Learning a Domain-Invariant Embedding for Unsupervised Domain Adaptation Using Class-Conditioned Distribution Alignment
Alex Gabourie, Mohammad Rostami, Philip Pope, Soheil Kolouri, Kyungnam Kim
LLearning a Domain-Invariant Embedding for Unsupervised DomainAdaptation Using Class-Conditioned Distribution Alignment
Alexander J. Gabourie , Mohammad Rostami , Philip E. Pope ,Soheil Kolouri , and Kuyngnam Kim Abstract — We address the problem of unsupervised domainadaptation (UDA) by learning a cross-domain agnostic em-bedding space, where the distance between the probabilitydistributions of the two source and target visual domains isminimized. We use the output space of a shared cross-domaindeep encoder to model the embedding space and use the Sliced-Wasserstein Distance (SWD) to measure and minimize thedistance between the embedded distributions of two source andtarget domains to enforce the embedding to be domain-agnostic.Additionally, we use the source domain labeled data to train adeep classifier from the embedding space to the label space toenforce the embedding space to be discriminative. As a resultof this training scheme, we provide an effective solution to trainthe deep classification network on the source domain such that itwill generalize well on the target domain, where only unlabeledtraining data is accessible. To mitigate the challenge of classmatching, we also align corresponding classes in the embeddingspace by using high confidence pseudo-labels for the targetdomain, i.e. assigning the class for which the source classifierhas a high prediction probability. We provide experimentalresults on UDA benchmark tasks to demonstrate that ourmethod is effective and leads to state-of-the-art performance.
I. I
NTRODUCTION
Deep learning classification algorithms have surpassedperformance of humans for a wide range of computer visionapplications. However, this achievement is conditioned onavailability of high-quality labeled datasets to supervisetraining deep neural networks. Unfortunately, preparing hugelabeled datasets is not feasible for many situations as datalabeling and annotation can be expensive [29]. Domainadaptation [12] is a paradigm to address the problem oflabeled data scarcity in computer vision, where the goal isto improve learning speed and model generalization as wellas to avoid expensive redundant model retraining. The majoridea is to overcome labeled data scarcity in a target domainby transferring knowledge from a related auxiliary sourcedomain, where labeled data is easy and cheap to obtain.A common technique in domain adaptation literature is toembed data from the two source and target visual domains inan intermediate embedding space such that common cross-domain discriminative relations are captured in the embed-ding space. For example, if the data from source and target Alexander J. Gabourie is with the Department of Electrical Engineering,Stanford University, Stanford, CA, USA [email protected] Mohammad Rostami is with the Department of Electrical and Sys-tems Engineering, University of Pennsylvania, Philadelphia, PA, US. [email protected]
Philip Pope, Soheil Kolouri, and Kyungnam Kim are with Informationand Systems Laboratory, HRL Laboratories, LLC, Malibu, CA, USA pepope, skolouri, kkim @hrl.com domains have similar class-conditioned probability distribu-tions in the embedding space, then a classifier trained solelyusing labeled data from the source domain will generalizewell on data points that are drawn from the target domaindistribution [28], [30].In this paper, we propose a novel unsupervised adaptation(UDA) algorithm following the above explained procedure.Our approach is a simpler, yet effective, alternative for ad-versarial learning techniques that have been more dominantto address probability matching indirectly for UDA [41],[43], [23]. Our contribution is two folds. First, we trainthe shared encoder by minimizing the Sliced-WassersteinDistance (SWD) [26] between the source and the targetdistributions in the embedding space. We also train a classi-fier network simultaneously using the source domain labeleddata. A major benefit of SWD over alternative probabilitymetrics is that it can be computed efficiently. Additionally,SWD is known to be suitable for gradient-based optimizationwhich is essential for deep learning [28]. Our second contri-bution is to circumvent the class matching challenge [34]by minimizing SWD between conditional distributions insequential iterations for better performance compared toprior UDA methods that match probabilities explicitly. Ateach iteration, we assign pseudo-labels only to the targetdomain data that the classifier predicts the assigned classlabel with high probability and use this portion of targetdata to minimize the SWD between conditional distributions.As more learning iterations are performed, the number oftarget data points with correct pseudo-labels grows and pro-gressively enforces distributions to align class-conditionally.We provide experimental results on benchmark problems,including ablation and sensitivity studies, to demonstrate thatour method is effective.II. B
ACKGROUND AND R ELATED W ORK
There are two major approaches in the literature to addressdomain adaption. The approach for a group of methods isbased on preprocessing the target domain data points. Thetarget data is mapped from the target domain to the sourcedomain such that the target data structure is preserved inthe source [33]. Another common approach is to map datafrom both domains to a latent domain invariant space [7].Early methods within the second approach learn a linearsubspace as the invariant space [15], [13] where the targetdomain data points distribute similar to the source domaindata points. A linear subspace is not suitable for capturingcomplex distributions. For this reason, recently deep neural a r X i v : . [ c s . L G ] S e p etworks have been used to model the intermediate space asthe output of the network. The network is trained such thatthe source and the target domain distributions in its outputpossess minimal discrepancy. Training procedure can bedone both by adversarial learning [14] or directly minimizingthe distance between the two distributions [4].Several important UDA methods use adversarial learning.Ganin et al. [10] pioneered and developed an effectivemethod to match two distributions indirectly by using ad-versarial learning. Liu et al. [21] and Tzeng et al. [41] usethe Generative Adversarial Networks (GAN) structure [14] totackle domain adaptation. The idea is to train two competing(i.e., adversarial) deep neural networks to match the sourceand the target distributions. A generator network maps datapoints from both domains to the domain-invariant space anda binary discriminator network is trained to classify the datapoints, with each domain considered as a class, based onthe representations of the target and the source data points.The generator network is trained such that eventually thediscriminator cannot distinguish between the two domains,i.e. classification rate becomes .A second group of domain adaptation algorithms matchthe distributions directly in the embedding by using a sharedcross-domain mapping such that the distance between thetwo distributions is minimized with respect to a distancemetric [4]. Early methods use simple metrics such theMaximum Mean Discrepancy (MMD) for this purpose [16].MMD measures the distances between the distance betweendistributions simply as the distance between the mean ofembedding features. In contrast, more recent techniquesthat use a shared deep encoder, employ the Wassersteinmetric [42] to address UDA [4], [6]. Wasserstein metric isshown to be a more accurate probability metric and can beminimized effectively by deep learning first-order optimiza-tion techniques. A major benefit of matching distributionsdirectly is existence of theoretical guarantees. In particular,Redko et al. [28] provided theoretical guarantees for using aWasserstein metric to address domain adaptation. Addition-ally, adversarial learning often requires deliberate architec-ture engineering, optimization initialization, and selection ofhyper-parameters to be stable [32]. In some cases, adversariallearning also suffers from a phenomenon known as modecollapse [22]. That is, if the data distribution is a multi-modal distribution, which is the case for most classificationproblems, the generator network might not generate samplesfrom some modes of the distribution. These challenges areeasier to address when the distributions are matched directly.As Wasserstein distance is finding more applications indeep learning, efficient computation of Wasserstein distancehas become an active area of research. The reason is thatWasserstein distance is defined in form of a linear pro-gramming optimization and solving this optimization prob-lem is computationally expensive for high-dimensional data.Although computationally efficient variations and approx-imations of the Wasserstein distance have been recentlyproposed [5], [40], [25], these variations still require anadditional optimization in each iteration of the stochastic gradient descent (SGD) steps to match distributions. Courtyet al. [4] used a regularized version of the optimal transportfor domain adaptation. Seguy et al. [36] used a dual stochas-tic gradient algorithm for solving the regularized optimaltransport problem. Alternatively, we propose to address theabove challenges using Sliced Wasserstein Distance (SWD).Definition of SWD is motivated by the fact that in contrastto higher dimensions, the Wasserstein distance for one-dimensional distributions has a closed form solution whichcan be computed efficiently. This fact is used to approximateWasserstein distance by SWD, which is a computationallyefficient approximation and has recently drawn interest fromthe machine learning and computer vision communities [26],[1], [3], [8], [39].III. P ROBLEM F ORMULATION
Consider a source domain, D S = ( X S , Y S ) , with N labeled samples, i.e. labeled images, where X S =[ x s , . . . , x sN ] ∈ X ⊂ R d × N denotes the samples and Y S = [ y s , ..., y sN ] ∈ Y ⊂ R k × N contains the correspondinglabels. Note that label y sn identifies the membership of x sn to one or multiple of the k classes (e.g. digits , ..., forhand-written digit recognition). We assume that the sourcesamples are drawn i.i.d. from the source joint probabilitydistribution, i.e. ( x si , y i ) ∼ p ( x S , y ) . We denote the sourcemarginal distribution over x S with p S . Additionally, we havea related target domain (e.g. machine-typed digit recognition)with M unlabeled data points X T = [ x t , . . . , x tM ] ∈ R d × M . Following existing UDA methods, we assume thatthe same type of labels in the source domain holds for thetarget domain. The target samples are drawn from the targetmarginal distribution x ti ∼ p T . We also know that despitesimilarity between these domains, distribution discrepancyexists between these two domains, i.e. p S (cid:54) = p T . Ourgoal is to classify the unlabeled target data points throughknowledge transfer from the source domain. Learning a goodclassifier for the source data points is a straight forwardproblem as given a large enough number of source samples, N , a parametric function f θ : R d → Y , e.g., a deep neuralnetwork with concatenated learnable parameters θ , can betrained to map samples to their corresponding labels usingstandard supervised learning solely in the source domain.The training is conducted via minimizing the empirical risk, ˆ θ = arg min θ ˆ e θ = arg min θ (cid:80) i L ( f θ ( x si ) , y si ) , with respectto a proper loss function, L ( · ) (e.g., cross entropy loss). Thelearned classifier f ˆ θ generalizes well on testing data pointsif they are drawn from the training data point’s distributions.Only then, the empirical risk is a suitable surrogate for thereal risk function, e = E ( x ,y ) ∼ p ( x S ,y S ) ( L ( f θ ( x ) , y )) . Hencethe naive approach of using f ˆ θ on the target domain mightnot be effective as given the discrepancy between the sourceand target distributions, f ˆ θ might not generalize well on thetarget domain. Therefore, there is a need for adapting thetraining procedure of f ˆ θ by incorporating unlabeled targetdata points such that the learned knowledge from the sourcedomain could be transferred and used for classification in thetarget domain using only the unlabeled samples.ig. 1: Architecture of the proposed unsupervised domainadaptation framework.The main challenge is to circumvent the problem ofdiscrepancy between the source and the target domain distri-butions. To that end, the mapping f θ ( · ) can be decomposedinto a feature extractor φ v ( · ) and a classifier h w ( · ) , suchthat f θ = h w ◦ φ v , where w and v are the correspondinglearnable parameters, i.e. θ = ( w , v ) . The core idea is tolearn the feature extractor function, φ v , for both domainssuch that the domain specific distribution of the extractedfeatures to be similar to one another. The feature extractingfunction φ v : X → Z , maps the data points from bothdomains to an intermediate embedding space
Z ⊂ R f (i.e.,feature space) and the classifier h w : Z → Y maps thedata points representations in the embedding space to thelabel set. Note that, as a deterministic function, the featureextractor function φ v can change the distribution of the datain the embedding. Therefore, if φ v is learned such that thediscrepancy between the source and target distributions isminimized in the embedding space, i.e., discrepancy between p S ( φ ( x s )) and p T ( φ ( x t )) (i.e., the embedding is domainagnostic), then the classifier would generalize well on thetarget domain and could be used to label the target domaindata points. This is the core idea behind various prior domainadaptation approaches in the literature [24], [11].IV. P ROPOSED M ETHOD
We consider the case where the feature extractor, φ v ( · ) ,is a deep convolutional encoder with weights v and theclassifier h w ( · ) is a shallow fully connected neural networkwith weights w . The last layer of the classifier networkis a softmax layer that assigns a membership probabilitydistribution to any given data point. It is often the casethat the labels of data points are assigned according tothe class with maximum predicted probability. In short, theencoder network is learned to mix both domains such that theextracted features in the embedding are: 1) domain agnosticin terms of data distributions, and 2) discriminative forthe source domain to make learning h w feasible. Figure 1demonstrates system level presentation of our framework.Following this framework, the UDA reduces to solving the following optimization problem to solve for v and w : min v , w N (cid:88) i =1 L (cid:0) h w ( φ v ( x si )) , y si (cid:1) + λD (cid:0) p S ( φ v ( X S )) , p T ( φ v ( X T )) (cid:1) , (1)where D ( · , · ) is a discrepancy measure between the prob-abilities and λ is a trade-off parameter. The first term inEq. (1) is empirical risk for classifying the source labeleddata points from the embedding space and the second termis the cross-domain probability matching loss. The encoder’slearnable parameters are learned using data points from bothdomains and the classifier parameters are simultaneouslylearned using the source domain labeled data.A major remaining question is to select a proper metric.First, note that the actual distributions p S ( φ ( X S )) and p T ( φ ( X T )) are unknown and we can rely only on observedsamples from these distributions. Therefore, a sensible dis-crepancy measure, D ( · , · ) , should be able to measure thedissimilarity between these distributions only based on thedrawn samples. In this work, we use the SWD [27] as itis computationally efficient to compute SWD from drawnsamples from the corresponding distributions. More impor-tantly, the SWD is a good approximation for the optimaltransport [2] which has gained interest in deep learningcommunity as it is an effective distribution metric and itsgradient is non-vanishing.The idea behind the SWD is to project two d -dimensionalprobability distributions into their marginal one-dimensionaldistributions, i.e., slicing the high-dimensional distributions,and to approximate the Wasserstein distance by integratingthe Wasserstein distances between the resulting marginalprobability distributions over all possible one-dimensionalsubspaces. For the distribution p S , a one-dimensional sliceof the distribution is defined as: R p S ( t ; γ ) = (cid:90) S p S ( x ) δ ( t − (cid:104) γ, x (cid:105) ) d x , (2)where δ ( · ) denotes the Kronecker delta function, (cid:104)· , ·(cid:105) de-notes the vector dot product, S d − is the d -dimensional unitsphere and γ is the projection direction. In other words, R p S ( · ; γ ) is a marginal distribution of p S obtained fromintegrating p S over the hyperplanes orthogonal to γ . TheSWD then can be computed by integrating the Wassersteindistance between sliced distributions over all γ : SW ( p S , p T ) = (cid:90) S d − W ( R p S ( · ; γ ) , R p T ( · ; γ )) dγ (3)where W ( · ) denotes the Wasserstein distance. The mainadvantage of using the SWD is that, unlike the Wasser-stein distance, calculation of the SWD does not require anumerically expensive optimization. This is due to the factthat the Wasserstein distance between two one-dimensionalprobability distributions has a closed form solution andis equal to the (cid:96) p -distance between the inverse of theircumulative distribution functions Since only samples fromdistributions are available, the one-dimensional Wassersteindistance can be approximated as the (cid:96) p -distance between thesorted samples [31]. The integral in Eq. (3) is approximatedsing a Monte Carlo style numerical integration. Doing so,the SWD between f -dimensional samples { φ ( x S i ) ∈ R f ∼ p S } Mi =1 and { φ ( x T i ) ∈ R f ∼ p T } Mj =1 can be approximatedas the following sum: SW ( p S , p T ) ≈ L L (cid:88) l =1 M (cid:88) i =1 |(cid:104) γ l , φ ( x S s l [ i ] (cid:105) ) − (cid:104) γ l , φ ( x T t l [ i ] ) (cid:105)| (4) where γ l ∈ S f − is uniformly drawn random sample fromthe unit f -dimensional ball S f − , and s l [ i ] and t l [ i ] arethe sorted indices of { γ l · φ ( x i ) } Mi =1 for source and targetdomains, respectively. Note that for a fixed dimension d ,Monte Carlo approximation error is proportional to O ( √ L ) .We utilize the SWD as the discrepancy measure betweenthe probability distributions to match them in the embeddingspace. Next, we discuss a major deficiency in Eq. (1) and ourremedy to tackle it. We utilize the SWD as the discrepancymeasure between the probability densities, p S ( φ ( x S ) | C j ) (cid:1) and p T ( φ ( x T ) | C j ) (cid:1) . A. Class-conditional Alignment of Distributions
A main shortcoming of Eq. (1) is that minimizing thediscrepancy between p S ( φ ( X S )) and p T ( φ ( X T )) does notguarantee semantic consistency between the two domains. Toclarify this point, consider the source and target domains tobe images corresponding to printed digits and handwrittendigits. While the feature distributions in the embeddingspace could have low discrepancy, the classes might not becorrectly aligned in this space, e.g. digits from a class inthe target domain could be matched to a wrong class of thesource domain or, even digits from multiple classes in thetarget domain could be matched to the cluster of a single digitof the source domain. In such cases, the source classifier willnot generalize well on the target domain. In other words,the shared embedding space, Z , might not be a semanti-cally meaningful space for the target domain if we solelyminimize SWD between p S ( φ ( X S )) and p T ( φ ( X T )) . Tosolve this challenge, the encoder function should be learnedsuch that the class-conditioned probabilities of both domainsin the embedding space are similar, i.e. p S ( φ ( x S ) | C j ) ≈ p T ( φ ( x T ) | C j ) , where C j denotes a particular class. Giventhis, we can mitigate the class matching problem by usingan adapted version of Eq. (1) as: min v , w N (cid:88) i =1 L (cid:0) h w ( φ v ( x si )) , y si (cid:1) + λ k (cid:88) j =1 D (cid:0) p S ( φ v ( x S ) | C j ) , p T ( φ v ( x T ) | C j ) (cid:1) , (5)where discrepancy between distributions is minimized con-ditioned on classes, to enforce semantic alignment in theembedding space. Solving Eq. (5), however, is not tractableas the labels for the target domain are not available and theconditional distribution, p T ( φ ( x T ) | C j ) (cid:1) , is not known.To tackle the above issue, we compute a surrogateof the objective in Eq. (5). Our idea is to approximate p T ( φ ( x T ) | C j ) (cid:1) by generating pseudo-labels for the target Algorithm 1
DACAD (
L, η, λ ) Input: data D S = ( X S , Y S ); D T = ( X S ) , Pre-training : ˆ θ = ( w , v ) = arg min θ (cid:80) i L ( f θ ( x si ) , y si ) for itr = 1 , . . . , IT R do D PL = { ( x ti , ˆ y ti ) | ˆ y ti = f ˆ θ ( x ti ) , p ( ˆ y ti | x ti ) > τ } for alt = 1 , . . . , ALT do Update encoder parameters using pseudo-labels: ˆ v = (cid:80) j D (cid:0) p S ( φ v ( x S ) | C j ) , p SL ( φ v ( x T ) | C j ) (cid:1) Update entire model: ˆ v , ˆ w = arg min w , v (cid:80) Ni =1 L (cid:0) h w ( φ ˆˆ v ( x si )) , y si (cid:1) end for end for data points. The pseudo-labels are obtained from the sourceclassifier prediction, but only for the portion of target datapoints that the the source classifier provides confident pre-diction. More specifically, we solve Eq. (5) in incrementalgradient descent iterations. In particular, we first initializethe classifier network by training it on the source data. Wethen alternate between optimizing the classification loss forthe source data and SWD loss term at each iteration. Ateach iteration, we pass the target domain data points intothe classifier learned on the source data and analyze thelabel probability distribution on the softmax layer of theclassifier. We choose a threshold τ and assign pseudo-labelsonly to those target data points that the classifier predictsthe pseudo-labels with high confidence, i.e. p ( y i | x ti ) > τ .Since the source and the target domains are related, it issensible that the source classifier can classify a subset oftarget data points correctly and with high confidence. Weuse these data points to approximate p T ( φ ( x T ) | C j ) (cid:1) inEq. (5) and update the encoder parameters, v , accordingly.In our empirical experiments, we have observed that becausethe domains are related, as more optimization iterations areperformed, the number of data points with confident pseudo-labels increases and our approximation for Eq. (5) improvesand becomes more stable, enforcing the source and the targetdistributions to align class conditionally in the embeddingspace. As a side benefit, since we math the distributionsclass-conditionally, a problem similar to mode collapse isunlikely to occur. Figure 2 visualizes this process using realdata. Our proposed framework, named Domain Adaptationwith Conditional Alignment of Distributions (DACAD) issummarized in Algorithm 1.V. E XPERIMENTAL V ALIDATION
We evaluate our algorithm using standard benchmark UDAtasks and compare against several UDA methods.
Datasets:
We investigate the empirical performance of ourproposed method on five commonly used benchmark datasetsin UDA, namely: MNIST ( M ) [19], USPS ( U ) [20], StreetView House Numbers, i.e., SVHN ( S ), CIFAR ( CI ), andSTL ( ST ). The first three datasets are 10 class (0-9) digitclassification datasets. MNIST and USPS are collection ofhand written digits whereas SVHN is a collection of realworld RGB images of house numbers. STL and CIFAR hared Encoder, ! Shared Embedding, " Source Classifier, ℎ Inputs $ %& $ '( Source DomainLabeled Dataset Target DomainUnlabeled Dataset
Labels Pseudo-Labels …… Source DomainTarget Domain Ground TruthFinal predictionConfident pseudo labelsUnlabeled samples
Fig. 2: The high-level system architecture, shown on the left, illustrates the data paths used during UDA training. On theright, t SNE visualizations demonstrate how the embedding space evolves during training for the
S → U task. In the targetdomain, colored points are examples with assigned pseudo-labels, which increase in number with the confidence of theclassifier.contain RGB images that share 9 object classes: airplane,car, bird, cat, deer, dog, horse, ship, and truck . For thedigit datasets, while six domain adaptation problems can bedefined among these datasets, prior works often consider fourof these six cases, as knowledge transfer from simple MNISTand USPS datasets to a more challenging SVHN domaindoes not seem to be tractable. Following the literature,we use 2000 randomly selected images from MNIST and1800 images from USPS in our experiments for the case of
U → M and
S → M [23]. In the remaining cases, we usedfull datasets. All datasets have their images scaled to × pixels and the SVHN images are converted to grayscale asthe encoder network is shared between the domains. CIFARand STL maintain their RGB components. We report thetarget classification accuracy across the tasks. Pre-training : Our experiments involve a pre-training stageto initialize the encoder and the classifier networks solelyusing the source data. This is an essential step becausethe combined deep network can generate confident pseudo-labels on the target domain only if initially trained on therelated source domain. In other words, this initially learnednetwork can be served as a naive model on the target domain.We then boost the performance on the target domain usingour proposed algorithm, demonstrating that our algorithmis indeed effective for transferring knowledge. Doing so,we investigate a less-explored issue in the UDA literature.Different UDA approaches use considerably different net-works, both in terms of complexity, e.g. number of layersand convolution filters, and the structure, e.g. using an auto-encoder. Consequently, it is ambiguous whether performanceof a particular UDA algorithm is due to successful knowledgetransfer from the source domain or just a good baselinenetwork that performs well on the target domain even withoutconsiderable knowledge transfer from the source domain. Tohighlight that our algorithm can indeed transfer knowledge,we use two different network architectures: DRCN [11] andVGG [38]. We then show that our algorithm can effec-tively boost base-line performance (statistically significant)regardless of the underlying network. In most of the domainadaptation tasks, we demonstrate that this boost indeed stems from transferring knowledge from the source domain. Inour experiments we used Adam optimizer [18] and set thepseudo-labeling threshold to tr = 0 . . Data Augmentation:
Following the literature, we use dataaugmentation to create additional training data by applyingreasonable transformations to input data in an effort to im-prove generalization [37]. Confirming the reported result in[11], we also found that geometric transformations and noise,applied to appropriate inputs, greatly improves performanceand transferability of the source model to the target data.Data augmentation can help to reduce the domain shiftbetween the two domains. The augmentations in this workare limited to translation, rotation, skew, zoom, Gaussiannoise, Binomial noise, and inverted pixels.
A. Results
Figure 2 demonstrates how our algorithm successfullylearns an embedding with class-conditional alignment ofdistributions of both domains. This figure presents the two-dimensional t SNE visualization of the source and targetdomain data points in the shared embedding space for the
S → U task. The horizontal axis demonstrates the optimiza-tion iterations where each cell presents data visualizationafter a particular optimization iteration is performed. Thetop sub-figures visualize the source data points, where eachcolor represents a particular class. The bottom sub-figuresvisualize the target data points, where the colored data pointsrepresent the pseudo-labeled data points at each iteration andthe black points represent the rest of the target domain datapoints. We can see that, due to pre-training initialization,the embedding space is discriminative for the source domainin the beginning, but the target distribution differs fromthe source distributions. However, the classifier is confidentabout a portion of target data points. As more optimiza-tion iterations are performed, since the network becomes abetter classifier for the target domain, the number of thetarget pseudo-labeled data points increase, improving ourapproximate of Eq. 5. As a result, the discrepancy betweenthe two distributions progressively decreases. Over time, ouralgorithm learns a shared embedding which is discriminative ethod
M → U U → M S → M S → U ST → CI CI → ST
GtA 92.8 ± ± ± ± ± ± ± ± ± † ± † ± † † ± ‡ ± ‡ ± † ± † ± ± ± † † ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TABLE I: Classification accuracy for UDA between MINIST, USPS, SVHN, CIFAR, and STL datasets. † indicates use offull MNIST and USPS datasets as opposed to the subset described in the paper. ‡ indicates results from reimplementationin [41].for both domains, making pseudo-labels a good predictionfor the original labels, bottom, right-most sub-figure. Thisresult empirically validates applicability of our algorithm toaddress UDA.We also compare our results against several recent UDAalgorithms in Table I. In particular, we compare against therecent adversarial learning algorithms: Generate to Adapt(GtA) [35], CoGAN [21], ADDA [41], CyCADA [17],and I2I-Adapt [24]. We also include FADA [23], whichis originally a few-shot learning technique. For FADA, welist the reported one-shot accuracy, which is very close tothe UDA setting (but it is arguably a simpler problem).Additionally, we have included results for RevGrad [9],DRCN [11], AUDA [34], OPDA [4], MML [36]. The lattermethods are similar to our method because these methodslearn an embedding space to couple the domains. OPDA andMML are more similar as they match distributions explicitlyin the learned embedding. Finally, we have included theperformance of fully-supervised (FS) learning on the targetdomain as an upper-bound for UDA. In our own results,we include the baseline target performance that we obtainby naively employing a DRCN network as well as targetperformance from VGG network that are learned solely onthe source domain. We notice that in Table I, our baselineperformance is better than some of the UDA algorithms forsome tasks. This is a very crucial observation as it demon-strates that, in some cases, a trained deep network with gooddata augmentation can extract domain invariant features thatmake domain adaptation feasible even without any furthertransfer learning procedure. The last row demonstrates thatour method is effective in transferring knowledge to boostthe baseline performance. In other words, Table I serves asan ablation study to demonstrate that that effectiveness of ouralgorithm stems from successful cross-domain knowledgetransfer. We can see that our algorithm leads to near- or thestate-of-the-art performance across the tasks. Additionally, an important observation is that our method significantlyoutperforms the methods that match distributions directly andis competent against methods that use adversarial learning.This can be explained as the result of matching distributionsclass-conditionally and suggests our second contribution canpotentially boost performance of these methods. Finally,we note that our proposed method provide a statisticallysignificant boost in all but two of the cases (shown in grayin Table I).VI. C ONCLUSIONS AND D ISCUSSION
We developed a new UDA algorithm based on learninga domain-invariant embedding space. We map data pointsfrom two related domains to the embedding space suchthat discrepancy between the transformed distributions isminimized. We used the sliced Wasserstein distance metric asa measure to match the distributions in the embedding space.As a result, our method is computationally more efficient.Additionally, we matched distributions class-conditionallyby assigning pseudo-labels to the target domain data. Asa result, our method is more robust and outperforms priorUDA methods that match distributions directly. We providedexperimental validations to demonstrate that our method iscompetent against SOA recent UDA methods.R
EFERENCES[1] N. Bonneel, J. Rabin, G. Peyr´e, and H. Pfister. Sliced and RadonWasserstein barycenters of measures.
Journal of Mathematical Imag-ing and Vision , 51(1):22–45, 2015.[2] N. Bonnotte.
Unidimensional and evolution methods for optimaltransportation . PhD thesis, Paris 11, 2013.[3] Mathieu Carriere, Marco Cuturi, and Steve Oudot. Sliced wassersteinkernel for persistence diagrams. arXiv preprint arXiv:1706.03358 ,2017.[4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimaltransport for domain adaptation.
IEEE TPAMI , 39(9):1853–1865,2017.[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimaltransport. In
Advances in neural information processing systems , pages2292–2300, 2013.6] B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty.Deepjdot: Deep joint distribution optimal transport for unsuperviseddomain adaptation. arXiv preprint arXiv:1803.10081 , 2018.[7] H. Daum´e III. Frustratingly easy domain adaptation. arXiv preprintarXiv:0907.1815 , 2009.[8] I. Deshpande, Z. Zhang, and A. Schwing. Generative modeling usingthe sliced wasserstein distance. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3483–3491, 2018.[9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation bybackpropagation. In
ICML , 2014.[10] Y. Ganin, E. Ustinova, H. Ajakan, P.l Germain, H. Larochelle, F. Lavi-olette, M. Marchand, and V. Lempitsky. Domain-adversarial trainingof neural networks.
The Journal of Machine Learning Research ,17(1):2096–2030, 2016.[11] M. Ghifary, B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deepreconstruction-classification networks for unsupervised domain adap-tation. In
European Conference on Computer Vision , pages 597–613.Springer, 2016.[12] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In
ICML ,pages 513–520, 2011.[13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernelfor unsupervised domain adaptation. In
Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on , pages 2066–2073.IEEE, 2012.[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In
Advances in neural information processing systems , pages 2672–2680,2014.[15] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for objectrecognition: An unsupervised approach. In
Computer Vision (ICCV),2011 IEEE International Conference on , pages 999–1006. IEEE, 2011.[16] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, andB. Sch¨olkopf. Covariate shift by kernel mean matching, 2009.[17] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A.Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domainadaptation. In
ICML , 2018.[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard,W. E. Hubbard, and L. D. Jackel. Handwritten digit recognitionwith a back-propagation network. In
Advances in neural informationprocessing systems , pages 396–404, 1990.[20] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker,H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, et al. Comparison oflearning algorithms for handwritten digit recognition. In
Internationalconference on artificial neural networks , volume 60, pages 53–60.Perth, Australia, 1995.[21] M. Liu and O. Tuzel. Coupled generative adversarial networks. In
Advances in neural information processing systems , pages 469–477,2016.[22] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generativeadversarial networks. arXiv preprint arXiv:1611.02163 , 2016.[23] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shotadversarial domain adaptation. In
Advances in Neural InformationProcessing Systems , pages 6670–6680, 2017.[24] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim.Image to image translation for domain adaptation. arXiv preprintarXiv:1712.00479 , 2017.[25] A. Oberman and Y. Ruan. An efficient linear programming methodfor optimal transportation. arXiv preprint arXiv:1509.03668 , 2015.[26] J. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenterand its application to texture mixing. In
International Conferenceon Scale Space and Variational Methods in Computer Vision , pages435–446. Springer, 2011.[27] J. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenterand its application to texture mixing. In
International Conferenceon Scale Space and Variational Methods in Computer Vision , pages435–446. Springer, 2011.[28] A. Redko, I.and Habrard and M. Sebban. Theoretical analysis of do-main adaptation with optimal transport. In
Joint European Conferenceon Machine Learning and Knowledge Discovery in Databases , pages737–753. Springer, 2017.[29] M. Rostami, D. Huber, and T. Lu. A crowdsourcing triage algorithm for geopolitical event forecasting. In
Proceedings of the 12th ACMConference on Recommender Systems , pages 377–381. ACM, 2018.[30] M. Rostami, S. Kolouri, E. Eaton, and K. Kim. Deep transfer learningfor few-shot sar image classification.
Remote Sensing , 11(11):1374,2019.[31] M. Rostami, S. Kolouri, E. Eaton, and K. Kim. Sar image classificationusing few-shot cross-domain transfer learning. In
Proceedings ofthe IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , pages 0–0, 2019.[32] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing trainingof generative adversarial networks through regularization. In
Advancesin Neural Information Processing Systems , pages 2018–2028, 2017.[33] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual categorymodels to new domains. In
European conference on computer vision ,pages 213–226. Springer, 2010.[34] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training forunsupervised domain adaptation. In
ICML , 2018.[35] S. Sankaranarayanan, Y. Balaji, C. D Castillo, and R. Chellappa.Generate to adapt: Aligning domains using generative adversarialnetworks. In
CVPR , 2018.[36] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, andM. Blondel. Large-scale optimal transport and mapping estimation.In
ICLR , 2018.[37] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices forconvolutional neural networks applied to visual document analysis. In
Seventh International Conference on Document Analysis and Recog-nition, 2003. Proceedings. , pages 958–963, Aug 2003.[38] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[39] Umut S¸ims¸ekli, Antoine Liutkus, Szymon Majewski, and Alain Dur-mus. Sliced-wasserstein flows: Nonparametric generative modelingvia optimal transport and diffusions. arXiv preprint arXiv:1806.08141 ,2018.[40] J Solomon, F. De Goes, G. Peyr´e, M. Cuturi, A. Butscher, A. Nguyen,T. Du, and L. Guibas. Convolutional wasserstein distances: Efficientoptimal transportation on geometric domains.
ACM Transactions onGraphics (TOG) , 34(4):66, 2015.[41] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. In
Computer Vision and PatternRecognition (CVPR) , volume 1, page 4, 2017.[42] C. Villani.
Optimal transport: old and new , volume 338. SpringerScience & Business Media, 2008.[43] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised duallearning for image-to-image translation. In