[PDF] Ensemble Multi-Source Domain Adaptation with Pseudolabels

Abstract

Given multiple source datasets with labels, how can we train a target model with no labeled data? Multi-source domain adaptation (MSDA) aims to train a model using multiple source datasets different from a target dataset in the absence of target data labels. MSDA is a crucial problem applicable to many practical cases where labels for the target data are unavailable due to privacy issues. Existing MSDA frameworks are limited since they align data without considering conditional distributions p(x|y) of each domain. They also miss a lot of target label information by not considering the target label at all and relying on only one feature extractor. In this paper, we propose Ensemble Multi-source Domain Adaptation with Pseudolabels (EnMDAP), a novel method for multi-source domain adaptation. EnMDAP exploits label-wise moment matching to align conditional distributions p(x|y), using pseudolabels for the unavailable target labels, and introduces ensemble learning theme by using multiple feature extractors for accurate domain adaptation. Extensive experiments show that EnMDAP provides the state-of-the-art performance for multi-source domain adaptation tasks in both of image domains and text domains.

Full PDF

EE NSEMBLE M ULTI - SOURCE D OMAIN A DAPTATIONWITH P SEUDOLABELS

Seongmin Lee, Hyunsik Jeon & U Kang ∗ Seoul National University { ligi214,jeon185,ukang } @snu.ac.kr A BSTRACT

Given multiple source datasets with labels, how can we train a target model withno labeled data?

Multi-source domain adaptation (MSDA) aims to train a modelusing multiple source datasets different from a target dataset in the absence oftarget data labels. MSDA is a crucial problem applicable to many practical caseswhere labels for the target data are unavailable due to privacy issues. ExistingMSDA frameworks are limited since they align data without considering condi-tional distributions p ( x | y ) of each domain. They also miss a lot of target labelinformation by not considering the target label at all and relying on only one fea-ture extractor. In this paper, we propose Ensemble Multi-source Domain Adap-tation with Pseudolabels (E N MDAP), a novel method for multi-source domainadaptation. E N MDAP exploits label-wise moment matching to align conditionaldistributions p ( x | y ) , using pseudolabels for the unavailable target labels, and in-troduces ensemble learning theme by using multiple feature extractors for accu-rate domain adaptation. Extensive experiments show that E N MDAP provides thestate-of-the-art performance for multi-source domain adaptation tasks in both ofimage domains and text domains.

NTRODUCTION

Given multiple source datasets with labels, how can we train a target model with no labeled data?

A large training data are essential for training deep neural networks. Collecting abundant data isunfortunately an obstacle in practice; even if enough data are obtained, manually labeling those datais prohibitively expensive. Using other available or much cheaper datasets would be a solution forthese limitations; however, indiscriminate usage of other datasets often brings severe generalizationerror due to the presence of dataset shifts (Torralba & Efros (2011)). Unsupervised domain adap-tation (UDA) tackles these problems where no labeled data from the target domain are available,but labeled data from other source domains are provided. Finding out domain-invariant features hasbeen the focus of UDA since it allows knowledge transfer from the labeled source dataset to theunlabeled target dataset. There have been many efforts to transfer knowledge from a single sourcedomain to a target one. Most recent frameworks minimize the distance between two domains bydeep neural networks and distance-based techniques such as discrepancy regularizers (Long et al.(2015; 2016; 2017)), adversarial networks (Ganin et al. (2016); Tzeng et al. (2017)), and generativenetworks (Liu et al. (2017); Zhu et al. (2017); Hoffman et al. (2018b)).While the above-mentioned approaches consider one single source, we address multi-source domainadaptation (MSDA), which is very crucial and more practical in real-world applications as well asmore challenging. MSDA is able to bring signiﬁcant performance enhancement by virtue of ac-cessibility to multiple datasets as long as multiple domain shift problems are resolved. Previousworks have extensively presented both theoretical analysis (Ben-David et al. (2010); Mansour et al.(2008); Crammer et al. (2008); Hoffman et al. (2018a); Zhao et al. (2018); Zellinger et al. (2020))and models (Zhao et al. (2018); Xu et al. (2018); Peng et al. (2019)) for MSDA. MDAN (Zhao et al.(2018)) and DCTN (Xu et al. (2018)) build adversarial networks for each source domain to generatefeatures domain-invariant enough to confound domain classiﬁers. However, these approaches donot encompass the shifts among source domains, counting only shifts between source and targetdomain. M SDA (Peng et al. (2019)) adopts moment matching strategy but makes the unrealistic ∗ Corresponding author a r X i v : . [ c s . L G ] S e p ssumption that matching the marginal probability p ( x ) would guarantee the alignment of the con-ditional probability p ( x | y ) . Most of these methods also do not fully exploit the knowledge of targetdomain, imputing to the inaccessibility to the labels. Furthermore, all these methods leverage onesingle feature extractor, which possibly misses important information regarding label classiﬁcation.In this paper, we propose E N MDAP, a novel MSDA framework which mitigates the limitationsof these methods of not explicitly considering conditional probability p ( x | y ) , and relying on onlyone feature extractor. The model architecture is illustrated in Figure 1. E N MDAP aligns the con-ditional probability p ( x | y ) by utilizing label-wise moment matching. We employ pseudolabels forthe inaccessible target labels to maximize the usage of the target data. Moreover, integrating thefeatures from multiple feature extractors gives abundant information about labels to the extractedfeatures.Extensive experiments show the superiority of our proposed methods.Our contributions are summarized as follows:• Method.

We propose E N MDAP, a novel approach for MSDA that effectively obtainsdomain-invariant features from multiple domains by matching conditional probability p ( x | y ) , not marginal one, utilizing pseudolabels for inaccessible target labels to fully de-ploy target data, and using multiple feature extractors. It allows domain-invariant featuresto be extracted, capturing the intrinsic differences of different labels.• Analysis.

We theoretically prove that minimizing the label-wise moment matching loss isrelevant to bounding the target error.•

Experiments.

We conduct extensive experiments on image and text datasets. We showthat 1) E N MDAP provides the state-of-the-art accuracy, and 2) each of our main ideassigniﬁcantly contributes to the superior performance.

ELATED W ORK

Single-source Domain Adaptation.

Given a labeled source dataset and an unlabeled target dataset,single-source domain adaptation aims to train a model that performs well on the target domain.The challenge of single-source domain adaptation is to reduce the discrepancy between the twodomains and to obtain appropriate domain-invariant features. Various discrepancy measures such asMaximum Mean Discrepancy (MMD) (Tzeng et al. (2014); Long et al. (2015; 2016; 2017); Ghifaryet al. (2016)) and KL divergence (Zhuang et al. (2015)) have been used as regularizers. Inspired fromthe insight that the domain-invariant features should exclude the clues about its domain, constructingadversarial networks against domain classiﬁers has shown superior performance. Liu et al. (2017)and Hoffman et al. (2018b) deploy GAN to transform data across the source and target domain, whileGanin et al. (2016) and Tzeng et al. (2017) leverage the adversarial networks to extract commonfeatures of the two domains. Unlike these works, we focus on multiple source domains.

Multi-source Domain Adaptation.

Single-source domain adaptation should not be naively em-ployed for multiple source domains due to the shifts between source domains. Many previous workshave tackled MSDA problems theoretically. Mansour et al. (2008) establish distribution weightedcombining rule that the weighted combination of source hypotheses is a good approximation for thetarget hypothesis. The rule is further extended to a stochastic case with joint distribution over the in-put and the output space in Hoffman et al. (2018a). Crammer et al. (2008) propose the general theoryof how to sift appropriate samples out of multi-source data using expected loss. Efforts to ﬁnd outtransferable knowledge from multiple sources from the causal viewpoint are made in Zhang et al.(2015). There have been salient studies on the learning bounds for MSDA. Ben-David et al. (2010)found the generalization bounds based on H ∆ H -divergence, which are further tightened by Zhaoet al. (2018). Frameworks for MSDA have been presented as well. Zhao et al. (2018) propose learn-ing algorithms based on the generalization bounds for MSDA. DCTN (Xu et al. (2018)) resolvesdomain and category shifts between source and target domains via adversarial networks. M SDAPeng et al. (2019) associates all the domains into a common distribution by aligning the momentsof the feature distributions of multiple domains. However, all these methods do not consider multi-mode structures (Pei et al. (2018)) that differently labeled data follow distinct distributions, even ifthey are drawn from the same domain. Also, the domain-invariant features in these methods containthe label information for only one label classiﬁer which lead these methods to miss a large amountof label information. Different from these methods, our frameworks fully count the multimodalstructures handling the data distributions in a label-wise manner and minimize the label informationloss considering multiple label classiﬁers. 2 eatureExtractor 𝑓 ",$ FeatureExtractor 𝑓 ",% LabelClassifier 𝑓 &’,()*+& Extractor Classifier 𝑓 "’ ConcatenateLabelClassifier 𝑓 &’,$ LabelClassifier 𝑓 &’,% Data Distributions

Target DataSource1 DataSource2 Data Class Label 1Class Label 2Class Label 3

Figure 1: E N MDAP for n =2. E N MDAP consists of n pairs of feature extractor and label classiﬁer,one extractor classiﬁer, and one ﬁnal label classiﬁer. Colors and symbols of the markers indicatedomains and class labels of the data, respectively. Moment Matching.

Domain adaptation has deployed the moment matching strategy to minimizethe discrepancy between source and target domains. MMD regularizer (Tzeng et al. (2014); Longet al. (2015; 2016; 2017); Ghifary et al. (2016)) can be interpreted as the ﬁrst-order moment whileSun et al. (2016) address second-order moments of source and target distributions. Zellinger et al.(2017) investigate the effect of higher-order moment matching. M SDA (Peng et al. (2019)) demon-strates that moment matching yields remarkable performance also with multiple sources. Whileprevious works have focused on matching the moments of marginal distributions for single-sourceadaptation, we handle conditional distributions in multi-source scenarios.

ROPOSED M ETHOD

In this section, we describe our proposed method, E N MDAP. We ﬁrst formulate the problem deﬁ-nition in Section 3.1. Then, we describe our main ideas in Section 3.2. Section 3.3 elaborates howto match label-wise moment with pseudolabels and Section 3.4 extends the approach by adding theconcept of ensemble learning. Figure 1 shows the overview of E N MDAP.3.1 P

ROBLEM D EFINITION

Given a set of labeled datasets from N source domains S , . . . , S N and an unlabeled dataset from atarget domain T , we aim to construct a model that minimizes test error on T . We formulate sourcedomain S i as a tuple of the data distribution µ S i on data space X and the labeling function l S i : S i =( µ S i , l S i ) . Source dataset drawn with the distribution µ S i is denoted as X S i = { ( x S i j , y S i j ) } n S i j =1 .Likewise, the target domain and the target dataset are denoted as T = ( µ T , l T ) and X T = { x T j } n T j =1 ,respectively. We narrow our focus down to homogeneous settings in classiﬁcation tasks: all domainsshare the same data space X and label set C .3.2 O VERVIEW

We propose E N MDAP based on the following observations: 1) existing methods focus on aligningthe marginal distributions p ( x ) not the conditional ones p ( x | y ) , 2) knowledge of the target data is notfully employed as no target label is given, and 3) there exists a large amount of label information losssince domain-invariant features are extracted for only one single label classiﬁer. Thus, we designE N MDAP aiming to solve the limitations. Designing such method entails the following challenges:1.

Matching conditional distributions.

How can we align the conditional distribution, p ( x | y ) , of multiple domains not the marginal one, p ( x ) ?2. Exploitation of the target data.

How can we fully exploit the knowledge of the target datadespite the absence of the target labels? 3.

Maximally utilizing feature information.

How can we maximally utilize the informationthat the domain-invariant features contain?We propose the following main ideas to address the challenges:1.

Label-wise moment matching (Section 3.3).

We match the label-wise moments of thedomain-invariant features so that the features with the same labels have similar distributionsregardless of their original domains.2.

Pseudolabels (Section 3.3).

We use pseudolabels as alternatives to the target labels.3.

Ensemble of feature representations (Section 3.4).

We learn to extract ensemble of fea-tures from multiple feature extractors, each of which involves distinct domain-invariantfeatures for its own label classiﬁer.3.3 L

ABEL - WISE M OMENT M ATCHING WITH PSEUDOLABELS

We describe how E N MDAP matches conditional distributions p ( x | y ) of the features from multipledistinct domains. In E N MDAP, a feature extractor f e and a label classiﬁer f lc lead the features to bedomain-invariant and label-informative at the same time. The feature extractor f e extracts featuresfrom data, and the label classiﬁer f lc receives the features and predicts the labels for the data. Wetrain the two components, f e and f lc , according to the losses for label-wise moment matching and label classiﬁcation , which make the features domain-invariant and label-informative, respectively. Label-wise Moment Matching.

To achieve the alignment of domain-invariant features, we deﬁnea label-wise moment matching loss as follows: L lmm,K = 1 |C| (cid:18) N + 12 (cid:19) − K (cid:88) k =1 (cid:88) D , D (cid:48) (cid:88) c ∈C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n D ,c (cid:88) j ; y D j = c f e ( x D j ) k − n D (cid:48) ,c (cid:88) j ; y D(cid:48) j = c f e ( x D (cid:48) j ) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (1)where K is a hyperparameter indicating the maximum order of moments considered by the loss, D and D (cid:48) are two distinct domains amongst the N + 1 domains, and n D ,c is the number of data labeledas c in X D . We introduce pseudolabels for the target data, which are determined by the outputs ofthe model currently being trained, to manage the absence of the ground truths for the target data. Inother words, we leverage f lc ( f e ( x T )) to give the pseudolabel to the target data x T . Drawing thepseudolabels using the incomplete model, however, brings mis-labeling issue which impedes furthertraining. To alleviate this problem, we set a threshold τ and assign the pseudolabels to the targetdata only when the prediction conﬁdence is greater than the threshold. The target examples withlow conﬁdence are not pseudolabeled and not counted in label-wise moment matching.By minimizing L lmm,K , the feature extractor f e aligns data from multiple domains by bringingconsistency in distributions of the features with the same labels. The data with distinct labels arealigned independently, taking account of the multimode structures that differently labeled data fol-low different distributions. Label Classiﬁcation.

The label classiﬁer f lc gets the features projected by f e as inputs and makesthe label predictions. The label classiﬁcation loss is deﬁned as follows: L lc = 1 N N (cid:88) i =1 n S i n S i (cid:88) j =1 L ce ( f lc ( f e ( x S i j )) , y S i j ) , (2)where L ce is the softmax cross-entropy loss. Minimizing L lc separates the features with differentlabels so that each of them gets label-distinguishable.3.4 E NSEMBLE OF F EATURE R EPRESENTATIONS

In this section, we introduce ensemble learning for further enhancement. Features extracted with thestrategies elaborated in the previous section contain the label information for a single label classiﬁer.However, each label classiﬁer leverages only limited label characteristics, and thus the conventionalscheme to adopt only one pair of feature extractor and label classiﬁer captures only a small part ofthe label information. Our idea is to leverage an ensemble of multiple pairs of feature extractor andlabel classiﬁer in order to make the features to be more label-informative.We train multiple pairs of feature extractor and label classiﬁer in parallel following thelabel-wise moment matching approach explained in Section 3.3. Let n denote the num-ber of the feature extractors in the overall model. We denote the n (feature extractor, la-4el classiﬁer) pairs as ( f e, , f lc, ) , ( f e, , f lc, ) , . . . , ( f e,n , f lc,n ) and the n resultant features as f eat , f eat , . . . , f eat n where f eat i is the output of the feature extractor f e,i . After obtaining n different feature mapping modules, we concatenate the n features into one vector f eat final = concat ( f eat , f eat , . . . , f eat n ) . The ﬁnal label classiﬁer f lc,final takes the concatenated featureas input, and predicts the label of the feature.Naively exploiting multiple feature extractors, however, does not guarantee the diversity of the fea-tures since it resorts to the randomness. Thus, we introduce a new model component, extractorclassiﬁer , which separates the features from different extractors. The extractor classiﬁer f ec gets thefeatures generated by a feature extractor as inputs and predicts which feature extractor has generatedthe features. For example, if n = 2 , the extractor classiﬁer f ec attempts to predict whether the inputfeature is extracted by the extractor f e, or f e, . By training the extractor classiﬁer and multiplefeature extractors at once, we explicitly diversify the features obtained from different extractors. Wetrain the extractor classiﬁer utilizing the feature diversifying loss , L fd : L fd = 1 N + 1  N (cid:88) i =1 n S i n S i (cid:88) j =1 n (cid:88) k =1 L ce ( f e,k ( x S i j ) , k ) + 1 n T n T (cid:88) j =1 n (cid:88) k =1 L ce ( f e,k ( x T j ) , k )  , (3)where n is the number of feature extractors.3.5 E N MDAP: E

NSEMBLE M ULTI -S OURCE D OMAIN A DAPTATION WITH PSEUDOLABELS

Our ﬁnal model E N MDAP consists of n pairs of feature extractor and label classiﬁer, ( f e, , f lc, ) , ( f e, , f lc, ) , . . . , ( f e,n , f lc,n ) , one extractor classiﬁer f ec , and one ﬁnal label classiﬁer f lc,final . We ﬁrst train the entire model except the ﬁnal label classiﬁer with the loss L : L = n (cid:88) k =1 L lc,k + α n (cid:88) k =1 L lmm,K,k + β L fd , (4)where L lc,k is the label classiﬁcation loss of the classiﬁer f lc,k , L lmm,K,k is the label-wise momentmatching loss of the feature extractor f e,k , and α and β are the hyperparameters. Then, the ﬁnallabel classiﬁer is trained with respect to the label classiﬁcation loss L lc,final using the concatenatedfeatures from multiple feature extractors. NALYSIS

We present a theoretical insight regarding the validity of the label-wise moment matching loss. Forsimplicity, we tackle only binary classiﬁcation tasks. The error rate of a hypothesis h on a domain D is denoted as (cid:15) D ( h ) = P r { h ( x ) (cid:54) = l D ( x ) } where l D is the labeling function on the domain D .We ﬁrst introduce k -th order label-wise moment divergence. Deﬁnition 1.

Let D and D (cid:48) be two domains over an input space X ⊂ R n where n is the dimensionof the inputs. Let C be the set of the labels, and µ c ( x ) and µ (cid:48) c ( x ) be the data distribution giventhat the label is c , i.e. µ c ( x ) = µ ( x | y = c ) and µ (cid:48) c ( x ) = µ (cid:48) ( x | y = c ) for the data distribution µ and µ (cid:48) on the domains D and D (cid:48) , respectively. Then, the k -th order label-wise moment divergence d LM,k ( D , D (cid:48) ) of the two domains D and D (cid:48) over X is deﬁned as d LM,k ( D , D (cid:48) ) = (cid:88) c ∈C (cid:88) i ∈ ∆ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( c ) (cid:90) X µ c ( x ) n (cid:89) j =1 ( x j ) i j d x − p (cid:48) ( c ) (cid:90) X µ (cid:48) c ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (5) where ∆ k = { i = ( i , . . . , i n ) ∈ N n | (cid:80) nj =1 i j = k } is the set of the tuples of the nonnegativeintegers, which add up to k , p ( c ) and p (cid:48) ( c ) are the probability that arbitrary data from D and D (cid:48) tobe labeled as c respectively, and the data x ∈ X is expressed as ( x , . . . , x n ) . The ultimate goal of MSDA is to ﬁnd a hypothesis h with the minimum target error. We neverthelesstrain the model with respect to the source data since ground truths for the target are unavailable. Let N datasets be drawn from N labeled source domains S , . . . , S N respectively. We denote i -thsource dataset X S i as { ( x S i j , y S i j ) } n S i j =1 . The empirical error of hypothesis h in i -th source domain S i estimated with X S i is formulated as ˆ (cid:15) S i ( h ) = n S i (cid:80) n S i j =1 h ( x S ij ) (cid:54) = y S ij . Given a weight vector α = ( α , α , . . . , α N ) such that (cid:80) Ni =1 α i = 1 , the weighted empirical source error is formulatedas ˆ (cid:15) α ( h ) = (cid:80) Ni =1 α i ˆ (cid:15) S i ( h ) . We extend the theorems in Ben-David et al. (2010); Peng et al. (2019)5nd derive a bound for the target error (cid:15) T ( h ) , for h trained with source data, in terms of k -th orderlabel-wise moment divergence. Theorem 1.

We conduct experiments to answer the following questions of E N MDAP.

Q1 Accuracy (Section 5.2).

How well does E N MDAP perform in classiﬁcation tasks?

Q2 Ablation Study (Section 5.3).

How much does each component of E N MDAP contributeto performance improvement?

Q3 Effects of Degree of Ensemble (Section 5.4).

How does the performance change as thenumber n of the pairs of the feature extractor and the label classiﬁer increases?5.1 E XPERIMENTAL S ETTINGS

Datasets.

We use three kinds of datasets, Digits-Five, Ofﬁce-Caltech10 , and Amazon Reviews .Digits-Five consists of ﬁve datasets for digit recognition: MNIST (LeCun et al. (1998)), MNIST-M (Ganin & Lempitsky (2015)), SVHN (Netzer et al. (2011)), SynthDigits (Ganin & Lempitsky(2015)), and USPS (Hastie et al. (2001)). We set one of them as a target domain and the rest assource domains. Following the conventions in prior works (Xu et al. (2018); Peng et al. (2019)),we randomly sample 25000 instances from the source training set and 9000 instances from thetarget training set to train the model except for USPS for which the whole training set is used.Ofﬁce-Caltech10 is the dataset for image classiﬁcation with 10 categories that Ofﬁce31 dataset andCaltech dataset have in common. It involves four different domains: Amazon, Caltech, DSLR, andWebcam. We double the number of data by data augmentation and exploit all the original dataand augmented data as training data and test data respectively.Amazon Reviews dataset containscustomers’ reviews on 4 product categories: Books, DVDs, Electronics, and Kitchen appliances.The instances are encoded into 5000-dimensional vectors and are labeled as being either positive ornegative depending on their sentiments. We set each of the four categories as a target and the restas sources. For all the domains, 2000 instances are sampled for training, and the rest of the data areused for the test. Details about the datasets are summarized in appendix. https://people.eecs.berkeley.edu/˜jhoffman/domainadapt/ https://github.com/KeiraZhao/MDAN/blob/master/amazon.npz http://yann.lecun.com/exdb/mnist/ http://yaroslav.ganin.net http://ufldl.stanford.edu/housenumbers/ http://yaroslav.ganin.net Source Combined and

Single Best respectively. Note that E N MDAP shows the best performance. (a) Digits-Five

Method M+S+D+U/T T+S+D+U/M T+M+D+U/S T+M+S+U/D T+M+S+D/U Average

LeNet5 (SC) 97.58 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SDA 98.75 ± ± ± ± ± ± SDA- β ± ± ± ± ± ± N MDAP (n=2) ± ± ± ± ± ± (b) Ofﬁce-Caltech10 Method C+D+W/A A+D+W/C A+C+W/D A+C+D/W Average

ResNet50 (SC) 95.47 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SDA 95.14 ± ± ± ± ± SDA- β ± ± ± ± ± N MDAP (n=2) ± ± ± ± ± (c) Amazon Reviews Method D+E+K/B B+E+K/D B+D+K/E B+D+E/K Average

MLP (SC) 79.76 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SDA 78.97 ± ± ± ± ± SDA- β ± ± ± ± ± N MDAP (n=2) ± ± ± ± ± Competitors.

We use 3 MSDA algorithms, DCTN (Xu et al. (2018)), M SDA (Peng et al. (2019)),and M SDA- β (Peng et al. (2019)), with state-of-the-art performances as baselines. All the frame-works share the same architecture for the feature extractor, the domain classiﬁer, and the label classi-ﬁer for consistency. For Digits-Five, we use convolutional neural networks based on LeNet5 (LeCunet al. (1998)). For Ofﬁce-Caltech10, ResNet50 (He et al. (2016)) pretrained on ImageNet is used asthe backbone architecture. For Amazon Reviews, the feature extractor is composed of three fully-connected layers each with 1000, 500, and 100 output units, and a single fully-connected layer with100 input units and 2 output units is adopted for both of the extractor and label classiﬁers. WithDigits-Five, LeNet5 (LeCun et al. (1998)) and ResNet14 (He et al. (2016)) without any adaptationare additionally investigated in two different manners: Source Combined and

Single Best . In

SourceCombined , multiple source datasets are simply combined and fed into a model. In

Single Best , wetrain the model with each source dataset independently, and report the result of the best performingone. Likewise, ResNet50 and MLP consisting of 4 fully-connected layers with 1000, 500, 100, and2 units are investigated without adaptation for Ofﬁce-Caltech10 and Amazon Reviews respectively.

Training Details.

We train our models for Digits-Five with Adam optimizer (Kingma & Ba (2015))with β = 0 . , β = 0 . , and the learning rate of . for 100 epochs. All images are scaledto × and the mini batch size is set to . We set the hyperparameters α = 0 . , β = 1 ,and K = 2 . For the experiments with Ofﬁce-Caltech10, all the modules comprising our model aretrained following SGD with the learning rate . , except that the optimizers for feature extractorshave the learning rate . . We scale all the images to × and set the mini batch size to . All the hyperparameters are kept the same as in the experiments with Digits-Five. For AmazonReviews, we train the models for epochs using Adam optimizer with β = 0 . , β = 0 . , andthe learning rate of . . We set α = β = 1 , K = 2 , and the mini batch size to . For everyexperiment, the conﬁdence threshoold τ is set to . .7able 2: Experiments with E N MDAP and its variants.

Method M+S+D+U/T T+S+D+U/M T+M+D+U/S T+M+S+U/D T+M+S+D/U Average

MDAP-L 98.75 ± ± ± ± ± ± ± ± ± ± ± ± N MDAP-R (n=2) ± ± ± ± ± ± N MDAP (n=2) 99.31 ± ± ± ± ± ± E N MDAP (n=3) 99.31 ± ± ± ± ± ± N MDAP (n=4) 99.30 ± ± ± ± ± ± ERFORMANCE E VALUATION

We evaluate the performance of E N MDAP with n = 2 against the competitors. We repeat exper-iments for each setting ﬁve times and report the mean and the standard deviation. The results aresummarized in Tables 1. Note that E N MDAP provides the best accuracy in all the datasets, showingtheir consistent superiority in both image datasets (Digits-Five, Ofﬁce-Caltech10) and text dataset(Amazon Reviews). The enhancement is especially remarkable when MNIST-M is the target do-main in Digits-Five, improving the accuracy by . compared to the state-of-the-art methods.5.3 A BLATION S TUDY

We perform an ablation study on Digits-Five to identify what exactly enhances the performanceof E N MDAP. We compare E N MDAP with 3 of its variants: MDAP-L, MDAP, and E N MDAP-R.MDAP-L has the same strategies as M SDA, aligning moments regardless of the labels of the data.MDAP trains the model without ensemble learning theme. E N MDAP-R exploits ensemble learningstrategy but relies on randomness without extractor classiﬁer and feature diversifying loss.The results are shown in Table 2. By comparing MDAP-L and MDAP, we observe that consideringlabels in moment matching plays a signiﬁcant role in extracting domain-invariant features.The re-markable performance gap between MDAP and E N MDAP with n = 2 veriﬁes the effectiveness ofensemble learning. On the other hand, the performance of E N MDAP-R and E N MDAP have littledifference. It indicates that two feature extractors trained independently without any diversifyingtechniques are unlikely to be correlated even though it resorts to randomness.5.4 E

FFECTS OF E NSEMBLE

We vary n , the number of pairs of feature extractor and label classiﬁer, and repeat the performanceevaluation on Digits-Five. The results are summarized in Table 2. While ensemble of two pairsgives much better performance than the model with one single pair, using more than two pairs rarelybrings further improvement. This result demonstrates that two pairs of feature extractor and labelclassiﬁer are able to cover most information without losing important label information in Digits-Five. It is notable that increasing n sometimes brings small performance degradation. As morefeature extractors are adopted to obtain ﬁnal features, the complexity of ﬁnal features increases. Itis harder for the ﬁnal label classiﬁers to manage the features with high complexity compared to thesimple ones. This deteriorates the performance when we exploit more than two feature extractors. ONCLUSION

We propose E N MDAP, a novel framework for the multi-source domain adaptation problem. E N -MDAP overcomes the problems in the existing methods of not directly addressing conditional dis-tributions of data p ( x | y ) , not fully exploiting knowledge of target data, and missing large amountof label information. E N MDAP aligns data from multiple source domains and the target domainconsidering the data labels, and exploits pseudolabels for unlabeled target data. E N MDAP fur-ther enhances the performance by introducing multiple feature extractors. Our framework exhibitssuperior performance on both image and text classiﬁcation tasks. Considering labels in momentmatching and adding ensemble learning theme is shown to bring remarkable performance enhance-ment through ablation study. Future works include extending our approach to other tasks such asregression, which may require modiﬁcation in the pseudolabeling method.8

EFERENCES

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wort-man Vaughan. A theory of learning from different domains.

Mach. Learn. , 79(1-2):151–175,2010.Olivier Chapelle and Alexander Zien. Semi-supervised classiﬁcation by low density separation. In

AISTATS , 2005.Koby Crammer, Michael J. Kearns, and Jennifer Wortman. Learning from multiple sources.

JMLR ,2008.Yaroslav Ganin and Victor S. Lempitsky. Unsupervised domain adaptation by backpropagation. In

ICML , 2015.Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸oisLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural net-works.

JMLR , 2016.Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deepreconstruction-classiﬁcation networks for unsupervised domain adaptation. In

ECCV , 2016.Trevor Hastie, Jerome H. Friedman, and Robert Tibshirani.

The Elements of Statistical Learning:Data Mining, Inference, and Prediction . Springer Series in Statistics. Springer, 2001.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

CVPR , 2016.Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. Algorithms and theory for multiple-sourceadaptation. In

NIPS , 2018a.Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros,and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In

ICML , 2018b.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR , 2015.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.Dong-Hyun Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deepneural networks. In

ICML , 2013.Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks.In

NIPS , 2017.Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable featureswith deep adaptation networks. In

ICML , 2015.Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Unsupervised domain adaptationwith residual transfer networks. In

NIPS , 2016.Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with jointadaptation networks. In

ICML , 2017.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

JMLR , 2008.Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiplesources. In

NIPS , 2008.Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Readingdigits in natural images with unsupervised feature learning. 2011.Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adap-tation. In

AAAI , 2018. 9ingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matchingfor multi-source domain adaptation. In

ICCV , 2019.Marshall Harvey Stone. Applications of the theory of boolean rings to general topology.

Transac-tions of the American Mathematical Society , 1937.Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In

AAAI , 2016.A. Torralba and A. A. Efros. Unbiased look at dataset bias. In

CVPR , 2011.E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In

CVPR , 2017.Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion:Maximizing for domain invariance.

CoRR , abs/1412.3474, 2014.Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In

CVPR , 2018.Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschl¨ager, and SusanneSaminger-Platz. Central moment discrepancy (CMD) for domain-invariant representation learn-ing.

CoRR , abs/1702.08811, 2017.Werner Zellinger, Bernhard Alois Moser, and Susanne Saminger-Platz. Learning bounds formoment-based domain adaptation.

CoRR , abs/2002.08260, 2020.Kun Zhang, Mingming Gong, and Bernhard Sch¨olkopf. Multi-source domain adaptation: A causalview. In

AAAI , 2015.Han Zhao, Shanghang Zhang, Guanhang Wu, Jos´e M. F. Moura, Jo˜ao Paulo Costeira, and Geoffrey J.Gordon. Adversarial multiple source domain adaptation. In

NIPS , 2018.Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In

ICCV , 2017.Fuzhen Zhuang, Xiaohu Cheng, Ping Luo, Sinno Jialin Pan, and Qing He. Supervised representationlearning: Transfer learning with deep autoencoders. In

IJCAI , 2015.10 A PPENDIX

A.1 P

ROOF FOR T HEOREM k -th order label-wise moment divergence d LM,k , anddisagreement ratio (cid:15) D ( h , h ) of the two hypotheses h , h ∈ H on the domain D . Deﬁnition 1.

Let D and D (cid:48) be two domains over an input space X ⊂ R n where n is the dimensionof the inputs. Let C be the set of the labels, and µ c ( x ) and µ (cid:48) c ( x ) be the data distributions giventhat the label is c , i.e. µ c ( x ) = µ ( x | y = c ) and µ (cid:48) c ( x ) = µ (cid:48) ( x | y = c ) for the data distribution µ and µ (cid:48) on the domains D and D (cid:48) , respectively. Then, the k -th order label-wise moment divergence d LM,k ( D , D (cid:48) ) of the two domains D and D (cid:48) over X is deﬁned as d LM,k ( D , D (cid:48) ) = (cid:88) c ∈C (cid:88) i ∈ ∆ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( c ) (cid:90) X µ c ( x ) n (cid:89) j =1 ( x j ) i j d x − p (cid:48) ( c ) (cid:90) X µ (cid:48) c ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7) where ∆ k = { i = ( i , . . . , i n ) ∈ N n | (cid:80) nj =1 i j = k } is the set of the tuples of the nonnegativeintegers, which add up to k , p ( c ) and p (cid:48) ( c ) are the probability that arbitrary data from D and D (cid:48) tobe labeled as c respectively, and the data x ∈ X is expressed as ( x , . . . , x n ) . Deﬁnition 2.

Let D be a domain over an input space X ⊂ R n with the data distribution µ ( x ) .Then, we deﬁne the disagreement ratio (cid:15) D ( h , h ) of the two hypotheses h , h ∈ H on the domain D as (cid:15) D ( h , h ) = Pr x ∼ µ ( x ) [ h ( x ) (cid:54) = h ( x )] . (8) Theorem 2. (Stone-Weierstrass Theorem (Stone (1937))) Let K be a compact subset of R n and f : K → R be a continuous function. Then, for every (cid:15) > , there exists a polynomial, P : K → R ,such that sup x ∈ K | f ( x ) − P ( x ) | < (cid:15). (9)Theorem 2 indicates that continuous functions on a compact subset of R n are approximated withpolynomials. We next formulate the discrepancy of the two domains using the disagreement ratioand bound it with the label-wise moment divergence. Lemma 1.

Let D and D (cid:48) be two domains over an input space X ∈ R n , where n is the dimension ofthe inputs. Then, for any hypotheses h , h ∈ H and any (cid:15) > , there exist n (cid:15) ∈ N and a constant a n (cid:15) such that | (cid:15) D ( h , h ) − (cid:15) D (cid:48) ( h , h ) | ≤ a n (cid:15) n (cid:15) (cid:88) k =1 d LM,k ( D , D (cid:48) ) + (cid:15). (10) Proof.

Let the domains D and D (cid:48) have the data distribution of µ ( x ) and µ (cid:48) ( x ) , respectively, over aninput space X , which is a compact subset of R n , where n is the dimension of the inputs. For brevity,we denote | (cid:15) D ( h , h ) − (cid:15) D (cid:48) ( h , h ) | as ∆ D , D (cid:48) . Then, ∆ D , D (cid:48) = | (cid:15) D ( h , h ) − (cid:15) D (cid:48) ( h , h ) |≤ sup h ,h ∈H | (cid:15) D ( h , h ) − (cid:15) D (cid:48) ( h , h ) | = sup h ,h ∈H (cid:12)(cid:12)(cid:12)(cid:12) Pr x ∼ µ ( x ) [ h ( x ) (cid:54) = h ( x )] − Pr x ∼ µ (cid:48) ( x ) [ h ( x ) (cid:54) = h ( x )] (cid:12)(cid:12)(cid:12)(cid:12) = sup h ,h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) h ( x ) (cid:54) = h ( x ) d x − (cid:90) X µ (cid:48) ( x ) h ( x ) (cid:54) = h ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) . (11)For any hypotheses h , h , the indicator function h ( x ) (cid:54) = h ( x ) is Lebesgue integrable on X , i.e . h ( x ) (cid:54) = h ( x ) is a L function. Since a set of continuous functions is dense in L ( X ) , for every11 > , there exists a continuous L function f deﬁned on X such that (cid:12)(cid:12) h ( x ) (cid:54) = h ( x ) − f ( x ) (cid:12)(cid:12) ≤ (cid:15) (12)for every x ∈ X , and the ﬁxed h and h that drive Equation 5 to the supremum. Accordingly, f ( x ) − (cid:15) ≤ h ( x ) (cid:54) = h ( x ) ≤ f ( x ) + (cid:15) . (13)By integrating every term in the inequality over X , the inequality, (cid:90) X µ ( x ) f ( x ) d x − (cid:15) ≤ (cid:90) X µ ( x ) h ( x ) (cid:54) = h ( x ) d x ≤ (cid:90) X µ ( x ) f ( x ) d x + (cid:15) , (14)follows. Likewise, the same inequality on the domain D (cid:48) with µ (cid:48) instead of µ holds. By subtractingthe two inequalities and reformulating it, the inequality, − (cid:15) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) h ( x ) (cid:54) = h ( x ) d x − (cid:90) X µ (cid:48) ( x ) h ( x ) (cid:54) = h ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) f ( x ) d x − (cid:90) X µ (cid:48) ( x ) f ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) , (15)is induced. By substituting the inequality in Equation 9 to the Equation 5, ∆ D , D (cid:48) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) f ( x ) d x − (cid:90) X µ (cid:48) ( x ) f ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) . (16)By the Theorem 2, there exists a polynomial P ( x ) such that sup x ∈X | f ( x ) − P ( x ) | < (cid:15) , (17)and the polynomial P ( x ) is expressed as P ( x ) = n (cid:15) (cid:88) k =1 (cid:88) i ∈ ∆ k α i n (cid:89) j =1 ( x j ) i j , (18)where n (cid:15) is the order of the polynomial, ∆ k = { i = ( i , . . . , i n ) ∈ N n | (cid:80) nj =1 i j = k } is the set ofthe tuples of the nonnegative integers, which add up to k , α i is the coefﬁcient of each term of thepolynomial, and x = ( x , x , . . . , x n ) . By applying Equation 11 to the Equation 10 and substitutingthe expression in Equation 12, ∆ D , D (cid:48) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) P ( x ) d x − (cid:90) X µ (cid:48) ( x ) P ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) n (cid:15) (cid:88) k =1 (cid:88) i ∈ ∆ k α i n (cid:89) j =1 ( x j ) i j d x − (cid:90) X µ (cid:48) ( x ) n (cid:15) (cid:88) k =1 (cid:88) i ∈ ∆ k α i n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) ≤ n (cid:15) (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ ∆ k α i (cid:90) X µ ( x ) n (cid:89) j =1 ( x j ) i j d x − α i (cid:90) X µ (cid:48) ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) ≤ n (cid:15) (cid:88) k =1 (cid:88) i ∈ ∆ k | α i | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X µ ( x ) n (cid:89) j =1 ( x j ) i j d x − (cid:90) X µ (cid:48) ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) = n (cid:15) (cid:88) k =1 (cid:88) i ∈ ∆ k | α i | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) c ∈C p ( c ) µ c ( x ) n (cid:89) j =1 ( x j ) i j d x − (cid:90) X (cid:88) c ∈C p (cid:48) ( c ) µ (cid:48) c ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15), (19)where p ( c ) and p (cid:48) ( c ) are the probability that an arbitrary data is labeled as class c in domain D and D (cid:48) , respectively, and µ c ( x ) = µ ( x | y = c ) and µ (cid:48) c ( x ) = µ (cid:48) ( x | y = c ) are the data distribution given12hat the data is labeled as class c on domain D and D (cid:48) , respectively. For a ∆ k = max i ∈ ∆ k | α i | , ∆ D , D (cid:48) ≤ n (cid:15) (cid:88) k =1 a ∆ k (cid:88) i ∈ ∆ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) c ∈C p ( c ) µ c ( x ) n (cid:89) j =1 ( x j ) i j d x − (cid:90) X (cid:88) c ∈C p (cid:48) ( c ) µ (cid:48) c ( x ) n (cid:89) j =1 ( x j ) i j d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) ≤ n (cid:15) (cid:88) k =1 a ∆ k (cid:88) i ∈ ∆ k (cid:88) c ∈C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( c ) (cid:90) X µ c ( x ) n (cid:89) j =1 ( x j ) i j − p (cid:48) ( c ) (cid:90) X µ (cid:48) c ( x ) n (cid:89) j =1 ( x j ) i j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) ≤ n (cid:15) (cid:88) k =1 a ∆ k d LM.k ( D , D (cid:48) ) + (cid:15) ≤ a n (cid:15) n (cid:15) (cid:88) k =1 d LM,k ( D , D (cid:48) ) + (cid:15), (20)for a n (cid:15) = 2 max ≤ k ≤ n (cid:15) a ∆ k .Let N datasets be drawn from N labeled source domains S , S , . . . , S N respectively. We denote i -ith source dataset X S i as { ( x S i j , y S i j ) } n S i j =1 . The empirical error of hypothesis h in i -th sourcedomain S i estimated with X S i is formulated as ˆ (cid:15) S i ( h ) = n S i (cid:80) n S i j =1 h ( x S ij ) (cid:54) = y S ij . Given a positiveweight vector α = ( α , α , . . . , α N ) such that (cid:80) Ni =1 α i = 1 and α i ≥ , the weighted empiricalsource error is formulated as ˆ (cid:15) α ( h ) = (cid:80) Ni =1 α i ˆ (cid:15) S i ( h ) . Lemma 2.

For N source domains S , S , . . . , S N , let n S i be the number of samples from sourcedomain S i , m = (cid:80) Ni =1 n S i be the total number of samples from N source domains, and β =( β , β , . . . , β N ) with β i = n S i m . Let (cid:15) α ( h ) be the weighted true source error which is the weightedsum of (cid:15) S i ( h ) = Pr x ∼ µ ( x ) [ h ( x ) (cid:54) = y ] . Then, Pr [ | ˆ (cid:15) α ( h ) − (cid:15) α ( h ) | ≥ (cid:15) ] ≤  − m(cid:15) (cid:80) Ni =1 α i β i  (21) Proof.

It has been proven in Ben-David et al. (2010).We now turn our focus back to the Theorem 1 in the paper and complete the proof.

Theorem 1.

Let H be a hypothesis space of VC dimension d , n S i be the number of sam-ples from source domain S i , m = (cid:80) Ni =1 n S i be the total number of samples from N sourcedomains S , . . . , S N , and β = ( β , . . . , β N ) with β i = n S i m . Let us deﬁne a hypothesis ˆ h = arg min h ∈H ˆ (cid:15) α ( h ) that minimizes the weighted empirical source error, and a hypothesis h ∗T = arg min h ∈H (cid:15) T ( h ) that minimizes the true target error. Then, for any δ ∈ (0 , and (cid:15) > ,there exist N integers n (cid:15) , . . . , n N(cid:15) and N constants a n (cid:15) , . . . , a n N(cid:15) such that (cid:15) T (ˆ h ) ≤ (cid:15) T ( h ∗T ) + η α , β ,m,δ + (cid:15) + N (cid:88) i =1 α i  λ i + a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  (22) with probability at least − δ , where η α , β ,m,δ = 4 (cid:115)(cid:16)(cid:80) Ni =1 α i β i (cid:17) (cid:18) d ( log ( md ) +1 ) +2 log ( δ ) m (cid:19) and λ i = min h ∈H { (cid:15) T ( h ) + (cid:15) S i ( h ) } .Proof. | (cid:15) α ( h ) − (cid:15) T ( h ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 α i (cid:15) S i ( h ) − (cid:15) T ( h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N (cid:88) i =1 α i | (cid:15) S i ( h ) − (cid:15) T ( h ) | . (23)We deﬁne h ∗ i = arg min h ∈H (cid:15) S i ( h ) + (cid:15) T ( h ) for every i = 1 , , . . . , N for the following equations.We also note that the 1-triangular inequality (Crammer et al. (2008)) holds for binary classiﬁcation13asks, i.e ., (cid:15) D ( h , h ) ≤ (cid:15) D ( h , h ) + (cid:15) D ( h , h ) for any hypothesis h , h , h ∈ H and domain D .Then, | (cid:15) D ( h ) − (cid:15) D ( h, h (cid:48) ) | = | (cid:15) D ( h, l D ) − (cid:15) D ( h, h (cid:48) ) | ≤ (cid:15) D ( l D , h (cid:48) ) = (cid:15) D ( h (cid:48) ) (24)for the ground truth labeling function l D on the domain D and two hypotheses h, h (cid:48) ∈ H . Applyingthe deﬁnition and the inequality to Equation 17, | (cid:15) α ( h ) − (cid:15) T ( h ) | ≤ N (cid:88) i =1 α i ( | (cid:15) S i ( h ) − (cid:15) S i ( h, h ∗ i ) | + | (cid:15) S i ( h, h ∗ i ) − (cid:15) T ( h, h ∗ i ) | + | (cid:15) T ( h, h ∗ i ) − (cid:15) T ( h ) | ) ≤ N (cid:88) i =1 α i ( (cid:15) S i ( h ∗ i ) + | (cid:15) S i ( h, h ∗ i ) − (cid:15) T ( h, h ∗ i ) | + (cid:15) T ( h ∗ i )) (25)By the deﬁnition of h ∗ i , (cid:15) S i ( h ∗ i ) + (cid:15) T ( h ∗ i ) = λ i for λ i = min h ∈H { (cid:15) T ( h ) + (cid:15) S i ( h ) } . Additionally,according to Lemma 1, for any (cid:15) > , there exists an integer n (cid:15) and a constant a n i(cid:15) such that | (cid:15) S i ( h, h ∗ i ) − (cid:15) T ( h, h ∗ i ) | ≤ a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T ) + (cid:15) . (26)By applying these relations, | (cid:15) α ( h ) − (cid:15) T ( h ) | ≤ N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T ) + (cid:15)  ≤ N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  + (cid:15) . (27)By Lemma 2 and the standard uniform convergence bound for hypothesis classes of ﬁnite VC di-mension (Ben-David et al. (2010)), (cid:15) T (ˆ h ) ≤ (cid:15) α (ˆ h ) + (cid:15) N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  ≤ ˆ (cid:15) α (ˆ h ) + 12 η α , β ,m,δ + (cid:15) N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  ≤ ˆ (cid:15) α ( h ∗T ) + 12 η α , β ,m,δ + (cid:15) N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  ≤ (cid:15) α ( h ∗T ) + η α , β ,m,δ + (cid:15) N (cid:88) i =1 α i  λ i + 12 a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  ≤ (cid:15) T ( h ∗T ) + η α , β ,m,δ + (cid:15) + N (cid:88) i =1 α i  λ i + a n i(cid:15) n i(cid:15) (cid:88) k =1 d LM,k ( S i , T )  . (28)The last inequality holds by Equation 21 with h = h ∗T .14.2 S UMMARY OF DATASETS

Table 3: Summary of datasets.

Datasets Features Labels Training set Test set PropertiesDigits-Five

MNIST 1x28x28 10 60000 10000 Grayscale imagesMNIST-M 3x32x32 10 59001 9001 RGB imagesSVHN 3x32x32 10 73257 26032 RGB imagesSynthDigits 3x32x32 10 479400 9553 RGB imagesUSPS 1x16x16 10 7291 2007 Grayscale images

Ofﬁce-Caltech10

Amazon 3x300x300 10 958 958 RGB imagesCaltech Variable 10 1123 1123 RGB imagesDSLR 3x1000x1000 10 157 157 RGB imagesWebcam Variable 10 295 295 RGB images

AmazonReviews

Books 5000 2 2000 4465 5000-dim vectorDVDs 5000 2 2000 3586 5000-dim vectorElectronics 5000 2 2000 5681 5000-dim vectorKitchen appliances 5000 2 2000 5945 5000-dim vector

A.3 Q

UALITATIVE A NALYSIS FOR E N MDAP

60 40 20 0 20 40 60 80Dim16040200204060 D i m label0.01.02.03.04.05.06.07.08.09.0 (a) No Adaptation

60 40 20 0 20 40 60Dim160402002040 D i m label0.01.02.03.04.05.06.07.08.09.0 (b) MDAP-L

60 40 20 0 20 40 60Dim16040200204060 D i m label0.01.02.03.04.05.06.07.08.09.0 (c) MDAP

40 20 0 20 40 60Dim16040200204060 D i m label0.01.02.03.04.05.06.07.08.09.0 (d) E N MDAP (n=2)

60 40 20 0 20 40 60Dim16040200204060 D i m label0.01.02.03.04.05.06.07.08.09.0 (e) E N MDAP (n=3)