Causal Generative Domain Adaptation Networks
Mingming Gong, Kun Zhang, Biwei Huang, Clark Glymour, Dacheng Tao, Kayhan Batmanghelich
CCausal Generative Domain AdaptationNetworks
Mingming Gong ∗†‡
Kun Zhang ∗‡ Biwei Huang ‡ ClarkGlymour ‡ Dacheng Tao § Kayhan Batmanghelich † June 29, 2018
Abstract
An essential problem in domain adaptation is to understand and makeuse of distribution changes across domains. For this purpose, we firstpropose a flexible Generative Domain Adaptation Network (G-DAN) withspecific latent variables to capture changes in the generating process offeatures across domains. By explicitly modeling the changes, one can evengenerate data in new domains using the generating process with new valuesfor the latent variables in G-DAN. In practice, the process to generate allfeatures together may involve high-dimensional latent variables, requiringdealing with distributions in high dimensions and making it difficult tolearn domain changes from few source domains. Interestingly, by furthermaking use of the causal representation of joint distributions, we thendecompose the joint distribution into separate modules, each of whichinvolves different low-dimensional latent variables and can be learnedseparately, leading to a Causal G-DAN (CG-DAN). This improves bothstatistical and computational efficiency of the learning procedure. Finally,by matching the feature distribution in the target domain, we can recoverthe target-domain joint distribution and derive the learning machine for thetarget domain. We demonstrate the efficacy of both G-DAN and CG-DANin domain generation and cross-domain prediction on both synthetic andreal data experiments.
In recent years supervised learning has achieved great success in various real-world problems, such as visual recognition, speech recognition, and naturallanguage processing. However, the predictive model learned on training data ∗ Equal Contribution † Department of Biomedical Informatics, University of Pittsburgh, USA ‡ Department of Philosophy, Carnegie Mellon University, USA § UBTECH Sydney AI Center, The University of Sydney, Australia a r X i v : . [ s t a t . M L ] J un ay not generalize well when the distribution of test data is different. Forexample, a predictive model trained with data from one hospital may fail toproduce reliable prediction in a different hospital due to distribution change.Domain adaption (DA) aims at learning models that can predict well in test(or target) domains by transferring proper information from source to targetdomains [1, 2]. In this paper, we are concerned with a difficult scenario calledunsupervised domain adaptation, where no labeled data are provided in thetarget domain.Let X denote features and Y the class label. If the joint distribution P XY changes arbitrarily, apparently the source domain may not contain useful knowl-edge to help prediction in the target domain. Thus, various constraints on howdistribution changes have been assumed for successful transfer. For instance,a large body of previous methods assume that the marginal distribution P X changes but P Y | X stays the same, i.e., the covariate shift situation [3, 4, 5, 6]. Inthis situation, correcting the shift in P X can be achieved by reweighting sourcedomain instances [4, 5] or learning a domain-invariant representation [7, 8, 9].This paper is concerned with a more difficult situation, namely conditionalshift [10], where P X | Y changes, leading to simultaneous changes in P X and P Y | X .In this case, P XY in the target domain is generally not recoverable because thereis no label information in the target domain. Fortunately, with appropriateconstraints on how P X | Y changes, we may recover P XY in the target domain bymatching only P X in the target domain. For example, location-scale transformsin P X | Y have been assumed and the corresponding identifiability conditionshave been established [10, 11]. Despite its success in certain applications, thelocation-scale transform assumption may be too restrictive in many practicalsituations, calling for more general treatment of distribution changes.How can we model and estimate the changes in P X | Y without assuming strongparametric constraints? To this end, we first propose a Generative DomainAdaptation Network (G-DAN) with specific latent variables θ to capture changesin P X | Y . Specifically, the proposed network implicitly represents P X | Y by afunctional model X = g ( Y, E, θ ) , where the latent variables θ may take differentvalues across domains, and the independent noise E and the function g are sharedby all domains. This provides a compact and nonparametric representation ofchanges in P X | Y , making it easy to capture changes and find new sensible valuesof θ to generate new domain data. Assuming θ is low-dimensional, we providethe necessary conditions to recover θ up to some transformations from the sourcedomains. In particular, if the changes can be captured by a single latent variable,we show that under mild conditions the changes can be recovered from a singlesource domain and an unlabeled target domain. We can then estimate the jointdistribution in the target domain by matching only the feature distribution,which enables target-domain-specific prediction. Furthermore, by interpolationin the θ space and realization of the noise term E , one can straightforwardlygenerate sensible data in new domains.However, as the number of features increases, modeling P X | Y for all features X jointly becomes much more difficult . To circumvent this issue, we propose tofactorize joint distributions of features and the label, according to the causal2tructure underlying them. Each term in the factorization aims at modelingthe conditional distribution of one or more variables given their direct causes,and accordingly, the latent variables θ are decomposed into disjoint, unrelatedsubsets [12]. Thanks to the modularity of causal systems, the latent variablesfor those terms can be estimated separately, enjoying a “divide-and-conquer”advantage. With the resulting Causal G-DAN (CG-DAN), it is then easier tointerpret and estimate the latent variables and find their valid regions to generatenew, sensible data. Our contribution is mainly two fold: • Explicitly capturing invariance and changes across domains by exploitinglatent variable θ in a meaningful way. • Further facilitating the learning of θ and the generating process in light ofcausal representations. Due to the distribution shift phenomenon, DA has attracted a lot of attentionin the past decade, and here we focus on related unsupervised DA methods.In the covariate shift scenario, where P X changes but P Y | X stays the same,to correct the shift in P X , traditional approaches reweight the source domaindata by density ratios of the features [3, 5, 4, 13, 14]. However, such methodsrequire the target domain to be contained in the support of the source domain,which may be restrictive in many applications. Another collection of methodssearches for a domain-invariant representation that has invariant distributionsin both domains. These methods rely on various distance discrepancy measuresas objective functions to match representation distributions; typical ones includemaximum mean discrepancy (MMD) [7, 8], separability measured by classifiers [9],and optimal transport [15]. Moreover, the representation learning architecturehas developed from shallow architectures such as linear projections [8] to deepneural networks [16, 9].Recently, another line of researches has attempted to address a more chal-lenging situation where P X and P Y | X both change across domains. A class ofmethods to deal with this situation assumes a (causal) generative model Y → X ,factorizes the joint distribution following the causal direction as P XY = P Y P X | Y ,and considers the changes in P Y and P X | Y separately. Assuming that only P Y changes, one can estimate P Y in the target domain by matching the featuredistribution [17, 10, 18]. Further works also consider changes in P X | Y , knownas generalized target shift , and proposed representation learning methods withidentifiability justifications, i.e., the learned representation τ ( X ) has invariant P τ ( X ) | Y across domains if P τ ( X ) is invariant after correction for P Y [10, 11].This also partially explains why previous representation learning methods forcorrecting covariate shift work well in the conditional shift situation where only P X | Y changes but P Y remains the same. To better match joint distributionsacross domains, recent methods focus on exploring more powerful distributiondiscrepancy measures and more expressive representation learning architectures[15, 19]. 3n the causal discovery (i.e., estimating the underlying causal structure fromobservational data) field [20, 12], some recent methods tried to learn causalgraphs by representing functional causal models using neural networks [21]. Suchmethods leveraged conditional independences and different complexities of thedistribution factorizations in the data from a fixed distribution to learn causalstructures. In contrast, the purpose of our work is not causal discovery, butto make use of generative models for capturing domain-invariant and changinginformation and benefit from the modularity property of causal models [12] tounderstand and deal with changes in distributions. In unsupervised DA, we are given m source domains D s = { ( x si , y si ) } n s i =1 ∼ P sXY ,where s ∈ { , · · · , m } and a target domain D t = { x ti } n t i =1 of n t unlabeled examplessampled from P tX . The goal is to derive a learning machine to predict target-domain labels by making use of information from all domains. To achieve thisgoal, understanding how the generating process of all features X given labels Y changes across domains is essential. Here we have assumed that Y is a cause of X , as is the case in many classification problems [10]. g (rep-resentedby NN) YE θ X Figure 1: G-DAN.In this section, we propose a Generative Do-main Adaptation Network (G-DAN), as shown inFig. 1, to model both invariant and changing partsin P X | Y across domain for the purpose of DA. G-DAN uses specific latent variables θ ∈ R d to cap-ture the variability in P X | Y . In the s -th domain, θ takes value θ ( s ) and thus encodes domain-specificinformation. In Fig. 1, the network specifies adistribution Q X | Y ; θ by the following functionalmodel: X = g ( Y, E ; θ ) , (1)which transforms random noise E ∼ Q E to X ∈ R D ∼ Q X | Y ; θ , conditioning on Y and θ . Q X | Y ; θ = θ ( s ) is trained to approximate the conditional distribution inthe s -th domain P X | Y,S = s , where S is the domain index. E is independent of Y and has a fixed distribution across domains. g is a function represented by aneural network (NN) and shared by all domains. Note that the functional modelcan be seen as a way to specify the conditional distribution of X given Y . Forinstance, consider the particular case where X and E have the same dimensionand the E can be recovered from X and Y . Then by the formula for changeof variables, we have Q Y X = Q Y E / | ∂ ( Y, X ) /∂ ( Y, E ) | = Q Y Q E / | ∂X/∂E | ; as aconsequence, Q X | Y = Q XY /Q Y = Q E / | ∂X/∂E | .G-DAN provides a compact way to model the distribution changes acrossdomains. For example, in the generating process of images, illumination θ maychange across different domains, but the mechanism g that maps the class labels4 , illumination θ (as well as other quantities that change across domains), andother factors E into the images X is invariant. This may resemble how humansunderstand different but related things. We can capture the invariance part,understand the difference part, and even generate new virtual domains accordingto the perceived process. Suppose the true distributions P X | Y ; η across domains were generated by somefunction with changing (effective) parameters η –the function is fixed and η changes across domains. It is essential to see whether η is identifiable givenenough source domains. Here identifiability is about whether the estimated ˆ θ can capture all distribution changes, as implied by the following proposition. Proposition 1.
Assume that η in P X | Y ; η is identifiable [22], i.e., P X | Y, η = η = P X | Y, η = η ⇒ η = η for all suitable η , η ∈ R d . Then as n s → ∞ , if P sX | Y = Q sX | Y , the estimated ˆ θ captures the variability in η in the sense that ˆ θ = ˆ θ ⇒ η = η . A complete proof of Proposition 1 can be found in Section S1 of SupplementaryMaterial. The above results says that different values of η correspond to differentvalues of ˆ θ once Q X | Y ; θ perfectly fits the true distribution. It should be notedthat if one further enforces that different θ correspond to different Q X | Y ; θ in(1), then we have ˆ θ = ˆ θ ⇔ η = η . That is, the true η can be estimated up toa one-to-one mapping.If θ is high-dimensional, intuitively, it means that the changes in conditionaldistributions are complex; in the extreme case, it could change arbitrarily acrossdomains. To demonstrate that the number of required domains increases with thedimensionality of θ , for illustrative purposes, we consider a restrictive, parametriccase where the expectation of P X | Y ; θ depends linearly on θ . Proposition 2.
Assume that for all suitable θ ∈ R d and function g , Q X | Y ; θ ,g satisfies E Q X | Y ; θ ,g [ X ] = A θ + h ( Y ) , where A ∈ R D × d . Suppose the sourcedomain data were generated from Q X | Y ; θ ∗ ,g ∗ implied by model (1) with true θ ∗ and g ∗ . The corresponding expectation is A ∗ θ ∗ + h ∗ ( Y ) . Denote Θ ∗ ∈ R ( d +1) × m as a matrix whose s -th column is [ θ ( s ) ; 1] and A ∗ aug as the matrix [ A ∗ , h ∗ ( Y )] .Assume that rank (Θ ∗ ) = d + 1 (a necessary condition is m ≥ d + 1 ) and rank ( A ∗ aug ) = d + 1 , if P sX | Y = Q sX | Y , the estimated ˆ θ is a one-to-one mappingof θ ∗ . A proof of Proposition 2 can be found in Section S2 of SupplementaryMaterial. If the number of domains m is smaller than the dimension of θ , thematrix Θ ∗ and the estimated ˆ A will be rank deficient. Therefore, only part ofchanges in θ can be recovered.In practice, we often encounter a very challenging situation when there isonly one source domain. In this case, we must incorporate the target domain5o learn the distribution changes. However, due to the absence of labels in thetarget domain, the identifiability of θ and the recovery of the target domainjoint distribution needs more assumptions. One possibility is to use linearindependence constraints on the modeled conditional distributions, as statedbelow. Proposition 3.
Assume Y ∈ { , · · · , C } . Denote by Q X | Y ; θ the conditionaldistribution of X specified by X = g ( Y, E ; θ ) . Suppose for any θ and θ (cid:48) , theelements in the set (cid:8) λQ X | Y = c ; θ + λ (cid:48) Q X | Y = c ; θ (cid:48) ; c = 1 , ..., C (cid:9) are linearly indepen-dent for all λ and λ (cid:48) such that λ + λ (cid:48) (cid:54) = 0 . If the marginal distributions of thegenerated X satisfy Q X | θ = Q X | θ (cid:48) , then Q X | Y ; θ = Q X | Y ; θ (cid:48) and the correspondingmarginal distributions of Y also satisfies Q Y = Q (cid:48) Y . A proof is given in Section S3 of Supplementary Material. The aboveconditions enable us to identify θ as well as recover the joint distribution in thetarget domain in the single-source DA problem. To estimate g and θ from empirical data, we adopt the adversarial training strat-egy [23, 24], which minimizes the distance between the empirical distributions inall domains and the empirical distribution of the data sampled from model (1).Because our model involves an unknown θ whose distribution is unknown, wecannot directly sample data from the model. To enable adversarial training, wereparameterize θ by a linear transformation of the one-hot representation s ofdomain index s , i.e., θ = Θ s , where Θ can be learned by adversarial training.We aim to match distributions in all source domains and the target domain.In particular, in the s -th source domain, we estimate the model by matchingthe joint distributions P sXY and Q sXY , where Q sXY is the joint distribution ofthe generated features X = g ( Y, E, θ s ) and labels Y . Specifically, we use theMaximum Mean Discrepancy (MMD) [4, 25, 24] to measure the distance betweenthe true and model distributions: J skl = (cid:107) E ( X,Y ) ∼ P sXY [ φ ( X ) ⊗ ψ ( Y )] − E ( X,Y ) ∼ Q sXY [ φ ( X ) ⊗ ψ ( Y )] (cid:107) H x ⊗H y , (2)where H x denotes a characteristic Reproducing Kernel Hilbert Space (RKHS)on the input feature space X associated with a kernel k ( · , · ) : X × X → R , φ the associated mapping such that φ ( x ) ∈ H x , and H y the RKHS on the labelspace Y associated with kernel l ( · , · ) : Y × Y → R and mapping ψ . In practice,given a mini-batch of size n consisting of { ( x si , y si ) } ni =1 sampled from the sourcedomain data and { (ˆ x si , ˆ y si ) } ni =1 sampled from the generator, we need to estimatethe empirical MMD for gradient evaluation and optimization: ˆ J skl = 1 n n (cid:88) i n (cid:88) j k ( x si , x sj ) l ( y si , y sj ) − n n (cid:88) i n (cid:88) j k ( x si , ˆ x sj ) l ( y i , ˆ y sj )+ 1 n n (cid:88) i n (cid:88) j k (ˆ x si , ˆ x sj ) l (ˆ y si , ˆ y sj ) X X Y X X X S Figure 2: A causal graph over Y and features X i . Y is the variable to bepredicted, and nodes in gray are in its Markov Blanket. S denotes the domainindex; a direct link from S to a variable indicates that the generating processfor that variable changes across domains. Here the generating processes for Y , X , X , and X vary across domains.In the target domain, since no label is present, we propose to match the marginaldistributions P tX and Q tX , where Q tX is the marginal distribution of the generatedfeatures X = g ( Y, E, θ t ) , where θ t is the θ value in the target domain. The lossfunction is M k = (cid:107) E X ∼ P tX [ φ ( X )] − E X ∼ Q tX [ φ ( X )] (cid:107) H x , and its empirical version ˆ M k is similar to ˆ J skl except that the terms involving the kernel function l areabsent. Finally, the function g and the changing parameters (latent variables) Θ can be estimated by min g, Θ J skl + αM k , where α is a hyperparameter thatbalances the losses on source and target domains. In the experiments, we set α to 1. The proposed G-DAN models the class-conditional distribution, P sX | Y , for allfeatures X jointly. As the dimensionality of X increases, the estimation of jointdistributions requires more data in each domain to achieve a certain accuracy.In addition, more domains are needed to learn the distribution changes.To circumvent this issue, we propose to factorize joint distributions of featuresand the label according to the causal structure. Each term in the factorizationaims at modeling the conditional distribution of one variable (or a variablegroup) given the direct causes, which involves a smaller number of variables [12].Accordingly, the latent variables θ can be decomposed into disjoint, unrelatedsubsets, each of which has a lower dimensionality. We model the distribution of Y and relevant features following a causal graphicalrepresentation. We assume that the causal structure is fixed, but the generatingprocess, in particular, the function or parameters, can change across domains.According to the modularity properties of a causal system, the changes in thefactors of the factorization of the joint distribution are independent of each other7 (rep-resentedby NN) YE θ g (rep-resentedby NN) X E θ X Figure 3: CG-DAN for with modules Y → X and ( Y, X ) → X .[20, 12]. We can then enjoy the “divide-and-conquer” strategy to deal with thosefactors separately.Fig. 2 shows a Directed Acyclic Graph (DAG) over Y and features X i .According to the Markov factorization [12], the distribution over Y and thevariables in its Markov Blanket (nodes in gray) can be factorized according tothe DAG as P XY = P Y P X | Y,X P X | X ,Y P X P X | Y . For the purpose of DA, itsuffices to consider only the Markov blanket M B ( Y ) of Y , since Y is independentof remaining variables given its Markov blanket. By considering only the Markovblanket, we can reduce the complexity of the model distribution without hurtingthe prediction performance.We further adopt functional causal model (FCM) [12] to specify such con-ditional distributions. The FCM (a tuple (cid:104) F , P (cid:105) ) consists of a set of equations F = ( F , . . . , F D ) : F i : X i = g i ( PA i , E i , θ i ) , i = 1 , . . . , D, (3)and a probability distribution P over E = ( E , . . . , E D ) . PA i denotes the directcauses of X i , and E i represents noises due to unobserved factors. Modularityimplies that parameters in different FCMs are unrelated, so the latent variable θ can be decomposed into disjoint subsets θ i , each of which only captures thechanges in the corresponding conditional distribution P X i | PA i . g i encodes theinvariance part in P X i | PA i and is shared by all domains. E i are required to bejointly independent. Without loss of generality, we assume that all the noises { E i } ni =1 follow Gaussian distributions. (A nonlinear function of E i , as part of g i , will change the distribution of the noise.) Based on (3), we employ a neural network to model each g i separately, andconstruct a constrained generative model according to the causal DAG, which wecall a Causal Generative Domain Adaptation Network (CG-DAN). Fig. 3 givesan example network constructed on Y , X , and X in Fig 2. (For simplicity wehave ignored X .)Because of the causal factorization, we can learn each conditional distribution Q X i | PA i , θ i separately by the adversarial learning procedure described in section3.2. To this end, let us start by using the Kullback-Leibler (KL) divergenceto measure the distribution distance, and the function to be minimized can bewritten as 8 L ( P sXY || Q sXY | θ ) = E P sXY (cid:110) log P sY,X , ··· ,X D Q sY,X , ··· ,X D | θ (cid:111) = E P sXY (cid:110) log P sY Π Di =1 P sX i | PA i Q sY Π Di =1 Q sX i | PA i , θ i (cid:111) = D (cid:88) i =1 KL ( P sX i | PA i || Q sX i | PA i , θ i )= D (cid:88) i =1 KL ( P sX i | PA i || Q sX i | PA i , θ i P sY ) , where each term corresponds to a conditional distribution of relevant variablesgiven all its parents. It can be seen that the objective function is the sum of themodeling quality of each module. For computational convenience, we can useMMD (2) to replace the KL divergence. Remark
It is worthwhile to emphasize the advantages of using causal repre-sentation. On one hand, by decomposing the latent variables into unrelated,separate sets θ i , each of which corresponds to a causal module [12], it is easier tointerpret the parameters and find their valid regions corresponding to reasonabledata. On the other hand, even if we just use the parameter values learned fromobserved data, we can easily come up with new data by making use of theircombinations. For instance, suppose we have ( θ , θ ) and ( θ (cid:48) , θ (cid:48) ) learned fromtwo domains. Then ( θ , θ (cid:48) ) and ( θ (cid:48) , θ ) also correspond to valid causal processesbecause of the modularity property of causal process. With a single source domain, we adopted a widely-used causal discovery algorithm,PC [20], to learn the causal graph up to the Markov equivalence class. (Note thatthe target domain does not have Y values, but we aim to find causal structureinvolving Y , so in the causal discovery step we do not use the target domain.) Weadd the constraint that Y is a root cause for X i to identify the causal structureon a single source domain.If there are multiple source domains, we modify the CD-NOD method [26],which extends the original PC to the case with distribution shifts. Particularly,the domain index S is added as an additional variable into the causal systemto capture the heterogeneity of the underlying causal model. CD-NOD allowsus to recover the underlying causal graph robustly and detect changing causalmodules. By doing so, we only need to learn θ i for the changing modules.Since the causal discovery methods are not guaranteed to find all edgedirections, we propose a simple solution to build a network on top of the9roduced Partially Directed Acyclic Graph (PDAG), so that we can constructthe network structure for CG-DAN. We detect indirectly connected variablegroups each of which contains a group of nodes connected by undirected edges,and then form a “DAG”, with nodes representing individual variables or variablesgroups. We can apply the network construction procedure described in Sec. 4.2to this “DAG”. We evaluate the proposed G-DAN and CG-DAN on both synthetic and realdatasets. We test our methods on the rotated MNIST [27] dataset, the MNISTUSPS [28] dataset, and the WiFi localization dataset [29]. The first two imagedatasets are used to illustrate the ability of G-DAN to generate new sensibledomains and perform prediction in the target domain. The WiFi dataset isused to evaluate CG-DAN, demonstrating the advantage of incorporating causalstructures in generative modeling. CG-DAN is not applied on image data becauseimages are generated hierarchically from high-level concepts and it is thus notsensible to model the causal relations between pixels.
Implementation Details
We implement all the models using PyTorch andtrain the models with the RMSProp optimizer [30]. On the two image datasets, weadopt the DCGAN network architecture [31], which is a widely-used convolutionalstructure to generate images. We use the DCGAN generator for the g functionin our G-DAN model. As done in [32], we add a discriminator network f totransform the images into high-level features and compare the distributions ofthe learned features using MMD. We use a mixture of RBF kernels k ( x, x (cid:48) ) = (cid:80) Kq =1 k σ q ( x, x (cid:48) ) in MMD. On the image datasets, we fix K = 5 and σ q to be { , , , , } . On the WiFi dataset, we fix K = 5 and σ q to be { . , . , , , } times the median of the pairwise distances between all source examples. We first conduct simulation studies on the MNIST dataset to demonstrate howdistribution changes can be identified from multiple source domains. MNISTis a handwritten digit dataset including ten classes − with , trainingimages and , test images. We rotate the images by different angles andconstruct corresponding domains. We denote a domain with images rotated byangle γ as D γ . The dimensionality of θ in G-DAN is set to .Since g is generally nonlinear w.r.t. θ , one cannot expect full identificationof θ from only two domains. However, we might be able to identify θ fromtwo nearby domains, if g is approximately linear w.r.t. θ between these twodomains. To verify this, we conduct experiments on two synthetic datasets. Onedataset contains two source domains D ◦ and D ◦ and the other has two sourcedomains D ◦ and D ◦ . We train G-DAN to obtain ˆ g and ˆ θ in each domain.10o investigate whether the model is able to learn meaningful rotationalchanges, we sample θ values uniformly from [ˆ θ s , ˆ θ t ] to generate new domains. ˆ θ s and ˆ θ t are the learned domain-specific parameters in the source and targetdomains, respectively. As shown in Figure 6, on the dataset with source domains D ◦ and D ◦ , our model successfully generates a new domain in between.However, on the dataset with source domains D ◦ and D ◦ , although our modelwell fits the two source domains, the generated new domain does not correspondto the domain with rotated images, indicating that the two domains are toodifferent for G-DAN to identify meaningful distribution changes. Source( ◦ ) New Domain Source( ◦ )Source( ◦ ) New Domain Source( ◦ ) Figure 4:
Generated images from our model. The first row shows the generated imagesin source domains D ◦ and D ◦ and the new domain. The second row shows thegenerated images in source domains D ◦ and D ◦ and the new domain. An animatedillustration is provided in Supplementary Material. Here we use MNIST-USPS dataset to demonstrate whether G-DAN can success-fully learn distribution changes and generate new domains. We also test theclassification accuracy in the target domain. USPS is another handwritten digitdataset including ten classes − with , training images and , testimages. There exists a slight scale change between the two domains. FollowingCoGAN [33, 34], we use the standard training-test splits for both MNIST andUSPS. We compare our method with CORAL [35], DAN [16], DANN [9], DSN[34], and CoGAN [33]. We adopt the discriminator in CoGAN for classificationby training on the generated labeled images from our G-DAN model. The quan-titative results are shown in Table 1. It can be seen that our method achieves11able 1: Comparison of different methods on MNIST-USPS. CORAL DAN DANN DSN CoGAN GDAN81.7 81.1 91.3 91.2 95.7 slightly better performance than CoGAN and outperforms the other methods.In addition, we provide qualitative results to demonstrate our model’s abilityto generate new domains. As shown in Figure 5, we generate a new domain inthe middle of MNIST and USPS. The images on the new domain have slightlylarger scale than those on MNIST and slightly smaller scale than those on USPS,indicating that our model understands how the distribution changes. Althoughthe slight scale change is not easily distinguishable by human eyes, it causesa performance degradation in terms of classification accuracy. The proposedG-DAN successfully recovers the joint distribution in the target domain andenables accurate prediction in the target domain.
Source (MNIST) New Target (USPS)
Figure 5:
Generated images on the source, new, and target domain. An animatedillustration is provided in Supplementary Material.
We then perform evaluations on the cross-domain indoor WiFi location dataset [29]to demonstrate the advantage of incorporating causal structures. The WiFidata were collected from a building hallway area, which was discretized into aspace of grids. At each grid point, the strength of WiFi signals receivedfrom D access points was collected. We aim to predict the location of the devicefrom the D -dimensional WiFi signals, which casts as a regression problem. Thedataset contains two domain adaptation tasks: 1) transfer across time periodsand 2) transfer across devices. In the first task, the WiFi data were collected bythe same device during three different time periods t1 , t2 , and t3 in the samehallway. Three subtasks including t1 → t2 , t1 → t3 , and t2 → t3 are takenfor performance evaluation. In the second task, the WiFi data were collected bydifferent devices, causing a distribution shift of the received signals. We evaluatethe methods on three datasets, i.e., hallway1 , hallway2 , hallway3 , each of12able 2: Comparison of different methods on the WiFi dataset. The top twoperforming methods are marked in bold. KRR TCA SuK DIP CTC G-DAN CG-DAN t1 → t2 . ± .
14 86 . ± . . ± . . ± .
33 89 . ± .
78 86 . ± . . ± . t1 → t3 . ± .
66 80 . ± . . ± . . ± .
29 94 . ± .
87 83 . ± . . ± . t2 → t3 . ± .
28 72 . ± .
32 85 . ± .
31 80 . ± . . ± . . ± . . ± . hallway1 . ± .
60 65 . ± .
86 76 . ± .
44 77 . ± . . ± .
02 85 . ± . - hallway2 . ± .
30 62 . ± .
25 64 . ± . . ± .
66 87 . ± . . ± . - hallway3 . ± .
32 59 . ± .
56 65 . ± .
57 75 . ± . . ± .
34 76 . ± . - which contains data collected by two different devices.For both tasks, we implement our G-DAN by using a MLP with one hiddenlayer ( nodes) for the generator g and set the dimension of input noise E and θ to and , respectively. In the time transfer task, because the causal structureis stable across domains, we also apply the proposed CG-DAN constructedaccording to the learned causal structure from the source domains. Sec. 4in Supplementary Material shows a causal graph and the detected changingmodules obtained by the CD-NOD method on t1 and t2 datasets. We use aMLP with one hidden layer ( nodes) to model each g i . The dimensions of E and θ i are set to for all the modules.We also compare with KMM, surrogate kernels (SuK) [29], TCA [7], DIP [8],and CTC [11]. We follow the evaluation procedures in [29]. The performances ofdifferent methods are shown in Table 2. The reported accuracy is the percentageof examples on which the predicted location is within 3 or 6 meters from thetrue location for time transfer and device transfer tasks, respectively. It canbe seen that CG-DAN outperforms G-DAN and previous methods in the timetransfer task, demonstrating the benefits of incorporating causal structures ingenerative modeling for domain transfer. We have shown how generative models formulated in particular ways and thecausal graph underlying the class label Y and relevant features X can improve do-main adaptation in a flexible, nonparametric way. This illustrates some potentialadvantages of leveraging both data-generating processes and flexible represen-tations such as neural networks. To this end, we first proposed a generativedomain adaptation network which is able to understand distribution changesand generate new domains. The proposed generative model also demonstratespromising performance in single-source domain adaptation. We then showedthat by incorporating reasonable causal structure into the model and makinguse of modularity, one can benefit from a reduction of model complexity andaccordingly improve the transfer efficiency. In future work we will study theeffect of changing class priors across domains and how to quantify the level of“transferability” with the proposed methods.13 eferences [1] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. Knowledgeand Data Engineering, IEEE Transactions on , 22(10):1345–1359, 2010.[2] J. Jiang.
A literature survey on domain adaptation of statistical classifiers ,2008.[3] H. Shimodaira. Improving predictive inference under covariate shift byweighting the log-likelihood function.
Journal of Statistical Planning andInference , 90:227–244, 2000.[4] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correctingsample selection bias by unlabeled data. In
NIPS 19 , pages 601–608, 2007.[5] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, andM. Kawanabe. Direct importance estimation for covariate shift adaptation.
Annals of the Institute of Statistical Mathematics , 60:699–746, 2008.[6] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh.Sample selection bias correction theory. In
International Conference onAlgorithmic Learning Theory , pages 38–53. Springer, 2008.[7] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptationvia transfer component analysis.
IEEE Transactions on Neural Networks ,22:199–120, 2011.[8] M. Baktashmotlagh, M.T. Harandi, B.C. Lovell, and M. Salzmann. Unsu-pervised domain adaptation by domain invariant projection. In
ComputerVision (ICCV), 2013 IEEE International Conference on , pages 769–776,Dec 2013.[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, HugoLarochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.Domain-adversarial training of neural networks.
JMLR , 17(1):2096–2030,2016.[10] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptationunder target and conditional shift. In
ICML , 2013.[11] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf. Domainadaptation with conditional transferable components. In
ICML , volume 48,pages 2839–2848, 2016.[12] J. Pearl.
Causality: Models, Reasoning, and Inference . Cambridge UniversityPress, Cambridge, 2000.[13] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importanceweighting. In
NIPS 23 , 2010.[14] Y. Yu and C. Szepesvári. Analysis of kernel mean matching under covariateshift. In
ICML , pages 607–614, 2012.[15] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy.Joint distribution optimal transportation for domain adaptation. In
NIPS ,pages 3733–3742, 2017.[16] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable featureswith deep adaptation networks. In David Blei and Francis Bach, editors,14
CML , pages 97–105. JMLR Workshop and Conference Proceedings, 2015.[17] A. Storkey. When training and test sets are different: Characterizing learningtransfer. In J. Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence,editors,
Dataset Shift in Machine Learning , pages 3–28. MIT Press, 2009.[18] A. Iyer, A. Nath, and S. Sarawagi. Maximum mean discrepancy for classratio estimation: Convergence bounds and kernel selection. In
ICML , 2014.[19] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deeptransfer learning with joint adaptation networks. In
International Conferenceon Machine Learning , pages 2208–2217, 2017.[20] P. Spirtes, C. Glymour, and R. Scheines.
Causation, Prediction, and Search .MIT Press, Cambridge, MA, 2nd edition, 2001.[21] Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, David Lopez-Paz,Isabelle Guyon, Michele Sebag, Aris Tritas, and Paola Tubaro. Learningfunctional causal models with generative neural networks. arXiv preprintarXiv:1709.05321 , 2017.[22] Paul G Hoel et al. Introduction to mathematical statistics.
Introduction tomathematical statistics. , (2nd Ed), 1954.[23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generativeadversarial nets. In
NIPS , pages 2672–2680, 2014.[24] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matchingnetworks. In
ICML , pages 1718–1727, 2015.[25] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditionaldistributions.
IEEE Signal Processing Magazine , 30:98 – 111, 2013.[26] Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and BernhardSchölkopf. Causal discovery from nonstationary/heterogeneous data: Skele-ton estimation and orientation determination. In
IJCAI , volume 2017, page1347, 2017.[27] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE ,86(11):2278–2324, 1998.[28] John S Denker, WR Gardner, Hans Peter Graf, Donnie Henderson,RE Howard, W Hubbard, Lawrence D Jackel, Henry S Baird, and Is-abelle Guyon. Neural network recognizer for hand-written zip code digits.In
NIPS , pages 323–331, 1989.[29] Zhang. Kai, V. Zheng, Q. Wang, J. Kwok, Q. Yang, and I. Marsic. Covariateshift in hilbert space: A solution via sorrogate kernels. In
ICML , pages388–395, 2013.[30] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude.
COURSERA: Neuralnetworks for machine learning , 4(2):26–31, 2012.[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised represen-tation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.1532] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and BarnabásPóczos. Mmd gan: Towards deeper understanding of moment matchingnetwork. arXiv preprint arXiv:1705.08584 , 2017.[33] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In
NIPS , pages 469–477, 2016.[34] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Kr-ishnan, and Dumitru Erhan. Domain separation networks. In
NIPS , pages343–351, 2016.[35] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deepdomain adaptation. In
ECCV , pages 443–450. Springer, 2016.16 upplement to“Causal Generative Domain AdaptationNetworks"
This supplementary material provides the proofs and some detailswhich are omitted in the submitted paper. The equation numbers inthis material are consistent with those in the paper.
S1. Proof of Proposition 1
Proof. ˆ θ = ˆ θ implies that Q X | Y ; ˆ θ =ˆ θ = Q X | Y ; ˆ θ =ˆ θ . Matching P sX | Y and Q sX | Y results in P X | Y ; η = Q X | Y ; ˆ θ . Thus, we have P X | Y ; η = η = P X | Y ; η = η . Due to theidentifiability of η in P X | Y ; η , we further have η = η . S2. Proof of Proposition 2
Proof.
After conditional distribution matching on all domains, we have Q X | Y ; θ ∗ = θ ∗ ( s ) ,g ∗ = Q X | Y ; ˆ θ =ˆ θ ( s ) , ˆ g and thus E Q X | Y ; θ ∗ = θ ∗ ( s ) ,g ∗ [ X ] = E Q X | Y ;ˆ θ =ˆ θ ( s ) , ˆ g [ X ] for s = 1 , · · · , m .Moreover, because E Q X | Y ; θ ,g [ X ] = A θ + h ( Y ) , for any y ∈ Y , we have A ∗ θ s ∗ + h ∗ ( y ) = ˆ A ˆ θ ( s ) + ˆ h ( y ) , f or s = 1 , · · · , m, (4)which can be written in matrix form: [ A ∗ , h ∗ ( y )] (cid:20) θ ∗ (1) , . . . , θ ∗ ( m ) , · · · , (cid:21) = [ ˆ A, ˆ h ( y )] (cid:20) ˆ θ (1) , . . . , ˆ θ ( m ) , · · · , (cid:21) . (5)Let A ∗ aug = [ A ∗ , h ∗ ( y )] ∈ R D × ( d +1) , ˆ A aug = [ ˆ A, ˆ h ( y )] ∈ R D × ( d +1) , Θ ∗ = (cid:20) θ ∗ (1) , . . . , θ ∗ ( m ) , · · · , (cid:21) , and ˆΘ = (cid:20) ˆ θ (1) , . . . , ˆ θ ( m ) , · · · , (cid:21) . According to the assump-tions that rank ( A ∗ aug ) = d + 1 and rank (Θ ∗ ) = d + 1 , ˆ A aug and ˆΘ should alsohave rank d + 1 . Therefore, we have θ ∗ = A ∗ † aug ˆ A aug ˆ θ and ˆ θ = ˆ A † aug A ∗ aug θ ∗ ,indicating that θ ∗ is a one-to-one mapping of ˆ θ . S3. Proof of Proposition 3
Proof.
According to the sum rule, we have P X | θ = C (cid:88) c =1 P X | Y = c,θ P Y = c , X | θ (cid:48) = C (cid:88) c =1 P X | Y = c,θ (cid:48) P (cid:48) Y = c . (6)Since P X | θ = P X | θ (cid:48) , then C (cid:88) c =1 P X | Y = c,θ P Y = c = C (cid:88) c =1 P X | Y = c,θ (cid:48) P (cid:48) Y = c . Also, because A5 holds true, we have P X | Y = c,θ P Y = c − P X | Y = c,θ (cid:48) P (cid:48) Y = c = 0 . (7)Taking the integral of (7) leads to P Y = P (cid:48) Y , which further implies that P X | Y,θ = P X | Y,θ (cid:48) . S4. Causal Structure on WiFi Data
Figure 6:
The causal structure learned by CD-NOD on the WiFi t1 and t2 datasets.Pink nodes denote the changing modules and green ones denote the constant moduleswhose conditional distribution does not change across domains.datasets.Pink nodes denote the changing modules and green ones denote the constant moduleswhose conditional distribution does not change across domains.