Deep Transfer Learning with Joint Adaptation Networks
DDeep Transfer Learning with Joint Adaptation Networks
Mingsheng Long Han Zhu Jianmin Wang Michael I. Jordan Abstract
Deep networks have been successfully applied tolearn transferable features for adapting modelsfrom a source domain to a different target domain.In this paper, we present joint adaptation networks(JAN), which learn a transfer network by aligningthe joint distributions of multiple domain-specificlayers across domains based on a joint maximummean discrepancy (JMMD) criterion. Adversarialtraining strategy is adopted to maximize JMMDsuch that the distributions of the source and targetdomains are made more distinguishable. Learningcan be performed by stochastic gradient descentwith the gradients computed by back-propagationin linear-time. Experiments testify that our modelyields state of the art results on standard datasets.
1. Introduction
Deep networks have significantly improved the state of thearts for diverse machine learning problems and applications.Unfortunately, the impressive performance gains come onlywhen massive amounts of labeled data are available forsupervised learning. Since manual labeling of sufficienttraining data for diverse application domains on-the-fly isoften prohibitive, for a target task short of labeled data,there is strong motivation to build effective learners that canleverage rich labeled data from a different source domain.However, this learning paradigm suffers from the shift indata distributions across different domains, which poses amajor obstacle in adapting predictive models for the targettask (Quionero-Candela et al., 2009; Pan & Yang, 2010).Learning a discriminative model in the presence of the shiftbetween training and test distributions is known as transferlearning or domain adaptation (Pan & Yang, 2010). Previous Key Lab for Information System Security, MOE; Tsinghua Na-tional Lab for Information Science and Technology (TNList); NEL-BDS; School of Software, Tsinghua University, Beijing 100084,China University of California, Berkeley, Berkeley 94720. Corre-spondence to: Mingsheng Long
Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s). shallow transfer learning methods bridge the source and tar-get domains by learning invariant feature representations orestimating instance importance without using target labels(Huang et al., 2006; Pan et al., 2011; Gong et al., 2013). Re-cent deep transfer learning methods leverage deep networksto learn more transferable representations by embeddingdomain adaptation in the pipeline of deep learning, whichcan simultaneously disentangle the explanatory factors ofvariations behind data and match the marginal distributionsacross domains (Tzeng et al., 2014; 2015; Long et al., 2015;2016; Ganin & Lempitsky, 2015; Bousmalis et al., 2016).Transfer learning becomes more challenging when domainsmay change by the joint distributions of input features andoutput labels, which is a common scenario in practical ap-plications. First, deep networks generally learn the complexfunction from input features to output labels via multilayerfeature transformation and abstraction. Second, deep fea-tures in standard CNNs eventually transition from general tospecific along the network, and the transferability of featuresand classifiers decreases when the cross-domain discrepancyincreases (Yosinski et al., 2014). Consequently, after feed-forwarding the source and target domain data through deepnetworks for multilayer feature abstraction, the shifts in thejoint distributions of input features and output labels stilllinger in the network activations of multiple domain-specifichigher layers. Thus we can use the joint distributions of theactivations in these domain-specific layers to approximatelyreason about the original joint distributions, which shouldbe matched across domains to enable domain adaptation. Todate, this problem has not been addressed in deep networks.In this paper, we present Joint Adaptation Networks (JAN)to align the joint distributions of multiple domain-specificlayers across domains for unsupervised domain adaptation.JAN largely extends the ability of deep adaptation networks(Long et al., 2015) to reason about the joint distributionsas mentioned above, while keeping the training procedureeven simpler. Specifically, JAN admits a simple transferpipeline, which processes the source and target domain databy convolutional neural networks (CNN) and then alignsthe joint distributions of activations in multiple task-specificlayers. To learn parameters and enable alignment, we derivejoint maximum mean discrepancy (JMMD), which measuresthe Hilbert-Schmidt norm between kernel mean embeddingof empirical joint distributions of source and target data. a r X i v : . [ c s . L G ] A ug eep Transfer Learning with Joint Adaptation Networks Thanks to a linear-time unbiased estimate of JMMD, we caneasily draw a mini-batch of samples to estimate the JMMDcriterion, and implement it efficiently via back-propagation.We further maximize JMMD using adversarial training strat-egy such that the distributions of source and target domainsare made more distinguishable. Empirical study shows thatour models yield state of the art results on standard datasets.
2. Related Work
Transfer learning (Pan & Yang, 2010) aims to build learningmachines that generalize across different domains followingdifferent probability distributions (Sugiyama et al., 2008;Pan et al., 2011; Duan et al., 2012; Gong et al., 2013; Zhanget al., 2013). Transfer learning finds wide applications incomputer vision (Saenko et al., 2010; Gopalan et al., 2011;Gong et al., 2012; Hoffman et al., 2014) and natural lan-guage processing (Collobert et al., 2011; Glorot et al., 2011).The main technical problem of transfer learning is howto reduce the shifts in data distributions across domains.Most existing methods learn a shallow representation modelby which domain discrepancy is minimized, which cannotsuppress domain-specific exploratory factors of variations.Deep networks learn abstract representations that disentan-gle the explanatory factors of variations behind data (Bengioet al., 2013) and extract transferable factors underlying dif-ferent populations (Glorot et al., 2011; Oquab et al., 2013),which can only reduce, but not remove, the cross-domaindiscrepancy (Yosinski et al., 2014). Recent work on deepdomain adaptation embeds domain-adaptation modules intodeep networks to boost transfer performance (Tzeng et al.,2014; 2015; 2017; Ganin & Lempitsky, 2015; Long et al.,2015; 2016). These methods mainly correct the shifts inmarginal distributions, assuming conditional distributionsremain unchanged after the marginal distribution adaptation.Transfer learning will become more challenging as domainsmay change by the joint distributions P ( X , Y ) of input fea-tures X and output labels Y . The distribution shifts maystem from the marginal distributions P ( X ) (a.k.a. covari-ate shift (Huang et al., 2006; Sugiyama et al., 2008)), theconditional distributions P ( Y | X ) (a.k.a. conditional shift(Zhang et al., 2013)), or both (a.k.a. dataset shift (Quionero-Candela et al., 2009)). Another line of work (Zhang et al.,2013; Wang & Schneider, 2014) correct both target and con-ditional shifts based on the theory of kernel embedding ofconditional distributions (Song et al., 2009; 2010; Sriperum-budur et al., 2010). Since the target labels are unavailable,adaptation is performed by minimizing the discrepancy be-tween marginal distributions instead of conditional distri-butions. In general, the presence of conditional shift leadsto an ill-posed problem, and an additional assumption thatthe conditional distribution may only change under location-scale transformations on X is commonly imposed to make the problem tractable (Zhang et al., 2013). As it is not easyto justify which components of the joint distribution arechanging in practice, our work is transparent to diverse sce-narios by directly manipulating the joint distribution withoutassumptions on the marginal and conditional distributions.Furthermore, it remains unclear how to account for the shiftin joint distributions within the regime of deep architectures.
3. Preliminary
We begin by providing an overview of Hilbert space embed-dings of distributions, where each distribution is representedby an element in a reproducing kernel Hilbert space (RKHS).Denote by X a random variable with domain Ω and distribu-tion P ( X ) , and by x the instantiations of X . A reproducingkernel Hilbert space (RKHS) H on Ω endowed by a kernel k ( x , x (cid:48) ) is a Hilbert space of functions f : Ω (cid:55)→ R withinner product (cid:104)· , ·(cid:105) H . Its element k ( x , · ) satisfies the repro-ducing property: (cid:104) f ( · ) , k ( x , · ) (cid:105) H = f ( x ) . Alternatively, k ( x , · ) can be viewed as an (infinite-dimensional) implicitfeature map φ ( x ) where k ( x , x (cid:48) ) = (cid:104) φ ( x ) , φ ( x (cid:48) ) (cid:105) H . Ker-nel functions can be defined on vector space, graphs, timeseries and structured objects to handle diverse applications.The kernel embedding represents a probability distribution P by an element in RKHS endowed by a kernel k (Smolaet al., 2007; Sriperumbudur et al., 2010; Gretton et al., 2012) µ X ( P ) (cid:44) E X [ φ ( X )] = (cid:90) Ω φ ( x ) d P ( x ) , (1)where the distribution is mapped to the expected feature map,i.e. to a point in the RKHS, given that E X [ k ( x , x (cid:48) )] (cid:54) ∞ .The mean embedding µ X has the property that the expecta-tion of any RKHS function f can be evaluated as an innerproduct in H , (cid:104) µ X , f (cid:105) H (cid:44) E X [ f ( X )] , ∀ f ∈ H . This kindof kernel mean embedding provides us a nonparametric per-spective on manipulating distributions by drawing samplesfrom them. We will require a characteristic kernel k suchthat the kernel embedding µ X ( P ) is injective, and that theembedding of distributions into infinite-dimensional featurespaces can preserve all of the statistical features of arbitrarydistributions, which removes the necessity of density estima-tion of P . This technique has been widely applied in manytasks, including feature extraction, density estimation andtwo-sample test (Smola et al., 2007; Gretton et al., 2012).While the true distribution P ( X ) is rarely accessible, wecan estimate its embedding using a finite sample (Grettonet al., 2012). Given a sample D X = { x , . . . , x n } of size n drawn i.i.d. from P ( X ) , the empirical kernel embedding is (cid:98) µ X = 1 n n (cid:88) i =1 φ ( x i ) . (2) eep Transfer Learning with Joint Adaptation Networks This empirical estimate converges to its population counter-part in RKHS norm (cid:107) µ X − (cid:98) µ X (cid:107) H with a rate of O ( n − ) .Kernel embeddings can be readily generalized to joint distri-butions of two or more variables using tensor product featurespaces (Song et al., 2009; 2010; Song & Dai, 2013). A jointdistribution P of variables X , . . . , X m can be embeddedinto an m -th order tensor product feature space ⊗ m(cid:96) =1 H (cid:96) by C X m ( P ) (cid:44) E X m (cid:2) ⊗ m(cid:96) =1 φ (cid:96) (cid:0) X (cid:96) (cid:1)(cid:3) = (cid:90) × m(cid:96) =1 Ω (cid:96) (cid:0) ⊗ m(cid:96) =1 φ (cid:96) (cid:0) x (cid:96) (cid:1)(cid:1) d P (cid:0) x , . . . , x m (cid:1) , (3)where X m denotes the set of m variables { X , . . . , X m } on domain × m(cid:96) =1 Ω (cid:96) = Ω × . . . × Ω m , φ (cid:96) is the featuremap endowed with kernel k (cid:96) in RKHS H (cid:96) for variable X (cid:96) , ⊗ m(cid:96) =1 φ (cid:96) (cid:0) x (cid:96) (cid:1) = φ (cid:0) x (cid:1) ⊗ . . . ⊗ φ m ( x m ) is the feature mapin the tensor product Hilbert space, where the inner productsatisfies (cid:104)⊗ m(cid:96) =1 φ (cid:96) ( x (cid:96) ) , ⊗ m(cid:96) =1 φ (cid:96) ( x (cid:48) (cid:96) ) (cid:105) = (cid:81) m(cid:96) =1 k (cid:96) ( x (cid:96) , x (cid:48) (cid:96) ) .The joint embeddings can be viewed as an uncentered cross-covariance operator C X m by the standard equivalence be-tween tensor and linear map (Song et al., 2010). That is,given a set of functions f , . . . , f m , their covariance can becomputed by E X m (cid:2)(cid:81) m(cid:96) =1 f (cid:96) ( X (cid:96) ) (cid:3) = (cid:10) ⊗ m(cid:96) =1 f (cid:96) , C X m (cid:11) .When the true distribution P ( X , . . . , X m ) is unknown, wecan estimate its embedding using a finite sample (Song et al.,2013). Given a sample D X m = { x m , . . . , x mn } of size n drawn i.i.d. from P ( X , . . . , X m ) , the empirical jointembedding (the cross-covariance operator) is estimated as (cid:98) C X m = 1 n n (cid:88) i =1 ⊗ m(cid:96) =1 φ (cid:96) (cid:0) x (cid:96)i (cid:1) . (4)This empirical estimate converges to its population counter-part with a similar convergence rate as marginal embedding. Let D X s = { x s , . . . , x sn s } and D X t = { x t , . . . , x tn t } bethe sets of samples from distributions P ( X s ) and Q ( X t ) ,respectively. Maximum Mean Discrepancy (MMD) (Gret-ton et al., 2012) is a kernel two-sample test which rejects oraccepts the null hypothesis P = Q based on the observedsamples. The basic idea behind MMD is that if the generat-ing distributions are identical, all the statistics are the same.Formally, MMD defines the following difference measure: D H ( P, Q ) (cid:44) sup f ∈H (cid:0) E X s [ f ( X s )] − E X t (cid:2) f (cid:0) X t (cid:1)(cid:3)(cid:1) , (5)where H is a class of functions. It is shown that the classof functions in an universal RKHS H is rich enough todistinguish any two distributions and MMD is expressed asthe distance between their mean embeddings: D H ( P, Q ) = (cid:107) µ X s ( P ) − µ X t ( Q ) (cid:107) H . The main theoretical result is that P = Q if and only if D H ( P, Q ) = 0 (Gretton et al., 2012). In practice, an estimate of the MMD compares the squaredistance between the empirical kernel mean embeddings as (cid:98) D H ( P, Q ) = 1 n s n s (cid:88) i =1 n s (cid:88) j =1 k (cid:0) x si , x sj (cid:1) + 1 n t n t (cid:88) i =1 n t (cid:88) j =1 k (cid:0) x ti , x tj (cid:1) − n s n t n s (cid:88) i =1 n t (cid:88) j =1 k (cid:0) x si , x tj (cid:1) , (6)where (cid:98) D H ( P, Q ) is an unbiased estimator of D H ( P, Q ) .
4. Joint Adaptation Networks
In unsupervised domain adaptation, we are given a sourcedomain D s = { ( x si , y si ) } n s i =1 of n s labeled examples anda target domain D t = { x tj } n t j =1 of n t unlabeled examples.The source domain and target domain are sampled fromjoint distributions P ( X s , Y s ) and Q ( X t , Y t ) respectively, P (cid:54) = Q . The goal of this paper is to design a deep neuralnetwork y = f ( x ) which formally reduces the shifts inthe joint distributions across domains and enables learningboth transferable features and classifiers, such that the targetrisk R t ( f ) = E ( x , y ) ∼ Q [ f ( x ) (cid:54) = y ] can be minimized byjointly minimizing the source risk and domain discrepancy.Recent studies reveal that deep networks (Bengio et al.,2013) can learn more transferable representations than tra-ditional hand-crafted features (Oquab et al., 2013; Yosinskiet al., 2014). The favorable transferability of deep featuresleads to several state of the art deep transfer learning meth-ods (Ganin & Lempitsky, 2015; Tzeng et al., 2015; Longet al., 2015; 2016). This paper also tackles unsuperviseddomain adaptation by learning transferable features usingdeep neural networks. We extend deep convolutional neuralnetworks (CNNs), including AlexNet (Krizhevsky et al.,2012) and ResNet (He et al., 2016), to novel joint adaptationnetworks (JANs) as shown in Figure 1. The empirical errorof CNN classifier f ( x ) on source domain labeled data D s is min f n s n s (cid:88) i =1 J ( f ( x si ) , y si ) , (7)where J ( · , · ) is the cross-entropy loss function. Based on thequantification study of feature transferability in deep con-volutional networks (Yosinski et al., 2014), convolutionallayers can learn generic features that are transferable acrossdomains (Yosinski et al., 2014). Thus we opt to fine-tunethe features of convolutional layers when transferring pre-trained deep models from source domain to target domain.However, the literature findings also reveal that the deepfeatures can reduce, but not remove, the cross-domain distri-bution discrepancy (Yosinski et al., 2014; Long et al., 2015; eep Transfer Learning with Joint Adaptation Networks X s X t Z t| L | Z s| L | Z s1 Z t1 Y s Y t JMMD ✖✖ tied tied φ φ φ L φ L AlexNetVGGnetGoogLeNetResNet…… (a) Joint Adaptation Network (JAN) X s X t Z t| L | Z s| L | Z s1 Z t1 Y s Y t JMMD ✖✖ tied tied θθ θθ θθ θθ φ φ φ L φ L AlexNetVGGnetGoogLeNetResNet…… (b) Adversarial Joint Adaptation Network (JAN-A)
Figure 1.
The architectures of Joint Adaptation Network (JAN) (a) and its adversarial version (JAN-A) (b). Since deep features eventuallytransition from general to specific along the network, activations in multiple domain-specific layers L are not safely transferable. And thejoint distributions of the activations P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) in these layers should be adapted by JMMD minimization. P ( X s , Y s ) and Q ( X t , Y t ) still linger in the activations Z , . . . , Z |L| ofthe higher network layers L . Taking AlexNet (Krizhevskyet al., 2012) as an example, the activations in the higher fully-connected layers L = { f c , f c , f c } are not safely trans-ferable for domain adaptation (Yosinski et al., 2014). Notethat the shift in the feature distributions P ( X s ) and Q ( X t ) mainly lingers in the feature layers f c , f c while the shiftin the label distributions P ( Y s ) and Q ( Y t ) mainly lingersin the classifier layer f c . Thus we can use the joint distribu-tions of the activations in layers L , i.e. P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) as good surrogates of the originaljoint distributions P ( X s , Y s ) and Q ( X t , Y t ) , respectively.To enable unsupervised domain adaptation, we should finda way to match P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) . Many existing methods address transfer learning by bound-ing the target error with the source error plus a discrepancybetween the marginal distributions P ( X s ) and Q ( X t ) ofthe source and target domains (Ben-David et al., 2010). TheMaximum Mean Discrepancy (MMD) (Gretton et al., 2012),as a kernel two-sample test statistic, has been widely ap-plied to measure the discrepancy in marginal distributions P ( X s ) and Q ( X t ) (Tzeng et al., 2014; Long et al., 2015;2016). To date MMD has not been used to measure thediscrepancy in joint distributions P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) , possibly because MMD has not been di-rectly defined for joint distributions by (Gretton et al., 2012)while in conventional shallow domain adaptation methodsthe joint distributions are not easy to manipulate and match.Following the virtue of MMD (5), we use the Hilbert spaceembeddings of joint distributions (3) to measure the dis-crepancy of two joint distributions P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) . The resulting measure is called JointMaximum Mean Discrepancy (JMMD), which is defined as D L ( P, Q ) (cid:44) (cid:107)C Z s, |L| ( P ) − C Z t, |L| ( Q ) (cid:107) ⊗ |L| (cid:96) =1 H (cid:96) . (8)Based on the virtue of the kernel two-sample test theory(Gretton et al., 2012), we will have P ( Z s , . . . , Z s |L| ) = Q ( Z t , . . . , Z t |L| ) if and only if D L ( P, Q ) = 0 . Givensource domain D s of n s labeled points and target domain D t of n t unlabeled points drawn i.i.d. from P and Q respec-tively, the deep networks will generate activations in layers L as { ( z s i , . . . , z s |L| i ) } n s i =1 and { ( z t j , . . . , z t |L| j ) } n t j =1 . Theempirical estimate of D L ( P, Q ) is computed as the squareddistance between the empirical kernel mean embeddings as (cid:98) D L ( P, Q ) = 1 n s n s (cid:88) i =1 n s (cid:88) j =1 (cid:89) (cid:96) ∈L k (cid:96) (cid:0) z s(cid:96)i , z s(cid:96)j (cid:1) + 1 n t n t (cid:88) i =1 n t (cid:88) j =1 (cid:89) (cid:96) ∈L k (cid:96) (cid:0) z t(cid:96)i , z t(cid:96)j (cid:1) − n s n t n s (cid:88) i =1 n t (cid:88) j =1 (cid:89) (cid:96) ∈L k (cid:96) (cid:0) z s(cid:96)i , z t(cid:96)j (cid:1) . (9) Remark:
Taking a close look on the objectives of MMD (6)and JMMD (9), we can find some interesting connections.The difference is that, for the activations Z (cid:96) in each layer (cid:96) ∈L , instead of putting uniform weights on the kernel function k (cid:96) ( z (cid:96)i , z (cid:96)j ) as in MMD, JMMD applies non-uniform weights,reflecting the influence of other variables in other layers L\ (cid:96) . This captures the full interactions between differentvariables in the joint distributions P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) , which is crucial for domain adaptation.All previous deep transfer learning methods (Tzeng et al.,2014; Long et al., 2015; Ganin & Lempitsky, 2015; Tzenget al., 2015; Long et al., 2016) have not addressed this issue. Denote by L the domain-specific layers where the activa-tions are not safely transferable. We will formally reducethe discrepancy in the joint distributions of the activations eep Transfer Learning with Joint Adaptation Networks in layers L , i.e. P ( Z s , . . . , Z s |L| ) and Q ( Z t , . . . , Z t |L| ) .Note that the features in the lower layers of the networkare transferable and hence will not require a further distri-bution matching. By integrating the JMMD (9) over thedomain-specific layers L into the CNN error (7), the jointdistributions are matched end-to-end with network training, min f n s n s (cid:88) i =1 J ( f ( x si ) , y si ) + λ (cid:98) D L ( P, Q ) , (10)where λ > is a tradeoff parameter of the JMMD penalty.As shown in Figure 1(a), we set L = { f c , f c , f c } forthe JAN model based on AlexNet (last three layers) whilewe set L = { pool , f c } for the JAN model based on ResNet(last two layers), as these layers are tailored to task-specificstructures, which are not safely transferable and should bejointly adapted by minimizing CNN error and JMMD (9).A limitation of JMMD (9) is its quadratic complexity, whichis inefficient for scalable deep transfer learning. Motivatedby the unbiased estimate of MMD (Gretton et al., 2012), wederive a similar linear-time estimate of JMMD as follows, (cid:98) D L ( P, Q ) = 2 n n/ (cid:88) i =1 (cid:32)(cid:89) (cid:96) ∈L k (cid:96) ( z s(cid:96) i − , z s(cid:96) i ) + (cid:89) (cid:96) ∈L k (cid:96) ( z t(cid:96) i − , z t(cid:96) i ) (cid:33) − n n/ (cid:88) i =1 (cid:32)(cid:89) (cid:96) ∈L k (cid:96) ( z s(cid:96) i − , z t(cid:96) i ) + (cid:89) (cid:96) ∈L k (cid:96) ( z t(cid:96) i − , z s(cid:96) i ) (cid:33) , (11) where n = n s . This linear-time estimate well fits the mini-batch stochastic gradient descent (SGD) algorithm. In eachmini-batch, we sample the same number of source pointsand target points to eliminate the bias caused by domain size.This enables our models to scale linearly to large samples. The MMD defined using the RKHS (6) has the advantage ofnot requiring a separate network to approximately maximizethe original definition of MMD (5). But the original MMD(5) reveals that, in order to maximize the test power suchthat any two distributions can be distinguishable, we requirethe class of functions f ∈ H to be rich enough. Although(Gretton et al., 2012) shows that an universal RKHS is richenough, such kernel-based MMD may suffer from vanishinggradients for low-bandwidth kernels. Moreover, it may bepossible that some widely-used kernels are unable to capturevery complex distances in high dimensional spaces such asnatural images (Reddi et al., 2015; Arjovsky et al., 2017).To circumvent the issues of vanishing gradients and non-richfunction class of kernel-based MMD (6), we are enlightenedby the original MMD (5) which fits the adversarial trainingin GANs (Goodfellow et al., 2014). We add multiple fully-connected layers parametrized by θ to the proposed JMMD(9) to make the function class of JMMD richer using neural network as shown in Figure 1(b). We maximize JMMD withrespect to these new parameters θ to approach the virtue ofthe original MMD (5), that is, maximizing the test power ofJMMD such that distributions of source and target domainsare made more distinguishable (Sriperumbudur et al., 2009).This leads to a new adversarial joint adaptation network as min f max θ n s n s (cid:88) i =1 J ( f ( x si ) , y si ) + λ (cid:98) D L ( P, Q ; θ ) . (12)Learning deep features by minimizing this more powerfulJMMD, intuitively any shift in the joint distributions will bemore easily identified by JMMD and then adapted by CNN. Remark:
This version of JAN shares the idea of domain-adversarial training with (Ganin & Lempitsky, 2015), butdiffers in that we use the JMMD as the domain adversarywhile (Ganin & Lempitsky, 2015) uses logistic regression.As pointed out in a very recent study (Arjovsky et al., 2017),our JMMD-adversarial network can be trained more easily.
5. Experiments
We evaluate the joint adaptation networks with state of theart transfer learning and deep learning methods. Codes anddatasets are available at http://github.com/thuml . (Saenko et al., 2010) is a standard benchmark fordomain adaptation in computer vision, comprising 4,652images and 31 categories collected from three distinct do-mains: Amazon ( A ), which contains images downloadedfrom amazon.com , Webcam ( W ) and DSLR ( D ), whichcontain images respectively taken by web camera and dig-ital SLR camera under different settings. We evaluate allmethods across three transfer tasks A → W , D → W and W → D , which are widely adopted by previous deep transferlearning methods (Tzeng et al., 2014; Ganin & Lempitsky,2015), and another three transfer tasks A → D , D → A and W → A as in (Long et al., 2015; 2016; Tzeng et al., 2015). ImageCLEF-DA is a benchmark dataset for ImageCLEF2014 domain adaptation challenge, which is organized byselecting the 12 common categories shared by the follow-ing three public datasets, each is considered as a domain: Caltech-256 ( C ), ImageNet ILSVRC 2012 ( I ), and PascalVOC 2012 ( P ). There are 50 images in each category and600 images in each domain. We use all domain combina-tions and build 6 transfer tasks: I → P , P → I , I → C , C → I , C → P , and P → C . Different from Office-31 wheredifferent domains are of different sizes, the three domainsin ImageCLEF-DA are of equal size, which makes it a goodcomplement to
Office-31 for more controlled experiments. http://imageclef.org/2014/adaptation eep Transfer Learning with Joint Adaptation Networks We compare with conventional and state of the art transferlearning and deep learning methods: Transfer ComponentAnalysis (
TCA ) (Pan et al., 2011), Geodesic Flow Kernel(
GFK ) (Gong et al., 2012), Convolutional Neural Networks
AlexNet (Krizhevsky et al., 2012) and
ResNet (He et al.,2016), Deep Domain Confusion (
DDC ) (Tzeng et al., 2014),Deep Adaptation Network (
DAN ) (Long et al., 2015), Re-verse Gradient (
RevGrad ) (Ganin & Lempitsky, 2015), andResidual Transfer Network (
RTN ) (Long et al., 2016). TCAis a transfer learning method based on MMD-regularizedKernel PCA. GFK is a manifold learning method that inter-polates across an infinite number of intermediate subspacesto bridge domains. DDC is the first method that maximizesdomain invariance by regularizing the adaptation layer ofAlexNet using linear-kernel MMD (Gretton et al., 2012).DAN learns transferable features by embedding deep fea-tures of multiple task-specific layers to reproducing kernelHilbert spaces (RKHSs) and matching different distributionsoptimally using multi-kernel MMD. RevGrad improves do-main adaptation by making the source and target domainsindistinguishable for a domain discriminator by adversarialtraining. RTN jointly learns transferable features and adap-tive classifiers by deep residual learning (He et al., 2016).We examine the influence of deep representations for do-main adaptation by employing the breakthrough
AlexNet (Krizhevsky et al., 2012) and the state of the art
ResNet (Heet al., 2016) for learning transferable deep representations.For AlexNet, we follow DeCAF (Donahue et al., 2014) anduse the activations of layer f c as image representation. ForResNet (50 layers), we use the activations of the last featurelayer pool as image representation. We follow standardevaluation protocols for unsupervised domain adaptation(Long et al., 2015; Ganin & Lempitsky, 2015). For both Office-31 and
ImageCLEF-DA datasets, we use all labeledsource examples and all unlabeled target examples. Wecompare the average classification accuracy of each methodon three random experiments, and report the standard errorof the classification accuracies by different experiments ofthe same transfer task. We perform model selection by tun-ing hyper-parameters using transfer cross-validation (Zhonget al., 2010). For MMD-based methods and JAN, we adoptGaussian kernel with bandwidth set to median pairwisesquared distances on the training data (Gretton et al., 2012).We implement all deep methods based on the
Caffe frame-work, and fine-tune from Caffe-provided models of AlexNet(Krizhevsky et al., 2012) and ResNet (He et al., 2016), bothare pre-trained on the ImageNet 2012 dataset. We fine-tuneall convolutional and pooling layers and train the classifierlayer via back propagation. Since the classifier is trainedfrom scratch, we set its learning rate to be 10 times thatof the other layers. We use mini-batch stochastic gradientdescent (SGD) with momentum of 0.9 and the learning rateannealing strategy in RevGrad (Ganin & Lempitsky, 2015): the learning rate is not selected by a grid search due to highcomputational cost—it is adjusted during SGD using thefollowing formula: η p = η (1+ αp ) β , where p is the trainingprogress linearly changing from to , η = 0 . , α = 10 and β = 0 . , which is optimized to promote convergenceand low error on the source domain. To suppress noisy acti-vations at the early stages of training, instead of fixing theadaptation factor λ , we gradually change it from to by aprogressive schedule: λ p = − γp ) − , and γ = 10 isfixed throughout experiments (Ganin & Lempitsky, 2015).This progressive strategy significantly stabilizes parametersensitivity and eases model selection for JAN and JAN-A. The classification accuracy results on the
Office-31 datasetfor unsupervised domain adaptation based on AlexNet andResNet are shown in Table 1. As fair comparison with identi-cal evaluation setting, the results of DAN (Long et al., 2015),RevGrad (Ganin & Lempitsky, 2015), and RTN (Long et al.,2016) are directly reported from their published papers. Theproposed JAN models outperform all comparison methodson most transfer tasks. It is noteworthy that JANs promotethe classification accuracies substantially on hard transfertasks, e.g. D → A and W → A , where the source and targetdomains are substantially different and the source domainis smaller than the target domain, and produce comparableclassification accuracies on easy transfer tasks, D → W and W → D , where the source and target domains are similar(Saenko et al., 2010). The encouraging results highlight thekey importance of joint distribution adaptation in deep neu-ral networks, and suggest that JANs are able to learn moretransferable representations for effective domain adaptation.The results reveal several interesting observations. (1) Stan-dard deep learning methods either outperform (AlexNet) orunderperform (ResNet) traditional shallow transfer learningmethods (TCA and GFK) using deep features (AlexNet-fc7and ResNet-pool5) as input. And traditional shallow trans-fer learning methods perform better with more transferabledeep features extracted by ResNet. This confirms the currentpractice that deep networks learn abstract feature represen-tations, which can only reduce, but not remove, the domaindiscrepancy (Yosinski et al., 2014). (2)
Deep transfer learn-ing methods substantially outperform both standard deeplearning methods and traditional shallow transfer learningmethods. This validates that reducing the domain discrep-ancy by embedding domain-adaptation modules into deepnetworks (DDC, DAN, RevGrad, and RTN) can learn moretransferable features. (3)
The JAN models outperform pre-vious methods by large margins and set new state of the artrecord. Different from all previous deep transfer learningmethods that only adapt the marginal distributions basedon independent feature layers (one layer for RevGrad andmultilayer for DAN and RTN), JAN adapts the joint distribu- eep Transfer Learning with Joint Adaptation Networks
Table 1.
Classification accuracy (%) on
Office-31 dataset for unsupervised domain adaptation (AlexNet and ResNet)Method A → W D → W W → D A → D D → A W → A AvgAlexNet (Krizhevsky et al., 2012) 61.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ResNet (He et al., 2016) 68.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2.
Classification accuracy (%) on
ImageCLEF-DA for unsupervised domain adaptation (AlexNet and ResNet)Method I → P P → I I → C C → I C → P P → C AvgAlexNet (Krizhevsky et al., 2012) 66.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ResNet (He et al., 2016) 74.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± tions of network activations in all domain-specific layers tofully correct the shifts in joint distributions across domains.Although both JAN and DAN (Long et al., 2015) adapt mul-tiple domain-specific layers, the improvement from DAN toJAN is crucial for the domain adaptation performance: JANuses a JMMD penalty to reduce the shift in the joint distribu-tions of multiple task-specific layers, which reflects the shiftin the joint distributions of input features and output labels;DAN needs multiple MMD penalties, each independentlyreducing the shift in the marginal distribution of each layer,assuming feature layers and classifier layer are independent.By going from AlexNet to extremely deep ResNet, we canattain a more in-depth understanding of feature transferabil-ity. (1) ResNet-based methods outperform AlexNet-basedmethods by large margins. This validates that very deepconvolutional networks, e.g. VGGnet (Simonyan & Zisser-man, 2015), GoogLeNet (Szegedy et al., 2015), and ResNet,not only learn better representations for general vision tasksbut also learn more transferable representations for domainadaptation. (2)
The JAN models significantly outperformResNet-based methods, revealing that even very deep net-works can only reduce, but not remove, the domain discrep-ancy. (3)
The boost of JAN over ResNet is more significantthan the improvement of JAN over AlexNet. This implies that JAN can benefit from more transferable representations.The great aspect of JAN is that via the kernel trick there isno need to train a separate network to maximize the MMDcriterion (5) for the ball of a RKHS. However, this has thedisadvantage that some kernels used in practice are unsuit-able for capturing very complex distances in high dimen-sional spaces such as natural images (Arjovsky et al., 2017).The JAN-A model significantly outperforms the previous do-main adversarial deep network (Ganin & Lempitsky, 2015).The improvement from JAN to JAN-A also demonstrates thebenefit of adversarial training for optimizing the JMMD ina richer function class. By maximizing the JMMD criterionwith respect to a separate network, JAN-A can maximize thedistinguishability of source and target distributions. Adapt-ing domains against deep features where their distributionsmaximally differ, we can enhance the feature transferability.The three domains in
ImageCLEF-DA are more balancedthan those of
Office-31 . With these more balanced transfertasks, we are expecting to testify whether transfer learningimproves when domain sizes do not change. The classifica-tion accuracy results based on both AlexNet and ResNet areshown in Table 2. The JAN models outperform comparisonmethods on most transfer tasks, but by less improvements.This means the difference in domain sizes may cause shift. eep Transfer Learning with Joint Adaptation Networks (a) DAN:
Source = A (b) DAN: Target = W (c) JAN: Source = A (d) JAN: Target = W Figure 2.
The t-SNE visualization of network activations (ResNet) generated by DAN (a)(b) and JAN (c)(d), respectively.
A->W W->D
Transfer Task
1 1.21.41.61.82 2.2 A - D i s t an c e CNN (AlexNet)DAN (AlexNet)JAN (AlexNet) (a) A -distance A->W W->D
Transfer Task J MM D CNN (AlexNet)DAN (AlexNet)JAN (AlexNet) (b) JMMD A cc u r a cy ( % ) A W (AlexNet)A W (ResNet) (c) Accuracy w.r.t. λ Number of Iterations ( 10 ) T e s t E rr o r RevGrad (ResNet)JAN (ResNet)JAN-A (ResNet) (d) Convergence
Figure 3.
Analysis: (a) A -distance; (b) JMMD; (c) parameter sensitivity of λ ; (d) convergence (dashed lines show best baseline results). We visualize in Figures 2(a)–2(d)the network activations of task A → W learned by DAN andJAN respectively using t-SNE embeddings (Donahue et al.,2014). Compared with the activations given by DAN in Fig-ure 2(a)–2(b), the activations given by JAN in Figures 2(c)–2(d) show that the target categories are discriminated muchmore clearly by the JAN source classifier. This suggests thatthe adaptation of joint distributions of multilayer activationsis a powerful approach to unsupervised domain adaptation. Distribution Discrepancy:
The theory of domain adapta-tion (Ben-David et al., 2010; Mansour et al., 2009) suggests A -distance as a measure of distribution discrepancy, which,together with the source risk, will bound the target risk. Theproxy A -distance is defined as d A = 2 (1 − (cid:15) ) , where (cid:15) is the generalization error of a classifier (e.g. kernel SVM)trained on the binary problem of discriminating the sourceand target. Figure 3(a) shows d A on tasks A → W , W → D with features of CNN, DAN, and JAN. We observe that d A using JAN features is much smaller than d A using CNN andDAN features, which suggests that JAN features can closethe cross-domain gap more effectively. As domains W and D are very similar, d A of task W → D is much smaller thanthat of A → W , which explains better accuracy of W → D .A limitation of the A -distance is that it cannot measure thecross-domain discrepancy of joint distributions, which isaddressed by the proposed JMMD (9). We compute JMMD(9) across domains using CNN, DAN and JAN activationsrespectively, based on the features in f c and ground-truthlabels in f c (the target labels are not used for model train-ing). Figure 3(b) shows that JMMD using JAN activations ismuch smaller than JMMD using CNN and DAN activations, which validates that JANs successfully reduce the shifts injoint distributions to learn more transferable representations. Parameter Sensitivity:
We check the sensitivity of JMMDparameter λ , i.e. the maximum value of the relative weightfor JMMD. Figure 3(c) demonstrates the transfer accuracyof JAN based on AlexNet and ResNet respectively, by vary-ing λ ∈ { . , . , . , . , . , . , } on task A → W .The accuracy of JAN first increases and then decreases as λ varies and shows a bell-shaped curve. This confirms themotivation of deep learning and joint distribution adaptation,as a proper trade-off between them enhance transferability. Convergence Performance:
As JAN and JAN-A involveadversarial training procedures, we testify their convergenceperformance. Figure 3(d) demonstrates the test errors ofdifferent methods on task A → W , which suggests that JANconverges fastest due to nonparametric JMMD while JAN-Ahas similar convergence speed as RevGrad with significantlyimproved accuracy in the whole procedure of convergence.
6. Conclusion
This paper presented a novel approach to deep transfer learn-ing, which enables end-to-end learning of transferable repre-sentations. Unlike previous methods that match the marginaldistributions of features across domains, the proposed ap-proach reduces the shift in joint distributions of the networkactivations of multiple task-specific layers, which approxi-mates the shift in the joint distributions of input features andoutput labels. The discrepancy between joint distributionscan be computed by embedding the joint distributions in atensor-product Hilbert space, which can be scaled linearlyto large samples and be implemented in most deep networks.Experiments testified the efficacy of the proposed approach. eep Transfer Learning with Joint Adaptation Networks
Acknowledgments
We thank Zhangjie Cao for conducting part of experiments.This work was supported by NSFC (61502265, 61325008),National Key R&D Program of China (2016YFB1000701,2015BAF32B01), and Tsinghua TNList Lab Key Projects.
References
Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon.Wasserstein gan. arXiv preprint arXiv:1701.07875 , 2017.Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A.,Pereira, F., and Vaughan, J. W. A theory of learning fromdifferent domains.
Machine Learning , 79(1-2):151–175,2010.Bengio, Y., Courville, A., and Vincent, P. Representationlearning: A review and new perspectives.
IEEE Trans-actions on Pattern Analysis and Machine Intelligence(TPAMI) , 35(8):1798–1828, 2013.Bousmalis, Konstantinos, Trigeorgis, George, Silberman,Nathan, Krishnan, Dilip, and Erhan, Dumitru. Domainseparation networks. In
Advances in Neural InformationProcessing Systems (NIPS) , pp. 343–351, 2016.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. Natural language pro-cessing (almost) from scratch.
Journal of Machine Learn-ing Research (JMLR) , 12:2493–2537, 2011.Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N.,Tzeng, E., and Darrell, T. Decaf: A deep convolutionalactivation feature for generic visual recognition. In
In-ternational Conference on Machine Learning (ICML) ,2014.Duan, L., Tsang, I. W., and Xu, D. Domain transfer multiplekernel learning.
IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) , 34(3):465–479, 2012.Ganin, Y. and Lempitsky, V. Unsupervised domain adapta-tion by backpropagation. In
International Conference onMachine Learning (ICML) , 2015.Glorot, X., Bordes, A., and Bengio, Y. Domain adaptationfor large-scale sentiment classification: A deep learn-ing approach. In
International Conference on MachineLearning (ICML) , 2011.Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesicflow kernel for unsupervised domain adaptation. In
IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2012.Gong, B., Grauman, K., and Sha, F. Connecting thedots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In
International Conference on Machine Learning (ICML) ,2013.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In
Advances in NeuralInformation Processing Systems (NIPS) , 2014.Gopalan, R., Li, R., and Chellappa, R. Domain adapta-tion for object recognition: An unsupervised approach.In
IEEE International Conference on Computer Vision(ICCV) , 2011.Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., andSmola, A. A kernel two-sample test.
Journal of MachineLearning Research (JMLR) , 13:723–773, 2012.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2016.Hoffman, J., Guadarrama, S., Tzeng, E., Hu, R., Donahue,J., Girshick, R., Darrell, T., and Saenko, K. LSDA: Largescale detection through adaptation. In
Advances in NeuralInformation Processing Systems (NIPS) , 2014.Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., andSchölkopf, B. Correcting sample selection bias by unla-beled data. In
Advances in Neural Information ProcessingSystems (NIPS) , 2006.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In
Advances in Neural Information Processing Systems(NIPS) , 2012.Long, Mingsheng, Cao, Yue, Wang, Jianmin, and Jordan,Michael I. Learning transferable features with deep adap-tation networks. In
International Conference on MachineLearning (ICML) , 2015.Long, Mingsheng, Zhu, Han, Wang, Jianmin, and Jordan,Michael I. Unsupervised domain adaptation with residualtransfer networks. In
Advances in Neural InformationProcessing Systems (NIPS) , pp. 136–144, 2016.Mansour, Y., Mohri, M., and Rostamizadeh, A. Domainadaptation: Learning bounds and algorithms. In
Confer-ence on Computational Learning Theory (COLT) , 2009.Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learningand transferring mid-level image representations usingconvolutional neural networks. In
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2013.Pan, S. J. and Yang, Q. A survey on transfer learning.
IEEE Transactions on Knowledge and Data Engineering(TKDE) , 22(10):1345–1359, 2010. eep Transfer Learning with Joint Adaptation Networks
Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Do-main adaptation via transfer component analysis.
IEEETransactions on Neural Networks (TNN) , 22(2):199–210,2011.Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., andLawrence, N. D.
Dataset shift in machine learning . TheMIT Press, 2009.Reddi, Sashank J, Ramdas, Aaditya, Póczos, Barnabás,Singh, Aarti, and Wasserman, Larry A. On the highdimensional power of a linear-time two sample test un-der mean-shift alternatives. In
Artificial Intelligence andStatistics Conference (AISTATS) , 2015.Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adaptingvisual category models to new domains. In
EuropeanConference on Computer Vision (ECCV) , 2010.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In
Interna-tional Conference on Learning Representations (ICLR),2015 (arXiv:1409.1556v6) , 2015.Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf,Bernhard. A hilbert space embedding for distributions.In
International Conference on Algorithmic LearningTheory (ALT) , pp. 13–31. Springer, 2007.Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbertspace embeddings of conditional distributions with appli-cations to dynamical systems. In
International Confer-ence on Machine Learning (ICML) , 2009.Song, Le and Dai, Bo. Robust low rank kernel embeddingsof multivariate distributions. In
Advances in Neural In-formation Processing Systems (NIPS) , pp. 3228–3236,2013.Song, Le, Boots, Byron, Siddiqi, Sajid M, Gordon, Geof-frey J, and Smola, Alex. Hilbert space embeddings ofhidden markov models. In
International Conference onMachine Learning (ICML) , 2010.Song, Le, Fukumizu, Kenji, and Gretton, Arthur. Kernelembeddings of conditional distributions: A unified kernelframework for nonparametric inference in graphical mod-els.
IEEE Signal Processing Magazine , 30(4):98–111,2013.Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet,G., and Schölkopf, B. Kernel choice and classifiability forrkhs embeddings of probability distributions. In
Advancesin Neural Information Processing Systems (NIPS) , 2009.Sriperumbudur, Bharath K, Gretton, Arthur, Fukumizu,Kenji, Schölkopf, Bernhard, and Lanckriet, Gert RG. Hilbert space embeddings and metrics on probability mea-sures.
Journal of Machine Learning Research (JMLR) ,11(Apr):1517–1561, 2010.Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V.,and Kawanabe, M. Direct importance estimation withmodel selection and its application to covariate shift adap-tation. In
Advances in Neural Information ProcessingSystems (NIPS) , 2008.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,A. Going deeper with convolutions. In
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2015.Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell,T. Deep domain confusion: Maximizing for domaininvariance.
CoRR , abs/1412.3474, 2014.Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell,T. Simultaneous deep transfer across domains and tasks.In
IEEE International Conference on Computer Vision(ICCV) , 2015.Tzeng, Eric, Hoffman, Judy, Saenko, Kate, and Darrell,Trevor. Adversarial discriminative domain adaptation. arXiv preprint arXiv:1702.05464 , 2017.Wang, X. and Schneider, J. Flexible transfer learning undersupport and model shift. In
Advances in Neural Informa-tion Processing Systems (NIPS) , 2014.Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. Howtransferable are features in deep neural networks? In
Ad-vances in Neural Information Processing Systems (NIPS) ,2014.Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. Do-main adaptation under target and conditional shift. In
International Conference on Machine Learning (ICML) ,2013.Zhong, E., Fan, W., Yang, Q., Verscheure, O., and Ren,J. Cross validation framework to choose amongst mod-els and datasets for transfer learning. In