[PDF] Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Abstract

Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples. To obtain better generalization, using the starting point as the reference, either through weights or features, has been successfully applied to transfer learning as a regularizer. However, due to the domain discrepancy between the source and target tasks, there exists obvious risk of negative transfer. In this paper, we propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED), where the relevant knowledge with respect to the target task is disentangled from the original source model and used as a regularizer during fine-tuning the target model. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average. TRED also outperforms other state-of-the-art transfer learning regularizers such as L2-SP, AT, DELTA and BSS.

Full PDF

TTowards Accurate Knowledge Transfer via Target-awareness RepresentationDisentanglement

Xingjian Li , Di Hu , Xuhong Li , Haoyi Xiong , Zhi Ye , Zhipeng Wang , Chengzhong Xu , Dejing Dou Baidu Inc., Renmin University of China, University of Macau { lixingjian,lixuhong,xionghaoyi,yezhi,wangzhipeng,doudejing } @[email protected],[email protected] Abstract

Fine-tuning deep neural networks pre-trained on largescale datasets is one of the most practical transfer learn-ing paradigm given limited quantity of training samples. Toobtain better generalization, using the starting point as thereference, either through weights or features, has been suc-cessfully applied to transfer learning as a regularizer. How-ever, due to the domain discrepancy between the source andtarget tasks, there exists obvious risk of negative transfer. Inthis paper, we propose a novel transfer learning algorithm,introducing the idea of Target-awareness REpresentationDisentanglement (

TRED ), where the relevant knowledgewith respect to the target task is disentangled from the orig-inal source model and used as a regularizer during ﬁne-tuning the target model. Experiments on various real worlddatasets show that our method stably improves the standardﬁne-tuning by more than 2% in average.

TRED also out-performs other state-of-the-art transfer learning regulariz-ers such as L - SP , AT , DELTA , and

BSS .

1. Introduction

Deep convolutional networks achieve great successon large scale vision tasks such as ImageNet [38] andPlaces365 [50]. In addition to their notable improvementsof accuracy, deep representations learned on modern CNNsare demonstrated transferable across relevant tasks [48].This is rather fortunate for many real world applicationswith inefﬁcient labeled examples. Transfer learning aimsto obtain good performance on such tasks by leveragingknowledge learned by relevant large scale datasets. Theauxiliary and desired tasks are called the source and targettasks respectively. According to [34], we focus on induc-tive transfer learning, which cares about the situation thatthe source and target tasks have different label spaces.The most popular practice is to ﬁne-tune a model pre- trained on the source task with the L regularization, whichhas the effect of constraining the parameter around the ori-gin of zeros. [24] points out that since the parameter maybe driven far from the Start Point (SP) of the pre-trainedmodel, a major disadvantage of naive ﬁne-tuning is the riskof catastrophic forgetting of the knowledge learned fromsource. They recommend to use the L - SP regularizer in-stead of the popular L . While in parallel, knowledge dis-tillation, which is originally designed for compressing theknowledge in a complex model to a simple one [18], isproved to be useful for transfer learning tasks where knowl-edge is distilled from a different dataset [49, 47]. Re-cent work [25] formulates knowledge distillation in transferlearning as a regularizer on features and further improvesthrough unactivated channel reusing for better ﬁtting thetraining samples.Although the mentioned starting point regularization andknowledge distillation methods have succeed to preservethe knowledge contained in the source model, ﬁne-tuningalso takes the obvious risk of negative transfer . Intu-itively, if the source and target data distribution are dissim-ilar to some extent, not all the knowledge from the sourceis transferable to the target and an indiscriminate transfercould be detrimental to the target model. However, the im-pact of negative transfer has been rarely considered in in-ductive transfer learning studies. The most related work,[7], proposed to investigate the regularizer of Batch Spec-tral Shrinkage ( BSS ) to inhibit negative transfer, where thesmall singular values of feature representations are consid-ered as not transferable and suppressed. Yet, it is hard toadaptively determine the scope of small singular value whenfaced with different target tasks. Moreover,

BSS does nottake consideration of the catastrophic forgetting risk, whichmeans it has to be equipped with other ﬁne-tuning tech-niques (e.g., L - SP [24], DELTA [25], etc.) to achieveconsiderable performance.According to the above analysis, it is straightforward to1 a r X i v : . [ c s . L G ] O c t hink about a better solution which simultaneously takes theconsideration of preserving relevant knowledge and avoid-ing negative transfer. In this paper, we intend to improvethe standard ﬁne-tuning paradigm by accurate knowledgetransfer. Assuming that the knowledge contained in thesource model consists of one part relevant to the targettask and the other part which is irrelevant , we are go-ing to explicitly disentangle the former from the sourcemodel. Thus, a target task speciﬁc starting point is usedas the reference instead of the original one. Speciﬁcally,we design a novel regularizer of deep transfer learningthrough Target-awareness REpresentation Disentanglement( TRED ). The whole algorithm includes two steps. Firstwe use a lightweight disentangler to separate middle repre-sentations of the pre-trained source model into the positiveand negative parts. The disentanglement is achieved by si-multaneously maximizing the maximum mean discrepancy(MMD) between these two parts, and being able to recon-struct the original representation. Supervision informationfrom labeled target examples is utilized to distinguish thepositive part from the negative part. The second step is toperform ﬁne-tuning using the disentangled positive part ofrepresentations as the reference. In summary, our main con-tributions are as following:• We are the ﬁrst to involve the idea of representationdisentanglement to improve inductive transfer learn-ing.• Our algorithm aiming at accurate knowledge transfercontributes to the study of negative transfer.• Our proposed

TRED signiﬁcantly outperforms state-of-the-art transfer learning regularizers including L , L - SP , AT , DELTA and

BSS on various real worlddatasets.Table 1: Comparison among different ﬁne-tuningapproaches.Approach Risk

Considered CF * NT * L [9] (cid:55) (cid:55) SPAR [49, 24, 25] (cid:51) (cid:55)

BSS [7] (cid:55) (cid:51)

TRED (cid:51) (cid:51) * CF=Catastrophic Forgetting, NT=Negative Transfer,SPAR=Starting Point As the Reference. Note that although this won’t be mathematically guaranteed, it is verycommon in practice that the source task is for general purpose while thetarget task focuses on a speciﬁc domain.

2. Related Work

Regularization techniques have a long history sinceStein’s paradox [40, 10], showing that shrinking towardschosen parameters obtains an estimate more accurate thansimply using observed averages. Most common choices likeLasso and Ridge Regression pull the model towards zero,while it is widely believed that shrinking towards “true pa-rameters” is more effective. In transfer learning, modelspre-trained on relevant source tasks with sufﬁcient labeledexamples are often regarded as “true parameters”. Ear-lier works demonstrate its effectiveness on Maximum En-tropy [6] or Support Vector Machine models [46, 23, 2].

In order to overcome over-ﬁtting, various transfer learn-ing regularizers have been proposed. According to the typeof regularized objectives, they can be categorized as pa-rameter based [24], feature based [49, 47, 25] or spectralbased [7] methods.Our paper adopts the general idea of preserving knowl-edge by regularizing features of the source model. Whileunlike previous methods, we do not directly use the origi-nal knowledge provided by the source model. Instead, wedisentangle the useful part for reference to avoid negativetransfer. There main differences are summarized in Table 1.Studies from other angles, such as sample selection [12,31, 20], dynamic computing [16], sparse transfer [44] andcross-modality transfer [19] are also important topics butout of this paper’s scope.

Techniques of representation disentanglement are devel-oped to help depict underlying regulars between the inputand output in an interpretable way. Note that the conceptof disentanglement in this paper is somewhat different froma mainstream understanding where the objective is to sepa-rate latent factors of variations [14, 4]. Our work is highlyinspired by the idea of domain information disentangle-ment [26, 3], which intends to extract domain invariant rep-resentations in unsupervised domain adaptation tasks. Intheir works, the disentangled components are domain rep-resentations, or groups of features, rather than individualfeatures.

3. Preliminaries

In inductive transfer learning, we are given a model pre-trained on the source task, with the parameter vector ω .For the desired task, the training set contains n tuples, each2 eatureExtractor D C

Target Data Source Model ClassifierRegularization Loss Empirical LossFeatureExtractorTarget Data Target Model

Trainable ModulesFrozen Modules

Step1. Representation DisentanglementStep2. Fine-tuning D Disentangler C Classifier

Figure 1: The architecture of deep transfer learning through Target-awareness Representation Disentanglement (

TRED ).of which is denoted as ( x i , y i ) . x i and y i refers to the i -thexample and its corresponding label.Let’s further denote z as the function of the neural net-work and ω as the parameter vector of the target network.We have the objective of structural risk minimization min ω n (cid:88) i =1 L ( z ( x i , ω ) , y i ) + λ · Ω( ω , ω ) , (1)where the ﬁrst term is the empirical loss and the second isthe regularization term. λ is the coefﬁcient to balance theeffect between data ﬁtting and reducing over-ﬁtting. Recent studies in the deep learning paradigm show thatSGD itself has the effect of implicit regularization that helpsgeneralizing in over-parameterized regime [39]. In addi-tion, since ﬁne-tuning is usually performed with a smallerlearning rate and fewer epochs, it can be regarded as a formof implicit regularization towards the initial solution withgood generalization properties [27]. Besides, we give abrief introduction of state-of-the-art explicit regularizers fordeep transfer learning. L Penalty . The most common choice is the L penalty with the form of (cid:107) ω (cid:107) , also named weight decayin deep learning. From a Bayesian perspective, it refers toa Gaussian prior of the parameter with a zero mean. Theshortcoming is that the meaningful initial point ω is ig-nored. L - SP . [24] follows the idea of shrinking towardschosen targets instead of zero. They propose to use thestarting point as the reference Ω( ω ) = α (cid:107) ω s − ω s (cid:107) + β (cid:107) ω s (cid:107) , (2)where the ﬁrst term refers to constraining the parameter ofthe part responsible for representation learning around thestarting point, and the second is weight decay of the remain-ing part which is task speciﬁc. Since ω s is general in allmentioned methods, we ignore it in following formulas. DELTA . [25] extends the framework of feature distil-lation [37, 49] by incorporating an attention mechanism.They constrain 2-d activation maps with respect to differ-ent channels by different strengths according to their valuesto the target task. Given a tuple of training example ( x i , y i ) and the distance metric between activation maps D , the reg-ularization is formulated as Ω i ( ω s ) = α C (cid:88) j =1 W j ( ω s , x i , y i ) · D ij , (3)where C is the number of channels and W j ( · ) refers to theregularization weight assigned to the j -th channel. Speciﬁ-cally, each weight is independently evaluated by the perfor-mance drop when disabling that channel. BSS . Authors in [7] propose Batch Spectral Shrinkage(

BSS ), towards penalizing untransferable spectral compo-nents of deep representations. They ﬁgure out that spec-tral components which are less transferable are those cor-responding to relatively small singular values. They applydifferentiable SVD to compute all singular values of a fea-ture matrix and penalize the smallest k ones: Ω( ω s ) = α k (cid:88) i =1 σ b +1 − i , (4)3here all singular values [ σ , σ ..., σ b ] are in the descend-ing order. ω is not involved as BSS doesn’t consider pre-serving the knowledge in the source model.

Algorithm 1:

The framework of

TRED

Input: labeled target dataset { ( x i , y i ) } ni =1 , modelpre-trained on the source task F s ,representation disentangler D , classiﬁer fordisentangled representations C and targetmodel F t . Output: well-trained target model (cid:98) F t . // Representation Disentanglement Set ( D , C ) trainable and F s frozen; while not converged do Sample mini-batch B : { ( x j , y j ) } (cid:107)B(cid:107) j =1 ; Calculate (

F M pos , F M neg ) by Eq. 5 given( B , F s , D ); Calculate L = L Ddi + L Dre + L Dce by Eq. 8–10; Update D and C with SGD by minimizing L ; end Obtain well-trained (cid:98) D = D ; // Fine-tuning Set ( F s , (cid:98) D ) frozen and F t trainable; while not converged do Sample mini-batch B : { ( x j , y j ) } (cid:107)B(cid:107) j =1 ; Obtain

F M pos by Eq. 5 given ( B , F s , (cid:98) D ); Calculate Ω( ω s ) = (cid:80) (cid:107)B(cid:107) j =1 Ω j ( ω s ) by Eq 11; Update F t by Eq 1; end return (cid:98) F t = F t

4. Disentangled Starting Point As the Refer-ence

The most important component of our algorithm is todisentangle the original knowledge into two parts which arerespectively relevant and irrelevant to the target task. Imitat-ing the visual attention mechanism of humans, we force thetwo parts to pay attention to different spatial regions withinthe original image. Mathematically, this is achieved byenlarging their statistical distributions measured as MMD.The positive part is distinguished by a simple discriminator.Then we ﬁne-tune the target model with the regularizationto restrict the distance between feature maps and the corre-sponding disentangled ones. A framework of the approachis illustrated in Figure 1. We explain the components in thefollowing paragraphs.

Representation Disentanglement . Different with themain stream of disentanglement studies which try to sepa-rate various atomic attributes such as the color or angle, we care about disentangling components relevant to the targettask from the whole representation produced by the sourcemodel. Formally, we disentangle the original representa-tion

F M ori ∈ R C × H × W obtained from the pre-trainedmodel into the positive and negative part with the disen-tangler module D : F M pos , F M neg = D ( F M ori ) , (5)where F M pos and

F M neg have the same shape with

F M ori . For efﬁcient estimation and optimization of thedisentanglement, we further denote the mapping functions F C : R C × H × W → R C and F S : R C × H × W → R H × W ,representing dimension reduction along the spatial andchannel direction respectively. Therefore we get f c ∗ = F C ( F M ∗ ) , f s ∗ = F S ( F M ∗ ) , (6)where ∗ refers to either pos or neg . Maximum Mean Discrepancy . Acting as a relevantcriterion for comparing distributions based on the Repro-ducing Kernel Hilbert Space, Maximum Mean Discrepancy(MMD) is widely used in statistical and machine learn-ing tasks such as two-sample test [11, 15], domain adapta-tion [33, 42, 29] and generative adversarial networks [41, 1].Denoting X s = { x s , x s , ..., x ns } and X t = { x t , x t , ..., x mt } as random variable sets with distributions P and Q , an empirical estimate [42, 28] of the MMD be-tween P and Q compares the square distance between theempirical kernel mean embeddings as MMD ( P, Q ) = (cid:107) m m (cid:88) i =1 k ( x is ) − n n (cid:88) j =1 k ( x jt ) (cid:107) , (7)where k refers to the kernel, as which a Gaussian radialbasis function (RBF) is usually used in practice [28, 30].Our objective is to enlarge the MMD between the disen-tangled positive and negative part along the spatial dimen-sion. Intuitively, this would explicitly encourage these twoparts to recognize different regions of the input image. Forstabler optimization, we minimize the negative exponent ofthe MMD as followed: L Ddi = λ di e − MMD ( f spos ,f sneg ) . (8) Reconstruction Requirement . As both the positive andnegative part are trained by the ﬂexible disentangler, it iseasy to produce two parts of meaningless representationswith the only objective of maximizing the distribution dis-tance. To ensure the disentanglement is informative ratherthan an arbitrary transformation, we add the reconstruc-tion requirement to restrict the disentangled representationswithin the original knowledge. Speciﬁcally, the disentan-gled positive and negative parts are required to be able torecover the original representation by point-wise addition: L Dre = λ re (cid:107) F M pos + F M neg − F M ori (cid:107) . (9)4able 2: Comparison of top-1 accuracy (%) with different methods. Baselines are No Regularization, L , BSS [7], L - SP [24], AT [49] and DELTA [25].Dataset No Reg. L BSS L - SP AT DELTA TRED

CUB-200-2011 79.31 ± ± ± ± ± ± ± Stanford-Dogs 84.47 ± ± ± ± ± ± ± Stanford-Cars 88.62 ± ± ± ± ± ± ± Flower-102 89.43 ± ± ± ± ± ± ± MIT-Indoor 82.20 ± ± ± ± ± ± ± Texture 68.32 ± ± ± ± ± ± ± Oxford-Pet 93.75 ± ± ± ± ± ± ± Food101-30 63.48 ± ± ± ± ± ± ± Food101-150 77.45 ± ± ± ± ± ± ± Average 80.78 80.75 81.11 82.05 81.86 82.50 . Since above represen-tation disentanglement is actually symmetry for each part,an explicit signal is required to distinguish features whichare useful for the target task. In particular, the selectedlayer for representation transfer is followed by a classiﬁerconsisting of a global pooling layer and a fully connectedlayer sequentially. A regular cross entropy loss is addedto explicitly drive the disentangler to extract into the posi-tive part components which are discriminative for the targettask. Denoting the involved classiﬁer as C , we have L Dce = λ ce CrossEntropy ( C ( f cpos ) , y i ) . (10) Regularizing the Disentangled Representation . Afterthe step of representation disentanglement, we perform ﬁne-tuning over the target task. We regularize the distance be-tween a feature map and its corresponding starting point.Quite different from previous feature map based regulariz-ers as [37, 49, 25], the starting point here is the disentangledpositive part of the original representation. The regulariza-tion term corresponding to some example ( x i , y i ) becomes: Ω i ( ω s ) = α (cid:107) F M ( ω s , x i ) − F M pos ( ω s , ω di , x i ) (cid:107) , (11)where ω di refers to the parameter of the disentangler D which is frozen during the ﬁne-tuning stage. The completetraining procedure is presented in Algorithm 1.

5. Experiments

We select several popular transfer learning datasets toevaluate the effectiveness of our method.

Stanford Dogs.

The Stanford Dogs [21] dataset consistsof images of 120 breeds of dogs, each of which containing100 examples used for training and 72 for testing. It’s asubset of ImageNet. Table 3: Comparison of top-1 accuracy (%) on CUB-200-2011 with respect to different sampling rates.Algorithm Sampling Rate50% 30% 15% L BSS L - SP AT DELTA

TRED

Figure 2: The improvements of top-1 accuracy of differ-ent algorithms on CUB-200-2011 with respect to differentsampling rates.

MIT Indoor-67.

MIT Indoor-67 [36] is an indoor sceneclassiﬁcation task consisting of 67 categories. There are 80images for training and 20 for testing for each category.

CUB-200-2011.

Caltech-UCSD Birds-200-2011 [45]contains 11,788 images of 200 bird species from around theworld. Each species is associated with a Wikipedia articleand organized by scientiﬁc classiﬁcation.5 ood-101.

Food-101 [5] is a large scale data set con-sisted of more than 100k food images divided into 101 dif-ferent kinds. To better ﬁt transfer learning applications, weuse two subsets which contains 30 and 150 training exam-ples per category.

Flower-102.

Flower-102 [32] consists of 102 ﬂower cat-egories. 1020 images are used for training and 6149 imagesfor testing. Only 10 samples are provided for each categoryduring training.

Stanford Cars.

The Stanford Cars [22] dataset contains16,185 images of 196 classes of cars. The data is split into8,144 training and 8,041 testing images.

Oxford-IIIT Pet.

The Oxford-IIIT Pet [35] dataset is a37-category pet dataset with about 200 cat or dog imagesfor each class.

Textures.

Describable Textures Dataset [8] is a texturedatabase, containing 5640 images organized by 47 cate-gories according to perceptual properties of textures.

We implement transfer learning experiments based onResNet [17]. For MIT indoor-67, we use ResNet-50 pre-trained with large scale scene recognition dataset Places365 [50] as the source model. For remaining datasets, weuse ImageNet pre-trained ResNet-101 as the source model.Input images are resized with the shorter edge being 256and then random cropped to × during training.For optimization, we ﬁrst train 5 epochs to optimize thedisentangler by Adam with the learning rate of 0.01. Allinvolved hyperparameters are set to default values of λ di =10 − , λ ce = 10 − , λ re = 10 − . Then we use SGD with themomentum of 0.9, batch size of 64 and initial learning rateof 0.01 for ﬁne-tuning the target model. We train 40 epochsfor each dataset. The learning rate is divided by 10 after 25epochs. We run each experiment three times and report theaverage top-1 accuracy. TRED is compared with state-of-the-art transferlearning regularizers including L - SP [24], AT [49], DELTA [25] and

BSS [7]. We perform 3-fold cross valida-tion searching for the best hyperparameter α in each exper-iment. For L - SP , DELTA and

TRED , the search spaceis [ − , − , − ]. Although authors in AT and BSS recommended ﬁxed values of α ( for AT and − for BSS ), we also extend the search space to [ , , ] for AT and [ − , − , − ] for BSS . While recent theoretical studies proved that weight de-cay actually has no regularization effect [43, 13] when com-bined with common used batch normalization, we use

NoRegularization as the most naive baseline, reevaluating L .From Table 2 we observe that L does not outperform ﬁne-tuning without any regularization. This may imply that deep transfer learning hardly beneﬁts from regularizers of non-informative priors.Advanced works [49, 24, 25] adopt regularizers usingthe starting point of the reference for knowledge preserving.From the perspective of Bayes theory, these are equivalentto the informative prior which believes the knowledge con-tained in the source model, in the form of either parametersor behavior. Table 2 shows that these algorithms obtain sig-niﬁcant improvements on some datasets such as StanfordDogs and MIT indoor-67, where the target dataset is verysimilar to the source dataset. However, beneﬁts are muchless on other datasets such as CUB-200-2011, Flower-102,Stanford Cars and Food-101.Table 2 illustrates that TRED consistently outperformsall above baselines over all evaluated datasets. It outper-forms naive ﬁne-tuning regularizer L by more than 2% onaverage. Except for Stanford Dogs and MIT Indoor-67, im-provements are still obvious even compared with state-of-the-art regularizers L - SP , AT , DELTA and

BSS .To evaluate the scalability of our algorithm with morelimited data, we conduct additional experiments on sub-sets of the standard dataset CUB-200-2011. Baselinemethods include L , BSS [7], L - SP [24], AT [49] and DELTA [25]. Speciﬁcally, we random sample 50%, 30%and 15% training examples for each category to constructnew training sets. Results show that our proposed

TRED achieves remarkable improvements compared with all com-petitors, as presented in Table 3.Figure 2 shows how these methods behave when reduc-ing the size of the training set. For clear illustration, we treatall regularizers only considering the risk the catastrophicforgetting as a group, namely SPAR as they are all followthe framework of using the Starting Point As the Reference.The average accuracy of L - SP , AT and DELTA is used torepresent for SPAR.

BSS , which is designed for suppress-ing the untransferable ingredients of features, only tacklesthe problem of negative transfer . While

TRED deals with both challenges. We plot the improvements of these meth-ods compared with naive ﬁne-tuning with L regularization.We can observe from Figure 2 that BSS and

TRED obtainincreased improvements as the sampling rate reduces, im-plying that the negative impact from the source model isgreater when the target dataset is smaller. Although shar-ing the same trend,

TRED always outperform

BSS with anobvious margin at all sampling rates. While the curve ofSPAR is much stabler as the sampling rate reduces.

6. Discussions

In this section, we dive deeper into the mechanism andexperiment results to explain why target-awareness disen-tanglement provides better reference. In subsection

Repre-sentation Visualization , we show the effect of our methodby visualizing attention maps and feature embeddings. In6 a) Input (b) Origin (c) Positive (d) Negative (a) Input (b) Origin (c) Positive (d) Negative

Figure 3: The effectiveness of representation disentanglement on CUB-200-2011 (left) and Stanford Cars (right). For eachdataset, we select three typical cases for demonstration. In addition to the input image (a), we add spatial attention map ontothe original image in column (b), (c), and (d) using the input image and the desired representation of the last convolutionallayer of ResNet-101. Speciﬁcally, (b) is the original representation generated by the ImageNet pre-trained model. (c) and (d)are the disentangled positive and negative part by

TRED . (a) Flower-102 Original (b) Flower-102 Disentangled(c) Indoor-67 Original (d) Indoor-67 Disentangled Figure 4: Visualization of the original (a, c) and disentan-gled (b, d) feature representations by t-SNE. Different col-ors and markers are used to denote different categories.subsection

Shrinking Towards True Behavior , we brieﬂydiscuss the theoretical understanding related with shrink-age estimation. Then we provide more statistical evidencesto validate the advantage of the disentangled positive rep-resentation. In subsection

Ablation Study , we empiricallyanalyze why the disentanglement component is essential. (a) Stanford Cars (b) CUB-200-2011

Figure 5: Singular values of the original and disentangleddeep representation.

Show Cases . Authors in [49] show that the spatial at-tention map plays a critical role in knowledge transfer. Wedemonstrate the effect of representation disentanglement byvisualizing the attention map in Fig 3. As observed in typi-cal cases from CUB-200-2011 and Stanford Cars, the orig-inal representations generated by the ImageNet pre-trainedmodel usually contain a wide range of semantic features,such as objects or backgrounds, in addition to parts of birds.Our proposed disentangler is able to “purify” the interestedconcepts into the positive part, while the negative part paysmore attention to the complementary constituent.

Embedding Visualization . Since the most importantchange of our method is to use the disentangled rather thanthe original representation as the reference, we are inter-ested in comparing the properties of these two representa-tions on the target task. We visualize the original and dis-7ntangled feature representations of Flower-102 and MITIndoor-67. The dimension of features is reduced along thespatial direction and then plotted in the 2D space using t-SNE embeddings. As illustrated in Figure 4, deep represen-tations derived by our proposed disentangler are separatedmore clearly than the original ones for different categoriesand clustered more tightly for samples of the same category.

Recent work [24] discusses the connection between theirproposed L - SP and classical statistical theory of shrink-age estimation [10]. The key hypothesis is that shrinkingtowards a value which is close to the “true parameters” ismore effective than an arbitrary one. [24] argues that thestarting point is supposed to be more close to the “true pa-rameters” than zero. [49, 25] regularize the feature ratherthan the parameter, which can be interpreted as shrinkingtowards the “true behavior”. Our proposed TRED furtherimproves them by explicitly disentangling “ truer behavior”by utilizing the global distribution and supervision informa-tion of the target dataset. To support the claim, We providesome additional evidences as followed.

Reducing Untransferable Components . Inspiredby [7], we compute singular eigenvectors and values of thedeep representation by SVD. All singular values are sortedin descending order and plotted in Fig 5. Authors in [7]demonstrate that the spectral components corresponding tosmaller singular values are less transferable. They ﬁnd thatthese less transferable components can be suppressed by in-volving more training examples. Interestingly, we ﬁnd sim-ilar trends by the proposed representation disentanglement.As observed in Fig 5, smaller singular values of the dis-entangled positive representation are further reduced com-pared with the original representation. Fig 5 also showsthe phenomenon that spectral components correspondingto larger singular values are increased, which does not ex-ist in [7]. This is intuitively consistent to the hypothesisthat features relevant to the target task are disentangled andstrengthened.

Robustness to Regularization Strength . We also pro-vide an empirical evidence to illustrate the effect of “truerbehavior” obtained by our proposed disentangler. The in-tuition is very straightforward that, if the behavior (repre-sentation) used as the reference is “truer”, it is supposedto be more robust to the larger regularization strength. Wecompare with

DELTA which uses the original represen-tation as the reference. We select three transfer learningtasks for evaluation, which are Places365 → MIT indoor-67, ImageNet → Stanford Cars and Places365 → StanfordDogs. The regularization strength α is gradually increasedfrom 0.001 to 1. As illustrated in Fig 6, the performance of DELTA falls rapidly as α increases, especially in ImageNet → Stanford Cars and Places365 → Stanford Dog, indicat- Figure 6: Top-1 accuracy of transfer learning tasks corre-sponding to different regularization strength α .Table 4: Top-1 accuracy (%) of the target and source task. TRED - refers to

TRED without disentanglement.Dataset Target Task Source Task

TRED TRED - TRED TRED -Flower-102 91.34 89.79 71.23 69.73CUB-200-2011 82.07 78.84 72.65 70.11MIT Inddors 67 * * Pre-trained on Places365 and evaluated on ImageNet. ing that the regularizer using original representations as thereference suffers from negative transfer seriously. While

TRED performs much more robust to the increasing of α . Since the desired output for the target task is the dis-entangled positive part, it seems reasonable to obtain thediscriminative representation only using the classiﬁer corre-sponding to L Dce . In this section, we conduct ablation studyto compare the simpler framework without maximizing thedistribution distance, which performs direct transformationon the original representation rather than disentanglement.This version is denoted by

TRED -.We can observe in Table 4 that, all evaluated tasks getsigniﬁcant performance drop on the target task without dis-entanglement. A reasonable guess is that, disentanglinghelps preserve knowledge in the source model and restrainthe representation transformation from over-ﬁtting the clas-siﬁer C . To verify this hypothesis, we compare the top-1 ac-curacy of ImageNet classiﬁcation (the source task) between TRED and

TRED -. Speciﬁcally, we train a random initial-ized classiﬁer to recognize the category of ImageNet, usingthe ﬁxed transformed representation as input. The top-1 ac-curacy of the pre-trained ResNet-101 model is 75.99%. Asshown in Table 4,

TRED - gets more performance drops onImageNet than

TRED , indicating that representation disen-tanglement performs better in preserving the general knowl-edge learned by the source task.

7. Conclusion

In this paper, we extend the study of negative trans-fer in inductive transfer learning. Speciﬁcally, we pro-8ose a novel approach

TRED to regularize the disentangleddeep representation, achieving accurate knowledge trans-fer. We succeed to implement the target-awareness disen-tanglement, utilizing the MMD metric and other reasonablerestrains. Extensive experimental results on various real-world transfer learning datasets show that

TRED signiﬁ-cantly outperforms the state-of-the-art transfer learning reg-ularizers. Moreover, we provide empirical analysis to ver-ify that the disentangled target-awareness representation isindeed closer to the expected “true behavior” of the targettask.

References [1] Michael Arbel, Dougal Sutherland, Mikołaj Bi´nkowski, andArthur Gretton. On gradient regularizers for mmd gans. In

Advances in Neural Information Processing Systems , pages6700–6710, 2018. 4[2] Yusuf Aytar and Andrew Zisserman. Tabula rasa: Modeltransfer for object category detection. In , pages 2252–2259. IEEE,2011. 2[3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar,Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R DevonHjelm. Mine: mutual information neural estimation. arXivpreprint arXiv:1801.04062 , 2018. 2[4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep-resentation learning: A review and new perspectives.

IEEEtransactions on pattern analysis and machine intelligence ,35(8):1798–1828, 2013. 2[5] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.Food-101 – mining discriminative components with randomforests. In

European Conference on Computer Vision , 2014.6[6] Ciprian Chelba and Alex Acero. Adaptation of maximum en-tropy capitalizer: Little data can help a lot.

Computer Speech& Language , 20(4):382–399, 2006. 2[7] Xinyang Chen, Sinan Wang, Bo Fu, Mingsheng Long, andJianmin Wang. Catastrophic forgetting meets negative trans-fer: Batch spectral shrinkage for safe transfer learning. In

Advances in Neural Information Processing Systems , pages1906–1916, 2019. 1, 2, 3, 5, 6, 8[8] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A.Vedaldi. Describing textures in the wild. In

Proceedings ofthe IEEE Conf. on Computer Vision and Pattern Recognition(CVPR) , 2014. 6[9] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman,Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deepconvolutional activation feature for generic visual recogni-tion. In

International conference on machine learning , pages647–655, 2014. 2[10] Bradley Efron and Carl Morris. Stein’s paradox in statistics.

Scientiﬁc American , 236(5):119–127, 1977. 2, 8[11] Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, BernhardSch¨olkopf, and Bharath K Sriperumbudur. Kernel choice andclassiﬁability for rkhs embeddings of probability distribu-tions. In

Advances in neural information processing systems ,pages 1750–1758, 2009. 4 [12] Weifeng Ge and Yizhou Yu. Borrowing treasures from thewealthy: Deep transfer learning through selective joint ﬁne-tuning. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 1086–1095, 2017. 2[13] Aditya Sharad Golatkar, Alessandro Achille, and StefanoSoatto. Time matters in regularizing deep networks: Weightdecay and data augmentation affect early learning dynamics,matter little near convergence. In

Advances in Neural Infor-mation Processing Systems , pages 10678–10688, 2019. 6[14] Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe,and Andrew Y Ng. Measuring invariances in deep networks.In

Advances in neural information processing systems , pages646–654, 2009. 2[15] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern-hard Sch¨olkopf, and Alexander Smola. A kernel two-sampletest.

The Journal of Machine Learning Research , 13(1):723–773, 2012. 4[16] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grau-man, Tajana Rosing, and Rogerio Feris. Spottune: transferlearning through adaptive ﬁne-tuning. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4805–4814, 2019. 2[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 6[18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015. 1[19] Di Hu, Xuhong Li, Lichao Mou, Pu Jin, Dong Chen, Lip-ing Jing, Xiaoxiang Zhu, and Dejing Dou. Cross-task trans-fer for multimodal aerial scene recognition. arXiv preprintarXiv:2005.08449 , 2020. 2[20] Yunho Jeon, Yongseok Choi, Jaesun Park, Subin Yi,Dongyeon Cho, and Jiwon Kim. Sample-based regulariza-tion: A transfer learning strategy toward better generaliza-tion. arXiv preprint arXiv:2007.05181 , 2020. 2[21] Aditya Khosla, Nityananda Jayadevaprakash, BangpengYao, and Li Fei-Fei. Novel dataset for ﬁne-grained imagecategorization. In

First Workshop on Fine-Grained VisualCategorization, IEEE Conference on Computer Vision andPattern Recognition , Colorado Springs, CO, June 2011. 5[22] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.3d object representations for ﬁne-grained categorization. In , Sydney, Australia, 2013. 6[23] Xiao Li.

Regularized adaptation: Theory, algorithms andapplications , volume 68. Citeseer, 2007. 2[24] Xuhong Li, Yves Grandvalet, and Franck Davoine. Explicitinductive bias for transfer learning with convolutional net-works.

Thirty-ﬁfth International Conference on MachineLearning , 2018. 1, 2, 3, 5, 6, 8[25] Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao,Liping Liu, and Jun Huan. Delta: Deep learning transferusing feature map with attention for convolutional networks. arXiv preprint arXiv:1901.09229 , 2019. 1, 2, 3, 5, 6, 8[26] Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-Chiang Frank Wang. A uniﬁed feature disentangler for multi-domain image translation and manipulation. In

Advances in eural Information Processing Systems , pages 2590–2599,2018. 2[27] Hong Liu, Mingsheng Long, Jianmin Wang, and Michael IJordan. Towards understanding the transferability of deeprepresentations. arXiv preprint arXiv:1909.12031 , 2019. 3[28] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor-dan. Learning transferable features with deep adaptation net-works. In International conference on machine learning ,pages 97–105. PMLR, 2015. 4[29] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael IJordan. Deep transfer learning with joint adaptation net-works. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 2208–2217. JMLR.org, 2017. 4[30] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling,and Richard S Zemel. The variational fair autoencoder. In

ICLR , 2016. 4[31] Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Ko-rnblith, Quoc V Le, and Ruoming Pang. Domain adap-tive transfer learning with specialist models. arXiv preprintarXiv:1811.07056 , 2018. 2[32] Maria-Elena Nilsback and Andrew Zisserman. Automatedﬂower classiﬁcation over a large number of classes. In , pages 722–729. IEEE, 2008. 6[33] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and QiangYang. Domain adaptation via transfer component analy-sis.

IEEE Transactions on Neural Networks , 22(2):199–210,2010. 4[34] Sinno Jialin Pan and Qiang Yang. A survey on transfer learn-ing.

IEEE Transactions on knowledge and data engineering ,22(10):1345–1359, 2009. 1[35] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, andCV Jawahar. Cats and dogs. In , pages 3498–3505.IEEE, 2012. 6[36] Ariadna Quattoni and Antonio Torralba. Recognizing indoorscenes. In , pages 413–420. IEEE, 2009. 5[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou,Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:Hints for thin deep nets.

CoRR , abs/1412.6550, 2014. 3, 5[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.

International journal ofcomputer vision , 115(3):211–252, 2015. 1[39] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee.Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.

IEEE Transactionson Information Theory , 65(2):742–769, 2018. 3[40] Charles Stein. Inadmissibility of the usual estimator for themean of a multivariate normal distribution. Technical report,Stanford University Stanford United States, 1956. 2[41] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann,Soumyajit De, Aaditya Ramdas, Alex Smola, and ArthurGretton. Generative models and model criticism viaoptimized maximum mean discrepancy. arXiv preprintarXiv:1611.04488 , 2016. 4 [42] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, andTrevor Darrell. Deep domain confusion: Maximizing fordomain invariance. arXiv preprint arXiv:1412.3474 , 2014. 4[43] Twan Van Laarhoven. L2 regularization versus batch andweight normalization. arXiv preprint arXiv:1706.05350 ,2017. 6[44] Kafeng Wang, Xitong Gao, Yiren Zhao, Xingjian Li, De-jing Dou, and Cheng-Zhong Xu. Pay attention to features,transfer learn faster cnns. In

International Conference onLearning Representations , 2019. 2[45] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-longie, and P. Perona. Caltech-UCSD Birds 200. TechnicalReport CNS-TR-2010-001, California Institute of Technol-ogy, 2010. 5[46] Jun Yang, Rong Yan, and Alexander G Hauptmann. Adapt-ing svm classiﬁers to data with shifted distributions. In

Sev-enth IEEE International Conference on Data Mining Work-shops (ICDMW 2007) , pages 69–76. IEEE, 2007. 2[47] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. Agift from knowledge distillation: Fast optimization, networkminimization and transfer learning. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , vol-ume 2, 2017. 1, 2[48] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.How transferable are features in deep neural networks? In

Advances in neural information processing systems , pages3320–3328, 2014. 1[49] Sergey Zagoruyko and Nikos Komodakis. Paying more at-tention to attention: Improving the performance of convolu-tional neural networks via attention transfer. arXiv preprintarXiv:1612.03928 , 2016. 1, 2, 3, 5, 6, 7, 8[50] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,and Antonio Torralba. Places: A 10 million image databasefor scene recognition.

IEEE transactions on pattern analysisand machine intelligence , 40(6):1452–1464, 2017. 1, 6, 40(6):1452–1464, 2017. 1, 6