[PDF] Joint Information Preservation for Heterogeneous Domain Adaptation

Abstract

Domain adaptation aims to assist the modeling tasks of the target domain with knowledge of the source domain. The two domains often lie in different feature spaces due to diverse data collection methods, which leads to the more challenging task of heterogeneous domain adaptation (HDA). A core issue of HDA is how to preserve the information of the original data during adaptation. In this paper, we propose a joint information preservation method to deal with the problem. The method preserves the information of the original data from two aspects. On the one hand, although paired samples often exist between the two domains of the HDA, current algorithms do not utilize such information sufficiently. The proposed method preserves the paired information by maximizing the correlation of the paired samples in the shared subspace. On the other hand, the proposed method improves the strategy of preserving the structural information of the original data, where the local and global structural information are preserved simultaneously. Finally, the joint information preservation is integrated by distribution matching. Experimental results show the superiority of the proposed method over the state-of-the-art HDA algorithms.

Full PDF

AAbstract

Domain adaptation aims to assist the modeling tasks of the target domain with knowledge of the source domain. The two domains often lie in different feature spaces due to diverse data collection meth-ods, which leads to the more challenging task of heterogeneous domain adaptation (HDA). A core issue of HDA is how to preserve the information of the original data during adaptation. In this paper, we propose a joint information preservation method to deal with the problem. The method preserves the information of the original data from two aspects. On the one hand, although paired samples often exist between the two domains of the HDA, current algorithms do not utilize such information suffi-ciently. The proposed method preserves the paired information by maximizing the correlation of the paired samples in the shared subspace. On the other hand, the proposed method improves the strategy of preserving the structural information of the original data, where the local and global structural infor-mation are preserved simultaneously. Finally, the joint information preservation is integrated by dis-tribution matching. Experimental results show the superiority of the proposed method over the state-of-the-art HDA algorithms.

1 Introduction

Domain Adaptation (DA) utilizes abundant labeled data from the source domain to assist the modeling tasks of a different but related target domain with only few (or no) labeled data [Csurka et al ., 2017; Weiss et al ., 2016]. According to the consistency of the feature spaces between the two domains, DA can be divided into homogeneous DA and heterogeneous

DA (HDA). Based on the availability of labeled data in the target domain, DA can be further divided into unsupervised and semi-supervised DA [Day and Khoshgoftaar, 2017]. There are mainly three strategies in the existing DA algo-rithms, i.e. instance-based adaptation [Aljundi et al ., 2015], model-based adaptation [Long et al ., 2014] and feature-based adaptation [Sun et al ., 2016; Yan et al ., 2018]. Attributed to the flexibility of representation learning and classifier selec-tion, feature-based adaptation methods have attracted exten-sive attention. Generally speaking, the basic idea of fea-ture-based adaptation consists of two steps: one is to mini- mize the distance of the projected data between the two do-mains in the shared subspace; the other is to preserve the information of the original data during adaptation. Hence, the differences among the existing feature-based adaptation methods can be divided into three aspects, i.e., the way to construct the shared subspace, the way to measure the dis-tance of the two domains and the way to preserve the infor-mation, which are detailed in the following. For the first aspect, construction of the shared subspace can be achieved through symmetric feature transformation or asymmetric feature transformation. The data of the two do-mains are transformed into an intermediate shared subspace for the former [Zhong et al ., 2009] while the data are usually transformed from one domain to the other for the latter [Fernando et al ., 2014]. For the second aspect, measurement of the distance between the two domains can be performed using the maximum mean discrepancy (MMD) [Li et al ., 2018], Kullback-Leibler divergence [Zhuang et al ., 2015] or the Wasserstein distance [Shen et al ., 2018]. For the third aspect, the existing methods mainly makes various manifold assumptions of the original data for information preservation, such as principal component analysis [Long et al ., 2013], locality preserving projections [Wang and Mahadevan, 2011] and discriminative local alignment [Si et al ., 2010]. Although these existing methods have achieved good performance, they suffer from two shortcomings from the aspect of information preservation. First, in many scenes of HDA, paired samples often exist between the two domains. Taking the image-text HDA classification as an example. Despite of abundant paired samples on the Internet where the image and text on the same page are matched, the existing methods hardly utilize such paired information. Second, the structural information preservation methods adopted by the existing methods is so specific that either the local or the global manifold is adopted, thereby reducing the generaliza-tion performance. To overcome the above challenges, a feature-based adap-tation algorithm called joint information preservation (JIP) is proposed in the paper. It aims to tackle the tasks under the scene of semi-supervised HDA. Unlike the existing methods that only adopt simple strategies the for information preser-vation during the construction of the shared subspace, the proposed method incorporates the paired information and the structural information preservation into a unified HDA framework. Specifically, JIP constructs a shared subspace by symmetric transformation and the transformed matrices are

Joint Information Preservation for Heterogeneous Domain Adaptation

Peng Xu, Zhaohong Deng, Kup-Sze Choi, Jun Wang, Shitong Wang

School of Digital Media, Jiangnan University, China [email protected] ptimized through three aspects: 1) the distribution distance of the two domains is minimized in the shared subspace for adaptation, 2) the paired information is preserved by max-imizing the correlation of paired samples, and 3) the struc-tural information is preserved with local and global manifold methods simultaneously. The contributions of this paper are highlighted below:  The paired information and more robust structural infor-mation are jointly preserved, which alleviates the problem of information loss during adaptation.  A HDA framework is proposed by integrating joint in-formation preservation and distribution matching. Opti-mization is then formulized as the problem of generalized eigenvalue decomposition.  Superiority of the proposed method over state-of-the-art methods is verified with experiments of image classifica-tion, action recognition and multimedia retrieval.

2 Related Work

In this section, classical HDA algorithms are first reviewed, followed by researches related to the proposed method. Among feature-based HDA algorithms, [Shi et al ., 2010] first proposes a classical HDA framework and its imple-mentation, the Hemap, which adopts symmetric transfor-mation to construct the shared subspace and the transformed matrices are optimized by minimizing the distance of the projected data. Algorithm DAMA proposed in [Wang and Mahadevan, 2011] introduces manifold alignment into HDA to preserve manifold within each domain and align the man-ifolds with labels between the domains. Besides, asymmetric transformation is introduced into HDA to yield the algorithm ARC-t [Kulis et al ., 2011]. In model-based HDA methods, MMDT is similar to ARC-t in that it adopts the asymmetric transformation and integrates the optimization of the large-margin model to obtain an adaptive SVM [Hoffman et al ., 2013]. As an extension of MMDT, the model-based method HFA conducts feature augmentation after feature transformation and increases the similarity of the samples within each domain by the augmented feature [Li et al ., 2014]. Unlike the previous algorithms that adapt the trans-formed feature, SHFR adapts the parameters of the trained classifiers [Zhou et al ., 2014]. Instance-based adaptation is another important strategy for HDA. CDLS proposed in [Tsai et al ., 2016] assigns each sample with a weigh during the optimization. The samples with nonzero weights are selected as the landmarks and the adaptation is based on these land-marks. [Li et al ., 2018] proposes the general HDA frame-work TIT to integrate distribution matching, manifold learning, sample weighting and feature selection. With the booming of neural networks, deep learning based HDA al-gorithms are also proposed. TNT proposed in [Chen et al ., 2016] designs a neural network based architecture for HDA. [Wang et al ., 2017] utilizes the autoencoder to develop a deep HDA algorithm for better information preservation. A comprehensive review of HDA can be found in [Csurka et al ., 2017; Day et al ., 2017]. There have been some researches considering paired samples for DA. They can be divided into two categories. The first category is multi-view DA that assumes that the data in both domains have multiple views [Hoffman et al ., 2012; Yang and Gao, 2013]. Multi-view DA aims to assist the learning tasks of the target domain by utilizing the labeled and multi-view samples in the source domain, which is dif-ferent from the scene concerned in this paper. The second category considers that there is only single view in each domain and that paired samples exist across the two domains. The work of the paper falls into this category. [Yeh et al ., 2014] first leverages paired samples across domains for HDA. The canonical correlation analysis (CCA) is used to con-struct a correlated subspace for adaptation and CTSVM is proposed by integrating the classifier optimization. Similar to the basic idea of CTSVM, DCA proposed in [Yan et al ., 2017] adopts a different method alternating direction method of multipliers to optimize the problem. RSP-KCCA proposed in [Mehrkanoon and Suykens, 2018] kernelizes CCA and for-mulates the problem into a least square SVM. There is a common problem among CTSVM, DCA and RSP-KCCA, i.e., the construction of the shared subspace for adaptation totally depends on the paired samples. If only few paired samples are available, the adaptability of the shared subspace will be decreased severely.

3 Joint Information Preservation

In this section, the details of the propose method is presented under the settings of HDA.

In the scene of semi-supervised HDA, the feature spaces of the source domain and target domain are different. The task is to enhance the performance of the target domain by lev-eraging the data of the source domain. Denote the data of the source domain { } s nsS i i   X x with the corresponding labels { } s nsS i i y   Y and the data of the target domain { } t nsT i i   X x with the corresponding pseudo labels ˆ ˆ{ } t ntT i i y   Y , where s s d nS R   X and t t d nT R   X , s d and t d represent the feature dimensions of the two domains, s n and t n represent the number of samples of the two do-mains. Assume that the number of paired samples is p n and min{ , } p s t n n n  , the paired samples can be represented as { } p nsSP i i   X x and { } p ntTP i i   X x . Since the labels of the paired samples in the two domains are shared, the labels of the paired samples of two domains are represented as { } p nsSP i i y   Y and { } p ntTP i i y   Y . Following the basic idea of feature-based adaptation, the objective function of the pro-posed method can be represented as follows, distribution distance joint paired and structural information loss min ( , | ) ( , | ) ( , | ) S T SP TP S T

D L L      X X X X X X , (1) where  denotes the feature transformation to construct the shared subspace. The first term of (1) is to minimize the distribution distance of the two domains in the shared sub-space and the second term is to jointly preserve the paired information and the structural information. .2 Distribution Matching In general, the basic step of feature-based HDA is to match the distributions of the two domains in the shared subspace, In the paper, the strategy of joint distribution adaptation (JDA) [Long et al ., 2013] is adopted. However, unlike the shared transformation matrix that is adopted in JDA, two transformation matrices A and B are used in proposed JIP to bridge the heterogeneous features spaces. Here, s d m R   A , t d m R   B and m is the dimension of the shared subspace. JDA jointly adapts both the marginal and the conditional distributions using MMD. The optimization of JDA in the scene of the HDA can be formalized as follows,

2T T, 1 1 s t n ns ti ji js t F n n      A B

A x B x , (2a)

T T, 1 ˆ s ti j

C s ti jc cc y c y cs t n n        A B

A x B x , (2b) where (2a) represents the adaptation for the marginal distri-bution and (2b) represents the adaptation for the conditional distribution, C is the number of classes, cs n and ct n repre-sent the number of samples in class c for the source and the target domains respectively. Similar to JDA, an iterative pseudo label refinement strategy is adopted, which will be detailed in Algorithm 1. With T T T [ , ]  W A B , the following objective can be obtained by unifying (2a) and (2b),

T T0 1 min Tr( ( ) )

C ci    W W X M M X W , (3) where =[ , ; , ] s t t s

S d n d n T  

X X 0 0 X . Here, M and c M are the MMD matrices with

1, 2,..., c C  , that are used for marginal and conditional distributions matching. They can be com-puted in the same as those in JDA [Long et al ., 2013]. To preserve the paired information, we use CCA to maximize the correlation of the paired samples. CCA aims to find a pair of projected vectors s d R   a and t d R   b to maximize the correlation between the projected data T SP a X in the source domain and the projected data T TP b X in the target domain. The objective function of CCA is given by T TT T T T, max

SP P TPSP P SP TP P TP a b a X H X ba X H X a b X H X b , (4) where P H denotes centering matrix which can simplify the calculation of covariance and variance in (4). Denote the identity matrix as p p n nP R   I and column vector with all ones as p nP R   , then T = (1 ) P P p P P n  H I 1 1 . By optimizing (4), only one pair of projected vectors a and b can be obtained and the projected subspace lies in one-dimensional space. To span the projected data in higher dimensional space, it is necessary to jointly optimize a group of correlation coefficients, where more than one pair of pro-jected vectors [ , , , ] m  A a a a and [ , , , ] m  B b b b can be obtained. Then the optimization problem in (4) for a group of correlation coefficients can be formulized as fol-lows,

T T, T T T T max Tr( )s.t. , pp p

SP n TPSP n SP TP n TP  

A B

A X H X BA X H X A I B X H X B I . (5) Rescaling of the projected vectors will not affect the solution of (4), which is the reason of derived constrained optimiza-tion of (5). The most commonly used method to optimize (5) is Lagrange multipliers [Hardoon et al., 2014], and the pro-jected matrices A and B can be obtained sequentially . In the proposed method, we want to integrate the paired information preservation into the framework of distribution matching. Hence, the projected matrices need to be opti-mized simultaneously rather than sequentially. Given the paired samples in the two domains, T p SP n SP

X H X and T p TP n TP

X H X are both fixed. Therefore, the effect of equality constraints in (5) is to limit the value of the projected vectors so that their directions can be optimized. To simultaneously optimize the projected matrices, (5) is formulated as

T T T T, max Tr( + ) p p

SP n TP TP n SP

A B

A X H X B B X H X A . (6) The constraint condition for A and B will be discussed in Section 3.5. Since T T T [ , ]  W A B , (6) can be expressed as

T T T max Tr( ), [ , ; , ] s t p p t s d d SP n TP TP n SP d d    W W CWC 0 X H X X H X 0 , (7) where C is named as the correlation matrix and the paired information preservation is formulated with (7) ultimately. To preserve the structural information of the original data more efficiently, the proposed JIP simultaneously adopts the local and global manifold methods. Discriminative manifold methods are adopted to utilize the labels of the data.

Local Structure Preservation

To preserve the local manifold structure of the data, the lo-cality preserving projections (LPP) [He and Niyogi, 2003] is introduced. LPP is a type of linear approximation of Lapla-cian eigenmaps [Belkin and Niyogi, 2001], where the neighborhood structure of the original samples is still re-mained in the shared subspace. The objective function of LLP is

2, , 1 T s t n n Li j ji l li j     A B z z W z W X , (8) where

1, 2,..., ( ) s t l n n   is the index of the projected sam-ples. L W is the adjacency matrix, where Lij W is the distance measure between the samples i x and j x . Define D as a di-agonal matrix with Lii iji   D W and L as the Laplacian matrix with L   L D W . The objective function in the form of the trace of the matrix can be derived from (8) as follows. T min tr( ) W W XLX W (9) The Laplacian matrix L can be calculated with a given L W which is constructed by the distance between each pair of samples. There are many methods to measure the distance between samples, such as Euclidean distance, cosine simi-larity, local neighborhood relationship and label information. To leverage the label information, L W is constructed using cosine distance in a discriminative manner [Li et al. , 2018] in this paper. Global Structure Preservation

Besides the local structure, preservation of the global struc-tural information is also important for unknown data struc-ture. In this paper, linear discriminative analysis is adopted to preserve the global structural information, i.e., minimizing the within-class scatter and maximizing the between-class scatter. The objective function is given by T max Tr( ), [ , ; , ] s t t s b b sb d d d d tb    W W S W S S 0 0 S , (10a) T min Tr( ), [ , ; , ] s t t s w w sw d n d n tw    W W S W S S 0 0 S , (10b) where b S and w S are the between-class scatter matrix and within-class scatter matrix respectively. sb S and sw S are the scatter matrices for the data in the source domain, tb S and tw S are the scatter matrices for the data in the target domain. They are calculated as follows, T1 = ( ) C i i isw S S Si   S X H X , (11a) T1 = ( )( ) C i i isb s s s s si m     S μ μ μ μ , (11b) T1 = ( ) C i i itw T T Ti   S X H X , (11c) T1 = ( )( ) C i i itb t t t t ti m     S μ μ μ μ , (11d) where the subscripts S , s and T , t denote the data in the source domain and the target domain respectively. iS X and iT X are the data matrices of the i th class; is m and it m are the number of samples belonging to the i th class. iS H and iT H are the centering matrices for the samples belonging to the i th class. The calculation of iS H and iT H is similar that of P H in (4), the only difference is that p n is replaced by is m or it m . is μ and it μ are the mean of the samples belonging to the i th class. s μ and t μ represent the mean of all the samples in the source and target domains respectively. Through the integration of the objective functions (3), (7), (9) and (10), the overall objective function can be obtained by introducing the regularization parameters  ,  and  to balance among the preservation of local structure, paired information and global structure.   distribution distance joint paired and structural information lossT T0 1T min ( , | ) ( , | ) ( , | )( ( ) )= min Tr S T SP TP S TC c wi b

D L L                   W X X X X X XW X M M L X S WW C S W (12) Note that the rescaling of W does not affect the solution of (12). The denominator of (12) is treated as the constraint condition such that the optimization has a unique solution. In the meanwhile, the remaining problem in (6) is tackled, where the temporal discarding constraint condition for the projected matrices in (5) is added. Hence, the problem is transformed into the following objective function,   T T0 1T min Tr( ( ( ) ) )s.t. Tr( ) 1

C c wib          W W X M M L X S WW C S W . (13) Using Lagrange function, (13) can be optimized to give (14) as follows,  

T T0 1T =Tr( ( ( ) ) ) Tr(( ) )

C c wi b L           W X M M L X S WI W C S W Φ , (14) where diag( , , , ) m    Φ is the Lagrange multipliers, and m is the dimension of the shared subspace. By setting L    W , the following equation is obtained. T0 1 ( ( ) ) =( )

C c w bi           X M M L X S W C S WΦ (15)

Hence, the optimization problem in (12) is transformed into the problem of generalized eigenvalue decomposition in (15). Finding the optimal W is then reduced to solving (15) for the m smallest eigenvalues and the corresponding eigen-vectors constitute the projected matrix W . The algorithm flowchart is illustrated in Algorithm 1. Algorithm JIP

Input : Data of two domains; dimension of the shared sub-space; regularization parameters; number of iterations.

Output : Labels of the unlabeled data in the target domain.

JIP Procedure : 1: Label the unlabeled data using pre-trained classifier with the labeled data in the target domain. 2: Calculate M and C in (3) and (7) respectively. 3: For

1, 2,..., t T  do

4: Update c M in (3). 5: Update L in (9). 5: Update b S and w S in (10a) and (10b) 6: Update W based on (15) using generalized eigenvalue decomposition. 7: Calculate the data in the shared subspace. 8: Train the classifier using the new data, update the pseudo labels for the data in the target domain ˆ ˆ{ } t ntT i i y   Y . 9: end for Experiments

The proposed method is evaluated on three datasets which are Caltech-Office, IXMAS and WIKI. Caltech-Office is an image classification dataset com-posed of Caltech and Office. The Office dataset, containing 31 classes, comes from three different sources, i.e., AMA-ZON (A), Webcam (W) and DSLR (D). The Caltech (C) dataset contains 256 classes. In the experiments, the four different sources A, W, D and C are treated as four small datasets, and ten common classes of these four datasets are used. Two types of features are extracted for all the images, i.e., the SURF and DeCAF features, in a way similar to that in [Tsai et al ., 2016]. These two types of features are regarded as two views. To construct HDA tasks for the experiments, the views are further regarded as the source domain and the target domain respectively. Eight tasks are thus constructed as shown in Table 1. Taking the A-D2S as an example, it represents the adaptation from the source domain DeCAF to the target domain SURF on dataset A. IXMAS is an action recognition dataset containing eleven classes. There are 36 samples for each class of action. The actions are captured by five cameras and each camera is treated as a view. Similar to the processing in [Mehrkanoon and Suykens, 2018], the samples are transformed into 1000-dimension vectors. In the experiments, samples from any two cameras are used to construct HDA tasks. Since a camera can be treated as the source domain or the target domain, 20 tasks can be constructed with five cameras. WIKI is an image-text dataset, where each sample contains an image and the corresponding text description. As the way in [Mehrkanoon and Suykens, 2018], the images are repre-sented as 128-dimension vectors using the method Scale Invariant Feature Transforms; the texts are represented as 10-dimension vectors using the Latent Dirichlet Allocation. In the experiments, five classes are selected, each containing 100 samples. The image and text can both be treated as the source domain or the target domain. Therefore, two HDA tasks img2txt and txt2img are constructed, where the img2txt represents adaptation from image to text and vice versa.

In the experiments, seven algorithms are adopted to compare with the proposed JIP. The baseline method is SVM t which trains SVM by using only the labeled data in the target do-main. The other six methods are all state-of-the-art HDA methods, including MMDT [Hoffman et al ., 2013], CTSVM [Yeh et al ., 2014], semi-supervised HFA (SHFA) [Li et al ., 2014], CDLS [Tsai et al ., 2016], TNT [Chen et al ., 2016], and TIT [Li et al ., 2018]. For all the methods, the number of iterations is set to 5; the dimension of the shared subspace is set to 100; the optimal regularization parameters involved are searched from the set {0, 0.01, 0.1, 1, 10, 100}. For the experiments on Caltech-Office and IXMAS da-tasets, 30% samples of each domain are selected to constitute paired samples. For the WIKI dataset, 10%, 20%, 30% and 40% paired samples are constructed for the experiments to demonstrate the performance of JIP with different propor-tions of paired samples. In the target domain, only paired samples are labeled and the rest are unlabeled. The results of the experiments on the Caltech-Office dataset are shown in Table 1. The proposed method achieves the best or competitive performance for most of the tasks on the Caltech-Office dataset. It even ranks first when considering the performance on the eight tasks on average. The results on the IXMAS dataset are shown in Figure 1. It is obvious that the performance of the proposed method also exceeds that of all the other algorithms on the 20 tasks on average and achieves the highest accuracy 80.38%.

Table 1: Accuracy of algorithms on the Caltech-Office datasets (%)

Tasks SVM t MMDT CTSVM SHFA CDLS TNT TIT JIP A-D2S 66.77 66.77 69.00 70.64 67.96 71.41

C-D2S 55.27 54.51 52.10 55.02 53.37 56.17 57.30

C-S2D 87.47 86.66 86.66 89.58

D-S2D 80.00 95.45 88.18 89.09 94.55 94.31 93.64

W-D2S 69.08 74.88 76.33 74.88 75.85 79.46

Average 74.23 78.61 76.97 78.70 79.39 80.59 81.24

Fig. 1: Accuracy of algorithms on the IXMAS dataset he results on the WIKI dataset are shown in Table 2, where different proportions of paired samples are selected for the experiments. It can be seen that the performance of the algorithm becomes better with the increasing of proportion of paired samples. JIP achieves the best result when considering the performance of the algorithms on all the tasks on average.

The proposed method is further studied by analyzing the convergence and dimensionality, and also the effectiveness of the paired and the structural information preservation.

Analysis of Convergence and Dimensionality

Two important parameters of the proposed JIP is the number of iterations and the number of dimensions. Fig 2(a) and Fig 2(b) show the variation in accuracy of JIP on the Cal-tech-Office dataset with number of iterations and dimensions respectively. For the sake of clarity, the accuracy for each task is moved up or down on the whole, which dose not affect the trend analysis. As shown in Fig 2(a), the proposed method demonstrates good convergence. It can be seen from Fig 2(b) that the trend under different number of dimensions is different for different tasks, and highest accuracy does not necessarily occur at the highest dimensionality, i.e., 100 in the experiments. If all the other parameters are fixed, the number of dimensions varies from 10 to 100 with the step of 10, the average accuracy of the proposed method on the eight tasks is 82.30%.

Effectiveness of Information Preservation

The effectiveness of information preservation of the pro-posed method is analyzed from four aspects. In Fig 3(a)-(c), the regularization parameters  ,  and  corresponding to the respective terms in (12) are set to 0 or optimized value with the other parameters fixed. It can be seen that the ac-curacy increases for almost all the tasks when the parameters are non-zero, which demonstrates the effectiveness of the information preservation. In Fig 3(d), regularization param-eters for the local and global structure preservation terms are both set to 0. The accuracy improvement in Fig 3(d) is sig-nificantly higher than that in Fig 3(b) and (c), which verifies the effectiveness of hierarchical structure preservation when compared to individual local or global structure preservation.

5 Conclusions

This paper purposes a new heterogeneous domain adaption method by preserving jointly the paired information and the structural information. Further, a HDA framework is pro-posed to integrate the joint information preservation with distribution matching that can alleviate the problem of in-formation loss during adaptation effectively. A disadvantage is that the settings of the regularization parameters depend on grid search, which is time consuming. More adaptive strategy for setting the parameters will be explored in our future work.

Table 2: Accuracy of algorithms on the WIKI datasets (%)

Tasks SVM t MMDT CTSVM SHFA CDLS TNT TIT JIP 10% img2txt 92.67 94.89 88.89 91.56 94.67 93.91 92.67 txt2img 41.11 35.56 44.22 44.22 40.22 46.02 44.44

20% img2txt 94.25 95.75 87.75 91.00 95.25 92.63 94.25 txt2img 43.25 37.00 43.50 47.25 40.25 48.08 47.25

30% img2txt 93.57 96.29 91.14 94.57 94.57 94.34 96.00 txt2img 49.14 37.14 46.29 52.86 48.29 51.17 50.57

40% img2txt 96.33 96.33 91.67 96.00 95.00 96.07

Average 69.96 66.00 67.68 71.35 70.07 71.65 71.73 (a) paired information (b) local structure (c) global structure (d) hierarchical structure Fig. 3: Effectiveness analysis of information preservation (a) Convergence analysis (b) Dimensionality analysis Fig. 2: Parameter analysis eferences [Aljundi et al ., 2015] R. Aljundi, R. Emonet, D. Muselet, and M. Sebban. Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. in

CVPR , 56-63, 2015 [Belkin and Niyogi, 2001] M. Belkin and P. Niyogi, Lapla-cian eigenmaps and spectral techniques for embedding and clustering. in

NIPS , 585-591, 2001. [Chen et al ., 2016] W. Chen, T. H. Hsu, Y. H. Tsai, Y. F. Wang, and M. Chen. Transfer neural trees for heterogeneous domain adaptation. in

ECCV , 399-414, 2016. [Csurka et al ., 2017] G. Csurka. Domain adaptation for vis-ual applications: a comprehensive survey.

CoRR abs/1702.05374 , 2017. [Day and Khoshgoftaar, 2017] O. Day and T. M. Khosh-goftaar. A survey on heterogeneous transfer learning.

Journal of Big Data , 4:29, 2017. [Fernando et al ., 2014] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Subspace alignment for domain ad-aptation.

CoRR abs/1409.5241 , 2014. [Li et al ., 2014] W. Li, L. Duan, D. Xu, and I. W. Tsang. Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation.

TPAMI , 36:1134-1148, 2014. [Li et al ., 2018] J. Li, K. Lu, Z. Huang, L. Zhu, and H. T. Shen. Transfer independently together: a generalized framework for domain adaptation.

TCYB , doi: 10.1109/TCYB.2018.2820174, 2018 [Long et al ., 2013] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation," in

ICCV , 2200-2207, 2013. [Long et al ., 2014] M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu. Adaptation regularization: a general framework for transfer learning.

TKDE , 26:1076-1089, 2014. [Mehrkanoon and Suykens, 2018] S. Mehrkanoon and J. A. K. Suykens. Regularized semipaired kernel CCA for domain adaptation.

TNNLS , 29:3199-3213, 2018 [Hardoon et al ., 2014] Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods.

Neural Computation , 16(12):2639-2664, 2014. [He and Niyogi, 2003] X. He and P. Niyogi. Locality Pre-serving Projections," in

NIPS , 153-160, 2003. [Hoffman et al ., 2012] J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain adaptation," in

ECCV , 702-715, 2012. [Hoffman et al ., 2013] J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko. Efficient learning of do-main-invariant image representations. in

ICLR , et al ., 2011] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: domain adaptation using asymmetric kernel transforms. in CVPR , 1785-1792, 2011. [Shen et al ., 2018] J. Shen, Y. Qu, W. Zhang, and Y. Y. Wasserstein distance guided representation learning for domain adaptation. in

AAAI , 4058-4065, 2018. [Shi et al ., 2010] X. Shi, Q. Liu, W. Fan, P. S. Yu, and R. Zhu. Transfer learning on heterogeneous feature spaces via spectral transformation," in

ICDM , 1049-1054, 2010. [Si et al ., 2010] S. Si, D. Tao, and B. Geng. Bregman di-vergence-based regularization for transfer subspace learning.

TKDE , et al ., 2016] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. in AAAI , 2058-2065, 2016. [Tsai et al ., 2016] Y. H. Tsai, Y. Yeh, and Y. F. Wang. Learning cross-domain landmarks for heterogeneous domain adaptation. in

CVPR , 5081-5090, 2016. [Wang et al ., 2017] X. Wang, Y. Ma, Y. Cheng, L. Zou, and J. J. P. C. Rodrigues. Heterogeneous domain adaptation network based on autoencoder.

Journal of Parallel and Distributed Computing , 117:281-291, 2017. [Wang and Mahadevan, 2011] C. Wang and S. Mahadevan. Heterogeneous domain adaptation using manifold alignment. in

IJCAI , 1541-1546, 2011. [Weiss et al ., 2016] K. Weiss, T. M. Khoshgoftaar, and D. D. Wang. A survey of transfer learning.

Journal of Big Data , 3:9, 2016. [Yan et al ., 2017] Y. Yan, W. Li, M. Ng, M. Tan, H. Wu, H. Min. Learning discriminative correlation subspace for heterogeneous domain adaptation. in

IJCAI , 3252-3258, 2017. [Yan et al ., 2018] K. Yan, L. Kou, and D. Zhang. Learning domain-invariant subspace using domain features and independence maximization.

TCYB , 48:288-299, 2018. [Yang and Gao, 2013] P. Yang and W. Gao. Multi-view discriminant transfer learning," in

IJCAI , 1848-1854, 2013. [Yeh et al ., 2014] Y. Yeh, C. Huang, and Y. F. Wang. Heterogeneous domain adaptation and classification by exploiting the correlation subspace.

TIP , 23:2009-2018, 2014. [Zhong et al ., 2009] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren and D. S. Turaga. Cross domain distribution adapta-tion via kernel mapping. in

KDD , 1027-1036, 2009. [Zhou et al ., 2014] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan. Heterogeneous domain adaptation for multiple classes. in

AISTATS , 1095-1103, 2014. [Zhuang et al ., 2015] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He. Supervised representation learning: transfer learning with deep autoencoders. in