[PDF] Domain Adaptation for Visual Applications: A Comprehensive Survey

Abstract

The aim of this paper is to give an overview of domain adaptation and transfer learning with a specific view on visual applications. After a general motivation, we first position domain adaptation in the larger transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we overview the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes. Finally, we conclude the paper with a section where we relate domain adaptation to other machine learning solutions.

Full PDF

DDomain Adaptation for Visual Applications: A ComprehensiveSurvey

Gabriela Csurka

Abstract

The aim of this paper is to give an overview of domain adaptation and transfer learning with aspeciﬁc view on visual applications. After a general motivation, we ﬁrst position domain adaptation in thelarger transfer learning problem. Second, we try to address and analyze brieﬂy the state-of-the-art methodsfor different types of scenarios, ﬁrst describing the historical shallow methods, addressing both the homoge-neous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deepconvolutional architectures which led to new type of domain adaptation methods that integrate the adap-tation within the deep architecture. Fourth, we overview the methods that go beyond image categorization,such as object detection or image segmentation, video analyses or learning visual attributes. Finally, weconclude the paper with a section where we relate domain adaptation to other machine learning solutions. While huge volumes of unlabeled data are generated and made available in many domains, the cost ofacquiring data labels remains high. To overcome the burden of annotation, alternative solutions have beenproposed in the literature in order to exploit the unlabeled data (referred to as semi-supervised learning),or data and/or models available in similar domains (referred to as transfer learning). Domain Adaptation(DA) is a particular case of transfer learning (TL) that leverages labeled data in one or more related source domains, to learn a classiﬁer for unseen or unlabeled data in a target domain. In general it is assumed thatthe task is the same, i.e . class labels are shared between domains. The source domains are assumed to berelated to the target domain, but not identical, in which case, it becomes a standard machine learning (ML)problem where we assume that the test data is drawn from the same distribution as the training data. Whenthis assumption is not veriﬁed, i.e . the distributions of training and test sets do not match, the performanceat test time can be signiﬁcantly degraded.

Xerox Research Center Europe ( ), 6 chemin Maupertuis, 38240 Meylan, France, e-mail:

[email protected] Book chapter to appear in ”Domain Adaptation in Computer Vision Applications”, Springer Series: Advances in ComputerVision and Pattern Recognition, Edited by Gabriela Csurka. 1 a r X i v : . [ c s . C V ] M a r Gabriela Csurka

Fig. 1

Example scenarios with domain adaptation needs.

In visual applications, such distribution difference, called domain shift, are common in real-life applica-tions. They can be consequences of changing conditions, i.e . background, location, pose changes, but thedomain mismatch might be more severe when, for example, the source and target domains contain imagesof different types, such as photos, NIR images, paintings or sketches [1, 2, 3, 4]. Service provider companiesare especially concerned since, for the same service (task), the distribution of the data may vary a lot fromone customer to another. In general, machine learning components of service solutions that are re-deployedfrom a given customer or location to a new customer or location require speciﬁc customization to accom-modate the new conditions. For example, in brand sentiment management it is critical to tune the modelsto the way users talk about their experience given the different products. In surveillance and urban trafﬁcunderstanding, pretrained models on previous locations might need adjustment to the new environment. Allthese entail either acquisition of annotated data in the new ﬁeld or the calibration of the pretrained modelsto achieve the contractual performance in the new situation. However, the former solution, i.e . data labeling,is expensive and time consuming due to the signiﬁcant amount of human effort involved. Therefore, the sec-ond option is preferred when possible. This can be achieved either by adapting the pretrained models takingadvantage of the unlabeled (and if available labeled) target set or, to build the target model, by exploitingboth previously acquired labeled source data and the new unlabeled target data together.Numerous approaches have been proposed in the last years to address adaptation needs that arise indifferent application scenarios (see a few examples in Figure 1). Examples include DA and TL solutions fornamed entity recognition and opinion extraction across different text corpora [5, 6, 7, 8], multilingual text omain Adaptation for Visual Applications: A Comprehensive Survey 3

Fig. 2

An overview of different transfer learning approaches. (Image: Courtesy to S.J. Pan [37].) classiﬁcation [9, 10, 11], sentiment analysis [12, 13], WiFi-based localization [14], speech recognition acrossdifferent speakers [15, 16], object recognition in images acquired in different conditions [17, 18, 19, 20, 21],video concept detection [22], video event recognition [23], activity recognition [24, 25], human motionparsing from videos [26], face recognition [27, 28, 29], facial landmark localization [30], facial action unitdetection [31], 3D pose estimation [32], document categorization across different customer datasets [33, 34,35], etc .In this paper, we mainly focus on domain adaptation methods applied to visual tasks . For a broader re-view of the transfer learning literature as well as for approaches speciﬁcally designed to solve non-visualtasks, e.g . text or speech, please refer to [36].The rest of the paper is organized as follows. In Section 2 we deﬁne more formally transfer learningand domain adaptation. In Section 3 we review shallow DA methods that can be applied on visual featuresextracted from the images, both in the homogeneous and heterogeneous case. Section 4 addresses morerecent deep DA methods and Section 5 describes DA solutions proposed for computer vision applicationsbeyond image classiﬁcation. In Section 6 we relate DA to other transfer learning and standard machinelearning approaches and in Section 7 we conclude the paper.

Gabriela Csurka

In this section, we follow the deﬁnitions and notation of [37, 36]. Accordingly, a domain D is composed of a d -dimensional feature space X ⊂ R d with a marginal probability distribution P ( X ) , and a task T deﬁned bya label space Y and the conditional probability distribution P ( Y | X ) , where X and Y are random variables.Given a particular sample set X = { x , . . . x n } of X , with corresponding labels Y = { y , . . . y n } from Y , P ( Y | X ) can in general be learned in a supervised manner from these feature-label pairs { x i , y i } .Let us assume that we have two domains with their related tasks: a source domain D s = {X s , P ( X s ) } with T s = {Y s , P ( Y s | X s ) } and a target domain D t = {X t , P ( X t ) } with T t = {Y t , P ( Y t | X t ) } . If thetwo domains corresponds, i.e . D s = D t and T s = T t , traditional ML methods can be used to solve theproblem, where D s becomes the training set and D t the test set.When this assumption does not hold, i.e . D t (cid:54) = D s or T t (cid:54) = T s , the models trained on D s might performpoorly on D t , or they are not applicable directly if T t (cid:54) = T s . When the source domain is somewhat relatedto the target, it is possible to exploit the related information from {D s , T s } to learn P ( Y t | X t ) . This processis known as transfer learning (TL).We distinguish between homogeneous TL , where the source and target are represented in the same thefeature space, X t = X s , with P ( X t ) (cid:54) = P ( X s ) due to domain shift, and heterogeneous TL where the sourceand target data can have different representations, X t (cid:54) = X s (or they can even be of different modalities suchas image vs . text).Based on these deﬁnitions, [37] categorizes the TL approaches into three main groups depending onthe different situations concerning source and target domains and the corresponding tasks. These are theinductive TL, transductive TL and unsupervised TL (see Figure 2). The inductive TL is the case where thetarget task is different but related to the source task, no matter whether the source and target domains arethe same or not. It requires at least some labeled target instances to induce a predictive model for the targetdata. In the case of the transductive TL , the source and target tasks are the same, and either the source andtarget data representations are different ( X t (cid:54) = X s ) or the source and target distributions are different dueto selection bias or distribution mismatch. Finally, the unsupervised TL refers to the case where both thedomains and the tasks are different but somewhat related. In general, labels are not available neither for thesource nor for the target, and the focus is on exploiting the (unlabeled) information in the source domainto solve unsupervised learning task in the target domain. These tasks include clustering, dimensionalityreduction and density estimation [38, 39].According to this classiﬁcation, DA methods are transductive TL solutions, where it is assumed that thetasks are the same, i.e . T t = T s . In general they refer to a categorization task, where both the set of labelsand the conditional distributions are assumed to be shared between the two domains, i.e . Y s = Y t and P ( Y | X t ) = P ( Y | X s ) . However, the second assumption is rather strong and does not always hold in real-life applications. Therefore, the deﬁnition of domain adaptation is relaxed to the case where only the ﬁrstassumption is required, i.e . Y s = Y t = Y .In the DA community, we further distinguish between the unsupervised (US) case where the labels areavailable only for the source domain and the semi-supervised (SS) case where a small set of target examplesare labeled. Note also that the unsupervised DA is not related to the unsupervised TL, for which no source labels are available and ingeneral the task to be solved is unsupervised.omain Adaptation for Visual Applications: A Comprehensive Survey 5

Fig. 3

Illustration of the effect of instance re-weighting samples on the source classiﬁer. (Image: Courtesy to M. Long [40].)

In this section, we review shallow DA methods that can be applied on vectorial visual features extractedfrom images. First, in Section 3.1 we survey homogeneous DA methods, where the feature representationfor the source and target domains is the same, X t = X s with P ( X t ) (cid:54) = P ( X s ) , and the tasks shared, Y s = Y t . Then, in Section 3.2 we discuss methods that can exploit efﬁciently several source domains.Finally in Section 3.3 we discuss the heterogeneous case, where the source and target data have differentrepresentations. Instance re-weighting methods.

The DA case when we assume that the conditional distributions are sharedbetween the two domains, i.e . P ( Y | X s ) = P ( Y | X t ) , is often referred to as dataset bias or covariate shift [41]. In this case, one could simply apply the model learned on the source to estimate P ( Y | X t ) . However, as P ( X s ) (cid:54) = P ( X t ) , the source model might yield a poor performance when applied on the target set despite ofthe underlying P ( Y | X s ) = P ( Y | X t ) assumption. The most popular early solutions proposed to overcomethis to happen are based on instance re-weighting (see Figure 3 for an illustration).To compute the weight of an instance, early methods proposed to estimate the ratio between the like-lihoods of being a source or target example. This can be done either by estimating the likelihoods inde-pendently using a domain classiﬁer [42] or by approximating directly the ratio between the densities witha Kullback-Leibler Importance Estimation Procedure [43, 44]. However, one of the most popular measureused to weight data instances, used for example in [45, 46, 14], is the Maximum Mean Discrepancy (MMD)[47] computed between the data distributions in the two domains.The method proposed in [48] infers re-sampling weights through maximum entropy density estimation.[41] improves predictive inference under covariate shift by weighting the log-likelihood function. The Im-portance Weighted Twin Gaussian Processes [32] directly learns the importance weight function, withoutgoing through density estimation, by using the relative unconstrained least-squares importance ﬁtting. TheSelective Transfer Machine [31] jointly optimizes the weights as well as the classiﬁer’s parameters to pre-serve the discriminative power of the new decision boundary. Gabriela Csurka

Fig. 4

Illustration of the TrAdaBoost method [49] where the idea is to decrease the importance of the misclassiﬁed sourceexamples while focusing, as in AdaBoost [50], on the misclassiﬁed target examples. (Image: Courtesy to S. J. Pan).

The Transfer Adaptive Boosting (TrAdaBoost) [49], is an extension to AdaBoost [50], that iterativelyre-weights both source and target examples during the learning of a target classiﬁer. This is done by increas-ing the weights of miss-classiﬁed target instances as in the traditional AdaBoost, but decreasing the weightsof miss-classiﬁed source samples in order to diminish their importance during the training process (see Fig-ure 4). The TrAdaBoost was further extended by integrating dynamic updates in [51, 52]. Parameter adaptation methods.

Another set of early DA methods, but which does not necessarily assume P ( Y | X s ) = P ( Y | X t ) , investigates different options to adapt the classiﬁer trained on the source domain, e.g . an SVM, in order to perform better on the target domain . Note that these methods in general require atleast a small set of labeled target examples per class, hence they can only be applied in the semi-supervisedDA scenario. As such, the Transductive SVM [53] that aims at decreasing the generalization error of theclassiﬁcation, by incorporating knowledge about the target data into the SVM optimization process. TheAdaptive SVM (A-SVM) [54] progressively adjusts the decision boundaries of the source classiﬁers withthe help of a set of so called perturbation functions built by exploiting predictions on the available labeledtarget examples (see Figure 5). The Domain Transfer SVM [55] simultaneously reduces the mismatch inthe distributions (MMD) between two domains and learns a target decision function. The Adaptive MultipleKernel Learning (A-MKL) [23] generalizes this by learning an adapted classiﬁer based on multiple base Code at https://github.com/BoChen90/machine-learning-matlab/blob/master/TrAdaBoost.m The code for several methods, such as A-SVM, A-MKL, DT-MKL can be downloaded from omain Adaptation for Visual Applications: A Comprehensive Survey 7

Fig. 5

Illustration of the Adaptive SVM [54], where a set of so called perturbation functions ∆ f are added to the sourceclassiﬁer f s to progressively adjusts the decision boundaries of f s for the target domain. (Courtesy to D. Xu). kernels and the pre-trained average classiﬁer. The model minimizes jointly the structural risk functional andthe mismatch between the data distributions (MMD) of the two domains.The domain adaptation SVM (DASVM) [56] exploits within the semi-supervised DA scenario both thetransductive SVM [53] and its extension, the progressive transductive SVM [57]. The cross-domain SVM,proposed in [58], constrains the impact of source data to the k-nearest neighbors (similarly to the spirit ofthe Localized SVM [59]). This is done by down-weighting support vectors from the source data that are farfrom the target samples. Feature augmentation.

One of the simplest method for DA was proposed in [60], where the original rep-resentation x is augmented with itself and a vector of the same size ﬁlled with zeros as follows: the sourcefeatures become (cid:34) x s x s (cid:35) and target features (cid:34) x t t (cid:35) . Then an SVM is trained on these augmented features to ﬁgureout which parts of the representation is shared between the domains and which are the domain speciﬁc ones.The idea of feature augmentation is also behind the Geodesic Flow Sampling (GFS) [61, 62] and theGeodesic Flow Kernel (GFK) [18, 63], where the domains are embedded in d -dimensional linear subspacesthat can be seen as points on the Grassman manifold corresponding to the collection of all d -dimensionalsubspaces. In the case of GFS [61, 62], following the geodesic path between the source and target domains,representations, corresponding to intermediate domains, are sampled gradually and concatenated (see illus-tration in Figure 6). Instead of sampling, GFK [18, 63], extends GFS to the inﬁnite case, proposing a kernelthat makes the solution equivalent to integrating over all common subspaces lying on the geodesic path.A more generic framework, proposed in [62], accommodates domain representations in high-dimensionalReproducing Kernel Hilbert Space (RKHS) using kernel methods and low-dimensional manifold representa-tions corresponding to Laplacian Eigenmaps. The approach described in [64] was inspired by the manifold-based incremental learning framework in [61]. It generates a set of intermediate dictionaries which smoothlyconnect the source and target domains. This is done by decomposing the target data with the current interme-diate domain dictionary updated with a reconstruction residue estimated on the target. Concatenating these Code available at

Gabriela Csurka

Fig. 6

The GFS samples between source S and target S on the geodesic path intermediate domains S ,i that can be seen ascross-domain data representations. (Courtesy to R. Gopalan [61]).) intermediate representations enables learning a better cross domain classiﬁer.These methods exploit intermediate cross-domain representations that are built without the use of classlabels. Hence, they can be applied in both, the US and SS, scenarios. These cross-domain representations arethen used either to train a discriminative classiﬁer [62] using the available labeled set (only from the source orfrom both domains), or to label the target instances using nearest neighbor search in the kernel space [18, 63]. Feature space alignment.

Instead of of augmenting the features, other methods tries to align the sourcefeatures with the target ones. As such, the Subspace Alignment (SA) [19] learns an alignment between thesource subspace obtained by PCA and the target PCA subspace, where the PCA dimensions are selectedby minimizing the Bregman divergence between the subspaces. It advantage is its simplicity, as shown inAlgorithm 1. Similarly, the linear Correlation Alignment (CORAL) [21] can be written in few lines of MAT-LAB code as illustrated in Algorithm 2. The method minimizes the domain shift by using the second-orderstatistics of the source and target distributions. The main idea is a whitening of the source data using itscovariance followed by a ” re-coloring ” using the target covariance matrix.As an alternative to feature alignment, a large set of feature transformation methods were proposed withthe objective to ﬁnd a projection of the data into a latent space such that the discrepancy between the sourceand target distributions is decreased. Note that the projections can be shared between the domains or they canbe domain speciﬁc projections. In the latter case we talk about asymmetric feature transformation. Further-more, when the transformation learning procedure uses no class labels, the method is called unsupervised feature transformation and when the transformation is learned by exploiting class labels (only from thesource or also from the target when available) it is referred to as supervised feature transformation.

Unsupervised feature transformation.

One of the ﬁrst such DA method is the Transfer Component Anal-ysis (TCA) [14] that proposes to discover common latent features having the same marginal distributionacross the source and target domains, while maintaining the intrinsic structure (local geometry of the datamanifold) of the original domain by a smoothness term. omain Adaptation for Visual Applications: A Comprehensive Survey 9

Algorithm 1:

Subspace Alignment (SA) [19]

Input:

Source data X s , target data X t , subspace dimension d P s ← P CA ( X s , d ) , P t ← P CA ( X t , d ) ;2: X sa = X s P s P (cid:62) s P t , X ta = X t P t ; Output:

Aligned source, X sa and target, X ta data. Algorithm 2:

Correlation Alignment (CORAL) [21]

Input:

Source data X s , target data X t C s = cov ( X s ) + eye ( size ( X s , , C t = cov ( X t ) + eye ( size ( X t , X sw = X s ∗ C − / s (whitening), X sa = X sw ∗ C − / t (re-coloring) Output:

Source data X sa adjusted to the target. Instead of restricting the discrepancy to a simple distance between the sample means in the lower-dimensional space, Baktashmotlagh et al . [65] propose the Domain Invariant Projection (DIP) approachthat compares directly the distributions in the RKHS while constraining the transformation to be orthogonal.They go a step further in [66] and based on the fact that probability distributions lie on a Riemannian mani-fold, propose the Statistically Invariant Embedding (SIE) that uses the Hellinger distance on this manifoldto compare kernel density estimates between of the source and target data. Both the DIP and SIE, involvenon-linear optimizations and are solved with the conjugate gradient algorithm [67].The Transfer Sparse Coding (TSC) [68] learns robust sparse representations for classifying cross-domaindata accurately. To bring the domains closer, the distances between the sample means for each dimensionsof the source and the target is incorporated into the objective function to be minimized. The Transfer JointMatching (TJM) [40] learns a non-linear transformation between the two domains by minimizing the dis-tance between the empirical expectations of source and target data distributions integrated within a kernelembedding. In addition, to put less emphasis on the source instances that are irrelevant to classify the targetdata, instance re-weighing is employed.The feature transformation proposed by in [12] exploits the correlation between the source and target setto learn a robust representation by reconstructing the original features from their noised counterparts. Themethod, called Marginalized Denoising Autoencoder (MDA), is based on a quadratic loss and a drop-outnoise level that factorizes over all feature dimensions. This allows the method to avoid explicit data corrup-tion by marginalizing out the noise and to have a closed-form solution for the feature transformation. Notethat it is straightforward to stack together several layers with optional non-linearities between layers to obtaina multi-layer network with the parameters for each layer obtained in a single forward pass (see Algorithm 3).In general, the above mentioned methods learn the transformation without using any class label. Afterprojecting the data in the new space, any classiﬁer trained on the source set can be used to predict labels forthe target data. The model often works even better if in addition a small set of the target examples are hand-labeled (SS adaptation). The class labels can also be used to learn a better transformation. Such methods, Code at https://drive.google.com/uc?export=download&id=0B9_PW9TCpxT0c292bWlRaWtXRHc Code at https://drive.google.com/uc?export=download&id=0B9_PW9TCpxT0SEdMQ1pCNzdZekU Code at http://ise.thss.tsinghua.edu.cn/˜mlong/doc/transfer-sparse-coding-cvpr13.zip Code at http://ise.thss.tsinghua.edu.cn/˜mlong/doc/transfer-joint-matching-cvpr14.zip

Algorithm 3:

Stacked Marginalized Denoising Autoencoder (sMDA) [12].

Input:

Source data X s , target data X t Input:

Parameters: p (noise level), ω (regularizer) and k (number of stacked layers)1: X = [ X s , X t ] , S = X (cid:62) X , and X = X ;2: P = (1 − p ) S and Q = (1 − p ) S + p (1 − p ) diag ( S ) W = ( Q + ω I D ) − P .4: (Optionally), stack K layers with X ( k ) = tanh( X ( k − W ( k ) ) . Output:

Denoised features X k . called supervised feature transformation based DA methods, to learn the transformation exploit class labels,either only from the source or also from the target (when available). When only the source class labels areexploited, the method can still be applied to the US scenario, while methods using also target labels aredesigned for the SS case. Supervised feature transformation.

Several unsupervised feature transformation methods, cited above,have been extended to capitalize on class labels to learn a better transformation. Among these extensions,we can mention the Semi-Supervised TCA [14, 69] where the objective function that is minimized containsa label dependency term in addition to the distance between the domains and the manifold regularizationterm. The label dependency term has the role of maximizing the alignment of the projections with the sourcelabels and, when available, target labels.Similarly, in [70] a quadratic regularization term, relying on the pretrained source classiﬁer, is added intothe MDA framework [12], in order to keep the denoised source data well classiﬁed. Moreover, the domaindenoising and cross-domain classiﬁer can be learned jointly by iteratively solving a Sylvester linear systemto estimate the transformation and a linear system to get the classiﬁer in closed form .To take advantage of class labels, the distance between each source sample and its corresponding classmeans is added as regularizer into the DIP [65] respectively SIE model [66]. This term encourages the sourcesamples from the same class to be clustered in the latent space. The Adaptation Regularization based TransferLearning [71] performs DA by optimizing simultaneously the structural risk functional, the joint distribu-tion matching between domains and the manifold consistency. The Max-Margin Domain Transform [72]optimizes both the transformation and classiﬁer parameters jointly, by introducing an efﬁcient cost functionbased on the misclassiﬁcation loss.Another set of methods extend marginal distribution discrepancy minimization to conditional distributioninvolving data labels from the source and class predictions from the target. Thus, [73] proposes an adap-tive kernel approach that maps the marginal distribution of the target and source sets into a common kernelspace, and use a sample selection strategy to draw conditional probabilities between the two domains closer.The Joint Distribution Adaptation [20] jointly adapts the marginal distribution through a principled (PCAbased) dimensionality reduction procedure and the conditional distribution between the domains. Code at https://github.com/sclincha/xrce_msda_da_regularization Code at http://ise.thss.tsinghua.edu.cn/˜mlong/doc/adaptation-regularization-tkde14.zip Code at https://cs.stanford.edu/˜jhoffman/code/Hoffman_ICLR13_MMDT_v3.zip Code at http://ise.thss.tsinghua.edu.cn/˜mlong/doc/joint-distribution-adaptation-iccv13.zip omain Adaptation for Visual Applications: A Comprehensive Survey 11

Fig. 7

The NBNN-DA adjusts the image-to-class distances by tuning the per class metrics and iteratively making the metricprogressively more suitable for the target. (Image: Courtesy to T. Tommasi [61])

Metric learning based feature transformation.

These methods are particular supervised feature transfor-mation methods that involves that at least a limited set of target labels are available, and they use metriclearning techniques to bridge the relatedness between the source and target domains. Thus, [74] proposesdistance metric learning with either log-determinant or manifold regularization to adapt face recognitionmodels between subjects. [17] uses the Information-Theoretic Metric Learning from [75] to deﬁne a com-mon distance metric across different domains. This method was further extended in [76] by incorporatingnon-linear kernels, which enable the model to be applicable to the heterogeneous case ( i.e . different sourceand target representations).The metric learning for Domain Speciﬁc Class Means (DSCM) [77] learns a transformation of the featurespace which, for each instance minimizes the weighted soft-max distances to the corresponding domain spe-ciﬁc class means. This allows in the projected space to decrease the intraclass and to increase the interclassdistances (see also Figure 10). This was extended with an active learning component by the Self-adaptiveMetric Learning Domain Adaptation (SaML-DA) [77] framework, where the target training set is iterativelyincreased with labels predicted with DSCM and used to reﬁne the current metric. SaML-DA was inspired bythe Naive Bayes Nearest Neighbor based Domain Adaptation (NBNN-DA) [78] framework, which com-bines metric learning and NBNN classiﬁer to adjust the instance-to-class distances by progressively makingthe metric more suitable for the target domain (see Figure 7). The main idea behind both methods, SaML-DA and NBNN-DA, is to replace at each iteration the most ambiguous source example of each class by thetarget example for which the classiﬁer (DSCM respectively NNBA) is the most conﬁdent for the given class. Local feature transformation.

The previous methods learn a global transformation to be applied to eachsource and target example. In contrast, the Adaptive Transductive Transfer Machines (ATTM) [80] comple-ments the global transformation with a sample-based transformation to reﬁne the probability density functionof the source instances assuming that the transformation from the source to the target domain is locally lin- Code at

Fig. 8

The OTDA [79] consider a local transportation plan for each sample in the source domain to transport the trainingsamples close to the target examples. (Image: Courtesy to N. Courty.) ear. This is achieved by representing the target set by a Gaussian Mixture Model and learning an optimaltranslation parameter that maximizes the likelihood of the translated source as a posterior.Similarly, the Optimal Transport for Domain Adaptation [79], considers a local transportation plan foreach source example. The model can be seen as a graph matching problem, where the ﬁnal coordinates ofeach sample are found by mapping the source samples to the target ones, whilst respecting the marginal dis-tribution of the target domain (see Figure 8). To exploit class labels, a regularization term with group-lasso isadded inducing, on one hand, group sparsity and, on another hand, constraining source samples of the sameclass to remain close during the transport.

Landmark selection.

In order to improve the feature learning process, several methods have been proposedwith the aim of selecting the most relevant instances from the source, so-called landmark examples, to beused to train the adaptation model (see examples in Figure 9). Thus, [63] proposes to minimize a variantof the MMD to identify good landmarks by creating a set of auxiliary tasks that offer multiple views ofthe original problem . The Statistically Invariant Sample Selection [66], uses the Hellinger distance on thestatistical manifold instead of MMD. The selection is forced to keep the proportions of the source samplesper class the same as in the original data. Contrariwise to these approaches, the Multi-scale Landmark Selec-tion [81] does not require any class labels. It takes each instance independently and considers it as being agood candidate if the Gaussian distributions of the source examples and of the target points centered on theinstance are similar over a set of different scales (Gaussian variances).Note that the landmark selection process, although strongly related to instance re-weighting methods withbinary weights, can be rather seen as data preprocessing and hence complementary to the adaptation process. Code at Code at http://home.heeere.com/data/cvpr-2015/LSSA.zip omain Adaptation for Visual Applications: A Comprehensive Survey 13

Fig. 9

Landmarks selected for the task amazon versus webcam using the popular Ofﬁce31 dataset [17] with (a) MMD [63]and (b) the Hellinger distance on the statistical manifold [66].

Most of the above mentioned methods were designed for a single source vs . target case. When multiplesources are available, they can be concatenated to form a single source set, but because the possible shiftbetween the different source domains, this might not be always a good option. Alternatively, the modelsbuilt for each source-target pair (or their results) can be combined to make a ﬁnal decision. However, abetter option might be to build multi-source DA models which, relying only on the a priori known domainlabels, are able to exploit the speciﬁcity of each source domain.Such methods are the Feature Augmentation (FA) [60] and the A-SVM [54], already mentioned in Sec-tion 3.1, both exploiting naturally the multi-source aspect of the dataset. Indeed in the case of FA, extrafeature sets, one for each source domain, concatenated to the representations, allow to learn source speciﬁcproperties shared between a given source and the target. The A-SVM uses an ensemble of source speciﬁcauxiliary classiﬁers to adjust the parameters of the target classiﬁer.Similarly, the Domain Adaptation Machine [82] l leverages a set of source classiﬁers by the integrationof domain-dependent regularizer term which is based on a smoothness assumption. The model forces thetarget classiﬁer to share similar decision values with the relevant source classiﬁers on the unlabeled targetinstances. The Conditional Probability based Multi-source Domain Adaptation (CP-MDA) approach [83]extends the above idea by adding weight values for each source classiﬁer based on conditional distributions.The DSCM proposed in [77] relies on domain speciﬁc class means both to learn the metric but also topredict the target class labels (see illustration in Figure 10). The domain regularization and classiﬁer basedregularization terms of the extended MDA [70] are both sums of source speciﬁc components.The Robust DA via Low-Rank Reconstruction (RDALRR) [84] transforms each source domain into anintermediate representation such that the transformed samples can be linearly reconstructed from the targetones. Within each source domain, the intrinsic relatedness of the reconstructed samples is imposed by usinga low-rank structure where the outliers are identiﬁed using sparsity constraints. By enforcing different sourcedomains to have jointly low ranks, a compact source sample set is formed with a distribution close to thetarget domain (see Figure 11). Fig. 10

Metric learning for the DSCM classiﬁer, where µ sc i and µ s (cid:48) c i represent source speciﬁc class means and µ tc i class meansin the target domain. The feature transformation W is learned by minimizing for each sample the weighted soft-max distancesto the corresponding domain speciﬁc class means in the projected space. To better take advantage of having multiple source domains, extensions to methods previously designedfor a single source vs . target case were proposed in [62, 85, 86, 87]. Thus, [62] describes a multi-sourceversion of the GFS [61], which was further extended in [85] to the Subspaces by Sampling Spline Flowapproach. The latter uses smooth polynomial functions determined by splines on the manifold to interpolatebetween different source and the target domain. [86] combines constrained clustering algorithm, used toidentify automatically source domains in a large data set, with a multi-source extension of the AsymmetricKernel Transform [76]. [87] efﬁciently extends the TrAdaBoost [49] to multiple source domains. Source domain weighting.

When multiple sources are available, it is desired to select those domains thatprovide the best information transfer and to remove the ones that have more likely negatively impact on theﬁnal model. Thus, to down-weight the effect of less related source domains, in [88] ﬁrst the available labelsare propagated within clusters obtained by spectral clustering and then to each source cluster a SupervisedLocal Weight (SLW) is assigned based on the percentage of label matches between predictions made by asource model and those made by label propagation.In the Locally Weighted Ensemble framework [88], the model weights are computed as a similarity be-tween the local neighborhood graphs centered on source and target instances. The CP-MDA [83], mentionedabove, uses a weighted combination of source learners, where the weights are estimated as a function of con-ditional probability differences between the source and target domains. The Rank of Domain value deﬁnedin [18] measures the relatedness between each source and target domain as the KL divergences between datadistributions once the data is projected into the latent subspace. The Multi-Model Knowledge Transfer [89]minimizes the negative transfer by giving higher weights to the most related linear SVM source classiﬁers.These weights are determined through a leave one out learning process. Code at https://cs.stanford.edu/˜jhoffman/code/hoffman_latent_domains_release_v2.zip omain Adaptation for Visual Applications: A Comprehensive Survey 15

Fig. 11

The RDALRR [84] transforms each source domain into an intermediate representation such that the transformedsamples can be linearly reconstructed from the target samples. (Image: Courtesy to I.H. Jhuo.)

Heterogeneous transfer learning (HTL) refers to the setting where the representation spaces are different forthe source and target domains ( X t (cid:54) = X s as deﬁned in Section 2). As a particular case, when the tasks areassumed to be the same, i.e . Y s = Y t , we refer to it as heterogeneous domain adaptation (HDA).Both HDA and HTL are strongly related to multi-view learning [90, 91], where the presence of multipleinformation sources gives an opportunity to learn better representations (features) by analyzing the viewssimultaneously. This makes possible to solve the task when not all the views are available. Such situationsappear when processing simultaneously audio and video [92], documents containing both image and text( e.g . web pages or photos with tags or comments) [93, 94, 95], images acquired with depth information [96], etc . We can also have multi-view settings when the views have the same modalities (textual, visual, audio),such as in the case of parallel text corpora in different languages [97, 98], photos of the same person takenacross different poses, illuminations and expressions [27, 29, 99, 100].Multi-view learning assumes that at training time for the same data instance multiple views from comple-mentary information sources are available ( e.g . a person is identiﬁed by photograph, ﬁngerprint, signatureor iris). Instead, in the case of HTL and HDA, the challenge comes from the fact that we have one view attraining and another one at test time. Therefore, one set of methods proposed to solve HDA relies on somemulti-view auxiliary data to bridge the gap between the domains (see Figure 12). Methods relying on auxiliary domains.

These methods principally exploit feature co-occurrences ( e.g .between words and visual features) in the multi-view auxiliary domain. As such, the Transitive TransferLearning [101] selects an appropriate domain from a large data set guided by domain complexity and, thedistribution differences between the original domains (source and target) and the selected one (auxiliary). When the bridge is to be done between visual and textual representations, a common practice is to crawl the Web for pagescontaining both text and images in order to build such intermediate multi-view data.6 Gabriela Csurka

Fig. 12

Heterogeneous DA through an intermediate domain allowing to bridge the gap between features representing the twodomains. For example, when the source domain contains text and the target images, the intermediate domain can be built froma set of crawled Web pages containing both text and images. (Image courtesy B. Tan [101]).

Then, using Non-negative Matrix Tri-factorization [102], feature clustering and label propagation is per-formed simultaneously through the intermediate domain.The Mixed-Transfer approach [103] builds a joint transition probability graph of mixed instances andfeatures, considering the data in the source, target and intermediate domains. The label propagation onthe graph is done by a random walk process to overcome the data sparsity. In [104] the representationsof the target images are enriched with semantic concepts extracted from the intermediate data through aCollective Matrix Factorization [105].[106] proposes to build a translator function between the source and target domain by learning directlythe product of the two transformation matrices that map each domain into a common (hypothetical) latenttopic built on the co-occurrence data. Following the principle of parsimony, they encode as few topics aspossible in order to be able to match text and images. The semantic labels are propagated from the labeledtext corpus to unlabeled new images by a cross-domain label propagation mechanism using the built trans-lator. In [107] the co-occurrence data is represented by the principal components computed in each featurespace and a Markov Chain Monte Carlo [108] is employed to construct a directed cyclic network where eachnode is a domain and each edge weight represents the conditional dependence between the correspondingdomains deﬁned by the transfer weights.[109] studies online HDA, where ofﬂine labeled data from a source domain is transferred to enhance theonline classiﬁcation performance for the target domain. The main idea is to build an ofﬂine classiﬁer basedon heterogeneous similarity using labeled data from a source domain and unlabeled co-occurrence data col-lected from Web pages and social networks (see Figure 13). The online target classiﬁer is combined with theofﬂine source classiﬁer using Hedge weighting strategy, used in Adaboost [50], to update their weights forensemble prediction.Instead of relying on external data to bridge the data representation gap, several HDA methods exploitdirectly the data distribution in the source and target domains willing to remove simultaneously the gap be-tween the feature representations and minimizing the data distribution shift. This is done by learning either a Code available at Code available at omain Adaptation for Visual Applications: A Comprehensive Survey 17

Fig. 13

Combining the online classiﬁer with the ofﬂine classiﬁer (right) and transfer the knowledge through co-occurrencesdata in the heterogeneous intermediate domain (left). (Image: Courtesy to Y. Yan [109]) projection for each domain into a domain-invariant common latent space, referred to as symmetric transfor-mation based HDA , or a transformation from the source space towards the target space, called asymmetrictransformation based HDA. These approaches require at least a limited amount of labeled target examples(semi-supervised DA). Symmetric feature transformation.

The aim of symmetric transformation based HDA approaches is tolearn projections for both the source and target spaces into a common latent (embedding) feature space bettersuited to learn the task for the target. These methods are related, on one hand, to the feature transformationbased homogeneous DA methods described in Section 3.1 and, on another hand, to multi-view embedding[93, 110, 99, 111, 112, 113], where different views are embedded in a common latent space. Therefore,several DA methods originally designed for the homogeneous case, have been inspired by the multi-viewembedding approaches and extended to heterogeneous data.As such, the Heterogeneous Feature Augmentation (HFA) [114], prior to data augmentation, embeds thesource and target into a common latent space (see Figure 15). In order to avoid the explicit projections, thetransformation metrics are computed by the minimization of the structural risk functional of SVM expressedas a function of these projection matrices. The ﬁnal target prediction function is computed by an alternatingoptimization algorithm to simultaneously solve the dual SVM and to ﬁnd the optimal transformations. Thismodel was further extended in [115], where each projection matrix is decomposed into a linear combinationof a set of rank-one positive semi-deﬁnite matrices and they are combined within a Multiple Kernel Learningapproach.The Heterogeneous Spectral Mapping [116] uniﬁes different feature spaces using spectral embeddingwhere the similarity between the domains in the latent space is maximized with the constraint to preserve theoriginal structure of the data. Combined with a source sample selection strategy, a Bayesian-based approachis applied to model the relationship between the different output spaces. These methods can be used even if the source and target data are represented in the same feature space, i.e . X t = X s .Therefore, it is not surprising that several methods are direct extensions of homogeneous DA methods described in Section 3.1. Code available at https://sites.google.com/site/xyzliwen/publications/HFA_release_0315.rar

Fig. 14

The SDDL proposes to learn a dictionary in a latent common subspace while maintaining the manifold structure of thedata. (Image: Courtesy to S. Shekhar [28]) [117] present a semi-supervised subspace co-projection method, which addresses heterogeneous multi-class DA. It is based on discriminative subspace learning and exploit unlabeled data to enforce an MMDcriterion across domains in the projected subspace. They use Error Correcting Output Codes (ECOC) to ad-dress the multi-class aspect and to enhance the discriminative informativeness of the projected subspace.The Semi-supervised Domain Adaptation with Subspace Learning [118] jointly explores invariant low-dimensional structures across domains to correct data distribution mismatch and leverages available un-labeled target examples to exploit the underlying intrinsic information in the target domain.To deal with both domain shift and heterogeneous data, the Shared Domain-adapted Dictionary Learn-ing (SDDL) [28] learns a class-wise discriminative dictionary in the latent projected space (see Figure14). This is done by jointly learning the dictionary and the projections of the data from both domains ontoa common low-dimensional space, while maintaining the manifold structure of data represented by sparselinear combinations of dictionary atoms.The Domain Adaptation Manifold Alignment (DAMA) [119] models each domain as a manifold andcreates a separate mapping function to transform the heterogeneous input space into a common latent spacewhile preserving the underlying structure of each domain. This is done by representing each domains with aLaplacian that captures the closeness of the instances sharing the same label. The RDALRR [84], mentionedabove (see also Figure 11), transforms each source domain into an intermediate representation such that thesource samples linearly reconstructed from the target samples are enforced to be related to each other undera low-rank structure. Note that both DAMA and RDALRR are multi-source HDA approaches. Code available at omain Adaptation for Visual Applications: A Comprehensive Survey 19

Fig. 15

The HFA [114] is seeking for an optimal common space while simultaneously learning a discriminative SVM classiﬁer.(Image: Courtesy to Dong Xu.)

Asymmetric feature transformation.

In contrast to symmetric transformation based HDA, these methodsaim to learn a projection of the source features into the target space such that the distribution mismatch withineach class is minimized. Such method is the Asymmetric Regularized Cross-domain Transformation [76]that utilizes an objective function responsible for the domain invariant transformation learned in a non-linearGaussian RBF kernel space. The Multiple Outlook MAPping algorithm [120] ﬁnds the transformation matrixby singular value decomposition process that encourage the marginal distributions within the classes to bealigned while maintaining the structure of the data. It requires a limited amount of labeled target data foreach class to be paired with the corresponding source classes.[10] proposes a sparse and class-invariant feature mapping that leverages the weight vectors of the binaryclassiﬁers learned in the source and target domains. This is done by considering the learning task as aCompressed Sensing [121] problem and using the ECOC scheme to generate a sufﬁcient number of binaryclassiﬁers given the set of classes. With the recent progress in image categorization due to deep convolutional architectures - trained in a fullysupervised fashion on large scale annotated datasets, in particular on part of ImageNet [122] - allowed asigniﬁcant improvement of the categorization accuracy over previous state-of-the art solutions. Furthermore,it was shown that features extracted from the activation layers of these deep convolutional networks can bere-purposed to novel tasks [123] even when the new tasks differ signiﬁcantly from the task originally usedto train the model.Concerning domain adaptation, baseline methods without adaptation obtained using features generatedby deep models on the two most popular benchmark datasets Ofﬁce (OFF31) [17] and Ofﬁce+Caltech(OC10) [18] outperform by a large margin the shallow DA methods using the SURFBOV features origi-nally provided with these datasets. Indeed, the results obtained with such Deep Convolutional ActivationFeatures (DeCAF) [123] even without any adaptation to the target are signiﬁcantly better that the results Code available at http://vision.cs.uml.edu/code/DomainTransformsECCV10_v1.tar.gz Activation layers extracted from popular CNN models, such as AlexNet [124], VGGNET [125], ResNet [126] orGoogleNet [127]. Code to extract features available at https://github.com/UCBAIR/decaf-releas

Fig. 16

Examples from the Cross-Modal Places Dataset (CMPlaces) dataset proposed in [3]. (Image: Courtesy to L. Castrej´on.) obtained with any DA method based on SURFBOV [128, 123, 21, 70]. As shown also in [129, 130], thissuggests that deep neural networks learn more abstract and robust representations, encode category levelinformation and remove, to a certain measure, the domain bias [123, 21, 70, 4].Note however that in OFF31 and OC10 datasets the images remain relatively similar to the images usedto train these models (usually datasets from the ImageNet Large-Scale Visual Recognition Challenge [122]).In contrast, if we consider category models between e.g . images and paintings, drawings, clip art or sketches(see see examples from the CMPlaces dataset in Figure 16), the models have more difﬁculties to handlethe domain differences [1, 131, 2, 3] and alternative solutions are necessary.Solutions proposed in the literature to exploit deep models can be grouped into three main categories. Theﬁrst group considers the CNN models to extract vectorial features to be used by the shallow DA methods.The second solution is to train or ﬁne-tune the deep network on the source domain, adjust it to the new task,and use the model to predict class labels for target instances. Finally, the most promising methods are basedon deep learning architectures designed for DA. Shallow methods with deep features.

The ﬁrst, naive solution is to consider the deep network as featureextractor, where the activations of a layer or several layers of the deep architecture is considered as repre-sentation for the input image. These Deep Convolutional Activation Features (DeCAF) [123] extracted fromboth source and target examples can then be used within any shallow DA method described in Section 3. Forexample, Feature Augmentation [60], Max-Margin Domain Transforms [72] and Geodesic Flow Kernel [18] Dataset available at http://projects.csail.mit.edu/cmplaces/ omain Adaptation for Visual Applications: A Comprehensive Survey 21

Fig. 17

The DLID model aims in interpolating between domains based on the amount of source and target data used to traineach model. (Image courtesy S. Chopra [128]). were applied to DECAF features in [123], Subspace Alignment [19] and Correlation Alignment in [21]. [70]experiments with DeCAF features within the extended MDA framework, while [4] explores various metriclearning approaches to align deep features extracted from RGB face images (source) and NIR or sketches(target).In general, these DA methods allow to further improve the classiﬁcation accuracy compared to the base-line classiﬁers trained only on the source data with these DeCAF features [123, 21, 70, 4]. Note howeverthat the gain is often relatively small and signiﬁcantly lower than the gain obtained with the same methodswhen used with the SURFBOV features.

Fine-tuning deep CNN architectures.

The second and most used solution is to ﬁne-tune the deep networkmodel on the new type of data and for the new task [132, 133, 134, 135]. But ﬁne-tuning requires in generala relatively large amount of annotated data which is not available for the target domain, or it is very limited.Therefore, the model is in general ﬁne-tuned on the source - augmented with, when available, the few labeledtarget instances - which allows in a ﬁrst place to adjust the deep model to the new task , common betweenthe source and target in the case of DA. This is fundamental if the targeted classes do not belong to theclasses used to pretrain the deep model. However, if the domain difference between the source and targetis important, ﬁne-tuning the model on the source might over-ﬁt the model for the source. In this case theperformance of the ﬁne-tuned model on the target data can be worse than just training the class predictionlayer or as above, using the model as feature extractor and training a classiﬁer with the correspondingDeCAF features [128, 21]. Finally, the most promising are the deep domain adaptation (deepDA) methods that are based on deep learn-ing architectures designed for domain adaptation. One of the ﬁrst deep model used for DA is the StackedDenoising Autoencoders [137] proposed to adapt sentiment classiﬁcation between reviews of different prod-ucts [13]. This model aims at ﬁnding common features between the source and target collections relying ondenoising autoencoders. This is done by training a multi-layer neural network to reconstruct input data frompartial random corruptions with backpropagation. The Stacked Marginalized Denoising Autoencoders [12] This is done by replacing the class prediction layer to correspond to the new set of classes. Note that the two approaches are equivalent when the layer preceding the class prediction layer are extracted.2 Gabriela Csurka

Fig. 18

Adversarial adaptation methods can be viewed as instantiations of the same framework with different choices regardingtheir properties [136] (Image courtesy E. Tzeng). (see also in Section 3.1) is a variant of the SDA, where the random corruption is marginalized out and henceyields a unique optimal solution (feature transformation) computed in closed form between layers.The Domain Adaptive Neural Network [138] uses such denoising auto-encoder as a pretraining stage.To ensure that the model pretrained on the source continue to adapt to the target, the MMD is embedded asa regularization in the supervised backpropagation process (added to the cross-entropy based classiﬁcationloss of the labels source examples).The Deep Learning for Domain Adaptation [128], inspired by the intermediate representations on thegeodesic path [18, 62], proposes a deep model based interpolation between domains. This is achieved bya deep nonlinear feature extractor trained in an unsupervised manner using the Predictive Sparse Decom-position [139] on intermediate datasets, where the amount of source data is gradually replaced by targetsamples.[140] proposes a light-weight domain adaptation method, which, by using only a few target samples, an-alyzes and reconstructs the output of the ﬁlters that were found affected by the domain shift. The aim of thereconstruction is to make the ﬁlter responses given a target image resemble to the response map of a sourceimage. This is done by simultaneously selecting and reconstructing the response maps of the bad ﬁlters usinga Lasso based optimization with a KL-divergence measure that guides the ﬁlter selection process.Most DeedDA methods follow a Siamese architectures [141] with two streams, representing the sourceand target models (see for example Figure 18), and are trained with a combination of a classiﬁcation loss and a discrepancy loss [142, 143, 138, 144, 145] or an adversarial loss . The classiﬁcation loss depends onthe labeled source data. The discrepancy loss aims to diminish the shift between the two domains while theadversarial loss tries to encourage a common feature space through an adversarial objective with respect to Code available at https://github.com/ghif/mtae omain Adaptation for Visual Applications: A Comprehensive Survey 23

Fig. 19

The JAN [145] minimizes a joint distribution discrepancy of several intermediate layers including the soft predictionone. (Image courtesy M. Long). a domain discriminator.

Discrepancy-based methods.

These methods, inspired by the shallow feature space transformation ap-proaches described in Section 3.1, uses in general a discrepancy based on MMD deﬁned between corre-sponding activation layers of the two streams of the Siamese architecture. One of the ﬁrst such method is theDeep Domain Confusion (DDC) [142] where the layer to be considered for the discrepancy and its dimen-sion is automatically selected amongst a set of ﬁne-tuned networks based on linear MMD between the sourceand the target. Instead of using a single layer and linear MMD, Long et al . proposed the Deep AdaptationNetwork (DAN) [143] that consider the sum of MMDs deﬁned between several layers, including the softprediction layer too. Furthermore, DAN explore multiple kernels for adapting these deep representations,which substantially enhances adaptation effectiveness compared to a single kernel method used in [138]and [142]. This was further improved by the Joint Adaptation Networks [145], which instead of the sum ofmarginal distributions (MMD) deﬁned between different layers, consider the joint distribution discrepanciesof these features.The Deep CORAL [144] extends the shallow CORAL [21] method described in Section 3 to deep archi-tectures . The main idea is to learn a nonlinear transformation that aligns correlations of activation layersbetween the two streams. This idea is similarly to DDC and DAN except that instead of MMD the CORALloss (expressed by the distance between the covariances) is used to minimize discrepancy between thedomains.In contrast to the above methods, Rozantsev et al . [146] consider the MMD between the weights of thesource respectively target models of different layers, where an extra regularizer term ensures that the weightsin the two models remains linearly related. Adversarial discriminative models.

The aim of these models is to encourage domain confusion throughan adversarial objective with respect to a domain discriminator. [136] proposes a uniﬁed view of existingadversarial DA methods by comparing them depending on the loss type, the weight sharing strategy be-tween the two streams and, on whether they are discriminative or generative (see illustration in Figure 18).Amongst the discriminative models we have the model proposed in [148] using a confusion loss, the Ad- Code available at https://github.com/thuml/transfer-caffe Code available at https://github.com/VisionLearningGroup/CORAL Note that this loss can be seen as minimizing the MMD with a polynomial kernel.4 Gabriela Csurka

Fig. 20

The DANN architecture including a feature extractor (green) and a label predictor (blue), which together form astandard feed-forward architecture. Unsupervised DA is achieved by the gradient reversal layer that multiplies the gradientby a certain negative constant during the backpropagation-based training to ensures that the feature distributions over the twodomains are made indistinguishable. (Image courtesy Y. Ganin [147]). versarial Discriminative Domain Adaptation [136] that considers an inverted label GAN loss [149] and theDomain-Adversarial Neural Network [147] with a minimax loss. The generative methods, additionally to thediscriminator, relies on a generator, which, in general, is a Generative Adversarial Network (GAN) [149].The domain confusion based model proposed in [148] considers a domain confusion objective, underwhich the mapping is trained with both unlabeled and sparsely labeled target data using a cross-entropyloss function against a uniform distribution. The model simultaneously optimizes the domain invariance tofacilitate domain transfer and uses a soft label distribution matching loss to transfer information betweentasks.The Domain-Adversarial Neural Networks (DANN) [147], integrates a gradient reversal layer into thestandard architecture to promote the emergence of features that are discriminative for the main learning taskon the source domain and indiscriminate with respect to the shift between the domains (see Figure 20). Thislayer is left unchanged during the forward propagation and its gradient reversed during backpropagation.The Adversarial Discriminative Domain Adaptation [136] uses an inverted label GAN loss to split theoptimization into two independent objectives, one for the generator and one for the discriminator. In contrastto the above methods, this model considers independent source and target mappings (unshared weights be-tween the two streams) allowing domain speciﬁc feature extraction to be learned, where the target weightsare initialized by the network pretrained on the source. Adversarial generative models.

These models combine the discriminative model with a generative compo-nent in general based on GANs [149]. As such, the Coupled Generative Adversarial Networks [150] consistsof a tuple of GANs each corresponding to one of the domains. It learns a joint distribution of multi-domainimages and enforces a weight sharing constraint to limit the network capacity. Code available at https://github.com/erictzeng/caffe/tree/confusion Code available at https://github.com/ddtm/caffe/tree/grl omain Adaptation for Visual Applications: A Comprehensive Survey 25

Fig. 21

The DSN architecture combines shared and domain speciﬁc encoders, which learns common and domain speciﬁcrepresentation components respectively with a shared decoder that learns to reconstruct the input samples. (Image courtesy K.Bousmalis [153]).

The model proposed in [151] also exploit GANs with the aim to generate source-domain images suchthat they appear as if they were drawn from the target domain. Prior knowledge regarding the low-level im-age adaptation process, such as foreground-background segmentation mask, can be integrated in the modelthrough content-similarity loss deﬁned by a masked Pairwise Mean Squared Error [152] between the un-masked pixels of the source and generated images. As the model decouples the process of domain adaptationfrom the task-speciﬁc architecture, it is able to generalize also to object classes unseen during the trainingphase.

Data reconstruction (encoder-decoder) based methods.

In contrast to the above methods, the Deep Re-construction Classiﬁcation Network proposed in [154] combines the standard convolutional network forsource label prediction with a deconvolutional network [155] for target data reconstruction. To jointly learnsource label predictions and unsupervised target data reconstruction, the model alternates between unsu-pervised and supervised training. The parameters of the encoding are shared across both tasks, while thedecoding parameters are separated. The data reconstruction can be viewed as an auxiliary task to support theadaptation of the label prediction.The Domain Separation Networks (DSN) [153] introduces the notion of a private subspace for each do-main, which captures domain speciﬁc properties, such as background and low level image statistics. A sharedsubspace, enforced through the use of autoencoders and explicit loss functions, captures common featuresbetween the domains. The model integrates a reconstruction loss using a shared decoder, which learns toreconstruct the input sample by using both the private (domain speciﬁc) and source representations (see Fig-ure 21). Code available at https://github.com/ghif/drcn

Fig. 22

The DTN architecture with strongly-shared and weakly-shared parameter layers. (Image courtesy X. Shu [157]).

Heterogeneous deepDA.

Concerning heterogeneous or multi-modal deep domain adaptation, we can men-tion the Transfer Neural Trees [156] proposed to relate heterogeneous cross-domain data. It is a two streamnetwork, one stream for each modality, where the weights in the latter stages of the network are shared.As the prediction layer, a Transfer Neural Decision Forest (Transfer-NDF) is used that performs jointlyadaptation and classiﬁcation.The weakly-shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation [157]learns a domain translator function from multi-modal source data that can be used to predict class labels inthe target even if only one of the modality is present. The proposed structure has the advantage to be ﬂexibleenough to represent both domain-speciﬁc features and shared features across domains (see Figure 22).

In the previous sections, we attempted to provide an overview of visual DA methods with emphasis onimage categorization. Compared to this vast literature focused on object recognition, relatively few papersgo beyond image classiﬁcation and address domain adaptation related to other computer vision problemssuch as object detection, semantic segmentation, pose estimation, video event or action detection. One ofthe main reason is probably due to the fact that these problems are more complex and have often additionalchallenges and requirements ( e.g . precision related to the localization in the case of detection, pixel levelaccuracy required for image segmentation, increased amount of annotation burden needed for videos, etc .)Moreover, adapting visual representations such as contours, deformable and articulated 2-D or 3-D models,graphs, random ﬁelds or visual dynamics, is less obvious with classical vectorial DA techniques.Therefore, when these tasks are addressed in the context of domain adaptation, the problem is generallyrewritten as a classiﬁcation problem with vectorial feature representations and a set of predeﬁned classlabels. In this case the main challenge becomes ﬁnding the best vectorial representation for the given the omain Adaptation for Visual Applications: A Comprehensive Survey 27

Fig. 23

Virtual word examples: SYNTHIA (top), Virtual KITTI (bottom). task. When this is possible, shallow DA methods, described in the Section 3, can be applied to the problem.Thereupon, we can ﬁnd in the literature DA solutions such as Adaptive SVM [54], DT-SVM [55], A-MKL[23] or Selective Transfer Machine [31] applied to video concept detection [22], video event recognition[23], activity recognition [24, 25], facial action unit detection [31], and 3D Pose Estimation [32].When rewriting the problem into classiﬁcation of vectorial representation is less obvious, as in the caseof image segmentation, where the output is a structured output, or detection where the output is a set ofbounding boxes, most often the target training set is simply augmented with the source data and traditional -segmentation, detection, etc . - methods are used. To overcome the lack of labels in the target domain, sourcedata is often gathered by crawling the Web (webly supervised) [158, 159, 160] or the target set is enrichedwith synthetically generated data. The usage of the synthetic data became even more popular since the mas-sive adoption of deep CNNs to perform computer vision tasks requiring large amount of annotated data.

Synthetic data based adaptation.

Early methods use 3D CAD models to improve solutions for pose andviewpoint estimation [161, 162, 163, 164], object and object part detection [165, 166, 167, 168, 169, 170,171, 172], segmentation and scene understanding [173, 174, 175]. The recent progresses in computer graph-ics and modern high-level generic graphics platforms such as game engines enable to generate photo-realistic

Fig. 24

Illustration of the Cool-TSN deep multi-task learning architecture [189] for end-to-end action recognition in videos.(Image courtesy C. De Souza). virtual worlds with diverse, realistic, and physically plausible events and actions. Popular virtual words areSYNTHIA [176], Virtual KITTI [177] and GTA-V [178] (see also Figure 23).Such virtually generated and controlled environments come with different levels of labeling for free andtherefore have great promise for deep learning across a variety of computer vision problems, includingoptical ﬂow [179, 180, 181, 182], object trackers [183, 177], depth estimation from RGB [184], objectdetection [185, 186, 187] semantic segmentation [188, 176, 178] or human actions recognition [189].In most cases, the synthetic data is used to enrich the real data for building the models. However, DAtechniques can further help to adjust the model trained with virtual data (source) to real data (target) es-pecially when no or few labeled examples are available in the real domain [190, 191, 176, 189]. As such,[190] propose a deep spatial feature point architecture for visuomotor representation which, using syntheticexamples and a few supervised examples, transfer the pretrained model to real imagery. This is done bycombining a pose estimation loss, a domain confusion loss that aligns the synthetic and real domains, anda contrastive loss that aligns speciﬁc pairs in the feature space. All together, these three losses ensure thatthe representation is suitable to the pose estimation task while remaining robust to the synthetic-real domainshift.The Cool Temporal Segment Network [189] is an end-to-end action recognition model for real-worldtarget categories that combines a few examples of labeled real-world videos with a large number of proce-durally generated synthetic videos. The model uses a deep multi-task representation learning architecture,able to mix synthetic and real videos even if the action categories differ between the real and synthetic sets(see Figure 24). Available at http://synthia-dataset.net Available at omain Adaptation for Visual Applications: A Comprehensive Survey 29

Fig. 25

Online adaptation of the generic detector with tracked regions. (Image courtesy P. Sharma [204]).

Concerning visual applications, after the image level categorization task, object detection received the mostattention from the visual DA/TL community. Object detection models, until recently, were composed of awindow selection mechanism and appearance based classiﬁers trained on the features extracted from labeledbounding boxes. At test time, the classiﬁer was used to decide if a region of interest obtained by slidingwindows or generic window selection models [192, 193, 194] contains the object or not.Therefore, considering the window selection mechanism as being domain independent, standard DAmethods can be integrated with the appearance based classiﬁers to adapt to the target domain the modelstrained on the source domain. The Projective Model Transfer SVM (PMT-SVM) and the Deformable Adap-tive SVM (DA-SVM) proposed in [195] are such methods, which adapt HOG deformable source templates[196, 197] with labeled target bounding boxes (SS scenario), and the adapted template is used at test timeto detect the presence or absence of an object class in sliding windows. In [198] the PMT-SVM was fur-ther combined with MMDT [72] to handle complex domain shifts. The detector is further improved by asmoothness constraints imposed on the classiﬁer scores utilizing instance correspondences ( e.g . the sameobject observed simultaneously from multiple views or tracked between video frames).[199] uses the TCA [14] to adapt image level HOG representation between source and target domainsfor object detection. [200] proposes a Taylor Expansion Based Classiﬁer Adaptation for either boosting orlogistic regression to adapt person detection between videos acquired in different meeting rooms.

Online adaptation of the detector.

Most early works related to object detector adaptation concern onlineadaptation of a generic detector trained on strongly labeled images (bounding boxes) to detect objects (ingeneral cars or pedestrians) in videos. These methods exploit redundancies in videos to obtain prospectivepositive target examples (windows) either by background modeling/subtraction [201, 202], or by combina-tion of object tracking with regions proposed by the generic detector [203, 204, 205, 206] (see the mainidea in Figure 25). Using these designated target samples in the new frame the model is updated involvingsemi-supervised approaches such as self-training [207, 208] or co-training [209, 210].For instance, [211] proposes a non-parametric detector adaptation algorithm, which adjusts an ofﬂineframe-based object detector to the visual characteristic of a new video clip. The Structure-Aware AdaptiveStructural SVM (SA-SSVM) [212] adapts online the deformable part-based model [213] for pedestrian de-tection (see Figure 26). To handle the case when no target label is available, a strategy inspired by self-pacedlearning and supported by a Gaussian Process Regression is used to automatically label samples in the tar-

Fig. 26

Domain Adaptation of DPM based on SA-SSVM [212] (Image courtesy J. Xu). get domains. The temporal structure of the video is exploited through similarity constraints imposed on theadapted detector.

Multi-object tracking.

Multi-object tracking aims at automatically detecting and tracking individual ob-ject ( e.g . car or pedestrian) instances [214, 205, 206]. These methods generally capitalizes on multi-task andmulti-instance learning to perform category-to-instance adaptation. For instance, [214] introduces a MultipleInstance Learning (MIL) loss function for Real Adaboost, which is used within a tracking based unsuper-vised online sample collection mechanism to incrementally adjust the pretrained detector.[205] propose an unsupervised, online and self-tuning learning algorithm to optimize a multi-task learningbased convex objective involving a high-precision/low-recall off-the-shelf generic detector. The method ex-ploits the data structure to jointly learn an ensemble of instance-level trackers, from which adapted category-level object detectors are derived. The main idea in [206] is to jointly learn all detectors (the target instancemodels and the generic one) using an online adaptation via Bayesian ﬁltering coupled with multi-task learn-ing to efﬁciently share parameters and reduce drift, while gradually improving recall.The transductive approach in [203] re-trains the detector with automatically discovered target domainexamples starting with the easiest ﬁrst, and iteratively re-weighting labeled source samples by scoring tra-jectory tracks. [204] introduces a multi-class random fern adaptive classiﬁer where different categories ofthe positive samples (corresponding to different video tracks) are considered as different target classes, andall negative online samples are considered as a single negative target class. [215] proposes a particle ﬁlteringframework for multi-person tracking-by-detection to predict the target locations.

Deep neural architectures.

More recently, end-to-end deep learning object detection models were proposedthat integrate and learn simultaneously the region proposals and the object appearance. In general, thesemodels are initialized by deep models pretrained with image level annotations (often on the ILSVRC datasets[122]). In fact, the pretrained deep model combined with class-agnostic region of interest proposal, can omain Adaptation for Visual Applications: A Comprehensive Survey 31 already be used to predict the presence or absence of the target object in the proposed local regions [216,217, 133, 218]. When strongly labeled target data is available, the model can further be ﬁne-tuned usingthe labeled bounding boxes to improve both the recognition and the object localization. Thus, the LargeScale Detection through Adaptation [218] learns to transform an image classiﬁer into an object detectorby ﬁne-tuning the CNN model, pretrained on images, with a set of labeled bounding boxes. The advantageof this model is that it generalizes well even for localization of classes for which there were no boundingbox annotations during the training phase.Instead ﬁne-tuning, [219] uses Subspace Alignment [19] to adjust class speciﬁc representations of bound-ing boxes (BB) between the source and target domain. The source BBs are extracted from the strongly an-notated training set, while the target BBs are obtained with the RCNN-detector [217] trained on the sourceset. The detector is then re-trained with the target aligned source features and used to classify the target dataprojected into the target subspace. The aim of this section is to relate domain adaptation to other machine learning solutions. First in Section6.1 we discuss how DA is related to other transfer learning (TL) techniques. Then, in Section 6.2 we connectDA to several classical machine learning approaches illustrating how these methods are exploited in variousDA solutions. Finally, in Section 6.3 we examine the relationship between heterogeneous DA and multi-view/multi-modal learning.

As shown in Section 2, DA is a particular case of the transductive transfer learning aimed to solve a clas-siﬁcation task common to the source and target, by simultaneously exploiting labeled source and unlabeledtarget examples (see also Figure 2). As such, DA is opposite to unsupervised TL, where both domains andtasks are different with labels available neither for source nor for target.DA is also different from self-taught learning [220], which exploits a limited labeled target data for aclassiﬁcation task together with a large amount of unlabeled source data mildly related to the task. The mainidea behind self-taught learning is to explore the unlabeled source data and to discover repetitive patternsthat could be used for the supervised learning task.On the other hand, DA is more closely related to domain generalization [221, 222, 138, 223, 224], multi-task learning [225, 226, 227] or few-shot learning [228, 229] discussed below.

Domain generalization.

Similarly to multi-source DA [83, 82, 84], domain generalization methods [221,222, 138, 223, 224] aim to average knowledge from several related source domains, in order to learn a modelfor a new target domain. But, in contrast to DA where unlabeled target instances are available to adapt themodel, in domain generalization, no target example is accessible at training time. Code available at https://github.com/jhoffman/lsda/zipball/master

Multi-task learning.

In multi-task learning [225, 226, 227] different tasks ( e.g . sets of the labels) are learnedat the same time using a shared representation such that what is learned for each task can help in learningthe other tasks. If we considering the tasks in DA as domain source and target) speciﬁc tasks, a semi-supervised DA method can be seen as a sort of two-task learning problem where, in particular, learning thesource speciﬁc task helps learning the target speciﬁc task. Furthermore, in the case of multi-source domainadaptation [230, 231, 89, 87, 232, 86, 28, 233, 62, 77] different source speciﬁc tasks are jointly exploited inthe interest of the target task.On the other hand, as we have seen in Section 5.1, multi-task learning techniques can be beneﬁcial foronline DA, in particular for multi-object tracking and detection [205, 206], where the generic object detector(trained on source data) is adapted for each individual object instance.

Few-shot learning.

Few-shot learning [228, 229, 89, 234] aims to learn information about object categorieswhen only a few training images are available for training. This is done by making use of prior knowl-edge of related categories for which larger amount of annotated data is available. Existing solutions are theknowledge transfer through the reuse of model parameters [235], methods sharing parts or features [236] orapproaches relying on contextual information [237].An extreme case of few-shot learning is the zero-shot learning [238, 239], where the new task is deducedfrom previous tasks without using any training data for the current task. To address zero-shot learning, themethods rely either on nameable image characteristics and semantic concepts [238, 239, 240, 241], or onlatent topics discovered by the system directly from the data [242, 243, 244]. In both cases, detecting theseattributes can be seen as the common tasks between the training classes (source domains) and the new classes(target domains).

Uniﬁed DA and TL models.

We have seen that the particularity of DA is the shared label space, in contrastto more generic TL approaches where the focus is on the task transfer between classes. However, in [245]it is claimed that task transfer and domain shift can be seen as different declinations of learning to learn paradigm, i.e . the ability to leverage prior knowledge when attempting to solve a new task. Based on thisobservation, a common framework is proposed to leverage source data regardless of the origin of the distri-bution mismatch. Considering prior models as experts, the original features are augmented with the outputconﬁdence values of the source models and target classiﬁers are then learned with these features.Similarly, the Transductive Prediction Adaptation (TPA) [246] augments the target features with classpredictions from source experts, before applying the MDA framework [12] on these augmented features.It is shown that MDA, exploiting the correlations between the target features and source predictions, candenoise the class predictions and improve classiﬁcation accuracy. In contrast to the method in [245], TPAworks also in the case when no label is available in the target domain (US scenario).The Cross-Domain Transformation [17] learns a regularized non-linear transformation using superviseddata from both domains to map source examples closer to the target ones. It is shown that the models built inthis new space generalize well not only to new samples from categories used to train the transformation (DA)but also to new categories that were not present at training time (task transfer). The Unifying Multi-DomainMulti-Task Learning [247], is a Neural Network framework that can be ﬂexibly applied to multi-task, multi-domain and zero-shot learning and even to zero-shot domain adaptation. omain Adaptation for Visual Applications: A Comprehensive Survey 33

Semi-supervised learning.

DA can be seen as a particular case of the semi-supervised learning [248, 249],where, similarly to the majority of DA approaches, unlabeled data is exploited to remedy the lack of labeleddata. Hence, ignoring the domain shift, traditional semi-supervised learning can be used as a solution forDA, where the source instances form the supervised part, and the target domain provides the unlabeled data.For this reason, DA methods often exploit or extend semi-supervised learning techniques such as transduc-tive SVM [56], self-training [207, 208, 78, 77], or co-training [209, 210]. When the domain shift is small,traditional semi-supervised methods can already bring a signiﬁcant improvement over baseline methods ob-tained with the pretrained source model [56].

Active learning.

Instance selection based DA methods exploit ideas from active learning [250] to select in-stances with best potentials to help the training process. Thus, the Migratory-Logit algorithm [251] explore,both the target and source data to actively select unlabeled target samples to be added to the training sets.[252] describes an active learning method for relevant target data selection and labeling, which combinesTrAdaBoost [49] with standard SVM. [224], (see also Chapter 15), uses active learning and DA techniquesto generalize semantic object parts ( e.g . animal eyes or legs) to unseen classes (animals). The methods de-scribed in [253, 254, 255, 78, 77, 256] combine transfer learning and domain adaptation with the targetsample selection and automatic sample labeling, based on the classiﬁer conﬁdence. These new samples arethen used to iteratively update the target models.

Online learning.

Online or sequential learning [257, 258, 259] is strongly related to active learning; in bothcases the model is iteratively and continuously updated using new data. However, while in active learningthe data to be used for the update is actively selected, in online learning generally the new data is acquiredsequentially. Domain adaptation can be combined with online learning too. As an example, we presented inSection 5.1 the online adaptation for incoming video frames of a generic object detector trained ofﬂine onlabeled image sets [215, 212]. [109] proposes online adaptation of image classiﬁer to user generated contentin social computing applications.Furthermore, as discussed in Section 4, ﬁne-tuning a deep model [132, 128, 133, 134, 135, 21], pretrainedon ImageNet (source), for a new dataset (target), can be seen as sort of semi-supervised domain adaptation.Both, ﬁne-tuning as well as training deepDA models [147, 143, 154], use sequential learning where databatches are used to perform the stochastic gradient updates. If we assume that these batches contain thetarget data acquired sequentially, the model learning process can be directly used for online DA adaptationof the original model.

Metric learning.

In Section 3 we presented several metric learning based DA methods [74, 17, 260, 76, 77].where class labels from both domains are exploited to bridge the relatedness between the source and tar-get. Thus, [74] proposes a new distance metric for the target domain by using the existing distance metricslearned on the source domain. [17] uses information-theoretic metric learning [75] as a distance metricacross different domains, which was extended to non-linear kernels in [76]. [77] proposes a metric learningadapted to the DSCM classiﬁer, while [260] deﬁnes a multi-task metric learning framework to learn relation-ships between source and target tasks. [4] explores various metric learning approaches to align deep featuresextracted from RGB and NIR face images.

Fig. 27

Illustrating through an example the difference between TL to ML in the case of homogeneous data and between multi-view and HTL/HDA when working with heterogeneous data. Image courtesy Q. Yang [264].

Classiﬁer ensembles.

Well studied in ML, classiﬁer ensembles have also been considered for DA and TL.As such, [261] applies a bagging approach for transferring the learning capabilities of a model to differentdomains where a high number of trees is learned on both source and target data in order to build a prunedversion of the ﬁnal ensemble to avoid a negative transfer. [262] uses random decision forests to transferrelevant features between domains. The optimization framework in [263] takes as input several classiﬁerslearned on the source domain as well as the results of a cluster ensemble operating solely on the targetdomain, yielding a consensus labeling of the data in the target domain. Boosting was extended to DA andTL in [49, 51, 200, 87, 52].

In many data intensive applications, such as video surveillance, social computing, medical health recordsor environmental sciences, data collected from diverse domains or obtained from various feature extractorsexhibit heterogeneity. For example, a person can be identiﬁed by different facets e.g . face, ﬁngerprint, sig-nature or iris, or in video surveillance, an action or event can be recognized using multiple cameras. Whenworking with such heterogeneous or multi-view data most, methods try to exploit simultaneously differentmodalities to build better ﬁnal models. omain Adaptation for Visual Applications: A Comprehensive Survey 35

As such, multi-view learning methods are related to HDA/HTL as discussed also in Section 3.3. Never-theless, while multi-view learning [90, 91] assumes that multi-view examples are available during training,in the case of HDA [116, 119, 84, 28, 118], this assumption rarely holds (see illustration in 27). On contrary,the aim of HDA is to transfer information from the source domain represented with one type of data ( e.g .text) to the target domain represented with another type of data ( e.g . images). While this assumption essen-tially differentiates the multi-view learning from HDA, we have seen in Section 3.3 that HDA methods oftenrely on an auxiliary intermediate multi-view domain [104, 106, 103, 101, 107, 109]. Hence, HDA/HTL canstrongly beneﬁt from multi-view learning techniques such as canonical correlation analysis [93], co-training[265], spectral embedding [116] and multiple kernel learning [114].Similarly to HDA/HTL relying on intermediate domains, cross-modal image retrieval methods dependon multi-view auxiliary data to deﬁne cross-modal similarities [266, 267], or to perform semantic [268, 269,270, 95] or multi-view embedding [93, 110, 99, 111, 112, 113]. Hence, HDA/HTL can strongly beneﬁt fromsuch cross-modal data representations.In the same spirit, webly supervised approaches [271, 272, 273, 274, 158, 159, 275] are also related to DAand HDA as is these approaches rely on collected Web data (source) data used to reﬁne the target model. Assuch, [276] uses multiple kernel learning to adapt visual events learned from the Web data for video clips.[277] and [275] propose domain transfer approaches from weakly-labeled Web images for action localizationand event recognition tasks.

In this chapter we tried to provide an overview of different solutions for visual domain adaptation, includingboth shallow and deep methods. We grouped the methods both by their similarity concerning the problem(homogeneous vs . heterogeneous data, unsupervised vs . semi-supervised scenario) and the solutions pro-posed (feature transformation, instance reweighing, deep models, online learning, etc .). We also reviewedmethods that solve DA in the case of heterogeneous data as well as approaches that address computer visionproblems beyond the image classiﬁcation, such as object detection or multi-object tracking. Finally, we po-sitioned domain adaptation within a larger context by linking it to other transfer learning techniques as wellas to traditional machine learning approaches.Due to the lack of the space and the large amount of methods mentioned, we could only brieﬂy depict eachmethod; the interested reader can follow the reference for deeper reading. We also decided not to provideany comparative experimental results between these methods for the following reasons: (1) Even if manyDA methods were tested on the benchmark OFF31 [17] and OC10 [18] datasets, papers use often differentexperimental protocols (sampling the source vs . using the whole data, unsupervised vs . supervised) anddifferent parameter tuning strategies (ﬁx parameter sets, tuning on the source, cross validation or unknown).(2) Results reported in different papers given the same methods ( e.g . GFK, TCA, SA) vary also a lot betweendifferent re-implementations. For all these reasons, making a fair comparison between all the methods basedonly on the literature review is rather difﬁcult. (3) These datasets are rather small, some methods havepublished results only with the outdated SURFBOV features and relying only on these results is not sufﬁcientto derive general conclusions about the methods. For a fair comparison, deep methods should be comparedto shallow methods using deep features extracted from similar architectures, but both features extracted fromthe latest deep models and deep DA architectures build on these models perform extremely well on OFF31and OC10 even without adaptation. Most DA solutions in the literature are tested on these relatively small datasets (both in terms of numberof classes and number of images). However, with the proliferation of sensors, large amount of heterogeneousdata is collected, and hence there is a real need for solutions being able to efﬁciently exploit them. This showsa real need for more challenging datasets to evaluate and compare the performance of different methods. Thefew new DA datasets, such as the Testbed cross-dataset (TB) [278] or datasets built for model adaptationbetween photos, paintings and sketches [1, 2, 3, 4] while more challenging than the popular OFF31 [17],OC10 [18] or MNIST [279] vs . SVHN [280], they are only sparsely used. Moreover, except the cross-modalPlace dataset [3], they are still small scale and single modality datasets.We have also seen that only relatively few papers address adaptation beyond recognition and detection.Image and video understanding, semantic and instance level segmentation, human pose, event and actionrecognition, motion and 3D scene understanding, where trying to simply describe the problem with a vec-torial representation and classical domain adaptation, even when it is possible, has serious limitations. Re-cently, these challenging problems are addressed with deep methods requiring large amount of labeled data.How to adapt these new models between domains with no or very limited amount of data is probably one ofthe main challenge that should be addressed by the visual domain adaptation and transfer learning commu-nity in the next few years. References

1. B. F. Klare, S. S. Bucak, A. K. Jain, and T. Akgul, “Towards automated caricature recognition,” in

International Confer-ence on Biometrics (ICB) , 2012.2. E. J. Crowley and A. Zisserman, “In search of art,” in

ECCV Workshop on Computer Vision for Art Analysis , 2014.3. L. Castrej´on, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, “Learning aligned cross-modal representations fromweakly aligned data,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.4. S. Saxena and J. Verbeek, “Heterogeneous face recognition with cnns,” in

ECCV Workshop on Transferring and AdaptingSource Knowledge in Computer Vision (TASK-CV) , 2016.5. H. Daum´e III and D. Marcu, “Domain adaptation for statistical classiﬁers,”

Journal of Artiﬁcial Intelligence Research ,vol. 26, no. 1, pp. 101–126, 2006.6. S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Cross-domain sentiment classiﬁcation via spectral feature alignment,”in

International Conference on World Wide Web (WWW) , 2010.7. J. Blitzer, S. Kakade, and D. P. Foster, “Domain adaptation with coupled subspaces,” in

International Conference onArtiﬁcial Intelligence and Statistics (AISTATS) , 2011.8. M. Zhou and K. C. Chang, “Unifying learning to rank and domain adaptation: Enabling cross-task document scoring,” in

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD) , 2014.9. P. Prettenhofer and B. Stein, “Cross-language text classiﬁcation using structural correspondence learning,” in

AnnualMeeting of the Association for Computational Linguistics(ACL) , 2010.10. J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan, “Heterogeneous domain adaptation for multiple classes,” in

InternationalConference on Artiﬁcial Intelligence and Statistics (AISTATS) , 2014.11. J. T. Zhou, S. J. Pan, I. W. Tsang, and Y. Yan, “Hybrid heterogeneous transfer learning through deep learning,” in

AAAIConference on Artiﬁcial Intelligence (AAAI) , 2014.12. M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in

Inter-national Conference on Machine Learning (ICML) , 2012.13. X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classiﬁcation: A deep learning ap-proach,” in

International Conference on Machine Learning (ICML) , 2011.14. S. J. Pan, J. T. Tsang, Ivor W.and Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,”

Transactionson Neural Networks , vol. 22, no. 2, pp. 199 – 210, 2011.15. C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous densityhidden markov models,”

Computer Speech and Language , vol. 9, no. 2, pp. 171–185, 1995.omain Adaptation for Visual Applications: A Comprehensive Survey 3716. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veriﬁcation using adapted Gaussian Mixture Models,”

DigitalSignal Processing , vol. 10, no. 1, pp. 19–41, 2000.17. K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in

European Conferenceon Computer Vision (ECCV) , 2010.18. B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic ﬂow kernel for unsupervised domain adaptation,” in

IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , 2012.19. B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace align-ment,” in

IEEE International Conference on Computer Vision (ICCV) , 2013.20. M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in

IEEEInternational Conference on Computer Vision (ICCV) , 2013.21. B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in

AAAI Conference on ArtiﬁcialIntelligence (AAAI) , 2016.22. J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in

IEEE Interna-tional Conference on Computer Vision (ICCV) , 2013.23. L. Duan, D. Xu, and S.-F. Chang, “Exploiting web images for event recognition in consumer videos: A multiple sourcedomain adaptation approach,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.24. N. FarajiDavar, T. deCampos, D. Windridge, J. Kittler, and W. Christmas, “Domain adaptation in the context of sportvideo action recognition,” in

BMVA British Machine Vision Conference (BMVC) , 2012.25. F. Zhu and L. Shao, “Enhancing action recognition by cross-domain dictionary learning,” in

BMVA British Machine VisionConference (BMVC) , 2013.26. H. Shen, S.-I. Yu, Y. Yang, D. Meng, and A. Hauptmann, “Unsupervised video adaptation for parsing human motion,” in

European Conference on Computer Vision (ECCV) , 2014.27. M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in

IEEEInternational Conference on Computer Vision (ICCV) , 2011.28. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2013.29. A. Sharma and D. W. Jacobs, “Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2011.30. B. Smith and L. Zhang, “Collaborative facial landmark localization for transferring annotations across datasets,” in

Euro-pean Conference on Computer Vision (ECCV) , 2014.31. W.-S. Chu, F. D. l. Torre, and J. F. Cohn, “Selective transfer machine for personalized facial action unit detection,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2013.32. M. Yamada, L. Sigal, and M. Raptis, “No bias left behind: Covariate shift adaptation for discriminative 3d pose estima-tion,” in

European Conference on Computer Vision (ECCV) , 2012.33. B. Chidlovskii, S. Clinchant, and G. Csurka, “Domain adaptation in the absence of source domain data,” in

Joint EuropeanConference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) , 2016.34. G. Csurka, B. Chidlovskii, and S. Clinchant, “Adapted domain speciﬁc class means,” in

ICCV workshop on Transferringand Adapting Source Knowledge in Computer Vision (TASK-CV) , 2015.35. G. Csurka, D. Larlus, A. Gordo, and J. Almazan, “What is the right way to represent document images?,”

CoRR ,vol. arXiv:1603.01076, 2016.36. K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,”

Journal of Big Data , vol. 9, no. 3, 2016.37. S. J. Pan and Q. Yang, “A survey on transfer learning,”

Transactions on Knowledge and Data Engineering , vol. 22, no. 10,pp. 1345–1359, 2010.38. W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Self-taught clustering,” in

International Conference on Machine Learning(ICML) , 2008.39. Z. Whang, Y. Song, and C. Zhang, “Transferred dimensionality reduction,” in

Joint European Conference on MachineLearning and Knowledge Discovery in Databases (ECML PKDD) , 2008.40. M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint matching for unsupervised domain adaptation,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2014.41. H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,”

Journalof Statistical Planning and Inference , vol. 90, no. 2, pp. 227–244, 2000.42. B. Zadrozny, “Learning and evaluating classiﬁers under sample selection bias,” in

International Conference on MachineLearning (ICML) , 2004.8 Gabriela Csurka43. M. Sugiyama, S. Nakajima, H. Kashima, P. v. Buenau, and M. Kawanabe, “Direct importance estimation with model se-lection and its application to covariate shift adaptation,” in

Annual Conference on Neural Information Processing Systems(NIPS) , 2008.44. T. Kanamori, S. Hido, and M. Sugiyama, “Efﬁcient direct density ratio estimation for non-stationarity adaptation andoutlier detection.,”

Journal of Machine Learning Research , vol. 10, pp. 1391–1445, 2009.45. J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Sch¨olkopf, “Correcting sample selection bias by unlabeled data,”in

Annual Conference on Neural Information Processing Systems (NIPS) , 2007.46. A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Sch¨olkopf, “Covariate shift by kernel mean match-ing,” in

Dataset Shift in Machine Learning (J. Qui˜nonero Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence,eds.), The MIT Press, 2009.47. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch¨olkopf, and A. J. Smola, “Integrating structured biologicaldata by kernel maximum mean discrepancy,”

Bioinformatics , vol. 22, pp. 49–57, 2006.48. M. Dud´ık, R. E. Schapire, and S. J. Phillips, “Correcting sample selection bias in maximum entropy density estimation,”in

Annual Conference on Neural Information Processing Systems (NIPS) , 2005.49. W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer learning,” in

International Conference on Machine Learning(ICML) , 2007.50. Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,”

Journal of Computer and System Sciences , vol. 55, no. 1, pp. 119–139, 1997.51. S. Al-Stouhi and C. K. Reddy, “Adaptive boosting for transfer learning using dynamic updates,” in

Joint European Con-ference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) , 2011.52. B. Chidlovskii, G. Csurka, and S. Gangwar, “Assembling heterogeneous domain adaptation methods for image classiﬁ-cation,” in

CLEF online Working Notes , 2014.53. T. Joachims, “Transductive inference for text classiﬁcation using support vector machines,” in

International Conferenceon Machine Learning (ICML) , 1999.54. J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept detection using adaptive SVMs,” in

ACM Multime-dia , 2007.55. L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank, “Domain transfer SVM for video concept detection,” in

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2009.56. L. Bruzzone and M. Marconcini, “Domain adaptation problems: A dasvm classiﬁcation technique and a circular validationstrategy,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 32, pp. 770–787, 2010.57. Y. Chen, G. Wang, and S. Dong, “Learning with progressive transductive support vector machine,”

Pattern RecognitionLetters , vol. 24, no. 12, pp. 845–855, 2003.58. W. Jiang, E. Zavesky, S.-F. Chang, and A. Loui, “Cross-domain learning methods for high-level visual concept classiﬁca-tion,” in

International Conference on Image Processing (ICIP) , 2008.59. H. Cheng, P.-N. Tan, and R. Jin, “Localized support vector machine and its efﬁcient algorithm,” in

SIAM InternationalConference on Data Mining (SDM) , 2007.60. H. Daum´e III, “Frustratingly easy domain adaptation,”

CoRR , vol. arXiv:0907.1815, 2009.61. R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach,” in

IEEEInternational Conference on Computer Vision (ICCV) , 2011.62. R. Gopalan, R. Li, and R. Chellappa, “Unsupervised adaptation across domain shifts by generating intermediate datarepresentations,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 36, no. 11, 2014.63. B. Gong, K. Grauman, and F. Sha, “Connecting the dots with landmarks: Discriminatively learning domain invariantfeatures for unsupervised domain adaptation,” in

International Conference on Machine Learning (ICML) , 2013.64. J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via dictionary learning for unsupervised domain sadaptation,” in

IEEE International Conference on Computer Vision (ICCV) , 2013.65. M. Baktashmotlagh, M. Harandi, B. Lovell, and M. Salzmann, “Unsupervised domain adaptation by domain invariantprojection,” in

IEEE International Conference on Computer Vision (ICCV) , 2013.66. M. Baktashmotlagh, M. Harandi, B. Lovell, and M. Salzmann, “Domain adaptation on the statistical manifold,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2014.67. A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,”

Journal of MatrixAnalysis and Applications , vol. 20, no. 2, pp. 303–353, 1998.68. M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. Yu, “Transfer sparse coding for robust image representation,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2013.omain Adaptation for Visual Applications: A Comprehensive Survey 3969. G. Matasci, M. Volpi, M. Kanevski, L. Bruzzone, and D. Tuia, “Semi-supervised transfer component analysis for domainadaptation in remote sensing image classiﬁcation,”

Transactions on Geoscience and Remote Sensing , vol. 53, no. 7,pp. 3550–3564, 2015.70. G. Csurka, B. Chidlovskii, S. Clinchant, and S. Michel, “Unsupervised domain adaptation with regularized domain in-stance denoising,” in

ECCV workshop on Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) ,2016.71. M. Long, J. Wang, G. Ding, S. J. Pan, and P. Yu, “Adaptation regularization: a general framework for transfer learning,”

Transactions on Knowledge and Data Engineering , vol. 5, no. 26, pp. 1076–1089, 2014.72. J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, “Efﬁcient learning of domain-invariant image representa-tions,” in

International Conference on Learning representations (ICLR) , 2013.73. E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure, “Cross domain distribution adaptation viakernel mapping,” in

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD) , 2009.74. Z.-J. Zha, t. Mei, M. Wang, Z. Wang, and X.-S. Hua, “Robust distance metric learning with auxiliary knowledge,” in

AAAI International Joint Conference on Artiﬁcial Intelligence (IJCAI) , 2009.75. J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in

International Conferenceon Machine Learning (ICML) , 2007.76. B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kerneltransforms,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2011.77. G. Csurka, B. Chidlovskii, and F. Perronnin, “Domain adaptation with a domain speciﬁc class means classiﬁer,” in

ECCVWorkshop on Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) , 2014.78. T. Tommasi and B. Caputo, “Frustratingly easy NBNN domain adaptation,” in

IEEE International Conference on Com-puter Vision (ICCV) , 2013.79. N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal transport for domain adaptation,”

CoRR ,vol. arXiv:1507.00504, 2015.80. N. FarajiDavar, T. deCampos, and J. Kittler, “Adaptive transductive transfer machines,” in

BMVA British Machine VisionConference (BMVC) , 2014.81. R. Aljundi, R. Emonet, D. Muselet, and M. Sebban, “Landmarks-based kernelized subspace alignment for unsuperviseddomain adaptation,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015.82. L. Duan, D. Xu, and I. W. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization ap-proach,”

Transactions on Neural Networks and Learning Systems , vol. 23, no. 3, pp. 504–518, 2012.83. R. Chattopadhyay, J. Ye, S. Panchanathan, W. Fan, and I. Davidson, “Multi-source domain adaptation and its applicationto early detection of fatigue,” in

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD) , 2011.84. I.-H. Jhuo, D. Liu, D. Lee, and S.-F. Chang, “Robust visual domain adaptation with low-rank reconstruction,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2012.85. R. Caseiro, J. F. Henriques, P. Martins, and J. Batista, “Beyond the shortest path : Unsupervised domain adaptation bysampling subspaces along the spline ﬂow,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2015.86. J. Hoffman, B. Kulis, T. Darrell, and K. Saenko, “Discovering latent domains for multisource domain adaptation,” in

European Conference on Computer Vision (ECCV) , 2012.87. Y. Yao and G. Doretto, “Boosting for transfer learning with multiple sources,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2010.88. L. Ge, J. Gao, H. Ngo, K. Li, and A. Zhang, “On handling negative transfer and imbalanced distributions in multiplesource transfer learning,” in

SIAM International Conference on Data Mining (SDM) , 2013.89. T. Tommasi and B. Caputo, “Safety in numbers: learning categories from few examples with multi model knowledgetransfer,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2010.90. C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,”

CoRR , vol. arXiv:1304.5634, 2013.91. W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in

International Conferenceon Machine Learning (ICML) , 2015.92. K. Chaudhuri, S. Kakade, K. Livescu, and K. S. Sridharan, “Multi-view clustering via canonical correlation analysis,” in

International Conference on Machine Learning (ICML) , 2009.93. D. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learningmethods,”

Neurocomputing , vol. 16, no. 12, pp. 2639–2664, 2004.94. R. Socher and F.-F. Li, “Connecting modalities: Semi-supervised segmentation and annotation of images using unalignedtext corpora,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2010.0 Gabriela Csurka95. F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2015.96. J. Hoffman, S. Gupta, and T. Darrell, “Learning with side information through modality hallucination,” in

IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , 2016.97. A. Vinokourov, N. Cristianini, and J. Shawe-Taylor, “Inferring a semantic representation of text via cross-language cor-relation analysis,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2003.98. M. Faruqui and C. Dyer, “Improving vector space word representations using multilingual correlation,” in

Conference ofthe European Chapter of the Association for Computational Linguistics (EACL) , 2014.99. A. Sharma, A. Kumar, H. Daum´e III, and D. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.100. Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa, “Domain adaptive dictionary learning,” in

European Conference onComputer Vision (ECCV) , 2012.101. b. Tan, Y. Song, E. Zhong, and Q. Yang, “Transitive transfer learning,” in

ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining (SIGKDD) , 2015.102. C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix tri-factorizations for clustering,” in

ACM SIGKDDConference on Knowledge Discovery and Data Mining (SIGKDD) , 2005.103. B. Tan, E. Zhong, M. Ng, and K. Q. Yang, “Mixed-transfer: Transfer learning over mixed graphs,” in

SIAM InternationalConference on Data Mining (SDM) , 2014.104. Y. Zhu, Y. Chen, Z. Lu, S. J. Pan, G.-R. Xue, Y. Yu, and Q. Yang, “Heterogeneous transfer learning for image classiﬁca-tion,” in

AAAI Conference on Artiﬁcial Intelligence (AAAI) , 2011.105. A. Singh, P. Singh, and G. Gordon, “Relational learning via collective matrix factorization,” in

ACM SIGKDD Conferenceon Knowledge Discovery and Data Mining (SIGKDD) , 2008.106. G.-J. Qi, C. Aggarwal, and T. Huang, “Towards semantic knowledge propagation from text corpus to web images,” in

International Conference on World Wide Web (WWW) , 2011.107. L. Yang, L. Jing, J. Yu, and M. K. Ng, “Learning transferred weights from co-occurrence data for heterogeneous transferlearning,”

Transactions on Neural Networks and Learning Systems , vol. 27, no. 11, pp. 2187–2200, 2015.108. C. Andrieu, N. Freitas, A. Doucet, and M. Jordan, “An introduction to mcmc for machine learning,”

Machine Learning ,vol. 50, no. 1, pp. 5–43, 2003.109. Y. Yan, Q. Wu, M. Tan, and H. Min, “Online heterogeneous transfer learning by weighted ofﬂine and online classiﬁers,”in

ECCV Workshop on Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) , 2016.110. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in

International Conferenceon Machine Learning (ICML) , 2011.111. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and theirsemantics,”

International Journal of Computer Vision , vol. 106, no. 2, pp. 210–233, 2014.112. G. Cao, A. Iosiﬁdis, K. Chen, and M. Gabbouj, “Generalized multi-view embedding for visual recognition and cross-modal retrieval,”

CoRR , vol. arXiv:1605.09696, 2016.113. L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2016.114. L. Duan, D. Xu, and I. W. Tsang, “Learning with augmented features for heterogeneous domain adaptation,”

Transactionsof Pattern Recognition and Machine Analyses (PAMI) , vol. 36, no. 6, pp. 1134–1148, 2012.115. W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning with augmented features for supervised and semi-supervised hetero-geneous domain adaptation,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 36, no. 6, pp. 1134– 1148, 2014.116. X. Shi, Q. Liu, W. Fan, P. S. Yu, and R. Zhu, “Transfer learning on heterogeneous feature spaces via spectral transforma-tion,” in

IEEE International Conference on Data Mining (ICDM) , 2010.117. M. Xiao and Y. Guo, “Semi-supervised subspace co-projection for multi-class heterogeneous domain adaptation,” in

JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) , 2015.118. T. Yao, Y. Pan, C.-W. Ngo, H. Li, and T. Mei, “Semi-supervised domain adaptation with subspace learning for visualrecognition,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015.119. C. Wang and S. Mahadevan, “Heterogeneous domain adaptation using manifold alignment,” in

AAAI International JointConference on Artiﬁcial Intelligence (IJCAI) , 2011.120. M. Harel and S. Mannor, “Learning from multiple outlooks,” in

International Conference on Machine Learning (ICML) ,2011.121. D. L. Donoho, “Compressed sensing,”

Transactions on Information Theory , vol. 52, pp. 1289–1306, 2006.omain Adaptation for Visual Applications: A Comprehensive Survey 41122. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C.Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,”

International Journal of Computer Vision ,vol. 115, no. 3, pp. 211–252, 2015.123. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activationfeature for generic visual recognition,” in

International Conference on Machine Learning (ICML) , 2014.124. A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classiﬁcation with deep Convolutional Neural Networks,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2012.125. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”

CoRR ,vol. arXiv:1409.1556, 2014.126. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

CoRR , vol. arXiv:1512.03385, 2015.127. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Goingdeeper with convolutions,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015.128. S. Chopra, S. Balakrishnan, and R. Gopalan, “DLID: Deep learning for domain adaptation by interpolating betweendomains,” in

ICML Workshop on Challenges in Representation Learning (WREPL) , 2013.129. Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”

Transactions ofPattern Recognition and Machine Analyses (PAMI) , vol. 35, no. 8, pp. 1798–1828, 2013.130. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in

AnnualConference on Neural Information Processing Systems (NIPS) , 2014.131. E. J. Crowley and A. Zisserman, “The state of the art: Object retrieval in paintings using discriminative regions,” in

BMVABritish Machine Vision Conference (BMVC) , 2014.132. M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,”

CoRR , vol. arXiv:1311.2901, 2013.133. M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolu-tional neural networks,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.134. A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky, “Neural codes for image retrieval,” in

European Conferenceon Computer Vision (ECCV) , 2014.135. B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell, “Best practices for ﬁne-tuning visual classiﬁers to newdomains,” in

ECCV Workshop on Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) , 2016.136. E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in

NIPS Workshop onAdversarial Training, (WAT) , 2016.137. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoisingautoencoders,” in

International Conference on Machine Learning (ICML) , 2008.138. M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization for object recognition with multi-taskautoencoders,” in

IEEE International Conference on Computer Vision (ICCV) , 2015.139. K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to objectrecognition,”

CoRR , vol. arXiv:1010.3467, 2010.140. R. Aljundi and T. Tuytelaars, “Lightweight unsupervised domain adaptation by convolutional ﬁlter reconstruction,” in

ECCV Workshop on Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) , 2016.141. J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S¨ackinger, and R. Shah, “Signature veriﬁcationusing a ”siamese” time delay neural network,”

International Journal of Pattern Recognition and Artiﬁcial Intelligence ,vol. 7, no. 04, pp. 669–688, 1993.142. E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,”

CoRR , vol. arXiv:1412.3474, 2014.143. M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in

InternationalConference on Machine Learning (ICML) , 2015.144. B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in

ECCV Workshop on Transfer-ring and Adapting Source Knowledge in Computer Vision (TASK-CV) , 2016.145. M. Long, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,”

CoRR ,vol. arXiv:1605.06636, 2016.146. A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deep domain adaptation,”

CoRR ,vol. arXiv:1603.06432, 2016.147. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”

Journal of Machine Learning Research , 2016.148. E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in

IEEE Inter-national Conference on Computer Vision (ICCV) , 2015.2 Gabriela Csurka149. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generativeadversarial nets,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2014.150. M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in

Annual Conference on Neural Information Pro-cessing Systems (NIPS) , 2016.151. K. Bousmalis, N. Silberman, D. Dohan, , D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation withgenerative adversarial networks,”

CoRR , vol. arXiv:1612.05424, 2016.152. D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2014.153. K. Bousmalis, G. Trigeorgis, N. Silberman, D. Erhan, and D. Krishnan, “Domain separation networks,” in

Annual Con-ference on Neural Information Processing Systems (NIPS) , 2016.154. M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi, “Deep reconstruction-classiﬁcation networks for unsuperviseddomain adaptation,” in

European Conference on Computer Vision (ECCV) , 2016.155. M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, “Deconvolutional networks,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2010.156. W.-Y. Chen, T.-M. H. Hsu, and Y.-H. H. Tsai, “Transfer neural trees for heterogeneous domain adaptation,” in

EuropeanConference on Computer Vision (ECCV) , 2016.157. X. Shu, G.-J. Qi, J. Tang, and W. Jingdong, “Weakly-shared deep transfer networks for heterogeneous-domain knowledgepropagation,” in

ACM Multimedia , 2015.158. S. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Webly-supervised visual concept learning,”in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.159. X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in

IEEE International Conference onComputer Vision (ICCV) , 2015.160. E. Crowley and A. Zisserman, “The art of detection,” in

ECCV Workshop on Computer Vision for Art Analysis, (CVAA) ,2016.161. A. Agarwal and B. Triggs, “A local basis representation for estimating human pose from cluttered images,” in

AsianConference on Computer Vision (ACCV) , 2006.162. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time humanpose recognition in parts from single depth images,” in

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2011.163. P. Panareda-Busto, J. Liebelt, and J. Gall, “Adaptation of synthetic data for coarse-to-ﬁne viewpoint reﬁnement,” in

BMVABritish Machine Vision Conference (BMVC) , 2015.164. H. Su, C. Qi, Y. Yi, and L. Guibas, “Render for CNN: viewpoint estimation in images using CNNs trained with rendered3D model views,” in

IEEE International Conference on Computer Vision (ICCV) , 2015.165. M. Stark, M. Goesele, and B. Schiele, “Back to the future: Learning shape models from 3D CAD data,” in

BMVA BritishMachine Vision Conference (BMVC) , 2010.166. B. Pepik, M. Stark, P. Gehler, and B. Schiele, “Teaching 3D geometry to deformable part models,” in

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2012.167. B. Sun and K. Saenko, “From virtual to reality: Fast adaptation of virtual object detectors to real domains,” in

BMVABritish Machine Vision Conference (BMVC) , 2014.168. A. Rozantsev, V. Lepetit, and P. Fua, “On rendering synthetic images for training an object detector,”

Computer Visionand Image Understanding , vol. 137, pp. 24–37, 2015.169. X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object detectors from 3D models,” in

IEEE International Con-ference on Computer Vision (ICCV) , 2015.170. H. Hattori, V. Naresh Boddeti, K. M. Kitani, and T. Kanade, “Learning scene-speciﬁc pedestrian detectors without realdata,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015.171. F. Massa, B. Russell, and M. Aubry, “Deep exemplar 2D-3D detection by adapting from real to rendered views,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2016.172. E. Bochinski, V. Eiselein, and T. Sikora, “Training a convolutional neural network for multi-class object detection usingsolely virtualworld data,” in

IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS) ,2016.173. S. Satkin, J. Lin, and M. Hebert, “Data-driven scene understanding from 3D models,” in

BMVA British Machine VisionConference (BMVC) , 2012.174. L.-C. Chen, S. Fidler, and R. Yuille, Alan L. Urtasun, “Beat the MTurkers: Automatic image labeling from weak 3Dsupervision,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.omain Adaptation for Visual Applications: A Comprehensive Survey 43175. J. Papon and M. Schoeler, “Semantic pose using deep networks trained on synthetic RGB-D,” in

IEEE InternationalConference on Computer Vision (ICCV) , 2015.176. G. Ros, L. Sellart, J. Materzy´nska, D. V´azquez, and A. L´opez, “The SYNTHIA dataset: A large collection of syntheticimages for semantic segmentation of urban scenes,” in

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2016.177. A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in

IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , 2016.178. S. Richter, V. Vineet, S. Roth, and K. Vladlen, “Playing for data: Ground truth from computer games,” in

EuropeanConference on Computer Vision (ECCV) , 2016.179. S. Meister and D. Kondermann, “Real versus realistically rendered scenes for optical ﬂow evaluation,” in

ITG Conferenceon Electronic Media Technology (CEMT) , 2011.180. D. Butler, J. Wulff, G. Stanley, and M. Black, “A naturalistic open source movie for optical ﬂow evaluation,” in

EuropeanConference on Computer Vision (ECCV) , 2012.181. N. Onkarappa and A. Sappa, “Synthetic sequences and ground-truth ﬂow ﬁeld generation for algorithm validation,”

Multimedia Tools and Applications , vol. 74, no. 9, pp. 3121–3135, 2015.182. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutionalnetworks for disparity, optical ﬂow, and scene ﬂow estimation,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2016.183. G. Taylor, A. Chosak, and P. Brewer, “OVVV: Using virtual worlds to design and evaluate surveillance systems,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2007.184. A. Shafaei, J. Little, and M. Schmidt, “Play and learn: Using video games to train computer vision models,” in

BMVABritish Machine Vision Conference (BMVC) , 2016.185. J. Mar´ın, D. V´azquez, D. Ger´onimo, and A. L´opez, L´opez, “Learning appearance in virtual scenarios for pedestriandetection,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2010.186. D. Vazquez, A. M. L´opez, J. Mar´ın, D. Ponsa, and D. Ger´onimo, “Virtual and real world adaptation for pedestriandetection,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 36, no. 4, pp. 797–809, 2014.187. J. Xu, D. V´azquez, A. L´opez, J. Mar´ın, and D. Ponsa, “Learning a part-based pedestrian detector in a virtual world,”

Transactions on Intelligent Transportation Systems , vol. 15, no. 5, pp. 2121–2131, 2014.188. A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla, “Understanding real world indoor scenes withsynthetic data,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.189. C. D. Souza, A. Gaidon, Y. Cabon, and A. L´opez, “Procedural generation of videos to train deep action recognitionnetworks,”

CoRR , vol. arXiv:1612.00881, 2016.190. E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell, “Towards adapting deep visuo-motor representations from simulated to real environments,”

CoRR , vol. arXiv:1511.07111, 2015.191. J. Xu, S. Ramos, D. V´azquez, and A. L´opez, “Hierarchical adaptive structural SVM for domain adaptation,”

InternationalJournal of Computer Vision , vol. 119, no. 2, pp. 159–178, 2016.192. X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in

IEEE International Conference onComputer Vision (ICCV) , 2013.193. C. L. Zitnick and P. Doll´ar, “Edge boxes: Locating object proposals from edges,” in

European Conference on ComputerVision (ECCV) , 2014.194. J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,”

InternationalJournal of Computer Vision , vol. 104, no. 2, pp. 154–171, 2013.195. Y. Aytar and A. Zisserman, “Tabula rasa: Model transfer for object category detection,” in

IEEE International Conferenceon Computer Vision (ICCV) , 2011.196. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2005.197. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trainedpart-based models,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 32, no. 9, pp. 1627–1645,2010.198. J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell, “Semi-supervised domain adaptation with instance con-straints,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2013.199. F. Mirrashed, V. I. Morariu, B. Siddiquie, R. S. Feris, and L. S. Davis, “Domain adaptive object detection,” in

Workshopson Application of Computer Vision (WACV) , 2013.200. C. Zhang, R. Hammid, and Z. Zhang, “Taylor expansion based classiﬁer adaptation: Application to person detection,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2008.4 Gabriela Csurka201. P. M. Roth, S. Sternig, H. Grabner, and H. Bischof, “Classiﬁer grids for robust adaptive object detection,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2009.202. S. Stalder, H. Grabner, and L. V. Gool, “Exploring context to learn scene speciﬁc object detectors.,” in

InternationalWorkshop on Performance Evaluation of Tracking and Surveillance (PETS) , 2009.203. K. Tang, V. Ramanathan, L. Fei-Fei, and D. Koller, “Shifting weights: Adapting object detectors from image to video,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2012.204. P. Sharma and R. Nevatia, “Efﬁcient detector adaptation for object detection in a video,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2013.205. A. Gaidon, G. Zen, and J. A. Rodriguez-Serrano, “Self-learning camera: Autonomous adaptation of object detectors tounlabeled video streams.,”

CoRR , vol. arXiv:1406.4296, 2014.206. A. Gaidon and E. Vig, “Online domain adaptation for multi-object tracking,” in

BMVA British Machine Vision Conference(BMVC) , 2015.207. C. Rosenberg, M. Hebert, and H. Schneiderman, “Semisupervised self-training of object detection models,” in

Workshopson Application of Computer Vision (WACV/MOTION) , 2005.208. B. Wu and R. Nevatia, “Improving part based object detection by unsupervised, online boosting,” in

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2007.209. O. Javed, S. Ali, and M. Shah, “Online detection and classiﬁcation of moving objects using progressively improvingdetectors,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2005.210. A. Levin, P. Viola, and Y. Freund, “Unsupervised improvement of visual detectors using co-training,” in

IEEE Interna-tional Conference on Computer Vision (ICCV) , 2013.211. X. Wang, G. Hua, and T. X. han, “Detection by detections: Non-parametric detector adaptation for a video,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2012.212. J. Xu, S. Ramos, D. V´azquez, and A. L´opez, “Domain adaptation of deformable part-based models,”

Transactions ofPattern Recognition and Machine Analyses (PAMI) , vol. 36, no. 12, pp. 2367–2380, 2014.213. P. Doll´ar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: a benchmark,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2009.214. P. Sharma, C. Huang, and R. Nevatia, “Unsupervised incremental learning for improved object detection in a video,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.215. M. D. Breitenstein, F. Reichlin, E. Koller-Meier, B. Leibe, and L. Van Gool, “Online multi-person tracking-by-detectionfrom a single, uncalibrated camera,”

Transactions of Pattern Recognition and Machine Analyses (PAMI) , vol. 31, no. 9,pp. 1820–1833, 2011.216. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localizationand detection using convolutional networks,”

CoRR , vol. arXiv:1312.6229, 2013.217. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semanticsegmentation,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.218. J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “LSDA: Large scaledetection through adaptation,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2014.219. A. Raj, V. P. N. Namboodiri, and T. Tuytelaars, “Subspace alignment based domain adaptation for rcnn detector,” in

BMVA British Machine Vision Conference (BMVC) , 2015.220. R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng, “Self-taught learning: transfer learning from unlabeled data,” in

International Conference on Machine Learning (ICML) , 2007.221. K. Muandet, D. Balduzzi, and B. Sch¨olkopf, “Domain generalization via invariant feature representation,” in

InternationalConference on Machine Learning (ICML) , 2013.222. Z. Xu, W. Li, L. Niu, and D. Xu, “Exploiting low-rank structure from latent domains for domain generalization,” in

European Conference on Computer Vision (ECCV) , 2014.223. C. Gan, T. Yang, and B. Gong, “Learning attributes equals multi-source domain generalization,” in

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2016.224. D. Novotny, D. Larlus, and A. Vedaldi, “I have seen enough: Transferring parts across categories,” in

BMVA BritishMachine Vision Conference (BMVC) , 2016.225. R. Caruana, “Multitask learning: A knowledge-based source of inductive bias,”

Machine Learning , vol. 28, pp. 41–75,1997.226. T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in

ACM SIGKDD Conference on Knowledge Discoveryand Data Mining (SIGKDD) , 2004.227. B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil, “Multilinear multitask learning,” in

InternationalConference on Machine Learning (ICML) , 2013.omain Adaptation for Visual Applications: A Comprehensive Survey 45228. E. Miller, N. Matsakis, and P. Viola, “Learning from one example through shared densities on transforms,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2010.229. L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,”

Transactions of Pattern Recognition andMachine Analyses (PAMI) , vol. 28, no. 4, pp. 594–611, 2006.230. Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation with multiple sources,” in

Annual Conference onNeural Information Processing Systems (NIPS) , 2009.231. L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua, “Domain adaptation from multiple sources via auxiliary classiﬁers,” in

International Conference on Machine Learning (ICML) , 2009.232. Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye, “A two-stage weighting framework for multi-source domainadaptation,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2011.233. B. Gong, K. Grauman, and F. Sha, “Reshaping visual datasets for domain adaptation,” in

Annual Conference on NeuralInformation Processing Systems (NIPS) , 2013.234. T. Tommasi and B. Caputo, “The more you know, the less you learn: from knowledge transfer to one-shot learning ofobject categories,” in

BMVA British Machine Vision Conference (BMVC) , 2009.235. M. Fink, “Object classiﬁcation from a single example utilizing class relevance pseudo-metrics,” in

Annual Conference onNeural Information Processing Systems (NIPS) , 2004.236. E. Bart and S. Ullman, “Cross-generalization: Learning novel classes from a single example by feature replacement,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2005.237. K. Murphy, A. Torralba, and W. Freeman, “Using the forest to see the trees: a graphical model relating features, objects,and scenes,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2003.238. V. Ferrari and A. Zisserman, “Learning visual attributes.,” in

Annual Conference on Neural Information Processing Sys-tems (NIPS) , 2007.239. C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attributetransfer,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.240. M. Palatucci, D. Pomerleau, G. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” in

AnnualConference on Neural Information Processing Systems (NIPS) , 2009.241. Y. Fu, T. Hospedales, T. Xiang, Z. Fu, and S. Gong, “Transductive multi-view embedding for zero-shot recognition andannotation,” in

European Conference on Computer Vision (ECCV) , 2014.242. V. Sharmanska, N. Quadrianto, and C. Lampert, “Augmented attribute representations,” in

European Conference on Com-puter Vision (ECCV) , 2012.243. R. Layne, T. Hospedales, and S. Gong, “Re-id: Hunting attributes in the wild,” in

BMVA British Machine Vision Confer-ence (BMVC) , 2014.244. Y. Fu, T. Hospedales, T. Xiang, and S. Gong, “Learning multimodal latent attributes,”

Transactions of Pattern Recognitionand Machine Analyses (PAMI) , vol. 36, no. 2, pp. 303–316, 2014.245. N. Patricia and B. Caputo, “Learning to learn, from transfer learning to domain adaptation: A unifying perspective,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.246. S. Clinchant, G. Csurka, and B. Chidlovskii, “Transductive adaptation of black box predictions,” in

Annual Meeting ofthe Association for Computational Linguistics(ACL) , 2016.247. Y. Yang and T. M. Hospedales, “A uniﬁed perspective on multi-domain and multi-task learning,” in

International Confer-ence on Learning representations (ICLR) , 2015.248. O. Chapelle, B. Sch ˜A ¶ lkopf, and A. Zien, Semi-supervised learning . MIT Press, 2006.249. X. Zhu, A. Goldberg, R. Brachman, and T. Dietterich,

Introduction to semi-supervised learning . Morgan & ClaypoolPublishers, 2009.250. B. Settles, “”active learning literature survey,” Tech. Rep. Computer Sciences Technical Report 1648, University ofWisconsin-Madison, 2010.251. X. Liao, Y. Xue, and L. Carin, “Logistic regression with an auxiliary data source,” in

International Conference on MachineLearning (ICML) , 2005.252. X. Shi, W. Fan, and J. Ren, “Actively transfer domain knowledge,” in

Joint European Conference on Machine Learningand Knowledge Discovery in Databases (ECML PKDD) , 2008.253. Y. Chan and H. Ng, “Domain adaptation with active learning for word sense disambiguation,” in

Annual Meeting of theAssociation for Computational Linguistics(ACL) , 2007.254. P. Rai, A. Saha, H. Daum´e III, and S. Venkatasubramanian, “Domain adaptation meets active learning,” in

ACL Workshopon Active Learning for Natural Language Processing (ALNLP) , 2010.255. A. Saha, P. Rai, H. Daum´e III, S. Venkatasubramanian, and S. DuVall, “Active supervised domain adaptation,” in

JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) , 2011.6 Gabriela Csurka256. X. Wang, T.-K. Huang, and J. Schneider, “Active transfer learning under model shift,” in

International Conference onMachine Learning (ICML) , 2014.257. S. Shalev-Shwartz,

Online Learning: Theory, Algorithms, and Applications.

PhD thesis, Hebrew University, 7 2007.258. L. Bottou,

Online Algorithms and Stochastic Approximations . Cambridge University Press, 1998.259. S. Shalev-Shwartz, “Online learning and online convex optimization,”

Foundations and Trends in Machine Learning ,vol. 4, no. 2, pp. 107–194, 2011.260. Y. Zhang and D.-Y. Yeung, “Transfer metric learning by learning task relationships,” in

ACM SIGKDD Conference onKnowledge Discovery and Data Mining (SIGKDD) , 2010.261. T. Kamishima, M. Hamasaki, and S. Akaho, “Trbagg: A simple transfer learning method and its application to personal-ization in collaborative tagging,” in

IEEE International Conference on Data Mining (ICDM) , 2009.262. E. Rodner and J. Denzler, “Learning with few examples by transferring feature relevance,” in

BMVA British MachineVision Conference (BMVC) , 2009.263. A. Acharya, E. Hruschka, J. Ghosh, and S. Acharyya, “Transfer learning with cluster ensembles,” in

ICML Workshop onUnsupervised and Transfer Learning (WUTL) , 2012.264. Q. Yang, Y. Chen, G.-R. Xue, W. Dai, and Y. Yong, “Heterogeneous transfer learning for image clustering via the social-web,” in

Annual Meeting of the Association for Computational Linguistics(ACL) , 2009.265. M. Chen, K. Q. Weinberger, and J. Blitzer, “Co-training for domain adaptation,” in

Annual Conference on Neural Infor-mation Processing Systems (NIPS) , 2011.266. J. Ah-Pine, M. Bressan, S. Clinchant, G. Csurka, Y. Hoppenot, and J.-M. Renders, “Crossing textual and visual contentin different application scenarios,”

Multimedia Tools and Applications , vol. 42, no. 1, pp. 31–56, 2009.267. Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in

IEEE InternationalConference on Computer Vision (ICCV) , 2011.268. J. Weston, S. Bengio, and N. Usunier, “Large scale image annotation: learning to rank with joint word-image embed-dings,”

Machine Learning , vol. 81, no. 1, pp. 21–35, 2010.269. N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach tocross-modal multimedia retrieval,” in

ACM Multimedia , 2010.270. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semanticembedding model,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2013.271. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from google’s image search,” in

IEEEInternational Conference on Computer Vision (ICCV) , 2005.272. X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, “Annotating images by mining image search results,”

Transactions of PatternRecognition and Machine Analyses (PAMI) , vol. 30, pp. 1919–1932, 2008.273. F. Schroff, A. Criminisi, and A. Zisserman, “Harvesting image databases from the web,” in

IEEE International Conferenceon Computer Vision (ICCV) , 2007.274. A. Bergamo and L. Torresani, “Exploiting weakly-labeled web images to improve object classiﬁcation: a domain adapta-tion approach,” in

Annual Conference on Neural Information Processing Systems (NIPS) , 2010.275. C. Gan, C. Sun, L. Duan, and B. Gong, “Webly-supervised video recognition by mutually voting for relevant web imagesand web video frames,” in

European Conference on Computer Vision (ECCV) , 2016.276. L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,”

Transactions ofPattern Recognition and Machine Analyses (PAMI) , vol. 34, no. 9, pp. 1667–1680, 2012.277. C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia, “Temporal localization of ﬁne-grained actions in videos by domaintransfer from web images,” in

ACM Multimedia , 2015.278. T. Tommasi and T. Tuytelaars, “A testbed for cross-dataset analysis,” in

ECCV Workshop on Transferring and AdaptingSource Knowledge in Computer Vision (TASK-CV) , 2014.279. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”

Proceedingsof the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.280. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervisedfeature learning,” in