Incremental Unsupervised Domain-Adversarial Training of Neural Networks
Antonio-Javier Gallego, Jorge Calvo-Zaragoza, Robert B. Fisher
IIncremental Unsupervised Domain-Adversarial Training of NeuralNetworks
Antonio-Javier Gallego a, ∗ , Jorge Calvo-Zaragoza a , Robert B. Fisher b a Department of Software and Computing Systems, University of Alicante,03690 Alicante, Spain b School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK
Abstract
In the context of supervised statistical learning, it is typically assumed that the training set comesfrom the same distribution that draws the test samples. When this is not the case, the behavior ofthe learned model is unpredictable and becomes dependent upon the degree of similarity betweenthe distribution of the training set and the distribution of the test set. One of the research topicsthat investigates this scenario is referred to as domain adaptation . Deep neural networks broughtdramatic advances in pattern recognition and that is why there have been many attempts to providegood domain adaptation algorithms for these models. Here we take a different avenue and approachthe problem from an incremental point of view, where the model is adapted to the new domainiteratively. We make use of an existing unsupervised domain-adaptation algorithm to identify thetarget samples on which there is greater confidence about their true label. The output of the modelis analyzed in different ways to determine the candidate samples. The selected set is then addedto the source training set by considering the labels provided by the network as ground truth, andthe process is repeated until all target samples are labelled. Our results report a clear improvementwith respect to the non-incremental case in several datasets, also outperforming other state-of-the-artdomain adaptation algorithms.
Keywords:
Domain Adaptation, Unsupervised learning, Neural Networks, Convolutional NeuralNetworks, Incremental labelling, Machine learning
1. Introduction
Supervised learning is the most considered approach for dealing with classification tasks. Thisparadigm is based on a sufficiently representative training set to learn a classification model. Thislevel of representativeness is usually defined by two criteria: on the one hand, the training samplesmust be varied, which allows the algorithm to generalize instead of memorizing; on the other hand, ∗ Corresponding author
Email addresses: [email protected] (Antonio-Javier Gallego), [email protected] (Jorge Calvo-Zaragoza), [email protected] (Robert B. Fisher)
Preprint submitted to arXiv January 14, 2020 a r X i v : . [ c s . L G ] J a n he application of the trained model is assumed to be carried out on samples that come from thesame distribution as those of the training set [11].Building a training set fulfilling these conditions is not always straightforward. Although obtain-ing samples might be easy, assigning their correct labels is costly. This is why there are efforts toalleviate the aforementioned requirements. However, while the conflict between memorization andgeneralization has been well studied, and there exist established mechanisms to deal with it such asregularization or data augmentation [15], learning a model that is able to correctly classify samplesfrom a different target distribution remains open to further research. This problem is generally called transfer learning (TL) [26], and when the classification labels do not vary in the target distributionit is usually referred to as domain adaptation (DA) [35].Within the context of supervised learning, deep learning represents an important breakthrough[19]. This term refers to the latest generation of artificial neural networks, for which novel mecha-nisms have been developed that allow training deeper networks, i.e., with many layers. These deepneural networks represent the state of the art in many classification tasks, and have managed tobreak the existing glass ceiling in many traditionally complex tasks. In turn, deep learning oftenrequires a large amount of data, which makes the study of DA even more interesting.As we will review in the next section, there are several alternatives to attempt DA, both generalstrategies and using deep neural networks. In this work we take a different avenue and study anincremental approach. We propose to use an existing DA algorithm for deep learning to classifythose samples of the target domain for which the model is confident. Assuming the assigned labelsas ground truth, the model is retrained. This added knowledge allows the network to refine itsbehavior to correctly classify other samples of the target set. This incremental process is repeateduntil the entire target set is completely annotated. We will show that this incremental approachachieves noticeable improvements with respect to the underlying DA algorithm. In addition, it iscompetitive on different benchmarks compared to other state-of-the-art DA algorithms.The rest of the paper is structured as follows: we outline in Section 2 the existing literatureabout DA, with special emphasis to that based on deep neural networks; we present in Section 3the proposed incremental methodology, as well as the underlying DA model that we consider in thiswork; we describe our experimental setting in Section 4, while the results are reported in Section 5;finally, the work is concluded in Section 6.
2. Background
Since the beginning of machine learning research, there exists the idea of exploiting a modelbeyond its use over unknown samples of the source distribution. In the literature we can find twomain topics that pursue this objective: the aforementioned TL and DA strategies.2n TL, some knowledge of the model is used to solve a different classification task. For example,a pre-trained DNN model can be used as initialization [28, 17] or its feature extraction process canbe considered as the basis of another classification model [37]. As a special case of TL, the DAchallenge typically assumes that the classification task of the target distribution is the same (i.e.,the set of labels is equal). In this work we focus on the latter case.In a DA scenario, we can also distinguish between semi-supervised and unsupervised approaches.While semi-supervised DA considers that some labeled samples of the target distribution are available[8, 36, 25], unsupervised DA works with just unlabeled samples [18]. We will revisit in this sectionunsupervised DA techniques, as it is the case of the proposed approach.Performing unsupervised DA is still considered an open problem from both theoretical and prac-tical perspectives [5]. Most approaches consider that the key is to build a good feature representationthat becomes invariant to the domain [4]. A good example is the
Domain Adaptation Neural Network (DANN) proposed by Ganin et al. [13], which simultaneously learns domain-invariant features fromboth source and target data and discriminative features from the source domain. Following this lineof research, many approaches have been proposed more recently:
Virtual Adversarial Domain Adap-tation (VADA) proposed by Shu et al. [27] added a penalty term to the loss function to penalize classboundaries that cross high-density feature regions. The
Deep Reconstruction-Classification Networks (DRCN) [14] consists of a neural network that forces a common representation of both the source andtarget domains by sample reconstruction, while learning the classification task from the source sam-ples. The
Domain Separation Networks (DSN) proposed by Bousmalis et al. [6] are trained to mapinput representations onto both a domain-specific subspace and a domain-independent subspace, inorder to improve the way that the domain-invariant features are learned. Haeusser et al. [16] pro-posed
Associative Domain Adaptation (ADA), which is another domain-invariant feature learningapproach that reinforces associations between source and target representations in an embeddingspace with neural networks. The
Adversarial Discriminative Domain Adaptation (ADDA) strategy[33] follows the idea of Generative Adversarial Networks, along with discriminative modeling anduntied weight sharing to learn domain-invariant features, while keeping a useful representation forthe discriminative task.
Drop to Adapt (DTA) [21] makes use of adversarial dropout to enforcediscriminative domain-invariant features. Damodaran et al. [10] proposed the
Deep Joint Distri-bution Optimal Transport (DeepJDOT) approach, which learns both the classifier and aligned datarepresentations between the source and target domain following a single neural framework with aloss functions based on the Optimal Transport theory [34].A different strategy to DA consists in learning how to transform features from one domainto another. Following this idea, the
Subspace Alignment (SA) method [12] seeks to represent thesource and target domains using subspaces modelled by eigenvectors. Then, it solves an optimizationproblem to align the source subspace with the target one. Also, Sun and Saenko proposed the
Deep orrelation Alignment (D-CORAL) approach [31], which consists of a neural network that learnsa nonlinear transformation to align correlations of layer activations from the source and targetdistributions.While the methods outlined above seek for new ways to achieve the desired characteristics of aproper DA method, our proposed approach takes a different avenue. Specifically, we build upon theexisting DANN approach, and we propose novel ways to improve its ability to adapt to the targetdomain by performing the adaptation incrementally.
3. Methodology
Let X be the input space and Y be the output or label space. A classification task assumes thatthere exist a function f : X → Y that assigns a label to each possible sample of the input space. Forsupervised learning, the goal is to learn a hypothesis function h that models the unknown function f with the least possible error. We refer to h as label classifier. Quite often, the approach is toestimate a posterior probability P ( Y | X ) so that the label classifier follows a maximum a posteriori decision such that h ( x ) = arg max y ∈ Y P ( x | y ). This is the case with neural networks.In the DA scenario, there exist two distributions over X × Y : D S and D T , which are referredto as source domain and target domain , respectively. We focus on the case of unsupervised domainadaptation, for which DA is only provided with a labeled source set S = { ( x i , y i ) } ni =1 ∼ ( D S ) n anda completely unlabeled target domain T = { ( x i ) } n (cid:48) i =1 ∼ ( D T ) n (cid:48) .The goal of a DA algorithm is to build a label classifier for D T by using the information providedin both S and T . Given its importance in the context of our work, we further describe here the operation of DANN,which will be considered as the backbone for our incremental approach.DANN is based on the theory of learning from different domains discussed by [3, 2]. This suggeststhat the transfer of the knowledge gained from one domain to another must be based on learningfeatures that do not allow to discriminate between the two domains (source and target) of thesamples to be classified. For this, DANN learns a classification model from features that do notencode information about the domain of the sample to be classified, thus generalizing the knowledgefrom a source labeled domain to a target unlabeled domain.More specifically, the proposed neural architecture includes a feature extractor module ( G f ) anda label classifier ( G y ), which together build a standard feed-forward neural network that can betrained to classify an input sample x into one of the possible categories of the output space Y .4he last layer of the label classifier G y uses a “softmax” activation, which models the posteriorprobability P ( x | y ) , ∀ y ∈ Y of a given input x ∈ X .DANN adds a new domain classifier module ( G d ) to the neural network, that classifies thedomain to which the input sample x belongs. This classifier is built as a binary logistic regressorthat models the probability that an input sample x comes from the source distribution ( d i = 0 if x ∼ D S ) or the target distribution ( d i = 1 if x ∼ D T ), where d i denotes a binary variable thatindicates the domain of the sample.The unsupervised adaptation to a target domain is achieved as follows: the domain classifier G d is connected to the feature extractor G f (which is shared with the label classifier G y ) throughthe so-called gradient reversal layer (GRL). This layer does nothing at prediction. However, whilelearning through back-propagation, it multiplies the gradient by a certain negative constant ( λ ).In other words, the GRL receives the gradient from the subsequent layer and multiplies it by − λ ,therefore changing its sign before passing it to the preceding layer. The idea of this operation is toforce G f to learn generic features that do not allow discriminating the domain. In addition, since thistraining is carried out simultaneously with the training of G y (label classifier), the features mustbe adequate for discriminating the categories to classify, yet unbiased with respect to the inputdomain. According to the DA theory, this should cause G y to be able to correctly classify inputsamples regardless of their domain, given that the features from G f are forced to be invariant.The DANN training simultaneously updates all modules, providing samples for both G y and G d . This can be done by using conventional mechanisms such as Stochastic Gradient Descent, frombatches that include half of the examples from each domain. During the training process, the learningof G f pursues a trade-off between appropriate features for the classification ( G y ) and inappropriatefeatures for discriminating the domain of the input sample ( G d ). The hyper-parameter λ allowstuning this trade-off. The training is performed until the result converges to a saddle point, whichcan be found as a stationary point in the gradient update defined by the following equation: θ f ← θ f − µ (cid:18) ∂ L y ∂θ f − λ ∂ L d ∂θ f (cid:19) (1)where θ f denotes the weights of G f , µ denotes the learning rate, and L y and L d represent the lossfunctions for the label classifier and the domain classifier, respectively.A graphical overview of the DANN architecture is depicted in Fig. 1. Our main contribution within the context of DA is to propose an incremental approach to DANN(iDANN). This strategy is explained below.Once the DANN model is trained as explained in the previous section, we can use both thefeature extractor G f and the label classifier G y to predict the category of samples from both the5 eature extractor (G f ) features f domain classi fi er (G d )label classi fi er (G y )inputx GRL domainlabel dclasslabel yloss L d loss L y ∂ L d ∂ θ f - λ ∂ L d ∂ θ d λ∂ L y ∂ θ y ∂ L y ∂ θ f Figure 1: Graphical overview of the DANN architecture, consisting of three blocks: feature extractor ( G f ), labelclassifier ( G y ), and domain classifier ( G d ). The GRL circle denotes the gradient reversal layer that multiplies thegradient by a negative factor. target domain and the source domain ( G y ( G f ( x ))). The “softmax” activation used at the output ofthis classifier returns the posterior probability that the network considers x to belong to any of theclasses of the output space Y .Our main assumption is that we can use the subset of samples from the target domain for which G y is more confident, and then add them to the source labeled domain assuming the prediction asground truth. These samples are thereafter considered as samples of the source domain completely.Afterwards, we can retrain the DANN network to fine-tune its weights using the new training set.This process is repeated iteratively, moving the labeled samples with greater confidence from thetarget domain to the source domain after each iteration. We stop when there are no more samplesto move from the target domain.The intuitive idea behind our approach is that by adding target domain information to the source(labeled) domain, the DANN learns new domain-invariant features that better fit the eventual clas-sification task, thereby becoming more accurate for other target domain samples. In each iteration,however, the task increases its complexity because it deals first with the simplest samples to classify(for which the DANN is more confident), leaving those that have more dissimilar features in theunlabeled target set. When the DANN is retrained with labeled samples that include target domaininformation, the domain classifier G d needs to be more specific. This forces the feature extractionmodule G f to forget the features that differentiate more complex samples from the target domain.We formalize the process in Algorithm 1, where e and b represents the number of epochs and thebatch size considered, respectively, e inc denotes the number of epochs for the incremental stage ofthe algorithm, r indicates the size of the subset of target domain samples to select in each iteration,and β is a constant that allows us to modify this size after each iteration.In this algorithm, the samples of the target domain ( ˆ B ) are classified using the label classifier G y , and then it proceeds to select a subset ˆ B r of size r to be moved from the target domain to thesource domain. For this purpose, two selection criteria are proposed, which are described in the nextsection. 6 lgorithm 1: Incremental DANN (iDANN)
Input : S ← { ( x i , y i ) ∼ D S } T ← { ( x i ) ∼ D T } e, e inc , b, r , λ, β ← Initial hyper-parameters values
Output: G f , G y , CNN while T (cid:54) = ∅ do G f , G y ← Fit DANN with { S, T, e, b, λ } ˆ B r ← selection policy ( G f , G y , T, r ) S ← S ∪ ˆ B r T ← T \ ˆ B r e ← e inc r ← β r end while ˆ T ← { ( x i , y i ) | x i ∼ D T , y i = G y ( G f ( x i )) } Fit CNN with { ˆ T , e, b } Once the iterative stage of the algorithm ends, the label classifier G y is used to classify the entireoriginal target domain (see line 9 of Algorithm 1). This labeled target set is used to then traina neural network from scratch, which is therefore specialized in classifying target domain samples(more details in Section 3.5). Below we describe in detail the two proposed policies to select samples during the iterative stageof Algorithm 1 ( selection policy ). One policy is directly based on the confidence level that thenetwork provides to the prediction, while the other is based on geometric properties of the learnedfeature space.
As mentioned above, the output of the label classifier G y uses a softmax activation. Let L denotethe number of labels. Then, the standard softmax function σ : R L → R L is defined by Equation 2. σ ( z ) i = e z i (cid:80) Lj =1 e z j for i = 1 , . . . , L and z = ( z , . . . , z L ) ∈ R L (2)This function normalizes an L -dimensional vector z of unbounded real values into another L -dimensional vector σ ( z ), for which values range between [0 ,
1] and add up to 1. This can be in-terpreted as a posterior probability over the different possible labels [7]. In order to turn these7robabilities into the predicted class label, we simply take the argmax-index position of this outputvector, following a
Maximum a Posteriori probability criterion.Taking advantage of this interpretation, the first policy for selecting samples to move from thetarget domain to the source is based on the probability provided by the label classifier G y , whichcan be seen as a measure of confidence in such classification.With this criterion, we will keep the maximum predicted probability value for each sample ofthe target set among the possible labels. Then, we will order all samples based on this value—fromhighest to lowest—in order to select the first r samples to build the subset ˆ B r .Algorithm 2 presents the algorithmic description of this process, where G py refers to the proba-bilistic output of the label classifier after the softmax activation, before applying argmax to select alabel. The function sortr is used to sort the set in decreasing order. Algorithm 2:
Confidence policy
Input : T ← { ( x i ) ∼ D T } G f , G y ← Feature extractor and label classifier r ← Size of the selected samples subset
Output: ˆ B r ˆ B ← G py ( G f ( T )) ˆ B r ← {∅} foreach ( x i , y i ) ∈ sortr ( ˆ B ) do ˆ B r ← ˆ B r ∪ ( x i , y i ) if | ˆ B r | = r then break end if end foreach Figure 2 shows an example of a set of probabilities obtained after predicting the target sampleswith DANN. The figure on the left shows the maximum probability values obtained for the classi-fication of each sample—without sorting—while in the figure on the right the sorted set is shown,where the threshold r has been highlighted. k NN policy
As in the previous case, once the network has been trained, we use the label classifier G y topredict the labels of the whole target domain and then we sort them based on the confidence givenby the network. However, in this case, instead of directly selecting a subset of samples according tothis confidence, we will also evaluate the geometric properties of the feature space. This is performedfollowing the k -nearest neighbor rule. 8 .10.20.30.40.50.60.70.80.91 0 200 400 600 800 1000 1200 1400 1600 1800 2000 O u t p u t P r o b Samples O u t p u t P r o b Samples r = 100 Figure 2: Example of probabilities obtained with DANN. Left: maximum probability of each sample. Right: orderedset of maximum probabilities, where the threshold r has been highlighted. We first obtain the feature set F S from the source set S (using G f ( S )). We then proceed toiterate the target set samples sorted by their level of confidence. Given a target sample, if the labelof the k -nearest samples of the source domain matches the label assigned by the label classifier G y ,then we will select the prototype. Otherwise, we will discard it. Therefore, samples are selectedbased on both the confidence provided by the DANN in their label and the extent they match thedistribution of the source domain.Algorithm 3 describes this process algorithmically. The kN N ( q, F S , k ) function receives as pa-rameters the query sample q , the set F S and the value k to be used, and yields the predicted label l and the number of samples m within its k -nearest neighbors from S that have the same label.The idea of this policy is to select the samples of the target domain whose features are within thecluster of the source domain for the same class. An illustrative example of this condition is shownin Fig. 3 with k = 5. The example shows two labels of the source domain as green circles and bluesquares. The red stars denote the target domain examples that are being evaluated to determine ifthey are selected. For instance, the star on the left would be selected if, and only if, the networkclassified it as a green circle, since its 5-nearest neighbors are green circles. Similarly, the star onthe right would be selected if, and only if, the network classified it as a blue square. However, thecentral star would always be discarded because its 5-neighbors belong to two different classes.If we increase k , the red star of the left would still be selected (if labeled as green circle) becauseit is located in the middle of the cluster. However, the red start of the right is closer to labelboundaries, and so it would eventually be discarded. As described in Algorithm 1, once the iterative stage of the iDANN algorithm is completed, weuse the label classifier G y to annotate the entire original target set T from scratch. Then, a newCNN is trained by conventional means considering the same neural architecture of G y ( G f ( · )). Thisallows us to eventually get a neural network that is directly specialized in the classification of the9 lgorithm 3: k NN policy
Input : S ← { ( x i , y i ) ∼ D S } T ← { ( x i ) ∼ D T } G f , G y ← Feature extractor and label classifier r ← Size of the selected samples subset k ← Number of neighbors to consider
Output: ˆ B r F S ← G f ( S ) ˆ B ← G py ( G f ( T )) ˆ B r ← {∅} foreach ( x i , y i ) ∈ sortr ( ˆ B ) do f ( i ) T ← G f ( x i ) l, m ← kN N ( f ( i ) T , F S , k ) if y i = l and m = k then ˆ B r ← ˆ B r ∪ ( x i , y i ) if | ˆ B r | = r then break end if end if end foreach target domain.However, we assume that some part of the iterative annotation of the target set will containnoise at the label level. To mitigate the possible efects of this noise, we consider label smoothing [32]. This is an efficient and theoretically-grounded strategy for dealing with label noise, which alsomakes the model less prone to overfitting.Compared to classical one-hot output representation, label smoothing changes the constructionof the true probability to y (cid:48) i = (1 − (cid:15) ) y i + (cid:15)L , (3)where (cid:15) is a small constant (or smoothing parameter) and L is the total number of classes. Hence,instead of minimizing cross-entropy with hard targets (0 or 1), it considers soft targets.10 igure 3: Example of sample selection using the kNN policy with k = 5. Green circles and blue squares representsamples of two different classes from the source domain. Red stars represent the samples of the target domain thatare evaluated to determine whether they are chosen.
4. Experimental setup
The proposed approach will be evaluated with two different classification tasks, that are commonin the DA literature. The first one is that of digit classification, for which we consider the followingdatasets: • MNIST [20]: this collection contains 28 ×
28 images representing isolated handwritten digits. • MNIST-M [13]: this dataset was synthetically generated by merging MNIST samples withrandom color patches from BSDS500 [1]. • Street View House Numbers (SVHN) [24]: it consists of images obtained from house numbersfrom Google Street View. It represents a real-world challenge of digit recognition in naturalscenes, for which several digits might appear in the same image and only the central one mustbe classified. • Synthetic Numbers [13]: images of digits generated using Windows TM fonts, with varyingposition, orientation, color and resolution.In addition, we also evaluate our approach for traffic sign classification with the following datasets: • German Traffic Sign Recognition Benchmark (GTSRB) [30]: this dataset contains images oftraffic signs obtained from the real world in different sizes, positions, and lighting conditions,as well as including occlusions. • Synthetic Signs [23]: this dataset was synthetically generated by taking common street signsfrom Wikipedia and applying several transformations. It tries to simulate images from GTSRBalthough there are significant differences between them.11able 1 summarizes the information of our evaluation corpora, including the domain to whichthey belong, the number of labels, the image resolution, the number of samples, and the type of imageindicating whether they are in color or grayscale format. Figure 4 shows some random examplesfrom each of these datasets.
Table 1: Description of the datasets used in the experimentation.
Set
Numbers 10 MNIST 28 ×
28 Gray 65,000MNIST-M 28 ×
28 Color 65,000SVHN 32 ×
32 Color 99,289Syn. Numbers 32 ×
32 Color 488,953Trafficsigns 43 GTSRB [25 × , × ×
40 Color 100,000 (a) MNIST (b) MNIST-M (c) SVHN(d) Syn. Numbers (e) GTSRB (f) Syn. SignsFigure 4: Random examples from the datasets used in experimentation.
The images of each classification task were rescaled to the same size: the digits to 28 × ×
40 pixels. Concerning the pre-processing of the input data, theimages were normalized within the range [0 , .2. CNN architectures To evaluate the proposed methodology, the same three CNN architectures used in the originalDANN paper have been tested. Table 2 reports a summary of these architectures.As the authors pointed out, these topologies are not necessarily optimal and better adapta-tion performance might be attained if they were tweaked. However, we chose to keep the sameconfiguration to make a fairer comparison.As the activation function, a Rectifier Linear Unit (ReLU) was used for each convolution layerand fully-connected layer, except for the the output layers. L neurons with softmax activation wereused as output of the label classifier. For the output of the domain classifier, a single neuron witha logistic (sigmoid) activation function was used to discriminate between two possible categories(source domain or target domain).Model 1 was used for all the experiments with digit datasets, except those using SVHN. Thistopology is inspired by the classical LeNet-5 architecture [20]. Model 2 was used to evaluate theexperiments with digits that include SVHN. This architecture is inspired by [29]. Finally, Model 3was used for the experiments with traffic sings. In this case, the single-CNN baseline obtained from[9] was used. Table 2: CNN network configurations considered. Notation:
Conv ( f, w, h ) stands for a layer with f convolutionoperators with a kernel of size w × h pixels, MaxP ool ( w, h ) stands for the max-pooling operator of dimensions w × h pixels—with a 2 × F C ( n ) represents a fully-connected layer of n neurons. In the output layer ofthe label classifier a fully-connected layer of L neurons with softmax activation is added, where L denotes the numberof categories of the dataset at issue. Feature extractorModel Layer 1 Layer 2 Layer 3 Label classifier Domain classifier
To ensure a fair comparison with the original DANN algorithm, we set the same training config-uration: Stochastic Gradient Descent with a learning rate of 0 .
01, decay of 10 − , and momentum of0 .
9, as well as the same number of epochs (300).13or the iterative stage of iDANN, we set e inc to 25. This value was determined empirically. Weobserved that it allowed the network weights to be tuned with the new knowledge without takingtoo long to perform a new iteration. In each training iteration, the greater improvement occurs inthe first epochs, after which the accuracy of the label classifier is stabilized.Concerning the size of the subset to select from the target set ( r ), we decided to consider apercentage of the remaining samples rather than a fixed value. Initially, we set it to 5%, and itwas increased after each iteration by 150% ( β = 1 .
5) until all target domain samples are selected.This value was also obtained empirically, by observing better results and more stable training if fewsamples are added in the first iterations.Different values for both the batch size b and λ are evaluated, as will be reported in the experi-mentation section.
5. Results
In this section we evaluate the proposed method using the datasets, topologies, and settingsdescribed in Section 4. We first study the different hyper-parameterization, as well as the twoprototype selection policies proposed. Next we show the performance results obtained over thedatasets and, finally, we compare with other state-of-the-art methods.
In this section, we start by analyzing the influence of the batch size and the value of λ on the per-formance of the method, as these hyper-parameters are those that affect the training stage the most.For this, we consider the batch sizes of { , , , , , } and λ of { − , − , − , − } .This means that each result comes from a total of 336 experiments (14 combinations of dataset pairs × × λ ). The rest of hyper-parameters are set as indicated in Section 4.3,that is: e = 300 (as in the original DANN paper), e inc = 25, and r = 5%, which were empiricallydetermined to favor stable training and obtain good results. In addition, we evaluate the resultsusing only the prototype selection policy based on network’s confidence, as next section will bedevoted to comparing the two proposed policies with the best hyper-parameters found.As we are dealing with an unsupervised method, we mainly focus on analyzing the trend whenmodifying these parameters. Table 3 shows the results of this experiment, where each figure rep-resents the average of the 14 possible combinations of source and target domain of the datasetsconsidered and all the iterations performed by the iDANN algorithm.The first thing to remark is that some of the hyper-parameter combinations evaluated in theseexperiments do not converge ( h = 32 / λ = 10 for traffic signs). This could be detected automat-ically, since the accuracy is abruptly reduced to a value approximately equal to a random guess, forboth the training set and evaluation set and for both the source and the target domain. However,14hese results have been kept in order to observe the general trend of the method and how theseparameters affect it.It can also be observed that the best performance is achieved with λ = 10 − in the two types ofcorpora, while a batch size of 64 and 32 are better for the digits and traffic signs, respectively. Onaverage, better results are reported with low λ values and batch sizes between 32 and 256. When λ is greater (e.g., 10 − ), the training becomes highly unstable, especially if combined with small batchsizes. Table 3: Influence of hyper-parameter setting on the performance (accuracy, in %) of the iDANN algorithm. Figuresreport classification accuracy over the target set, averaging with the respective datasets and iterations of the algorithm.
Numbers Traffic signs λ λ
Batch 10 − − − − − − − − Next, we analyze the influence of these parameters with respect to the iteration of the iDANNalgorithm. Table 4 shows the average result obtained by grouping all combinations of datasets(numbers and traffic signs) and hyper-parameters considered. As in the previous analysis, betterresults are also observed for low λ values and batch sizes between 32 and 256 (see column ‘Avg.’). Inthis case, it can also be seen that low λ values are more appropriate in the first iterations, whereasgreater λ values are more appropriate in the last iterations. It might happen that a more stable wayof proceeding (low λ ) is preferred in the first iterations, even at the cost of being less aggressive inthe domain adaptation. Therefore, we propose to start with a low λ and increase its value gradually(+10 − after each epoch).Additionally, it is observed that each iteration of the algorithm leads to a better result than theprevious one (except for λ ≥ − ), yielding the higher leap in the first iterations and reducing thisdifference towards the last iterations. Including all cases, the results improve by 5 .
19% between thefirst and the last iteration, on average. If we ignore those settings that do not converge, the averageimprovement obtained increases to 10 . We now evaluate the effect of the incremental training process on the domain adaptation ap-proach. Figure 5 shows the evolution of the accuracy obtained over the target test set during thetraining process for the case Syn Numbers → MNIST-M combination of datasets, with a batchsize of 64 and λ = 10 − . The training epochs are represented with the horizontal axis, while the15 able 4: Influence of hyper-parameter setting on the performance of the iDANN algorithm with respect to number ofiterations. Figures report classification accuracy over the target set, averaging with the respective datasets. Iterations λ Batch 1 2 3 4 5 6 7 8 9 Avg.10 −
16 58.46 59.71 61.46 62.81 63.91 64.86 65.39 65.78 66.04 63.1632 65.20 67.85 69.37 69.91 70.73 71.14 71.89 72.11
64 63.88 67.11 68.33 68.75 69.35 70.30 70.71 71.13 71.22 68.97128 62.67 65.52 67.09 67.68 68.19 68.84 69.46 69.88 70.06 67.71256 62.82 65.20 66.92 67.83 68.65 68.84 69.57 70.09 70.34 67.81512 61.51 63.28 64.32 65.45 65.97 66.93 67.60 67.87 68.03 65.66 −
16 56.65 57.70 59.65 60.72 61.43 62.25 62.75 63.15 63.31 60.8532 63.95 67.42 68.68 69.93 70.59 71.26 71.79 72.19 72.33 69.7964 64.66 67.39 68.78 69.89 70.41 71.32 72.06 72.44
128 63.75 66.59 68.61 69.07 69.93 70.82 71.44 72.00 72.14 69.37256 62.38 65.08 66.67 67.44 67.69 68.52 69.10 69.49 69.71 67.34512 61.36 63.46 64.72 65.48 66.25 66.96 67.42 67.84 68.08 65.73 −
16 56.73 61.57 63.37 65.13 65.81 66.31 62.94 63.13 63.26 63.1432 62.07 64.75 66.01 66.68 67.46 67.78 68.23 68.47 68.80 66.6964 64.78 67.77 69.44 70.20 71.21 71.92 72.35 72.74
128 64.49 67.27 69.02 69.75 70.71 71.58 72.00 72.72 72.87 70.05256 63.16 65.55 66.75 67.49 68.32 68.76 69.51 69.81 70.21 67.73512 61.61 63.49 64.82 65.57 66.10 66.81 67.52 68.01 68.28 65.80 −
16 46.48 50.71 52.12 53.35 53.78 42.41 42.98 43.35 43.52 47.6332 50.09 53.07 47.66 42.61 43.12 44.06 44.39 44.91 44.96 46.1064 61.64 64.02 64.24 54.01 54.48 55.44 55.96 56.47 56.59 58.09128 55.72 57.10 57.16 57.01 52.95 53.15 53.63 53.50 53.69 54.88256 60.13 62.26 62.60 63.53 64.25 64.93 65.57 66.04
512 53.54 54.39 54.63 55.01 55.69 56.04 56.57 56.84 56.95 55.52
Average – iterations (i.e., when new training samples are added) are highlighted with blue lines and markedabove. It can be observed that in the first iteration (spanning 300 epochs), the accuracy slowlyimproves until around 150 epochs, after which becomes stable. In the subsequent iterations, theaccuracy further improves, especially during iterations 2, 3 and 4. Then, the performance increaseis gradually reduced until it is hardly noticeable.To provide further analysis, we also examine the representation space learned by the network ineach of these iterations, using the same combination of datasets and training parameters. We usethe t-Distributed Stochastic Neighbor Embedding (t-SNE) [22] projection to visualize the samplesaccording to their representation by the last hidden layer of the label predictor. Figure 6 showsa visualization of the features learned after each of the iterations, where the red color representsthe target domain, the blue color represents the source domain, and the green color representsthe set ˆ B r (selected samples) using the confidence policy. This representation reveals 10 well-defined clusters—the 10 possible classes of the datasets considered for this analysis—around anadditional central cluster. This central cluster groups the samples of the target domain (red color)whose representation does not correspond to any of the existing classes yet. This cluster would16 A cc EpochsIteration 1 2 3 4 5 6 7 8 9
Figure 5: Accuracy curve with respect to training epochs and iterations of the incremental approach. therefore correspond to target samples whose representation has not been correctly mapped ontoany of the source domain classes. Iteratively, the method is selecting samples (green points) of thetarget domain and moving them to the source domain. In the first iterations—until the 6th one,approximately—the method selects only samples that are well located in one of the source domainclusters (that is, those samples for which the network is more confident). Due to this process, thesize of the central cluster is reduced. It is important to emphasize that this cluster becomes smalleralthough no samples out of it are selected, which indicates that the network is learning to better mapthose samples because of the selected samples of previous iterations. Towards the last iterations,the method begins to select the most complex samples that are still in this additional cluster. InFig. 6(*) (which is the same as the Fig. 6(9) but highlighting each class with a different color),the additional cluster of target samples still appears without being mapped, yet with a very smallsize. This cluster contains almost all the classification errors, having mapped only some isolatedprototypes to the actual class clusters incorrectly.
We compare in this section the two policies proposed for selecting the set of target prototypesˆ B r to be added to the source domain. To this end, we evaluate whether the label assigned to eachof these prototypes is correct. In this case, we make use of the ground-truth of the target domainjust for the sake of analysis.We show in Fig. 7 a dotted line with the performance of the confidence policy, which may serveas a baseline here, and eight results for the kNN policy with varying k values. As in the previousexperiments, the reported figures are obtained for all combinations of datasets and hyper-parametersconsidered.It is observed that, as the number of iterations of the algorithm increases, the accuracy of theadditional labels assigned to the selected prototypes decreases. However, the kNN policy generally17 igure 6: t-SNE representation of the feature space of the neural network (from the last hidden layer) with respectto the iteration of the approach. The red color represents the target domain, the blue color represents the sourcedomain, and the green color represents the set ˆ B r . Last representation (*) depicts each sample according to its actualcategory by using different colors. obtains better results from the first iteration, obtaining on average (for all iterations) an improvementof 6.36 % with respect to confidence policy. This improvement is significantly greater in the lastiterations, obtaining an increase up to 24.85 % between the result of the confidence policy and thebest result obtained with kNN policy.The role of the parameter k is also illustrated in Fig. 7, where better results are attained as k is increased. It is shown that the impact of this parameter is more noticeable in the last iterations,where a difference of up to 8.91 % is obtained between k = 3 and k = 150.Because the kNN selection policy worked better, this policy was used in all following experiments. In this section, we evaluate the final result obtained through the proposed iDANN method withthe best combination of hyper-parameters previously obtained for each of the dataset pairs. We will18 .30.40.50.60.70.80.91 1 2 3 4 5 6 7 8 9 A cc Iterations k=3k=7k=11 k=20k=50k=100 k=150k=200Conf.
Figure 7: Accuracy of the labels assigned to the selected prototypes set ( ˆ B r ) from the target domain according tothe selection policy. compare this result with that obtained by the original DANN method in order to check the goodnessof the incremental approach.Table 5 reports the results of the experiment, where rows indicate the dataset pairs (sourceand target) and columns represent the DA method. Concerning iDANN, we report two results:the accuracy of the labels assigned during the iterative process itself (1), as well as the accuracyusing the CNN trained from scratch using only the target samples (once all the target samples havebeen assigned a label). In addition to DANN and iDANN methods, we have also added the resultsobtained with the neural networks trained just with the source set (‘CNN Src.’), as well as the resultsobtained with the neural neural networks directly trained with the target set (‘CNN Tgt.’). Theformer serves as baseline, to better assess the impact of the domain-adaptation mechanisms, whilethe second represents the upper bound of accuracy.The first thing to remark is that the worst results obtained by the baseline (‘CNN Src.’) comefrom the combinations of single-digit datasets (MNIST, MNIST-M) as source and complex digitdatasets (SVHN, Syn Numbers) as target. Furthermore, the best results from the baseline arereported for combinations where the source and target are similar (MNIST-M → MNIST, SVHN → Syn Numbers).The original DANN method outperforms the results obtained by using the baseline network(‘CNN Src.’) by 10.7 %, on average, obtaining the most significant improvement for the combinationsof Syn Numbers → MNIST (improving by 29.31 %). It is also noticeable the impact of DANN whenthe dataset pair consists of similar tasks with the most complex one as target, such as MNIST → MNIST-M—improvement of 23 %—or Syn Signs → GTSRB—improvement of 15.49 %. Theseresults for DANN have been obtained using our own implementation, following the details givenin the original paper. We observed that the accuracy matches approximately that reported by theauthors (for the 4 combinations they considered), and so we assume that our implementation is19orrect. We can therefore faithfully report the performance in all source-target combinations of ourexperiments.Concerning the labels assigned during the proposed incremental approach iDANN ( D T Acc. (1) ),the first thing to note is its improvement with respect to the underlying DANN method, which isaround 16 %, on average. In the best case, this improvement reaches values around 33 %, 35 %and 36 % for the Syn Numbers → MNIST-M, MNIST-M → Syn Numbers, and MNIST → SynNumbers pairs, respectively. This confirms the goodness of our strategy, which uses the samedomain adaptation method in a novel way.Finally, if the CNN is trained from scratch with the target labels that have been automaticallyassigned by the iDANN ( D T Acc. (2) ), it can further improve the results up to 1.64 %, on average,and up to 5.5 % in the best case (MNIST-M → Syn Numbers). It should be noted that in somespecific combinations, this approach slightly outperforms the CNN trained with the correct targetlabels (for example, MNIST-M → MNIST or Syn Numbers → SVHN). It might happen that theincorrectly assigned labels of the iDANN process act as a regularizer that alleviates some overfitting.
Table 5: Accuracy (%) over the the target dataset for different strategies: ‘CNN Src.’ indicates a neural networktrained only with source samples; ‘DANN’ denotes the original DANN strategy; ‘iDANN’ yields two results from theincremental strategy: D T Acc. (1) refers to the accuracy of the labels assigned to the target samples during the iterativeprocess, while D T Acc. (2) refers to the classification after training a new CNN from scratch using the labels assignedto the target samples; and ‘CNN Tgt.’ denotes a CNN trained using the ground truth of the target samples.
Source Target CNN Src. DANN iDANN CNN Tgt. D T Acc. D T Acc. D T Acc. (1) D T Acc. (2) D T Acc.
MNIST MNIST-M 55.71 78.70 96.09 96.67 97.34SVHN 16.26 31.32 35.83 36.49 90.93Syn Numbers 32.14 44.66 80.79 84.82 99.34MNIST-M MNIST 97.95 98.65 99.04 99.59 98.94SVHN 32.91 41.41 61.87 61.89 90.93Syn Numbers 46.34 54.02 89.49 94.99 99.34SVHN MNIST 59.04 67.08 82.72 84.50 98.94MNIST-M 43.49 47.42 66.40 67.62 97.34Syn Numbers 88.42 89.56 96.43 98.10 99.34Syn Numbers MNIST 60.04 89.35 98.13 99.35 98.94MNIST-M 41.84 54.38 87.10 90.26 97.34SVHN 85.16 87.24 91.42 91.95 90.93GTSRB Syn signs 76.39 86.22 98.28 98.57 99.74Syn signs GTSRB 69.79 85.28 96.31 98.00 97.89Average 57.53 68.23 84.28 85.91 96.95 .5. Comparison with the state of the art To conclude the results section, we present below a comparison with other domain adaptationstrategies from the state of the art. In these works, not all possible combinations of source-targetpairs are considered but a few combinations of them. We show in Table 6 the results reported inthe literature , along with the results obtained by our proposal (iDANN). A brief description of thecompeting methods was provided in Section 2. Readers are referred to the corresponding referencesfor details.These results reveal that our method yields the best performance in 5 out of 7 source-target pairs.The performance of iDANN is especially remarkable in the case of MNIST → Syn Num, where theimprovement reaches around 30 % compared to the literature. For the case in which our proposaldoes not attain the best result, we observe a dissimilar performance: it is still very competitive forthe MNIST → SVHN pair, whereas it is outperformed for the SVHN → MNIST pair. When all theresults are good, the improvement is relative, but when there is enough margin, the improvement isquite remarkable (as in the case of MNIST → Syn Num).
Table 6: Comparison of accuracy (%) between state-of-the-art DA approaches and iDANN. The first two rows denotethe source and target dataset, respectively. The best result for each combination (column) is highlighted in boldtypeface. Empty cells indicate that the result is not reported in the literature.
MNIST SVHN Syn NumMethods MNIST-M SVHN Syn Num MNIST Syn Num MNIST SVHN
SA [12] 56.9 – – 59.32 – – 86.44DRCN [14] – – 81.97 – – –DSN [6] 83.2 – – 82.7 – – 91.2DANN [13] 76.66 12.4 22.9 73.85 96.9 87.6 91.09D-CORAL [31] – 35 55.8 76.3 95.5 89.9 78.8ADDA [33] – – – 76 – – –ADA [16] – 12.9 34.8 96.3 95.5 97.1 88.1VADA [27] – 18.6 45.9 92.9 96.8 96.2 85.3DeepJDOT [10] 92.4 – – 96.7 – – –DTA [21] – – – – – –iDANN
Furthermore, it should be noted that many of the compared methods propose specific CNNarchitecture for each combination of datasets and/or focus on optimizing the result for a particularcombination, such as DTA or DeepJDOT. In our case, we utilized the topologies proposed in the Unlike the results of the previous section, the DANN values of Table 6 are those reported in the original paper[13].
6. Conclusions and Future Work
This paper proposes an incremental strategy to the problem of domain adaptation with artificialneural networks. Our approach is built upon an existing domain adaptation approach, combinedwith a heuristic that, in each iteration, decides which prototypes of the target set can be added tothe training set by considering the label provided by the neural network. To this end, two selectionpolicies have been proposed: one directly based on the confidence given by the network to theprediction and another based on geometric properties of the learned feature space. We observedthat the latter reported a better performance, especially in the last iterations of the algorithm. Inaddition, we consider a final stage in which the labeled target set is used to train a new neuralnetwork with label smoothing.Our experiments were performed on various corpora and using several configurations of the neuralnetwork. From the results, we conclude that the incremental approach outperforms the underlyingDANN model, as well as other state-of-the-art methods. It is interesting to note that, in somecases, the iDANN approach improves the result obtained with the CNN trained directly with theground-truth data of the target set, which could indicate that the incremental process also serves asa regularizer that leads to greater robustness. Furthermore, unlike the classic DANN, our approachimproves results when domains are similar and helps keeping the accuracy for the source domain.We also observed a greater training stability and less dependence on the hyper-parameters set.As future work, a primary objective would be to establish a well-principled stop criterion thatallows us to detect when the prediction over the target samples is not reliable. In addition, we wantto extend the experiments to other types of input types (such as sequences), as well as to studythe behavior of the incremental strategy when the underlying DA method is different—given thatthere currently exist several architectures for this challenge. Note that our incremental approach isindependent of the underlying DA model considered, and so it could be adopted as a generic strategythat might improve to the same extent as the underlying DA algorithm improves. Other avenuesto further explore this proposal is to evaluate more neural network architectures, as well as addingdata augmentation to the learning process.
References [1] Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchicalimage segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence , ,898–916. doi: . 222] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010).A theory of learning from different domains. Machine Learning , , 151–175. doi: .[3] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations fordomain adaptation. In NIPS (pp. 137–144).[4] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations fordomain adaptation. In B. Sch¨olkopf, J. C. Platt, & T. Hoffman (Eds.),
Advances in NeuralInformation Processing Systems 19 (NIPS) (pp. 137–144). MIT Press.[5] Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervisedpixel-level domain adaptation with generative adversarial networks. In (pp. 95–104).[6] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domainseparation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.),
Advances in Neural Information Processing Systems 29 (pp. 343–351). Curran Associates, Inc.[7] Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs,with relationships to statistical pattern recognition. In F. F. Souli´e, & J. H´erault (Eds.),
Neurocomputing (pp. 227–236). Berlin, Heidelberg: Springer Berlin Heidelberg.[8] Cheng, L., & Pan, S. J. (2014). Semi-supervised domain adaptation on manifolds.
IEEETransactions on Neural Networks and Learning Systems , , 2240–2249. doi: .[9] Cire¸san, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural networkfor traffic sign classification. Neural Networks , , 333 – 338. Selected Papers from IJCNN2011.[10] Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot:Deep joint distribution optimal transport for unsupervised domain adaptation. In V. Ferrari,M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (pp. 467–483).Cham: Springer International Publishing.[11] Duda, R. O., Hart, P. E., & Stork, D. G. (2001).
Pattern classification, 2nd Edition . Wiley.[12] Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domainadaptation using subspace alignment. In (pp. 2960–2967). 2313] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M.,& Lempitsky, V. (2016). Domain-adversarial training of neural networks.
Journal of MachineLearning Research , , 1–35.[14] Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., & Li, W. (2016). Deep reconstruction-classification networks for unsupervised domain adaptation. In B. Leibe, J. Matas, N. Sebe, &M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 597–613). Cham: Springer Interna-tional Publishing.[15] Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning . MIT Press. .[16] Haeusser, P., Frerix, T., Mordvintsev, A., & Cremers, D. (2017). Associative domain adaptation.In
The IEEE International Conference on Computer Vision (ICCV) .[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In (pp. 770–778).[18] Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels.
IEEETransactions on Pattern Analysis and Machine Intelligence , (pp. 1–1).[19] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.
Nature , , 436.[20] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied todocument recognition. In Proc. of the IEEE (pp. 2278–2324). volume 86.[21] Lee, S., Kim, D., Kim, N., & Jeong, S.-G. (2019). Drop to adapt: Learning discriminative fea-tures for unsupervised domain adaptation. In
The IEEE International Conference on ComputerVision (ICCV) .[22] van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE.
Journal of MachineLearning Research , , 2579–2605.[23] Moiseev, B., Konev, A., Chigorin, A., & Konushin, A. (2013). Evaluation of traffic signrecognition methods trained on synthetically generated data. In J. Blanc-Talon, A. Kasin-ski, W. Philips, D. Popescu, & P. Scheunders (Eds.), Advanced Concepts for Intelligent VisionSystems (pp. 576–583). Cham: Springer International Publishing.[24] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits innatural images with unsupervised feature learning. In
NIPS Workshop on Deep Learning andUnsupervised Feature Learning 2011 . 2425] Saito, K., Kim, D., Sclaroff, S., Darrell, T., & Saenko, K. (2019). Semi-supervised domainadaptation via minimax entropy. In
The IEEE International Conference on Computer Vision(ICCV) .[26] Shao, L., Zhu, F., & Li, X. (2014). Transfer learning for visual categorization: A survey.
IEEETransactions on Neural Networks and Learning Systems , , 1019–1034.[27] Shu, R., Bui, H., Narui, H., & Ermon, S. (2018). A DIRT-t approach to unsupervised domainadaptation. In International Conference on Learning Representations .[28] Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale imagerecognition. In .[29] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout:A simple way to prevent neural networks from overfitting.
Journal of Machine Learning Re-search , , 1929–1958.[30] Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmark-ing machine learning algorithms for traffic sign recognition. Neural Networks , , 323 – 332.doi: https://doi.org/10.1016/j.neunet.2012.02.016 .[31] Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation.In G. Hua, & H. J´egou (Eds.), Computer Vision – ECCV 2016 Workshops (pp. 443–450).Cham: Springer International Publishing.[32] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inceptionarchitecture for computer vision. In (pp. 2818–2826). doi: .[33] Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domainadaptation. ,(pp. 2962–2971).[34] Villani, C. (2009).
Optimal Transport Old and New . Springer.[35] Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey.
Neurocomputing , , 135 – 153.[36] Yao, T., Pan, Y., Ngo, C.-W., Li, H., & Mei, T. (2015). Semi-supervised domain adaptation withsubspace learning for visual recognition. In Proceedings of the IEEE conference on computervision and pattern recognition (CVPR) (pp. 2142–2150).2537] Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deepneural networks? In