[PDF] Incremental Unsupervised Domain-Adversarial Training of Neural Networks

Abstract

In the context of supervised statistical learning, it is typically assumed that the training set comes from the same distribution that draws the test samples. When this is not the case, the behavior of the learned model is unpredictable and becomes dependent upon the degree of similarity between the distribution of the training set and the distribution of the test set. One of the research topics that investigates this scenario is referred to as domain adaptation. Deep neural networks brought dramatic advances in pattern recognition and that is why there have been many attempts to provide good domain adaptation algorithms for these models. Here we take a different avenue and approach the problem from an incremental point of view, where the model is adapted to the new domain iteratively. We make use of an existing unsupervised domain-adaptation algorithm to identify the target samples on which there is greater confidence about their true label. The output of the model is analyzed in different ways to determine the candidate samples. The selected set is then added to the source training set by considering the labels provided by the network as ground truth, and the process is repeated until all target samples are labelled. Our results report a clear improvement with respect to the non-incremental case in several datasets, also outperforming other state-of-the-art domain adaptation algorithms.

Full PDF

IIncremental Unsupervised Domain-Adversarial Training of NeuralNetworks

Antonio-Javier Gallego a, ∗ , Jorge Calvo-Zaragoza a , Robert B. Fisher b a Department of Software and Computing Systems, University of Alicante,03690 Alicante, Spain b School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK

Abstract

In the context of supervised statistical learning, it is typically assumed that the training set comesfrom the same distribution that draws the test samples. When this is not the case, the behavior ofthe learned model is unpredictable and becomes dependent upon the degree of similarity betweenthe distribution of the training set and the distribution of the test set. One of the research topicsthat investigates this scenario is referred to as domain adaptation . Deep neural networks broughtdramatic advances in pattern recognition and that is why there have been many attempts to providegood domain adaptation algorithms for these models. Here we take a diﬀerent avenue and approachthe problem from an incremental point of view, where the model is adapted to the new domainiteratively. We make use of an existing unsupervised domain-adaptation algorithm to identify thetarget samples on which there is greater conﬁdence about their true label. The output of the modelis analyzed in diﬀerent ways to determine the candidate samples. The selected set is then addedto the source training set by considering the labels provided by the network as ground truth, andthe process is repeated until all target samples are labelled. Our results report a clear improvementwith respect to the non-incremental case in several datasets, also outperforming other state-of-the-artdomain adaptation algorithms.

Keywords:

Domain Adaptation, Unsupervised learning, Neural Networks, Convolutional NeuralNetworks, Incremental labelling, Machine learning

1. Introduction

Supervised learning is the most considered approach for dealing with classiﬁcation tasks. Thisparadigm is based on a suﬃciently representative training set to learn a classiﬁcation model. Thislevel of representativeness is usually deﬁned by two criteria: on the one hand, the training samplesmust be varied, which allows the algorithm to generalize instead of memorizing; on the other hand, ∗ Corresponding author

Email addresses: [email protected] (Antonio-Javier Gallego), [email protected] (Jorge Calvo-Zaragoza), [email protected] (Robert B. Fisher)

Preprint submitted to arXiv January 14, 2020 a r X i v : . [ c s . L G ] J a n he application of the trained model is assumed to be carried out on samples that come from thesame distribution as those of the training set [11].Building a training set fulﬁlling these conditions is not always straightforward. Although obtain-ing samples might be easy, assigning their correct labels is costly. This is why there are eﬀorts toalleviate the aforementioned requirements. However, while the conﬂict between memorization andgeneralization has been well studied, and there exist established mechanisms to deal with it such asregularization or data augmentation [15], learning a model that is able to correctly classify samplesfrom a diﬀerent target distribution remains open to further research. This problem is generally called transfer learning (TL) [26], and when the classiﬁcation labels do not vary in the target distributionit is usually referred to as domain adaptation (DA) [35].Within the context of supervised learning, deep learning represents an important breakthrough[19]. This term refers to the latest generation of artiﬁcial neural networks, for which novel mecha-nisms have been developed that allow training deeper networks, i.e., with many layers. These deepneural networks represent the state of the art in many classiﬁcation tasks, and have managed tobreak the existing glass ceiling in many traditionally complex tasks. In turn, deep learning oftenrequires a large amount of data, which makes the study of DA even more interesting.As we will review in the next section, there are several alternatives to attempt DA, both generalstrategies and using deep neural networks. In this work we take a diﬀerent avenue and study anincremental approach. We propose to use an existing DA algorithm for deep learning to classifythose samples of the target domain for which the model is conﬁdent. Assuming the assigned labelsas ground truth, the model is retrained. This added knowledge allows the network to reﬁne itsbehavior to correctly classify other samples of the target set. This incremental process is repeateduntil the entire target set is completely annotated. We will show that this incremental approachachieves noticeable improvements with respect to the underlying DA algorithm. In addition, it iscompetitive on diﬀerent benchmarks compared to other state-of-the-art DA algorithms.The rest of the paper is structured as follows: we outline in Section 2 the existing literatureabout DA, with special emphasis to that based on deep neural networks; we present in Section 3the proposed incremental methodology, as well as the underlying DA model that we consider in thiswork; we describe our experimental setting in Section 4, while the results are reported in Section 5;ﬁnally, the work is concluded in Section 6.

2. Background

Since the beginning of machine learning research, there exists the idea of exploiting a modelbeyond its use over unknown samples of the source distribution. In the literature we can ﬁnd twomain topics that pursue this objective: the aforementioned TL and DA strategies.2n TL, some knowledge of the model is used to solve a diﬀerent classiﬁcation task. For example,a pre-trained DNN model can be used as initialization [28, 17] or its feature extraction process canbe considered as the basis of another classiﬁcation model [37]. As a special case of TL, the DAchallenge typically assumes that the classiﬁcation task of the target distribution is the same (i.e.,the set of labels is equal). In this work we focus on the latter case.In a DA scenario, we can also distinguish between semi-supervised and unsupervised approaches.While semi-supervised DA considers that some labeled samples of the target distribution are available[8, 36, 25], unsupervised DA works with just unlabeled samples [18]. We will revisit in this sectionunsupervised DA techniques, as it is the case of the proposed approach.Performing unsupervised DA is still considered an open problem from both theoretical and prac-tical perspectives [5]. Most approaches consider that the key is to build a good feature representationthat becomes invariant to the domain [4]. A good example is the

Domain Adaptation Neural Network (DANN) proposed by Ganin et al. [13], which simultaneously learns domain-invariant features fromboth source and target data and discriminative features from the source domain. Following this lineof research, many approaches have been proposed more recently:

Virtual Adversarial Domain Adap-tation (VADA) proposed by Shu et al. [27] added a penalty term to the loss function to penalize classboundaries that cross high-density feature regions. The

Deep Reconstruction-Classiﬁcation Networks (DRCN) [14] consists of a neural network that forces a common representation of both the source andtarget domains by sample reconstruction, while learning the classiﬁcation task from the source sam-ples. The

Domain Separation Networks (DSN) proposed by Bousmalis et al. [6] are trained to mapinput representations onto both a domain-speciﬁc subspace and a domain-independent subspace, inorder to improve the way that the domain-invariant features are learned. Haeusser et al. [16] pro-posed

Associative Domain Adaptation (ADA), which is another domain-invariant feature learningapproach that reinforces associations between source and target representations in an embeddingspace with neural networks. The

Adversarial Discriminative Domain Adaptation (ADDA) strategy[33] follows the idea of Generative Adversarial Networks, along with discriminative modeling anduntied weight sharing to learn domain-invariant features, while keeping a useful representation forthe discriminative task.

Drop to Adapt (DTA) [21] makes use of adversarial dropout to enforcediscriminative domain-invariant features. Damodaran et al. [10] proposed the

Deep Joint Distri-bution Optimal Transport (DeepJDOT) approach, which learns both the classiﬁer and aligned datarepresentations between the source and target domain following a single neural framework with aloss functions based on the Optimal Transport theory [34].A diﬀerent strategy to DA consists in learning how to transform features from one domainto another. Following this idea, the

Subspace Alignment (SA) method [12] seeks to represent thesource and target domains using subspaces modelled by eigenvectors. Then, it solves an optimizationproblem to align the source subspace with the target one. Also, Sun and Saenko proposed the

Deep orrelation Alignment (D-CORAL) approach [31], which consists of a neural network that learnsa nonlinear transformation to align correlations of layer activations from the source and targetdistributions.While the methods outlined above seek for new ways to achieve the desired characteristics of aproper DA method, our proposed approach takes a diﬀerent avenue. Speciﬁcally, we build upon theexisting DANN approach, and we propose novel ways to improve its ability to adapt to the targetdomain by performing the adaptation incrementally.

3. Methodology

Let X be the input space and Y be the output or label space. A classiﬁcation task assumes thatthere exist a function f : X → Y that assigns a label to each possible sample of the input space. Forsupervised learning, the goal is to learn a hypothesis function h that models the unknown function f with the least possible error. We refer to h as label classiﬁer. Quite often, the approach is toestimate a posterior probability P ( Y | X ) so that the label classiﬁer follows a maximum a posteriori decision such that h ( x ) = arg max y ∈ Y P ( x | y ). This is the case with neural networks.In the DA scenario, there exist two distributions over X × Y : D S and D T , which are referredto as source domain and target domain , respectively. We focus on the case of unsupervised domainadaptation, for which DA is only provided with a labeled source set S = { ( x i , y i ) } ni =1 ∼ ( D S ) n anda completely unlabeled target domain T = { ( x i ) } n (cid:48) i =1 ∼ ( D T ) n (cid:48) .The goal of a DA algorithm is to build a label classiﬁer for D T by using the information providedin both S and T . Given its importance in the context of our work, we further describe here the operation of DANN,which will be considered as the backbone for our incremental approach.DANN is based on the theory of learning from diﬀerent domains discussed by [3, 2]. This suggeststhat the transfer of the knowledge gained from one domain to another must be based on learningfeatures that do not allow to discriminate between the two domains (source and target) of thesamples to be classiﬁed. For this, DANN learns a classiﬁcation model from features that do notencode information about the domain of the sample to be classiﬁed, thus generalizing the knowledgefrom a source labeled domain to a target unlabeled domain.More speciﬁcally, the proposed neural architecture includes a feature extractor module ( G f ) anda label classiﬁer ( G y ), which together build a standard feed-forward neural network that can betrained to classify an input sample x into one of the possible categories of the output space Y .4he last layer of the label classiﬁer G y uses a “softmax” activation, which models the posteriorprobability P ( x | y ) , ∀ y ∈ Y of a given input x ∈ X .DANN adds a new domain classiﬁer module ( G d ) to the neural network, that classiﬁes thedomain to which the input sample x belongs. This classiﬁer is built as a binary logistic regressorthat models the probability that an input sample x comes from the source distribution ( d i = 0 if x ∼ D S ) or the target distribution ( d i = 1 if x ∼ D T ), where d i denotes a binary variable thatindicates the domain of the sample.The unsupervised adaptation to a target domain is achieved as follows: the domain classiﬁer G d is connected to the feature extractor G f (which is shared with the label classiﬁer G y ) throughthe so-called gradient reversal layer (GRL). This layer does nothing at prediction. However, whilelearning through back-propagation, it multiplies the gradient by a certain negative constant ( λ ).In other words, the GRL receives the gradient from the subsequent layer and multiplies it by − λ ,therefore changing its sign before passing it to the preceding layer. The idea of this operation is toforce G f to learn generic features that do not allow discriminating the domain. In addition, since thistraining is carried out simultaneously with the training of G y (label classiﬁer), the features mustbe adequate for discriminating the categories to classify, yet unbiased with respect to the inputdomain. According to the DA theory, this should cause G y to be able to correctly classify inputsamples regardless of their domain, given that the features from G f are forced to be invariant.The DANN training simultaneously updates all modules, providing samples for both G y and G d . This can be done by using conventional mechanisms such as Stochastic Gradient Descent, frombatches that include half of the examples from each domain. During the training process, the learningof G f pursues a trade-oﬀ between appropriate features for the classiﬁcation ( G y ) and inappropriatefeatures for discriminating the domain of the input sample ( G d ). The hyper-parameter λ allowstuning this trade-oﬀ. The training is performed until the result converges to a saddle point, whichcan be found as a stationary point in the gradient update deﬁned by the following equation: θ f ← θ f − µ (cid:18) ∂ L y ∂θ f − λ ∂ L d ∂θ f (cid:19) (1)where θ f denotes the weights of G f , µ denotes the learning rate, and L y and L d represent the lossfunctions for the label classiﬁer and the domain classiﬁer, respectively.A graphical overview of the DANN architecture is depicted in Fig. 1. Our main contribution within the context of DA is to propose an incremental approach to DANN(iDANN). This strategy is explained below.Once the DANN model is trained as explained in the previous section, we can use both thefeature extractor G f and the label classiﬁer G y to predict the category of samples from both the5 eature extractor (G f ) features f domain classi ﬁ er (G d )label classi ﬁ er (G y )inputx GRL domainlabel dclasslabel yloss L d loss L y ∂ L d ∂ θ f - λ ∂ L d ∂ θ d λ∂ L y ∂ θ y ∂ L y ∂ θ f Figure 1: Graphical overview of the DANN architecture, consisting of three blocks: feature extractor ( G f ), labelclassiﬁer ( G y ), and domain classiﬁer ( G d ). The GRL circle denotes the gradient reversal layer that multiplies thegradient by a negative factor. target domain and the source domain ( G y ( G f ( x ))). The “softmax” activation used at the output ofthis classiﬁer returns the posterior probability that the network considers x to belong to any of theclasses of the output space Y .Our main assumption is that we can use the subset of samples from the target domain for which G y is more conﬁdent, and then add them to the source labeled domain assuming the prediction asground truth. These samples are thereafter considered as samples of the source domain completely.Afterwards, we can retrain the DANN network to ﬁne-tune its weights using the new training set.This process is repeated iteratively, moving the labeled samples with greater conﬁdence from thetarget domain to the source domain after each iteration. We stop when there are no more samplesto move from the target domain.The intuitive idea behind our approach is that by adding target domain information to the source(labeled) domain, the DANN learns new domain-invariant features that better ﬁt the eventual clas-siﬁcation task, thereby becoming more accurate for other target domain samples. In each iteration,however, the task increases its complexity because it deals ﬁrst with the simplest samples to classify(for which the DANN is more conﬁdent), leaving those that have more dissimilar features in theunlabeled target set. When the DANN is retrained with labeled samples that include target domaininformation, the domain classiﬁer G d needs to be more speciﬁc. This forces the feature extractionmodule G f to forget the features that diﬀerentiate more complex samples from the target domain.We formalize the process in Algorithm 1, where e and b represents the number of epochs and thebatch size considered, respectively, e inc denotes the number of epochs for the incremental stage ofthe algorithm, r indicates the size of the subset of target domain samples to select in each iteration,and β is a constant that allows us to modify this size after each iteration.In this algorithm, the samples of the target domain ( ˆ B ) are classiﬁed using the label classiﬁer G y , and then it proceeds to select a subset ˆ B r of size r to be moved from the target domain to thesource domain. For this purpose, two selection criteria are proposed, which are described in the nextsection. 6 lgorithm 1: Incremental DANN (iDANN)

Input : S ← { ( x i , y i ) ∼ D S } T ← { ( x i ) ∼ D T } e, e inc , b, r , λ, β ← Initial hyper-parameters values

Output: G f , G y , CNN while T (cid:54) = ∅ do G f , G y ← Fit DANN with { S, T, e, b, λ } ˆ B r ← selection policy ( G f , G y , T, r ) S ← S ∪ ˆ B r T ← T \ ˆ B r e ← e inc r ← β r end while ˆ T ← { ( x i , y i ) | x i ∼ D T , y i = G y ( G f ( x i )) } Fit CNN with { ˆ T , e, b } Once the iterative stage of the algorithm ends, the label classiﬁer G y is used to classify the entireoriginal target domain (see line 9 of Algorithm 1). This labeled target set is used to then traina neural network from scratch, which is therefore specialized in classifying target domain samples(more details in Section 3.5). Below we describe in detail the two proposed policies to select samples during the iterative stageof Algorithm 1 ( selection policy ). One policy is directly based on the conﬁdence level that thenetwork provides to the prediction, while the other is based on geometric properties of the learnedfeature space.

As mentioned above, the output of the label classiﬁer G y uses a softmax activation. Let L denotethe number of labels. Then, the standard softmax function σ : R L → R L is deﬁned by Equation 2. σ ( z ) i = e z i (cid:80) Lj =1 e z j for i = 1 , . . . , L and z = ( z , . . . , z L ) ∈ R L (2)This function normalizes an L -dimensional vector z of unbounded real values into another L -dimensional vector σ ( z ), for which values range between [0 ,

1] and add up to 1. This can be in-terpreted as a posterior probability over the diﬀerent possible labels [7]. In order to turn these7robabilities into the predicted class label, we simply take the argmax-index position of this outputvector, following a

Maximum a Posteriori probability criterion.Taking advantage of this interpretation, the ﬁrst policy for selecting samples to move from thetarget domain to the source is based on the probability provided by the label classiﬁer G y , whichcan be seen as a measure of conﬁdence in such classiﬁcation.With this criterion, we will keep the maximum predicted probability value for each sample ofthe target set among the possible labels. Then, we will order all samples based on this value—fromhighest to lowest—in order to select the ﬁrst r samples to build the subset ˆ B r .Algorithm 2 presents the algorithmic description of this process, where G py refers to the proba-bilistic output of the label classiﬁer after the softmax activation, before applying argmax to select alabel. The function sortr is used to sort the set in decreasing order. Algorithm 2:

Conﬁdence policy

Input : T ← { ( x i ) ∼ D T } G f , G y ← Feature extractor and label classiﬁer r ← Size of the selected samples subset

Output: ˆ B r ˆ B ← G py ( G f ( T )) ˆ B r ← {∅} foreach ( x i , y i ) ∈ sortr ( ˆ B ) do ˆ B r ← ˆ B r ∪ ( x i , y i ) if | ˆ B r | = r then break end if end foreach Figure 2 shows an example of a set of probabilities obtained after predicting the target sampleswith DANN. The ﬁgure on the left shows the maximum probability values obtained for the classi-ﬁcation of each sample—without sorting—while in the ﬁgure on the right the sorted set is shown,where the threshold r has been highlighted. k NN policy

As in the previous case, once the network has been trained, we use the label classiﬁer G y topredict the labels of the whole target domain and then we sort them based on the conﬁdence givenby the network. However, in this case, instead of directly selecting a subset of samples according tothis conﬁdence, we will also evaluate the geometric properties of the feature space. This is performedfollowing the k -nearest neighbor rule. 8 .10.20.30.40.50.60.70.80.91 0 200 400 600 800 1000 1200 1400 1600 1800 2000 O u t p u t P r o b Samples O u t p u t P r o b Samples r = 100 Figure 2: Example of probabilities obtained with DANN. Left: maximum probability of each sample. Right: orderedset of maximum probabilities, where the threshold r has been highlighted. We ﬁrst obtain the feature set F S from the source set S (using G f ( S )). We then proceed toiterate the target set samples sorted by their level of conﬁdence. Given a target sample, if the labelof the k -nearest samples of the source domain matches the label assigned by the label classiﬁer G y ,then we will select the prototype. Otherwise, we will discard it. Therefore, samples are selectedbased on both the conﬁdence provided by the DANN in their label and the extent they match thedistribution of the source domain.Algorithm 3 describes this process algorithmically. The kN N ( q, F S , k ) function receives as pa-rameters the query sample q , the set F S and the value k to be used, and yields the predicted label l and the number of samples m within its k -nearest neighbors from S that have the same label.The idea of this policy is to select the samples of the target domain whose features are within thecluster of the source domain for the same class. An illustrative example of this condition is shownin Fig. 3 with k = 5. The example shows two labels of the source domain as green circles and bluesquares. The red stars denote the target domain examples that are being evaluated to determine ifthey are selected. For instance, the star on the left would be selected if, and only if, the networkclassiﬁed it as a green circle, since its 5-nearest neighbors are green circles. Similarly, the star onthe right would be selected if, and only if, the network classiﬁed it as a blue square. However, thecentral star would always be discarded because its 5-neighbors belong to two diﬀerent classes.If we increase k , the red star of the left would still be selected (if labeled as green circle) becauseit is located in the middle of the cluster. However, the red start of the right is closer to labelboundaries, and so it would eventually be discarded. As described in Algorithm 1, once the iterative stage of the iDANN algorithm is completed, weuse the label classiﬁer G y to annotate the entire original target set T from scratch. Then, a newCNN is trained by conventional means considering the same neural architecture of G y ( G f ( · )). Thisallows us to eventually get a neural network that is directly specialized in the classiﬁcation of the9 lgorithm 3: k NN policy

Input : S ← { ( x i , y i ) ∼ D S } T ← { ( x i ) ∼ D T } G f , G y ← Feature extractor and label classiﬁer r ← Size of the selected samples subset k ← Number of neighbors to consider

Output: ˆ B r F S ← G f ( S ) ˆ B ← G py ( G f ( T )) ˆ B r ← {∅} foreach ( x i , y i ) ∈ sortr ( ˆ B ) do f ( i ) T ← G f ( x i ) l, m ← kN N ( f ( i ) T , F S , k ) if y i = l and m = k then ˆ B r ← ˆ B r ∪ ( x i , y i ) if | ˆ B r | = r then break end if end if end foreach target domain.However, we assume that some part of the iterative annotation of the target set will containnoise at the label level. To mitigate the possible efects of this noise, we consider label smoothing [32]. This is an eﬃcient and theoretically-grounded strategy for dealing with label noise, which alsomakes the model less prone to overﬁtting.Compared to classical one-hot output representation, label smoothing changes the constructionof the true probability to y (cid:48) i = (1 − (cid:15) ) y i + (cid:15)L , (3)where (cid:15) is a small constant (or smoothing parameter) and L is the total number of classes. Hence,instead of minimizing cross-entropy with hard targets (0 or 1), it considers soft targets.10 igure 3: Example of sample selection using the kNN policy with k = 5. Green circles and blue squares representsamples of two diﬀerent classes from the source domain. Red stars represent the samples of the target domain thatare evaluated to determine whether they are chosen.

4. Experimental setup

The proposed approach will be evaluated with two diﬀerent classiﬁcation tasks, that are commonin the DA literature. The ﬁrst one is that of digit classiﬁcation, for which we consider the followingdatasets: • MNIST [20]: this collection contains 28 ×

28 images representing isolated handwritten digits. • MNIST-M [13]: this dataset was synthetically generated by merging MNIST samples withrandom color patches from BSDS500 [1]. • Street View House Numbers (SVHN) [24]: it consists of images obtained from house numbersfrom Google Street View. It represents a real-world challenge of digit recognition in naturalscenes, for which several digits might appear in the same image and only the central one mustbe classiﬁed. • Synthetic Numbers [13]: images of digits generated using Windows TM fonts, with varyingposition, orientation, color and resolution.In addition, we also evaluate our approach for traﬃc sign classiﬁcation with the following datasets: • German Traﬃc Sign Recognition Benchmark (GTSRB) [30]: this dataset contains images oftraﬃc signs obtained from the real world in diﬀerent sizes, positions, and lighting conditions,as well as including occlusions. • Synthetic Signs [23]: this dataset was synthetically generated by taking common street signsfrom Wikipedia and applying several transformations. It tries to simulate images from GTSRBalthough there are signiﬁcant diﬀerences between them.11able 1 summarizes the information of our evaluation corpora, including the domain to whichthey belong, the number of labels, the image resolution, the number of samples, and the type of imageindicating whether they are in color or grayscale format. Figure 4 shows some random examplesfrom each of these datasets.

Table 1: Description of the datasets used in the experimentation.

Set

Numbers 10 MNIST 28 ×

28 Gray 65,000MNIST-M 28 ×

28 Color 65,000SVHN 32 ×

32 Color 99,289Syn. Numbers 32 ×

32 Color 488,953Traﬃcsigns 43 GTSRB [25 × , × ×

40 Color 100,000 (a) MNIST (b) MNIST-M (c) SVHN(d) Syn. Numbers (e) GTSRB (f) Syn. SignsFigure 4: Random examples from the datasets used in experimentation.

The images of each classiﬁcation task were rescaled to the same size: the digits to 28 × ×

40 pixels. Concerning the pre-processing of the input data, theimages were normalized within the range [0 , .2. CNN architectures To evaluate the proposed methodology, the same three CNN architectures used in the originalDANN paper have been tested. Table 2 reports a summary of these architectures.As the authors pointed out, these topologies are not necessarily optimal and better adapta-tion performance might be attained if they were tweaked. However, we chose to keep the sameconﬁguration to make a fairer comparison.As the activation function, a Rectiﬁer Linear Unit (ReLU) was used for each convolution layerand fully-connected layer, except for the the output layers. L neurons with softmax activation wereused as output of the label classiﬁer. For the output of the domain classiﬁer, a single neuron witha logistic (sigmoid) activation function was used to discriminate between two possible categories(source domain or target domain).Model 1 was used for all the experiments with digit datasets, except those using SVHN. Thistopology is inspired by the classical LeNet-5 architecture [20]. Model 2 was used to evaluate theexperiments with digits that include SVHN. This architecture is inspired by [29]. Finally, Model 3was used for the experiments with traﬃc sings. In this case, the single-CNN baseline obtained from[9] was used. Table 2: CNN network conﬁgurations considered. Notation:

Conv ( f, w, h ) stands for a layer with f convolutionoperators with a kernel of size w × h pixels, MaxP ool ( w, h ) stands for the max-pooling operator of dimensions w × h pixels—with a 2 × F C ( n ) represents a fully-connected layer of n neurons. In the output layer ofthe label classiﬁer a fully-connected layer of L neurons with softmax activation is added, where L denotes the numberof categories of the dataset at issue. Feature extractorModel Layer 1 Layer 2 Layer 3 Label classiﬁer Domain classiﬁer

To ensure a fair comparison with the original DANN algorithm, we set the same training conﬁg-uration: Stochastic Gradient Descent with a learning rate of 0 .

01, decay of 10 − , and momentum of0 .

9, as well as the same number of epochs (300).13or the iterative stage of iDANN, we set e inc to 25. This value was determined empirically. Weobserved that it allowed the network weights to be tuned with the new knowledge without takingtoo long to perform a new iteration. In each training iteration, the greater improvement occurs inthe ﬁrst epochs, after which the accuracy of the label classiﬁer is stabilized.Concerning the size of the subset to select from the target set ( r ), we decided to consider apercentage of the remaining samples rather than a ﬁxed value. Initially, we set it to 5%, and itwas increased after each iteration by 150% ( β = 1 .

5) until all target domain samples are selected.This value was also obtained empirically, by observing better results and more stable training if fewsamples are added in the ﬁrst iterations.Diﬀerent values for both the batch size b and λ are evaluated, as will be reported in the experi-mentation section.

5. Results

In this section we evaluate the proposed method using the datasets, topologies, and settingsdescribed in Section 4. We ﬁrst study the diﬀerent hyper-parameterization, as well as the twoprototype selection policies proposed. Next we show the performance results obtained over thedatasets and, ﬁnally, we compare with other state-of-the-art methods.

In this section, we start by analyzing the inﬂuence of the batch size and the value of λ on the per-formance of the method, as these hyper-parameters are those that aﬀect the training stage the most.For this, we consider the batch sizes of { , , , , , } and λ of { − , − , − , − } .This means that each result comes from a total of 336 experiments (14 combinations of dataset pairs × × λ ). The rest of hyper-parameters are set as indicated in Section 4.3,that is: e = 300 (as in the original DANN paper), e inc = 25, and r = 5%, which were empiricallydetermined to favor stable training and obtain good results. In addition, we evaluate the resultsusing only the prototype selection policy based on network’s conﬁdence, as next section will bedevoted to comparing the two proposed policies with the best hyper-parameters found.As we are dealing with an unsupervised method, we mainly focus on analyzing the trend whenmodifying these parameters. Table 3 shows the results of this experiment, where each ﬁgure rep-resents the average of the 14 possible combinations of source and target domain of the datasetsconsidered and all the iterations performed by the iDANN algorithm.The ﬁrst thing to remark is that some of the hyper-parameter combinations evaluated in theseexperiments do not converge ( h = 32 / λ = 10 for traﬃc signs). This could be detected automat-ically, since the accuracy is abruptly reduced to a value approximately equal to a random guess, forboth the training set and evaluation set and for both the source and the target domain. However,14hese results have been kept in order to observe the general trend of the method and how theseparameters aﬀect it.It can also be observed that the best performance is achieved with λ = 10 − in the two types ofcorpora, while a batch size of 64 and 32 are better for the digits and traﬃc signs, respectively. Onaverage, better results are reported with low λ values and batch sizes between 32 and 256. When λ is greater (e.g., 10 − ), the training becomes highly unstable, especially if combined with small batchsizes. Table 3: Inﬂuence of hyper-parameter setting on the performance (accuracy, in %) of the iDANN algorithm. Figuresreport classiﬁcation accuracy over the target set, averaging with the respective datasets and iterations of the algorithm.

Numbers Traﬃc signs λ λ

Batch 10 − − − − − − − − Next, we analyze the inﬂuence of these parameters with respect to the iteration of the iDANNalgorithm. Table 4 shows the average result obtained by grouping all combinations of datasets(numbers and traﬃc signs) and hyper-parameters considered. As in the previous analysis, betterresults are also observed for low λ values and batch sizes between 32 and 256 (see column ‘Avg.’). Inthis case, it can also be seen that low λ values are more appropriate in the ﬁrst iterations, whereasgreater λ values are more appropriate in the last iterations. It might happen that a more stable wayof proceeding (low λ ) is preferred in the ﬁrst iterations, even at the cost of being less aggressive inthe domain adaptation. Therefore, we propose to start with a low λ and increase its value gradually(+10 − after each epoch).Additionally, it is observed that each iteration of the algorithm leads to a better result than theprevious one (except for λ ≥ − ), yielding the higher leap in the ﬁrst iterations and reducing thisdiﬀerence towards the last iterations. Including all cases, the results improve by 5 .

19% between theﬁrst and the last iteration, on average. If we ignore those settings that do not converge, the averageimprovement obtained increases to 10 . We now evaluate the eﬀect of the incremental training process on the domain adaptation ap-proach. Figure 5 shows the evolution of the accuracy obtained over the target test set during thetraining process for the case Syn Numbers → MNIST-M combination of datasets, with a batchsize of 64 and λ = 10 − . The training epochs are represented with the horizontal axis, while the15 able 4: Inﬂuence of hyper-parameter setting on the performance of the iDANN algorithm with respect to number ofiterations. Figures report classiﬁcation accuracy over the target set, averaging with the respective datasets. Iterations λ Batch 1 2 3 4 5 6 7 8 9 Avg.10 −

16 58.46 59.71 61.46 62.81 63.91 64.86 65.39 65.78 66.04 63.1632 65.20 67.85 69.37 69.91 70.73 71.14 71.89 72.11

64 63.88 67.11 68.33 68.75 69.35 70.30 70.71 71.13 71.22 68.97128 62.67 65.52 67.09 67.68 68.19 68.84 69.46 69.88 70.06 67.71256 62.82 65.20 66.92 67.83 68.65 68.84 69.57 70.09 70.34 67.81512 61.51 63.28 64.32 65.45 65.97 66.93 67.60 67.87 68.03 65.66 −

16 56.65 57.70 59.65 60.72 61.43 62.25 62.75 63.15 63.31 60.8532 63.95 67.42 68.68 69.93 70.59 71.26 71.79 72.19 72.33 69.7964 64.66 67.39 68.78 69.89 70.41 71.32 72.06 72.44

128 63.75 66.59 68.61 69.07 69.93 70.82 71.44 72.00 72.14 69.37256 62.38 65.08 66.67 67.44 67.69 68.52 69.10 69.49 69.71 67.34512 61.36 63.46 64.72 65.48 66.25 66.96 67.42 67.84 68.08 65.73 −

16 56.73 61.57 63.37 65.13 65.81 66.31 62.94 63.13 63.26 63.1432 62.07 64.75 66.01 66.68 67.46 67.78 68.23 68.47 68.80 66.6964 64.78 67.77 69.44 70.20 71.21 71.92 72.35 72.74

128 64.49 67.27 69.02 69.75 70.71 71.58 72.00 72.72 72.87 70.05256 63.16 65.55 66.75 67.49 68.32 68.76 69.51 69.81 70.21 67.73512 61.61 63.49 64.82 65.57 66.10 66.81 67.52 68.01 68.28 65.80 −

16 46.48 50.71 52.12 53.35 53.78 42.41 42.98 43.35 43.52 47.6332 50.09 53.07 47.66 42.61 43.12 44.06 44.39 44.91 44.96 46.1064 61.64 64.02 64.24 54.01 54.48 55.44 55.96 56.47 56.59 58.09128 55.72 57.10 57.16 57.01 52.95 53.15 53.63 53.50 53.69 54.88256 60.13 62.26 62.60 63.53 64.25 64.93 65.57 66.04

512 53.54 54.39 54.63 55.01 55.69 56.04 56.57 56.84 56.95 55.52

Average – iterations (i.e., when new training samples are added) are highlighted with blue lines and markedabove. It can be observed that in the ﬁrst iteration (spanning 300 epochs), the accuracy slowlyimproves until around 150 epochs, after which becomes stable. In the subsequent iterations, theaccuracy further improves, especially during iterations 2, 3 and 4. Then, the performance increaseis gradually reduced until it is hardly noticeable.To provide further analysis, we also examine the representation space learned by the network ineach of these iterations, using the same combination of datasets and training parameters. We usethe t-Distributed Stochastic Neighbor Embedding (t-SNE) [22] projection to visualize the samplesaccording to their representation by the last hidden layer of the label predictor. Figure 6 showsa visualization of the features learned after each of the iterations, where the red color representsthe target domain, the blue color represents the source domain, and the green color representsthe set ˆ B r (selected samples) using the conﬁdence policy. This representation reveals 10 well-deﬁned clusters—the 10 possible classes of the datasets considered for this analysis—around anadditional central cluster. This central cluster groups the samples of the target domain (red color)whose representation does not correspond to any of the existing classes yet. This cluster would16 A cc EpochsIteration 1 2 3 4 5 6 7 8 9

Figure 5: Accuracy curve with respect to training epochs and iterations of the incremental approach. therefore correspond to target samples whose representation has not been correctly mapped ontoany of the source domain classes. Iteratively, the method is selecting samples (green points) of thetarget domain and moving them to the source domain. In the ﬁrst iterations—until the 6th one,approximately—the method selects only samples that are well located in one of the source domainclusters (that is, those samples for which the network is more conﬁdent). Due to this process, thesize of the central cluster is reduced. It is important to emphasize that this cluster becomes smalleralthough no samples out of it are selected, which indicates that the network is learning to better mapthose samples because of the selected samples of previous iterations. Towards the last iterations,the method begins to select the most complex samples that are still in this additional cluster. InFig. 6(*) (which is the same as the Fig. 6(9) but highlighting each class with a diﬀerent color),the additional cluster of target samples still appears without being mapped, yet with a very smallsize. This cluster contains almost all the classiﬁcation errors, having mapped only some isolatedprototypes to the actual class clusters incorrectly.

We compare in this section the two policies proposed for selecting the set of target prototypesˆ B r to be added to the source domain. To this end, we evaluate whether the label assigned to eachof these prototypes is correct. In this case, we make use of the ground-truth of the target domainjust for the sake of analysis.We show in Fig. 7 a dotted line with the performance of the conﬁdence policy, which may serveas a baseline here, and eight results for the kNN policy with varying k values. As in the previousexperiments, the reported ﬁgures are obtained for all combinations of datasets and hyper-parametersconsidered.It is observed that, as the number of iterations of the algorithm increases, the accuracy of theadditional labels assigned to the selected prototypes decreases. However, the kNN policy generally17 igure 6: t-SNE representation of the feature space of the neural network (from the last hidden layer) with respectto the iteration of the approach. The red color represents the target domain, the blue color represents the sourcedomain, and the green color represents the set ˆ B r . Last representation (*) depicts each sample according to its actualcategory by using diﬀerent colors. obtains better results from the ﬁrst iteration, obtaining on average (for all iterations) an improvementof 6.36 % with respect to conﬁdence policy. This improvement is signiﬁcantly greater in the lastiterations, obtaining an increase up to 24.85 % between the result of the conﬁdence policy and thebest result obtained with kNN policy.The role of the parameter k is also illustrated in Fig. 7, where better results are attained as k is increased. It is shown that the impact of this parameter is more noticeable in the last iterations,where a diﬀerence of up to 8.91 % is obtained between k = 3 and k = 150.Because the kNN selection policy worked better, this policy was used in all following experiments. In this section, we evaluate the ﬁnal result obtained through the proposed iDANN method withthe best combination of hyper-parameters previously obtained for each of the dataset pairs. We will18 .30.40.50.60.70.80.91 1 2 3 4 5 6 7 8 9 A cc Iterations k=3k=7k=11 k=20k=50k=100 k=150k=200Conf.

Figure 7: Accuracy of the labels assigned to the selected prototypes set ( ˆ B r ) from the target domain according tothe selection policy. compare this result with that obtained by the original DANN method in order to check the goodnessof the incremental approach.Table 5 reports the results of the experiment, where rows indicate the dataset pairs (sourceand target) and columns represent the DA method. Concerning iDANN, we report two results:the accuracy of the labels assigned during the iterative process itself (1), as well as the accuracyusing the CNN trained from scratch using only the target samples (once all the target samples havebeen assigned a label). In addition to DANN and iDANN methods, we have also added the resultsobtained with the neural networks trained just with the source set (‘CNN Src.’), as well as the resultsobtained with the neural neural networks directly trained with the target set (‘CNN Tgt.’). Theformer serves as baseline, to better assess the impact of the domain-adaptation mechanisms, whilethe second represents the upper bound of accuracy.The ﬁrst thing to remark is that the worst results obtained by the baseline (‘CNN Src.’) comefrom the combinations of single-digit datasets (MNIST, MNIST-M) as source and complex digitdatasets (SVHN, Syn Numbers) as target. Furthermore, the best results from the baseline arereported for combinations where the source and target are similar (MNIST-M → MNIST, SVHN → Syn Numbers).The original DANN method outperforms the results obtained by using the baseline network(‘CNN Src.’) by 10.7 %, on average, obtaining the most signiﬁcant improvement for the combinationsof Syn Numbers → MNIST (improving by 29.31 %). It is also noticeable the impact of DANN whenthe dataset pair consists of similar tasks with the most complex one as target, such as MNIST → MNIST-M—improvement of 23 %—or Syn Signs → GTSRB—improvement of 15.49 %. Theseresults for DANN have been obtained using our own implementation, following the details givenin the original paper. We observed that the accuracy matches approximately that reported by theauthors (for the 4 combinations they considered), and so we assume that our implementation is19orrect. We can therefore faithfully report the performance in all source-target combinations of ourexperiments.Concerning the labels assigned during the proposed incremental approach iDANN ( D T Acc. (1) ),the ﬁrst thing to note is its improvement with respect to the underlying DANN method, which isaround 16 %, on average. In the best case, this improvement reaches values around 33 %, 35 %and 36 % for the Syn Numbers → MNIST-M, MNIST-M → Syn Numbers, and MNIST → SynNumbers pairs, respectively. This conﬁrms the goodness of our strategy, which uses the samedomain adaptation method in a novel way.Finally, if the CNN is trained from scratch with the target labels that have been automaticallyassigned by the iDANN ( D T Acc. (2) ), it can further improve the results up to 1.64 %, on average,and up to 5.5 % in the best case (MNIST-M → Syn Numbers). It should be noted that in somespeciﬁc combinations, this approach slightly outperforms the CNN trained with the correct targetlabels (for example, MNIST-M → MNIST or Syn Numbers → SVHN). It might happen that theincorrectly assigned labels of the iDANN process act as a regularizer that alleviates some overﬁtting.

Table 5: Accuracy (%) over the the target dataset for diﬀerent strategies: ‘CNN Src.’ indicates a neural networktrained only with source samples; ‘DANN’ denotes the original DANN strategy; ‘iDANN’ yields two results from theincremental strategy: D T Acc. (1) refers to the accuracy of the labels assigned to the target samples during the iterativeprocess, while D T Acc. (2) refers to the classiﬁcation after training a new CNN from scratch using the labels assignedto the target samples; and ‘CNN Tgt.’ denotes a CNN trained using the ground truth of the target samples.

Source Target CNN Src. DANN iDANN CNN Tgt. D T Acc. D T Acc. D T Acc. (1) D T Acc. (2) D T Acc.

MNIST MNIST-M 55.71 78.70 96.09 96.67 97.34SVHN 16.26 31.32 35.83 36.49 90.93Syn Numbers 32.14 44.66 80.79 84.82 99.34MNIST-M MNIST 97.95 98.65 99.04 99.59 98.94SVHN 32.91 41.41 61.87 61.89 90.93Syn Numbers 46.34 54.02 89.49 94.99 99.34SVHN MNIST 59.04 67.08 82.72 84.50 98.94MNIST-M 43.49 47.42 66.40 67.62 97.34Syn Numbers 88.42 89.56 96.43 98.10 99.34Syn Numbers MNIST 60.04 89.35 98.13 99.35 98.94MNIST-M 41.84 54.38 87.10 90.26 97.34SVHN 85.16 87.24 91.42 91.95 90.93GTSRB Syn signs 76.39 86.22 98.28 98.57 99.74Syn signs GTSRB 69.79 85.28 96.31 98.00 97.89Average 57.53 68.23 84.28 85.91 96.95 .5. Comparison with the state of the art To conclude the results section, we present below a comparison with other domain adaptationstrategies from the state of the art. In these works, not all possible combinations of source-targetpairs are considered but a few combinations of them. We show in Table 6 the results reported inthe literature , along with the results obtained by our proposal (iDANN). A brief description of thecompeting methods was provided in Section 2. Readers are referred to the corresponding referencesfor details.These results reveal that our method yields the best performance in 5 out of 7 source-target pairs.The performance of iDANN is especially remarkable in the case of MNIST → Syn Num, where theimprovement reaches around 30 % compared to the literature. For the case in which our proposaldoes not attain the best result, we observe a dissimilar performance: it is still very competitive forthe MNIST → SVHN pair, whereas it is outperformed for the SVHN → MNIST pair. When all theresults are good, the improvement is relative, but when there is enough margin, the improvement isquite remarkable (as in the case of MNIST → Syn Num).

Table 6: Comparison of accuracy (%) between state-of-the-art DA approaches and iDANN. The ﬁrst two rows denotethe source and target dataset, respectively. The best result for each combination (column) is highlighted in boldtypeface. Empty cells indicate that the result is not reported in the literature.

MNIST SVHN Syn NumMethods MNIST-M SVHN Syn Num MNIST Syn Num MNIST SVHN

SA [12] 56.9 – – 59.32 – – 86.44DRCN [14] – – 81.97 – – –DSN [6] 83.2 – – 82.7 – – 91.2DANN [13] 76.66 12.4 22.9 73.85 96.9 87.6 91.09D-CORAL [31] – 35 55.8 76.3 95.5 89.9 78.8ADDA [33] – – – 76 – – –ADA [16] – 12.9 34.8 96.3 95.5 97.1 88.1VADA [27] – 18.6 45.9 92.9 96.8 96.2 85.3DeepJDOT [10] 92.4 – – 96.7 – – –DTA [21] – – – – – –iDANN

Furthermore, it should be noted that many of the compared methods propose speciﬁc CNNarchitecture for each combination of datasets and/or focus on optimizing the result for a particularcombination, such as DTA or DeepJDOT. In our case, we utilized the topologies proposed in the Unlike the results of the previous section, the DANN values of Table 6 are those reported in the original paper[13].

6. Conclusions and Future Work

This paper proposes an incremental strategy to the problem of domain adaptation with artiﬁcialneural networks. Our approach is built upon an existing domain adaptation approach, combinedwith a heuristic that, in each iteration, decides which prototypes of the target set can be added tothe training set by considering the label provided by the neural network. To this end, two selectionpolicies have been proposed: one directly based on the conﬁdence given by the network to theprediction and another based on geometric properties of the learned feature space. We observedthat the latter reported a better performance, especially in the last iterations of the algorithm. Inaddition, we consider a ﬁnal stage in which the labeled target set is used to train a new neuralnetwork with label smoothing.Our experiments were performed on various corpora and using several conﬁgurations of the neuralnetwork. From the results, we conclude that the incremental approach outperforms the underlyingDANN model, as well as other state-of-the-art methods. It is interesting to note that, in somecases, the iDANN approach improves the result obtained with the CNN trained directly with theground-truth data of the target set, which could indicate that the incremental process also serves asa regularizer that leads to greater robustness. Furthermore, unlike the classic DANN, our approachimproves results when domains are similar and helps keeping the accuracy for the source domain.We also observed a greater training stability and less dependence on the hyper-parameters set.As future work, a primary objective would be to establish a well-principled stop criterion thatallows us to detect when the prediction over the target samples is not reliable. In addition, we wantto extend the experiments to other types of input types (such as sequences), as well as to studythe behavior of the incremental strategy when the underlying DA method is diﬀerent—given thatthere currently exist several architectures for this challenge. Note that our incremental approach isindependent of the underlying DA model considered, and so it could be adopted as a generic strategythat might improve to the same extent as the underlying DA algorithm improves. Other avenuesto further explore this proposal is to evaluate more neural network architectures, as well as addingdata augmentation to the learning process.

References [1] Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchicalimage segmentation.

IEEE Transactions on Pattern Analysis and Machine Intelligence , ,898–916. doi: . 222] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010).A theory of learning from diﬀerent domains. Machine Learning , , 151–175. doi: .[3] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations fordomain adaptation. In NIPS (pp. 137–144).[4] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations fordomain adaptation. In B. Sch¨olkopf, J. C. Platt, & T. Hoﬀman (Eds.),

Advances in NeuralInformation Processing Systems 19 (NIPS) (pp. 137–144). MIT Press.[5] Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervisedpixel-level domain adaptation with generative adversarial networks. In (pp. 95–104).[6] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domainseparation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.),

Advances in Neural Information Processing Systems 29 (pp. 343–351). Curran Associates, Inc.[7] Bridle, J. S. (1990). Probabilistic interpretation of feedforward classiﬁcation network outputs,with relationships to statistical pattern recognition. In F. F. Souli´e, & J. H´erault (Eds.),

Neurocomputing (pp. 227–236). Berlin, Heidelberg: Springer Berlin Heidelberg.[8] Cheng, L., & Pan, S. J. (2014). Semi-supervised domain adaptation on manifolds.

IEEETransactions on Neural Networks and Learning Systems , , 2240–2249. doi: .[9] Cire¸san, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural networkfor traﬃc sign classiﬁcation. Neural Networks , , 333 – 338. Selected Papers from IJCNN2011.[10] Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot:Deep joint distribution optimal transport for unsupervised domain adaptation. In V. Ferrari,M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (pp. 467–483).Cham: Springer International Publishing.[11] Duda, R. O., Hart, P. E., & Stork, D. G. (2001).

Pattern classiﬁcation, 2nd Edition . Wiley.[12] Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domainadaptation using subspace alignment. In (pp. 2960–2967). 2313] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M.,& Lempitsky, V. (2016). Domain-adversarial training of neural networks.

Journal of MachineLearning Research , , 1–35.[14] Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., & Li, W. (2016). Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation. In B. Leibe, J. Matas, N. Sebe, &M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 597–613). Cham: Springer Interna-tional Publishing.[15] Goodfellow, I., Bengio, Y., & Courville, A. (2016).

Deep Learning . MIT Press. .[16] Haeusser, P., Frerix, T., Mordvintsev, A., & Cremers, D. (2017). Associative domain adaptation.In

The IEEE International Conference on Computer Vision (ICCV) .[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In (pp. 770–778).[18] Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels.

IEEETransactions on Pattern Analysis and Machine Intelligence , (pp. 1–1).[19] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.

Nature , , 436.[20] LeCun, Y., Bottou, L., Bengio, Y., & Haﬀner, P. (1998). Gradient-based learning applied todocument recognition. In Proc. of the IEEE (pp. 2278–2324). volume 86.[21] Lee, S., Kim, D., Kim, N., & Jeong, S.-G. (2019). Drop to adapt: Learning discriminative fea-tures for unsupervised domain adaptation. In

The IEEE International Conference on ComputerVision (ICCV) .[22] van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE.

Journal of MachineLearning Research , , 2579–2605.[23] Moiseev, B., Konev, A., Chigorin, A., & Konushin, A. (2013). Evaluation of traﬃc signrecognition methods trained on synthetically generated data. In J. Blanc-Talon, A. Kasin-ski, W. Philips, D. Popescu, & P. Scheunders (Eds.), Advanced Concepts for Intelligent VisionSystems (pp. 576–583). Cham: Springer International Publishing.[24] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits innatural images with unsupervised feature learning. In

NIPS Workshop on Deep Learning andUnsupervised Feature Learning 2011 . 2425] Saito, K., Kim, D., Sclaroﬀ, S., Darrell, T., & Saenko, K. (2019). Semi-supervised domainadaptation via minimax entropy. In

The IEEE International Conference on Computer Vision(ICCV) .[26] Shao, L., Zhu, F., & Li, X. (2014). Transfer learning for visual categorization: A survey.

IEEETransactions on Neural Networks and Learning Systems , , 1019–1034.[27] Shu, R., Bui, H., Narui, H., & Ermon, S. (2018). A DIRT-t approach to unsupervised domainadaptation. In International Conference on Learning Representations .[28] Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale imagerecognition. In .[29] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout:A simple way to prevent neural networks from overﬁtting.

Journal of Machine Learning Re-search , , 1929–1958.[30] Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmark-ing machine learning algorithms for traﬃc sign recognition. Neural Networks , , 323 – 332.doi: https://doi.org/10.1016/j.neunet.2012.02.016 .[31] Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation.In G. Hua, & H. J´egou (Eds.), Computer Vision – ECCV 2016 Workshops (pp. 443–450).Cham: Springer International Publishing.[32] Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inceptionarchitecture for computer vision. In (pp. 2818–2826). doi: .[33] Tzeng, E., Hoﬀman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domainadaptation. ,(pp. 2962–2971).[34] Villani, C. (2009).

Optimal Transport Old and New . Springer.[35] Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey.

Neurocomputing , , 135 – 153.[36] Yao, T., Pan, Y., Ngo, C.-W., Li, H., & Mei, T. (2015). Semi-supervised domain adaptation withsubspace learning for visual recognition. In Proceedings of the IEEE conference on computervision and pattern recognition (CVPR) (pp. 2142–2150).2537] Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deepneural networks? In