Augmented Cyclic Adversarial Learning for Low Resource Domain Adaptation
Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, Richard Socher
PPublished as a conference paper at ICLR 2019 A UGMENTED C YCLIC A DVERSARIAL L EARNINGFOR L OW R ESOURCE D OMAIN A DAPTATION
Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, Richard Socher
Salesforce Research {ehosseiniasl,yingbo.zhou,cxiong,rsocher}@salesforce.com A BSTRACT
Training a model to perform a task typically requires a large amount of data from thedomains in which the task will be applied. However, it is often the case that data areabundant in some domains but scarce in others. Domain adaptation deals with thechallenge of adapting a model trained from a data-rich source domain to performwell in a data-poor target domain. In general, this requires learning plausiblemappings between domains. CycleGAN is a powerful framework that efficientlylearns to map inputs from one domain to another using adversarial training anda cycle-consistency constraint. However, the conventional approach of enforcingcycle-consistency via reconstruction may be overly restrictive in cases where one ormore domains have limited training data. In this paper, we propose an augmentedcyclic adversarial learning model that enforces the cycle-consistency constraint viaan external task specific model, which encourages the preservation of task-relevant content as opposed to exact reconstruction. We explore digit classification in alow-resource setting in supervised, semi and unsupervised situation, as well as highresource unsupervised. In low-resource supervised setting, the results show that ourapproach improves absolute performance by and when adapting SVHNto MNIST and vice versa, respectively, which outperforms unsupervised domainadaptation methods that require high-resource unlabeled target domain. Moreover,using only few unsupervised target data, our approach can still outperforms manyhigh-resource unsupervised models. Our model also outperforms on USPS toMNIST and synthetic digit to SVHN for high resource unsupervised adaptation. Inspeech domains, we similarly adopt a speech recognition model from each domainas the task specific model. Our approach improves absolute performance of speechrecognition by for female speakers in the TIMIT dataset, where the majority oftraining samples are from male voices. NTRODUCTION
Domain adaptation (Huang et al., 2007; Xue et al., 2008; Ben-David et al., 2010) aims to generalizea model from source domain to a target domain. Typically, the source domain has a large amountof training data, whereas the data are scarce in the target domain. This challenge is typicallyaddressed by learning a mapping between domains, which allows data from the source domain toenrich the available data for training in the target domain. One way of learning such mappings isthrough Generative Adversarial Networks (GANs Goodfellow et al., 2014) with cycle-consistency constraint (CycleGAN Zhu et al., 2017), which enforces that mapping of an example from the sourceto the target and then back to the source domain would result in the same example (and vice versa fora target example). Due to this constraint, CycleGAN learns to preserve the ‘content’ from the sourcedomain while only transferring the ‘style’ to match the distribution of the target domain. This is apowerful constraint, and various works (Yi et al., 2017; Liu et al., 2017; Hoffman et al., 2018) havedemonstrated its effectiveness in learning cross domain mappings. Here the content refers to the invariant properties of the data with respect to a task. For example, inimage classification the semantic information of an image would be its class. Thus, different task on the samedata would result in different semantic information. In this paper we use content and semantic informationinterchangeably. a r X i v : . [ c s . L G ] J a n ublished as a conference paper at ICLR 2019Enforcing cycle-consistency is appealing as a technique for preserving semantic information of thedata with respect to a task, but implementing it through reconstruction may be too restrictive whendata are imbalanced across domains. This is because the reconstruction error encourages exact matchof samples from the reverse mapping, which may in turn encourage the forward-mapping to keepthe sample close to the original domain. Normally, the adversarial objectives would counter thiseffect; however, when data from the target domain are scarce, it is very difficult to learn a powerfuldiscriminator that can capture meaningful properties of the target distribution. Therefore, the resultingmappings learned is likely to be sub-optimal. Importantly, for the learned mapping to be meaningful,it is not necessary to have the exact reconstruction. As long as the ‘semantic’ information is preservedand the ‘style’ matches the corresponding distribution, it would be a valid mapping.To address this issue, we propose an augmented cyclic adversarial learning model (ACAL) for domainadaptation. In particular, we replace the reconstruction objective with a task specific model. Themodel learns to preserve the ‘semantic’ information from the data samples in a particular domain byminimizing the loss of the mapped samples for the task specific model. On the other hand, the taskspecific model also serves as an additional source of information for the corresponding domain andhence supplements the discriminator in that domain to facilitate better modeling of the distribution.The task specific model can also be viewed as an implicit way of disentangling the informationessential to the task from the ‘style’ information that relates to the data distribution of differentdomain. We show that our approach improves the performance by as compared to the baselineon digit domain adaptation. We improve the phoneme error rate by ∼ on TIMIT dataset, whenadapting the model trained on one speech from one gender to the other.1.1 R ELATED W ORK
Our work is broadly related to domain adaptation using neural networks for both supervised andunsupervised domain adaptation.
Supervised Domain Adaptation
When labels are available in the target domain, a commonapproach is to utilize the label information in target domain to minimize the discrepancy betweensource and target domain (Hu et al., 2015; Tzeng et al., 2015; Gebru et al., 2017; Hoffman et al.,2016; Gupta et al., 2016; Ge and Yu, 2017). For example, Hu et al. (2015) applies the marginal Fisheranalysis criteria and Maximum Mean Discrepancy (MMD) to minimize the distribution differencebetween source and target domain. Tzeng et al. (2015) proposed to add a domain classifier thatpredicts domain label of the inputs, with a domain confusion loss. Gebru et al. (2017) leveragesattributes by using attribute and class level classification loss with attribute consistent loss to fine-tunethe target model. Our method also employs models from both domains, however, our models are usedto assist adversarial learning for better learning of the target domain distribution. In addition, our finalmodel for supervised domain adaptation is obtained by training on data from target domain as well asthe transfered data from the source domain, rather than fine-tuning a source/target domain model.
Unsupervised Domain Adaptation
More recently, various work have taken advantage of thesubstantial generation capabilities of the GAN framework and applied them to domain adaptation (Liuand Tuzel, 2016; Bousmalis et al., 2017; Yi et al., 2017; Tzeng et al., 2017; Kim et al., 2017; Hoffmanet al., 2018). However, most of these works focus on high-resource unsupervised domain adaptation,which may be unsuitable for situations where the target domain data are limited. Bousmalis et al.(2017) uses a GAN to adapt data from the source to target domain while simultaneously traininga classifier on both the source and adapted data. Our method also employs task specific models;however, we use the models to augment the CycleGAN formulation. We show that having cyclesin both directions ( i.e. from source to target and vice versa ) is important in the case where thetarget domain has limited data (see sec. 4). Tzeng et al. (2017) proposes adversarial discriminativedomain adaptation (ADDA), where adversarial learning is employed to match the representationlearned from the source and target domain. Our method also utilizes pre-trained model from sourcedomain, but we only implicitly match the representation distributions rather than explicitly enforcingrepresentational similarity. Cycle-consistent adversarial domain adaptation (CyCADA Hoffmanet al., 2018) is perhaps the most similar work to our own. This approach uses both (cid:96) and semanticconsistency to enforce cycle-consistency. An important difference in our work is that we also includeanother cycle that starts from the target domain. This is important because, if the target domain is oflow resource, the adaptation from source to target may fail due to the difficulty in learning a good2ublished as a conference paper at ICLR 2019Figure 1: Illustration of proposed approach. Left: CycleGAN (Zhu et al., 2017). Middle: Relaxedcycle-consistent model (RCAL), where the cycle-consistency is enforced through task specific modelsin corresponding domain. Right: Augmented cycle-consistent model (ACAL). In addition to therelaxed model, the task specific model is also used to augment the discriminator of correspondingdomain to facilitate learning. In the diagrams x and L denote data and losses, respectively. We pointout that the ultimate goal of our approach is to use the mapped Source → Target samples ( x S (cid:55)→ T ) toaugment the limited data of the target domain ( x T ).discriminator in the target domain. Almahairi et al. (2018) also suggests to improve CycleGAN byexplicitly enforcing content consistency and style adaptation, by augmenting the cyclic adversariallearning to hidden representation of domains.Our model is different from recent cyclic adversarial learning, due to implicit learning of contentand style representation through an auxiliary task, which is more suitable for low resource domains.Using classification to assist GAN training has also been explored previously (Springenberg, 2015;Sricharan et al., 2017; Kumar et al., 2017). Springenberg (2015) proposed CatGAN, where thediscriminator is converted to a multi-class classifier. We extend this idea to any task specific model,including speech recognition task, and use this model to preserve task specific information regardingthe data.We also propose that the definition of task model can be extended to unsupervised tasks,suchas language or speech modeling in domains, meaning augmented unsupervised domain adaptation. RELIMINARIES
ENERATIVE A DVERSARIAL N ETWORK
To learn the true data distribution P data ( X ) in a nonparametric way, Goodfellow et al. (2014)proposed the generative adversarial network (GAN). In this framework, a discriminator network D ( x ) learns to discriminate between the data produced by a generator network G ( z ) and the data sampledfrom the true data distribution P data ( X ) , whereas the generator models the true data distributionby learning to confuse the discriminator. Under certain assumptions (Goodfellow et al., 2014), thegenerator would learn the true data distribution when the game reaches equilibrium. Training of GANis in general done by alternately optimizing the following objective for D and G . min G max D V ( G, D ) = E x ∼ P data ( X ) [log D ( x )] + E z ∼ P z ( Z ) [log (1 − D ( G ( z ))] (1)2.2 C YCLE
GANCycleGAN (Zhu et al., 2017) extends this framework to multiple domains, P S ( X ) and P T ( X ) , whilelearning to map samples back and forth between them. Adversarial learning is applied such that theresult mapping from G S (cid:55)→ T will match the target distribution P T ( X ) , and similarly for the reversemapping from G T (cid:55)→ S . This is accomplished by the following adversarial objectives: L adv ( G S (cid:55)→ T , D T ) = E x ∼ P T ( X ) [log D T ( x )] + E x ∼ P S ( X ) [log (1 − D T ( G S (cid:55)→ T ( x ))] (2) L adv ( G T (cid:55)→ S , D S ) = E x ∼ P S ( X ) [log D S ( x )] + E x ∼ P T ( X ) [log (1 − D S ( G T (cid:55)→ S ( x ))] (3)CycleGAN also introduces cycle-consistency, which enforces that each mapping is able to invert theother. In the original work, this is achieved by including the following reconstruction objective: L cyc ( G S (cid:55)→ T , G T (cid:55)→ S ) = E x ∼ P S ( X ) [ (cid:107) G T (cid:55)→ S ( G S (cid:55)→ T ( x )) − x (cid:107) ]+ E x ∼ P T ( X ) [ (cid:107) G S (cid:55)→ T ( G T (cid:55)→ S ( x )) − x (cid:107) ] (4)Learning the CycleGAN model involves optimizing a weighted combination of the above objectives2, 3 and 4. 3ublished as a conference paper at ICLR 2019 UGMENTED C YCLIC A DVERSARIAL L EARNING (ACAL)
Enforcing cycle-consistency using a reconstruction objective ( e.g. eq. 4) may be too restrictiveand potentially results in sub-optimal mapping functions. This is because the learning dynamics ofCycleGAN balance the two contrastive forces. The adversarial objective encourages the mappingfunctions to generate samples that are close to the true distribution. At the same time, the reconstruc-tion objective encourages identity mapping. Balancing these objectives may works well in the casewhere both domains have a relatively large number of training samples. However, problems mayarise in case of domain adaptation, where data within the target domain are relatively sparse.Let P S ( X ) and P T ( X ) denote source and target domain distributions, respectively, and samplesfrom P T ( X ) are limited. In this case, it will be difficult for the discriminator D T to model the actualdistribution P T ( X ) . A discriminator model with sufficient capacity will quickly overfit and theresulting D T will act like delta function on the sample points from P T ( X ) . Attempts to prevent thisby limiting the capacity or using regularization may easily induce over-smoothing and under-fittingsuch that the probability outputs of D T are only weakly sensitive to the mapped samples. In bothcases, the influence of the reconstruction objective should begin to outweigh that of the adversarialobjective, thereby encouraging an identity mapping. More generally, even if we are are able to obtaina reasonable discriminator D T , the support of the distribution learned through it would likely to besmall due to limited data. Therefore, the learning signal G S (cid:55)→ T receive from D T would be limited.To sum up, limited data within P T ( X ) would make it less likely that the discriminator will encouragemeaningful cross domain mappings.The root of the above issue in domain adaptation is two fold. First, exact reconstruction is a too strongobjective for enforcing cycle-consistency. Second, learning a mapping function to a particular domainwhich solely depends on the discriminator for that domain is not sufficient. To address these twoproblems, we propose to 1) use a task specific model to enforce the cycle-consistency constraint, and2) use the same task specific model in addition to the discriminator to train more meaningful crossdomain mappings. In more detail, let M S and M T be the task specific models trained on domains P S ( X, Y ) and P T ( X, Y ) , and L task denotes the task specific loss. Our cycle-consistent objective isthen: L RCAL ( G S (cid:55)→ T , G T (cid:55)→ S , M S , M T ) = E ( x,y ) ∼ P S ( X,Y ) [ L task ( M S ( G T (cid:55)→ S ( G S (cid:55)→ T ( x )) , y )]+ E ( x,y ) ∼ P T ( X,Y ) [ L task ( M T ( G S (cid:55)→ T ( G T (cid:55)→ S ( x )) , y )] (5)Here, L task enforces cycle-consistency by requiring that the reverse mappings preserve the semanticinformation of the original sample. Importantly, this constraint is less strict than when usingreconstruction, because now as long as the content matches that of the original sample, the incurredloss will not increase. (Some style consistency is implicitly enforced since each model M is trainedon data within a particular domain.) This is a much looser constraint than having consistency in theoriginal data space, and thus we refer to this as the relaxed cycle-consistency objective.To address the second issue, we augment the adversarial objective with corresponding objective: L ACAL − supervised ( G T (cid:55)→ S , D S , M S ) = E x ∼ P S ( X ) [log( D S ( x ))]+ E x ∼ P T ( X ) [log(1 − D S ( G T (cid:55)→ S ( x )))]+ E ( x,y ) ∼ P S ( x,y ) [ L task ( M S ( x, y ))]+ E ( x,y ) ∼ P T ( x,y ) [ L task ( M S ( G T (cid:55)→ S ( x ) , y ))] (6) L ACAL − supervised ( G S (cid:55)→ T , D T , M T ) = E x ∼ P T ( X ) [log( D T ( x ))]+ E x ∼ P S ( X ) [log(1 − D T ( G S (cid:55)→ T ( x )))]+ E ( x,y ) ∼ P T ( x,y ) [ L task ( M T ( x, y ))]+ E ( x,y ) ∼ P S ( x,y ) [ L task ( M T ( G S (cid:55)→ T ( x ) , y ))] (7)Similar to adversarial training, we optimize the above objective by maximizing D S ( D T ) andminimizing G T (cid:55)→ S ( G S (cid:55)→ T ) and M S ( M T ). With the new terms, learning of the mapping functions G get assists from both the discriminator and the task specific model. The task specific model learnsto capture conditional probability distribution P S ( Y | X ) ( P T ( Y | X ) ), that also preserves informationregarding P S ( X ) ( P T ( X ) ). This conditional information is different than the information capturedthrough the discriminator D S ( D T ). The difference is that the model is only required to preserve4ublished as a conference paper at ICLR 2019 Algorithm 1
Augmented Cyclic Adversarial Learning (ACAL)
Input: source domain data P S ( x, y ) , target domain data P T ( x, y ) , pretrained source task model M S Output: target task model M T while not converged do Sample from ( x s , y s ) from P S if y t in P T then %Supervised% Sample ( x t , y t ) from P T Finetune source model M S on ( x s , y s ) and ( G T (cid:55)→ S ( x t ) , y t ) samples (eq. 6)Train task model M T on ( x t , y t ) and ( G S (cid:55)→ T ( x s ) , y s ) samples (eq. 7) else %Un-supervised% Sample x t from P T Finetune source model M S on ( x s , y s ) samples (eq. 8)Train task model M T ( G S (cid:55)→ T ( x s ) , y s ) and ( x t , M S ( G T (cid:55)→ S ( x t )) samples (eq. 9) endend useful information regarding X respect to predicting Y , for modeling the conditional distribution,which makes learning the conditional model a much easier problem. In addition, the conditionalmodel mediates the influence of data that the discriminator does not have access to ( Y ), which shouldfurther assist learning of the mapping functions G T (cid:55)→ S ( G S (cid:55)→ T ).In case of unsupervised domain adaptation, when there is no information of target conditionalprobability distribution P T ( Y | X ) , we propose to use source model M S to estimate P T ( Y | X ) throughadversarial learning, i.e. P T ( Y | X ) ≈ E x ∼ P T ( X ) [ M S ( G S (cid:55)→ T ( x ))] . Therefore, proposed model canbe extended to unsupervised domain adaptation, with the corresponding modified objectives: L ACAL − unsupervised ( G T (cid:55)→ S , D S , M S ) = E x ∼ P S ( X ) [log( D S ( x ))]+ E x ∼ P T ( X ) [log(1 − D S ( G T (cid:55)→ S ( x )))]+ E ( x,y ) ∼ P S ( x,y ) [ L task ( M S ( x, y ))] (8) L ACAL − unsupervised ( G S (cid:55)→ T , D T , M T ) = E x ∼ P T ( X ) [log( D T ( x ))]+ E x ∼ P S ( X ) [log(1 − D T ( G S (cid:55)→ T ( x )))]+ E ( x,y ) ∼ P T ( x,y ) [ L task ( M T ( x, M S ( G T (cid:55)→ S ( x ))))]+ E ( x,y ) ∼ P S ( x,y ) [ L task ( M T ( G S (cid:55)→ T ( x ) , y ))] (9)To further extend this approach to semi-supervised domain adaptation, both supervised and unsuper-vised objectives for labeled and unlabeled target samples are used interchangeably, as explained inAlgorithm 1. XPERIMENTS
In this section, we evaluate our proposed model on domain adaptation for visual and speech recog-nition. We continue the convention of referring to the data domains as ‘source’ and ‘target’, wheretarget denotes the domain with either limited or unlabeled training data. Visual domain adaptation isevaluated using the MNIST dataset ( M ) Lecun et al. (1998), Street View House Numbers (SVHN)datasets ( S ) Netzer et al. (2011), USPS ( U ) (Hull, 1994), MNISTM ( MM ) and Synthetic Dig-its ( SD ) (Ganin and Lempitsky, 2014). Adaptation on speech is evaluated on the domain of genderwithin the TIMIT dataset Garofolo et al. (1993), which contains broadband kHz recordings of utterances ( . hours) of phonetically-balanced speech. The male/female ratio of speakers acrosstrain/validation/test sets is approximately % to %. Therefore, we treat male speech as the sourcedomain and female speech as the low resource target domain.5ublished as a conference paper at ICLR 20194.1 M ODEL A BLATIONS
To get an idea of the contribution from each component of our model, in this section we perform aseries of ablations and present the results in Table 1. We perform these ablations by treating SVHNas the source domain and MNIST as the target domain. We down sample the MNIST training dataso only samples per class are available during training, denoted as MNIST-(10), which is only . of full training data. The testing performance is calculated on the full MNIST test set. Weuse a modified LeNet for all experiments in this ablation. The Modified LeNet consists of twoconvolutional layers with and channels, followed by a dropout layer and two fully connectedlayers of and dimensionality. Table 1: Ablation study results from SVHN (Source) toMNIST (Target). See text for more details. Note:
TheMNIST domain is limited to only samples per class( . of full training dataset), denoted as MNIST- (10) .Experiments were performed times with different ran-dom sampling for MNIST. Domain Adaptation Model Test Accuracy (%)No Adaptation (trained on SVHN) 71.11Target Model (trained on MNIST-(10)) 79.22 ± ± → T 69.91 ± → T → S)-One Cycle 46.32 ± → S → T)-One Cycle 58.34 ± → T → S)-RCAL (Ours) ± (T → S → T)-RCAL (Ours) 43.56 ± → T → S)-ACAL (Ours) ± (T → S → T)-ACAL (Ours) 49.81 ± ± ± ± There are various ways that one mayutilize cycle-consistency or adversarialtraining to do domain adaptation fromcomponents of our model. One way is touse adversarial training on the target do-main to ensure matching of distributionof adapted data, and use the task spe-cific model to ensure the ‘content’ of thedata from the source domain is preserved.This is the model described in Bousmaliset al. (2017), except their model is orig-inally unsupervised. This model is de-noted as S → T in Table 1. It is alsointeresting to examine the importanceof the double cycle, which is proposedin Zhu et al. (2017) and adopted in ourwork. Theoretically, one cycle would besufficient to learn the mapping betweendomains; therefore, we also investigatethe performance of one cycle only mod-els, where one direction would be fromsource to target and then back, and sim-ilarly for the other direction. These mod-els are denoted as (S → T → S)-One Cycleand (T → S → T)-One Cycle in Table 1, re-spectively. To test the effectiveness ofthe relaxed cycle-consistency (eq. 5) and augmented adversarial loss (eq. 6 and 7), we also test onecycle models while progressively adding these two losses. Interestingly, the one cycle relaxed andone cycle augmented models are similar to the model proposed in Hoffman et al. (2018) when theirmodel performs mapping from source to target domain and then back. The difference is that theirmodel is unsupervised and includes more losses at different levels.As can be seen from Table 1, the simple conditional model performed surprisingly well as comparedto more complicated cyclic counterparts. This may be attributed to the reduced complexity, sinceit only needs to learn one set of mapping. As expected, the single cycle performance is poor whenthe target domain is of limited data due to inefficient learning of discriminator in the target domain(see section 3). When we change the cycle to the other direction, where there are abundant data inthe target domain, the performance improves, but is still worse than the simple one without cycle.This is because the adaptation mapping ( i.e. G S (cid:55)→ T ) is only learned via the generated samples from G T (cid:55)→ S , which likely deviate from the real examples in practice. This observation also suggeststhat it would be beneficial to have cycles in both directions when applying the cycle-consistencyconstraint, since then both mappings can be learned via real examples. The trends get reversed whenwe are using relaxed implementation of cycle-consistency from the reconstruction error with the taskspecific losses. This is because now the power of the task specific model is crucial to preserve thecontent of the data after the reverse mapping. When the source domain dataset is sufficiently large,the cycle-consistency is preserved. As such, the resulting learned mapping functions would preservemeaningful semantics of the data while transferring the styles to the target domain, and vice versa . Inaddition, it is clear that augmenting the discriminator with task specific loss is helpful for learningadaptations. Furthermore, the information added from the task specific model is clearly beneficialfor improving the adaptation performance, without this none of the models outperform the baseline6ublished as a conference paper at ICLR 2019 (a) M → U (b)
U → M (c)
M → S (d)
S → M (e)
S → U (f)
U → S
Figure 2: Comparison of adaptation robustness between CyCADA (Hoffman et al., 2018), CyCADAwith no (cid:96) reconstruction loss (Relaxed), and ACAL algorithms for variable number of unsupervisedtarget samples. Note: No labeled sample is used. model, where no adaptation is performed. Last but not least, it is also clear from the results that usingtask specific model improves the overall adaptation performance.To further evaluate the effectiveness of using task-specific loss with two cycles for low-resourceunsupervised domain adaptation scenario, we comapre our model with CyCADA (Hoffman et al.,2018), and when no reconstruction loss is used in CyCADA, referred as "CyCADA (Relaxed)".The latter resembles the ( S → T → S ) -ACAL in Table 1, but with a different semantic loss. Asshown in Figure 2, CyCADA model and its relaxed variation fail to learn a good adaptation, wheretarget domain contains few unlabaled samples per class. Additionally, CyCADA models show highinstability in low-resource situation. As described in section 1.1, instability is an expected behvaiourof CyCADA when having limited target data, because the source to target cycle fails to preserveconsistency, due to weak target domain discriminator. However, ACAL model indicates stable andconsistent performance, due to proper use of source classifier to enforce consistency, rather thanrelying on target and source discriminators.4.2 V ISUAL D OMAIN A DAPTATION
In this section, we experiment on domain adaptation for the task of digit recognition. In eachexperiment, we select one domain (MNIST, USPS, MNISTM, SVHN, Synthetic Digits) to be thetarget. We conduct three types of domain adaptation, i.e. low-resource supervised , high-resourceunsupervised , and low-resource semi-supervised adaptation. The evaluation results are based on notusing any data augmentation. Low-resource supervised adaptation:
In this setting, we sub-sample the target to contain only afew labeled samples per class, and using the other full dataset as the source domain. In this setting, nounlabeled sample is used. Comparison with recent low resource domain adaptation, FADA (Motiianet al., 2017) for MNIST, USPS, and SVHN adaptation is shown in Figure 3. To provide morebaselines, we also compared with model trained only on limited target data, and on combination ofboth labeled source and limited target domains. As shown in Figure 4, ACAL outperforms FADAand two other baselines in all adaptations.
High-resource unsupervised adaptation;
Here, we use the whole target domain with no label.Evaluation results on all adaptation directions are presented in Table 2 and Table 7 (Appendix A). Itis evident that ACAL model performance is on par with the state of the art unsupervised approaches,and outperforms on MNIST → USPS and Syn-Digits → SVHN. It is worth mentioning that Shu et al.7ublished as a conference paper at ICLR 2019 (a)
U → M (b)
M → U (c)
M → S (d)
S → M (e)
U → S (f)
S → U
Figure 3: Low-resource supervised Domain Adaptation on MNIST ( M ) , USPS ( U ) and SVHN ( S ) datasets. FADA model refers to Motiian et al. (2017). n = 1 means labeled example per class. Nounlabeled target sample is used.(2018) improved their VADA adversarial model using natural gradient as teacher-student training,which is not directly comparable to adversarial approaches. Moreover, the source-only baselineof (Shu et al., 2018) is stronger than the reported unsupervised approaches, as well as our baseline. Low-resource semi-supervised adaptation:
We also evaluate the performance of ACAL algo-rithm when there are limited labeled and unlabeled target samples in Table 6 (Appendix A). In caseof MNIST → USPS, our model outperforms many high-resource unsupervised domain adaptation inTable 2 by using < unlabeled samples only.4.3 S PEECH D OMAIN A DAPTATION
We also apply our proposed model to domain adaptation in speech recognition. We use TIMITdataset, where the male to female speaker ratio is about and thus we choose the data subset frommale speakers as the source and the subset from female speakers as the target domain. We evaluateperformance on the standard TIMIT test set and use phoneme error rate (PER) as the evaluationmetric. Spectrogram representation of audio is chosen for model evaluation. As demonstrated byHosseini-Asl et al. (2018), multi-discriminator training significantly impacts adaptation performance.Therefore, we used the multi-discriminator architecture as the discriminator for the adversarial lossin our evaluation. Our task-specific model is a pre-trained speech recognition model within eachdomain in this set of experiments.The result are shown in Table 3. We observe significant performance improvements over the baselinemodel as well as comparable or better performance as compared to previous methods. It is interestingto note that the performance of the proposed model on the adapted male (
M → F ) almost matchesthe baseline model performance, where the model is trained on true female speech. In addition, theperformance gap in this case is significant as compared to other methods, which suggests the adapteddistribution is indeed close to the true target distribution. In addition, when combined with more data,our model further outperforms the baseline by a noticeable margin.8ublished as a conference paper at ICLR 2019Table 2:
High-resource unsupervised domain adaptation between MNIST ( M ) , USPS ( U ) , MNIST-M ( MM ) , SVHN ( S ) , Synthetic Digits ( SD ) . Note: Direction indicates source → target adaptationdirection. VADA (Shu et al., 2018) used a stronger source-only baseline on S → M ( . accuracy)compared to other approaches. Note:
No data augmentation is used in our experiments.
Domain pairsModel Direction
M − U M − MM M − S S − SD
Source-only → ← → - - - ← - - → - - ← - - → - - ← - → - - ← - - CyCADA (Hoffman et al., 2018) → ← → - ← - - ACAL (Ours) → ← Target-only (completely supervised) → ← Table 3: Speech domain adaptation results on TIMIT. We treat Male ( M ) and Female ( F ) voices forthe source and target domains, respectively, based on the intrinsic imbalance of speaker genders inthe dataset (about male/female ratio). For the evaluation metric, lower is better. Female (PER)Training Set Domain Adaptation Model Val Test M - F (Baseline model) - M → F
CycleGAN (Zhu et al., 2017) 32.95 30.07FHVAE (Hsu et al., 2017) – F + ( M → F ) CycleGAN (Zhu et al., 2017) 28.32 28.43MD-CycleGAN (Hosseini-Asl et al., 2018) 21.15 19.08ACAL (Ours) F + M - 20.63 20.52 F + M + ( M → F ) CycleGAN (Zhu et al., 2017) 21.03 22.81MD-CycleGAN (Hosseini-Asl et al., 2018) 20.26 19.60ACAL (Ours)
ONCLUSION AND F UTURE W ORK
In this paper, we propose to use augmented cycle-consistency adversarial learning for domain adapta-tion and introduce a task specific model to facilitate learning domain related mappings. We enforcecycle-consistency using a task specific loss instead of the conventional reconstruction objective. Addi-tionally, we use the task specific model as an additional source of information for the discriminator inthe corresponding domain. We demonstrate the effectiveness of our proposed approach by evaluatingon two domain adaptation tasks, and in both cases we achieve significant performance improvementas compared to the baseline. 9ublished as a conference paper at ICLR 2019By extending the definition of task-specific model to unsupervised learning, such as reconstructionloss using autoencoder, or self-supervision, our proposed method would work on all settings ofdomain adaptation. Such unsupervised task can be speech modeling using wavenet (van den Oordet al., 2016), or language modeling using recurrent or transformer networks (Radford et al., 2018). R EFERENCES
A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville. Augmented cyclegan: Learningmany-to-many mappings from unpaired data. In
ICML , 2018.S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learningfrom different domains.
Machine Learning , 79(1):151–175, May 2010. ISSN 1573-0565. doi: 10.1007/s10994-009-5152-4. URL https://doi.org/10.1007/s10994-009-5152-4 .K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptationwith generative adversarial networks. In
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , volume 1, page 7, 2017.G. French, M. Mackiewicz, and M. Fisher. Self-ensembling for visual domain adaptation. In
InternationalConference on Learning Representations (ICLR) , 2018.Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprintarXiv:1409.7495 , 2014.J. S. Garofolo et al. TIMIT acoustic-phonetic continuous speech corpus LDC93S1.
Philadelphia: LinguisticData Consortium , 1993.W. Ge and Y. Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI , volume 6,2017.T. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain adaptationapproach. In , pages 1358–1367. IEEE,2017.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors,
Advances in Neural Information Processing Systems 27 , pages 2672–2680. 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In
Computer Vision andPattern Recognition (CVPR), 2016 IEEE Conference on , pages 2827–2836. IEEE, 2016.P. Häusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. , pages 2784–2792, 2017.J. Hoffman, S. Gupta, J. Leong, S. Guadarrama, and T. Darrell. Cross-modal adaptation for rgb-d detection. In
Robotics and Automation (ICRA), 2016 IEEE International Conference on , pages 5032–5039. IEEE, 2016.J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In
Proceedings of the 35th International Conference on MachineLearning , volume 80, pages 1994–2003, 2018.E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher. A multi-discriminator cyclegan for unsupervised non-parallelspeech domain adaptation. In
INTERSPEECH , 2018.W.-N. Hsu, Y. Zhang, and J. R. Glass. Unsupervised learning of disentangled and interpretable representationsfrom sequential data. In
NIPS , 2017.J. Hu, J. Lu, and Y.-P. Tan. Deep transfer metric learning. In
Computer Vision and Pattern Recognition (CVPR),2015 IEEE Conference on , pages 325–333. IEEE, 2015.L. Hu, M. Kan, S. Shan, and X. Chen. Duplex generative adversarial network for unsupervised domain adaptation.In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. , pages 2261–2269, 2017.
J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola. Correcting sample selection bias byunlabeled data. In
Advances in neural information processing systems , pages 601–608, 2007.J. J. Hull. A database for handwritten text recognition research.
IEEE Trans. Pattern Anal. Mach. Intell. , 16:550–554, 1994.T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generativeadversarial networks. arXiv preprint arXiv:1703.05192 , 2017.A. Kumar, P. Sattigeri, and T. Fletcher. Semi-supervised learning with gans: Manifold invariance with improvedinference. In
Advances in Neural Information Processing Systems , pages 5540–5550, 2017.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In
Advances in neural information processingsystems , pages 469–477, 2016.M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In
Advances in NeuralInformation Processing Systems , pages 700–708, 2017.S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances inNeural Information Processing Systems 30 , pages 6670–6680. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7244-few-shot-adversarial-domain-adaptation.pdf .Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y Ng. Reading digits in natural images with unsuper-vised feature learning.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 , 2011.URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf .A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by gener-ative pre-training. 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf .O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In
MICCAI , 2015.P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bi-directionaladaptive gan.
CVPR , abs/1705.08824, 2018.R. Shu, H. Bui, H. Narui, and S. Ermon. A DIRT-t approach to unsupervised domain adaptation. In
InternationalConference on Learning Representations (ICLR) , 2018. URL https://openreview.net/forum?id=H1q-TM-AW .J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390 , 2015.K. Sricharan, R. Bala, M. Shreve, H. Ding, K. Saketh, and J. Sun. Semi-supervised conditional gans. arXivpreprint arXiv:1708.05789 , 2017.E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In
Computer Vision (ICCV), 2015 IEEE International Conference on , pages 4068–4076. IEEE, 2015.E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In
ComputerVision and Pattern Recognition (CVPR) , volume 1, page 4, 2017.A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu. Wavenet: A generative model for raw audio. In
SSW , 2016.G.-R. Xue, W. Dai, Q. Yang, and Y. Yu. Topic-bridged plsa for cross-domain text classification. In
Proceedingsof the 31st annual international ACM SIGIR conference on Research and development in information retrieval ,pages 627–634. ACM, 2008.Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. arXiv preprint , 2017.Y. Zhou, C. Xiong, and R. Socher. Improving end-to-end speech recognition with policy learning. arXiv preprintarXiv:1712.07101 , 2017.J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistentadversarial networks. In
Computer Vision (ICCV), 2017 IEEE International Conference on , 2017. A PPENDIX
A D
IGIT D OMAIN A DAPTATION A NALYSIS
In this section, we evaluate domain adaptation for MNIST ↔ SVHN for comparison with CycleGAN, as well asthe relaxed version of the cycle-consistent objective (Relaxed-Cyc, see eq. 5 in section 3). For the former, (cid:96) reconstruction loss is replaced with the model loss in order to encouraging cycle-consistency. We also experimentwith two different task specific models M : specifically, DenseNet (Huang et al., 2017, representing a relativelycomplex architecture) and a modified LeNet (representing a relatively simple architecture, see section 4.1).Table 4 and 5 show the results on augmenting the low resource MNIST and SVHN with the complementary highresource domain. This approach improves test performance of the target classifier by a large margin, comparedto when trained only using the target domain data. We observe that training a more complicated deep model forthe target domain weakens this effect. As shown in Table 4, using DenseNet as a classifier on MNIST (target)achieves ≈ lower test classification accuracy than using a variant of LeNet. This difference likely reflectsdifferences in the two architectures’ degree of overfitting. Overfitting will produce a false gradient signal duringcycle adversarial learning (when classifying the adapted source examples). Based on this observation, we use acomparatively simpler LeNet architecture with SVHN as the target domain (see Table 5). Using our proposedapproach, SVHN test performance improves by over domain adaptation using CycleGAN. We also includesome qualitative results when performing domain adaptation from SVHN (source) to MNIST (target), as shownin Figure 5. We also compare the performance with different number of labeled target samples in Figure 4. Itindicates the improvement on generalization performance of target model using Augmented cyclic adaptation,with variable labeled target domain on MNIST and SVHN datasets. Evaluation of semi supervised adaptation ispresented in Table 6. (a) S → M (b)
M → S
Figure 4: Performance comparison of proposed ACAL algorithm on SVHN ( S ) and MNIST ( M ) with baselines using different numbers of labeled training sample (per class) in target domain for (a) S → M and (b)
M → S adaptation. (Best viewed in color)Table 4: Visual domain adaptation results from SVHN to MNIST (Low resource). No adaptationdenotes model trained on the source domain (SVHN) and target model refers to model trained on thetarget domain (MNIST).
Note:
MNIST (Low resource) domain contains only labeled sampelsper class (MNIST-(10)), the experiments was performed times with different random sampling forMNIST. MNIST Test (%)Domain Adaptation Model LeNet (Modified) DenseNet No Adaptation (trained on SVHN)
Target Model (trained on MNIST-(10)) ± ± CycleGAN ± ± RCAL (Ours) ± ± ACAL (Ours) ± ± Note:
SVHN (Low resource) domain contains only images per class(SVHN-(50)), the experiments was performed times with different random sampling for SVHN.SVHN Test (%)Domain Adaptation Model LeNet (modified) No Adaptation (trained on MNIST)
Target Model (trained on SVHN-(50)) ± CycleGAN ± RCAL (Ours) ± ACAL (Ours) ± Figure 5: Qualitative comparison of domain adaptation for experimental models. Each columnillustrates the mapping performed by each of the models from the original SVHN image (sourcedomain) to MNIST (target domain, labeled samples per class in total). It can be seen that theaugmented cycle-consistent model is able to preserve most of the semantic information, while stillapproximately match the target distribution.Table 6: Low-resource semi and unsupervised domain adaptation on MNIST ( M ) , USPS ( U ) andSVHN ( S ) datasets. Note: n = 10 means samples per class, and denotes the percentageof target samples (per class) which have labels. corresponds to low-resource unsupervisedadaptation. S → M M → U U → M S → U % % % % % % % % n = 10 n = 50 n = 100 n = full train 96.51 99.41 96.91 95.71 96.74 98.45 79.23 93.17 Table 7:
High-resource unsupervised domain adaptation between MNIST ( M ) , USPS ( U ) , MNIST-M ( MM ) , SVHN ( S ) , Synthetic Digits ( SD ) . Note: Direction indicates source → target adaptationdirection. Domain pairsModel Direction
M − SD U − MM U − S U − SD MM − S MM − SD
Source-only → ← → ← → ← A PPENDIX
B S
PEECH D OMAIN M ODELS I MPLEMENTATION
In this section, the detail of CycleGAN and speech model architectures are explained. The size of the convolutionlayer are denoted by the tuple ( C, F, T, SF, ST ) , where C, F, T, SF, and ST denote number of channels, filter sizein frequency dimension, filter size in time dimension, stride in frequency dimension and stride in time dimensionrespectively. Architecture of CycleGAN model is based on Zhu et al. (2017) with modifications mentionedin Hosseini-Asl et al. (2018). Both generators in CycleGAN are based on U-net Ronneberger et al. (2015)architecture with layers of convolution of sizes (8,3,3,1,1), (16,3,3,1,1), (32,3,3,2,2), (64,3,3,2,2), followed bycorresponding deconvolution layers. To increase stability of adversarial training, as proposed by Hosseini-Aslet al. (2018), the discriminator output is modified to predict a single scalar as real/fake probability. Discriminatorhas convolution layers of sizes (8,4,4,2,2), (16,4,4,2,2), (32,4,4,2,2), (64,4,4,2,2), as default kernel and stridesizes in Hosseini-Asl et al. (2018). ASR model is implemented based on Zhou et al. (2017), which is trainedonly with maximum likelihood. The model includes one convolutional layer of size (32,41,11,2,2), and fiveresidual convolution blocks of size (32,7,3,1,1), (32,5,3,1,1), (32,3,3,1,1), (64,3,3,2,1), (64,3,3,1,1) respectively.Convolutional layers are followed by layers of bidirectional GRU RNNs with hidden units per directionper layer. Finally, a fully-connected hidden layer of size is used as the output layer. B.1 Q
UALITATIVE E VALUATION OF D OMAIN A DAPTATION
In this section we show some qualitative results on transcriptions produced from different models.
Table 8: ASR prediction improvement on low resource Female domain (TIMIT), when augmentedwith adapted audios from high resource Male domain
Train on Female + (Male → Female)