[PDF] Learning Sampling Policies for Domain Adaptation

Abstract

We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that learned sampling policies construct labeled sets that improve accuracies of visual classifiers over baselines.

Full PDF

LLearning Sampling Policies forDomain Adaptation

Yash Patel (cid:63) , Kashyap Chitta (cid:63) , and Bhavan Jasani (cid:63)

The Robotics Institute, Carnegie Mellon University { yashp, kchitta, bjasani } @andrew.cmu.edu Abstract.

We address the problem of semi-supervised domain adapta-tion of classiﬁcation algorithms through deep Q-learning. The core idea isto consider the predictions of a source domain network on target domaindata as noisy labels, and learn a policy to sample from this data so as tomaximize classiﬁcation accuracy on a small annotated reward partitionof the target domain. Our experiments show that learned sampling poli-cies construct labeled sets that improve accuracies of visual classiﬁersover baselines.

Keywords:

Domain Adaptation, Active Learning, Deep Q-learning.

Dataset bias [1] is a well-known drawback of supervised approaches to visualrecognition tasks. In general, the success of supervised recognition models, bothof the traditional and deep learning varieties, is restricted to data from thedomain it was trained on [2]. The common approach to handle this is fairlystraightforward: pre-trained deep models perform well on new domains whenthey are ﬁne-tuned with a suﬃcient amount of data from the new distribution.However, data for ﬁne-tuning needs to be annotated. In many situations, labelingenough data for this approach to be eﬀective is still prohibitively expensive.Recent work on domain adaptation address this problem by aligning the fea-tures extracted from the network across the source and target domains, withoutany labeled target samples. The alignment typically involves minimizing somemeasure of distance between the source and target feature distributions, such ascorrelation distance [3], maximum mean discrepancy [4], or adversarial discrim-inator accuracy [5,6,2].In this work, we explore the semi-supervised domain adaptation problem.We assume that we can collect data in the target domain, as well as annotatea small fraction of it, and we have a ﬁxed budget for annotation. This has beenextensively studied under the ﬁeld of active learning [7,8], where the goal is toobtain better predictive models than those trained on equal amounts of i.i.d.data by deciding which examples to annotate from a large unlabeled dataset.However, active learning methods are inherently designed for a target domain (cid:63)

Equal Contribution a r X i v : . [ c s . C V ] M a y Patel, Y., and Chitta, K., and Jasani, B. directly, and do not make use of the extensive amount of annotated data wehave in the source domain.We propose a reinforcement learning based formulation of the semi-superviseddomain adaptation problem. In active learning, we need to choose a subset ofthe data to annotate and train from. We hypothesize that we could better useour annotation budget if we label a ’reward partition’, used to generate rewardsfor a deep Q-network. Knowledge from the source domain could be coupled withthis Q-agent to potentially give us a large quantity of well-labeled data in thetarget domain, which could not be achieved independently through unsuperviseddomain adaptation or active learning.Inspired by a similar approach for action recognition [9], we aim to use ourQ-network to learn a policy for sampling from noisily labeled data in the targetdomain. A classiﬁer trained on the source domain is used to generate these noisyannotations for the entire target dataset. The agent is rewarded for sampling datafrom the target domain, that when used to train a new classiﬁer, leads to highaccuracies on the annotated reward partition.We evaluate our approach on the Oﬃce-31 dataset, a widely accepted bench-mark for testing real-world visual domain adaption methods [10], comparing ourlearned policies to baselines, and state-of-the-art unsupervised domain adapta-tion methods.

In this section, we describe the proposed method for semi-supervised domainadaptation for a n -way classiﬁcation problem. The training data consists ofimages from two diﬀerent domains, we will refer to these domains as D s = { ( x is , y is ) } N s i =1 (source domain) and D t = { ( x it ) } N t i =1 (target domain). An overviewof entire method is shown in Fig 1. It consists of the following components: – Deep convolutional neural network based source classiﬁer , trained for clas-siﬁcation of the source domain D s images into n object categories. – Binary Support Vector Machine (SVM) used as a domain discriminator ,to help select a held out subset of target domain samples S rew for generatingrewards, and initialize a training set S pos for the multi-class SVM. – Multi-class SVM based target classiﬁer , for classiﬁcation of the targetdomain D t images into n object categories. – Deep

Q-agent sampling an image from the target domain D t every iteration,to be added to S pos . Source Classiﬁer.

For image feature representations, we make use of a ResNet-50 [11] architecture pretrained on ImageNet [12]. We choose this model for com-parison to previous work [13]. In order to obtain ﬁne-grained representations,we ﬁrst ﬁne-tune the network on source domain D s in a supervised setting. Wedenote this source classiﬁer as C src . Domain Discriminator.

We make a reward set S rew , that consists of sampledtarget domain images that is used to evaluate the performance of classiﬁer and in-turn used for computing rewards for the Q-agent. In order to pick an appropriate earning Sampling Policies for Domain Adaptation 3 Positive Training Set

Q-agent SVM

Labelled Reward Set R t S t+1 Candidate Images Update Positive Set A t Random Sampling

Fig. 1.

Overall Method set of images for S rew , we train a binary SVM on image representations toclassify the images as source domain or target domain. We denote this classiﬁeras C dom . The reward set S rew is then stochastically sampled from target domainbased on the sample distances from the separating hyperplane of C dom , withthe sampling weight of each sample w i ∝ d i . The idea behind this initializationis that the samples further away from the domain classiﬁcation hyperplane aremore confusing and diﬀerent from source domain samples and thus make a goodevaluation set [14]. Note that we use our budget for ground truth annotationsto get the true labels for images in S rew . Target Classiﬁer.

The n -way classiﬁer C tar is trained on a subset of targetdomain images governed by the Q-agent, which we represent as S pos . The targetclassiﬁcation labels are set to the predictions from source domain classiﬁer C src on the images in S pos . At each iteration of training, the action taken by theQ-agent involves updating S pos , and this updated training set is used to train C tar again. For our setup, we make use of multi-class SVM for C tar since thenumber of images in target domain D t is limited, and we need relatively quickconvergence since it is repeatedly retrained every episode.For the initialization of S pos , we use C dom again, but now sample l datapoints from D t with weights directly proportional to the sample distances fromthe classiﬁcation hyperplane w i ∝ d i . This follows the idea that the samplesconfusing the domain classiﬁer are quite similar visually to those in the sourcedomain, making the source classiﬁer’s predictions on these more reliable. The objective of the Q-agent is to predict the appropriate set of samples whichmaximize the performance of target domain classiﬁer C tar on the reward set S rew . At any given timestep t , the Q-agent observes the current state s t , selectsan action a t from its discrete action space, and receives a reward r t as well as anext state observation s t +1 . Actions.

To have a ﬁxed length action space for every iteration, we randomlysample n cand samples in D t , with replacement, giving new set of examples each Patel, Y., and Chitta, K., and Jasani, B. iteration we call the candidate set S cand . The Q-agent then chooses one of n cand +1 actions, which are either to: – pick a sample from S cand to be added to S pos , in which case that sample isﬂagged and no longer chosen for S cand for the rest of the episode, or – pick none of the samples in the current S cand and move to the next iteration. State Representation.

We formulate the states as a concatenation of twovectors, one dependent on the positive set and the other on the current can-didate set. To get the ﬁrst part of the state representation which shows thecurrent distribution of examples in the positive set, we use a histogram of clas-siﬁer conﬁdences, obtained by using C tar on the current S pos , We threshold thedistribution over classes output by C tar into discrete bins of equal width. Thiscomponent of the state representation has a dimensionality of n × n bin (where n is the number of classes).The second part of the state representation gives the agent a concise summaryof the choices it has: for every image in S cand , we obtain the distribution ofclassiﬁer conﬁdences for each class as a conﬁdence vector for that sample. Theconcatenation of all these vectors, each of which corresponds to a speciﬁc actionthat the Q-agent can take, has a dimensionality of n × n cand .The entire state vector, which is the input to the Q-network, is a ﬂattenedconcatenation of these two parts. Rewards.

The reward for the agent at any time stamp r t is determined by therelative change in performance of target domain classiﬁer C tar from the previoustime stamp. Q-Network.

In the standard formulation of Q-learning, we use a function ap-proximator with some weights w to estimate the Q -value of a state-action pair( s, a ). We use a Dueling Deep Q-Network (DDQN) [15] as our function approx-imator. We deﬁne the advantage function relating the value and Q functions: A π ( s, a ) = Q π ( s, a ) − V π ( s ) (1)We implement the DDQN with two sequences (or streams) of fully connected lay-ers. We subtract the average output over all actions from the advantage stream: Q ( s, a ; α, β ) = V ( s ; β ) + A ( s, a ; α ) − | A | (cid:88) a (cid:48) A ( s, a (cid:48) ; α ) (2)where α are the parameters of the advantage stream, β are the parametersof the value stream, and the weights w we intend to optimize is the combinationof these two sets of parameters. Stabilization.

We further stabilize the training by maintaining two Q-networks[16]: an online network with weights w , and a target network with weights w − .The weights w are updated using the following gradients: w := w + η (cid:18) r + γ max a (cid:48) ∈ A Q w − ( s (cid:48) , a (cid:48) ) − Q w ( s, a ) (cid:19) ∇ w Q w ( s, a ) . (3) earning Sampling Policies for Domain Adaptation 5 Where Q w − is generated by the target network, and η is our learning rate.Every few iterations, the target network weights are set to be equal to the onlineQ-network, and are kept frozen until the next such assignment.We additionally found that adding weight decay as a regularizer to the objec-tive and using a sigmoid activation on the ﬁnal predicted Q values signiﬁcantlyimproved the optimization. Since the Q values are expectations of cumulativerewards under an optimal policy, the sigmoid activation bounds these values tothe range [0,1]. This is a valid restriction based on our reward structure, as themaximum possible reward of an optimal policy is 1. We evaluate our algorithm on Oﬃce-31 dataset [10]. It contains images from 3domains, A mazon, W ebcam and D SLR. Within each domain, images belong to31 classes of everyday objects, with a fairly even class distribution. The datasetis imbalanced across domains, with 2,817 images in A , 795 images in W , and498 images in D . With 3 domains, we have 6 possible transfer tasks, whichaddress various forms of domain shift including resolution, lighting, viewpoint,background and dataset size. The standard protocol for evaluation in unsupervised adaptation involves usingall images in the source domain (with labels) and target domain (without la-bels) for training, and reporting performances on the entire target domain. Wecompare our results to other work based on the same feature extraction back-bone (ResNet-50), by evaluating the ﬁnal learned policy with the entire targetdomain dataset. The key diﬀerence between existing approaches and ours is thatwe use k = 3 labels per class from the target domain in addition to the remainingunlabeled data during training. For each transfer task, we initially ﬁne-tune our backbonenetwork on the source domain for 30,000 iterations. We use SGD with an ex-ponentially decaying learning rate starting at 0.003 and a batch size of 16. The2048-dimensional pool5 feature vectors are used as representations for the imagesby the other components of our algorithm.

SVMs.

For the domain discriminator C dom , we train a binary SVM with alinear kernel using representations from the entire source domain as one classand target domain as the other. The target classiﬁer C tar is a multi-class linearkernel SVM trained in a one-vs-all manner every iteration. Both of these areimplemented with the default parameters using the liblinear library [17]. Weinitialize S pos with l = 100 samples for all 6 transfer tasks. Patel, Y., and Chitta, K., and Jasani, B.

Q-agent.

For our state representation, we set n bin of the histogram to 10 and n cand to 20. This leads a state space of dimensionality 930 and action space ofsize 21. For the value stream of our DDQN, we use a single hidden layer with asmany units as the dimensionality of the state. The advantage stream consists oftwo hidden layers of 512 units each.We use the Adam optimizer [18] with a learning rate of 0.001. We use an (cid:15) -greedy linear exploration policy, reducing the value of (cid:15) from 1 to 0 over the2,000 iterations of training, and train for a total of 20,000 iterations for eachtransfer task. We assign the weights of our online network to the target networkevery 10 iterations. Our models are implemented on Keras with a TensorFlowbackend [19]. We report our results in Table 1. As a baseline, we use the classiﬁer trained onsource domain and evaluate on target domain (without any domain adaptation)for all 6 transfer tasks. We additionally compare to the best reported results forunsupervised domain adaptation on this dataset [13].We observe that the learned policies do better than baselines, but fail to closethe gap towards existing state-of-the-art unsupervised adaptation methods.

Table 1.

Comparison with state of the art

Method A ⇒ D A ⇒ W D ⇒ A D ⇒ W W ⇒ A W ⇒ DBaseline (ours) 76 .

9% 71 .

4% 58 .

6% 91 .

7% 57 .

4% 96 . .

7% 82 .

0% 68 .

2% 96 .

9% 67 .

4% 99 . .

8% 86 .

8% 74 .

3% 99 .

3% 73 .

9% 100%Ours (semi-supervised) 86 .

1% 83 .

5% 64 .

6% 93 .

1% 60 .

8% 98 . We present a reinforcement learning based approach to learn sampling policiesfor the purpose of domain adaptation. Our method learns to select samples fortraining from the target domain that maximize performance on a reward set,and in-turn improve the overall classiﬁcation accuracy in the target domain.Unlike the existing state-of-the-art methods, we make use of ﬁxed representa-tions for both the source and target domain samples which hurts the performanceof our method. In order to ﬁx this, we plan to work on the integration of a labeler into the Q-agent which will be jointly optimized with the current sampler to tunebetter representations and less noisy labels for target domain. Another idea isto learn sampling policies using the representations obtained after performingunsupervised domain adaptation through existing feature alignment techniques. earning Sampling Policies for Domain Adaptation 7

References

1. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR. (2011)2. Tzeng, E., Hoﬀman, J., Saenko, K., Darrell, T.: Adversarial discriminative domainadaptation. In: CVPR. (2017)3. Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation.In: ECCV Workshops. (2016)4. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features withdeep adaptation networks. In: ICML. (2015)5. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation.In: ICML. (2015)6. Tzeng, E., Hoﬀman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer acrossdomains and tasks. In: ICCV. (2015)7. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning.Machine Learning (1994)8. Settles, B.: Active learning literature survey. (2010)9. Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., Fei-Fei, L.: Learn-ing to learn from noisy web videos. In: CVPR. (2017)10. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models tonew domains. In: ECCV. (2010)11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. (2016)12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scalehierarchical image database. In: CVPR. (2009)13. Kang, G., Zheng, L., Yan, Y., Yang, Y.: Deep adversarial attention alignment forunsupervised domain adaptation: the beneﬁt of target expectation maximization.CoRR (2018)14. Rai, P., Saha, A., Daum´e, III, H., Venkatasubramanian, S.: Domain adaptationmeets active learning. In: ALNLP. (2010)15. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.:Dueling network architectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581 (2015)16. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with doubleq-learning. In: AAAI. (2016)17. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: Alibrary for large linear classiﬁcation. Journal of Machine Learning Research (2008)18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980abs/1412.6980