Learning Sampling Policies for Domain Adaptation
LLearning Sampling Policies forDomain Adaptation
Yash Patel (cid:63) , Kashyap Chitta (cid:63) , and Bhavan Jasani (cid:63)
The Robotics Institute, Carnegie Mellon University { yashp, kchitta, bjasani } @andrew.cmu.edu Abstract.
We address the problem of semi-supervised domain adapta-tion of classification algorithms through deep Q-learning. The core idea isto consider the predictions of a source domain network on target domaindata as noisy labels, and learn a policy to sample from this data so as tomaximize classification accuracy on a small annotated reward partitionof the target domain. Our experiments show that learned sampling poli-cies construct labeled sets that improve accuracies of visual classifiersover baselines.
Keywords:
Domain Adaptation, Active Learning, Deep Q-learning.
Dataset bias [1] is a well-known drawback of supervised approaches to visualrecognition tasks. In general, the success of supervised recognition models, bothof the traditional and deep learning varieties, is restricted to data from thedomain it was trained on [2]. The common approach to handle this is fairlystraightforward: pre-trained deep models perform well on new domains whenthey are fine-tuned with a sufficient amount of data from the new distribution.However, data for fine-tuning needs to be annotated. In many situations, labelingenough data for this approach to be effective is still prohibitively expensive.Recent work on domain adaptation address this problem by aligning the fea-tures extracted from the network across the source and target domains, withoutany labeled target samples. The alignment typically involves minimizing somemeasure of distance between the source and target feature distributions, such ascorrelation distance [3], maximum mean discrepancy [4], or adversarial discrim-inator accuracy [5,6,2].In this work, we explore the semi-supervised domain adaptation problem.We assume that we can collect data in the target domain, as well as annotatea small fraction of it, and we have a fixed budget for annotation. This has beenextensively studied under the field of active learning [7,8], where the goal is toobtain better predictive models than those trained on equal amounts of i.i.d.data by deciding which examples to annotate from a large unlabeled dataset.However, active learning methods are inherently designed for a target domain (cid:63)
Equal Contribution a r X i v : . [ c s . C V ] M a y Patel, Y., and Chitta, K., and Jasani, B. directly, and do not make use of the extensive amount of annotated data wehave in the source domain.We propose a reinforcement learning based formulation of the semi-superviseddomain adaptation problem. In active learning, we need to choose a subset ofthe data to annotate and train from. We hypothesize that we could better useour annotation budget if we label a ’reward partition’, used to generate rewardsfor a deep Q-network. Knowledge from the source domain could be coupled withthis Q-agent to potentially give us a large quantity of well-labeled data in thetarget domain, which could not be achieved independently through unsuperviseddomain adaptation or active learning.Inspired by a similar approach for action recognition [9], we aim to use ourQ-network to learn a policy for sampling from noisily labeled data in the targetdomain. A classifier trained on the source domain is used to generate these noisyannotations for the entire target dataset. The agent is rewarded for sampling datafrom the target domain, that when used to train a new classifier, leads to highaccuracies on the annotated reward partition.We evaluate our approach on the Office-31 dataset, a widely accepted bench-mark for testing real-world visual domain adaption methods [10], comparing ourlearned policies to baselines, and state-of-the-art unsupervised domain adapta-tion methods.
In this section, we describe the proposed method for semi-supervised domainadaptation for a n -way classification problem. The training data consists ofimages from two different domains, we will refer to these domains as D s = { ( x is , y is ) } N s i =1 (source domain) and D t = { ( x it ) } N t i =1 (target domain). An overviewof entire method is shown in Fig 1. It consists of the following components: – Deep convolutional neural network based source classifier , trained for clas-sification of the source domain D s images into n object categories. – Binary Support Vector Machine (SVM) used as a domain discriminator ,to help select a held out subset of target domain samples S rew for generatingrewards, and initialize a training set S pos for the multi-class SVM. – Multi-class SVM based target classifier , for classification of the targetdomain D t images into n object categories. – Deep
Q-agent sampling an image from the target domain D t every iteration,to be added to S pos . Source Classifier.
For image feature representations, we make use of a ResNet-50 [11] architecture pretrained on ImageNet [12]. We choose this model for com-parison to previous work [13]. In order to obtain fine-grained representations,we first fine-tune the network on source domain D s in a supervised setting. Wedenote this source classifier as C src . Domain Discriminator.
We make a reward set S rew , that consists of sampledtarget domain images that is used to evaluate the performance of classifier and in-turn used for computing rewards for the Q-agent. In order to pick an appropriate earning Sampling Policies for Domain Adaptation 3 Positive Training Set
Q-agent SVM
Labelled Reward Set R t S t+1 Candidate Images Update Positive Set A t Random Sampling
Fig. 1.
Overall Method set of images for S rew , we train a binary SVM on image representations toclassify the images as source domain or target domain. We denote this classifieras C dom . The reward set S rew is then stochastically sampled from target domainbased on the sample distances from the separating hyperplane of C dom , withthe sampling weight of each sample w i ∝ d i . The idea behind this initializationis that the samples further away from the domain classification hyperplane aremore confusing and different from source domain samples and thus make a goodevaluation set [14]. Note that we use our budget for ground truth annotationsto get the true labels for images in S rew . Target Classifier.
The n -way classifier C tar is trained on a subset of targetdomain images governed by the Q-agent, which we represent as S pos . The targetclassification labels are set to the predictions from source domain classifier C src on the images in S pos . At each iteration of training, the action taken by theQ-agent involves updating S pos , and this updated training set is used to train C tar again. For our setup, we make use of multi-class SVM for C tar since thenumber of images in target domain D t is limited, and we need relatively quickconvergence since it is repeatedly retrained every episode.For the initialization of S pos , we use C dom again, but now sample l datapoints from D t with weights directly proportional to the sample distances fromthe classification hyperplane w i ∝ d i . This follows the idea that the samplesconfusing the domain classifier are quite similar visually to those in the sourcedomain, making the source classifier’s predictions on these more reliable. The objective of the Q-agent is to predict the appropriate set of samples whichmaximize the performance of target domain classifier C tar on the reward set S rew . At any given timestep t , the Q-agent observes the current state s t , selectsan action a t from its discrete action space, and receives a reward r t as well as anext state observation s t +1 . Actions.
To have a fixed length action space for every iteration, we randomlysample n cand samples in D t , with replacement, giving new set of examples each Patel, Y., and Chitta, K., and Jasani, B. iteration we call the candidate set S cand . The Q-agent then chooses one of n cand +1 actions, which are either to: – pick a sample from S cand to be added to S pos , in which case that sample isflagged and no longer chosen for S cand for the rest of the episode, or – pick none of the samples in the current S cand and move to the next iteration. State Representation.
We formulate the states as a concatenation of twovectors, one dependent on the positive set and the other on the current can-didate set. To get the first part of the state representation which shows thecurrent distribution of examples in the positive set, we use a histogram of clas-sifier confidences, obtained by using C tar on the current S pos , We threshold thedistribution over classes output by C tar into discrete bins of equal width. Thiscomponent of the state representation has a dimensionality of n × n bin (where n is the number of classes).The second part of the state representation gives the agent a concise summaryof the choices it has: for every image in S cand , we obtain the distribution ofclassifier confidences for each class as a confidence vector for that sample. Theconcatenation of all these vectors, each of which corresponds to a specific actionthat the Q-agent can take, has a dimensionality of n × n cand .The entire state vector, which is the input to the Q-network, is a flattenedconcatenation of these two parts. Rewards.
The reward for the agent at any time stamp r t is determined by therelative change in performance of target domain classifier C tar from the previoustime stamp. Q-Network.
In the standard formulation of Q-learning, we use a function ap-proximator with some weights w to estimate the Q -value of a state-action pair( s, a ). We use a Dueling Deep Q-Network (DDQN) [15] as our function approx-imator. We define the advantage function relating the value and Q functions: A π ( s, a ) = Q π ( s, a ) − V π ( s ) (1)We implement the DDQN with two sequences (or streams) of fully connected lay-ers. We subtract the average output over all actions from the advantage stream: Q ( s, a ; α, β ) = V ( s ; β ) + A ( s, a ; α ) − | A | (cid:88) a (cid:48) A ( s, a (cid:48) ; α ) (2)where α are the parameters of the advantage stream, β are the parametersof the value stream, and the weights w we intend to optimize is the combinationof these two sets of parameters. Stabilization.
We further stabilize the training by maintaining two Q-networks[16]: an online network with weights w , and a target network with weights w − .The weights w are updated using the following gradients: w := w + η (cid:18) r + γ max a (cid:48) ∈ A Q w − ( s (cid:48) , a (cid:48) ) − Q w ( s, a ) (cid:19) ∇ w Q w ( s, a ) . (3) earning Sampling Policies for Domain Adaptation 5 Where Q w − is generated by the target network, and η is our learning rate.Every few iterations, the target network weights are set to be equal to the onlineQ-network, and are kept frozen until the next such assignment.We additionally found that adding weight decay as a regularizer to the objec-tive and using a sigmoid activation on the final predicted Q values significantlyimproved the optimization. Since the Q values are expectations of cumulativerewards under an optimal policy, the sigmoid activation bounds these values tothe range [0,1]. This is a valid restriction based on our reward structure, as themaximum possible reward of an optimal policy is 1. We evaluate our algorithm on Office-31 dataset [10]. It contains images from 3domains, A mazon, W ebcam and D SLR. Within each domain, images belong to31 classes of everyday objects, with a fairly even class distribution. The datasetis imbalanced across domains, with 2,817 images in A , 795 images in W , and498 images in D . With 3 domains, we have 6 possible transfer tasks, whichaddress various forms of domain shift including resolution, lighting, viewpoint,background and dataset size. The standard protocol for evaluation in unsupervised adaptation involves usingall images in the source domain (with labels) and target domain (without la-bels) for training, and reporting performances on the entire target domain. Wecompare our results to other work based on the same feature extraction back-bone (ResNet-50), by evaluating the final learned policy with the entire targetdomain dataset. The key difference between existing approaches and ours is thatwe use k = 3 labels per class from the target domain in addition to the remainingunlabeled data during training. For each transfer task, we initially fine-tune our backbonenetwork on the source domain for 30,000 iterations. We use SGD with an ex-ponentially decaying learning rate starting at 0.003 and a batch size of 16. The2048-dimensional pool5 feature vectors are used as representations for the imagesby the other components of our algorithm.
SVMs.
For the domain discriminator C dom , we train a binary SVM with alinear kernel using representations from the entire source domain as one classand target domain as the other. The target classifier C tar is a multi-class linearkernel SVM trained in a one-vs-all manner every iteration. Both of these areimplemented with the default parameters using the liblinear library [17]. Weinitialize S pos with l = 100 samples for all 6 transfer tasks. Patel, Y., and Chitta, K., and Jasani, B.
Q-agent.
For our state representation, we set n bin of the histogram to 10 and n cand to 20. This leads a state space of dimensionality 930 and action space ofsize 21. For the value stream of our DDQN, we use a single hidden layer with asmany units as the dimensionality of the state. The advantage stream consists oftwo hidden layers of 512 units each.We use the Adam optimizer [18] with a learning rate of 0.001. We use an (cid:15) -greedy linear exploration policy, reducing the value of (cid:15) from 1 to 0 over the2,000 iterations of training, and train for a total of 20,000 iterations for eachtransfer task. We assign the weights of our online network to the target networkevery 10 iterations. Our models are implemented on Keras with a TensorFlowbackend [19]. We report our results in Table 1. As a baseline, we use the classifier trained onsource domain and evaluate on target domain (without any domain adaptation)for all 6 transfer tasks. We additionally compare to the best reported results forunsupervised domain adaptation on this dataset [13].We observe that the learned policies do better than baselines, but fail to closethe gap towards existing state-of-the-art unsupervised adaptation methods.
Table 1.
Comparison with state of the art
Method A ⇒ D A ⇒ W D ⇒ A D ⇒ W W ⇒ A W ⇒ DBaseline (ours) 76 .
9% 71 .
4% 58 .
6% 91 .
7% 57 .
4% 96 . .
7% 82 .
0% 68 .
2% 96 .
9% 67 .
4% 99 . .
8% 86 .
8% 74 .
3% 99 .
3% 73 .
9% 100%Ours (semi-supervised) 86 .
1% 83 .
5% 64 .
6% 93 .
1% 60 .
8% 98 . We present a reinforcement learning based approach to learn sampling policiesfor the purpose of domain adaptation. Our method learns to select samples fortraining from the target domain that maximize performance on a reward set,and in-turn improve the overall classification accuracy in the target domain.Unlike the existing state-of-the-art methods, we make use of fixed representa-tions for both the source and target domain samples which hurts the performanceof our method. In order to fix this, we plan to work on the integration of a labeler into the Q-agent which will be jointly optimized with the current sampler to tunebetter representations and less noisy labels for target domain. Another idea isto learn sampling policies using the representations obtained after performingunsupervised domain adaptation through existing feature alignment techniques. earning Sampling Policies for Domain Adaptation 7
References
1. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR. (2011)2. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domainadaptation. In: CVPR. (2017)3. Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation.In: ECCV Workshops. (2016)4. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features withdeep adaptation networks. In: ICML. (2015)5. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation.In: ICML. (2015)6. Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer acrossdomains and tasks. In: ICCV. (2015)7. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning.Machine Learning (1994)8. Settles, B.: Active learning literature survey. (2010)9. Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., Fei-Fei, L.: Learn-ing to learn from noisy web videos. In: CVPR. (2017)10. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models tonew domains. In: ECCV. (2010)11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. (2016)12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scalehierarchical image database. In: CVPR. (2009)13. Kang, G., Zheng, L., Yan, Y., Yang, Y.: Deep adversarial attention alignment forunsupervised domain adaptation: the benefit of target expectation maximization.CoRR (2018)14. Rai, P., Saha, A., Daum´e, III, H., Venkatasubramanian, S.: Domain adaptationmeets active learning. In: ALNLP. (2010)15. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.:Dueling network architectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581 (2015)16. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with doubleq-learning. In: AAAI. (2016)17. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: Alibrary for large linear classification. Journal of Machine Learning Research (2008)18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980abs/1412.6980