Detecting Bias in Transfer Learning Approaches for Text Classification
DDetecting Bias in Transfer Learning Approachesfor Text Classification
Irene Li
Department of Computer ScienceYale University [email protected]
Abstract
Classification is an essential and fundamental task in machine learning, playing a cardinal rolein the field of natural language processing (NLP) and computer vision (CV). In a supervisedlearning setting, labels are always needed for the classification task. Especially for deep neuralmodels, a large amount of high-quality labeled data are required for training. However, whena new domain comes out, it is usually hard or expensive to acquire the labels. Transfer learn-ing could be an option to transfer the knowledge from a source domain to a target domain. Achallenge is that these two domains can be different, either on the feature distribution, or theclass distribution for the nature of the samples. In this work, we evaluate some existing transferlearning approaches on detecting the bias of imbalanced classes including traditional and deepmodels. Besides, we propose an approach to bridge the gap of the domain class imbalance issue.
Machine learning has been widely applied to multiple applications in a variety of industries. Supervisedmachine learning is an essential type of algorithm where both data and the corresponding labels arerequired when doing the process of learning. Classification, as a very fundamental supervised machinelearning task, plays an important role in many applications like face recognition and sentiment analysis(Le and Nguyen, 2015). However, a large amount of labeled data is often required when working withthe supervised classification task, especially for deep neural networks. When it is difficult or expensiveto get enough labeled data, transfer learning (Pan and Yang, 2009) is considered to be an approach tosolve this issue which tries to transfer any existing knowledge to a new domain.Existing research has been targeting on how to improve model performance when applying transferlearning techniques from a source domain to a target domain. In the real world, there are some differencesin the source and target domain. We focus on the domain class imbalance (DCI) issue, where we havedifferent ratios of the samples for classes in two domains. For instance, imagine a binary classificationproblem, we have 50/50 (Pos/Neg) balanced samples in the source domain, but we may have 30/70 in thetarget domain. In this case, if a classifier predict all the samples to be negative cases, it can still achievean accuracy of 0.7. But in some extreme situations like in the medical domain, we also want to examinethe accuracy for the positive cases. Most of the works have shown that there is an average improvementlike an average accuracy and F1 score improvement on all classes. Fewer efforts were made to checkthe improvements for each class, especially for some rare classes. In this work, we aim to provide someanalysis on the robustness of deep transfer learning models in text classification task with a domain classimbalanced setting. This work is inspired by (Weiss and Khoshgoftaar, 2016), where they performedtests on some traditional transfer learning algorithms under this setting with the lack of analysis for deepmodels.
Most related works learn feature matching and instance reweighting for the two domains. Recent workby (Long et al., 2014) proposed a novel Transfer Joint Matching (TJM) method to model them in a unifiedoptimization problem which is focusing reducing the domain difference by jointly matching the features a r X i v : . [ c s . C L ] F e b nd re-weighting the instances across domains. This method substantially improved cross-domain imagerecognition baseline. Another way is to learn a good feature representation shared by both two domains.(Pan et al., 2010) proposed a new learning method to learn the shared representation called transfercomponent analysis (TCA). TAC learns some transferable feature components in a Reproducing KernelHilbert Space (RKHS) using Maximum Mean Discrepancy (MMD). Then in the shared feature space,some standard machine learning method can be used to do the classification task. Similarly, (Long et al.,2015) brought this framework into a deep neural network, achieving satisfying results in classificationon images.Adversarial learning methods are proved to be more robust when working with deep neural networkmodels. In a recent work by (Tzeng et al., 2017), a model was proposed to combine discriminativemodeling, untied weight sharing, and a GAN loss, call Adversarial Discriminative Domain Adaptation(ADDA). Their results achieved a state-of-the-art unsupervised adaptation results in a classic imageclassification task. We will provcide more details in the following section.The mentioned related works are mostly for CV applications, such methods are also possible to beadapted into NLP applications like sentiment classification. For simplicity, first we define these notations: indices s and t indicate Source and Target domains.Source domain samples are X s = { x s , x s , .., x sn } and the labels are Y s = { y s , y s , .., y sn } ; similarly,target domain samples are X t = { x t , x t , .., x tm } and the labels are Y t = { y t , y t , .., y tm } . In ourunsupervised transfer learning setting, the Y t are invisible to the model in the training stage, but Y s and Y t share the same label sets Z = { z , z , .., z l } .There are two main differences between source and target domain data. Firstly, the marginal datadistributions are different, thus P ( X s ) (cid:54) = P ( X t ) . That means the features would be different for thesetwo domains. If we take a classifier trained on source domain and directly apply it into the target,we will have a performance drop. Secondly, the sample class ratios in training sets are different. Wedefine the sample class ratio to be the distribution of the ratio for each class. To formulate, we choose R s = { r s , r s , .., r sl } to be sample class ratio of source domain, and R t = { r t , r t , .., r tl } for targetdomain. Note that this work focus on a class imbalanced setting, in our experiments, for each domain,we choose the ratios to be different for each class. Also we want a domain class imbalanced settingwhich means R s (cid:54) = R t . Proposed by (Mikolov et al., 2013), word embeddings are dense word vectors as the presentations whilepreserving the semantics of the words. Usually, they can be pre-trained using free text without any labels.The dimension of the dense vectors are pre-defined and then trained using the skip-gram model or cbowmodel (Mikolov et al., 2013). In this work, we will pre-train the word vectors as our input to othermodels.
GANs (Goodfellow, 2016) have been widely applied in various applications and tasks and achievingpromising results. It is a generalized model which contains two main components: discriminator D andgenerator G . The task for the generator G is to generate some cases to fool the discriminator D thatthese cases are from real data distribution P data . Then the discriminator D is to try its best to recognizeif input cases are coming from P data . The generator G takes noise variables p z ( z ) as input to generatesample cases. Finally, we will learn the distribution of G to be P g , which we are expecting it to be closeto P data . In general, it can be defined as a two-player minimax game with a value function V ( G, D ) : min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )] + E z ∼ p z ( z ) [log(1 − D ( G ( z )))] (1)igure 1: ADNN Model Framework, adapted from (Tzeng et al., 2017). As mentioned before, in transfer learning tasks, the data distributions are different. As the source domaindata could be obtained more easily, we have a good understanding of its distribution P ( X s ) . Then weapply a feature extractor ( M s ) for the raw inputs, then the distribution becomes P ( M s ( X s )) which isknown already once the feature extractor ( M s ) is well-learned with the labels provided. Then the task is,can we learn a feature extractor for the target samples ( M t ) while the labels are missing? In this section,we introduce the idea of applying GAN to solve this issue.The base model we choose is the Adversarial Discriminative Domain Adaptation (ADDA) model pro-posed by (Tzeng et al., 2017), as shown in Figure 1. The model was proposed to solve an unsupervisedimage classification task, where they have images from two domains: digit images with colorful back-ground (source images) and hand written digit images without background (target images).The first step is to pre-train a classifier C using the source data where labels are provided. It is essentialto learn the features of the training samples. The choice is to apply a Convolutional Neural Network(CNN) (Krizhevsky et al., 2012) model before feeding into the classifier C as a feature extractor ( M s )for the source domain. In this step, the loss function is a typical classification loss, thus the optimizationproblem will be: min M s ,C L cls ( X s , Y s ) = − E ( x s ,y s ) ∼ ( X s ,Y s ) K (cid:88) k =1 [ k = y s ] log C ( M s ( x s )) (2)The second step is the adversarial adaptation training process. The goal for this part is to fix the featureextractor for the source domain ( M s ) which is learned in the fist step, and learn the one for the targetdomain M t . The main idea is use the GAN(Goodfellow, 2016) framework. Then the Target CNN or thefeature extractor for the target domain M t acts like a generator which tries to generate similar sampleswith the source images, and it takes the original target images as the input. Then a discriminator D is setto determine weather the images come from source or target by outputting domain labels. In other words,we will try to approximate the distribution from target P ( M t ( X t )) to the known distribution from source P ( M s ( X s )) . In this step, M t will be learned, which acts like a mapping function from the target imagesto source images, and then the pre-trained classifier C can be adapted to the target domain images. Theoptimization function for the adversarial loss here can be found below. In this stage, M s is fixed whilelearning M t . in D L adv D ( X s , X t , M s , M t ) = − E x s ∼ X s [log D ( M s ( x s ))] − E x t ∼ X t [log (1 − D ( M t ( x t )))] (3) min M s ,M t L adv M ( X s , X t , D ) = − E x t ∼ X t [log D ( M t ( x t ))] (4)The last step is the testing stage where we predict labels for the target images. We take the learnedtarget CNN feature extractor M t from step two and the pre-trained classifier C from step one to doclassification. Inspired by the work of (Shen et al., 2018) and (Wang et al., 2019), it is possible to improve the modelby instance re-weighting. The main idea is, we can give more weights to the instances from the targetdomain if they are quite similar to the instances from source domain. We can apply distance-basedmethod to guide the process of backpropagation during training. The original stochastic gradient descent(SGD) is defined as: θ ←− θ − α ∇ θ J ( θ ; x i , y i ) (5)where we have α as the learning rate, and the SGD step is applied during each mini-batch with k instances (we set k = 10 in all of our experiments). Then the distance-based backpropagation is definedas: θ ←− θ − α k (cid:88) w i ∇ θ J ( θ ; x i , y i ) (6)where w i is the weight to each instance x i .Now we show how to calculate the instance weights. The distance-based backpropagation aims to givemore weights to the instances which are very close to each other from target and source domains. Here,we take the outputs of a mini batch after the feature extractors M s ( x s ) and M t ( x t ) , and for each instance x i,t from target domain, the weight is defined as: w i = 1 τ Dist ( M t ( x t ) − M t ( x i,t )) (7)where τ is a partition function to make sure all weights (within a single mini batch) sum up to be 1.We decide the distance matric between each target domain instance output (a vector) and the output fromsource domain M s ( x s ) (as a matrix) later. If the distance is small, then we should give more weight tothat instance. It is possible to apply other distance metrics as well.Then the distance-based backpropagation will be applied only for the target feature extractor part. Wekeep the source feature extractor to be optimized via the original SGD. The original ADDA model was proposed to solve image classifications. As we are targeting on textclassification task, we will choose the input to be word embeddings and CNN model to deal with theinput sentences (Zhang and Wallace, 2015). As mentioned before, we truncate each sentence to be 140words, and each word embedding has a dimension of 128, then we make it as a 2-D matrix as an image-like input to CNN. As illustrated in figure 2, when the word embeddings are fed into the model, then wehave a convolutional layer following by a pooling layer. In the end, there is a fully-connected layer anda softmax layer to predict the labels.Besides, when choosing the base model for feature extractors M s and M t , it is also possible to applyLong short-term memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), but we may leave thisout since CNN is already proved to be well-performed.igure 2: Convolutional Neural Networks for Sentence Classification (adapted from (Zhang and Wallace,2015))Our desired algorithmic result would be a potential improvement on this existing ADDA model withan imbalanced training sample testing with text classification job. The original experimental results by(Tzeng et al., 2017) did not include a model performance for this particular setting. One possible ideais to apply weights for individual samples while doing the adversarial adaptation training process. Morespecifically, both E x s ∼ X s and E x t ∼ X t in Equation 3 can be adjusted by the class ratios. For example, ifthe ratio is 90/10 for Pos/Neg, we may add more weights for the samples of Neg class for fairness andprevent some bias. Adapted code from https://github.com/corenel/pytorch-adda , implemented in Py-Torch. Some bugs are fixed, and applied loading texts in stead of images. For simple baseline results,implemented from scratch. Use libraries include NLTK for natural language processing, and scikit-learn for machine learning methods. The goal is to finalize a classifier C which has the ability to do classification on the target domain data X t . Average accuracy and F1 scores will be reported among all the classes. Accuracy is defined as thecorrect number of samples classified divided by the total number of samples. F1 score is defined as: F = (cid:18) recall − + precision − (cid:19) − = 2 · precision · recallprecision + recall . (8)Also we check the scores for individual class. To be more specific, we define the accuracy for eachclass to be: the number of correct samples classified by C divided by the total number of samples in thatclass. Similarly, we will compare F1 score for each class. By comparing the performance of differentmodels, we are expecting an improvement in both accuracy and F1 score. We did experiments on the Multi-Domain Sentiment Dataset , which contains four domains of the la-beled reviews. Each domain contains 1000 positive and 1000 negative user reviews. https://scikit-learn.org/stable/ ethod Src acc Tgt acc Src f1 (p) Src f1 (n) Tgt f1 (p) Tgt f1 (n)LR NB 0.5938 0.5412 0.6011 0.5850 0.5314 0.5456RF 0.7446 0.6592 0.7569 0.7304 0.6784 0.6343Table 1: Baseline methods
Simple Baseline methods
We tried with simple TFIDF (term frequency–inverse document frequency)features as the input to three traditional classifiers: logistic regression ( LR ), random forest ( RF ), naiveBayesian ( NB ). Table 1 shows the baseline results when we applied balanced training, please note thatthese are average results. Since in this method, we do not have a progress of adaptation, so column Srcacc gives the accuracy for training and testing on the source domain, while
Tgt acc gives the accuracy fortraining on source but testing on target data. We also provide F1 scores for both positive (marked as (p) )and negative (marked as (n) ) class. From the results, we can see that logistic regression ( LR ) performsmuch better than the other two classifiers, so we will then compare other model results with this method. LR-discriminator Model
To compare with the ADDA model, it is better to keep the model structure,which is shown in Figure 1. So we then replaced the CNN feature extractor with a traditional linearnetwork structure and kept other layers. Specifically, the linear feature extractor is defined as: x (cid:48) = xA T + b (9)where matrix A and vector b are trainable parameters. We use this linear layer to do feature extractionwhen the inputs are TFIDF features, and each token is represented by its ID. To prevent overfitting, thedimensions of each layer shrink into smaller numbers. We show the results for balanced training in Table3 ( LR-Dis ). The In column shows the results for training and testing in source domain, while Out columnprovides results for training in source but testing in target domain, and
Adapted is after adaptation andtested in target domain. The columns f1(p) and f1(n) illustrate F1 scores for positive and negative classfor each corresponding method. However, this setting is not suitable for this task as most of the accuracyare lower than random guess (0.5). The bold values are the ones higher than random guess, but there isno significant improvement. A possible reason is overfitting, as the training accuracy can be 0.9.
Deep Adaptation Adversarial Model
Results with the ADDA model are shown in Table 3 markedas
ADDA in the method column. Similarly, we show the accuracy in different training settings: balancedtraining, 1:10 (pos:neg), 3:10(pos:neg), 5:10(pos:neg) and 7:10(pos:neg). It is desired that adapted re-sults are higher than out-domain results (bold values). We give in domain and out domain accuracy andf1 scores for both classes, as well as the results after adapting.
Optimizer and Distance Metric
To select a better distance metric for calculating the weight in Equa-tion 7 and a good optimizer, we compare Euclidean distance ( EC ) and Cosine similarity ( Cosine ) inTable 2, as well as two optimizers: stochastic gradient descent (
SGD ) (Bottou, 2010) and Adam (
Adam )(Kingma and Ba, 2014). It shows that by applying Adam optimizer and Cosine similarity, we could havea slightly better transferred results with a higher accuracy and f1 scores for both positive and negativeclass. We only show results for the experimental setting from dvd domain to the other three classes forcomparison.
Our Model
We then apply cosine similarity and Adam optimizer as our final choice in our proposedmodel. Compared with other baseline models in Table 3, we show that our model
Our performs better inthe adapted results among all the other models.We now compare the F1 scores of both positive and negative classes in Figure 3, where x-axis showsthe class with ratio group and y-axis gives the F1 scores. For example, gives the positive class underthe ratio of 1:10, and the
Neg beside it indicates the negative class with the same ratio. We can see that,in a reasonable training setting, especially for 3:10, 5:10 and 7:10, our proposed model has a better F1score in the minor class, but nearly the same F1 score with the other class. While under a fair training ource Target In f1(p) f1(n) Out f1(p) f1(n) Adapted f1(p) f1(n)
Cosine SGD dvd books 0.725 0.7208 0.7291 0.688 0.7006 0.6743 0.5205 0.2696 0.6431dvd kitchen 0.77 0.7745 0.7653 0.6665 0.6929 0.6352 0.5123 0.3345 0.6151dvd electronics 0.78 0.78 0.78 0.6386 0.6478 0.629 0.5265 0.3997 0.6091
Cosine Adam dvd books 0.7 0.6939 0.7059 0.7055 0.6892 0.7202
EC SGD dvd books 0.705 0.6811 0.7256 0.708 0.6949 0.72 0.5505 0.5773 0.52dvd kitchen 0.705 0.6776 0.7281 0.6585 0.6268 0.6852 0.5696 0.5405 0.5952dvd electronics 0.775 0.7805 0.7692 0.6687 0.6927 0.6406 0.5295 0.4731 0.575
EC Adam dvd books 0.695 0.6115 0.749 0.65 0.5308 0.7209 0.65 0.5308 0.7209dvd kitchen 0.745 0.7437 0.7463 0.666 0.6549 0.6764
Table 2: Comparison of SGD and Adam, Euclidean distance and Cosine similarity.
Method Ratio In f1(p) f1(n) Out f1(p) f1(n) Adapted f1(p) f1(n)Baseline 10:10 0.7211 0.7262 0.7149 0.6433 0.6460 0.6363 - - -LR 0.5150 0.4991 0.5261 0.5047 0.4928 0.5130 0.5033 0.4926 0.5103ADDA 0.7592 0.7535 0.7612 0.6660 0.6738 0.6440
Our 0.7604 0.7517 0.7668 0.6703 0.6723 0.6625 0.6703 0.6724 0.6626Baseline 1:10 0.5044 0.6666 0.0337 0.5032 0.6662 0.0279 - - -LR 0.5067 0.6521 0.1489 0.5015 0.6495 0.1354 0.5017 0.6516 0.1236ADDA 0.5000 0.6667 0.0000 0.5007 0.6673 0.0000 0.5007 0.6673 0.0000Our 0.5000 0.6667 0.0000 0.5007 0.6673 0.0002 0.5007 0.6673
Baseline 3:10 0.5399 0.6750 0.1953 0.5208 0.6633 0.1474 - - -LR 0.5038 0.6151 0.2990 0.5009 0.6120 0.2989 0.5014 0.6171 0.2834ADDA 0.5275 0.6789 0.1015 0.5155 0.6732 0.0625 0.5104 0.6712 0.0418Our 0.5279 0.6794 0.1022 0.5213 0.6751 0.0834
Baseline 5:10 0.6046 0.6905 0.4372 0.5608 0.6663 0.3303 - - -LR 0.5025 0.5740 0.3982 0.5055 0.5766 0.4025 0.5029 0.5791 0.3902ADDA 0.6338 0.7280 0.4343 0.5850 0.6980 0.3241 0.5637 0.6919 0.2481Our 0.6296 0.7267 0.4163 0.5787 0.6960 0.3084
Baseline 7:10 0.6907 0.7262 0.6392 0.6077 0.6710 0.4968 - - -LR 0.5146 0.5592 0.4583 0.5105 0.5494 0.4624 0.5071 0.5430 0.4600ADDA 0.7129 0.7646 0.6267 0.6258 0.6978 0.4824 0.6363
Table 3: Final results.setting, the ADDA model performs well in both classes. While given such a small number of trainingand testing example, the distance metric maybe not that accurate, especially for a small batch size, whichmay have some impact for calculating the gradients.
We show that our proposed Distance-based Backpropagation model has the ability and potential to pre-venting class bias, with more unbalanced training data, the better our model perform.However, from our results, we show that directly applying method works in small datasets, while smalldatasets without pre-trianed word embeddings will easily tend to overfit. The future work is to extend ourmethod to larger scale datasets, and it is also possible to do more analysis on other optimizers. Besides,more work can be done in solving unbalanced training issue. Our proposed approach is possible to beapplied in this following manner: we assign a larger weight w i to minor class instances, and Equation 7then becomes:igure 3: A comparison of F1 scores among different ratio groups. w i = n p n p + n n if y i is negative n n n p + n n if y i is positive (10)In this case, we will apply Equation 6 to both source and target feature extractor. Acknowledgements
We thank Professor Nisheeth Vishnoi for useful discussions.
References
L´eon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMP-STAT’2010 , pages 177–186. Springer.Ian Goodfellow. 2016. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural computation , 9(8):1735–1780.Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutionalneural networks.
Advances in neural information processing systems , 25:1097–1105.Bac Le and Huy Nguyen. 2015. Twitter sentiment analysis using machine learning techniques. In
AdvancedComputational Methods for Knowledge Engineering , pages 279–289. Springer.Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. 2014. Transfer joint matchingfor unsupervised domain adaptation. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 1410–1417.Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deepadaptation networks. In
International conference on machine learning , pages 97–105. PMLR.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations ofwords and phrases and their compositionality. arXiv preprint arXiv:1310.4546 .Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning.
IEEE Transactions on knowledge and dataengineering , 22(10):1345–1359.inno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. 2010. Domain adaptation via transfer componentanalysis.
IEEE Transactions on Neural Networks , 22(2):199–210.Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. 2018. Wasserstein distance guided representation learning fordomain adaptation. In
Proceedings of the AAAI Conference on Artificial Intelligence , volume 32.Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7167–7176.Zhi Wang, Wei Bi, Yan Wang, and Xiaojiang Liu. 2019. Better fine-tuning via instance weighting for text classi-fication. In
Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 7241–7248.Karl R Weiss and Taghi M Khoshgoftaar. 2016. Investigating transfer learners for robustness to domain classimbalance. In ,pages 207–213. IEEE.Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neuralnetworks for sentence classification. arXiv preprint arXiv:1510.03820arXiv preprint arXiv:1510.03820