[PDF] Distance Based Source Domain Selection for Sentiment Classification

Abstract

Automated sentiment classification (SC) on short text fragments has received increasing attention in recent years. Performing SC on unseen domains with few or no labeled samples can significantly affect the classification performance due to different expression of sentiment in source and target domain. In this study, we aim to mitigate this undesired impact by proposing a methodology based on a predictive measure, which allows us to select an optimal source domain from a set of candidates. The proposed measure is a linear combination of well-known distance functions between probability distributions supported on the source and target domains (e.g. Earth Mover's distance and Kullback-Leibler divergence). The performance of the proposed methodology is validated through an SC case study in which our numerical experiments suggest a significant improvement in the cross domain classification error in comparison with a random selected source domain for both a naive and adaptive learning setting. In the case of more heterogeneous datasets, the predictability feature of the proposed model can be utilized to further select a subset of candidate domains, where the corresponding classifier outperforms the one trained on all available source domains. This observation reinforces a hypothesis that our proposed model may also be deployed as a means to filter out redundant information during a training phase of SC.

Full PDF

DDistance Based Source Domain Selection for SentimentClassiﬁcation

LEX ELIAS RAZOUX SCHULTZ, MARCO LOOG, PEYMAN MOHAJERIN ESFAHANI

Abstract.

Automated sentiment classiﬁcation (SC) on short text fragments has received increasing attentionin recent years. Performing SC on unseen domains with few or no labeled samples can signiﬁcantly aﬀectthe classiﬁcation performance due to diﬀerent expression of sentiment in source and target domain. In thisstudy, we aim to mitigate this undesired impact by proposing a methodology based on a predictive measure,which allows us to select an optimal source domain from a set of candidates. The proposed measure isa linear combination of well-known distance functions between probability distributions supported on thesource and target domains (e.g. Earth Mover’s distance and Kullback-Leibler divergence). The performanceof the proposed methodology is validated through an SC case study in which our numerical experimentssuggest a signiﬁcant improvement in the cross domain classiﬁcation error in comparison with a randomselected source domain for both a naive and adaptive learning setting. In the case of more heterogeneousdatasets, the predictability feature of the proposed model can be utilized to further select a subset of candidatedomains, where the corresponding classiﬁer outperforms the one trained on all available source domains. Thisobservation reinforces a hypothesis that our proposed model may also be deployed as a means to ﬁlter outredundant information during a training phase of SC. Introduction

Automated SC is performed when a trained classiﬁer is used to label documents with a sentiment label,based on the content of the document. In practice, a document represents a review, tweet or other small textualopinion expression and ought to be classiﬁed as either positive or negative (binary SC). Allied terminologyfor SC is semantic orientation or polarity, which indicates the direction a word deviates from the normof its semantic group [17]. Often, the terms opinion mining and sentiment analysis are used to describe thecomputational treatment of opinion, sentiment and subjectivity in text [30]. Examples of current applicationsthat beneﬁt from SC are recommendation systems [39], stock market prediction systems [7] and politicalelection predictors [40]. Also in the future, we could imagine SC put in practice to improve man-machinecommunication such as voice commands or even interaction between robots and humans.One of the biggest challenges in many SC applications, is the discrepancy between the source domain wherethe classiﬁer is trained on, and the target domain of interest. Typically, we have large amounts of labeled datafrom diﬀerent source domains, but only possess few, unlabeled data from the target domain, both originatedfrom diﬀerent underlying distributions. In this setting, SC generally performs rather bad compared to innerdomain training and testing. This so called domain transfer problem can be intuitively explained due todiﬀerent fashions of sentiment expression in diﬀerent domains [38]. A word may have diﬀerent sentimentpolarity with respect to the domains it is used in, or diﬀerent words may be used to express similar sentimentamong diﬀerent domains [46].Various research has been performed with the aim to decrease the loss in classiﬁcation performance dueto domain shift. This loss is referred to as the adaptation loss [18] [5]. One of the popular transfer learningtechniques to maintain performance of SC when crossing domains, is Structural Correspondence Learningwhich uses pivot features as link between the source and target domain [6] [22]. Many other techniques arebased on optimal transport which matches the conditional probability distribution of the training domain

Date : August 29, 2018.The authors are with the Delft Center for Systems and Control, TU Delft, The Netherlands and the Patter RecognitionLaboratory, TU Delft, The Netherlands. a r X i v : . [ c s . I R ] A ug L.E. RAZOUX SCHULTZ, M. LOOG AND P. MOHAJERIN ESFAHANI and target domain by a transformation of the source domain data [10] [20]. The distance between source andtarget distribution is minimized [9]. Often distance is minimized in a lower dimensional embedded space, forexample when performing Subspace Alignment [13] or Transfer Component Analysis [29].In recent years, the use of neural models for adaptation learning has been proven successful. The abilityof a neural model to be trained on both source and unlabeled target data has been leveraged to improveperformance of the cross domain classiﬁcation task. Various approaches are proposed to adapt the neural nettowards superior performance of classiﬁcation in a given target domain. In the ﬁeld of SC, Domain-AdversarialNeural Networks (DANNs) have been eﬀectively used in such a context [14]. DANNs improve performanceby augmenting a feed-forward model with a few standard layers and a new gradient reversal layer. Due tothis augmentation, the model promotes features that are discriminative on class label and indiscriminativeregarding a shift in domain. Other research is based on the use of Stacked Denoising Auto-encoders (SDA’s)[8] [15]. An SDA can be used to learn a higher lever feature extraction by using unlabeled data from thesource and target domain. The feature extraction transforms the feature space, and subsequently, a classiﬁeris trained on this transformed feature space. Other popular techniques with much resemblance to SDA’s arethe word2vec [28] and doc2vec [24] algorithms.One may invoke other techniques in combination with the adaptation learning to improve the adaptionloss, e.g., unsupervised source domain selection. Instead of focusing on the type of classiﬁer or the underlyingrepresentation of the data, these techniques primarily revolve around preselecting an appropriate trainingdataset. The source domain selection method opts to select one or multiple optimal source domains froma set of candidate domains without supervision, i.e., it selects the source domains that have the best crossdomain classiﬁcation performance compared to the other candidates without the use of class labels of thetarget domain data. Source domain selection has been successfully deployed in other contexts, for example toimprove brain-computer interface calibration [44]. In this case, supervised source domain selection based ondistance between the class average vectors has been used. Alternatively, ranking convolutional neural networkstrained on diﬀerence source domains for a target task shows success when using the mutual informationbetween domains [1].Source domain selection is a higher level of selection than instance selection as it chooses a set of instancesfrom the same underlying domain rather than individual instances. It is based on the assumption that innerdomain information is more informative and therefore yields an improved performance when used for training.That rises the question what constitutes a domain. It is argued that there is no common ground in whatconstitutes a domain, but in NLP it is typically used to refer tot some coherent data with respect to topicor genre [34]. However, there are numerous other factors that should be taken into consideration, e.g., themedium on which the domain data is exposed, the time spirit, and the purpose of the data.In order to ﬁnd an appropriate source domain, it is recently proposed to represent a domain as an averageof its document representations in a pre-trained embedded space [36]. Additionally, this study performs sameexperiments when using denoising auto-encoder representations trained on all source data instead of a pre-trained embedding. They conclude that using a pre-trained embedded space for domain representation is notnecessarily recommended. Their approach that relies on denoising auto-encoders is showed to be eﬀective inselecting source domains, but only when large training sets are available.More common approaches for source domain selection use domain similarity metrics. One of these metricsdeveloped for assessing potential of transfer learning is the A -distance [4]. In the ﬁeld of source domainselection for SC, Blitzer et al. show a good correlation between the A -distance and adaptation loss [5]. Theadaptation loss is measured using the target domain’s inner domain classiﬁcation error which unfortunatelyrequires labeled data of the target domain. The concept of A -distance is also used in recent research fororthophoto classiﬁcation, were it is approximated by the Maximum Mean Discrepancy [43]. For a more opensearch space for a source domain, techniques have been researched to ﬁnd a suitable source domain in openonline information sources such as Wikipedia [45]. However, this technique also uses labeled information ofthe target domain. ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 3

Another similarity metric used in source domain selection is the Kullback-Leibler divergence (KL) [23].This metric may not be well deﬁned in SC, and therefore the data requires smoothing. Alternatively, ap-proximations to KL, such as the Jensen-Shannon divergence [26] or the skew divergence [25] can be usedto determine distance between domains. These divergences indicators have proven their ability to select asuitable source domain and thus improving the classiﬁcation performance of cross domain classiﬁcation tasks[35].In this article, we propose a new method, CMEK, for source domain selection that does not requireany labeled information of the target domain. The CMEK method trains a source domain selection modelby means of a set of given distance metrics. This approach diﬀers from other state-of-the art approachesas it proposes a linear combination of distance metrics where weights are determined by training on theavailable source domain instead of a single metric for source domain selection. We emphasize that theproposed technique here can be utilized in combination with adaptation learning in a sequential manner,which potentially improves the performance of our domain transformation task.In Section 2, we deﬁne the problem of source domain selection in general, that is, not in context of SC.Section 3 proposes an approach to predict the performance of a classiﬁer trained on a source domain, andapplies it on a target domain of interest of which the labels are unknown. This prediction is based onstatistical distances between the distributions of the source and target data. To this end, we leverage severalstatistical measures to determine the distance. Section 4 addresses certain limitations to our proposed model,starting with fundamental limitations that occur in all classiﬁcation tasks and ending with limitations of ourapproach within the task of SC. In Section 5 we present background knowledge on how to perform sentimentclassiﬁcation by using a bag-of-words model, as we do in our experiments. Section 6 describes the experimentalsetup and model settings for two diﬀerent types of datasets; a homogeneous one where the datasets sharesimilar contents and a heterogeneous one that contains datasets from very diverse topics and sources. In thisarticle, our homogeneous set used is the DRANZIERA benchmark dataset [12]. In addition, we describe howwe run experiments in an adaptation learning setting. Section 7 reports the numerical results in detail, andﬁnally in Section 8 we conclude with our reﬂection on the results and ﬁnal messages along with suggestionsfor future work. 2.

Problem Definition

With a document represented by a random variable X and its sentiment label by random variable Y , eachdomain is uniquely characterized by its joint probability density function supported on X × Y . Let us thendeﬁne two underlying distributions to the evaluated source and target data on X × Y , denoted by P and ¯ P respectively. Their marginals on X , are denoted as P x and ¯ P x respectively. In machine learning, a realizationof P is referred to as labeled data whereas realizations from P x are referred to as unlabeled data.We deﬁne an hypothesis function as a mapping from X to Y , h : X → Y . In the context of classiﬁcationproblems in the machine learning literature, the set Y is often discrete, often binary, and the hypothesis h is often referred to as a classiﬁer as it is able to assign a class label in Y to each data point in X . The truehypothesis function h (cid:63) P for a domain characterized by P , is deﬁned as(1) h (cid:63) P := arg min h Prob (cid:0) h ( x ) (cid:54) = y | ( x, y ) ∼ P (cid:1) . We then deﬁne the inner domain classiﬁcation error as the probability of misclassiﬁcation in the samedomain as the true hypothesis function is constructed in. And the cross domain classiﬁcation error is deﬁnedas the probability of misclassiﬁcation in the target domain of the true hypothesis function of the sourcedomain

L.E. RAZOUX SCHULTZ, M. LOOG AND P. MOHAJERIN ESFAHANI ξ ( P ) := Prob (cid:0) h (cid:63) P ( x ) (cid:54) = y | ( x, y ) ∼ P (cid:1) ,ξ ( P , ¯ P ) := Prob (cid:0) h (cid:63) P ( x ) (cid:54) = y | ( x, y ) ∼ ¯ P (cid:1) . (2)Note that due to the stochastic nature of sentiment expression, ξ ( P ) and ξ ( P , ¯ P ) are not likely to be zero, evenif we would have inﬁnite data to train on. In addition, due to the diﬀerence in source and target distribution, ξ ( P , ¯ P ) is likely to be higher than ξ ( P ) [5]. The diﬀerence between the cross and inner domain classiﬁcationerror is referred to as the adaptation loss.The goal in many classiﬁcation tasks is to minimize the cross domain classiﬁcation error (2) for a speciﬁctarget domain while we have access to a ﬁnite set of labeled candidate source domains.An appropriate choice of the source domain in this context is the main goal of this study, leading to thefollowing question: What source domain minimizes the cross domain classiﬁcation error for a given target domain?

With the choice of source domain restricted to a domain characterized by a probability density functionin the candidate set P , our main goal can be formally described through the optimization program(3) P (cid:63) := arg min P ∈P ξ ( P , ¯ P ) . where ξ ( P , ¯ P ) is the cross domain classiﬁcation error introduced in (2). In other words, when we posses aﬁnite number of labelled datasets, each originated from a unique source domain, on which set should we trainour classiﬁer in order to minimize classiﬁcation error on data from a speciﬁc target domain?The challenge concerning the objective (3), is that for a target domain of interest we typically only haveunlabeled data, therefore we cannot calculate the cross domain classiﬁcation error. This means we onlyknow the marginal distribution ¯ P x of our target domain instead of ¯ P , making it impossible to explicitlycalculate ξ ( P , ¯ P ). In order to ﬁnd the best source domain characterized by P (cid:63) ∈ P for our target domainwith distribution ¯ P , we ought to predict ξ ( P , ¯ P ) for all P ∈ P . For this prediction we have full distributionalinformation of the source domains, P , and marginal information on the distribution of the target domain, ¯ P x ,available. 3. Method

In this section, we propose a model, called CMEK source domain selection, an acronym of the four measureswe chose in our case study, to deal with the previously stated challenge. The model measures statisticaldistances between the marginal distributions of candidate source domains and the target domain and usesthe inner domain classiﬁcation error of the candidate source domains. The candidate source domain with thelowest distance is hypothesized to have the lowest cross domain classiﬁcation error and is selected to train aclassiﬁer. This section describes the methodology, the distance measures we consider and how we evaluatethe performance of the model.We consider a set D containing a family of distance functions d : P × ¯ P x → R + . We now hypothesize thatgiven an available set of source domain distributions P and a target domain distribution denoted by ¯ P withthe marginal ¯ P x , there exists a ˆ d ∈ D that can reliably predict the cross domain classiﬁcation error ξ ( P , ¯ P ),that is,(4) ξ ( P , ¯ P ) hyp ≈ ˆ d ( P , ¯ P x ) . More speciﬁc, we hypothesize that the cross domain classiﬁcation error can be predicted by looking at thediﬀerence in marginal distribution functions P x and ¯ P x , and ξ ( P ). With this prediction, the best candidate ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 5 source domain can be selected for training. We formalize this optimal choice of the measure by consideringthe optimization problem(5) ˆ d := arg min d ∈D (cid:88) P ∈ ¯ P (cid:12)(cid:12) ξ ( P , ¯ P ) − d ( P , P x ) (cid:12)(cid:12) . To construct the family D of candidate measures, we use a vector s with K known statistical distancemeasures as elements, s i : P x × ¯ P x → R + for i ∈ { , · · · , K } . The family D consists of linear combinationsof these measures s i with corresponding weight coeﬃcients β i ∈ R + , with β i being elements of the weightvector β . We augment the linear combination with a constant. Note that the distance measures are functionssupported on the marginal distributions of source and target domain. Examples of these statistical distancemetrics are Integral Probability Metrics and φ -divergence. When a source domain has poor inner domainclassiﬁcation performance, we expect it to be less suitable as source domain for cross domain classiﬁcation.Therefore, we calculate the inner domain classiﬁcation error for each candidate source domain and add thisto our vector s . With our choice for D , the optimization problem (5) can be reduced toˆ β := arg min β ≥ (cid:88) P ∈P (cid:12)(cid:12) ξ ( P , ¯ P ) − βs ( P , ¯ P x ) (cid:12)(cid:12) , ˆ d := ˆ βs ( P , ¯ P x ) . (6)Note that program (6) constructs the optimal unsupervised predictor by using the full distribution of thetarget data, which is presumed to be unknown. In a practical setting with a ﬁnite number of elements in P , we deal with this problem by training only on the domains in P . In each run, one of the elements of P is extracted from the set and used as proxy ¯ P , at the end of the run, this element, denoted as ˜ P and itsmarginal ˜ P x , is placed back in P . In every run, another element is extracted and placed back for training.This training program is described asˆ β := arg min β ≥ (cid:88) P ∈P , P (cid:54) =˜ P (cid:88) ˜ P ∈P (cid:12)(cid:12) ξ ( P , ˜ P ) − βs ( P , ˜ P x ) (cid:12)(cid:12) . We hypothesize that with a suﬃcient amount of source domain distributions in P , the constructed predictorfor the cross domain classiﬁcation error, based on information from P , is also reasonably accurate in predictingthe cross domain classiﬁcation error for a target domain not included in P . This allows us to select the bestsource domain to train on for this unseen target domain.Now, in view of our hypothesis (4), with the predictor ˆ d we are able to approximate ξ ( P , ¯ P ) with onlymarginal information of the target domain ¯ P x , which allows us to deal with the challenge through an approxi-mation. This enables us to, given a target domain with distribution ¯ P , select the source domain, characterizedby ˆ P from a set of available source domains that is predicted to minimize the cross domain classiﬁcation error ξ ( P , ¯ P ) as formalized through the optimization programˆ P := arg min P ∈P ˆ d ( P , ¯ P x )which is the goal of this study. Measures

In the SC case study, we hypothesize that the way people express sentiment can be captured in how oftencertain words and word combinations are used. The underlying joint distributions P diﬀer among diﬀerentdomains. To accurately classify sentiment for a target domain, we therefore ought to ﬁnd a domain witha similar underlying distribution function. To this end, one may need to measure the similarity between L.E. RAZOUX SCHULTZ, M. LOOG AND P. MOHAJERIN ESFAHANI diﬀerent distributions. In this article we consider two classes of distance measures: Integral ProbabilityMetrics (IPM’s), and φ -divergences that map X × X → R . Both measures compare the distributions of thefeatures but in diﬀerent fashions.3.1.A. Integral Probability Metrics

The main characteristic component of IPM distance functions is the set F that contains a family offunctions f : X → R . Using F , we deﬁne the distance between two distributions P x and ¯ P x as s i ( P x , ¯ P x ) := sup f ∈F (cid:12)(cid:12)(cid:12) E P x (cid:2) f ( X ) (cid:3) − E ¯ P x (cid:2) f ( X ) (cid:3)(cid:12)(cid:12)(cid:12) . Diﬀerent choices of the family F give rise to diﬀerent distance functions. A popular choice is to limitour search space to the unit ball in a predeﬁned metric space. In this article, we consider two choices: (i)the unit ball in a Reproducing Kernel Hilbert Space which corresponds to the Maximum Mean Discrepancy(MMD) distance function [16], (ii) the unit ball of Lipschitz functions which corresponds to the Earth Mover’sDistance (EMD) distance function [2].More intuitively, the EMD metric measures the minimum transportation cost to transform one distributionto another. The transportation cost, roughly speaking, amounts to the eﬀort measured in a so-called groundmetric that is required to move a unit probability mass. The MMD metric ﬁnds a well behaved, smooth,function which is typically high on the points drawn from P x and low on ¯ P x or vice verse.3.1.B. Phi-divergence

For the second class of distance measures, we consider the φ -divergence between two distributions P and¯ P deﬁned as s i ( P x , ¯ P x ) := (cid:82) X φ (cid:18) d P x d ¯ P x (cid:19) d ¯ P x P x (cid:28) ¯ P x , + ∞ , otherwisewhere φ is a convex function, and P x (cid:28) ¯ P x denotes that P x is absolutely continuous with respect to ¯ P x [37]. A widely used choice is φ ( t ) = t log( t ), resulting in the Kullback-Leibler Divergence (KLD). Anotherpopular choice for φ ( t ) is φ ( t ) = ( t − which is based on Pearson’s χ test statistic. This choice givesthe χ -divergence (Chi2) distance measure. An interesting property of these two divergences is that they areasymmetric, s i ( P x , ¯ P x ) (cid:54) = s i (¯ P x , P x ).Since we hypothesize that the way people express sentiment can be captured in how often certain words andword combinations are used, regardless of the context, we are then left with a broad spectrum of possibilitiesto optimize over. We propose to use a linear combination of the two IPM approaches MMD and EMD, thetwo φ -divergences KLD and Chi2, a constant representing an oﬀset error and the inner domain classiﬁcationerror of the source domain.In most practical cases, we only have access to the true probability distribution through a ﬁnite numberof observations. For that reason, in order to evaluate these discrete distributions, the integrals are replacedby summation. Selection based on the linear combination of these four metrics, the constant, and ξ ( P ) isreferred to as the CMEK selection model. d ( P , ¯ P x ) = β Chi P x , ¯ P x ) + β M M D ( P x , ¯ P x )+ β EM D ( P x , ¯ P x ) + β KLD ( P x , ¯ P x )+ β ξ ( P ) + β Performance evaluation

We have constructed a model that predicts cross domain classiﬁcation error between a candidate sourcedomain and a target domain with a linear combination of statistical distance measures based on the marginaldistributions P x and ¯ P x , a constant and ξ ( P ). Now we are interested in how well this model is able to identifythe best candidate source domain, i.e. the one that gives the lowest cross domain classiﬁcation error ξ ( P , ¯ P ).For n c acquired datasets, let us use one corpus as target domain, ¯ P , and the others as candidate sourcedomains with the set annotation P (cid:48) . We can repeat the experiment n c times by using each time a diﬀerenttarget domain and average the results.We let the model select the predicted best source domain ˆ P ∈ P (cid:48) and use this domain for training. Toevaluate performance we deﬁne the relative cross domain classiﬁcation error as(7) ξ relative (ˆ P , ¯ P ) = ξ (ˆ P , ¯ P ) − ξ ( P (cid:63) , ¯ P )where P (cid:63) is the domain in P (cid:48) of which we know it has the lowest cross domain classiﬁcation error, the truebest domain. When the model selects the best source domain, the relative error will therefore be zero. Wecompare the distribution of this relative error when using the CMEK model with randomly selecting onesource domain for training. Also, we look at the absolute cross domain classiﬁcation error, the probabilityof selecting the best domain and the probability of selecting one of the ﬁve worst domains when using theCMEK model. We compare this with randomly selecting a domain and with an optimal selection model.Next, we let our proposed CMEK model select the best n domains from P (cid:48) and compare ξ ( (cid:83) ni =1 ˆ P i , ¯ P ) withtraining on all domains in P (cid:48) and with randomly selecting n domains. All results are tested on signiﬁcanceusing a paired t-test over the results of 13 runs, we use a signiﬁcance cutoﬀ of p = 0 .

05. The paired t-testis strongly dependent on the assumption that the pairwise diﬀerences are normally distributed. For everycomparison between results, we evaluated the normality assumption. We report only on signiﬁcant resultsfor which the normality assumption is conﬁrmed.To test if our methodology creates added value when using it in an adaptive learning setting, we let ourmodel select the predicted best source domain and evaluate the cross domain classiﬁcation error when usinga DANN adaptive learning setting described in the recent study [14]. As a benchmark, the result is comparedwith random selecting a source domain in the same training and test setting.4.

Limitations

In solving the problem as deﬁned in section 2 with our proposed method, we encounter limitations ofvarious kinds. We ﬁrst discuss a fundamental statistical limitation that is present in almost all practicalmachine learning problems. Then we discuss a limitation of computational nature and how to deal with this.In the last paragraph we elaborate on limitations that are more speciﬁc to the case study of SC: numericallyrepresenting text and the limited hypothesis space. We end with brieﬂy explaining the limitations of ourproposed CMEK model.

Statistical limitations

In the ﬁrst place, and inherent to all classiﬁcation problems, we are unable to retrieve the true distributionover the support set X × Y in practical settings. Instead, we need to infer this distribution through onlya ﬁnite number of observations. In this case a natural approximation for the distribution is the empiricaldistribution supported on these observations, which are the available documents in our SC case study. Wethen encounter an inevitable approximation P ≈ n s n s (cid:88) i δ { x i ,y i } L.E. RAZOUX SCHULTZ, M. LOOG AND P. MOHAJERIN ESFAHANI where n s denotes the number of documents in the domain sample. Due to the approximation of the truedistribution, the hypothesis function (1) is suboptimal, resulting in a higher classiﬁcation error. To evaluatethis error, the available data originated from one domain is split into a train and test set. The train set isused to construct h (cid:63) P as deﬁned in (1). The test set is used to calculate the inner domain classiﬁcation error.When using suﬃcient data in the test set, a large diﬀerence in apparent and true error indicates that theobservations used for training are a poor support for approximating the true distribution since it means thatthe empirical distributions from the train and test set are not much alike. When this is the case, we are overﬁtting; the hypothesis function works for the data it is constructed on, but can not be properly generalizedto perform well on new data from the same underlying distribution. To reduce over ﬁtting, we ought to usea suﬃcient number of training objects and a low but not too low number of features. Computational limitations

Another limitation is that the optimization in (1) uses the indicator function as loss function, which isnon-convex. For computational reasons, we replace this loss function with a convex counterpart. For the truehypothesis function h (cid:63) P in (1), a common convex loss function (cid:96) ( x, y ) to minimize, is the quadratic error ofthe prediction, popular as the method of least squares(8) h (cid:63) P = arg min h n s (cid:88) i =1 (cid:96) ( x i , y i ) (cid:96) ( x, y ) := (cid:0) h ( x ) − y (cid:1) , and the cross domain classiﬁcation error ξ ( P , ¯ P ) is then eﬁned by empirical distributions of the target andthe hypothesis function from (8),(9) ξ ( P , ¯ P ) := 1 n s n s (cid:88) i (cid:8) h (cid:63) P ( x i ) (cid:54) = y i | ( x,y ) ∼ ¯ P (cid:9) . Further discussion on SC related limitations

There are limitations concerned with performing SC with a nature that may also be relevant to other ﬁeldsof application. Choosing a word representation model is a ﬁeld of study by itself and should be consideredcarefully here. In the previous section we assumed a true hypothesis function h (cid:63) mapping each documentrepresented in X to a label in Y . For computational purpose, we need to numerically express documents inthe X space. Let us call W the set of all words, W = { w , w , . . . w n w } where n w is the number of uniquewords. Then, documents in X are combinations of these words, indicating that X can be viewed as an elementin the power set of W denoted by ℘ ( W ). Note that repetition in the subset is possible. For computationalfeasibility, we choose a subset F ⊂ ℘ ( W ) to represent our documents. The subset F is called the feature setwhose cardinality is denoted by |F| . The features selection process depends on what representation modelis at hand. The representation model F maps X to R |F| , i.e. F : X → R |F| . The computational needfor a representation model F to select a subset of X for representation, limits the classiﬁcation model sinceinformation has to be discarded, see Figure 1.Furthermore, in the task of classiﬁcation we need to deﬁne a search space H for our hypothesis functionas it is computational impossible to search through all possible hypothesis functions as proposed in (1). SCtasks are characterized by the fact that the representation space is sparse, i.e. the ratio of number of trainingdocuments to |F| is low. In this situation, computational feasibility calls for a tractable subset of hypothesisfunctions H such as all linear functions. The true hypothesis function is not likely to be in our hypothesisspace, h (cid:63) (cid:54)∈ H . We deﬁne the optimal hypothesis function constructed in the domain P as the hypothesis ˆ h within our limited hypothesis space H , which is the collection of all linear mappings from R |F| to R , that ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 9

X YR |F| h (cid:63) F ˆ h ∈ H Figure 1.

Representation and classiﬁcation structureminimizes the chosen loss function (8). The cross domain classiﬁcation error from (9) is calculated with theoptimal hypothesis function ˆ h P .Hereafter, we use P to annotate the empirical distribution instead of the true continuous distribution.Furthermore, the cross domain classiﬁcation error deﬁned as ξ ( P , ¯ P ) in (2) is calculated with the optimalhypothesis function in H that minimizes the empirical loss in (8). CMEK selection model limitations

Our model assumes that a good source domain, with low cross domain classiﬁcation error, can be identiﬁedby the distributions P and ¯ P x . However, where in reality sentiment is determined by joint distribution ofthe features, the measure we use only measure distance on the individual distributions of the features. Thismeans we try to ﬁnd similarity of the joint distributions from marginal distributions of individual features.The same holds for classiﬁcation that is based on marginal data of individual features whereas sentimentin real life is determined by combinations of features. Since we use marginal data for calculating the crossdomain classiﬁcation error as well as the prediction of this error, this approach seems justiﬁed, but it createsa model separated from the true rules of expressing sentiment.In addition, we do not know in what way we should compare distributions, only that when distributionsare the same, we expect no adaptation loss. It might very well be that the linear combination of our chosenmeasures does not include the true distance function that distinguishes domains from each other.5. Case Study: Sentiment Classification

Let us give some brief background information on how to perform SC, mainly to clarify design choices madein the experimental set up. The pipeline for SC can be segmented into three parts: data preprocessing, doc-ument representation and classiﬁcation. The next paragraphs describe the pipeline. For a lower dimensionalvisualization, see Figure 2.

Preprocessing

In the preprocessing part one tries to remove noise from the text that does not hold sentiment informationin order to reduce complexity, i.e. dimensionality. Common techniques to do so are stop word removal,lower casing words, spelling correction and removing punctuation. Other techniques, including part of speechtagging, stemming and lemmatization, attempt to ﬁnd implicit sentiment information by predicting syntacticsof words or combinations of words.

Document representation

The second part of the pipeline for sentiment classiﬁcation deals with representing documents in a math-ematical interpretable fashion. The most upfront approach is to use words as features, and word counts asfeature values without looking at word order, a bag-of-words approach. With n F words or combinations ofwords as features, n F = |F| , we can build an n F -dimensional feature space in which each word is represented f = “ cat ” f = “ sad ” f = “ happy ” d d d d d = “The cat makes me happy .”- POS d = “The cat is beautiful.”- POS d = “The cat makes me sad .”- NEG d = “I was happy , the sad cat made me sad .”- NEG Figure 2.

Classiﬁed feature spaceby an n F -dimensional vector. The number of occurrences of a feature in the document determines the valuein that dimension. Each document is then represented as the sum of all the vectors of its features. Whenusing n F features, a corpus of n d documents can be represented as a matrix X ∈ N n d × n F which is referredto as the feature value matrix. X represents a projection of the empirical distribution P . In addition, onecould choose to use combinations of N words as features, called N -grams. Often, features are weighted toassign less value to more common words such as stop words. A widely used weighting scheme borrowed frominformation retrieval systems is the TF-IDF weighting scheme. To reduce dimensionality, feature selectionalgorithms can be used, selecting the most class label informative features by using for example χ , mutualinformation or lexicon based selection [19]. Another popular approach for document representation uses wordembeddings [24] [28], a lower dimensional projection of a high dimensional feature space. To construct a goodprojection, we need a lot of data, preferably from the domain that is under evaluation. Classiﬁcation

When we have a mathematical representation for the source and target documents, we try to ﬁnd ahypothesis function that separates the feature space in subspaces belonging to the diﬀerent class labels. Alldocuments that are represented in certain subspace, are predicted to have the accompanying label of thatspeciﬁc subspace. The hypothesis function, or classiﬁer, is constructed by minimizing a loss function, see(8). Popular classiﬁers for sentiment classiﬁcation are Support Vector Machines, Naive Bayes and LogisticRegression (LR). 6.

Experimental Setup

In this section we brieﬂy elaborate on diﬀerent datasets we used for our experiments as well as the designchoices we consider concerning the SC pipeline and the proposed distance measures.

Datasets

To evaluate the methodology, we choose two sets of data that are diﬀerent in their nature of topic variety,source medium, number of documents, document length, and class balance. Our primary goal to conduct twoexperiments with datasets at diﬀerent homogeneity levels is to investigate the performance and robustness ofour proposed method with respect to the diversity of the training datasets.

ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 11

Homogeneous dataset

The homogeneous dataset consisting of 20 corpora from the DRANZIERA benchmark dataset [12]. Weuse 5000 positive and 5000 negative documents from each corpus. The documents range from 64 to 123 wordswith an average of 90. Each corpus consists of Amazon reviews on products from a diﬀerent category. Thisset is referred to as the “homogeneous” set as the nature of all corpora are similar, but only the subtopics ofthe corpora diﬀer.6.1.B.

Heterogeneous dataset

Our second dataset contains 13 corpora. Each corpus consists of a diﬀerent number of documents rangingfrom 502 to 30602 documents with an average of 6326 documents. The average length of each documentamong the corpora varies between 10 words and 20 words with an average of 15 words. We deploy a corpusof online reviews on movies, venues and consumer products [21], short opinions about movies [41], and ninesets of tweets with various topics [11]. The acquired datasets do not have any major class imbalances. Thisset is referred to as the “heterogeneous” set as the nature of each corpus diﬀers in number of documents,source and topic. In a direct comparison with the ﬁrst datasets, we note that the homogeneous set is richerand more balanced in terms of data points, and the topics of the corpora lay closer to each other in semanticmeaning.6.1.C.

Adaptive learning dataset

The third set acquired is a well known dataset containing Amazon reviews on four diﬀerent categories ofproducts [5]. We use 2000 reviews from each category. Motivated by the original paper [14] on DANN, Thisdataset is used in order to evaluate the performance of our proposed method in an adaptive learning setting.

Design choices

Considering the SC pipeline design choices, in our experiments we only remove punctuation and lowercase all words as prepossessing. For classiﬁcation in our experiments, we use a bag-of-words approach withunigrams and bigrams and use all features that occur more than 4 times in the training corpus and at maximumin 40% of the training documents. We weight the features according to the TF-IDF weighting scheme. Wedeliberately do not use word embeddings since the lower dimensional projection needs too much textualdomain data to construct. Using data from other domains will distort results as the representation becomesdependent on other domains than the source and target domain that are under performance evaluation. Weuse LR for its good performance. We use binary classiﬁcation, y ∈ { , } , giving the following loss functionfor our classiﬁer (cid:96) ( x, y ) LR = n d (cid:88) i =1 log (cid:0) e − (2 y i − x i β + β (cid:1) + α (cid:107) β (cid:107) where w and w represent the optimizers, α represents the regularization parameter and the hypothesisfunction is deﬁned as h ( x ) = (cid:40) βx + β >

00 else . For α , we use the default classiﬁer settings of our used toolbox [31].In the adaptive learning setting, we use exactly the same model settings as in [14]. The model deploysDANN for classiﬁcation both with an original representation of 5000 features as with a marginalized StackedDenoising Auto-encoders (mSDE) representation [8], an advanced representation that improves performancein SC. For both representations, we evaluate the performance of our methodology. For all distance measures, we only measure the distance over the n = 1000 most occurring featuresfor computational convenience. We do not compare the distribution of all features for two reasons. Inthe ﬁrst place, the distance calculations will take long when using all features. And second, and moreimportant, distributions of rarely occurring features are less reliable. For the Chi2 divergence, we smoothenthe distributions by adding a constant λ = 0 .

05 over the entire distribution p ( X ) to prevent the distancebetween two discrete distributions to become inﬁnite when we encounter a probability of zero. For the KLD weadd a constant λ = 10 − to the entire distribution for same reasons. For the MMD we choose a ReproducingKernel Hilbert Space that corresponds to the Gaussian Radial Basis Function kernel K ( x, y ) = e || x − y || σ where σ is our optimizer. We use corpora of the same number of objects, i.e. documents, as input by using a subsetof the largest corpus of the two corpora used. For computational convenience we use a maximum of 5000documents. For the EMD we use the set of all continuous Lipschitz functions with respect to the 1-norm.Note that the distribution is highly dependent on the length of a document. Therefore, we match documentlengths among the two domains before calculating the EMD. For the optimization for the two IPMs we useavailable code for discrete distributions [42] [27] [32] [33].For the inner domain classiﬁcation error that is used as predictor in the CMEK model, we use the samelogistic regression as for the cross domain classiﬁcation and use 10-fold cross validation to assess the innerdomain classiﬁcation error. 7. Results

In this section we will objectively present the results of our experiments according to our performanceevaluation described in subsection 3.2.Figure 3 shows the empirical distribution of the relative error as deﬁned in (7) when selecting one sourcedomain. The relative error distributions when randomly selecting a source domain and when using the CMEKsource domain selection model are shown. (a)

Homogeneous dataset (b)

Heterogeneous dataset

Figure 3.

Relative error of CMEK and random source domain selection when selecting asingle training domainFor the homogeneous and heterogeneous dataset, the CMEK model uses the optimized weightsˆ β homogeneous = [0 . , . , . , . , . , . β heterogeneous = [1 . , . , . , . , . , . ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 13

Table 1.

Performance of CMEK, random and optimal selection.

Probability Probability AverageSelection Best possible One of 5 worst absolutemethod domain selected domains selected ξ ( ˆ P , ¯ P )Optimal 1 0 .139CMEK .600 0 .144Random .053 .263 .191 (a) Homogeneous dataset

Probability Probability AverageSelection Best possible One of 5 worst absolutemethod domain selected domains selected ξ ( ˆ P , ¯ P )Optimal 1 0 .252CMEK .385 .154 .337Random .083 .417 .403 (b) Heterogeneous dataset (a)

Homogeneous dataset (b)

Heterogeneous dataset

Figure 4.

Performance of the CMEK model and random selection for diﬀerent number ofselected source domains in comparison with training on all source domainsclassiﬁcation error, averaged over the all target domains. We list the results for using the CMEK model fordomain selection, when the source domain is selected at random and an optimal model that would alwaysselect the best source domain. The optimal cross domain classiﬁcation error can be seen as lower bound oferror. For the homogeneous dataset, the CMEK model uses the optimized weightsFigure 4 shows the absolute cross domain classiﬁcation error when we let our CMEK selection model select n source domains. Note that the CMEK model is constructed by optimizing the predicted cross domainclassiﬁcation error when training the classiﬁer on single source domains, not training on multiple. The resultsare compared with randomly selecting n domains and training on all candidate source domains.Table 2 reports the probability of selecting the best domain in terms of the cross domain classiﬁcationerror, and the respective average absolute cross domain classiﬁcation error; the result is empirically computedbased on the four available target domains provided by [5]. We list the results corresponding to the CMEKmodel as a means to select the source domain, when the source domain is selected at random and an optimalmodel that would always select the best source domain. Table 2.

Performance of CMEK, random, and optimal selection in adaptive learning setting [14].

Probability AverageSelection Best possible absolutemethod domain selected ξ ( ˆ P , ¯ P )Optimal 1 .199CMEK 1 .199Random .33 .237 (a) Original representation

Probability AverageSelection Best possible absolutemethod domain selected ξ ( ˆ P , ¯ P )Optimal 1 .152CMEK 1 .152Random .33 .187 (b) mSDA representation Conclusion

To select a suitable source domain in terms of cross domain classiﬁcation error, we hypothesized that thiserror can be predicted by a function of the source domain distribution P and marginal target domain distri-bution ¯ P x . We proposed the CMEK model to select the best source domain(s) from a set of candidates. Forour case study, the CMEK model uses a linear combination of the Chi2, MMD, EMD, KLD, the inner domainclassiﬁcation error and the constant 1, with weight vector β as predictor for the cross domain classiﬁcationerror. To evaluate performance, we consider an example including N − N ( N − β is optimized to minimize the absoluteerror of the prediction. The optimized predictor was used to select one or multiple best source domains amongthose N − N th , unseen target domain. This classiﬁer is testedby calculating the cross domain classiﬁcation error of the selected source domain(s) and the target domain.The process is repeated N times, each time using a diﬀerent unseen target domain to test on. We benchmarkthe performance of our solution with randomly selecting a source domain as well as training on all candidatesource domains. We perform this method over a homogeneous dataset ( N = 20) and for a heterogeneousdataset ( N = 13).From ﬁgure 3 we see that the CMEK model is well able to identify source domains with low cross domainclassiﬁcation error. In 90% of the runs on the homogeneous datasets, the selected source domain was within2.5 percent points error of the optimal choice whereas random selection only selected 26% of the times asource domain in this category. For the heterogeneous set we see a similar increase in the performance ofthe CMEK model when looking at the probability a domain is selected within 5 percent points error of theoptimal choice, 45%, and 16%, respectively.From Table 1, we see that compared to random domain selection, the CMEK realizes a signiﬁcant im-provement in average cross domain classiﬁcation error for both the homogeneous set and the heterogeneousset ( p hom = 3 . × − , p het = 0 . p hom = 3 . × − , p het = 0 . p hom = 2 . × − , p het = 0 . n , see Figure 4b. For the heterogeneous dataset, the CMEK modelseems to perform signiﬁcantly better than training on all domains when it selects 9 or 10 domains ( p het = 0 . p het = 0 . ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 15 conﬁrmed, however in this speciﬁc evaluation, the pairwise diﬀerences seem not normally distributed for n = 10. What the optimal number of domains to train on is, may be very dependent on what candidatesource domains are available. When we have candidate source domains that are somehow similar to thetarget domain, the optimal n will be higher, or even equal to the total number of sets, as in the heterogeneousdataset, see Figure 4a. For a diverse set of candidate source domains the optimal n will be lower.Looking at the results when combining the CMEK model with adaptation learning, Table 2, we observethat the CMEK model enabled us to select the best source domain out of three candidates for each of the fourtarget domains. This gives us a signiﬁcant improvement with regard to a random selection model ( p = 0 . Further discussion

Reﬂecting on our conclusion, we would like to place some remarks. We showed that the CMEK modelworks superior in selecting one source domain compared to random selection. However, the results show that,even for candidate source domains with a wide spread in topic and source medium, it is quite beneﬁcial totrain on all the candidate source domains. One of the reasons not to train on all data could be that it is toocomputational expensive. Another reason might be that the candidate source domains are too diverse. Toestablish if this is the case, we would need a measure that informs us about the diversity of the candidatesource domains.When we know we have candidate source domains that are similar to each other in terms of expressingsentiment, it might very well be that our CMEK model is not able to improve performance compared totraining on all data. We illustrated this by using the homogeneous dataset, see Figure 4a. Therefore, ifwe have candidate source domains that are somewhat similar to each other, we may choose to train on alldomains.Another disadvantage of the CMEK model, is that it uses some distance measures that are quite expensiveto calculate. This can be a problem when we have many or large candidate source domains. This challengemight be addressed by using less features and more informative features to calculate distance over and wecould optimize used code on eﬃciency. We might investigate what happens to the performance of the CMEKmodel in case we leave out the EMD and KLD measure, as they are given fewest importance, i.e. lowestweight β i .We constructed, tested, and evaluated the model on three diﬀerent acquired datasets. We would like toremind that sentiment expression is characterized by a severe notion of stochasticity. It is hard to tell howmuch performance of the CMEK model will deviate when other sets are chosen to construct the model. Formore available domains to construct the model, we may assume that the model will be more reﬁned andbetter performing due to the wider support for the β weight vector.The dataset used for the adaptive learning setting is quite popular in the ﬁeld. However, it only containsdata from four domains. When using one of the domains as the target, only three domains remain in thecandidate source domain set. The chance of coincidentally selecting the optimal domain is not negligible.Although we presented that the results are statistically signiﬁcant, for future work, we would prefer to use adataset that contains data originated from more domains. Although we described a method that can be easily implemented in other ﬁelds of machine learning thatencounter a domain shift such as computer vision, fraud detection or spam detection, we only showed ourmodel works reasonably well in the domain of SC.

Future Work

As the CMEK model is to some extension able to select a source domain with lower than average crossdomain classiﬁcation error, we concluded that the used measures are to some extend able to predict the crossdomain classiﬁcation error in order to select a suitable source domain. However, if our model would fully dojustice to our hypothesis (4), we would improve even further. We could say that proof of concept is given:it is possible to roughly predict the cross domain classiﬁcation error based on the distribution of the sourcedomain distribution and marginal distribution of the target domain. However, we still see much room forimprovement. As the lower bound for the error when selecting one domain in the heterogeneous dataset is.252, we are only half way. For the homogeneous dataset, it turns out that we are closer to achieve an optimalmodel.If we would like to improve results further, ﬁrst steps would be to look closer into the features over whichwe measure distance. Should we use the 1000 most common, or is distance better measured with more or lessfeatures? This could be dependent on corpus size as well. There might be feature selection methods to selectthe features that are most informative in terms of distance, instead of simply using the N most occurringfeatures. We, could for example, look at domain speciﬁc emotion lexicons to measure distance [3].To improve, we might add some predictors to the linear combination, such as the document length dis-tribution or the length of a corpus. Brief statements are on average more similar with each other than briefstatements compared with long expressions. Also, if we have more objects in a source domain, we might ben-eﬁt more when training on that large source domain than training on a corpus with only a few documents.However, with more elements in β , we might need more data to prevent over ﬁtting.We can also extend our model to give a weight factor on every source domain: source domains that arepredicted not suitable get a low weight, domains with low distance to the target are given more weight. Thisapproach leans towards importance sampling.We constructed a predictor for the cross domain classiﬁcation error of one source domain, and used themodel to select multiple source domains. If this is what we are interested in, it would make sense to calculatedistances between a set of multiple domains and the target domain and make a selection model optimized onpredicting the cross domain classiﬁcation error when training on multiple domains. In that case, synergies ofdatasets will be taken into account. This approach will, however, be very computational expensive.Concerning the results when deploying the CMEK model in an adaptive learning setting, we evaluatedthe performance when selecting a source domain based on the original source data distribution. It would beinteresting to see what we could achieve when we let the CMEK model select based on adapted source datadistribution. In that scenario, the source data is ﬁrst transformed with adaptive techniques, and then therespective distance is measured between the transformed source data and the target data in order to selectthe predicted best domain in an adaptive learning setting.At last, we notice that when the available datasets are rather heterogeneous, the CMEK model is typicallyable to outperform the classiﬁers trained on all datasets by selecting an appropriate subset of the trainingdomains. To realize this, it is essential to determine what the optimal number of domains to train on is.We could approach this question by deﬁning an absolute threshold in distance between the source and targetdomain, or by clustering the candidate source domains and selecting a cluster of candidates to train on.Combining the proposed methodology with a model that selects the optimal number of domains to train onmight bring a very useful solution to many machine learning problems. ISTANCE BASED SOURCE DOMAIN SELECTION FOR SENTIMENT CLASSIFICATION 17

Acknowledgment

The authors are grateful to Dr. Mauro Dragoni for providing us the DRANZIERA dataset, and to thecompany Crunchr for support throughout the project. The third author gratefully acknowledges funding fromthe Swiss National Science Foundation under grant P2EZP2-165264.

References [1]

M. J. Afridi, A. Ross, and E. M. Shapiro , On automated source selection for transfer learning in convolutional neuralnetworks , Pattern Recognition, 73 (2018), pp. 65–75.[2]

M. Arjovsky, S. Chintala, and L. Bottou , Wasserstein gan , arXiv preprint arXiv:1701.07875, (2017).[3]

A. Bandhakavi, N. Wiratunga, D. Padmanabhan, and S. Massie , Lexicon based feature extraction for emotion textclassiﬁcation , Pattern recognition letters, 93 (2017), pp. 133–142.[4]

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira , Analysis of representations for domain adaptation , in Advancesin neural information processing systems, 2007, pp. 137–144.[5]

J. Blitzer, M. Dredze, F. Pereira, et al. , Biographies, bollywood, boom-boxes and blenders: Domain adaptation forsentiment classiﬁcation , in ACL, vol. 7, 2007, pp. 440–447.[6]

J. Blitzer, R. McDonald, and F. Pereira , Domain adaptation with structural correspondence learning , in Proceedingsof the 2006 conference on empirical methods in natural language processing, Association for Computational Linguistics,2006, pp. 120–128.[7]

J. Bollen, H. Mao, and X. Zeng , Twitter mood predicts the stock market , Journal of computational science, 2 (2011),pp. 1–8.[8]

M. Chen, Z. Xu, K. Weinberger, and F. Sha , Marginalized denoising autoencoders for domain adaptation , arXiv preprintarXiv:1206.4683, (2012).[9]

S. Cohen and L. Guibasm , The earth mover’s distance under transformation sets , in Computer vision, 1999. The proceed-ings of the seventh IEEE international conference on, vol. 2, IEEE, 1999, pp. 1076–1083.[10]

N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy , Optimal transport for domain adaptation , arXiv preprintarXiv:1507.00504, (2015).[11]

CrowdFlower , Data for everyone , 2017.[12]

M. Dragoni, A. G. Tettamanzi, and C. D. C. Pereira , Dranziera: an evaluation protocol for multi-domain opinionmining , in Tenth International Conference on Language Resources and Evaluation (LREC 2016), European LanguageResources Association (ELRA), 2016, pp. 267–272.[13]

B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars , Unsupervised visual domain adaptation using subspacealignment , in Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 2960–2967.[14]

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky , Domain-adversarial training of neural networks , The Journal of Machine Learning Research, 17 (2016), pp. 2096–2030.[15]

X. Glorot, A. Bordes, and Y. Bengio , Domain adaptation for large-scale sentiment classiﬁcation: A deep learningapproach , in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 513–520.[16]

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola , A kernel two-sample test , Journal ofMachine Learning Research, 13 (2012), pp. 723–773.[17]

V. Hatzivassiloglou and K. R. McKeown , Predicting the semantic orientation of adjectives , in Proceedings of theeighth conference on European chapter of the Association for Computational Linguistics, Association for ComputationalLinguistics, 1997, pp. 174–181.[18]

Y. He, C. Lin, and H. Alani , Automatically extracting polarity-bearing topics for cross-domain sentiment classiﬁcation , inProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, 2011, pp. 123–131.[19]

F. H. Khan, U. Qamar, and S. Bashir , Swims: Semi-supervised subjective feature weighting and intelligent model selectionfor sentiment analysis , Knowledge-Based Systems, 100 (2016), pp. 97–111.[20]

S. Kolouri, A. B. Tosun, J. A. Ozolek, and G. K. Rohde , A continuous linear optimal transport approach for patternanalysis in image datasets , Pattern recognition, 51 (2016), pp. 453–462.[21]

D. Kotzias, M. Denil, N. De Freitas, and P. Smyth , From group to individual labels using deep features , in Proceedingsof the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 597–606.[22]

W. M. Kouw, L. J. Van Der Maaten, J. H. Krijthe, and M. Loog , Feature-level domain adaptation , Journal of MachineLearning Research, 17 (2016), pp. 1–32.[23]

S. Kullback and R. A. Leibler , On information and suﬃciency , The annals of mathematical statistics, 22 (1951),pp. 79–86.[24]

Q. Le and T. Mikolov , Distributed representations of sentences and documents , in Proceedings of the 31st InternationalConference on Machine Learning (ICML-14), 2014, pp. 1188–1196. [25]

L. Lee , On the eﬀectiveness of the skew divergence for statistical language analysis. , in AISTATS, Citeseer, 2001, pp. 65–72.[26]

J. Lin , Divergence measures based on the shannon entropy , IEEE Transactions on Information theory, 37 (1991), pp. 145–151.[27]

W. Mayner , pyemd , 2017.[28] T. Mikolov, K. Chen, G. Corrado, and J. Dean , Eﬃcient estimation of word representations in vector space , arXivpreprint arXiv:1301.3781, (2013).[29]

S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang , Domain adaptation via transfer component analysis , IEEE Transactionson Neural Networks, 22 (2011), pp. 199–210.[30]

B. Pang, L. Lee, et al. , Opinion mining and sentiment analysis , Foundations and Trends R (cid:13) in Information Retrieval, 2(2008), pp. 1–135.[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay , Scikit-learn: Machine learning in Python , Journal of Machine Learning Research, 12 (2011), pp. 2825–2830.[32]

O. Pele and M. Werman , A linear time histogram metric for improved sift matching , in Computer Vision–ECCV 2008,Springer, October 2008, pp. 495–508.[33] ,

Fast and robust earth mover’s distances , in 2009 IEEE 12th International Conference on Computer Vision, IEEE,September 2009, pp. 460–467.[34]

B. Plank , What to do about non-standard (or non-canonical) language in nlp , arXiv preprint arXiv:1608.07836, (2016).[35]

B. Plank and G. Van Noord , Eﬀective measures of domain similarity for parsing , in Proceedings of the 49th AnnualMeeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Com-putational Linguistics, 2011, pp. 1566–1576.[36]

S. Ruder, P. Ghaffari, and J. G. Breslin , Data selection strategies for multi-domain sentiment analysis , arXiv preprintarXiv:1702.02426, (2017).[37]

B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨olkopf, and G. R. Lanckriet , On integral probabilitymetrics, \ phi-divergences and binary classiﬁcation , arXiv preprint arXiv:0901.2698, (2009).[38] S. Tan, X. Cheng, Y. Wang, and H. Xu , Adapting naive bayes to domain adaptation for sentiment analysis , Advances inInformation Retrieval, (2009), pp. 337–349.[39]

L. Terveen, W. Hill, et al. , Phoaks: a system for sharing recommendations , Communication of the ACM, 40 (1997),pp. 59–63.[40]

A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe , Predicting elections with twitter: What 140 charactersreveal about political sentiment. , Icwsm, 10 (2010), pp. 178–185.[41]

University of Michigan , UMICH SI650 - Sentiment Classiﬁcation , 2011.[42]

V. van Asch , mmd.py , 2012.[43] K. Vogt, A. Paul, J. Ostermann, F. Rottensteiner, and C. Heipke , Boosted unsupervised multi-source selection fordomain adaptation , ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 4 (2017),p. 229.[44]

D. Wu, V. J. Lawhern, and B. J. Lance , Reducing oﬄine bci calibration eﬀort using weighted adaptation regularizationwith source domain selection , in Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, IEEE,2015, pp. 3209–3216.[45]

E. W. Xiang, S. J. Pan, W. Pan, J. Su, and Q. Yang , Source-selection-free transfer learning , in IJCAI proceedings-international joint conference on artiﬁcial intelligence, vol. 22, 2011, p. 2355.[46]