[PDF] Weakly-Supervised Deep Learning for Domain Invariant Sentiment Classification

Abstract

The task of learning a sentiment classification model that adapts well to any target domain, different from the source domain, is a challenging problem. Majority of the existing approaches focus on learning a common representation by leveraging both source and target data during training. In this paper, we introduce a two-stage training procedure that leverages weakly supervised datasets for developing simple lift-and-shift-based predictive models without being exposed to the target domain during the training phase. Experimental results show that transfer with weak supervision from a source domain to various target domains provides performance very close to that obtained via supervised training on the target domain itself.

Full PDF

aa r X i v : . [ c s . L G ] N ov Weakly-Supervised Deep Learning for Domain InvariantSentiment Classification

Pratik Kayal

Indian Institute of TechnologyGandhinagarGujarat, India

Mayank Singh

Indian Institute of TechnologyGandhinagarGujarat, India

Pawan Goyal

Indian Institute of TechnologyKharagpurWest Bengal, India

ABSTRACT

The task of learning a sentiment classiﬁcation model that adaptswell to any target domain, diﬀerent from the source domain, is achallenging problem. Majority of the existing approaches focus onlearning a common representation by leveraging both source andtarget data during training. In this paper, we introduce a two-stagetraining procedure that leverages weakly supervised datasets fordeveloping simple lift-and-shift-based predictive models withoutbeing exposed to the target domain during the training phase. Ex-perimental results show that transfer with weak supervision froma source domain to various target domains provides performancevery close to that obtained via supervised training on the targetdomain itself.

CCS CONCEPTS • Computing methodologies → Supervised learning by classiﬁ-cation ; Semi-supervised learning settings ; Neural networks.

KEYWORDS

Sentiment Analysis, Domain Transfer, Weakly labeled datasets

Sentiment analysis is the practice of applying natural language pro-cessing and machine learning techniques to examine the polarity(sentiment) of subjective information from text. With the advance-ment in the internet infrastructure and cheaper Web services, alarge volume of opinionated sentences is produced and consumedby users. The majority of the volume is available in social media,blogs, online retail shops, and discussion forums. Primarily, thisuser-generated content evaluates the utility and quality of prod-ucts and their components such as laptops, mobile phones andbooks, and services such as restaurants, hotels, and events. As ofApril 2013, 90% of the customers’ purchase decisions are depen-dent on online reviews [23]. However, automatic sentiment clas-siﬁcation is a challenging problem due to several factors such asthe unavailability of the trained dataset [6], multilingualism [11],bias [13], etc. We describe relevant past work and our contributionsin the following sections.

Figure 1 shows representative sentences from ﬁve unrelated do-mains. Development of supervised classiﬁcation models is heav-ily dependent on labeled datasets, which are rarely available inresource constraint domains. One line of work focuses on creat-ing a general representation for multiple domains based on the co-occurrences of domain-speciﬁc and domain-independent fea-tures [3, 4, 15, 20, 28–30]. Peng et al. [24] propose an innovativemethod to simultaneously extract domain-speciﬁc and invariantrepresentations, using both source and target domain labeled data.Qiu and Zhang [26] identiﬁed domain-speciﬁc words to improvecross-domain classiﬁcation. Blitzer et al. [3] propose Structural Cor-respondence Learning (SCL), which learns a shared feature repre-sentation for source and target domains. Pan et al. [20] proposeSpectral Feature Alignment (SFA) to construct the alignment be-tween the source domain and target domain by using the co-occurrencebetween them, in order to build a bridge between the domains. Ingeneral, the above methods leverage both source-target domain

Domain I:

Laptop reviewsR: Laptop can get warm, to the point of discomfort nearthe WSAD keys, which I assume the GPU is located inter-nally.S: Negative

Domain II:

Restaurant reviewsR: I ﬁnished the meal with the “Cookies & Cream” icecream sandwich. That was a little disappointing... notmuch ﬂavour to it.S: Negative

Domain III:

Movie reviewsR: Tired of sobby melodramas and stupid comedies? Whynot watch a ﬁlm with a diﬀerence?S: Positive

Domain IV:

Weather reviewsR: This week in NYC will mark the longest stretch of dryweather since February.S: Positive

Domain V:

Scientiﬁc reviewsR: Many approaches for POS tagging have been devel-oped in the past, including rule-based tagging (Brill, 1995),HMM taggers (Brants, 2000), maximum-entropy models(Rathnaparki, 1996), etc. All of these approaches requireeither a large amount of annotated training data (for su-pervised tagging) or a lexicon listing all possible tags foreach word (for unsupervised tagging).S: NegativeR: Review Sentence S: Sentiment

Figure 1: Example reviews from ﬁve domains. oDS COMAD 2020, January 5–7, 2020, Hyderabad, India Pratik Kayal, Mayank Singh, and Pawan Goyal pairs for training. Availability of labeled instances in resource-constraintdomains is a challenging task.As a remedy, recent advancement in transfer learning led tothe development of classiﬁcation models (termed as ‘Lift-and-shiftmodels’ ) that are trained on a labeled dataset from one domainbut can perform signiﬁcantly better on several other domains [2,12, 27]. Lift-and-Shift model is a natural extension of single do-main sentiment classiﬁer for cross-domain sentiment classiﬁcation.Here, we pick a trained classiﬁer on a source domain and classifyreviews from the target domain without any prior training on thetarget domain. However, we witness models trained on reviewsof a particular domain, do not generally do well when tested onreviews of the unknown and diﬀerent target domain. The severallimitations in the textual domain transfer primarily exist due toout-of-the-vocabulary tokens [19], stylistic variations [16], non-generalizable features [12], etc. For example, Crammer et al. [5]assumed that the distributions of multiple sources are the same,but the labelings of the data from diﬀerent sources may be diﬀer-ent from each other.

Majority of e-commerce, travel, and restaurant websites (such asAmazon, Flipkart, Airbnb, etc.) allow customers to submit their re-views along with a rating over a ﬁve-point scale. Even though therating might not directly correlate with the sentiment of the re-view, but they provide weak signals for estimating sentence polar-ity [10, 31]. These weakly supervision rating datasets are termedas “Weakly Labeled Dataset (WLD)” . We show that WLDs, in ad-dition to the labeled dataset, produce signiﬁcant improvements indomain transfer for resource-constraint target domains.

In this work, we propose a two-stage lift-and-shift training proce-dure that leverages standard labeled sentences along with polaritysignals emerging from weakly labeled review datasets. Informally,the proposed model is trained on a single source domain but pre-dicts sentiments for diﬀerent target domains. We show that eventhough, BERT (Bidirectional Encoder Representation from Trans-formers) [9] and ELMO [25] achieve state-of-the-art performanceand outperform the previous benchmarks for single domain senti-ment analysis, they consistently fail in the cross-domain sentimentanalysis. Our proposed training mechanism adapts to unknowntarget domains and even performs better than models that explic-itly leverage target domain data.

In this paper, we address the cross-domain sentiment classiﬁcationproblem. Given a source domain D src and set of target domains { D , D , D , D . . . } where D , D , . . . , D N represent N distincttarget domains ( D tar ) with D src , D tar , the task is to train a clas-siﬁer on labelled D src data with high polarity prediction accuracyfor the sentences from the D tar . We use two types of review datasets — (i) the weakly labeled datasets and (ii) the fully labeled datasets . Table 1 details the datasets statis-tics.

Weakly labeled review datasets contain weak signals about the po-larity of review sentences. In the current scenario, the weak signalsare represented by user ratings associated with each review. Userratings are considered as noisy labels and would result in signiﬁ-cantly weak classiﬁers. As user ratings lie between 1–5, with onebeing the worst and ﬁve being the best review, we adopt a simplestrategy to assign sentiment labels to these sentences. sentiment = ( positive, if 4 or 5 star ratednegative, if 1 or 2 star ratedPlease note that we do not consider three-star rated reviews.The current study uses three weakly labeled datasets, Amazon prod-uct reviews [18, 31], YELP restaurant reviews [8] and IMDB moviereviews [17], which we henceforth refer to as AWLD (Amazonweakly labeled dataset), YWLD (YELP weakly labeled dataset) andIWLD (IMDB weakly labeled dataset), respectively. The WLDs areeasier to collect, are not explicitly labeled for sentiments, but aremanual ratings given to reviews complementing the review text. Fully labeled datasets consist of manually labeled review sentences.The reviews are labeled into three classes — (i) positive, (ii) nega-tive, or (iii) neutral. For the current study, we only consider reviewsassociated with positive or negative sentiments. The current studyleverages six fully labeled datasets, (i) Weather sentiment data [7],(ii) IMDB [14], (iii) Yelp [14], (iv) Amazon (Cell and Accessory) [14],(v) Scientiﬁc citation context data [1], and (vi) Amazon (DigitalCameras, Cell Phones and Laptops) [31].Table 1 presents salient statistics of the two types of reviewdatasets. The experiments have been reported for those source do-mains that possess corresponding WLDs; thus, scientiﬁc citationand weather datasets are not considered as the source domainsdue to unavailability of WLDs. The compiled datasets are currentlyavailable at https://bit.ly/2EnjsSe.

As discussed in previous sections, the intuition is to leverage weaksignals generated by WLDs to complement the polarity classiﬁca-tion training of FLDs. We, therefore, present a two-stage trainingprocedure. In the ﬁrst stage, the predictive model is pre-trainedusing WLD data with lower learning rate for few iterations . Thesecond stage follows a standard training procedure. In the secondstage, the predictive model is trained using FLD instances with usu-ally higher learning rate till convergence. Algorithm 1 presents adetailed methodology. The training procedure is followed by the In the current paper, a single iteration over all training instances (one epoch). eakly-Supervised Deep Learning for Domain Invariant Sentiment Classification CoDS COMAD 2020, January 5–7, 2020, Hyderabad, India

Dataset Abbreviation Reviews P/N Ratio W L D Amazon AWLD 1.1M 1.43Yelp YWLD 1M 3.95IMDB IWLD 50,000 1 F L D Weather WEAT 980 0.81Amazon ACAD 11,800 0.88IMDB IMDB 1,000 1Yelp YELP 1,000 1Amazon ADLD 1,000 1Scientiﬁc SCCD 700 2.96

Table 1: Salient statistics of the datasets. Appropriate ab-breviations are added for better readability in further sec-tions. Amazon represents the Cell and Accessory categoryand Amazon represents Digital Cameras, Cell Phones, andLaptops category. The rightmost column displays the ratioof counts of positive reviews (P) to negative reviews (N). testing procedure. Since, the current work focuses on domain in-variant sentiment classiﬁcation, the test dataset domain, in the ma-jority of the cases, is diﬀerent from the train dataset domain.

Algorithm 1:

Weakly Supervised training for Domain Gener-alization

Input: D scr : Source domain; D , D , D , D , . . . , D N : N target domains; Pre-train model on

WLD of source domain D scr with a lowlearning rate and in ‘ n ’ epochs, where n is a small integer; Train model on

FLD of source domain D scr ; for i = i ≤ N ; i = i + do Test the trained model on D i ; For the current polarity prediction task, we train a standard fully-connected feed-forward network with softmax as activation func-tion and two output perceptrons. We plug this fully connectedlayer over recently published state-of-the-art natural language em-bedding models. We experiment with two models, (i) BERT base [9]and (ii) ELMO [25]. These models are trained on the general Wikipediaarticles. Our proposed algorithm considers their pre-trained weightsas the initialization for basic language understanding and adaptsthem to the speciﬁc polarity knowledge.

Each review is tokenized and subjected to standard token and char-acter level ﬁltering such as lower-casing and special character ﬁl-tering. Also, all neutral reviews are ﬁltered. Next, each FLD datasetis randomly split into two sets — training, and test. We allocate 85% of pairs for training. The rest 15% of review sentences are allocatedfor the test. Note that, WLDs are not subjected to random splitting.

We compare our model with several lift-and-shift baselines withno WLD-based pretraining. Lift-and-shift models are a natural ex-tension of single-domain sentiment classiﬁcation models. Here, wepick a trained classier model on a source domain and predict sen-tences from the target domain without any prior training on thetarget domain. The training procedure does not involve the WLDdataset either.

We compare our proposed model against standard lift-and-shiftbaselines (described in the previous section). We leverage standardmetrics in sentiment classiﬁcation — (i)

Average accuracy score and(ii)

F1 score — for comparing the predicted polarity with the ground-truth polarity.

The BERT and ELMO embedding vector size are ﬁxed at 768 and512 dimensions, respectively. We used a batch size of 64, and binarycross-entropy loss function. In the case of BERT, we ﬁnd that thelearning rate of 3.00e-5 and 3.00e-8 for training and pretrainingphase, respectively, were performing the best. Similarly, in the caseof ELMO, the learning rate of 3.00e-03 and 3.00e-06 for training andpretraining phase performed best.

We, ﬁrst of all, discuss the performance score of EMLO-based model.Table 2 presents accuracy and F1 scores for ELMO-based classiﬁ-cation model. Here, each target domain represents 15% test data.ELMO baselines performed best when the system is trained andtested on the same domain (showed by diagonal cells in Table 2).Only training on WLD datasets produced worst results (see therow with source as AWLD). Our proposed training procedure (forexample, ACAD-AWLD) produced marginal performance. We wit-ness similar poor and marginal performances for other WLDs andfor our other proposed two-stage experiment settings, respectively.We claim that the poorer EMLO’s performance gains in cross-domainclassiﬁcation are primarily due to its current limitations in gen-eralizability as compare to other state-of-the-art embeddings gen-erated from transformer-based language models. Next, we experi-ment with other competitive models.Table 3 presents accuracy and F1 scores for BERT-based classi-ﬁcation model. BERT performed signiﬁcantly better than ELMO.Our proposed training procedure performed exceptionally well, insome cases even at par with the models that are trained and testedon the same domain. This training procedure not only improvesagainst standard lift-and-shift models but also leads to higher trans-fer results. Again, training only on WLD datasets produced theworst results. However, the values are better than the correspond-ing ELMO-based models. Even though, YELP-YWLD performs marginallypoor ( 0.3% lower than YELP), it performs signiﬁcantly better inother domains such as ADLD (6% higher than YELP) and IMDB(4.7% higher than YELP). Similar transfer improvements are reported oDS COMAD 2020, January 5–7, 2020, Hyderabad, India Pratik Kayal, Mayank Singh, and Pawan Goyal

Target domainsACAD WEAT ADLD IMDB YELP SCCDA(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 S o u r c e d o m a i n s ACAD 82.50 0.809 74.60 0.691 87.30 0.871 74.00 0.735 75.30 0.764 71.20 0.809WEAT 69.10 0.570 82.10 0.813 69.30 0.596 72.70 0.682 70.70 0.735 29.40 0.182ADLD 75.30 0.775 68.70 0.687 88.00 0.878 73.30 0.750 78.00 0.802 69.30 0.791IMDB 77.30 0.756 74.60 0.761 86.00 0.847 78.70 0.802 79.30 0.812 70.60 0.802YELP 76.10 0.744 74.60 0.746 82.70 0.783 74.70 0.740 81.30 0.823 37.90 0.371SCCD 47.80 0.647 44.80 0.619 48.00 0.649 52.70 0.687 49.30 0.661 75.80 0.862AWLD 54.40 0.676 46.30 0.625 52.70 0.670 52.00 0.684 53.30 0.679 75.80 0.862IWLD 67.30 0.699 59.70 0.64 79.30 0.805 74.70 0.729 74.00 0.755 66.40 0.762YWLD 73.90 0.765 68.70 0.72 84.00 0.848 74.00 0.78 80.00 0.824 77.80 0.861ACAD-AWLD 80.00 0.794 71.60 0.708 80.70 0.803 75.30 0.764 73.30 0.762 69.90 0.807IMDB-IWLD 75.60 0.74 68.70 0.704 80.70 0.788 83.30 0.843 74.70 0.756 61.00 0.691YELP-YWLD 76.40 0.755 76.10 0.742 90.00 0.891 73.30 0.71 80.70 0.818 58.30 0.56

Table 2: [Color online] Accuracy and F1 scores with ELMO as embeddings. Here, each target domain represents a 15% held-out data. Blue and red color represent the best and second best values for a given target domain. AWLD, IWLD and YWLDrepresent training on weakly labeled datasets only. ACAD-AWLD, IMDB-IWLD and YELP-YWLD represents our two-stagetraining procedure.

Target DomainsACAD WEAT ADLD IMDB YELP SCCDA(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 A(100%) F1 S o u r c e D o m a i n s ACAD 90.20 0.898 80.50 0.805 95.30 0.953 89.30 0.901 88.60 0.887 80.20 0.876WEAT 79.90 0.766 85.00 0.848 86.60 0.857 87.30 0.876 85.30 0.853 67.00 0.731ADLD 84.10 0.819 85.00 0.827 95.30 0.950 84.60 0.841 91.30 0.909 50.80 0.867IMDB 82.30 0.805 86.50 0.852 92.60 0.925 91.30 0.915 87.30 0.875 65.80 0.867YELP 81.90 0.791 79.10 0.787 89.30 0.887 87.30 0.872 95.30 0.953 55.00 0.867SCCD 73.60 0.747 56.70 0.658 76.60 0.782 76.00 0.785 87.30 0.883 82.00 0.883AWLD 78.65 0.812 50.70 0.645 89.30 0.898 78.60 0.829 79.30 0.82 77.20 0.87IWLD 76.00 0.785 73.10 0.769 74.00 0.782 84.70 0.862 79.30 0.825 82.00 0.893YWLD 85.40 0.855 76.10 0.778 94.00 0.941 90.70 0.912 91.30 0.916 76.60 0.868ACAD-AWLD 90.50 0.903 85.00 0.843 94.00 0.938 92.60 0.931 92.00 0.921 80.80 0.879IMDB-IWLD 86.50 0.857 85.10 0.844 93.30 0.933 92.70 0.929 92.70 0.928 76.60 0.868YELP-YWLD 86.00 0.853 80.60 0.806 95.30 0.953 92.00 0.923 95.00 0.951 71.90 0.868

Table 3: [Color online] Accuracy and F1 scores with BERT as embeddings. Here, each target domain represents a 15% held-out data. Blue and red color represent the best and second best values for a given target domain. AWLD, IWLD and YWLDrepresent training on weakly labeled datasets only. ACAD-AWLD, IMDB-IWLD and YELP-YWLD represents our two-stagetraining procedure. for ACAD-AWLD and IMDB-IWLD. This transfer performance im-provement reconﬁrms the usefulness of two-stage training proce-dure. As expected, domain transfer from SCCD and WEAT is infe-rior due to high dissimilarity between SCCD and WEAT and otherdomains.Note that the performance of the models trained on only WLDsis poor compared to the performance of FLDs. This observationsuggests that WLDs are itself very noisy and not good enoughfor the sentiment classiﬁcation task[21, 22]. Also, the poor perfor-mance of same domain weakly labelled dataset on fully labelleddata suggests that the star rating is not highly correlated to senti-ment.

In this paper, we propose a two-stage training framework for cross-domain sentiment classiﬁcation. We showcase the utility of com-bining weak labels and full labels for domain-invariant sentimentanalysis. Even though the proposed approach uses more data thanthe baseline, curating WLD data is an extremely easier task thanthe curation of FLDs. WLD datasets extend the transfer capabilitiesto a signiﬁcantly large extent. Our experimental results on BERTbased model demonstrate the eﬀectiveness of our proposed frame-work for a wide range of domains.The primary focus of this paper has been to train more gener-alizable sentiment classiﬁcation models using only single domaindata without using any target domain signals. The idea of pretrain-ing on weak signals can further be explored by combining variousweakly labeled domains. eakly-Supervised Deep Learning for Domain Invariant Sentiment Classification CoDS COMAD 2020, January 5–7, 2020, Hyderabad, India

REFERENCES [1] Awais Athar. 2011. Sentiment Analysis of Citations using SentenceStructure-Based Features. In

Proceedings of the ACL 2011 Student Ses-sion

Proceedings of recent advances in natural languageprocessing (RANLP) , Vol. 1. Citeseer, 2–1.[3] John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood,boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In

Proceedings of the 45th annual meeting of the association of computational lin-guistics . 440–447.[4] Danushka Bollegala, David Weir, and John Carroll. 2011. Using multiple sourcesto construct a sentiment sensitive thesaurus for cross-domain sentiment classi-ﬁcation. In

Proceedings of the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies-Volume 1 . Association forComputational Linguistics, 132–141.[5] Koby Crammer, Michael Kearns, and Jennifer Wortman. 2007. Learning frommultiple sources.In

Advances in Neural Information ProcessingSystems . 321–328.[6] Sajib Dasgupta and Vincent Ng. 2009. Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classiﬁcation. In

Proceedings of theJoint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume2 arXiv preprint arXiv:1810.04805 (2018).[10] Ziyu Guan, Long Chen, Wei Zhao, Yi Zheng, Shulong Tan, and Deng Cai. 2016.Weakly-supervised deep learning for customer review sentiment classiﬁcation.In

Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial In-telligence . AAAI Press, 3719–3725.[11] Alexander Hogenboom, Bas Heerschop, Flavius Frasincar, Uzay Kaymak, andFranciska de Jong. 2014. Multi-lingual support for lexicon-based sentiment anal-ysis guided by semantics.

Decision support systems

62 (2014), 43–53.[12] Jing Jiang and ChengXiang Zhai. 2007. A two-stage approach to domain adap-tation for statistical classiﬁers. In

Proceedings of the sixteenth ACM conference onConference on information and knowledge management . ACM, 401–410.[13] Muhammad Taimoor Khan, Mehr Durrani, Armughan Ali, Irum Inayat, ShehzadKhalid, and Kamran Habib Khan. 2016. Sentiment analysis and the complexnatural language.

Complex Adaptive Systems Modeling

4, 1 (2016), 2.[14] Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015.From group to individual labels using deep features. In

Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining .ACM, 597–606.[15] Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. 2012. Cross-domain co-extraction of sentiment and topic lexicons. In

Proceedings of the 50thAnnual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 . Association for Computational Linguistics, 410–419.[16] Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: Asimple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437 (2018).[17] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng,and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In

Proceedings of the 49th annual meeting of the association for computational lin-guistics: Human language technologies-volume 1 . Association for ComputationalLinguistics, 142–150.[18] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In

Proceedingsof the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval . ACM, 43–52.[19] Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010.Cross-domain sentiment classiﬁcation via spectral feature alignment. In

Proceed-ings of the 19th international conference on World wide web . ACM, 751–760.[20] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. 2011. Domainadaptation via transfer component analysis.

IEEE Transactions on Neural Net-works

22, 2 (2011), 199–210.[21] Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships forsentiment categorization with respect to rating scales. In

Proceedings of the 43rdannual meeting on association for computational linguistics . Association for Com-putational Linguistics, 115–124.[22] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sen-timent classiﬁcation using machine learning techniques. In

Proceedings of theACL-02 conference on Empirical methods in natural language processing-Volume 10 . Association for Computational Linguistics, 79–86.[23] Ling Peng, Geng Cui, Mengzhou Zhuang, and Chunyu Li. 2014. What do sellermanipulations of online product reviews mean to consumers? (2014).[24] Minlong Peng, Qi Zhang, Yu-gang Jiang, and Xuanjing Huang. 2018. Cross-Domain Sentiment Classiﬁcation with Target Domain Speciﬁc Information. In

Proceedings of the 56th Annual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) . 2505–2513.[25] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365 (2018).[26] Likun Qiu and Yue Zhang. 2015. Word segmentation for Chinese novels. In

Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence .[27] Songbo Tan, Gaowei Wu, Huifeng Tang, and Xueqi Cheng. 2007. A novel schemefor domain-transfer problem in the context of sentiment analysis. In

Proceed-ings of the sixteenth ACM conference on Conference on information and knowledgemanagement . Citeseer, 979–982.[28] Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. 2017. Transfer learn-ing for sequence tagging with hierarchical recurrent networks. arXiv preprintarXiv:1703.06345 (2017).[29] Jianfei Yu and Jing Jiang. 2016. Learning sentence embeddings with auxiliarytasks for cross-domain sentiment classiﬁcation. In

Proceedings of the 2016 con-ference on empirical methods in natural language processing . 236–246.[30] Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. 2014. Type-superviseddomain adaptation for joint segmentation and pos-tagging. In

Proceedings of the14th Conference of the European Chapter of the Association for ComputationalLinguistics . 588–597.[31] Wei Zhao, Ziyu Guan, Long Chen, Xiaofei He, Deng Cai, Beidou Wang, and QuanWang. 2017. Weakly-supervised deep embedding for product review sentimentanalysis.