[PDF] Review-Level Sentiment Classification with Sentence-Level Polarity Correction

Abstract

We propose an effective technique to solving review-level sentiment classification problem by using sentence-level polarity correction. Our polarity correction technique takes into account the consistency of the polarities (positive and negative) of sentences within each product review before performing the actual machine learning task. While sentences with inconsistent polarities are removed, sentences with consistent polarities are used to learn state-of-the-art classifiers. The technique achieved better results on different types of products reviews and outperforms baseline models without the correction technique. Experimental results show an average of 82% F-measure on four different product review domains.

Full PDF

RReview-Level Sentiment Classiﬁcation with Sentence-LevelPolarity Correction

Sylvester Olubolu Orimaye(School of Information Technology,Monash University [email protected])Saadat M. Alhashmi(Department of Management Information Systems,University of Sharjah,Sharjah, United Arab [email protected])Eu-Gene Siew(School of Business,Monash University [email protected])Sang Jung Kang(School of Information Technology,Monash University [email protected])

Abstract:

We propose an eﬀective technique to solving review-level sentiment clas-siﬁcation problem by using sentence-level polarity correction. Our polarity correctiontechnique takes into account the consistency of the polarities (positive and negative)of sentences within each product review before performing the actual machine learningtask. While sentences with inconsistent polarities are removed, sentences with con-sistent polarities are used to learn state-of-the-art classiﬁers. The technique achievedbetter results on diﬀerent types of products reviews and outperforms baseline mod-els without the correction technique. Experimental results show an average of 82%F-measure on four diﬀerent product review domains.

Key Words:

Sentiment Analysis, Review-Level Classiﬁcation, Polarity Correction,Data Mining, Machine Learning

Sentiment classiﬁcation has attracted a number of research studies in the pastdecade. The most prominent in the literature is Pang et al,[1] which employedsupervised machine learning techniques to classify positive and negative senti-ments in movie reviews. The signiﬁcance of that work inﬂuenced the research1 a r X i v : . [ c s . C L ] N ov ommunity and created diﬀerent research directions within the ﬁeld of senti-ment analysis and opinion mining.[2, 3, 4] Practical beneﬁts also emerged as aresult of automatic recommendation of movies and products by using the senti-ments expressed in the related review.[5, 3, 4] This is also applicable to businessintelligence applications which rely on customers’ reviews to extract ‘satisfac-tion’ patterns that may improve proﬁtability.[6, 7] While the number of reviewshas continued to grow, and sentiments are expressed in a subtle manner, it isimportant to develop more eﬀective sentiment classiﬁcation techniques that cancorrectly classify sentiments despite natural language ambiguities, which includethe use of irony.[8, 9, 10, 11]In this work, we classify sentiments expressed on individual product types bylearning a language model classiﬁer. We focus on online product reviews whichcontain individual product domains and express explicit sentiment polarities.For example, it is quite common that the opinion expressed in reviews are tar-geted at the speciﬁc products on which the reviews are written.[2, 7] This enablesthe reviewer to express a substantial level of sentiments on the particular prod-uct alone without necessarily splitting the opinions between diﬀerent products.Also, in a review, sentiments are likely to be expressed on speciﬁc aspects of theparticular product.[12] For example, an iPad user may express positive senti-ment about the ‘camera quality’ of the device but expresses negative sentimentabout the ‘audio quality’ of the device. This provides useful and collaborativeinformation on aspects of the product that need improvements.[13, 14, 15]The application of sentiment classiﬁcation is important to the ordinary usersof opinion mining and sentiment analysis systems.[16, 2, 3] This is because thediﬀerent categories of sentiments (e.g. positive and negative) represent the actualstances of humans on a particular target (e.g. a product). A product manufac-turer for example, can have an overview of how many people ‘like’ and ‘dislike’the product by using the number of positive and negative reviews. Similarly,sentiment classiﬁcation has been quite useful in ﬁnance industries, especially forstock market prediction.[17, 18, 19]Sentiment classiﬁcation on product reviews can be challenging,[16, 3, 4] whichis why it is still a very active area of research. More importantly, sentimentsexpressed in each product review sometimes include ambiguous and unexpectedsentences, [20] and are often alternated between the two diﬀerent positive andnegative polarities. This causes inconsistencies in the sentiments expressed andconsequentially leading to the mis-classiﬁcation of the review document.[1, 16, 3]As such, the bag-of-words approach is not suﬃcient alone.[3, 4] We emphasizethat most negative reviews contain positive sentences and often express negativesentiments by using just a few negative sentences.[21] We show an example asfollows: 2 bought myself one of these and used it minimally and washappy (POSITIVE) I am using my old 15 year old Oster (NEGATIVE)

Also to my surprise it is doing a better job (POSITIVE)

Just not as pretty (NEGATIVE)

I have KA stand mixer, hand blender, food processors large andsmall... (OBJECTIVE)

Will buy other KA but not this again (NEGATIVE)

The above problem often degrades the accuracy of sentiment classiﬁers as manyreview documents get mis-classiﬁed to the opposite category. This is regardedas false positives and false negatives as the case may be.While the above problem is non-trivial, we propose a polarity correctiontechnique that extracts sentences with consistent polarities in a review. Ourcorrection technique includes three separate steps. First, we perform trainingset correction by training a ‘na¨ıve’ sentence-level polarity classiﬁer to identify false negatives in both positive and negative categories. We then combine the truepositives sentences and the false negative sentences of the two opposite categoriesto form a new training set for each category. Second, we propose a sentence-levelpolarity correction algorithm to identify consistent polarities in each review,while discarding sentences with inconsistent polarities. Finally, we learn diﬀerentMachine Learning algorithms to perform the sentiment classiﬁcation task.The above steps were performed on four diﬀerent Amazon product reviewdomains and improved the accuracy of sentiment classiﬁcation of the reviews overa baseline technique and give comparable performance with standard biagram,bag-of-words, and unigram techniques. In terms of F-measure, the techniqueachieve an average of 82% on all the product review domains.The rest of this paper is organized as follows. We discuss related researchwork in Section 2. In Section 3, we propose the training set correction techniquefor sentiment classiﬁcation task. Section 4 describes the sentence-level polaritycorrection technique and the corresponding algorithm. Our machine learningexperiments and results are presented in Section 5. Finally, Section 6 presentsconclusions and future work.

Pang and Lee,[22] proposed a subjectivity summarization technique that is basedon minimum cuts to classify sentiment polarities in IMDb movie reviews. Theintuition is to identify and extract subjective portions of the review documentusing minimum cuts in graphs. The minimum cut approach takes into considera-tion, the pairwise proximity information via graph cuts that partitions sentenceswhich are likely to be in the same class. For example, a strongly subjective sen-tence might have lexical dependencies on its preceding or next sentence. ThusPang and Lee,[22] showed that minimum cuts in graph put such sentences in3he same class. In the end, the identiﬁed subjective portions as a result of theminimum graph cuts are then classiﬁed as either negative or positive polarity.This approach showed signiﬁcant improvement from 82.8% to 86.4% with just60% subjective portion of the documents.In our work, we introduce additional steps by not only extracting subjectivesentences. Instead, we extract subjective sentences with consistent sentiment po-larities. We then discard other subjective sentences with inconsistent sentimentpolarities that may contribute noise and reduce the performance of the senti-ment classiﬁer. Thus, contrary to Pang and Lee,[22] our work has the ability toeﬀectively learn sentiments by identifying the likely subjective sentences withconsistent sentiments. Again, we emphasize that some subjective sentences maynot necessarily express sentiments towards the subject matter.[3, 4] Consider,for example, the following excerpt from a ‘positive-labelled’ movie review: ‘ real life, however, consists of long stretches of boredom with afew dramatic moments and characters who stand around, thinkthoughts and do nothing, or come and go before events are re-solved. Spielberg gives us a visually spicy and historically ac-curate real life story. You will like it.’

In the above excerpt, sentence 1 is a subjective sentence which does not con-tribute to the sentiment on the movie. Explicit sentiments are expressed in sen-tence 2 and 3. We propose that discarding sentences such as sentence 1 fromreviews is likely to improve the accuracy of a sentiment classiﬁer.Similarly, Wilson et al,[23] used instances of polar words to detect contextualpolarity of phrases from the MPQA corpus. Each phrase detected is veriﬁed tobe either polar or non-polar phrase by using the presence of opinionated wordsfrom a polarity lexicon. Polar phrases are then processed further to detect theirrespective contextual polarities which can then be used to train machine learn-ing techniques. Identifying the polarity of phrase-level expression is a challengein sentiment analysis.[3] Earlier in Section 1, we have illustrated some examplesentences to that eﬀect. For clarity, consider the sentence ‘I am not saying thepicture quality of the camera is not good ’ . In this sentence, the presence of thenegation word ‘not’ does not represent ‘negative’ polarity of the sentence incontext. In fact it emphasizes a ‘desired state’ that the ‘picture quality’ of thecamera entity is ‘good’. However, without eﬀective contextual polarity detec-tion, such sentences could be easily classiﬁed as ‘negative’ by ordinary machinelearning techniques. To this extent, Wilson et al,[23] performed manual anno-tation of contextual polarities in the MPQA corpus to train a classiﬁer with acombination of ten features resulting to 65.7% accuracy giving room for moreimprovement.Choi and Cardie,[24] proposed a compositional semantics approach to learnthe polarity of sentiments from the sub-sentential level of opinionated expres-4ions. The compositional semantic approach breaks the lexical constituents ofan expression into diﬀerent semantic components. Thus, the work used contentword negators (e.g. sceptic, disbelief) to identify the sentiment polarities fromthe diﬀerent semantic components of the expression. Content word negators are negation words other than function words such as not , but , never and so on.Identiﬁed sentiment polarities are then combined using a set of heuristic rulesto form an overall sentiment polarity feature which can then be used to trainmachine learning techniques. Interestingly, on the Multi-Perspective QuestionAnswering (MPQA) corpus created by Wiebe et al,[25] this combination yieldeda performance of 90.7% over the 89.1% performance of ordinary classiﬁer (e.g.using bag-of-words).The performance achieved by Choi and Cardie,[24] is understandable giventhat the MPQA corpus contains well ‘structured’ news articles which are mostlywell written on certain topics. Moreover, sentences or expressions which are con-tained in news articles are most likely to express sequential sentiments for areasonable classiﬁcation performance.[17, 26, 27] For example, it is more likelythat a negative news ‘event’ such as ‘Disaster unfolds as Tsunami rocks Japan’ will attract ‘persistent’ negative expressions and sentiments in news articles.In contrast, sentiment classiﬁcation on product reviews is more challenging asthere is often inconsistent or mixed sentiment polarities in the reviews. We haveillustrated an example to that eﬀect in Section 1. It would be interesting toknow the performance of the heuristics used by Choi and Cardie,[24] on stan-dard product review datasets such as Amazon online product review datasets. Adetailed review of other sentiment classiﬁcation techniques on review documentsis provided in Tang et al.[28]Our main contribution to the sentiment classiﬁcation task is to do trainingset correction and further detect inter-sentence polarity consistency that couldimprove a sentiment classiﬁer. That is, given a review of n − sentences, we tryto understand how the sentiment polarity varies from sentence 1 to sentence n .We hypothesize that detecting consistent sentiment patterns in reviews couldimprove a sentiment classiﬁer without further sophisticated natural languagetechniques (e.g. using compositional semantics or linguistic dependencies).[17]More importantly, we believe every sentence in the review may not necessarilycontribute to the classiﬁcation of the review to the appropriate class.[22] We saythat certain sequential sentences with consistent sentiment polarities could besuﬃcient to represent and distinguish between the sentiment classes of a review.Representative features have been argued to be the key to eﬀective classiﬁcationtechnique.[29, 30] We emphasize that our approach is promising and can be easilyintegrated by any sentiment classiﬁcation system regardless of the sentimentdetection technique employed. 5 Training Set Correction

Training set polarity correction has been largely ignored in sentiment classiﬁca-tion tasks.[6] Earlier, we emphasized that a review document could contain bothpositive and negative sentences. Moreover, since reviewers often express senti-ments on diﬀerent aspects of products, it is probable that some aspects of theproducts will receive positive sentiments while others get negative sentiments.[3]In a negative-labeled product review for example, it is more likely that negativesentiments will be expressed within the ﬁrst few portion of the review and thenfollowed by positive sentiments in the later portion of the review on some of theaspects of the product that gave some satisfactions.[5, 3] This could be becausereviewers tend to emphasize on the negative aspects of a product than the posi-tive aspects, and in some cases, both polarities are expressed alternately, whichwe will discuss in Section 4. Thus, using such mixed sentiments in each category,for training a machine learning algorithm will only result to bias and reduce theaccuracy of the classiﬁer.[22, 31]As such, we propose a promising approach to reduce the bias in the train-ing set by ﬁrst learning a ‘na¨ıve’ sentence-level classiﬁer on all sentences fromboth the positive and negative categories. A ‘na¨ıve’ classiﬁer could be any clas-siﬁer trained with surface-level features (e.g. unigram or bag-of-words),[22, 32]without necessarily performing sophisticated features engineering since the ﬁnalsentiment classiﬁer will be constructed with more ﬁne-grained features. [33] Forexample, one could learn the popular Na¨ıve Bayes classiﬁer with only unigramfeatures.[22, 34, 35] It is also possible to use a more complexly constructed clas-siﬁer at the expense of eﬃciency. Having said that, the ‘na¨ıve’ classiﬁer is thenused to also test the same sentences from both the positive and negative cate-gories. The idea is to identify positive-labelled sentences that will be classiﬁed asnegative and negative-labelled sentences that will be classiﬁed as positive. Havingidentiﬁed this, it is therefore imperative to correct the training set by combiningthe wrongly classiﬁed sentences to their original respective categories. That is,positive-labelled sentences that are classiﬁed as negative should be combinedwith the original negative sentences (in the negative category) and negative-labelled sentences that are classiﬁed as positive should be combined with theoriginal positive sentences (in the positive category).While this technique may result to a meta classiﬁcation,[36] we propose to in-clude the technique as part of the training process of the ﬁnal sentiment classiﬁer.In addition, in order to minimize wrongly classiﬁed sentences, we implement the‘na¨ıve’ classiﬁer to maximize the

Joint-Log-Probability score of a given sentencebelonging to either of positive or negative categories. This is because most or-dinary classiﬁers maximize the conditional probability over all categories, whichis at the expense of better accuracy.[37] We compute the

Joint-Log-Probability as follows: 6 ( S, C ) = log P ( S | C ) + log P ( C ) (1) P c = argmax c ∈ C P ( S, C ) (2)where P ( S, C ) is the probability of a sentence given a class, P c is the prob-ability of the sentence belonging to either a ‘positive’ category c or a ‘negative’category c and P ( C ) is a multivariate distribution on the positive and negativecategories. Following the training set correction in Section 3, we propose the sentence-levelpolarity correction to further reduce mis-classiﬁcation in both ‘training’ and‘testing’ sets. More importantly, because the bag-of-words approach has seldomimprove the accuracy of a sentiment classiﬁer,[3, 4] a sentence-level approachcould give better improvement since most sentiments are expressed at sentence-level anyway.[38] However, we have indicated in Section 3 that many reviewdocuments have the tendency to contain both positive and negative sentences,regardless of their individual categories (i.e. positive or negative). While the con-sistent sentence polarities of both categories might be helpful to the classiﬁcationtask, it would be better to remove sentences with outlier polarities that causeinconsistencies by using a polarity correction approach.[39, 40, 3] Note that wehave motivated the inconsistency problem with an example in Section 1.The idea of the sentence-level polarity correction is to remove inconsistentsentence polarities from each review. We observed that sentences with incon-sistent polarity deviate from the previous consistent polarity. More often thannot, the polarities of sentences in a given review are expressed consistently ex-cept for some outliers polarities.[1, 40, 3] As such, a given polarity is expressedconsistently over a number of sentences and at a certain point deviate to theother polarity, and continues over a number of sentences alternately. Figure 1shows an illustration depicting a possible review with consistent polarities andinconsistent polarities (or outlier polarities).Given a 10-sentence review, a reviewer has expressed negative sentimentswith the ﬁrst three sentences. This is followed by a single positive sentence on line4. Lines 5 to 7 consist of another three negative sentences. Lines 8 to 9 expressedpositive sentences. Finally, line 10 concluded with a negative sentence. Thus,we regard line 4 (positive sentence) and line 10 (negative sentence) as outlierpolarities because there is no subsequent exact polarity after each of them. Ourpolarity correction algorithm removes such outliers, leaving only the consistentpolarities. It is to be noted that at this stage, the algorithm is independent ofa particular sentiment category (i.e. positive or negative). We consider exact7 . negative2. negative3. negative4. positive5. negative6. negative7. negative8. positive9. positive10. negative

ConsistentOutlierConsistentConsistentOutlier

Figure 1: An illustration of a review document with outliers, consistent, andinconsistent polarities.subsequent polarities - either positive or negative - since a review is likely tocontain both polarities as discussed earlier. Our intuition is that sentences withconsistent polarity could better represent the overall sentiment expressed in areview document by providing a wider margin between the categories of themajor consistent sentiment polarities.[17, 21] Note that this technique is diﬀerentfrom intra-sentence polarity detection as studied in Li et al.[40] An additionalthing we did was to performed negation tagging by tagging 1 to 3 words after anegation word in each sentence. In contrast to our baseline, the negation taggingshowed some improvements in our correction technique.Thus, given a review document with n -number of sentences S , ..., S n , weclassify each sentence with the ‘na¨ıve’ classiﬁer and compare the polarity Φ s of the ﬁrst sentence with the polarity Φ s n +1 of the next sentence until s n − .Where Φ s is the starting polarity, the polarity of the subsequent sentence Φ s n +1 is compared with the polarity of the prior sentence Φλ s n +1 . When Φλ s n +1 equals Φ s n +1 , the sentence is stored into the consistent category, otherwise, the sentenceis considered outlier. Note that we set a consistency threshold by specifyinga parameter θ , which indicates the minimum number of subsequent and thesame sentence polarities that must be considered consistent. As such, consistentsentence polarities that are lower than the θ value are ignored.In our experiment, we set θ = 2 to simulate the default case. Our empiri-cal observation shows that θ = 2 suﬃciently captures consistent polarities for asparse review document containing as low as 7 sentences. Figure 2 shows howconsistent polarities are extracted with diﬀerent threshold θ , where θ = 2 re-trieves sentences n to n and θ = 3 retrieves only sentences n to n .8 = 2neg posnegneg pos pos posɸλ sn+1 Ɵ = 3n n n n n n n = prior Figure 2:

Extracting consistent polarities with diﬀerent θ thresholds. We performed several experiments with our correction technique and comparebetween the performance on popular state-of-the-art classiﬁers with and with-out our polarity correction techniques. The classiﬁers comprises of the SequentialMinimum Optimization (SMO) variant of Support Vector Machines (SVM),[41]and Na¨ıve Bayes (NB) classiﬁer.[42] We used SVM and NB on the WEKA ma-chine learning platform,[43] with bag-of-words , unigram , and word bigram fea-tures. We did not include word trigram features as both word unigram andword bigram features have been studied to improved sentiment classiﬁcationtasks.[1, 22, 3] We conducted - performance evaluation for comparisonwith the baselines on each dataset domain.For selecting the best parameters for the baseline algorithms, we performedhyperparameters search using Auto-Weka,[44] with cross-validation and the Se-quential Model-based Algorithm Conﬁguration (SMAC) optimization algorithm,which is an Bayesian optimization method proposed as part of Auto-Weka.[44]We performed the search by using the unigram features on the training set ofeach domain. This is because unigram features have shown robust performancein sentiment analysis.[1, 22, 3] Our dataset is the multi-domain sentiment dataset constructed by Blitzer etal.[5] The dataset was ﬁrst used in year 2007 and consists of Amazon online9 odel Hyperparameters

SVM-beauty -C 1.1989425641153333 -N 0 -K “NormalizedPolyKernel -E1.6144079568156302 -L”SVM-books -C 1.2918141993816825 -N 2 -K “NormalizedPolyKernel -E2.78637472738497”SVM-kitchen -C 1.2929645940353218 -N 2 -K “Puk -S 9.028189222927269 -O0.9952824838773323”SVM-software -C 1.1471978195519354 -N 2 -M -K “NormalizedPolyKernel -E1.7177045231155679 -L”NB-all-domains -K (Kernel Estimator)Table 1: Auto-Weka hyperparameters settings for SVM and NB on product domainswith unigram features. product reviews from four diﬀerent types of product domains , which includes, beauty products , books , software , and kitchen . Each product domain has positive reviews and negatives reviews, which were identiﬁed based on thecustomers’ star ratings according to Blitzer et al.[5] For each domain, we sepa-rated documents per category as training set and used the remaining documents as unseen testing set . We extracted the review text and performedsentence boundary identiﬁcation by optimizing the output of the MedlineSen-tenceModel available as part of the LingPipe library. As our baseline, we implemented a sentence-level sentiment classiﬁer usinga technique similar to Pang and Lee,[22] on the same dataset but without ourcorrection technique. The baseline technique has worked very well in most sen-timent classiﬁcation tasks. The baseline work removes objective sentences fromthe training and testing documents by using an automatic subjectivity detector component which uses subjective sentences only for sentence-level classiﬁcation.

We used three evaluation metrics comprising of precision , recall , and F-Measure or F-1 . The precision is calculated as

T P/ ( T P + F P ), recall as T P/ ( T P + F N ),and

F-Measure as (2 ∗ precision ∗ recall ) / ( precision + recall ). Note that TP,TN, FP, and FN are deﬁned as true positives, true negatives, false positives, andfalse negatives, respectively. All results are based on 95% Conﬁdence Interval. We present the results in Tables 2 - 5, where

Model is the type of classiﬁer,

Pr. isthe precision,

Rc. is the recall, and

F-1 is the F-measure, respectively. We identifythe models with our correction technique with ‘cor’ after the model names. For http://alias-i.com/lingpipe/docs/api/com/aliasi/sentences/MedlineSentenceModel.html odel Pr. Rc. F-1 SVM − Bigram − cor 0.83 0.83 0.83SVM − BOWS − cor 0.83 0.83 0.83SVM − Unigram − cor SVM − Bigram 0.83 0.83 0.83SVM − BOWS 0.85 0.85 0.85 † SVM − Unigram 0.83 0.83 0.83SVM − Baseline 0.76 0.76 0.76NB − Bigram − cor 0.83 0.79 0.80NB − BOWS − cor NB − Unigram − cor 0.79 0.74 0.75NB − Bigram 0.86 0.86 0.86 † NB − BOWS 0.83 0.83 0.83NB − Unigram 0.86 0.85 0.86NB − Baseline 0.75 0.75 0.75Table 2: Performance of unseen testsets on Beauty Reviews

Model Pr. Rc. F-1

SVM − Bigram − cor 0.77 0.77 0.77SVM − BOWS − cor SVM − Unigram − cor 0.75 0.75 0.75SVM − Bigram 0.76 0.75 0.76SVM − BOWS 0.78 0.78 0.78SVM − Unigram 0.78 0.77 0.77SVM − Baseline 0.68 0.68 0.68NB − Bigram − cor 0.78 0.74 0.74NB − BOWS − cor NB − Unigram − cor 0.79 0.77 0.77NB − Bigram 0.78 0.77 0.76NB − BOWS 0.56 0.55 0.53NB − Unigram 0.69 0.67 0.66NB − Baseline 0.69 0.67 0.64Table 3: Performance of unseen testsets on Books reviews

Model Pr. Rc. F-1

SVM − Bigram − cor 0.83 0.82 0.82SVM − BOWS − cor SVM − Unigram − cor 0.82 0.82 0.82SVM − Bigram 0.84 0.83 0.83SVM − BOWS 0.82 0.82 0.82SVM − Unigram 0.85 0.85 0.85 † SVM − Baseline 0.70 0.70 0.70NB − Bigram − cor 0.84 0.84 0.84NB − BOWS − cor 0.80 0.79 0.79NB − Unigram − cor NB − Bigram 0.83 0.79 0.79NB − BOWS 0.77 0.77 0.76NB − Unigram 0.79 0.77 0.77NB − Baseline 0.72 0.72 0.72Table 4: Performance of unseen testsets on Software reviews

Model Pr. Rc. F-1

SVM − Bigram − cor 0.79 0.79 0.79SVM − BOWS − cor SVM − Unigram − cor 0.79 0.79 0.79SVM − Bigram 0.81 0.79 0.79SVM − BOWS 0.81 0.8 0.80SVM − Unigram 0.79 0.79 0.79SVM − Baseline 0.74 0.74 0.74NB − Bigram − cor NB − BOWS − cor 0.82 0.82 0.82NB − Unigram − cor 0.82 0.82 0.82NB − Bigram 0.84 0.84 0.84 † NB − BOWS 0.77 0.76 0.76NB − Unigram 0.82 0.82 0.82NB − Baseline 0.72 0.72 0.72Table 5: Performance of unseen testsets on Kitchen reviews example ‘SVM-Unigram-Cor’ depicts a model using SVM with unigram featuresand our correction techniques. Standard models are identiﬁed by the algorithmname and the feature used. Baseline models are identiﬁed with ‘Baseline’. Inaddition, we identify our best performing model above the baseline with ( ∗ ) andcomparable performance with standard models is identiﬁed with ( † ).We see that the model with our correction techniques outperformed the base-line model without the correction techniques on all domains. Other than thebaseline model, our technique show comparable performance with the standardbigram, bag-of-words, and unigram models. Not surprisingly, SVM performedbetter than NB in most cases with bag-of-words and unigram features. On theother hand, NB performed better than SVM with bigram features. The im-provement on the baseline technique and the comparable performance on thestandard models show the importance of our polarity correction techniques as11pplicable to sentiment classiﬁcation. It also emphasizes the fact that using theunseen test sets without sentence-level polarity corrections is likely to lead tomis-classiﬁcation as a result of inconsistent polarities within each review. Per-haps, it could be beneﬁcial to consider the integration of our polarity correctiontechniques into an independent sentiment classiﬁer for more accurate sentimentclassiﬁcation.The limitation of our polarity correction techniques, however, could be inthe construction and the performance of the initial ‘na¨ıve’ classiﬁer for perform-ing both the training set and the sentence-level polarity corrections. Also, theclassiﬁer needed to be trained on each review domain. At the same time, we em-phasize that a moderate classiﬁer - taking a NB classiﬁer as an example - trainedwith the standard bag-of-word features, gives an average of approximately 72%F-measure across all domains as observed in our results. Therefore, we believethat the process is likely to have a minimal or negligible eﬀect on the resultingsentiment classiﬁer. As such, in favor of a more eﬃcient classiﬁcation task, espe-cially on very large datasets, we do not recommend sophisticated classiﬁers forthe initial correction processes. We also like to emphasize that any reasonablesentence-level polarity identiﬁcation technique,[3] used in place of the ‘na¨ıve’classiﬁer in the correction processes, is likely to work just ﬁne and give improvedresults for the overall sentiment classiﬁcation task. In this work, we have proposed a training set and sentence-level polarity cor-rection for the sentiment classiﬁcation task on review documents. We performedexperiments on diﬀerent Amazon product review domains and show that a sen-timent classiﬁer with training set and sentence-level polarity corrections, showedimproved performance and outperformed a state-of-the-art sentiment classiﬁca-tion baseline on all the review domains. Our correction techniques ﬁrst removepolarity bias from the training set and then inconsistent sentence-level polari-ties from both training and testing sets. Given the diﬃculty of the sentimentclassiﬁcation task [3], we believe that the improvement shown by the correctiontechnique is promising and could lead to building a more accurate sentimentclassiﬁer.In the future, we will integrate the training and sentence-level polarity cor-rection techniques as part of an independent sentiment detection algorithm andperform larger scale experiment on large datasets such as the SNAP Web Data:Amazon reviews dataset , which was prepared by McAuley and Leskovec.[45] http://snap.stanford.edu/data/web-Amazon.html eferences

1. B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classiﬁcation us-ing machine learning techniques,” in

Proceedings of the ACL-02 conference on Em-pirical methods in natural language processing (EMNLP) , pp. 79–86, Associationfor Computational Linguistics, 2002.2. O. Vechtomova, “Facet-based opinion retrieval from blogs,”

Information Processing& Management , vol. 46, no. 1, pp. 71–88, 2010.3. B. Liu, “Sentiment analysis and opinion mining,”

Synthesis Lectures on HumanLanguage Technologies , vol. 5, no. 1, pp. 1–167, 2012.4. E. Cambria, B. Schuller, Y. Xia, and C. Havasi, “New avenues in opinion miningand sentiment analysis,”

IEEE Intelligent Systems , vol. 28, no. 2, pp. 15–21, 2013.5. J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom-boxes andblenders: Domain adaptation for sentiment classiﬁcation,” (Association of Compu-tational Linguistics (ACL)), 2007.6. E. Mart´ınez-C´amara, M. T. Mart´ın-Valdivia, L. A. Ure˜na-L´opez, and A. R.Montejo-R´aez, “Sentiment analysis in twitter,”

Natural Language Engineering ,vol. 20, no. 01, pp. 1–28, 2014.7. D. Vilares, M. A. Alonso, and C. G´omez-Rodr´ıguez, “A syntactic approach foropinion mining on spanish reviews,”

Natural Language Engineering , pp. 139–163,2015.8. J. Mendel, L. Zadeh, E. Trillas, R. Yager, J. Lawry, H. Hagras, andS. Guadarrama, “What computing with words means to me [discussion fo-rum],”

Computational Intelligence Magazine, IEEE , vol. 5, no. 1, pp. 20–26,2010.9. F. Keshtkar and D. Inkpen, “A hierarchical approach to mood classiﬁcation inblogs,”

Natural Language Engineering , vol. 18, no. 01, pp. 61–81, 2012.10. A. Reyes and P. Rosso, “On the diﬃculty of automatically detecting irony: beyonda simple case of negation,”

Knowledge and Information Systems , vol. 40, no. 3,pp. 595–614, 2014.11. M. Melero, M. Costa-Juss`a, P. Lambert, and M. Quixal, “Selection of correctioncandidates for the normalization of spanish user-generated content,”

Natural Lan-guage Engineering , pp. 1–27, 2014.12. Y. Jo and A. H. Oh, “Aspect and sentiment uniﬁcation model for online reviewanalysis,” in

Proceedings of the fourth ACM international conference on Web searchand data mining , pp. 815–824, ACM, 2011.13. A. Tsai, R. Tsai, and J. Hsu, “Building a concept-level sentiment dictionary basedon commonsense knowledge,”

IEEE Intelligent Systems , vol. 28, no. 2, pp. 22–30,2013.14. S. Poria, A. Gelbukh, A. Hussain, D. Das, and S. Bandyopadhyay, “Enhanced sen-ticnet with aﬀective labels for concept-based opinion mining,”

IEEE IntelligentSystems , vol. 28, no. 2, pp. 31–38, 2013.15. N. T. Roman, P. Piwek, A. M. B. R. Carvalho, and A. R. Alvares, “Sentiment andbehaviour annotation in a corpus of dialogue summaries,”

Journal of UniversalComputer Science , vol. 21, no. 4, pp. 561–586, 2015.16. B. Pang and L. Lee, “Opinion mining and sentiment analysis,”

Foundations andTrends in Information Retrieval , vol. 2, no. 1-2, pp. 1–135, 2008.17. A. Devitt and K. Ahmad, “Sentiment polarity identiﬁcation in ﬁnancial news: Acohesion-based approach,” in

Proceedings of the 45th Annual Meeting of the Asso-ciation of Computational Linguistics , pp. 984–991, ACL, 2007.18. A. Brabazon, M. O’Neill, and I. Dempsey, “An introduction to evolutionary com-putation in ﬁnance,”

Computational Intelligence Magazine, IEEE , vol. 3, no. 4,pp. 42–55, 2008.

9. E. J. Fortuny, T. D. Smedt, D. Martens, and W. Daelemans, “Evaluating and un-derstanding text-based stock price prediction models,”

Information Processing &Management , vol. 50, no. 2, pp. 426 – 441, 2014.20. D. Li, A. Laurent, P. Poncelet, and M. Roche, “Extraction of unexpected sen-tences: A sentiment classiﬁcation assessed approach,”

Intelligent Data Analysis ,vol. 14, no. 1, p. 31, 2010.21. L. Jia, C. Yu, and W. Meng, “The eﬀect of negation on sentiment analysis andretrieval eﬀectiveness,” in

Proceeding of the 18th ACM conference on Informationand knowledge management , (Hong Kong, China), pp. 1827–1830, ACM, 2009.22. B. Pang and L. Lee, “A sentimental education: sentiment analysis using subjec-tivity summarization based on minimum cuts,” in

Proceedings of the 42nd AnnualMeeting on Association for Computational Linguistics , (Barcelona, Spain), p. 271,Association for Computational Linguistics, 2004.23. T. Wilson, J. Wiebe, and P. Hoﬀmann, “Recognizing contextual polarity in phrase-level sentiment analysis,” in

Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing , (Vancouver,British Columbia, Canada), pp. 347–354, Association for Computational Linguis-tics, 2005.24. Y. Choi and C. Cardie, “Learning with compositional semantics as structural in-ference for subsentential sentiment analysis,” in

Proceedings of the Conference onEmpirical Methods in Natural Language Processing , (Honolulu, Hawaii), pp. 793–801, Association for Computational Linguistics, 2008.25. J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emo-tions in language,”

Language Resources and Evaluation , vol. 39, no. 2/3, pp. 165–210, 2005.26. M. Bautin, C. B. Ward, A. Patil, and S. S. Skiena, “Access: news and blog analysisfor the social sciences,” in

Proceedings of the 19th international conference onWorld wide web , (Raleigh, North Carolina, USA), pp. 1229–1232, ACM, 2010.27. Y. Lee, H.-y. Jung, W. Song, and J.-H. Lee, “Mining the blogosphere for top newsstories identiﬁcation,” in

Proceeding of the 33rd international ACM SIGIR confer-ence on Research and development in information retrieval , (Geneva, Switzerland),pp. 395–402, ACM, 2010.28. H. Tang, S. Tan, and X. Cheng, “A survey on sentiment detection of reviews,”

Expert Systems with Applications , vol. 36, no. 7, pp. 10760–10773, 2009.29. I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learning-a new frontierin artiﬁcial intelligence research [research frontier],”

Computational IntelligenceMagazine, IEEE , vol. 5, no. 4, pp. 13–18, 2010.30. Y. He and D. Zhou, “Self-training from labeled features for sentiment analysis,”

Information Processing & Management , vol. 47, no. 4, pp. 606–616, 2011.31. P. H. Calais Guerra, A. Veloso, W. Meira Jr, and V. Almeida, “From bias to opin-ion: a transfer-learning approach to real-time sentiment analysis,” in

Proceedingsof the 17th ACM SIGKDD international conference on Knowledge discovery anddata mining , pp. 150–158, ACM, 2011.32. H. Cui, V. Mittal, and M. Datar, “Comparative experiments on sentiment classiﬁ-cation for online product reviews,” (American Association for Artiﬁcial Intelligence(AAAI)), 2006.33. S. Maldonado and C. Montecinos, “Robust classiﬁcation of imbalanced data us-ing one-class and two-class svm-based multiclassiﬁers,”

Intelligent Data Analysis ,vol. 18, no. 1, pp. 95–112, 2014.34. S. Tan, X. Cheng, Y. Wang, and H. Xu, “Adapting naive bayes to domain adap-tation for sentiment analysis,” in

Advances in Information Retrieval , pp. 337–349,Springer, 2009.35. Q. Ye, Z. Zhang, and R. Law, “Sentiment classiﬁcation of online reviews to traveldestinations by supervised machine learning approaches,”

Expert Systems with Ap-plications , vol. 36, no. 3, pp. 6527–6535, 2009.

6. R. Xia, C. Zong, X. Hu, and E. Cambria, “Feature ensemble plus sample selec-tion: A comprehensive approach to domain adaptation for sentiment classiﬁcation,”

IEEE Intelligent Systems , vol. 28, no. 3, pp. 10–18, 2013.37. D. Lowd and P. Domingos, “Naive bayes models for probability estimation,” in

Proceedings of the 22nd international conference on Machine learning , ICML ’05,(New York, NY, USA), pp. 529–536, ACM, 2005.38. L. Tan, J. Na, Y. Theng, and K. Chang, “Sentence-level sentiment polarity clas-siﬁcation using a linguistic approach,”

Digital Libraries: For Cultural Heritage,Knowledge Dissemination, and Future Creation , pp. 77–87, 2011.39. T. Wilson, J. Wiebe, and P. Hoﬀmann, “Recognizing contextual polarity: An ex-ploration of features for phrase-level sentiment analysis,”

Computational Linguis-tics , vol. 35, no. 3, pp. 399–433, 2009.40. S. Li, S. Y. M. Lee, Y. Chen, C.-R. Huang, and G. Zhou, “Sentiment classiﬁca-tion and polarity shifting,” in

Proceedings of the 23rd International Conference onComputational Linguistics , (Beijing, China), pp. 635–643, Association for Compu-tational Linguistics, 2010.41. J. Platt, “Sequential minimal optimization: A fast algorithm for training supportvector machines,” Tech. Rep. MSR-TR-98-14, Microsoft Research, 1998.42. I. Rish, “An empirical study of the naive bayes classiﬁer,” in

IJCAI 2001 workshopon empirical methods in artiﬁcial intelligence , vol. 3, pp. 41–46, 2001.43. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten,“The weka data mining software: an update,”

ACM SIGKDD explorations newslet-ter , vol. 11, no. 1, pp. 10–18, 2009.44. C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Com-bined selection and hyperparameter optimization of classiﬁcation algorithms,” in

Proceedings of the 19th ACM SIGKDD international conference on Knowledge dis-covery and data mining , pp. 847–855, ACM, 2013.45. J. McAuley and J. Leskovec, “Hidden factors and hidden topics: understandingrating dimensions with review text,” in

Proceedings of the 7th ACM conference onRecommender systems , pp. 165–172, ACM, 2013., pp. 165–172, ACM, 2013.