Proceedings of the 2019 3rd International Conference on E-Business and Internet | 2019
The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews
Abstract
Fake review detection is getting crucial due to rapid growth of internet purchases. Obviously, it is important to choose the most efficient algorithm in order to detect fake (deceptive, spam) reviews either positive or negative. On the other hand, it is also important to pre-process the textual content of the reviews for training and later for production environment. A number of text preprocessing methods are examined in this study, such as feature dimensionality, tokenization, removal of stop words, stemming and different term weighting schemes. Three well-known machine learning algorithms are used as benchmark classifiers, including Naïve Bayes, neural network and support vector machine. Here we show that text preprocessing strategies are important determinants of the classifiers performance. We find that the classifiers perform better for high-dimensional datasets represented by bigrams or trigrams selected according to the non-binary weighting scheme. Stemming and stopword removal seem to be less important.