ACM J. Data Inf. Qual. | 2021

Improving Opinion Spam Detection by Cumulative Relative Frequency Distribution

 
 
 
 

Abstract


ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2021 Association for Computing Machinery. 1936-1955/2021/01-ART4 $15.00 https://doi.org/10.1145/3439307 ACM Journal of Data and Information Quality, Vol. 13, No. 1, Article 4. Publication date: January 2021. 4:2 M. Fazzolari et al. reviews, a.k.a. opinion spam. This results in the diffusion of different kinds of disinformation and misinformation, where misinformation refers to inaccuracies that may even originate acting in good faith, while disinformation is false information deliberately spread to deceive [20]. Over the past few years, online reviews have become very important, since they reflect the customers’ experience with a product or service and, nowadays, they constitute the basis on which the reputation of an organization is built. Unfortunately, the confidence in such reviews is often misplaced, due to the fact that spammers are tempted to write fake information in exchange for some reward or to mislead consumers for obtaining business advantages [28]. The practice of writing false reviews is not only morally deplorable, as it is misleading for customers and inconvenient for service providers, but it is also punishable by law. Considering both the longevity and the spread of the phenomenon, scholars for years have investigated various approaches to opinion spam detection, mainly based on supervised or unsupervised learning algorithms. Further approaches are based on Multi Criteria Decision Making [42]. Machine learning approaches rely on input data to build a mathematical model to make predictions or decisions. To this aim, data are usually represented by a set of features, which are structured and ideally fully representative of the phenomenon being modeled. An effective feature engineering process, i.e., the process through which an analyst uses the domain knowledge of the data under investigation to prepare appropriate features [11], is a critical and time-consuming task. However, if done correctly, feature engineering increases the predictive power of algorithms by facilitating the machine learning process. In this article, we do not aim to contribute by defining novel features suitable for fake reviews detection; rather, starting from features that have been proven to be very effective by Academia, we re-engineered them by considering the distribution of the occurrence of the features values in the dataset under analysis. In particular, we focus on the Cumulative Relative Frequency Distribution of a set of the basic features already employed for the task of fake review detection. We compute this distribution for each feature and substitute each feature value with the corresponding value of the distribution. To demonstrate the effectiveness of the proposed approach, the distributional (cumulative) features and the basic ones have been exploited to train several supervised machine-learning classifiers and the obtained results have been compared. To the best of the authors’ knowledge, this is the first time that Cumulative Relative Frequency Distribution of a set of features has been considered for the unveiling of fake reviews. The experimental results show that the distributional features improve the performances of the classifiers, at the mere cost of a small computational surplus in the feature engineering phase. The rest of the article is organized as follows. The next section revises related work in the area. Section 3 describes the process of feature engineering. In Section 4, we present the experimental setup, while Section 5 reports the results of the comparison among the classification algorithms. Moreover, in this section, we assess the importance of the distributional features and discuss about the benefits brought by their adoption. Finally, Section 6 concludes the article.

Volume 13
Pages 4:1-4:16
DOI 10.1145/3439307
Language English
Journal ACM J. Data Inf. Qual.

Full Text