International Journal of Intelligent Engineering and Systems | 2021

Text Generation with Content and Structure-Based Preprocessing in Imbalanced Data of Product Review

 
 
 

Abstract


Spam detection frequently categorizes product reviews as spam and non-spam. The spam reviews may contain texts of fake reviews and non-review statements describing unrelated things about products. Most of the publicly available spam reviews are labelled as fake reviews, while non-spam texts that are not fake reviews could contain non-review statements. It is crucial to notice those non-review statements since they convey misperception to consumers. Non-review statements are hardly found, and those statements of large and long texts often need to be manually labelled, which is time-consuming. Because of the rareness in finding non-review statements, there is an imbalanced condition between non-spam as a major class and spam that consists of the non-review statement as a minor class. Augmenting fake reviews to add spam texts is ineffective because they have similar content to non-spam such as some opinion words of product features. Thus, the text generation of non-review statements is preferable for adding spam texts. Some text generation issues are the frequent neural network-based methods require much learning data, and the existing pre-trained models produce texts with different contexts to non-review statements. The augmented texts should have similar content and context represented by the structure of the non-review statement. Therefore, we propose a text generation model with content and structure-based preprocessing to produce non-review statements, which is expected to overcome imbalanced data and give better spam detection results in product reviews. Structure-based preprocessing identifies the feature structures of non-opinion words from part-of-speech tags. Those features represent the context of spam reviews in unlabeled texts. Then, content-based preprocessing appoints selected topic modeling results of non-review statements from fake reviews. Our experiments resulted an improvement on the metric value of ± 0.04, called as BLEU (Bi-Lingual Evaluation Understudy) score, for the correspondence evaluation between generated and trained texts. The metric value indicates that the generated texts are not quite identical to the trained texts of non-review statements. However, those additional texts combined with the original spam texts gave better spam detection results with an increasing value of more than 40% on average recall score.

Volume 14
Pages 516-527
DOI 10.22266/ijies2021.0228.48
Language English
Journal International Journal of Intelligent Engineering and Systems

Full Text