Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification

Alexander Ligthart, Cagatay Catal*, Bedir Tekinerdogan

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Opinion spam detection is concerned with identifying fake reviews that are deliberately placed to either promote or discredit a product. Opinionated social media like product reviews are increasingly important resources for people as well as businesses in the decision-making process and can be easily manipulated by opportunistic individuals. To reduce this increasing impact of opinion spams, opinion spam detection approaches have been proposed, which adopt mostly supervised classification methods. However, in practice, the provided data is largely not labeled and therefore semi-supervised learning approaches are required instead. To this end, this study aims to analyze the effectiveness of several semi-supervised learning approaches for opinion spam classification. Four different semi-supervised methods are evaluated on a dataset of both genuine and deceptive hotel reviews. The results are compared with several traditional classification methods using the same amount of labeled data. According to this study, the self-training algorithm with Naive Bayes as the base classifier yields 93% accuracy. Results show that self-training is the only approach, out of the four tested semi-supervised models, that outperforms traditional supervised classification models when limited data is available. This study further shows that self-training can mitigate labeling efforts while retaining high model performance, which is useful for scenarios where limited data is available or retrieving labeled data is more costly.

Original languageEnglish
Article number107023
JournalApplied Soft Computing
Volume101
DOIs
Publication statusPublished - Mar 2021

Keywords

  • Fake reviews
  • Machine learning
  • Opinion spam detection
  • Semi-supervised learning

Fingerprint Dive into the research topics of 'Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification'. Together they form a unique fingerprint.

Cite this