The effects of data balancing approaches: A case study

Paul Mooijman, Cagatay Catal*, Bedir Tekinerdogan, Arjen Lommen, Marco Blokland

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

22 Citations (Scopus)

Abstract

Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

Original languageEnglish
Article number109853
JournalApplied Soft Computing
Volume132
DOIs
Publication statusPublished - Jan 2023

Keywords

  • Cattle
  • Classification
  • Feature selection
  • Hormone abuse detection
  • Imbalanced dataset
  • LC–MS
  • Missing data
  • Resampling
  • Supervised machine learning

Fingerprint

Dive into the research topics of 'The effects of data balancing approaches: A case study'. Together they form a unique fingerprint.

Cite this