Ecol. Informatics | 2021

Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms

 
 
 
 
 
 

Abstract


Abstract This study aimed to explicitly explore the effects of the degree of class imbalance on predicting infrequently occurring events, i.e., cyanobacteria blooms. Although class imbalance poses a major issue in binary classification schemes, few efforts have been made to relate model performance with real-life applications. The data utilized herein were collected from 2013 to 2019 at 13 sites within three major rivers in South Korea; a variety of physicochemical and hydrometeorological factors were obtained as input variables, and the occurrence of cyanobacteria blooms (indicated by a cell count\u202f≥\u202f1000 cells/mL) was included as a response variable. The imbalance ratio (IR) for cyanobacteria blooms differed significantly by site, ranging widely from 0.93 to 9.32. The study results suggested that class imbalance negatively affected model performance, with an increase in the IR significantly increasing the false negative (FN) rate. The application of resampling decreased the FN rate while simultaneously increasing the true positive (TP) rate, which yielded improvements that tended to increase with increasing IRs. Ensemble classifiers, which combine multiple single classifiers into an integrated classifier, alone could not successfully address the class imbalance problem; however, in combination with resampling, they consistently outperformed single classifiers. Among the ensemble classifiers, AdaBoost yielded the most stable performance across a range of IRs, irrespective of the resampling application. A variable importance analysis indicated that temperature was usually the primary influencing factor of cyanobacteria blooms. These findings highlight the effectiveness of resampling applications for addressing class imbalance, while providing useful guidelines for learning from imbalance data, including the selection of classification algorithms and model evaluation metrics.

Volume 61
Pages 101202
DOI 10.1016/j.ecoinf.2020.101202
Language English
Journal Ecol. Informatics

Full Text