Inf. Sci. | 2021

A hybrid data-level ensemble to enable learning from highly imbalanced dataset

 
 
 
 

Abstract


Abstract Highly imbalanced class distribution has been well-recognized as a major cause of performance degradation for most supervised learning algorithms. Unfortunately, such detrimental distribution inherently occurs in various real-world applications. In this work, we developed a hybrid data-level ensemble (HD-Ensemble), which integrates ensemble learning with the union of a margin-based undersampling and diversity-enhancing oversampling. The proposed undersampling method filters out certain number of unrepresentative majority instances based on an unsupervised margin definition, while the proposed oversampling method generates diverse minority instances according to the behavior of ensemble learning. The combination of the two data-level approaches serves a twofold purpose of balancing the data distribution, and optimizing the fundamental properties (e.g., margin distribution and diversity) of the ensemble, therefore, the inferior performance caused by adopting single data-level approach can be better addressed. Targeting on binary classification task, we evaluated the HD-Ensemble on 42 highly imbalanced datasets, which exhibited a considerable variety in sample number (ranging from 129 to 20,034), feature number (ranging from 3 to 5,000) and imbalance ratio (ranging from 9.08 to 970.6). Experimental results demonstrated the performance advantages of proposed HD-Ensemble over ten other ensemble solutions.

Volume 554
Pages 157-176
DOI 10.1016/j.ins.2020.12.023
Language English
Journal Inf. Sci.

Full Text