IEEE Access | 2021

An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification

 
 
 
 
 

Abstract


In the process of imbalanced dataset classification, traditional oversampling technologies do not take into the account of the distribution characteristics of data sets, which easily results in low classification accuracy of positive samples. In order to improve the recognition rate of positive samples, the algorithm combining the K-means clustering with the MAHAKIL oversampling is proposed in this article. Firstly, the K-means clustering algorithm is used to divide the positive samples into K clusters. The samples in each cluster are then divided into two initial populations according to the magnitude of the Mahalanobis distance. Finally, the crossover operator of the genetic algorithm is used to continuously synthesize new data samples to balance the data set. The experimental results show that when K-means MAHAKIL oversampling algorithm is combined with different classification algorithms, it can achieve better classification performance, which verifies the effectiveness of the algorithm.

Volume 9
Pages 16030-16040
DOI 10.1109/ACCESS.2020.3047741
Language English
Journal IEEE Access

Full Text