Proceedings of the 23rd Pan-Hellenic Conference on Informatics | 2019

Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model

 
 

Abstract


Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used methods of clustering; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in document clustering over large-scale collections can lead to unpredictable time costs. In this paper, towards the direction of the efficient handling of big text data, we present a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then uses the results as the initial centers for a K-Means based assignment of the remaining documents, with very few iterations. We also give a highly efficient adaptation of the proposed Buckshot-based approach in the MapReduce model which is then experimentally tested using Apache Hadoop over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as to significant execution time improvements. Preliminary results drawn from relevant experiments using the Spark framework are also presented.

Volume None
Pages None
DOI 10.1145/3368640.3368658
Language English
Journal Proceedings of the 23rd Pan-Hellenic Conference on Informatics

Full Text