Inf. Sci. | 2021

Data clustering via cooperative games: A novel approach and comparative study

 
 

Abstract


Abstract Arguably, the main purpose of cluster analysis is to develop algorithms to reveal natural groupings (clusterings) over a set of data points based on their similarity. On the other hand, the focus of cooperative game theory (CGT) is to study the formation of groups (coalitions) of decision makers (players) and ways to split the resulting income among them. Due to the conceptual similitude between these fields, algorithms rooted in CGT have recently emerged for tackling the data clustering problem. In this work, we revisit two such algorithms, one based on cluster prototypes (Biobjective Game Clustering – BiGC) and the other based on dense regions of data points (Density-Restricted Agglomerative Clustering – DRAC). We also present a novel partitional clustering algorithm, referred to as HGC (after Hedonic Game based Clustering), which is grounded on theoretical results stemming from the subclass of hedonic games. Two HGC versions are investigated, which differ in the order of the players in the game, and a detailed factorial simulation study is reported to analyze how sensitive they are to three relevant factors, namely number of clusters, number of features, and noise level. Besides, a heuristic to calibrate the value of HGC’s single control parameter (viz., the number of nearest neighbors of each point) is provided, so as to yield high-quality partitions. To compare the performance of the CGT algorithms, a series of experiments were conducted on UCI and gene-expression data sets, the majority of which being high dimensional. Overall, the results measured by 10 external validation indices evidence that HGC is usually more stable and effective than DRAC and BiGC. They also show that HGC is very competitive (sometimes, considerably better) to well-known clustering algorithms/variants (specifically, k-means, k-means++, affinity propagation, two variants of hierarchical clustering, and the density peak clustering algorithm). Remarkably, HGC could fully recover the true clustering structures for two gene-expression data sets.

Volume 545
Pages 791-812
DOI 10.1016/J.INS.2020.09.018
Language English
Journal Inf. Sci.

Full Text