Concurrency and Computation: Practice and Experience | 2019

Design and evaluation of a parallel document clustering algorithm based on hierarchical latent semantic analysis

 
 
 

Abstract


We propose a parallel generalization scheme for Singular Value Decomposition–based clustering algorithms. The scheme enables the clustering algorithm to generate a hierarchy of clusters instead of a flat set of clusters. The generalization scheme infers the number of levels to be formed and the number of clusters per level of the hierarchy automatically without depending on any user‐supplied parameter. The performance of the suggested hierarchical clustering algorithm was evaluated using the web directory taxonomy hosted by the Open Directory DMOZ. Empirical evaluations and statistical tests reveal that the proposed generalization scheme produces a superior cluster hierarchy when compared with two existing generalization techniques in terms of the precision, recall, f‐measure, and the rand index. The generalization scheme is well‐equipped to deal with large datasets and the speed‐up achieved by the parallelized generalization scheme over its sequential variant was measured using a multicore computer.

Volume 31
Pages None
DOI 10.1002/cpe.5094
Language English
Journal Concurrency and Computation: Practice and Experience

Full Text