IEEE Transactions on Knowledge and Data Engineering | 2021
Efficient Computation and Visualization of Multiple Density-Based Clustering Hierarchies
Abstract
HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset <italic>w.r.t.</italic> a parameter <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq1-2962412.gif /></alternatives></inline-formula>. While a small change in <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq2-2962412.gif /></alternatives></inline-formula> typically leads to a small change in the clustering structure, choosing a “good” <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq3-2962412.gif /></alternatives></inline-formula> value can be challenging: depending on the data distribution, a high or low <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq4-2962412.gif /></alternatives></inline-formula> value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq5-2962412.gif /></alternatives></inline-formula> values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute <italic>all</italic> HDBSCAN* hierarchies for a <italic>range</italic> of <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq6-2962412.gif /></alternatives></inline-formula> values by building upon results from computational geometry to replace HDBSCAN*’s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of <inline-formula><tex-math notation= LaTeX >$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href= cavalcantearaujoneto-ieq7-2962412.gif /></alternatives></inline-formula> values, along with case studies that illustrate how these analyses are performed.