ArXiv | 2021

Statistically-Robust Clustering Techniques for Mapping Spatial Hotspots: A Survey

 
 
 

Abstract


ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2020 Association for Computing Machinery. XXXX-XXXX/2020/3-ART $15.00 https://doi.org/10.1145/1122445.1122456 , Vol. 1, No. 1, Article . Publication date: March 2020. ar X iv :2 10 3. 12 01 9v 1 [ st at .M L ] 2 2 M ar 2 02 1 2 Xie, Shekhar and Li design issues, etc. These are just a few of the many examples, and mapping of hotspots is also widely applied in other important use cases such as agriculture, forestry, fishery, environment, astronomy and disaster management (detailed in Sec. 2). While spatial hotspots are naturally clusters of events, mapping of hotspots poses many unique challenges. Here we summarize these challenges in four different aspects. First, spurious results often have very high social and economic costs in the application domains of hotspot mapping [12, 160]. For example, if a city falsely claims a neighborhood as a crime cluster, there can be many unnecessary negative impacts such as reducing property values in that region, hurting local small businesses, increasing mental pressure on residents and potentially pushing them to move. Unfortunately, spurious clusters are very common in real-world datasets and can easily happen as a consequence of natural randomness. Thus, decision makers in important application domains require statistical rigor as a necessary component to robustly control of the rate of spurious patterns. Second, mapping of spatial hotspots needs explicit consideration of underlying risk factors, which are typically not considered in traditional clustering approaches [87, 115]. By definition, hotspots are regions with significantly higher rates or probability density of generating certain events, which has a fundamental difference to higher density of events. For example, having 1,000 crimes in a region with one-million population is different from having the same number of crimes in an equal-area region with one-thousand population. Thus, distribution of underlying risk factors, either discrete or continuous, needs to be explicitly considered in the clustering process. Third, hotspots are geographically contiguous regions and should not be fragmented pieces in spatial dimensions [167, 170]. As a result, spatial attributes cannot be simply mixed with other non-spatial attributes during the clustering process, and need specific modeling. Finally, while a hotspot is often a geographically contiguous region, the observed events inside it may not necessarily form a continuous or smooth distribution of density, especially when the total number of observations is small (i.e., susceptible to higher variance caused by natural randomness). This challenges basic assumptions on contiguous density used in many traditional clustering approaches [20, 52]. Clustering has long been a core topic in data mining and machine learning, providing a prolific set of techniques of various families. Partitioning-based clustering methods (e.g., k-means and Gaussian mixture models, CLARANS [136]) split the input data into k groups to minimize withincluster dissimilarities. Density-based approaches (e.g., DBSCAN [52], OPTICS [10], HDBSCAN [20]) measure density using the distances (e.g., Minkowski, Mahalanobis) between data points and apply local or global density criteria to separate out clusters from noises. These approaches do not rely on pre-defined number of clusters and can be made flexible for different shapes and varying densities. Many techniques also utilizes the hierarchical structure of clusters, e.g., by first splitting data into small units and then sequentially merging them into final clusters based on similarity and adjacency, or in the opposite direction (e.g., Chameleon [84], BIRCH [198], CURE [65], HDBSCAN [20]). These different families of approaches are not necessarily mutually exclusive (e.g., HDBSCAN leverages both density and hierarchy). Moving further, there have also been many other types of clustering methods such as grid-based (e.g., STING [181], CLIQUE [7]), graph-spectrum-based (e.g., spectral clustering [178]), kernel-based (e.g., kernel k-means [156], SVC [15]), etc. Given that these techniques have been well summarized in other surveys on general clustering (e.g., [190, 191]), here we will skip the details and concentrate on the topic of statistically-robust clustering for hotspot mapping. Traditional clustering techniques are not sufficient in responding to the unique challenges posed by hotspot mapping. These techniques, for example, often do not consider the high costs of spurious results and are prone to return many clusters formed by natural randomness (e.g., k-means always returns k partitions, DBSCAN and its variations have difficulty in avoiding spurious high-density regions in random point distributions [20, 115, 186, 187]). In addition, these methods normally do , Vol. 1, No. 1, Article . Publication date: March 2020. Statistically-Robust Clustering: A Survey 3 not consider underlying risk factors (e.g., population at risk as control), geographic-contiguity of outputs, and non-contiguous within-cluster density (e.g., due to a small data or randomness). To address these challenges, in the last decades there have been a blossom of research focusing on statistically-robust clustering techniques with explicit consideration and modeling of these various needs of hotspot mapping. As many of the formulations are based on ideas from scan statistics, which was first developed by the statistics community, statistically-robust clustering is also known as scan statistics. Since both clustering and scan statistics have been widely used in literature, we use clustering in the rest of the survey to better connect to intended audience from data mining. First, statistical significance of cluster candidates was modeled and incorporated into the detection process to provide robust control of the rate of spurious results. New algorithms were also developed to reduce the computational cost of significance testing, which often requires Monte-Carlo estimation due to the complexity of test statistics. Second, underlying risk factors were explicitly modeled into the test statistics used for candidate evaluation, allowing the scores to reflect differences in statistical processes rather than directly observed densities of events. Third, the statistically-robust clustering techniques can also guarantee that the output clusters are contiguous in the geographic space by modeling the problem as a region-maximization problem. In this paradigm, the enumeration space of region-candidates are clearly-defined (e.g., circles, rings or linear paths formed by subsets of data) and a wide variety of efficient region-maximization algorithms have been developed. This enumeration scheme also addresses the fourth challenge that within-cluster density may not be smooth or continuous due to the effect of natural randomness. By directly enumerating geographic regions instead of forming them by local or hierarchical density criteria (e.g., DBSCAN, H-DBSCAN), regions can be evaluated regardless of their inner-density distribution.1 There are several related surveys on traditional and statistically-robust clustering. For traditional clustering, multiple taxonomies are provided in [17, 107, 141, 190, 191] to summarize a broad spectrum of techniques, including partition-based, density-based, hierarchy-based, graph spectrum based, kernel based, spatial and non-spatial, etc. These surveys do not cover the models and techniques developed for statistically-robust clustering. A few surveys also exist on statistically robust clustering from different perspectives. Kulldorff (1999) [88] provides a summary of early methods and applications of spatial scan statistics, as well as basic enumeration algorithms; an overview of earlier domain applications was also discussed later in [34]. Neill and Moore (2006) [129] describes a few extensions of spatial scan statistic, including efficient computational strategies for rectangular-shaped clusters and an expectation-based approach to improve space-time cluster detection. Later, two sequential handbooks on scan statistics were developed mainly by the statistics community [59, 61], covering both new and existing theoretical results on distribution estimation (e.g., closed-form solutions, approximations), extreme values, sequences, etc. The handbooks also included a few chapters that are related to data mining, including summaries on Bayesian scan statistics [121] and irregular-shaped cluster detection [45]. Finally, the topic of statistically-robust clustering (a.k.a. hotspot detection) has also been briefly discussed (e.g., as a paragraph or section) in broad spatio-temporal data mining and data science surveys [12, 160, 185]. Given the richness of application contexts and the large variety of techniques developed along different dimensions of statistically-robust clustering, there is a need for developing a taxonomy from the computer science or data mining perspective to decompose the complex modeling and detection process into a set of key steps, highlight their roles and mutual relationships, and discuss the advances in each key step by extracting corresponding contributions from the vast (and often versatile) literature. This can promote cross-pollination and synergistic integration of ideas across 1If continuous density is desired, constraints can also be added to enforce it. , Vol. 1, No. 1, Article . Publication date: March 2020. 4 Xie, Shekhar and Li various research communities, especially considering that the developments of general clustering and statistically-robust clustering have largely run on siloed tracks. This can also help engage the broader data mining community in addressing the challenges and fostering future directions in statistically-robust cl

Volume abs/2103.12019
Pages None
DOI 10.1145/3487893
Language English
Journal ArXiv

Full Text