FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance
FFISHDBC: Flexible, Incremental, Scalable, HierarchicalDensity-Based Clustering for Arbitrary Data and Distance
Matteo Dell’Amico
Symantec Research [email protected]
ABSTRACT
FISHDBC is a flexible, incremental, scalable, and hierarchical density-based clustering algorithm. It is flexible because it empowers usersto work on arbitrary data, skipping the feature extraction step thatusually transforms raw data in numeric arrays letting users de-fine an arbitrary distance function instead. It is incremental and scalable : it avoids the O( n ) performance of other approaches innon-metric spaces and requires only lightweight computation toupdate the clustering when few items are added. It is hierarchi-cal : it produces a “flat” clustering which can be expanded to a treestructure, so that users can group and/or divide clusters in sub- orsuper-clusters when data exploration requires so. It is density-based and approximates HDBSCAN*, an evolution of DBSCAN.We evaluate FISHDBC on 8 datasets, confirming its scalability.Our quality metrics show that FISHDBC often performs compa-rably to HDBSCAN*, and sometimes FISHDBC’s results are evenpreferable thanks to a regularization effect. In exploratory data analysis (EDA), data are often large, complex,and arrive in a streaming fashion; clustering is an important tool forEDA, because it summarizes datasets—making them more amenableto human analysis—by grouping similar items. Data can be complexbecause of heterogeneity: consider, e.g., a database of user dataas diverse as timestamps, IP addresses, user-generated text, geolo-cation information, etc. Clustering structure can be complex aswell, involving clusters within clusters. Complexity requires clus-tering algorithms that are flexible , in the sense that they can dealwith arbitrarily complex data, and are able to discover hierarchical clusters. Large datasets call for scalable solutions, and streamingdata benefits from incremental approaches where the clusteringcan be updated cheaply as new data items arrive. In addition, it isdesirable to distinguish signal from noise with algorithms that donot fit isolated data items into clusters.As discussed in Section 2, while these problems have been con-sidered previously in the literature, our proposal tackles all of themat once. FISHDBC, which stands for
Flexible, Incremental, Scalable,Hierarchical Density-Based Clustering , is flexible because it is ap-plicable to arbitrary data and distance functions : rather than beingforced to convert data to numeric values through a feature extrac-tion process that may lose valuable information, domain expertscan encode as much domain knowledge as needed by defining any symmetric and possibly non-metric distance function, no matterhow complex—our implementation accepts arbitrary Python func-tions as distance measures. FISHDBC is incremental : it holds aset of data structures to which new data can be added cheaply andfrom which clustering can be computed quickly; in a streaming context, new data can be added as they arrive, and clustering canbe computed inexpensively. FISHDBC is also scalable , in the sensethat it avoids in most common cases the O( n ) complexity that mostclustering algorithms have when dealing with non-metric spaces;our experiments show that it can scale to millions of data items.It is hierarchical , recognizing clusters within clusters. FISHDBCbelongs to the family of density-based algorithms inspired by DB-SCAN [9], inheriting the ability to recognize clusters of arbitraryshapes and filtering noise.FISHDBC approximates HDBSCAN* [4], an evolution of DB-SCAN supporting hierarchical clustering and recognizing clusterswith different densities; HDBSCAN*, however, has O( n ) compu-tational complexity when using distance functions for which noaccelerated indexing exists. The key idea that allows FISHDBC to beflexible and incremental while maintaining scalability is maintain-ing a data structure—a spanning tree connecting data items—whichis updated as new items are added to the dataset. The problems ofneighbor discovery and incremental model maintenance are sepa-rated, making the algorithm simpler to understand, implement andmodify. In Section 3 we present the algorithm, together with ananalysis of its time and space complexity and its relationship withHDBSCAN*.We evaluate FISHDBC on 8 datasets varying by size, dimension-ality, data type, and distance function used. In Section 4, we validatethe scalability and show that clustering quality metrics are oftenclose to the ones of HDBSCAN*, and sometimes they outperform itthanks to a regularization effect. We conclude by discussing whenFISHDBC is preferable to existing approaches in Section 5. Several algorithms have a subset of the desirable properties dis-cussed in Section 1: for example, spectral clustering [12] is notlimited to spherical clusters; agglomerative methods [30] producehierarchical clusters and can have incremental implementations.To the best of our knowledge, though, no other algorithm embod-ies at once all the properties that FISHDBC satisfies, being flexible,incremental, scalable, and providing hierarchical density-based clus-tering. Due to space limitations, we cannot cover all approachesthat have some of the above properties. In the following, we focuson density-based clustering and approaches applicable to arbitrarydata and (potentially non-metric) dissimilarity/distance functions.
Relational Clustering.
These algorithms take as input a distancematrix D containing all O( n ) pairwise distances. Among them,some are specialized towards arbitrary (non-metric) distances [11,19]. Unfortunately, these methods are intrinsically not scalablebecause computing D requires Ω ( n ) time. FISHDBC scales betterbecause not all pairwise distances are computed : rather than takinga matrix as input, FISHDBC takes a dataset of arbitrary items and a a r X i v : . [ c s . L G ] O c t istance function to apply to them: the distance function will becalled on a small subset of the O( n ) item pairs.Spectral clustering, which is expensive because it involves fac-torizing an O( n ) -sized affinity matrix, can be accelerated via theNyström method [13]: computing approximate eigenvectors byrandomly sampling matrix rows. This sampling approach wouldbe ineffective for density-based clustering as it would not retrievea good approximation of each node’s local neighborhood, whichdensity-based algorithms need to discover dense areas. FISHDBCis instead guided by an approximate neighbor search convergingtowards each node’s neighbors, discovering most of them cheaply. Density-Based Clustering on Arbitrary Data.
Density-based clus-tering was introduced with DBSCAN [9] and generalized to arbi-trary data in GDBSCAN [37], in which clusters are connected denseareas : given a definition of an item’s neighborhood (in most cases,given a distance function, the items at distance smaller than a thresh-old ε ), a node is considered to be in a dense area if its neighborhoodcontains at least MinPts points,and each node in its neighborhood isconsidered to be in the same cluster. In the general case, GDBSCANhas O( n ) complexity, even though indexing structures can lowerthe computational complexity of the algorithm, depending on thecomplexity of range queries [38] which are O( n ) in the general caseof arbitrary distance functions. Some subsequent pieces of workstill require indexing structures to lower computational complex-ity [23], while others [2] are based on filter functions, i.e., cheapfunctions that return a superset of an item’s neighborhood: in thislatter case, complexity depends on the filter’s function selectivity,i.e., how big their output is. Unlike these approaches, FISHDBCdoes not require users to provide an indexing structure or a filterfunction tailored to the distance function used, and it avoids O( n ) complexity by introducing approximation.NG-DBSCAN [22] is a distributed approximate DBSCAN im-plementation that discovers neighbors in arbitrary spaces withan approach inspired by the NN-Descent [6] approximate nearest-neighbor algorithm. Other approaches [17, 21] use a similar strategy.Unlike FISHDBC, these approaches are not incremental: their re-sults must be wholly recomputed as the dataset changes. Moreover,FISHDBC benefits from the better scalability of HNSWs over NN-Descent [1]. Finally, compared to these works, FISHDBC inheritsthe improvements of HDBSCAN* over DBSCAN: better clustering,one less parameter, and hierarchical output. Incremental Density-Based Clustering.
Unlike our work, exist-ing incremental density-based clustering algorithms [10, 14, 18]have quadratic complexity in non-metric spaces; moreover, theygenerally report speed-up factors lower than 100 for incrementalrecomputation after adding a few elements. What we obtain (seeTables 3 and 8, “cluster” columns) is generally similar or better.
HDBSCAN*.
Campello et al. [4] improve on DBSCAN while re-moving the cluster density threshold ε , which is tuned automaticallyand separately for each cluster. In addition to simplifying tuning,result quality improves because the output can include clustershaving different density in the same dataset.HDBSCAN* introduces the concepts of core and reachability dis-tance . A node a ’s core distance c ( a ) is the distance of its MinPts th closest neighbor, while the reachability distance between items a and b is max ( d ( a , b ) , c ( a ) , c ( b )) with d being the distance function.Reachability distance essentially factors in the computation thedensity of each node’s neighborhood. HDBSCAN* computes theminimum spanning tree (MST) T of a complete reachability graph RG having data items as nodes and their reachability distance asweights; the hierarchical clustering is obtained from T by removingall edges in order of decreasing weight. Because T is a spanningtree, edge removals split connected components into reciprocallydisconnected ones. A m cs parameter controls the minimum clustersize, and each split is added to the hierarchical clustering if bothresulting components have size at least m cs ; Campello et al. suggestto set m cs = MinPts . The non-hierarchical flat output consists ofdisjoint clusters selected from the hierarchical ones, selecting an ε threshold for each branch of T to maximize cluster stability acrossa wide range of densities. Explicitly computing RG has O( n ) com-plexity; McInnes and Healy [26] introduced a faster implementationthat directly computes T thanks to accelerated lookup structures ifthe distance function belongs to a set of supported ones. HDBSCAN* improves on DBSCAN in terms of result quality andby yielding hierarchical results recognizing clusters within clus-ters. Unfortunately, though, HDBSCAN* is not incremental—if newdata arrives, results have to be recomputed from scratch—and it has O( n ) complexity in the generic case of arbitrary distance functions;it also underperforms when lookup structures are ineffective, e.g.,when datasets have very high dimensionality. As our analytic (Sec-tion 3.2) and empirical (Section 4) results show, FISHDBC insteadsupports incremental computation, maintains or even improvesresult quality, is accelerated with arbitrary distance functions inmost common cases and has a moderate memory footprint.The core idea of FISHDBC is maintaining an approximate versionof the T MST described in Section 2 and updating it incrementally,at a low cost, as new data arrive. We discover candidate edgesfor T by carefully adapting HNSWs (Hierarchical Navigable SmallWorlds [24]). HNSWs are indexes conceived for near-neighborquerying in non-metric spaces; however, rather than first buildingan HNSW representing our dataset and then querying it to findeach node’s neighbors, we piggyback on all calls to the distancefunction performed by building the index, and generate batchesof ( a , b , d ( a , b )) triples that we consider for inclusion in T . Thisstrategy allows us to significantly improve FISHDBC’s efficiencybecause no query is ever performed on the HNSW; moreover, wetune the HNSW for speed: as we will see, settings that speed upindex construction but would result in low accuracy for nearest-neighbor querying hit desireable trade-offs for our clustering task.The crux of FISHDBC’s approximation lies in that not all d ( a , b ) pairs are computed, and the clustering result only depends onknown distances—as proven in Theorem 3.4, FISHDBC’s results areequivalent to assuming d ( a , b ) = ∞ for non-computed distances.While this may seem to imply a loss in clustering quality, in machinelearning [36] and clustering in particular [16] subsampling the dis-tance matrix can improve the results by working as a regularizationstep that avoids overfitting. As discussed in Section 2, uniformlysampling the distance matrix would not be effective in our case; lgorithm 1 FISHDBC. procedure setup( d , MinPts , ef ) (cid:46) d is the distance function self . MinPts ← MinPts self . mst ← {} (cid:46) approx. MST (cid:46) mst is a hashtable mapping ( x , y ) edges to weights self . neighbors ← {} (cid:46) MinPts neighbors per node (cid:46) maps data to max-heaps of ( distance , neighbor ) pairs self . HNSW ← HNSW ( d , MinPts , ef ) (cid:46) HNSW’s k parameter (neighbors per node) is MinPts self . candidates ← {} (cid:46) Candidate edges (cid:46) mapping of ( x , y ) edges to weights procedure add( x ) self . HNSW . add ( x ) self . neighbors [ x ] ← MinPts closest neighbors found for each time d ( x , y ) is called by HNSW returning v do rd ← max ( v , core distances of x and y ) self . candidates [ x , y ] ← rd (cid:46) Reachability distance if we found a new top- MinPts neighbor for y then update self . neighbors [ y ] for all neighbor z of y at distance w < v do if core distance of z is less than v then rd ← max ( w , core distances of y and z ) candidates [ y , z ] ← rd (cid:46) reachability distance for ( y , z ) decreased if | candidates | > α · | neighbors | then call update_MST (cid:46) We guarantee that candidates has O( n ) size procedure update_MST mst ← Kruskal ( mst ∪ candidates ) candidates ← {} function cluster( m cs ) if candidates is not empty then call update_MST compute clustering from MST (cid:46) using McInnes and Healy [26]’s approachhence, we resort to HNSWs which provide a good approximationof a node’s neighborhood to estimate local density.A second regularization effect benefitting FISHDBC is that thereare often multiple valid MSTs of a given reachability graph, becauseseveral edges connected to a same node can have the same weight(e.g., because they correspond to that node’s reachability distance).FISHDBC tends to privilege edges towards nodes that are higherup in the HNSW hierarchy, leading to MSTs with a lower diameter(because the top of the HNSW hierarchy is reached more quickly),which in turn corresponds to final outputs with smaller and largerclusters, and with shallower hierarchies. As a consequence of thesetwo factors, some results of Section 4 indeed show that FISHDBCoutperforms HDBSCAN* in terms of quality metrics.Our implementation is available at https://github.com/matteodellamico/flexible-clustering. Algorithm 1 shows FISHDBC in pseudocode. The state consistsof four objects: (1) the HNSW; (2) neighbors : each node’s
MinPts closest discovered neighbors and their distance; (3) the current approximated MST and, for each edge ( a , b ) in it, the correspondingvalue of d ( a , b ) ; (4) candidates , a temporary collection of candidateMST edges. Setup initializes the state.Add is called to incrementally add a new element x to the dataset.It adds x to the HNSW, updates the max-heap of x ’s neighbors withthose discovered in the HNSW, and then processes all the pairs ( x , y ) whose distance has been computed while adding x to theHNSW. Each of them is considered as a candidate edge for ourMST; in addition, we add to the candidate MST edges candidates all those for which the reachability distance decreased due to thenew edge. Since neighbors contains max-heaps, each item’s coredistance—i.e., the distance of the m th closest neighbor—is accessibleat the top of the heap. If candidates became larger than αn , we callupdate_MST to free memory. α has a moderate impact on runtime,and should be chosen as large as possible while guaranteeing thatFISHDBC’s state will fit in memory.Update_MST processes the temporary set of candidate edges candidates . Any minimum spanning forest algorithm can becalled on the union of the current MST and the new candidates;in our implementation, we use Kruskal’s algorithm. Technically,the approximate MST might be a forest—an acyclic graph withmultiple connected components—rather than a tree; as shown inTheorem 3.4, this has no effect on final results. In a streaming con-text when data arrives incrementally, this procedure can be calledduring idle time.The output is finally computed using the bottom-up strategy byMcInnes and Healy [26] after calling update_MST. About HNSWs and the FISHDBC Design.
HNSWs represent eachdataset as a set of layered approximated k -nearest neighbor graphs,where the bottom layer contains the whole dataset, and each otherone contains approximately 1 / k -th of the elements in the layerbelow it. Neighbors are found through searches starting at the toplayer and continuing in the lower ones when a local minimum isfound in the above layer. Since we want to find the MinPts nearestneighbors, we set k = MinPts . The ef parameter controls the effortspent in the search; in Section 4 we show that ef ∈ [ , ] yields agood trade-off between speed and quality of results.One may think that FISHDBC could have a simpler design, com-puting the MST based on the nearest neighbor distances in thebottom graph of the HNSW which represents the whole dataset,similarly to other approaches [17, 21]. This, however, is not opti-mal as information about farther away items is important to avoidbreaking up large clusters: often, small clusters having around closeto MinPts nodes are disconnected from other (close) clusters in thenearest neighbor graph. By gradually converging towards closestnodes during neighbor search, we obtain enough information aboutother nodes to ensure that local clusters remain connected.
We now give proofs relative to FISHDBC’s complexity in termsof space and time, as well as studying its relationship with HDB-SCAN*.
Space Complexity.
The asymptotic memory footprint of FISHDBCis rather small: this is confirmed in Section 4, where we show thatFISHDBC can handle datasets that are too large for HDBSCAN*. heorem 3.1. FISHDBC’s state has size O( n log n ) . Proof. FISHDBC’s state consists of (1) the HNSW ( O( n log n ) size [24]); (2) neighbors : each node’s MinPts closest discoveredneighbors and their distance ( O( n ) size); (3) mst : the current ap-proximated MST stored as a mapping between edges and theirweight ( n nodes and at most n − O( n ) size); (4) thetemporary set candidates of candidate edges ( O( n ) size, becauseeach call to add will add to candidates at most n − O( n log n ) . (cid:3) Time Complexity.
This theorem justifies why computation timegrows slowly as dataset size increases (e.g., Fig. 2).Theorem 3.2.
Adding elements to FISHDBC and recomputingclustering has average time complexity
O(( t + n ) log n ) , where t isthe number of calls to d () performed by the HNSW. The time complexity of FISHDBC of depends on HNSWs: if theyrequire few distance calls, computation cost remains low. We exper-imentally see that this is true in most real-world cases; moreover,Malkov and Yashunin [24] show that HNSWs have t = O( l log n ) foradding l elements under some assumptions. Malkov and Yashuninprovide experimental results that support this, similarly to ourown results which also show a coherent behavior. When thisholds, incrementally processing l elements has time complexity O( l log n + n log n ) , and processing a whole dataset has complexity O( n log n ) . Our experiments show that most computation is spentin incrementally building and updating the MST, while computingclustering is orders of magnitude cheaper (e.g., Table 3).Proof. We will call add ( x ) for each new element x to updatethe model, and then cluster to obtain the clustering.Core distance lookups have O( ) cost as they are accessible atthe top of each heap in neighbors . The complexity of addingelements to the HNSW is O( t ) where t is the number of calls to d () . In the rest of the add procedure (see Algorithm 1), the mostcomputationally intensive part is the inner loop of lines 19–23. Thisloop is executed at most O( t MinPts ) times: the O( t ) factor is dueto the outer loop (line 14) and O( MinPts ) to the inner loop. Thehashtable lookup at line 22 has complexity O( ) , for an averagecomplexity of O( t MinPts ) for the whole time spent in the addprocedure, excluding update_MST calls.The cost of update_MST is determined by the MSF algorithm.Kruskal’s algorithm, which we use, has time complexity O( E log E ) where E is the number of input edges. Since E ∈ O( n ) here, acall of update_MST has cost O( n log n ) . This function will becalled O( t / n + ) times, resulting in a computational complexity of O( t / n + ) n log n = O(( t + n ) log n ) for this procedure.The call to cluster has complexity O( n log n ) [26].The dominant cost is the time spent in update_MST, yielding atotal complexity of O(( t + n ) log n ) . (cid:3) Approximation of HDBSCAN*.
We show that the only reason forthe approximation is that we do not compute all pairwise distances:FISHDBC computes a valid result of HDBSCAN* when the latter ispassed a distance matrix in which all the pairwise distances thatare not computed are set to infinity. If d () is called on all the O( n ) pairwise distances, we will indeed be proving that FISHDBC isequivalent to HDBSCAN*. We first prove that, in a reachability graph, edges with weight ∞ can be safely removed without any effect on the resulting clustering.Lemma 3.3. Consider two reachability graphs RG and RG’, whereRG’ is obtained by removing all edges weighted ∞ from RG. Cluster-ings resulting from RG and RG’ are equivalent. Proof. The procedure we use to compute clustering [26] startsby considering each node as a cluster, iterates through MST edgesgrouped by increasing weight, and joins in the same cluster thenodes connected by those edges. When clusters of size at least m cs are joined, they are added to the hierarchical clustering—excludingthe root cluster which contains all nodes.Let us consider the minimum spanning forests F and F (cid:48) obtainedrespectively from RG and RG ’. Because RG is a full graph, F is aspanning tree, while F (cid:48) may not be. If F = F (cid:48) , the thesis is proven.If F (cid:44) F (cid:48) , it must be because all edges of F (cid:48) are present in F , andone or more edges having weight ∞ are present in F . Since edgesof the MST are processed by increasing weight, these ∞ -weightededges are processed last, hence the output for F and F (cid:48) will be thesame until then; joining edges in this last step will necessarily resultin the root cluster containing all nodes which is not returned in thefinal results. The two outputs will therefore be the same. (cid:3) We can now prove our theorem.Theorem 3.4.
The output of FISHDBC is a valid output of HDB-SCAN* run on a distance matrix D (cid:48) such that D (cid:48) i , j = d ( i , j ) if d ( i , j ) has been called, and D (cid:48) i , j = ∞ otherwise. Proof. HDBSCAN* can have several valid outputs because itis based on computing a spanning tree of the reachability graph,which may not be unique if several edges have the same weight.We prove the equivalence for at least one of the valid spanningtrees.We base ourselves on a result by Eppstein [8, Lemma 1], whichproves that minimum spanning forests (MSFs) can be built incre-mentally: rather than taking as input a whole graph G at once wecan take a subgraph G (cid:48) , compute its MSF F (cid:48) and ignore the restof G (cid:48) . We can later add to F (cid:48) the parts of G that were not in G (cid:48) and compute an MSF of the resulting graph: it will be a correctMSF ˆ F of G . Hence, we can add edges incrementally in batches andkeep memory consumption low (while G has size O( n ) , F has size O( n ) ). More formally, given a graph G = ( V , E ) and a subgraph ofit G (cid:48) = ( V (cid:48) ⊆ V , E (cid:48) ⊆ E ) , for every MSF F (cid:48) of G (cid:48) , there exists anMSF ˆ F of G such that ( E (cid:48) \ F (cid:48) ) ∩ ˆ F = ∅ .Given the reachability graph RG obtained from D (cid:48) we consider RG ’, which is RG without all the edges having weight ∞ . Due toLemma 3.3, our goal reduces to showing that FISHDBC will end uphaving in mst a minimum spanning forest of RG ’.Recall the update_MST procedure of Algorithm 1: we iterativelyadd elements from candidates to mst and discard the edges thatare not part of the MSF. Thanks to the aforementioned result byEppstein, our thesis is proven if all edges of RG ’ eventually endup in candidates : this is actually done in line 16; the reachabilitydistance might not be correct if some neighbors are not yet known,but this will be eventually updated to the correct value (line 26)when neighbors are discovered. We may include a single edgemultiple times in candidates , but the weight always decreases: ince we compute a minimum spanning forest, only the last (andcorrect) value for the weight will end up in mst at last.Since all edges of RG (cid:48) are eventually added to candidates withtheir correct weights, mst will be a minimum spanning forest of RG ’, which thanks to Lemma 3.3 proves our thesis. (cid:3) The key novelties of FISHDBC with respect to HDBSCAN* are in-cremental implementation and handling arbitrary data and distancefunctions while maintaining scalability. HDBSCAN* is regardedas an improvement on DBSCAN and known for the result qual-ity [4, 38], and the accelerated implementation by McInnes et al. [27]is competitive in terms of runtime with many other algorithms [26].In the following, we therefore use McInnes et al. [27]’s HDBSCAN*implementation as a strong state-of-the-art baseline for both speedand clustering quality which also handles arbitrary data and dis-tance functions and returns hierarchical results, and evaluate whereFISHDBC does (and does not ) outperform it. We refer to McInnesand Healy [26] for comparisons between our reference HDBSCAN*implementation and other algorithms. We consider comparisonsagainst distributed DBSCAN implementations [22, 39] as out ofscope, also because of the difficulties in performing fair compar-isons between single-machine and distributed approaches [28].
The goal is to test FISHDBC’s flexibility by evaluating it on sev-eral very diverse datasets and distance functions. We evaluateFISHDBC’s quality/runtime tradeoff on a single machine with128 GB of RAM and different values of the ef HNSW parame-ter: 20 for faster computation and, in some cases, lower quality,and 50 for slower computation and possibly better results. Weperformed experiments—reported where space allows—with othervalues ( ef ∈ [ , ] ), which hit less desireable tradeoffs: thisis remarkable, because Malkov and Yashunin [24] report a goodtradeoff between speed and approximation with a value of ef = ef . Following theadvice of Schubert et al. [38], we use a low value of MinPts =
10; inadditional experiments—not included due to space limitations—wesee that
MinPts has only a minor effect on final results. HNSWparameters are set to the defaults of Malkov and Yashunin [24],except for ef . Datasets.
We validate FISHDBC on 8 datasets and 8 different dis-tance functions (Table 1). While many related works are evaluatedon large datasets with only a handful of dimensions, we are espe-cially interested in high-dimensional cases, where ad-hoc lookupstructures (and algorithms based on them) often do not scale well.
Blobs.
Synthetic labeled datasets of isotropic Gaussian blobs(10 centers, 10,000 samples) generated with scikit-learn [33]. Resultsare averaged over generated datasets; the standard deviation issmall enough that it would not be discernible in plots.
Docword.
The DW-* datasets [7] represent text documents ashigh-dimensional bags of words; here, we use cosine distance.
Finefoods consists of unlabeled textual food reviews [25],which we cluster with the Jaro-Winkler edit distance [40].
Fuzzy Hashes are digests of binary files from the study ofPagani et al. [32]—digests can be compared to output a similarityscore between files. We use three algorithms: lzjd [34], sdhash [3]and tlsh [31]. sdhash and tlsh have been evaluated as sound ap-proaches by Pagani et al., while lzjd is a recent improvement [34].Files have 5 labels each: program, package, version, compiler usedto build it, and options passed to the compiler.
Household is a large unlabeled 7-dimensional dataset of powerconsumption data [7]. We use Euclidean distance.
Synth datasets are created with Cesario et al. [5]’s generator,simulating transactions as event sets. In each, we generate 5 clustersof transactions with no outliers, no overlapping and dimensionalityvarying between 640 and 2,048. We use Jaccard distance.
USPS.
A set of 16x16-pixel images of handwritten letters [19].Like other works [11, 19], we consider the 0 and 7 digits and dis-cretize them to a bitmap using a threshold of 0.5, and we consideronly those with at least 20 pixels having a value of 1, for a totalof 2,196 elements. As in these works, we use the Simpson scoreas our distance function. Where & is the bitwise-and function and c () is the function that returns the number of ‘1’ bits, the Simpsondistance between bitmaps x and y is 1 − c ( x & y )/ min ( c ( x ) , c ( y )) . Quality metrics.
We evaluate clustering on labeled datasets withexternal metrics: adjusted mutual information (AMI) and adjustedRand index (ARI). These metrics vary between 0 (random cluster-ing) and 1 (perfect matching). Like most density-based clusteringalgorithms, FISHDBC does not cluster all the elements, returninginstead a set of unclustered “noise” elements: for this reason, wecompute AMI and ARI by taking into account only the clusteredelements. A metric like this, however, may reward clusterings thatonly group extremely similar items and mark as noise the rest of thedataset: hence, we use two additional metrics—respectively, AMI*and ARI*—that consider all noise items as a single additional cluster.While AMI/ARI evaluate whether clustered elements are groupedsimilarly to the reference labeling, AMI*/ARI* penalize outputsthat do not cluster many items. Other options can be envisioned,such as treating each noise item as a single cluster, but this couldtrigger known problems as metrics such as AMI are biased againstsolutions with many small clusters [15]. Romano et al. [35] adviseusing AMI rather than ARI for unbalanced datasets; as this can bethe case when some clusters are disproportionately recognized asnoise, we always use AMI and include ARI when space allows it.For unlabeled datasets, we resort to internal metrics, such assilhouette, intra- and inter-cluster distance [20]. Silhouette is ex-pensive to compute and generally requires more memory thanFISHDBC, hence we obtained out-of-memory errors (
OOM ) onlarger datasets; for intra-cluster (lower is better) and inter-clusterdistance (higher is better) we resorted, for the larger clusters, tosampling, choosing two random elements from the same cluster(intra-cluster) or different clusters (inter-cluster), normalizing theprobability of choosing each cluster to ensure that each pair has thesame probability of being selected. We use a sample size of 10,000.We do not use the density-based clustering validation metricby Moulavi et al. [29], as—besides having O( n ) complexity—it isdesigned for low-dimensional datasets: results are unstable and ataset Size Data type Distance function(s) Metric Labeled ResultsQuality RuntimeBlobs 10 000 1,000 to 10,000-d vectors Euclidean yes yes Table 6 Fig. 3DW-Enron 39 861 Sparse 914-d vectors cosine no no Table 7 Table 8DW-NYTimes 300 000 Sparse 2,120-d vectors cosine no no Table 7 Table 8Finefoods 568 474 Text (average 430 chars) Jaro-Winkler no no Table 7 Table 8Fuzzy hashes 15 402 File digests lzjd , tlsh , sdhash no yes Fig. 1 Table 2Household 2 049 280 7-d vectors Euclidean yes no Table 7 Table 8Synth 10 000 640–2,048-d sparse bool vectors Jaccard yes yes Table 4 Table 3USPS 2 197 16x16 bitmaps Simpson score no yes Table 5 Table 8 Table 1: Evaluated datasets.
Dataset size R un t i m e ( s ) HDBSCAN* ef = 50 ef = 20 (a) lzjd . Dataset size R un t i m e ( s ) HDBSCAN* ef = 50 ef = 20 (b) sdhash . Dataset size R un t i m e ( s ) HDBSCAN* ef = 50 ef = 20 (c) tlsh . Figure 1: Fuzzy hashes: runtime comparison. The ef = { , } parameter is passed to the HNSW. Fuzzy hash Clustering ef lzjd FISHDBC 20 12 710 0.47 0.42
50 12 879
HDBSCAN*
13 365 sdhash
FISHDBC 20 6 905 0.52 0.21
HDBSCAN*
13 184 tlsh
FISHDBC 20 9 746
50 10 046
HDBSCAN*
12 958
Table 2: Fuzzy hashes: external quality metrics, applied to different distance functions (rows) and labels (columns). overflow in our case because distances are exponentiated by thenumber of dimensions.
We now consider distance measures for which our reference HDB-SCAN* implementation [27] does not provide accelerated support;in such cases, it is still possible to run HDBSCAN* by computinga pairwise distances matrix. Here, FISHDBC can scale better thanHDBSCAN* because of the lower asymptotical complexity.
Fuzzy Hashes.
This dataset has the interesting property of havingoverlapping class labels. We start by analyzing Fig. 1: here, com-putational cost is dominated by the calls to the distance function,and we clearly see a quadratic increase in runtime for HDBSCAN*—differences between HDBSCAN* results are essentially due to thedifferences in cost between the distance functions. FISHDBC con-sistently scales much better than HDBSCAN*.The quality metrics of Table 2, where we evaluate AMI andAMI* for each fuzzy hash algorithm/labeling pair, inspire someconsiderations. irst, HDBSCAN* consistently clusters more files than FISHDBC,but the AMI score of FISHDBC is often higher. This means thatFISHDBC identifies more elements as noise, while outputting theother elements in more coherent clusters.Second (with the single exception of sdhash applied to the “pro-gram” label where FISHDBC’s approximation appears to impactresult quality negatively), the AMI* scores of HDBSCAN* are gen-erally equivalent or worse than those of FISHDBC, suggesting thatthe additional elements clustered by HDBSCAN* are often not wellclustered. This can be explained by the argument of Section 3,which suggests that—by working as regularization—FISHDBC’sapproximation can improve output quality. By manually examiningresults, we confirm that the hierarchical clustering of FISHDBCis generally simpler, with fewer larger clusters and a shallowerhierarchy.dim FISHDBC ( ef =
20) FISHDBC ( ef =
50) HDBSCAN*build cluster build cluster640
Table 3: Synth: runtime (s). “Build” is the time to incremen-tally build the FISHDBC data structures, “cluster” the timeto compute clustering using them as input. ef dim =
640 dim = ,
024 dim = , HDBSCAN* 0.49 0.75 0.79 0.95
Synth.
Table 3 reports on runtime while varying ef . FISHDBCspends most of the time building incrementally its data structures,while the cost of extracting a clustering from them is more than twoorders of magnitude cheaper. Therefore clustering can be recom-puted, cheaply, as the data structure grows; as shown in Table 8, thisis the case in all our datasets. FISHDBC outperforms HDBSCAN*here, with a margin growing as the dimensionality (and hence thecost of the distance function) grows. Compared to the Fuzzy Hashesdataset, the smaller difference is largely due to a cheaper distancefunction. Quality results in Table 4 are perhaps more surprising:for 640 and 1,024 dimensions, FISHDBC substantially outperformsHDBSCAN*; once again, we attribute this to the regularizationeffect described in Section 3. As the dimensionality grows, clustersbecome more separated and quality metrics values grow. Finefoods.
This dataset is rather large, and the Jaro-Winkler dis-tance applied to it is quite expensive. We could not apply HDB-SCAN* to this dataset, as the full distance matrix would be veryexpensive to compute and could not fit in memory; this dataset
Dataset size C o m p a r i s o n s p e r i t e m ef=200ef=100ef=50ef=20ef=10 Figure 2: Finefoods: scalability as the dataset size increases. ef
50 1 307 Table 5: USPS: external quality metrics. allows us investigate FISHDBC’s scalability. In Fig. 2 we observethe average number of calls to the distance function performed peritem as new elements get introduced in the FISHDBC data structure(a clustering is computed every time 2% of the dataset is added). Wecan see that, in the beginning, the number of comparisons growsas the dataset does, but it tends to plateau afterwards.Results forquality metrics and runtime are available in Tables 7 and 8.
USPS.
In this smaller dataset, the runtime results of Table 8—while in any case small—are preferable for HDBSCAN*, as theadvantages brought by asymptotical complexity are irrelevant here.Results in Table 5 are, on the other hand, quite interesting: onceagain, the regularization effects discussed in Section 3 improve thequality metrics on the results. In particular, AMI and ARI are bothequal to 1, showing that FISHDBC always returns two clusters: onefor each of the two labels in the original dataset (AMI*/ARI* valuesare still lower than 1 because many digits are still considered asnoise). On the other hand, HDBSCAN* returns a larger number ofclusters (11), and some of them contain mixed labels.
Summary.
FISHDBC enables performant clustering in cases wherecomputing the full distance matrix falls short. Moreover, FISHDBCrarely fares worse than HDBSCAN* in terms of quality metrics—invarious cases, indeed, regularization effects improve result quality.
We now consider Euclidean and cosine distance, for which HDB-SCAN* provides a high-performance accelerated implementation.
Blobs.
These datasets have between 1,000 and 10,000 dimensions.HDBSCAN* uses a KD-tree here, but as the number of dimensionsgrows the effectiveness of such data structures decreases. In Fig. 3,we see how the computation for HDBSCAN* increases quite steeplyas dimensionality grows; on the other hand, growth is definitely
000 4000 6000 8000 10000
Dimensions R un t i m e ( s ) HDBSCAN* ef = 50 ef = 20 Figure 3: Blobs: runtime comparison.
Dimensions ef = ef =
50 HDBSCAN*AMI* ARI* AMI* ARI* AMI* ARI*1 000 0.98 0.99 0.99 0.99
10 000 0.98 0.99 0.98 0.99 slower for FISHDBC thanks to the lower cost of approximatedsearch through HNSWs.Quality metrics in Table 6 show that, here, FISHDBC pays asmall price in terms of clustering quality. Here, the experiment wasrepeated on 30 randomly generated datasets for each number ofdimensions, and the standard deviation in AMI* and ARI* is, in allcases, 0.01 for FISHDBC and 0 for HDBSCAN*.
Household.
In this 7-dimensional Euclidean dataset, one mayspeculate that FISHDBC would be largely outperformed by the ac-celerated ad-hoc HDBSCAN* implementation (it uses an elaboratedual-tree version of Borůvka’s algorithm). Actually, as reportedin Table 8, HDBSCAN* is only slightly faster than FISHDBC. It ispossible that optimizations on constant factors, e.g., swapping ourpure Python HNSW implementation with a faster one, could makeFISHDBC faster in this case as well. Intra- and inter-cluster qualitymetrics (Table 7) are better for HDBSCAN*, but FISHDBC producesa smaller number of clusters, which is arguably more desirablefor data exploration because the summarization due to clusteringis more succinct. While a considerable number of elements arecategorized as noise in the flat clustering, almost all elements endup in a cluster when we consider the hierarchical clustering, whichcan facilitate data exploration tasks. This benefit is shared by bothFISHDBC and HDBSCAN*, for most datasets reported in Table 7.
Docword.
We conclude our evaluation by examining sparse vec-tor datasets where we use cosine distance, which has an acceleratedad-hoc implementation in HDBSCAN* [27]. Internal quality met-rics in Table 7 are again similar between FISHDBC and HDBSCAN*.Results on runtime in Table 8, however, are quite different: thelookup structures of HDBSCAN* result in faster execution butlarger memory footprint; hence, FISHDBC can compute results for DW-NYTimes while HDBSCAN* fails with an out-of-memoryerror.
Summary.
Ad-hoc lookup structures are appealing, but theydo not always outperform the generic acceleration of FISHDBC.FISHDBC outperforms HDBSCAN* in very high-dimensional densedatasets like Blobs, and because of its lower memory footprint itcan handle dataset that HDBSCAN* cannot like DW-NYTimes.
FISHDBC can deal with arbitrary distance functions and can han-dle datasets that are too large for our HDBSCAN* reference. Itscore features are providing cheap, incremental computation whilesupporting arbitrary data and distance functions, avoiding O( n ) complexity without needing filter functions or lookup indices: do-main experts are free to write arbitrarily complex distance functionsreflecting the quirks of the data at hand. In addition to being in-cremental, scalable and flexible, FISHDBC supports hierarchicalclustering. It is also an option for very high-dimensional datasetswhere lookup structures suffer from the curse of dimensionality:our results shows that for datasets that have very high dimension-ality FISHDBC can outperform ad-hoc accelerated approaches.We believe that separating neighbor discovery from incremen-tal model maintenance is a powerful approach, which allows foralgorithms that are easier to reason about, implement and improve. REFERENCES [1] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algo-rithms. In
Similarity Search and Applications (Lecture Notes in Computer Science) ,Christian Beecks, Felix Borutta, Peer Kröger, and Thomas Seidl (Eds.). SpringerInternational Publishing, Munich, Germany, 34–49.[2] Stefan Brecheisen, H-P Kriegel, and Martin Pfeifle. 2004. Efficient density-based clustering of complex objects. In
Data Mining, 2004. ICDM’04. Fourth IEEEInternational Conference on . IEEE, Brighton, UK, 43–50.[3] Frank Breitinger, Harald Baier, and Jesse Beckingham. 2012. Security and imple-mentation analysis of the similarity digest sdhash. In
First international balticconference on network security & forensics (NeSeFo) . Tartu, Estonia, 16.[4] Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In
Advances in Knowl-edge Discovery and Data Mining (Lecture Notes in Computer Science) , Jian Pei,Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu (Eds.).Springer Berlin Heidelberg, Gold Coast, Australia, 160–172.[5] Eugenio Cesario, Giuseppe Manco, and Riccardo Ortale. 2007. Top-downparameter-free clustering of high-dimensional categorical data.
IEEE Trans-actions on Knowledge and Data Engineering
19, 12 (2007), 1607–1624.[6] Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graphconstruction for generic similarity measures. In
WWW . ACM, Hyderabad, India,577–586.[7] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml[8] D. Eppstein. 1994. Offline Algorithms for Dynamic Minimum Spanning TreeProblems.
Journal of Algorithms
17, 2 (Sept. 1994), 237–250. https://doi.org/10.1006/jagm.1994.1033[9] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In
KDD . ACM, Portland, Oregon, USA, 226–231.[10] Martin Ester and Rüdiger Wittmann. 1998. Incremental generalization for miningin a data warehousing environment. In
International Conference on ExtendingDatabase Technology . Springer, Valencia, Spain, 135–149.[11] Maurizio Filippone. 2009. Dealing with non-metric dissimilarities in fuzzy centralclustering algorithms.
International Journal of Approximate Reasoning
50, 2 (2009),363–384.[12] Maurizio Filippone, Francesco Camastra, Francesco Masulli, and Stefano Rovetta.2008. A survey of kernel and spectral methods for clustering.
Pattern Recognition
41, 1 (2008), 176–190. ataset Size Algorithm ( ef ) Clustered elements Clusters Silhouette Average distanceflat hierarchical flat hierarchical intra-cluster inter-clusterFISHDBC (20)
398 1 546
FISHDBC (50) 385 995 3 6 0.513 0.381 0.871DW-Kos 3 430 HDBSCAN* 353 353
222 454 0.552 0.301
299 642 0.469 0.326
DW-Nytimes 300 000 FISHDBC (20) 29 546 299 729
802 1 754
OOM
31 404 299 757
888 1 924
OOM
HDBSCAN*
Out of memory
Finefoods 568 464 FISHDBC (20) 77 152 566 484
OOM
FISHDBC (50) 79 904 568 104 3 531 7 486
OOM
Out of memory
Household 2 049 280 FISHDBC (20) 1 587 223 2 049 175 12 268
61 582
OOM
11 198
61 902
OOM
53 358 173 198
OOM
Table 7: Internal clustering quality metrics.
OOM stands for out-of-memory errors when computing the Silhouette metric.
Dataset ef = ef =
50 HDBSCAN*build cluster build cluster (accelerated?)Blobs see Figure 3 on page 8
DW-Kos 27.4 0.102 37.1 0.103 (yes)DW-Enron 616 2.39 851 2.06 (yes)DW-NYTimes
OOM (yes)Finefoods
50 422
OOM (no)Fuzzy hashes see Figure 1 on page 6
Household 27 375 123 38 759 109
24 258 (yes)Synth see Table 3 on page 7
USPS 9.1 0.0500 12.1 0.0502 (no)
Table 8: Runtime (in seconds). [13] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. 2004. Spectral grouping using theNyström method.
IEEE Transactions on Pattern Analysis and Machine Intelligence
26, 2 (Feb 2004), 214–225. https://doi.org/10.1109/TPAMI.2004.1262185[14] Jun-Song Fu, Yun Liu, and Han-Chieh Chao. 2015. ICA: An Incremental Cluster-ing Algorithm Based on OPTICS.
Wireless Personal Communications
84, 3 (Oct.2015), 2151–2170. https://doi.org/10.1007/s11277-015-2517-9[15] Alexander J Gates and Yong-Yeol Ahn. 2017. The impact of random modelson clustering similarity.
The Journal of Machine Learning Research
18, 1 (2017),3049–3076.[16] Y. Han and M. Filippone. 2017. Mini-batch spectral clustering. In . IEEE, Anchorage, Alaska,USA, 3888–3895. https://doi.org/10.1109/IJCNN.2017.7966346[17] Jacob Jackson, Aurick Qiao, and Eric P Xing. 2018. Scaling HDBSCAN Clusteringwith kNN Graph Approximation. In
SysML . sysml.cc, Stanford, CA, USA, Article2-5, 3 pages.[18] Hans-Peter Kriegel, Peer Kröoger, and Irina Gotlibovich. 2003. IncrementalOPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering. In
Data Warehousing and Knowledge Discovery (Lecture Notes in Computer Science) ,Yahiko Kambayashi, Mukesh Mohania, and Wolfram Wöß (Eds.). Springer BerlinHeidelberg, Prague, Czech Republic, 224–233.[19] Julian Laub and Klaus-Robert Müller. 2004. Feature discovery in non-metricpairwise data.
Journal of Machine Learning Research
5, Jul (2004), 801–818.[20] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. 2010. Un-derstanding of internal clustering validation measures. In
Data Mining (ICDM),2010 IEEE 10th International Conference on . IEEE, Sydney, Australia, 911–916.[21] Alessandro Lulli, Thibault Debatty, Matteo Dell’Amico, Pietro Michiardi, andLaura Ricci. 2015. Scalable k-NN based text clustering. In . IEEE, Santa Clara, CA, USA, 958–963. [22] Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, and Laura Ricci. 2016.NG-DBSCAN: scalable density-based clustering for arbitrary data.
Proceedingsof the VLDB Endowment
10, 3 (2016), 157–168.[23] Son T. Mai, Ira Assent, Jon Jacobsen, and Martin Storgaard Dieu. 2018. Anytimeparallel density-based clustering.
Data Mining and Knowledge Discovery
32, 4(July 2018), 1121–1176. https://doi.org/10.1007/s10618-018-0562-1[24] Y. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximatenearest neighbor search using Hierarchical Navigable Small World graphs.
IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1.https://doi.org/10.1109/TPAMI.2018.2889473[25] Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs:modeling the evolution of user expertise through online reviews. In
Proceedingsof the 22nd international conference on World Wide Web . ACM, Rio de Janeiro,Brazil, 897–908.[26] Leland McInnes and John Healy. 2017. Accelerated Hierarchical Density BasedClustering. In
Data Mining Workshops (ICDMW), 2017 IEEE International Confer-ence on . IEEE, New Orleans, LA, USA, 33–42.[27] Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchicaldensity based clustering.
The Journal of Open Source Software
2, 11 (21 3 2017),205. https://doi.org/10.21105/joss.00205[28] Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability!But at what COST?. In
Proceedings of the 2014SIAM International Conference on Data Mining . SIAM, Philadelphia, PA, USA,839–847.[30] Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical cluster-ing: an overview.
Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery
2, 1 (2012), 86–97.[31] J. Oliver, C. Cheng, and Y. Chen. 2013. TLSH – A Locality Sensitive Hash. In . IEEE, Sydney,Australia, 7–13. https://doi.org/10.1109/CTC.2013.9[32] Fabio Pagani, Matteo Dell’Amico, and Davide Balzarotti. 2018. Beyond Precisionand Recall: Understanding Uses (and Misuses) of Similarity Hashes in BinaryAnalysis. In
Proceedings of the Eighth ACM Conference on Data and ApplicationSecurity and Privacy . ACM, Tempe, AZ, USA, 354–365.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.
Journal of Machine Learning Research
12 (2011), 2825–2830.[34] Edward Raff and Charles Nicholas. 2018. Lempel-Ziv Jaccard Distance, an effec-tive alternative to ssdeep and sdhash.
Digital Investigation
24 (2018), 34–49.
35] Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. 2016.Adjusting for chance clustering comparison measures.
The Journal of MachineLearning Research
17, 1 (2016), 4635–4666.[36] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. 2015. Less is More:Nyström Computational Regularization. In
Advances in Neural Information Pro-cessing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett (Eds.). Curran Associates, Inc., Montréal, Canada, 1657–1665. http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf[37] Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applica-tions.
Data mining and knowledge discovery
2, 2 (1998), 169–194. [38] Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and XiaoweiXu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) UseDBSCAN.
ACM Trans. Database Syst.
42, 3, Article 19 (July 2017), 21 pages.https://doi.org/10.1145/3068335[39] Hwanjun Song and Jae-Gil Lee. 2018. RP-DBSCAN: A Superfast Parallel DBSCANAlgorithm Based on Random Partitioning. In
Proceedings of the 2018 InternationalConference on Management of Data (SIGMOD ’18) . ACM, New York, NY, USA,1173–1187. https://doi.org/10.1145/3183713.3196887[40] William E Winkler. 1999.
The state of record linkage and current research problems .Technical Report. Statistical Research Division, US Census Bureau..Technical Report. Statistical Research Division, US Census Bureau.