Clustering For Point Pattern Data
CClustering For Point Pattern Data
Quang N. Tran ∗ , Ba-Ngu Vo ∗ , Dinh Phung † and Ba-Tuong Vo ∗∗ Curtin University, Australia † Deakin University, Australia
Abstract —Clustering is one of the most common unsupervisedlearning tasks in machine learning and data mining. Clusteringalgorithms have been used in a plethora of applications acrossseveral scientific fields. However, there has been limited researchin the clustering of point patterns – sets or multi-sets ofunordered elements – that are found in numerous applicationsand data sources. In this paper, we propose two approaches forclustering point patterns. The first is a non-parametric methodbased on novel distances for sets. The second is a model-basedapproach, formulated via random finite set theory, and solved bythe Expectation-Maximization algorithm. Numerical experimentsshow that the proposed methods perform well on both simulatedand real data.
Index Terms —Clustering, point pattern data, multiple instancedata, point process, random finite set, affinity propagation,expectation–maximization.
I. I
NTRODUCTION
Clustering is a data analysis task that groups similar dataitems together [1] and can be viewed as an unsupervisedclassification problem since the class (or cluster) labels arenot given [2], [3]. Clustering is a fundamental problem inmachine learning with a long history dating back to the 1930sin psychology [4]. Today, clustering is widely used in a host ofapplication areas including genetics [5], medical imaging [6],market research [7], social network analysis [8], and mobilerobotics [9]. Excellent surveys can be found in [2], [10].Clustering algorithms can broadly be categorized as hardor soft. In hard clustering, each datum could only belong toone cluster [2], [11], i.e. a hard clustering algorithm outputsa partition of the dataset . K-means is a typical example ofhard clustering. Soft clustering, on the other hand, allowseach datum to belong to more than one cluster with certaindegrees of membership. The Gaussian mixture model [3] is anexample of soft clustering wherein the degree of membershipof a data point to a cluster is given by its mixing probability.Hard clustering can be obtained from soft clustering resultsby simply assigning the cluster with the highest membershipdegree to each data point [2].In many applications, each datum is a point pattern, i.e.set or multi-set of unordered points (or elements). For exam-ple, in natural language processing and information retrieval, Dinh Phung gratefully acknowledges support from the Air Force Office ofScientific Research under award number FA2386-16-1-4138. More specifically, the hard clustering could be dichotomized as eitherhierarchical or partitional. The hierarchical clustering has the output as anested tree of partitions, whereas the output of partitional clustering is onlyone partition [1], [2]. the ‘bag-of-words’ representation treats each document as acollection or set of words [12], [13]. In image and scenecategorization, the ‘bag-of-visual-words’ representation – theanalogue of the ‘bag-of-words’ – treats each image as a set ofits key patches [14], [15]. In data analysis for the retail industryas well as web management systems, transaction records suchas market-basket data [16], [17], [18] and web log data [19] aresets of transaction items. Other examples of point pattern datacould be found in drug discovery [20], and protein bindingsite prediction [21]. In multiple instance learning [22], [23],the ‘bags’ are indeed point patterns. Point patterns are alsoabundant in nature, such as the coordinates of trees in a forest,stars in a galaxy, etc. [24], [25], [26].While point pattern data are abundant, the clustering prob-lem for point patterns has received very limited attention. Tothe best of our knowledge, there are two clustering algorithmsfor point patterns: the Bag-level Multi-instance Clustering(BAMIC) algorithm [27]; and the Maximum Margin MultipleInstance Clustering (M IC) algorithm [28]. BAMIC adapts thek-medoids algorithm for the clustering of point pattern data(or multiple instance data) by using the Hausdorff metric asa measure of dissimilarity between two point patterns [27].M IC, on the other hand, poses the point pattern clusteringproblem as a non-convex optimization problem which is thenrelaxed and solved via a combination of the ConstrainedConcave-Convex Procedure and Cutting Plane methods [28].In this paper, we propose a non-parametric approach anda model-based approach to the clustering problem for pointpattern data: • Our non-parametric approach uses Affinity Propagation(AP) [29] with a novel measure of dissimilarity, known asthe Optimal Sub-Pattern Assignment (OSPA) metric. Thismetric alleviates the insensitivity of the Hausdorff metric(used by BAMIC) to cardinality differences. Moreover,AP is known to find clusters faster and with much lowererror compared to other methods such as k-medoids (usedby BAMIC) [29]. • In our model-based approach, point patterns are modeledas random finite sets. Moreover, for a class of modelsknown as independently and identically distributed (iid)cluster random finite sets, we develop an Expectation-Maximization (EM) technique [30] to learn the modelparameters. To the best of our knowledge, this is thefirst model-based framework for bag-level clustering ofmultiple instance data. a r X i v : . [ c s . L G ] F e b I. N ON - PARAMETRIC CLUSTERING FORPOINT PATTERN DATA
In AP, the notion of similarity/dissimilarity between datapoints plays an important role. There are various measures ofsimilarity/dissimilarity, for example distances between obser-vations, joint likelihoods of observations, or even manuallysetting the similarity/dissimilarity for each observation pair[29]. In this paper, we use distances for sets as a measuresof dissimilarity. In the following section, we describe severaldistances for point patterns.
A. Set distances
Let X = { x , ..., x m } and Y = { y , ..., y n } denote twofinite subsets of a metric space ( S , d ) . Note that when S is a subset of R n , d is usually the Euclidean distance, i.e., d ( x, y ) = || x − y || .The Hausdorff distance is defined by d H ( X, Y ) (cid:44) max (cid:26) max x ∈ X min y ∈ Y d ( x, y ) , max y ∈ Y min x ∈ X d ( x, y ) (cid:27) . (1)and d H ( X, Y ) = ∞ , when X = ∅ and Y (cid:54) = ∅ .While the Hausdorff distance is a rigorous measure of dis-similarities between point patterns, it is relatively insensitive todifferences in cardinalities [31], [32]. Consequently, clusteringbased on this distance has the undesirable tendency to grouptogether point patterns with large differences in cardinality.The Wasserstein distance of order p ≥ is defined by d ( p ) W ( X, Y ) (cid:44) min C m (cid:88) i =1 n (cid:88) j =1 c i,j d ( x i , y j ) p /p , (2)where each C = ( c i,j ) is an m × n transportation matrix, i.e. C satisfies [31], [32]: c i,j ≥ for ≤ i ≤ m, ≤ j ≤ n (cid:80) nj =1 c i,j = m for ≤ i ≤ m, (cid:80) mi =1 c i,j = n for ≤ j ≤ n. Note that when X = ∅ and Y (cid:54) = ∅ , d ( p ) W ( X, Y ) = ∞ byconvention.The Wasserstein distance is more sensitive to differencesin cardinalities than the Hausdorff distance and also has aphysically intuitive interpretation when the point patterns havethe same cardinality. However, it does not have a physicallyconsistent interpretation when the point patterns have differentcardinalities, see [31] for further details.The OSPA distance of order p with cut-off c is defined by d ( p,c ) O ( X, Y ) (cid:44) (cid:32) n (cid:32) min π ∈ Π n m (cid:88) i =1 (cid:0) min (cid:0) c, d (cid:0) x i , y π ( i ) (cid:1)(cid:1)(cid:1) p + c p ( n − m ) (cid:33)(cid:33) /p , (3)if m ≤ n, and d ( p,c ) O ( X, Y ) (cid:44) d ( p,c ) O ( Y, X ) if m > n , where ≤ p < ∞ , c > , and Π n is the set of permutationson { , , ..., n } . The two adjustable parameters p , and c , are interpreted as outlier the sensitivity and cardinality penalty,respectively. The OSPA metric allows for a physically intuitiveinterpretation even if the cardinalities of two sets are not thesame, see [31] for further details. B. AP clustering with set distances
The AP algorithm first considers all data points as potentialexemplars, i.e., centroids of clusters. Progressively better setsof exemplars and corresponding clusters are then determinedby passing “responsibility” and “availability” messages basedon similarities/dissimilarities between data points. Further de-tails on the AP algorithm can be found in [29]. The dissim-ilarities measures considered in this work are the Hausdorff,Wasserstein, OSPA distances. In the following section, we willevaluate AP clustering performance with various set distances.
C. Numerical experiments
We evaluate the proposed clustering algorithm with bothsimulated and real data . The performance are measured byRand index [33]. From a performance perspective, we considerAP clustering with the Hausdorff metric as version of BAMICsince the only difference is that latter uses k-medoids insteadof AP.In the following experiments, we report the performance ofAP clustering with different set distances, namely Hausdorff,Wasserstein and OSPA. We use p = 2 for the Wasserstein andOSPA metrics, and additionally c = 20 or c = 60 for OSPA.
1) AP clustering with simulated data:
The simulated pointpattern data are generated from Poisson RFS models with 2-D Gaussian feature distributions (see section III-A), with eachPoisson RFS representing a cluster. There are 3 clusters, eachwith 100 point patterns, in the simulated data (Fig. 2).
Figure 1. Performance of AP clustering with various set distances on10 different (simulated) datasets (section II-C1). The error-bars are standarddeviations of the Rand indices.
The test is run 10 times with 10 different (simulated)datasets. The averaged results are shown in Fig. 1, whileindividual results for certain datasets are show in Fig. 2.Observe that the OSPA’s performance could be improved ifone choose a suitable cut-off c (see section II-B).
2) AP clustering with real data:
This experiment in-volves clustering images from the classes “T14 brick1” and“T15 brick2” of the Texture images dataset [34]. Fig. 3visualizes some example images from these classes. x y Features
Cluster 1Cluster 2Cluster 3 F r equen cy Cardinality histogram
Cluster 1Cluster 2Cluster 3 0 0.5 1
Rand index
Clustering performance (a) st dataset x y F r equen cy Rand index (b) nd dataset x y F r equen cy Rand index (c) rd datasetFigure 2. Simulated data from various Poisson RFS distributions. In eachsubplot, Left: features of the data, Middle: cardinality histogram of the data,Right: AP clustering performance with Hausdorff, Wasserstein and OSPAdistances. Class: T14_brick1, image: 7 x y
100 200 300 400 500 60050100150200250300350400450
Class: T15_brick2, image: 7 x y
100 200 300 400 500 60050100150200250300350400450
Figure 3. Example images from classes “T15 brick2” and “T14 brick1” ofthe Texture dataset. Circles mark detected SIFT keypoints.
Features are extracted from each image by applying theSIFT algorithm , followed by Principal Component Analysis(PCA) to convert the 128-D SIFT features into 2-D features.Thus each image is compressed into a point pattern of 2-Dfeatures. Fig. 4 plots the 2-D features of the images in theTexture images dataset. The clustering algorithm is used togroup together point patterns extracted from images of thesame class.The AP clustering results for various set distances areshown in Fig. 5. Observe that for this dataset, the OSPAdistance achieves a better performance than the Hausdorff andWasserstein distances.III. M ODEL - BASED CLUSTERING FORPOINT PATTERN DATA
In this section, we present generative models for clusters ofpoint pattern data. In particular, the point patterns are assumed Using the VLFeat library [35]. −200 −100 0 100 200−200−1000100200 x y T14_brick1T15_brick2 (a) F r equen cy T14_brick1T15_brick2 (b)Figure 4. Extracted data from images of two classes “T15 brick2” and“T14 brick1” of the Texture data. (a) 2-D features (after applying PCA to theSIFT features) of images. (b) Histogram of cardinalities of images.
Wasserstein ) Figure 5. Performance of AP clustering with various set distances on theTexture image data set (section II-C2). to be distributed according a finite mixture of iid clusterrandom finite set distributions. We then derive an Expectation-Maximization (EM) algorithm to learn the parameters of thesemodels, which are then used to cluster the data.
A. Random Finite Set
Point patterns can be modeled as random finite sets (RFSs),or simple finite point processes. The likelihood of a pointpattern of discrete features is straightforward since this issimply the product of the cardinality distribution and the jointprobability of the features given the cardinality. The difficultiesarise in continuous feature spaces. In this work, we onlyconsider continuous feature spaces.Let F ( X ) denote the space of finite subsets of a space X .A random finite set (RFS) X of X is a random variable takingvalues in F ( X ) . An RFS X can be completely specified bya discrete (or categorical) distribution that characterizes thecardinality | X | , and a family of symmetric joint distributionsthat characterizes the distribution of the points (or features) of X , conditional on the cardinality.Analogous to random vectors, the probability density ofan RFS (if it exists) is essential in the modeling of pointpattern data. The probability density p : F ( X ) → [0 , ∞ ) ofan RFS is the Radon-Nikodym derivative of its probabilitydistribution relative to the dominating measure µ , defined foreach (measurable) T ⊆ F ( X ) , by [24], [26], [36], [37]: µ ( T ) = ∞ (cid:88) m =0 m ! U m (cid:90) T ( { x , ..., x m } ) d ( x , ..., x m ) (4)where U is the unit of hyper-volume in X , and T ( · ) is theindicator function for T . The measure µ is the unnormalizeddistribution of a Poisson point process with unit intensity u = 1 /U when X is bounded. Note that µ is unitless andconsequently the probability density p is also unitless.In general the probability density of an RFS, with respectto µ , evaluated at X = { x , ..., x m } can be written as [38, p.27] ((Eqs. (1.5), (1.6), and (1.7)), [26]: ( X ) = p c ( m ) m ! U m f m ( x , ..., x m ) , (5)where p c ( m ) = Pr ( | X | = m ) is the cardinality distribution,and f m ( x , ..., x m ) is a symmetric joint probability densityof the points x , ..., x m given the cardinality.Imposing the independence assumption among the featureson the model in (5) reduces to the iid-cluster RFS model [25] p ( X ) = p c ( | X | ) | X | ! [ U p f ] X (6)where p f is a probability density on X , referred to as the feature density, and h X (cid:44) (cid:81) x ∈ X h ( x ) , with h ∅ = 1 byconvention, is the finite-set exponential notation.When p c is a Poisson distribution we have the celebrated Poisson point process (aka,
Poisson RFS ) p ( X ) = λ | X | e − λ [ U p f ] X (7)where λ is the mean cardinality. The Poisson model is com-pletely determined by the intensity function u = λp f [36],[37]. Note that the Poisson cardinality distribution is describedby a single non-negative number λ , hence there is only onedegree of freedom in the choice of cardinality distribution forthe Poisson model. B. Mixture of iid-cluster RFSs
A mixture of iid-cluster RFSs is a probability density of theform p ( X | Θ) = N comp (cid:88) k =1 w k p ( X | C k , F k ) (8)where k ∈ { , ...N comp } is the component label, w k = Pr ( C = k ) is the component weight (the probability of anobservation belonging to the k th component), and p ( X | C k , F k ) = p c ( | X | | C k ) | X | ! U | X | (cid:89) x ∈ X p f ( x | F k ) (9)is iid-cluster density of the k th component with cardinalitydistribution parameter C k , and feature distribution parameter F k .Note that Θ = { ( w k , C k , F k ) : k = 1 , ..., N comp } is thecomplete collection of parameters of the iid-cluster mixturemodel. The probability density (8) is a likelihood function foriid point patterns arising from N comp clusters. C. EM clustering using mixture of iid-cluster RFSs
Given a dataset D = ( X , ..., X N data ) where each datum isa point pattern X n = (cid:8) x n, , ..., x n, | X n | (cid:9) . Assume that D isgenerated from an underlying iid-cluster RFS mixture modelwith N comp components (8) where each component representsa data cluster. The EM clustering for the given data includestwo main steps:1) Model learning:
Estimate the parameters of the under-lying model using EM method.2)
Cluster assignment:
Assign observations to clusters us-ing maximum a posteriori (MAP) estimation.
Algorithm 1
EM algorithm for mixture of iid-cluster RFSs
Input : dataset D = { X , ..., X N data } , no. of components N comp , no. of iterations N iter Output : parameters Θ of the iid-cluster RFS mixture modelinitialize Θ (0) = (cid:110)(cid:16) w (0) k , C (0) k , F (0) k (cid:17) : k = 1 , ..., N comp (cid:111) for i = 1 to N iter /* C OMPUTE POSTERIORS */ for n = 1 to N data for k = 1 to N comp p ( k | X n , Θ ( i - ) = w ( i - k p ( X n | C ( i - k , F ( i - k ) (cid:80) N comp (cid:96) =1 w ( i - (cid:96) p ( X n | C ( i - (cid:96) , F ( i - (cid:96) ) (10) endendfor k = 1 to N comp /* U PDATE COMPONENT WEIGHTS */ w ( i ) k = 1 N data N data (cid:88) n =1 p ( k | X n , Θ ( i - ) (11) /* U PDATE CARDINALITY DISTRIBUTION PARAMETERS */ for m = 0 to N card q ( i ) k,m = (cid:80) N data n =1 δ ( m - | X n | ) p ( k | X n , Θ ( i - ) (cid:80) N card (cid:96) =0 (cid:80) N data n =1 δ ( (cid:96) - | X n | ) p ( k | X n , Θ ( i - ) (12) end /* U PDATE FEATURE DENSITY PARAMETERS */ µ ( i ) k = (cid:80) N data n =1 (cid:0) p ( k | X n , Θ ( i - ) (cid:80) x ∈ X n x (cid:1)(cid:80) N data n =1 | X n | p ( k | X n , Θ ( i - ) (13) Σ ( i ) k = (cid:80) N data n =1 p ( k | X n , Θ ( i - ) (cid:80) x ∈ X n (cid:80) | X n | (cid:96) =1 K ( i ) ( x ) (cid:80) N data n =1 | X n | p ( k | X n , Θ ( i - ) (14)where K ( i ) ( x ) = (cid:16) x − µ ( i ) k (cid:17) (cid:16) x − µ ( i ) k (cid:17) T endend return Θ ( N iter ) Model learning step:
In this paper, we present EM learningfor the iid-cluster mixture model (8) with categorical cardi-nality distribution and Gaussian feature distribution, i.e., theparameters of the k th component are C k = (cid:40) ( q k, , ..., q k,N card ) : 0 ≤ q k,m ≤ , N card (cid:88) m =0 q k,m = 1 (cid:41) F k = { ( µ k , Σ k ) } where N card is the maximum cardinality of point patterns, µ k , Σ k are the means and covariances of the Gaussian distri-butions. The EM algorithm for learning the model parameters = { ( w k , C k , F k ) : k = 1 , ..., N comp } proceeds as shown inAlgorithm 1.In each iteration of Algorithm 1, the parameters are esti-mated by Θ ( i ) = argmax Θ Q (cid:0) Θ , Θ ( i − (cid:1) [39], where Q (cid:16) Θ , Θ ( i − (cid:17) = N comp (cid:88) k =1 N data (cid:88) n =1 log ( w k p ( X n | C k , F k )) p ( k | X n , Θ ( i - ) (15)and p ( X n | C k , F k ) , p ( k | X n , Θ ( i - ) are given by (9), (10)respectively. Cluster assignment step:
The cluster label k n of an obser-vation X n can be estimated using MAP estimation: ˆ k n = argmax k ∈ K p ( k | X n , Θ) , (16)where K = { , ..., N comp } , parameters Θ are learned byAlgorithm 1, and p ( k | X n , Θ) is the posterior probability: p ( k | X n , Θ) = w k p ( X n | C k , F k ) (cid:80) N comp (cid:96) =1 w (cid:96) p ( X n | C (cid:96) , F (cid:96) ) . (17) D. Numerical experiments
In this section, we evaluate the proposed EM clusteringwith both simulated and real data. For comparison with APclustering, we use the same datasets from section II-C. Notethat our experiments assume a mixture of Poisson RFSs model.
1) EM clustering with simulated data:
In this experimentwe apply EM clustering on the same (simulated) datasetsfrom section II-C1. The performance is shown in Fig. 6.EM clustering performs very well for these datasets with theaverage Rand index of 0.95. st dataset) nd dataset) rd dataset) I Figure 6. Performance of the EM clustering on simulated datasets (the samedatasets from section II-C1).
To further investigate performance of the proposed algo-rithm, we illustrate in Fig. 7 the distributions learned by theEM algorithm for the 3 datasets shown in Fig. 2. Observethat the model-based method has better performance than thenon-parametric method when there is little overlap betweenthe feature distributions or cardinality distributions of the datamodel (e.g., the st and nd simulated dataset). However, asexpected, if there is significant overlap in both the featuredistributions and the cardinality distributions, then model-based clustering performs poorly since there is not enoughinformation to separate the data (e.g., the rd simulateddataset). x y Feature densities (2D−Gaussians) −4 −2 0 2 4 6 8−5051015 Cluster 1Cluster 2Cluster 3 0 10 20 30 40 50 6000.050.10.15 Cardinality n p ( n ) Cardinality distributions (Poissons)
Cluster 1Cluster 2Cluster 3 (a) th dataset x y p ( n ) Cluster 1Cluster 2Cluster 3 (b) th dataset x y p ( n ) Cluster 1Cluster 2Cluster 3 (c) th datasetFigure 7. The RFS distributions learned by EM for 3 simulated datasets(corresponding with 3 datasets in Fig. 2). In each subplot, Left: featuredistributions, Right: cardinality distributions. iter R and i nde x Figure 8. Performance of the EM clustering on the Texture dataset (describedin section II-C2). The clustering is evaluated with different number ofiterations N iter = { , ..., } .
2) EM clustering with real data:
In this experiment, wecluster the Texture dataset (described in section II-C2) usingEM clustering with various iterations N iter = { , ..., } . Theperformance is shown in Fig. 8. The best performance is Randindex of 0.86, which is equal the best performance of APclustering using OSPA with c = 60 (Fig. 5). Fig. 9 shows thelearned distribution after 10 EM iterations.IV. C ONCLUSION
This paper has detailed a non-parametric approach (basedon set distance) and a model-based approach (based on randomfinite set) to the clustering problem for point pattern data(aka ‘bags’ or multiple instance data). Experiments with bothsimulated and real data indicate that, in the non-parametricmethod, the choice of distance has a big influence on theclustering performance. The experiments also indicate that the y Feature densities (2D−Gaussians) −300 −100 100 300−300−100100300 Cluster 1Cluster 2 0 20 40 60 80 100 120 14000.020.040.060.080.1 Cardinality n p ( n ) Cardinality distributions (Poissons)
Cluster 1Cluster 2
Figure 9. The learned RFS distributions for Texture data after 10 EMiterations. Left: feature distributions, Right: cardinality distributions. model-based method has better performance than the non-parametric method when there is little overlap between thefeature distributions or cardinality distributions of the datamodel. However, as expected, if there is significant overlap inboth the feature distributions and the cardinality distributions,then model-based clustering performs poorly since there is notenough discriminative information to separate the data.Future research directions may include developing morecomplex RFS models such as RFSs with Gaussian mixturefeature distribution which can capture better multiple modesfeature data (e.g., section II-C2). Another promising devel-opment is adapting the proposed clustering algorithms intodata stream mining – an emerging research topic dealing withrapidly and continuously generated data such as search orsurveillance data [40]. R
EFERENCES[1] K. P. Murphy,
Machine learning: a probabilistic perspective . MITpress, 2012.[2] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,”
ACM computing surveys (CSUR) , vol. 31, no. 3, pp. 264–323, 1999.[3] S. Russell and P. Norvig,
Artificial intelligence: A modern approach .Prentice Hall, 2003.[4] R. C. Tryon,
Cluster analysis: correlation profile and orthometric(factor) analysis for the isolation of unities in mind and personality .Edwards brother, Incorporated, lithoprinters and publishers, 1939.[5] D. J. Witherspoon, S. Wooding, A. R. Rogers, E. E. Marchani, W. S.Watkins, M. A. Batzer, and L. B. Jorde, “Genetic similarities withinand between human populations,”
Genetics , vol. 176, no. 1, pp. 351–359, 2007.[6] M.-S. Yang, Y.-J. Hu, K. C.-R. Lin, and C. C.-L. Lin, “Segmentationtechniques for tissue differentiation in mri of ophthalmology using fuzzyclustering algorithms,”
Magnetic Resonance Imaging , vol. 20, no. 2, pp.173–179, 2002.[7] G. Arimond and A. Elfessi, “A clustering method for categorical datain tourism market segmentation research,”
Journal of Travel Research ,vol. 39, no. 4, pp. 391–397, 2001.[8] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan, “Clustering socialnetworks,” in
Algorithms and Models for the Web-Graph . Springer,2007, pp. 56–67.[9] V. Nguyen, A. Martinelli, N. Tomatis, and R. Siegwart, “A comparison ofline extraction algorithms using 2d laser rangefinder for indoor mobilerobotics,” in
Intelligent Robots and Systems, 2005.(IROS 2005). 2005IEEE/RSJ International Conference on . IEEE, 2005, pp. 1929–1934.[10] A. K. Jain, “Data clustering: 50 years beyond k-means,”
Pattern recog-nition letters , vol. 31, no. 8, pp. 651–666, 2010.[11] R. Xu and D. Wunsch, “Survey of clustering algorithms,”
IEEE Trans-actions on neural networks , vol. 16, no. 3, pp. 645–678, 2005.[12] T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidffor text categorization.” DTIC Document, Tech. Rep., 1996.[13] A. McCallum and K. Nigam, “A comparison of event models for naiveBayes text classification,” in
AAAI-98 workshop on learning for textcategorization , vol. 752. Citeseer, 1998, pp. 41–48.[14] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visualcategorization with bags of keypoints,” in
Workshop on statisticallearning in computer vision, ECCV , 2004. [15] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learningnatural scene categories,” in
Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on , vol. 2.IEEE, 2005, pp. 524–531.[16] S. Guha, R. Rastogi, and K. Shim, “Rock: A robust clustering algorithmfor categorical attributes,” in
Data Engineering, 1999. Proceedings., 15thInternational Conference on . IEEE, 1999, pp. 512–521.[17] Y. Yang, X. Guan, and J. You, “Clope: a fast and effective clusteringalgorithm for transactional data,” in
Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and datamining . ACM, 2002, pp. 682–687.[18] C.-H. Yun, K.-T. Chuang, and M.-S. Chen, “An efficient clusteringalgorithm for market basket data based on small large ratios,” in
Computer Software and Applications Conference, 2001. COMPSAC2001. 25th Annual International . IEEE, 2001, pp. 505–510.[19] I. V. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic frame-work for clustering individuals and objects,” in
Proceedings of the sixthACM SIGKDD international conference on Knowledge discovery anddata mining . ACM, 2000, pp. 140–149.[20] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez, “Solving themultiple instance problem with axis-parallel rectangles,”
Artificial in-telligence , vol. 89, no. 1, pp. 31–71, 1997.[21] F. Minhas and A. Ben-Hur, “Multiple instance learning of calmodulinbinding sites,”
Bioinformatics (Oxford, England) , vol. 28, no. 18, pp.i416–i422, 2012.[22] J. Amores, “Multiple instance classification: Review, taxonomy andcomparative study,”
Artificial Intelligence , vol. 201, pp. 81–105, 2013.[23] J. Foulds and E. Frank, “A review of multi-instance learning assump-tions,”
The Knowledge Engineering Review , vol. 25, no. 01, pp. 1–25,2010.[24] D. Stoyan, W. S. Kendall, and J. Mecke,
Stochastic geometry and itsapplications . John Wiley & Sons, 1995.[25] D. J. Daley and D. Vere-Jones,
An introduction to the theory of pointprocesses . Springer, 1988, vol. 2.[26] J. Moller and R. P. Waagepetersen,
Statistical inference and simulationfor spatial point processes . Chapman & Hall CRC, 2003.[27] M.-L. Zhang and Z.-H. Zhou, “Multi-instance clustering with applica-tions to multi-instance prediction,”
Applied Intelligence , vol. 31, no. 1,pp. 47–68, 2009.[28] D. Zhang, F. Wang, L. Si, and T. Li, “M3ic: Maximum margin multipleinstance clustering.” in
IJCAI , vol. 9, 2009, pp. 1339–1344.[29] B. J. Frey and D. Dueck, “Clustering by passing messages between datapoints,”
Science , vol. 315, no. 5814, pp. 972–976, 2007.[30] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the em algorithm,”
Journal of the RoyalStatistical Society. Series B (Methodological) , pp. 1–38, 1977.[31] D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric forperformance evaluation of multi-object filters,”
Signal Processing, IEEETransactions on , vol. 56, no. 8, pp. 3447–3457, 2008.[32] J. R. Hoffman and R. P. Mahler, “Multitarget miss distance via optimalassignment,”
Systems, Man and Cybernetics, Part A: Systems andHumans, IEEE Transactions on , vol. 34, no. 3, pp. 327–336, 2004.[33] C. D. Manning, P. Raghavan, and H. Sch¨utze,
Introduction to informa-tion retrieval . Cambridge university press Cambridge, 2008, vol. 1.[34] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texture representationusing local affine regions,”
Pattern Analysis and Machine Intelligence,IEEE Transactions on
Aerospace and ElectronicSystems, IEEE Transactions on , vol. 41, no. 4, pp. 1224–1245, 2005.[37] H. G. Hoang, B.-N. Vo, B.-T. Vo, and R. Mahler, “The Cauchy-Schwarz divergence for Poisson point processes,”
IEEE Trans. Inf.Theory , vol. 61, no. 8, pp. 4475–4485, 2015.[38] M. van Lieshout,
Markov point processes and their applications . Im-perial College Press, 2000.[39] J. A. Bilmes, “A gentle tutorial of the em algorithm and its application toparameter estimation for gaussian mixture and hidden markov models,”
International Computer Science Institute , vol. 4, no. 510, p. 126, 1998.[40] H.-L. Nguyen, Y.-K. Woon, and W.-K. Ng, “A survey on data streamclustering and classification,”