[PDF] Large-data determinantal clustering

Abstract

Full PDF

LL ARGE - DATA DETERMINANTAL CLUSTERING

A P

REPRINT

Serge Vicente

Department of Mathematics and StatisticsUniversité de MontréalMontréal, Québec, Canada [email protected]

Alejandro Murua

Department of Mathematics and StatisticsUniversité de MontréalMontréal, Québec, Canada [email protected]

February 9, 2021 A BSTRACT

Determinantal consensus clustering is a promising and attractive alternative to partitioning aboutmedoids and k -means for ensemble clustering. Based on a determinantal point process or DPPsampling, it ensures that subsets of similar points are less likely to be selected as centroids. It favorsmore diverse subsets of points. The sampling algorithm of the determinantal point process requiresthe eigendecomposition of a Gram matrix. This becomes computationally intensive when the datasize is very large. This is particularly an issue in consensus clustering, where a given clusteringalgorithm is run several times in order to produce a ﬁnal consolidated clustering. We propose twoefﬁcient alternatives to carry out determinantal consensus clustering on large datasets. They consistin DPP sampling based on sparse and small kernel matrices whose eigenvalue distributions are closeto that of the original Gram matrix. K eywords Cholesky factorization · classiﬁcation · k -nearest neighbors · Kullback-Leibler divergence · nearest-neighbors Gaussian process · sparseness · spectral decomposition Cluster analysis is a classical procedure to aggregate elements according to their similarity. Among the most popularmethods to do clustering, we ﬁnd the well-known k -means [40] and the partitioning around medoids (PAM) [33]algorithms. These two procedures have the particularity of building data partitions from an initial choice of randompoints called centroids. These points are usually sampled uniformly at random from the set of all datapoints. Sinceeach point has the same probability of being chosen, the initial sample may end up with a set of points containingmany similar points that carry the same type of information. That is, the initial sample might not represent the diversitypresent in the data. This might affect the effectiveness of the clustering.Nowadays, the diversity in the selected elements is a major concern in some domains of research, like clinical trials [12],forensic sciences [57] or educational development [54]. Adopting uniform random sampling as a sampling mechanismcan result in sets of elements with a poor coverage of all the facets of a population under study. Determinantal pointprocesses, or DPPs for short, introduced by [8], can address this problem. DPPs model negative correlations betweenpoints so that similar elements have less chances of being simultaneously sampled. The negative correlations arecaptured by the so-called kernel or Gram matrix [34], a matrix whose entries represent a measure of similarity betweenpair of points. DPPs have already been adopted in machine learning as models for subset selection [27, 21, 50, 22, 42].The origins of DPPs can be found in quantum physics [41]. Known ﬁrst as fermion processes, they model the distributionof fermion systems at thermal equilibrium. Much later, [8] introduced the now accepted Determinantal Point Process terminolgy in the mathematics community. DPPs have also been applied to problems dealing with nonintersectingrandom paths [14], random spanning trees [9], and the study of eigenvalues of random matrices [3].Clustering algorithms like k -means and the partitioning around medoids result in single partition of data, seekingto maximize intra-cluster similarity and inter-cluster dissimilarity. However, the clustering results of two different a r X i v : . [ s t a t . C O ] F e b PREPRINT - F

EBRUARY

9, 2021algorithms can be very different. The lack of an external objective and impartial criterion can explain those differences[55]. The dependence on the initial choice of centroids is also an important factor. [5] and [6] proposed a new approachto improve the quality and robustness of clustering results. The approach was later formalized by [53], where thenotion of cluster ensembles is introduced. Cluster ensembles combine different data partitions into a single consolidatedclustering. A particular cluster ensemble method was later introduced in [45], the so-called consensus clustering . Thismethod consists in performing multiple runs of the same clustering algorithm on the same data, to produce a singleclustering conﬁguration by agreement among all the runs.Most often, the particular clustering algorithm to be run several times for consensus implies random initial conditionsor centroids. Recently [56] introduced one such method, the determinantal consensus clustering or consensus DPP procedure. This generates centroids through a DPP process. The similarity or distance between the datapoints isincorporated in the similarity matrix, also known as the kernel or Gram matrix, which constitutes the core of the DPPprocess. Moreover, the link between similarity matrices and kernel methods for statistical or machine learning makesthe method very ﬂexible and effective at discovering data clusters. The diversity within centroids is automaticallyinherited in the DPP sampling. The DPP “diversity at sampling” property has shown to greatly improve the consensusclustering results [56].In practice, the centroids are drawn using the DPP sampling algorithm described in [3] and [34]. This algorithm isbased on computing the spectral decomposition of the data similarity or kernel matrix. Unfortunately, when the datasize n is very large, the eigendecomposition becomes a computational burden. The computational complexity of theeigendecomposition of a n × n symmetric matrix is of order O (cid:0) n (cid:1) . However, to sample the centroid points, we mightnot need to compute all eigenvalues and eigenvectors of the kernel matrix. In fact, the probability of selecting eachdatapoint depends on a corresponding eigenvalue. Datapoints associated with relatively very small eigenvalues areselected with very low probability. Therefore, the key to reduce the computational burden induced by large datasets isthe extraction of the largest eigenvalues of the associated kernel matrix. This raises the necessity to deliver algorithmsable to extract such a subset of eigenvalues. One of the most used algorithms to extract the largest eigenvalues is theLanczos algorithm [37]. Due to his proven numerical instability, many variations of it have been proposed. A popularvariation of the Lanczos algorithm is the implicitly restarted Lanczos method [10] which we adopte in this paper.The Lanczos algorithm have been specially developed for large sparse and symmetric matrices. Hence, to be ableto perform determinantal consensus clustering on large datasets, we need to ﬁnd good sparse approximations of theoften dense original kernel matrix. We propose two sound approaches: one approach based on the Nearest NeighborGaussian Process of [15], and another approach based on a random sampling of small submatrices from the densekernel matrix. This latter approach may be seen as a kind of divide and conquer approach. Although we show that theseapproaches offer good approximations to the eigenvalue distribution of the original kernel matrix, our goal is rather tointroduce alternative efﬁcient DPP sampling models for large datasets that inherit the data diversity expressed in theoriginal kernel matrix.The paper is organized as follows: in Section 2, we summarize the consensus clustering and recall the basic characteris-tics of determinantal point processes. The determinantal consensus clustering is also summarized in this section. InSection 3, we introduce the problem of using the determinantal point process in the context of large datasets, and presenttwo approaches to address this issue. In Section 4, we evaluate the two approaches presented in Section 3 through largedataset simulations; here we also illustrate the concept of diversity introduced by the determinantal point process. Aperformance comparison between our two approaches and two other competing methods on large real datasets is shownin Section 5. We conclude with a few thoughts and a discussion in Section 6. Throughout the paper, the data will be denoted by S = { x , . . . , x n } ⊂ R p , where x i represents a p -dimensional vector,for i = 1 , . . . ,n and n ≥ . Consider a particular clustering algorithm run R times on the same data S . The agreementamong the several runs of the algorithm is based on the consensus matrix C . This is a n × n symmetric matrix whoseentries { C ij , i, j = 1 , . . . , n } represent the proportion of runs in which elements x i and x j of S fall in the samecluster. Let r represent a speciﬁc run of the clustering algorithm, r = 1 , . . . , R, and let C r be the associated n × n symmetric binary matrix with entries c rij = 1 if x i and x j belong to the same cluster in the r th run of the algorithm,and c rij = 0 , otherwise, i, j = 1 , . . . , n . The components of the consensus matrix C are given by C ij = (cid:80) Rr =1 c rij /R,i, j = 1 , . . . ,n . The entry C ij is known as consensus index . The diagonal entries are given by C ii = 1 , for i = 1 , . . . , n .Our interest is to extend the determinantal consensus clustering (or consensus DPP) algorithm of [56] to large datasets.Consensus DPP is a modiﬁed version of Partitioning around medoids (PAM) algorithm [33] in which the center points2 PREPRINT - F

EBRUARY

9, 2021are sampled with a determinantal point process (DPP); more details on DPP are shown in the next section. Eachrun starts with a Voronoi diagram [2] on the set S . This partitions the space into several cells or regions based ona random subset of generator points. After R runs of the algorithm, we obtain R partitions associated with the R Voronoi diagrams. The consensus matrix C is computed from these partitions. Let θ ∈ [0 , be a proportion threshold.According to [5], if C ij ≥ θ , points x i and x j are deﬁned as “friends” and are included in the same consensuscluster. Moreover, all mutual friends (including friends of friends, etc.) are assigned to the same cluster. To selectan appropriate threshold value, we follow [46] and consider all threshold values given by the set of all differentobserved consensus indexes C ij . If there are t different consensus indexes, we will have a collection of t thresholds θ , θ , . . . , θ t . For each threshold θ i , i = 1 , . . . , t , a consensus clustering conﬁguration with K ( θ i ) clusters is obtained.If θ i = 0 , we obtain a conﬁguration with K (0) = 1 cluster, that is, a single cluster equal to S . If θ i = 1 , we obtaina conﬁguration with K (1) = n singleton clusters; that is, each element of S forms a singleton cluster. In general,clustering conﬁgurations with one cluster or n clusters are of no interest. Therefore, thresholds θ i that are too low or toolarge are not relevant. This observation leads to a more efﬁcient procedure [56] where only a predetermined sequenceof t thresholds τ < θ < θ < · · · < θ t , bounded from below by a certain minimum threshold τ , are considered. Thisapproach is particularly useful in the context of large datasets, since reducing the range of considered thresholds canreduce the computational burden induced by the high number of elements.Moreover, we are not interested in a clustering conﬁguration with too many small clusters. Only clustering conﬁgurationswith cluster sizes larger than a minimal value are admissible (that is, accepted). [56] established through extensivesimulations that the “square-root choice” (for the number of bins of a histogram) √ n is an adequate value for theminimum cluster size. Each one of the t consensus clustering conﬁgurations obtained with the t thresholds { θ i } areexamined to verify that they are admissible. For every non-admissible consensus clustering conﬁguration we mergeeach small cluster with its closest “large” cluster, according to the following procedure, inspired by single linkage:(i) select the cluster V that has the smallest cluster size < √ n ; (ii) ﬁnd the pair of indexes ( i ∗ , j ∗ ) ∈ { , . . . , n } thatsatisﬁes C i ∗ j ∗ = max { C ij : x i ∈ V , x j (cid:54)∈ V} ; (iii) merge the cluster V to the cluster that includes x j ∗ ; (iv) repeatthe merging procedure until there are no more clusters with size smaller than √ n . Once all t consensus clusteringconﬁgurations are made admissible, we proceed to select the ﬁnal consensus clustering as the consensus clusteringconﬁguration among these t conﬁgurations that minimizes the kernel-based validation index of [17]. This index wasconceived after the studies of [23] to choose the optimal ﬁnal cluster among several possible partitions. It can be seenas an index that combines modiﬁed extensions of the between and within variances to kernel-based methods. See [17]or [56] for further details. Let L be a n × n real symmetric and positive semideﬁnite matrix that measures similarity between all pairs of elementsof S . We denote by L Y = ( L ij ) i,j ∈ Y the principal submatrix of L whose rows and columns are indexed by the subset Y ⊆ S . A determinantal point process, DPP for short, is a probability measure on S that assigns probability P ( Y ) = det( L Y ) / det( L + I n ) , (1)to any subset Y ∈ S , where I n is the identity matrix of dimension n × n . We write Y ∼ DP P S ( L ) for thecorresponding determinantal process.The matrix L is known as the kernel matrix of the DPP [34, 31, 26]. It can be shown [34] that det( L + I n ) = (cid:80) Y ⊂ S det( L Y ) , hence (1) does indeed deﬁne a probability mass function over all subsets in S . This deﬁnition statesrestrictions on all the principal minors of the kernel matrix L , denoted by det( L Y ) . Indeed, as P ( Y = Y ) ∝ det( L Y ) represents a probability measure, we have det( L Y ) ≥ , for any Y ⊆ S .Any symmetric positive semideﬁnite matrix L may be a kernel matrix of a DPP. For the construction of the similaritymatrix L in (1), we use a suitable Mercer Kernel [23] as indicated in [56]. The choice of an appropriate kernel isa critical step in the application of any kernel-based method. However, as pointed out by [29], there is no rule norconsensus about its choice. Ideally, the kernel must be chosen according to prior knowledge of the problem domain[29, 36], but this practice is rarely observed. In the absence of expert knowledge, a common choice is the Radial BasisFunction (RBF) kernel (or

Gaussian kernel ): L = (cid:0) exp (cid:8) −(cid:107) x i − x j (cid:107) / (2 σ ) (cid:9)(cid:1) ni,j =1 , where the scale parameter σ , known as the kernel’s bandwidth, represents the relative spread of the Euclidean distances (cid:107) x i − x j (cid:107) between any two points x i and x j . Due to its appealing mathematical properties, this particular kernel hasbeen extensively used in many studies. A particular property of the Gaussian kernel is that it is positive and boundedfrom above by one, making it directly interpretable as a scaled measure of similarity between any given pair of points.3 PREPRINT - F

EBRUARY

9, 2021The computation of the RBF kernel requires the estimation of the bandwidth parameter σ . As pointed by [46], most ofthe literature considers σ as a parameter that can be estimated by observed data. Inspired by [5] and [6], we estimate σ by the average of all pairwise and squared Euclidean distances, i.e., (cid:98) σ = 2 (cid:80) i

DPP( K J ) is given by the product of the eigenvalues λ i ( L ) corresponding to the eigenvectors v i ∈ V J , normalized by det ( L + I n ) = (cid:81) ni =1 [ λ i ( L ) + 1] . Sampling of a subset Y ∼ DPP( L ) can be realized byﬁrst selecting an elementary DPP, DPP( K J ) , with probability equal to its mixture component weight, and then, in asecond step, sampling a subset from DPP( K J ) . Moreover, it can be shown that the expected value and variance of thenumber of elements in Y , card( Y ) , are given by E [card( Y )] = n (cid:88) i =1 λ i ( L ) λ i ( L )+1 ; Var [card( Y )] = n (cid:88) i =1 λ i ( L )( λ i ( L )+1) . It is well known that the computational complexity of obtaining the eigendecomposition of a n × n symmetric matrix isof order O ( n ) and, as n grows larger, the computation of the eigenvalues and eigenvectors becomes expensive. Eventhe storage of the matrix is limited by the memory required.The sampling algorithm based on DPP starts with a subset of the eigenvectors of the kernel matrix, selected at random,where the probability of selecting each eigenvector depends on its associated eigenvalue. This fact suggests directlythat it is unnecessary to compute all the eigenvalues, as eigenvectors with low associated eigenvalues are selected withlow probability. This is particularly useful in the case of large matrices: computing only the largest eigenvalues cansubstantially reduce the computational burden of obtaining all the eigenvalues. The literature points to many referencesof well-known algorithms that can extract the t largest (or smallest) eigenvalues, with their associated eigenvectors, of a n × n Hermitian matrix, where usually, we have t (cid:28) n . One of the most classical and used algorithms is the Lanczosalgorithm [37]. Despite its popularity and computational efﬁciency, the algorithm was proven to be numerically instable,due in part to the loss of orthogonality of the computed Krylov subspace basis vectors generated. Since then, manyefforts have been made to solve this issue, like [13], [47] or [25]. Many of the variations of the Lanczos algorithmpropose a restart after a certain number of iterations. One of the most popular restarted variations of the algorithm is theimplicitly restarted Lanczos method, proposed by [10], which is implemented in ARPACK [39], motivating then ourpreference for this particular variation.The Lanczos algorithm and its implicitly restarted variation were specially developed for large sparse symmetricmatrices. Consequently, our ﬁrst step before implementing the implicitly restarted Lanczos method is to ﬁnd a goodapproximation of the kernel matrix L by a sparse matrix. We decide to address this question by following twoapproaches: one approach based on the Nearest Neighbor Gaussian Process [15] and another approach based on randomsampling of small submatrices from the dense kernel matrix L . The Nearest Neighbor Gaussian Process (NNGP) was developed by [15] and extended by [18], to obtain a sparseapproximation of the kernel matrix L , say (cid:101) L , to which the implicitly restarted Lanczos method can be applied to obtainits largest eigenvalues.[18] showed that the covariance matrix W of a Gaussian process can be expressed through a speciﬁc Choleskydecomposition: W = ( I n − A ) − D ( I n − A ) − T , (2)4 PREPRINT - F

EBRUARY

9, 2021where A is a n × n strictly lower-triangular matrix and D is a n × n diagonal matrix; here ( · ) − T stands for theinverse of the transposed matrix. In order to deﬁne properly the matrices A and D we need to introduce the followingnotation. For any n × n matrix M , and a subset of indices J ⊆ { , . . . , n } , we will write M k,J = ( M kj ) j ∈ J for the card( J ) -dimensional vector formed by the corresponding components of the row k of M , k = 1 , . . . , n . We deﬁnesimilarly, M J,k . Also, we write [ i : i ] for the set J = { j : i ≤ j ≤ i } .Having introduced the notation, we can write the i th row of A , A i • as A i, [1: i − = W − i − W [1: i − ,i , for i = 2 , . . . , n, and A i, [ i : n ] = 0 , for i = 1 , . . . , n . where W [1: k ] represents the leading principal submatrix of order k of the matrix W .The diagonal entries D ii of D are such that D = W and D ii = W ii − W i, [1: i − A Ti, [1: i − , for i = 2 , . . . , n .Note that these equations are the linear equations that deﬁne the matrices A and D . These equations need to be solve for A and D in order to obtain the decomposition given by the expression in (2). Unfortunately, the computation of A i • stilltakes O ( n ) ﬂoating point operations, specially for high values of i closer to n , which increases the dimension of W i − .Despite this shortcoming, the authors mention that this speciﬁc decomposition highlights where the sparseness can beexploited: setting to zero some elements in the lower triangular part of A . This is achieved by limiting the number ofnonzero elements in each row of A to a maximum of m elements.Let N i ⊆ { , . . . , n } be the set of indices j < i for which A i,j (cid:54) = 0 . We constraint N i to have at most m indices. In thislatter case, all elements of the i th row A i • of A are zero, except for the elements A i,N i = W − N i W N i ,i , for i = 2 , . . . , n ,where W N i is the principal submatrix of W whose rows and columns are indexed by N i . For the diagonal entries, wehave D = W and D ii = W ii − W i,N i A Ti,N i , for i = 2 , . . . , n .These latter equations form a linear system of size at most m × m , with m = max i (card( N i )) . This new system can besolved for A and D in O ( nm ) ﬂoating points operations. Using these solutions gives rise to the sparse approximationto the precision matrix W − (cid:102) W − = ( I n − A ) T D − ( I n − A ) . (3)The inverse of (cid:102) W − is an approximation to W . [15] show in the context of spatial Gaussian processes, that (cid:102) W − has atmost nm ( m + 1) / nonzero entries. Thus, (cid:102) W − is sparse provided that m (cid:28) n . We can apply this result to develop anefﬁcient determinantal consensus clustering (see Section 2) when the data size n is large. Recall that the kernel matrix L is a real symmetric positive semideﬁnite matrix. When L is also positive deﬁnite, it can be seen as a covariancematrix. This is what we assume from now on. Therefore, it can be approximated using the NNGP approach by a matrix (cid:101) L whose inverse (cid:101) L − is sparse.For each i ∈ { , . . . , n } consider the distances d ij = (cid:107) x i − x j (cid:107) for j < i . Let d i (1) ≤ d i (2) ≤ · · · ≤ d i ( i − be thecorresponding sequence of ordered distances. In our model we set N i = { j : j < i, d ij ≤ d i ( m ) } , for i > m ; and set N i = { , . . . , i − } for i ≤ m . Let ˇ L be the m -nearest-neighbor matrix whose row entries are given by ˇ L ii = L ii , ˇ L ij = L ij if j ∈ N i , and ˇ L ij = 0 , otherwise. We would like to stress here that the matrix (cid:101) L based on the neighborhoods { N i } ni =1 is not the same as the m -nearest-neighbor matrix ˇ L . We have adopted (cid:101) L instead of ˇ L for several reasons.First, (cid:101) L is a dense approximation of L , whose inverse (cid:101) L − is sparse. Second, (cid:101) L − is a sparse matrix with O ( nm ) nonzero entries. On the other hand, ˇ L is a sparse matrix with O ( nm nn ) nonzero entries, where m nn is the numberof nearest neighbors. Therefore, for a ﬁxed level of sparseness, building (cid:101) L − requires many fewer nearest neighbors m = O ( √ m nn ) than building ˇ L . Finally, in Section 4, we compute both the Frobenius distances [28] (cid:107) L − (cid:101) L (cid:107) F , and (cid:107) L − ˇ L (cid:107) F , and show that the former is always smaller than the latter, for all our simulated data. Moreover, in terms ofthe symmetrized Kullback-Leibler divergence [35], the distribution of the eigenvalues of (cid:101) L is closer to the distributionof the eigenvalues of L than the distribution of the eigenvalues of ˇ L .As our primary goal is to obtain the t (cid:28) n largest eigenvalues of the kernel matrix L , we start with the construction ofthe sparse matrix (cid:101) L − using the formula in (3), and the sparse computation form above. We then apply the Lanczosalgorithm to extract the t smallest eigenvalues of (cid:101) L − , and their associated eigenvectors. By inverting the eigenvalues,we obtain the t largest eigenvalues of (cid:101) L ; the eigenvectors of (cid:101) L are the same as those of (cid:101) L − . With the eigenvalues andeigenvectors in hand, we can proceed as usual with the determinantal consensus clustering of Section 2. We stresshere that the actual DPP used for the determinantal consensus clustering after this construction is DP P S ( (cid:101) L ) , and not DP P S ( L ) . In pratice, L is chosen only to give us a measure of similarity between data points.5 PREPRINT - F

EBRUARY

9, 2021 L In this section, we consider another approach to deal with large datasets and kernel matrices. In this approach wecombine dimension reduction techniques and the advantages of working with sparse matrices. Let L (1) , . . . , L ( M ) denote M r × r submatrices sampled uniformly at random and without replacement from L , where r < n (ideally, r (cid:28) n ). The idea is to use these submatrices as proxies for L in the DPP sampling.By generating a sufﬁciently large number M of matrices, we expect to cover the set of eigenvectors and eigenvalues of L with the smaller sets of eigenvectors and eigenvalues of the matrices collection { L ( i ) : i = 1 , . . . , M } . This mightbe the case if the data are well separated so that, although L might be a dense matrix, its hidden structure might be sparse. That is, many entries in L might be small. We could also achieve sparseness by thresholding the elements of L .However, the following approach yielded better results in our experiments (not shown here), and hence, it is the secondapproach adopted (the ﬁrst having been described in the previous section).We apply the following iterative methodology to the set of submatrices, for a number N of times. For k = 1 , . . . , N :1. Select an index i k from { , , . . . , M } at random (with replacement), and consider the submatrix L ( i k ) .2. Build a sparse approximation (cid:98) L ( i k ) of the submatrix L ( i k ) by considering the k -nearest neighbors of eachpoint associated with the rows of the submatrix; that is, (cid:98) L ( i k ) ij = L ( i k ) ij if x j is one of the k -nearest-neighborsof x i or if x i is one of the the k -nearest-neighbors of x j ; (cid:98) L ( i k ) ij = 0 , otherwise.3. Generate a subset sample Y i k from a DPP( (cid:98) L ( i k ) ) based only on the t largest eigenvalues extracted with theLanczos algorithm.4. Find the Voronoi cells of the n data points based on the sampled Y i k center points. This generates the k thpartition of the algorithm.At the end of this procedure, we apply the determinantal consensus clustering of Section 2 to the set of N partitionsobtained.The number M of submatrices to be sampled must be chosen so that we get beneﬁts from using the submatricesto sample the generator sets through DPP rather using the whole kernel matrix L . We know that the computationalcomplexity of obtaining the eigendecomposition of the n × n kernel matrix is O ( n ) operations. On the other hand, theeigendecompositions of the M r × r submatrices requires O ( M r ) operations. To obtain computational gains fromthe sampled submatrices, we must guarantee that M r < n , that is, M < ( r/n ) − . The quantity γ = r/n is theproportion of points considered in the submatrices. Since we would like to take full advantage of both the dimensionreduction and speed, we should work with values of M (cid:28) γ − . In our experiments, we set M = (cid:98) γ − / (cid:99) , where (cid:98) x (cid:99) stands for the ﬂoor function. In this section, we apply both approaches presented in Section 3 to moderately large datasets in order to compare theirresults. The comparison and evaluation of the results are based on simulations. Because the data size depends onthe nature of the problem, there is no clear deﬁnition of what is considered a “large dataset”. Here, we consider twosize values as large sizes: n ∈ { , } . These values were chosen so as to obtain results that can be appliedto moderetaly large real datasets, and to be able to explore the effect on the results of using sparse kernel matricesinstead of the original kernel matrices for determinantal consensus clustering. Moreover, the chosen data sizes keepthe computation time within reasonable elapsed times for our computer resources. Our choices for data sizes do notnecessarily constitute what is known as big data , a term popularized by [43] to describe datasets with sizes that gobeyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerableelapsed time [52]. As [7] emphasize, the processing of large datasets does not have to involve big data. n = 1000 observationsData generation. Following [56], we created nine experimental conditions or scenarios to generate datasetswith n = 1000 observations. Each dataset was generated with the algorithm of [44], which draws datasets from p -variate ﬁnite Gaussian mixtures. These have the form (cid:80) Kk =1 π k φ p ( · ; µ k , V k ) , where K is the number of Gaussiancomponents, φ p denotes the p -variate Gaussian density, { µ , . . . , µ K } , are the component means, and { V , . . . , V K } are the covariance matrices of the components. The means constitute K independent realizations of a uniform p -variatedistribution on the p -dimensional unit hypercube; the covariance matrices are K independent realizations of a p -variate6 PREPRINT - F

EBRUARY

9, 2021standard Wishart distribution with p + 1 degrees of freedom; the mixing proportions π k are drawn from a Dirichletdistribution, so that (cid:80) Kk =1 π k = 1 . The number of data points per component is a draw from the multinomial distributionbased on the mixing proportions.[44] introduce the concept of pairwise overlap to generate datasets with the algorithm. It represents the degree ofinteraction between components and deﬁnes the clustering complexity of datasets. [44] deﬁne a range of 0.001 to0.4 for an average pairwise overlap between components, representing a very low and a very high overlap degree,respectively. We follow the values suggested by the authors, and choose, for every dataset generated, a random averagepairwise overlap between 0.001 and 0.01. This range of values keep the overlap degree at a low level so that performingclustering on the data makes sense. All datasets were generated from mixtures with ellipsoidal covariance matrices; thesimulated components do not necessarily contain the same number of elements.To obtain the nine simulated datasets with n = 1000 observations, we consider p ∈ { , , } variables, and K ∈ { , , } components. These values correspond to three different levels (low, medium, large) for p and K .We also ensured that no cluster with size less than √ n is present among each simulated dataset, because otherwise,following the procedure described in Section 2.1, small clusters will be inevitably merged with larger clusters. Weapplied the two approaches presented in Section 3 to each simulated dataset. Results with approach based on NNGP

In order to study the effect of sparseness in the approximation presentedin Section 3.1, we set different values for m , the maximum number of nonzero elements in each row of the matrix A .Each value ensures four levels of sparseness (or total percentage of zeros) for the matrix (cid:101) L − : , , and . The ﬁrst largest t eigenvalues and corresponding eigenvectors of (cid:101) L − were extracted with the Lanczos algorithm,for t ∈ { , , } . The consensus DPP of Section 2 was applied to obtain 200 partitions, as recommended in [56].The whole procedure was repeated ten times. Considering the nine simulated scenarios, the experiment consists of nineplots with ten repeated measures from a × factorial design given by the combination of sparseness and number ofextracted eigenvalues. For each single plot, we note that all observations are dependent, since the same data wereused for all combinations.Figure 1 shows density estimates of the eigenvalues distribution obtained with the NNGP approximation matrix (cid:101) L (solid line). We considered two combinations of sparseness levels and number of eigenvalues: ( , t = 10 ) and( , t = 50 ), respectively. The ﬁgure presents plots from one dataset, since they depict typical plots obtained withall datasets. The tick-marks in the horizontal axis locate all eigenvalues of the original kernel matrix L . We can seethat the density estimations obtained from (cid:101) L concentrate correctly around the true eigenvalues of the kernel matrix L . For comparison purposes, we also display in Figure 1 the density estimates of the set of eigenvalues from the m nn -nearest-neighbor matrix ˇ L (dashed line) described in Section 3.1. Unlike the NNGP approach, the eigenvaluedensity estimations associated with ˇ L do not quite concentrate on the set of true eigenvalues. The plots corroborate thearguments of Section 3.1 concerning our preference for (cid:101) L instead of ˇ L .Following the discussion of Section 3, we also computed the Frobenius distances between the sparse approximationmatrices, (cid:101) L and ˇ L , and the original dense matrix L . Table 1 shows the mean and standard deviation of Frobeniusdistances considering all nine data scenarios, as a function of the level of sparseness. The mean distance (cid:107) L − (cid:101) L (cid:107) F isalways much inferior to the distance (cid:107) L − ˇ L (cid:107) F . At ﬁrst, the distances (cid:107) L − (cid:101) L (cid:107) F decrease with sparseness until a levelof at least sparseness is reached. However, the distances of L to ˇ L monotonically increase with sparseness. Theapproximation based on NNGP is then the preferred choice for extracting eigenvalues. These results corroborate thenice concentration of the eigenvalue distributions of the NNGP approach about the eigenvalues of L in Figure 1.Approximationmatrix Sparseness level

20% 40% 60% 80% (cid:101) L ˇ L (cid:101) L and L , we computed thesymmetrized Kulback-Leibler (KL) divergence [35] between corresponding density estimates of the two eigenvaluedistributions. The densities were estimated using a Gaussian kernel density estimator [51]. Table 2 reports the mean andstandard deviation of the KL divergences over all nine data scenarios for each combination of sparseness and number of7 PREPRINT - F

EBRUARY

9, 2021 eigenvalues e s t i m a t ed den s i t y (a) Sparseness=20% ; t =10 eigenvalues eigenvalues e s t i m a t ed den s i t y (b) Sparseness=60% ; t =50 eigenvalues Figure 1: Kernel density estimates of the eigenvalue distribution from the NNGP approximation (cid:101) L (solid line), andthe m nn -nearest-neighbor matrix ˇ L (dashed line). The plots are associated with two sparseness-eigenvalue conditionsamong the twelve experimental conditions for a given dataset. The tick-marks indicate the eigenvalues of L .eigenvalues extracted. There does not appear to be any signiﬁcant difference between all the cases, except for t = 10 eigenvalues. Better results are obtained when a larger but still very moderate number of eigenvalues is estimated. Thedivergences reported in Table 2 are very small, showing that there is pretty good similarity between the eigenvaluedistributions. Number of Sparseness leveleigenvalues

20% 40% 60% 80%

10 0.01 (0.00) 0.36 (0.98) 0.36 (0.97) 0.35 (0.96)25 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) 0.01 (0.00)50 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) 0.01 (0.00)Table 2: Mean and standard deviation (within parentheses) of Kullback-Leibler divergences.We also report and compare the elapsed time in seconds for calculating each set of eigenvalues. Table 3 displays themeans and standard deviations of elapsed times over all nine data scenarios for each combination of sparseness andnumber of eigenvalues extracted. We can see that, as expected, the average computation time decreases with sparseness,and increases with the number of eigenvalues extracted. In fact, a linear regression (not shown here) of the elapsedtime as a function of sparseness and number of eigenvalues extracted yields a coefﬁcient of determination of 0.97,clearly indicating a linear growth of the computational time with both sparseness and the number of eigenvalues tobe estimated. The same statistics computed from the original kernel matrix L , from which all eigenvalues must beextracted, yield a mean of 0.27 seconds with a standard deviation of 0.06. Extracting only a few eigenvalues with theLanczos algorithm reduces signiﬁcantly the computation time. The elapsed times were computed while running a Juliav1.1.1 script [4] on a PC with an Intel-Core i5-4460 CPU running at 3.20GHz with 16GB of RAM.Number of Sparseness leveleigenvalues

20% 40% 60% 80%

10 0.10 (0.00) 0.07 (0.00) 0.06 (0.01) 0.04 (0.01)25 0.12 (0.01) 0.09 (0.00) 0.08 (0.01) 0.05 (0.01)50 0.14 (0.01) 0.12 (0.01) 0.11 (0.01) 0.07 (0.01)Table 3: Means and standard deviations (within parentheses) of elapsed times in seconds for eigenvalues calculation.8

PREPRINT - F

EBRUARY

9, 2021To obtain the optimal clustering conﬁguration for each dataset, we used the determinantal consensus clustering describedin Section 2, meeting the minimal cluster size criterion of √ n and using the kernel-based validation index of [17]. Thequality of the chosen optimal clustering conﬁguration has been assessed with the adjusted Rand index or ARI [49, 30].Among the many known measures of goodness-of-ﬁt that can be found in the literature, the ARI is one of the mostcommon criteria. The original Rand index counts the proportion of elements that are either in the same clusters inboth clustering conﬁgurations or in different clusters in both conﬁgurations. The adjusted version of the Rand indexcorrected the calculus of the proportion, so that its expected value is zero when the clustering conﬁgurations are random.The larger the ARI, the more similar the two conﬁgurations are, with the maximum ARI score of 1.0 indicating a perfectmatch. Table 4 displays the ARI means and standard deviations over all nine scenarios and replicas for all twelvecombinations of sparseness and number of eigenvalues extracted.Sparsenesslevel t = 10 t = 25 t = 50

20% 0.94 (0.06) 0.95 (0.06) 0.95 (0.06)40% 0.93 (0.05) 0.95 (0.05) 0.95 (0.06)60% 0.94 (0.05) 0.95 (0.06) 0.95 (0.05)80% 0.93 (0.06) 0.94 (0.05) 0.95 (0.05)Table 4: Global ARI means and standard deviations (within parentheses) yielded by consensus DPP with the sparsekernel matrices for the twelve combinations of sparseness and number of eigenvalues extracted. Each mean and standarddeviation was computed from ninety datasets.For comparison purposes, we computed the ARI statistics obtained by applying consensus DPP with the original densekernel matrix L , from which we extracted all the eigenvalues and eigenvectors. This yielded an ARI mean of 0.91,and a standard deviation of 0.08. We did the same with the consensus clustering methodology applied to partitionsgenerated with the well-known Partitioning Around Medoids (PAM) algorithm [33]. The PAM algorithm is a classicalpartitioning technique for clustering. It chooses the data center points of the Voronoi cells by uniform random sampling.Because DPP selects center points based on diversity, our goal here is to show how much the quality of the clusteringconﬁgurations is affected by the lack of diversity at the moment of sampling centroids. PAM yielded a much lower ARImean of 0.86, and a standard deviation of 0.14.We can see that the quality of the clustering results associated with the NNGP approach is better in terms of ARIthan the ones yielded using the whole dense kernel matrix L . This conclusion holds regardless of the sparseness leveland number of eigenvalues extracted. The results are also more stable, considering the standard deviations. Takingadvantage of the elapsed time, choosing t = 25 eigenvalues is the better choice among the three levels for t studiedhere. This combined with a higher sparseness level (to speed up the computation of the matrix (cid:101) L − ) appears to be awinning combination. The results yielded by PAM are not as good. In conclusion, the NNGP approach provides a verygood and efﬁcient alternative to the use of the complete dense matrix L . Results with approach based on small submatrices from L Following the notation presented in Section 3.2,we studied the effect of three parameters on the clustering results. These are the proportion γ of points chosen (or,equivalently, the size r of any submatrix L ( i k ) , k = 1 , , . . . , N ), the sparsennes level of any submatrix L ( i k ) (alongwith the number k of nearest neighbors needed to achieve such level), and the number t ∈ { , , } of the largesteigenvalues to be extracted. Table 5 shows the choices for γ and sparseness levels with the corresponding values of r and k . Note that each value of k ensures the same sparseness for all nine data scenarios. kγ r

20% 40% 60% 80%0.05 50 34 25 17 80.1 100 68 50 34 160.2 200 136 100 68 32Table 5: Number of nearest neighbors k associated with choices for proportion of data γ and sparseness levels 20%,40%, 60% and 80%.As we did with the NNGP approach, we set the number of partitions to obtain a consensus clustering to 200. The wholeprocedure was repeated ten times. The analysis of the results mimics the one made in the previous section with the9 PREPRINT - F

EBRUARY

9, 2021NNGP approach. Figure 2 shows density estimates of the eigenvalue distribution associated with three combinations of( γ , sparseness, eigenvalues) ∈ { (0 . , , , (0 . , , , (0 . , , } . The plots are associated with a givendataset; it depicts typical patterns oberved in all datasets. The tick-marks in the horizontal axis locate all eigenvalues ofthe original kernel matrix L . We can see that the density estimations concentrate well around the smaller eigenvalues of L , but fail to capture the largest eigenvalues. eigenvalues e s t i m a t ed den s i t y (a) γ = 0 . ; Sparseness=20% ; t =10eigenvalues eigenvalues e s t i m a t ed den s i t y (b) γ = 0 . ; Sparseness=40% ; t =25eigenvalues eigenvalues e s t i m a t ed den s i t y (c) γ = 0 . ; Sparseness=60% ; t =50eigenvalues Figure 2: Density estimators of three sets of eigenvalues extracted from the random small matrices approach associatedwith three scenarios of Table 5 on a given dataset. The tick-marks are placed on the eigenvalues of L .Table 6 shows the Kullback-Leibler symmetrized divergences between the distribution of the eigenvalues extracted fromthe random small matrices and the distribution of the eigenvalues of L , for the ( γ, sparseness) combinations displayedin Table 5. The eigenvalues distributions were estimated with kernel density estimators. We can see that, globally,increasing the proportion γ of points sampled reduces the KL divergence when combined with moderate to low levelsof sparseness and a higher number of eigenvalues. Overall, the KL divergences are larger than those obtained with theNNGP approach. But, again, we stress that the objective of the approaches is not to estimate the eigenvalues of L , butto offer efﬁcient DPP sample alternatives to DPP S ( L ) . γ Number ofeigenvalues Sparseness level20% 40% 60% 80%0.05 10 2.09 (1.28) 1.28 (0.37) 1.00 (0.08) 1.01 (0.03)25 0.02 (0.01) 0.58 (1.25) 1.64 (0.46) 1.01 (0.04)50 0.02 (0.01) 0.03 (0.23) 1.71 (0.88) 1.10 (0.17)0.1 10 1.95 (1.39) 1.34 (0.43) 1.02 (0.10) 1.00 (0.03)25 0.01 (0.00) 0.12 (0.58) 1.87 (0.89) 1.04 (0.10)50 0.02 (0.00) 0.01 (0.00) 0.15 (0.68) 1.32 (0.26)0.2 10 2.07 (1.43) 1.36 (0.44) 1.03 (0.12) 1.00 (0.03)25 0.01 (0.00) 0.16 (0.69) 1.76 (1.08) 1.12 (0.23)50 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) 1.64 (0.51)Table 6: Means and standard deviations (within parentheses) of Kullback-Leibler divergences associated with therandom small matrices approach.Recalling the notation of Section 3.2, Table 7 shows the means and standard deviations of the elapsed times in secondsfor eigenvalues computation, using the sampled submatrix L ( i k ) or its sparse approximation (cid:98) L ( i k ) . The results wereobtained while running a Julia script on a PC with an Intel Core i5-4460 CPU running at 3.20GHz with 16GB of RAM.Note that contrary to NNGP, the level of sparseness does not affect the elapsed times for computing the eigenvalueswith the Lanczos method. This fact is probably due to the small size of the matrices (cid:98) L ( i k ) .10 PREPRINT - F

EBRUARY

9, 2021 γ Submatrix Number of eigenvalues10 25 500.05 L ( i k ) (cid:98) L ( i k ) L ( i k ) (cid:98) L ( i k ) L ( i k ) (cid:98) L ( i k ) L ( i k ) and its sparse approximation (cid:98) L ( i k ) .Table 8 displays the ARI means and standard deviations over all ( γ , sparseness) combinations in Table 5. For eachcombination, these statistics were computed from a sample of ninety ARI scores given by the ten replica datasets fromthe nine data scenarios. Table 9 displays the same statistics obtained by keeping the dense version of the submatrices L ( i k ) , and extracting all its eigenvalues, instead of using their sparse approximations. This comparison is appropriate tostudy the effect of making these small matrices sparse. These results should be compared to those of (i) consensus DPPapplied to the original dense kernel matrix L , from which all eigenvalues are extracted, and (ii) the consensus clusteringmethodology applied to partitions generated by PAM. These latter results were already mentioned above when showingthe results of the NNGP approach. They are, respectively, (i) a mean ARI and standard deviation of 0.91 and 0.08, and(ii) a mean ARI and standard deviation of 0.86 and 0.14. γ Sparsenesslevel t = 10 t = 25 t = 50 γ , sparseness) combinations in Table 5, and over the corresponding ninety datasetsfor each combination. γ = 0 . γ = 0 . γ = 0 . L ( i k ) or sparse (cid:98) L ( i k ) random small submatrices rather than the whole dense kernel matrix L . The sparse approximations (cid:98) L ( i k ) require morecomputation since the nearest neighbors must be computed. This extra cost does not seem worth when comparingthe results associated with the dense approximations L ( i k ) . However, there is a slight improvement in the results11 PREPRINT - F

EBRUARY

9, 2021when γ = 0 . . In this case, combining any sparseness level with a higher number of eigenvalues is generally a goodcombination, since computational times for eigenvalue extraction do not differ signiﬁcantly. For other levels of γ , thegain is not worth the complication of making the submatrices sparse.Again, as with the NNGP approach, the random small submatrices approach outperforms PAM and shows less variabilityin the quality of the results. Furthermore, this approach achieves comparable quality results to those of the NNGPapproach. From a computational point of view, the NNGP approach is more expensive because it requires dealing withlarger matrices. Therefore, the random small submatrices are a good and very efﬁcient alternative to the use of thecomplete dense matrix L , and to the use of the NNGP approach. n = 10000 observationsData generation. For the second experiment, due to hardware limitations, we simulated only two datasets with n = 10000 observations. As in the previous case, these were generated with the algorithm of [44]. The ﬁrst onehas p = 15 variables and K = 10 components, while the second one has p = 10 variables and K = 5 components.Throughout this section, these data will be referred to as dataset I and dataset II, respectively. Both datasets have amaximum pairwise overlap of 0.01. We ensured that no cluster with size less than √ n is present among the simulateddatasets, as it will be inevitably merged with a larger cluster, according to the procedure described in Section 2.1. Weapplied the two approaches presented in Section 3 to the simulated datasets. Results with approach based on NNGP

We studied the same four levels of sparseness for the matrix (cid:101) L − ( , , , ), setting the appropriate values for the maximum number m of nonzero elements in each row ofthe matrix A (Section 3.1). The ﬁrst largest t ∈ { , , } eigenvalues and corresponding eigenvectors of (cid:101) L − are extracted with the Lanczos algorithm. The determinantal consensus clustering of Section 2 was then applied so as toobtain 200 partitions of the data. The whole procedure was repeated ﬁve times. This time, the experiment consistsof two plots of ﬁve repeated measures each, from a × factorial design given by the combination of four levels ofsparseness and three levels for the number of extracted eigenvalues. All observations of each plot are dependent,since the same datasets were used for all scenarios. The analysis of the results follows the same steps as the NNGPapproach applied to the smaller datasets.Figure 3 shows density estimates of the eigenvalue distribution of (cid:101) L for dataset I, for two combinations of sparsenessand number of eigenvalues: ( , t = 100 ) and ( , t = 500 ), respectively. We can see that the density estimationsconcentrate correctly around the true eigenvalues of the kernel matrix L . eigenvalues e s t i m a t ed den s i t y (a) Sparseness: 20% ; t =100 eigenvalues eigenvalues e s t i m a t ed den s i t y (b) Sparseness: 60% ; t =500 eigenvalues Figure 3: Kernel density estimates of the eigenvalue distribution associated with (cid:101) L for dataset I, for two sparseness-eigenvalue conditions among the twelve experimental conditions. The tick-marks indicate the eigenvalues of L .For comparison purposes between the two sparse approximations (cid:101) L and ˇ L , we computed the Frobenius distance betweenthese matrices and the kernel matrix L . Table 10 displays the results for both datasets. As seen in the experiment with12 PREPRINT - F

EBRUARY

9, 2021the smaller datasets, the distance of L to (cid:101) L is always much smaller than the distance of L to ˇ L . For both datasets,the distances always increase with sparseness. The approximation based on NNGP is a better choice for extractingeigenvalues. This also explains the correct concentration of the density estimation of the eigenvalue distribution of (cid:101) L around the eigenvalues of L , as seen in Figure 3.Approximationmatrix Dataset Sparseness level

20% 40% 60% 80% (cid:101) L I 2.75 8.82 15.65 96.71II 0.33 4.47 14.07 191.45 ˇ L I 2122.11 3269.61 4265.59 5224.48II 1955.09 3079.10 4128.15 5228.46Table 10: Frobenius distances for both datasets (I and II).Table 11 reports the symmetrized Kullback-Leibler divergence of both datasets between the density estimates of thetwo eigenvalue distributions from (cid:101) L and L . The densities were estimated with kernel density estimators. Again, asobserved with the smaller datasets, the very small divergence values indicate a good resemblance between the eigenvaluedistributions. Number ofeigenvalues Dataset Sparseness level

20% 40% 60% 80%

100 I 0.0000338 0.0000337 0.0000337 0.0000336II 0.0000329 0.0000132 0.0000015 0.0000329250 I 0.0000365 0.0000359 0.0000359 0.0000358II 0.0000133 0.0000015 0.0000329 0.0000139500 I 0.0000090 0.0000090 0.0000090 0.0000090II 0.0000016 0.0000322 0.0000166 0.0000017Table 11: Kullback-Leibler divergences for both datasets (I and II).Table 12 shows the comparison of the elapsed times in seconds for calculating each set of eigenvalues for dataset I.Extracting all eigenvalues from the original matrix L yields an elapsed time 170.08 seconds and 156.74 seconds fordatasets I and II, respectively. The results were obtained on Julia v1.1.1 running on a PC with an Intel-Core i5-4460CPU running at 3.20GHz with 16GB of RAM. The elapsed times for dataset I are well explained by a linear regressionon sparseness and number of eigenvalues extracted, presenting a coefﬁcient of determination of . . Similar resultswere obtained for dataset II.Number ofeigenvalues Dataset Sparseness level

20% 40% 60% 80%

100 I 37.34 34.46 31.79 27.39250 I 68.03 65.86 60.62 55.37500 I 125.48 120.27 115.55 112.85Table 12: Elapsed times in seconds for eigenvalues calculation of dataset I.13

PREPRINT - F

EBRUARY

9, 2021

Sparsenesslevel t = 100 t = 250 t = 500 I II I II I II20% 0.97 (0.04) 0.79 (0.04) 0.93 (0.02) 0.87 (0.05) 0.70 (0.04) 0.88 (0.07)40% 0.98 (0.00) 0.87 (0.03) 0.94 (0.02) 0.87 (0.08) 0.77 (0.03) 0.89 (0.04)60% 0.97 (0.04) 0.86 (0.09) 0.94 (0.01) 0.87 (0.07) 0.83 (0.02) 0.82 (0.07)80% 0.98 (0.02) 0.86 (0.09) 0.97 (0.02) 0.86 (0.04) 0.71 (0.10) 0.87 (0.03)

Table 13: ARI means and standard deviations (within parentheses) obtained by consensus DPP on the sparse kernelmatrices over the twelve experimental conditions and both datasets (I and II).Table 13 displays the ARI means and standard deviations over all twelve experimental conditions for both datasets (Iand II). For comparison purposes, we computed the same statistics for (i) consensus DPP applied to the original densekernel matrix L , from which all eigenvalues were extracted, and for (ii) the consensus clustering methodology appliedto partitions generated by PAM. For dataset I, the corresponding ARI means and standard deviations are 0.57 and 0.02for consensus DPP, and 0.93 and 0.01 for PAM. For dataset II, we obtain ARI means and standard deviations of 0.82and 0.06 for consensus DPP, and of 0.83 and 0.03 for PAM.Observe that the quality of the clustering results is better if we consider the sparse approximation (cid:101) L − and a lowernumber of eigenvalues, t ∈ { , } . Considering a higher number of eigenvalues is not such a good option. Thequality of the results decreases with the number of eigenvalues extracted for each sparseness level. The use of theoriginal dense matrix L with all eigenvalues is not a good alternative either. The most probably reason for the observedresults is the large size of the kernel matrix. Most eigenvalues cannot be computed reliably, numerically speaking;hence, using all numerically extracted eigenvalues might introduce noise, leading the algorithm to perform poorly. Also,if most of the eigenvalues are small, they will be estimated with a lot of error, specially if the difference between thelargest eigenvalues and the smallest ones is orders of magnitude. This is known as ill-conditioning of the kernel matrix.We would like to note that for all three cases, NNGP, the original dense matrix and PAM, we used the √ n criterion tomerge small clusters during consensus (see Section 2.1). However, as pointed out in [56], a criterion close to n / mighthave been more appropriate given the small number of clusters K , and the large number of observations n . In fact,using this larger size of small cluster in the merging step of the consensus give slightly better results for all methods,specially for dataset II.To summarize, determinantal consensus clustering with the NNGP approach outperforms PAM for all cases. In addition,if we would like to favor high quality clustering results with low computational cost, combining t = 100 eigenvalueswith a high level of sparseness appears to be the best option. Results with approach based on small submatrices from L For this approach, due to hardware limitations,we decided to reduce the number of evaluated scenarios in both datasets. Thus, maintaining the proportion γ ∈{ . , . , . } of points selected from the dense matrix L , the sampled submatrices L ( i k ) , k = 1 , . . . ,N , are of size r ∈ { , , } , respectively. These sizes are considerably larger than those used with the smaller datasets.Hence, sparse approximations of L ( i k ) are used with a unique high level of sparseness equal to . This sparseness isachieved with 80, 160 and 320 nearest neighbors, respectively. Recall that with the NNGP approach applied to thesmaller datasets, a low mean elapsed time for eigenvalue calculation was achieved when combining a low number ofeigenvalues with a high level of sparseness (see Table 3). Therefore, to speed up the computation time, we extractonly the largest 100 eigenvalues for all the cases. The following analysis of the results is organized as with the abovesimulations.Figure 4 shows the kernel density estimates for the aforementioned situations. The tick-marks in the horizontal axislocate all eigenvalues of the original kernel matrix L . As seen in the results with the smaller datasets, the densityestimates of this approach do not capture well the largest eigenvalues when the sparseness is too high (for a low valueof γ ). 14 PREPRINT - F

EBRUARY

9, 2021 eigenvalues e s t i m a t ed den s i t y (a) γ = 0 . eigenvalues e s t i m a t ed den s i t y (b) γ = 0 . eigenvalues e s t i m a t ed den s i t y (c) γ = 0 . Figure 4: Kernel density estimates of the set of eigenvalues extracted from sparse L ( i k ) . The tick-marks are placed onall eigenvalues of L .Table 14 reports the symmetrized Kullback-Leibler divergences between the distribution of the eigenvalues extractedfrom the sparse submatrices L ( i k ) and the distribution of the eigenvalues of L , for γ ∈ { . , . , . } . The densitieswere estimated with kernel density estimators. The KL divergences decrease with γ and are very small, indicating agood resemblance between the two distributions.Dataset γ = 0 . γ = 0 . γ = 0 . I 0.000875 0.000301 0.000095II 0.000579 0.000173 0.000061Table 14: Kullback-Leibler divergences for both datasets (I and II).Table 15 shows the comparison of the elapsed time in seconds for eigenvalue computation, using the sampled submatrix L ( i k ) or its sparse approximation (cid:98) L ( i k ) . The results were obtained with Julia Language, version 1.1.1, on a Desktop PCwith Intel Core i5-4460 CPU @ 3.20GHz Processor and 16 GB DDR3 RAM. In this case, and due to the relativelylarge size of the matrices L ( i k ) , it does make a difference to use the sparse matrices (cid:98) L ( i k ) instead of the dense ones.Dataset Submatrix γ = 0 . γ = 0 . γ = 0 . I L ( i k ) (cid:98) L ( i k ) L ( i k ) (cid:98) L ( i k ) γ ∈ { . , . , . } and both datasets (I and II). Italso reports the results obtained with consensus DPP applied to the original dense kernel matrix L , from which alleigenvalues were extracted, and PAM. These latter results were already reported in Section 4.2.Dataset γ = 0 . γ = 0 . γ = 0 . Original L PAMI 0.95 (0.00) 0.96 (0.01) 0.96 (0.01) 0.57 (0.02) 0.93 (0.01)II 0.88 (0.02) 0.87 (0.04) 0.87 (0.03) 0.82 (0.06) 0.83 (0.03)Table 16: ARI means and standard deviations (within parentheses) obtained from consensus DPP with approach basedon small submatrices from L for datasets I and II. The results for the original dense matrix and PAM are also displayed.15 PREPRINT - F

EBRUARY

9, 2021As we did with the smaller datsets, we show in Table 17, the same statistics obtained by keeping the dense version ofthe submatrices L ( i k ) , and extracting all its eigenvalues, instead of using their sparse approximations.Dataset γ = 0 . γ = 0 . γ = 0 . I 0.94 (0.05) 0.95 (0.05) 0.95 (0.06)II 0.88 (0.04) 0.81 (0.06) 0.83 (0.05)Table 17: Global ARI means and standard deviations (within parentheses) of consensus DPP on the dense version of therandom small submatrices for datasets I and II.The random small matrices approach yields excellent results for any proportion γ . If we consider the results with thedense version of the submatrices, we can see that the results obtained are similar or better. As already observed with thesmall samples, the use of a sparse approximation (cid:98) L ( i k ) does not hurt the quality of the results. The main advantage ofusing the sparse approximations is the reduced amount of time needed to compute the eigenvalues. Another advantageis the resulting stability of the clustering quality results (low dispersion of ARI values). In summary, if one would likelow computational time and low dispersion, using the sparse approximation of submatrices with a low proportion ofsampled points ( ) is a good option.To end this section, we observe that the approach based on random sampling of submatrices outperforms PAM, for all γ values in both datasets. It also reaches comparable quality clustering levels to the approach based on NNGP with t = 100 and a sparseness of . Sampling submatrices of lower dimension is then a good alternative to the use of theNNGP approach. The differences between DPP and PAM can also be highlighted using the logarithm of the probability mass function ofthe DPP, given by (1). Figure 5 displays histograms of these probability logarithms. The histograms are based on 1000random subsets of datapoints drawn (i) using the DPP sampling algorithm of [3] and [34], and (ii) using the simplerandom sampling of PAM. The subsets were drawn from two simulated large datasets: one of size n = 1000 , andanother of size n = 10000 observations. loglik c oun t methodDPPrandom (a) Dataset with n = 1000 observations loglik c oun t methodDPPrandom (b) Dataset with n = 10000 observations Figure 5: Histograms of the logarithm of the probability mass function (loglik), using DPP and simple random sampling,for two simulated large datasets. 16

PREPRINT - F

EBRUARY

9, 2021The histograms clearly show that DPP selects random subsets with higher and less dispersed probability mass values(likelihood) than simple random sampling. This explains the low dispersion of the ARI observed in most results of thissection, when sampling is performed with DPP. The higher likelihood achieved by DPP also implies a higher diversityof the sampled subsets. On its turn, subsets sampled as in PAM yield a highly dispersed likelihood, resulting in highlyor poorly diverse subsets. DPP tends to be more consistent and stable since it ensures a high level of diversity amongthe selected elements at each sampling.

In this section we evaluate the performance of consensus DPP versus PAM on three large real datasets:1.- A dataset about human activity recognition and postural transitions using smartphones, collected from 30 subjectswho performed six basic postures (downstairs, upstairs, walking, laying, sitting and standing), and six transitionalpostures between static postures (stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie and lie-to-stand). Theexperiment was realized in the same environment and conditions, while carrying a waist-mounted smartphone withembedded inertial sensors. The dataset consists of 10929 observations, with 561 time and frequency extracted features,which are commonly used in the ﬁeld of human activity recognition. The dataset is available on the

UCI MachineLearning Repository [16], a well known database in the machine learning community for clustering and classiﬁcationproblems. The six transitional postures between static postures comprises a relatively small subset of observations.Therefore, we apply our clustering algorithm only to the six basic postures. Using the notation of the simulated datasetsof Section 4, the ﬁnal dataset has n = 10411 observations, p = 561 variables and K = 6 components.2.- The Modiﬁed National Institute of Standards and Technology ( MNIST ) dataset [38], one of the most common datasetsused for image classiﬁcation. This dataset contains 60000 training images and 10000 testing images of handwrittendigits, obtained from American Census Bureau employees and American high school students. Each 784-dimensionalobservation represents a × pixel gray-scale image depicting a handwritten version of one of the ten possible digits(0 to 9). As it is a common practice with the MNIST dataset, due to its intrinsic characteristics, we transformed the datato its multiplicative inverse as hinted by the Box-Cox transformation. We only use the testing set of 10000 images, sothat the ﬁnal dataset has n = 10000 observations, p = 784 variables and K = 10 components.3.- The Fashion- MNIST dataset [58], also one of the most common datasets used for image classiﬁcation. This datasetcontains 60000 training images and 10000 testing images of Zalando’s articles . Each observation represents a × pixel gray-scale images of clothes associated with a label from 10 classes. For the same reasons as MNIST , wetransformed the data with an appropriate Box-Cox transformation, and only worked with the testing portion of the data.The dataset has n = 10000 observations, p = 784 variables and K = 10 components.We applied both approaches of Section 3 to each dataset, and followed the same analysis procedure of Section 4. Forthe NNGP approach, we set the maximum number m of nonzero elements in each row of the matrix A to obtain asparse approximation of the kernel matrix L with of sparseness. The Lanczos algorithm was applied to extract the t = 100 largest eigenvalues from the sparse approximated matrix. For the approach based on random small submatricesfrom L , we sampled a proportion γ = 0 . of points from the kernel matrix L , and obtained sparse approximationsof the sampled submatrices with of sparseness. The Lanczos algorithm was applied to extract the t = 100 largest eigenvalues from the sparse approximated submatrices. Determinantal consensus clustering was applied to 200partitions. The whole procedure was repeated ﬁve times.For comparison purposes, we also include the results from a couple of popular algorithms: consensus clusterings withPAM, and k -means. The PAM algorithm was already mentioned in Section 4. The k -means algorithm was introducedby Stuart Lloyd in 1957, with a supporting publication in [40]. Given an initial set of k means, representing k clusters,it assigns each observation to the cluster with the nearest mean, and proceeds to recompute means from observationsin the same cluster. The procedure is repeated until no changes are observed in the assignments. Among the manyalgorithms to initialize the k centers of k -means [19, 48, 24, 32], we chose the k -means ++ algorithm of [1]. Thismethod has become largely popular [11, 20]. It is a probability-based technique that avoids the usually poor resultsfound by standard k -means initialization methods. We performed consensus clusterings with PAM and k -means using200 partitions, also repeating the whole procedure ﬁve times. We report the mean and the standard deviation of the ARIachieved by the different methods in Table 18. Zalando is a European e-commerce company specializing in fashion. They provided image data in repositories like Github. PREPRINT - F

EBRUARY

9, 2021Dataset NNGP Smallsubmatrices PAM k -meansSmartphones 0.59 (0.01) 0.58 (0.01) 0.43 (0.12) 0.49 (0.02) MNIST

MNIST k -means.Both DPP approaches, based on NNGP and small submatrices, attain good results. The NNGP approach outperformedall other three methods. Overall, the gains are very important when compared with PAM and k -means. The methodbased on random small submatrices also ourperformed the other methods, except for the MNIST dataset.

Within the context of determinantal consensus clustering, we proposed two different approaches to overcome thecomputational burden induced by the eigendecomposition of the kernel matrix associated with large datasets. The ﬁrstone is based on NNGP, and the second one, is based on random sampling of small submatrices from the kernel matrix.The NNGP approach ﬁnds a sparse approximation of the kernel matrix, based on nearest-neighbors. The sparse matrixsubstitutes the original dense kernel matrix to extract the largest eigenvalues, so as to be able to perform determinantalconsensus clustering on the large dataset. Simulations showed that using a high sparseness level and extracting a fewlargest eigenvalues are enough to ensure good clustering results. Extracting 1% to 3% of the eigenvalues seems to be areasonable strategy. The amount of time needed to achieve the ﬁnal clustering conﬁguration is reduced considerably.The approach based on random sampling of small submatrices from the kernel matrix is a good alternative to the NNGPapproach. Its performance is comparable to that of the NNGP approach, even though the distribution of eigenvaluesextracted with this method is not as close to the original eigenvalue distribution of the kernel matrix. Rather thankeeping the original kernel matrix, this approach shows that considering the extraction of eigenvalues from severalrather small submatrices of the kernel matrix is enough to obtain good results with consensus DPP. Our simulations hintat sampling a number of small matrices in the range of . n to . n . If the submatrices are still too large, ﬁnding theirsparse approximation with the k -nearest neighbors graph method is a good option, even with a small k . In fact, forvery large datasets, our simulations showed that it is strongly recommended to ﬁnd a sparse approximation of the smallsubmatrices. In order to speed up computational time, choosing a high sparseness level is the best option.The two approaches are able to reach better quality results than consensus clustering applied to PAM. Applications onlarge real datasets conﬁrm the results found with the simulations. Determinantal consesus clustering also proved to besuperior to k -means for most of the datasets. The presence of diversity in the sampled centroids helps to improve thequality of clustering and the stability of the results.An issue we found with the NNGP approach, on some real datasets, is the possibility of ill-conditioning of some ofthe intermediate calculation submatrices W [1: i − (see equation (2)). This might occur because of strong similaritybetween some observations. As with the case of regression, the original kernel matrix should be evaluated for numericalconditioning before attempting to use the NNGP approach. A possible solution to avoid the problem is to eliminateobservations that are too similar, so as to only consider representative observations instead of all of them. Anotherpossibility is to regularize the submatrices by adding a small positive constant to the main diagonal, as it is done in ridgeregression. Alternatively, the approach based on random small matrices might be used instead, if the problem arises. References [1] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In

Proceedings of theEighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’07, pages 1027–1035, USA, 2007.Society for Industrial and Applied Mathematics.[2] Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geometric data structure.

ACM Comput.Surv. , 23(3):345–405, September 1991.[3] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág. Determinantal processes and independence.

Probability Surveys [electronic only] , 3:206–229, 2006.18

PREPRINT - F

EBRUARY

9, 2021[4] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing.

SIAM review , 59(1):65–98, 2017.[5] Marcelo Blatt, Shai Wiseman, and Eytan Domany. Superparamagnetic clustering of data.

Phys. Rev. Lett. ,76:3251–3254, Apr 1996.[6] Marcelo Blatt, Shai Wiseman, and Eytan Domany. Data clustering using a model granular magnet.

NeuralComput. , 9(8):1805–1842, November 1997.[7] Stephen Bonner, Ibad Kureshi, John Brennan, and Georgios Theodoropoulos. Exploring the evolution of big datatechnologies. In

Software Architecture for Big Data and the Cloud , pages 253–283. Elsevier, 2017.[8] A. Borodin and G. Olshanski. Distributions on Partitions, Point Processes, and the Hypergeometric Kernel.

Communications in Mathematical Physics , 211:335–358, 2000.[9] Alexei Borodin and Alexander Soshnikov. Janossy densities determinantal ensembles.

Journal of StatisticalPhysics , 113:595–610, 01 2003.[10] Daniela Calvetti, L Reichel, and And D C Sorensen. An implicitly restarted lanczos method for large symmetriceigenvalue problems.

Electronic Trans. Numer. Anal. , 2:1–21, 04 1994.[11] Marco Capó, Aritz Pérez, and Jose A Lozano. An efﬁcient approximation to the k-means clustering for massivedata.

Knowledge-Based Systems , 117:56–69, 2017.[12] Luther T Clark, Laurence Watkins, Ileana L Piña, Mary Elmer, Ola Akinboboye, Millicent Gorham, BrendaJamerson, Cassandra McCullough, Christine Pierre, Adam B Polis, et al. Increasing diversity in clinical trials:overcoming critical barriers.

Current Problems in Cardiology , 44(5):148–172, 2019.[13] Jane Cullum. The simultaneous computation of a few of the algebraically largest and smallest eigenvalues of alarge, sparse, symmetric matrix.

BIT Numerical Mathematics , 18(3):265–275, Sep 1978.[14] D. Daley and D. Vere Jones.

An introduction to the theory of point processes. Vol. I: Elementary theory andmethods. 2nd ed , volume Vol. 1. Springer-Verlag, 01 2003.[15] Abhirup Datta, Sudipto Banerjee, Andrew O. Finley, and Alan E. Gelfand. On nearest-neighbor gaussian processmodels for massive spatial data.

Wiley Interdisciplinary Reviews: Computational Statistics , 8(5):162–171, 2016.[16] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.[17] Zizhu Fan, Xiangang Jiang, Baogen Xu, and Zhaofeng Jiang. An automatic index validity for clustering. In YingTan, Yuhui Shi, and Kay Chen Tan, editors,

Advances in Swarm Intelligence , pages 359–366, Berlin, Heidelberg,2010. Springer Berlin Heidelberg.[18] Andrew O. Finley, Abhirup Datta, Bruce D. Cook, Douglas C. Morton, Hans E. Andersen, and Sudipto Banerjee.Efﬁcient algorithms for bayesian nearest neighbor gaussian processes.

Journal of Computational and GraphicalStatistics , 28(2):401–414, 2019.[19] Edward W Forgy. Cluster analysis of multivariate data: efﬁciency versus interpretability of classiﬁcations. biometrics , 21:768–769, 1965.[20] Pasi Fränti and Sami Sieranoja. How much can k-means be improved by using better initialization and repeats?

Pattern Recognition , 93:95–112, 2019.[21] Mike Gartrell, Elvis Dohmatob, and Jon Alberdi. Deep determinantal point processes. arXiv preprintarXiv:1811.07245 , 2018.[22] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Discovering diverse and salient threads in documentcollections. In

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning , pages 710–720, 2012.[23] M. Girolami. Mercer kernel-based clustering in feature space.

IEEE Transactions on Neural Networks , 13(3):780–784, May 2002.[24] Teoﬁlo F Gonzalez. Clustering to minimize the maximum intercluster distance.

Theoretical computer science ,38:293–306, 1985.[25] R. Grimes, J. Lewis, and H. Simon. A shifted block lanczos algorithm for solving sparse symmetric generalizedeigenproblems.

SIAM Journal on Matrix Analysis and Applications , 15(1):228–272, 1994.[26] R. Haﬁz Affandi, E. B. Fox, R. P. Adams, and B. Taskar. Learning the Parameters of Determinantal Point ProcessKernels.

ArXiv e-prints , February 2014.[27] R. Haﬁz Affandi, E. B. Fox, and B. Taskar. Approximate Inference in Continuous Determinantal Point Processes.

ArXiv e-prints , November 2013. 19

PREPRINT - F

EBRUARY

9, 2021[28] Roger A. Horn and Charles R. Johnson.

Matrix Analysis . Cambridge University Press, USA, 2nd edition, 2012.[29] Tom Howley and Michael G. Madden. An evolutionary approach to automatic kernel construction. In StefanosKollias, Andreas Stafylopatis, Włodzisław Duch, and Erkki Oja, editors,

Artiﬁcial Neural Networks – ICANN2006 , pages 417–426, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.[30] Lawrence Hubert and Phipps Arabie. Comparing partitions.

Journal of Classiﬁcation , 2(1):193–218, Dec 1985.[31] Byungkon Kang. Fast Determinantal Point Process Sampling with Application to Clustering. In C.J.C. Burges,L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,

Advances in Neural Information ProcessingSystems 26 , pages 2319–2327. Curran Associates, Inc., 2013.[32] Ioannis Katsavounidis, C-C Jay Kuo, and Zhen Zhang. A new initialization technique for generalized lloyditeration.

IEEE Signal processing letters , 1(10):144–146, 1994.[33] Leonard Kaufmann and Peter Rousseeuw. Clustering by means of medoids.

Data Analysis based on the L1-Normand Related Methods , pages 405–416, 01 1987.[34] A. Kulesza and B. Taskar. Determinantal point processes for machine learning.

ArXiv e-prints , July 2012.[35] S. Kullback and R. A. Leibler. On information and sufﬁciency.

Ann. Math. Statist. , 22(1):79–86, 03 1951.[36] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning thekernel matrix with semideﬁnite programming.

J. Mach. Learn. Res. , 5:27–72, December 2004.[37] Cornelius Lanczos.

An iteration method for the solution of the eigenvalue problem of linear differential andintegral operators . United States Governm. Press Ofﬁce Los Angeles, CA, 1950.[38] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database.

ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist , 2, 2010.[39] Richard B Lehoucq, Danny C Sorensen, and Chao Yang.

ARPACK users’ guide: solution of large-scale eigenvalueproblems with implicitly restarted Arnoldi methods . SIAM, 1998.[40] Stuart P. Lloyd. Least squares quantization in pcm.

IEEE Trans. Inf. Theory , 28:129–136, 1982.[41] Odile Macchi. The Coincidence Approach to Stochastic Point Processes.

Advances in Applied Probability ,7(1):83–122, 1975.[42] Zelda E Mariet, Yaniv Ovadia, and Jasper Snoek. Dppnet: Approximating determinantal point processes withdeep networks. In

Advances in Neural Information Processing Systems , pages 3223–3234, 2019.[43] John R Mashey. Big data and the next wave of infrastress problems, solutions, opportunities. In

USENIX AnnualTechnical Conference, Monterey, California, June , pages 6–11, 1999.[44] Volodymyr Melnykov, Wei-Chen Chen, and Ranjan Maitra. Mixsim: An r package for simulating data to studyperformance of clustering algorithms.

Journal of Statistical Software, Articles , 51(12):1–25, 2012.[45] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: A resampling-based methodfor class discovery and visualization of gene expression microarray data.

Machine Learning , 52(1):91–118, 2003.[46] Alejandro Murua and Nicolas Wicker. The Conditional-Potts Clustering Model.

Journal of Computational andGraphical Statistics , 23(3):717–739, 2014.[47] Beresford N. Parlett and Bahram Nour-Omid. Towards a black box lanczos program.

Computer PhysicsCommunications , 53(1):169 – 179, 1989.[48] José M Pena, Jose Antonio Lozano, and Pedro Larranaga. An empirical comparison of four initialization methodsfor the k-means algorithm.

Pattern recognition letters , 20(10):1027–1040, 1999.[49] William M. Rand. Objective criteria for the evaluation of clustering methods.

Journal of the American StatisticalAssociation , 66(336):846–850, 1971.[50] Amar Shah and Zoubin Ghahramani. Determinantal clustering processes-a nonparametric bayesian approach tokernel based semi-supervised clustering. arXiv preprint arXiv:1309.6862 , 2013.[51] B. W. Silverman.

Density estimation for statistics and data analysis , volume 26 of

Monographs on Statistics &Applied Probability . Chapman and Hall, 1986.[52] Chris Snijders, Uwe Matzat, and Ulf-Dietrich Reips. " big data": big gaps of knowledge in the ﬁeld of internetscience.

International journal of internet science , 7(1):1–5, 2012.[53] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse framework for combining multiplepartitions.

J. Mach. Learn. Res. , 3:583–617, 2002. 20

PREPRINT - F

EBRUARY

9, 2021[54] Nikolett Szelei, Luís Tinoca, and Ana Soﬁa Pinho. Professional development for cultural diversity: the challengesof teacher learning in context.

Professional Development in Education , pages 1–17, 2019.[55] Sandro Vega-Pons and José Ruiz-Shulcloper. A survey of clustering ensemble algorithms.

International Journalof Pattern Recognition and Artiﬁcial Intelligence , 25(03):337–372, 2011.[56] Serge Vicente and Alejandro Murua. Determinantal consensus clustering. arXiv e-prints , 2020.[57] Iris R Wagstaff and Gerald LaPorte. The importance of diversity and inclusion in the forensic sciences.

NationalInstitute of Justice Journal , 279:81–91, 2018.[58] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms. arXiv e-prints 1708.07747arXiv e-prints 1708.07747