[PDF] On Consistency of Compressive Spectral Clustering

Abstract

Spectral clustering is one of the most popular methods for community detection in graphs. A key step in spectral clustering algorithms is the eigen decomposition of the n{\times}n graph Laplacian matrix to extract its k leading eigenvectors, where k is the desired number of clusters among n objects. This is prohibitively complex to implement for very large datasets. However, it has recently been shown that it is possible to bypass the eigen decomposition by computing an approximate spectral embedding through graph filtering of random signals. In this paper, we analyze the working of spectral clustering performed via graph filtering on the stochastic block model. Specifically, we characterize the effects of sparsity, dimensionality and filter approximation error on the consistency of the algorithm in recovering planted clusters.

Full PDF

OOn Consistency of Compressive Spectral Clustering

On Consistency of Compressive Spectral Clustering

Muni Sreenivas Pydi [email protected]

Department of Electrical and Computer EngineeringUniversity of Wisconsin - MadisonMadison, WI - 53726, USA

Ambedkar Dukkipati [email protected]

Department of Computer Science and AutomationIndian Institute of ScienceBengaluru - 560012, India

Abstract

Spectral clustering is one of the most popular methods for community detection in graphs.A key step in spectral clustering algorithms is the eigen decomposition of the n × n graphLaplacian matrix to extract its k leading eigenvectors, where k is the desired numberof clusters among n objects. This is prohibitively complex to implement for very largedatasets. However, it has recently been shown that it is possible to bypass the eigendecomposition by computing an approximate spectral embedding through graph ﬁlteringof random signals. In this paper, we analyze the working of spectral clustering performedvia graph ﬁltering on the stochastic block model. Speciﬁcally, we characterize the eﬀects ofsparsity, dimensionality and ﬁlter approximation error on the consistency of the algorithmin recovering planted clusters. Keywords: spectral methods, clustering, stochastic block model

1. Introduction

Detecting communities, or clusters in networks is an important problem in many ﬁelds ofscience (Fortunato, 2010; Jain et al., 1999). Spectral clustering is a widely used algorithmfor community detection in networks (Von Luxburg, 2007) because of its strong theoreticalgrounding (Ng et al., 2002; Shi and Malik, 2000) and recently established consistency results(Rohe et al., 2011; Lei et al., 2015). Spectral clustering works by relaxing the NP-hard dis-crete optimization problem of graph partitioning, into a continuous optimization problem.As a ﬁrst step, one computes the the k leading eigenvectors of the graph Laplacian matrix,that gives a k dimensional ’spectral’ embedding for each vertex of the graph. In the secondstep, one performs k -means on the embedding to retrieve the graph clusters.However, computing the leading eigenvectors of the graph Laplacian requires eigen de-composition, which is very hard to compute for large datasets. Several approximate algo-rithms have been proposed to overcome this problem via Nystr¨om sampling (Fowlkes et al.,2004; Li et al., 2011; Choromanska et al., 2013). While these methods do not skip the eigendecomposition, they reduce its complexity via column sampling of the Laplacian. Anotherclass of methods use random projections to reduce the dimensionality of the dataset whileobtaining an approximate spectral embedding (Sakai and Imiya, 2009; Gittens et al., 2013).On the other hand, with the emergence of signal processing on graphs (Shuman et al., 2013),there has been the development of techniques based on graph ﬁltering that can side-step theeigen decomposition altogether (Ramasamy and Madhow, 2015; Tremblay et al., 2016b,a). a r X i v : . [ s t a t . M L ] M a y uni Sreenivas Pydi and Ambedkar Dukkipati While many of these approaches have been shown to work fairly well on real and syntheticdatasets, a rigorous mathematical analysis is still lacking.In this paper, we consider a variant of the compressive spectral clustering algorithmthat uses graph ﬁltering of random signals to compute an approximate spectral embeddingof the graph nodes (Tremblay et al., 2016b). For a graph with n nodes and k clusters,the algorithm proceeds by calculating a d dimensional embedding for the graph nodes,where d is of the order of log( n ). This compressed embedding acts as a substitute for the k dimensional spectral embedding of the spectral clustering algorithm and does not need theeigen decomposition of the Laplacian. Instead, the embedding is obtained by ﬁltering outthe top k frequencies for d number of random graph signals using fast graph ﬁltering. Contributions

In this paper, we analyze the spectral clustering algorithm performed via graph ﬁltering(Algorithm 2) using the stochastic block model (SBM). We derive a bound on the number ofvertices that would be incorrectly clustered with the algorithm, and prove that the algorithmcan consistently recover planted clusters from SBM under mild assumptions on the sparsityof the graph and the ﬁlter approximation used to compute the spectral embedding. For ouranalysis, we speciﬁcally consider the high-dimensional stochastic block model that allows forthe number of clusters k to grow faster than log( n ). This is very important considering thatthe computational gains of compressive spectral algorithm is more apparent in the high-dimensional case. In proving the weak consistency of Algorithm 2, we primarily use theproof techniques from Rohe et al. (2011), which were originally used to analyze the spectralclustering algorithm under the high-dimensional SBM. Finally, we analyze our consistencyresult in some special cases of the block model and validate our ﬁndings with accompanyingexperiments.

2. Preliminaries

We use capital letters to denote matrices, and speciﬁcally their formal script versions forrandom matrices. We use the superscript ( n ) to denote matrices corresponding to a graphof n nodes. We use (cid:107)·(cid:107) for the Euclidean norm of a vector and the spectral norm of amatrix. We use (cid:107)·(cid:107) F for the Frobenius norm of a matrix. For a matrix M , we use M i ∗ and M ∗ j to denote the i th row and j th column respectively. We also use the standard notation o ( · ), O ( · ) and Ω( · ) to describe the limiting behavior of functions. We consider an undirected, unweighted graph G with n nodes. Under SBM, each node of thegraph G is assigned to one of k clusters or blocks via the membership matrix Z ∈ { , } n × k . Z ig = 1 if and only if the node i belongs to block g . The SBM adjacency matrix isdeﬁned as W = ZBZ T where B ∈ [0 , k × k is the block matrix, whose entry B gh givesthe probability of an edge between nodes of cluster g and cluster h . B is full rank andsymmetric. The diagonal entries of W are set to zero to prevent self edges. From W , wedeﬁne the degree matrix D such that D ii = (cid:80) k W ik and the normalized Laplacian matrix n Consistency of Compressive Spectral Clustering L = D − / W D − / . We deﬁne τ n = min ≤ i ≤ n D ( n ) ii /n to indicate the level of sparsity inthe graph.To generate a random graph with SBM, we sample a random adjacency matrix W from it’s population version, W . Let D and L represent the corresponding degree matrixand the normalized Laplacian for the sampled graph. Using Davis-Kahan theorem, it canbe shown that the eigenvectors of L and L converge asymptotically as n becomes large.This is important because the spectral clustering algorithm relies on the eigenvectors of thesampled graph Laplacian L to estimate the node membership Z .Now, we borrow a result from Rohe et al. (2011) that shows the conditions for conver-gence of the leading k eigenvectors of L and L . Theorem 2.2.1 (Convergence of Eigenvalues and Eigenvectors)

Let W ( n ) ∈ { , } n × n be a sequence of adjacency matrices sampled from the SBM with population matrices W ( n ) .Let L ( n ) and L ( n ) be the corresponding graph Laplacians. Let X ( n ) , X ( n ) ∈ R n × k n be thematrices that contain the eigenvectors corresponding to the leading k n eigenvalues of L ( n ) and L ( n ) in absolute sense, respectively. Let ¯ λ k n be the least non-zero eigenvalue of L ( n ) . Assumption 1 (Eigengap) n − / (log n ) = O (¯ λ k n ) Assumption 2 (Sparsity) τ n > / log n Under Assumptions 1 and 2, for some sequence of orthonormal matrices O ( n ) , (cid:107) X ( n ) − X ( n ) O ( n ) (cid:107) F = o (cid:18) (log n ) n ¯ λ k n τ n (cid:19) . Proof

Theorem 2.2.1 is a special case of Theorem 2.2 from Rohe et al. (2011). The resultfollows by setting S n = [ ¯ λ k n / ,

1] and δ n = δ (cid:48) n = ¯ λ k n / L is high enough to enable the separabilityof the k clusters, Assumption 2 puts a lower bound on the sparsity level of the graph. Underthese two assumptions, Theorem 2.2.1 bounds the Frobenius norm of the diﬀerence betweenthe top k eigenvectors of the population and sampled versions of the graph Laplacian. Spectral clustering operates on the k leading eigenvectors of L i.e. the matrix X in Theorem2.2.1. Each row of X is taken as the k -dimensional spectral embedding of the correspondingnode, and k -means is performed on the new data points to retrieve the cluster membershipmatrix Z . The spectral clustering algorithm we consider is listed in Algorithm 1.Note that the k -means is performed on the rows of the matrix X in Algorithm 1. Forthis to result in the k distinct clusters of the SBM, the rows belonging to nodes in diﬀerentclusters must be ‘well-separated’ while the rows belonging to nodes in the same cluster mustbe closely spaced. This property of X becomes evident from Theorem 2.3.1 that followsfrom the work of Rohe et al. (2011). Theorem 2.3.1 (Separability of Clusters)

Consider a SBM with k blocks. Let L bethe population version of the graph Laplacian. Let X ∈ R n × k be the matrix containing the uni Sreenivas Pydi and Ambedkar Dukkipati Algorithm 1

Spectral Clustering

Input:

Graph Laplacian matrix L , number of clusters k

1. Compute X ∈ R n × k containing the eigenvectors corresponding to k leading eigenvalues(in absolute sense) of L .2. Treating each row of X as a point in R k , run k -means. From the result of k -means,form the membership matrix ˆ Z ∈ { , } n × k assigning each node to a cluster. Output:

Estimated membership matrix ˆ Z . eigenvectors corresponding to k nonzero eigenvalues of L . Let P be the number of nodesin the largest block i.e. P = max ≤ j ≤ k ( Z T Z ) jj Then the following statements are true.1. There exists a matrix µ ∈ R k × k such that Zµ = X .2. X i ∗ = X j ∗ ⇔ Z i ∗ = Z j ∗ i.e. µ is invertible.3. (cid:107)X i ∗ − X j ∗ (cid:107) ≥ (cid:112) /P for any Z i ∗ (cid:54) = Z j ∗ . Proof

Statements 1 and 2 of Theorem 2.3.1 follow from Lemma 3.1 from Rohe et al.(2011). Statement 3 is equivalent to Statement D.3 from the proof of Lemma 3.2 in Roheet al. (2011).From Theorem 2.3.1, it is evident that performing k -means on the rows of X wouldretrieve the block membership of all the nodes in the graph exactly. However, the matrix X is hidden, and only its sampled version, X can be accessed. But by theorem 2.2.1, wehave that X is a close approximation of X for large n . As Algorithm 1 performs k -meanson X , the estimated membership matrix ˆ Z should be close to the true membership matrix Z . As in Algorithm 1, extracting the top k eigenvectors of the Laplacian is a key step in thespectral clustering algorithm. This can be viewed as extracting the k lowest frequencies orFourier modes of the graph Laplacian. This interpretation allows us to use the fast graphﬁltering approach (Tremblay et al., 2016b; Ramasamy and Madhow, 2015) to speed up thecomputation. We brieﬂy describe this here.A graph signal y ∈ R n is a mapping from vertex set V of a graph G to R . If the eigendecomposition of the graph Laplacian is L = U Λ U T , then the graph Fourier transform of y is ˆ y = U T y . The entries of ˆ y give the n Fourier modes of the graph signal y . Assuming thatthe rows of U are ordered in the decreasing order (in absolute value) of the correspondingeigenvalues, the top k Fourier modes of y can be obtained by ˆ y k = X T y where X ∈ R n × k is the matrix whose columns are the top k eigenvectors of L .A graph ﬁlter function h is deﬁned over [ − , h (Λ) is a diagonal matrix deﬁnedas h (Λ) := diag( h ( λ ) , . . . , h ( λ n )) where λ , . . . , λ n are the eigenvalues of L ordered in thedecreasing order of absolute value. The equivalent ﬁlter operator in the spectral domain, H ∈ R n × n is deﬁned as H := U h (Λ) U T . n Consistency of Compressive Spectral Clustering Algorithm 2

Spectral Clustering via Graph Filtering

Input:

Graph Laplacian L , number of clusters k , number of dimensions d , polynomialorder p .1. Estimate λ k of L .2. Compute (cid:101) h λ k to approximate the ideal ﬁlter h λ k .3. Construct R ∈ R n × d with i.i.d entries from N (0 , d ).4. Compute (cid:101) X R = (cid:101) H λ k R = (cid:80) pl =0 α l L l R .5. Treating each row of (cid:101) X R as a point in R d , run k -means. From the result of k -means,form the membership matrix ˆ Z ∈ { , } n × k assigning each node to a cluster. Output:

Estimated membership matrix ˆ Z .To extract the top k Fourier modes of a graph signal, we use an ideal low-pass ﬁlterdeﬁned as h λ k ( λ ) = (cid:40) if | λ | ≥ | λ k | otherwise. (1)The result of graph signal y ﬁltered through h λ k is given by y λ k = U h λ k (Λ) U T y = XX T y . Obviously, ﬁltering a graph signal with the ideal ﬁlter in (1) needs the eigendecomposition of the graph Laplacian. Now we deﬁne (cid:101) h λ k ( λ ) := (cid:80) p(cid:96) =0 α (cid:96) λ (cid:96) , an order p polynomial, to be the non-ideal approximation of the ﬁlter h λ k ( λ ). The ﬁlter operator inspectral domain, (cid:101) H λ k can be computed as (cid:101) H λ k = U (cid:101) h λ k ( λ ) U T = (cid:80) p(cid:96) =0 α (cid:96) L (cid:96) . The signal y ﬁltered by (cid:101) H λ k can be computed as (cid:101) y k = (cid:80) p(cid:96) =0 α (cid:96) L (cid:96) y , which does not required the eigendecomposition of L . Moreover, it only involves computing p matrix-vector multiplications.The method that we use for our analysis is outlined in Algorithm 2.

3. SBM and Spectral Clustering via Graph Filtering

In this section, we lay down the building blocks that make up Algorithm 2. In 3.1 weshall see how a compressed spectral embedding can be computed with graph ﬁltering andprove that the compressed embedding is still a close approximation of the SBM’s populationversion of graph Laplacian. In Section 3.2 we show the eﬀect of using the fast graph ﬁlteringtechnique to compute the compressed embedding. In Section 3.3 we deal with the estimationof the k th eigenvalue of the graph Laplacian without resorting to eigen decomposition. From Algorithm 1, it seems that we need the matrix X containing the k most signiﬁcanteigenvectors of L , in order to retrieve the clusters. Since we only use the rows of X as datapoints for the subsequent k -means step, we only need a distance preserving embedding ofthe rows of X . In this section, we see how such an embedding can be obtained through theresult of ﬁltering random graph signals. The technique used is similar to that of Tremblayet al. (2016b), except that we employ stricter assumptions to help in proving consistencyresults.Consider the matrix R ∈ R n × d whose entries are independent Gaussian random variableswith mean 0 and variance 1 /d . Deﬁne X R := H λ k R = U h λ k (Λ) U T R = XX T R whose d uni Sreenivas Pydi and Ambedkar Dukkipati columns contain the result of ﬁltering the corresponding d columns of R using the ﬁlter h λ k .In Theorem 3.1.1, we show that the rows of X R form an (cid:15) -approximate distance preservingembedding of the rows of X for suﬃciently large d . To analyze the eﬀect of this embeddingon the true cluster centers, i.e. the k unique rows of X , we deﬁne the matrix X R := X OX T R where O is the orthonormal rotation matrix as in Theorem 2.2.1. We aim to show that theseparability of the true cluster centers is still ensured under the compressed embedding. Theorem 3.1.1 (Convergence and Separability under Compressed Spectral Embedding)

For the sequence of adjacency matrices as deﬁned in Theorem 2.2.1, deﬁne P n = max ≤ j ≤ k n ( Z T Z ) jj to be the sequence of populations of the largest block. Let X ( n ) R , X ( n ) R ∈ R n × d n be the com-pressed embeddings for X ( n ) , X ( n ) ∈ R n × k n as deﬁned in Theorem 2.2.1. For (cid:15) ∈ [0 , and β > , if d n > β(cid:15) / − (cid:15) / n + k n ) , then with probability at least − n − β , we have the following under the assumptions ofTheorem 2.2.1. (cid:107) X ( n ) R − X ( n ) R (cid:107) F = o (cid:18) (log n ) n ¯ λ k n τ n (cid:19) , (cid:107)X R ( n ) i ∗ − X R ( n ) j ∗ (cid:107) ≥ (1 − (cid:15) ) (cid:112) /P n for any Z i ∗ (cid:54) = Z j ∗ , where i, j ∈ { , · · · , n } . Proof

See Appendix A.Theorem 3.1.1 is analogous to the theorems on convergence (Theorem 2.2.1) and separability(Theorem 2.3.1) of the spectral clustering algorithm. It ensures that the approximatespectral embedding X ( n ) R converges to the corresponding population version X ( n ) R while stillensuring that the true clusters remain separable. Now, we deﬁne an additional level of approximation for the spectral embedding using thefast graph ﬁltering technique discussed in Section 2.4. Let (cid:101) X R := (cid:101) H λ k R = (cid:80) p(cid:96) =0 α (cid:96) L (cid:96) R tobe the output of approximate ﬁltering of the columns of R where R ∈ R n × d with entriesdrawn from N (0 , d ). Lemma 3.2.1 bounds the diﬀerence between (cid:101) X R and X R , which resultfrom approximate and ideal ﬁltering respectively. Lemma 3.2.1 (Bounding the Approximate Filtering Error)

For the sequence of ad-jacency matrices as deﬁned in Theorem 2.2.1, let (cid:101) X ( n ) R ∈ R n × d n be the approximation for X ( n ) R obtained using the polynomial ﬁlter (cid:101) h λ kn (instead of the ideal ﬁlter h λ kn ). Let σ ( L ( n ) ) be the spectrum of the sampled graph Laplacian L ( n ) . Deﬁne the maximum absolute errorin the polynomial approximation as e n = max λ ∈ σ ( L ( n ) ) | (cid:101) h λ kn ( λ ) − h λ kn ( λ ) | . For (cid:15) ∈ [0 , ,with probability at least − e − nd n ( (cid:15) − (cid:15) ) / , (cid:107) (cid:101) X ( n ) R − X ( n ) R (cid:107) F ≤ (1 + (cid:15) ) n e n . n Consistency of Compressive Spectral Clustering Proof

See Appendix A. λ k Lemma 3.2.1 shows that in order to achieve a ﬁxed error bound between X R and (cid:101) X R , thepolynomial approximation must be increasingly accurate as n grows large. Designing sucha polynomial would necessitate knowing the value of λ k . In this section, we explain howthat can be done without having to do the eigen decomposition of L . First, we state thefollowing Lemma which bounds the output of fast graph ﬁltering, (cid:101) X R . Lemma 3.3.1 (Estimation of λ k ) For the (cid:101) X ( n ) R , e n and (cid:15) given in Lemma 3.2.1, withprobability at least − e − nd n ( (cid:15) − (cid:15) ) / we have (1 − (cid:15) ) k n − (cid:15) ) k n e n ≤ n (cid:107) (cid:101) X ( n ) R (cid:107) F ≤ (1 + (cid:15) )( k n + 2 k n e n + ne n ) . Proof

See Appendix A.For (2 ke n + ne n ) = o (1), Lemma 3.3.1 shows that the output of fast graph ﬁltering, (cid:101) X R istightly concentrated around k , upon normalization by n . This can be used to estimate | λ k | by a dichotomic search in the range [0 ,

1] as explained in Puy et al. (2016). The basic ideais to make a coarse initial guess on | λ k | in the interval [0 , (cid:101) X R with the currentestimate, and iteratively reﬁne the estimate by comparing n (cid:107) (cid:101) X R (cid:107) F with k .Before we move on to proving the consistency of Algorithm 2, let us summarise theresults from the previous sections. We have a tractable way to estimate | λ k | without theeigen decomposition of L . Through Lemma 3.2.1, we know that the resultant approxi-mate embedding will be close to the ideal compressed embedding, for reasonably accuratepolynomial approximation of the ideal ﬁlter. Through Theorem 3.1.1, we showed that acompressed embedding of the k leading eigenvectors of L converge to the correspondingembedding on L . We also showed that the data points corresponding to diﬀerent clustersare still separable under such an embedding.

4. Consistency of Algorithm SC-GF

Once we get the approximate spectral embedding of the n nodes of the graph in the formof (cid:101) X R , we perform k -means with the rows of (cid:101) X R as data points in R d . Let c , · · · , c n ∈ R d be the centroids corresponding to the n rows of (cid:101) X R , out of which only k are unique. The k unique centroids correspond to the centers of the k clusters. Note that the true clustercenters correspond to the rows of X R , and Theorem 3.1.1 ensures that they are separablefrom each other. Hence, we say that a node i is correctly clustered if its k -means clustercenter c i is closer to its true cluster center X R i ∗ than it is to any other center X R j ∗ , for j (cid:54) = i . In the following Lemma, we lay down the suﬃcient condition for correctly clusteringa node i . uni Sreenivas Pydi and Ambedkar Dukkipati Lemma 4.1.1 (Suﬃcient Condition for Correct Clustering)

Let c ( n )1 , · · · , c ( n ) n ∈ R d n be the centroids resulting from performing k n -means on the rows of (cid:101) X ( n ) R . For P n and (cid:15) asdeﬁned in Theorem 3.1.1, (cid:107) c ( n ) i − X ( n ) R i ∗ (cid:107) < (1 − (cid:15) ) 1 √ P n ⇒ (cid:107) c ( n ) i − X ( n ) R i ∗ (cid:107) < (cid:107) c ( n ) i − X ( n ) R j ∗ (cid:107) . for any z i (cid:54) = z j . Proof

See Appendix B.Following the analysis in (Rohe et al., 2011), we deﬁne the set of misclustered vertices M as containing the vertices that do not satisfy the suﬃcient condition in Lemma 4.1.1. M = (cid:110) i : (cid:107) c ( n ) i − X ( n ) R i ∗ (cid:107) ≥ (1 − (cid:15) ) 1 √ P n (cid:111) Now that we have the deﬁnition for misclustered vertices, we analyze the performance of k -means. Let the matrix C ∈ R n × k be the result of k -means clustering where the i th row, c i is the centroid corresponding to the i th vertex. C ∈ C n,k where C n,k represents the familyof matrices with n rows out of which only k are unique. C can be deﬁned as C = arg min M ∈ C n,k (cid:107) M − (cid:101) X R (cid:107) F . The next theorem bounds the number of misclustered vertices, that is the size of the set M . Theorem 4.1.2 (Bound on the number of Misclustered Vertices) | M | = o (cid:18) P n (cid:0) (log n ) n ¯ λ k n τ n + n e n (cid:1)(cid:19) (2) Proof

See Appendix B.

We consider a simpliﬁed SBM with four parameters k , q , r and s with k blocks each ofwhich contains s nodes so that the total number of vertices in the graph, n = ks . Theprobability of an edge between two vertices of the same block is given by q + r ∈ [0 ,

1] andthat of diﬀerent blocks is given by r ∈ [0 , P n = s . The smallest non-zero eigenvalue of the sampled graph Laplacian L is given by ¯ λ k n = k ( r/q )+1 and the parameter τ n = q/k + r (Rohe et al., 2011). Theproportion of the misclustered vertices is given by | M | n = o (cid:16) k n (log n ) + n k e n (cid:17) . (3)For weak consistency, we need lim n →∞ | M | n = 0. From (3), the condition on the number ofclusters for weak consistency is k = o ( n / / (log n ) / ) and the worst case condition on thepolynomial approximation error is e n = o ( n − / (log n ) / ). n Consistency of Compressive Spectral Clustering Figure 1: Proportion of misclustered vertices plotted against the number of vertices. q = 0 . r = 0 .

1. The polynomial order p is set to 5, 25 and 125 for the three curvespertaining to Algorithm 2. The corresponding polynomial error e n is shown inthe legend.

5. Experiments

We perform experiments on the simpliﬁed four parameter SBM presented in Section 4.2. Forpolynomial approximation of the ideal ﬁlter, we use Chebyshev polynomials with Jacksondamping coeﬃcients (Di Napoli et al., 2016).In our ﬁrst experiment, we analyze the error rate for Algorithm 2 for ﬁxed number ofclusters as the number of nodes is increased. As expected, the proportion of misclusteredvertices, | M | n tends to zero as n grows large. However, for the case of high polynomial error( p = 5) we see that the error rate diverges. This validates the presence of e n in (3).In our second experiment, we analyze the eﬀect of the polynomial error e n in ﬁner detail,by ﬁxing all the other variables, n , k , q and r . From (3) the proportion of misclusteredvertices should grow linearly with the squared polynomial error e n . From Figure 2, thisbehavior is evident.

6. Conclusion

In this paper, we prove some basic theorems that provide the theoretical basis for spectralclustering done via graph ﬁltering. By Theorem 3.1.1, we prove the fundamental conditionsrequired for the consistency of the spectral clustering algorithm via graph ﬁltering, namelyseparability and convergence. By Theorem 4.1.2, we have shown that the algorithm canretrieve the planted clusters in a stochastic block model consistently, and derive a bound on uni Sreenivas Pydi and Ambedkar Dukkipati Figure 2: Proportion of misclustered vertices plotted against the squared polynomial error, e n . q = 0 . r = 0 .

1. The polynomial order p is varied from 5 to 25 linearly.the number of misclustered vertices. Through Lemma 3.2.1 and Lemma 3.3.1, we quantifythe maximum tolerable ﬁltering error for the algorithm to succeed. We then validate ourresults by performing experiments on the simulated stochastic block model.While the results we prove in this paper provide evidence for the weak consistency ofAlgorithm 2 under the stochastic block model under certain assumptions on sparsity andseparability, several problems still remain open. First, the bound on the accuracy of the λ k estimate as given in Lemma 3.3.1 is derived in terms of the polynomial approximationerror e n . However, it is not trivial to estimate the polynomial order required to achieve aspeciﬁc absolute error ( L norm) even in case of popular choices like the Jackson-Chebyshevpolynomials Di Napoli et al. (2016). This results in complications in deriving explicitexpressions for the algorithm’s computational complexity. It also remains to be seen ifthe algorithm remains consistent under a milder assumption on the graph sparsity ( τ n )as is the case with the original spectral clustering algorithm Lei et al. (2015). While it isinevitable that the approximations involved in estimating λ k (Lemma 3.3.1) and in obtainingthe approximate spectral embedding (Lemma 3.2.1) will result in a weaker bound on theperformance, we do not know if the results we derived are optimal. With this work, wehope to see a renewed interest in graph ﬁltering approaches to spectral algorithms whichpromise signiﬁcant speed-ups in computation while (provably) maintaining almost the sameperformance. n Consistency of Compressive Spectral Clustering References

Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss withbinary coins.

Journal of computer and System Sciences , 66(4):671–687, 2003.Anna Choromanska, Tony Jebara, Hyungtae Kim, Mahesh Mohan, and Claire Monteleoni.Fast spectral clustering via the nystr¨om method. In

International Conference on Algo-rithmic Learning Theory , pages 367–381. Springer, 2013.Edoardo Di Napoli, Eric Polizzi, and Yousef Saad. Eﬃcient estimation of eigenvalue countsin an interval.

Numerical Linear Algebra with Applications , 23(4):674–692, 2016.Santo Fortunato. Community detection in graphs.

Physics reports , 486(3):75–174, 2010.Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping usingthe nystrom method.

IEEE transactions on Pattern Analysis and Machine Intelligence ,26(2):214–225, 2004.Alex Gittens, Prabhanjan Kambadur, and Christos Boutsidis. Approximate spectral clus-tering via randomized sketching.

Ebay/IBM Research Technical Report , 2013.Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmod-els: First steps.

Social networks , 5(2):109–137, 1983.Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review.

ACMcomputing surveys (CSUR) , 31(3):264–323, 1999.Jing Lei, Alessandro Rinaldo, et al. Consistency of spectral clustering in stochastic blockmodels.

The Annals of Statistics , 43(1):215–237, 2015.Mu Li, Xiao-Chen Lian, James T. Kwok, and Bao-Liang Lu. Time and space eﬃcientspectral clustering via column sampling. In

CVPR , 2011.Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and analgorithm.

Advances in neural information processing systems , 2:849–856, 2002.Gilles Puy, Nicolas Tremblay, R´emi Gribonval, and Pierre Vandergheynst. Random sam-pling of bandlimited signals on graphs.

Applied and Computational Harmonic Analysis ,2016.Dinesh Ramasamy and Upamanyu Madhow. Compressive spectral embedding: sidesteppingthe svd. In

Advances in Neural Information Processing Systems , pages 550–558, 2015.Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensionalstochastic blockmodel.

The Annals of Statistics , pages 1878–1915, 2011.Tomoya Sakai and Atsushi Imiya. Fast spectral clustering with random projection andsampling. In

International Workshop on Machine Learning and Data Mining in PatternRecognition , pages 372–384. Springer, 2009. uni Sreenivas Pydi and Ambedkar Dukkipati Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.

IEEE Transac-tions on Pattern Analysis and Machine Intelligence , 22(8):888–905, 2000.David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Van-dergheynst. The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.

IEEE Signal Pro-cessing Magazine , 30(3):83–98, 2013.Nicolas Tremblay, Gilles Puy, Pierre Borgnat, Pierre Vandergheynst, et al. Accelerated spec-tral clustering using graph ﬁltering of random signals. In , pages 4094–4098. IEEE,2016a.Nicolas Tremblay, Gilles Puy, R´emi Gribonval, and Pierre Vandergheynst. Compressivespectral clustering. In

Machine Learning, Proceedings of the Thirty-third InternationalConference (ICML 2016), June , pages 20–22, 2016b.Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416, 2007.

Appendix A. Proofs for Theorems in Section 3

A.1 Proof of Theorem 3.1.1Proof

For the sake of compactness, we omit the superscript ( n ) for the sequences ofmatrices, as the analysis is valid at every n .By Theorem 2.3.1, there are at most k unique rows out of the n rows of the matrix X ,while the n rows of the matrix X , can potentially be unique. The same inference can bemade for the matrices X OX T and XX T , where O is the orthonormal matrix from Theorem2.2.1.Treating the combined n + k unique rows of the two matrices as data points in R k , wecan use the Johnson-Lindenstrauss Lemma to approximately preserve the pairwise Euclidiandistances between any two rows up to a factor of (cid:15) . Applying Theorem 1.1 from Achlioptas(2003), if d n is larger than 4 + 2 β(cid:15) / − (cid:15) / n + k ) , then with probability at least 1 − n − β , we have(1 − (cid:15) ) (cid:107)X i ∗ OX T − X j ∗ OX T (cid:107) ≤ (cid:107)X Ri ∗ − X Rj ∗ (cid:107) ≤ (1 + (cid:15) ) (cid:107)X i ∗ OX T − X j ∗ OX T (cid:107) (4)for any Z i ∗ (cid:54) = Z j ∗ ,(1 − (cid:15) ) (cid:107) X i ∗ X T − X j ∗ X T (cid:107) ≤ (cid:107)X Ri ∗ − X Rj ∗ (cid:107) ≤ (1 + (cid:15) ) (cid:107) X i ∗ X T − X j ∗ X T (cid:107) and(1 − (cid:15) ) (cid:107) X i ∗ OX T − X j ∗ X T (cid:107) ≤ (cid:107) X Ri ∗ − X Rj ∗ (cid:107) ≤ (1 + (cid:15) ) (cid:107) X i ∗ OX T − X j ∗ X T (cid:107) (5) n Consistency of Compressive Spectral Clustering where i, j ∈ { , · · · , n } .Combining the inequality on the let side of (4) with Statement 3 of Theorem 2.3.1, weget (cid:107)X Ri ∗ − X Rj ∗ (cid:107) ≥ (1 − (cid:15) ) (cid:107)X i ∗ OX T − X j ∗ OX T (cid:107) = (1 − (cid:15) ) (cid:107) ( X i ∗ − X j ∗ ) OX T (cid:107) = (1 − (cid:15) ) (cid:107)X i ∗ − X j ∗ (cid:107) ≥ (1 − (cid:15) ) (cid:112) /P n for any Z i ∗ (cid:54) = Z j ∗ . Since X T X is an identity matrix, the rows of OX T are orthogonal.Hence, multiplication of a vector by OX T from the right does not change the norm. By asimilar procedure, combining the inequality on the right side of (5) with Theorem 2.2.1, weget (cid:107) X R − X R (cid:107) F = n (cid:88) i =1 (cid:107) X Ri ∗ − X Ri ∗ (cid:107) ≤ (1 + (cid:15) ) n (cid:88) i =1 (cid:107) X i ∗ OX T − X j ∗ X T (cid:107) = (1 + (cid:15) ) n (cid:88) i =1 (cid:107) ( X i ∗ O − X j ∗ ) X T (cid:107) = (1 + (cid:15) ) n (cid:88) i =1 (cid:107) X i ∗ O − X j ∗ (cid:107) = (1 + (cid:15) ) (cid:107) X i ∗ O − X j ∗ (cid:107) F = o (cid:18) (log n ) n ¯ λ k n τ n (cid:19) A.2 Proof of Lemma 3.2.1Proof

Firstly, we note that (cid:107) R ( n ) (cid:107) F is a chi-squared random variable with nd n degrees offreedom and mean n . Using the Chernoﬀ bound on (cid:107) R ( n ) (cid:107) F , we have Pr (cid:16)(cid:12)(cid:12)(cid:12) n (cid:107) R ( n ) (cid:107) F − (cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) ≤ e − nd n ( (cid:15) − (cid:15) ) / (6)Now to bound the diﬀerence between the ideal and polynomial ﬁlters, (cid:107) U ( (cid:101) h λ kn (Λ) − h λ kn (Λ)) U T (cid:107) F = (cid:107) (cid:101) h λ kn (Λ) − h λ kn (Λ) (cid:107) F = n (cid:88) i =1 ( (cid:101) h λ kn ( λ i ) − h λ kn ( λ i )) ≤ n (cid:88) i =1 e n = ne n . (7) uni Sreenivas Pydi and Ambedkar Dukkipati Using the result from (6) and (7), we can bound the diﬀerence between the ideal andapproximate spectral embedding as follows. (cid:107) (cid:101) X ( n ) R − X ( n ) R (cid:107) F = (cid:107) (cid:101) H λ kn R − H λ kn R ( n ) (cid:107) F = (cid:107) U ( (cid:101) h λ kn (Λ) − h λ kn (Λ)) U T R ( n ) (cid:107) F ≤ (cid:107) U ( (cid:101) h λ kn (Λ) − h λ kn (Λ)) U T (cid:107) F (cid:107) R ( n ) (cid:107) F ≤ (1 + (cid:15) ) n e n where the last step follows with a probability of at least 1 − e − nd n ( (cid:15) − (cid:15) ) / . A.3 Proof of Lemma 3.3.1Proof

For the sake of compactness, we omit the superscript ( n ) for the sequences ofmatrices, as the analysis is valid at every n .From Lemma 3.2.1, we have a bound on the term (cid:107) (cid:101) X R − X R (cid:107) F . So, we proceed to proveLemma 3.3.1 by bounding the term (cid:107) X R (cid:107) F . For this, we make use of the fact that the k n columns of X ( n ) are orthonormal. (cid:107) X R (cid:107) F = (cid:107) XX ( n ) T R (cid:107) F = k n (cid:107) R (cid:107) F (8)Combining (8) with (6), we have the following with probability exceeding 1 − e − nd n ( (cid:15) − (cid:15) ) / .(1 − (cid:15) ) k n ≤ n (cid:107) X R (cid:107) F ≤ (1 + (cid:15) ) k n . Now, we prove the upper bound on (cid:101) X R . (cid:107) (cid:101) X R (cid:107) F = tr (cid:0) (cid:101) X TR (cid:101) X R (cid:1) = tr (cid:0) R T U (cid:101) h λ k (Λ) U T U (cid:101) h λ k (Λ) U T R (cid:1) = tr (cid:0) R T U ( (cid:101) h λ k (Λ)) U T R (cid:1) = tr (cid:0) ( (cid:101) h λ k (Λ)) U T RR T U (cid:1) ≤ tr (cid:0) ( (cid:101) h λ k (Λ)) (cid:1) tr (cid:0) U T RR T U (cid:1) (9)where the last statement follows from the fact that the matrices (cid:101) X TR (cid:101) X R , ( (cid:101) h λ k (Λ)) and U T RR T U are non-negative semi-deﬁnite. tr (cid:0) U T RR T U (cid:1) = tr (cid:0) RR T (cid:1) = (cid:107) R (cid:107) F ≤ (1 + (cid:15) ) n. (10)The last statement follows from (6) with a probability of at least 1 − e − nd n ( (cid:15) − (cid:15) ) / .Using the deﬁnition of the maximum ﬁlter error e n , we get tr (cid:0) ( (cid:101) h λ k (Λ)) (cid:1) ≤ k (1 + e n ) + ( n − k ) e n = k + 2 ke n + ne n . (11)Combining (10) and (11) with (9), we get1 n (cid:107) (cid:101) X R (cid:107) F ≤ (1 + (cid:15) )( k + 2 ke n + ne n ) . (12) n Consistency of Compressive Spectral Clustering Now we proceed to proving the lower bound. (cid:107) (cid:101) X R (cid:107) F = (cid:107) X R (cid:107) F + (cid:107) (cid:101) X R − X R (cid:107) F + 2 tr ( X TR ( (cid:101) X R − X R )) . (13) tr (cid:0) X TR ( (cid:101) X R − X R ) (cid:1) = tr (cid:0) R T U h λ k (Λ) U T U ( (cid:101) h λ k (Λ) − h λ k (Λ)) U T R (cid:1) = tr (cid:0) h λ k (Λ)( (cid:101) h λ k (Λ) − h λ k (Λ)) U T RR T U (cid:1) = tr (cid:0) h λ k (Λ)( (cid:101) h λ k (Λ) − h λ k (Λ) + e n I n ) U T RR T U (cid:1) − tr (cid:0) h λ k (Λ) e n I n U T RR T (cid:1) . (14)Here, I n ∈ R n × n is the Identity matrix. By the deﬁnition of e n , the diagonal entries of( (cid:101) h λ k (Λ) − h λ k (Λ) + e n I n ) are non-negative. Hence, the ﬁrst term in (14) is non-negative.For the second term we have, tr (cid:0) h λ k (Λ) e n I n U T RR T (cid:1) ≤ e n tr (cid:0) h λ k (Λ) (cid:1) tr (cid:0) U T RR T (cid:1) = e n k n (1 + (cid:15) ) n. (15)In addition, the term (cid:107) (cid:101) X R − X R (cid:107) F in (13) is non-negative. Combining (15) with (13), weget 1 n (cid:107) (cid:101) X R (cid:107) F ≥ (1 − (cid:15) ) k n − (cid:15) ) e n k n . (16)Putting together (12) and (16), we prove Lemma 3.3.1. Appendix B. Proofs for Theorems in Section 4

B.1 Proof of Lemma 4.1.1Proof

We follow a similar technique as that of Lemma 3.2 in Rohe et al. (2011). Supposethat (cid:107) c ( n ) i − X ( n ) R i ∗ (cid:107) < (1 − (cid:15) ) √ P n for some i . For any z j (cid:54) = z i , we have (cid:107) c ( n ) i − X ( n ) R j ∗ (cid:107) ≥ (cid:107)X ( n ) R i ∗ − X ( n ) R j ∗ (cid:107) − (cid:107) c ( n ) i − X ( n ) R i ∗ (cid:107) ≥ (1 − (cid:15) ) (cid:114) P n − (1 − (cid:15) ) 1 √ P n = (1 − (cid:15) ) 1 √ P n Here we have used the result of Theorem 3.1.1 on the separability of the rows of X ( n ) R . B.2 Proof of Theorem 4.1.2Proof