Scalable and Robust Community Detection with Randomized Sketching
11 Scalable and Robust Community Detection withRandomized Sketching
Mostafa Rahmani, Andre Beckus,
Student Member, IEEE , Adel Karimian, and George K. Atia,
SeniorMember, IEEE
Abstract —This paper explores and analyzes the unsupervisedclustering of large partially observed graphs. We propose ascalable and provable randomized framework for clusteringgraphs generated from the stochastic block model. The clusteringis first applied to a sub-matrix of the graph’s adjacency matrixassociated with a reduced graph sketch constructed using randomsampling. Then, the clusters of the full graph are inferred basedon the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown toimprove the computational complexity over clustering of the fullgraph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantlyimproves upon the performance of the clustering algorithmeven when clusters are unbalanced. This algorithm improves thephase transitions for matrix-decomposition-based clustering withregard to computational complexity and minimum cluster size,which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shownto improve balance by randomly sampling nodes based on spatialdistribution. We provide analysis and numerical results using aconvex clustering algorithm based on matrix completion.
Index Terms —Clustering, Community Detection, Matrix Com-pletion, Randomized Methods
I. I
NTRODUCTION
The identification of clusters within graphs constitutes a crit-ical component of network analysis and data mining, and canbe found in a wide array of practical applications ranging fromsocial networking [1] to biology [2]. Community detectionalgorithms identify communities or clusters of nodes withinwhich connections are more dense (see [3] and referencestherein). Many algorithms have been proposed, includingclassic approaches such as spectral clustering [4], as well asnewer approaches incorporating random walks [5] or solutionof semidefinite programming problems [6].One common issue shared by community detection algo-rithms is the high computational cost, e.g., by requiring theexploration of the entire graph, and the storage of the fulladjacency matrix in memory. We seek to improve upon theselimitations, thereby improving scalability and enabling theprocessing of large graphs.
This work was supported by NSF CAREER Award CCF-1552497.M. Rahmani, A. Beckus, and G. K. Atia are with the Department ofElectrical and Computer Engineering, University of Central Florida, Orlando,FL 32816 USA.A. Karimian, deceased, was with the Department of Electrical and Com-puter Engineering, University of Central Florida, Orlando, FL 32816 USA.M. Rahmani and A. Beckus contributed equally to this work.A conference version of this work was presented at the 52nd AnnualAsilomar Conference on Signals, Systems, and Computers, 2018.
Another recurring issue is that many algorithms fail toidentify small clusters. Here, we propose an algorithm whichcan handle graphs that are highly unbalanced, i.e., when thecluster sizes are highly disproportionate.To compare with previous work, we perform analysis andexperiments with the Stochastic Block Model (SBM) [7], [8],along with a modification to incorporate partial observations.Graphs created from this probabilistic generative model con-tain a planted set of disjoint clusters (where each node mustbelong to a cluster). Edges within each cluster are created withprobability p , and edges between clusters exist with probability q . Any given edge is observed with probability ρ .In this work, we establish conditions for exact recovery,i.e., where the probability that the algorithm exactly returnsthe correct planted partition approaches one as the number ofnodes N increases. A. Contributions
We propose an approach in which a graph clustering algo-rithm is applied to a small random sub-graph obtained fromthe original graph through random sampling of the nodes. Weprovide an analysis establishing conditions for exact recoveryusing the proposed approach.The approach allows flexibility to choose both the graphclustering algorithm and the randomized sampling technique.Here, we perform the analysis using the matrix decompositionclustering technique described in [9], [10]. Three randomizedtechniques are proposed, each of which varies in the informa-tion required from the full graph.Uniform Random Sampling (URS) forms the sample setwithout any knowledge of the full graph, other than itssize. This simple approach can significantly improve theperformance of the clustering, both in terms of computationalcomplexity and memory requirements. Suppose the data isbalanced, i.e., the clusters are of size Θ( N ) . If the edge andobservation probabilities in the generative model are constant,then successful clustering can occur with high probability(whp) using a sketch of only ˜ O ( r ) nodes . This reduces theper-iteration computational complexity of the costly clusteringstep from O (cid:0) rN (cid:1) to only ˜ O (cid:0) r (cid:1) . This is a substantialreduction when the number of clusters scales sub-linearly withgraph size, with the complexity becoming almost dimension-free when r is order-wise constant. However, when the clusters The soft-O notation ˜ O ( · ) and soft- Ω notation ˜Ω( · ) ignore log factors.Specifically, f = ˜ O ( g ) if there exists a j such that f ∈ O ( g ( N ) log j ( N )) ,and f = ˜Ω( g ) if there exists a k such that f ∈ Ω( g ( N ) log k ( N )) . a r X i v : . [ c s . S I] M a y TABLE IE
XACT RECOVERY S UFFICIENT C ONDITIONS FOR TECHNIQUESDESCRIBED IN THIS PAPER . S
YMBOLS ARE OBSERVATION PROBABILITY ρ , DENSITY DIFFERENCE γ = 1 − { − p, q } , AND BALANCE f = N/n min .Approach Minimum cluster size n min Required samples N (cid:48) Full Graph [9] Ω (cid:16) √ N log N √ ργ (cid:17) N URS Ω (cid:16) √ N log N √ ργ (cid:17) Ω (cid:16) f log Nργ (cid:17) SbS (large-q) q = ω (cid:0) f − (cid:1) Ω (cid:16) r √ qN log N √ ργ (cid:17) Ω (cid:16) r q f log Nργ (cid:17) SbS (small-q) q = O (cid:0) f − (cid:1) Ω (cid:16) r log Nργ (cid:17) Ω (cid:16) r log Nργ (cid:17) are unbalanced in size – a challenging case in communitydetection – a considerable number of samples will still berequired. In particular, if the smallest cluster size is ˜Θ( √ N ) then Ω( N ) samples are required.To address this issue, we study a sampling method in whichthe nodes are sampled with probability inversely proportionalto their node degrees. This is equivalent to sampling basedon the sparsity levels of the columns of the adjacency ma-trix, hence the appellation ‘Sparsity-based Sampling’ (SbS).By capturing smaller clusters and producing more balancedsketches, the clustering algorithm can be made even morelikely to succeed with the sketch than with the full matrix.Again, if the clusters are of size ˜Ω( √ N ) and q diminishessufficiently fast as N increases, then clustering using SbS ishighly likely to succeed, while requiring as few as roughly ˜Θ (cid:0) r (cid:1) samples (see Section III-B for details). This meansthat the number of samples is almost independent of the graphsize. In fact, this result holds under the same conditions evenif the minimum cluster size is reduced to n min = ˜Θ( r ) witha mild restriction on the number of clusters r = ˜ O ( √ N ) .This approach comes close to state-of-the-art performance.As a point of comparison, under the same conditions, thebest results we are aware of require n min = ˜Ω(1) [11],[12]. Therefore, our approach comes close to state-of-the-art performance in this regime, while significantly reducingsample complexity to avoid the heavy cost of clustering a largegraph.Finally, we leverage a randomized structure-preservingsketching technique termed Spatial Random Sampling (SRS)[13]. With this approach, the individual edges of each nodeare considered, and sampling is performed based on thespatial distribution of the column vectors in the adjacencymatrix. Numerical results show that this approach exceeds theperformance of URS and SbS in many cases.A table summarizing the results of this paper are shown inTable I. Additional symbols are defined in the caption withmore explanation given in later sections. In all cases, thecomputation complexity is O ( rN (cid:48) ) .II. B ACKGROUND AND R ELATED W ORK
A. Sampling
Randomized sketching techniques have been instrumentalin devising scalable solutions to many high-dimensional un-supervised learning problems, such as Principal Component Analysis (PCA) [14], matrix decomposition [15]–[17], outlierdetection [18], [19], low rank matrix approximation [20]–[22],data summarization [20], [23], [24], and data clustering [25],[26].The use of sampling naturally extends to graphs as well.Sampling may be performed in the edges or the nodes,depending on which imposes the dominant cost. We focuson node sampling (although our model accommodates partialobservations, which can be considered as a form of uniformedge sampling).In [27], a complexity-reducing scheme is explored, in whichnodes are incrementally added to the sample such that ateach step the number of nodes adjacent to the sample aremaximized. However, this is largely experimental work, andconditions for exact recovery are not analyzed.The approach in [28] also uses random node sampling, butconsiders entirely different scenarios in which the data featurematrices are known. Our setup only assumes availability ofsimilarity and dissimilarity information given the partiallyobserved topology, otherwise no data features are available.Other works have used node sampling to produce represen-tative sketches of graphs, although not in the specific contextof community detection, e.g. [29], [30].Edge-based approaches include [31], where reduced execu-tion time is demonstrated experimentally by performing uni-form edge sampling, and [32] which demonstrates a reducededge sampling budget through the use of either URS or activesampling. These results are altogether different from our node-based approach.The object of most sampling algorithms is to capture certainfeatures of the full graph in the sketch, for example cut size[31] or degree distribution [29]. Similarly, we aim to maintainthe SBM features of the full graph in the sketch. However,one feature which the SbS and SRS sampling algorithms intentionally seek to modify is the proportion of the clusters inthe underlying planted partition. In particular, the clusters inthe sketch should be of equal size, or as close as possible tothis ideal. By doing so, the sufficient conditions for successfulrecovery can be significantly relaxed.The goal of obtaining a balanced sketch is also consideredin [33]. However, they do not consider this in the context ofclustering, and no analysis or guarantees are provided. In [34],the clustering algorithm iteratively finds and removes largeclusters to improve the balance of the remaining small clusters.However, their technique does not improve the computationalcomplexity as the full graph still needs to be clustered initially.Additionally, although they incorporate sampling, it is in theform of edge sampling via partial observations, rather thannode sampling.
B. Correlation Clustering
Many clustering algorithms build on the intuition that therewill be a higher edge density within communities than betweencommunities [35]. One approach, correlation clustering, ex-plicitly minimizes the total number of disagreements betweenthe actual graph and a partitioning of the graph into disjointcliques [36]. An advantage of this formulation is that it canbe solved without prior knowledge of the number of clusters.
We work with a recent convex algorithm which solves thisproblem by decomposing the graph’s adjacency matrix as asum of low rank and sparse components [9], [10]. The clusterstructure is captured by the low rank component in the formof non-overlapping full cliques, and the sparse componentindicates missing edges within clusters and extra edges acrossclusters. The validity of the low rank plus sparse structureemerges from the fact that p ≥ ≥ q .The decomposition-based approaches allow for partial ob-servability, and also provide a good foundation for analyzingthe sampling technique. The SBM-based analysis in [9] pro-vides a direct tradeoff between the SBM parameters: minimumblock size, p , q , and ρ . The algorithm therein, if successful, isguaranteed to produce not only the disagreement minimizer,but also whp the correct clustering. C. Limits of Existing Clustering Algorithms
One factor in determining the success of an algorithm isthe density difference γ = 1 − { − p, q } of the SBM.As this gap decreases, the intra-cluster and inter-cluster edgesbecome harder to distinguish. Another factor is the minimumcluster size n min ; as the smallest cluster(s) get proportionallysmaller, they become easier to “lose” in the noise. Here, thevalue of q , the density of noisy edges found between clusters,also plays an important role.Assuming that other parameters do not scale with N , itis typically considered sufficient to have n min = ˜Ω( √ N ) .This limit appears in the analytic sufficient conditions of manyalgorithms (see [37] for a review of algorithms having thislimit).However, when considering some special cases, a morenuanced picture emerges. It is shown in [11] that for thecase of equal-sized clusters, if q = O (cid:16) log NN (cid:17) then apolynomial-time algorithm can achieve exact recovery whpfor n min = Ω(log N ) . A similar result can be found in [12]for the more general case of arbitrarily-sized clusters. Both ofthese works require a priori knowledge of the sum of squaresof the cluster sizes.The authors of [38] propose and analyze anotherpolynomial-time convex algorithm which performs well undera wide variety of cluster scalings. In particular, they provide anexample where clusters are as small as Θ( √ log N ) , althoughthis requires additional conditions on the size and numberof clusters, and requires that some of these clusters mustbe asymptotically sparse, i.e., the intra-cluster edge density p → .A table comparing the cluster size lower bounds of thispaper to other works is shown in Table II. For an extensivelist of algorithms and sufficient conditions, see Table 1 of [9].Though the algorithms listed in the table are polynomial-time algorithms, at each iteration they all require a SingularValue Decomposition (SVD), which scales quadratically withthe graph size. A distinguishing feature of our work is that itcan reduce the time of the SVD while coming close to state-of-the-art minimum cluster size requirements.The authors of [34] demonstrate that even if capturingthe small clusters is difficult, finding the large clusters is TABLE IIM
INIMUM CLUSTER SIZE REQUIREMENTS IN LITERATURE . F
OR PURPOSESOF COMPARISON , WE ASSUME DENSITY GAP p − q = γ . F OR THEALGORITHMS OF [11], [12], [38],
UNOBSERVED ENTRIES ARE TREATEDAS A ZERO , SO WE MAKE THE SUBSTITUTIONS p → ρp AND q → ρq .Algorithm n min (large-q) n min (small-q)(Chen,2014) [9] Ω (cid:16) √ N log N √ ργ (cid:17) (same as large-q)(Cai,2015) [12](Chen,2016) [11] Ω (cid:16) log Nργ ∨ √ qN √ ργ (cid:17) Ω (cid:16) log Nργ (cid:17) (Jalali,2016) [38] Ω (cid:16) log Nργ ∨ √ n max ∨√ qN √ ργ (cid:17) Ω (cid:16) log Nργ ∨ √ n max √ ργ (cid:17) This paper (URS) Ω (cid:16) √ N log N √ ργ (cid:17) (same as large-q)This paper (SbS) Ω (cid:16) r √ qN log N √ ργ (cid:17) Ω (cid:16) r log Nργ (cid:17) typically easy. By finding and removing the large clusters,the proportions of the small clusters become more favorablefor successful recovery. However, their iterative algorithmstill needs to be run initially on the full graph, affording nocomputational advantage, and they do not provide sufficientconditions for exact recovery of an entire graph (althougha weaker result is provided). This idea of improving theproportions of the graph can be found in our approach as well,but in our case, the costly clustering step is only performedonce, and performed on the sketch rather than the full graph. D. Data Model
The adjacency matrix is assumed to follow a variant ofthe Planted Partition/Stochastic Block Model [7], [8], whichallows for partial observations. This data model is defined asfollows [9].
Data Model 1.
The graph consists of N nodes partitionedinto r clusters. Any two nodes within a cluster are connectedwith probability p , and two nodes belonging to differentclusters are connected with probability q . Any given edge isobserved with probability ρ . Given an adjacency matrix A , thedecomposition into the low rank L and sparse S matrices takesthe form A = L + S . The matrix L captures the connectivityof the nodes within a cluster and has no inter-cluster edges,while S indicates the missing intra-cluster and extra inter-cluster edges. We note that for an adjacency matrix A with ideal clusterstructure, i.e., where p = 1 , q = 0 , and ρ = 1 , we have L = A and S = . In the implementation of the algorithm,the diagonal elements of A are always fully observed and setto all ones for convenience. E. Notation
Vectors and matrices are denoted using bold-face lower-caseand upper-case letters, respectively. Given a vector a , its (cid:96) p -norm is denoted by (cid:107) a (cid:107) p and a ( i ) its i th element. Given matrix A , its (cid:96) -norm (the sum of the absolute values of its elements)is denoted by (cid:107) A (cid:107) , and its nuclear norm (the sum of itssingular values) is denoted by (cid:107) A (cid:107) ∗ . The operator Ω obs ( · ) returns the observed values of its vector or matrix argument. III. P
ROPOSED A PPROACH
In this section, we present the three sampling techniquesand provide the main results of the analysis. The proofs aredeferred to the appendices.We first present URS, and demonstrate the capability ofthis method to improve computational complexity of theclustering algorithm. Second, SbS is analyzed and shown toalso reduce the computational complexity, while at the sametime improving the probability of success when working withunbalanced data. The proof of the main results for the URSapproach can be found in Appendix A, and those for SbS inAppendix B.Finally, we describe how SRS can also be applied tounbalanced high-dimensional data.Algorithm 1 shows the proposed approach. It consists ofthree main steps: node sampling, sub-graph clustering, andfull data clustering.
A. Randomized graph clustering using random node sampling
We will separately consider the three main steps of thealgorithm.
1) Uniform random node sampling:
A key advantage tothe proposed approach is that it does not apply the clusteringalgorithm to the full data, but rather to a sketch, i.e., a sub-graph generated from a set of N (cid:48) randomly sampled nodes. Wedenote the adjacency matrix of the sketch by A (cid:48) ∈ R N (cid:48) × N (cid:48) .The parameter f = N/n min indicates how balanced thegraph A is. When f = r , the data is perfectly balanced, andlarger values of f indicate a lack of balance. For URS, theprobability of sampling from the smallest cluster is exactlyequal to f − . Therefore, we can expect that any imbalancein the full matrix will carry over to the sketch matrix, thusrequiring more samples to ensure enough columns are sampledfrom the smallest cluster. The following lemma shows thatthe sufficient number of random samples to obtain at least b samples from the smallest cluster whp is in fact linear with f . Lemma 1 (Sampling Size) . Suppose that A (cid:48) is produced byStep 1 (sampling) of Algorithm 1 using Random Sampling of N (cid:48) columns. If N ≥ N (cid:48) ≥ f [ b + log (2 rN )] , (1) then n (cid:48) min > b with probability at least − N − .2) Robust sub-graph clustering: Next, we solve (4) byapplying the matrix decomposition algorithm of [9] to clusterthe sketch A (cid:48) . This approach is valid since the sketch can bedecomposed as A (cid:48) = L (cid:48) + S (cid:48) , where L (cid:48) and S (cid:48) are the sketchesof L and S , respectively (constructed using the same index set I ). The following lemma provides a sufficient condition forthis step to exactly recover L (cid:48) . Lemma 2 (Sketch decomposition) . Suppose the adjacencymatrix A follows Data Model 1. Let ζ = C log Nργ , where C is a constant real number. If N (cid:48) ≤ min (cid:8) n min /ζ , N (cid:9) , (2) N (cid:48) ≥ f [ f ζ + log (2 rN )] , (3) then the optimal point of (4) yields the exact low rankcomponent of A (cid:48) with probability at least − cN − − N − ,where c is a constant real number. Remark 1.
The sufficient number of samples is ˜Ω (cid:16) f ργ (cid:17) .However, if the graph is perfectly balanced, i.e., n min = N/r ,then N (cid:48) = ˜Ω (cid:16) r ργ (cid:17) . Thus, for balanced graphs, the sufficientnumber of randomly sampled nodes is virtually independent ofthe size of the graph. The density difference γ and observabil-ity ρ are also critical factors driving the number of samples. Adecrease in either parameter will drive the sufficient numberof samples higher.The sufficient condition (2) also includes two upper bounds.First, we cannot sample more columns than there are in thefull matrix. Second, we need to avoid the following issue: bysampling a large number of columns, the supply of columnsfrom the smallest cluster may be exhausted. Such an event willlead to more imbalance in the sketch as sampling continues,making decomposition of the sketch even less likely to succeed.From conditions (2) and (3) , in order for there to be a gapbetween the upper and lower bounds as N → ∞ , we need n min = ˜Ω (cid:16) √ N log N √ ργ (cid:17) . Therefore, although URS may improvethe sample and computational complexity (see Section V fordetails), the sufficient condition on cluster size remains aboutthe same as for full-scale decomposition (in fact, because ofthe retrieval step, URS gives slightly poorer performance).3) Full data clustering: The third step infers the partitionof the full graph. This is accomplished by checking the edgesof each node in the full graph and finding the cluster that theedge patterns have the strongest correlation with.The following lemma gives a sufficient condition for thefull data clustering to succeed whp. This condition is given interms of n (cid:48) min , the size of the smallest cluster in the sketch matrix. Lemma 3 (Retrieval) . If n (cid:48) min ≥ pγ log (cid:0) rN (cid:1) , (5) then step 3 (retrieval) of Algorithm 1 will exactly reconstructmatrix L with probability at least − N − . We can readily state the following Theorem which estab-lishes conditions for Algorithm 1 to achieve exact recoverywhp.
Theorem 4.
Suppose the adjacency matrix A follows DataModel 1. If n min ≥ pγ log (cid:0) rN (cid:1) , (6) N (cid:48) ≤ min (cid:8) n min /ζ , N (cid:9) , (7) N (cid:48) ≥ f max (cid:26) f ζ , pγ log (cid:0) rN (cid:1)(cid:27) + 4 f log (2 rN ) , (8) then Algorithm 1 exactly clusters the graph with probabilityat least − cN − − N − , where ζ is defined in Lemma 2. Algorithm 1
Efficient cluster retrieval for full graph
Input : Given adjacency matrix A ∈ R N × N
1. Random Node Sampling:1.1
Form the set I containing the indices of N (cid:48) randomly sampled nodes.Sampling is accomplished using either URS, SbS, or SRS. The sampling iswithout replacement. Construct A (cid:48) ∈ R N (cid:48) × N (cid:48) as the sub-matrix of A corresponding to thesampled nodes.
2. Sub-graph Clustering:2.1
Define L (cid:48) ∗ and S (cid:48) ∗ as the optimal point of min ˙ L (cid:48) , ˙ S (cid:48) λ (cid:107) ˙ S (cid:48) (cid:107) + (cid:107) ˙ L (cid:48) (cid:107) ∗ subject to Ω obs (cid:16) ˙ S (cid:48) + ˙ L (cid:48) (cid:17) = Ω obs ( A (cid:48) ) . (4)This problem is solved with Algorithm 1 of [9], where the initial value λ = √ N (cid:48) ρ uses ρ , the empirical observation probability from this instance of A . Subsequently, a binary search on λ is performed until a valid result isreturned. Cluster the sub-graph corresponding to A (cid:48) using L (cid:48) ∗ (we use SpectralClustering in our experiments). If ˆ r is the number of detected clusters, define (cid:26) v i ∈ R N (cid:48) × (cid:27) ˆ ri =1 as thecollection of characteristic vectors of the actual clusters in the sketch matrix,i.e. the set of vectors which span the column space of L (cid:48) ∗ .
3. Full Data Clustering:
Define a k I ∈ R N (cid:48) as the vector of elements of a k ( k th column of A )indexed by set I . Let ˆ n (cid:48) i be the number of elements in the i th cluster of thesketch (as identified in Step 2 of the algorithm). For k from 1 to Nu = arg max i ( a k I ) T v i ˆ n (cid:48) i Assign the k th node to the u th cluster. End For
Remark 2.
The sufficient condition is essentially that ofLemma 2, with two additional constraints to ensure that theretrieval step is successful: condition (6) imposes a constrainton the minimum cluster size in the full graph, and the lowerbound on N (cid:48) in (8) is modified to ensure the retrieval step issuccessful. Nonetheless, with these additional constraints westill have N (cid:48) = ˜Ω (cid:16) f ργ (cid:17) as described in Remark 1.B. Sparsity-based Sampling The URS method achieves significant computational com-plexity improvements for balanced graphs. However, imbal-ance in the graphs will tend to carry over to the sketch, thusrequiring a large number of samples to ensure that the sketchadequately captures the small clusters.In this section, a new sampling method which can yield amore balanced sketch of unbalanced data is presented. Thismethod, designated ‘Sparsity-based Sampling’, samples thesparser columns of the adjacency matrix, which representnodes with fewer connections, with a higher probability.Specifically, the probability of the i th node being selected isset inversely proportional to the degree of the node, i.e., the (cid:96) -norm of the corresponding column (cid:107) a i (cid:107) ; a factor which isa measure of sparsity. When calculating the norm, the entriescorresponding to unobserved edges are set to zero. Note that,because the diagonal element is always populated with one,the (cid:96) norm will always be greater than zero. The samplingprobabilities are properly normalized so that they sum up toone. To clearly illustrate the advantage of this method, we willfirst consider a sketch produced by sampling with replacement,and later consider the more difficult scenario where samplingis performed without replacement. The next proposition pro-vides a result when the graph consists of disjoint cliques A = L , i.e., is uncorrupted. Proposition 5.
Suppose A = L , where L is as defined inData Model 1, and all edges are observed. Then, using SbSwith replacement, the sampling probabilities from all clustersare equal to /r . Hence, there is an equal probability of sampling from eachcluster regardless of the cluster sizes. Even if the clusters arehighly unbalanced, i.e., the largest cluster is much larger thanthe smallest cluster, SbS will tend to produce a sketch whichis near-perfectly balanced.In fact, we can still improve the balance of the sketch evenif the graph is corrupted. The next result will depend on themean degree of a node in the smallest cluster, which is µ min = ρ [( p − q ) n min + qN ] . (9)We will always assume that µ min = ω (log rN ) . (10)This constraint on µ min is satisfied for the clustering problemsconsidered herein (due to the required growth of n min forsuccessful clustering). The following lemma provides a lowerbound on the probability t min of sampling from the smallestcluster using SbS. Lemma 6.
Suppose the adjacency matrix A follows DataModel 1 and all conditions in the statement of Lemma 2 hold.Define η = (cid:20) qp ( f − (cid:21) , (11) α = (cid:115) N ) µ min . (12) If α < , and sampling is performed using SbS with replace-ment, then t min ≥ − α α rη (13) with probability at least − N − . Remark 3.
The variable α is primarily dependent on themean degree µ min of the smallest cluster. This mean valuemay be small in some challenging cases: when few entriesare observed, when ( p − q ) is small, or when the clusters areextremely small. However, as this mean increases, the boundson the probability improve. Due to the assumption (10) , wehave − α α → .In (11) , the variable η reflects an important trade-off. Letus first consider how η behaves for fixed N and n min . Atone extreme, as q/p → , we have η → f , which means thatthe probability will strongly depend on the proportion of theminimum cluster size to the full graph, as in URS. At the otherextreme, as q → , we have η → . In this case, assuming α is small, the sampling probability for each cluster approaches r − , leading to a roughly equal chance of sampling from eachcluster. In terms of asymptotic behavior, we have η = O ( qf ) if q > , or η = 1 exactly if q = 0 . Remark 4.
The r in the denominator of (13) is due to someconservatism in the bounding techniques. However, in theregime where γ is small, a better approximation dispenseswith this factor in the sufficient condition, such that SbS sam-pling probabilities become roughly equal to those of randomsampling. See Remark 8 in Appendix B for more details. Now, we consider the SbS sampling process without re-placement, as is used in Step 1 (sampling) of Algorithm 1. Ifa column is sampled a second time, the duplicate sample isdiscarded and not counted towards the sampling budget N (cid:48) .The next lemma shows that as a consequence of the balancedsampling probability, fewer samples are required to obtain abalanced sketch. Lemma 7 (Sampling Size for SbS) . Suppose that A (cid:48) is pro-duced by SbS with N (cid:48) sampled columns (without replacement).Given an integer b with n min > b > , let g = η β − β (cid:18) − bn min (cid:19) − , (14) β = (cid:115) rN ) µ min , (15) where η is as defined in (11) . Then, n (cid:48) min ≥ b with probabilityat least − N − provided that β < and N ≥ N (cid:48) ≥ rg [ b log( N ) + log (2 rN )] . (16) Remark 5.
Note that the sufficient condition of Lemma 7 has asimilar structure to that of Lemma 1. In the sufficient conditionfor URS (1) , the main factor is the variable f , whereas forSbS in (16) , this factor changes to rg .As columns are sampled from a particular cluster, theprobability of sampling from this cluster may decrease. Thispossibility is reflected by the (cid:16) − bn min (cid:17) − penalty termwhich appears in (14) . This effect will be small for largeclusters, but may become significant for small clusters as therequired number of samples approaches the size of the smallestcluster. Remark 6.
Due to the assumption (10) , we have − β β → .Furthermore, suppose that b does not approach n min (forexample, assume that b ≤ cn min for some constant c ). Giventhese conditions, the number of samples to guarantee sufficientrepresentation of the smallest cluster in the sketch whp is N (cid:48) = ˜Ω ( qbrf ) for q > . However, q may scale with N . Inparticular, if q = O (cid:0) f − (cid:1) , then we get a much better result: N (cid:48) = ˜Ω ( br ) . We will refer to this as the small-q regime. Thesituation where q = ω (cid:0) f − (cid:1) will be referred to as the large-qregime. We now proceed to state the main result, which will consistof three theorems. First, we provide a sufficient conditionon the number of samples to obtain sketch clusters of size Ω( √ N (cid:48) log N (cid:48) ) , thus setting the stage for successful clusteringof the sketch. Theorem 8 (Cluster Size for SbS) . Let g (cid:48) = 2 η β − β , (17) ζ (cid:48) = C log Nρ (cid:48) γ (cid:48) . (18) If A (cid:48) is produced by SbS with N (cid:48) ≤ min (cid:26) n min ζ (cid:48) , N (cid:27) , (19) N (cid:48) ≥ rg (cid:48) (cid:2) rg (cid:48) ζ (cid:48) log N + 2 log(2 rN ) (cid:3) , (20) then n (cid:48) min ≥ √ CN (cid:48) log N (cid:48) √ ρ (cid:48) γ (cid:48) (21) with probability at least − N − . Next, we look at the stochastic properties of the sketchinduced by the randomness from the probabilistic samplingprocedure in addition to that of the generative SBM. Let p (cid:48) i bethe intra-cluster edge density for cluster i in the sketch matrix.Likewise, let q (cid:48) be the inter-cluster edge density, and ρ (cid:48) theobservation probability for the sketch matrix. If we performURS, then we have p (cid:48) i = p (for all clusters), q (cid:48) = q , and ρ (cid:48) = ρ . However, when the graph is sketched using SbS, therecould be a bias of p (cid:48) , q (cid:48) , and ρ (cid:48) toward slightly smaller valuesthan in the original graph. The next theorem demonstrates notonly that the bias will be extremely small, but also that theprobabilities of the sketch asymptotically approach those ofthe full graph as N → ∞ . Theorem 9 (Sketch probabilities p, q, ρ for SbS) . If A isconstructed using Data Model 1, and subsequently sampledusing SbS, then the sketch matrix densities will be bounded by p ≥ p (cid:48) i ≥ p − (22) q ≥ q (cid:48) ≥ q − (23) ρ ≥ ρ (cid:48) ≥ ρ − (24) where p − = p (1 − (cid:15) ) , (25) q − = q (1 − (cid:15) ) , (26) ρ − = ρ (1 − (cid:15) ) , (27) (cid:15) = N (1 − ρp ) n min (1 − ρq ) n min , (28) (cid:15) = N (1 − ρ ) N . (29)We can readily state the following theorem which providesguarantees for clustering using SbS based on the statisticalproperties from Theorem 9. Theorem 10.
Define p (cid:48) as the intra-cluster edge probabilityof the sketch (assuming the probability is the same for allclusters), and γ (cid:48) = 1 − { − p (cid:48) , q (cid:48) } . Suppose that
1) A sampling algorithm produces a sketch following theSBM with parameters p (cid:48) = p − , (30) q (cid:48) = q − , (31) ρ (cid:48) = ρ − , (32) n (cid:48) min ≥ √ CN (cid:48) log N (cid:48) √ ρ (cid:48) γ (cid:48) , (33) where p − , q − , and ρ − are as in (25) - (27) .2) The following two conditions hold: n min ≥ pγ log (cid:0) rN (cid:1) , (34) N ≥ N (cid:48) ≥ rg (cid:48) log(2 rN ) (cid:20) pγ log( N ) + 1 (cid:21) . (35) Then, Algorithm 1 exactly clusters the graph with probabilityat least − cN − − N − , where c is a constant real number. Remark 7.
Combining Theorems 8-10, the sampling com-plexity will be roughly ˜Ω (cid:16) r q f ργ (cid:17) . The main savings in thissampling complexity come when q is sufficiently small. In thesmall-q setting, the sampling complexity becomes ˜Ω (cid:16) r ργ (cid:17) ,thus making the sufficient condition almost independent of thegraph size. In the large-q regime, we need n min = ˜Ω (cid:16) r √ qN √ ργ (cid:17) ,while in the small-q regime, we only need n min = ˜Ω (cid:16) rργ (cid:17) . Inthe small-q regime, the number of clusters must be O (cid:16) γ √ ρN log N (cid:17) .As shown in Table II, the best achievability results for aconvex algorithm are n min = Ω (cid:16) log Nργ ∨ √ qN √ ργ (cid:17) for the large-q regime and n min = Ω (cid:16) log Nργ (cid:17) for the small-q regime [11],[12]. In addition to the aforementioned computational gainof our sketch-based approach, our result compares favorablywith these works considering that the clustering algorithm weuse only guarantees success if n min = Ω (cid:16) √ N log N √ ργ (cid:17) whenclustering the full graph. Additionally, the analysis of [11]requires that clusters be of equal size, and both [11] and [38]require strong side information, i.e., the sum of squares ofcluster sizes. Note that SbS only looks at the degree, which is a countof edges. One would expect that taking into account theindividual edges making up this count would yield furtherimprovement. Next, we present a method which does just this.IV. N
ODE SAMPLING USING S PATIALLY R ANDOM S AMPLING
In Section III-B, we presented SbS which notably raises thechance of sampling from small clusters as compared to URS.In this section, we present our second node sampling methodwith which we can obtain a balanced sketch from unbalanceddata. In [13], two of the authors proposed Spatially RandomSampling (SRS) as a new randomized data sampling method.The main idea underlying SRS is to perform the randomsampling in the spatial domain. To this end, the data points(here, the columns of the adjacency matrix) are projected onthe unit sphere, then points are sampled successively based on their proximity to randomly chosen directions in the ambientspace. Thus, with SRS the probability of sampling from aspecific data cluster depends on the amount of space thecluster occupies on the unit sphere (since SRS is applied tothe normalized unit (cid:96) -norm data points). Accordingly, theprobability of sampling using SRS is nearly independent ofthe population sizes of the clusters [13]. For further details,we refer the reader to [13].Suppose that A = L . In this case, the columns of A liein the union of r Lemma 11.
Suppose A = L and SRS is used to sample onecolumn of A . If the corresponding node is sampled, then theprobability of sampling from each cluster is equal to /r . See [13] for more details.
A. Efficient Implementation
In contrast to the low rank approximation-based columnsampling methods, SRS is not sensitive to the existence oflinear dependence between the clusters [13]. Thus, in orderto reduce the computation complexity, first we embed thecolumns of the adjacency matrix using a computationallyefficient embedding method to reduce the dimensionality of A . For instance, we can use a random binary embeddingmatrix (whose elements are independent random variableswith values ± with equal probability) to embed the columnsof A into a lower dimensional space and apply the SRSalgorithm to the embedded data. Data embedding using arandom binary matrix is significantly faster than using aconventional random Gaussian matrix since it does not involvenumerical multiplication, and comes with no significant lossin performance. Algorithm 2 presents the SRS-based nodesampling method. Algorithm 2
Efficient Node Sampling using SRS
Input : Given adjacency matrix A ∈ R N × N
1. Random embedding:
Calculate A φ = ΦA , where Φ ∈ R m × N is arandom binary matrix. Normalize the (cid:96) -norm of the columns of A φ , i.e., set a φi = a φi (cid:107) a φi (cid:107) , where a φi is the i th column of A φ .
2. Node Sampling:
Apply the SRS algorithm (Algorithm 1 of [13]) to sample N (cid:48) nodes/columns (without replacement). Output:
Set I as the set of the indexes of sampled nodes/columns in Step 2. B. Preparing Data for SRS
Because SRS is not designed to support missing values,an important issue with adopting SRS as our node samplingalgorithm is the missing values of the adjacency matrix A .One possible and easy solution is to replace the missing valueswith zeros. While this works well when a small fraction of theelements are missing, it degrades the performance of SRS ifa notable part of the adjacency matrix is missing. The mainreason stems from the previous observation, that is, if the data is clean and complete, then the columns correspondingto a cluster lie in a 1-dimensional subspace. If we replacethe missing values with zeros, the columns corresponding tothe large cluster significantly diffuse in the space. Thus, SRSends up sampling many columns from the large clusters. Weaddress this problem by pre-completing the adjacency matrix.The larger clusters are easily captured using random sampling.Thus, we apply Algorithm 1 with URS to A to complete it.Algorithm 3 presents the pre-completion step, and Algorithm4 provides the full graph clustering approach using SRS. Algorithm 3
Data Pre-Completion
Input : Given adjacency matrix A ∈ R N × N
1. Random Node Sampling:1.1
Form the set I consisting of indices of N (cid:48) randomly sampled nodesand construct A (cid:48) ∈ R N (cid:48) × N (cid:48) as the sub-matrix of A corresponding to thesampled nodes.
2. Sub-graph Clustering:2.1
Similar to Step 2.1 of Algorithm 1.
Similar to Step 2.2 of Algorithm 1. If ˆ r is the number of detected clusters, the vectors (cid:26) v i ∈ R N (cid:48) × (cid:27) ˆ ri =1 are defined as in Step 2.3 of Algorithm 1. Vector v ∈ R N (cid:48) × is defined asa zero vector.
3. Adjacency Matrix Generation:3.1
Initialize U ∈ R N × ˆ r (cid:48) as a zero matrix. k from 1 to Nj = arg min i (cid:107) a k I − v i (cid:107) If j > , then U ( k, j ) = 1 .Note: The index of the columns/rows starts from 1 (similar to MATLAB). Compute the completed matrix A c = UU T + A and clamp theelements of A c which are greater than 1 to 1. Algorithm 4
Proposed Approach with SRS
Input : Given adjacency matrix A ∈ R N × N
0. Pre-completion : Apply Algorithm 3 to A to obtain A c .
1. Random Node Sampling:
Apply Algorithm 2 to A c to obtain the set ofindices of sampled nodes I . Construct A (cid:48) ∈ R N (cid:48) × N (cid:48) as the sub-matrix of A corresponding to the sampled nodes.
2. Sub-graph Clustering:2.1
Similar to Step 2.1 of Algorithm 1.
Similar to Step 2.2 of Algorithm 1. If ˆ r is the number of detected clusters, the vectors (cid:26) v i ∈ R N (cid:48) × (cid:27) ˆ ri =1 are defined as in Step 2.3 of Algorithm 1.
3. Full Data Clustering:
Similar to Step 3 in Algorithm 1.
V. C
OMPUTATIONAL C OMPLEXITY
We first consider the complexity of a single iteration ofthe convex optimization in Step 2 of Algorithm 1, whichcomprises the principal cost in most cases. The complexity ofeach iteration is dominated by the SVD computation, requiring Ω( rN ) computations per iteration for the full graph, and Ω( rN (cid:48) ) for the sketch.If URS is used, based on the required number of samples,we have a cost of order ˜ O (cid:16) rf ρ γ (cid:17) per iteration. If the graphis perfectly balanced, then the cost reduces to only ˜ O (cid:16) r ρ γ (cid:17) ,a considerable saving when the number of clusters scales sublinearly with N . Likewise, the sketch subgraph can be assmall as ˜ O ( r ) , which can significantly reduce the memoryrequirements, or even allow processing of large graphs thatare virtually impossible to cluster otherwise.For SbS, the computational complexity is roughly ˜ O (cid:16) r q f ρ γ (cid:17) for the large-q regime. However, in the small-q regime, this can become as small as ˜ O (cid:16) r ρ γ (cid:17) , making thedecomposition almost independent of the graph size.The computational complexity of Step 1 of Algorithm 1 willdepend on the sampling algorithm used. For random sampling,this is linear in N (cid:48) and independent of the graph size. For SbS,a linear number of (cid:96) -norms must be calculated. For SRS, thesampling step is of order O ( N (cid:48) N ) , i.e., for each sample,we need to take an inner product with each column, albeitthe binary embedding makes these inner products cheaper, oforder O ( mN (cid:48) N ) , where m is the dimension of the embeddingspace.The retrieval step requires an inner product with eachcolumn for each cluster, making the complexity of this step O ( rN ) . VI. N UMERICAL E XPERIMENTS
In this section, we perform a set of numerical simulations tostudy the performance of the proposed randomized frameworkand compare it to algorithms which cluster the full-scale graph.First, we demonstrate the substantial speedups afforded bythe proposed approach, and then perform experiments withunbalanced data to showcase the effectiveness of the SbSscheme. Cases are presented in which the proposed methodwith SbS can even outperform full-scale decomposition. Anexperiment is considered successful if the algorithm exactlyreconstructs the low rank matrix L with no errors. Each pointin the phase transition plots is obtained by averaging over 20independent runs.When executing the algorithms of [9], [12], we estimate thenumber of clusters using Spectral Clustering [39] applied tothe obtained low rank component. For efficiency in obtainingthe numerical results, when executing the algorithm of [9] weuse λ = √ N (cid:48) , a value which we found to work well across awide range of regimes. A. Running time
We compare the running time of the proposed randomizedmethod with the full-scale clustering algorithms of [9] and[12]. The parameters of the Data Model 1 are set to r = 2 , p = 0 . , q = 0 . , ρ = 0 . , and n = n = N/ . For therandomized method we use URS with N (cid:48) = 200 samples. Therunning time in seconds is shown as a function N in Fig. 1.The full-scale and sketching-based approaches cluster the dataaccurately in every case. However, in all cases the randomizedmethod is substantially faster, running in less than 1.5 seconds.The full-scale decomposition [9], on the other hand, rangesfrom 3.3 seconds for N = 500 to 647 seconds for N = 10000 ,and the time for the algorithm of [12] ranges from 1.4 secondsfor N = 500 to 1693 seconds for N = 10000 . The main factorin the fast execution speed of the randomized approach is that Fig. 1. Timing comparison between Algorithm 1 with SbS, full-scaledecomposition with (Chen,2014) [9], and full-scale clustering with (Cai,2015)[12]. Times are averaged over five runs. the matrix decomposition algorithm is applied to the muchsmaller sketch matrix. The complexity of the retrieval stepof Algorithm 1 is linear with N , making its run time impactinsignificant. B. Clustering unbalanced graphs
In this experiment, we demonstrate the performance of theproposed randomized approach in terms of sampling com-plexity and minimum cluster size. The graph follows DataModel 1 with parameters p = 0 . , q = 0 . , ρ = 0 . , and N = 5000 . The graph consists of two small clusters withsizes n = n = n min , and one large cluster with size n = 5000 − n min . Phase transition plots are shown in Fig. 2,comparing the URS, SbS, and SRS sampling techniques withrespect to sample complexity and minimum cluster size.For SRS, we set m = 500 in Algorithm 2. Additionally, wefound that sampling part of the sketch using URS improvedthe success rate: here we acquire N (cid:48) / samples via URS, and N (cid:48) / via SRS.The phase transitions are shown over the domain ≤ N (cid:48) ≤ and ≤ n min ≤ . The algorithm whichuses SbS can yield exact clustering even when the URS-basedalgorithm fails due to highly unbalanced data. For example,when n min = 200 , the SbS algorithm extracts accurate clustersusing only N (cid:48) = 200 samples (only 4% of the total nodes).This improved performance of SbS is due to the morebalanced sketches, stemming from the larger sampling prob-ability that SbS places on the columns in the small clusters.Fig. 2(d) shows the probability with which the smallest clusteris sampled for URS and SbS (averaged over 20 independentruns), corresponding to the N (cid:48) = 400 row in Fig. 2(a) and (b).The dashed line shows the ideal probability in which eachcluster has equal sampling probability. We can see that forURS, the probability is approximately n min /N . However, SbSprovides at least two-fold improvement over URS.Fig. 2(e) shows the resulting minimum cluster sizes in thesketch, again for N (cid:48) = 400 averaged over 20 independentruns. The black dashed line shows the ideal minimum clustersize to attain the most balanced sketch, i.e. where the clustersare equal-sized. For URS and SbS, we see the same trends asfound in Fig. 2(d). We also show the results for SRS, wherethe proportions come quite close to the ideal.In addition to being significantly faster than the full-scaleclustering algorithms, the proposed algorithm can even out- Fig. 2. Phase transition plots for a) URS, b) Sparsity-based Sampling, and c)Spatial Random Sampling. White regions indicate success and black regionsfailure. d) shows the probability of sampling from the smallest cluster for N (cid:48) = 400 . The minimum size of the sketch clusters is shown in e) for eachsampling method, for fixed N (cid:48) = 400 . perform the full-scale algorithm in terms of success rate. Asdescribed in Remark 7, this occurs when the inter-clusterprobability q is sufficiently small. We remark that this doesnot violate the data processing inequality which indicatesthat post-processing cannot increase information [40]. Rather,many full-scale clustering algorithms are not robust to dataunbalancedness in the sense that they often fail to yieldaccurate clustering with unbalanced data. As an example,consider the scenario where p = 0 . , q = 0 . , ρ = 0 . , N = 5000 , with a graph composed of three clusters with twosmall clusters of size n = n = n min and one dominantcluster of size n = 5000 − n min . Fig. 3(a) comparesthe probability of success of the proposed randomized withSbS and the two full-scale decomposition algorithms, as afunction n min . Using SbS with 500 sampled nodes (only 10%of the total number nodes), the proposed approach yields exactclustering even when n min = 120 . On the other hand, when n min ≤ , the full-scale decomposition algorithm [9] failsto yield accurate clustering. While [12] can have strongerasymptotic guarantees than [9] if q → , in this case ithas similar performance. In fact, the large-q lower bound on n min for [12] (see Table II) consists of a maximum over twoterms. For these parameters, the two terms are almost equal,suggesting that q is large enough to degrade the performanceof [12]. Fig. 3(b) shows the phase transition in terms of n min and N (cid:48) of the randomized approach with SbS under the same Fig. 3. a) Success probability for full-scale clustering and SbS approachesas a function of n min . b) Phase transition with SbS.Fig. 4. Comparison of (a) SbS and (b) SRS for q large. setup as in Fig. 3(a).Now, we turn our focus to SRS. Smaller values of p and ρ will increase the dispersion of the larger cluster, thus degradingthe results of SRS. However, this degradation is offset whenthe value of q is increased, thus causing dispersion of thesmaller clusters as well. To illustrate this, in Fig. 4, we showthe results when q is larger. As in Fig. 2, we use parameters p = 0 . , ρ = 0 . , and N = 5000 , but here we set q = 0 . .In this regime, SbS performs worse with larger q , showing nosuccesses for n min ≤ , whereas SRS continues to performwell for small n min . A PPENDIX AP ROOF FOR R ANDOM S AMPLING A PPROACH
In this appendix, we provide proof of the lemmas andtheorems for URS presented in Section III-A. The set C i contains the indices of the columns which belong to cluster i , and the set I i = C i ∩ I contains the indices of sampledcolumns which belong to cluster i . Proof of Lemma 1.
To simplify the analysis, we treat thediagonal elements of the adjacency matrix as regular intra-cluster connections, i.e. a self-loop occurs with probability p and is observed with probability ρ . This modification canonly hurt performance, and therefore provides a valid lowerbound. Additionally, as the graph size increases, the differencein performance will be negligible.For simplicity, we will perform the analysis on a Bernoullisampling model in place of the uniform sampling model. Inthe Bernoulli model, a Bernoulli trial with success probability N (cid:48) /N is repeated for each node, and a node is included if itstrial is successful. Let Ω (cid:48) be the index set of columns sampledusing the Bernoulli model, and Ω be the index set of columnssampled using URS (without replacement). Although the exact number of samples will vary around a mean of N (cid:48) in theBernoulli model, from [41], we can conclude that P ( n (cid:48) min < b | I = Ω (cid:48) ) ≥ P ( n (cid:48) min < b | I = Ω) . (36)In words, the probability of failing to sample sufficientcolumns using URS is no more than twice the probability offailing using Bernoulli random sampling.Now, in the Bernoulli model the number of samples n (cid:48) i fromcluster i is a Binomial random variable with n i independentexperiments, each with success probability N (cid:48) /N .Let ξ i , i = 1 , . . . , r be such that the number of samples N (cid:48) = ξ i bN/n i . Using the Chernoff bound for Binomialdistributions [42] and the union bound, P ( n (cid:48) min ≥ b | I = Ω) ≥ − r (cid:88) i =1 P ( n (cid:48) i < b | I = Ω (cid:48) ) ≥ − r (cid:88) i =1 exp (cid:18) − b ( ξ i − ξ i (cid:19) (37)Thus, if (1) holds, then ξ i ≥ b log (2 rN ) for ≤ i ≤ r and the RHS of (37) is lower-bounded by (1 − N − ) .The upper bound in (1) is required since we are samplingwithout replacement, and the sketch matrix cannot be largerthan the full matrix. Proof of Lemma 2.
First we consider the lower bound in (3).We need to sample a sufficient number of columns such thatthe decomposition of the sketch will be successful whp. Let b = √ CN (cid:48) log N (cid:48) √ ργ . Successful decomposition is guaranteed byTheorem 4 of [9] with probability − cN (cid:48)− if n (cid:48) min ≥ b . Ifsufficient conditions (2) and (3) hold, then N ≥ N (cid:48) ≥ f (cid:34) √ CN (cid:48) log N (cid:48) √ ργ + log (2 rN ) (cid:35) , (38)and therefore Lemma 1 guarantees that n (cid:48) min ≥ b withprobability at least (1 − N − ) .The cluster sizes in the sketch cannot exceed those in thefull graph, so we need n min ≥ b . This is satisfied by (7),which ensures that n min ≥ ζN (cid:48) . Proof of Lemma 3.
Define the inner product between the j th column and the i th characteristic vector of L (cid:48) (i.e. the eigenvec-tor of L (cid:48) representing cluster i ) as U i ( j ) = ( a (cid:48) j ) T v i . Supposethat a (cid:48) j belongs to the (cid:96) th cluster. Then, U (cid:96) ( j ) ∼ Bin( n (cid:48) (cid:96) , p ) whereas U i ( j ) ∼ Bin( n (cid:48) i , q ) for any i (cid:54) = (cid:96) .Now, we define u i ( j ) = U i ( j ) /n (cid:48) i , which is the normalizedinner product used in step 2.3 of Algorithm 1. Let τ = p + q and note that E [ u (cid:96) ( j )] = p and E [ u i ( j )] = q for i (cid:54) = (cid:96) . Then,from the upper and lower Chernoff bounds [42], it follows that P ( u (cid:96) ( j ) ≤ τ ) ≤ exp (cid:18) − ( p − q ) p n (cid:48) (cid:96) (cid:19) , (39) P ( u i ( j ) ≥ τ ) ≤ exp (cid:18) − p − q ) q + 4( p − q ) n (cid:48) i (cid:19) . (40) If (5) holds, then n (cid:48) min ≥ p ( p − q ) log (cid:0) rN (cid:1) > p + 5 q )3( p − q ) log (cid:0) rN (cid:1) (41)(since p > q ) and the right hand sides of (39) and (40) willbe upper bounded by ( rN ) − .Then, the probability that the algorithm will fail to classifycolumn j is the probability that the normalized inner productis larger for an incorrect clusters than for the correct cluster, P r (cid:91) i =1 i (cid:54) = (cid:96) u (cid:96) ( j ) ≤ u i ( j ) ≤ P [ u (cid:96) ( j ) ≤ τ ] (cid:91) r (cid:91) i =1 i (cid:54) = (cid:96) u i ( j ) ≥ τ ≤ P ( u (cid:96) ( j ) ≤ τ ) + r (cid:88) i =1 i (cid:54) = (cid:96) P ( u i ( j ) ≥ τ ) = N − (42)Thus, the probability of correctly classifying all columns is P N (cid:92) j =1 r (cid:92) i =1 i (cid:54) = (cid:96) u (cid:96) ( j ) > u i ( j ) ≥ − N (cid:88) j =1 P r (cid:91) i =1 i (cid:54) = d u (cid:96) ( j ) ≤ u i ( j ) > − N − (43) Proof of Theorem 4.
First, let b = pγ log (cid:0) rN (cid:1) . Conditions(7) and (8) imply that N ≥ N (cid:48) ≥ f (cid:20) pγ log (cid:0) rN (cid:1) + log (2 rN ) (cid:21) , (44)and so Lemma 1 guarantees that n (cid:48) min ≥ b with probability atleast (1 − N − ) . If Lemma 1 succeeds, then retrieval is guar-anteed with probability (1 − N − ) by Lemma 3. Additionally,for retrieval to be possible, we require n min ≥ b (i.e., thelower bound on cluster size cannot exceed the smallest clusterin the full graph), which is satisfied by (6).Next, conditions (7) and (8) also satisfy the conditions ofLemma 2, thus guaranteeing successful decomposition withprobability − cN − − N − .Therefore, the complete clustering algorithm will be suc-cessful with probability at least − cN − − N − .A PPENDIX BP ROOFS FOR S PARSITY - BASED S AMPLING A PPROACH
In this appendix, we provide proofs related to the SbSsampling approach.First, we present some technical lemmas which will be usedin the remainder of this section. We use the expected degreefor a column in cluster i , which is µ i = ρ [( p − q ) n i + qN ] . We will often find it useful to work with a “typical” graph,i.e., one whose adjacency matrix has all node degrees closeto their respective mean values. Specifically, define the set oftypical graphs as A (cid:15) = (cid:8) A | (cid:12)(cid:12) (cid:107) a j (cid:107) − µ i (cid:12)(cid:12) ≤ (cid:15) µ i , ≤ i ≤ r, j ∈ C i (cid:9) , (45)for an arbitrary (cid:15) > . The following lemma bounds theprobability with which the SBM will generate a typical graph. Lemma 12 (Typical graph) . Given an arbitrary δ > , let > (cid:15) ≥ (cid:115) µ min log (cid:18) Nδ (cid:19) , (46) then P ( A ∈ A (cid:15) ) ≥ − δ .Proof. From the Chernoff bound [42], if µ min ≥ (cid:15) log (cid:0) Nδ (cid:1) ,then for a given column j belonging to cluster i , P (cid:0)(cid:12)(cid:12) (cid:107) a j (cid:107) − µ i (cid:12)(cid:12) ≤ (cid:15) µ i (cid:1) ≥ − δN . (47)Finally, we take the union bound over all N columns.Next, we will place bounds on the sampling probabilitiesobtained using SbS for a typical graph. Lemma 13 (SbS column sampling probabilities) . Let s ( j ) bethe probability of sampling a single column j , assuming thatwe are sampling using SbS with replacement. Furthermore,suppose that A ∈ A α with α as defined in (12) . Then for anarbitrary column j belonging to cluster i , s ( j ) ≥ s − i , where s − i = − α α rn i η i and η i = (cid:104) qp (cid:16) Nn i − (cid:17)(cid:105) .Proof. Since A ∈ A α , then from Lemma 12, s ( j ) = (cid:32) N (cid:88) k =1 (cid:107) a j (cid:107) (cid:107) a k (cid:107) (cid:33) − ≥ (cid:32) r (cid:88) k =1 (1 + α ) µ i n k (1 − α ) µ k (cid:33) − = 1 − α α (cid:32) r (cid:88) k =1 [ n i + qp ( N − n i )] n k [ n k + qp ( N − n k )] (cid:33) − ≥ s − i . (48)Next we provide proofs for the SbS lemmas and theoremsfound in Section III-B. Proof of Lemma 5.
Because A = L , the degree of a nodebelonging to cluster i is exactly n i . Then the probability ofsampling from cluster i is n i ni (cid:80) rj =1 n j nj = r . Proof of Lemma 6.
Invoking Lemma 12, the probability that A ∈ A α is at least (1 − N − ) . In this event, from Lemma 13we have t min = min ≤ i ≤ r (cid:88) j ∈C i s ( j ) ≥ min ≤ i ≤ r n i s − i , (49)which yields (13). Remark 8.
In the limit as p − q → , we would expect (13) to become t min ≥ − α α f − such that it is consistent with the column sample probabilities for URS. Indeed, in this case q/p → , and (48) becomes s i ≥ − α α N − , which is thedesired result.Proof of Lemma 7. For ease of analysis, we calculate thesufficient number of samples with replacement to obtain b distinct columns from each cluster. This condition will alsobe sufficient for sampling without replacement, since fewersamples will be required than sampling with replacement.First, we will bound the probability that at least n i − b columns are not sampled from cluster i in a particular sketchof N (cid:48) samples. In the following, let s ( k ) be the probabilityof sampling column k , m ( k ) be the number of times thatcolumn k is sampled, and A β be the typical graph as definedin Lemma 12 with (cid:15) = β . From (15) and Lemma 12, we havethat P ( A / ∈ A β ) < (2 rN ) − .Let S i be a set containing exactly n i − b distinct columnindices from cluster i . The probability that the columnsin S i are not sampled by SbS is P (cid:0)(cid:80) k ∈S i m ( k ) = 0 (cid:1) = (cid:0) − (cid:80) k ∈S i s ( k ) (cid:1) N (cid:48) . Note that this probability includes theevent that other columns not in S i are also absent from thesample. If the graph belongs to a typical set, then P (cid:32) (cid:88) k ∈S i m ( k ) = 0 | A ∈ A β (cid:33) ≤ (1 − ( n i − b ) s − i ) N (cid:48) ≤ exp (cid:8) − ( n i − b ) N (cid:48) s − i (cid:9) (50)Since there are (cid:0) n i b (cid:1) ways to choose the elements in set S i ,the probability that fewer than b distinct columns are sampledfrom cluster i is P ( n (cid:48) i < b | A ∈ A β ) ≤ (cid:18) n i b (cid:19) P (cid:32) (cid:88) k ∈S i m ( k ) = 0 | A ∈ A β (cid:33) ≤ n bi (cid:8) exp (cid:2) − ( n i − b ) N (cid:48) s − i (cid:3)(cid:9) . (51)If (16) holds, then the RHS of (51) is less than ( rN ) − . Then,the probability of failure for any graph is P ( n (cid:48) i < b ) ≤ P ( n (cid:48) i < b | A ∈ A β ) + P ( A / ∈ A β ) ≤ ( rN ) − . (52)Finally, applying the union bound over all clusters, we have P ( n (cid:48) min ≥ b ) = 1 − P { ( n (cid:48) < b ) ∪ · · · ∪ ( n (cid:48) r < b ) } , which isgreater than or equal to − (cid:80) ri =1 P ( n (cid:48) i < b ) = (1 − N − ) . Proof of Theorem 8.
We will invoke Lemma 7 with b = √ CN (cid:48) log N (cid:48) √ ρ (cid:48) γ (cid:48) . From (19), it follows that b ≤ n min / . Critically,this ensures that (cid:16) − bn min (cid:17) will be bounded away from zero,and will impose the bound g (cid:48) ≥ g .Then, from conditions (19) and (20), and since g (cid:48) ≥ g ,condition (16) is satisfied. Therefore, Lemma 7 guarantees (21)to hold with probability at least (1 − N − ) . Proof of Theorem 9.
Because we need to calculate separatelythe edge probabilities p (cid:48) , q (cid:48) and observability ρ (cid:48) probabil-ity, we will need to separately consider the fully observedadjacency matrix, denoted A , and the indicator matrix for the observed entries, denoted O . The matrix O containszeros for unobserved entries and ones for observed entries.Both A and O have zeros along the diagonal so that thepartially observed adjacency matrix A is the sum of theidentity matrix with the element-wise product of A and O . Define A = (cid:8) A | (cid:107) a j (cid:107) ≥ , ≤ j ≤ N (cid:9) and O = (cid:8) O | (cid:107) o j (cid:107) ≥ , ≤ j ≤ N (cid:9) . For brevity, we use A asshorthand for the event A ∈ A , and A C as shorthand forthe event A / ∈ A , and likewise for O and O .We define (cid:15) as the probability that A is atypical. Then, (cid:15) = P (cid:0) A / ∈ A (cid:1) = P (cid:8) ∪ Nj =1 (cid:107) a j (cid:107) = 0 (cid:9) ≤ N (cid:88) i =1 P (cid:0) (cid:107) a j (cid:107) = 0 (cid:1) = r (cid:88) i =1 n i (1 − p ) n i (1 − q ) N − n i ≤ N (1 − p ) n min (1 − q ) n min . (53)Likewise, (cid:15) = P ( O / ∈ O ) = P (cid:8) ∪ Nj =1 (cid:107) o j (cid:107) = 0 (cid:9) ≤ N (cid:88) i =1 P (cid:0) (cid:107) o j (cid:107) = 0 (cid:1) = N (1 − ρ ) N . (54)We will first bound p (cid:48) . Let the set of intra-cluster edges contained in the sketch matrix be P = { ( V i , V j ) ∈ (cid:83) ri =1 ( I i × I i ) | V i (cid:54) = V j } . We will denotethe i th and j th samples as S i and S j , respectively.Then, the probability that an arbitrary intra-cluster elementof the sketch matrix a (cid:48) ij contains a one is as follows.For the upper bound, we have p (cid:48) = P (cid:0) a V i V j = 1 | ( V i , V j ) ∈ P (cid:1) = (cid:80) ( v i ,v j ) ∈P P (cid:0) v i , v j | a v i v j = 1 (cid:1) P (cid:0) a v i v j = 1 (cid:1)(cid:80) ( v i ,v j ) ∈P P ( v i , v j )= p P (cid:0) v i , v j | a v i v j = 1 (cid:1) P ( v i , v j ) ≤ p (55)In the last line, we have used the fact that the presence ofan edge will slightly reduce the probability of sampling therespective columns.For the lower bound, first observe that P (cid:16) a (cid:48) v i v j = 1 |A (cid:17) = p − P (cid:16) a v i ,v j = 1 | A C (cid:17) P (cid:16) A C (cid:17) P (cid:0) A (cid:1) ≥ p − p P (cid:16) A C (cid:17) P (cid:0) A (cid:1) ≥ p (1 − (cid:15) ) . (56)In the second line, we have used the fact that if the graph A has some nodes with degree zero, then this slightly lowersthe probability that a given element has a one, and therefore P (cid:16) a v i ,v j = 1 | A C (cid:17) < p . Then, we have p (cid:48) = P (cid:0) a (cid:48) ij = 1 | ( V i , V j ) ∈ P (cid:1) = P (cid:0) a (cid:48) ij = 1 | A , ( V i , V j ) ∈ P (cid:1) P ( A )+ P (cid:16) a (cid:48) ij = 1 | A C , ( V i , V j ) ∈ P (cid:17) P (cid:16) A C (cid:17) ≥ P (cid:0) a (cid:48) ij = 1 | A , ( V i , V j ) ∈ P (cid:1) (1 − (cid:15) ) . (57) Finally, P (cid:0) a (cid:48) ij = 1 | A , ( V i , V j ) ∈ P (cid:1) = (cid:80) ( v i ,v j ) ∈P P (cid:0) v i , v j | a v i ,v j = 1 , A (cid:1) P (cid:0) a v i ,v j = 1 | A (cid:1)(cid:80) ( s i ,s j ) ∈P P (cid:0) v i , v j |A (cid:1) ≥ p (1 − (cid:15) ) . (58)Combining (57) and (58), we have arrive at the lower bound.For bounding q (cid:48) , we follow a similar lineof reasoning as for p (cid:48) , but substituting Q = (cid:26) ( V i , V j ) ∈ (cid:83) r k,l =1 k (cid:54) = l ( I k × I l ) | V i (cid:54) = V j (cid:27) in place of P ,which yields (23). Likewise, to bound ρ (cid:48) , we replace O = { ( V i , V j ) ∈ ( I × I ) | V i (cid:54) = V j } in place of P , and o inplace of a , which yields (24). Proof of Theorem 10.
Successful decomposition is guaran-teed by Theorem 4 of [9] with probability − cN (cid:48)− if n (cid:48) min ≥ CN (cid:48) log N (cid:48) ρ (cid:48) γ (cid:48) , which is satisfied by condition (33).Furthermore, we need to sample enough columns suchthat the retrieval process is successful whp. We will invokeLemma 7 with b = pγ log (cid:0) rN (cid:1) . If conditions (34) and (35)hold, then b ≤ n min / , g (cid:48) ≥ g , and condition (16) is satisfied.Then, Lemma 7 guarantees that n (cid:48) min ≥ b with probabilityat least (1 − N − ) . If Lemma 7 succeeds, then retrieval isguaranteed with probability (1 − N − ) by Lemma 3.Using the union bound, all conditions hold with probabilityat least − cN − − N − .R EFERENCES[1] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan, “Clustering socialnetworks,” in
Proc. 5th Int’l Conf. Algor. Models Web-graph . Berlin,Heidelberg: Springer-Verlag, 2007, pp. 56–67.[2] C. Nicolini, C. Bordier, and A. Bifone, “Community detection inweighted brain connectivity networks beyond the resolution limit,”
ArXive-prints , Sep. 2016.[3] S. Fortunato and D. Hric, “Community detection in networks: A userguide,”
Phys. Rep. , vol. 659, pp. 1 – 44, 2016.[4] U. Von Luxburg, “A tutorial on spectral clustering,”
Stat. Comput. ,vol. 17, no. 4, pp. 395–416, 2007.[5] P. Pons and M. Latapy, “Computing communities in large networks usingrandom walks,” in
Proc. 20th Int. Symp. Comput. Inform. Sci.
Springer,2005, pp. 284–293.[6] B. Hajek, Y. Wu, and J. Xu, “Semidefinite programs for exact recoveryof a hidden community,” in
J. Mach. Learn. Res. , vol. 49, 2016.[7] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels:First steps,”
Social Networks , vol. 5, no. 2, pp. 109 – 137, 1983.[8] A. Condon and R. M. Karp, “Algorithms for graph partitioning on theplanted partition model,”
Random Struct. Algor. , vol. 18, no. 2, 2001.[9] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu, “Clustering partiallyobserved graphs via convex optimization,”
J. Mach. Learn. Res. , vol. 15,no. 1, pp. 2213–2238, Jan. 2014.[10] R. Korlakai Vinayak, S. Oymak, and B. Hassibi, “Graph clustering withmissing data: Convex algorithms and analysis,” in
Proc. Adv. Neural Inf.Process. Syst. , 2014, pp. 2996–3004.[11] Y. Chen and J. Xu, “Statistical-computational tradeoffs in plantedproblems and submatrix localization with a growing number of clustersand submatrices,”
J. Mach. Learn. Res. , vol. 17, no. 1, Jan. 2016.[12] T. T. Cai and X. Li, “Robust and computationally feasible communitydetection in the presence of arbitrary outlier nodes,”
Ann. Statist. , vol. 43,no. 3, pp. 1027–1059, 2015.[13] M. Rahmani and G. K. Atia, “Spatial random sampling: A structure-preserving data sketching tool,”
IEEE Signal Process. Lett. , vol. 24,no. 9, pp. 1398–1402, Sep. 2017.[14] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?”
J. ACM , vol. 58, no. 3, pp. 11:1–11:37, Jun. 2011. [15] M. Rahmani and G. K. Atia, “High dimensional low rank plus sparsematrix decomposition,”
IEEE Trans. Signal Process. , vol. 65, 2017.[16] M. Rahmani and G. Atia, “A subspace learning approach for highdimensional matrix decomposition with efficient column/row sampling,”in
Proc. 33rd Int. Conf. Mach. Learn. , 2016, pp. 1206–1214.[17] L. W. Mackey, M. I. Jordan, and A. Talwalkar, “Divide-and-conquermatrix factorization,” in
Proc. Adv. Neural Inf. Process. Syst. , 2011.[18] M. Rahmani and G. Atia, “Randomized robust subspace recovery andoutlier detection for high dimensional data matrices,”
IEEE Trans. SignalProcess. , vol. 65, no. 6, Mar. 2017.[19] X. Li and J. Haupt, “Identifying outliers in large matrices via randomizedadaptive compressive sampling,”
IEEE Trans. Signal Process. , vol. 63,no. 7, pp. 1792–1807, 2015.[20] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure withrandomness: Probabilistic algorithms for constructing approximate ma-trix decompositions,”
SIAM review , vol. 53, no. 2, pp. 217–288, 2011.[21] M. Gu and S. C. Eisenstat, “Efficient algorithms for computing a strongrank-revealing QR factorization,”
SIAM J. Sci. Comput. , vol. 17, 1996.[22] N. H. Nguyen, T. T. Do, and T. D. Tran, “A fast and efficient algorithmfor low-rank approximation of a matrix,” in
Proc. 41st ACM Symp.Theory Comput. , 2009, pp. 215–224.[23] M. Rahmani and G. K. Atia, “Spatial random sampling: A structure-preserving data sketching tool,”
IEEE Signal Process. Lett. , vol. 24,no. 9, pp. 1398–1402, 2017.[24] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fastJohnson-Lindenstrauss transform,” in
Proc. 38th ACM Symp. TheoryComput. , 2006, pp. 557–563.[25] S. Ben-David, “A framework for statistical clustering with constant timeapproximation algorithms for k-median and k-means clustering,”
Mach.Learn. , vol. 66, no. 2-3, pp. 243–257, 2007.[26] A. Czumaj and C. Sohler, “Sublinear-time approximation for clusteringvia random sampling,” in
Proc. 31st ICALP . Springer, 2004.[27] A. S. Maiya and T. Y. Berger-Wolf, “Sampling community structure,”in
Proc. 19th Int’l Conf. World Wide Web . ACM, 2010, pp. 701–710.[28] K. Voevodski, M.-F. Balcan, H. Roglin, S.-H. Teng, and Y. Xia,“Efficient clustering with limited distance information,” arXiv preprintarXiv:1009.5168 , 2010.[29] C. Hbler, H. Kriegel, K. Borgwardt, and Z. Ghahramani, “Metropolisalgorithms for representative subgraph sampling,” in
Proc. 8th IEEEInt’l Conf. Data Mining , Dec. 2008, pp. 283–292.[30] J. Leskovec and C. Faloutsos, “Sampling from large graphs,” in
Proc.12th ACM SIGKDD . New York, NY, USA: ACM, 2006, pp. 631–636.[31] R. Gao, H. Xu, P. Hu, and W. C. Lau, “Accelerating graph miningalgorithms via uniform random edge sampling,” in
IEEE ICC , May 2016.[32] S.-Y. Yun and A. Proutiere, “Community detection via random andadaptive sampling,” in
Proc. Conf. Learn. Theory , 2014, pp. 138–175.[33] M. Salehi, H. R. Rabiee, and A. Rajabi, “Sampling from complexnetworks with high community structures,”
Chaos: An InterdisciplinaryJournal of Nonlinear Science , vol. 22, no. 2, 2012.[34] N. Ailon, Y. Chen, and H. Xu, “Iterative and active graph clusteringusing trace norm minimization without cluster size constraints,”
J. Mach.Learn. Res. , vol. 16, pp. 455–490, 2015.[35] S. Fortunato, “Community detection in graphs,”
Phys. Rep. , vol. 486,no. 3, pp. 75 – 174, 2010.[36] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,”
Mach.Learn. , vol. 56, no. 1, pp. 89–113, Jul. 2004.[37] Y. Chen, S. Sanghavi, and H. Xu, “Improved graph clustering,”
IEEETrans. Inf. Theory , vol. 60, no. 10, pp. 6440–6455, Oct. 2014.[38] A. Jalali, Q. Han, I. Dumitriu, and M. Fazel, “Exploiting tradeoffs forexact recovery in heterogeneous stochastic block models,” in
Proc. Adv.Neural Inf. Process. Syst. , D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, Eds., 2016, pp. 4871–4879.[39] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm,theory, and applications,”
IEEE Trans. Pattern Anal. Mach. Intell. ,vol. 35, no. 11, pp. 2765–2781, Nov. 2013.[40] T. M. Cover and J. A. Thomas,
Elements of Information Theory . Wiley-Interscience, 2006.[41] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles:exact signal reconstruction from highly incomplete frequency informa-tion,”
IEEE Trans. Inf. Theory , vol. 52, no. 2, pp. 489–509, Feb. 2006.[42] C. McDiarmid, “Concentration,” in