Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model
SSparse and Smooth:improved guarantees for Spectral Clusteringin the Dynamic Stochastic Block Model
Nicolas KerivenCNRS & GIPSA-lab11 rue des Math´ematiques, 38400 St-Martin-d’H´er`es, [email protected] VaiterCNRS & IMB, Universit´e de Bourgogne9 avenue Alain Savary, 21000 Dijon, [email protected]
Abstract
In this paper, we analyse classical variants of the Spectral Clustering (SC) algorithm in theDynamic Stochastic Block Model (DSBM). Existing results show that, in the relatively sparsecase where the expected degree grows logarithmically with the number of nodes, guarantees inthe static case can be extended to the dynamic case and yield improved error bounds when theDSBM is sufficiently smooth in time, that is, the communities do not change too much betweentwo time steps. We improve over these results by drawing a new link between the sparsityand the smoothness of the DSBM: the more regular the DSBM is, the more sparse it can be,while still guaranteeing consistent recovery. In particular, a mild condition on the smoothnessallows to treat the sparse case with bounded degree. We also extend these guarantees to thenormalized Laplacian, and as a by-product of our analysis, we obtain to our knowledge the bestspectral concentration bound available for the normalized Laplacian of matrices with independentBernoulli entries.
In recent years, the study of dynamic networks has appeared as a topic of great interest to modelcomplex phenomenons that evolve with time, such as interactions in social networks, the spread ofinfectious diseases or opinions, or information packets in computer networks. In light of this, manyrandom graphs models, traditionally static (non-dynamic), have been extended to the dynamic case,see [13, 17] for reviews. One of the most popular use of dynamic networks consists in detectingand tracking communities of well-connected nodes, for instance users of a social network [44, 42,39]. In this context, the classical Stochastic Block Model (SBM) [16], in which nodes intra- andinter-communities are linked independently with some prescribed probabilities, has been extended todynamic settings (DSBM) in a myriad of ways. In this paper, we consider one of the first (and mostpopular) extension [44] as a discrete Hidden Markov Model (HMM) as well as one of its simplification,where node memberships follow a Markov chain with respect to time, and connections are generatedby a classical SBM conditionally on the memberships. We will also consider a slight simplification asin [33], where the authors remove the Markov Chain assumption and consider deterministic communitymemberships at each time steps. Many other models have been proposed since, to take into account1 a r X i v : . [ s t a t . M L ] F e b volving connection probabilities [42, 27, 33], varying number of nodes [41], connections that dependon their previous states [41], mixed-membership SBM [15] or multi-graphs [14].The literature on clustering nodes in a graph is vast, with a variety of methods. Arguably, themost popular class of algorithms in practice is that of spectral clustering (SC) methods [29, 37],which consist in applying a classical clustering algorithm for vectorial data, often the well-known k -means algorithm [25], to the eigenvectors of a matrix related to the structure of the graph such asthe adjacency matrix or normalized Laplacian. In a dynamic context, we will consider in this paperone of the simplest adaptation of SC, which consists in feeding a version of the adjacency matrix smoothed in time to the classical SC algorithm, in hope of implicitely enforcing smoothness of thecommunities. This can be an averaged version of the adjacency matrix over a finite window [15, 33],or computed through recursive updates with a certain “forgetting factor” [6, 7, 40], which is somehowmore amenable to streaming computing. Other works explicitely enforce smoothness between thecommunities or between the eigenvectors considered in SC through efficient updates [30, 9, 24].Beyond SC, many other methods have been proposed, such as Maximum Likelihood or varia-tional approaches, which are consistent for the SBM and DSBM [5, 27, 26], Bayesian approaches [44],learning-based approaches [2], or neural networks [4]. Many variants of the SC itself exist, often toaccelerate computation [36]. Guarantees for Spectral Clustering
There is a vast literature on the theoretical analysis of SC,and guarantees come in many different flavors. Several works analyze the algorithm when the graphis well-clustered (in some sense) [32], in terms of spectral convergence of the normalized Laplacianwhen the number of nodes goes to infinity [38, 10, 11, 35], or using random matrix theory [8].It is well-known that a key quantity to analyse SC algorithms is the density of edges with respectto the number of nodes. In the specific case of independent Bernoulli edges like the SBM and DSBM,this correspond to the mean probability of connection, which will be denoted by α n in this paper,where n is the number of nodes in the graph. The dense case α n ∼ sparse case α n ∼ n is much more complex, since thegraph is not even guaranteed to be connected with high probability [1].Modern analyses of the sparse case are often inspired by statistical physics [18, 28, 1], and areinterested with the computation of a detectability threshold , that is, the characterization of regimesof parameters in which there exists (or not) an algorithm that asymptotically performs better thanrandom guess. However, this approach does not concern the classic SC algorithm (which will generallyfail [18]), and the case where the number of communities K is larger than 2 is still largely open. Inthe dynamic case, a conjecture on the detectability threshold is given in [12]. In parallel, other worksstudy the sparse case by regularizing the adjacency matrix or normalized Laplacian of the graph beforethe SC algorithm [20, 21].In [22], Lei and Rinaldo provide strong, non-asymptotic consistency guarantees for the classic SCalgorithm on the adjacency matrix (without regularization) in the relatively sparse case α n (cid:38) log nn ,showing that the proportion of misclassified nodes tends to 0 with a probability that goes to 1 whenthe number of nodes n increases. Their recovery results are valid for any K , potentially growing slowlywith n . In [33], Pensky and Zhang extend this analysis to a particular Dynamic SBM, referred to as“deterministic” DSBM in the sequel, for the SC algorithm applied to a smoothed adjacency matrix.In this case, another key quantity is the temporal regularity of the model ε n , that is, the proportionof nodes that may change community between two time steps (the smaller ε n is, the more regularthe model). They showed that, in the relatively sparse case, if the model was sufficiently regular in ε n = o (cid:16) n (cid:17) , then the error bound of the static case can be improved. However their analysis stilltakes place in the relatively sparse case even when ε n is very low. Contributions
In this paper, we follow the analyses of [22] and [33] and significantly extend themin several ways. 2 Our main contribution is to draw a new link between the sparsity α n and regularity ε n in theanalysis of the DSBM: we show that, the more regular the model, the sparser it can be, whilestill guaranteeing consistency. In particular, a mildly strengthened condition ε n ∼ n allowsto give consistent guarantees in the sparse case α n ∼ n .– We extend the analysis to the normalized Laplacian, which was left open by Lei and Rinaldo [22].As a by-product, in the static case, we obtain, to our knowledge, the best spectral concentrationbound available (cid:107) L ( A ) − L ( E ( A )) (cid:107) (cid:46) √ log n in the relatively sparse case α n ∼ log nn .– We also improve the rate of the error bounds with respect to the number of communities K when the probabilities of connection between communities decrease with K , in both the static[22] and dynamic [33] cases.– Finally, we extend our results to the Markov DSBM introduced in [44], and the SC algorithm withan “exponentially smoothed” matrix, used in [6, 40] and appropriate in a streaming computingframework. Outline
In Section 2, we introduce notations, the SBM and DSBM, and recall the SC algorithm.In Section 3, we draw a link between recovery guarantees for SC and the concentration of the inputmatrix in spectral norm, similar to [22] but extended to the normalized Laplacian. In Section 4 and5, we expose our main concentration results respectively for the adjacency matrix and normalizedLaplacian. Proofs are given in Section 6, with technical computations deferred to the Appendix.
The set of the first n integers is denoted by [ n ] = { , . . . , n } . For any vector d ∈ R n , we definediag( d ) ∈ R n × n to be the diagonal matrix whose elements are given by d . For a varying parameter α n , the notation α n ∼ f ( n ) indicates that, as n → ∞ , the quantity α n /f ( n ) tends to a non-zeroconstant, α n (cid:46) f ( n ) indicates that there is a universal constant C such that α n (cid:54) Cf ( n ), andsimilarly for α n (cid:38) f ( n ).An undirected graph G = ( V, E ) is formed by a set of nodes V and edges E ⊂ V × V . For a graphwith n nodes, we often adopt V = [ n ], and we define its (symmetric) adjacency matrix A ∈ { , } n × n such that for i, j ∈ [ n ], A ij = (cid:40) { i, j } ∈ E, D ( A ) by D ( A ) = diag (( d i ) ni =1 ) where d i = n (cid:88) j =1 A ij . For any symmetric matrix A such that (cid:80) j A ij (cid:54) = 0 for all i , the normalized Laplacian L ( A ) is definedas L ( W ) = D ( A ) − AD ( A ) − . We note that, typically, the normalized Laplacian is defined as the matrix Id − D ( A ) − AD ( A ) − .However, SC is mainly concerned with the eigenvectors of the Laplacian, which are the same for bothvariants. 3 tochastic Block Model Let us start by introducing the classical static SBM. We take the followingnotations: n the number of nodes, K the number of communities. Each node belongs to exactly onecommunity. We denote by Θ ∈ { , } n × K the 0 − i , Θ ik = 1 indicates that it belongs to the k th community, and is 0 otherwise.The (symmetric) adjacency matrix is denoted by A ∈ { , } n × n . Given Θ, for i < j we have A ij | { Θ ik = 1 , Θ j(cid:96) = 1 } ∼ Ber( B k(cid:96) )where B ∈ [0 , K × K is a symmetric connectivity matrix, and Ber( p ) indicates a Bernoulli randomvariable with parameter p . We also let A ii = 0 and A ji = A ij . Finally, we define P = Θ B Θ (cid:62) ∈ R n × n the matrix storing the probabilities of connection between two nodes off its diagonal, and we have E ( A ) = P − diag( P )Typically, B has high diagonal terms and low off-diagonal terms. We will consider B of the form B = α n B (1)for some α n ∈ (0 ,
1) and B ∈ [0 , K × K whose elements are denoted by b (0) k(cid:96) . It is known that therate α n when n → ∞ is the main key quantity when analyzing the properties of random graphs.Typical settings include α n ∼ α n ∼ /n (sparse graphs), or middle grounds suchas α n ∼ log nn , usually referred to “relatively sparse” graphs. As we will see, it is known that strongguarantees of consistency can be given in the relatively sparse case, while the sparse case it hard toanalyze and only partially understood.For some maximum and minimum community sizes n max (cid:62) nK and n min (cid:54) nK , we define the set ofadmissible community sizes N def. = { ( n k ) Kk =1 | n min (cid:54) n k (cid:54) n max , (cid:80) k n k = n } , and¯ n max def. = max ( n (cid:96) ) (cid:96) ∈ N,k (cid:54) K (cid:88) (cid:96) n (cid:96) b (0) k(cid:96) , ¯ n min def. = min ( n (cid:96) ) (cid:96) ∈ N,k (cid:54) K (cid:88) (cid:96) n (cid:96) b (0) k(cid:96) (2)These quantities are such that the expected degree will be comprised between α n ¯ n min and α n ¯ n max .For simplicity, we will sometimes express our results with B equal to: B = (1 − τ )Id K + τ K (cid:62) K (3)In other words, B contains α n on its diagonal and τ α n outside. For this expression of B , we have¯ n max = (1 − τ ) n max + nτ , and similarly for ¯ n min . Interestingly, in the case of balanced communities n max , n min ∼ nK , we have then ¯ n min , ¯ n max ∼ (cid:40) n if τ ∼ nK if τ ∼ K Dynamic SBM
The Dynamic SBM (DSBM) is a random model for generating adjacency matrices A , . . . , A t at each time step. Each A i will be generated according to a classical SBM with constantnumber of nodes n , number of communities K and connectivity matrix B , but changing node mem-berships Θ t . Note that several works consider changing number of nodes [41] or changing connectivitymatrix [33], but for simplicity we assume that they are constant in time here. We will consider twopotential models on the Θ t .– The simplest one, adopted in [33], is to consider that Θ , . . . , Θ t are deterministic variables. Inthis case, we will assume that only a number s (cid:54) n of nodes change communities between eachtime step t − t , and denote ε n = s/n this relative proportion of nodes. We will also assumethat at all time steps, the communities sizes are comprised between some n min and n max , whichwill typically be of the order of n/K for balanced communities. As a shorthand, we will simplyrefer to this model as deterministic DSBM (keeping in mind that the A t are still random).4 In the second model, similar to [44] we assume that the nodes memberships follow a Markovchain, such that between two time steps, all nodes have a probability 1 − ε n to stay in the samecommunity, and ε n to go into any other community, that is: ∀ i, k, P ((Θ t ) ik = 1 | (Θ t − ) ik = 1) = 1 − ε n , ∀ (cid:96) (cid:54) = k, P ((Θ t ) i(cid:96) = 1 | (Θ t − ) ik = 1) = ε n K − t , the A t are drawn independently according to a SBM. The globalmodel is thus a Hidden Markov Model (HMM). We will simply refer to this case as MarkovDSBM. Note that, in this case, it is rather difficult to quantify, in a non-asymptotic manner,the probability of having bounded community sizes globally holding for all time steps. Hence¯ n max , ¯ n min will not intervene in our analysis of this case. Goal and error measure
The goal of a clustering algorithm is to give an estimator ˆΘ of the nodememberships Θ, up to permutation of the communities labels. We consider the following measure ofdiscrepancy between Θ and an estimator ˆΘ [22]: E ( ˆΘ , Θ) = min Q ∈P k n (cid:13)(cid:13)(cid:13) ˆΘ Q − Θ (cid:13)(cid:13)(cid:13) (4)where P k is the set of permutation matrices of [ k ] and (cid:107)·(cid:107) counts the number of non-zero elementsof a matrix. While other error measures are possible, as we will see one can generally relate them toa spectral concentration property, which will be the main focus of this paper.In the dynamic case, a possible goal is to estimate Θ , . . . , Θ t for all time steps simultaneously [40,33]. Here we consider a slightly different goal: at a given time step t , we seek to estimate Θ t withthe best precision possible, by exploiting past data. In general, this will give rise to methods thatare computationally lighter than simultaneous estimation of all the Θ t ’s, and more amenable tostreaming computing, where one maintains an estimator without having to keep all past data inmemory. Naturally, such methods could be applied independently at each time step to produceestimators of all the Θ t ’s, but this is not the primary goal here. Spectral Clustering (SC) algorithm
Spectral Clustering [29] is nowadays one of the leadingmethods to identify communities in an unsupervised setting. The basic idea is to solve the K -means problem [25] on the K leading eigenvectors E K of either the adjacency matrix or (normalized)Laplacian. Solving the K -means, i.e., obtaining( ¯Θ , ¯ C ) ∈ Argmin Θ ∈ R n × K ,C ∈ R K × K (cid:107) Θ C − E K (cid:107) F , (5)is known to be NP-hard, but several approximation algorithms, such as [19], are known to produce1 + δ approximate solutions ( ˆΘ , ˆ C ) (cid:13)(cid:13)(cid:13) ˆΘ ˆ C − E K (cid:13)(cid:13)(cid:13) F (cid:54) (1 + δ ) (cid:13)(cid:13) ¯Θ ¯ C − E K (cid:13)(cid:13) F . The SC is summarized in Algorithm 1.In the dynamic case, a typical approach to exploit past data is to replace the adjacency matrix A t with a version “smoothed” in time A smooth t , and feed either ˆ P = A smooth t or the correspondingLaplacian ˆ L = L ( A smooth t ) to the classical SC algorithm. In [33], the authors consider the smoothedadjacency matrix as an average over its last r values: A unif t = 1 r r − (cid:88) k =0 A t − k . (6)5 ata: Matrix M ∈ R n × n (typically adjacency or normalized Laplacian), number ofcommunities K , approximation ratio δ > Result:
Estimated communities ˆΘ ∈ R n × K Compute the K leading eigenvectors E K of M .Obtain a (1 + δ )-approximation ( ˆΘ , ˆ C ) of (5).Return ˆΘ. Algorithm 1:
Spectral Clustering algorithm ad j u s t ed R and i nde x A exp L ( A exp ) (a) Adjusted Rand index in function of theforgetting factor λ for the adjacency matrixand the normalized Laplacian. = 1/ r ad j u s t ed R and i nde x A exp A unif (b) Adjusted Rand index in function of theforgetting factor λ for A exp t and the windowsize r for A unif t . Figure 1: Performance results for SC on synthetic data.Note that, in the original paper, the authors sometimes consider non-uniform weights due to potentialchanges in time of the connectivity matrix B t , but in our case we consider a fixed B , and thus uniformweights r . In this paper, we will also consider the “exponentially smoothed” estimator proposed by[6, 7, 43], which is computed recursively as: A exp t = (1 − λ ) A exp t − + λA t . (7)for some “forgetting factor” λ ∈ (0 , A exp0 = A . Compared to the uniform estimator (6),this kind of estimator is somewhat more amenable to streaming and online computing, since only thecurrent A exp t needs to be stored in memory instead of the last r values A t , A t − , . . . , A t − r +1 (notehowever that A exp t may be denser that a typical adjacency matrix, so the memory gain is sometimesmitigated depending on the case).In Fig. 1, we illustrate the performance of the SC algorithm on a synthetic DSBM example. Asexpected, the normalized Laplacian L ( A exp t ) generally performs better than A exp t . Interestingly, theoptimal forgetting factor λ is slightly different from one to the other, and the normalized Laplacianreaches a higher performance altogether. We then compare A unif t and A exp t . As we will see in thesequel, taking r ∼ λ often results in the same performance for both estimators. However, a clearadvantage of the exponential estimator is that it is not limited to discrete window sizes, but has acontinuous forgetting factor. As such, A exp t with the optimal λ often reaches a better performancethan A unif t with the optimal r . 6 From Spectral Clustering to spectral norm concentration
As described in [22], a key quantity for analyzing SC algorithm is the concentration of the adjacencymatrix around its expectation in spectral norm . As a first contribution, we prove the following lemma,which is a generalisation of this result to the normalized Laplacian.
Lemma 1.
Let P = Θ B Θ (cid:62) correspond to some SBM with K communities, where n max , n (cid:48) max and n min are respectively the largest, second-largest and smallest community size. Assume B = α n B forany B with smallest eigenvalue γ . Let ˆ P be an estimator of P , and ˆΘ be the output of Algorithm 1on ˆ P with a (1 + δ ) -approximate k -means algorithm. Then E ( ˆΘ , Θ) (cid:46) (1 + δ ) n (cid:48) max Knα n n γ (cid:13)(cid:13)(cid:13) ˆ P − P (cid:13)(cid:13)(cid:13) , (8) Similarly, if ˆ L is an estimator of L ( P ) and ˆΘ is the output of Algorithm 1 on ˆ L , it holds that E ( ˆΘ , Θ) (cid:46) (1 + δ ) n (cid:48) max K ¯ n nn γ (cid:13)(cid:13)(cid:13) ˆ L − L ( P ) (cid:13)(cid:13)(cid:13) . (9) When B is defined as (3) , we have γ = 1 − τ . The proof of this lemma is deferred to Appendix A.1. The first bound (8) was proved in [22], weextend it to the Laplacian case. Note that ˆ L could be an estimator of L ( P ) without being of the formˆ L = L ( M ) for some matrix M .Using this lemma, in the static SBM case, the goal is to find estimators ˆ P or ˆ L that concentratesaround P or L ( P ) in spectral norm. In the dynamic case, where the goal is to estimate the communitiesat a particular time t , we seek the best estimators for P t or L ( P t ). As outlined in the previous section,we will consider smoothed versions of the adjacency matrix A smooth t , and prove concentration of A smooth t around P t and L ( A smooth t ) around L ( P t ). Remark 1.
Assuming that all community sizes are of the order of nK and τ is fixed, the error inthe adjacency case (8) scales as K n α n (cid:13)(cid:13)(cid:13) ˆ P − P (cid:13)(cid:13)(cid:13) , and in the normalized Laplacian case the error (9)scales as K (cid:13)(cid:13)(cid:13) ˆ L − L ( P ) (cid:13)(cid:13)(cid:13) . Also note that, when ¯ n max ∼ nK , then the error (9) is as (cid:13)(cid:13)(cid:13) ˆ L − L ( P ) (cid:13)(cid:13)(cid:13) .This does not explicitely depend on α n or K , however these quantities will naturally appear in theconcentration of the Laplacian.The next sections will therefore be devoted in analyzing the spectral concentration rates of thevarious estimators. Table 1 summarize our results and compare them with previous works. As wewill see in the next section, our main contribution is to weaken the hypothesis on the sparsity α n ,and relate it to the regularity of the DSBM ε n . We also provide the best bound available for thenormalized Laplacian in the static case, and the first bound in the dynamic case.In Figure 2, we illustrate numerically the spectral concentration of A exp t and L ( A exp t ), and theiractual clustering performance, with respect to the forgetting factor λ . We see that there is a slightdiscrepancy between the λ that minimizes the spectral bound, and the one that yields the bestclustering result. As we will see in the next sections, the λ that minimizes the spectral error istheoretically of the order of √ α n nε n . This rate is indeed verified numerically for the spectral error,however the actual best clustering performance deviates slightly. This indicates that spectral normconcentration probably does not yield sharp bounds in examining the performance of SC. We start by recalling the result of [22] in the static case and prove an interesting minor improvementin some cases, then we examine the result for DSBM of [33] and state our main contribution, that is,a weakening of the sparsity hypothesis for this case.7 .0 0.2 0.4 0.6 0.8 / n ad j u s t ed R and i nde x n = 100 n = 250 n = 500Best clusteringMinimal norm (a) Adjusted Rand index in function of thenormalized forgetting factor λ/ √ α n nε n forthe adjacency matrix. / n | A e x p P | n = 100 n = 250 n = 500Best clusteringMinimal norm (b) Approximation (in norm) of P by A exp in function of the normalized forgetting fac-tor λ/ √ α n nε n for the adjacency matrix. / n ad j u s t ed R and i nde x n = 100 n = 250 n = 500Best clusteringMinimal norm (c) Adjusted Rand index in function of thenormalized forgetting factor λ/ √ α n nε n forthe normalized Laplacian matrix. / n | L ( A e x p ) P | n = 100 n = 250 n = 500Best clusteringMinimal norm (d) Approximation (in norm) of P by L ( A exp ) in function of the normalized for-getting factor λ/ √ α n nε n for the normalizedLaplacian matrix. Figure 2: SC on synthetic data. Comparison between the forgetting factor λ that minimizes thespectral error, and the one that yields the best clustering result.8 static L static A dyn. L dyn. Hyp. E → n O (1) α n (cid:38) log nn Yes[22] √ α n n α n (cid:38) log nn Yes[33] √ α n n √ α n nρ n α n (cid:38) log nn Yes[3] √ log n α n (cid:38) n NoUs √ α n n √ α n n √ α n nρ n (cid:113) ρ n α n n α n ρ n (cid:38) log nn YesTable 1: Concentration rates in spectral norm of the adjacency matrix and normalized Laplacian,in the static or dynamic case, with respect to the sparsity parameter α n , and the factor ρ n def. =min(1 , √ nα n ε n ) which includes the regularity ε n . The last column indicate convergence of the errorusing Lemma 1. This table does not include methods with regularization [21]. In their landmark paper [22], Lei and Rinaldo analyze the relatively sparse case α n (cid:38) log nn and showthat, with probability at least 1 − n − ν for some ν >
0, the adjacency matrix concentrates as (cid:107) A − P (cid:107) (cid:46) √ nα n (10)Therefore, by Lemma 1, using A as an estimator for P in an SC algorithm leads to an error E ( ˆΘ , Θ) (cid:46) K α n n , such that E ( ˆΘ , Θ) → K = o ( √ nα n ). As a minor contribution, we remark that it isnot hard to prove the following Lemma that improves over their result in the particular case when B is defined as (3). Proposition 1.
Consider a static SBM where B is defined as (3) , assume that the community sizes n , . . . , n K are comprised between n min and n max , and that α n (cid:38) log n ¯ n min (11) Then, for all ν > , there exists a constant C ν such that, with probability at least − (cid:80) k n − νk , it holdsthat (cid:107) A − P (cid:107) (cid:54) C ν √ ¯ n max α n (12) Proof.
Denote by S , . . . , S K ⊂ [ n ] the subset of indices of each community, assume without lost ofgenerality that the nodes are ordered such that the S k are consecutive in [ n ], that is, S = { , . . . , n } , S = { n + 1 , . . . , n + n } , and so on. Define A k = A S k ,S k ∈ { , } n k × n k the adjacency matrixof the subgraph of nodes from the k th community. Note that by our assumption on B we have P k = P S k ,S k = α n n k (cid:62) n k . Denote A (cid:48) ∈ { , } n × n the block matrix containing the A k on its diagonalof blocks, similarly P (cid:48) , and A (cid:48)(cid:48) = A − A (cid:48) , P (cid:48)(cid:48) = P − P (cid:48) . We have (cid:107) A − P (cid:107) (cid:54) (cid:107) A (cid:48) − P (cid:48) (cid:107) + (cid:107) A (cid:48)(cid:48) − P (cid:48)(cid:48) (cid:107) = max k (cid:107) A k − P k (cid:107) + (cid:107) A (cid:48)(cid:48) − P (cid:48)(cid:48) (cid:107) where the equality is valid because A (cid:48) − P (cid:48) is a block diagonal matrix. From Lei and Rinaldo’s resultabove, for each k , if α n (cid:38) log n k n k , then with probability at least 1 − n − νk it holds that (cid:107) A k − P k (cid:107) (cid:46) √ n k α n , such that (cid:107) A (cid:48) − P (cid:48) (cid:107) (cid:46) √ n max α n . For the second term, we note that A (cid:48)(cid:48) is an adjacencymatrix generated by the SBM corresponding to P (cid:48)(cid:48) , whose maximal probability is τ α n . Hence, if τ α n (cid:38) log nn , then with probability 1 − n − ν we have (cid:107) A (cid:48)(cid:48) − P (cid:48)(cid:48) (cid:107) (cid:46) √ τ nα n . We conclude with a unionbound. 9his Lemma provides a better error rate than [22] when τ goes to 0 with K , at the price ofrequiring a higher α n . For instance, when the communities sizes are balanced n k ∼ nK , and we have τ ∼ K and α n ∼ K log nn , Lei and Rinaldo’s rate (10) yields E ( ˆΘ , Θ) (cid:46) K log n and converge only for K = o (log n ), while using Proposition 1 we get E ( ˆΘ , Θ) (cid:46) n . The latter does not depend on K ,which may grow with any rate in the number of nodes (recalling that τ and α n depend on K , andthat there must be a ν > K ν +1 n − ν → In [33], Pensky and Zhang analyze the dynamic case with Lei and Rinaldo’s proof technique. Theyconsider the deterministic DSBM model in the almost sparse case α n (cid:38) log nn and the uniform estimator(6). Defining a factor ρ (PZ) n = min(1 , √ nα n ε n ) , (13)they show that, for an optimal choice of window size r ∼ ρ (PZ) n , it holds that (cid:13)(cid:13) A unif t − P t (cid:13)(cid:13) (cid:46) (cid:113) nα n ρ (PZ) n (14)In particular, the concentration is better if ρ (PZ) n = o (1), that is: ε n = o (cid:18) α n n (cid:19) . (15)In other words, there is an improvement if we assume sufficient smoothness in time, which then leadsto a better error rate E ( ˆΘ , Θ) (cid:46) K ρ (PZ) n α n n when using A unif t in the SC algorithm. Note that, with thisproof technique a constant smoothness ε n ∼ α n , since intuitively, if there is more dataavailable where the communities are almost the same as the present time step, the density of edgesshould not need to be as large. We solve this in the following theorem, which is the central contributionof this paper. Theorem 1.
Consider the deterministic DSBM with any B . Define ρ n def. = min (cid:0) , √ ¯ n max α n ε n (cid:1) (16) Assume t (cid:62) t min def. = log ( ρnαnn ) − ρ n ) , and α n ρ n (cid:38) log nn . (17) Consider either the uniform estimator A smooth t = A unif t with r ∼ ρ n or the exponential estimator A smooth t = A exp t with λ ∼ ρ n .For all ν > , there is a universal constant C ν such that, with probability at least − n − ν , it holdsthat (cid:13)(cid:13) A smooth t − P t (cid:13)(cid:13) (cid:54) C ν √ nα n ρ n . (18)In this theorem, we improve over [33] in several ways. First, we improve ρ (PZ) n to ρ n by replacing n with ¯ n max (cid:54) n . In the case where (cid:80) (cid:96) ( B ) k(cid:96) stays bounded, for instance if it is defined as (3) with τ ∼ K , we have ¯ n max ∼ nK and this improves the bound (18) compared to (14). We also extend theresult to the exponential estimator with the right choice of forgetting factor.10ore importantly, the main feature of our result is the weaker condition (17), which relates thesparsity and the smoothness of the DSBM. Strinkingly, if ε n ∼ n/ ¯ n max log n , (19)which is a slight strengthening of (15), then our result is valid in the sparse regime α n ∼ n , whichis a significant improvement compared to previous works. In any case, if we have exactly α n ρ n ∼ log nn ,then as previously Lemma 1 yields that E ( ˆΘ , Θ) → K = o ( √ log n ). Proposition 2.
The result of Theorem 1 stays valid under the Markov DSBM, by replacing ¯ n max with n everywhere, and assuming ε n (cid:38) (cid:114) log nn (20) (in this case, “with probability at least − n − ν ” refers to joint probability on both the A t and the Θ t ). The above Lemma shows that the Markov DSBM yields the exact same error bounds than thedeterministic DSBM model, but since we do not assume a maximal community size here, ¯ n max isreplaced with n . Furthermore, ε n cannot be too small to still obtain a polynomial probability offailure. Nevertheless, the condition (20) is much weaker than the rate (19) for instance, such that thesparse regime with sufficient smoothness if still valid. Remark 2.
As already observed in [33], with this proof technique, a constant ε n , or in other words,a fraction of changing nodes s that grows linearly with n , does not result in an improvement of therate of the error bounds compared to the static case. Following the statistical physic approach in thesparse static case [18, 28, 1], a conjecture on the detectability threshold in the sparse case and ε n ∼ K > As mentioned in the introduction, the spectral concentration of the normalized Laplacian has been lessstudied than the adjacency matrix, even in the static case. Many works study the asymptotic spectralconvergence of the normalized Laplacian in the dense case [38], but few examine non-asymptoticbounds.
Among the few existing bounds, [31] proves a concentration in O (1) in the relatively sparse case,and [34] proves a concentation in Frobenius norm but with the stronger condition α n (cid:38) √ log n . Animportant corollary of our study of the dynamic case is to significantly improves over these results,and obtain, to our knowledge, the best bound available in the relatively sparse case. We state thefollowing proposition for any Bernoulli matrix (not necessarily SBM). Proposition 3 (Normalized Laplacian, static case) . Let A be a symmetric matrix with independententries a ij ∼ Ber ( p ij ) . Assume p ij (cid:54) α n , and that there is ¯ n min , ¯ n max such that for all i , α n ¯ n min (cid:54) (cid:80) j p ij (cid:54) α n ¯ n max , and µ B = ¯ n max ¯ n min . For all ν > , there are constants C ν , C (cid:48) ν such that: if α n (cid:62) C (cid:48) ν µ B log n ¯ n min (21) then with probability at least − n − ν we have (cid:107) L ( A ) − L ( P ) (cid:107) (cid:46) C ν µ B √ n ¯ n min √ α n roof. This is a direct consequence of Theorem 4 in Section 6.In other words, when ¯ n min ∼ n (for instance when all the p ij /α n are bounded below), then in therelatively sparse case the spectral concentration of the normalized Laplacian is in √ log n , which is astrict improvement over existing bounds.Let us comment a bit on the condition (21). When ¯ n min = o ( n ) or µ − B = o (1), it is stronger thanthe relatively sparse case. The attentive reader would also remark the subtle interplay of the quantifierswith the rate ν : in the analysis of the adjacency matrix in the previous section, any multiplicativeconstant between α n and log nn was acceptable, and the rate ν only forced a multiplicative constant C ν in the final error bound. Here, the rate ν also imposes a multiplicative constant C (cid:48) ν in the sparsityhypothesis. To our knowledge, the normalized Laplacian in the DSBM has never been studied theoretically. Ourresult is the following.
Theorem 2.
Consider the deterministic DSBM with B satisfying (3) , and either the uniform esti-mator A smooth t = A unif t with r ∼ ρ n or the exponential estimator A smooth t = A exp t with λ = ρ n . Assume t (cid:62) t min .For all ν > , there exist universal constants C ν , C (cid:48) ν > such that: if α n ρ n (cid:62) C (cid:48) ν µ B log n ¯ n min (22) then with probability at least − n − ν , it holds that (cid:13)(cid:13) L ( A smooth t ) − L ( P t ) (cid:13)(cid:13) (cid:54) C ν µ B (cid:114) nρ n ¯ n α n . (23)In the case of balanced communities, the result of Theorem 2 combined with Lemma 1 yieldsthe same error rate than in the case of the adjacency matrix with Theorem 1 and Lemma 1, evenin terms of K when ¯ n min , ¯ n max ∼ nK . Note however that in the latter, the condition (22) is slightlystronger than (17), similar to the static case. In practice however, it is well-known that the normalizedLaplacian generally performs better (Fig. 1). In this section, we provide the proof of our main results, largely inspired by [22] and [33]. The technicalcomputations are given in appendix. Despite some similarity with [22] and [33], we strove to makethe proofs self-contained.
We place ourselves at a particular time t . Both estimators A unif t and A exp t can be written as a weightedsum A smooth t = t (cid:88) k =0 β k A t − k , (24)where β = . . . = β r − = r and β k = 0 for k (cid:62) r in the uniform case, and β k = λ (1 − λ ) k for k < t and β t = (1 − λ ) t in the exponential case. As we will see, our results will be valid for any estimator12f the form (24), with weights β k (cid:62) β max , C β , C (cid:48) β > t (cid:88) k =0 β k = 1 , β k (cid:54) β max , t (cid:88) k =0 β k (cid:54) C β β max , (25) t (cid:88) k =0 β k min(1 , (cid:112) kε n ) (cid:54) C (cid:48) β (cid:114) ε n β max In words, the weights must naturally sum to 1 and be bounded; the sum of their squares must besmall; and they must decrease faster than √ k , which is roughly the rate at which the past communitiesΘ t − k deviate from Θ t . It is not difficult to show that the uniform and exponential estimator satisfythese conditions. Lemma 2.
The weights in the uniform estimator (6) satisfy (25) with β max = r , C β = C (cid:48) β = 1 .If t (cid:62) t min = min(log( ε n /β max ) , log β max )2 log(1 − β max ) , the weights in the exponential estimator (7) satisfy (25) with β max = λ , C β = , C (cid:48) β = 2 .Proof. The computations are trivial in the uniform case, where the last condition is implied by thestronger property (cid:80) k β k √ k (cid:54) √ r = (cid:113) β max . In the exponential case, we have β max = λ , and (cid:88) k β k = λ t − (cid:88) k =0 (1 − λ ) k + (1 − λ ) t (cid:54) λ ∞ (cid:88) k =0 (1 − λ ) k + λ (cid:54) λ where the first inequality is valid since t (cid:62) log β max − β max ) , and thus C β = . Next, we have t (cid:88) k =0 β k min(1 , (cid:112) kε n ) (cid:54) √ ε n λ ∞ (cid:88) k =0 √ k (1 − λ ) k + (1 − λ ) t Since t (cid:62) log( ε n /β max )2 log(1 − β max ) , we get (1 − λ ) t (cid:54) (cid:112) ε n λ and ∞ (cid:88) k =0 (1 − λ ) k √ k (cid:54) (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) k =0 (1 − λ ) k (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) k =0 (1 − λ ) k k = (cid:114) λ (cid:114) − λλ (cid:54) λ / and therefore we obtain the desired inequality with C (cid:48) β = 2. For an estimator of the form (24), our goal is to bound (cid:13)(cid:13) A smooth t − P t (cid:13)(cid:13) . We define P smooth t def. = (cid:80) tk =0 β k P t − k , and divide the error in two terms: (cid:13)(cid:13) A smooth t − P t (cid:13)(cid:13) (cid:54) (cid:13)(cid:13) A smooth t − P smooth t (cid:13)(cid:13) + (cid:13)(cid:13) P smooth t − P t (cid:13)(cid:13) (26)The first error term corresponds to the difference between A smooth t and its expectation (up to thediagonal terms). Intuitively, it decreases when the amount of smoothing increases , that is, when r increases or λ gets close to 0, since the sum of matrices is taken over more values. The second termis the difference between the smoothed matrix of probability connection and its value at time t . Thistime, it will increase when the amount of smoothing increases, since the past communities will beincreasingly present in P smooth t . Once we have the two bounds, we can balance them to obtain anoptimal value for r or λ , respectively ρ n and ρ n . 13 .2.1 Bound on the first term The first bound will be handled by the following general concentration theorem. This is where we areable to weaken the hypothesis on the sparsity.
Theorem 3.
Let A , . . . , A t ∈ { , } n × n be t symmetric Bernoulli matrices whose elements a ( k ) ij areindependent random variables: a ( k ) ij ∼ Ber ( p ( k ) ij ) , a ( k ) ji = a ( k ) ij , a ( k ) ii = 0 Assume p ( k ) ij (cid:54) α n . Consider non-negative weights β k that satisfy (25) . Denoting A = (cid:80) tk =0 β k A t − k and P = E ( A ) , there is a universal constant C such that for all c > we have P (cid:16) (cid:107) A − P (cid:107) (cid:62) C (1 + c ) (cid:112) nα n β max (cid:17) (cid:54) e − (cid:18) c / Cβ + 23 c − log(14) (cid:19) n (27)+ e − c / Cβ +2 c/ · nαnβ max +log n + n − c +6 This theorem is proved in Appendix A.2. Its proof is heavily inspired by [22] and [33]: the spectralnorm is expressed as a maximization problem over the sphere, and for each point of the sphere theobtained sum is divided into so-called “light” terms, for which Berstein’s concentration inequality issufficient, and more problematic “heavy” terms, that require a complex concentration method. Weobtain our weaker sparsity hypothesis in a small but crucial part of this second step, the so-called bounded degree lemma . Lemma 3 (Bounded degree.) . Denote d i,t = (cid:80) j ( A t ) ij the degree of node i at time t , d i = (cid:80) tk =0 β k d i,t − k the smoothed degree and ¯ d i = E d i . Then, for all c , P (max i | d i − ¯ d i | (cid:62) cnα n ) (cid:54) exp (cid:18) − c / C β + 2 c/ · nα n β max + log n (cid:19) Proof.
We use Bernstein’s inequality. For any fixed i we have d i = t (cid:88) k =0 (cid:88) j (cid:54) = i β k a ( t − k ) ij = (cid:88) k,j Y jk where Y jk = β k a ( t − k ) ij are such that E ( Y jk ) = β k p ( t − k ) ij (cid:54) β k α n , | Y jk − E Y jk | (cid:54) ( α n + 1) β k (cid:54) β max ,and V ar ( Y jk ) (cid:54) β k α n such that (cid:80) k,j V ar ( Y jk ) (cid:54) C β nα n β max .Therefore, applying Berstein’s inequality, we have P ( | d i − ¯ d i | (cid:62) cnα n ) (cid:54) exp (cid:18) − c n α n / C β nα n β max + β max cnα n (cid:19) Applying a union bound over the nodes i proves the result.In the static case [22] where β max = 1, the bounded degree lemma is exactly where the relativesparsity hypothesis α n (cid:38) log nn is needed, otherwise the probability of failure diverges. In the dynamiccase, we see that β max (which we will ultimately set at ρ n ) intervenes and gives our final hypothesison sparsity and smoothness.Applying Theorem 3, we obtain that for any fixed Θ , . . . , Θ t , if nα n β max (cid:38) log n , then for any ν > C ν such that with probability at least 1 − n − ν (cid:13)(cid:13) A smooth t − P smooth t (cid:13)(cid:13) (cid:54) (cid:13)(cid:13) A smooth t − E ( A smooth t ) (cid:13)(cid:13) + (cid:13)(cid:13) diag( P smooth t ) (cid:13)(cid:13) (cid:54) C ν (cid:112) nα n β max + α n Since in all considered cases we will have β max (cid:62) /n the second term is negligible, and we obtain (cid:13)(cid:13) A smooth t − P smooth t (cid:13)(cid:13) (cid:46) (cid:112) nα n β max (28)14 .2.2 Second term The second error term in (26) is handled slightly differently in the deterministic and Markov DSBM,even if the final bound is the same.
Lemma 4.
Consider the deterministic DSBM, with weights that satisfy (25) . It holds that (cid:13)(cid:13) P smooth t − P t (cid:13)(cid:13) (cid:46) C (cid:48) β α n (cid:114) n ¯ n max ε n β max (29) Proof.
Since the weights sum to 1, we decompose (cid:13)(cid:13) P smooth t − P t (cid:13)(cid:13) (cid:54) (cid:88) k β k (cid:107) P t − k − P t (cid:107) (cid:54) (cid:88) k β k (cid:107) P t − k − P t (cid:107) F where (cid:107)·(cid:107) F is the Frobenius norm. Consider P = Θ B Θ (cid:62) and P (cid:48) = Θ (cid:48) B (Θ (cid:48) ) (cid:62) two probability matricessuch that there is a set S of nodes that have changed communities. We have then: (cid:107) P − P (cid:48) (cid:107) F = (cid:88) i ∈S (cid:88) j ( p ij − p (cid:48) ij ) + ( p ji − p (cid:48) ji ) (cid:54) (cid:88) i ∈S (cid:88) j p ij + ( p (cid:48) ij ) (cid:54) α n |S| n max max k (cid:88) (cid:96) ( B ) k(cid:96) (cid:54) |S| α n ¯ n max Since at most ks nodes have changed community between P t and P t − k , with a maximum of n nodes,we have (cid:107) P t − k − P t (cid:107) F (cid:54) α n ¯ n max min( n, ks ) = 2 α n n ¯ n max min(1 , kε n ) (30)Using the hypothesis that we have made on (cid:80) k β k min(1 , √ kε n ), we obtained the desired bound.At the end of the day, combining (26), (28) and (29) for both deterministic and Markov DSBMmodel we obtain with the desired probability: (cid:13)(cid:13) A smooth t − P t (cid:13)(cid:13) (cid:46) E ( β max ) + E ( β max ) where (cid:40) E ( β ) def. = √ nα n βE ( β ) def. = α n (cid:113) n ¯ n max ε n β (31)As expected, E decreases and E increases when β max decreases. A simple function study show thatthe sum of the errors is minimized for β max = ρ n , which concludes the proof of Theorem 1. Since the bound on the first term (28) is valid for any Θ k , and the A k are conditionally independentgiven the Θ k , by the law of total probability it is also valid with joint probability at least 1 − n − ν onboth the A k and Θ k in the Markov DSBM model. For the bound on the second term, we show that(30) is still valid with high probability, replacing ¯ n max with n . Lemma 5.
Consider the Markov DSBM model. We have P (cid:16) ∃ k, (cid:107) P t − k − P t (cid:107) F (cid:62) (8 + C ) α n n min(1 , kε n ) (cid:17) (cid:54) e − C ε n n +log εn The proof is in Appendix A.4. Using this Lemma, if (20) is satisfied we obtain that with probabilityat least 1 − n − ν , (30) is satisfied for all k . Using the rest of the proof of Lemma 4, (29) is valid in theMarkov DSBM model, with n instead of ¯ n max . The rest of the proof is the same as the deterministiccase. 15 .3 Concentration of Laplacian: proof of Theorem 2 A crucial part of handling the normalized Laplacian is to lower-bound the degrees of the nodes, sincewe later manipulate the inverse of the degree matrix. Under our hypotheses, the minimal expecteddegree is of the order of α n ¯ n min , so we need to bound the deviation of the degrees with respect to thisquantity. We revisit the bounded degree lemma. Lemma 6 (Bounded degree revisited.) . Under the deterministic DSBM, for all c , P (max i | d i − ¯ d i | (cid:62) c ¯ n min α n ) (cid:54) exp (cid:18) − c / C β + 2 c/ · ¯ n min α n µ B β max + log n (cid:19) Proof.
We do the exact same proof as Lemma 3, but we remark that (cid:80) k,j
V ar ( Y jk ) (cid:54) C β ¯ n max α n β max ,since (cid:80) i p ( t − k ) ij (cid:54) α n ¯ n max for all k, i . Therefore, applying Berstein’s inequality, we have P ( | d i − ¯ d i | (cid:62) c ¯ n min α n ) (cid:54) exp (cid:18) − c ¯ n α n / C β ¯ n max α n β max + β max c ¯ n min α n (cid:19) Applying a union bound over the nodes i proves the result.To lower-bound d i , we use Lemma 6 with 0 < c <
1, for instance c = . The sparsity hypothesis(22) in the theorem comes directly from this: it uses ¯ n min instead of n , and the multiplicative constant C (cid:48) ν actually depends on the desired concentration rate ν , unlike the previous case of the adjacencymatrix where ν could be obtained by adjusting c in Lemma 3. Let us now turn to the proof of thetheorem.As before, we divide the bound in two parts: (cid:13)(cid:13) L ( A smooth t ) − L ( P t ) (cid:13)(cid:13) (cid:54) (cid:13)(cid:13) L ( A smooth t ) − L ( P smooth t ) (cid:13)(cid:13) + (cid:13)(cid:13) L ( P smooth t ) − L ( P t ) (cid:13)(cid:13) (32)The first bound is handled with a general concentration theorem. Theorem 4.
Let A , . . . , A t ∈ { , } n × n be t symmetric Bernoulli matrices whose elements a ( k ) ij areindependent random variables: a ( k ) ij ∼ Ber ( p ( k ) ij ) , a ( k ) ji = a ( k ) ij , a ( k ) ii = 0 Consider non-negative weights β k that satisfy (25) . Denoting A = (cid:80) tk =0 β k A t − k and P = E ( A ) .Assume p ( k ) ij (cid:54) α n , and that there is ¯ n min , ¯ n max such that for all i , α n ¯ n min (cid:54) (cid:80) j p ij (cid:54) α n ¯ n max . Thenthere is a universal constant C such that for all c > we have P (cid:16) (cid:107) L ( A ) − L ( P ) (cid:107) (cid:62) C (1 + c ) µ B ¯ n min (cid:114) nβ max α n (cid:17) (33) (cid:54) e − (cid:18) c / Cβ + 23 c − log(14) (cid:19) n + e − c / Cβ +2 c/ · nαnβ max +log n + e − / Cβ +1 / · ¯ n min αnµBβ max +log n + n − c +6 The proof is in Appendix A.3. Similar to the adjacency matrix case, we thus obtain (cid:13)(cid:13) L ( A smooth t ) − L ( P smooth t ) (cid:13)(cid:13) (cid:54) (cid:13)(cid:13) L ( A smooth t ) − L ( E ( A smooth t )) (cid:13)(cid:13) + (cid:13)(cid:13) L ( E ( A smooth t )) − L ( P smooth t ) (cid:13)(cid:13) and by Lemma 11, the second term is negligible since E ( A smooth t ) and P smooth t only differ by theirdiagonal, of the order of α n .The second bound is handled in the same way as the adjacency matrix in the deterministic case.16 emma 7. Under the deterministic DSBM, we have (cid:13)(cid:13) L ( P t ) − L ( P smooth t ) (cid:13)(cid:13) (cid:46) C (cid:48) β µ B ¯ n min (cid:114) n ¯ n max ε n β max The proof is in Appendix A.4.At the end of the day, we obtain (cid:13)(cid:13) L ( A smooth t ) − L ( P t ) (cid:13)(cid:13) (cid:46) µ B ¯ n min α n ( E ( β max ) + E ( β max )) (34)Which is minimized for the same choice of β max ∼ ρ n . In the DSBM, it should come as no surprise that a model that is very regular should not need to beas dense as when treading with a single snapshot. Our analysis is the first to show this, for classicSC, in a non-asymptotic manner. Under a slightly stronger condition on the regularity than thatin [33], we showed that strong consistency guarantees can be obtained even in the sparse case. Weextended the results to the normalized Laplacian and, although we obtain the same final error rateas the adjacency matrix, our analysis also yields, to our knowledge, the best non-asymptotic spectralbound concentration of the normalized Laplacian for Bernoulli matrices with independent edges.In this theoretical paper, we did not discuss how to select in practice the various parameters ofthe algorithms such as the number of communities K or the forgetting factor λ . This is left forfuture investigations, as well as the analysis of varying K , n , or B . As we mentioned in Remark2, an outstanding conjecture about the sparse case and ε n ∼ (cid:107) L ( A ) − L ( P ) (cid:107) → Acknowledgements
S. Vaiter was supported by ANR GraVa ANR-18-CE40-0005 and Projet ANER RAGA G048CVCRB-2018ZZ. We thank Nicolas Verzelen for useful discussion and pointing us to references.
References [1] Emmanuel Abbe. Community detection and stochastic block models: recent developments.
Jour-nal of Machine Learning Research , pages 1–86, 2018.[2] Fr Bach and Mi Jordan. Learning spectral clustering, with application to speech separation.
TheJournal of Machine Learning Research , 7:1963–2001, 2006.[3] Afonso S Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm of randommatrices with independent entries.
Annals of Probability , 44(4):2479–2506, 2016.[4] Joan Bruna and Xiang Li. Community Detection with Graph Neural Networks. pages 1–15, 2017.[5] Alain Celisse, Jean Jacques Daudin, and Laurent Pierre. Consistency of maximum-likelihood andvariational estimators in the stochastic block model.
Electronic Journal of Statistics , 6(September2011):1847–1899, 2012. 176] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle L. Tseng. Evolutionary spec-tral clustering by incorporating temporal smoothness.
Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery and data mining - KDD ’07 , page 153, 2007.[7] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle L Tseng. On evolutionary spectralclustering.
ACM Transactions on Knowledge Discovery from Data , 3(4):1–30, 2009.[8] Romain Couillet and Florent Benaych-Georges. Kernel spectral clustering of large dimensionaldata.
Electronic Journal of Statistics , 10(1):1393–1454, 2016.[9] Charanpal Dhanjal, Romaric Gaudel, and St´ephan Cl´emen¸con. Efficient eigen-updating for spec-tral graph clustering.
Neurocomputing , 131:440–452, 2014.[10] Peter Diao, Dominique Guillot, Apoorva Khare, and Bala Rajaratnam. Model-free consistencyof graph partitioning. 2016.[11] Nicol´as Garc´ıa Trillos and Dejan Slepˇcev. A variational approach to the consistency of spectralclustering.
Applied and Computational Harmonic Analysis , 45(2):239–281, 2018.[12] Amir Ghasemian, Pan Zhang, Aaron Clauset, Cristopher Moore, and Leto Peel. Detectabilitythresholds and optimal algorithms for community structure in dynamic networks.
Physical ReviewX , 6(3):1–9, 2016.[13] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey ofstatistical network models.
Foundations and Trends in Machine Learning , 2(2):129–233, 2009.[14] Qiuyi Han, Kevin S. Xu, and Edoardo M. Airoldi. Consistent estimation of dynamic and multi-layer block models. , 2:1511–1520, 2015.[15] Qirong Ho, Le Song, and Eric P. Xing. Evolving cluster mixed-membership blockmodel fortime-varying networks.
Journal of Machine Learning Research , 15:342–350, 2011.[16] Paul W Holland. STOCHASTIC BLOCKMODELS: FIRST STEPS. 5, 1983.[17] Petter Holme. Modern temporal network theory: a colloquium.
European Physical Journal B ,88(9), 2015.[18] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborov´a,and Pan Zhang. Spectral redemption: clustering sparse networks. pages 1–11, 2013.[19] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A Simple Linear Time (1+ eps) -Approximation Algorithm for k-Means Clustering in Any Dimensions. , pages 454–462, 2004.[20] Can M. Le, Elizaveta Levina, and Roman Vershynin. Concentration and regularization of randomgraphs.
Random Structures and Algorithms , 51(3):538–561, 2017.[21] Can M. Le, Elizaveta Levina, and Roman Vershynin. Concentration of random graphs andapplication to community detection. 80:1–17, 2018.[22] Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models.
Annals of Statistics , 43(1):215–237, 2015.[23] Ron Levie, Michael M. Bronstein, and Gitta Kutyniok. Transferability of Spectral Graph Con-volutional Neural Networks. pages 1–35, 2019.1824] Fuchen Liu, David Choi, Lu Xie, and Kathryn Roeder. Global spectral clustering in dynamicnetworks.
Proceedings of the National Academy of Sciences of the United States of America ,115(5):927–932, 2018.[25] Stuart P. Lloyd. Least Squares Quantization in PCM.
IEEE Transactions on Information Theory ,28(2):129–137, 1982.[26] L´ea Longepierre and Catherine Matias. Consistency of the maximum likelihood and variationalestimators in a dynamic stochastic block model. 2019.[27] Catherine Matias and Vincent Miele. Statistical clustering of temporal networks through a dy-namic stochastic block model.
Journal of the Royal Statistical Society. Series B: StatisticalMethodology , 79(4):1119–1141, 2017.[28] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture.
Combinatorica , pages 1–44, 2017.[29] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On Spectral Clustering: Analysis and Algo-rithm. In
Advances in Neural Information Processing Systems , pages 849–856, 2001.[30] Huazhong Ning, W Xu, Y Chi, Y Gong, and Ts Huang. Incremental Spectral Clustering WithApplication to Monitoring of Evolving Blog Communities.
SIAM Int. Conf. on Data Mining ,pages 261–272, 2007.[31] Roberto Imbuzeiro Oliveira. Concentration of the adjacency matrix and of the Laplacian inrandom graphs with independent edges. pages 1–46, 2009.[32] Richard Peng, He Sun, and Luca Zanetti. Partitioning Well-Clustered Graphs: Spectral Cluster-ing Works! 40:1–33, 2014.[33] Marianna Pensky and Teng Zhang. Spectral clustering in the dynamic stochastic block model.
Electronic Journal of Statistics , 13(1):678–709, 2019.[34] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochas-tic blockmodel.
Annals of Statistics , 39(4):1878–1915, 2011.[35] Minh Tang and Carey E. Priebe. Limit theorems for eigenvectors of the normalized Laplacianfor random graphs.
Annals of Statistics , 46(5):2360–2415, 2018.[36] Nicolas Tremblay and Andreas Loukas. Approximating Spectral Clustering via Sampling: aReview. pages 1–41, 2019.[37] Ulrike Von Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17(4):395–416,2007.[38] Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering.
Annals of Statistics , 36(2):555–586, 2008.[39] Yu Wang, Aniket Chakrabarti, David Sivakoff, and Srinivasan Parthasarathy. Fast Change PointDetection on Dynamic Social Networks. pages 2992–2998, 2017.[40] Dinghai Xu and John Knight. Continuous Empirical Characteristic Function Estimation of Mix-tures of Normal Parameters.
Econometric Reviews , 30(1):25–50, 2010.[41] Kevin S. Xu. Stochastic Block Transition Models for Dynamic Networks. pages 1–23, 2014.[42] Kevin S. Xu and Alfred O. Hero. Dynamic stochastic blockmodels for time-evolving social net-works.
IEEE Journal on Selected Topics in Signal Processing , 8(4):552–562, 2014.1943] Kevin S Xu, Mark Kliger, and Alfred O Hero Iii. Evolutionary spectral clustering with adaptativeforgetting factor.
Time , (1):2174–2177, 2010.[44] Tianbao Yang, Yun Chi, Shenghuo Zhu, Yihong Gong, and Rong Jin. Detecting communitiesand their evolutions in dynamic social networks - A Bayesian approach.
Machine Learning ,82(2):157–189, 2011.
A Proofs
A.1 Proof of Lemma 1
From [22, Section 5.4], for any matrix M ∈ R K × K and Q = Θ M Θ (cid:62) , given an estimator ˆ Q that wefeed to the SC algorithm it holds that E ( ˆΘ , Θ) (cid:46) (1 + δ ) n (cid:48) max Knn γ M (cid:13)(cid:13)(cid:13) ˆ Q − Q (cid:13)(cid:13)(cid:13) where γ M is the smallest eigenvalue of M .When using the adjacency matrix ˆ P = A to estimate the probability matrix P = Θ B Θ (cid:62) , we have B = α n B , and γ M = α n γ , which gives us (8). When B is defined as (3), we have γ = 1 − τ .In the Laplacian case, for a node i (cid:54) n belonging to a community k (cid:54) K , we have d i = (cid:80) j p ij = d (cid:48) k def. = ( n k − B kk + (cid:80) (cid:96) (cid:54) = k n (cid:96) B k(cid:96) (cid:54) α n ¯ n max , hence the Laplacian of the probability matrix L ( P ) canbe written as: L ( P ) = D ( P ) − P D ( P ) − = D ( P ) − Θ B Θ (cid:62) D ( P ) − = Θ (cid:16) D − B BD − B (cid:17) Θ (cid:62) where D B = diag( d (cid:48) k ) ∈ R K × K . Hence we can apply the result above with M = D − B BD − B , andsince (cid:13)(cid:13)(cid:13) D B B − D B (cid:13)(cid:13)(cid:13) (cid:54) α n ¯ n max α n γ (cid:54) ¯ n max γ the smallest eigenvalue of D − B BD − B satisfies γ M (cid:62) γ ¯ n max , which leads to the result. A.2 Proof of Theorem 3
The proof is heavily inspired by [22]. Define P k = E ( A k ), W k = A k − P k and w ( k ) ij its elements, andtheir respective smoothed versions A = (cid:80) tk =0 β k A t − k , P = E ( A ), W = A − P , and a ij , p ij , w ij theirelements. Denote by S the Euclidean ball in R n of radius 1. The proof strategy of [22] is to define agrid T = { x ∈ S : 2 √ nx i ∈ Z } and simply note that (Lemma 2.1 in [22] supplementary): (cid:107) W (cid:107) = sup u ∈ S | u (cid:62) W u | (cid:54) x,y ∈ T | x (cid:62) W y | Hence we must bound this last quantity. To do this, for each given ( x, y ) in T , we divide their indicesinto ”light” pairs: L ( x, y ) = { ( i, j ) : | x i y j | (cid:54) (cid:114) α n β max n } and ”heavy” pairs H ( x, y ) are all the other indices. We naturally dividesup x,y ∈ T | x (cid:62) W y | (cid:54) sup x,y ∈ T | (cid:88) ( i,j ) ∈L ( x,y ) x i y j w ij | + sup x,y ∈ T | (cid:88) ( i,j ) ∈H ( x,y ) x i y j w ij | and bound each of these two terms separately. 20 .2.1 Bounding the light pairs To bound the light pairs, Bernstein’s concentration inequality is sufficient.
Lemma 8 (Bounding the light pairs) . We have P sup x,y ∈ T | (cid:88) ( i,j ) ∈L ( x,y ) x i y j w ij | (cid:62) c (cid:112) nα n β max (cid:54) e − (cid:18) c / Cβ + 23 c − log(14) (cid:19) n for all constants c > .Proof. The proof is immediate by applying Bernstein’s inequality. Take any ( x, y ) ∈ T , denote C = (cid:113) α n β max n . Define u ij = x i y j ( i,j ) ∈L ( x,y ) + x j y i ( j,i ) ∈L ( x,y ) (which is necessary because the edges( i, j ) and ( j, i ) are not independent). We have (cid:88) ( i,j ) ∈L ( x,y ) x i y j w ij = (cid:88) (cid:54) i 1, and σ ijk := Var( Y ijk ) (cid:54) β k ( x i y j + x j y i ) α n sinceVar( w ( t − k ) ij ) = p ( t − k ) ij (1 − p ( t − k ) ij ) (cid:54) α n . Note that (cid:88) i To bound the heavy pairs, two main Lemmas are required: the so-called bounded degree (Lemma 3)and bounded discrepancy lemma, presented below. As mentioned before, the bounded degree lemma iskey in improving the sparsity hypothesis, despite the simplicity of its proof. The bounded discrepancylemma is closer to its original proof [22], that we reproduce here for completeness.21 emma 9 (Bounded discrepancy) . For I, J ⊂ { , . . . , n } , we define µ ( I, J ) = α n | I || J | , e ( I, J ) = t (cid:88) k =0 β k e t − k ( I, J ) where e t ( I, J ) is the number of edges between I and J at time t . Then, for all c, c (cid:48) , with probability − e − c / Cβ +2 c/ · nαnβ max +log n − n − c (cid:48) +6 : for all | I | (cid:54) | J | at least one the following is true:1. e ( I, J ) (cid:54) c (cid:48)(cid:48) µ ( I, J ) with c (cid:48)(cid:48) = max( ec, e ( I, J ) log e ( I,J ) µ ( I,J ) (cid:54) c (cid:48) β max | J | log n | J | Of course by symmetry it is also valid for | J | (cid:54) | I | with the same probability (inverting the roleof I and J in the bounds).To prove it, we need the following Lemma. Lemma 10 (Adapted from Lemma 9 in [33]) . Let X , ..., X n be independent variables such that X i = Y i − E Y i , where Y i is a Bernoulli random variable with parameter p i . Define X = (cid:80) i w i X i where (cid:54) w i (cid:54) w max . Let µ be such that (cid:80) i =1 w i p i (cid:54) µ . Then, for all t (cid:62) , we have P ( X (cid:62) tµ ) (cid:54) exp (cid:18) − t log(1 + t ) µ w max (cid:19) Proof. For some λ > E ( e λw i X i ) = p i e w i (1 − p i ) λ + (1 − p i ) e w i p i λ . Hence E ( e λX ) = (cid:89) i E ( e λw i X i ) = (cid:89) i (cid:16) p i e w i (1 − p i ) λ + (1 − p i ) e w i p i λ (cid:17) (cid:54) e − λ (cid:80) i w i p i (cid:89) i (cid:0) p i ( e w i λ − (cid:1) Using 1 + a (cid:54) e a and e x − (cid:54) e A − A x for 0 (cid:54) x (cid:54) A , we have (cid:89) i (cid:0) p i ( e w i λ − (cid:1) (cid:54) exp (cid:32) e w max λ − w max (cid:88) i w i p i (cid:33) Hence, for t (cid:62) λ = log(1+ t ) w max , P ( X (cid:62) tµ ) (cid:54) e − tµλ E ( e λX ) (cid:54) exp (cid:32)(cid:18) e w max λ − w max − λ (cid:19) (cid:88) i w i p i − λtµ (cid:33) = exp (cid:32) w max (cid:32) ( t − log(1 + t )) (cid:88) i w i p i − log(1 + t ) tµ (cid:33)(cid:33) (cid:54) exp (cid:18) µw max ( t − log(1 + t ) − log(1 + t ) t ) (cid:19) since log(1 + t ) (cid:54) t and (cid:88) i w i p i (cid:54) µ (cid:54) exp (cid:18) µ w max t log(1 + t ) (cid:19) since t − log(1 + t ) (cid:54) t log(1 + t ).The Lemma above is slightly stronger than Bernstein in this particular case: we would haveobtained O ( t ) instead of O ( t log(1 + t )). Now we can prove the bounded discrepancy lemma.22 roof of Lemma 9. We assume that the bounded degree property (Lemma 3) holds, which impliesthat for all I, J , it hold that: e ( I, J ) = t (cid:88) k =0 β k e t − k ( I, J ) (cid:54) (cid:88) k β k min (cid:88) i ∈ I d i,t − k , (cid:88) j ∈ J d j,t − k (cid:54) min (cid:88) i ∈ I d i , (cid:88) j ∈ J d j (cid:54) cnα n min( | I | , | J | )Following this, for any pair I, J such that | I | (cid:62) n/e or | J | (cid:62) n/e (where e = exp(1) is chosen forlater conveniency), then e ( I,J ) µ ( I,J ) (cid:54) cnα n min( | I | , | J | ) α n | I || J | (cid:54) ce and the result is proved.Thus we now considers the pairs I, J where both have size less than n/e , and such that | I | (cid:54) | J | without lost of generality. For such a given pair I, J , we decompose e ( I, J ) = t (cid:88) k =0 (cid:88) ( i,j ) β k a ( t − k ) ij = (cid:88) i,j,k Y ijk where the sum over ( i, j ) counts only once each distinct edge between I and J , and Y ijk = a ( t − k ) ij is aBernoulli variable with parameter p ( t − k ) ij . Using (cid:80) i,j,k β k p ( t − k ) ij (cid:54) | I || J | α n = µ ( I, J ) and Lemma 10,we have, for any t (cid:62) P ( e ( I, J ) (cid:62) tµ ( I, J )) (cid:54) P ( e ( I, J ) − E e ( I, J ) (cid:62) ( t − µ ( I, J )) (cid:54) exp (cid:18) − µ ( I, J )( t − 1) log k β max (cid:19) (cid:54) exp (cid:18) − µ ( I, J ) t log t β max (cid:19) Denoting u = u ( I, J ) the unique value such that u log u = c (cid:48) β max | J | µ ( I,J ) log n | J | and t ( I, J ) = max(8 , u ( I, J )),we have (again for a fixed pair I, J of size less that n/e ): P ( e ( I, J ) (cid:62) t ( I, J ) µ ( I, J )) (cid:54) e − c (cid:48) | J | log n | J | Then, performing the same computations as in [22] (reproduced here for completeness): P (cid:16) ∃ I, J : | I | (cid:54) | J | (cid:54) ne , e ( I, J ) (cid:62) t ( I, J ) µ ( I, J ) (cid:17) (cid:54) (cid:88) (cid:54) | I | (cid:54) | J | (cid:54) n/e e − c (cid:48) | J | log n | J | = (cid:88) (cid:54) h (cid:54) g (cid:54) n/e (cid:18) nh (cid:19)(cid:18) ng (cid:19) e − c (cid:48) g log ng (cid:54) (cid:88) (cid:54) h (cid:54) g (cid:54) n/e (cid:16) neh (cid:17) h (cid:18) neg (cid:19) g e − c (cid:48) g log ng (cid:54) (cid:88) (cid:54) h (cid:54) g (cid:54) n/e exp (cid:18) h log neh + g log neg − c (cid:48) g log ng (cid:19) (cid:54) (cid:88) (cid:54) h (cid:54) g (cid:54) n/e exp (cid:18) g log ng − c (cid:48) g log ng (cid:19) (cid:54) (cid:88) (cid:54) h (cid:54) g (cid:54) n/e n − c (cid:48) +4 (cid:54) n − c (cid:48) +6 using the fact that x → x log x is increasing on [1 , n/e ]. So, e ( I, J ) (cid:62) t ( I, J ) µ ( I, J ) holds uniformlyfor all pairs I, J with high probability.Finally, we distinguish two cases depending on the value of t ( I, J ). If t ( I, J ) = 8 we get e ( I, J ) (cid:54) µ ( I, J ). If t ( I, J ) = u ( I, J ) (cid:62) 8, we have e ( I, J ) (cid:54) uµ ( I, J ), and e ( I, J ) µ ( I, J ) log e ( I, J ) µ ( I, J ) (cid:54) u log u (cid:54) c (cid:48) β max µ ( I, J ) | J | log n | J | 23e can now prove the bound on the heavy pairs, that is, we want to prove with high probability:sup x,y ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈H ( x,y ) x i y j w ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) (cid:112) nα n β max Since p ij (cid:54) α n and by definition of the heavy pairs, for all x, y ∈ T : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈H ( x,y ) x i y j p ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) α n (cid:88) ( i,j ) ∈H ( x,y ) x i y j | x i y j | (cid:54) α n (cid:114) nβ max α n (cid:107) x (cid:107) (cid:107) y (cid:107) (cid:54) (cid:112) nα n β max Hence our goal is now to bound sup x,y ∈ T (cid:12)(cid:12)(cid:12)(cid:80) ( i,j ) ∈H ( x,y ) x i y j a ij (cid:12)(cid:12)(cid:12) . We will show that, when the boundeddegree and bounded discrepancy properties hold, this sum is bounded for all x, y . From now on, weassume that these results hold, and consider any x, y ∈ T . Let us define sets of indices I s , J t overwhich we bound uniformly x i and y j , and replace the sum over a ij is these sets by e ( I s , J t ). Morespecifically, we define I s = (cid:26) i : 2 s − / √ n (cid:54) | x i | < s / √ n (cid:27) for s = 1 , . . . , log (2 √ n ) + 1 J t = (cid:26) j : 2 t − / √ n (cid:54) | y j | < t / √ n (cid:27) for t = 1 , . . . , log (2 √ n ) + 1Since we consider heavy pairs, we need only consider indices ( s, t ) such that 2 s + t (cid:62) (cid:113) α n nβ max , and wedefine C n = (cid:113) α n nβ max for convenience. Moreover, we have (cid:80) i ∈ I s ,j ∈ J t a ij (cid:54) e ( I s , J t ), since each edgeindices appears at most twice. Hence, we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈H ( x,y ) x i y j a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:88) ( s,t ):2 s + t (cid:62) C n s + t / n · e ( I s , J t ) (35)We now introduce more notations. We denote µ s = s | I s | n , ν t = t | J t | n , γ st = e ( I s ,J t ) α n | I s || J t | , σ st = γ st C n − ( s + t ) . We reformulate (35) as: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈H ( x,y ) x i y j a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:88) ( s,t ):2 s + t (cid:62) C n s + t / n · e ( I s , J t )= (cid:112) nα n β max (cid:88) s,t √ nα n β max − ( s + t ) s t n · e ( I s ,J t ) α n | I s || J t | α n | I s || J t | · √ n √ n = 12 (cid:112) nα n β max (cid:88) s,t µ s ν t σ st (36)Our goal is therefore to show that (cid:80) s,t µ s ν t σ st (cid:46) 1. For this, we will make extensive use of the factthat µ s (cid:54) (cid:80) i ∈ I s x i , and therefore (cid:80) s µ s (cid:54) 16, and similarly (cid:80) t ν t (cid:54) C = { ( s, t ) : 2 s + t (cid:62) C n , | I s | (cid:54) | J t |} , divided in six: C = { ( s, t ) ∈ C : σ st (cid:54) }C = { ( s, t ) ∈ C \ C : γ st (cid:54) c }C = { ( s, t ) ∈ C \ ( ∪ i =1 C i ) : 2 s − t (cid:62) C n }C = { ( s, t ) ∈ C \ ( ∪ i =1 C i ) : log γ st > log t ν t }C = { ( s, t ) ∈ C \ ( ∪ i =1 C i ) : 2 t log 2 (cid:62) log ν t }C = { ( s, t ) ∈ C \ ( ∪ i =1 C i ) } Similarly, we define C (cid:48) = { ( s, t ) : 2 s + t (cid:62) C n , | I s | (cid:62) | J t |} and C (cid:48) i the same way by inverting the rolesof µ s and ν t . We write the proof for C , the other case is strictly symmetric. Our goal is to prove thateach of the (cid:80) ( s,t ) ∈C i µ s ν t σ st is bounded by a constant. Pairs in C In this case we get (cid:88) ( s,t ) ∈C µ s ν t σ st (cid:54) (cid:88) s,t µ s ν t (cid:54) Pairs in C This includes the indices for which the first case in the bounded discrepancy lemma(Lemma 9) is satisfied. Since for s, t ∈ C we have 2 s + t (cid:62) C n , we have σ st (cid:54) γ st / 16, and (cid:88) ( s,t ) ∈C µ s ν t σ st (cid:54) c (cid:48)(cid:48) (cid:88) s,t µ s ν t / (cid:54) c (cid:48)(cid:48) Pairs in C . Since 2 s − t (cid:62) C n , we have necessarily t (cid:54) s − log C n . Furthermore, since we assumedthe bounded degree property (Lemma 3), we have e ( I s , J t ) (cid:54) c | I s | α n n , and therefore γ st (cid:54) cn/ | J t | .Then, (cid:88) ( s,t ) ∈C µ s ν t σ st (cid:54) (cid:88) s µ s s − log C n (cid:88) t =1 t | J t | n cn | J t | C n − ( s + t ) = c (cid:88) s µ s C n − s s − log C n (cid:88) t =1 t = c (cid:88) s µ s C n − s (2 s − log C n +1 − (cid:54) c (cid:88) s µ s (cid:54) c Pairs in C All I s , J t for which the first case of the bounded discrepancy lemma is satisfied are in-cluded in C , hence the remaining sets satisfy the second case of the lemma, which reads e ( I s , J t ) log γ st (cid:54) c (cid:48) β max | J t | log n | J t | . It can be reformulated as γ st α n | I s || J t | log γ st (cid:54) c (cid:48) β max | J t | log 4 t ν t σ st s + t (cid:114) β max α n n α n µ s − s n log γ st (cid:54) c (cid:48) β max log 4 t ν t σ st µ s log γ st (cid:54) c (cid:48) s − t C n log 4 t ν t (37)25ince ( s, t ) / ∈ C , we have 2 s − t (cid:54) C n , and therefore s (cid:54) t + log C n . Since ( s, t ) ∈ C we havelog γ st (cid:62) log t ν t , and (37) implies σ st µ s (cid:54) c (cid:48) s − t C n . Then, (cid:88) ( s,t ) ∈C µ s ν t σ st (cid:54) (cid:88) t ν t t +log C n (cid:88) s =1 c (cid:48) s − t C n (cid:54) c (cid:48) (cid:88) t ν s − t C n (2 t +log C n +1 − (cid:54) c (cid:48) Pairs in C We have ν t (cid:54) t , and since ( s, t ) / ∈ C , we have log γ st (cid:54) log t ν t (cid:54) t log 2 and γ st (cid:54) t .On the other hand, since ( s, t ) / ∈ C , 1 (cid:54) σ st = γ st C n − ( s + t ) (cid:54) C n − s , and s (cid:54) log C n .Because ( s, t ) / ∈ C and c (cid:48)(cid:48) (cid:62) 8, log γ st (cid:62) ν t (cid:54) t , equation (37) becomes: σ st µ s (cid:54) c (cid:48) s − t C n t (cid:54) c (cid:48) s +1 C n since t − t (cid:54) / (cid:88) ( s,t ) ∈C µ s ν t σ st (cid:54) (cid:88) t ν t log C n (cid:88) s =1 C n c (cid:48) s (cid:54) c (cid:48) (cid:88) t ν s C n (2 log C n +1 − (cid:54) c (cid:48) Pairs in C Finally, we have 0 (cid:54) log γ st (cid:54) log ν t (cid:54) log ν t because, respectively, ( s, t ) / ∈ C ,( s, t ) / ∈ C and ( s, t ) / ∈ C , so γ st (cid:54) ν t . Since by definition 2 s + t (cid:62) C n , we have t (cid:62) log C n − s , and (cid:88) ( s,t ) ∈C µ s ν t σ st = (cid:88) ( s,t ) ∈C µ s ν t γ st − ( s + t ) C n (cid:54) (cid:88) s µ s C n − s (cid:88) t (cid:62) log C n − s (1 / t = (cid:88) s µ s C n − s (cid:18) − − (1 / log C n − s − (1 / (cid:19) = 2 (cid:88) s µ s (cid:54) c > A.3 Concentration of Laplacian: proof of Theorem 4 We note the degree matrices of A and P respectively D and D P , containing the degrees d i = (cid:80) kj β k a ( t − k ) ij and ¯ d i = E d i . Note that under our assumptions d min def. = α n ¯ n min (cid:54) ¯ d i (cid:54) d max def. = α n ¯ n max .Applying Lemma 6 with c = , we obtain: for all ν > 0, there is a constant C (cid:48) ν such that, if α n β max (cid:62) C (cid:48) ν µ B log n ¯ n min , with probability at least 1 − n − ν we have d min (cid:54) d i (cid:54) d max for all i . We assumethat it is satisfied for the rest of the proof.We apply Lemma 11, from which (cid:107) L ( A ) − L ( P ) (cid:107) (cid:54) (cid:107) A − P (cid:107) d min + 4 (cid:107) ( D − D P ) P (cid:107) d (38)We will now bound (cid:107) A − P (cid:107) and (cid:107) ( D − D P ) P (cid:107) with high probability, and use a union bound toconclude.By Theorem 3, with probability 1 − n − ν we have (cid:107) A − P (cid:107) (cid:46) √ nα n β max and the first term has thedesired rate. 26o bound the spectral norm of ( D − D P ) P with high probability, we re-use the “light and heavypairs” strategy of the previous proof. Define δ i = d i − ¯ d i . We adopt the definitions of the previoussection. We use again the fact that (cid:107) ( D − D P ) P (cid:107) (cid:54) x,y ∈ T x (cid:62) Qy where T is the same grid. We decompose x (cid:62) Qy = (cid:88) i,j ∈L ( x,y ) x i y j p ij δ i + (cid:88) i,j ∈H ( x,y ) x i y j p ij δ i Recall that we have | δ i | (cid:54) d max . In the proof of Theorem 3 we proved that (cid:80) i,j ∈H ( x,y ) x i y j p ij (cid:54) √ nα n β max , and therefore with the same probability (cid:88) i,j ∈H ( x,y ) x i y j p ij δ i (cid:46) d max (cid:112) nα n β max which is the desired complexity.We must now handle the light pairs. We write δ i = (cid:80) n(cid:96) =1 (cid:80) tk =0 β k w ( t − k ) i(cid:96) , and therefore (cid:88) i,j ∈L ( x,y ) x i y j p ij δ i = (cid:88) i<(cid:96) (cid:88) k u i(cid:96)k w ( t − k ) i(cid:96) where u i(cid:96)k = β k (cid:88) j (cid:0) x i y j p ij ( i,j ) ∈L ( x,y ) + x (cid:96) y j p (cid:96)j ( (cid:96),j ) ∈L ( x,y ) (cid:1) We want to apply Bernstein inequality. The random variables w ( t − k ) i(cid:96) are independent centeredBernoulli variables of parameters p ( t − k ) i(cid:96) . By definition of light pairs we have | u ijk | (cid:54) β max α n ¯ n max (cid:114) α n nβ max = 2 (cid:112) β max α n ¯ n max √ n since (cid:80) j p i j ( t − k ) (cid:54) α n ¯ n max . Then, using ( a + b ) (cid:54) a + b ), (cid:88) i(cid:96)k V ar ( u i(cid:96)k w ( t − k ) i(cid:96) ) (cid:54) (cid:88) k β k (cid:88) i(cid:96) p ( t − k ) i(cid:96) x i (cid:88) j y j p ij + x (cid:96) (cid:88) j y j p (cid:96)j (cid:54) (cid:88) k β (cid:88) i(cid:96) p ( t − k ) i(cid:96) x i (cid:88) j y j p ij + p ( t − k ) i(cid:96) x (cid:96) (cid:88) j y j p (cid:96)j (cid:54) C β β max ¯ n α n Where we have used (cid:88) j y j p (cid:96)j = (cid:88) k β k (cid:88) j y j p ( t − k ) (cid:96)j (cid:54) (cid:88) k β k (cid:107) y (cid:107) α n (cid:112) n max + τ n (cid:54) α n √ ¯ n max and (cid:80) i(cid:96) p ( t − k ) i(cid:96) x i (cid:54) α n ¯ n max (cid:107) x (cid:107) . Hence, using Bernstein’s inequality, P | (cid:88) i,j ∈L ( x,y ) x i y j p ij δ i | (cid:62) t (cid:54) − t / C β β max ¯ n α n + √ β max α n ¯ n max √ n t P | (cid:88) i,j ∈L ( x,y ) x i y j p ij δ i | (cid:62) c ¯ n max (cid:112) nβ max α n (cid:54) (cid:18) − c / C β + c · n (cid:19) Using a union bound over T we can conclude. 27 .4 Additional proofs Proof of Lemma 5. For any k , we have p ( t − k ) ij = p ( t ) ij if z t − ki = z ti and z t − kj = z tj , that is, if the nodeshave not changed communities. Thus we write (cid:107) P t − k − P t (cid:107) F = (cid:88) i,j ( p ( t − k ) ij − p ( t ) ij ) (cid:16) − { z t − ki = z ti } { z t − kj = z tj } (cid:17) (cid:54) α n n − (cid:32) n (cid:88) i { z t − ki = z ti } (cid:33) Considering that z t − ki = z ti at least when z t − ki = z t − k +1 i = . . . = z ti and that this event happens withprobability (1 − ε n ) k , we have that { z t − ki = z ti } (cid:62) a i ∼ Ber((1 − ε n ) k )where the a i are independent Bernoulli variables. By Hoeffding inequality, for some δ > P (cid:32) n (cid:88) i a i (cid:54) (1 − ε n ) k − δ (cid:33) (cid:54) e − δ n and therefore with probability at least 1 − ρ (cid:107) P t − k − P t (cid:107) F (cid:54) α n n (cid:16) − (cid:0) (1 − ε n ) k − δ (cid:1) (cid:17) (cid:54) α n n (cid:0) − (1 − ε n ) k + δ (cid:1) then using 1 − x k = (1 − x )(1 + x + . . . + x k − ) (cid:54) (1 − x ) k for | x | (cid:54) (cid:107) P t − k − P t (cid:107) F (cid:54) α n n (cid:16) − (cid:0) (1 − ε n ) k − δ (cid:1) (cid:17) (cid:54) α n n (min(1 , kε n ) + δ )Then we choose δ ∼ ε n (cid:54) min(1 , kε n ) to conclude. Proof of Lemma 7. Denote by D = D ( P t ), ¯ D = D ( P smooth t ) the degree matrices of P t and P smooth t ,with d i and ¯ d i their elements. By assumption, we have d i , ¯ d i (cid:62) d min def. = ¯ n min α n for all i . Therefore,by applying Lemma 11 we have (cid:13)(cid:13) L ( P t ) − L ( P smooth t ) (cid:13)(cid:13) (cid:54) (cid:13)(cid:13) P t − P smooth t (cid:13)(cid:13) d min + (cid:13)(cid:13) ( D − ¯ D ) P smooth t (cid:13)(cid:13) d From Lemma 4, we have (cid:13)(cid:13) P t − P smooth t (cid:13)(cid:13) (cid:46) C (cid:48) β α n (cid:113) n ¯ n max ε n β max .For the second term, we define D t − k the diagonal degree matrix associated to P t − k , such that (cid:13)(cid:13) ( D − ¯ D ) P smooth t (cid:13)(cid:13) (cid:54) t (cid:88) k =0 β k (cid:13)(cid:13) ( D − D t − k ) P smooth t (cid:13)(cid:13) F Denoting by ¯ p ij the elements of P smooth t , recall that (cid:80) j ¯ p ij (cid:54) α n ¯ n max , and using ( a + b ) (cid:54) a + b ) we have (cid:13)(cid:13) ( D − D t − k ) P smooth t (cid:13)(cid:13) F = (cid:88) i (cid:32)(cid:88) (cid:96) p ( t ) i(cid:96) − p ( t − k ) i(cid:96) (cid:33) (cid:88) j ¯ p ij (cid:54) α n ¯ n max (cid:88) i (cid:32)(cid:88) (cid:96) p ( t ) i(cid:96) − p ( t − k ) i(cid:96) (cid:33) (cid:54) α n ¯ n max (cid:88) i (cid:32)(cid:88) (cid:96) (cid:18)(cid:113) p ( t ) i(cid:96) + (cid:113) p ( t − k ) i(cid:96) (cid:19) (cid:33) (cid:32)(cid:88) (cid:96) (cid:18)(cid:113) p ( t ) i(cid:96) − (cid:113) p ( t − k ) i(cid:96) (cid:19) (cid:33) (cid:54) α n ¯ n max (cid:88) i (cid:32)(cid:88) (cid:96) p ( t ) i(cid:96) + p ( t − k ) i(cid:96) (cid:33) (cid:32)(cid:88) (cid:96) (cid:18)(cid:113) p ( t ) i(cid:96) − (cid:113) p ( t − k ) i(cid:96) (cid:19) (cid:33) (cid:54) α n ¯ n (cid:13)(cid:13)(cid:13) P (cid:12) t − P (cid:12) t − k (cid:13)(cid:13)(cid:13) F where A (cid:12) indicates element-wise square root. Repeating the proof of Lemma 4, for two SBMconnection probability matrices P and P (cid:48) between which only the nodes belonging to a set S havechange community, we have (cid:13)(cid:13)(cid:13) P (cid:12) − ( P (cid:48) ) (cid:12) (cid:13)(cid:13)(cid:13) F = (cid:88) i ∈S (cid:88) j (cid:16) √ p ij − (cid:113) p (cid:48) ij (cid:17) + (cid:16) √ p ji − (cid:113) p (cid:48) ji (cid:17) (cid:54) (cid:88) i ∈S (cid:88) j p ij + p (cid:48) ij (cid:54) |S| α n ¯ n max Therefore, (cid:13)(cid:13)(cid:13) P (cid:12) t − P (cid:12) t − k (cid:13)(cid:13)(cid:13) F (cid:46) α n ¯ n max n min(1 , kε n ), and (cid:13)(cid:13) ( D − D t − k ) P smooth t (cid:13)(cid:13) F (cid:54) α n ¯ n max √ n min(1 , √ kε n ).We conclude using the hypothesis on (cid:80) k β k min(1 , √ kε n ). A.5 Technical Lemma Lemma 11. Let A, P ∈ R n × n be symmetric matrices containing non-negative elements, assume that d i = (cid:80) j a ij and d Pi = (cid:80) j p ij are strictly positive, define D = diag( d i ) , D P = diag( d Pi ) , d min =min i ( d i , d Pi ) . Then, (cid:107) L ( A ) − L ( P ) (cid:107) (cid:54) (cid:107) A − P (cid:107) d min + (cid:107) ( D − D P ) P (cid:107) d Proof. Recall that (cid:107) M (cid:107) = sup x,y ∈S x (cid:62) M y , where S is the Euclidean unit ball. Denote Q = ( D − − D − P ) P .We write L ( A ) − L ( P ) = D − AD − − D − P P D − P = D − ( A − P ) D − + D − P D − − D − P P D − P = D − ( A − P ) D − + QD − − D − P Q (cid:62) and thus (cid:107) L ( A ) − L ( P ) (cid:107) (cid:54) (cid:107) A − P (cid:107) d min + 2 (cid:107) Q (cid:107)√ d min x, y ∈ S , we have x (cid:62) Qy = (cid:88) ij x i y j p ij (cid:32) √ d i − (cid:112) d Pi (cid:33) = (cid:88) ij x i y j p ij (cid:32) d Pi − d i (cid:112) d i d Pi ( √ d i + (cid:112) d Pi ) (cid:33) = (cid:88) ij x i y j p ij δ i ( d Pi − d i )where δ i def. = √ d i d Pi ( √ d i + √ d Pi ) (cid:54) d − min . Since the p ij are non-negative, the maximum over x, y ∈ S isnecessarily reached when every term in the sum is non-negative, by choosing y j (cid:62) x i with thesame sign as d Pi − d i . Hence, sup x,y ∈S (cid:80) ij x i y j p ij ( d Pi − d i ) = sup x,y ∈S (cid:80) ij | x i y j p ij ( d Pi − d i ) | . Usingthis property, sup x,y ∈S (cid:88) ij x i y j p ij δ i ( d Pi − d i ) (cid:54) sup x,y ∈S (cid:88) ij | x i y j p ij δ i ( d Pi − d i ) | (cid:54) d − min sup x,y ∈S (cid:88) ij | x i y j p ij ( d Pi − d i ) | = 12 d − min sup x,y ∈S (cid:88) ij x i y j p ij ( d Pi − d i )= 12 d − min (cid:107)(cid:107) Recall that (cid:107) M (cid:107) = sup x,y ∈S x (cid:62) M y , where S is the Euclidean unit ball. Denote Q = ( D − − D − P ) P .We write L ( A ) − L ( P ) = D − AD − − D − P P D − P = D − ( A − P ) D − + D − P D − − D − P P D − P = D − ( A − P ) D − + QD − − D − P Q (cid:62) and thus (cid:107) L ( A ) − L ( P ) (cid:107) (cid:54) (cid:107) A − P (cid:107) d min + 2 (cid:107) Q (cid:107)√ d min x, y ∈ S , we have x (cid:62) Qy = (cid:88) ij x i y j p ij (cid:32) √ d i − (cid:112) d Pi (cid:33) = (cid:88) ij x i y j p ij (cid:32) d Pi − d i (cid:112) d i d Pi ( √ d i + (cid:112) d Pi ) (cid:33) = (cid:88) ij x i y j p ij δ i ( d Pi − d i )where δ i def. = √ d i d Pi ( √ d i + √ d Pi ) (cid:54) d − min . Since the p ij are non-negative, the maximum over x, y ∈ S isnecessarily reached when every term in the sum is non-negative, by choosing y j (cid:62) x i with thesame sign as d Pi − d i . Hence, sup x,y ∈S (cid:80) ij x i y j p ij ( d Pi − d i ) = sup x,y ∈S (cid:80) ij | x i y j p ij ( d Pi − d i ) | . Usingthis property, sup x,y ∈S (cid:88) ij x i y j p ij δ i ( d Pi − d i ) (cid:54) sup x,y ∈S (cid:88) ij | x i y j p ij δ i ( d Pi − d i ) | (cid:54) d − min sup x,y ∈S (cid:88) ij | x i y j p ij ( d Pi − d i ) | = 12 d − min sup x,y ∈S (cid:88) ij x i y j p ij ( d Pi − d i )= 12 d − min (cid:107)(cid:107) ( D − D P ) P (cid:107)(cid:107)