[PDF] Estimation of Static Community Memberships from Temporal Network Data

Abstract

This article studies the estimation of static community memberships from temporally correlated pair interactions represented by an N -by- N -by- T tensor where N is the number of nodes and T is the length of the time horizon. We present several estimation algorithms, both offline and online, which fully utilise the temporal nature of the observed data. As an information-theoretic benchmark, we study data sets generated by a dynamic stochastic block model, and derive fundamental information criteria for the recoverability of the community memberships as N→∞ both for bounded and diverging T . These results show that (i) even a small increase in T may have a big impact on the recoverability of community memberships, (ii) consistent recovery is possible even for very sparse data (e.g. bounded average degree) when T is large enough. We analyse the accuracy of the proposed estimation algorithms under various assumptions on data sparsity and identifiability, and prove that an efficient online algorithm is strongly consistent up to the information-theoretic threshold under suitable initialisation. Numerical experiments show that even a poor initial estimate (e.g., blind random guess) of the community assignment leads to high accuracy after a small number of iterations, and remarkably so also in very sparse regimes.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug Estimation of Static Community Memberships fromTemporal Network Data

Konstantin Avrachenkov ∗ , Maximilien Dreveton † and Lasse Leskelä ‡ Inria Sophia Antipolis, France Aalto University, Finland

Abstract

This article studies the estimation of static community memberships from temporally cor-related pair interactions represented by an N -by- N -by- T tensor where N is the number ofnodes and T is the length of the time horizon. We present several estimation algorithms,both ofﬂine and online, which fully utilise the temporal nature of the observed data. Asan information-theoretic benchmark, we study data sets generated by a dynamic stochasticblock model, and derive fundamental information criteria for the recoverability of the com-munity memberships as N → ∞ both for bounded and diverging T . These results showthat (i) even a small increase in T may have a big impact on the recoverability of commu-nity memberships, (ii) consistent recovery is possible even for very sparse data (e.g. boundedaverage degree) when T is large enough. We analyse the accuracy of the proposed estima-tion algorithms under various assumptions on data sparsity and identiﬁability, and prove thatan efﬁcient online algorithm is strongly consistent up to the information-theoretic thresholdunder suitable initialisation. Numerical experiments show that even a poor initial estimate(e.g., blind random guess) of the community assignment leads to high accuracy after a smallnumber of iterations, and remarkably so also in very sparse regimes. Keywords: temporal networks, dynamic stochastic block model, community detection,random graphs.

Contents ∗ Email: [email protected] † Email: [email protected] ‡ Email: [email protected] Markov dynamics 11 ν model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Special case of bounded T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Long time horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.1 Unique permutation minimising the Hamming distance between two differentnode labellings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.3 Bounding of JI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B Proof of upper bound 36

B.1 Test between two noisy samples from two distributions . . . . . . . . . . . . . . 36B.2 Probability of error for a single node . . . . . . . . . . . . . . . . . . . . . . . . 36B.3 Proof of Propositon 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

C Hellinger divergence between sparse binary Markov chains 40

C.1 Notations and main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40C.2 Proof of Proposition C.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40C.3 Auxiliary asymptotics lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

D Markov dynamics with long-time horizon 46

D.1 Clustering using the union graph . . . . . . . . . . . . . . . . . . . . . . . . . . 46D.2 Clustering using time-aggregated adjacency tensors . . . . . . . . . . . . . . . . 47

E Analysis of baseline algorithms 48

E.1 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48E.2 Proof of Proposition 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49E.3 Proof of Proposition 6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

Introduction

Data sets in many application domains consist of observed pairwise interactions over time. Exam-ples include human interactions related to sociology and epidemiology [KY04, LGK12, MFB15,ZWL + +

16, BWP + + community (block,group, cluster). In this case a natural unsupervised learning problem is to infer the types (nodelabels, block memberships) from the observed pair interactions, a task commonly known as com-munity recovery (or clustering ) [For10].While many static clustering methods exist (spectral methods [VL07], methods based on mod-ularity maximisation [GN02, BGLL08, BC09], belief propagation [MM09, Moo17], Bayesianmethods [HW08, Pei19], likelihood-based methods [WB17]), the extension to dynamic networksis not necessarily straightforward. In particular, a naive approach of clustering after performing atemporal aggregation (by summing and/or thresholding) may lead to a potentially important lossof information. Additionally, the ability to update the community estimates in an online fashion,as the user receives new data, is extremely important in practice. Recently, community recoverywith temporal network data became a popular topic, leading to a multitude of terminology andalgorithms (see for example [MM17] and references therein). Despite this interest, there has beenfew theoretical analysis of community recovery limits [GZC +

16, BLMT18, BC20]. Moreover,[GZC +

16, BLMT18] do not present rigorous analysis.In this article we study a large population of N nodes which is partitioned into a static set of K blocks. The observed data set is represented by a three-way binary tensor ( A tij ) indexed by nodes i, j = 1 , . . . , N and a time parameter t = 1 , . . . , T . The data set is assumed to be generated by amulti-layer stochastic block model [HLL83] so that the observed temporal interaction pattern ofa node pair is independent of the interaction patterns of other node pairs, but the interactions overtime may be correlated. Interestingly, the original paper [HLL83] of Holland et al. deﬁned theSBM as a multi-layer network, but later on the SBM has gradually been restricted to a single-layerversion. However, recently there has been a steady growth of interest in multi-layer SBM. Belowwe elaborate in detail on the relation between multi-layer SBM and temporal networks.When T = 1 , existing work on community recovery in static network with block-communitystructure provides a strong information-theoretic foundation [ZZ16, GMZZ17, Abb18].The extension to T > could be naively done by considering the temporal network as a staticnetwork, where the pairwise interactions weights are binary numbers. Albeit the link-labelled(or multi-layer) SBM has been studied [XJL20, HLM12, LMX15, YP16] and is now fairly wellunderstood, two remarks are in order. First, since the number of link-labels is T , any algorithmdesigned for the link-labelled SBM will be completely inefﬁcient for a temporal network (evenif the algorithm is linear in the number of labels, and the number of snapshots is reasonablysmall, e.g., T = 20 ). Moreover, the existing theoretical results are for bounded or slowly growingnumber of labels. For example, the results of [XJL20] will directly apply only if T < N , while inpractice the number of snapshots might be extremely large. Of course, in the context of temporalnetworks online algorithms are obviously preferable over the ‘recompute from scratch’ approach. Related work on clustering temporal graphs

While several dynamic extensions of the SBMmodel have been proposed, a majority of them describe the change in the node labels, while thegraphs are independently re-sampled at each time-step with the new node labelling [YCZ + +

16] considered a Markov chain for each nodelabels, while the edges were resampled at each time step. They conjectured an expression for thedetection threshold in the constant degree regime and in the limit of a large number of snapshots.Albeit no proof was given, this expression extended nicely the static SBM results, and the authorsprovided some insights, as well as a belief propagation algorithm on a space-time graph.Barucca and al. [BLMT18] proposed a Markov chain evolution for both the node labels andthe edges. More precisely, at time t , the presence or absence of a link between two nodes iscopied with probability ξ , and otherwise is re-sampled according to the new community member-ship. This double evolution makes the theoretical analysis challenging, and the authors proposeda likelihood approach to recover the communities. They showed that, while the persistence ofcommunities made the community recovery problem easier , the persistence of the edges made it harder . A ﬁrst restriction of this model is that the edge persistency parameter ξ is independentof the node labelling. Moreover, restricting ξ ∈ [0 , implicitly implies that the Markov evo-lution is positively correlated . While positive correlation is a reasonable assumption for socialnetworks, it might not be suitable in other situations (for example in biological networks, spikingphenomenon occurs). Although we propose a model where the labelling of the nodes remainsstatic, we consider more general dynamics (not necessarily Markov and not necessarily positivelycorrelated) for the pattern interactions.Recently, [BC20] introduced a temporal SBM, with ﬁxed communities and where the pair-wise interactions are sampled at each time-step according to a connectivity probability matrix.This connectivity matrix varies over time, according to a stochastic process. This can be seenas an external modulating randomness, hence implying synchronous interaction patterns acrossblocks. They proposed a spectral algorithm based on a linear algebraic methods using the squaredadjacency matrix, similar to [Lei20], and showed consistency in sparse regime. Main contributions of our work

The main contributions of our work are the following.1. We derive a lower bound on community recovery threshold for a general Dynamic SBMwith static community memberships. This result extends in a natural but non-trivial way[XJL20, Theorem 5.2], by allowing the number of snapshots (or equivalently the numberof link labels), to be arbitrarily large.2. We show that if the block interaction probabilities are known, we can asymptotically re-cover the true communities up to the information theoretic lower bound. This also extends[XJL20, Proposition 6.1].3. In case of a Markov dynamics, we derive the information-theoretic thresholds for strongconsistency (exact recovery) and weak consistency (almost exact recovery). This leads tothe computation of the Rényi-divergence between two sparse Markov chains, which couldbe of independent interest. Moreover, we compare these bounds to those obtained if oneaggregates the temporal data. This comes from the following Lemma. A 2-by-2 binary stochastic matrix P = I admits a representation P = ξI +(1 − ξ ) (cid:18) π π π π (cid:19) for some ξ ∈ [0 , and some probability distribution π = ( π , π ) iff both eigenvaluesof P are non-negative iff Cov( X , X ) ≥ for a stationary Markov chain ( X , X , . . . , ) on { , } with transitionmatrix P .

4. We provide two online algorithms for a Markov dynamics in the situations where the in-teraction probabilities are known or unknown. The update step is at most O ( N ) (less ifthe network is sparse), making the algorithm linear in T . Numerical validation is providedin both cases. In particular, a numerical study demonstrates that in a typical situation werecover the correct communities in a few steps starting from an initial random guess.5. In some speciﬁc situations (ﬁnite number of nodes but large number of snapshots, staticintra-community interactions, etc.), we provide some baseline algorithms and establish theguarantees of their performance. Structure of the paper

The paper is structured as follows. The model together with the mainnotations and assumptions are presented in Section 2. In Section 3, we derive the lower and uppererror bounds for community recovery in a general dynamic SBM. The technical proofs are rele-gated into Appendices A and B. Section 4 is devoted to the study of Markov dynamics. We com-pute the various thresholds for exact recovery of the communities. Along those computations, weestablish some results about the Hellinger distance of two binary Markov chains describing sparseinteractions, which could be of independent interest. The statements and proofs are relegated toAppendix C to avoid saturating the main text. We derive in Section 5 two online, likelihood-basedalgorithms for Markov dynamics. The ﬁrst algorithm requires the knowledge of the interactionparameters while the second one does not. In Section 6 we present some baseline algorithms forspecial cases (large T and N bounded, or static interaction patterns for intra-community nodes).We provide some numerical illustrations in Section 7, which demonstrate high efﬁcacy of ouralgorithms. Finally, we conclude with some remarks about future research in Section 8. This section summarizes basic notations and conventions used in the article.

The cardinality of a set A is denoted by | A | . An ordered pair of elements is denoted ( i, j ) . Un-ordered pairs are abbreviated { i, j } . The set of unordered pairs with elements in S is denoted (cid:0) S (cid:1) .The symbol A ) is deﬁned to be one when statement A is true, and zero otherwise. We denote [ N ] = { , . . . , N } . For nonnegative sequences a = a n and b = b n indexed by n = 1 , , . . . we denote a ≪ b or a = o ( b ) when lim n →∞ a n /b n = 0 ; and a . b or a = O ( b ) when lim sup n →∞ a n /b n < ∞ . Wedenote a ≍ b or a = Θ( b ) when a . b and b . a ; and a ∼ b when lim n →∞ a n /b n = 1 . All probability measures are deﬁned with the respect to the Borel sigma-algebra of the underlyingspace. All countable spaces are equipped with the discrete topology. For probability measures oncountable spaces we abbreviate f ( { x } ) by f ( x ) . The Dirac measure at x is denoted by δ x . Theproduct of probability measures f and g is denoted by f ⊗ g . Moreover, f ⊗ n denotes the n -foldproduct measure of f . 5 .2 Dynamic stochastic block models A dynamic stochastic block model with N nodes, K blocks, and T layers (snapshots) is param-eterized by a node labelling σ : [ N ] → [ K ] and an interaction kernel f = ( f kℓ ) which is acollection of probability distributions on { , } T such that f kℓ = f ℓk for all k, ℓ = 1 , . . . , K .These parameters specify the distribution P σ ( A ) = Y ≤ i

The dynamic community recovery problem is the task of estimating an unknown node labelling σ based on an observed three-dimensional data array A T = ( A tij ) . After observing A T , the ob-server directly learns N and T , but the other model parameters may or may not remain unknown.Hence there are three interrelated statistical tasks:(i) estimation of the node labelling σ when the model dimensions ( N, K, T ) and the interactionparameters ( f in , f out ) are known;(ii) simultaneous estimation of the node labelling σ and interaction parameters ( f in , f out ) whenthe model dimensions ( N, K, T ) are known;(iii) simultaneous estimation of the node labelling σ , the interaction parameters ( f in , f out ) , andthe number of blocks K .Most earlier research has focused on problem (i) and (ii) in the static setting. Cases (i) and (ii) arealso our main focus, and discussions on estimation problems (iii) are postponed to Section 8.A large network of interacting nodes is modelled as a sequence of Markov SBMs deﬁnedby (2.1)–(2.2) indexed by a scale parameter ν = 1 , , . . . , where the ν -th model has dimensions ( N ( ν ) , K ( ν ) , T ( ν ) ) , interaction kernel f ( ν )in , f ( ν )out , and node labelling σ ( ν ) . The main attention isfocused on the situations where: • the number of nodes N ( ν ) diverges to inﬁnity;7 the number of time slots T ( ν ) may or may not diverge to inﬁnity; • the number of blocks K ( ν ) may or may not be constant; • µ ( ν )in (1) = c in p ( ν ) and µ ( ν )out (1) = c out p ( ν ) for some constants c in , c out ∈ (0 , ∞ ) , and some p ( ν ) representing the overall edge density; • the edge refresh rates P ( ν )in (1 , and P ( ν )out (1 , may or may not be constants.To keep the notation light, we omit the scale parameter when the role of the scale parameter is clearfrom the context. When the number of nodes tends to inﬁnity, we may without loss of generalityassume that N ( ν ) = ν and use N as the scale parameter. Then the model is parameterized byscale dependent sequences K ( ν ) = K N , T ( ν ) = T N and by the interaction pattern probabilities µ ( N )in , µ ( N )out , P ( N )in , P ( N )out .Now the community recovery problem (i) for the scale- ν model becomes the problem ofdeveloping a function φ which maps an observed data array A = A ( ν ) into an estimated nodelabelling ˆ σ ( ν ) = φ ( A ( ν ) ) . For two node labellings σ , σ : [ N ] → [ K ] we introduce the loss ℓ ( σ , σ ) between σ and σ as ℓ ( σ , σ ) = 1 N min π ∈ Sym( K ) d Ham ( σ , π ◦ σ ) . where Sym( K ) denotes the group of permutations on [ K ] . Then we see that ℓ ( σ , σ ) = 0 if andonly if the partitions { σ − ( k ) : k ∈ [ K ] } and { σ − ( k ) : k ∈ [ K ] } are equal. An estimator φ issaid to achieve asymptotically exact recovery if P σ ( ν ) (cid:16) ℓ (cid:0) φ ( A ( ν ) ) , σ ( ν ) (cid:1) > (cid:17) → , and asymptotically almost exact recovery if P σ ( ν ) (cid:16) ℓ (cid:0) φ ( A ( ν ) ) , σ ( ν ) (cid:1) > ǫ (cid:17) → for all ǫ > . Let S be a set containing an element 0. Let A ∈ S N × N be a symmetric matrix with zero diagonal.A clustering algorithm is a map ˆ σ which maps an interaction matrix A into a node labelling ˆ σ A : [ N ] → [ K ] . The original node identiﬁers are the integers , . . . , N . Consider permutingthe identiﬁers by a bijection π ∈ Sym( N ) , so that π ( i ) is the alternative identiﬁer of a node withoriginal identiﬁer i . Then the matrix A π ( i, j ) = A ( π − ( i ) , π − ( j )) tells how a pair of nodes withalternative identiﬁers i, j interact. A clustering algorithm ˆ σ is called permutation equivariant if ˆ σ A π ◦ π = ˆ σ A for all A and all π ∈ Sym( N ) .A clustering algorithm ˆ σ is called permutation equivariant as partition-valued if ˆ σ A π ◦ π ≃ ˆ σ A for all A and for all π ∈ Sym( N ) where we use the convention that σ ≃ σ if σ = ρ ◦ σ forsome ρ ∈ Sym( K ) (block membership structures are equal as partitions). The following result generalizes [XJL20, Theorem 5.2] to a nonasymptotic setting which makesno regularity assumptions on f in , f out , nor any assumptions on the underlying space S of nodelabels (it can be an arbitrary measurable space). Note that [XJL20, Theorem 5.2] does not tellwhat happens for large T in case where S = { , } T .8 heorem 3.1. Consider a homogeneous model where f in , f out are arbitrary. Let σ be the truenode labelling, for which the minimum block size is N = min k | σ − ( k ) | ≥ , and for which thereexists a block of size N + 1 . Then, for any permutation equivariant estimator b σ , the expectedrelative error is bounded by E ℓ ( b σ, σ ) ≥ (cid:18) N N (cid:19) e − N I − √ N ( I + J ) , where I = − R ( f in f out ) / denotes the Rényi- divergence and J = R (log f in − log f out ) f ∗ with f ∗ = Z − ( f in f out ) / and Z = R ( f in f out ) / . The proof of Theorem 3.1 is presented in Appendix A.2.

Remark 3.2.

The quantity J in Theorem 3.1 replaces [XJL20, Assumption A ∗ ], which statesthat Z (log f in − log f out ) ( f in + f out ) . Z ( f / − f / ) ≈ Hel ( f in , f out ) . Corollary 3.3.

Under the same setting as Theorem 3.1, assume that N N = O (1) and that I ≤ .Then, almost exact recovery is not possible if lim sup N I < ∞ , and exact recovery is not possible if lim sup N I log N < . Proof.

Recall that almost exact recovery holds if ℓ (ˆ σ, σ ) = o (1) , and exact recovery holds if ℓ (ˆ σ, σ ) = o (cid:16) N (cid:17) . From Theorem 3.1, almost exact recovery is not possible if lim sup N I + p N ( I + J ) < ∞ . From Lemma A.2, we have J ≤ I . Hence, N I + q N ( I + J ) ≤ N I + q N ( I + 14 I ) ≤ N I + p N I, where the latter inequality holds since I ≤ . Therefore, lim sup (cid:18) N I + q N ( I + J ) (cid:19) < ∞ ⇐⇒ lim sup N I < ∞ , and the result for almost exact recovery is established. The result for exact recovery is similar. We now generalise [XJL20, Proposition 6.1] by introducing the following Algorithm and statingits guarantee of consistency. Note that similarly to the previous section, we cannot directly usethe results of [XJL20] in the situation T ≫ . Moreover, the Algorithm studied in [XJL20] islinear in the number of labels, hence exponential in T . Remark 3.4.

Similarly to [GMZZ17, XJL20], for technical reasons the initialisation step of Al-gorithm 1 accounts for N separate spectral clustering, done on the union graph where one nodeis removed. A consensus step is therefore needed at the end, to correctly permutate the indi-vidual predictions. Nonetheless, it is in practice sufﬁcient to do one Spectral Clustering on theunion graph, and remove this consensus step. We will discuss practical aspects in more detail inSection 7. 9 lgorithm 1: General clustering for a dynamic SBM

Input:

Observed adjacency tensor ( A tij ) ; number of communities K ; interactionparameters f in , f out . Output:

Estimated node labelling b σ Initialize: for i = 1 , . . . , N do Let e σ ( i ) ∈ [ K ] N − be the output of Spectral Clustering on the graph e G ( i ) where ˜ G ( i ) isgenerated from the union graph S Tt =1 G ( t ) , where node i and the edges attached to itare removed. Update: for i = 1 , . . . , N dofor k = 1 , . . . , K do Compute h ik ← P j = i e σ ( i ) j = k ) log f in ( A Tij ) f out ( A Tij ) Let b σ ( i ) ∈ [ K ] n such that b σ ( i ) j = ˜ σ ( i ) j for j = i and set b σ ( i ) i ← arg max ≤ k ≤ K h ik witharbitrary tie breaks. Consensus:

Let b σ = 1 . for i = 2 , . . . , N do b σ i ← arg max k ∈ [ K ] (cid:12)(cid:12)(cid:12) { j : b σ (1) j = k } ∩ { j : b σ ( i ) j = b σ ( i ) i } (cid:12)(cid:12)(cid:12) Proposition 3.5.

Consider a homogeneous model deﬁned by (2.1) , and indexed by a scale param-eter ν . Assume that f ( ν )in , f ( ν )out are known, and such that f ( ν )in is absolutely continuous w.r.t. f ( ν )out ,and vice-versa. Assume that D / (cid:16) f ( ν )in , f ( ν )out (cid:17) I ( ν ) ≍ c ( ν ) for some c ( ν ) , and where I ( ν ) and D / (cid:16) f ( ν )in , f ( ν )out (cid:17) are respectively the Rényi- and the Rényi- divergences between f ( ν )in and f ( ν )out .Suppose that N ( ν ) ≫ , and lim ν →∞ c ( ν ) βK N ( ν ) max(1 − f in (0) , − f out (0))( f in (0) − f out (0)) = 0 .Let b σ be the output of Algorithm 1. Then, there exists ξ ( ν ) = o (1) such that lim ν →∞ Pr ℓ ( b σ, σ ) ≤ exp − N ( ν ) I ( ν ) βK ( ν ) (cid:16) − ξ ( ν ) (cid:17)!! = 1 . The proof is given in Appendix B.In particular, Algorithm 1 achieves almost exact and exact recovery up to the lower boundderived in Corollary 3.3.Note that lim ν →∞ βK N ( ν ) max(1 − f in (0) , − f out (0))( f in (0) − f out (0)) = 0 ensures that Spectral Clustering onthe union graph achieves almost exact recovery. The extra factor c ( ν ) in the condition of Propo-sition 3.5 ensure that the mistake made by the initial prediction does not spread too much whilecomputing the likelihood ratio tests (see Lemma B.1 in the Appendix).10 Markov dynamics ν model In this section, we focus on applications of Theorem 3.1 and Proposition 3.5 for sparse modelswith a Markov evolution. By sparse, we mean that the probability of observing an interactionbetween any particular pair of nodes is small. This property is quantiﬁed by the assumption that δ ( ν ) T ( ν ) ≪ , (4.1)where δ ( ν ) = max k,ℓ ∈ [ K ( ν ) ] n µ ( ν ) kℓ (1) , P ( ν ) kℓ (0 , o . This means that the expected number ofon-periods between any particular pair of nodes is approximately zero. Often we assume that theblocks balanced by requiring that block sizes N k = N ( ν ) k deﬁned by N k = { i : σ ( i ) = k } satisfy N k ≥ NβK , (4.2)for some constant β ∈ [1 , ∞ ) . Theorem 4.1.

Consider a homogeneous Markov SBM deﬁned by (2.1) – (2.2) and indexed by ascale parameter ν which satisﬁes the sparsity assumption (4.1) , and the block balance condition (4.2) . Let I ( ν ) := I ( ν )0 + T ( ν ) X t =2 (cid:18) I ( ν )1 + I ( ν )2 + I ( ν,t )3 (cid:19) , where I ( ν )0 = (cid:18)q µ ( ν )in (1) − q µ ( ν )out (1) (cid:19) ,I ( ν )1 = (cid:18)q P ( ν )in (0 , − q P ( ν )out (0 , (cid:19) ,I ( ν )2 = 2 ρ ( ν ) q P ( ν )in (0 , P ( ν )out (0 , , and I ( ν,t )3 = 2 ρ ( ν ) q µ ( ν )in (1) µ ( ν )out (1) − q P ( ν )in (0 , P ( ν )out (0 , − q P ( ν )in (1 , P ( ν )out (1 ,  × (cid:18) − q P ( ν )in (1 , P ( ν )out (1 , (cid:19) (cid:16) P ( ν )in (1 , P ( ν )out (1 , (cid:17) t − , with ρ ( ν ) :=  − q P ( ν )in (1 , P ( ν )out (1 , − q P ( ν )in (1 , P ( ν )out (1 ,  . Then:(i) Exact recovery is possible if lim inf ν →∞ N ( ν ) I ( ν ) K ( ν ) log N ( ν ) > , and is impossible if lim sup ν →∞ N ( ν ) I ( ν ) K ( ν ) log N ( ν ) < . ii) Almost exact recovery is possible if lim inf ν →∞ N ( ν ) I ( ν ) K ( ν ) = ∞ , and is impossible if lim sup ν →∞ N ( ν ) I ( ν ) K ( ν ) < ∞ . Proof.

We saw in Corollary 3.3 that the conditions for impossibility of almost exact and exactrecovery are governed by the quantity N ( ν ) I ( ν ) . Moreover, the lower bound is achieved by Algo-rithm 1.Hence, to prove Theorem 4.1, we need to compute the Renyi divergence between two Markovchains. The Rényi- divergence is linked to the Hellinger distance H = Hel( f, g ) by the relation I = − − H ) . Taylor’s approximation implies that − log(1 − t ) = t + ǫ ( t ) where ≤ ǫ ( t ) ≤ t for all ≤ t ≤ . Hence H ≤ I ≤ H + H for H ≤ √ , and hence I ∼ H for H ≪ . A careful analysis of the Hellinger distance between two binary Markovchains is presented in details in Appendix C. Using a ﬁrst-order expansion (Proposition C.1),yields I ( ν ) ∼ I ( ν )0 + T X t =2 (cid:18) I ( ν )1 + I ( ν )2 + I ( ν,t )3 (cid:19) . This completes the proof, because the expressions of I ( ν )0 , I ( ν )1 , I ( ν )2 and I ( ν,t )3 correspond to theones in the statement. Corollary 4.2.

We consider the same assumptions and notations as in Theorem 4.1. Assume fur-ther that µ in (1) = c in p N , µ out (1) = c out p N with c in , c out being constants, and that P ( N )in (0 ,

1) = p p N , P ( N )out (0 ,

1) = p p N with p , p being constants. We will also assume that K , P ( N )in (1 , , P ( N )out (1 , are constants. Then the condition for exact recovery becomes lim N →∞ N p N log N ˜ I + T X t =2 (cid:16) ˜ I + ˜ I + ˜ I ( t )3 (cid:17)! > K, where ˜ I = ( √ c in − √ c out ) , ˜ I = (cid:18)q p − q p (cid:19) , ˜ I = 2 ρ q p p , and ˜ I ( t )3 = 2 ρ  √ c in c out − q p p − p P in (1 , P out (1 ,  × (cid:18) − q P in (1 , P out (1 , (cid:19) ( P in (1 , P out (1 , t − , where ρ = (cid:18) − √ P in (1 , P out (1 , − √ P in (1 , P out (1 , (cid:19) . imilarly, almost exact recovery is possible if and only if lim N →∞ N p N ˜ I + T X t =2 (cid:16) ˜ I + ˜ I + ˜ I ( t )3 (cid:17)! = ∞ . Remarks 4.3.

Under the same setting as Corollary 4.2: • if p N ≫ log NN , then exact recovery is always possible if the evolution is non-static or if c in = c out ; • if p N = log NN , exact recovery is possible if ˜ I + T P t =2 (cid:16) ˜ I + ˜ I + ˜ I ( t )3 (cid:17) > K and impossibleif ˜ I + T P t =2 (cid:16) ˜ I + ˜ I + ˜ I ( t )3 (cid:17) < K . More precisely: – the term ˜ I accounts for the ﬁrst snapshot: in particular, for T = 1 we recover theknown threshold for exact recovery in SBM [ABH16, MNS16]; – each new snapshot adds an extra term ˜ I + ˜ I + ˜ I . This term is strictly positive if theevolution is non-static (and equal to zero otherwise). This increases the left hand sideof the inequality, hence making recovery easier.Notably, in that situation, exact recovery is possible if T is big enough. Indeed, the sumbehaves linearly with T (minus a term vanishing exponentially fast in T ), and is thus un-bounded. Therefore, there exists a T ∗ such that ∀ T ≥ T ∗ exact recovery is possible. T ∗ depends only on the model parameters. We plot in Section 7.1 some values of T ∗ . • Finally, if p N ≪ log NN , then exact recovery is never possible.Similar remarks apply for almost exact recovery.We give below examples of the exact recovery threshold computed in Corollary 4.2; almostexact recovery thresholds would be similar. Example 4.4.

For T = 1 , the criteria for exact recovery becomes: lim inf N →∞ NK N log N (cid:18)q µ ( N )in (1) − q µ ( N )out (1) (cid:19) > . This corresponds to the known threshold for exact recovery in a static SBM, originally establishedin [ABH16, MNS16] for two communities.

Example 4.5.

Assume that the pattern interaction between nodes of different communities arei.i.d., that is P ( N )out = − µ ( N )out (1) µ ( N )out (1)1 − µ ( N )out (1) µ ( N )out (1) ! . Then, the condition for exact recovery becomes lim inf N →∞ NK N log N (cid:18)q µ in (1) − q µ out (1) (cid:19) + ( T − (cid:18)q P ( N )in (0 , − q µ ( N )out (1) (cid:19) ! > . In particular, if the pattern interactions between nodes in the same communities are also i.i.d., thatis P ( N )in = − µ ( N )in (1) µ ( N )in (1)1 − µ ( N )in (1) µ ( N )in (1) ! , then the condition for exact recovery becomes lim inf N →∞ N TK N log N (cid:18)q µ in (1) − q µ out (1) (cid:19) > . T µ in and T µ out ,that one would get by aggregating the T independent graphs. In particular, we recover the resultsof [PC16] on a multi-layer SBM with independent layers. Example 4.6.

Assume a static evolution for the pattern interactions between two nodes belongingto the same community, that is P ( N )in = I . Then, the exact recovery threshold becomes lim inf N →∞ NK N log N q µ ( N )in − q µ ( N )out ! + ( T − P ( N )out (0 , q µ ( N )in µ ( N )out − (cid:16) P ( N )out (1 , (cid:17) T − ! ! > . Example 4.7.

Assume, like in Example 4.6, that P ( N )in = I . Assume further that the evolution ofthe pattern interaction between two nodes in different communities are i.i.d., that is P ( N )out (0 ,

1) = P ( N )out (1 ,

1) = µ ( N )out . The condition for exact recovery becomes lim inf N →∞ NK N log N (cid:18) µ ( N )in + T µ ( N )out − q µ ( N )in (cid:16) µ ( N )out (cid:17) T (cid:19) > . For T = 1 , we recover the condition of a static SBM (cf. Example 4.4). For T ≥ the conditionbecomes lim inf N →∞ NK N log N (cid:16) µ ( N )in + T µ ( N )out (cid:17) > . (4.3)Condition (4.3) is equivalent to the connectivity of a homogeneous SBM where the probabilitiesfor inside community and outside community edges are respectively µ ( N )in and T µ ( N )out . We saw in Remark 4.3 that in the particular case of N → ∞ , for large (but ﬁnite) T , almost exactrecovery becomes always possible, and similarly for exact recovery in the logarithmic degreeregion. Let us now investigate the situation when T grows unbounded. Corollary 4.8.

Consider a homogeneous Markov SBM deﬁned by (2.1) – (2.2) indexed by a scaleparameter ν which satisﬁes the sparsity assumption (4.1) , and the block balance condition (4.2) .Assume further that T ( ν ) ≫ , and that T ( ν ) max { P ( ν )in (0 , , P ( ν )out (0 , } ≫ max { µ ( ν )in (1) , µ ( ν )out (1) } . (4.4) Let I ( ν )1 , I ( ν )2 are deﬁned as in Theorem 4.1. Then: • Exact recovery is possible if lim inf ν →∞ N ( ν ) T ( ν ) K ( ν ) log N ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) > , and is not possible if lim sup ν →∞ N ( ν ) T ( ν ) K ( ν ) log N ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) < . Almost exact recovery is possible if lim inf ν →∞ N ( ν ) T ( ν ) K ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) < ∞ , and is not possible if lim sup ν →∞ N ( ν ) T ( ν ) K ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) < ∞ . Condition (4.4) insures that the dominating terms are the ones coming from the dynamicpatterns. In particular, the recovery conditions in Corollary 4.8 do not depend on the initial distri-butions. Indeed, under this condition, I ( ν ) ≍ T ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) , and the proof of Corollary 4.8follows immediately from Theorem 4.1. Remark 4.9.

By simply considering the union graph (see Appendix D.1), one would only recoverthe term I ( ν )1 . The second term I ( ν )2 corresponds to the gain one obtains using the difference of dy-namic evolution between intra-community and inter-community interaction patterns. In particular, I ( ν )2 = 0 if P in (1 ,

1) = P out (1 , . In that speciﬁc scenario, considering the union graph doesnot decrease the recovery conditions. Note that this scenario corresponds to the edge persistencesetting of [BLMT18]. Remark 4.10.

Surprisingly, considering the time-aggregated graph does not result in a signiﬁ-cant loss of information. Indeed, recovery conditions for the Markov SBM and the correspondingtime-aggregated graph are the same when T ( ν ) is unbounded and the condition (4.4) holds (seeProposition D.4 in Appendix). Indeed, the Rényi divergence I ( ν ) between f in and f out is domi-nated by the terms coming from signals with only zero or one rare event (here a rare event in anapparition of a ’1’ in the binary string x representing the pattern interaction between two nodes).Therefore, all what matters is the number of 1’s in the string x . Hence, transforming the initialsignal x ∈ { , } T into || x || results in a loss of negligible information. Remarks 4.11.

Suppose that N → ∞ , and thus we use the parameter N instead of ν . Assumethat P ( n )in (0 ,

1) = p p N and P ( N )out (0 ,

1) = p p N , with p , p being constants (not bothzeros). Assume also that K N , P ( n )in (1 , and P ( N )out (1 , are constants, not both equal to one. Theexact recovery threshold becomes lim N →∞ N p N T N log N (cid:16) ˜ I + ˜ I (cid:17) > K, where ˜ I and ˜ I are deﬁned as in Corollary 4.2. Similarly, almost exact recovery is possible if andonly if lim N →∞ N p N T N (cid:16) ˜ I + ˜ I (cid:17) = ∞ . In particular, • if p N T N ≫ log NN , then exact recovery is always possible; • if p N T N ∼ τ log NN for some constant τ , then exact recovery is possible if τ (cid:16) ˜ I + ˜ I (cid:17) > K and impossible if τ (cid:16) ˜ I + ˜ I (cid:17) < K ; • if p N T N ≪ log NN , then exact recovery is never possible.15imilar results apply for almost exact recovery.The key product p N T N corresponds to the expected number of on-periods. Similarly to thestatic SBM, a phase transition for exact recovery arises when the number of on-periods is of theorder of log NN (note that for a static SBM, an on-period is simply an edge).It is striking to note that exact and almost exact recovery are possible even in a very sparsesetting, as long as the number of snapshots is large enough. For example, if p N = N , then T N has to be at least of the order log N for exact recovery, and of the order ω (1) for almost exactrecovery. This behavior is very different from the situation in the standard SBM, where in theconstant degree regime ( p N = N ), the best one can achieve is detection (that is, doing better thana blind random guess). Given A t = ( A , . . . , A t ) , deﬁne a log-likelihood ratio matrix by M ( t ) ij = log f in ( A tij ) f out ( A tij ) . (5.1)Then the log of the probability of observing a graph sequence A t given node labelling σ =( σ , . . . , σ n ) is equal to L ( A t | σ ) + c ( A t ) where L ( A t | σ ) = 12 X i X j = i M ( t ) ij δ σ j ,σ i , and c ( A ) = P i P j = i f out ( A ij ) .Therefore, given an assignment ˆ σ ( t − computed from the observation of the t − ﬁrst snap-shots, one can compute a new assignment ˆ σ ( t ) such that node i is assigned to any block k whichmaximizes L ( t ) i,k := X j = i M ( t ) ij δ ˆ σ ( t − j k . (5.2)This formula is interesting only if the computation of M ( t ) can be easily done from M ( t − .This is in particular the case for a Markov evolution. Indeed, if µ in and µ out are the initialprobability distributions, and P in , P out the transition matrices, then the cumulative log-likelihoodmatrices deﬁned in equation (5.1) can be computed recursively by M ( t ) = M ( t − + ∆ ( t ) , where M (1) ij = log µ in (cid:16) A ij (cid:17) µ out (cid:16) A ij (cid:17) , (5.3)and ∆ ( t ) ij = log P in (cid:16) A t − ij , A tij (cid:17) P out (cid:16) A t − ij , A tij (cid:17) . (5.4)We summarizes this in Algorithm 2. Let us emphasize that this algorithm works in an online16daptive fashion. Algorithm 2:

Online clustering for homogeneous Markov dynamics when the block inter-action parameters are known.

Input:

Observed interaction tensor ( A tij ) ; block interaction parameters µ in , µ out , P in , P out ;number of communities K ; static graph clustering algorithm algo . Output:

Node labelling ˆ σ = (ˆ σ , . . . , ˆ σ N ) ∈ [ N ] K . Initialize:

Compute ˆ σ ← algo ( A ) , and M ij ← log µ in ( A ij ) µ out ( A ij ) for i, j = 1 , . . . , N . for t = 2 , . . . , T do Compute ∆ ij ← log P in ( A t − ij ,A tij ) P out ( A t − ij ,A tij ) for i, j = 1 , . . . , N .Update M ← M + ∆ . for i = 1 , . . . , N do Set L ik ← P j = i M ij δ ˆ σ j k for k = 1 , . . . , K .Set ˆ σ i ← arg max ≤ k ≤ K L ik . Return: ˆ σ The time complexity (worst case complexity) of Algorithm 2 is O ( KN T ) plus the timecomplexity of the initial clustering. The space complexity is O ( N ) . In addition: • Since at each time step, ∆ can take only one of four values, these four different values of ∆ can be precomputed and stored to avoid computing N T logarithms. • The N -by- K matrix ( L ik ) can be computed as a matrix product L = M Σ , where M isthe matrix obtained by zeroing out the diagonal of M , and Σ is the one-hot representationof b σ such that Σ ik = 1 if b σ i = k and zero otherwise. • For sparse networks the time and space complexity (average complexity) can be reducedby a factor of d/N where d is the average node degree, by neglecting the → transitionsand only storing nonzero entries (similarly to what is often done for belief propagation inthe static SBM [Moo17]). Algorithm 2 requires the a priori knowledge of the interaction parameters. This is often not thecase in practice, and one has to learn the parameters during the process of recovering communities.In this section, we adapt Algorithm 2 to estimate the parameters on the ﬂy.Let n ij ( a, b ) be the observed number of transitions a → b in the interaction pattern betweennodes i and j , and let n ij ( a ) = P b n ij ( a, b ) . Let P ij be the 2-by-2 matrix transition probabilitiesfor the evolution of the pattern interaction between a node pair { i, j } . By the law of large numbers(for stationary and ergodic random processes), the empirical transition probabilities b P ij ( a, b ) := n ij ( a, b ) n ij ( a ) (5.5)are with high probability close to P ij ( a, b ) for T ≫ .An estimator of P in is obtained by averaging those probabilities over the pairs of nodes pre-dicted to belong to the same community. More precisely, after t snapshots observed ( t ≥ ), givena predicted community assignment b σ ( t ) , we deﬁne for a, b ∈ { , } , d P in( t ) ( a, b ) = 1 (cid:12)(cid:12)(cid:12)n ( i, j ) : ˆ σ ( t ) i = ˆ σ ( t ) j o(cid:12)(cid:12)(cid:12) X ( i,j ) : ˆ σ ( t ) i =ˆ σ ( t ) j n ( t ) ij ( a, b ) n ( t ) ij ( a ) , (5.6)17here n ( t ) ij ( a, b ) = t − X t ′ =1 (cid:0) A t ′ ij = a (cid:1) (cid:0) A t ′ +1 ij = b (cid:1) is the number of a → b transitions in the interaction pattern between nodes i and j (with a, b ∈{ , } ) seen during the t ﬁrst snapshots and n ( t ) ij ( a ) = X b =0 n ( t ) ij ( a, b ) . Similarly, d P out( t ) ( a, b ) = 1 (cid:12)(cid:12)(cid:12)n ( i, j ) : ˆ σ ( t ) i = ˆ σ ( t ) j o(cid:12)(cid:12)(cid:12) X ( i,j ) : ˆ σ ( t ) i =ˆ σ ( t ) j n ( t ) ij ( a, b ) n ( t ) ij ( a ) , (5.7)is an estimator of P out ( a, b ) . Moreover, the quantities n ( t ) ij ( a, b ) can be updated inductively. In-deed, n ( t +1) ij ( a, b ) = n ( t ) ij ( a, b ) + 1 (cid:0) A tij = a (cid:1) (cid:0) A t +1 ij = b (cid:1) . (5.8)Finally, the initial distribution can also be estimated by averaging: c µ in( t ) = 1 (cid:12)(cid:12)(cid:12)n ( i, j ) : ˆ σ ( t ) i = ˆ σ ( t ) j o(cid:12)(cid:12)(cid:12) X ( i,j ) : ˆ σ ( t ) i =ˆ σ ( t ) j A tij (5.9)and d µ out( t ) = 1 (cid:12)(cid:12)(cid:12)n ( i, j ) : ˆ σ ( t ) i = ˆ σ ( t ) j o(cid:12)(cid:12)(cid:12) X ( i,j ) : ˆ σ ( t ) i =ˆ σ ( t ) j A tij . (5.10)This leads to Algorithm 3, for clustering in a Markov SBM when only the number of com-munities K is known. Note that to save computation time, we can choose not to update theparameters at each time step. 18 lgorithm 3: Online clustering for homogeneous Markov dynamics when the block inter-action parameters are unknown.

Input:

Observed graph sequence A T = (cid:16) A , . . . , A T (cid:17) ; number of communities K ;static graph clustering algorithm algo . Output:

Node labelling ˆ σ = (ˆ σ , . . . , ˆ σ n ) . Initialize: • Compute ˆ σ ← algo ( A ) ; • Compute c µ in , d µ out using formulas (5.9), (5.10); • Compute M using (5.3); • Let n ij ( a, b ) ← for i, j ∈ [ N ] and a, b ∈ { , } . Update:for t = 2 , . . . , T do Compute ∆ using (5.4);Set M ← M + ∆ . for i = 1 , . . . , n do Set L i,k ← P j = i M ij σ j = k ) for all k = 1 , . . . , K Set ˆ σ i ← arg max ≤ k ≤ K L i,k Update c µ in , d µ out using formulas (5.9), (5.10);For every node pair ( ij ) , update n ij ( a, b ) using (5.8);Update d P in , d P out using (5.6) and (5.7). This section provides some baseline algorithms to recover the blocks in some particular cases,without prior knowledge of the block interaction parameters. Section 6.1 concerns regimes with N = O (1) and T ≫ . An algorithm based on parameter estimations is proposed, and showedto converge to the true community structure. Section 6.2 describes tailored-made algorithms fora speciﬁc model instance with static intra-block interactions and uncorrelated inter-block noise. Let us consider the situation where the number of snapshots T goes to inﬁnity while N remainsbounded. The main idea is to use the ergodicity of the Markov chains to estimate the parametersusing standard techniques, and then perform inference. For now, we will assume that the interac-tion parameters P in , P out are known, but K is unknown. We refer to Remark 6.2 when P in , P out are unknown as well.Recall that formula (5.5) gave consistent estimators for P ij , the matrix of transition probabili-ties for the evolution of the pattern interaction between a node pair { i, j } . Then, once all P ij ( a, b ) are known with a good precision, we can use our knowledge of P in , P out to distinguish whethernodes i and j are in the same block or not, and use this data to construct a similarity graph onthe set of nodes. This leads to Algorithm 4 which does not require a priori knowledge about thenumber of blocks, but instead estimates it as a byproduct. Note that this algorithm is tailor-madefor homogeneous interaction tensors. 19 lgorithm 4: Clustering by empirical transition rates.

Input:

Observed interaction tensor ( A tij ) ; transition probability matrices P in , P out .. Output:

Estimated node labelling ˆ σ = (ˆ σ , . . . , ˆ σ N ) ; estimated number of communities c K . V ← { , . . . , N } and E ← ∅ . for all unordered node pairs ij do Compute b P ij ( a, b ) for a, b = 0 , if | ˜ P ij ( a, b ) − P in ( a, b ) | ≤ | P in ( a, b ) − P out ( a, b ) | for some a, b then Set E ← E ∪ { ij } .Compute C ← set of connected components in G = ( V, E ) and set c K ← |C| and ( C , . . . , C b K ) ← members of C listed in arbitrary order. for i = 1 , . . . , N do ˆ σ i ← unique k for which C k ∋ i . Theorem 6.1.

Consider a homogeneous Markov SBM with N nodes, K communities and T snapshots. Assume that N is ﬁxed, and the interaction pattern probabilities f in , f out are known.Then with high probability Algorithm 4 correctly classify every node when T goes to inﬁnity, aslong as the evolution is not static and f in = f out . The proof of Theorem 6.1 is provided in Appendix E.1.

Remark 6.2. If P in and P out are unknown, we can add a step where the estimated transitionmatrices ( b P ij ) are clustered into two classes (for example using K-means). This section investigates special data tensors where the intra-block interactions are static anddeterministic, and the inter-block interactions are considered as (non-static) random noise. Forsuch data, we will ﬁrst make two simple observations that greatly help recovering the underlyingblock structure. Those observations lead to two different algorithms, and we will study theirperformance in Section 6.2.2.

If nodes i and j interact at time t but not at time t + 1 (or vice versa), then i and j do not belong to the same block. Observation 2.

If nodes i and j interact at every time step, then i and j probably belong to thesame block.Observation 2 suggests a very simple and extremely fast clustering method (Algorithm 5)which tracks persistent interactions and disregards other information. Persistent interactions canbe represented as an intersection graph G = ∩ t G t , where G t is the graph with adjacency matrix A t . By noting that G can be computed by performing O (log T ) graph intersections of complexity O (∆ max N ) , and that a breadth-ﬁrst search ﬁnds the connected components in O ( N ) time, wesee that Algorithm 5 runs in O (∆ max N log T ) time, where ∆ max = max t max i P j | A tij | is themaximum degree of the graphs G t . 20 lgorithm 5: Best friends forever

Input:

Observed interaction tensor ( A tij ) Output:

Estimated node labelling ˆ σ = (ˆ σ , . . . , ˆ σ N ) ; estimated number of communities c K .Set V ← { , . . . , N } .Compute E T ← ∩ Tt =1 E t where E t = { ij : A tij = 1 } Compute

C ← set of connected components in G T = ( V, E T ) and set ˆ K ← number ofmembers in C of size larger than N / , and ( C , . . . , C b K ) ← list of c K largest members in C in arbitrary order.Set V ← ∪ b Kk =1 C k .For i ∈ V , set ˆ σ i ← unique k for which C k ∋ i .For i ∈ V \ V , set ˆ σ i ← arbitrarily value k ∈ { , . . . , c K } .Similarly, we propose a clustering method based on Observation 1. We call enemies twonodes i and j such that there is a change in the interaction pattern between i and j . Then we cangroup nodes that share a common enemy. Indeed, if K = 2 , the fact that node i is enemy with j ,and j is also enemy with k means that nodes i and k belong to the same cluster. This enemies ofmy enemies are my friends procedure leads to Algorithm 6. Algorithm 6:

Enemies of my enemy (for K = 2 ). Input:

Observed interaction tensor ( A tij ) . Output:

Estimated node labelling ˆ σ = (ˆ σ , . . . , ˆ σ N ) .Compute E ∩ ← ∩ t E t and E ∪ ← ∪ t E t where E t = { ij : A tij = 1 } .Compute E ′ = E ∪ \ E ∩ .Set V ← { , . . . , N } .Set G ′ ← ( V, E ′ ) .Set G ′′ = ( V, E ′′ ) where ij ∈ E ′′ iff there is a 2-path i → h → j in G ′ .Compute C ← set of connected components in G ′′ and set c K ← |C| and ( C , . . . , C b K ) ← members of C listed in arbitrary order. for i = 1 , . . . , N do ˆ σ i ← unique k for which C k ∋ i . Remark 6.3.

The above description for Algorithm 6 runs in O (∆ max N T ) , where ∆ max is themaximal degree over all single layers. A faster, but less transparent, implementation is possible,by ﬁrst computing the union graph. Then, two nodes are marked as enemies if the weight betweenthem in the union graph belongs to the interval [1 , T − . This reduces the time complexity to O (∆ max N log T ) . A simple generative model for interaction tensors with static and deterministic intra-block inter-actions is the Markov SBM where P kk = I is the 2-by-2 identity matrix for all k ∈ [ K ] . Underthis model, Proposition 6.4 states the performance guarantees for Algorithm 5. Proposition 6.4.

Consider a dynamic SBM indexed by a scale parameter ν , with T ( ν ) snapshotsand K ( ν ) blocks of size N , . . . , N K . Assume that N k ≍ N for all k , and that. ∀ k = ℓ : N max ≤ k<ℓ ≤ K ( ν ) f ( ν ) kℓ (1 , . . . , ≪ . (6.1)21 hen Algorithm 5 achieves exact recovery whp if ∀ k ∈ h K ( ν ) i : lim ν →∞ N ( ν ) k f ( ν ) kk (1 , . . . , (cid:16) K ( ν ) N ( ν ) k (cid:17) > . (6.2) Moreover, assume K ( ν ) , T ( ν ) are bounded. Then, Algorithm 5 achieves almost exact recoveryif ∀ k ∈ [ K ] : lim ν →∞ N ( ν ) k f ( ν ) kk (1 , . . . ,

1) = ∞ . (6.3) Remark 6.5.

Condition (6.1) ensures that the number of nodes in different community interactingat every time step remains small, making Observation 2 meaningful. The extra Conditions (6.2)and (6.3) ensures that in each community, there is enough node pairs interacting at all time step.The following Proposition 6.6 gives the guarantees of convergence of 6 for a general dynamicSBM with two communities.

Proposition 6.6.

Consider a dynamic SBM with N ≫ nodes and K = 2 blocks of sizes N , N ≍ N . Assume that log(1 /p T ) + log(1 /p T ) ≪ N − and − p T ≫ N − log N ,where p kℓT = f kℓ (0 , . . . , | {z } T ) + f kℓ (1 , . . . , | {z } T ) is the probability of observing a static interaction pattern of length T between any particular pairof nodes in blocks k and ℓ . Then Algorithm 6 estimates the correct block memberships with highprobability. The proofs of Propositions 6.4 and 6.6 are postponed to Appendix E.

In the numerical simulations, we suppose that − µ in (1) µ in (1) ! , resp., − µ out (1) µ out (1) ! , is the sta-tionnary distribution of P in , resp., of P out . Therefore, P in = − µ in (1) − P in (1 , − µ in (1) µ in (1) − P in (1 , − µ in (1) − P in (1 , P in (1 , ! , and similarly for P out . Let us focus on regime where the average degree is of the order

N p N ≍ log N , which is known tobe critical for exact recovery in the static SBM. In Remark 4.3 following Corollary 4.2 we statedthat, as long as the model is theoretically identiﬁable and the evolution is non-static, there existsa threshold T ∗ such that exact recovery is possible if we have more than T ∗ snapshots. Figure 1displays the theoretical value T ∗ as a function of P in (1 , and P out (1 , , for various choices of µ in (1) and µ out (1) . In particular, the hardest cases are: • when P in (1 , , P out (1 , are both close to one (nearly static situation); • when µ in (1) ≈ µ out (1) and P in (1 , ≈ P out (1 , , then the interaction patterns are similarwithin blocks and between blocks. 22 .0 0.2 0.4 0.6 0.8 1.0 Pout(1,1) P i n ( , ) log(T) for exact recovery (a) µ in (1) = 1 . log NN Pout(1,1) P i n ( , ) log(T) for exact recovery (b) µ in (1) = 2 . log NN Pout(1,1) P i n ( , ) log(T) for exact recovery (c) µ in (1) = 4 . log NN Figure 1: Theoretical minimum value T ∗ needed to achieve exact recovery when K = 2 for µ out (1) = 1 . log NN and different µ in (1) , as a function of P in (1 , and P out (1 , . The hard-est cases are around the top right corner (when P in , P out ≈ I : static situation) and around thediagonal P in (1 ,

1) = P out (1 , when µ in (1) is close to µ out (1) . The plots show log T ∗ . Pout(1,1) P i n ( , ) (a) µ in (1) = 1 . log NN Pout(1,1) P i n ( , ) (b) µ in (1) = 2 . log NN Pout(1,1) P i n ( , ) (c) µ in (1) = 4 . log NN Figure 2: Greyscale plot of accuracy (proportion of correctly labelled nodes) as a function of P in (1 , and P out (1 , , given by Algorithm 2 with a Random Guessing initialization. Simula-tions are done for a Markov SBM with T = 10 snapshots, N = 500 nodes, µ out (1) = 1 . log NN and different µ in (1) .The relevance of the theoretical threshold is next illustrated by numerical experiments. InFigure 2, we plot the accuracy obtained when P in (1 , and P out (1 , vary, while the other pa-rameters of the model are ﬁxed. We see that the hard region ( i.e., where the accuracy remainsbad after snapshots) lies around the diagonal. Away from the diagonal P in (1 ,

1) = P out (1 , ,Algorithm 2 always achieves a near to perfect accuracy score after T = 10 snapshots. Let us now study the effect of the initialization step. We plot in Figure 3 the evolution of theaveraged accuracy obtained when we run Algorithm 2 on 50 realizations of a Markov SBM,where the initialization is done either using Spectral Clustering or Random Guessing. Obviously,when Spectral Clustering works well (see Figure 3c), it is preferable to use it than a random guess.Nonetheless, it is striking to see that when the initial Spectral Clustering gives a bad accuracy,then the likelihood method can overcome it. For example, in Figure 3a, the initial clustering withSpectral Clustering on the ﬁrst snapshot is really bad (accuracy ≈ , hence not much betterthan a random guessing), Algorithm 2 does overcome this and reaches a perfect clustering after afew snapshots. In that particular setting, there is no advantage in using Spectral Clustering ratherthan Random Guessing.This is further strengthened by our numerical observations in the constant degree regime. Aswe see in Figure 4, our Algorithm performs well when µ in (1) = c in N and p out = c out N ( c in , c out con-stants), even if c in ≈ c out (see Figure 4b). This is very similar to what we saw in the logarithmic23 Number of time steps A cc u r a c y Initialisation by:Spectral ClusteringRandom Guessing (a) µ in (1) = 1 . log NN ( T ∗ theo = 13 ) Number of time steps A cc u r a c y Initialisation by:Spectral ClusteringRandom Guessing (b) µ in (1) = 2 . log NN ( T ∗ theo = 14 ) Number of time steps A cc u r a c y Initialisation by:Spectral ClusteringRandom Guessing (c) µ in (1) = 4 . log NN ( T ∗ theo = 11 ) Figure 3: Evolution of the accuracy given by Algorithm 2 when the initialisation is done viaSpectral Clustering or Random Guessing. The synthetic graphs are Markov SBM with N = 500 nodes (equally divided in two clusters), and parameters µ out (1) = 1 . log NN , P in (1 ,

1) = 0 . and P out (1 ,

1) = 0 . . Accuracy is averaged over 50 realisations, and the error bars represent thestandard error. T ∗ theo is the theoretical minimum number of time steps needed to get above theexact recovery threshold.degree regime (Figure 3), except that the number of snapshots needed to get excellent accuracy ishigher since the graphs are sparser. Number of time steps A cc u r a c y Pout(1,1):0.0030.30.60.9 (a) N = 500 , µ in (1) = . N , µ out (1) = . N and P in (1 ,

1) = 0 . . Number of time steps A cc u r a c y Pout(1,1):0.10.20.30.4 (b) N = 100 , µ in (1) = . N , µ out (1) = . N and P in (1 ,

1) = 0 . . Figure 4: Evolution of the accuracy with the number of snapshots obtained by Algorithm 2 ina sparse setting, when the initialisation is done via Random Guessing. We draw 50 syntheticMarkov SBM with two equal size communities. The choice of parameters in Figure ( b ) is muchmore challenging than Figure ( a ) . The different curves show the averaged accuracy over 50 trials,and errors bars correspond to the empirical standard errors.24 .3 The case when the interaction parameters are unknown We show in Figure 5 the comparison of the online Algorithm 2 (with known interaction parame-ters) with the online Algorithm 3 (with unknown interaction parameters). We see that, when thestarting round of Spectral Clustering gives a decent accuracy (at least 75%), then Algorithm 3 canlearn the model parameters as well as communities. However, when Spectral Clustering givesa bad accuracy, Algorithm 3 without the model parameters fails, whereas the version with theknown interaction parameters succeeds.

Number of time steps A cc u r a c y Model parameters:knownunknown (a) µ out (1) = 0 . . Number of time steps A cc u r a c y Model parameters:knownunknown (b) µ out (1) = 0 . . Number of time steps A cc u r a c y Model parameters:knownunknown (c) µ out = 0 . . Figure 5: Comparison of the accuracy given by the online versions of the algorithm. The re-sults are averaged on 20 realizations of Markov SBM with parameters N = 1000 , T = 30 and µ in (1) = 0 . , P in (1 ,

1) = 0 . , P out (1 ,

1) = 0 . , and for different µ out (1) . In this section, we compare the performance of Algorithm 2 to the baseline methods proposed inSection 6.2. Results are shown in Figure 6. We draw the following observations: • Algorithm 2 (called online likelihood in the plots) always achieves very high accuracy, andoutperforms all other methods; • Spectral Clustering on the union graph always performs very poorly, while Spectral Clus-tering on the time-aggregated graph can perform very well if the evolution of the patterninteractions are not too static (that is, P in (1 , and P out (1 , should be both away from ); • Spectral Clustering on P Tt =1 A t − D t , where D t is the degree matrix of layer t , is the methodproposed and analysed in [Lei20]. This method, called squared adjacency SC in the captionof Figure 6, is always outperformed by Spectral Clustering on the time-aggregated graph; • Algorithms 5 and 6 are more sensitive to the hypothesis P in (1 ,

1) = 1 than to P out (1 ,

1) = µ out (1) . In particular, Algorithm 6 ( enemies of my enemy ) fails as soon as P in (1 , = 1 (in Figure 6b, when P in (1 ,

1) = 0 . , the accuracy of Algorithm 6 drops to ); • Given its simplicity, Algorithm 5 ( best friends forever ) performs surprisingly well. Ofcourse, when the parameter setting is too far from the ideal situation P in (1 ,

1) = 1 and P out (1 ,

1) = µ out (1) , the algorithm fails as expected. However, even at not too shortdistances from this ideal case, Algorithm 5 gives meaningful classiﬁcation. In this paper, we studied clustering in a dynamic stochastic block model, where the node labellingis ﬁxed and the interaction pattern between node pairs are independent. We derived explicit condi-tions for recovery of the latent node-labels, extending previously known results for a small number25 .0 0.2 0.4 0.6 0.8

Pout(1,1) A cc u r a c y Algorithm:online likelihoodunion graph SCtime aggregated SCsquared adjacency SCbest friends foreverenemies of my enemy (a) P in (1 ,

1) = 1 . Pin(1,1) A cc u r a c y (b) P out (1 ,

1) = µ out . Figure 6: Comparison of the accuracy given by the different algorithms. The results are aver-aged on 50 realisations of Markov SBM with parameters N = 500 , T = 30 and µ in (1) =0 . , µ out (1) = 0 . . Figure (6a) shows the situation P in (1 ,

1) = 1 (static intra-communityinteraction patterns) and P out (1 , varies, while Figure (6b) shows P out (1 ,

1) = µ out (1) (i.i.d.inter-community interaction pattern) and P in (1 , varies. Colours correspond to the same algo-rithms in both plots.of snapshots [XJL20] or independent snapshots [PC16]. For a Markov dynamics of the interac-tions pattern, we derived the conditions for almost exact and exact recovery, and made parallelwith existing work in the static SBM and the independent multi-layer SBM. We also proposed anonline algorithm (Algorithm 2) based on likelihood estimation. We investigated numerically theperformance of this algorithm. Especially, we observed that even in hard regimes ( P in ≈ P out ,and/or very sparse graphs), Algorithm 2 achieves excellent accuracy given a reasonable numberof snapshots.Algorithm 2 can be extended to a more general setting, for example where the communitiesare of different size, or where the interaction parameters f kℓ are not necessarily all equal to f out when k = ℓ . Theoretical perspectives of a general link-labelled SBM can be found in [YP16].If the interaction probabilities are unknown, we can estimate them. This leads to Algorithm 3.This method achieves high accuracy if the initialisation step gives a good enough accuracy (typ-ically at least ∼ µ in , µ out , P in , P out , and the node labelling σ .We leave open the theoretical study of those algorithms, and in particular a proof of consis-tency of Algorithm 2 given random guess initialisation.Moreover, both Algorithms 2 and 3 require the knowledge of the number of communities. Inpractice, such an information might not be available, and need to be inferred as well. Estimatingthe number of clusters in the static SBM has been investigated. Methods based on the likelihood[SYF17, WB17] or spectral properties of well chosen matrices [LL15] have been proposed, andmight be extendable to dynamic graphs.Another natural extension is to allow the node labelling to vary over time (in the spirit of[BLMT18], but with different pattern interaction between nodes in the same community andnodes in different communities). Acknowledgements

This work has been done within the project of Inria - Nokia Bell Labs “Distributed Learning andControl for Network Analysis” and was partially supported by COSTNET Cost Action CA15109.26 eferences [Abb18] Emmanuel Abbe. Community detection and the stochastic block models.

Founda-tions and Trends R (cid:13) in Communications and Information Theory , 14(1–2):1–62, 2018.[ABH16] Emmanuel Abbe, Afonso S. Bandeira, and Georgina Hall. Exact recovery in thestochastic block model. IEEE Transactions on Information Theory , 62(1):471–487,2016.[Ana17] Venkat Anantharam. A variational characterization of rényi divergences. In , pages 893–897, 2017.[BC09] Peter J Bickel and Aiyou Chen. A nonparametric view of network models andnewman–girvan and other modularities.

Proceedings of the National Academy ofSciences , 106(50):21068–21073, 2009.[BC20] Sharmodeep Bhattacharyya and Shirshendu Chatterjee. General community detec-tion with optimal recovery conditions for multi-relational sparse networks with de-pendent layers, 2020. https://arxiv.org/abs/2004.03480 .[BGLL08] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre.Fast unfolding of communities in large networks.

Journal of statistical mechanics:theory and experiment , 2008(10):P10008, 2008.[Bil61] Patrick Billingsley. Statistical methods in Markov chains.

Ann. Math. Statist. , 32:12–40, 1961.[BLMT18] Paolo Barucca, Fabrizio Lillo, Piero Mazzarisi, and Daniele Tantari. Disentanglinggroup and link persistence in dynamic stochastic block models.

Journal of StatisticalMechanics: Theory and Experiment , 2018(123407):1–18, 2018.[BMD +

16] Trygve E Bakken, Jeremy A Miller, Song-Lin Ding, Susan M Sunkin, Kimberly ASmith, Lydia Ng, Aaron Szafer, Rachel A Dalley, Joshua J Royall, Tracy Lemon,et al. A comprehensive transcriptional map of primate brain development.

Nature ,535(7612):367–375, 2016.[BWP +

11] Danielle S Bassett, Nicholas F Wymbs, Mason A Porter, Peter J Mucha, Jean MCarlson, and Scott T Grafton. Dynamic reconﬁguration of human brain networksduring learning.

Proceedings of the National Academy of Sciences , 108(18):7641–7646, 2011.[CF17] François Caron and Emily B. Fox. Sparse graphs using exchangeable random mea-sures.

Journal of the Royal Statistical Society B , 79(5):1295–1366, 2017.[For10] Santo Fortunato. Community detection in graphs.

Physics Reports , 486(3–5):75–174,2010.[GMZZ17] Chao Gao, Zongming Ma, Anderson Y. Zhang, and Harrison H. Zhou. Achievingoptimal misclassiﬁcation proportion in stochastic block models.

J. Mach. Learn. Res. ,18(1):1980–2024, 2017.[GN02] M. Girvan and M. E. J. Newman. Community structure in social and biologicalnetworks.

Proceedings of the National Academy of Sciences , 99(12):7821–7826,2002. 27GZC +

16] Amir Ghasemian, Pan Zhang, Aaron Clauset, Cristopher Moore, and Leto Peel. De-tectability thresholds and optimal algorithms for community structure in dynamicnetworks.

Physical Review X , 6(3):031005, 2016.[HLL83] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochasticblockmodels: First steps.

Social Networks , 5(2):109–137, 1983.[HLM12] Simon Heimlicher, Marc Lelarge, and Laurent Massoulié. Community detection inthe labelled stochastic block model. In

NIPS Workshop on Algorithmic and StatisticalApproaches for Large Social Networks , 2012.[HS12] Petter Holme and Jari Saramäki. Temporal networks.

Physics Reports , 519(3):97–125, 2012.[HW08] Jake M Hofman and Chris H Wiggins. Bayesian approach to network modularity.

Physical review letters , 100(25):258701, 2008.[KAB +

14] Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, andMason A. Porter. Multilayer networks.

Journal of Complex Networks , 2(3):203–271,07 2014.[Kal05] Olav Kallenberg.

Probabilistic Symmetries and Invariance Principles . Springer,2005.[KY04] Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email clas-siﬁcation research. In

European Conference on Machine Learning , pages 217–226.Springer, 2004.[Lei20] Jing Lei. Tail bounds for matrix quadratic forms and bias ad-justed spectral clustering in multi-layer stochastic block models, 2020. https://arxiv.org/abs/2003.08222 .[Les] Lasse Leskelä. Random graphs and network statistics.[LGK12] Kevin Lewis, Marco Gonzalez, and Jason Kaufman. Social selection and peer inﬂu-ence in an online social network.

Proceedings of the National Academy of Sciences ,109(1):68–72, 2012.[LL15] Can M. Le and Elizaveta Levina. Estimating the number of communities in networksby spectral methods, 2015. https://arxiv.org/abs/1507.00827 .[LM19] Léa Longepierre and Catherine Matias. Consistency of the maximum likelihood andvariational estimators in a dynamic stochastic block model.

Electronic Journal ofStatistics , 13(2):4157–4223, 2019.[LMX15] Marc Lelarge, Laurent Massoulié, and Jiaming Xu. Reconstruction in the labelledstochastic block model.

IEEE Trans. Netw. Sci. Eng. , 2(4):152–163, 2015.[MBLT20] P. Mazzarisi, P. Barucca, F. Lillo, and D. Tantari. A dynamic network model withpersistent links and node-speciﬁc latent variables, with an application to the interbankmarket.

European Journal of Operational Research , 281(1):50–65, 2020.[MFB15] Rossana Mastrandrea, Julie Fournet, and Alain Barrat. Contact patterns in a highschool: a comparison between data collected using wearable sensors, contact diariesand friendship surveys.

PloS one , 10(9):e0136497, 2015.28MM09] Marc Mezard and Andrea Montanari.

Information, physics, and computation . OxfordUniversity Press, 2009.[MM17] Catherine Matias and Vincent Miele. Statistical clustering of temporal networksthrough a dynamic stochastic block model.

J. R. Stat. Soc. Ser. B. Stat. Methodol. ,79(4):1119–1141, 2017.[MNS16] Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresholds for the plantedbisection model.

Electronic Journal of Probability , 2016.[Moo17] Cristopher Moore. The computer science and physics of community detection: Land-scapes, phase transitions, and hardness.

Bulletin of the EATCS , 121, 2017.[PC16] Subhadeep Paul and Yuguo Chen. Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel.

Electron. J.Statist. , 10(2):3807–3870, 2016.[Pei19] Tiago P Peixoto. Bayesian stochastic blockmodeling.

Advances in network clusteringand blockmodeling , pages 289–332, 2019.[SC95] Karen B. Singer-Cohen.

Random intersection graphs . PhD thesis, Johns HopkinsUniversity, 1995. Thesis (Ph.D.)–The Johns Hopkins University.[SYF17] D. Franco Saldaña, Yi Yu, and Yang Feng. How many communities are there?

Jour-nal of Computational and Graphical Statistics , 26(1):171–181, 2017.[TSSM16] Dane Taylor, Saray Shai, Natalie Stanley, and Peter J. Mucha. Enhanced detectabilityof community structure in multilayer networks through layer aggregation.

Phys. Rev.Lett. , 116:228301, Jun 2016.[vH14] T. van Erven and P. Harremoës. Rényi divergence and kullback–leibler divergence.

IEEE Transactions on Information Theory , 60(7):3797–3820, July 2014.[VL07] Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing ,17(4):395–416, 2007.[WB17] Y. X. Rachel Wang and Peter J. Bickel. Likelihood-based model selection for stochas-tic block models.

Ann. Statist. , 45(2):500–528, 04 2017.[XH14] Kevin S Xu and Alfred O Hero. Dynamic stochastic blockmodels for time-evolvingsocial networks.

IEEE Journal of Selected Topics in Signal Processing , 8(4):552–562,2014.[XJL20] Min Xu, Varun Jog, and Po-Ling Loh. Optimal rates for community estimation inthe weighted stochastic block model.

Annals of Statistics , 48(1):183–204, 2020.[YCZ +

11] Tianbao Yang, Yun Chi, Shenghuo Zhu, Yihong Gong, and Rong Jin. Detectingcommunities and their evolutions in dynamic social networks—a bayesian approach.

Machine Learning , 82(2):157–189, 2011.[YP16] Se-Young Yun and Alexandre Proutière. Optimal cluster recovery in the labeledstochastic block model. In

Proceedings of the 30th International Conference on Neu-ral Information Processing Systems , NIPS’16, pages 973–981, USA, 2016. CurranAssociates Inc. 29ZWL +

14] Dawei Zhao, Lianhai Wang, Shudong Li, Zhen Wang, Lin Wang, and Bo Gao. Im-munization of epidemics in multiplex networks.

PloS one , 9(11):e112018, 2014.[ZZ16] Anderson Y. Zhang and Harrison Huibin Zhou. Minimax rates of community detec-tion in stochastic block models.

Annals of Statistics , 44(5):2252–2280, 2016.

A Proof of lower bound

A.1 Unique permutation minimising the Hamming distance between two differentnode labellings

Lemma A.1.

Let σ , σ : [ N ] → [ K ] be such that d Ham ( π ◦ σ , σ ) < s/ for some π ∈ Sym( K ) ,where s = min k | σ − ( k ) | . Then π is the unique permutation with this property, and π ( k ) = arg max ℓ | σ − ( k ) ∩ σ − ( ℓ ) | . This corresponds to [XJL20, Lemma B.6].

Proof.

Assume that π ∈ Sym( K ) satisﬁes d Ham ( π ◦ σ , σ ) < s/ . Fix k ∈ [ K ] and let U k = { i : σ ( i ) = k, σ ( i ) = π ( k ) } . Then every node i in U k satisﬁes π ◦ σ ( i ) = σ ( i ) , andtherefore | U k | ≤ d Ham ( π ◦ σ , σ ) < s . Hence for any ℓ = π ( k ) , | σ − ( k ) ∩ σ − ( ℓ ) | ≤ | U k | < s . On the other hand, | σ − ( k ) ∩ σ − ( π ( k )) | = | σ − ( k ) | − | U k | ≥ s − s ≥ s . Hence π ( k ) is the unique value which maximizes ℓ

7→ | σ − ( k ) ∩ σ − ( ℓ ) | . Because this conclu-sion holds for all k , it follows that π is uniquely deﬁned. A.2 Proof of Theorem 3.1 (0)

Preliminaries.

As a preparation, let us deﬁne some additional notation related to permutationsof blocks and nodes. Denote by

Sym( σ , σ ) the set of permutations ρ ∈ Sym( K ) for which d Ham ( ρ ◦ σ , σ ) is minimized, and by E ( σ , σ ) the set of nodes i for which ρ ◦ σ ( i ) = σ ( i ) for some ρ ∈ Sym( σ , σ ) . Nodes in E ( σ , σ ) are critical in the sense that they may becomemisclustered, depending on the permutation. We denote ˜ ℓ ( σ , σ ) = N − |E ( σ , σ ) | . Note that ˜ ℓ ( σ , σ ) ≥ ℓ ( σ , σ ) in general.(i) Lower bounding by the proportion of critical nodes.

Assume that σ is the true nodelabelling. Denote by N = min k | σ − ( k ) | the smallest block size. Note that ℓ = ℓ (ˆ σ A , σ ) satisﬁes, for c = N N , by Lemma A.1, and ≤ ˜ ℓ ≤ , E ℓ = E ℓ ℓ ≥ c ) + E ℓ ℓ < c )= E ℓ ℓ ≥ c ) + E ˜ ℓ ℓ < c ) ≥ c P ( ℓ ≥ c ) + ( E ˜ ℓ − P ( ℓ ≥ c )) + . If P ( ℓ ≥ c ) ≥ c E ˜ ℓ , then the ﬁrst inequality implies E ℓ ≥ c c E ˜ ℓ . If P ( ℓ ≥ c ) ≤ c E ˜ ℓ ,then the latter inequality implies E ℓ ≥ c c E ˜ ℓ . Hence we conclude that E ℓ ≥ c c E ˜ ℓ = N N + 2 N E ˜ ℓ ≥ N N E ˜ ℓ . (A.1)30ii) Randomizing the reference node label.

Let C a = σ − ( a ) and C b = σ − ( b ) be blocks withsizes N + 1 and N , respectively, and select a reference node u ∈ C a . Deﬁne a modiﬁed nodelabelling σ ′ by setting σ ′ ( u ) = b and σ ′ ( i ) = σ ( i ) for i = u . Deﬁne a probability measure P Φ on { σ , σ ′ } × S N × N by P Φ ( σ, A ) = 12 P σ ( A ) . This amounts to a randomized model with a random block membership structure σ ∈ { σ , σ ′ } where a coin ﬂip is ﬁrst performed to determine whether the label of the reference node u isswapped from the true value a into a false value b . We will now show that the randomization doesnot change the expected proportion of critical nodes, by verifying that E Φ ˜ ℓ (ˆ σ A , σ ) = E ˜ ℓ (ˆ σ A , σ ) . (A.2)Let π ∈ Sym( N ) be a permutation which swaps C a \ { u } and C b , and keeps other nodesﬁxed. Then σ ′ = τ ◦ σ ◦ π − where τ ∈ Sym( K ) is the map which swaps block labels a and b . Because ˆ σ is permutation equivariant, we see that ˆ σ A π = ˆ σ A ◦ π − . Now for any A and any ρ ∈ Sym( K ) , { i : ρ ◦ ˆ σ A ( i ) = σ ( i ) } = { i : ρ ◦ ˆ σ A ( i ) = τ − ◦ σ ′ ◦ π ( i ) } = { i : ρ ◦ ˆ σ A ◦ π − ( π ( i )) = τ − ◦ σ ′ ( π ( i )) } = { i : ρ ◦ ˆ σ A π ( π ( i )) = τ − ◦ σ ′ ( π ( i )) } = { i : τ ◦ ρ ◦ ˆ σ A π ( π ( i )) = σ ′ ( π ( i )) } . As a consequence, d Ham ( ρ ◦ ˆ σ A , σ ) = d Ham ( τ ◦ ρ ◦ ˆ σ A π , σ ′ ) . Hence ρ ∈ Sym(ˆ σ A , σ ) iff τ ◦ ρ ∈ Sym(ˆ σ A π , σ ′ ) . The above computation also shows that i ∈ E (ˆ σ A , σ ) iff π ( i ) ∈ E (ˆ σ A π , σ ′ ) .Hence P σ ( i ∈ E (ˆ σ A , σ )) = P σ ( π ( i ) ∈ E (ˆ σ A π , σ ′ )) . (A.3)Next, a key observation is that for any pair ij of distinct nodes, σ ( π ( i )) = σ ( π ( j )) if and onlyif σ ′ ( i ) = σ ′ ( j ) . Therefore, because the model is homogeneous, it follows that P σ ( A π ) = Y ij f σ ( π ( i )) σ ( π ( j )) ( A ij ) = Y ij f σ ′ ( i ) σ ′ ( j ) ( A ij ) = P σ ′ ( A ) . Thus, the law of A π under P σ is the same as the law of A under P σ ′ . Hence by (A.3), it followsthat P σ ( i ∈ E (ˆ σ A , σ )) = P σ ′ ( π ( i ) ∈ E (ˆ σ A , σ ′ )) . By summing both sides over i ∈ [ N ] and dividing the outcome by N , we conclude that E σ ˜ ℓ (ˆ σ A , σ ) = E σ ′ ˜ ℓ (ˆ σ A , σ ′ ) . Hence (A.2) follows from E Φ ˜ ℓ (ˆ σ A , σ ) = 12 E σ ˜ ℓ (ˆ σ A , σ ) + 12 E σ ′ ˜ ℓ (ˆ σ A , σ ′ ) = E σ ˜ ℓ (ˆ σ A , σ ) . (iii) From global to local error.

A simple computation using the permutation equivariance([XJL20, Corollary G.1]) shows that E Φ ˜ ℓ = 12 X σ ∈{ σ ,σ ′ } N − N X i =1 P σ ( i ∈ E (ˆ σ A , σ )) ≥ X σ ∈{ σ ,σ ′ } N − X i ∈ C σ,u P σ ( i ∈ E (ˆ σ A , σ ))= 12 X σ ∈{ σ ,σ ′ } N − | C σ,u | P σ ( u ∈ E (ˆ σ A , σ )) , C σ,u = { i : σ ( i ) = σ ( u ) } . Because | C σ,u | = N + 1 for σ ∈ { σ , σ ′ } , it follows that E Φ ˜ ℓ ≥ N N P Φ ( u ∈ E (ˆ σ A , σ )) . (A.4)(iv) Deﬁning an alt model.

We will now change the distribution P Φ into a distribution P Ψ cor-responding to a modiﬁcation where the interactions of the reference node u with nodes in blocks a and b are identically distributed. Let f ∗ be any probability measure on S which is absolutelycontinuous with respect to both f in and f out . For any node labelling σ , deﬁne a probability density P ∗ σ ( A ) = (cid:18) Y ij ∈ E ( u,C a ∪ C b ) f ∗ ( A ij ) (cid:19)(cid:18) Y ij ∈ E ( u,C a ∪ C b ) c f σ ( i ) σ ( j ) ( A ij ) (cid:19) , where E ( C, D ) denotes the of set unordered node pairs with one node in C and the other in D ,and E ( u, C ) is shorthand for E ( { u } , C ) . Deﬁne a probability measure P Ψ on { σ , σ ′ } × S N × N by P Ψ ( σ, A ) = 12 P ∗ σ ( A ) , corresponding to the same randomization as in the deﬁnition of P Φ . The alt model has beenconstructed so that P Ψ is absolutely continuous with respect to P Φ , and P ∗ σ = P ∗ σ ′ . (A.5)To see why (A.5) is true, note that E ( u, C a ∪ C b ) c = E ( u, D ) ∪ E ( { u } c , { u } c ) with D =( C a ∪ C b ) c . For a homogeneous model, f σ ( i ) σ ( j ) = f out for all i ∈ C a ∪ C b and j ∈ D . Hence,denoting C = C a ∪ C b , for node labeling σ = σ , P ∗ σ ( A ) = (cid:18) Y ij ∈ E ( u,C ) f ∗ ( A ij ) (cid:19)(cid:18) Y ij ∈ E ( u,D ) f out ( A ij ) (cid:19)(cid:18) Y ij ∈ E ( { u } c , { u } c ) f σ ( i ) σ ( j ) ( A ij ) (cid:19) . The same formula holds also for σ = σ ′ .(v) The alt model is blind for the reference node . Intuitively, the initial randomization of thelabel of the reference node u should make it impossible to cluster u better than a blind randomguess for data sample from the alt model. Technically, let us verify this as follows. Fix a number < δ < N / − N . For E = E (ˆ σ A , σ ) and ˜ ℓ = ˜ ℓ (ˆ σ A , σ ) consider the event E u,δ = n ( σ, A ) : u ∈ E or ˜ ℓ > δ o that the reference node is misclustered or the relative error is large. A key thing is to show that P Ψ ( E u,δ ) ≥ . (A.6)Observe that P Ψ ( E cu,δ ) = 12 P ∗ σ ( E cu,δ,σ ) + 12 P ∗ σ ′ ( E cu,δ,σ ′ ) , where E cu,δ,σ = { A : u

6∈ E (ˆ σ A , σ ) , ˜ ℓ (ˆ σ A , σ ) ≤ δ } . Let us verify that E cu,δ,σ and E cu,δ,σ ′ are dis-joint. Assume that A ∈ E cu,δ,σ . Then ℓ (ˆ σ A , σ ) ≤ ˜ ℓ (ˆ σ A , σ ) ≤ δ . Hence there exists a permuta-tion π ∈ Sym( K ) such that d Ham ( π ◦ ˆ σ A , σ ) = N ℓ (ˆ σ A , σ ) ≤ N δ . Because d Ham ( σ , σ ′ ) = 1 ,it follows that d Ham ( π ◦ ˆ σ A , σ ′ ) ≤ N δ + 1 . The choice of δ implies that d Ham ( π ◦ ˆ σ A , σ ) and d Ham ( π ◦ ˆ σ A , σ ′ ) are both strictly less than N . Because N is the minimum block sizecorresponding to both σ and σ ′ , Lemma A.1 implies that Sym(ˆ σ A , σ ) = Sym(ˆ σ A , σ ′ ) = { π } .32ecause A ∈ E cu,δ,σ , it follows that π ◦ ˆ σ A ( u ) = σ ( u ) = σ ′ ( u ) , and hence u ∈ E (ˆ σ A , σ ′ ) . Weconclude that A ∈ E u,δ,σ ′ . Hence E cu,δ,σ and E cu,δ,σ ′ are disjoint. Now because P ∗ σ ′ = P ∗ σ (see(A.5)), it follows that P Ψ ( E cu,δ ) = 12 (cid:18) P ∗ σ ( E cu,δ,σ ) + P ∗ σ ( E cu,δ,σ ′ ) (cid:19) = 12 P ∗ σ (cid:0) E cu,δ,σ ∪ E cu,δ,σ ′ (cid:1) ≤ . Hence (A.6) is valid.(vii)

Lower bounding using the alt model.

Markov’s inequality gives E Φ ˜ ℓ ≥ δ P Φ (˜ ℓ > δ ) . Bycombining this with (A.4) we ﬁnd that ( N/N + 1 /δ ) E Φ ˜ ℓ ≥ (cid:18) P Φ ( u ∈ E ) + P Φ (˜ ℓ > δ ) (cid:19) ≥ P Φ ( E u,δ ) . (A.7)Now deﬁne a log-likelihood ratio by Q ( σ, A ) =  log P Ψ ( σ,A ) P Φ ( σ,A ) , P Φ ( σ, A ) > , P Ψ ( σ, A ) > ∞ , otherwise . Then for any t ∈ R , noting that { P Ψ > } ⊂ { P Φ > } , by the absolute continuity of P Ψ withrespect to P Φ , P Φ ( E u,δ ) ≥ P Φ ( E u,δ , P Ψ >

0) = E Ψ e − Q E u,δ ) ≥ e − t P Ψ ( E u,δ , Q ≤ t ) . For t = E Ψ Q + 2 p Var Ψ ( Q ) , Chebyshev’s inequality implies P ( Q > t ) ≤ . For this choice of t , we see with the help of (A.6) that P Ψ ( E u,δ , Q ≤ t ) = P Ψ ( E u,δ ) − P Ψ ( E u,δ , Q > t ) ≥ − P Ψ ( Q > t ) ≥ . Hence P Φ ( E u,δ ) ≥ e − t = 14 e − E Ψ Q − √ Var Ψ ( Q ) . Together with (A.7), this shows that E Φ ˜ ℓ ≥

14 (

N/N + 1 /δ ) − e − E Ψ Q − √ Var Ψ ( Q ) . The above bound holds for all < δ < N / − N . Hence it also holds for δ = N / − N , in whichcase ( N/N + 1 /δ ) − ≥ N N for all N ≥ . Then by (A.1) and (A.2), E ℓ ≥ (cid:18) N N (cid:19) e − E Ψ Q − √ Var Ψ ( Q ) . (viii) Mean of the log-likelihood ratio.

Recall that Q ( σ, A ) = log P Ψ ( σ,A ) P Φ ( σ,A ) . Note that inthis case E Ψ Q = d KL ( P Ψ || P Φ ) together with E Φ Xe Q = E Ψ X and E Ψ Xe − Q = E Φ X for anyreal-valued random variable X whose outcome is a deterministic function of ( σ, A ) . Then, Q ( σ, A ) = X ij ∈ E ( u,C a ∪ C b ) log f ∗ ( A ij ) f σ ( i ) σ ( j ) ( A ij ) = X j ∈ ( C a ∪ C b ) \{ u } log f ∗ ( A uj ) f σ ( u ) σ ( j ) ( A uj ) . When A is P ∗ σ -distributed, the marginal distribution of A uj is f ∗ for all j ∈ ( C a ∪ C b ) \ { u } , forboth σ ∈ { σ , σ ′ } . Hence by taking P ∗ σ -expectations on both sides above, we ﬁnd that X A Q ( σ, A ) P ∗ σ ( A ) = X j ∈ ( C a ∪ C b ) \{ u } d KL ( f ∗ || f σ ( u ) σ ( j ) ) . | C a \ { u }| = | C b | = N , it follows that X A Q ( σ, A ) P ∗ σ ( A ) = N (cid:16) d KL ( f ∗ || f in ) + d KL ( f ∗ || f out ) (cid:17) for both σ ∈ { σ , σ ′ } . Hence E Ψ Q = X σ,A Q ( σ, A ) P Ψ ( σ, A ) = 12 X σ,A Q ( σ, A ) P ∗ σ ( A ) = N I , where I = d KL ( f ∗ || f in ) + d KL ( f ∗ || f out ) .(ix) Variance of the log-likelihood ratio.

A ﬁnal part is to prove a concentration of the log-likelihood ratio Q , by getting an upper bound for the variance. Here, because σ E ∗ σ Q ( σ, A ) = P A Q ( σ, A ) P ∗ σ ( A ) is constant with respect to σ ∈ { σ , σ ′ } , we see that Var Ψ ( Q ) = 12 Var ∗ σ ( Q ( σ , A )) + 12 Var ∗ σ ′ ( Q ( σ ′ , A )) . Now, with C ∗ = ( C a ∪ C b ) \ { u } , Var ∗ σ ( Q ( σ , A )) = X j ∈ C ∗ Var ∗ σ log f ∗ ( A uj ) f aσ ( j ) ( A uj ) ! = X j ∈ C ∗ E ∗ σ log f ∗ ( A uj ) f aσ ( j ) ( A uj ) ! − X j ∈ C ∗ E ∗ σ log f ∗ ( A uj ) f aσ ( j ) ( A uj ) ! = X j ∈ C ∗ Z log f ∗ f aσ ( j ) ! f ∗ − X j ∈ C ∗ Z log f ∗ f aσ ( j ) f ∗ ! = N (cid:18) d KL , ( f ∗ || f in ) − d KL ( f ∗ || f in ) + d KL , ( f ∗ || f out ) − d KL ( f ∗ || f out ) (cid:19) . Because the same formula holds also for

Var ∗ σ ′ ( Q ( σ ′ , A )) , we conclude that Var Ψ ( Q ) = N I where I = d KL , ( f ∗ || f in ) − d KL ( f ∗ || f in ) + d KL , ( f ∗ || f out ) − d KL ( f ∗ || f out ) = Z (cid:18) log f ∗ f in (cid:19) f ∗ − (cid:18)Z log f ∗ f in f ∗ (cid:19) + Z (cid:18) log f ∗ f out (cid:19) f ∗ − (cid:18)Z log f ∗ f out f ∗ (cid:19) . We denote f (cid:22) g if f is absolutely continuous with respect to g . Deﬁne d KL ( f || g ) = R (log fg ) f if f (cid:22) g and d KL ( f || g ) = ∞ otherwise. A variational characterization [Ana17,Theorem 1] shows that inf f ∗ (cid:22) f in ,f ∗ (cid:22) f out (cid:16) d KL ( f ∗ || f in ) + d KL ( f ∗ || f out ) (cid:17) = I ( f in , f out ) . The proof of the theorem also shows that the optimal probability measure has density f ∗ = Z ( f in f out ) / . For this choice of f ∗ , (cid:18) log f ∗ f in (cid:19) = (cid:18)

12 log f out f in − log Z (cid:19) = 14 (cid:18) I + log f out f in (cid:19) = 14 I + 12 I (cid:18) log f out f in (cid:19) + 14 (cid:18) log f out f in (cid:19) .

34n analogous formula also holds for (cid:16) log f ∗ f out (cid:17) . By summing these, the middle terms canceleach other, and we ﬁnd that (cid:18) log f ∗ f in (cid:19) + (cid:18) log f ∗ f out (cid:19) = 12 I + 12 (cid:18) log f out f in (cid:19) . By integrating both sides above against f ∗ , we ﬁnd that the key term in the formula Var Ψ ( Q ) = N I of the theorem becomes I = 12 I + 12 Z (cid:18) log f out f in (cid:19) f ∗ − (cid:18)Z log f ∗ f in f ∗ (cid:19) − (cid:18)Z log f ∗ f out f ∗ (cid:19) . Hence we get an upper bound I ≤ I + 12 Z (cid:18) log f out f in (cid:19) f ∗ . Now the claim follows.

A.3 Bounding of JI Recall that I = − Z, and J = Z − Z log( f /g ) p f g, where Z = R √ f g . Lemma A.2.

Assume that f, g > on S , and that Z > . Then J ≤ e I/ − . Especially, J ≤ I whenever I ≤ .Proof. Let us ﬁx some x ∈ S for which f ( x ) = g ( x ) . At this point, for t = f /g , (log f − log g ) ( √ f − √ g ) p f g = 4 (log √ f − log √ g ) ( √ f − √ g ) p f g = 4 φ ( t ) where φ ( t ) = (log t ) ( t − t . Assume that t > , and let u = log t . Then t = e u and φ ( t ) = (cid:18) ue u − (cid:19) e u = (cid:18) ue u − e − u (cid:19) = (cid:18) u sinh u (cid:19) , where sinh u = 12 ( e u − e − u ) = X k> , odd u k k ! ≥ u. Hence φ ( t ) ≤ for all t > . Next, by noting that φ ( t ) = φ (1 /t ) for all < t , we conclude that φ ( t ) ≤ for all t > such that t = 1 . We conclude that (log f − log g ) p f g ≤ p f − √ g ) whenever f = g . Obviously the same inequality holds also when f = g . By integrating bothsides, it follows that ZJ ≤ Z ( p f − √ g ) = 4(2 − Z ) = 8(1 − Z ) . Hence J ≤ Z − − . The ﬁrst claim follows because Z = e − I/ . The second claim followsby noting that e t/ − R t/ e s ds ≤ e / t for t ≤ , and e / ≤ .35 Proof of upper bound

B.1 Test between two noisy samples from two distributions

Let us start with a lemma linking the Rényi divergence to the likelihood ratio test. The Rényidivergence of positive order α = 1 is deﬁned by D α ( P || Q ) = 1 α − X x P α ( x ) Q − α ( x ) . The following lemma describes a testing scenario, where we decide whether a noisy sample X , . . . , X m is sampled from P or from Q (if δ is small, most of the X i are sampled from P , but some are from Q ). The case δ = δ = 0 corresponds to pure samples. Lemma B.1.

Let

P, Q be two probability distributions, with P ≪ Q and Q ≪ P , and considerindependent random variables X , . . . , X m , Y , . . . , Y m . Let ≤ m , m ≤ m and assume that: • X , . . . , X m are sampled from Q , and X m +1 , . . . , X m from P ; • Y , . . . , Y m are sampled from P , and Y m +1 , . . . , Y m from Q .Let ℓ Q ( X i ) = log Q ( X i ) P ( X i ) and ℓ P ( Y i ) = log P ( Y i ) Q ( Y i ) be the log-likelihood ratios. Then, the randomvariable L = P mi =1 ( ℓ Q ( X i ) + ℓ P ( Y i )) satisﬁes P ( L > z ) ≤ e − z − mD / ( P || Q )+ m D / ( Q || P )+ m D / ( P || Q ) , z ∈ R . Proof.

Because L = log Q mi =1 (cid:16) Q ( X i ) P ( Y i ) P ( X i ) Q ( Y i ) (cid:17) / , Markov’s inequality implies that P ( L > z ) ≤ e − z E e L = e − z m Y i =1 E s Q ( X i ) P ( X i ) E s P ( Y i ) Q ( Y i ) . (B.1)Because E s Q ( X i ) P ( X i ) = (P x Q / ( x ) P − / ( x ) = e D / ( Q,P ) , i ≤ m , P x P / ( x ) Q / ( x ) = e − D / ( P,Q ) , i > m , and E s P ( Y i ) Q ( Y i ) = (P x P / ( x ) Q − / ( x ) = e D / ( P,Q ) , i ≤ m , P x P / ( x ) Q / ( x ) = e − D / ( P,Q ) , i > m , we see that the right side of (B.1) equals e − z − mD / ( P || Q )+ m (cid:0) D / ( Q || P )+ D / ( Q || P ) (cid:1) + m (cid:0) D / ( P || Q )+ D / ( P || Q ) (cid:1) . The claim follows because α D α ( P || Q ) is nondecreasing [vH14, Thm 3]. B.2 Probability of error for a single node

Lemma B.2.

Let i ∈ [ N ] and ˜ σ ( i ) be the output of clustering on [ N ] \{ i } . Suppose π i ∈ S K satisﬁes ℓ ( σ, ˜ σ ( i ) ) = 1 n − d Ham (cid:16) σ − i , π i ◦ ˜ σ ( i ) (cid:17) . Suppose further that ℓ (cid:16) σ, ˜ σ ( i ) (cid:17) ≤ ǫ. nd that f in ≪ f out and f out ≪ f in , with c − ≤ D / ( P, Q ) I ≤ c and c − ≤ D / ( Q, P ) I ≤ c. Then, with probability at least − ( K −

1) exp (cid:18) − N IβK (1 − (1 + c ) ǫ ) (cid:19) we have π − i ( σ i ) = arg max k ∈ [ K ] X j = i (cid:16) ˜ σ ( i ) j = k (cid:17) log f in (cid:16) A Tij (cid:17) f out (cid:16) A Tij (cid:17) . Proof.

Assume without loss of generalities that π i = Id and that σ i = 1 . For k = 1 , let E i ( k ) bethe event that X j = i e σ ( i ) j = 1) log f in ( A Tij ) f out ( A Tij ) > X j = i e σ ( i ) j = k ) log f in ( A Tij ) f out ( A Tij ) . Using Lemma B.1, we have: P ( E i ( k ) c ) ≤ e − ( | C | + | C k | ) I + | C ′ k | I + D / f in ,f out)2 + | C ′ | I + D / f out ,f in)2 , where I = D / ( f in , f out ) , and | C ′ k | denotes the number of nodes misclassiﬁed in cluster k by ˜ σ ( i ) .Since ℓ (cid:16) σ, ˜ σ ( i ) (cid:17) ≤ ǫ , we have | C ′ k | ≤ ǫ | C k | . Thus, P ( E i ( k ) c ) ≤ e − | C | + | Ck | I (1 − ǫ (1+ c )) ≤ e − NIβK (1 − ǫ (1+ c )) . where we used | C k | ≥ NβK .Hence, by the union bound P (cid:16) ∪ Kk =2 E i ( k ) c (cid:17) ≤ ( K − e − NIβK (1 − ǫ (1+ c )) , which ends the proof. B.3 Proof of Propositon 3.5

Proof.

For simplicity of notation, we drop the superscript ν .The graph ˜ G ( i ) is a SBM with interaction probabilities − f in (0) and − f out (0) for intra-cluster and inter-cluster links.Denote ǫ = C spec βK N − − f in (0) , − f out (0))( f in (0) − f out (0)) , with C spec = 2 , and let E init bethe event that ∀ i ∈ [ N ] : ℓ \{ i } (˜ σ i , σ ) ≤ ǫ, (B.2)where ℓ \{ i } is the error on [ N ] \{ i } . Using [XJL20, Proposition B.3] and the union bound, wehave for N large enough, P ( E init ) ≥ − N ( N − − . (B.3)37or any σ ′ ∈ [ K ] N , let S K [ σ ′ , σ ] := arg min ρ ∈S K d Ham ( ρ ◦ σ ′ , σ ) , where S K the set of permutations on [ K ] . We will now use several times Lemma A.1, whichstates that under some condition, the set S K [ σ ′ , σ ] is a singleton, and gives in that case the uniquepermutation π ∈ S K [ σ, σ ′ ] . In fact, π is the consensus (cf. last part of Algorithm 1).Since ǫ = o (1) , we have ǫ < βK for N large enough. Hence by Lemma A.1, under theevent E init , the set S K [˜ σ i , σ ] is a singleton for every i ∈ [ N ] . We denote π i the only element of S K [˜ σ i , σ ] . Moreover, since ˆ σ ( i ) j = ˜ σ ( i ) j for j = i , we have N d

Ham (cid:16) π i ◦ b σ ( i ) , σ (cid:17) ≤ N (cid:16) d Ham (cid:16) π i ◦ ˜ σ ( i ) , σ (cid:17) + 1 (cid:17) = N − N ǫ + 1

N < βK for N large enough, and hence again by Lemma A.1, π i is the only element of S K [ b σ ( i ) , σ ] . Then, ℓ (cid:16)b σ (1) , b σ ( i ) (cid:17) ≤ N d

Ham (cid:16) π ◦ b σ (1) , π i ◦ b σ ( i ) (cid:17) ≤ N (cid:16) d Ham (cid:16) π ◦ b σ (1) , σ (cid:17) + d Ham (cid:16) σ, π i ◦ b σ ( i ) (cid:17)(cid:17) ≤ (cid:18) N − N ǫ + 1 N (cid:19) < βK for N large enough. Therefore, again by Lemma A.1, we conclude that π − ◦ π i is the onlyelement of S K [ b σ ( i ) , b σ (1) ] , and b σ i = arg max k ∈ [ K ] (cid:12)(cid:12)(cid:12) { j : b σ (1) j = k } ∩ { j : b σ ( i ) j = b σ ( i ) i } (cid:12)(cid:12)(cid:12) = (cid:16) π − ◦ π i (cid:17) (cid:16)b σ ( i ) i (cid:17) . (B.4)Hence, P (cid:16) ∃ π ∈ S k [ b σ (1) , σ ] : ( π ◦ b σ ) i = σ i (cid:17) ≤ P (cid:16) ∃ π ∈ S k [ b σ (1) , σ ] : ( π ◦ b σ ) i = σ i | E init (cid:17) + P ( E cinit )= ( a ) P (( π ◦ b σ ) i = σ i | E init ) + P ( E cinit )= ( b ) P (cid:16) ( π − i ◦ σ ) i = b σ i | E init (cid:17) + P ( E cinit ) ≤ ( K −

1) exp (cid:18) − N IβK (1 − (1 + c ) ǫ ) (cid:19) + N − where ( a ) comes from S k [ b σ (1) , σ ] = { π } , and ( b ) comes from equation (B.4). The last linecomes from Lemma B.2 and from equation (B.3).38et ξ ′ = (1 + c ) ǫ . By Assumption, ξ ′ = o (1) . Moreover, E ( ℓ ( b σ, σ )) = E  min π ∈S K N X i ∈ [ N ] π ◦ b σ ) i = σ i )  ≤ E  min π ∈S K [ b σ (1) ,σ ] N X i ∈ [ N ] π ◦ b σ ) i = σ i )  ≤ E  N X i ∈ [ N ] (cid:16) ∃ π ∈ S K [ b σ (1) , σ ] : ( π ◦ b σ ) i = σ i (cid:17) = 1 N X i ∈ [ N ] P (cid:16) ∃ π ∈ S K [ b σ (1) , σ ] : ( π ◦ b σ ) i = σ i (cid:17) ≤ ( K −

1) exp (cid:18) − N IβK (1 − ξ ′ ) (cid:19) + N − . Finally, let ξ = ξ ′ + (cid:16) βKNI (cid:17) / . If ( K −

1) exp (cid:16) − NIβK (1 − ξ ) (cid:17) ≥ N , P (cid:18) ℓ ( b σ, σ ) ≥ ( K −

1) exp (cid:18) − N IβK (1 − ξ ) (cid:19)(cid:19) ≤ E ( ℓ ( b σ, σ ))( K −

1) exp (cid:16) − NIβK (1 − ξ ) (cid:17) ≤ exp (cid:18) N IβK ( ξ ′ − ξ ) (cid:19) + N − exp (cid:16) − NIβK (1 − ξ ) (cid:17) ≤ exp − s N IβK ! + N − . Otherwise, P (cid:18) ℓ ( b σ, σ ) ≥ ( K −

1) exp (cid:18) − N IβK (1 − ξ ) (cid:19)(cid:19) ≤ P ( ℓ ( b σ, σ ) > ≤ P (cid:18) min π ∈S K d Ham ( π ◦ ˆ σ, σ ) > (cid:19) ≤ P min π ∈S K [ˆ σ ,σ ] d Ham ( π ◦ ˆ σ, σ ) > ! ≤ X i ∈ [ N ] P ( ∃ π ∈ S K [ˆ σ , σ ] : ( π ◦ ˆ σ ) i = σ i ) ≤ N ( K −

1) exp (cid:18) − N IβK (1 − ξ ) (cid:19) + n − ≤ N − . Therefore, we can conclude that P (cid:18) ℓ ( b σ, σ ) ≥ ( K −

1) exp (cid:18) − N IβK (1 − ξ ) (cid:19)(cid:19) → . Hellinger divergence between sparse binary Markov chains

C.1 Notations and main result

In this section we consider { , } -valued Markov chains with initial distributions µ, ν , transitionmatrices P, Q , and path probability distributions deﬁned by f x = µ x P x ,x · · · P x T − ,x T and g x = ν x Q x ,x · · · Q x T − ,x T , (C.1)which are assumed sparse in the sense that max { µ , ν , P , Q } ≤ δ (C.2)for some small constant δ > . By the union bound, this assumption implies that for both Markovchains, the probability of observing a path of length T not identically zero is at most δT . Thefollowing result describes a ﬁrst-order expansion in a sparse setting. Proposition C.1.

The Hellinger distance of Markov chain path probabilities deﬁned by (C.1) ,which satisfy P Q < and (C.2) for some < δ ≤ such that δT ≤ , is approximated by Hel ( f, g ) = 12 ( √ µ − √ ν ) + T X t =2 J t + ǫ, where the error term is bounded by ≤ ǫ ≤

24 ( δT ) , and J t = 12 (cid:16)p P − p Q (cid:17) + R (cid:18) − R − R (cid:19) + (1 − R − R ) (cid:18) √ µ ν − R − R (cid:19) R t − , with R ab = ( P ab Q ab ) / . C.2 Proof of Proposition C.1

For convenience, we use the shorthand notations ρ a = ( µ a ν a ) / , R ab = ( P ab Q ab ) / . (C.3)The proof is based on the following two lemmas. Lemma C.2.

The squared Hellinger distance between Markov path probability distributions de-ﬁned by (C.1) equals

Hel ( f, g ) = 1 − P ⌈ T/ ⌉ j =0 S j , where S j = X a,b ∈{ , } X t s j,t ( a, b ) ρ − a ρ a R T − − j + a + b − t R j − a R j − b R t − j , (C.4) and the nonzero values of s j,t ( a, b ) are given by s , (0 ,

0) = 1 , s ,t ( ab ) =  T − t − , ( a, b ) = (0 , , ≤ t ≤ T − , , ( a, b ) = (0 , , (1 , , ≤ t ≤ T − , , ( a, b ) = (1 , , t = T, and s j,t ( a, b ) = (cid:0) t − j − (cid:1)(cid:0) T − t − j − a − b (cid:1) for ≤ j ≤ ⌈ T / ⌉ , and j ≤ t ≤ T − − j + a + b . roof. For a, b = 0 , , we denote by x ab the number of a → b transitions, and by k x k = P t x t thenumber of ones in path x = ( x , . . . , x T ) . We can split the sum on the right side of Hel ( f, g ) =1 − P x ( f x g x ) / into P x ( f x g x ) / = P j S j , where S j equals the sum of ( f x g x ) / over the setof paths with j on-periods. We further split S j into S j = X a,b S j ( a, b ) = X a,b X t S j,t ( a, b ) , where S j,t ( a, b ) equals the sum of ( f x g x ) / over the set A j,t ( a, b ) = n x ∈ { , } T : x + x = j, k x k = t, x = a, x T = b o (C.5)of paths with j on-periods and t ones which start at a and end at b . Observe that the number of on-periods can be written as x + x = x + x T , and the number of ones as k x k = x + x + x . Hence any path x in A j,t ( a, b ) satisﬁes x = j − a , x = j − b , x = t − j . Moreover, becausethe total number of transitions is T − , we also ﬁnd that x = T − − j + a + b − t . Observenow that the path probabilities can be written as f x = µ x Y a,b P x ab ab and g x = ν x Y a,b Q x ab ab . Therefore, for any path x in A j,t ( a, b ) , ( f x g x ) / = ρ − a ρ a R T − − j + a + b − t R j − a R j − b R t − j . Hence (C.4) holds with s j,t ( a, b ) = | A j,t ( a, b ) | . We ﬁnish the proof by computing the cardinalities s j,t ( a, b ) .(i) Case j = 0 . The only path with no on-periods is the path of all zeros. Therefore, s ,t ( a, b ) = 1 for t = 0 and ( a, b ) = (0 , , and s ,t ( a, b ) = 0 otherwise.(ii) Case j = 1 . In this case s ,t (0 ,

0) = T − t − for ≤ t ≤ T − and zero otherwise.Furthermore, s ,t (0 ,

1) = s ,t (1 ,

0) = 1 for ≤ t ≤ T − , and both are zero otherwise. Finally, s ,t (1 ,

1) = 1 for t = T and zero otherwise.(iii) Case j ≥ . Now we proceed as follows. First, given a series of t ones, we choose j − places to break the series: there are (cid:0) t − j − (cid:1) ways of doing so. Then, we need to ﬁll those breakswith zeros chosen among the T − t zeros of the chain. Note that when a = b = 0 , we also need toput zeros before and after the chain of ones. There are j − − a ) + (1 − b ) = j + 1 − a − b places to ﬁll with T − t zeros, and we need to put at least one zero in each place: there are (cid:0) T − t − j − a − b (cid:1) ways of doing so. Therefore, we conclude that s j,t ( a, b ) = t − j − ! T − t − j − a − b ! . Lemma C.3. If (C.2) holds for some < δ ≤ such that δT ≤ , then ρ = 1 − δ = 1 − µ + ν ǫ ,R = 1 − δ = 1 − P + Q ǫ ,R T − = 1 − δ = 1 − ( T − P + Q ǫ ,S = 1 − δ = 1 − µ + ν − ( T − P + Q ǫ , A combinatorial fact, often refered as the stars and bars method, is that the number of ways in which n identicalballs can be divided into m distinct bins is (cid:18) n + m − m − (cid:19) , and (cid:18) n − m − (cid:19) if bins cannot be empty. here the error terms are bounded by ≤ δ , δ ≤ δ , ≤ δ ≤ δT , and ≤ δ ≤ δT ,together with | ǫ | , | ǫ | ≤ δ , | ǫ | ≤ δ T , and | ǫ | ≤ δ T .Proof. The error terms δ , δ , δ , δ are nonnegative because the quantities on the left are at mostone, being products of square roots of probabilities. Furthermore, µ + ν ≤ δ and P + Q ≤ δ due to (C.2). Hence the upper bounds on δ , δ , δ , δ follow from the bounds on ǫ , ǫ , ǫ , ǫ .Let us now verify the bounds on the error terms ǫ , ǫ , ǫ , ǫ . Taylor’s approximation (Lemma C.6)implies that ρ = (cid:16) (1 − µ )(1 − ν ) (cid:17) / = 1 − µ + ν ǫ , where the error term is bounded by | ǫ | ≤ (1 + / ) δ ≤ δ . Because R = ((1 − P )(1 − Q )) / , the same approximation also yields | ǫ | ≤ δ . If T = 1 , the bound for ǫ is trivial.Assume next that T ≥ . By applying Lemma C.6 again, now with a = ( T − / ≥ , we seethat R T − = (cid:16) (1 − P )(1 − Q ) (cid:17) ( T − / = 1 − ( T − P + Q ) + ǫ , where | ǫ | ≤ (cid:16) a (cid:17) a (1 + a ) δ = (cid:16) a (cid:17) ( T − δ ≤ δ T . By multiplying theapproximation formulas of ρ and R T − , we ﬁnd that ǫ = ǫ (cid:18) − ( T − P + Q (cid:19) + ǫ (cid:18) − µ + ν (cid:19) + ǫ ǫ + ( T − µ + ν P + Q . By the triangle inequality, noting that δT ≤ , and | ǫ ǫ | ≤ δ T ≤ δ T due to δ ≤ , we ﬁndthat | ǫ | ≤ | ǫ | + | ǫ | + | ǫ ǫ | + δ T ≤ δ T . Let us now ﬁnish the proof of Proposition C.1. Lemma C.2 implies that

Hel ( f, g ) = 1 − S − S − P ⌈ T/ ⌉ j =2 S j , where S = ρ R T − and S = S (0 ,

0) + S (0 ,

1) + S (1 ,

0) + S (1 , ,where S (0 ,

0) = ρ R R T − X t =1 ( T − t − R T − t R t − ,S (0 ,

1) = ρ R T X t =2 R T − t R t − ,S (1 ,

0) = ρ R T X t =2 R T − t R t − ,S (1 ,

1) = ρ R T − . We will derive approximations to the above quantities.(i) Deﬁne ˜ S = 1 − µ + ν − ( T − P + Q . Then Lemma C.3 implies that | ˜ S − S | ≤ δ T .(ii) Deﬁne ˜ S = R R T − X t =1 ( T − t − R t − + ( R + ρ R ) T X t =2 R t − + ρ R T − . Because ρ , R ≤ , it follows that S ≤ ˜ S . Because ρ , R ≥ − δ (Lemma C.3) and R T − t ≥ R T − for t ≥ , it follows that S ≥ (cid:18) − δ (cid:19) T ˜ S ≥ (cid:18) − δT (cid:19) ˜ S . P T − t =1 ( T − t − R t − ≤ T P T − t =1 R t − ≤ T P ∞ t =0 R t , so that T − X t =1 ( T − t − R t − ≤ T (1 − R ) − . Moreover, P Tt =2 R t − ≤ ( T − . Because ρ , R ≤ δ and R ≤ − R , it follows that ˜ S ≤ δT R − R + 2 δ ( T −

1) + δ ≤ δT. Hence S ≥ ˜ S − δ T . This last line holds because R + R = 1 − H , where H is the squared Hellinger distancebetween distributions ( P , P ) and ( Q , Q ) on { , } , and therefore R ≤ − R . (iii)Let us now derive an upper bound for the term P ⌈ T/ ⌉ j =2 S j . Fix ≤ j ≤ ⌈ T / ⌉ . Recall fromLemma C.2 that S j = X a,b ∈{ , } X t s j,t ( a, b ) ρ − a ρ a R T − − j + a + b − t R j − a R j − b R t − j , and the nonzero values of s j,t ( a, b ) are given by s j,t ( a, b ) = (cid:0) t − j − (cid:1)(cid:0) T − t − j − a − b (cid:1) for j ≤ t ≤ T − − j + a + b . Because ρ , R ≤ and ρ , R ≤ δ , S j ≤ δ j X a,b ∈{ , } X t s j,t ( a, b ) R j − b R t − j . (C.6)We will deal with the cases b = 0 and b = 1 separately. If b = 0 , the inequality (cid:0) T − t − j − a − b (cid:1) ≤ T j − a − b ( j − a − b )! implies that X t s j,t ( a, R j R t − j = T − − j + a X t = j t − j − ! T − t − j − a ! R j R t − j ≤ R j T j − a ( j − a )! T − − j + a X t = j t − j − ! R t − j . It follows with the help of a geometric moment formula (Lemma C.4) that T − − j + a X t = j t − j − ! R t − j ≤ ∞ X t = j t − j − ! R t − j = (1 − R ) − j . Therefore, X t s j,t ( a, R j R t − j ≤ (cid:18) R − R (cid:19) j T j − a ( j − a )! ≤ T j − a ( j − a )! ≤ T j ( j − , (C.7)where we used R ≤ − R . 43f b = 1 , the previous reasoning leads to an extra (1 − R ) − factor. Let us ﬁrst recallthat s j,t ( a,

1) = (cid:0) t − j − (cid:1)(cid:0) T − t − j − a − (cid:1) corresponds to the number of time series x ∈ { , } T with j on-periods, such that x = a, x T = 1 and || x || = t . If we call t j the length of the j -th on-period,we have ≤ t j ≤ T − j + 1 . For a given t j , choosing x accounts to cut a list of T − t zeros in j − − a places, and then ﬁll the j − − a + a = j − spots with the t − t j remaining ones,given that no spot should be empty. Hence, s j,t ( a,

1) = t − j +1 X t j =1 T − t − j − − a ! t − t j − j − ! . From (cid:0) T − t − j − − a (cid:1) ≤ T j − − a ( j − − a )! and (cid:0) t − t j − j − (cid:1) ≤ (cid:0) t − j − (cid:1) , it follows that X t s j,t ( a, R j − R t − j = R j − T − j + a X t = j t − j +1 X t j =1 T − t − j − − a ! t − t j − j − ! R t − j ≤ R j − T j − − a ( j − − a )! T − j + a X t = j t − j +1 X t j =1 t − t j − j − ! R t − j ≤ R j − T j − − a ( j − − a )! T − j + a X t = j t − j − ! ( t − j + 1) R t − j ≤ R j − T j − a ( j − − a )! T − j + a X t = j t − j − ! R t − j . Using the geometric moment formula (Lemma C.4) T − j + a X t = j t − j − ! R t − j ≤ ∞ X t = j t − j − ! R t − j = (1 − R ) − ( j − , and the inequality R ≤ − R , it follows that X t s j,t ( a, R j − R t − j ≤ T j − a ( j − − a )! ≤ T j − a ( j − . (C.8)Going back to (C.6), and using (C.7)-(C.8), we can write S j ≤ δT ) j ( j − , and hence X j ≥ S j ≤ δT ) exp ( δT ) . By combining the estimates obtained in (i)–(iii), we conclude that

Hel( f, g ) = 1 − ˜ S − ˜ S + ǫ, where | ǫ | ≤ δT ) + 6( δT ) + 4( δT ) exp ( δT ) . In particular, for δT ≤ , exp( δT ) ≤ e < implies | ǫ | ≤

24 ( δT ) . − ˜ S − ˜ S = µ + ν T − P + Q − R R T − X t =1 ( T − t − R t − − ( R + ρ R ) T X t =2 R t − − ρ R T − . By applying formulas P Tt =2 R t − = (1 − R ) − (( T − − P Tt =2 R t ) and R T − = 1 − (1 − R ) P Tt =2 R t − , and simplifying the outcome using formulas µ + ν − ρ = ( √ µ − √ ν ) and P + Q − R = ( √ P − √ Q ) , we ﬁnd that − ˜ S − ˜ S = 12 ( √ µ − √ ν ) + T X t =2 J t . Hence the claim of Proposition C.1 follows.

C.3 Auxiliary asymptotics lemmas

Lemma C.4.

For any integer j ≥ and any real number ≤ q < , ∞ X k = j kj ! q k − j = (1 − q ) − ( j +1) . Proof.

Let f ( q ) = (1 − q ) − . Then the j -th derivative of f equals f ( j ) ( q ) = j !(1 − q ) − ( j +1) .Because f ( q ) = P ∞ k =0 q k , we ﬁnd that the j -th derivative of f also equals P ∞ k = j ( k ) j q k − j . Hencethe claim follows. Lemma C.5.

For any ≤ x ≤ and a ≥ , the error term in the approximation (1 − x ) a =1 − ax + r ( x ) is bounded by | r ( x ) | ≤ | a − | a ax . Moreover, r ( x ) ≥ when a ≥ .Proof. The error term in the approximation f ( x ) = f (0) + f ′ (0) x + r ( x ) equals r ( x ) = R x R t f ′′ ( s ) dsdt and is bounded by | r ( x ) | ≤ cx with c = max ≤ x ≤ / | f ′′ ( x ) | . The function f ( x ) = (1 − x ) a satisﬁes f (0) = 1 and f ′ (0) = − a , together with f ′′ ( x ) = a ( a − − x ) a − .The claims follow after noticing that max ≤ x ≤ / | f ′′ ( x ) | = ( | f ′′ ( ) | = a a | a − | for < a < ,f ′′ (0) = a ( a − for a ≥ . Lemma C.6.

For any ≤ x, y ≤ δ with δ ≤ , and any a ≥ , (cid:16) (1 − x )(1 − y ) (cid:17) a = 1 − a ( x + y ) + ǫ, where | ǫ | ≤ (cid:16) | a − | a (cid:17) aδ .Proof. Denote z = x + y − xy . Then ≤ z ≤ δ ≤ . Then by Lemma C.5, we ﬁndthat (1 − z ) a = 1 − az + r ( z ) , where the error term is bounded by | r ( z ) | ≤ | a − | a az . As aconsequence, (cid:16) (1 − x )(1 − y ) (cid:17) a = 1 − az + r ( z ) = 1 − a ( x + y ) + ǫ, where ǫ = axy + r ( z ) is bounded by | ǫ | ≤ axy + | r ( z ) | ≤ aδ + 2 | a − | a a (2 δ ) . Markov dynamics with long-time horizon

D.1 Clustering using the union graph

Proposition D.1.

Let δ ( ν ) = max n µ ( ν )in (1) , µ ( ν )out (1) , P ( ν )in (0 , , P ( ν )out (0 , o and assume that T ( ν ) ≫ with δ ( ν ) T ( ν ) ≪ . Assume that the signal, coming from the dynamics, is stronger thanthe signal of the ﬁrst snapshot, that is P ( ν )in (0 , T ( ν ) ≫ µ ( ν )in (1) and P ( ν )out (0 , T ( ν ) ≫ µ ( ν )out (1) .Let I ( ν )1 := (cid:18)q P ( ν )in (0 , − q P ( ν )out (0 , (cid:19) . Then, the following holds.(i) Exact recovery using the union graph is possible if lim inf ν →∞ N ( ν ) T ( ν ) K ( ν ) log N ( ν ) I ( ν )1 > , and impossible if lim sup ν →∞ N ( ν ) T ( ν ) K ( ν ) log N ( ν ) I ( ν )1 < . (ii) Almost exact recovery using the union graph is possible if lim inf ν →∞ N ( ν ) T ( ν ) K ( ν ) I ( ν )1 = ∞ , and impossible if lim sup ν →∞ N ( ν ) T ( ν ) K ( ν ) I ( ν )1 < ∞ . Proof.

The union graph G ∪ = ∪ Tt =1 G t has adjacency matrix with entries max t A tij . There-fore, the union graph is an instance of a static SBM with intra-block link density p ∪ in = 1 − µ in (0) P in (0 , T − and inter-block link density p ∪ out = 1 − µ out (0) P out (0 , T − .A known result [ABH16, MNS16] states that exact recovery in SBM is possible if lim inf ν →∞ NK log N (cid:18)q p ∪ in − q p ∪ out (cid:19) > , and almost exact recovery is possible if lim inf ν →∞ NK (cid:18)q p ∪ in − q p ∪ out (cid:19) = ∞ , and similar impossibility conditions. Moreover, using the sparsity condition, we can write p ∪ in = µ in (1) + (1 − µ in (1))( T − P in (0 ,

1) + O (cid:16) ( δ N T N ) (cid:17) ∼ T P in (0 ,

1) + O (cid:16) ( δT ) (cid:17) , and similarly for p ∪ out . This ends the proof. Remark D.2. If µ ( ν )in (1) ≫ T ( ν ) P ( ν )in (0 , and µ ( ν )out (1) ≫ T ( ν ) P ( ν )out (0 , , then one would re-cover the static conditions for exact and almost exact recovery. Indeed, in this scenario, the patternarising from the dynamics are too weak to make an improvement.46 emark D.3. If T ( ν ) P ( ν )out (0 , ≫ µ ( ν )out (1) but T ( ν ) P ( ν )in (0 , ≪ µ ( ν )in (1) , then the condition forexact recovery becomes lim inf ν →∞ N ( ν ) T ( ν ) K ( ν ) log N ( ν ) P out (0 , > . In particular, this arises when the pattern interactions are i.i.d. for node pairs in different commu-nities, and static for node pairs in the same community.

D.2 Clustering using time-aggregated adjacency tensors

The majority of earlier literature on temporal network clustering is based on the analysis of thetime-aggregated N -by- N tensor A + ij = P t A tij . One might ask whether such aggregation destroysrelevant information about the data. The following results shows that for long ( T ≫ ) andsparse (expected number of on-periods ≪ N ) adjacency tensors, the amount of lost informationis negligible. Proposition D.4.

Consider a homogeneous Markov SBM with the same Assumptions as Corol-lary 4.8. Then the conditions for exact and almost exact recovery for the the time-aggregatedtensor ( A + ij ) are the same as for the full tensor ( A tij ) .Proof. For t ∈ { , . . . , T } , let A t := { x ∈ { , } T : || x || = t } be the set of interaction patternswith t ones, and f W in ( t ) = X x ∈{ , } T : k x k = t f in ( x ) ,f W out ( t ) = X x ∈{ , } T : k x k = t f out ( x ) . Now the time-aggregated tensor ( A + ij ) is an instance of an edge-labelled SBM with ﬁnitely manypossible labels t ∈ { , . . . , T } whose probabilities are given by f W in ( t ) and by f W out ( t ) .Recall the deﬁnition of the sets A j,t ( ab ) , introduced in Equation (C.5): A j,t ( a, b ) = n x ∈ { , } T : x + x = j, k x k = t, x = a, x T = b o , and let us introduce ˜ f +in and ˜ f +out deﬁned, for t ∈ [ T ] , by ˜ f in ( t ) = f in ( A ,t (0 , , ˜ f out ( t ) = f out ( A ,t (0 , . Note that in general, ˜ f in and ˜ f out are not probabilities distributions anymore as they do not sumto one. Nonetheless, D / (cid:16) f W in , f W out (cid:17) ≥ D / (cid:16) ˜ f in , ˜ f out (cid:17) . (D.1)Moreover, from the data processing inequality [vH14, Theorem 1], D / ( f in , f out ) ≥ D / (cid:16) f W in , f W out (cid:17) . (D.2)Let us make the following claim: D / (cid:16) ˜ f in , ˜ f out (cid:17) ≈ D / ( f in , f out ) . (D.3)Combining the claim with the inequality (D.2), yields D / (cid:16) f W in , f W out (cid:17) ≈ D / (cid:16) ˜ f in , ˜ f out (cid:17) , D / (cid:16) ˜ f in , ˜ f out (cid:17) = − T X t =0 q ˜ f in ( t ) ˜ f out ( t ) ! = − S + S (0 , , where S := p f in (0) f out (0) and S (0 ,

0) := q f in ( A ,t (0 , f out ( A ,t (0 , . In the sparsesetting, an approximation of S (0 , was computed in Appendix C.1 and shown to be equal to S (0 ,

0) = S U U T ( ν ) X t =1 ( T ( ν ) − t − U t − = S U U − U (cid:16) T ( ν ) − (cid:17) + o ( δ ( ν ) T ( ν ) ) , where U ab = s P in ( a, b ) P out ( a, b ) P in (0 , P out (0 , . Therefore, using Lemma C.3 and the sparsity condition δ ( ν ) T ( ν ) ≪ , we can write D / (cid:16) ˜ f in , ˜ f out (cid:17) = − − T ( ν ) − (cid:18) P in (0 ,

1) + P out (0 , − R R − R (cid:19) + o (cid:16) δ ( ν ) T ( ν ) (cid:17)! = T ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) + o (cid:16) δ ( ν ) T ( ν ) (cid:17) , where I ( ν )1 and I ( ν )2 are deﬁned in Theorem 4.1. This proves the claim, because as we noticed inCorollary 4.8, D / ( f in , f out ) ≍ T ( ν ) (cid:16) I ( ν )1 + I ( ν )2 (cid:17) . E Analysis of baseline algorithms

E.1 Proof of Theorem 6.1

For a, b ∈ { , } , let n ij ( a ) = P b n ij ( a, b ) , where n ij ( a, b ) counts the observed number oftransitions a → b between nodes i and j . From [Bil61, Theorem 3.1 and Formula 3.13], thedistribution of the random variables ξ ij ( a, b ) := n ij ( a,b ) − n ij ( a ) p ij ( a ) √ n ij ( a ) tends to a normal distributionwith the zero mean and ﬁnite variance given by λ ( ab ) , ( cd ) := δ ac (cid:0) δ bd P ij ( a, b ) − P ij ( a, b ) P ij ( a, d ) (cid:1) .Therefore, for any α > , P (cid:16) | b P ij ( a, b ) − P ij ( a, b ) | ≥ α (cid:17) = P (cid:16) | ξ ij ( ab ) | ≥ α q n ij ( a ) (cid:17) (E.1)and this quantity goes to zero as T goes to inﬁnity.From model identiﬁability, P in = P out . Therefore, w.l.o.g. we can assume P in (0 , = P out (0 , , and let α such that < α < P in (0 , − P out (0 , . The nodes i and j are predictedto be in the same community if b P ij (0 , > p in + p out , and the probability of making an error is P (cid:16)(cid:12)(cid:12)(cid:12) b P ij (0 , − P ij (0 , (cid:12)(cid:12)(cid:12) ≥ α (cid:17) . By the union bound, the probability that all nodes are correctly classiﬁed is bounded by N ( N − ij P (cid:16)(cid:12)(cid:12)(cid:12) b P ij (0 , − P ij (0 , (cid:12)(cid:12)(cid:12) ≥ α (cid:17) , ij . By equation (E.1), for all node pairs ij wehave P (cid:16) | b P ij (0 , − P ij (0 , | ≥ α (cid:17) → . Therefore, all nodes are a.s. correctly classiﬁed as T → ∞ . E.2 Proof of Proposition 6.4

Assume that the true block membership structure σ contains K blocks C , . . . , C K of sizes N k = | C k | . Let G t be the graph on node set V = [ N ] and edge set E t = { ij : A tij = 1 } . Let G T = ∩ t G t be the intersection graph.We denote by p Tkℓ = f kℓ (1 , . . . , | {z } T ) the probability of a persistent interaction of duration T between a pair of nodes in blocks k and ℓ . (a) Conditions for strong consistency. Algorithm 5 returns exactly the correct block mem-bership structure if and only if each C k forms a connected set of nodes in G T , and for all blocks k = ℓ , there are no links between C k and C ℓ in G T .The probability that the intersection graph G T contains a link between some distinct blocksis bounded by X ≤ k<ℓ ≤ K N k N ℓ p Tkℓ . Hence, by the union bound, the probability that Algorithm 5 does not give exact recovery isbounded by X k ∈ [ K ] (1 − c Tk ) + X ≤ k<ℓ ≤ K N k N ℓ p Tkℓ , where c Tk is the probability that the subgraph of G T induced by C k is connected. By classicalresults about Erd˝os–Rényi graph models [Les] we know that c Tk ≥ − e − ( N k p Tkk − log N k ) , whenever N k p Tkk ≥ max { e, log N k } . Hence X k ∈ [ K ] (1 − c Tk ) ≤ X k ∈ [ K ] e − ( N k p Tkk − log N k ) ≤ e − min k ∈ [ K ] log( KN k ) (cid:18) NkpTkk log(

KNk ) − (cid:19) , and this last term goes to zero under Condition (6.2). Moreover, X ≤ k<ℓ ≤ K N k N ℓ p Tkℓ ≤ N ! max k = ℓ p Tkℓ which also goes to zero under Condition (6.1). (b) Condition for weak consistency.

We just saw that the probability that the intersectiongraph G T contains a link between some distinct blocks is bounded by (cid:0) N (cid:1) max ≤ k<ℓ ≤ K p Tkℓ , andhence goes to zero if Condition (6.1) holds.Let G T [ C k ] be the subgraph of G T induced by C k . Let A kT be the event that the largestconnected component of G T [ C k ] has size at least N / , and all other components are smallerthan N / . Observe that G T [ C k ] is an instance of a Bernoulli random graph with N k nodeswhere all node pairs are independently linked with probability p Tkk . When N k p Tkk ≫ , classicalErd˝os–Rényi random graph theory tells that P ( A kT ) = 1 − o (1) for any ﬁxed k and T as N ≫ .For bounded K, T = O (1) this implies that P ( ∩ k ∩ T A kT ) = 1 − o (1) .49n the event A = ( ∩ k ∩ T A kT ) ∩ B , the algorithm estimates ˆ K = K correctly, and (with thecorrect permutation), the number of misclustered nodes is at most X k ∈ [ K ] | C k \ ˆ C kT | ≪ , where ˆ C kT is the largest component of G T [ C k ] . E.3 Proof of Proposition 6.6

Denote the time-aggregated interaction tensor by A + ij = P t A tij . Let G ′ be the “enemy graph”with node set { , . . . , N } and adjacency matrix A ′ ij = 1(0 < A + ij < T ) . Let C , C be blockscorresponding to the true labelling σ . The probability that all intra-block interactions are static is p ( N ) T p ( N ) T ≥ ( p T p T ) N → . Hence, it follows that G ′ is whp bipartite with respect to partition { C , C } .Let us next analyze the probability that G ′ is connected. Let G ′′ be the graph on node set { , . . . , N } obtained by deleting all edges connecting pair of nodes within C or within C . Then G ′′ is random bipartite graph with bipartition { C , C } where each node pair ij with i ∈ C and j ∈ C is linked with probability q = 1 − p T , independently of other node pairs. Becauseblocks sizes are balanced according N , N ≍ N and N q ≫ log N , it follows by applying [SC95,Theorem 3.3] that G ′′ is connected with high probability. Because G ′′ is a subgraph of G ′ , thesame is true for G ′ .We have now seen that G ′ is whp connected and bipartite with respect to partition { C , C } .Let ˜ G be the graph on [ N ] , of which nodes i and j are linked if and only if there exists a 2-pathin G ′ between i and j . Then the connected components of ˜ G are C and C . Hence Algorithm 6estimates the correct block memberships on the high-probability event that G ′′