[PDF] Sketch-based community detection in evolving networks

Abstract

We consider an approach for community detection in time-varying networks. At its core, this approach maintains a small sketch graph to capture the essential community structure found in each snapshot of the full network. We demonstrate how the sketch can be used to explicitly identify six key community events which typically occur during network evolution: growth, shrinkage, merging, splitting, birth and death. Based on these detection techniques, we formulate a community detection algorithm which can process a network concurrently exhibiting all processes. One advantage afforded by the sketch-based algorithm is the efficient handling of large networks. Whereas detecting events in the full graph may be computationally expensive, the small size of the sketch allows changes to be quickly assessed. A second advantage occurs in networks containing clusters of disproportionate size. The sketch is constructed such that there is equal representation of each cluster, thus reducing the possibility that the small clusters are lost in the estimate. We present a new standardized benchmark based on the stochastic block model which models the addition and deletion of nodes, as well as the birth and death of communities. When coupled with existing benchmarks, this new benchmark provides a comprehensive suite of tests encompassing all six community events. We provide a set of numerical results demonstrating the advantages of our approach both in run time and in the handling of small clusters.

Full PDF

SSketch-based community detection in evolving networks

Andre Beckus and George K. Atia

Department of Electrical and Computer Engineering,University of Central Florida, Orlando, FL 32816 USA. (Dated: September 25, 2020)We consider an approach for community detection in time-varying networks. At its core, thisapproach maintains a small sketch graph to capture the essential community structure found in eachsnapshot of the full network. We demonstrate how the sketch can be used to explicitly identify sixkey community events which typically occur during network evolution: growth, shrinkage, merging,splitting, birth and death. Based on these detection techniques, we formulate a community detectionalgorithm which can process a network concurrently exhibiting all processes. One advantage aﬀordedby the sketch-based algorithm is the eﬃcient handling of large networks. Whereas detecting eventsin the full graph may be computationally expensive, the small size of the sketch allows changes tobe quickly assessed. A second advantage occurs in networks containing clusters of disproportionatesize. The sketch is constructed such that there is equal representation of each cluster, thus reducingthe possibility that the small clusters are lost in the estimate. We present a new standardizedbenchmark based on the stochastic block model which models the addition and deletion of nodes,as well as the birth and death of communities. When coupled with existing benchmarks, this newbenchmark provides a comprehensive suite of tests encompassing all six community events. Weprovide a set of numerical results demonstrating the advantages of our approach both in run timeand in the handling of small clusters.

I. INTRODUCTION

The detection of community structure in networks hasgarnered a great deal of attention, leading to a vast ar-ray of algorithms. Much of the focus has been on staticnetworks, where the goal is to identify groups of nodeswithin which connections are dense and between whichconnections are relatively sparse. However, it is often thecase that networks evolve with time. For example, edgesin social media networks appear and disappear to reﬂectever-changing friendships, and gene expression networkscontinuously evolve in response to external stimuli [1, 2].In this dynamic setting, new sequential algorithms areneeded to track the community structure underlying eachtemporal snapshot of the network. Here, we propose asketch-based approach.Sketching involves the construction of a small synopsisof a full dataset [3]. Notably, this technique has beenused in static community detection [4, 5], where a sketchsub-graph is generated by sampling nodes from the fullnetwork. The sketch is clustered using an existing com-munity detection algorithm, and the community mem-bership of the nodes in the full network are inferred basedon the estimated communities in the sketch. Here, wepropose the use of a dynamic sketch which evolves totrack the communities in the full network. This dynamicapproach addresses two pervasive issues in communitydetection.One important concern in community detection is theability to process large graphs. Many static methods be-come infeasibly slow when processing a large network,leading to a search for eﬃcient algorithms [6]. The ex-tra time dimension inherent to the dynamic setting onlymakes this search for eﬃciency more pressing. However,time-evolving networks also oﬀer a distinct advantage not found in the static domain. Speciﬁcally, evolving net-works often possess temporal smoothness in which thecommunity structure changes gradually [7]. In this case,previous snapshots oﬀer prior information which can aidin the clustering of subsequent snapshots. We present amethod which relies on a small sketch to convey infor-mation regarding previous snapshots. By using a smallsketch, the algorithm can detect the main communityevents without requiring the full graph to be examined,thus reducing the required computational complexity. Ifthe sketch size and number of clusters are ﬁxed, the com-plexity of our algorithm scales linearly in network size.Another typical issue found in community detection isthe detection of small clusters [8]. If a community shrinkstoo small, it may become lost, i.e., the community may beabsorbed into a larger cluster in the estimated partition.We show that once a cluster is captured in the sketch, itcan be tracked even if the cluster becomes very small.Our algorithm handles the six canonical communityevents observed in dynamic networks [9]: growth, shrink-age, merging, splitting, birth, and death. The existingbenchmarks presented in [10], based on the well-knownStochastic Block Model (SBM) [11], include the ﬁrst fourof these events. Here, we propose a new benchmark whichcaptures the last two events of birth and death. An im-portant feature of the proposed benchmark is that thesize of the network varies with time, a characteristic notfound in the existing benchmarks. In addition to model-ing the birth event, this benchmark incrementally addsnew nodes to the network which join existing commu-nities, a feature also not seen in [10]. The benchmarkscapture the key fundamental processes under which anetwork can evolve, and provide a basic foundation fordesigning a community detection algorithm.This paper is organized as follows. In Sec. II, we a r X i v : . [ phy s i c s . s o c - ph ] S e p summarize existing community detection algorithms forevolving networks. Section III describes the networkmodel, summarizes the existing SBM benchmarks, andproposes a new benchmark. In Sec. IV, we describethe sketch-based approach, and formulate techniques bywhich sketches can track the key evolutionary processesfound in the benchmarks. Section V presents the pro-posed algorithm based on these tracking techniques. Wepresent numerical results in Sec. VI and conclude in Sec.VII. II. RELATED WORK: COMMUNITYDETECTION IN EVOLVING NETWORKS

A number of algorithms have been proposed for com-munity detection in evolving networks (see [7, 12] forcomprehensive surveys). One straightforward approachentails the independent clustering of each snapshot us-ing a static algorithm. The communities in the currentsnapshot are matched to the previous communities suchthat there is continuity in the community identities. Thiscategory of algorithm contains a number of variants be-ginning with the classic work of [13].More recently, many algorithms take a more sophis-ticated “dependent” approach, in which previous snap-shots are accounted for in the clustering of the currentsnapshot. These algorithms have the potential to outper-form independent community detection algorithms, sincethey incorporate previous knowledge directly in the clus-tering step.One approach commonly seen in this category is therepresentation of each snapshot using a compact graph.In [14], a small weighted graph is constructed after clus-tering a given snapshot, with each community repre-sented by a single “supernode”. Each supernode’s selfloop is weighted to reﬂect the number of edges withinthat community, whereas the edge weights between su-pernodes indicate the number of edges connecting thecorresponding communities. When processing the nextsnapshot, nodes with changed edges are extracted fromthe supernodes and join the graph as singleton nodes.The graph is then clustered to produce a new partitionestimate. In this way, only the ﬁrst snapshot needs tobe clustered in its entirety, with subsequent snapshotsbeing clustered via their compact representation. A sim-ilar idea can be seen in dynamic methods built aroundthe static Louvain algorithm [15], for example as seen in[16]. The extension of the Louvain algorithm to time-varying networks follows naturally from its reliance onsupernodes. Our approach also uses a small representa-tive graph, however using an altogether diﬀerent idea ofsketching, as described in Sec. IV.The model used in this paper is based on the SBM [11].Several recent algorithms have been developed based ondynamic SBM-based models. The dynamic models of[17, 18] specify that nodes move between a ﬁxed set ofcommunities according to a stationary transition proba- bility matrix. In addition to allowing the movement ofndoes between communities, the models of [19–21] alsoallow the edge probabilities of the communities to vary.Nonetheless, these works focus on the case where indi-vidual nodes only change community membership, i.e.,the communities undergo the grow and shrink processes.

III. MODEL DESCRIPTION

Our model follows that described in [10], which is adynamic extension to the SBM. At time t the networksnapshot is represented by graph G ( t ) = ( V ( t ) , E ( t )),where V ( t ) is the set of nodes in existence at time t ,and E ( t ) is the set of edges between these nodes. Thenetwork is partitioned into at most q communities. Let C ( t ) = (cid:8) C α ( t ) | α = 1 , . . . , q (cid:9) be the partition at time t , with C α ( t ) denoting the set of nodes in community α .We assume by convention that C α ( t ) = ∅ if community α does not exist at time t . Given a graph G = ( V, E )and node set V (cid:48) ⊂ V , the subgraph of G induced by V (cid:48) isdenoted G [ V (cid:48) ].In each snapshot, an edge exists between nodes withina community with probability p in . Unless otherwise spec-iﬁed, nodes in diﬀerent clusters are connected with prob-ability p out . An exception to this rule occurs for themerge-split benchmark, where the intercommunity edgedensity varies as the community pairs merge and split.We now describe two existing benchmarks and pro-pose a third benchmark. These benchmarks capture im-portant network processes which we explicitly use in thedesign of our algorithm. A. Existing benchmarks: grow-shrink andmerge-split

First, we summarize the existing benchmarks found in[10]. Each benchmark consists of an evolving networkcontaining 2 n total nodes. The underlying process isdriven by a triangular waveform x ( t ) = (cid:40) t ∗ , ≤ t ∗ < / , − t ∗ , / ≤ t ∗ < , (1)where t ∗ ≡ ( t/τ + φ ) mod 1 , (2) τ is the period of the waveform, and φ controls the phaseof the waveform. We will assume that φ = 0 unless oth-erwise speciﬁed.The grow-shrink benchmark models the movement ofnodes between a pair of communities. At each time stepthe size of the ﬁrst community is n A = n − nf [2 x ( t + τ / − , (3)where the parameter f ∈ [0 ,

1] controls the amount ofvariation in the sizes of the communities. Since nodeslost from the ﬁrst community transfer to the second com-munity, and vice-versa, the size of the second communityis n B = 2 n − n A . Whenever a node transitions betweencommunities, its edges are regenerated according to theintracommunity and intercommunity edge probabilities p in and p out , respectively. For t ∈ { , τ / , τ } the sizes ofthe communities are equal. At time t = τ /

4, a fraction f of nodes in community A will have moved to communityB, whereas at time t = 3 τ / n , with intercommunity edges existingwith probability p out . New edges are gradually addedbetween the two communities until they are completelymerged at time t = τ /

2, at which point intercommunityedges will exist with probability p in . Then, the processreverses and the new edges are removed until the com-munities are completely split again at time t = τ . Thesnapshots are constructed in the following way. The in-tracommunity edges are added independently with prob-ability p in and remain static throughout the process. Thenumber of intercommunity edges m um in the unmergedstate are drawn according to a binomial distribution withparameters n and p out . The number of edges m m in themerged state is similarly drawn, except using probability p in . The intercommunity node pairs are sorted in randomorder, and edges are included between the ﬁrst m ∗ ( t ) = [1 − x ( t )] m um + x ( t ) m m (4)node pairs at time t . In this way, the eﬀective edge den-sity between the two clusters is p ∗ inter = m ∗ ( t ) /n . Thecommunities are considered merged when p in − p ∗ inter < (cid:114) n ( p in + p ∗ inter ) . (5)This threshold was chosen based on the community de-tectability limit found in [22].The benchmarks are periodic, i.e., the connections inthe network at time t will be exactly identical to thoseat time t + rτ , for any integer r . B. Birth-death benchmark

The grow-shrink and merge-split benchmarks dis-cussed in the previous section lack important features.First, the networks remain ﬁxed in size, with neither newnodes being added to the network, nor existing nodesbeing removed from the network. Furthermore, they donot model the fundamental processes in which communi-ties are born or die. We now propose a new birth-deathbenchmark which includes these missing features.A schematic diagram of the birth-death benchmark isshown in Fig. 1(a). The benchmark consists of two com-munities which pass into and out of existence. The sizeof the ﬁrst community is n A = (cid:40) , x ( t + τ / ≥ − γ/ ,n [1 − x ( t + τ / , otherwise, (6) (a) Birth-Death benchmark M e r g e - S p lit G r o w - S h r i nk B i r t h - D ea t h t = 0 t = τ /4 t = τ /2 t = 3 τ /4 t = τ (b) Mixed (three benchmarks) t = 0 t = τ /4 t = τ /2 t = 3 τ /4 t = τ FIG. 1. (a) Schematic representation of the birth-deathbenchmark. (b) Schematic representation of the mixed bench-mark which stacks the grow-shrink, merge-split, and birth-death benchmarks. where the parameter γ ∈ [0 ,

1] controls the minimumsize of the community. When n A = 0, the community isnon-existent. The community starts at time t = 0 with n/ γn/ t = τ (1 − γ ) /

4. At this point,the community dies and all of its remaining nodes aredeleted from the network. At time t = τ (1 + γ ) /

4, a newset of γn/ n . Atthis point, nodes are again removed from the communityuntil it contains n/ n B = (cid:40) , x ( t + τ / < γ/ n x ( t + τ / , otherwise. (7)This community undergoes essentially the same processas the ﬁrst community except with a phase shift of τ / existing nodes moved from one community tothe other. Similarly, nodes removed from a community time (t) nod e i d (a) Full graph planted partitions(b) Sketch planted partitions M e r g e - S p lit G r o w - S h r i nk B i r t h - D ea t h time (t) nod e i d FIG. 2. (a) Planted partitions for full graphs. Each verticalslice indicates the planted partition at time t . (b) Sketchesproduced with n (cid:48) = 40. Each vertical slice indicates theplanted partitions in sketch S ( t ). White regions indicate thatthe corresponding node does not exist at time t . are deleted from the network rather than transferred tothe other community. C. Mixed (three benchmarks)

In [10], a mixed benchmark is created by “stacking”the grow-shrink and merge-split benchmarks such thatthere are 4 n nodes containing four communities. Twoof the clusters, together containing 2 n nodes, undergothe grow-shrink process while the other two clusters, alsocontaining 2 n nodes, undergo the merge-split process.We propose an extended mixed benchmark to includethe birth-death benchmark. A schematic of this mixedbenchmark is shown in Fig. 1(b). The benchmark hasa maximum of 6 n nodes. The ﬁrst 4 n nodes contain thegrow-shrink and merge-split benchmarks as previouslydescribed, whereas the last 2 n nodes participate in thebirth-death process (the actual number of nodes varieswith time due to addition and deletion of nodes in thebirth-death benchmark). We show an example of thismixed benchmark in Fig. 2(a), having parameters n =200 , q = 6 , f = 0 . , γ = 0 . IV. SKETCH-BASED TRACKING OFDYNAMIC PROCESSES

Our algorithm relies on a small representative sketchof the full network. The sketch captures important in-formation which can be used to track the processes bywhich the network evolves. Meanwhile, the smaller sizeof the sketch allows these checks to be performed quicklywithout requiring a complete assessment of the entire net-work.We ﬁrst describe the sketch in detail, and then describehow this sketch can be used to detect speciﬁc events ineach of the fundamental processes described in Sec. III.For the sake of clarity, we will consider the detection ofeach process in isolation, and in Sec. V present an al-gorithm which exploits all three techniques to track con-current processes occurring in the same network.

A. Dynamic sketching

The sketch consists of a set of nodes sampled from thefull network. At each time step, this set is updated toreﬂect the current state of the full network. The set ofnodes in the sketch at time t is denoted S ( t ), and thesubset of these nodes from cluster α is denoted C (cid:48) α ( t ) = S ( t ) ∩ C α ( t ).The sketch-based approach allows ﬂexibility to choosewhich nodes are placed in the sketch. For the SBM usedhere – in which the intracommunity edges are placed withthe same probability p in for all communities – algorithmsgenerally have better success rates when the communi-ties are of equal size [5]. In our approach, rather thanrequiring that the full network be balanced, we can in-stead improve the possibility of success by maintaining abalanced sketch. Ideally, all clusters in the sketch shouldbe of equal size n (cid:48) , but since communities in the fullnetwork may be smaller than n (cid:48) , we set the size of com-munity C (cid:48) α ( t ) as min { n (cid:48) , | C α ( t ) |} . (8)An example sketch time series is shown in Fig. 2(b),where nodes have been sampled from the mixed bench-mark shown in Fig. 2(a) according to (8). For the grow-shrink and birth-death processes, the white space indi-cates non-existent nodes, i.e., where the community sizesare smaller than n (cid:48) .For this example, we build the sketches using knowl-edge of the planted community partitions. The proposedalgorithm has no such knowledge, and therefore mustbuild the sketches based on its community estimates. Wewill present an actual set of sketch produced by the pro-posed algorithm in Sec. VI D. B. Merge-split process

Suppose that we have a cluster α which is undergoinga split into two separate communities. We propose detec-tion of the emerging clusters by using spectral techniques– here we use a simple approach based on the spectralgap heuristic [23].First, we consider how to obtain the spectral gap in thecurrent snapshot G ( t ) based on the last estimate of thesplitting community C α ( t − A be the adjacencymatrix of the subgraph of G ( t ) induced by C α ( t − D is the diagonal matrix containing the degrees of thenodes in A along the diagonal, then the normalized graphLaplacian is deﬁned as [24] L = I − D − / AD − / . (9)Let λ , λ , λ be the three smallest eigenvalues of L inincreasing order. The eigengap heuristic states that ifthere is one community, λ − λ will tend to be large,whereas if there are two communities present, then λ − λ will tend to be smaller than λ − λ . Noting that λ = 0,we can determine the state of the process by monitoring λ gap = ( λ − λ ) /λ . (10)An example is shown in Fig. 3 for a network havingparameters q = 2 , n = 200 , p in = 0 . , p out = 0 .

05. Theplanted partitions are shown in Fig. 3(a), and the dashedblue line in Fig. 3(b) shows the corresponding value of λ gap for each time step. As can be seen, the value of λ gap increases as the process moves in either directionaway from the fully merged state (at t = 50) and towardsthe fully split state (at t ∈ { , τ } ). For reference, thedetectability limit that formally deﬁnes the split in thebenchmark is shown as a vertical dashed line.Rather than ﬁnding the eigenvalues for the commu-nity in the full network, we propose to estimate λ gap based on the sketch. We use the same procedure as de-scribed above, but instead let A be the adjacency matrixof the subgraph of G ( t ) induced by C (cid:48) α ( t −

1) rather than C α ( t − n (cid:48) = 50nodes sampled uniformly at random from each commu-nity at each time step. We note that the estimated valuetends to be smaller than the actual value, thus makingthe split harder to detect. This is consistent with the factthat the sketch detectability limit – shown as a verticaldotted line – is larger due to the smaller cluster sizes (notethat the right hand side of (5) is inversely proportionalto cluster size). Nonetheless, the estimate still allows thesplit to be identiﬁed. If more precision is required, a two-stage process could be used in which a small value in theestimate triggers a calculation of the exact value λ gap ata higher computational expense. Section V will discusshow the new partition is estimated once the split eventis detected.For the merge process, we exploit the fact that thetwo communities are already known at time t −

1. This ( - ) / ActualEstimate nod e i d time (t) e ff ec ti v e e dg e d e n s it y S k e t c h d e t ec t a b ilit y F u ll g r a ph d e t ec t a b ilit y EstimateActual p inter p in * (b)(a)(c) F u ll g r a ph d e t ec t a b ilit y FIG. 3. (a) Planted partitions of the merge-split benchmark.(b) Actual and estimated value of λ gap at each time step. (c)Actual and estimated value of p in , p ∗ inter at each time step. means that we can estimate p in and p ∗ inter and use theseestimates to directly check condition (5). Suppose thatcommunities α, α (cid:48) are merging. The sketch allows us toquickly calculate empirical estimates of the edge proba-bilities using the expressions (cid:98) p in = (cid:88) u ∈{ ,...,q } (cid:12)(cid:12)(cid:8) ( i, j ) ∈ E ( t ) | j ∈ C (cid:48) u ( t − (cid:9)(cid:12)(cid:12) | C (cid:48) u ( t − | , (11) (cid:98) p α,α (cid:48) = (cid:12)(cid:12)(cid:8) ( i, j ) ∈ E | i ∈ C (cid:48) α ( t − , j ∈ C (cid:48) α (cid:48) ( t − (cid:9)(cid:12)(cid:12) | C (cid:48) α ( t − | | C (cid:48) α (cid:48) ( t − | . (12)Figure 3(c) shows the actual (dashed blue line) and es-timated (solid orange line) values of p in for the examplein Fig. 3(a). The actual (dashed purple line) and esti-mated (solid green line) values of p ∗ inter are also shown inthe sample plot. In both cases, the estimates track theactual values well. C. Birth-death process

First, we consider how to handle the incremental addi-tion of new nodes to the network when these new nodesjoin an existing cluster. Suppose that at time t a setof nodes V + ( t ) is added to community α . We can esti-mate the correct community for a new node i ∈ V + ( t ) byevaluating its connectivity to the existing nodes in eachcluster. To this end, we deﬁne s i,u ( t ) = (cid:12)(cid:12)(cid:8) ( i, j ) ∈ E ( t ) | j ∈ C (cid:48) u ( t − (cid:9)(cid:12)(cid:12) | C (cid:48) u ( t − | , (13)where it is assumed that s i,u ( t ) = 0 if C (cid:48) u ( t −

1) is empty.Noting that E [ s i,u ( t )] = (cid:40) p in , u = α,p out , u (cid:54) = α, (14)we can see that s i,u ( t ) serves as a point estimate of theprobability that there is an edge between node i and anarbitrary node in C (cid:48) u ( t − i can then be assignedto the cluster α with which connectivity is greatest, i.e.,where α = arg max u ∈{ ,...,q } s i,u ( t ) . (15)The variance in s i,u ( t ) isVar ( s i,u ( t )) = (cid:40) p in (1 − p in ) | C (cid:48) u ( t − | , u = α, p out (1 − p out ) | C (cid:48) u ( t − | , u (cid:54) = α. (16)While the expected value is independent of sketch clus-ter size, the variance grows as the clusters shrink, thusmotivating the use of equal-sized communities.Now, consider the birth event in which some or all ofthe nodes V + ( t ) are added to a new community whichdoes not exist at time t −

1. In this case, the expecta-tion E [ s i,u ( t )] will equal p out for any existing cluster.We can therefore identify these nodes as those having aconnectivity signiﬁcantly below the intracommunity den-sity estimate found in (11). For determining an exactthreshold, the standard deviation of the intracommunitydensity can be estimated as (cid:98) σ p in = (cid:115) (cid:98) p in (1 − (cid:98) p in ) (cid:80) u ∈{ ,...,q } | C (cid:48) u ( t − | . (17)We then deﬁne the set of nodes in new clusters as V birth = (cid:8) i ∈ V + ( t ) | s i,u ( t ) < (cid:98) p in − (cid:98) σ p in , ∀ u ∈ { , . . . , q } (cid:9) (18)Since it is possible that the new nodes belong to multiplenew clusters, we apply the static clustering algorithmfound in [5] to cluster V birth (see Sec. V for details).When nodes are deleted, the algorithm needs to removethese deleted nodes from the sketch, and then replacethem by selecting new nodes from the same communityuniformly at random. D. Grow-shrink process

Suppose that two clusters are evolving under the grow-shrink process, and that at time t , a set of nodes V α → α (cid:48) moves from cluster α to cluster α (cid:48) . To identify thesenodes, we propose to use s i,u ( t ). The number of nodesin V α → α (cid:48) that are also contained in the sketch is m (cid:48) = | V α → α (cid:48) ∩ S ( t − | . (19)For a node i in community α , we have for an arbitrarycommunity u , E [ s i,u ( t )] = (cid:40) p in − m (cid:48) ( p in − p out ) | C (cid:48) u ( t − | , u = α,p out , u = α (cid:48) , (20)while for a node j in community α (cid:48) we have E [ s j,u ( t )] = (cid:40) p in , u = α (cid:48) ,p out + m (cid:48) ( p in − p out ) | C (cid:48) u ( t − | , u = α. (21)For either node, the gap between the expected similaritiesfor the correct and incorrect communities is E [ s i,α ( t )] − E [ s i,α (cid:48) ( t )]= E [ s j,α (cid:48) ( t )] − E [ s j,α ( t )]= ( p in − p out ) (cid:20) − m (cid:48) ( p in − p out ) | C (cid:48) α ( t − | (cid:21) . (22)If the sketch size is set such that | C (cid:48) α ( t − | (cid:29) m (cid:48) ( p in − p out ), then the reliability of (15) will be drivenprimarily by the density gap p in − p out . V. ALGORITHM

We ﬁrst describe two procedures upon which the algo-rithm depends, and then present the proposed algorithmitself. We ﬁnish with an analysis of the computationalcomplexity of this algorithm.

A. Preliminaries

In order to cluster the ﬁrst snapshot, and to parti-tion new and splitting communities, we invoke the staticsketch-based community detection framework found in[4, 5]. This framework ﬁrst produces a static sketch ofthe full graph G using the SamPling Inversely propor-tional to Node degree (SPIN) method [5], as detailed inthe following procedure.

Sample-SPIN ( G, N (cid:48) )(1) P i ← (cid:16) d i (cid:80) j ∈ V d − j (cid:17) − for i = 1 , . . . , | V | .(2) Sample N (cid:48) nodes without replacement to form set S . Node i should be sampled with probability P i .(3) return S Next, the framework applies an existing communitydetection algorithm A to the sketch S , and the member-ship of the nodes in the full graph are inferred based onthe estimated partition. These steps are captured in thefollowing procedure. Static-Cluster ( G, S )(1) G (cid:48) ← G [ S ](2) Invoke community detection algorithm A on G (cid:48) toget sketch partition estimate C (cid:48) = { C (cid:48) , . . . , C (cid:48) ˆ q } ,where ˆ q is the estimated number of clusters.(3) C u ← ∅ for u = 1 , . . . , ˆ q (4) for each node i ∈ V do (5) α ← arg max u ∈{ ,..., ˆ q } (cid:12)(cid:12)(cid:12) (cid:8) ( i,j ) ∈ E | j ∈ C (cid:48) u (cid:9) (cid:12)(cid:12)(cid:12) | C (cid:48) u | (6) C α ← C α ∪ { i } (7) end for (8) return partition C = (cid:8) C α | α = 1 , . . . , ˆ q (cid:9) This static community detection framework has twokey advantages. First, applying algorithm A to a smallsketch reduces the computational complexity of thiscostly step as compared to clustering the full graph. Sec-ond, when G contains communities of disproportionatesize, the SPIN tends to produce a sketch with more uni-form community sizes, thus improving the likelihood ofsuccess in the subsequent clustering step. B. Proposed algorithm

We now present the proposed algorithm, followed by adescription of its steps.Input: Sketch community size n (cid:48) . Graph snapshots G ( t ) , t = 0 , , . . . (1) S (cid:48) ← Sample-SPIN ( G (0) , qn (cid:48) )(2) C (0) ← Static-Cluster ( G (0) , S (cid:48) )(3) Build sketch S (0) by sampling n (cid:48) nodes uniformlyat random from each community C ∈ C (0). If n (cid:48) > | C | , then include all nodes from C .(4) r ← |C (0) | .(5) for t = 1 , , . . . do (6) G ← G ( t )(7) C u ← ∅ for u ∈ { , . . . , r } (8) Build set V birth using equation (18).(9) if V birth (cid:54) = ∅ then (10) G ← G [ V birth ](11) S ←

Sample-SPIN ( G, n (cid:48) )(12) { C r +1 , . . . , C r + (cid:98) q } ← Static-Cluster ( G, S )(13) r ← r + (cid:98) q (14) end if (15) for each node i ∈ V ( t ) \ V birth do (16) α ← arg max u ∈{ ,...,r } s i,u ( t )(17) C α ← C α ∪ { i } (18) end for (19) for α ∈ { , . . . , r } , where | C α | > a do (20) C (cid:48) ← C α ∩ S ( t −

1) (21) G (cid:48) ← G [ C (cid:48) ](22) Let A be the adjacency matrix of G (cid:48) . Cal-culate eigenvalues λ , λ of the normalizedLaplacian L deﬁned in equation (9) (see Sec.IV B). Calculate λ gap as in equation (10).(23) if λ gap > b then (24) G α ← G [ C α ](25) { C α , C r +1 } ← Static-Cluster ( G α , C (cid:48) )(26) r ← r + 1(27) end if (28) end for (29) for community pairs α, α (cid:48) ∈ { , . . . , r } do (30) if (cid:98) p in − (cid:98) p α,α (cid:48) < c (cid:114) ( (cid:98) p in + (cid:98) p α,α (cid:48) ) | C α | + | C α (cid:48) | then (31) C α ← C α ∪ C α (cid:48) (32) C α (cid:48) ← ∅ (33) end if (34) end for (35) C ( t ) ← (cid:8) C u | u = 1 , . . . , r (cid:9) (36) S ( t ) ← S ( t − \ V − (37) Re-proportion sketch S ( t ) such that it containsmin { n (cid:48) , | C |} nodes from each community C ∈C ( t ).(38) end for Output: Partitions C ( t ) , t = 0 , , . . . Steps 1-2 cluster the ﬁrst graph snapshot. This ﬁrstpartition estimate is used to construct a balanced sketchin step 3. The remainder of the algorithm processes eachsubsequent snapshot, handling each of the processes de-scribed in Sec. IV.First, the grow-shrink and birth-death processes areaddressed. Steps 8-14 identify and partition the set ofnewly-born communities. Meanwhile, steps 15-18 re-evaluate the community membership of existing nodes,as well as new nodes joining existing communities, i.e.,new nodes not in V birth . We note that the algorithmre-evaluates all nodes for simplicity, but we could reducethe time required for this step by only re-evaluating nodeswith changed edges.Steps 19-28 handle splits within each community. Onlycommunities with size greater than a are checked, as thespectral estimates become unreliable for small communi-ties. If the spectral gap λ gap exceeds threshold param-eter b , then a split is declared and step 25 bi-partitionsthe community. If the community is undergoing a splitinto more than two communities, then the additionalcommunities will be detected and split at the next timestep, a sequence which is reminiscent of the recursivebi-partitioning scheme sometimes used in spectral clus-tering [25]. Parameters a and b can be determined basedon the merge-split benchmark as shown in Fig. 3. First,we can ﬁnd the smallest value of n (cid:48) which gives a suﬃ-ciently reliable estimate of λ gap , and then set a to thisvalue. Second, the parameter b should be set high enoughsuch that noise in the estimate of λ gap during the mergedstate will not prematurely trigger a split (when n (cid:48) is setas speciﬁed in the input to the algorithm). In this paper,we set a = 20 , b = 0 . n set to the average of the communitysizes, and an additional scaling parameter c in the righthand side. It might seem best to set c = 1 such that itexactly matches the detectability threshold in equation(5). However, the shrinking density gap (cid:98) p in − (cid:98) p α,α (cid:48) causeserratic behavior in step 16, resulting in nodes incorrectlybeing moved between the pair of merging communities.This in turn corrupts the estimates (cid:98) p in , (cid:98) p α,α (cid:48) . We ﬁndthat triggering the merge earlier by setting c = 2 avoidsthis issue. C. Computational complexity analysis

In this section, we take q to be the maximum numberof communities, and N to be the maximum number ofnodes in any given snapshot. For the proposed algorithm,the complexity for estimating each partition C ( t ) , t ≥ O (cid:0) q n (cid:48) ( qn (cid:48) + N ) (cid:1) . A detailed justiﬁcation for thisresult follows.We start by commenting on the complexity of theprocedures in Sec. V A. The computational complex-ity of Sample-SPIN is O ( N (cid:48) | V | ). The complexity of Static-Cluster depends on which algorithm A is cho-sen. We assume A to be at most cubic in |S| , whereassteps 4-7 of Static-Cluster incur a cost of O (ˆ q |S| | V | )time. Therefore, the run time of this procedure is O (cid:16) |S| ( |S| + ˆ q | V | ) (cid:17) .The proposed algorithm ﬁrst estimates the communi-ties C (0) in G (0). Since |S| = qn (cid:48) , step 1 takes O (cid:0) q n (cid:48) N (cid:1) time, and step 2 takes O (cid:0) q n (cid:48) ( qn (cid:48) + N (cid:1) time. We nowconsider the remainder of the algorithm, which estimatesthe community partitions C ( t ) in each snapshot G ( t ) for t >

0. Steps 8-14 take O (cid:0) q n (cid:48) ( qn (cid:48) + N (cid:1) time due tothe cost of invoking Sample-SPIN and

Static-Cluster .Calculation of the similarity metric s i,u ( t ) for a singlecommunity u and single node i takes time O ( n (cid:48) ), andso steps 15-18 take time O ( qn (cid:48) N ) in total. For thesplit detection, calculation of the eigenvalues in step 22takes O (cid:0) n (cid:48) (cid:1) time. If a split is detected, then the bi-partitioning in step 25 takes O (cid:0) n (cid:48) ( n (cid:48) + N ) (cid:1) time, sincethe size of the sketch C (cid:48) is O ( n (cid:48) ). The aforementionedsteps are repeated for each community, and so steps 19-28 altogether run in O (cid:0) qn (cid:48) (cid:1) time. For the merge detec-tion, the estimates (cid:98) p in and (cid:98) p α,α (cid:48) (for all community pairs α, α (cid:48) ) can be calculated once per snapshot at a total costof O (cid:0) n (cid:48) (cid:1) . Once these estimates are calculated, the loopin steps 29-34 runs in O (cid:0) q (cid:1) time. In conclusion, the runtime for the ﬁrst snapshot in steps 1-3, and for each sub-sequent loop of steps 6-37, are both O (cid:0) q n (cid:48) ( qn (cid:48) + N ) (cid:1) .This yields the stated result. VI. RESULTS

We now present results demonstrating the performanceof the proposed algorithm. For comparison, results arealso shown for four diﬀerent algorithms. First, we usea classic “independent” community detection approachbased on those described in [13, 26]. This approach ap-plies a community detection algorithm A to each snap-shot to obtain community estimates C = { C , . . . , C q } .To provide continuity in the community assignments ofthe nodes, community i in the each snapshot at time t > t − α ,we set C α ( t ) = C α (cid:48) where α (cid:48) = arg max u | C α ∩ C u ( t − || C α ∪ C u ( t − | . (23)This algorithm is referred to as Standard Independent(SI). Second, we apply a variation of SI, in which thestatic sketch-based algorithm of [5] is applied to eachsnapshot, using sketch size N (cid:48) = n (cid:48) q . The sketch-basedalgorithm uses an arbitrary algorithm A to partition thesketch. We refer to this method as Sketch-based Indepen-dent (SbI). Third, we run the algorithm of [14], whichwe refer to here as (Dinh, 2009). Lastly, we use ESPRA( Evolutionary clustering based on Structural Perturbationand Resource Allocation similarity ), which is based onstructural perturbation theory [28]. When running theESPRA algorithm, we use the same parameters as usedin the experimental results of [28]: α = 0 . , β = 0 . A . Here, we use spectral clustering [24], and estimatethe number of communities using the eigengap heuristic[23]. Speciﬁcally, let A be the adjacency matrix of theentire graph snapshot, and let λ , . . . , λ q be the q smallesteigenvalues of the Laplacian L deﬁned in (9). Then theestimated number of clusters is (cid:98) q = arg max r ∈{ q − ,...,q } ( λ r +1 − λ r ) , (24)where q − is the minimum number of possible clusters.The communities are identiﬁed by performing k-meansclustering on the ﬁrst (cid:98) q eigenvectors of L .We note that SI and SbI cannot handle the birth andsplit processes, and therefore we only apply these al-gorithms to the grow-shrink benchmark. Furthermore,ESPRA does not account for networks of changing size,and therefore is only applied to the grow-shrink andmerge-split benchmarks. Both the proposed algorithmand (Dinh, 2009) can accomodate all of the benchmarks.First, we show the run time of the proposed algorithm.Second, we demonstrate the ability of the algorithm tohandle small clusters in the birth-death and grow-shrinkbenchmarks. Next, we demonstrate the performance ofthe algorithm on the merge-split benchmark, as well as

200 500 1000 2000 n ti m e ( s ) ProposedSISbI(Dinh, 2009)ESPRA

FIG. 4. Timings results for the algorithms on the grow-shrink benchmark. Time is averaged over 10 trials and plotusing logarithmic scales for both axes. on the mixed benchmark. Finally, we perform a sensitiv-ity analysis showing how varying sketch size aﬀects thesuccess rate of the algorithm.

A. Run time

We ﬁrst demonstrate the speed-ups possible with theproposed algorithm. We run all of the algorithms onthe grow-shrink benchmark with parameters q = 2 , n (cid:48) =100 , p in = 0 . , p out = 0 . , f = 0 .

5, and show the results inFig. 4. All algorithms had perfect community estimatesfor all network sizes, except for ESPRA which still per-formed well, with less than 0.02% of the nodes being mis-classiﬁed (on average) in every snapshot. The proposedalgorithm ﬁnishes very fast, in under three seconds for allcases. SbI also rapidly clusters the network time-seriesthrough its use of sketching. While both SbI and the pro-posed algorithm scale well with increasing network size,the proposed algorithm still holds an advantage in successrate given that it carries over the sketch from previousiterations (we will show an example of this in the nextsection). Both SI and ESPRA apply spectral techniquesto the full graph, and therefore scale super-linearly withnetwork size. Although (Dinh, 2009) clusters a graphof reduced size at each time step, nodes having changededges are left as singleton nodes. In this example, theedge changes are suﬃcient to keep many nodes as single-tons, thus increasing the graph size and run time.

B. Performance with small clusters

We now use the grow-shrink and birth-death bench-marks to evaluate the algorithms’ handling of small clus-ters. We use normalized agreement to compare theplanted communities C = { C , . . . , C q } and estimatedcommunities (cid:98) C = (cid:110) (cid:98) C , . . . , (cid:98) C q (cid:111) (empty communities are added to the smaller set such that |C| = | (cid:98) C| ). Normalizedagreement is deﬁned as [29]˜ A = 1 q max π q (cid:88) u =1 | C u | > (cid:12)(cid:12)(cid:12) C u ∩ (cid:98) C π ( u ) (cid:12)(cid:12)(cid:12) | C u | , (25)where π ranges over the permutations on q elements(this permutation is necessary since the community in-dices may be ordered arbitrarily). Normalized agreementproves useful for quantifying performance in the pres-ence of small clusters, since each community constitutesa fraction 1 /q of the normalized agreement, regardlessof community size. The normalized agreement for thesnapshot at time t is denoted ˜ A ( t ). Plots of ˜ A ( t ) showan ensemble average over 50 independent runs.For summarizing the overall deviation in the actualand estimate communities for a snapshot sequence, weuse the average-squared error E ˜ A = 1 T T (cid:88) t =1 (cid:104) − ˜ A ( t ) (cid:105) , (26)where T is the total number of snapshots. When plotting E ˜ A , we take an average over 50 independent trials.We ﬁrst consider a network with two concurrent in-stances of the birth-death benchmark. For the ﬁrst in-stance, we set n = n and use a phase shift of φ = 0,whereas for the second instance we set n = 500 − n anduse a phase shift of φ = τ /

2. Both instances have param-eters q = 4 , p in = 0 . , p out = 0 . , q − = 2. We set n (cid:48) = 50,which leads to a maximum sketch size of 200 nodes.One means for producing small clusters is by usingsmall values of γ , such that each community is small im-mediately after birth and before death. An example isshown in Fig. 5(a), with γ = 0 . n = 250. Thecommunity detection results are shown in Fig. 5(b) us-ing both the proposed algorithm and (Dinh, 2009). Thealgorithm of (Dinh, 2009) tends to absorb the small com-munities into the large communities, as exhibited by thelarge drop in normalized agreement immediately afterbirth and before death. Meanwhile, the proposed algo-rithm maintains ˜ A ( t ) > .

998 for all t ≥

0. We expand onthis example by plotting E ˜ A as a function of γ in Fig.5(c). As with the previous example, we have balancedinstance sizes with n = 250. The proposed algorithmhas E ˜ A < .

002 for all values of γ .Another means for introducing small clusters is by re-ducing n , i.e., making smaller communities in the ﬁrstbenchmark instance while at the same time increasingthe sizes of the communities in the second instance. Forthis example, we use a smaller sketch size with n (cid:48) = 25.The value of E ˜ A is shown in Fig. 5(d) for varying n with γ = 0 .

5. For the proposed algorithm, E ˜ A remains below0 .

005 for n ≥

40. On the other hand, (Dinh, 2009) doesnot fall under this threshold until n ≥ φ = 0 , n = n for the0 nod e i d (a)(b) n n

500 - n

500 - n (d)(c) n Proposed(Dinh, 2009)Proposed(Dinh, 2009) time (t)

Proposed(Dinh, 2009)

FIG. 5. (a) Planted partitions for a double-stacked versionof the birth-death benchmark, with n = 250 , γ = 0 .

1. (b) Nor-malized agreement ˜ A ( t ) plot as a function of time. Squarederror of normalized agreement E ˜ A is shown for (c) varying γ with n = 250 and (d) varying n with γ = 0 . ﬁrst instance, and φ = τ / , n = 500 − n for the secondinstance. Figure 6(a) shows planted partitions for an ex-ample with f = 0 .

95. For the proposed algorithm we set n (cid:48) = 50, and for SbI, we use a sketch size of 200 (to matchthe sketch size used in the proposed algorithm).As with the birth-death benchmark, we can producesmaller clusters by reducing the value of n . Results areshown in Fig. 6(b) for varying n . We can also pro-duce small communities by increasing f . This makes twosmall communities and two large communities at both t = τ / t = 3 τ /

4. The value of E ˜ A is shown as a func-tion of f in Fig. 7(a). Both the proposed algorithm and n (a) time (t) nod e i d n n

500 - n

500 - n (b) ProposedSISbI(Dinh, 2009)ESPRA

FIG. 6. (a) Planted partitions for a double-stacked versionof the grow-shrink benchmark, with n = 250 , f = 0 .

95. (b)Squared error of normalized agreement E ˜ A shown for varying n with f = 0 . (Dinh, 2009) have very similar performance, both having E ˜ A < .

01 for all values of f . The other algorithms per-form signiﬁcantly worse. To gain further insight into thebehavior of the algorithms, we plot the value of ˜ A ( t ) foreach algorithm in Fig. 7(b)-(f). The value of f is variedalong the vertical axis, and the corresponding value of˜ A ( t ) is plotted along the horizontal axis as a function oftime t . For the proposed algorithm, (Dinh, 2009), andESPRA, most errors occur when the communities aremost imbalanced (with ESPRA encountering low valuesof ˜ A ( t ) over a much wider range of time around theseextreme points). SI tends to lose track of the small clus-ters at t = τ /

4, resulting in a merge of communities and asharp drop in agreement. This occurs again at t = 3 τ / t = 0. C. Merge-split detection

We now execute the algorithms on the merge-splitbenchmark, We use two concurrent instances of themerge-split process, such that both instances undergo asplit and merge simultaneously. The planted partitionsare shown in Fig. 8(a). The parameters of the modelare q = 2 , n = N/ , p in = 0 . , p out = 0 . , q − = 2, and weset n (cid:48) = 50 for the proposed algorithm. This results in a1 time (t) time (t) f (b) Proposed (c) (Dinh, 2009)(d) ESPRA (f) SbI(e) SI(a) f ProposedSISbI(Dinh, 2009)ESPRA time (t) time (t) time (t) f FIG. 7. Results for varying f in the grow-shrink example in Fig. 6. Plot of E ˜ A is shown in (a). Panels (b) through (f) showheat maps of ˜ A ( t ) as a function of time along the horizontal axis, and f along the vertical axis for each algorithm. sketch size of 100 in the merged state, and 200 in the splitstate. The partitions reconstructed by the algorithms areshown in Fig. 8(b)-(d).It is important to note that the network gradually in-terpolates between the fully merged and fully split states.The planted partitions, on the other hand, undergo an in-stantaneous transition between these state. This discrep-ancy means we cannot expect that the estimated parti-tions will exactly match the planted partitions. Indeed,all three algorithms overestimate the span of time duringwhich the communities are merged. This is as expectedsince the benchmark deﬁnes the instantaneous transitionto occur at the theoretical detectability limit.For the proposed algorithm, nodes start being misclas-siﬁed at t = 19. This is expected due to the shrinking den-sity gap p in − p ∗ inter , as described in Sec. V B. Nonetheless,for the proposed algorithm, the times at which the esti-mated communities merge and split more closely matchthe corresponding event times in the benchmark. D. Mixed benchmark

So far, our results have considered individual bench-marks in isolation. We now run the proposed algo-rithm on the mixed benchmark shown in Fig. 2(a). Re-call that this example has concurrent birth-death, grow-shrink and merge-split processes. The network has edgedensity parameters p in = 0 . , p out = 0 .

05, and minimumnumber of communities q − = 4. The partition estimates are shown in Fig. 9(a). All of the mismatch occurs inthe merge-split communities, which is consistent with ourearlier results.The set of sketches produced by the proposed algo-rithm is shown in Fig. 9(b). The sketch nodes are sortedvertically according to their planted communities, withtheir color indicating the estimated community of thecorresponding node. The only deviation from the idealsketch in Fig. 2(b) lies inside the merge-split communi-ties, due to the errors present in the full graph.The estimated partitions for (Dinh, 2009) are pre-sented in Fig. 9(c). As with the earlier results,(Dinh, 2009) encounters diﬃculties in correctly identify-ing the small clusters in the grow-shrink and birth-deathcommunities. E. Eﬀects of sketch size

We now show how the proposed algorithm perform oneach benchmark when using diﬀerent sketch sizes. Tobest illustrate the eﬀects, we modify the algorithm toonly perform tasks relevant to the benchmark being used(details of these modiﬁcations will be provided when dis-cussing each result). For this section, we plot the un-normalized agreement [29] A = 1 N max π q (cid:88) u =1 | C u | > (cid:12)(cid:12)(cid:12) C u ∩ (cid:98) C π ( u ) (cid:12)(cid:12)(cid:12) , (27)2 nod e i d nod e i d time (t) nod e i d (a) Planted(b) Proposed nod e i d (c) (Dinh, 2009)(d) ESPRA FIG. 8. Planted partitions for a double-stacked version of themerge-split benchmark is shown in (a). Panels (b) through(d) show the estimated partitions for each algorithm. where N is the total size of the graph. The agreementfor the snapshot at time t is denoted A ( t ). We plot anensemble average over 50 independent runs.Figure 10(a) shows results for the birth-death bench-mark example of Fig. 5(a) with γ = 0 .

15. When runningthe proposed algorithm, we remove steps 19-34. Further-more, step 7 is changed to C u ← C (cid:48) u ( t − i ∈ V + ( t ) \ V birth .These changes ensure that only new nodes are assigned acommunity; existing nodes are left as-is and the merge-split detection is disabled. Recall that n (cid:48) is the number ofnodes included in the sketch from each community. As n (cid:48) increases, the estimate in step 15 becomes more reliableas the variance of s i,u ( t ) falls [see (16)]. When n (cid:48) = 10,the initial clustering has a large number of errors, andnew nodes are consistently misclassiﬁed, resulting in asteady drop in agreement. For n (cid:48) = 20, the initial par-titioning is perfect. Nonetheless, the new nodes that are time (t) nod e i d time (t) s k e t c h nod e i d nod e i d (a)(b)(c) FIG. 9. Results for mixed benchmark. Estimated partitionsproduced by the proposed algorithm are shown for (a) the fullnetwork and (b) the sketches produced by the proposed al-gorithm. The estimated partitions produced by (Dinh, 2009)are shown in (c). added after the ﬁrst birth event are still consistently mis-classiﬁed. When n (cid:48) = 40 the reconstructed partitions arealmost exactly correct, with A ( t ) > .

997 for all timesteps.In Fig. 10(b), we show the results for the grow-shrinkexample of Fig. 6(a) with f = 0 .

95. We modify the pro-posed algorithm by removing steps 19-34 such that thethe algorithm only re-clusters individual nodes, withoutmerging or splitting communities. For n (cid:48) = 20, a largenumber of misclassiﬁcations occur around the extremepoints when the community sizes are most imbalancedat t = τ / t = 3 τ /

4. This occurs for the samereason as the misclassiﬁcations in the birth-death bench-mark. As expected, as n (cid:48) increases, the misclassiﬁcationrate drops.In Fig. 10(c), we show results for the merge-splitbenchmark shown in Fig. 8(a). We modify the proposedalgorithm by removing steps 8-14, and changing step 7to C u ← C (cid:48) u ( t − n (cid:48) = 25, the estimates (cid:98) p in and (cid:98) p α,α (cid:48) used instep 30 become less reliable. This leads to more variationin the merge detection, as evidenced by the smoothed3 time (t) (a) Birth-death(b) Grow-shrink n'=10n'=20n'=40 n'=25n'=50n'=125 (c) Merge-split n'=20n'=25n'=100 FIG. 10. Eﬀects of varying sketch size for each benchmark. transition in the value of A ( t ) (which is averaged overmultiple runs). Similarly, the estimate of the eigenval-ues used in step 23 are also less reliable, leading to sim-ilar variability in the split detection. As n (cid:48) increases, the merge and split events become more consistent, thusleading to a sharp transition for n (cid:48) = 125. Note thatfor reasons described in Sec. VI C, there is a consistentmisclassiﬁcation of roughly half of the communities for38 ≤ t ≤

43 and 57 ≤ t ≤ VII. CONCLUSION

We have presented a sketch-based approach for com-munity detection in time-evolving networks. We pre-sented techniques for handling two existing fundamen-tal processes: one involving growing and shrinking com-munities, and the other involving merging and splittingcommunities. We presented a third fundamental processinvolving the birth and death of communities, as wellas techniques to handle these events. An algorithm waspresented incorporating these techniques to handle con-current processes.Our approach is extendable to other graphs as well, forexample the Degree Corrected SBM (DCSBM) [30]. Thiscan be accomplished by substituting a suitable samplingtechnique for constructing DCSBM sketches, a new sim-ilarity deﬁnition between each node and the sketch com-munities, and an appropriate technique for determiningwhether clusters split or merge.

ACKNOWLEDGMENTS

This work was supported by NSF CAREER AwardCCF-1552497. The University of Central Florida Ad-vanced Research Computing Center provided compu-tational resources that contributed to results reportedherein. [1] C. Aggarwal and K. Subbian, ACM Comput. Surv. ,10:1 (2014).[2] D. Greene, D. Doyle, and P. Cunningham, in (2010) pp. 176–183.[3] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jer-maine, Foundations and Trends in Databases , 1 (2011).[4] M. Rahmani, A. Beckus, A. Karimian, and G. K. Atia,IEEE Transactions on Signal Processing , 962 (2020).[5] A. Beckus and G. K. Atia, in Proc. IEEE 29th Int. Work-shop Mach. Learn. Signal Process (2019) pp. 1–6.[6] A. Clauset, M. E. J. Newman, and C. Moore, Phys. Rev.E (2004), art. no. 066111.[7] N. Dakiche, F. B.-S. Tayeb, Y. Slimani, and K. Be-natchba, Inform. Process. Manag. , 1084 (2019).[8] S. Zhang and H. Zhao, Phys. Rev. E , 066114 (2012).[9] J. Shang, L. Liu, X. Li, F. Xie, and C. Wu, Physica A , 70 (2016).[10] C. Granell, R. K. Darst, A. Arenas, S. Fortunato, andS. G´omez, Phys. Rev. E , 012805 (2015).[11] P. W. Holland, K. B. Laskey, and S. Leinhardt, Soc.Netw. , 109 (1983).[12] G. Rossetti and R. Cazabet, ACM Comput. Surv. , 35:1 (2018).[13] J. Hopcroft, O. Khan, B. Kulis, and B. Selman, P. Natl.Acad. Sci. , 5249 (2004).[14] T. N. Dinh, Ying Xuan, and M. T. Thai, in IEEE IPCCC (2009) pp. 161–168.[15] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, andE. Lefebvre, J. Stat. Mech. , P10008 (2008).[16] J. He and D. Chen, Physica A , 87 (2015).[17] T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, Mach.Learn. , 157 (2011).[18] A. Ghasemian, P. Zhang, A. Clauset, C. Moore, andL. Peel, Phys. Rev. X , 031005 (2016).[19] K. S. Xu and A. O. Hero, IEEE J. Sel. Topics SignalProcess , 552 (2014).[20] C. Matias and V. Miele, J. R. Stat. Soc. B , 1119(2017).[21] M. Pensky and T. Zhang, Electron. J. Statist. , 678(2019).[22] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys. Rev. Lett. , 065701 (2011).[23] U. Von Luxburg, Stat. Comput. , 395 (2007).[24] A. Y. Ng, M. I. Jordan, and Y. Weiss, in Advances inNeural Information Processing Systems , edited by T. G. Dietterich, S. Becker, and Z. Ghahramani (MIT Press,2002) pp. 849–856.[25] A. R. Benson, D. F. Gleich, and J. Leskovec, Science ,163 (2016).[26] T. Aynaud, E. Fleury, J.-L. Guillaume, and Q. Wang,Communities in evolving networks: Deﬁnitions, detec-tion, and analysis techniques, in

Dynamics On and OfComplex Networks, Vol. 2 (Springer, 2013) pp. 159–200. [27] P. Jaccard, New Phytologist , 37 (1912).[28] P. Wang, L. Gao, and X. Ma, Journal of Statistical Me-chanics: Theory and Experiment , 013401 (2017).[29] E. Abbe, Found. Trends. Commun. Inform. Theor. , 1(2018).[30] B. Karrer and M. E. J. Newman, Phys. Rev. E83