[PDF] Optimal network online change point localisation

Abstract

We study the problem of online network change point detection. In this setting, a collection of independent Bernoulli networks is collected sequentially, and the underlying distributions change when a change point occurs. The goal is to detect the change point as quickly as possible, if it exists, subject to a constraint on the number or probability of false alarms. In this paper, on the detection delay, we establish a minimax lower bound and two upper bounds based on NP-hard algorithms and polynomial-time algorithms, i.e., \mbox{detection delay} \begin{cases} \gtrsim \log(1/\alpha) \frac{\max\{r^2/n, \, 1\}}{\kappa_0^2 n \rho},\\ \lesssim \log(\Delta/\alpha) \frac{\max\{r^2/n, \, \log(r)\}}{\kappa_0^2 n \rho}, & \mbox{with NP-hard algorithms},\\ \lesssim \log(\Delta/\alpha) \frac{r}{\kappa_0^2 n \rho}, & \mbox{with polynomial-time algorithms}, \end{cases} where \kappa_0, n, \rho, r and \alpha are the normalised jump size, network size, entrywise sparsity, rank sparsity and the overall Type-I error upper bound. All the model parameters are allowed to vary as \Delta, the location of the change point, diverges. The polynomial-time algorithms are novel procedures that we propose in this paper, designed for quick detection under two different forms of Type-I error control. The first is based on controlling the overall probability of a false alarm when there are no change points, and the second is based on specifying a lower bound on the expected time of the first false alarm. Extensive experiments show that, under different scenarios and the aforementioned forms of Type-I error control, our proposed approaches outperform state-of-the-art methods.

Full PDF

OOptimal network online change point localisation

Yi Yu , Oscar Hernan Madrid Padilla , Daren Wang , and Alessandro Rinaldo Department of Statistics, University of Warwick Department of Statistics, University California, Los Angeles Department of Statistics, University of Notre Dame Department of Statistics & Data Science, Carnegie Mellon UniversityJanuary 15, 2021

Abstract

We study the problem of online network change point detection. In this setting, a collectionof independent Bernoulli networks is collected sequentially, and the underlying distributionschange when a change point occurs. The goal is to detect the change point as quickly aspossible, if it exists, subject to a constraint on the number or probability of false alarms. Inthis paper, on the detection delay, we establish a minimax lower bound and two upper boundsbased on NP-hard algorithms and polynomial-time algorithms, i.e.detection delay $’’&’’% Á log p { α q max t r { n, u κ nρ , À log p ∆ { α q max t r { n, log p r qu κ nρ , with NP-hard algorithms , À log p ∆ { α q rκ nρ , with polynomial-time algorithms , where κ , n, ρ, r and α are the normalised jump size, network size, entrywise sparsity, ranksparsity and the overall Type-I error upper bound. All the model parameters are allowed tovary as ∆, the location of the change point, diverges.The polynomial-time algorithms are novel procedures that we propose in this paper, designedfor quick detection under two diﬀerent forms of Type-I error control. The ﬁrst is based oncontrolling the overall probability of a false alarm when there are no change points, and thesecond is based on specifying a lower bound on the expected time of the ﬁrst false alarm.Extensive experiments show that, under diﬀerent scenarios and the aforementioned forms ofType-I error control, our proposed approaches outperform state-of-the-art methods. Keywords : Dynamic networks, online change point detection, minimax optimality.

In this paper we are concerned with online change point detection in dynamic networks. Tobe speciﬁc, we observe a sequence of independent adjacency matrices t A p t q , t “ , , . . . u , with E t A p t qu “ Θ p t q , for t P N ` . If there exists t ˚ ě

2, such that Θ p t ˚ q ‰ Θ p t ˚ ´ q , then we call t ˚ achange point. Our aim is to detect the existence of such change points as soon as they occur. On1 a r X i v : . [ m a t h . S T ] J a n Figure 1: Interaction networks based on the MIT cellphone data sets. A white dot means thecorresponding row and column individuals interacted on the speciﬁc date. A red dot means thelack of interaction. Details are explained in Section 4.3.the other hand, if there is no change point, then we would like to avoid false alarms. To the best ofour knowledge, this problem has not been theoretically studied in the existing statistical literature.The problem we described above is an abstractiﬁcation of various real-life problems. For in-stance, in cybersecurity, one monitors the internet or a system and wishes to detect maliciousactivity as early as it starts. In ﬁnance, regulatory authorities oversee the markets and aim to stopunlawful activities at an early stage. In epidemiology, public health sectors follow the spreading ofa contagious disease in a community and target at knowing the spreading pattern changes as theyhappen.As a concrete example, we consider the Massachusetts Institute of Technology (MIT) cellphonedata set (Eagle and Pentland, 2006). The data set consists of human interactions measured bythe cellphone activity of the participants. There were 96 participants that included students andfaculty members at the MIT. The data were taken from 14-Sept-2004 to 5-May-2005.We construct two experiments to evaluate our proposed methods and our competitors. In theﬁrst example, we use the data from 14-Sept-2004 to 15-Feb-2005, which cover the MIT winter recessstarting on 22-Dec-2004 and ending on 3-Jan-2005. In our second example, we use the data from1-Jan-2005 to 5-May-2005, which cover the spring recess starting on 26-Mar-2005 and ending on3-Apr-2005. In Figure 1, we plot the interaction networks for a few representative dates. A whitedot means the corresponding row and column individuals interacted on the speciﬁc date, while ared dot means the lack of interaction. For these two examples, our proposed method detects changepoints at 27-Dec-2004 and 31-Mar-2005, respectively. Our competitors’ change point estimatorsare around 30-Jan-2005 and 6-Apr-2005, respectively. Our method is clearly the best at detectingthe winter and spring recess periods. Numerical details are explained in Section 4.3.2ue to the aforementioned real-world applications, change point detection problems have beenintensively studied in the literature, not necessarily in the dynamic networks context though. Interms of online change point detection, i.e., making sequential decisions about the existence ofchange points while collecting data, Lorden (1971), Moustakides (1986), Ritov (1990), Lai (1981),Lai (1998), Lai (2001), Chu et al. (1996), Aue and Horv´ath (2004), Kirch (2008), Madrid Padillaet al. (2019) and Yu et al. (2020) studied univariate sequences; He et al. (2018) focused on asequence of random graphs; Chen (2019) and Dette and G¨osmann (2019) allowed for more generalscenarios, including nonparametric models; Chen et al. (2020) and Keshavarz et al. (2018) studiedhigh-dimensional Gaussian vectors. In terms of oﬄine change point detection, i.e., after collectinga sequence of data, one seeks change points retrospectively, a wide range of models have beenstudied. The closely related one is Wang et al. (2018), where a sequence of adjacency matrices wereconsidered. More discussions with existing literature will be provided as we unfold our results.

The contributions of this paper are summarised below. • To the best of our knowledge, it is the ﬁrst time that online change point detection is for-mally analysed in a sequence of adjacency matrices, allowing all model parameters to vary asfunctions of ∆. • We establish minimax lower bounds on the detection delay. To the best of our knowledge,in statistical networks literature, a lower bound involving the rank parameter has only beenestablished in estimation problems (e.g. Gao et al., 2015), but not in the context of testing,not to mention change point detection. Our lower bound matches, up to a logarithmic factor,an upper bound derived based on an NP-hard algorithm. • In addition, we propose a computationally-eﬃcient network online change point detectionmethod, which comes with two variants corresponding to two diﬀerent Type-I error control-ling strategies. Extensive numerical results are provided to evaluate the performance of ourproposed methods against state-of-the-art competitors. We also discuss tuning parameterselection aspects of our approaches.Throughout this paper, we will adopt the following notation. For any matrix M P R m ˆ m ,let } M } F and } M } “ max m i “ max m j “ | M ij | be the Frobenius norm and the entry-wise supremumnorm of M , respectively. For any two matrices A, B P R m ˆ m , let p A, B q “ tr p A J B q be theFrobenius inner product of two matrices. Since our data are a sequence of adjacency matrices, our ﬁrst task is to formally deﬁne the networksat every time point.

Deﬁnition 1 (Inhomogeneous Bernoulli networks) . A network with node set t , . . . , n u is an in-homogeneous Bernoulli network if its adjacency matrix A P R n ˆ n satisﬁes A ij “ A ji “ , nodes i and j are connected by an edge , , otherwise ;3 nd t A ij , i ă j u are independent Bernoulli random variables with E p A ij q “ Θ ij . We refer to thematrix Θ as the graphon matrix. This general deﬁnition includes popular network models as special cases, but it is not restrictedto any speciﬁc model. Note that we allow every random variable that corresponds to an integerpair p i, j q , 1 ď i ă j ď n , to have its own mean, i.e. E p A ij q “ Θ ij . Remark 1.

Despite the ﬂexibility it enjoys, Deﬁnition 1 is also subjected to a number of restric-tions. First, each random variable is assumed to be Bernoulli, which is a sub-Gaussian randomvariable. However, our framework can be extended to handle Poisson random variables. The al-gorithms we propose in the sequel follow naturally, and the theoretical results can be adjusted byusing sub-Exponential concentration inequalities. Second, the adjacency matrices are assumed to besymmetric in Deﬁnition 1. All the results in this paper can be extended straightforwardly to asym-metric cases, corresponding to directed networks. Finally, the model does not allow for dependencebetween entries of the adjacency matrix.

In order to detect the change points of t Θ p t q , t “ , , . . . u , we adopt the network CUSUMstatistic, which originated in the univariate CUSUM statistics from Page (1954) and was ﬁrstformally stated in Wang et al. (2018). Deﬁnition 2.

Given a sequence of matrices t A p t qu t “ , ,... Ă R n ˆ n , we deﬁne the correspondingonline CUSUM statistics as p A s,t “ c t ´ sst s ÿ l “ A p l q ´ c s p t ´ s q t t ÿ l “ s ` A p l q , for all integer pairs p s, t q , t ě and s P r , t q . With the notation introduced in Deﬁnition 2, we have for any integer pair p s, t q , 1 ď s ă t , E p p A s,t q “ p Θ s,t “ c t ´ sst s ÿ l “ Θ p l q ´ c s p t ´ s q t t ÿ l “ s ` Θ p l q . Algorithm 2 is our main procedure, with a subroutine detailed in Algorithm 1 and a variant inAlgorithm 3. Both Algorithms 2 and 3 are written in a way that they will not stop if no changepoint is detected. In practice, they can be terminated either by users or if there are no more newdata.To motivate our algorithms, we ﬁrst investigate the statistics in Deﬁnition 2. These are linearcombinations of all adjacency matrices up to time point t ě

2. To be speciﬁc, for 1 ď s ă t , thecorresponding statistic is a diﬀerence between the sample means before and after time point s . Inthe univariate online change point detection problem (e.g. Yu et al., 2020), one scans through allpossible integer pairs p s, t q , 1 ď s ă t . In Algorithm 2, we propose a more eﬃcient algorithm toavoid scanning through all s P r , t q .For every time point t ě

2, in Algorithm 2, we only consider s P S p t q for candidates of changepoints, where S p t q “ t t ´ j , j “ , , . . . , t log p t q{ log p q u ´ u is a set of geometric scale grid points.Once the criteria } r B s,t } F ą C log { p t { α q and p p A s,t , r B s,t {} r B s,t } F q ą b t (1)4re met, we declare that there exists a change point at t .The criteria in (1) are constructed based on two independent samples t A p t qu and t B p t qu . Inpractice, one can achieve this by splitting data into odd and even indices subsamples. For eachinteger pair p s, t q , the quantity r B s,t is a function of the CUSUM statistic p B s,t , obtained by thesubroutine Algorithm 1. In fact, r B s,t is a universal singular value thresholding (USVT) estimator.The USVT algorithm was proposed in Chatterjee (2015), for the purpose of estimating low-rankhigh-dimensional sparse matrices.The criteria (1) have two components. We ﬁrst need to check that the USVT estimator } r B s,t } F is large enough. In theory, this is required to prompt near optimal detection delay. Intuitivelyspeaking, change points would only occur, if } r B s,t } F is large. Provided that this criterion holds, wethen check that the matrix inner product p p A s,t , r B s,t {} r B s,t } F q is large enough. The data splitting issummoned due to the fact that for any Bernoulli random variable X , X “ X . Therefore, in orderto detect the change in terms of the Frobenius norm, data splitting helps to estimate the squaredmeans of Bernoulli random variables.In view of the whole procedure, there is a sequence of tuning parameters. The tuning parameters t τ j,s,u , j “ , , u “ , , . . . , s P S p u qu are used in the subroutine Algorithm 1. As suggested in Xu (2018), the parameters τ , ¨ , ¨ serveas cutoﬀs of the upper bound on sample ﬂuctuations; and the parameters τ , ¨ , ¨ are chosen to bethe entry-wise maximum norms of the matrices of interest. The tuning parameter α P p , q isthe tolerance of Type-I errors and acts as an upper bound on the probability of returning at leastone false alarm. The thresholds t b t u are upper bounds on the inner products when there is nochange point. More detailed discussions and guidance on tuning parameter selection are providedin Sections 3 and 4. Algorithm 1

Universal Singular Value Thresholding. USVT p A, τ , τ q INPUT:

Symmetric matrix A P R n ˆ n , τ , τ ą p λ i , v i q Ð the i th eigen-pair of A , with | λ | ě ¨ ¨ ¨ ě | λ n | ; A Ð ř i : | λ i |ě τ λ i v i v J i ; A Ð a matrix with p i, j q th entry p A q ij satisfying p A q ij “ p A q ij , |p A q ij | ď τ , sign pp A q ij q τ , |p A q ij | ą τ . OUTPUT: A .In addition, we also present a variant of Algorithm 2 in Algorithm 3. Note that the maindiﬀerence between these two algorithms is that the tuning parameter α P p , q is replaced by γ P N in Algorithm 3. Inputs are changed correspondingly. These two algorithms represent two popularways of controlling Type-I errors. The tuning parameter γ is in fact a lower bound on the averagerun length. In other words, a choice of γ implies that, when there is no change point, the expectedtime of the ﬁrst false alarm is at least γ .There is no algorithmic diﬀerences between Algorithms 2 and 3. Their theoretical diﬀerenceswill be explained in Section 3, and the tuning parameter selection diﬀerences will be discussed inSection 4. 5 lgorithm 2 Network online change point detection

INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , τ ,s,u , τ ,s,u , u “ , , . . . , s “ , , . . . , u u Ă R , α Pp , q . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ; r B s j ,t Ð USVT p p B s j ,t , τ ,s j ,t , τ ,s j ,t q ;FLAG “ ! p p A t ´ s j ,t , r B t ´ s j ,t {} r B t ´ s j ,t } F q ą b t ) ! } r B t ´ s j ,t } F ą C log { p t { α q ) ; j Ð j ` end whileend whileOUTPUT: t . Algorithm 3

Network online change point detection – a variant

INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , τ ,s,u , τ ,s,u , u “ , , . . . , s “ , , . . . , u u Ă R , γ P N . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ´ ; r B s j ,t Ð USVT p p B s j ,t , τ ,s j ,t , τ ,s j ,t q ;FLAG “ ! p p A s j ,t , r B s j ,t {} r B s j ,t } F q ą b t ) ! } r B s j ,t } F ą C log { p γ q ) ; j Ð j ` end whileend whileOUTPUT: t . This section consists of all the theoretical results we develop in this paper, with all the technicaldetails in the Appendix. This section is organised as follows. All the assumptions are statedand discussed in Section 3.1. The theoretical guarantees of Algorithms 2 and 3 are provided inSection 3.2. To investigate the fundamental limits, we established a minimax lower bound onthe detection delay in Section 3.3, with an NP-hard procedure which is nearly minimax optimalstudied in Section 3.4. To conclude, we provide some additional discussions through comparisonswith existing work in Section 3.4. 6 .1 Assumptions

Before arriving at our main results, we start by introducing some assumptions. The following threeassumptions introduce the sparsity parameter, describe the one change point and no change pointscenarios, respectively.

Assumption 1.

Assume that t A p q , A p q , . . . u Ă R n ˆ n is a sequence of inhomogeneous Bernoullinetworks satisfying E t A p i qu “ Θ p i q P R n ˆ n , i “ , , . . . , and sup i “ , ,... } Θ p i q} “ ρ, where ρn ě log p n q . Assumption 2 (One change point scenario) . Assume that there exists ∆ P N ˚ such that Θ p q “ ¨ ¨ ¨ “ Θ p ∆ q “ Θ and Θ p ∆ ` q “ Θ p ∆ ` q “ ¨ ¨ ¨ “ Θ . In addition, let κ “ κnρ “ } Θ ´ Θ } F nρ ą and r “ rank p Θ ´ Θ q . Assumption 3 (No change point scenario) . Assume that Θ p q “ Θ p q “ ¨ ¨ ¨ “ Θ . In the one change point scenario, in view of Assumptions 1 and 2, we see that the change pointdetection problem is characterised by the following parameters: the network size n , the entry-wisesparsity parameter ρ , the size of the uncontaminated sample ∆, the normalised jump size κ , andthe low-rank parameter r .The jump size κ is deﬁned to be the Frobenius norm of the diﬀerence between two consecutivebut distinct graphon matrices. The choice of the Frobenius norm is tailored to the context ofdynamic networks. Arguably, the most popular statistical network model is the stochastic blockmodels (Holland et al., 1983). If both Θ and Θ are graphon matrices of two stochastic blockmodels, then the Frobenius norm can explicitly reﬂect the magnitude of the change, as comparedto other matrix norms including the operator norm and the supremum norm. For instance, if thecommunity structure stays unchanged, but the between community probability changes from p to p in a community of size n {

2, then κ “ n | p ´ p |{

2. If the between and within communityprobabilities, p and p , remain the same, but the community structure changes from a balanced2-community network to a balanced 3-community network, then κ “ a { n | p ´ p | . Since κ P p , nρ s , the normalised jump size κ is scale free and satisﬁes that κ P p , s .Without further restrictions, the low-rank parameter r is allowed to be r P t , . . . , n u . Notethat, the introduction of the parameter r is on the diﬀerence matrix and we allow for arbitrarystructure of each graphon per se . Recall that our missions are as follows. When there is a change point, we wish to declare theexistence of the change point as soon as it appears. The distance between the change point estimatorand the change point is called detection delay, which is to be minimised. On the other hand, it is also7ital to control the false alarms. When there is no change point, we either control the probabilityof declaring change points, or the expected time of the ﬁrst false alarm. These two diﬀerent ways ofcontrolling false alarms are in fact Algorithms 2 and 3. Their theoretical guarantees are providedin Theorem 1 and Corollary 2, respectively. In the results presented in this section, we assume theexistence of two independent sequences of adjacency matrices. In practice, this can be done bysplitting the data sequence into odd and even index sequences.

Theorem 1.

For any α P p , q , assume that the data t A p t q , B p t qu t “ , ,... are two independentsequences of adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 2 withinputs t A p t q , B p t qu t “ , ,... , b u “ C c ρ log ´ uα ¯ , τ ,s,u “ C ? nρ ` d " u p u ` q log p u q α log p q * and τ ,s,u “ c p u ´ s q su ρ. (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, the it holds that P m P N t p t ą m u + ą ´ α. (ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, and there existsa large enough absolute constant C SNR ą such that ∆ κ nρ ą C SNR r log p ∆ { α q , (2) then P " ă p t ´ ∆ ă C d rnρ log p ∆ { α q κ “ C d r log p ∆ { α q κ nρ * ą ´ α, where C, C , C d ą are absolute constants. We can see from Theorem 1(i) that when there is no change point, with probability at least 1 ´ α ,Algorithm 2 will not raise any false alarm. On the other hand, if there is a change point, then itfollows from Theorem 1(ii) that the detection delay is at most of order r log p ∆ { α q κ nρ , with probability at least 1 ´ α .In fact, the condition (2) can be regarded as a sort of signal-to-noise ratio condition and is amild constraint. We list a few special cases here. • (Small sample size.) If κ , ρ, r, α —

1, then as long as ∆ Á log p ∆ q{ n , (2) holds. If thenetwork size n is large, then this shows that Algorithm 2 can detect change points with avery small number of uncontaminated samples. • (Large rank matrices.) If κ — ρ — log p n q{ n , α — — n , then the rank parameter r is allowed to be r — n . This means that provided the size of uncontaminated sample iscomparable with the size of networks, then it is not necessary to have a low-rank assumptionimposed on the diﬀerence of the graphons. 8 (Small jump size.) If ρ, r, ∆ , α —

1, then, provided that κ Á n ´ { , condition (2) holds. Thismeans that the normalised jump size can decrease to zero if the size of the network diverges.Finally, we remark on the choices of the tuning parameters. As we have mentioned in Section 2,the tuning parameters t τ ,s,u u are the cutoﬀs due to the low-rank parameter and t τ ,s,u u are tobound the entry-wise maximum norms. The theoretical choices of these two sets of parameters canbe found in Lemma 9, with the aim of ensuring the good performances of the USVT estimators. Thetuning parameter α is completely determined by practitioners, reﬂecting the tolerance of Type-Ierrors. The sequence t b u u reﬂects an upper bound on the statistics’ ﬂuctuations when there is nochange point. The rate of t b u u is determined in Lemma 8. Corollary 2.

For γ ě , assume that the data t A p t q , B p t qu t “ , ,... are two independent sequencesof adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 3 with inputs t A p t q , B p t qu t “ , ,... , b u “ C a ρ log p γ q , τ ,s,u “ C ? nρ ` d " p γ ` q γ log p γ ` q log p q * and τ ,s,u “ c p u ´ s q su ρ. (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, then E p p t q ě γ, where E p p t q , under Assumption 3, is called the average run length.(ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, and it holds that γ ě ∆ and ∆ κ nρ ą C SNR r log p γ q , (3) where C SNR ą is an absolute constant, then P " ă p t ´ ∆ ă C d rnρ log p γ q κ “ C d r log p γ q κ nρ * ą ´ γ ´ , (4) where C, C , C d ą are absolute constants. Corollary 2 is Theorem 1’s counterpart based on Algorithm 3. Comparisons of Theorem 1(i)and Corollary 2(i) show that Algorithms 2 and 3 have diﬀerent strategies in controlling the falsealarms. Theorem 1(i) shows that the Type-I error across the whole time horizon is upper boundedby α if Algorithm 2 is deployed. In contrast, Corollary 2(i) ensures that if Algorithm 3 is used thenthe expected time of the ﬁrst false alarm is at least γ .Both of these two ways to control the false alarms are widely used in the literature. We showin Theorem 1 and Corollary 2 that, if γ ě ∆ and γ — ∆ { α, (5)then these two methods provide the same order of detection delay. If γ ă ∆, then the samelocalisation rate of Corollary 2 holds for max t p t ´ ∆ , u instead of p t ´ ∆.Finally, we summarise the diﬀerences between Algorithms 2 and 3. Since both α and γ reﬂectthe preferences on Type-I error control, these two tuning parameters can be speciﬁed by the users.9lthough we have speciﬁed theoretical guidance on all the other tuning parameters, in practice, theystill involve either unknown quantities or unspeciﬁed constants. In order to tune these parameters,Algorithm 3 might be handier. One may have access to historical data under the pre-change-pointdistribution, and tune all the tuning parameters such that the average run length is γ . This isless natural under the strategy of Algorithm 2, unless the whole time course has a pre-speciﬁedendpoint, since the Type-I error is across the whole time course. This is in fact how we tune thetuning parameters in Section 4 for Algorithm 2. In Theorem 1, we show that we are able to detect change points with the order of the detectiondelay upper bounded by r log p ∆ { α q κ nρ . (6)In this subsection, we will investigate the optimality of this upper bound. Proposition 3.

Assume that t A p t qu t “ , ,... is a sequence of independent adjacency matrices satis-fying Assumptions 1 and 2. Denote the joint distribution of t A p t qu t “ , ,... as P κ, ∆ . Consider theclass of estimators D deﬁned as D “ t T : T is a stopping time and satisﬁes P p T ă 8q ď α u , where P indicates ∆ “ 8 . Then for suﬃciently small α P p , q , there exists an absolute constant c ą such that we have that inf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q κ nρ max (cid:32) , r { n ( . The change point estimators are all stopping time random variables satisfying that the overallType-I error is controlled by α P p , q . The rate of the detection delay is lower bounded bylog p { α q κ nρ max (cid:32) , r { n ( . This means in the low-rank regime r À ? n , we have the lower bound log p { α q ` κ nρ ˘ ´ ; in thelarge-rank regime r Á ? n , the lower bound is of the order log p { α q r ` κ n ρ ˘ ´ . In view ofProposition 3, we see that (6) is nearly-optimal, saving for a logarithmic factor, only in the extremeregimes, i.e. r — n or r — r graphon Θ P R n ˆ n isinf p Θ sup Θ E " n ››› p Θ ´ Θ ››› * — r ` n log p r q n , where the upper bound is achieved by an NP-hard algorithm. In fact, we can also adopt NP-hardprocedures to match the lower bound in Proposition 3, up to logarithmic factors.10 .4 An NP-hard procedure on stochastic block models In this subsection, we ﬁrst focus on the stochastic block models (Holland et al., 1983).

Deﬁnition 3 (Sparse Stochastic Block Model) . A network is from a sparse stochastic block modelwith size n , sparsity parameter ρ , membership matrix Z P R n ˆ r and connectivity matrix Q Pr , s r ˆ r if the corresponding adjacency matrix satisﬁes E p A q “ ρZQZ J ´ diag ` ρZQZ J ˘ . The membership matrix Z consists of n rows, each of which has one and only one entry being 1and has all the entries being 0; moreover, Z is a column full rank matrix, i.e. rank p Z q “ r . Thesparsity parameter ρ P r , s potentially depends on n . As we have already pointed out, the stochastic block models are special cases of the inhomoge-neous Bernoulli networks deﬁned in Deﬁnition 1.In Gao et al. (2015), an NP-hard estimator of stochastic block models’ graphons is proposed. Inthis subsection, we will replace the USVT estimator deﬁned in Algorithm 1 and used in Algorithm 2by the NP-hard estimator studied in Gao et al. (2015). For completeness, we include the estimatorconstruction below.

Deﬁnition 4 (An NP-hard graphon estimator) . For any positive integers n and r , r ď n , let Z n,r “ t z : t , . . . , n u Ñ t , . . . , r uu be the collection of all possible mappings from t , . . . , n u to t , . . . , r u . Given an adjacency matrix A “ p A ij q P R n ˆ n , any z P Z n,r and any Q “ p Q ab q P R r ˆ r ,deﬁne the objective function L p Q, z q “ ÿ a,b Pt ,...,r u ÿ p i,j qP z ´ p a qˆ z ´ p b q i ‰ j p A ij ´ Q ab q . For any optimiser of the the objective function p p Q, ˆ z q P arg min Q P R r ˆ r , z P Z n,r L p Q, z q , the estimator is deﬁned as q Θ “ p q Θ ij q ni,j “ P R n ˆ n , with q Θ ij “ q Θ ji “ p Q ˆ z i ˆ z j , i ą j and q Θ ii “ . For notational simplicity, we write q Θ “ NP p A, r q . The new procedure for change point detection replaces the USVT subroutine in Algorithm 2with the estimation detailed in Deﬁnition 4. We present the full algorithm below.

Corollary 4.

For any α ą , assume that the data t A p t q , B p t qu t “ , ,... are two independent se-quences of adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 4 withinputs t A p t q , B p t qu t “ , ,... and b u “ C c ρ log ´ uα ¯ . lgorithm 4 Network online change point detection - NP-hard

INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , u “ , , . . . u Ă R , α P p , q , r P N ˚ . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ; q B s j ,t Ð NP p A, r q ;FLAG “ ! p p A t ´ s j ,t , q B t ´ s j ,t {} q B t ´ s j ,t } F q ą b t ) ! } q B t ´ s j ,t } F ą C log { p t { α q ) ; j Ð j ` end whileend whileOUTPUT: t . (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, then for any m P N , P m P N t p t ą m u + ą ´ α. (ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, the input r ě r ,and there exists a large enough absolute constant C SNR ą such that ∆ κ n ρ ą C SNR t r ` n log p r qu log p ∆ { α q , then P ă p t ´ ∆ ă C d (cid:32) r ` n log p r q ( ρ log p ∆ { α q κ “ C d (cid:32) r { n ` log p r q ( log p ∆ { α q κ nρ + ą ´ α, where C, C , C d ą are absolute constants. Corollary 4 provided us with an upper bound on the detection delay matching the lower boundin Proposition 3, saving for logarithmic factors. However, the detection delay in Corollary 4 isbased on an NP-hard procedure in Algorithm 4, which has limited practical value.Comparing the results in Theorem 1 and Corollary 4, we see that in the very extreme regimes,i.e. r — r — n , the detection delays obtained by the proposed polynomial-time and NP-hardalgorithms achieve the same rates. Both estimators are nearly optimal, saving for logarithmicfactors. Between the two extreme cases r — r — n , the NP-hard algorithm achieves sharperrates than the polynomial time algorithm. This phenomenon is inline with the computational andstatistical tradeoﬀs observed in other high-dimensional statistical problems, e.g. Zhang et al. (2012),Loh and Wainwright (2013), to name but a few.12 .5 Comparisons with existing work With all the theoretical results at hand, we are ready to provide some in-depth comparisons withexisting work. Since we believe that our paper is the ﬁrst ever providing theoretical results fornetwork (in the sense of random matrices) online change point detection problems, the four paperswe select in this subsection are all concerned with diﬀerent but related problems.Chen (2019) establishes a general framework for online change point detection. Provided asuitable notion of distance, a k -nearest-neighbour-based test statistic is used for testing the existenceof the change points in a sequential manner. In Section 4, we consider three diﬀerent statistics suchthat the methods from Chen (2019) can be used as our competitors. As for the theoretical results,Chen (2019) focused on the average run length. In our paper, we provide a range of results includingdetection delay, average run length and minimax lower bounds.In statistics literature, the term “network” sometimes refers to the precision matrices in Gaus-sian graphical models, which are diﬀerent from what we study in this paper. Keshavarz et al.(2018) and Keshavarz and Michailidis (2020) studied online change point detection in Gaussiangraphical models. In addition to the model diﬀerences, both Keshavarz et al. (2018) and Keshavarzand Michailidis (2020) focused on the limiting distributions of the test statistics under the null andalternative distributions. We conjecture that a detection delay might be obtainable based on theresults thereof, but the results are not explicit yet. On the other hand, in our paper, especiallybased on Theorem 1, it is straightforward that the overall probabilities of falsely detecting changepoints or missing change points are both upper bounded by α .Wang et al. (2018) investigated an oﬄine network change point detection problem, where asequence of independent adjacency matrices are collected and change point estimators are soughtretrospectively. Despite the diﬀerence, there are some interesting comparisons, which to someextent, reﬂect the connections between online and oﬄine change point detection problems.(1) The detection delay in the online setting can be seen as the counterpart of the localisationerror in the oﬄine setting. The minimax lower bound on the localisation error in Wang et al.(2018) is of order p κ n ρ q ´ , while the minimax lower bound on the detection delay in thispaper is of order log p { α qp κ nρ q ´ p r { n ` q . The extra log p { α q term is rooted in the factthat we need to control the Type-I error in online settings. The other diﬀerences are moreinteresting – obviously, the oﬄine rate is better than the online rate. This is because, the defacto smallest sample size for a certain distribution in the oﬄine scenario is ∆, while in theonline scenario it is min t ∆ , detection delay u .(2) In both online and oﬄine settings, we have seen a computational and statistical tradeoﬀ.Comparing Theorem 1 and Corollary 4, we see that NP-hard estimators can detect changepoints under a weaker condition and provide a smaller detection delay. In the oﬄine setting,as Wang et al. (2018) has conjectured, by replacing the USVT estimator with the NP-hardestimator in Deﬁnition 4, one can achieve a nearly optimal localisation error under a weakercondition, than the one needed by the USVT estimator.13 Numerical experiments

Recall that in Section 2, we proposed a network online change point detection method in Algo-rithm 2, with a subroutine in Algorithm 1 and a variant in Algorithm 3. In this section, we willinvestigate the numerical performances of our proposed methods. Since there is no direct competi-tor available, we will tailor the k -nearest neighbours type method proposed in Chen (2019).In order to make a fair comparison with Chen (2019) we consider its three diﬀerent statistics,including: the “original” (ORI) which speciﬁes the original edge-count scan statistic, the weightededge-count scan statistic (W) and the generalised edge-count scan statistic (G). The statistics arecomputed with internal functions in the R (R Core Team, 2020) package gStream (Chen and Chu,2019).We consider two diﬀerent forms of calibration. The ﬁrst is based on the probability of raisinga false alarm. Using 200 Monte Carlo simulations and values of α P t . , . u , we choose thedetection thresholds such that, the probability of raising a false alarm in the interval r , T train s is α . The values of T train are taken from the set t , u . The second one is based on theaverage run length γ . We consider values of γ in the set t , u and calibrate the thresholdsof the competing methods, based on 200 Monte Carlo simulations, to have average run lengthapproximately γ under the pre-change model. For the data splitting required by Algorithms 2 and3, in all of our experiments, the sequence of adjacency matrices t A p u qu consists of the odd indicesof the original sequence, and the sequence t B p u qu of the even ones.To evaluate the performance of diﬀerent methods, we proceed as follows. For each generativemodel described below, we run N “

100 Monte Carlo simulations, where in each trial the data arecollected in the interval r , T s , T “ p t , which can be if no change points are detected in r , T s . Wedeﬁne r t “ min t T, p t u and compute the average detection delayDelay “ ř Nj “ t r t ě ∆ u p r t ´ ∆ q ř Nj “ t r t ě ∆ u . We also report the proportion of false alarmsPFA “ ř Nj “ t r t ă ∆ u N .

As for Algorithm 2, guided by Theorem 1, we set τ ,s,u “ . a n ˆ ρ ` d ˆ p u ´ s qp u ´ s ` q α ˙ and τ ,s,u “ c p u ´ s q su ˆ ρ, where ˆ ρ is an estimator of ρ , calculated as the 0 . p i,j “ T ÿ t “ A ij p t q , i, j P t , . . . , n u , i ă j, (7)where the matrices t A p t qu Tt are part of the training data. In addition, we set b u “ C c ˆ ρ log ´ uα ¯ , C tuned to give the desired false alarm rate.With respect to Algorithm 3, guided by Corollary 2, we let τ ,s,u “ . a n ˆ ρ ` a p γ ` q

15 and τ ,s,u “ c p u ´ s q su ˆ ρ, where ˆ ρ is the 0 . b u “ C a ˆ ρ log p γ q , with the constant C calibrated to such that before the change point the expect time of the ﬁrstfalse alarm is γ .We consider four diﬀerent settings. Scenario 1.

This consists of a stochastic block model with 3 communities of sizes t n { u , t n { u and n ´ t n { u . The network size n takes values in t , u . Denoting by z i the label of thecommunity associated with node i P t , . . . , n u , the data are generated as A ij p t q ind. „ Bernoulli p ρB z i z j p t qq , where ρ “ .

02 and the matrices B p t q satisfy B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t , . . . , ∆ u and B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t ∆ ` , ∆ ` , . . . , T u . Scenario 2.

This is also a stochastic block model. We now take the number of communities to be5 and the number nodes in each community to be n { n P t , u . Again we set ρ “ . B p t q “ ¨˚˚˚˚˝ . . . . . . . . . . . . . . . . . . . . . . . . . ˛‹‹‹‹‚ , t P t , . . . , ∆ u and B p t q “ ¨˚˚˚˚˝ . . . . . . . . . . . . . . . . . . . . . . . . . ˛‹‹‹‹‚ , t P t ∆ ` , ∆ ` , . . . , T u . cenario 3. We consider a degree corrected block model (Karrer and Newman, 2011) with 3communities of sizes t n { u , t n { u and n ´ t n { u , where n P t , u . Let z i be the communityto which node i belongs, and deﬁne v i “ a i { n . The data are then generated as A ij p t q „ Bernoulli p v i v j B z i z j p t qq , where B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t , . . . , ∆ u and B p t q “ ¨˝ .

95 0 .

15 0 . .

15 0 .

95 0 . .

15 0 .

15 0 . ˛‚ , t P t ∆ ` , ∆ ` , . . . , T u . Scenario 4.

This is a random dot product graph (Young and Scheinerman, 2007) with ﬁxedlatent positions. First, we generate the latent positions X P R n ˆ as X ij i.i.d. „ Unif r , s , i “ , . . . , n, j “ , . . . , , which are kept ﬁxed throughout our simulations. We then construct r X P R n ˆ as˜ X ij i.i.d. „ Unif r , s , i “ , . . . , n, j “ , . . . , , which are also kept ﬁxed throughout our simulations. Finally, the data are generated as A ij p t q „ Bernoulli ˆ X J i X j } X i } } X j } ˙ , t P t , . . . , ∆ u and A ij p t q „ Bernoulli ˆ Y J i Y j } Y i } } Y j } ˙ , t P t ∆ ` , ∆ ` , . . . , T u , where X i , r X i P R are the i th rows of the matrices X and r X , } ¨ } is the (cid:96) -norm of vectors, and Y i “ X i , i ď t n { u ,X i , otherwise.We collect the results in Figure 2, Tables 1 and 2. In Figure 2, we exhibit one realisation eachfor each scenario. Each row corresponds to each scenario, from the ﬁrst to the fourth. In eachrow, the left two panels are realisations before change points, and the right two panels are the postchange points realisations. It can be seen from Figure 2 that, these four scenarios cover diﬀerenttypes of networks, and the change points are hard to spot with the naked eye.Tables 1 and 2 correspond to the two diﬀerent ways to control Type-I errors. We reiterate thatAlgorithm 2, Theorem 1 and Table 1 correspond to the strategy of controlling the overall Type-Ierror α . Algorithm 3, Corollary 2 and Table 2 correspond to the strategy of lower bounding theaverage run length γ . 16

75 150

Figure 2: Examples of adjacency matrices generated under diﬀerent scenarios. The ﬁrst to thefourth rows correspond to the ﬁrst to the fourth scenarios, respectively. In each row, from left toright, the ﬁrst two plots correspond to networks generated before the change point, and the lasttwo plots to networks generated after the change point. In each display, a white dot indicates oneand a red dot indicates zero. 17able 1: Upper bounding overall Type-I errors: n , the network size; α , the Type-I error upperbound; T train , the time length of the training data used for selecting tuning parameters; ORI, Chen(2019) using the original edge-count scan statistic; W, Chen (2019) using the weighted edge-countscan statistic; G, Chen (2019) using the generalised edge-count scan statistic.Settings Delay PFA n Scenario α T train

Algo. 2 ORI W G Algo. 2 ORI W G150 1 0.01 200

150 1 0.05 200

150 1 0.05 150

100 1 0.01 200

100 1 0.05 200

100 1 0.01 150

100 1 0.05 150

150 2 0.05 200

150 2 0.01 150

100 2 0.05 150

150 3 0.01 200

100 4 0.05 200 r , the rank of18able 2: Lower bounding the average run lengths: n , the network size; γ , the average runlength lower bound; ORI, Chen (2019) using the original edge-count scan statistic; W, Chen (2019)using the weighted edge-count scan statistic; G, Chen (2019) using the generalised edge-count scanstatistic. Settings Delay PFA n Scenario γ Algo.3 ORI W G Algo. 3 ORI W G150 1 150 r . We consider stock market data from April 1990 to January 2012. The data consist of the weekly logreturns for the Dow Jones Industrial Average index and they are available in the R (R Core Team,2020) package ecp (James et al., 2019). To construct networks, we ﬁrst use a sliding window ofwindow width being 3, and consider the covariance matrix among 29 companies’ log-weekly-returnsover a 3 week period. We then truncate the covariance matrices by setting those entries which havevalues above the 0.95-quantile as 1, and the remaining as 0. This construction leads to sparsenetworks. Some examples of these networks are illustrated in the ﬁrst two rows of Figure 3.As competitors to our estimator, we consider the same statistics (ORI), (W) and (G) fromChen (2019) that were used in Section 4.1. To evaluate the performances of these methods. wehave chosen two periods of the original data, each consisting of a training set and a test set. Wecalculate the maximum score of each method using the training data and use the maximum scoreas the threshold for detecting false alarms.In the ﬁrst period, the data from 2-Apr-1990 to 4-Jan-1999 are used as the training set, andthe data from 25-Jan-1999 to 31-May-2004 are used as the test set. The algorithm we proposedin Algorithm 2 detects a change point corresponding to 25-Mar-2002. This seems to coincide with19

Figure 3: Stock market data set described in Section 4.2.the period of ﬁnancial turbulence after the 11-Sep-2001 terrorist attacks. We can also see in theﬁrst row of Figure 3 that there seems to be a change in pattern around such date. In contrast, thecompetitor methods did not detect the change point with the given choice of threshold.In the second period, the data from 31-May-2004 to 15-Jan-2007 are used as the training set,and the data from 5-Feb-2007 to 1-Mar-2010 are used as the test set. We remark that the trainingdata correspond to the period before the ﬁnancial crisis of 2007–2008. Algorithm 2 detects a changepoint corresponding to the date 10-Mar-2008. The competing approaches detect a change point inthe same period. Speciﬁcally, ORI detects the date December 17, 2007; and both W and G detectthe date 19-Mar-2007. From looking at the second row of Figure 3, we can see that the 2007-2008ﬁnancial crisis seems to aﬀect the network patterns in the data.

In Section 1, we have studied the MIT cellphone data set. In this subsection, we provide allnumerical details.Originally, the data are in the form of 1392 networks of size 96 ˆ

96. For each day of theexperiment, the data include four networks corresponding to six hours interval in the given day.We sum the four networks during each day resulting in 232 networks, each with 96 nodes. Thenetworks are transformed to binary networks by setting all strictly positive entries as 1. In otherwords, in each binary network, if the entry p i, j q equals 1, then it means that participants i and j were within physically close proximity during the corresponding day.To evaluate the performances of diﬀerent methods, we construct two experiments. In the ﬁrstwe use the data from 14-Sept-2004 to 1-Dec-2004 as the training data set, and the data from 2-Dec-2004 to 15-Feb-2005 as the test set. The training set is the period before the MIT winter recess,which that year took place from 22-Dec-2004 to 3-Jan-2005. The thresholds are chosen in the same20ay as Section 4.2. Algorithm 2 detects a change point on 27-Dec-2004; ORI on 30-Jan-2005; Wand G on 29-Jan-2005. Clearly, our method is the best at detecting the winter recess period.In our second example, we use the data from 1-Jan-2005 to 3-Mar-2005 as the training dataset and the data from 4-Mar-2005 to 5-May-2005 as the test set. Thus the training data consist ofnetworks before the spring recess which took place from 26-Mar-2005 to 3-Apr-2005. Algorithm 2detects a change point on 31-Mar-2005; and ORI on 6-Apr-2005. Thus, our method seems to bequicker at detecting the spring recess. In this paper, we are concerned with online change point detection in a sequence of inhomogeneousBernoulli networks. We established the minimax lower bound on the detection delay, which matchesan upper bound, saving for logarithmic factors, based on an NP-hard estimation procedure. Inaddition, we proposed a polynomial-time algorithm, the detection delay of which matches the thatbased on NP-hard estimators in the extreme cases, i.e. r — r — n .Our proposed methods consist of two diﬀerent Type-I error control strategies, with the worstcase computational cost of order O p log p t q Cost p n qq when proceeding to the time point t , whereCost p n q is the computational cost for running the USVT algorithm (Algorithm 1) on a size- n network.In this paper, we only discuss the at most one change point scenario. In fact, it is straightforwardto extend the algorithm and the results to multiple change points scenario. To be speciﬁc, one canrestart the algorithm whenever a change point is declared by Algorithm 2. As for the theoreticalresults, in Theorem 1, we can let ∆ be the minimal spacing between two consecutive change points.Provided that C SNR nr log p ∆ { α q ą C d , then with probability at least 1 ´ α , all change points can be detected with detection delay uniformlyupper bounded by the same detection delay upper bound in Theorem 1 and without false alarms.This is a straightforward consequence of Theorem 1, therefore we omit the technical details here.In the existing literature, it is hoped to have an online change point detection with constant costproceeding to every time point. In this sense, our cost O p log p t qq is not eﬃcient enough. However,to the best of our knowledge, when the before and after distributions are not fully speciﬁed, thisconstant computational cost is not achievable even in the univariate case. Having said this, it is stillof vital interest to improve the computational eﬃciency of network online change point detectionmethods. We will leave this for future work. Appendices

All necessary lemmas are collected in Appendix A and all proofs of the main results are left inAppendix B.

A Network change point lemmas

This lemma below is identical to Lemma S.6 in Wang et al. (2018), therefore we skip the proofhere. 21 emma 5. (1)

For any t P N ˚ , let t A p l qu tl “ be a collection of independent matrices with indepen-dent Bernoulli entries satisfying max l “ ,...,t } E p A p t qq} ď ρ, with nρ ě log p n q . Let t w l u tl “ Ă R be a collection of scalars such that ř tl “ w l “ and ř tl “ w l “ .Then there exists an absolute constant C ą ˆ { e such that for any ε ą , P ¨˝››››› t ÿ l “ w l A p l q ´ E ˜ t ÿ l “ w l A p l q ¸››››› op ě C ? nρ ` ε ˛‚ ď exp p´ ε { q . (8) (2) If t A p l qu tl “ are symmetric matrices, then (8) still holds. Lemmas 6 and 7 are from Lemma 1 in Xu (2017).

Lemma 6.

Let

A, B P R n ˆ n be two symmetric matrices with } A ´ B } op ă τ {p ` δ q , τ ą . Thenfor a ﬁxed δ ă , we have } USVT p A, τ,

8q ´ B } ď n min s “ sτ ` p ` δ q δ ´ n ÿ i “ s ` λ i p B q + , where λ n p B q ě ¨ ¨ ¨ λ p B q are the eigenvalues of B . Lemma 7.

Let A and B be deﬁned as in Lemma 6, and that } B } ď τ , then } USVT p A, τ, τ q ´ B } ď n min s “ sτ ` p ` δ q δ ´ n ÿ i “ s ` λ i p B q + , where λ n p B q ě ¨ ¨ ¨ λ p B q are the eigenvalues of B This lemma below is Lemma S.2 in Wang et al. (2018).

Lemma 8.

Let t X p l qu l “ , ,... P R p be a sequence of independent random vectors with independentBernoulli entires. Suppose that E p X i p t qq “ µ i p t q and that sup l “ , ,... } µ p l q} ď ρ. For any t ą , let v P R p and t w l u tl “ Ă R satisfy ř tl “ w l “ . Then for any ε ą , we have P ˜ˇˇˇˇˇ p ÿ i “ v i t ÿ l “ w l p X i p l q ´ µ i p l qq ˇˇˇˇˇ ě ε ¸ ď ˆ ´ { ε ρ } v } ` ε max pi “ | v i | max tl “ | w l | ˙ . Lemma 9.

Assume that t B p u qu is a sequence of adjacency matrices satisfying Assumption 1. Forany integer t ě , let S p t q “ t t ´ s j ´ , j “ , . . . , t log p t q{ log p q u u , C ą ˆ { e ,ε s,t “ d " t p t ` q log p t q α log p q * , τ ,s,t “ C ? nρ ` ε s,t and τ ,s,t “ c p t ´ s q st ρ. e have that under Assumption 2, the event F “ @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ , t ď ∆ , and max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆ + holds with probability at least ´ α { ; under Assumption 3, the event F “ " @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ * holds with probability at least ´ α { .Proof. Step 1 . If Assumption 2 holds, then for t ď ∆, it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q ;for t ą ∆, it holds that p Θ s,t “ $&% p t ´ ∆ q b st p t ´ s q p Θ ´ Θ q , s ď ∆ , ∆ b t ´ sst p Θ ´ Θ q , s ą ∆ . and rank p p Θ s,t q ď r. (9)If Assumption 3 holds, then for any t , it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q . Step 2 . Due to Deﬁnition 2, we have p B s,t “ t ÿ l “ w sl B p l q , where ř tl “ w sl “ ř tl “ p w sl q “ E c “ ! D t ě , s P S p t q : } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) . Then it follows from Lemma 5 that, P p E c q ď ÿ t “ log p t q log p q max s P S p t q P ! } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) ď ÿ t “ log p t q log p q log p q log p t q αt p t ` q ď α , where C ą ˆ { e and ε s,t “ d " t p t ` q log p t q α log p q * . (10)23nder Assumption 2, it follows from Lemma 6 that E Ă E , where E “ @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t ,

8q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆and max s P S p t q ››› USVT p p B s,t , τ ,s,t , ››› F “ , t ď ∆ + , with C and ε s,t deﬁned in (10) and τ ,s,t “ C ? nρ ` ε s,t . Due to (9), we have that ››› p Θ s,t ››› ď c s p t ´ s q t ρ “ τ ,s,t . Therefore, we have that F Ă E .Under Assumption 3, note that p Θ s,t “ p p Θ s,t q “

0. Due to Lemmas 6 and 7, we have F Ă E , which completes the proof. Lemma 10.

Assume t B p u qu is a sequence of adjacency matrices satisfying Assumption 1. For anyinteger t ě , let S p t q “ t t ´ s j ´ , j “ , . . . , t log p t q{ log p q u u , C ą ˆ { e ,ε s,t “ d " p γ ` q log p γ ` q log p q * , τ ,s,t “ C ? nρ ` ε t and τ ,s,t “ c p t ´ s q st ρ. We have that under Assumption 2, the event F “ @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ , t ď ∆ and max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆ + holds with probability at least ´ p γ ` q ´ { ; under Assumption 2, the event F “ " @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ * holds with probability at least ´ p γ ` q ´ { .Proof. Step 1 . If Assumption 2 holds, then for t ď ∆, it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q ;for t ą ∆, it holds that p Θ s,t “ $&% p t ´ ∆ q b st p t ´ s q p Θ ´ Θ q , s ď ∆ , ∆ b t ´ sst p Θ ´ Θ q , s ą ∆ , and rank p p Θ s,t q ď r. (11)24n addition, it holds that rank p p Θ s,t q ď r .If Assumption 3 holds, then for any t , it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q . Step 2 . Due to Deﬁnition 2, we have p B s,t “ t ÿ l “ w sl B p l q , where ř tl “ w sl “ ř tl “ p w sl q “ E c “ ! D t P t , . . . , γ ` u , s P S p t q : } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) . Then it follows from Lemma 5 that, P p E c q ă p γ ` q , where C ą ˆ { e and ε s,t “ d " p γ ` q γ log p γ ` q log p q * . (12)Under Assumption 2, it follows from Lemma 6 that E Ă E , where E “ @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t ,

8q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆and max s P S p t q ››› USVT p p B s,t , τ ,s,t , ››› F “ , t ď ∆ + , with C and ε s,t deﬁned in (12) and τ ,s,t “ C ? nρ ` ε s,t . Due to (11), we have that ››› p Θ s,t ››› ď c s p t ´ s q t ρ “ τ ,s,t . Therefore, we have F Ă E .Under Assumption 3, not that p Θ s,t “ p p Θ q “

0. Due to Lemmas 6 and 7, we have F Ă E , which completes the proof. B Proofs of main results

Proof of Theorem 1.

We let r B s,t “ USVT p p B s,t , τ ,s,t , τ ,s,t q , t ě S p t q “ t t ´ j ´ , j “ , . . . , t log p t q{ log p q u u . The rest of the proof is conducted on the event F , deﬁned in Lemma 9. In particular, P t F u ą ´ α { F is regarding the data t B p t qu , which are independent of the data t A p t qu . Step 1 . Due to Lemma 9, it holds that if Assumption 2 holds and t ď ∆, or Assumption 3 holds,then r B s,t “ s P S p t q . Therefore the claim (i) is proved and for the claim (ii), we have p t ´ ∆ ą . Deﬁne t “ min $&% t ą ∆ : max s P S p t q »–ˇˇˇˇˇˇ¨˝ p A s,t , r B s,t ››› r B s,t ››› F ˛‚ˇˇˇˇˇˇ ! } r B s,t } F ą C log { p t { α q )ﬁﬂ ą b t ,.- . (14)Due to the design of Algorithm 2, we can see that p t ď t and therefore d ď t ´ ∆. In order toprovide an upper bound on d , it thus suﬃces to upper bound t . Step 2 . Recall the quantity ε s,t “ d " t p t ` q log p t q α log p q * deﬁned in Lemma 9. With the quantity ε s,t , we deﬁne t “ min t ą ∆ : max s P S p t q « ˇˇˇˇˇˇ¨˝ p A s,t , r B s,t ››› r B s,t ››› F ˛‚ˇˇˇˇˇˇ ˆ ! } p Θ s,t } F ą C log { p t { α q ` ? r p C ? nρ ` ε s,t q ) ﬀ ą b t + . Due to Lemma 9, we know that if ! } p Θ s,t } F ą C log { p t { α q ` ? r p C ? nρ ` ε s,t q ) considered in t holds, then ! } r B s,t } F ą C log { p t { α q ) considered in t also holds. This implies that t ą t . It now suﬃces to ﬁnd t , which yields anupper bound on d that d ď t ´ ∆.Due to the choices of s , we in turn deﬁne J “ min j P N : ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` j , r B ∆ , ∆ ` j ››› r B ∆ , ∆ ` j ››› F ˛‚ˇˇˇˇˇˇ ! } p Θ ∆ , ∆ ` j } F ą C log { pp ∆ ` j q{ α q ` ? r ` C ? nρ ` ε ∆ , ∆ ` j ˘) ą b ∆ ` j + and t “ ∆ ` J . In the deﬁnition of J , we essentially choose the integer pair p s, t q to be p ∆ , ∆ ` J q .This is to ensure that s P S p t q and s “ ∆. Due to this construction, we can see that t is an upperbound of t and our task is now to ﬁnd J deﬁned above. Step 3 . We are now to show that, with a large enough absolute constant C d ą J “ S log ˆ C d r log p ∆ { α q κ nρ ˙ { log p q W . (15)For notational simplicity, in the rest of the proof, we assume thatlog ˆ C d r log p ∆ { α q κ nρ ˙ { log p q is a positive integer. If this is violated, then the proof only needs to be modiﬁed by keeping theceiling operator throughout. Step 3.1 . With J deﬁned in (15), we have that } p Θ ∆ , ∆ ` J } F “ κ gfffe ∆ C d r log p ∆ { α q κ nρ ∆ ` C d r log p ∆ { α q κ nρ , which can be derived by plugging in ∆ and ∆ ` J into (11). Due to (2), we have that } p Θ ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q ` ? r ` C ? nρ ` ε s, ∆ ` J ˘ . (16)This can be seen in the following three steps. Step 3.1.1 . We ﬁrst show that3 ´ } p Θ ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q . (17)Provided that C d ă C SNR , due to (2), it holds that } p Θ ∆ , ∆ ` J } F ě κ d ∆ C d r log p ∆ { α q κ nρ “ c ∆ nρ C d r log p ∆ { α q . (18)In addition, provided that ∆ { α ě

2, it holds thatlog { pp ∆ ` J q{ α q ď log { p { α q ď a p ∆ { α q . (19)Therefore, provided that C d ą C { log p q and n ě

2, (17) holds, where nρ ě log p n q assumed inAssumption 1 is used. Step 3.1.2 . Provided that C d ą C { log p q , we have that 3 ´ } p Θ ∆ , ∆ ` J } F ą C ? rnρ , by us-ing (18). 27 tep 3.1.3 . Lastly, we are to show3 ´ } p Θ ∆ , ∆ ` J } F ą C ? rε s, ∆ ` J “ C d r log " p ∆ ` J qp ∆ ` J ` q log p ∆ ` J q α log p q * . (20)Due to (19), the last term in (20) is upper bounded by C d r log " p ` q log p q α log p q * ď C d r log " p ` q α log p q * ď C d r log " α * ď C d r log " ∆ α * . Therefore provided that C d ą C { log p q , (20) holds. Step 3.2 . In addition, we have that ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ´ ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J ´ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ p I q ´ p II q . (21) Step 3.2.1 . As for p I q , we have that ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ } p Θ ∆ , ∆ ` J } F ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› p Θ ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F ´ r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ›››››› ˛‹‚ “ } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› p Θ ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ´ r B ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F } p Θ ∆ , ∆ ` J } F ››› r B ∆ , ∆ ` J ››› F ›››››› ˛‹‚ ě } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› ››› p Θ ∆ , ∆ ` J ´ r B ∆ , ∆ ` J ››› F } p Θ ∆ , ∆ ` J } F ` ˇˇˇ››› p Θ ∆ , ∆ ` J ››› F ´ ››› r B ∆ , ∆ ` J ››› F ˇˇˇ } p Θ ∆ , ∆ ` J } F ›››››› ˛‹‚ ě } p Θ ∆ , ∆ ` J } F $&% ´ ˜ } p Θ ∆ , ∆ ` J ´ r B ∆ , ∆ ` J } F } p Θ ∆ , ∆ ` J } F ¸ ,.- ě } p Θ ∆ , ∆ ` J } F $&% ´ ˜ C ? rnρ ` C ? rε ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F ¸ ,.- ě } p Θ ∆ , ∆ ` J } F , (22)28here the third inequality is due to the event F and the last inequality follows from (16) with asuﬃciently large C SNR . Step 3.2.2 . As for (II), due to the independence between t A p t qu and t B p t qu , it follows fromLemma 8 and (16) that P A $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J ´ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě b ∆ ` J ,.- ď exp ´ { b t ρ ` b ∆ ` J ρC ´ log ´ { pp ∆ ` J q{ α q + ă α . (23)To be speciﬁc, the CUSUM weights are regarded as the t w l u sequence in Lemma 8 and all theentries in r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› ´ are regarded as the t v i u sequence in Lemma 8. Therefore, } v } “ tl “ | w l | ď ´ J { and p max i “ | v i | ď ρ b J ∆∆ ` J C log { pp ∆ ` J q{ α q ď ρ J { C log { pp ∆ ` J q{ α q , where the last inequality follows from (16) and the deﬁnition of F .Due to (16), with a suﬃciently large C SNR , it holds that } p Θ ∆ , ∆ ` J } F ą b ∆ ` J . Then, combining (21), (22) and (23), it holds that P A $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě b ∆ ` J ,.- ą ´ α . (24) Step 3.3 . Combining (13) and (24), we have that P $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ! } r B ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q ) ą b ∆ ` J ,.- ą ´ α, which completes the proof. Proof of Corollary 2.

We let r B s,t “ USVT p p B s,t , τ ,s,t , τ ,s,t q , t ą N. Let S p t q “ t t ´ j ´ , j “ , . . . , t log p t q{ log p q u u . tep 1 . Due to Lemma 10, it holds that if Assumption 3 holds, then with probability at least1 ´ p γ ` q ´ , the event F holds, i.e. r B s,t “

0, for all t P t , . . . , γ ` u and s P S p t q Then we have E p p t q “ ÿ t “ P p p t ě t q ě γ ` ÿ t “ P p p t ě t q ě p γ ` q P p p t ě γ ` q ě p γ ` q ˆ ´ γ ` ˙ “ γ. The claim (i) is proved.

Step 2 . As for the claim (ii), recall the event F and associated quantities deﬁned in Lemma 10.We have that P p F q ą ´ {t p γ ` qu . Conditional on the event F instead of F , the rest of theproof is identical to that of Theorem 1. Proof of Corollary 4.

The proof is almost identical to the proof of Theorem 1, except that thelarge probability events where USVT estimators are well controlled in the proof of Theorem 1 arereplaced by Theorem 2.1 in Gao et al. (2015). In fact, Theorem 2.1 in Gao et al. (2015) is statedand proved by assuming ρ “

1. In order to get an upper bound being a function of ρ , we only needto change Lemmas 4.1-4.3 in Gao et al. (2015) correspondingly. Proof of Proposition 3.

This proof consists of two diﬀerent cases: a) r À ? n and b) r Á ? n . Case 1: r À ? n .Step 1 - Setup. We assume the networks are generated as follows. Prior to the change point,if there exists any, the adjacency matrices are generated independently from the distribution P ,which has the graphon matrix Θ “ p ρ { q ni,j “ . If there exists a change point, then the adjacency matrices after the change point are generatedindependently from the distribution P “ n ÿ u Pt˘ u n P ,u , where the graphon of the distribution P ,u is ρ { J ` κ ρuu J , u P t˘ u n .For any M P N , let P M be the restriction of a distribution P on F M , i.e. the σ -ﬁled generated bythe observations t A p t qu Mi “ . For notational simplicity, in this proof, the adjacency matrices A p t q ’swill be denoted as A t ’s. For any ν ě M ě ν , we have that for any M ě ∆, let Z ν,M “ log ˜ P Mκ ,ν P Mκ , ¸ , where P κ, indicates the distribution under which there is no change point. Step 2 - When Z ν,T is upper bounded. For any ν ě

1, deﬁne the event E ν “ " ν ă T ă ν ` log p { α q κ nρ , Z ν,T ă

34 log p { α q * . Then we have P κ,ν p E ν q “ P κ ,ν P κ, p E ν q P κ, p E ν q ď α ´ { α “ α { , (25)30here the inequality follows from the deﬁnition of D and E ν . Step 3 - When Z ν,T is lower bounded. For any ν ě T P D , since t T ě ν u P F ν ´ , wehave that P κ,ν ν ă T ă ν ` log p { α q κ nρ , Z ν,T ě

34 log p { α q ˇˇˇˇˇ T ě ν + ď ess sup P κ,ν $&% max ď l ď log p { α q κ nρ Z ν,ν ` l ě

34 log p { α q ˇˇˇˇˇ A , . . . , A ν ,.- ď log p { α q κ nρ max ď l ď log p { α q κ nρ ess sup P κ,ν Z ν,ν ` l ě

34 log p { α q ˇˇˇˇˇ A , . . . , A ν + ď log p { α q κ nρ exp (cid:32) log p { α q ( max ď l ď log p { α q κ nρ ess sup E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + . (26) Step 3.1.

Note that for any l P t , . . . , log p { α qp κ nρ q ´ u , it holds that E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + “ E κ,ν P ν ` lκ ,ν P ν ` lκ , ¸ ˇˇˇˇˇ A , . . . , A ν + . (27)In addition, letting ζ “ ρ { v i be the i th entry of any vector v , we have that E P ˆ P P p A q ˙ “ E u E A | u ¨˝ n ÿ v Pt˘ u P ,v P p A q ˛‚ (28) “ E u E A | u $&% n ÿ v Pt˘ u ź ď i ă j ď n ˆ ζ ` κ ρv i v j ζ ˙ A ij ˆ ´ ζ ´ κ ρv i v j ´ ζ ˙ ´ A ij ,.- “ E u $&% n ÿ v Pt˘ u ź ď i ă j ď n " ` κ ρ u i u j v i v j ζ p ´ ζ q *,.- ď E u $&% n ÿ v Pt˘ u n ź i,j “ " ` κ ρ u i u j v i v j ζ p ´ ζ q *,.- , (29)where in (29), u is a random vector with entries being independent Rademacher random variables.We further have that(29) ď E u E v n ź i,j “ exp " κ ρ u i u j v i v j ζ p ´ ζ q * “ E u,v exp " κ ρ ´ ρ p u J v q * , (30)where u and v are independent random vectors with entries being independent Rademacher randomvariables. Finally, we have that (30) “ E u exp " κ ρ ´ ρ p u J q * , (31)31here 1 is an all-one n -dimensional vector. Equation (31) is due to the fact that for each i Pt , . . . , n u , u i v i has the same distribution as u i . Step 3.2.

Let ε n “ ˆ ř ni “ u i n ˙ . Then we have that for any x ą

0, due to Hoeﬀding’s inequality that P t ε n ą x u ď p´ nx q . Therefore E u exp " κ n ρ ´ ρ p u J q * “ ż P " exp ˆ κ n ρ ´ ρ ε n ˙ ą x * dx ď ` ż P " ε n ą log p x q ´ ρ κ n ρ * dx ď ` ż exp " log p x q ρ ´ κ nρ * dx ď ` ` ρ ´ κ nρ x ` ρ ´ κ nρ ˇˇˇˇˇ “ ´ ` ρ ´ κ nρ ď exp ` κ nρ ˘ , (32)provided that ρ ` κ nρ ă . (33) Step 3.3.

Combining (26), (27), (31) and (32), we have that P κ,ν ν ă T ă ν ` log p { α q κ nρ , Z ν,T ě

34 log p { α q ˇˇˇˇˇ T ě ν + ď log p { α q κ nρ exp ! log p { α q κ nρ κ nρ ) exp (cid:32) log p { α q ( ď α { , (34)provided that α { log p { α q ă κ nρ. (35) Step 4.

Combining (41) and (34), we then havesup ν ě P κ,ν " ν ă T ă ν ` log p { α q κ nρ * ď α { . Then it holds that E κ, ∆ tp T ´ ∆ q ` u ě log p { α q κ nρ P κ,ν " T ´ ∆ ě log p { α q κ nρ * “ log p { α q κ nρ P κ,ν „ P κ,ν t T ą ∆ u ´ P κ,ν " ∆ ă T ă ∆ ` log p { α q κ nρ * ě log p { α q κ nρ p ´ α ´ α { q ě log p { α q κ nρ , α ` α { ă { . (36) Step 5.

Finally, we are to show the set of parameters satisfying (33), (35) and (49) is not an emptyset. For instance, we take ρ “ { κ n “ {

2, then (33) holds. With this choice, (35) holds if α { ă { log p { α q (37)holds. Since α ă α { , (49) holds if α { ă { α “ ´ satisﬁes both (37) and (38). Step 6.

We have now shown that when r À ? n , it holds thatinf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q κ nρ , (39)with an absolute constant c ą Case 2: r Á ? n .Step 1 - Setup. We assume that the networks are generated as follows. Prior to the change point,if there exists any, the adjacency matrices are generated independently from the distribution P ,which has the graphon matrix Θ “ p ρ { q ni,j “ . If there exists a change point, then the adjacency matrices after the change point are generatedindependently from the distribution P “ p r { q ÿ Z P Z P ,Z , where the graphon of the distribution P ,Z is ρ { J ` κ ρn { rZ and the collection Z is the set forall symmetric matrices satisfying Z ij “

0, if max t i, j u ą r , and all the upper triangular matrix of Z p r q , p r q are independent Radamacher random variables.In order to show that P is a probability distribution, it suﬃces to justify that for any Z P Z , P ,Z is a suitable probability distribution. • Firstly, we have that } κ ρn { rZ } F “ κ ρn. • Secondly, we have that the entries of the matrix Z are all in the set t , ˘ u . This means thatprovided κ n { r ă { , (40)all the entries of P ,Z,E are in the interval r , ρ s . • Lastly, the rank of the matrix Z is upper bounded by r .33or any M P N , let P M be the restriction of a distribution P on F M , i.e. the σ -ﬁled generated bythe observations t A p t qu Mi “ . For notational simplicity, in this proof, the adjacency matrices A p t q ’swill be denoted as A t ’s. For any ν ě M ě ν , we have that for any M ě ∆, let Z ν,M “ log ˜ P Mκ ,ν P Mκ , ¸ , where P κ, indicates the distribution under which there is no change point. Step 2 - When Z ν,T is upper bounded. For any ν ě

1, deﬁne the event E ν “ " ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ă

34 log p { α q * . Then we have P κ,ν p E ν q “ P κ ,ν P κ, p E ν q P κ, p E ν q ď α ´ { α “ α { , (41)where the inequality follows from the deﬁnition of D and E ν . Step 3 - When Z ν,T is lower bounded. For any ν ě T P D , since t T ě ν u P F ν ´ , wehave that P κ,ν ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ě

34 log p { α q ˇˇˇˇˇ T ě ν + ď ess sup P κ,ν $’&’% max ď l ď r { n log p { α q κ nρ Z ν,ν ` l ě

34 log p { α q ˇˇˇˇˇ A , . . . , A ν ,/./- ď r { n log p { α q κ nρ max ď l ď r { n log p { α q κ nρ ess sup P κ,ν Z ν,ν ` l ě

34 log p { α q ˇˇˇˇˇ A , . . . , A ν + ď r { n log p { α q κ nρ exp (cid:32) log p { α q ( max ď l ď r { n log p { α q κ nρ ess sup E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + . (42) Step 2.1.

Note that for any l P t , . . . , r { n log p { α qp κ nρ q ´ u , it holds that E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + “ E κ,ν P ν ` lκ ,ν P ν ` lκ , ¸ ˇˇˇˇˇ A , . . . , A ν + . (43)In addition, letting ζ “ ρ { U ij “ κ ρn { rZ ij and V ij “ κ ρn { rW ij , we have that E P ˆ P P ˙ “ E Z E A | Z ˜ p r { q ÿ W P Z P ,W P ¸ “ E Z E A | Z p r { q ÿ W P Z ź ď i ă j ď n ˆ ζ ` V ij ζ ˙ A ij ˆ ´ ζ ´ V ij ´ ζ ˙ ´ A ij + E Z p r { q ÿ W P Z ź ď i ă j ď n " ` U ij V ij ζ p ´ ζ q *+ ď E Z p r { q ÿ W P Z n ź i,j “ " ` U ij V ij ζ p ´ ζ q *+ ď E Z E W n ź i,j “ exp " U ij V ij ζ p ´ ζ q * “ E Z E W exp " κ ρn { r ´ ρ x Z, W y * . (44) Step 2.2

For any ﬁxed

Z, W P Z , it holds that x Z, W y “ z J w, (45)where z and w are vectorised upper triangular parts of Z and W , respectively. The vectors z and w are all r { ˘ Step 2.3.

Due to (45), it holds that E P ˆ P P ˙ ď E Z E W exp " κ ρn { r ´ ρ p z J w q * “ E Z exp " κ ρn { r ´ ρ p z J q * . Let ε “ ř r { i “ z i r { . Then we have that for any x ą

1, due to Hoeﬀding’s inequality that P t ε ą x u ď exp p´ x r q ď exp p´ xr q . We have that E Z exp " κ ρn { r ´ ρ p z J q * “ ż P " exp ˆ κ ρn ´ ρ ε n ˙ ą x * dx ď a ` ż a P " ε n ą log p x q ´ ρκ ρn * dx ď a ` ż a exp " ´ log p x q r p ´ ρ q κ n ρ * dx ď a ´ ´ r p ´ ρ q κ n ρ “ a ` κ n ρ r ´ κ n ρ ď a ` κ n ρr , provided that κ n ρ ă r , (46)where a “ exp " ´ ρκ ρn * . Then we have E P ˆ P P ˙ ď exp " ´ ρκ ρn * " ` exp " ρ ´ κ ρn * κ n ρr * ď exp " ´ ρκ ρn * exp " κ n ρr * . (47)35 tep 3. Combining (42), (43) and (47), we then have P κ,ν ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ě

34 log p { α q ˇˇˇˇˇ T ě ν + ď r { n log p { α q κ nρ exp (cid:32) log p { α q ( exp " r { n log p { α q κ nρ ´ ρκ ρn * exp " r { n log p { α q κ nρ κ n ρr * ď α { α ´ { α ´ { α ´ { ď α { , provided that r log p { α q κ n ρ ď α ´ { and r κ n ρ ď { . (48)Then it holds that E κ, ∆ tp T ´ ∆ q ` u ě r { n log p { α q κ nρ P κ,ν " T ´ ∆ ě r { n log p { α q κ nρ * “ r { n log p { α q κ nρ P κ,ν „ P κ,ν t T ą ∆ u ´ P κ,ν " ∆ ă T ă ∆ ` r { n log p { α q κ nρ * ě r { n log p { α q κ nρ p ´ α ´ α { q ě r { n log p { α q κ nρ , provided that α ` α { ă { . (49) Step 4.

Finally, we are to show the set of parameters satisfying (40), (46), (48) and (49) is notan empty set. For instance, we take ρ “ { r “

30 and κ n “

14, then (40), (46) and the secondcondition in (48) hold. With this choice, the ﬁrst half of (48) and (49) hold with the choice of α “ { n ď r ě ? n . Step 5.

We have now shown that when r Á ? n , it holds thatinf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q r { nκ nρ , (50)with an absolute constant c ą References

Aue, A. and

Horv´ath, L. (2004). Delay time in sequential detection of change.

Statistics &Probability Letters , Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding.

The Annalsof Statistics , hen, H. (2019). Sequential change-point detection based on nearest neighbors. The Annals ofStatistics , Chen, H. and

Chu, L. (2019). gStream: Graph-Based Sequential Change-Point Detectionfor Streaming Data . R package version 0.2.0, URL https://CRAN.R-project.org/package=gStream . Chen, Y. , Wang, T. and

Samworth, R. J. (2020). High-dimensional, multiscale online change-point detection. arXiv preprint arXiv:2003.03668 . Chu, C.-S. J. , Stinchcombe, M. and

White, H. (1996). Monitoring structural change.

Econo-metrica: Journal of the Econometric Society

Dette, H. and

G¨osmann, J. (2019). A likelihood ratio approach to sequential change pointdetection for a general class of parameters.

Journal of the American Statistical Association

Eagle, N. and

Pentland, A. S. (2006). Reality mining: sensing complex social systems.

Personaland ubiquitous computing , Gao, C. , Lu, Y. and

Zhou, H. H. (2015). Rate-optimal graphon estimation.

The Annals ofStatistics , He, X. , Xie, Y. , Wu, S.-M. and

Lin, F.-C. (2018). Sequential graph scanning statistic forchange-point detection. In .IEEE, 1317–1321.

Holland, P. W. , Laskey, K. B. and

Leinhardt, S. (1983). Stochastic blockmodels: Firststeps.

Social Networks

James, N. A. , Zhang, W. and

Matteson, D. S. (2019). ecp: An R package for nonparametricmultiple change point analysis of multivariate data. r package version 3.1.2. URL https://cran.r-project.org/package=ecp . Karrer, B. and

Newman, M. E. (2011). Stochastic blockmodels and community structure innetworks.

Physical review E , Keshavarz, H. and

Michailidis, G. (2020). Online detection of local abrupt changes in high-dimensional gaussian graphical models. arXiv preprint arXiv:2003.06961 . Keshavarz, H. , Michailidis, G. and

Atchade, Y. (2018). Sequential change-point detection inhigh-dimensional gaussian graphical models. arXiv preprint arXiv:1806.07870 . Kirch, C. (2008). Bootstrapping sequential change-point tests.

Sequential Analysis , Lai, T. L. (1981). Asymptotic optimality of invariant sequential probability ratio tests.

TheAnnals of Statistics

Lai, T. L. (1998). Information bounds and quick detection of parameter changes in stochasticsystems.

IEEE Transactions on Information Theory , Lai, T. L. (2001). Sequential analysis: some classical problems and new challenges.

StatisticaSinica oh, P.-L. and

Wainwright, M. J. (2013). Regularized m-estimators with nonconvexity: Sta-tistical and algorithmic theory for local optima. In

Advances in Neural Information ProcessingSystems . 476–484.

Lorden, G. (1971). Procedures for reacting to a change in distribution.

The Annals of Mathe-matical Statistics , Madrid Padilla, O. H. , Athey, A. , Reinhart, A. and

Scott, J. G. (2019). Sequential non-parametric tests for a change in distribution: an application to detecting radiological anomalies.

Journal of the American Statistical Association , Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distributions.

TheAnnals of Statistics , Page, E. S. (1954). Continuous inspection schemes.

Biometrika , R Core Team (2020).

R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria. URL . Ritov, Y. (1990). Decision theoretic optimality of the cusum procedure.

The Annals of Statistics

Wang, D. , Yu, Y. and

Rinaldo, A. (2018). Optimal change point detection and localization insparse dynamic networks. arXiv preprint arXiv:1809.09602 . Xu, J. (2017). Rates of convergence of spectral methods for graphon estimation. arXiv preprint . Xu, J. (2018). Rates of convergence of spectral methods for graphon estimation. In

InternationalConference on Machine Learning . 5433–5442.

Young, S. J. and

Scheinerman, E. R. (2007). Random dot product graph models for socialnetworks. In

International Workshop on Algorithms and Models for the Web-Graph . 138–149.

Yu, Y. , Padilla, O. H. M. , Wang, D. and

Rinaldo, A. (2020). A note on online change pointdetection. arXiv preprint arXiv:2006.03283 . Zhang, Y. , Wainwright, M. J. and

Duchi, J. C. (2012). Communication-eﬃcient algorithmsfor statistical optimization. In