Optimal network online change point localisation
Yi Yu, Oscar Hernan Madrid Padilla, Daren Wang, Alessandro Rinaldo
OOptimal network online change point localisation
Yi Yu , Oscar Hernan Madrid Padilla , Daren Wang , and Alessandro Rinaldo Department of Statistics, University of Warwick Department of Statistics, University California, Los Angeles Department of Statistics, University of Notre Dame Department of Statistics & Data Science, Carnegie Mellon UniversityJanuary 15, 2021
Abstract
We study the problem of online network change point detection. In this setting, a collectionof independent Bernoulli networks is collected sequentially, and the underlying distributionschange when a change point occurs. The goal is to detect the change point as quickly aspossible, if it exists, subject to a constraint on the number or probability of false alarms. Inthis paper, on the detection delay, we establish a minimax lower bound and two upper boundsbased on NP-hard algorithms and polynomial-time algorithms, i.e.detection delay $’’&’’% Á log p { α q max t r { n, u κ nρ , À log p ∆ { α q max t r { n, log p r qu κ nρ , with NP-hard algorithms , À log p ∆ { α q rκ nρ , with polynomial-time algorithms , where κ , n, ρ, r and α are the normalised jump size, network size, entrywise sparsity, ranksparsity and the overall Type-I error upper bound. All the model parameters are allowed tovary as ∆, the location of the change point, diverges.The polynomial-time algorithms are novel procedures that we propose in this paper, designedfor quick detection under two different forms of Type-I error control. The first is based oncontrolling the overall probability of a false alarm when there are no change points, and thesecond is based on specifying a lower bound on the expected time of the first false alarm.Extensive experiments show that, under different scenarios and the aforementioned forms ofType-I error control, our proposed approaches outperform state-of-the-art methods. Keywords : Dynamic networks, online change point detection, minimax optimality.
In this paper we are concerned with online change point detection in dynamic networks. Tobe specific, we observe a sequence of independent adjacency matrices t A p t q , t “ , , . . . u , with E t A p t qu “ Θ p t q , for t P N ` . If there exists t ˚ ě
2, such that Θ p t ˚ q ‰ Θ p t ˚ ´ q , then we call t ˚ achange point. Our aim is to detect the existence of such change points as soon as they occur. On1 a r X i v : . [ m a t h . S T ] J a n Figure 1: Interaction networks based on the MIT cellphone data sets. A white dot means thecorresponding row and column individuals interacted on the specific date. A red dot means thelack of interaction. Details are explained in Section 4.3.the other hand, if there is no change point, then we would like to avoid false alarms. To the best ofour knowledge, this problem has not been theoretically studied in the existing statistical literature.The problem we described above is an abstractification of various real-life problems. For in-stance, in cybersecurity, one monitors the internet or a system and wishes to detect maliciousactivity as early as it starts. In finance, regulatory authorities oversee the markets and aim to stopunlawful activities at an early stage. In epidemiology, public health sectors follow the spreading ofa contagious disease in a community and target at knowing the spreading pattern changes as theyhappen.As a concrete example, we consider the Massachusetts Institute of Technology (MIT) cellphonedata set (Eagle and Pentland, 2006). The data set consists of human interactions measured bythe cellphone activity of the participants. There were 96 participants that included students andfaculty members at the MIT. The data were taken from 14-Sept-2004 to 5-May-2005.We construct two experiments to evaluate our proposed methods and our competitors. In thefirst example, we use the data from 14-Sept-2004 to 15-Feb-2005, which cover the MIT winter recessstarting on 22-Dec-2004 and ending on 3-Jan-2005. In our second example, we use the data from1-Jan-2005 to 5-May-2005, which cover the spring recess starting on 26-Mar-2005 and ending on3-Apr-2005. In Figure 1, we plot the interaction networks for a few representative dates. A whitedot means the corresponding row and column individuals interacted on the specific date, while ared dot means the lack of interaction. For these two examples, our proposed method detects changepoints at 27-Dec-2004 and 31-Mar-2005, respectively. Our competitors’ change point estimatorsare around 30-Jan-2005 and 6-Apr-2005, respectively. Our method is clearly the best at detectingthe winter and spring recess periods. Numerical details are explained in Section 4.3.2ue to the aforementioned real-world applications, change point detection problems have beenintensively studied in the literature, not necessarily in the dynamic networks context though. Interms of online change point detection, i.e., making sequential decisions about the existence ofchange points while collecting data, Lorden (1971), Moustakides (1986), Ritov (1990), Lai (1981),Lai (1998), Lai (2001), Chu et al. (1996), Aue and Horv´ath (2004), Kirch (2008), Madrid Padillaet al. (2019) and Yu et al. (2020) studied univariate sequences; He et al. (2018) focused on asequence of random graphs; Chen (2019) and Dette and G¨osmann (2019) allowed for more generalscenarios, including nonparametric models; Chen et al. (2020) and Keshavarz et al. (2018) studiedhigh-dimensional Gaussian vectors. In terms of offline change point detection, i.e., after collectinga sequence of data, one seeks change points retrospectively, a wide range of models have beenstudied. The closely related one is Wang et al. (2018), where a sequence of adjacency matrices wereconsidered. More discussions with existing literature will be provided as we unfold our results.
The contributions of this paper are summarised below. • To the best of our knowledge, it is the first time that online change point detection is for-mally analysed in a sequence of adjacency matrices, allowing all model parameters to vary asfunctions of ∆. • We establish minimax lower bounds on the detection delay. To the best of our knowledge,in statistical networks literature, a lower bound involving the rank parameter has only beenestablished in estimation problems (e.g. Gao et al., 2015), but not in the context of testing,not to mention change point detection. Our lower bound matches, up to a logarithmic factor,an upper bound derived based on an NP-hard algorithm. • In addition, we propose a computationally-efficient network online change point detectionmethod, which comes with two variants corresponding to two different Type-I error control-ling strategies. Extensive numerical results are provided to evaluate the performance of ourproposed methods against state-of-the-art competitors. We also discuss tuning parameterselection aspects of our approaches.Throughout this paper, we will adopt the following notation. For any matrix M P R m ˆ m ,let } M } F and } M } “ max m i “ max m j “ | M ij | be the Frobenius norm and the entry-wise supremumnorm of M , respectively. For any two matrices A, B P R m ˆ m , let p A, B q “ tr p A J B q be theFrobenius inner product of two matrices. Since our data are a sequence of adjacency matrices, our first task is to formally define the networksat every time point.
Definition 1 (Inhomogeneous Bernoulli networks) . A network with node set t , . . . , n u is an in-homogeneous Bernoulli network if its adjacency matrix A P R n ˆ n satisfies A ij “ A ji “ , nodes i and j are connected by an edge , , otherwise ;3 nd t A ij , i ă j u are independent Bernoulli random variables with E p A ij q “ Θ ij . We refer to thematrix Θ as the graphon matrix. This general definition includes popular network models as special cases, but it is not restrictedto any specific model. Note that we allow every random variable that corresponds to an integerpair p i, j q , 1 ď i ă j ď n , to have its own mean, i.e. E p A ij q “ Θ ij . Remark 1.
Despite the flexibility it enjoys, Definition 1 is also subjected to a number of restric-tions. First, each random variable is assumed to be Bernoulli, which is a sub-Gaussian randomvariable. However, our framework can be extended to handle Poisson random variables. The al-gorithms we propose in the sequel follow naturally, and the theoretical results can be adjusted byusing sub-Exponential concentration inequalities. Second, the adjacency matrices are assumed to besymmetric in Definition 1. All the results in this paper can be extended straightforwardly to asym-metric cases, corresponding to directed networks. Finally, the model does not allow for dependencebetween entries of the adjacency matrix.
In order to detect the change points of t Θ p t q , t “ , , . . . u , we adopt the network CUSUMstatistic, which originated in the univariate CUSUM statistics from Page (1954) and was firstformally stated in Wang et al. (2018). Definition 2.
Given a sequence of matrices t A p t qu t “ , ,... Ă R n ˆ n , we define the correspondingonline CUSUM statistics as p A s,t “ c t ´ sst s ÿ l “ A p l q ´ c s p t ´ s q t t ÿ l “ s ` A p l q , for all integer pairs p s, t q , t ě and s P r , t q . With the notation introduced in Definition 2, we have for any integer pair p s, t q , 1 ď s ă t , E p p A s,t q “ p Θ s,t “ c t ´ sst s ÿ l “ Θ p l q ´ c s p t ´ s q t t ÿ l “ s ` Θ p l q . Algorithm 2 is our main procedure, with a subroutine detailed in Algorithm 1 and a variant inAlgorithm 3. Both Algorithms 2 and 3 are written in a way that they will not stop if no changepoint is detected. In practice, they can be terminated either by users or if there are no more newdata.To motivate our algorithms, we first investigate the statistics in Definition 2. These are linearcombinations of all adjacency matrices up to time point t ě
2. To be specific, for 1 ď s ă t , thecorresponding statistic is a difference between the sample means before and after time point s . Inthe univariate online change point detection problem (e.g. Yu et al., 2020), one scans through allpossible integer pairs p s, t q , 1 ď s ă t . In Algorithm 2, we propose a more efficient algorithm toavoid scanning through all s P r , t q .For every time point t ě
2, in Algorithm 2, we only consider s P S p t q for candidates of changepoints, where S p t q “ t t ´ j , j “ , , . . . , t log p t q{ log p q u ´ u is a set of geometric scale grid points.Once the criteria } r B s,t } F ą C log { p t { α q and p p A s,t , r B s,t {} r B s,t } F q ą b t (1)4re met, we declare that there exists a change point at t .The criteria in (1) are constructed based on two independent samples t A p t qu and t B p t qu . Inpractice, one can achieve this by splitting data into odd and even indices subsamples. For eachinteger pair p s, t q , the quantity r B s,t is a function of the CUSUM statistic p B s,t , obtained by thesubroutine Algorithm 1. In fact, r B s,t is a universal singular value thresholding (USVT) estimator.The USVT algorithm was proposed in Chatterjee (2015), for the purpose of estimating low-rankhigh-dimensional sparse matrices.The criteria (1) have two components. We first need to check that the USVT estimator } r B s,t } F is large enough. In theory, this is required to prompt near optimal detection delay. Intuitivelyspeaking, change points would only occur, if } r B s,t } F is large. Provided that this criterion holds, wethen check that the matrix inner product p p A s,t , r B s,t {} r B s,t } F q is large enough. The data splitting issummoned due to the fact that for any Bernoulli random variable X , X “ X . Therefore, in orderto detect the change in terms of the Frobenius norm, data splitting helps to estimate the squaredmeans of Bernoulli random variables.In view of the whole procedure, there is a sequence of tuning parameters. The tuning parameters t τ j,s,u , j “ , , u “ , , . . . , s P S p u qu are used in the subroutine Algorithm 1. As suggested in Xu (2018), the parameters τ , ¨ , ¨ serveas cutoffs of the upper bound on sample fluctuations; and the parameters τ , ¨ , ¨ are chosen to bethe entry-wise maximum norms of the matrices of interest. The tuning parameter α P p , q isthe tolerance of Type-I errors and acts as an upper bound on the probability of returning at leastone false alarm. The thresholds t b t u are upper bounds on the inner products when there is nochange point. More detailed discussions and guidance on tuning parameter selection are providedin Sections 3 and 4. Algorithm 1
Universal Singular Value Thresholding. USVT p A, τ , τ q INPUT:
Symmetric matrix A P R n ˆ n , τ , τ ą p λ i , v i q Ð the i th eigen-pair of A , with | λ | ě ¨ ¨ ¨ ě | λ n | ; A Ð ř i : | λ i |ě τ λ i v i v J i ; A Ð a matrix with p i, j q th entry p A q ij satisfying p A q ij “ p A q ij , |p A q ij | ď τ , sign pp A q ij q τ , |p A q ij | ą τ . OUTPUT: A .In addition, we also present a variant of Algorithm 2 in Algorithm 3. Note that the maindifference between these two algorithms is that the tuning parameter α P p , q is replaced by γ P N in Algorithm 3. Inputs are changed correspondingly. These two algorithms represent two popularways of controlling Type-I errors. The tuning parameter γ is in fact a lower bound on the averagerun length. In other words, a choice of γ implies that, when there is no change point, the expectedtime of the first false alarm is at least γ .There is no algorithmic differences between Algorithms 2 and 3. Their theoretical differenceswill be explained in Section 3, and the tuning parameter selection differences will be discussed inSection 4. 5 lgorithm 2 Network online change point detection
INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , τ ,s,u , τ ,s,u , u “ , , . . . , s “ , , . . . , u u Ă R , α Pp , q . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ; r B s j ,t Ð USVT p p B s j ,t , τ ,s j ,t , τ ,s j ,t q ;FLAG “ ! p p A t ´ s j ,t , r B t ´ s j ,t {} r B t ´ s j ,t } F q ą b t ) ! } r B t ´ s j ,t } F ą C log { p t { α q ) ; j Ð j ` end whileend whileOUTPUT: t . Algorithm 3
Network online change point detection – a variant
INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , τ ,s,u , τ ,s,u , u “ , , . . . , s “ , , . . . , u u Ă R , γ P N . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ´ ; r B s j ,t Ð USVT p p B s j ,t , τ ,s j ,t , τ ,s j ,t q ;FLAG “ ! p p A s j ,t , r B s j ,t {} r B s j ,t } F q ą b t ) ! } r B s j ,t } F ą C log { p γ q ) ; j Ð j ` end whileend whileOUTPUT: t . This section consists of all the theoretical results we develop in this paper, with all the technicaldetails in the Appendix. This section is organised as follows. All the assumptions are statedand discussed in Section 3.1. The theoretical guarantees of Algorithms 2 and 3 are provided inSection 3.2. To investigate the fundamental limits, we established a minimax lower bound onthe detection delay in Section 3.3, with an NP-hard procedure which is nearly minimax optimalstudied in Section 3.4. To conclude, we provide some additional discussions through comparisonswith existing work in Section 3.4. 6 .1 Assumptions
Before arriving at our main results, we start by introducing some assumptions. The following threeassumptions introduce the sparsity parameter, describe the one change point and no change pointscenarios, respectively.
Assumption 1.
Assume that t A p q , A p q , . . . u Ă R n ˆ n is a sequence of inhomogeneous Bernoullinetworks satisfying E t A p i qu “ Θ p i q P R n ˆ n , i “ , , . . . , and sup i “ , ,... } Θ p i q} “ ρ, where ρn ě log p n q . Assumption 2 (One change point scenario) . Assume that there exists ∆ P N ˚ such that Θ p q “ ¨ ¨ ¨ “ Θ p ∆ q “ Θ and Θ p ∆ ` q “ Θ p ∆ ` q “ ¨ ¨ ¨ “ Θ . In addition, let κ “ κnρ “ } Θ ´ Θ } F nρ ą and r “ rank p Θ ´ Θ q . Assumption 3 (No change point scenario) . Assume that Θ p q “ Θ p q “ ¨ ¨ ¨ “ Θ . In the one change point scenario, in view of Assumptions 1 and 2, we see that the change pointdetection problem is characterised by the following parameters: the network size n , the entry-wisesparsity parameter ρ , the size of the uncontaminated sample ∆, the normalised jump size κ , andthe low-rank parameter r .The jump size κ is defined to be the Frobenius norm of the difference between two consecutivebut distinct graphon matrices. The choice of the Frobenius norm is tailored to the context ofdynamic networks. Arguably, the most popular statistical network model is the stochastic blockmodels (Holland et al., 1983). If both Θ and Θ are graphon matrices of two stochastic blockmodels, then the Frobenius norm can explicitly reflect the magnitude of the change, as comparedto other matrix norms including the operator norm and the supremum norm. For instance, if thecommunity structure stays unchanged, but the between community probability changes from p to p in a community of size n {
2, then κ “ n | p ´ p |{
2. If the between and within communityprobabilities, p and p , remain the same, but the community structure changes from a balanced2-community network to a balanced 3-community network, then κ “ a { n | p ´ p | . Since κ P p , nρ s , the normalised jump size κ is scale free and satisfies that κ P p , s .Without further restrictions, the low-rank parameter r is allowed to be r P t , . . . , n u . Notethat, the introduction of the parameter r is on the difference matrix and we allow for arbitrarystructure of each graphon per se . Recall that our missions are as follows. When there is a change point, we wish to declare theexistence of the change point as soon as it appears. The distance between the change point estimatorand the change point is called detection delay, which is to be minimised. On the other hand, it is also7ital to control the false alarms. When there is no change point, we either control the probabilityof declaring change points, or the expected time of the first false alarm. These two different ways ofcontrolling false alarms are in fact Algorithms 2 and 3. Their theoretical guarantees are providedin Theorem 1 and Corollary 2, respectively. In the results presented in this section, we assume theexistence of two independent sequences of adjacency matrices. In practice, this can be done bysplitting the data sequence into odd and even index sequences.
Theorem 1.
For any α P p , q , assume that the data t A p t q , B p t qu t “ , ,... are two independentsequences of adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 2 withinputs t A p t q , B p t qu t “ , ,... , b u “ C c ρ log ´ uα ¯ , τ ,s,u “ C ? nρ ` d " u p u ` q log p u q α log p q * and τ ,s,u “ c p u ´ s q su ρ. (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, the it holds that P m P N t p t ą m u + ą ´ α. (ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, and there existsa large enough absolute constant C SNR ą such that ∆ κ nρ ą C SNR r log p ∆ { α q , (2) then P " ă p t ´ ∆ ă C d rnρ log p ∆ { α q κ “ C d r log p ∆ { α q κ nρ * ą ´ α, where C, C , C d ą are absolute constants. We can see from Theorem 1(i) that when there is no change point, with probability at least 1 ´ α ,Algorithm 2 will not raise any false alarm. On the other hand, if there is a change point, then itfollows from Theorem 1(ii) that the detection delay is at most of order r log p ∆ { α q κ nρ , with probability at least 1 ´ α .In fact, the condition (2) can be regarded as a sort of signal-to-noise ratio condition and is amild constraint. We list a few special cases here. • (Small sample size.) If κ , ρ, r, α —
1, then as long as ∆ Á log p ∆ q{ n , (2) holds. If thenetwork size n is large, then this shows that Algorithm 2 can detect change points with avery small number of uncontaminated samples. • (Large rank matrices.) If κ — ρ — log p n q{ n , α — — n , then the rank parameter r is allowed to be r — n . This means that provided the size of uncontaminated sample iscomparable with the size of networks, then it is not necessary to have a low-rank assumptionimposed on the difference of the graphons. 8 (Small jump size.) If ρ, r, ∆ , α —
1, then, provided that κ Á n ´ { , condition (2) holds. Thismeans that the normalised jump size can decrease to zero if the size of the network diverges.Finally, we remark on the choices of the tuning parameters. As we have mentioned in Section 2,the tuning parameters t τ ,s,u u are the cutoffs due to the low-rank parameter and t τ ,s,u u are tobound the entry-wise maximum norms. The theoretical choices of these two sets of parameters canbe found in Lemma 9, with the aim of ensuring the good performances of the USVT estimators. Thetuning parameter α is completely determined by practitioners, reflecting the tolerance of Type-Ierrors. The sequence t b u u reflects an upper bound on the statistics’ fluctuations when there is nochange point. The rate of t b u u is determined in Lemma 8. Corollary 2.
For γ ě , assume that the data t A p t q , B p t qu t “ , ,... are two independent sequencesof adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 3 with inputs t A p t q , B p t qu t “ , ,... , b u “ C a ρ log p γ q , τ ,s,u “ C ? nρ ` d " p γ ` q γ log p γ ` q log p q * and τ ,s,u “ c p u ´ s q su ρ. (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, then E p p t q ě γ, where E p p t q , under Assumption 3, is called the average run length.(ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, and it holds that γ ě ∆ and ∆ κ nρ ą C SNR r log p γ q , (3) where C SNR ą is an absolute constant, then P " ă p t ´ ∆ ă C d rnρ log p γ q κ “ C d r log p γ q κ nρ * ą ´ γ ´ , (4) where C, C , C d ą are absolute constants. Corollary 2 is Theorem 1’s counterpart based on Algorithm 3. Comparisons of Theorem 1(i)and Corollary 2(i) show that Algorithms 2 and 3 have different strategies in controlling the falsealarms. Theorem 1(i) shows that the Type-I error across the whole time horizon is upper boundedby α if Algorithm 2 is deployed. In contrast, Corollary 2(i) ensures that if Algorithm 3 is used thenthe expected time of the first false alarm is at least γ .Both of these two ways to control the false alarms are widely used in the literature. We showin Theorem 1 and Corollary 2 that, if γ ě ∆ and γ — ∆ { α, (5)then these two methods provide the same order of detection delay. If γ ă ∆, then the samelocalisation rate of Corollary 2 holds for max t p t ´ ∆ , u instead of p t ´ ∆.Finally, we summarise the differences between Algorithms 2 and 3. Since both α and γ reflectthe preferences on Type-I error control, these two tuning parameters can be specified by the users.9lthough we have specified theoretical guidance on all the other tuning parameters, in practice, theystill involve either unknown quantities or unspecified constants. In order to tune these parameters,Algorithm 3 might be handier. One may have access to historical data under the pre-change-pointdistribution, and tune all the tuning parameters such that the average run length is γ . This isless natural under the strategy of Algorithm 2, unless the whole time course has a pre-specifiedendpoint, since the Type-I error is across the whole time course. This is in fact how we tune thetuning parameters in Section 4 for Algorithm 2. In Theorem 1, we show that we are able to detect change points with the order of the detectiondelay upper bounded by r log p ∆ { α q κ nρ . (6)In this subsection, we will investigate the optimality of this upper bound. Proposition 3.
Assume that t A p t qu t “ , ,... is a sequence of independent adjacency matrices satis-fying Assumptions 1 and 2. Denote the joint distribution of t A p t qu t “ , ,... as P κ, ∆ . Consider theclass of estimators D defined as D “ t T : T is a stopping time and satisfies P p T ă 8q ď α u , where P indicates ∆ “ 8 . Then for sufficiently small α P p , q , there exists an absolute constant c ą such that we have that inf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q κ nρ max (cid:32) , r { n ( . The change point estimators are all stopping time random variables satisfying that the overallType-I error is controlled by α P p , q . The rate of the detection delay is lower bounded bylog p { α q κ nρ max (cid:32) , r { n ( . This means in the low-rank regime r À ? n , we have the lower bound log p { α q ` κ nρ ˘ ´ ; in thelarge-rank regime r Á ? n , the lower bound is of the order log p { α q r ` κ n ρ ˘ ´ . In view ofProposition 3, we see that (6) is nearly-optimal, saving for a logarithmic factor, only in the extremeregimes, i.e. r — n or r — r graphon Θ P R n ˆ n isinf p Θ sup Θ E " n ››› p Θ ´ Θ ››› * — r ` n log p r q n , where the upper bound is achieved by an NP-hard algorithm. In fact, we can also adopt NP-hardprocedures to match the lower bound in Proposition 3, up to logarithmic factors.10 .4 An NP-hard procedure on stochastic block models In this subsection, we first focus on the stochastic block models (Holland et al., 1983).
Definition 3 (Sparse Stochastic Block Model) . A network is from a sparse stochastic block modelwith size n , sparsity parameter ρ , membership matrix Z P R n ˆ r and connectivity matrix Q Pr , s r ˆ r if the corresponding adjacency matrix satisfies E p A q “ ρZQZ J ´ diag ` ρZQZ J ˘ . The membership matrix Z consists of n rows, each of which has one and only one entry being 1and has all the entries being 0; moreover, Z is a column full rank matrix, i.e. rank p Z q “ r . Thesparsity parameter ρ P r , s potentially depends on n . As we have already pointed out, the stochastic block models are special cases of the inhomoge-neous Bernoulli networks defined in Definition 1.In Gao et al. (2015), an NP-hard estimator of stochastic block models’ graphons is proposed. Inthis subsection, we will replace the USVT estimator defined in Algorithm 1 and used in Algorithm 2by the NP-hard estimator studied in Gao et al. (2015). For completeness, we include the estimatorconstruction below.
Definition 4 (An NP-hard graphon estimator) . For any positive integers n and r , r ď n , let Z n,r “ t z : t , . . . , n u Ñ t , . . . , r uu be the collection of all possible mappings from t , . . . , n u to t , . . . , r u . Given an adjacency matrix A “ p A ij q P R n ˆ n , any z P Z n,r and any Q “ p Q ab q P R r ˆ r ,define the objective function L p Q, z q “ ÿ a,b Pt ,...,r u ÿ p i,j qP z ´ p a qˆ z ´ p b q i ‰ j p A ij ´ Q ab q . For any optimiser of the the objective function p p Q, ˆ z q P arg min Q P R r ˆ r , z P Z n,r L p Q, z q , the estimator is defined as q Θ “ p q Θ ij q ni,j “ P R n ˆ n , with q Θ ij “ q Θ ji “ p Q ˆ z i ˆ z j , i ą j and q Θ ii “ . For notational simplicity, we write q Θ “ NP p A, r q . The new procedure for change point detection replaces the USVT subroutine in Algorithm 2with the estimation detailed in Definition 4. We present the full algorithm below.
Corollary 4.
For any α ą , assume that the data t A p t q , B p t qu t “ , ,... are two independent se-quences of adjacency matrices satisfying Assumption 1. Let p t be the output of Algorithm 4 withinputs t A p t q , B p t qu t “ , ,... and b u “ C c ρ log ´ uα ¯ . lgorithm 4 Network online change point detection - NP-hard
INPUT: t A p u q , B p u qu u “ , ,... Ă R n ˆ n , t b u , u “ , , . . . u Ă R , α P p , q , r P N ˚ . t Ð Ð while FLAG “ do t Ð t ` J Ð t log p t q{ log p q u ; j Ð while j ă J and FLAG “ do s j Ð t ´ j ; q B s j ,t Ð NP p A, r q ;FLAG “ ! p p A t ´ s j ,t , q B t ´ s j ,t {} q B t ´ s j ,t } F q ą b t ) ! } q B t ´ s j ,t } F ą C log { p t { α q ) ; j Ð j ` end whileend whileOUTPUT: t . (i) (No change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 3, then for any m P N , P m P N t p t ą m u + ą ´ α. (ii) (One change point.) If t A p t q , B p t qu t “ , ,... in addition satisfy Assumption 2, the input r ě r ,and there exists a large enough absolute constant C SNR ą such that ∆ κ n ρ ą C SNR t r ` n log p r qu log p ∆ { α q , then P ă p t ´ ∆ ă C d (cid:32) r ` n log p r q ( ρ log p ∆ { α q κ “ C d (cid:32) r { n ` log p r q ( log p ∆ { α q κ nρ + ą ´ α, where C, C , C d ą are absolute constants. Corollary 4 provided us with an upper bound on the detection delay matching the lower boundin Proposition 3, saving for logarithmic factors. However, the detection delay in Corollary 4 isbased on an NP-hard procedure in Algorithm 4, which has limited practical value.Comparing the results in Theorem 1 and Corollary 4, we see that in the very extreme regimes,i.e. r — r — n , the detection delays obtained by the proposed polynomial-time and NP-hardalgorithms achieve the same rates. Both estimators are nearly optimal, saving for logarithmicfactors. Between the two extreme cases r — r — n , the NP-hard algorithm achieves sharperrates than the polynomial time algorithm. This phenomenon is inline with the computational andstatistical tradeoffs observed in other high-dimensional statistical problems, e.g. Zhang et al. (2012),Loh and Wainwright (2013), to name but a few.12 .5 Comparisons with existing work With all the theoretical results at hand, we are ready to provide some in-depth comparisons withexisting work. Since we believe that our paper is the first ever providing theoretical results fornetwork (in the sense of random matrices) online change point detection problems, the four paperswe select in this subsection are all concerned with different but related problems.Chen (2019) establishes a general framework for online change point detection. Provided asuitable notion of distance, a k -nearest-neighbour-based test statistic is used for testing the existenceof the change points in a sequential manner. In Section 4, we consider three different statistics suchthat the methods from Chen (2019) can be used as our competitors. As for the theoretical results,Chen (2019) focused on the average run length. In our paper, we provide a range of results includingdetection delay, average run length and minimax lower bounds.In statistics literature, the term “network” sometimes refers to the precision matrices in Gaus-sian graphical models, which are different from what we study in this paper. Keshavarz et al.(2018) and Keshavarz and Michailidis (2020) studied online change point detection in Gaussiangraphical models. In addition to the model differences, both Keshavarz et al. (2018) and Keshavarzand Michailidis (2020) focused on the limiting distributions of the test statistics under the null andalternative distributions. We conjecture that a detection delay might be obtainable based on theresults thereof, but the results are not explicit yet. On the other hand, in our paper, especiallybased on Theorem 1, it is straightforward that the overall probabilities of falsely detecting changepoints or missing change points are both upper bounded by α .Wang et al. (2018) investigated an offline network change point detection problem, where asequence of independent adjacency matrices are collected and change point estimators are soughtretrospectively. Despite the difference, there are some interesting comparisons, which to someextent, reflect the connections between online and offline change point detection problems.(1) The detection delay in the online setting can be seen as the counterpart of the localisationerror in the offline setting. The minimax lower bound on the localisation error in Wang et al.(2018) is of order p κ n ρ q ´ , while the minimax lower bound on the detection delay in thispaper is of order log p { α qp κ nρ q ´ p r { n ` q . The extra log p { α q term is rooted in the factthat we need to control the Type-I error in online settings. The other differences are moreinteresting – obviously, the offline rate is better than the online rate. This is because, the defacto smallest sample size for a certain distribution in the offline scenario is ∆, while in theonline scenario it is min t ∆ , detection delay u .(2) In both online and offline settings, we have seen a computational and statistical tradeoff.Comparing Theorem 1 and Corollary 4, we see that NP-hard estimators can detect changepoints under a weaker condition and provide a smaller detection delay. In the offline setting,as Wang et al. (2018) has conjectured, by replacing the USVT estimator with the NP-hardestimator in Definition 4, one can achieve a nearly optimal localisation error under a weakercondition, than the one needed by the USVT estimator.13 Numerical experiments
Recall that in Section 2, we proposed a network online change point detection method in Algo-rithm 2, with a subroutine in Algorithm 1 and a variant in Algorithm 3. In this section, we willinvestigate the numerical performances of our proposed methods. Since there is no direct competi-tor available, we will tailor the k -nearest neighbours type method proposed in Chen (2019).In order to make a fair comparison with Chen (2019) we consider its three different statistics,including: the “original” (ORI) which specifies the original edge-count scan statistic, the weightededge-count scan statistic (W) and the generalised edge-count scan statistic (G). The statistics arecomputed with internal functions in the R (R Core Team, 2020) package gStream (Chen and Chu,2019).We consider two different forms of calibration. The first is based on the probability of raisinga false alarm. Using 200 Monte Carlo simulations and values of α P t . , . u , we choose thedetection thresholds such that, the probability of raising a false alarm in the interval r , T train s is α . The values of T train are taken from the set t , u . The second one is based on theaverage run length γ . We consider values of γ in the set t , u and calibrate the thresholdsof the competing methods, based on 200 Monte Carlo simulations, to have average run lengthapproximately γ under the pre-change model. For the data splitting required by Algorithms 2 and3, in all of our experiments, the sequence of adjacency matrices t A p u qu consists of the odd indicesof the original sequence, and the sequence t B p u qu of the even ones.To evaluate the performance of different methods, we proceed as follows. For each generativemodel described below, we run N “
100 Monte Carlo simulations, where in each trial the data arecollected in the interval r , T s , T “ p t , which can be if no change points are detected in r , T s . Wedefine r t “ min t T, p t u and compute the average detection delayDelay “ ř Nj “ t r t ě ∆ u p r t ´ ∆ q ř Nj “ t r t ě ∆ u . We also report the proportion of false alarmsPFA “ ř Nj “ t r t ă ∆ u N .
As for Algorithm 2, guided by Theorem 1, we set τ ,s,u “ . a n ˆ ρ ` d ˆ p u ´ s qp u ´ s ` q α ˙ and τ ,s,u “ c p u ´ s q su ˆ ρ, where ˆ ρ is an estimator of ρ , calculated as the 0 . p i,j “ T ÿ t “ A ij p t q , i, j P t , . . . , n u , i ă j, (7)where the matrices t A p t qu Tt are part of the training data. In addition, we set b u “ C c ˆ ρ log ´ uα ¯ , C tuned to give the desired false alarm rate.With respect to Algorithm 3, guided by Corollary 2, we let τ ,s,u “ . a n ˆ ρ ` a p γ ` q
15 and τ ,s,u “ c p u ´ s q su ˆ ρ, where ˆ ρ is the 0 . b u “ C a ˆ ρ log p γ q , with the constant C calibrated to such that before the change point the expect time of the firstfalse alarm is γ .We consider four different settings. Scenario 1.
This consists of a stochastic block model with 3 communities of sizes t n { u , t n { u and n ´ t n { u . The network size n takes values in t , u . Denoting by z i the label of thecommunity associated with node i P t , . . . , n u , the data are generated as A ij p t q ind. „ Bernoulli p ρB z i z j p t qq , where ρ “ .
02 and the matrices B p t q satisfy B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t , . . . , ∆ u and B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t ∆ ` , ∆ ` , . . . , T u . Scenario 2.
This is also a stochastic block model. We now take the number of communities to be5 and the number nodes in each community to be n { n P t , u . Again we set ρ “ . B p t q “ ¨˚˚˚˚˝ . . . . . . . . . . . . . . . . . . . . . . . . . ˛‹‹‹‹‚ , t P t , . . . , ∆ u and B p t q “ ¨˚˚˚˚˝ . . . . . . . . . . . . . . . . . . . . . . . . . ˛‹‹‹‹‚ , t P t ∆ ` , ∆ ` , . . . , T u . cenario 3. We consider a degree corrected block model (Karrer and Newman, 2011) with 3communities of sizes t n { u , t n { u and n ´ t n { u , where n P t , u . Let z i be the communityto which node i belongs, and define v i “ a i { n . The data are then generated as A ij p t q „ Bernoulli p v i v j B z i z j p t qq , where B p t q “ ¨˝ . . . . . . . . . ˛‚ , t P t , . . . , ∆ u and B p t q “ ¨˝ .
95 0 .
15 0 . .
15 0 .
95 0 . .
15 0 .
15 0 . ˛‚ , t P t ∆ ` , ∆ ` , . . . , T u . Scenario 4.
This is a random dot product graph (Young and Scheinerman, 2007) with fixedlatent positions. First, we generate the latent positions X P R n ˆ as X ij i.i.d. „ Unif r , s , i “ , . . . , n, j “ , . . . , , which are kept fixed throughout our simulations. We then construct r X P R n ˆ as˜ X ij i.i.d. „ Unif r , s , i “ , . . . , n, j “ , . . . , , which are also kept fixed throughout our simulations. Finally, the data are generated as A ij p t q „ Bernoulli ˆ X J i X j } X i } } X j } ˙ , t P t , . . . , ∆ u and A ij p t q „ Bernoulli ˆ Y J i Y j } Y i } } Y j } ˙ , t P t ∆ ` , ∆ ` , . . . , T u , where X i , r X i P R are the i th rows of the matrices X and r X , } ¨ } is the (cid:96) -norm of vectors, and Y i “ X i , i ď t n { u ,X i , otherwise.We collect the results in Figure 2, Tables 1 and 2. In Figure 2, we exhibit one realisation eachfor each scenario. Each row corresponds to each scenario, from the first to the fourth. In eachrow, the left two panels are realisations before change points, and the right two panels are the postchange points realisations. It can be seen from Figure 2 that, these four scenarios cover differenttypes of networks, and the change points are hard to spot with the naked eye.Tables 1 and 2 correspond to the two different ways to control Type-I errors. We reiterate thatAlgorithm 2, Theorem 1 and Table 1 correspond to the strategy of controlling the overall Type-Ierror α . Algorithm 3, Corollary 2 and Table 2 correspond to the strategy of lower bounding theaverage run length γ . 16
75 150
Figure 2: Examples of adjacency matrices generated under different scenarios. The first to thefourth rows correspond to the first to the fourth scenarios, respectively. In each row, from left toright, the first two plots correspond to networks generated before the change point, and the lasttwo plots to networks generated after the change point. In each display, a white dot indicates oneand a red dot indicates zero. 17able 1: Upper bounding overall Type-I errors: n , the network size; α , the Type-I error upperbound; T train , the time length of the training data used for selecting tuning parameters; ORI, Chen(2019) using the original edge-count scan statistic; W, Chen (2019) using the weighted edge-countscan statistic; G, Chen (2019) using the generalised edge-count scan statistic.Settings Delay PFA n Scenario α T train
Algo. 2 ORI W G Algo. 2 ORI W G150 1 0.01 200
150 1 0.05 200
150 1 0.05 150
100 1 0.01 200
100 1 0.05 200
100 1 0.01 150
100 1 0.05 150
150 2 0.05 200
150 2 0.01 150
100 2 0.05 150
150 3 0.01 200
100 4 0.05 200 r , the rank of18able 2: Lower bounding the average run lengths: n , the network size; γ , the average runlength lower bound; ORI, Chen (2019) using the original edge-count scan statistic; W, Chen (2019)using the weighted edge-count scan statistic; G, Chen (2019) using the generalised edge-count scanstatistic. Settings Delay PFA n Scenario γ Algo.3 ORI W G Algo. 3 ORI W G150 1 150 r . We consider stock market data from April 1990 to January 2012. The data consist of the weekly logreturns for the Dow Jones Industrial Average index and they are available in the R (R Core Team,2020) package ecp (James et al., 2019). To construct networks, we first use a sliding window ofwindow width being 3, and consider the covariance matrix among 29 companies’ log-weekly-returnsover a 3 week period. We then truncate the covariance matrices by setting those entries which havevalues above the 0.95-quantile as 1, and the remaining as 0. This construction leads to sparsenetworks. Some examples of these networks are illustrated in the first two rows of Figure 3.As competitors to our estimator, we consider the same statistics (ORI), (W) and (G) fromChen (2019) that were used in Section 4.1. To evaluate the performances of these methods. wehave chosen two periods of the original data, each consisting of a training set and a test set. Wecalculate the maximum score of each method using the training data and use the maximum scoreas the threshold for detecting false alarms.In the first period, the data from 2-Apr-1990 to 4-Jan-1999 are used as the training set, andthe data from 25-Jan-1999 to 31-May-2004 are used as the test set. The algorithm we proposedin Algorithm 2 detects a change point corresponding to 25-Mar-2002. This seems to coincide with19
Figure 3: Stock market data set described in Section 4.2.the period of financial turbulence after the 11-Sep-2001 terrorist attacks. We can also see in thefirst row of Figure 3 that there seems to be a change in pattern around such date. In contrast, thecompetitor methods did not detect the change point with the given choice of threshold.In the second period, the data from 31-May-2004 to 15-Jan-2007 are used as the training set,and the data from 5-Feb-2007 to 1-Mar-2010 are used as the test set. We remark that the trainingdata correspond to the period before the financial crisis of 2007–2008. Algorithm 2 detects a changepoint corresponding to the date 10-Mar-2008. The competing approaches detect a change point inthe same period. Specifically, ORI detects the date December 17, 2007; and both W and G detectthe date 19-Mar-2007. From looking at the second row of Figure 3, we can see that the 2007-2008financial crisis seems to affect the network patterns in the data.
In Section 1, we have studied the MIT cellphone data set. In this subsection, we provide allnumerical details.Originally, the data are in the form of 1392 networks of size 96 ˆ
96. For each day of theexperiment, the data include four networks corresponding to six hours interval in the given day.We sum the four networks during each day resulting in 232 networks, each with 96 nodes. Thenetworks are transformed to binary networks by setting all strictly positive entries as 1. In otherwords, in each binary network, if the entry p i, j q equals 1, then it means that participants i and j were within physically close proximity during the corresponding day.To evaluate the performances of different methods, we construct two experiments. In the firstwe use the data from 14-Sept-2004 to 1-Dec-2004 as the training data set, and the data from 2-Dec-2004 to 15-Feb-2005 as the test set. The training set is the period before the MIT winter recess,which that year took place from 22-Dec-2004 to 3-Jan-2005. The thresholds are chosen in the same20ay as Section 4.2. Algorithm 2 detects a change point on 27-Dec-2004; ORI on 30-Jan-2005; Wand G on 29-Jan-2005. Clearly, our method is the best at detecting the winter recess period.In our second example, we use the data from 1-Jan-2005 to 3-Mar-2005 as the training dataset and the data from 4-Mar-2005 to 5-May-2005 as the test set. Thus the training data consist ofnetworks before the spring recess which took place from 26-Mar-2005 to 3-Apr-2005. Algorithm 2detects a change point on 31-Mar-2005; and ORI on 6-Apr-2005. Thus, our method seems to bequicker at detecting the spring recess. In this paper, we are concerned with online change point detection in a sequence of inhomogeneousBernoulli networks. We established the minimax lower bound on the detection delay, which matchesan upper bound, saving for logarithmic factors, based on an NP-hard estimation procedure. Inaddition, we proposed a polynomial-time algorithm, the detection delay of which matches the thatbased on NP-hard estimators in the extreme cases, i.e. r — r — n .Our proposed methods consist of two different Type-I error control strategies, with the worstcase computational cost of order O p log p t q Cost p n qq when proceeding to the time point t , whereCost p n q is the computational cost for running the USVT algorithm (Algorithm 1) on a size- n network.In this paper, we only discuss the at most one change point scenario. In fact, it is straightforwardto extend the algorithm and the results to multiple change points scenario. To be specific, one canrestart the algorithm whenever a change point is declared by Algorithm 2. As for the theoreticalresults, in Theorem 1, we can let ∆ be the minimal spacing between two consecutive change points.Provided that C SNR nr log p ∆ { α q ą C d , then with probability at least 1 ´ α , all change points can be detected with detection delay uniformlyupper bounded by the same detection delay upper bound in Theorem 1 and without false alarms.This is a straightforward consequence of Theorem 1, therefore we omit the technical details here.In the existing literature, it is hoped to have an online change point detection with constant costproceeding to every time point. In this sense, our cost O p log p t qq is not efficient enough. However,to the best of our knowledge, when the before and after distributions are not fully specified, thisconstant computational cost is not achievable even in the univariate case. Having said this, it is stillof vital interest to improve the computational efficiency of network online change point detectionmethods. We will leave this for future work. Appendices
All necessary lemmas are collected in Appendix A and all proofs of the main results are left inAppendix B.
A Network change point lemmas
This lemma below is identical to Lemma S.6 in Wang et al. (2018), therefore we skip the proofhere. 21 emma 5. (1)
For any t P N ˚ , let t A p l qu tl “ be a collection of independent matrices with indepen-dent Bernoulli entries satisfying max l “ ,...,t } E p A p t qq} ď ρ, with nρ ě log p n q . Let t w l u tl “ Ă R be a collection of scalars such that ř tl “ w l “ and ř tl “ w l “ .Then there exists an absolute constant C ą ˆ { e such that for any ε ą , P ¨˝››››› t ÿ l “ w l A p l q ´ E ˜ t ÿ l “ w l A p l q ¸››››› op ě C ? nρ ` ε ˛‚ ď exp p´ ε { q . (8) (2) If t A p l qu tl “ are symmetric matrices, then (8) still holds. Lemmas 6 and 7 are from Lemma 1 in Xu (2017).
Lemma 6.
Let
A, B P R n ˆ n be two symmetric matrices with } A ´ B } op ă τ {p ` δ q , τ ą . Thenfor a fixed δ ă , we have } USVT p A, τ,
8q ´ B } ď n min s “ sτ ` p ` δ q δ ´ n ÿ i “ s ` λ i p B q + , where λ n p B q ě ¨ ¨ ¨ λ p B q are the eigenvalues of B . Lemma 7.
Let A and B be defined as in Lemma 6, and that } B } ď τ , then } USVT p A, τ, τ q ´ B } ď n min s “ sτ ` p ` δ q δ ´ n ÿ i “ s ` λ i p B q + , where λ n p B q ě ¨ ¨ ¨ λ p B q are the eigenvalues of B This lemma below is Lemma S.2 in Wang et al. (2018).
Lemma 8.
Let t X p l qu l “ , ,... P R p be a sequence of independent random vectors with independentBernoulli entires. Suppose that E p X i p t qq “ µ i p t q and that sup l “ , ,... } µ p l q} ď ρ. For any t ą , let v P R p and t w l u tl “ Ă R satisfy ř tl “ w l “ . Then for any ε ą , we have P ˜ˇˇˇˇˇ p ÿ i “ v i t ÿ l “ w l p X i p l q ´ µ i p l qq ˇˇˇˇˇ ě ε ¸ ď ˆ ´ { ε ρ } v } ` ε max pi “ | v i | max tl “ | w l | ˙ . Lemma 9.
Assume that t B p u qu is a sequence of adjacency matrices satisfying Assumption 1. Forany integer t ě , let S p t q “ t t ´ s j ´ , j “ , . . . , t log p t q{ log p q u u , C ą ˆ { e ,ε s,t “ d " t p t ` q log p t q α log p q * , τ ,s,t “ C ? nρ ` ε s,t and τ ,s,t “ c p t ´ s q st ρ. e have that under Assumption 2, the event F “ @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ , t ď ∆ , and max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆ + holds with probability at least ´ α { ; under Assumption 3, the event F “ " @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ * holds with probability at least ´ α { .Proof. Step 1 . If Assumption 2 holds, then for t ď ∆, it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q ;for t ą ∆, it holds that p Θ s,t “ $&% p t ´ ∆ q b st p t ´ s q p Θ ´ Θ q , s ď ∆ , ∆ b t ´ sst p Θ ´ Θ q , s ą ∆ . and rank p p Θ s,t q ď r. (9)If Assumption 3 holds, then for any t , it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q . Step 2 . Due to Definition 2, we have p B s,t “ t ÿ l “ w sl B p l q , where ř tl “ w sl “ ř tl “ p w sl q “ E c “ ! D t ě , s P S p t q : } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) . Then it follows from Lemma 5 that, P p E c q ď ÿ t “ log p t q log p q max s P S p t q P ! } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) ď ÿ t “ log p t q log p q log p q log p t q αt p t ` q ď α , where C ą ˆ { e and ε s,t “ d " t p t ` q log p t q α log p q * . (10)23nder Assumption 2, it follows from Lemma 6 that E Ă E , where E “ @ t ě s P S p t q ››› USVT p p B s,t , τ ,s,t ,
8q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆and max s P S p t q ››› USVT p p B s,t , τ ,s,t , ››› F “ , t ď ∆ + , with C and ε s,t defined in (10) and τ ,s,t “ C ? nρ ` ε s,t . Due to (9), we have that ››› p Θ s,t ››› ď c s p t ´ s q t ρ “ τ ,s,t . Therefore, we have that F Ă E .Under Assumption 3, note that p Θ s,t “ p p Θ s,t q “
0. Due to Lemmas 6 and 7, we have F Ă E , which completes the proof. Lemma 10.
Assume t B p u qu is a sequence of adjacency matrices satisfying Assumption 1. For anyinteger t ě , let S p t q “ t t ´ s j ´ , j “ , . . . , t log p t q{ log p q u u , C ą ˆ { e ,ε s,t “ d " p γ ` q log p γ ` q log p q * , τ ,s,t “ C ? nρ ` ε t and τ ,s,t “ c p t ´ s q st ρ. We have that under Assumption 2, the event F “ @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ , t ď ∆ and max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆ + holds with probability at least ´ p γ ` q ´ { ; under Assumption 2, the event F “ " @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t , τ ,s,t q ››› F “ * holds with probability at least ´ p γ ` q ´ { .Proof. Step 1 . If Assumption 2 holds, then for t ď ∆, it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q ;for t ą ∆, it holds that p Θ s,t “ $&% p t ´ ∆ q b st p t ´ s q p Θ ´ Θ q , s ď ∆ , ∆ b t ´ sst p Θ ´ Θ q , s ą ∆ , and rank p p Θ s,t q ď r. (11)24n addition, it holds that rank p p Θ s,t q ď r .If Assumption 3 holds, then for any t , it holds that p Θ s,t “ p p Θ s,t q “ , @ s P S p t q . Step 2 . Due to Definition 2, we have p B s,t “ t ÿ l “ w sl B p l q , where ř tl “ w sl “ ř tl “ p w sl q “ E c “ ! D t P t , . . . , γ ` u , s P S p t q : } p B s,t ´ p Θ s,t } op ą C ? nρ ` ε s,t ) . Then it follows from Lemma 5 that, P p E c q ă p γ ` q , where C ą ˆ { e and ε s,t “ d " p γ ` q γ log p γ ` q log p q * . (12)Under Assumption 2, it follows from Lemma 6 that E Ă E , where E “ @ t P t , . . . , γ ` u : max s P S p t q ››› USVT p p B s,t , τ ,s,t ,
8q ´ p Θ s,t ››› F ď ? r p C ? nρ ` ε s,t q , t ą ∆and max s P S p t q ››› USVT p p B s,t , τ ,s,t , ››› F “ , t ď ∆ + , with C and ε s,t defined in (12) and τ ,s,t “ C ? nρ ` ε s,t . Due to (11), we have that ››› p Θ s,t ››› ď c s p t ´ s q t ρ “ τ ,s,t . Therefore, we have F Ă E .Under Assumption 3, not that p Θ s,t “ p p Θ q “
0. Due to Lemmas 6 and 7, we have F Ă E , which completes the proof. B Proofs of main results
Proof of Theorem 1.
We let r B s,t “ USVT p p B s,t , τ ,s,t , τ ,s,t q , t ě S p t q “ t t ´ j ´ , j “ , . . . , t log p t q{ log p q u u . The rest of the proof is conducted on the event F , defined in Lemma 9. In particular, P t F u ą ´ α { F is regarding the data t B p t qu , which are independent of the data t A p t qu . Step 1 . Due to Lemma 9, it holds that if Assumption 2 holds and t ď ∆, or Assumption 3 holds,then r B s,t “ s P S p t q . Therefore the claim (i) is proved and for the claim (ii), we have p t ´ ∆ ą . Define t “ min $&% t ą ∆ : max s P S p t q »–ˇˇˇˇˇˇ¨˝ p A s,t , r B s,t ››› r B s,t ››› F ˛‚ˇˇˇˇˇˇ ! } r B s,t } F ą C log { p t { α q )fifl ą b t ,.- . (14)Due to the design of Algorithm 2, we can see that p t ď t and therefore d ď t ´ ∆. In order toprovide an upper bound on d , it thus suffices to upper bound t . Step 2 . Recall the quantity ε s,t “ d " t p t ` q log p t q α log p q * defined in Lemma 9. With the quantity ε s,t , we define t “ min t ą ∆ : max s P S p t q « ˇˇˇˇˇˇ¨˝ p A s,t , r B s,t ››› r B s,t ››› F ˛‚ˇˇˇˇˇˇ ˆ ! } p Θ s,t } F ą C log { p t { α q ` ? r p C ? nρ ` ε s,t q ) ff ą b t + . Due to Lemma 9, we know that if ! } p Θ s,t } F ą C log { p t { α q ` ? r p C ? nρ ` ε s,t q ) considered in t holds, then ! } r B s,t } F ą C log { p t { α q ) considered in t also holds. This implies that t ą t . It now suffices to find t , which yields anupper bound on d that d ď t ´ ∆.Due to the choices of s , we in turn define J “ min j P N : ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` j , r B ∆ , ∆ ` j ››› r B ∆ , ∆ ` j ››› F ˛‚ˇˇˇˇˇˇ ! } p Θ ∆ , ∆ ` j } F ą C log { pp ∆ ` j q{ α q ` ? r ` C ? nρ ` ε ∆ , ∆ ` j ˘) ą b ∆ ` j + and t “ ∆ ` J . In the definition of J , we essentially choose the integer pair p s, t q to be p ∆ , ∆ ` J q .This is to ensure that s P S p t q and s “ ∆. Due to this construction, we can see that t is an upperbound of t and our task is now to find J defined above. Step 3 . We are now to show that, with a large enough absolute constant C d ą J “ S log ˆ C d r log p ∆ { α q κ nρ ˙ { log p q W . (15)For notational simplicity, in the rest of the proof, we assume thatlog ˆ C d r log p ∆ { α q κ nρ ˙ { log p q is a positive integer. If this is violated, then the proof only needs to be modified by keeping theceiling operator throughout. Step 3.1 . With J defined in (15), we have that } p Θ ∆ , ∆ ` J } F “ κ gfffe ∆ C d r log p ∆ { α q κ nρ ∆ ` C d r log p ∆ { α q κ nρ , which can be derived by plugging in ∆ and ∆ ` J into (11). Due to (2), we have that } p Θ ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q ` ? r ` C ? nρ ` ε s, ∆ ` J ˘ . (16)This can be seen in the following three steps. Step 3.1.1 . We first show that3 ´ } p Θ ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q . (17)Provided that C d ă C SNR , due to (2), it holds that } p Θ ∆ , ∆ ` J } F ě κ d ∆ C d r log p ∆ { α q κ nρ “ c ∆ nρ C d r log p ∆ { α q . (18)In addition, provided that ∆ { α ě
2, it holds thatlog { pp ∆ ` J q{ α q ď log { p { α q ď a p ∆ { α q . (19)Therefore, provided that C d ą C { log p q and n ě
2, (17) holds, where nρ ě log p n q assumed inAssumption 1 is used. Step 3.1.2 . Provided that C d ą C { log p q , we have that 3 ´ } p Θ ∆ , ∆ ` J } F ą C ? rnρ , by us-ing (18). 27 tep 3.1.3 . Lastly, we are to show3 ´ } p Θ ∆ , ∆ ` J } F ą C ? rε s, ∆ ` J “ C d r log " p ∆ ` J qp ∆ ` J ` q log p ∆ ` J q α log p q * . (20)Due to (19), the last term in (20) is upper bounded by C d r log " p ` q log p q α log p q * ď C d r log " p ` q α log p q * ď C d r log " α * ď C d r log " ∆ α * . Therefore provided that C d ą C { log p q , (20) holds. Step 3.2 . In addition, we have that ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ´ ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J ´ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ p I q ´ p II q . (21) Step 3.2.1 . As for p I q , we have that ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ } p Θ ∆ , ∆ ` J } F ˇˇˇˇˇˇ¨˝ p Θ ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ “ } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› p Θ ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F ´ r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ›››››› ˛‹‚ “ } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› p Θ ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ´ r B ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F } p Θ ∆ , ∆ ` J } F ››› r B ∆ , ∆ ` J ››› F ›››››› ˛‹‚ ě } p Θ ∆ , ∆ ` J } F ¨˚˝ ´ ›››››› ››› p Θ ∆ , ∆ ` J ´ r B ∆ , ∆ ` J ››› F } p Θ ∆ , ∆ ` J } F ` ˇˇˇ››› p Θ ∆ , ∆ ` J ››› F ´ ››› r B ∆ , ∆ ` J ››› F ˇˇˇ } p Θ ∆ , ∆ ` J } F ›››››› ˛‹‚ ě } p Θ ∆ , ∆ ` J } F $&% ´ ˜ } p Θ ∆ , ∆ ` J ´ r B ∆ , ∆ ` J } F } p Θ ∆ , ∆ ` J } F ¸ ,.- ě } p Θ ∆ , ∆ ` J } F $&% ´ ˜ C ? rnρ ` C ? rε ∆ , ∆ ` J } p Θ ∆ , ∆ ` J } F ¸ ,.- ě } p Θ ∆ , ∆ ` J } F , (22)28here the third inequality is due to the event F and the last inequality follows from (16) with asufficiently large C SNR . Step 3.2.2 . As for (II), due to the independence between t A p t qu and t B p t qu , it follows fromLemma 8 and (16) that P A $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J ´ p Θ ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě b ∆ ` J ,.- ď exp ´ { b t ρ ` b ∆ ` J ρC ´ log ´ { pp ∆ ` J q{ α q + ă α . (23)To be specific, the CUSUM weights are regarded as the t w l u sequence in Lemma 8 and all theentries in r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› ´ are regarded as the t v i u sequence in Lemma 8. Therefore, } v } “ tl “ | w l | ď ´ J { and p max i “ | v i | ď ρ b J ∆∆ ` J C log { pp ∆ ` J q{ α q ď ρ J { C log { pp ∆ ` J q{ α q , where the last inequality follows from (16) and the definition of F .Due to (16), with a sufficiently large C SNR , it holds that } p Θ ∆ , ∆ ` J } F ą b ∆ ` J . Then, combining (21), (22) and (23), it holds that P A $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ě b ∆ ` J ,.- ą ´ α . (24) Step 3.3 . Combining (13) and (24), we have that P $&%ˇˇˇˇˇˇ¨˝ p A ∆ , ∆ ` J , r B ∆ , ∆ ` J ››› r B ∆ , ∆ ` J ››› F ˛‚ˇˇˇˇˇˇ ! } r B ∆ , ∆ ` J } F ą C log { pp ∆ ` J q{ α q ) ą b ∆ ` J ,.- ą ´ α, which completes the proof. Proof of Corollary 2.
We let r B s,t “ USVT p p B s,t , τ ,s,t , τ ,s,t q , t ą N. Let S p t q “ t t ´ j ´ , j “ , . . . , t log p t q{ log p q u u . tep 1 . Due to Lemma 10, it holds that if Assumption 3 holds, then with probability at least1 ´ p γ ` q ´ , the event F holds, i.e. r B s,t “
0, for all t P t , . . . , γ ` u and s P S p t q Then we have E p p t q “ ÿ t “ P p p t ě t q ě γ ` ÿ t “ P p p t ě t q ě p γ ` q P p p t ě γ ` q ě p γ ` q ˆ ´ γ ` ˙ “ γ. The claim (i) is proved.
Step 2 . As for the claim (ii), recall the event F and associated quantities defined in Lemma 10.We have that P p F q ą ´ {t p γ ` qu . Conditional on the event F instead of F , the rest of theproof is identical to that of Theorem 1. Proof of Corollary 4.
The proof is almost identical to the proof of Theorem 1, except that thelarge probability events where USVT estimators are well controlled in the proof of Theorem 1 arereplaced by Theorem 2.1 in Gao et al. (2015). In fact, Theorem 2.1 in Gao et al. (2015) is statedand proved by assuming ρ “
1. In order to get an upper bound being a function of ρ , we only needto change Lemmas 4.1-4.3 in Gao et al. (2015) correspondingly. Proof of Proposition 3.
This proof consists of two different cases: a) r À ? n and b) r Á ? n . Case 1: r À ? n .Step 1 - Setup. We assume the networks are generated as follows. Prior to the change point,if there exists any, the adjacency matrices are generated independently from the distribution P ,which has the graphon matrix Θ “ p ρ { q ni,j “ . If there exists a change point, then the adjacency matrices after the change point are generatedindependently from the distribution P “ n ÿ u Pt˘ u n P ,u , where the graphon of the distribution P ,u is ρ { J ` κ ρuu J , u P t˘ u n .For any M P N , let P M be the restriction of a distribution P on F M , i.e. the σ -filed generated bythe observations t A p t qu Mi “ . For notational simplicity, in this proof, the adjacency matrices A p t q ’swill be denoted as A t ’s. For any ν ě M ě ν , we have that for any M ě ∆, let Z ν,M “ log ˜ P Mκ ,ν P Mκ , ¸ , where P κ, indicates the distribution under which there is no change point. Step 2 - When Z ν,T is upper bounded. For any ν ě
1, define the event E ν “ " ν ă T ă ν ` log p { α q κ nρ , Z ν,T ă
34 log p { α q * . Then we have P κ,ν p E ν q “ P κ ,ν P κ, p E ν q P κ, p E ν q ď α ´ { α “ α { , (25)30here the inequality follows from the definition of D and E ν . Step 3 - When Z ν,T is lower bounded. For any ν ě T P D , since t T ě ν u P F ν ´ , wehave that P κ,ν ν ă T ă ν ` log p { α q κ nρ , Z ν,T ě
34 log p { α q ˇˇˇˇˇ T ě ν + ď ess sup P κ,ν $&% max ď l ď log p { α q κ nρ Z ν,ν ` l ě
34 log p { α q ˇˇˇˇˇ A , . . . , A ν ,.- ď log p { α q κ nρ max ď l ď log p { α q κ nρ ess sup P κ,ν Z ν,ν ` l ě
34 log p { α q ˇˇˇˇˇ A , . . . , A ν + ď log p { α q κ nρ exp (cid:32) log p { α q ( max ď l ď log p { α q κ nρ ess sup E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + . (26) Step 3.1.
Note that for any l P t , . . . , log p { α qp κ nρ q ´ u , it holds that E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + “ E κ,ν P ν ` lκ ,ν P ν ` lκ , ¸ ˇˇˇˇˇ A , . . . , A ν + . (27)In addition, letting ζ “ ρ { v i be the i th entry of any vector v , we have that E P ˆ P P p A q ˙ “ E u E A | u ¨˝ n ÿ v Pt˘ u P ,v P p A q ˛‚ (28) “ E u E A | u $&% n ÿ v Pt˘ u ź ď i ă j ď n ˆ ζ ` κ ρv i v j ζ ˙ A ij ˆ ´ ζ ´ κ ρv i v j ´ ζ ˙ ´ A ij ,.- “ E u $&% n ÿ v Pt˘ u ź ď i ă j ď n " ` κ ρ u i u j v i v j ζ p ´ ζ q *,.- ď E u $&% n ÿ v Pt˘ u n ź i,j “ " ` κ ρ u i u j v i v j ζ p ´ ζ q *,.- , (29)where in (29), u is a random vector with entries being independent Rademacher random variables.We further have that(29) ď E u E v n ź i,j “ exp " κ ρ u i u j v i v j ζ p ´ ζ q * “ E u,v exp " κ ρ ´ ρ p u J v q * , (30)where u and v are independent random vectors with entries being independent Rademacher randomvariables. Finally, we have that (30) “ E u exp " κ ρ ´ ρ p u J q * , (31)31here 1 is an all-one n -dimensional vector. Equation (31) is due to the fact that for each i Pt , . . . , n u , u i v i has the same distribution as u i . Step 3.2.
Let ε n “ ˆ ř ni “ u i n ˙ . Then we have that for any x ą
0, due to Hoeffding’s inequality that P t ε n ą x u ď p´ nx q . Therefore E u exp " κ n ρ ´ ρ p u J q * “ ż P " exp ˆ κ n ρ ´ ρ ε n ˙ ą x * dx ď ` ż P " ε n ą log p x q ´ ρ κ n ρ * dx ď ` ż exp " log p x q ρ ´ κ nρ * dx ď ` ` ρ ´ κ nρ x ` ρ ´ κ nρ ˇˇˇˇˇ “ ´ ` ρ ´ κ nρ ď exp ` κ nρ ˘ , (32)provided that ρ ` κ nρ ă . (33) Step 3.3.
Combining (26), (27), (31) and (32), we have that P κ,ν ν ă T ă ν ` log p { α q κ nρ , Z ν,T ě
34 log p { α q ˇˇˇˇˇ T ě ν + ď log p { α q κ nρ exp ! log p { α q κ nρ κ nρ ) exp (cid:32) log p { α q ( ď α { , (34)provided that α { log p { α q ă κ nρ. (35) Step 4.
Combining (41) and (34), we then havesup ν ě P κ,ν " ν ă T ă ν ` log p { α q κ nρ * ď α { . Then it holds that E κ, ∆ tp T ´ ∆ q ` u ě log p { α q κ nρ P κ,ν " T ´ ∆ ě log p { α q κ nρ * “ log p { α q κ nρ P κ,ν „ P κ,ν t T ą ∆ u ´ P κ,ν " ∆ ă T ă ∆ ` log p { α q κ nρ * ě log p { α q κ nρ p ´ α ´ α { q ě log p { α q κ nρ , α ` α { ă { . (36) Step 5.
Finally, we are to show the set of parameters satisfying (33), (35) and (49) is not an emptyset. For instance, we take ρ “ { κ n “ {
2, then (33) holds. With this choice, (35) holds if α { ă { log p { α q (37)holds. Since α ă α { , (49) holds if α { ă { α “ ´ satisfies both (37) and (38). Step 6.
We have now shown that when r À ? n , it holds thatinf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q κ nρ , (39)with an absolute constant c ą Case 2: r Á ? n .Step 1 - Setup. We assume that the networks are generated as follows. Prior to the change point,if there exists any, the adjacency matrices are generated independently from the distribution P ,which has the graphon matrix Θ “ p ρ { q ni,j “ . If there exists a change point, then the adjacency matrices after the change point are generatedindependently from the distribution P “ p r { q ÿ Z P Z P ,Z , where the graphon of the distribution P ,Z is ρ { J ` κ ρn { rZ and the collection Z is the set forall symmetric matrices satisfying Z ij “
0, if max t i, j u ą r , and all the upper triangular matrix of Z p r q , p r q are independent Radamacher random variables.In order to show that P is a probability distribution, it suffices to justify that for any Z P Z , P ,Z is a suitable probability distribution. • Firstly, we have that } κ ρn { rZ } F “ κ ρn. • Secondly, we have that the entries of the matrix Z are all in the set t , ˘ u . This means thatprovided κ n { r ă { , (40)all the entries of P ,Z,E are in the interval r , ρ s . • Lastly, the rank of the matrix Z is upper bounded by r .33or any M P N , let P M be the restriction of a distribution P on F M , i.e. the σ -filed generated bythe observations t A p t qu Mi “ . For notational simplicity, in this proof, the adjacency matrices A p t q ’swill be denoted as A t ’s. For any ν ě M ě ν , we have that for any M ě ∆, let Z ν,M “ log ˜ P Mκ ,ν P Mκ , ¸ , where P κ, indicates the distribution under which there is no change point. Step 2 - When Z ν,T is upper bounded. For any ν ě
1, define the event E ν “ " ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ă
34 log p { α q * . Then we have P κ,ν p E ν q “ P κ ,ν P κ, p E ν q P κ, p E ν q ď α ´ { α “ α { , (41)where the inequality follows from the definition of D and E ν . Step 3 - When Z ν,T is lower bounded. For any ν ě T P D , since t T ě ν u P F ν ´ , wehave that P κ,ν ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ě
34 log p { α q ˇˇˇˇˇ T ě ν + ď ess sup P κ,ν $’&’% max ď l ď r { n log p { α q κ nρ Z ν,ν ` l ě
34 log p { α q ˇˇˇˇˇ A , . . . , A ν ,/./- ď r { n log p { α q κ nρ max ď l ď r { n log p { α q κ nρ ess sup P κ,ν Z ν,ν ` l ě
34 log p { α q ˇˇˇˇˇ A , . . . , A ν + ď r { n log p { α q κ nρ exp (cid:32) log p { α q ( max ď l ď r { n log p { α q κ nρ ess sup E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + . (42) Step 2.1.
Note that for any l P t , . . . , r { n log p { α qp κ nρ q ´ u , it holds that E κ,ν exp p Z ν,ν ` l q ˇˇˇˇˇ A , . . . , A ν + “ E κ,ν P ν ` lκ ,ν P ν ` lκ , ¸ ˇˇˇˇˇ A , . . . , A ν + . (43)In addition, letting ζ “ ρ { U ij “ κ ρn { rZ ij and V ij “ κ ρn { rW ij , we have that E P ˆ P P ˙ “ E Z E A | Z ˜ p r { q ÿ W P Z P ,W P ¸ “ E Z E A | Z p r { q ÿ W P Z ź ď i ă j ď n ˆ ζ ` V ij ζ ˙ A ij ˆ ´ ζ ´ V ij ´ ζ ˙ ´ A ij + E Z p r { q ÿ W P Z ź ď i ă j ď n " ` U ij V ij ζ p ´ ζ q *+ ď E Z p r { q ÿ W P Z n ź i,j “ " ` U ij V ij ζ p ´ ζ q *+ ď E Z E W n ź i,j “ exp " U ij V ij ζ p ´ ζ q * “ E Z E W exp " κ ρn { r ´ ρ x Z, W y * . (44) Step 2.2
For any fixed
Z, W P Z , it holds that x Z, W y “ z J w, (45)where z and w are vectorised upper triangular parts of Z and W , respectively. The vectors z and w are all r { ˘ Step 2.3.
Due to (45), it holds that E P ˆ P P ˙ ď E Z E W exp " κ ρn { r ´ ρ p z J w q * “ E Z exp " κ ρn { r ´ ρ p z J q * . Let ε “ ř r { i “ z i r { . Then we have that for any x ą
1, due to Hoeffding’s inequality that P t ε ą x u ď exp p´ x r q ď exp p´ xr q . We have that E Z exp " κ ρn { r ´ ρ p z J q * “ ż P " exp ˆ κ ρn ´ ρ ε n ˙ ą x * dx ď a ` ż a P " ε n ą log p x q ´ ρκ ρn * dx ď a ` ż a exp " ´ log p x q r p ´ ρ q κ n ρ * dx ď a ´ ´ r p ´ ρ q κ n ρ “ a ` κ n ρ r ´ κ n ρ ď a ` κ n ρr , provided that κ n ρ ă r , (46)where a “ exp " ´ ρκ ρn * . Then we have E P ˆ P P ˙ ď exp " ´ ρκ ρn * " ` exp " ρ ´ κ ρn * κ n ρr * ď exp " ´ ρκ ρn * exp " κ n ρr * . (47)35 tep 3. Combining (42), (43) and (47), we then have P κ,ν ν ă T ă ν ` r { n log p { α q κ nρ , Z ν,T ě
34 log p { α q ˇˇˇˇˇ T ě ν + ď r { n log p { α q κ nρ exp (cid:32) log p { α q ( exp " r { n log p { α q κ nρ ´ ρκ ρn * exp " r { n log p { α q κ nρ κ n ρr * ď α { α ´ { α ´ { α ´ { ď α { , provided that r log p { α q κ n ρ ď α ´ { and r κ n ρ ď { . (48)Then it holds that E κ, ∆ tp T ´ ∆ q ` u ě r { n log p { α q κ nρ P κ,ν " T ´ ∆ ě r { n log p { α q κ nρ * “ r { n log p { α q κ nρ P κ,ν „ P κ,ν t T ą ∆ u ´ P κ,ν " ∆ ă T ă ∆ ` r { n log p { α q κ nρ * ě r { n log p { α q κ nρ p ´ α ´ α { q ě r { n log p { α q κ nρ , provided that α ` α { ă { . (49) Step 4.
Finally, we are to show the set of parameters satisfying (40), (46), (48) and (49) is notan empty set. For instance, we take ρ “ { r “
30 and κ n “
14, then (40), (46) and the secondcondition in (48) hold. With this choice, the first half of (48) and (49) hold with the choice of α “ { n ď r ě ? n . Step 5.
We have now shown that when r Á ? n , it holds thatinf p t P D sup P κ, ∆ E P (cid:32) p p t ´ ∆ q ` ( ě c log p { α q r { nκ nρ , (50)with an absolute constant c ą References
Aue, A. and
Horv´ath, L. (2004). Delay time in sequential detection of change.
Statistics &Probability Letters , Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding.
The Annalsof Statistics , hen, H. (2019). Sequential change-point detection based on nearest neighbors. The Annals ofStatistics , Chen, H. and
Chu, L. (2019). gStream: Graph-Based Sequential Change-Point Detectionfor Streaming Data . R package version 0.2.0, URL https://CRAN.R-project.org/package=gStream . Chen, Y. , Wang, T. and
Samworth, R. J. (2020). High-dimensional, multiscale online change-point detection. arXiv preprint arXiv:2003.03668 . Chu, C.-S. J. , Stinchcombe, M. and
White, H. (1996). Monitoring structural change.
Econo-metrica: Journal of the Econometric Society
Dette, H. and
G¨osmann, J. (2019). A likelihood ratio approach to sequential change pointdetection for a general class of parameters.
Journal of the American Statistical Association
Eagle, N. and
Pentland, A. S. (2006). Reality mining: sensing complex social systems.
Personaland ubiquitous computing , Gao, C. , Lu, Y. and
Zhou, H. H. (2015). Rate-optimal graphon estimation.
The Annals ofStatistics , He, X. , Xie, Y. , Wu, S.-M. and
Lin, F.-C. (2018). Sequential graph scanning statistic forchange-point detection. In .IEEE, 1317–1321.
Holland, P. W. , Laskey, K. B. and
Leinhardt, S. (1983). Stochastic blockmodels: Firststeps.
Social Networks
James, N. A. , Zhang, W. and
Matteson, D. S. (2019). ecp: An R package for nonparametricmultiple change point analysis of multivariate data. r package version 3.1.2. URL https://cran.r-project.org/package=ecp . Karrer, B. and
Newman, M. E. (2011). Stochastic blockmodels and community structure innetworks.
Physical review E , Keshavarz, H. and
Michailidis, G. (2020). Online detection of local abrupt changes in high-dimensional gaussian graphical models. arXiv preprint arXiv:2003.06961 . Keshavarz, H. , Michailidis, G. and
Atchade, Y. (2018). Sequential change-point detection inhigh-dimensional gaussian graphical models. arXiv preprint arXiv:1806.07870 . Kirch, C. (2008). Bootstrapping sequential change-point tests.
Sequential Analysis , Lai, T. L. (1981). Asymptotic optimality of invariant sequential probability ratio tests.
TheAnnals of Statistics
Lai, T. L. (1998). Information bounds and quick detection of parameter changes in stochasticsystems.
IEEE Transactions on Information Theory , Lai, T. L. (2001). Sequential analysis: some classical problems and new challenges.
StatisticaSinica oh, P.-L. and
Wainwright, M. J. (2013). Regularized m-estimators with nonconvexity: Sta-tistical and algorithmic theory for local optima. In
Advances in Neural Information ProcessingSystems . 476–484.
Lorden, G. (1971). Procedures for reacting to a change in distribution.
The Annals of Mathe-matical Statistics , Madrid Padilla, O. H. , Athey, A. , Reinhart, A. and
Scott, J. G. (2019). Sequential non-parametric tests for a change in distribution: an application to detecting radiological anomalies.
Journal of the American Statistical Association , Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distributions.
TheAnnals of Statistics , Page, E. S. (1954). Continuous inspection schemes.
Biometrika , R Core Team (2020).
R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria. URL . Ritov, Y. (1990). Decision theoretic optimality of the cusum procedure.
The Annals of Statistics
Wang, D. , Yu, Y. and
Rinaldo, A. (2018). Optimal change point detection and localization insparse dynamic networks. arXiv preprint arXiv:1809.09602 . Xu, J. (2017). Rates of convergence of spectral methods for graphon estimation. arXiv preprint . Xu, J. (2018). Rates of convergence of spectral methods for graphon estimation. In
InternationalConference on Machine Learning . 5433–5442.
Young, S. J. and
Scheinerman, E. R. (2007). Random dot product graph models for socialnetworks. In
International Workshop on Algorithms and Models for the Web-Graph . 138–149.
Yu, Y. , Padilla, O. H. M. , Wang, D. and
Rinaldo, A. (2020). A note on online change pointdetection. arXiv preprint arXiv:2006.03283 . Zhang, Y. , Wainwright, M. J. and
Duchi, J. C. (2012). Communication-efficient algorithmsfor statistical optimization. In