Adaptive Inference for Change Points in High-Dimensional Data
AAdaptive Inference for Change Points in High-Dimensional Data
Yangfan Zhang, Runmin Wang and Xiaofeng Shao ∗ Abstract : In this article, we propose a class of test statistics for a change point in the mean of high-dimensionalindependent data. Our test integrates the U-statistic based approach in a recent work by Wang et al. (2019) andthe L q -norm based high-dimensional test in He et al. (2020), and inherits several appealing features such as beingtuning parameter free and asymptotic independence for test statistics corresponding to even q s. A simple combinationof test statistics corresponding to several different q s leads to a test with adaptive power property, that is, it canbe powerful against both sparse and dense alternatives. On the estimation front, we obtain the convergence rate ofthe maximizer of our test statistic standardized by sample size when there is one change-point in mean and q = 2,and propose to combine our tests with a wild binary segmentation (WBS) algorithm to estimate the change-pointnumber and locations when there are multiple change-points. Numerical comparisons using both simulated and realdata demonstrate the advantage of our adaptive test and its corresponding estimation method. Keywords : asymptotically pivotal, segmentation, self-normalization, structural break, U-statistics
Testing and estimation of change points in a sequence of time-ordered data is a classical problem in statistics. There isa rich literature for both univariate and multivariate data of low dimension; see Cs¨org¨o and Horv´ath (1997), Chen andGupta (2011) and Tartakovsky et al. (2014), for some book-length introductions and Perron (2006), Aue and Horv´ath(2013), and Aminikhanghahi and Cook (2017) for recent reviews of the subject. This paper addresses the testing andestimation for change points of high-dimensional data where the dimension p is high and can exceed the sample size n . As high-dimensional data becomes ubiquitous due to technological advances in science, engineering and otherareas, change point inference under the high-dimensional setting has drawn great interest in recent years. When thedimension p is greater than sample size n , traditional methods are often no longer applicable. Among recent work thataddresses change point inference for the mean of high-dimensional data, we mention Horv´ath and Huˇskov´a (2012),Chan et al. (2013), Jirak (2015), Cho (2016), Wang and Samworth (2018), Enikeeva and Harchaoui (2019), Wanget al. (2019) and Yu and Chen (2020). In most of these papers, the proposed methods are powerful either when the ∗ Yangfan Zhang is Ph.D. student, Xiaofeng Shao is Professor at Department of Statistics, University of Illinois at Urbana Champaign.Runmin Wang is Assistant Professor at Department of Statistical Science, Southern Methodist University. Emails: [email protected],[email protected] and [email protected]. We would like to thank two anonymous referees for constructive comments, which led tosubstantial improvements. We are also grateful to Dr. Farida Enikeeva for sending us the code used in Enikeeva and Harchaoui (2019).Shao’s research is partially supported by NSF-DMS 1807023 and NSF-DMS-2014018. a r X i v : . [ s t a t . M E ] J a n lternative is sparse and strong, i.e., there are a few large non-zero values in the components of mean difference, orwhen the alternative is weak and dense, i.e., there are many small values in the components of mean difference. Amongsome of these papers, the sparsity appeared either explicitly in the assumptions, e.g. Wang and Samworth (2018), whoproposed to project the data to some informative direction related to the mean change, to which univariate changepoint detection algorithm can be applied, or implicitly in the methodology, e.g. Jirak (2015), who took the maximalCUSUM statistic and therefore essentially targeted at the sparse alternative. Yu and Chen (2020) recently introduceda Gaussian multiplier bootstrap to calibrate critical values of the sup norm of CUSUM test statistics in high dimensionsand their test is also specifically for sparse alternative. On the contrary, Horv´ath and Huˇskov´a (2012) aggregated theunivariate CUSUM test statistics using the sum and their test is supposed to capture the dense alternative, but thevalidity of their method required the cross-sectional independence assumption. Wang et al. (2019) aimed at densealternatives by extending the U-statistic based approach pioneered by Chen and Qin (2010) in the two-sample testingproblem. An exception is the test developed in Enikeeva and Harchaoui (2019), which was based on a combination ofa linear statistic and a scan statistic, and can be adaptive to both sparse and dense alternatives. However, its criticalvalues were obtained under strong Gaussian and independent components assumptions and they do not seem to workwhen these assumptions are not satisfied; see Section 4 for numerical evidence.In practice, it is often unrealistic to assume a particular type of alternative and there is little knowledge about thetype of changes if any. Thus there is a need to develop new test that can be adaptive to different types of alternatives,and have good power against a broad range of alternatives. In this article, we shall propose a new class of tests that canhave this adaptive power property, which holds without the strong Gaussian and independent components assumptions.Our test is built on two recent advances in the high-dimensional testing literature: Wang et al. (2019) and He et al.(2020). In Wang et al. (2019), they developed a mean change point test based on a U-statistic that is an unbiasedestimator of the squared L norm of the mean difference. They further used the idea of self-normalization [see Shao(2010), Shao and Zhang (2010), Shao (2015)] to eliminate the need of estimating the unknown nuisance parameter. Heet al. (2020) studied both one sample and two sample high-dimensional testing problem for the mean and covariancematrix using L q norm, where q ∈ [2 , ∞ ] is some integer. They showed that the corresponding U-statistics at different q s are asymptotically independent, which facilitates a simple combination of the tests based on several values of q (say2 and ∞ ) and their corresponding p -values, and that the resulting combined test is adaptive to both dense and sparsealternatives.Building on these two recent advances, we shall propose a new L q norm based test for a change point in the meanof high-dimensional independent data. Our contributions to the literature is threefold. On the methodological front,we develop a new class of test statistics (as indexed by q ∈ N ) based on the principle of self-normalization in the high-dimensional setting. Our test is tuning parameter free when testing for a single change point. A simple combinationof tests corresponding to different q s can be easily implemented due to the asymptotic independence and results inan adaptive test that has well-rounded power against a wide range of alternatives. On the theory front, as He et al.(2020) proved the asymptotic independence of one-sample and two-sample U-statistics corresponding to different q s,we derive the asymptotic independence for several stochastic processes corresponding to different q s under significantlyweaker assumptions. More precisely, we can define two-sample test statistics on different sub-samples for each q ∈ N .2hese statistics can be viewed as smooth functionals of stochastic processes indexed by the starting and ending pointsof the sub-samples, which turn out to be asymptotically independent for different q s. Compared to the adaptive test inEnikeeva and Harchaoui (2019), which relied on the Gaussian and independent components assumptions, our technicalassumptions are much weaker, allowing non-Gaussianity and weak dependence among components. Furthermore, weobtained the convergence rate of the argmax of our SN-based test statistic standardized by sample size when there isone change point and q = 2. Lastly, in terms of empirical performance, we show in the simulation studies that theadaptive test can have accurate size and high power for both sparse and dense alternatives. Their power is alwaysclose to the highest one given by a single statistic under both dense and sparse alternatives.The rest of the paper is organized as follows. In Section 2, we define our statistic, derive the limiting null distributionand analyze the asymptotic power when there is one change point. We also propose an adaptive procedure combiningseveral tests of different q ∈ N . In Section 3, we study the asymptotic behavior of change-point location estimatorswhen there is a single change-point and combine the WBS algorithm with our test to estimate the location when thereare multiple change points. In Section 4, we present some simulation results for both testing and estimation and applythe WBS-based estimation method to a real data set. Section 5 concludes. All technical details and some additionalsimulation results are gathered in the supplemental material. Mathematically, let { Z t } nt =1 ∈ R p be i.i.d random vectors with mean 0 and covariance Σ. Our observed data is X t = Z t + µ t , where µ t = E ( X t ) is the mean at time t . The null hypothesis is that there is no change point in themean vector µ t and the alternative is that there is at least one change point, the location of which is unknown, i.e.,we want to test H : µ = µ = · · · = µ n v.s H : µ = · · · = µ k (cid:54) = µ k +1 = · · · = µ k s (cid:54) = µ k s +1 · · · = µ n , where k < k < · · · < k s and s are unknown. Note that we assume temporal independence, which seems to becommonly adopted in change point analysis for genomic data; see Zhang et al. (2010), Jeng et al. (2010), and Zhangand Siegmund (2012) among others.In this section, we first construct our two-sample U-statistic for a single change point alternative, which is thecornerstone for the estimation method we will introduce later. Then we derive the theoretical size and power resultsfor our statistic. We also form an adaptive test that combines tests corresponding to different q s. Throughout thepaper, we assume p ∧ n → + ∞ , and we may use p = p n to emphasize that p can depend on n . For a vector ormatrix A and q ∈ N , we use (cid:107) A (cid:107) q to denote (cid:0) (cid:80) i,j A qij (cid:1) /q , and in particular, for q = 2 , (cid:107) · (cid:107) q = (cid:107) · (cid:107) F equals theFrobenius norm. We use (cid:107) Σ (cid:107) s to denote the spectral norm. Denote the number of permutations P kq = k ! / ( k − q )!,and define (cid:80) ∗ to be the summation over all pairwise distinct indices. If lim n a n /b n = 0, we denote a n = o ( b n ), andif 0 < lim inf n a n /b n ≤ lim sup n a n /b n < + ∞ , we denote a n (cid:16) b n . Throughout the paper, we use “ D → ” to denoteconvergence in distribution, “ P → ” for convergence in probability, and “ (cid:32) ” for process convergence in some suitable3unction space. We use (cid:96) ∞ (cid:0) [0 , (cid:1) to denote the set of bounded functions on [0 , . In this subsection, we shall develop our test statistics for one change point alternative, i.e., H : µ = · · · = µ k (cid:54) = µ k +1 = · · · = µ n , where k is unknown. In Wang et al. (2019), a U-statistic based approach was developed and their test targets atthe dense alternative since the power is a monotone function of √ n (cid:107) ∆ (cid:107) / (cid:107) Σ (cid:107) / F , where ∆ is the difference betweenpre-break and post-break means, i.e., ∆ = µ n − µ , and Σ is the covariance matrix of X i . Thus their test may not bepowerful if the change in mean is sparse and (cid:107) ∆ (cid:107) is small. Note that several tests have been developed to capturesparse alternatives as mentioned in Section 1. In practice, when there is no prior knowledge of the alternative for agiven data set at hand, it would be helpful to have a test that can be adaptive to different types and magnitudes ofthe change. To this end, we shall adopt the L q norm-based approach, as initiated by Xu et al. (2016) and He et al.(2020), and develop a class of test statistics indexed by q ∈ N , and then combine these tests to achieve the adaptivity.Denote X i = ( X i, , . . . , X i,p ) T . For any positive even number q ∈ N , consider the following two-sample U-statisticof order ( q, q ), T n,q ( k ) = 1 P kq P n − kq p (cid:88) l =1 ∗ (cid:88) ≤ i ,...,i q ≤ k ∗ (cid:88) k +1 ≤ j ,...,j q ≤ n ( X i ,l − X j ,l ) · · · (cid:0) X i q ,l − X j q ,l (cid:1) , for any k = q, · · · , n − q . Simple calculation shows that E [ T n,q ( k )] = 0 for any k = q, · · · , n − q under the null hypothesis,and E [ T n,q ( k )] = (cid:107) ∆ (cid:107) qq under the alternative. When q ∈ N +1 (i.e., q is odd) and under the alternative, E [ T n,q ( k )] = (cid:80) pj =1 δ qj (cid:54) = (cid:107) ∆ (cid:107) qq where ∆ = ( δ , · · · , δ p ) T . This is the main reason we focus on the statistics corresponding to even q ssince for an odd q , (cid:80) pj =1 δ qj = 0 does not imply ∆ = 0.If the change point location k = (cid:98) τ n (cid:99) , τ ∈ (0 ,
1) is known, then we would use T n,q ( k ) as our test statistic. Asimplied by the asymptotic results shown later, we have that under the null, (cid:18) τ (1 − τ ) n (cid:107) Σ (cid:107) q (cid:19) q/ T n,q ( k ) √ q ! D → N (0 , , under suitable moment and weak dependence assumptions on the components of X t . In practice, a typical approachis to replace (cid:107) Σ (cid:107) q by a ratio-consistent estimator, which is available for q = 2 [see Chen and Qin (2010)], but not forgeneral q ∈ N . In practice, the location k is unknown, which adds additional complexity to the variance estimationand motivates Wang et al. (2019) to use the idea of self-normalization [Shao (2010), Shao and Zhang (2010)] in thecase q = 2. Self-normalization is a nascent inferential method [Lobato (2001), Shao (2010)] that has been developedfor low and fixed-dimensional parameter in a low dimensional time series. It uses an inconsistent variance estimator toyield an asymptotically pivotal statistic, and does not involve any tuning parameter or involves less number of tuning4arameters compared to traditional procedures. See Shao (2015) for a comprehensive review of recent developmentsfor low dimensional time series. There have been two recent extensions to the high-dimensional setting: Wang andShao (2019) adopted a one sample U-statistic with trimming and extended self-normalization to inference for the meanof high-dimensional time series; Wang et al. (2019) used a two sample U-statistic and extended the self-normalization(SN)-based change point test in Shao and Zhang (2010) to high-dimensional independent data. Both papers are L norm based, and this seems to be the first time that a L q -norm based approach is extended to high-dimensional settingvia self-normalization.Following Wang et al. (2019), we consider the following self-normalization procedure. Define U n,q ( k ; s, m ) = p (cid:88) l =1 ∗ (cid:88) s ≤ i ,...,i q ≤ k ∗ (cid:88) k +1 ≤ j ,...,j q ≤ m ( X i ,l − X j ,l ) · · · (cid:0) X i q ,l − X j q ,l (cid:1) , which is an un-normalized version of T n,q applied to the subsample ( X s , · · · , X m ). Let W n,q ( k ; s, m ) := 1 m − s + 1 k − q (cid:88) t = s + q − U n,q ( t ; s, k ) + 1 m − s + 1 m − q (cid:88) t = k + q U n,q ( t ; k + 1 , m ) , The self-normalized statistic is given by (cid:101) T n,q := max k =2 q,...,n − q U n,q ( k ; 1 , n ) W n,q ( k ; 1 , n ) . Remark . If we want to test for multiple change points, we can use the scanning idea presented in Zhang andLavitas (2018) and Wang et al. (2019) and construct the following statistic: T ∗ n,q := max q ≤ l ≤ l − q U n,q ( l ; 1 , l ) W n,q ( l ; 1 , l ) + max m +2 q − ≤ m ≤ n − q U n,q ( m ; m , n ) W n,q ( m ; m , n ) . We shall skip further details as the asymptotic theory and computational implementation are fairly straightforward.
Before presenting our main theorem, we need to make the following assumptions.
Assumption . Suppose Z , . . . , Z n are i.i.d. copies of Z with mean 0 and covariance matrix Σ , and the followingconditions hold.1. There exists c > not depending n such that inf i =1 ,...,p n Var ( Z ,i ) ≥ c .2. Z has up to -th moments, with sup ≤ j ≤ p E [ Z ,j ] ≤ C, and for h = 2 , . . . , there exist constants C h dependingon h only and a constant r > such that | cum ( Z ,l , . . . , Z ,l h ) | ≤ C h (cid:18) ∨ max ≤ i,j ≤ h | l i − l j | (cid:19) − r . emark (Discussion of Assumptions) . The above cumulant assumption is implied by geometric moment contrac-tion [cf. Proposition 2 of Wu and Shao (2004)] or physical dependence measure proposed by Wu (2005) [cf. Section4 of Shao and Wu (2007)], or α -mixing [Andrews (1991), Zhurbenko and Zuev (1975)] in the time series setting. Itbasically imposes weak dependence among the p components in the data. Our theory holds as long as a permutation of p components satisfies the cumulant assumption, since our test is invariant to the permutation within the components. To derive the limiting null distribution for (cid:101) T n,q , we need to define some useful intermediate processes. Define D n,q ( r ; [ a, b ]) = U n,q ( (cid:98) nr (cid:99) ; (cid:98) na (cid:99) + 1 , (cid:98) nb (cid:99) )= p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i ,...,i q ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j ,...,j q ≤(cid:98) nb (cid:99) ( X i ,l − X j ,l ) · · · (cid:0) X i q ,l − X j q ,l (cid:1) , for any 0 ≤ a < r < b ≤
1. Note that under the null, X i ’s have the same mean. Therefore, we can rewrite D n,q as D n,q ( r ; [ a, b ]) = p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i ,...,i q ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j ,...,j q ≤(cid:98) bn (cid:99) ( X i ,l − X j ,l ) · · · (cid:0) X i q ,l − X j q ,l (cid:1) = p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i ,...,i q ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j ,...,j q ≤(cid:98) bn (cid:99) ( Z i ,l − Z j ,l ) · · · (cid:0) Z i q ,l − Z j q ,l (cid:1) = q (cid:88) c =0 ( − q − c (cid:18) qc (cid:19) P (cid:98) nr (cid:99)−(cid:98) na (cid:99)− cq − c P (cid:98) nb (cid:99)−(cid:98) nr (cid:99)− q + cc S n,q,c ( r ; [ a, b ]) . In the above expression, considering the summand for each c = 0 , , . . . , q, we can define, for any 0 ≤ a < r < b ≤ S n,q,c ( r ; [ a, b ]) = p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i , ··· ,i c ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j , ··· ,j q − c ≤(cid:98) nb (cid:99) (cid:32) c (cid:89) t =1 Z i t ,l q − c (cid:89) s =1 Z j s ,l (cid:33) , if (cid:98) nr (cid:99) ≥ (cid:98) na (cid:99) + 1 and (cid:98) nb (cid:99) ≥ (cid:98) nr (cid:99) + 1 , and 0 otherwise. Theorem . If Assumption 2.1 holds, then under the null and for a finite set I of positive even numbers, we havethat (cid:110) a − n,q S n,q,c ( · ; [ · , · ]) (cid:111) q ∈ I, ≤ c ≤ q (cid:32) (cid:110) Q q,c ( · ; [ · , · ]) (cid:111) q ∈ I, ≤ c ≤ q in (cid:96) ∞ (cid:0) [0 , (cid:1) jointly over q ∈ I, ≤ c ≤ q , where a n,q = (cid:113) n q (cid:80) pl ,l =1 Σ ql ,l = (cid:112) n q (cid:107) Σ (cid:107) qq , and Q q,c are centeredGaussian processes. Furthermore, the covariance of Q q,c and Q q,c is given by cov ( Q q,c ( r ; [ a , b ]) , Q q,c ( r ; [ a , b ])) = (cid:18) Cc (cid:19) c !( q − c )!( r − A ) c ( R − r ) C − c ( b − R ) q − C , where ( r, R ) = (min { r , r } , max { r , r } ) , ( a, A ) = (min { a , a } , max { a , a } ) , ( b, B ) = (min { b , b } , max { b , b } ) , and ( c, C ) = (min { c , c } , max { c , c } ) . Additionally, Q q ,c and Q q ,c are mutually independent if (cid:54) = q ∈ N . For illustration, consider the case when a < a < r < r < b < b and c ≤ c . We havecov ( Q q,c ( r ; [ a , b ]) , Q q,c ( r ; [ a , b ])) = (cid:18) c c (cid:19) c !( q − c )!( r − a ) c ( r − r ) c − c ( b − r ) q − c , which implies, for example, var [ Q q,c ( r ; [ a, b ])] = c !( q − c )!( r − a ) c ( b − r ) q − c . The proof of Theorem 2.1 is long and is deferred to the supplement.
Theorem . Suppose Assumption 2.1 holds. Then for a finite set I of positive even numbers, (cid:110) n − q a − n,q D n,q ( · ; [ · , · ]) (cid:111) q ∈ I (cid:32) (cid:110) G q ( · ; [ · , · ]) (cid:111) q ∈ I in (cid:96) ∞ (cid:0) [0 , (cid:1) jointly over q ∈ I , where G q = q (cid:88) c =0 ( − q − c (cid:18) qc (cid:19) ( r − a ) q − c ( b − r ) c Q q,c and Q q,c is given in Theorem 2.1. Furthermore, for q (cid:54) = q ∈ N , G q and G q are independent. Consequently, wehave that under the null, (cid:101) T n,q D −→ (cid:101) T q = sup r ∈ [0 , G q ( r ; 0 , (cid:82) r G q ( u ; 0 , r ) du + (cid:82) r G q ( u ; r, du . It can be derived that the G q ( · ; [ · , · ]) is a Gaussian process with the following covariance structure:var[ G q ( r ; [ a, b ])] = q (cid:88) c =0 (cid:18) qc (cid:19) c !( q − c )!( r − a ) q − c ( b − r ) q + c = q !( r − a ) q ( b − r ) q ( b − a ) q . When r = r = r , cov( G q ( r ; [ a , b ]) , G q ( r ; [ a , b ])) = q !( r − A ) q ( b − r ) q ( B − a ) q . When r (cid:54) = r ,cov( G q ( r ; [ a , b ]) , G q ( r ; [ a , b ])) = q ![( r − A )( b − R )( B − a ) − ( A − a )( R − r )( B − b )] q , where ( r, R, a, A, b, B, c, C ) is defined in Theorem 2.1. The limiting null distribution (cid:101) T q is pivotal and its critical valuescan be simulated as done in Wang et al. (2019) for the case q = 2. The simulated critical values and their correspondingrealizations for q = 2 , , q , such as q = 8 ,
10, since larger q corresponds to more trimming on the two ends and the finite sample performance when q = 67s already very promising for detecting sparse alternatives, see Section 4. An additional difficulty with larger q is theassociated computation cost and complexity in its implementation. Remark . Compared to He et al. (2020), we assume the th moment conditions, which is weaker than the uniformsub-Gaussian type conditions in their condition A.4(2), although the latter condition seems to be exclusively used forderiving the limit of the test statistic corresponding to q = ∞ . Furthermore, since their strong mixing condition withexponential decay rate [cf. condition A.4(3) of He et al. (2020)] implies our cumulant assumption 2.1 [see Andrews(1991), Zhurbenko and Zuev (1975)], our overall assumption is weaker than condition A.4 in He et al. (2020). Despitethe weaker assumptions, our results are stronger, as we derived the asymptotic independence of several stochasticprocess indexed by q ∈ N , which implies the asymptotic independence of U-statistics indexed by q ∈ N .Note that our current formulation does not include the q = ∞ case, which corresponds to L ∞ norm of meandifference (cid:107) ∆ (cid:107) ∞ . The L ∞ -norm based test was developed by Yu and Chen (2020) and their test statistic is based onCUSUM statistics Z n ( s ) = (cid:114) s ( n − s ) n (cid:32) s s (cid:88) i =1 X i − n − s n (cid:88) i = s +1 X i (cid:33) and takes the form T n = max s ≤ s ≤ n − s (cid:107) Z n ( s ) (cid:107) ∞ , where s is the boundary removal parameter. They did not obtain theasymptotic distribution of T n but showed that a bootstrap CUSUM test statistic is able to approximate the finite sampledistribution of T n using a modification of Gaussian and bootstrap approximation techniques developed by Chernozhukovet al. (2013, 2017). Given the asymptotic independence between L q -norm based U statistic and L ∞ -norm based teststatistic [He et al. (2020)] in the two-sample testing text, we would conjecture that T n test statistic in Yu and Chen(2020) is asymptotically independent of our (cid:101) T n,q for any q ∈ N under suitable moment and weak componentwisedependence conditions. A rigorous investigation is left for future work. Let I be a set of q ∈ N (e.g. { } ). Since (cid:101) T n,q s are asymptotically independent for different q ∈ I under the null,we can combine their corresponding p -values and form an adaptive test. For example, we may use p ada = min q ∈ I p q ,where p q is the p -value corresponding to (cid:101) T n,q , as a new statistic. Its p -value is equal to 1 − (1 − p ada ) | I | . Suppose wewant to perform a level- α test, it is equivalent to conduct tests based on (cid:101) T n,q , ∀ q ∈ I at level 1 − (1 − α ) / | I | , andreject the null if one of the statistics exceeds its critical value. Therefore, we only need to compare each (cid:101) T n,q with its(1 − α ) / | I | -quantile of the corresponding limiting null distribution.As we explained before, a smaller q (say q = 2) tends to have higher power under the dense alternative, which isalso the main motivation for the proposed method in Wang et al. (2019). On the contrary, a larger q has a higherpower under the sparse alternative, as lim q →∞ (cid:107) ∆ n (cid:107) q = (cid:107) ∆ n (cid:107) ∞ . Therefore, with the adaptive test, we can achievehigh power under both dense and sparse alternatives with asymptotic size still equal to α . This adaptivity will beconfirmed by our asymptotic power analysis presented in Section 2.4 and simulation results presented in Section 4.8 .4 Power Analysis Theorem . Assume that the change point location is at k = (cid:98) τ n (cid:99) with the change in the mean equal to ∆ n =( δ n, , . . . , δ n,p ) T . Suppose Assumption 2.1, and the following conditions on ∆ n hold. We have1. If n q/ (cid:107) ∆ n (cid:107) qq / (cid:107) Σ (cid:107) q/ q → ∞ , then (cid:101) T n,q P −→ ∞ ;2. If n q/ (cid:107) ∆ n (cid:107) qq / (cid:107) Σ (cid:107) q/ q → , then (cid:101) T n,q D −→ (cid:101) T q ;3. If n q/ (cid:107) ∆ n (cid:107) qq / (cid:107) Σ (cid:107) q/ q → γ ∈ (0 , + ∞ ) , then (cid:101) T n,q D −→ sup r ∈ [0 , { G q ( r ; 0 ,
1) + γJ q ( r, , } (cid:82) r { G q ( u ; 0 , r ) + γJ q ( u, , r ) } du + (cid:82) r { G q ( u ; r,
1) + γJ q ( u, r, } du , where J q ( r, a, b ) := ( τ − a ) q ( b − r ) q a < τ ≤ r < b ( r − a ) q ( b − τ ) q a < r < τ < b τ < a or τ > b . Remark . The following example illustrates the power behavior using different q ∈ N . For simplicity, we assume Σ = I p and consider a change in the mean equal to ∆ n = δ · ( d , p − d ) T . In addition to demonstrating that large(small) q is favorable to the sparse (dense) alternatives, our local asymptotic power results stated in Theorem 2.3 alsoallow us to provide a rule to classify an alternative, which is given by sparse d = o ( √ p ) in between d (cid:16) √ pdense √ p = o ( d ) . To have a nontrivial power, it suffices to have n q/ (cid:107) ∆ n (cid:107) qq / (cid:107) Σ (cid:107) q/ q = dn q/ δ q / √ p = γ ∈ (0 , + ∞ ) , which implies δ (cid:16) ( √ p/d ) /q n − / . Therefore, when d = o ( √ p ) , a smaller δ corresponds to a larger q . On the contrary, when √ p = o ( d ) , a smaller q that yields a larger δ is preferable to have higher power. Similar argument still holds formore general ∆ n and Σ , as long as we have a similar order for (cid:107) ∆ n (cid:107) qq and (cid:107) Σ (cid:107) qq , and the latter one is guaranteed byAssumption 2.1.We can summarize the asymptotic powers of the tests under different alternatives in the following table. Note thatwhen at least one single- q based test obtains asymptotically nontrivial power (power 1), our adaptive test can alsoachieve nontrivial power (power 1). δ I = { } I = { q } I = { , q } Dense √ p = o ( d ) δ = o ( p / d − / n − / ) α α αδ (cid:16) p / d − / n − / β ∈ ( α, α ( α, β ) p / d − / n − / = o ( δ ) , α δ = o ( p / q d − /q n − / ) δ (cid:16) p / q d − /q n − / α,
1) 1 p / q d − /q n − / = o ( δ ) 1 1 1Sparse d = o ( √ p ) δ = o ( p / q d − /q n − / ) α α αδ (cid:16) p / q d − /q n − / α β ∈ ( α,
1) ( α, β ) p / q d − /q n − / = o ( δ ) , α δ = o ( p / d − / n − / ) δ (cid:16) p / d − / n − / ( α,
1) 1 1 p / d − / n − / = o ( δ ) 1 1 1Table 1: Asymptotic powers of single- q and adaptive testsLiu et al. (2020) recently studied the detection of a sparse change in the high-dimensional mean vector under theGaussian assumption as a minimax testing problem. Let ρ = min( k , n − k ) (cid:107) ∆ n (cid:107) . In the fully dense case, i.e.,when (cid:107) ∆ n (cid:107) = p , where (cid:107) ∆ n (cid:107) denotes the L norm, Theorem 8 in Liu et al. (2020) stated that the minimax rate isgiven by ρ (cid:16) (cid:107) Σ (cid:107) F (cid:112) log log(8 n ) ∨ (cid:107) Σ (cid:107) s log log(8 n ). Thus under the assumption that k /n = τ ∈ (0 , L -normbased test in Wang et al. (2019) achieves the rate optimality up to a logarithm factor. Consequently, any adaptivetest based on I is rate optimal (up to a logarithm factor) as long as 2 ∈ I .In the special case Σ = I p , the minimax rate is given by ρ (cid:16) (cid:112) p log log(8 n ) if d ≥ (cid:112) p log log(8 n ) d log (cid:16) ep log log(8 n ) d (cid:17) ∨ log log(8 n ) if d < (cid:112) p log log(8 n ) . Recall that ∆ n = δ · ( d , p − d ) T . In the sparse setting d = o ( √ p ) and under the assumptions that d (cid:16) p − v , v ∈ (0 , / d > log log(8 n ), the minimax rate is d (up to a logarithm factor), which corresponds to δ (cid:16) n − / . Our L q -normbased test is not minimax rate optimal since the detection boundary is ( √ p/d ) /q n − / , which gets closer to n − / as q ∈ N gets larger. In the dense setting √ p = o ( d ) and under the assumptions that d (cid:16) p − v , v ∈ (1 / ,
1) and d > (cid:112) p log log(8 n ), the minimax rate is √ p (up to a logarithm factor), which corresponds to δ (cid:16) p / / √ nd . Thereforethe L -norm based test in Wang et al. (2019) is again rate optimal (up to a logarithm factor). In this section, we investigate the change-point location estimation based on change-point test statistics we proposedin Section 2. Specifically, Section 3.1 presents convergence rate for the argmax of SN-based test statistic upon suitablestandardization. Section 3.2 proposes a combination of wild binary segmentation (WBS, Fryzlewicz (2014)) algorithmwith our SN-based test statistics for both single- q test and adaptive test to estimate multiple change points.10 .1 Single Change-point Estimation In this subsection, we propose to estimate the location of a change point assuming that the data is generated from thefollowing single change-point model, X t = µ + ∆ n ( t > k ∗ ) + Z t , t = 1 , · · · , n, where k ∗ = k = (cid:98) τ ∗ n (cid:99) is the location of change point. In the literature, it is common to focus on the convergencerate of the estimators of the relative location τ ∗ ∈ (0 , τ = ˆ k/n ,where ˆ k is an estimator for k ∗ .Given the discussions about size and power properties of the SN-based test statistic in Section 2, it is natural touse the argmax of the test statistic as the estimator for k ∗ . That is, we defineˆ k = argmax k =2 q,...,n − q U n,q ( k ; 1 , n ) W n,q ( k ; 1 , n ) . To present the convergence rate for ˆ τ , we shall introduce the following assumptions. Assumption . tr (Σ ) = o ( (cid:107) Σ (cid:107) F ) ;2. (cid:80) pl ,...,l h =1 cum ( Z ,l , ..., Z ,l h ) ≤ C (cid:107) Σ (cid:107) hF , for h = 2 , ..., ;3. (cid:107) Σ (cid:107) F = o ( n (cid:107) ∆ n (cid:107) ) . Let γ n,q = n q/ (cid:107) ∆ n (cid:107) qq / (cid:107) Σ (cid:107) q/ q so γ n, = n (cid:107) ∆ n (cid:107) / (cid:107) Σ (cid:107) F . We have the following convergence rate of ˆ τ for the case q = 2. Theorem . Suppose Assumption 3.1 holds and q = 2 . It holds that ˆ τ − τ ∗ = o p ( γ − / κn, ) as n ∧ p → ∞ , for any < κ < / . Remark . Assumption 3.1 (1) and (2) have been assumed in Wang et al. (2019), and they are implied by Assump-tion 2.1; see Remark 3.2 in Wang et al. (2019). Assumption 3.1(3) is equivalent to γ n, → ∞ , which implies that ˆ τ is a consistent estimator of τ ∗ . Note that even in the low-dimensional setting, no convergence rate for the argmax ofSN-based statistic (standarized by the sample size) is obtained in Shao and Zhang (2010). Thus this is the first timethe asymptotic rate for the argmax of a SN-based test statistic is studied. On the other hand, the proof for the moregeneral case q ∈ N is considerably more involved than the special case q = 2 and is deferred to future investigation. In practice, the interest is often in the change point estimation or segmentation, when the presence of change pointsis confirmed by testing or based on prior knowledge. In the high-dimensional context, the literature on change pointestimation is relatively scarce; see Cho (2016), Wang and Samworth (2018) and Wang et al. (2019). Here we shallfollow the latter two papers and use the wild binary segmentation [Fryzlewicz (2014)] coupled with our test developed11or a single q or adaptive test to estimate the number and location of change points. Note that the standard binarysegmentation procedure may fail when the change in means is not monotonic, as shown in Wang et al. (2019) viasimulations.For any integers s, e satisfying 2 q ≤ s + 2 q − ≤ e − q ≤ n − q , define Q n,q ( s, e ) := max b = s +2 q − ,...,e − q U n,q ( b ; s, e ) W n,q ( b ; s, e ) , Note that Q n,q ( s, e ) is essentially the statistic T n,q based on the sub-sample ( X s , ..., X e ). Denote a random sampleof ( s m , e m ) s.t. 2 q ≤ s m +2 q − ≤ e m − q ≤ n − q as F Mn , where the sample is drawn independently with replacementof size M . In practice, we may require the segments to be slightly longer to reduce unnecessary fluctuations of thecritical values. Then define ˆ ξ n,M,q = max m =1 , ··· ,M Q n,q ( s m , e m ) and we stop the algorithm if ˆ ξ n,M,q ≤ ξ n,q , where ξ n,q is some threshold to be specified below, and estimate the change point otherwise; see Algorithm 1 for details.One anonymous reviewer asked whether it is possible to derive the limiting distribution of ˆ ξ n,M,q under the null,which turns out to be challenging for two reasons: (1) The SN-based test statistic for different intervals could be highlydependent, especially when the two intervals overlap by a lot; (2) the number of such randomly generated intervalsis usual large, and it would be more valuable to develop an asymptotic distribution under the assumption that bothsample size and number of intervals go to infinity. It seems difficult to use the classical argument for this problem,and we shall leave this for future investigation.To obtain the threshold value ξ n,q as needed in the Algorithm 1, we generate R standard Gaussian samples eachof which has sample size n and dimension p . For the r -th sample ( r = 1 , . . . , R ), we calculateˆ ξ ( r ) n,M,q = max m =1 , ··· ,M Q ( r ) n,q ( s m , e m ) , where Q ( r ) n,q ( s m , e m ) is the SN-based test statistic applied to the r th Gaussian simulated sample. We can take ξ n,q to be the 95% quantile of { ˆ ξ ( r ) n,M,q } Rr =1 . Since the self-normalized test statistic is asymptotically pivotal, the abovethreshold ξ n,q is expected to approximate the 95% quantile of the finite sample distribution of maximized SN-basedtest statistic applied to M randomly drawn sub-samples from the original data.12 lgorithm 1 WBS Algorithm for a given q ∈ N function WBS ( S, E ) if E − S < q − then STOP else M s,e ← set of those 1 ≤ m ≤ M s.t. S ≤ s m , e m ≤ E, e m − s m ≥ q − m q ← argmax m ∈M s,e Q n,q ( s m , e m ) if Q n,q ( s m q , e m q ) > ξ n,q then add b ← argmax b U n,q ( b ; s m q , e m q ) /W n,q ( b ; s m q , e m q ) to set of estimated CP WBS(
S, b ) WBS( b + 1 , E ) else STOPTo apply the adaptive test, we calculate ˆ ξ ( r ) n,M,q with r -th sample using different q ∈ I . Denote q I := max q ∈ I q Wecalculate p -value for each single- q based statistic and select the most significant one for location estimation, whichgives the adaptive version; see Algorithm 2. Algorithm 2
Adaptive WBS Algorithm function WBS ( S, E ) if E − S < q I − then STOP else p = 0 . for q in I do M s,e ← set of those 1 ≤ m ≤ M s.t. S ≤ s m , e m ≤ E, e m − s m ≥ q I − m q ← argmax m ∈M s,e Q n,q ( s m , e m ) p q = R − (cid:110) Q n,q ( s m q , e m q ) > ξ ( r ) n,M,q (cid:111) Rr =1 if p q < p for current q then b ← argmax b U n,q ( b ; s m q , e m q ) /W n,q ( b ; s m q , e m q ) p ← p q NEXT add b to set of estimated CP WBS(
S, b ) WBS( b + 1 , E ) In this section, we present numerical results to examine the finite sample performance of our testing and estimationmethod in comparison with the existing alternatives. Section 4.1 shows the size and power for the single change pointtests; Section 4.2 presents the estimation result when there is one single change-point; Section 4.3 compares severalWBS-based estimation methods for multiple change point estimation, including the INSPECT method in Wang andSamworth (2018). Finally, we apply our method to a real data set in Section 4.4.13 .1 Single Change Point Testing
In this subsection, we examine the size and power property of our single- q and adaptive tests in comparison with theone in Enikeeva and Harchaoui (2019) (denoted as EH), which seems to be the only adaptive method in the literature.The data X i ∼ N ( µ i , Σ), where µ i = 0 for i = 1 , · · · , n under the null. We set ( n, p ) = (200 , , σ ij ) as follows, σ ij = i = j Id0 . | i − j | AR(0.5)0 . | i − j | AR(0.8) i = j + 0 . · i (cid:54) = j CS . They correspond to independent components (Id), auto-regressive model with order 1 (AR(0.5) and AR(0.8)) andcompound symmetric (CS), respectively. The first three configurations imply weak dependence among components sosatisfy Assumption 2.1, whereas the compound symmetric covariance matrix corresponds to strong dependence amongcomponents and violates our assumption. The size of our tests, including (cid:101) T n,q at a single q = 2 , , I = (2 , ,
6) and (2 , ,
6) are presented in Table 2. It appears that all tests are oversized when Σ is compoundsymmetric, which is somewhat expected since the strong dependence among components brings non-negligible errorsin asymptotic approximation. As a matter of fact, we conjecture that our limiting null distribution (cid:101) T q no longer holdsin this case. Below we shall focus our comments on the first three configurations (Id, AR(0.5) and AR(0.8)).The size for q = 2 (i.e., the test in Wang et al. (2019)) appears quite accurate except for some degree of under-rejection in the Id case. For q = 4, it is oversized and its size seems inferior to the case q = 6, which also shows someover-rejection for the AR(1) models when ( n, p ) = (200 , n, p ) to (400 , I = (2 , , , I = (2 ,
6) exhibits the most accurate size overall. By contrast, theEH shows serious size distortions in all settings with some serious over-rejection when the componentwise dependenceis strong (e.g., AR(0.8) and CS), which is consistent with the fact that its validity strongly relies on the Gaussian andcomponentwise independence assumptions. We also checked the sensitivity of the size with respect to nonGaussianassumptions and observe serious distortion for EH when the data is generated from a nonGaussian distribution (resultsnot shown). Overall, the adaptive test with I = (2 ,
6) seems preferred to all other tests (including the adaptive testwith I = (2 , , µ i = 0 for i ≤ n/ µ i = (cid:112) δ/d · ( d , p − d ) T for i > n/
2. We take δ = 1 , d = 3, which corresponds to a sparse alternative; and let d = p to examine the power under the dense alternative;see Table 3. In the case of sparse alternative, we can see that the powers corresponding to q = 4 and q = 6 are muchhigher than that for q = 2, which is consistent with our intuition. When q = 4, the power is slightly higher than thatfor q = 6, which might be explained by the over-rejection with q = 4 (in the case of AR(1) models), and we expect nopower gain as we increase q to 8 ,
10 etc, so the results for these larger q are not included. Also for larger q , there is14ore trimming involved as the maximum runs from 2 q to n − q in our test statistics, so if the change point occursoutside of the range [2 q, n − q ], our test has little power. In the dense alternative case, the power for q = 2 is thehighest as expected, and the power for q = 4 is again slightly higher than that for q = 6.The power of the combined tests (i.e., I = (2 ,
4) or (2 ,
6) or (2 , , ,
6) is very close to the power for q = 6 in the sparse case and is quiteclose to the power for q = 2 in the dense case, indicating the adaptiveness of the combined test. In the sparse case, thepowers for I = (2 ,
4) and (2 , ,
6) are slightly higher than that for (2 , I = (2 ,
4) and (2 , , ,
6) is more accurate than that for (2,4) and (2,4,6), we slightly favor the (2 ,
6) combination. EH exhibitshigh power for all settings, but it is at the cost of serious size distortion. We shall not present size-adjusted power asthe serious distortion is too great to recommend its use when there are componentwise dependence in the data.Please insert Tables 3 here!
In this subsection, we present the square root of mean-square-error (RMSE, multiplied by 1000 for readability) of SN-based location estimators and compare with the EH-based estimator under the same settings as we used in Section 4.1.For both dense and sparse alternatives, the proposed estimators (i.e., SN(2), SN(4) and SN(6)) perform betterthan the EH method when the signal is relatively weak (i.e., δ = 1 , δ = 4), the EH method can outperform ours in the identity covariance matrix case. On the other other hand, theperformance of the EH estimator apparently deterioates as the cross-sectional dependence gets stronger, indicating itsstrong reliance on the componentwise independence assumption. It is interesting to note that the SN-based methodperforms fairly well, even in the case of compound symmetric covariane matrix, and SN(6) outperforms the other twoin all settings. A theoretical justification for the latter phenomenon would be intriguing.Please insert Tables 4 here! In the following simulations, we compare our WBS-based method with the INSPECT method proposed by Wang andSamworth (2018). Following Wang et al. (2019), we generate 100 samples of i.i.d. standard normal variables { Z t } nt =1 with n = 120 , p = 50. The 3 change points are located at 30 ,
60 and 90. Denote the changes in mean by θ , θ , θ , with θ = − θ = 2 (cid:112) k /d · ( d , p − d ) , θ = 2 (cid:112) k /d · ( d , p − d ). We use, e.g., Dense(2.5) to denote dense changeswith d i = p = 50 , k i = 2 . i = 1 , , d i = 5 , k i = 4 for i = 1 , ,
3. Inparticular, Dense(2 .
5) & Sparse(4) refers to k = 2 . , k = 4 , d = 5 , d = 50, where we have a mixture of dense andsparse changes.We compare WBS with INSPECT, for which we use default parameters with the ”InspectChangepoint” package15n R. We use 2 different metrics for evaluating the performance of different methods. One is to calculate the meansquare errors (MSE) of the estimated number of change points. The other metric takes the accuracy of locationestimation into account. We utilize the correlated rand index (CRI), which can measure the accuracy of change pointlocation estimation. See Rand (1971), Hubert and Arabie (1985) and Wang et al. (2019) for more details. For perfectestimation, the calculated CRI is 1. In general it is a number between 0 and 1 and the more precise we estimate thechange point locations, the higher CRI we get. We average the CRI for all Monte Carlo replications and record theaverage rand index (ARI). We report the MSE and ARI of different methods based on 100 replications in Table 5.When there are only sparse changes and δ = 2 .
5, the performance of adaptive procedure (WBS(2,6)) is similar toWBS(6), whose estimation is much more accurate than WBS(2) and INSPECT. When we strengthen the signal byincreasing δ from 2 . δ = 2 . δ from 2 . In this subsection, we study the genomic micro-array data set that contains log intensity ratios of 43 individuals withbladder tumor, measured at 2215 different loci. The data was available in R package ecp and was also studied byWang et al. (2019) and Wang and Samworth (2018). We compare our results with theirs.We take the first 200 loci for our study. For the WBS algorithm, we generate 10000 samples from i.i.d. standardnormal distributions with ( n, p ) = (200 , q = 2 33 , , , , , , , , , q = 6 15 , , , , , , , , , , q = 2 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , q ), which suggests that the adaptive WBS methodcaptures both sparse and dense alternatives as expected. In particular, 32(33),44(46),74,134(135), 158(155),173 are16etected by both single methods, 38(39),97,102,191 are detected only by q = 2, and 15, 59, 91,116,186 only by q = 6.The set of the change points detected by adaptive WBS method overlaps with the set for INSPECT by a lot, including15, 32(33), 38(36), 74(73), 91, 97, 102, 116(119), 134, 158(155), 173(174), and 191. It is worth noting that the changepoints at locations 91, 97, 191 were only detected by one of two single WBS methods and INSPECT, whereas theadaptive WBS method is able to capture with its good all-round power property again a broad range of alternatives.In Figure 1, we plot the log intensity ratios of the first 10 individuals at first 200 loci, and the locations of the changepoints estimated by the adaptive method. Please insert Figure 1 here!This example clearly demonstrates the usefulness of the proposed adaptive test and corresponding WBS-basedestimation method. An important practical choice is the threshold, which can be viewed as a tuning parameter in theimplementation of WBS algorithm. We shall leave its choice for future investigation. In this paper, we propose a class of asymptotically pivotal statistics for testing a mean change in high-dimensionalindependent data. The test statistics are formed on the basis of an unbiased estimator of q -th power of the L q normof the mean change via U-statistic and self-normalization. They are asymptotically independent for different q ∈ N ,and therefore, we can form an adaptive test by taking the minimum of p -values corresponding to test statistics indexedby q ∈ N . The resulting test is shown to have good overall power against both dense and sparse alternatives viatheory and simulations. On the estimation front, we obtain the convergence rate for the argmax of SN-based teststatistic standardized by sample size under the one change-point model and q = 2. We also combine our tests withWBS algorithm to estimate multiple change points. As demonstrated by our simulations, the WBS-based estimationmethod inherits the advantage of the adaptive test, as it outperforms other methods under the setting where thereis a mixture of dense and sparse change points, and has close-to-best performance for purely dense and purely sparsecases.To conclude, we mention that it would be interesting to extend our adaptive test to the high-dimensional timeseries setting, for which a trimming parameter seems necessary to accommodate weak temporal dependence in viewof recent work by Wang and Shao (2019). In addition, the focus of this paper is on mean change, whereas in practicethe interest could be on other high-dimensional parameters, such as vector of marginal quantiles, variance-covariancematrix, and even high-dimensional distributions. It remains to be seen whether some extensions to these more generalparameters are possible in the high-dimensional environment. We shall leave these open problems for future research.17GP ( n, p ) α =5% q = 2 q = 4 q = 6 q = 2 , q = 2 , q = 2 , , H DGP δ ( n, p ) α =5% q = 2 q = 4 q = 6 q = 2 , q = 2 , q = 2 , , Method Sparse DenseId AR(0.5) AR(0.8) CS Id AR(0.5) AR(0.8) CS1 SN(2) 38.7 53.7 72.3 84.5 41.1 50.6 72.6 91.4SN(4) 20.3 24.5 26.7 20.9 44.6 43.0 49.1 49.6SN(6) 18.5 22.0 22.3 19.6 33.1 31.6 32.6 31.4EH 150.3 214.9 300.6 326.8 155.6 216.9 291.3 332.32 SN(2) 26.0 33.5 41.7 44.3 27.5 36.6 54.8 76.8SN(4) 14.4 17.5 19.6 16.3 37.9 38.3 45.8 48.8SN(6) 12.1 14.1 14.9 11.9 29.7 28.6 31.9 30.5EH 41.4 90.2 196.8 272.7 40.1 110.8 210.2 286.64 SN(2) 21.8 24.8 29.5 30.4 20.7 26.1 39.2 64.2SN(4) 12.1 14.7 16.5 14.1 22.8 27.8 39.0 44.3SN(6) 9.9 10.8 11.6 10.1 26.1 26.5 29.1 33.6EH 8.7 16.7 66.8 153.2 9.7 29.9 109.4 209.6Table 4: RMSE (multiplied by 10 ) for one change point location estimation under different alternativesFigure 1: ACGH data of the first 10 individuals at first 200 loci. The dashed lines represent the locations of the changepoints detected. 19 N − N MSE ARI-3 -2 -1 0 1 2 3Sparse(2.5) WBS-SN(2) 0 1 11 75 13 0 0 0.28 0.8667WBS-SN(4) 0 0 0 98 2 0 0 0.02 0.958WBS-SN(6) 0 0 0 94 5 1 0 0.09 0.9552WBS-SN(2,6) 0 0 0 90 10 0 0 0.1 0.9489INSPECT 0 26 0 69 5 0 0 1.09 0.7951Sparse(4) WBS-SN(2) 0 0 0 86 14 0 0 0.14 0.9188WBS-SN(4) 0 0 0 98 2 0 0 0.02 0.9684WBS-SN(6) 0 0 0 94 5 1 0 0.09 0.9707WBS-SN(2,6) 0 0 0 90 10 0 0 0.1 0.9678INSPECT 0 0 0 91 8 1 0 0.12 0.9766Dense(2.5) WBS-SN(2) 0 2 10 74 13 1 0 0.35 0.8662WBS-SN(4) 94 4 2 0 0 0 0 8.64 0.0263WBS-SN(6) 70 20 7 3 0 0 0 7.17 0.1229WBS-SN(2,6) 5 5 7 64 9 0 0 0.91 0.7809INSPECT 0 40 0 46 13 0 1 1.82 0.6656Dense(4) WBS-SN(2) 0 0 0 85 13 2 0 0.21 0.9186WBS-SN(4) 47 33 14 6 0 0 0 5.69 0.2748WBS-SN(6) 46 28 21 5 0 0 0 5.47 0.2642WBS-SN(2,6) 0 0 0 87 13 0 0 0.13 0.9214INSPECT 0 7 0 68 22 2 1 0.67 0.9027WBS-SN(2) 0 1 12 73 14 0 0 0.3 0.8742Sparse(2.5) WBS-SN(4) 0 0 62 37 1 0 0 0.63 0.7855& WBS-SN(6) 0 0 60 38 2 0 0 0.62 0.7743Dense(4) WBS-SN(2,6) 0 0 0 91 9 0 0 0.09 0.9439INSPECT 0 21 1 70 6 1 1 1.04 0.8198
Table 5: Multiple change-point estimation20 eferences
Aminikhanghahi, S. and Cook, D. J. (2017), “A survey of methods for time series change point detection,”
Knowledgeand Information Systems , 51, 339–367.Andrews, D. (1991), “Heteroskedasticity and autocorrelation consistent covariant matrix estimation,”
Econometrica ,59, 817–858.Aue, A. and Horv´ath, L. (2013), “Structural breaks in time series,”
Journal of Time Series Analysis , 34, 1–16.Chan, J., Horv´ath, L., and Huˇskov´a, M. (2013), “Darling–Erd˝os limit results for change-point detection in panel data,”
Journal of Statistical Planning and Inference , 143, 955–970.Chen, H. and Zhang, N. (2015), “Graph-based change-point detection,”
The Annals of Statistics , 43, 139–176.Chen, J. and Gupta, A. K. (2011),
Parametric Statistical Change Point Analysis: with Applications to Genetics,Medicine, and Finance , Springer Science & Business Media.Chen, S. X. and Qin, Y.-L. (2010), “A two-sample test for high-dimensional data with applications to gene-set testing,”
The Annals of Statistics , 38, 808–835.Chernozhukov, V., Chetverikov, D., and Kato, K. (2013), “Gaussian approximations and multiplier bootstrap formaxima of sums of high-dimensional random vectors,”
Annals of Statistics , 41, 2786–2819.— (2017), “Central limit theorems and bootstrap in high dimensions,”
Annals of Probability , 45, 2309–2352.Cho, H. (2016), “Change-point detection in panel data via double CUSUM statistic,”
Electronic Journal of Statistics ,10, 2000–2038.Cs¨org¨o, M. and Horv´ath, L. (1997),
Limit Theorems in Change-Point Analysis. Wiley Series in Probability and Statis-tics. , Wiley.Enikeeva, F. and Harchaoui, Z. (2019), “High-dimensional change-point detection with sparse alternatives,”
The Annalsof Statistics , 47, 2051–2079.Fryzlewicz, P. (2014), “Wild binary segmentation for multiple change-point detection,”
The Annals of Statistics , 42,2243–2281.He, Y., Xu, G., Wu, C., and Pan, W. (2020), “Asymptotically independent U-Statistics in high-dimensional testing,”
The Annals of Statistics , forthcoming.Horv´ath, L. and Huˇskov´a, M. (2012), “Change-point detection in panel data,”
Journal of Time Series Analysis , 33,631–648.Hubert, L. and Arabie, P. (1985), “Comparing partitions,”
Journal of Classification , 2, 193–218.21eng, X., Cai, T., and Li, H. (2010), “Optimal sparse segment identification with application in copy number variationanalysis,”
Journal of the American Statistical Association , 105, 1156–1166.Jirak, M. (2015), “Uniform change point tests in high dimension,”
The Annals of Statistics , 43, 2451–2483.Kley, T., Volgushev, S., Dette, H., and Hallin, M. (2016), “Quantile spectral processes: asymptotic analysis andinference,”
Bernoulli , 22, 1770–1807.Liu, H., Gao, C., and Samworth, R. (2020), “Minimax Rates in Sparse High-Dimensional Changepoint Detection,”
Annals of Statistics, to appear .Lobato, I. N. (2001), “Testing that a dependent process is uncorrelated,”
Journal of the American Statistical Associ-ation , 96, 1066–1076.Perron, P. (2006), “Dealing with structural breaks,”
Palgrave Handbook of Econometrics , 1, 278–352.Rand, W. M. (1971), “Objective criteria for the evaluation of clustering methods,”
Journal of the American StatisticalAssociation , 66, 846–850.Shao, X. (2010), “A self-normalized approach to confidence interval construction in time series,”
Journal of the RoyalStatistical Society, Series, B. , 72, 343–366.— (2015), “Self-normalization for time series: a review of recent developments,”
Journal of the American StatisticalAssociation , 110, 1797–1817.Shao, X. and Wu, W. B. (2007), “Local whittle estimation of fractional integration for nonlinear processes,”
Econo-metric Theory , 23, 899–929.Shao, X. and Zhang, X. (2010), “Testing for change points in time series,”
Journal of the American Statistical Asso-ciation , 105, 1228–1240.Tartakovsky, A., Nikiforov, I., and Basseville, M. (2014),
Sequential Analysis: Hypothesis Testing and ChangepointDetection , Chapman and Hall/CRC.Wang, D., Yu, Y., and Rinaldo, A. (2020), “Optimal change point detection and localization in sparse dynamicnetworks,” forthcoming at Annals of Statistics, arXiv preprint arXiv:1809.09602 .Wang, R. and Shao, X. (2019), “Hypothesis testing for high-dimensional time series via self-normalization,”
The Annalsof Statistics , to appear.Wang, R., Volgushev, S., and Shao, X. (2019), “Inference for change points in high dimensional data,” arXiv preprintarXiv:1905.08446 .Wang, T. and Samworth, R. J. (2018), “High dimensional change point estimation via sparse projection,”
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , 80, 57–83.22u, W. B. (2005), “Nonlinear system theory: Another look at dependence,”
Proceedings of the National Academy ofSciences USA , 102, 14150–14154.Wu, W. B. and Shao, X. (2004), “Limit theorems for iterated random functions,”
Journal of Applied Probability , 41,425–436.Xu, G., Lin, L., Wei, P., and Pan, W. (2016), “An adaptive two-sample test for high-dimensional means,”
Biometrika ,103, 609–624.Yu, M. and Chen, X. (2020), “Finite sample change point inference and identification for high-dimensional meanvectors,”
Journal of Royal Statistical Society, Series B, to appear .Zhang, N. R. and Siegmund, D. O. (2012), “Model selection for high-dimensional multi-sequence change-point prob-lems,”
Statistica Sinica , 22, 1507–1538.Zhang, N. R., Siegmund, D. O., Ji, H., and Li, J. Z. (2010), “Detecting simultaneous changepoints in multiplesequences,”
Biometrika , 97, 631–645.Zhang, T. and Lavitas, L. (2018), “Unsupervised self-normalized change-point testing for time series,”
Journal of theAmerican Statistical Association , 113, 637–648.Zhao, Z., Chen, L., and Lin, L. (2019), “Change-point detection in dynamic networks via graphon estimation,” arXivpreprint arXiv:1908.01823 .Zhurbenko, I. and Zuev, N. (1975), “On higher spectral densities of stationary processes with mixing,”
UkrainianMathematical Journal , 27, 364–373. 23 upplement to ”Adaptive Inference for Change Points in High-Dimensional Data”
The supplement contains all the technical proofs in Section 6 and some additional simulation results on networkchange-point detection in Section 7.
In the following, we will denote a n (cid:46) b n and b n (cid:37) a n if lim sup n a n /b n < ∞ . Proof of Theorem 2.2.
Recall that under the null, as X i ’s have the same mean, D n,q ( r ; [ a, b ]) = q (cid:88) c =0 ( − q − c (cid:18) qc (cid:19) P (cid:98) nr (cid:99)−(cid:98) na (cid:99)− cq − c P (cid:98) nb (cid:99)−(cid:98) nr (cid:99)− q + cc S n,q,c ( r ; [ a, b ]) . Therefore, we can calculate the covariance structure of G q based on that of Q q,c given in Theorem 2.1.var[ G q ( r ; [ a, b ])] = q (cid:88) c =0 (cid:18) qc (cid:19) c !( q − c )!( r − a ) q − c ( b − r ) q + c = q !( r − a ) q ( b − r ) q ( b − a ) q . When r < r , cov( G q ( r ; [ a , b ]) , G q ( r ; [ a , b ]))= (cid:88) ≤ c ≤ c ≤ q (cid:16) ( − c + c (cid:18) qc (cid:19)(cid:18) qc (cid:19)(cid:18) Cc (cid:19) c !( q − C )! r >a ,r r , cov( G q ( r ; [ a , b ]) , G q ( r ; [ a , b ]))= (cid:88) ≤ c ≤ c ≤ q (cid:16) ( − c + c (cid:18) qc (cid:19)(cid:18) qc (cid:19)(cid:18) Cc (cid:19) c !( q − C )! r >a ,r
Proof of the claim.
Define S m,h ( l ) := (cid:26) ≤ l , . . . , l h ≤ p n : max ≤ i,j ≤ h | l i − l j | = m (cid:27) .
25y triangular inequality, | l − l | + | l − l | + | l − l | + | l − l | ≥ ≤ i,j ≤ | l i − l j | , and therefore, p (cid:88) l , ··· ,l =1 (Σ l l Σ l l Σ l l Σ l l ) q/ = p n (cid:88) l =1 p n (cid:88) m =0 (cid:88) l ,...,l ∈ S m, ( l ) (Σ l l Σ l l Σ l l Σ l l ) q/ ≤ p n (cid:88) l =1 p n (cid:88) m =0 | S m, ( l ) | C q (1 ∨ m ) − qr (cid:46) p n p n (cid:88) m =0 (1 ∨ m ) − − qr . On the other hand, p (cid:88) l , ··· ,l h =1 cum q ( X ,l ,n , · · · , X ,l h ,n ) = p n (cid:88) l =1 p n (cid:88) m =0 (cid:88) l ,...,l h ∈ S m,h ( l ) cum q ( X ,l ,n , · · · , X ,l h ,n ) ≤ p n (cid:88) l =1 p n (cid:88) m =0 | S m,h ( l ) | C qh (1 ∨ m ) − qr (cid:46) p n p n (cid:88) m =0 (1 ∨ m ) h − − qr . RHS has order O (cid:0) p h − qrn (cid:1) if h − qr − >
0. Now a simple computation shows that Assumption 6.1 is satisfied if h − qr < h/ h = 2 , . . . , , and q = 2 , . . . , which is equivalent to r > ♦ We are now ready to introduce the following lemma, which is vital in proving the main result.
Lemma . Under Assumption 2.1, for any i ( h )1 , i ( h )2 , ..., i ( h ) q that are all distinct, h = 1 , ..., , and c = 1 , , ..., q , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,...,l =1 δ q − cn,l · · · δ q − cn,l E [ Z i (1)1 ,l · · · Z i (1) c ,l · · · Z i (8)1 ,l · · · Z i (8) c ,l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) (cid:107) ∆ n (cid:107) q − c ) q (cid:107) Σ (cid:107) cq (1) In particular, for c = q , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,...,l =1 E [ Z i (1)1 ,l · · · Z i (1) c ,l · · · Z i (8)1 ,l · · · Z i (8) c ,l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) (cid:107) Σ (cid:107) qq . (2) In addition, for any c = 1 , , ..., q − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,l =1 δ q − cn,l δ q − cn,l Σ cl ,l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o (cid:16) (cid:107) ∆ n (cid:107) q − c ) q (cid:107) Σ (cid:107) cq (cid:17) . (3)26 roof of Lemma 6.1. Applying the generalized H¨older’s Inequality, we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,...,l =1 δ q − cn,l · · · δ q − cn,l h E [ Z i (1)1 ,l · · · Z i (1) c ,l · · · Z i (8)1 ,l · · · Z i (8) c ,l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:32) (cid:89) u =1 (cid:34) p (cid:88) l u =1 δ q − cn,l u Z i ( u )1 ,l u · · · Z i ( u ) c ,l u (cid:35)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:89) u =1 E (cid:34) p (cid:88) l u =1 δ q − cn,l u Z i ( u )1 ,l u · · · Z i ( u ) c ,l u (cid:35) / = E (cid:34) p (cid:88) l =1 δ q − cn,l Z i (1)1 ,l · · · Z i (1) c ,l (cid:35) = p (cid:88) l ,...,l =1 δ q − cn,l ...δ q − cn,l E (cid:104) Z i (1)1 ,l ...Z i (1)1 ,l · · · Z i (1) c ,l ...Z i (1) c ,l (cid:105) = p (cid:88) l ,...,l =1 δ q − cn,l ...δ q − cn,l (cid:16) E (cid:104) Z i (1)1 ,l ...Z i (1)1 ,l (cid:105)(cid:17) c , since i (1)1 , i (1)2 , ..., i (1) c are all different, and { Z i } are i.i.d. Again by H¨older’s Inequality, p (cid:88) l ,...,l =1 δ q − cn,l ...δ q − cn,l (cid:16) E (cid:104) Z i (1)1 ,l ...Z i (1)1 ,l (cid:105)(cid:17) c ≤ p (cid:88) l ,...,l =1 ( δ q − cn,l ...δ q − cn,l ) q/ ( q − c ) ( q − c ) /q p (cid:88) l ,...,l =1 (cid:16) E (cid:104) Z i (1)1 ,l ...Z i (1)1 ,l (cid:105)(cid:17) cq/c c/q (cid:46) (cid:107) ∆ n (cid:107) q − c ) q p (cid:88) l ,...,l =1 (cid:88) π (cid:89) B ∈ π cum ( Z ,l i , i ∈ B ) q c/q . The last line in the above inequalities is due to the CR inequality and the definition of joint cumulants, where π runsthrough the list of all partitions of { , ..., } , B runs through the list of all blocks of the partition π . As all blocks ina partition are disjoint, we can further bound it as (cid:107) ∆ n (cid:107) q − c ) q p (cid:88) l ,...,l =1 (cid:88) π (cid:89) B ∈ π cum ( Z ,l i , i ∈ B ) q c/q = (cid:107) ∆ n (cid:107) q − c ) q (cid:88) π (cid:89) B ∈ π p (cid:88) l i =1 ,i ∈ B cum ( Z ,l i , i ∈ B ) q c/q (cid:46) (cid:107) ∆ n (cid:107) q − c ) q (cid:40)(cid:88) π (cid:107) Σ (cid:107) q (cid:80) B ∈ π | B | / q (cid:41) c/q (cid:46) (cid:107) ∆ n (cid:107) q − c ) q (cid:107) Σ (cid:107) cq , where the first inequality in the above is due to Assumption 6.1, A.2, which is a consequence of Assumption 2.1, andthe fact that there are only finite number of distinct partitions over { , ..., } . This completes the proof of the firstresult.For the second result, we first define A ◦ n as the notation for the element-wise n -th power of any real matrix A , i.e.27 ◦ ni,j = A ni,j . Then we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,l =1 δ q − cn,l δ q − cn,l Σ cl ,l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ∆ ◦ ( q − c ) T n Σ ◦ c ∆ ◦ ( q − c ) n ≤ (cid:107) ∆ ◦ ( q − c ) n (cid:107) σ max (Σ ◦ c ) , where σ max is the largest eigenvalue. First observe that (cid:107) ∆ ◦ ( q − c ) n (cid:107) = (cid:80) pl =1 δ q − c ) n,l = (cid:107) ∆ n (cid:107) q − c )2( q − c ) . By properties of L q norm, (cid:107) ∆ n (cid:107) q − c ) ≤ (cid:107) ∆ n (cid:107) q , if q ≤ q − c ), and (cid:107) ∆ n (cid:107) q − c ) ≤ p / q − c ) − /q (cid:107) ∆ n (cid:107) q , if q > q − c ). This implies (cid:107) ∆ n (cid:107) q − c )2( q − c ) ≤ max( p (2 c − q ) /q (cid:107) ∆ n (cid:107) q − c ) q , (cid:107) ∆ n (cid:107) q − c ) q ).Next, for any symmetric matrix A , σ max ( A ) (cid:107) ≤ (cid:107) A (cid:107) ∞ = max i =1 ,...,p (cid:80) pj =1 | A i,j | . This, together with Assumption2.1 (A.2), implies σ max (Σ ◦ c ) ≤ max i =1 ,...,p p (cid:88) j =1 | Σ ◦ ci,j | (cid:46) max i =1 ,...,p p (cid:88) j =1 (1 ∧ | i − j | ) − cr ≤ p (cid:88) m =1 m − cr ≤ ∞ , for some r >
2. This is equivalent to σ max (Σ ◦ c ) = O (1). Note that (cid:107) Σ (cid:107) qq ≥ tr (Σ ◦ q ) (cid:38) p , which leads to p c/q (cid:46) (cid:107) Σ (cid:107) cq .So (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) l ,l =1 δ q − cn,l δ q − cn,l Σ cl ,l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) max( p (2 c − q ) /q (cid:107) ∆ n (cid:107) q − c ) q , (cid:107) ∆ n (cid:107) q − c ) q ) = o ( (cid:107) ∆ n (cid:107) q − c ) q (cid:107) Σ (cid:107) cq ) , since (2 c − q ) /q = c/q + ( c − q ) /q < c/q , for c = 1 , , ..., q −
1. This completes the proof for the second result. ♦ This lemma is a generalization to its counterpart in Wang et al. (2019), in which we only have q = 2. To proveTheorem 2.1, we need the following lemmas to show tightness and finite dimensional convergence. Lemma . Under Assumption 2.1, for any c = 0 , , ..., q , and define the 3-dimensional index set G n := { ( i/n, j/n, k/n ) : i, j, k = 0 , , ..., n } , E [ a − n ( S n,q,c ( r ; [ a , b ]) − S n,q,c ( r ; [ a , b ])) ] ≤ C (cid:107) ( a , r , b ) − ( a , r , b ) (cid:107) , for some constant C , any ( a , r , b ) , ( a , r , b ) ∈ G n such that (cid:107) ( a , r , b ) − ( a , r , b ) (cid:107) > δ/n .Proof of Lemma 6.2. By CR-inequality, E [( S n,q,c ( r ; [ a , b ]) − S n,q,c ( r ; [ a , b ])) ] ≤ C (cid:110) E [( S n,q,c ( r ; [ a , b ]) − S n,q,c ( r ; [ a , b ])) ]+ E [( S n,q,c ( r ; [ a , b ]) − S n,q,c ( r ; [ a , b ])) ]+ E [( S n,q,c ( r ; [ a , b ]) − S n,q,c ( r ; [ a , b ])) ] (cid:111) . We shall only analyze E [( S n,q,c ( r ; [ a, b ]) − S n,q,c ( r ; [ a, b (cid:48) ])) ], and the analysis of the other 2 terms are similar.28ote that for any a, r, b, b (cid:48) ∈ [0 ,
1] and b < b (cid:48) , E [( S n,q,c ( r ; [ a, b ]) − S n,q,c ( r ; [ a, b (cid:48) ])) ]= E ( q − c ) p (cid:88) l =1 (cid:88) (cid:98) nb (cid:99) +1 ≤ j ≤(cid:98) nb (cid:48) (cid:99) (cid:88) (cid:98) na (cid:99) +1 ≤ i (cid:54) = ···(cid:54) = i c ≤(cid:98) nr (cid:99) (cid:88) (cid:98) nr (cid:99) +1 ≤ j (cid:54) = ···(cid:54) = j q − c − ≤ j − (cid:32) c (cid:89) t =1 Z i t ,l · q − c − (cid:89) s =1 Z j s ,l · Z j,l (cid:33) = C (cid:88) j ( · ) ,i ( · ) t ,j ( · ) s p (cid:88) l ,...,l =1 8 (cid:89) h =1 (cid:32) E (cid:34) c (cid:89) t =1 Z i ( h ) t ,l h (cid:35) E (cid:34) q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:35) E (cid:2) Z j ( h ) ,l h (cid:3)(cid:33) (cid:46) n q − ( (cid:98) nb (cid:48) (cid:99) − (cid:98) nb (cid:99) ) (cid:107) Σ (cid:107) qh/ q (cid:46) n q (cid:20) ( b (cid:48) − b ) + 1 n (cid:21) (cid:107) Σ (cid:107) qq , where we have applied Lemma 6.1-(2) to i ( h )1 , . . . , i ( h ) c , j ( h )1 , . . . , j ( h ) q − c − , j ( h ) , and the summation (cid:80) j ( · ) ,i ( · ) t ,j ( · ) s is over (cid:98) nb (cid:99) + 1 ≤ j ( h ) ≤ (cid:98) nb (cid:48) (cid:99) , (cid:98) na (cid:99) + 1 ≤ i ( h )1 (cid:54) = · · · (cid:54) = i ( h ) c ≤ (cid:98) nr (cid:99) , (cid:98) nr (cid:99) + 1 ≤ j ( h )1 (cid:54) = · · · (cid:54) = j ( h ) q − c − ≤ j ( h ) − h = 1 , . . . , a − n E [( S n,q,c ( r ; [ a, b ]) − S n,q,c ( r ; [ a, b (cid:48) ])) ] (cid:46) ( b (cid:48) − b ) + 1 n . ♦ Lemma . Fix q, c, for any ≤ a < r < b ≤ , ≤ a < r < b , any α , α ∈ R , we have α a n S n,q,c ( r ; [ a , b ]) + α a n S n,q,c ( r , [ a , b ]) D −→ α Q q,c ( r ; [ a , b ]) + α Q q,c ( r ; [ a , b ]) , where cov ( Q q,c ( r ; [ a , b ]) , Q q,c ( r ; [ a , b ])) = c !( q − c )!( r − A ) c ( b − R ) q − c , Proof of Lemma 6.3.
WLOG, we can assume a < a < r < r < b < b . The other terms are similar. Define ξ ,i = q − ca n p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i , ··· ,i c ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j , ··· ,j q − c − ≤ i − (cid:32) c (cid:89) t =1 Z i t ,l · q − c − (cid:89) s =1 Z j s ,l · Z i,l (cid:33) ξ ,i = q − ca n p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i , ··· ,i c ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j , ··· ,j q − c − ≤ i − (cid:32) c (cid:89) t =1 Z i t ,l · q − c − (cid:89) s =1 Z j s ,l · Z i,l (cid:33) , and (cid:101) ξ n,i = α ξ ,i if (cid:98) nr (cid:99) + q − c ≤ i ≤ (cid:98) nr (cid:99) + q − c − α ξ ,i + α ξ ,i if (cid:98) nr (cid:99) + q − c ≤ i ≤ (cid:98) nb (cid:99) α ξ ,i if (cid:98) nb (cid:99) + 1 ≤ i ≤ (cid:98) nb (cid:99) . F i = σ ( Z i , Z i − , · · · ), we can see that under the null E [ Z ] = 0, (cid:101) ξ n,i is a martingale difference sequence w.r.t. F i , and α a n S n,q,c ( r ; [ a , b ]) + α a n S n,q,c ( r , [ a , b ]) = (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c (cid:101) ξ n,i . To apply the martingale CLT (Theorem 35.12 in Billingsley (2008)), we need to verify the following two conditions(1) ∀ (cid:15) > , (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i (cid:110)(cid:12)(cid:12)(cid:12)(cid:101) ξ n,i (cid:12)(cid:12)(cid:12) > (cid:15) (cid:111) (cid:12)(cid:12)(cid:12) F i − (cid:105) p → V n = (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i | F i − (cid:105) p → σ . To prove (1), it suffices to show that (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i (cid:105) → . Observe that (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i (cid:105) = α (cid:98) nr (cid:99) + q − c − (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i (cid:3) + (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104) ( α ξ ,i + α ξ ,i ) (cid:105) + α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nb (cid:99) +1 E (cid:2) ξ ,i (cid:3) ≤ α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i (cid:3) + 8 α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i (cid:3) . Straightforward calculations show that E (cid:2) ξ ,i (cid:3) = Cn q (cid:107) Σ (cid:107) qq (cid:88) i ( h ) t ,j ( h ) s p (cid:88) l ,l ,l ,l =1 4 (cid:89) h =1 (cid:32) E (cid:34) c (cid:89) t =1 Z i ( h ) t ,l h (cid:35) E (cid:34) q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:35) E [ Z i,l h ] (cid:33) (cid:46) n q (cid:107) Σ (cid:107) qq n q − (cid:107) Σ (cid:107) qq = O ( 1 n ) . The same result holds for ξ ,i . Therefore, (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i (cid:105) (cid:46) (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i (cid:3) + (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i (cid:3) = O ( 1 n ) → .
30s regards (2), we decompose V n as follows, (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104)(cid:101) ξ n,i | F i − (cid:105) = α (cid:98) nr (cid:99) + q − c − (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i | F i − (cid:3) + (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:104) ( α ξ ,i + α ξ ,i ) | F i − (cid:105) + α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nb (cid:99) +1 E (cid:2) ξ ,i | F i − (cid:3) = α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i | F i − (cid:3) + α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i | F i − (cid:3) + 2 α α (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E [ ξ ,i ξ ,i | F i − ]= : α V ,n + α V ,n + 2 α α V ,n . We still focus on the case a < a < r < r < b < b . Note that (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c E (cid:2) ξ ,i | F i − (cid:3) = ( q − c ) n q (cid:107) Σ (cid:107) qq c !( q − c − (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c (cid:88) i ( h ) t ,j ( h ) s p (cid:88) l ,l =1 Σ l l (cid:89) h =1 (cid:32) c (cid:89) t =1 Z i ( h ) t ,l h · q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:33) = ( q − c ) n q (cid:107) Σ (cid:107) qq c !( q − c − (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c (1) (cid:88) i ( h ) t ,j ( h ) s p (cid:88) l ,l =1 Σ l l (cid:89) h =1 (cid:32) c (cid:89) t =1 Z i ( h ) t ,l h · q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:33) + ( q − c ) n q (cid:107) Σ (cid:107) qq c !( q − c − (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c (2) (cid:88) i ( h ) t ,j ( h ) s p (cid:88) l ,l =1 Σ l l (cid:89) h =1 (cid:32) c (cid:89) t =1 Z i ( h ) t ,l h · q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:33) = : V (1)1 ,n + V (2)1 ,n , where (1) (cid:88) i ( h ) t ,j ( h ) s denotes the summation over terms s.t. i (1) t = i (2) t , j (1) s = j (2) s , ∀ t, s , and (2) (cid:88) i ( h ) t ,j ( h ) s is over the other terms.It is straightforward to see that E [ V (2)1 ,n ] = 0 as Z i ’s are independent, and E [ V (1)1 ,n ] = ( q − c ) n q (cid:107) Σ (cid:107) qq c !( q − c − n c ( r − a ) c (cid:98) nb (cid:99)−(cid:98) nr (cid:99) (cid:88) k =1 k q − c − p (cid:88) l ,l =1 Σ pl l + o (1)= c !( q − c )!( r − a ) c ( b − r ) q − c + o (1) . E [( V (1)1 ,n ) ] = ( q − c ) n q (cid:107) Σ (cid:107) qq [ c !( q − c − p (cid:88) l ,l ,l ,l =1 (cid:98) nb (cid:99) (cid:88) i = (cid:98) nr (cid:99) + q − c (cid:98) nb (cid:99) (cid:88) j = (cid:98) nr (cid:99) + q − c ∗ (cid:88) i ( h ) t ,j ( h ) s (cid:34) Σ l l Σ l l (cid:89) h =1 (cid:32) c (cid:89) t =1 Z i ( h ) t ,l h · q − c − (cid:89) s =1 Z j ( h ) s ,l h (cid:33) (cid:35) + o (1) , where the summation (cid:80) ∗ i ( h ) t ,j ( h ) s is over the range of i ( h ) t , j ( h ) s , h = 1 , , , i (1) t = i (2) t , j (1) s = j (2) s , i (3) t = i (4) t , j (3) s = j (4) s , ∀ t, s. Note that RHS can be further decomposed into 2 parts. The first part corresponds to the summation of theterms s.t. { i ( h ) t , j ( s ) } for h = 1 and has no intersection with that for h = 3, which has order( q − c ) n q (cid:107) Σ (cid:107) qq [ c !( q − c − n c ( r − a ) c (cid:98) nb (cid:99)−(cid:98) nr (cid:99) (cid:88) i =1 i q − c − (cid:98) nb (cid:99)−(cid:98) nr (cid:99) (cid:88) j =1 j q − c − p (cid:88) l ,l ,l ,l Σ ql l Σ ql l =[ c !( q − c )!( r − a ) c ( b − r ) q − c ] + o (1) = E [ V (1)1 ,n ] + o (1) . For the second part, it corresponds to the summation of the terms s.t. { i ( h ) t , j ( s ) } for h = 1 and has at least oneintersection with that for h = 3. Since at least one ”degree of freedom” for n is lost, the summation still has theform (cid:80) pl ,l ,l ,l =1 E (cid:104) Z i (1)1 ,l · · · Z i (1) q ,l · · · Z i ( h )1 ,l h · · · Z i ( h ) q ,l h (cid:105) as in Lemma 6.1-(2), which has order O ( (cid:107) Σ (cid:107) qq ). We canconclude that the second part has order O ( n ), and hence goes to 0.Therefore, lim sup (cid:16) E [( V (1)1 ,n ) ] − E [ V (1)1 ,n ] (cid:17) ≤
0, which implies lim var( V (1)1 ,n ) = 0. Therefore, we can conclude that V (1)1 ,n p → lim E [ V (1)1 ,n ] = c !( q − c )!( r − a ) c ( b − r ) ( q − c ) . It remains to show that V (2)1 ,n p → E (cid:104) ( V (2)1 ,n ) (cid:105) →
0. Based on the same argument as before, by applying Lemma 6.1-(2) weknow that every kind of summation has the same order O ( n ) no matter how i ( h ) t , j ( h ) s , i, j intersects with each other.Therefore, the terms in the expansion of E (cid:104) ( V (2)1 ,n ) (cid:105) for which n has highest degree of freedom should dominate. Forthese terms, each index in i ( h ) t , j ( h ) s , i, j should have exactly one pair. The number of these terms is of order O ( n q ).The summation has forms (cid:80) pl ,l ,l ,l =1 (Σ dl l Σ dl l Σ dl l Σ el l Σ fl l Σ fl l ), s.t. d > , e + f > d + e + f = q . Weneed to show that it is of order o ( (cid:107) Σ (cid:107) qq ) to complete the proof. By symmetry, we can assume e >
0, and therefore d, e ≤
1. Note that for q > p (cid:88) l ,l ,l ,l =1 (Σ dl l Σ dl l Σ el l Σ el l Σ fl l Σ fl l )= p (cid:88) l ,l ,l ,l =1 (Σ l l Σ l l Σ l l Σ l l )(Σ d − l l Σ d − l l Σ e − l l Σ e − l l Σ fl l Σ fl l ) ≤ p (cid:88) l ,l ,l ,l =1 | Σ l l Σ l l Σ l l Σ l l | q/ /q p (cid:88) l ,l ,l ,l =1 | Σ d − l l Σ d − l l Σ e − l l Σ e − l l Σ fl l Σ fl l | q/ ( q − − /q (cid:46) o ( (cid:107) Σ (cid:107) q ) · (cid:107) Σ (cid:107) q − q = o ( (cid:107) Σ (cid:107) qq ) , p (cid:88) l ,l ,l ,l =1 | Σ d − l l Σ d − l l Σ e − l l Σ e − l l Σ fl l Σ fl l | q/ ( q − (cid:46) p (cid:88) l ,l ,l ,l =1 (Σ ql l Σ ql l + Σ ql l Σ ql l + Σ ql l Σ ql l ) = 3 (cid:107) Σ (cid:107) qq . When q = 2, it must be the case that d = e = 1, the term becomes (cid:80) pl ,l ,l ,l =1 | Σ l l Σ l l Σ l l Σ l l | , and directlyapplying A.1 can yield the desired order.We can then conclude that E [ V (2)1 ,n ] → V (2)1 ,n p →
0. Combining what we have proved so far, we obtain V ,n p → c !( q − c )!( r − a ) c ( b − r ) q − c .Similar argument shows that V ,n p → c !( q − c )!( r − a ) c ( b − r ) q − c , V ,n p → c !( q − c )!( r − a ) c ( b − r ) q − c . Therefore, we conclude that V n p → α c !( q − c )!( r − a ) c ( b − r ) q − c + α c !( q − c )!( r − a ) c ( b − r ) q − c + 2 α α c !( q − c )!( r − a ) c ( b − r ) q − c , which completes the proof. ♦ We can generalize the above lemma to the case when c i , q i are not identical. Lemma . Fix q , c , q , c for any ≤ a < r < b ≤ , ≤ a < r < b , any α , α ∈ R , we have α a n S n,q ,c ( r ; [ a , b ]) + α a n S n,q ,c ( r , [ a , b ]) D −→ α Q q ,c ( r ; [ a , b ]) + α Q q ,c ( r ; [ a , b ]) , where Q q ,r and Q q ,r are independent Gaussian processes if q (cid:54) = q , or ( c − c )( r − r ) < or r = r , c (cid:54) = c .And when q = q = q, ( c − c )( r − r ) > = 0 , we have cov ( Q q,c ( r ; [ a , b ]) , Q q,c ( r ; [ a , b ])) = (cid:18) Cc (cid:19) c !( q − C )!( r − A ) c ( R − r ) C − c ( b − R ) q − C , Proof of Lemma 6.4.
We use the same notations in proving last lemma, as the proof is similar to the previous oneand involves applying martingale CLT, where we have decomposed V n into 2 parts. Since the argument there can bedirectly applied, the only additional work is about calculating the mean.To prove the second statement, we take c < c , a < a < r < r < b < b , as the example case, since the proof33or other cases are similar. With the same technique we have used, it can be shown that E [ V n ] → α c !( q − c )!( r − a ) c ( b − r ) q − c + α c !( q − c )!( r − a ) c ( b − r ) q − c + 2 α α (cid:18) c c (cid:19) c !( q − c )!( r − a ) c ( r − r ) c − c ( b − r ) q − c . To derive the convergence in the statement, we can follow the same argument as before to show the variance goes to0, and therefore, we have the convergence in distribution, with desired covariance structure.As for the first statement, it is straightforward to see that the expectation for the crossing term (corresponding to α α ) is 0 for each of the cases in the first statement, which implies that the Gaussian processes have to be independentdue to asymptotic normality. ♦ Now we are ready to complete the proof of Theorem 2.1.
Proof of Theorem 2.1.
The tightness is guaranteed by Lemma 6.2 and applying Lemma 7.1 in Kley et al. (2016) withΦ( x ) = x , T = T n , d ( u, u (cid:48) ) = (cid:107) u − u (cid:48) (cid:107) / , ¯ η = n − / /
2. We omit the detailed proof as the argument is similar to thetightness proof in Wang et al. (2019). Lemma 6.4 has provided finite dimensional convergence of S n,q,c , which hasasymptotic covariance structure as Q q,c after normalization. Therefore, we have derived desired process convergence. ♦ Proof Theorem 2.3.
Let ( s, k, m ) = ( (cid:98) an (cid:99) + 1 , (cid:98) rn (cid:99) , (cid:98) bn (cid:99) ) and define D Zn,q ( r ; a, b ) = p (cid:88) l =1 ∗ (cid:88) s ≤ i ,...,i q ≤ k ∗ (cid:88) k +1 ≤ j ,...,j q ≤ m ( Z i ,l − Z j ,l ) · · · (cid:0) Z i q ,l − Z j q ,l (cid:1) . Recall that Theorem 2.2 holds for D Zn,q since under the null D Zn,q = D n,q .Now we are under the alternative, with the location point k = (cid:98) nτ (cid:99) and the change of mean equal to ∆ n . Suppose34LOG s < k < k < m . D n,q ( r ; a, b ) = p (cid:88) l =1 ∗ (cid:88) s ≤ i ,...,i q ≤ k ∗ (cid:88) k +1 ≤ j ,...,j q ≤ m ( X i ,l − X j ,l ) · · · (cid:0) X i q ,l − X j q ,l (cid:1) = q ! p (cid:88) l =1 (cid:88) s ≤ i <...
1, where D n,c,l ( r ; a, b ) = (cid:88) s ≤ i <...
Proposition . For any ≤ l < k < m ≤ n , k ≥ l + 1 and m ≥ k + 2 , we have:1. if k ∗ < l or k ∗ ≥ m , U n, ( k ; l, m ) = U Zn, ( k ; l, m ) ; . if l ≤ k ≤ k ∗ < m , U n, ( k ; l, m ) = U Zn, ( k ; l, m ) + ( k − l + 1)( k − l )( m − k ∗ )( m − k ∗ − (cid:107) ∆ n (cid:107) − k − l + 1)( m − k ∗ )( m − k ) k (cid:88) i = l ∆ Tn Z i + 2( k − l )( k − l + 1)( m − k ∗ ) m (cid:88) i = k +1 ∆ Tn Z i + 2( k − l )( m − k ∗ ) k (cid:88) i = l ∆ Tn Z i − k − l + 1)( k − l + 1) m (cid:88) i = k ∗ +1 ∆ Tn Z i ;
3. if l ≤ k ∗ ≤ k < m , U n, ( k ; l, m ) = U Zn, ( k ; l, m ) + ( k ∗ − l + 1)( k ∗ − l )( m − k )( m − k − (cid:107) ∆ n (cid:107) − k ∗ − l + 1)( m − k )( m − k − k (cid:88) i = l ∆ Tn Z i + 2( m − k − k ∗ − l + 1)( k − l + 1) m (cid:88) i = k +1 ∆ Tn Z i + 2( m − k − m − k ) k ∗ (cid:88) i = l ∆ Tn Z i − m − k − k ∗ − l + 1) m (cid:88) i = k +1 ∆ Tn Z i . Let (cid:15) n = nγ − / κn, . We have the following result. Proposition . Under Assumption 3.1,1. P (cid:0) sup k ∈ Ω n U n, ( k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ≥ (cid:1) → ;2. P ( W n, ( k ∗ ; 1 , n ) − inf k ∈ Ω n W n, ( k ; 1 , n ) ≥ → ,where Ω n = { k : | k − k ∗ | > (cid:15) n } . Now we are ready to prove the convergence rate for SN-based statistic ˆ τ . Proof of Theorem 3.1.
Due to the fact that ˆ k is the global maximizer, we have0 ≤ U n, (ˆ k ; 1 , n ) W n, (ˆ k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) W n, ( k ∗ ; 1 , n )= U n, (ˆ k ; 1 , n ) W n, (ˆ k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) W n, (ˆ k ; 1 , n ) + U n, ( k ∗ ; 1 , n ) W n, (ˆ k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) W n, ( k ∗ ; 1 , n )= 1 W n, (ˆ k ; 1 , n ) ( U n, (ˆ k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ) + U n, ( k ∗ ; 1 , n ) W n, (ˆ k ; 1 , n ) W n, ( k ∗ ; 1 , n ) ( W n, ( k ∗ ; 1 , n ) − W n, (ˆ k ; 1 , n )) . Since U n, ( k ∗ ; 1 , n ) , W n, (ˆ k ; 1 , n ) and W n, ( k ∗ ; 1 , n ) are all strictly positive almost surely, we can then concludethat U n, (ˆ k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ≥ W n, ( k ∗ ; 1 , n ) − W n, (ˆ k ; 1 , n ) ≥
0. Define Ω n = { k : | k − k ∗ | > (cid:15) n } . If ˆ k ∈ Ω n ,37hen there exists at least one k ∈ Ω n such that U n, ( k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ≥ W n, ( k ∗ ; 1 , n ) − W n, ( k ; 1 , n ) ≥ P (ˆ k ∈ Ω n ) ≤ P (cid:18) sup k ∈ Ω n U n, ( k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ≥ (cid:19) + P (cid:18) W n, ( k ∗ ; 1 , n ) − inf k ∈ Ω n W n, ( k ; 1 , n ) ≥ (cid:19) . By Proposition 6.2, it is straightforward to see that P (ˆ k ∈ Ω n ) →
0, and this completes the proof. ♦ Proof of Proposition 6.1. If k ∗ < l or k ∗ ≥ m , then E [ X i ] are all identical, for i = l, ..., m . This implies that U n, ( k ; l, m ) = (cid:80) l ≤ i (cid:54) = i ≤ k (cid:80) k +1 ≤ j (cid:54) = j ≤ m ( X i − X j ) T ( X i − X j ) = (cid:80) l ≤ i (cid:54) = i ≤ k (cid:80) k +1 ≤ j (cid:54) = j ≤ m ( Z i − Z j ) T ( Z i − Z j ) = U Zn, ( k ; l, m ).When l ≤ k ∗ < m , there are two scenarios depending on the value of k . If k ≤ k ∗ , note that E [ X i ] = ∆ n for any i > k ∗ and zero otherwise, then by straightforward calculation we have U n, ( k ; l, m ) = (cid:88) l ≤ i (cid:54) = i ≤ k (cid:88) k +1 ≤ j (cid:54) = j ≤ m ( X i − X j ) T ( X i − X j )= (cid:88) l ≤ i (cid:54) = i ≤ k (cid:88) k +1 ≤ j (cid:54) = j ≤ m ( Z i − Z j − E [ X j ]) T ( Z i − Z j − E [ X j ])= U n, ( k ; l, m ) + ( k − l + 1)( k − l )( m − k ∗ )( m − k ∗ − (cid:107) ∆ n (cid:107) − k − l )( m − k ∗ ) k (cid:88) i = l k ∗ (cid:88) j = k +1 ∆ Tn ( Z i − Z j ) − k − l )( m − k ∗ − k (cid:88) i = l m (cid:88) j = k ∗ +1 ∆ Tn ( Z i − Z j )= U Zn, ( k ; l, m ) + ( k − l + 1)( k − l )( m − k ∗ )( m − k ∗ − (cid:107) ∆ n (cid:107) − k − l )( m − k ∗ )( m − k ) k (cid:88) i = l ∆ Tn Z i + 2( k − l )( m − k ∗ )( k − l + 1) m (cid:88) i = k +1 ∆ Tn Z i + 2( k − l )( m − k ∗ ) k (cid:88) i = l ∆ Tn Z i − k − l )( k − l + 1) m (cid:88) i = k ∗ +1 ∆ Tn Z i . k ≥ k ∗ we have U n, ( k ; l, m ) = (cid:88) l ≤ i (cid:54) = i ≤ k (cid:88) k +1 ≤ j (cid:54) = j ≤ m ( X i − X j ) T ( X i − X j )= (cid:88) l ≤ i (cid:54) = i ≤ k (cid:88) k +1 ≤ j (cid:54) = j ≤ m ( Z i − Z j + E [ X i ] − ∆ n ) T ( Z i − Z j + E [ X i ] − ∆ n )= U n, ( k ; l, m ) + ( k ∗ − l + 1)( k ∗ − l )( m − k )( m − k − (cid:107) ∆ n (cid:107) − m − k − k ∗ − l ) k ∗ (cid:88) i = l m (cid:88) j = k +1 ∆ Tn ( Z i − Z j ) − m − k − k ∗ − l + 1) k (cid:88) i = k ∗ +1 m (cid:88) j = k +1 ∆ Tn ( Z i − Z j )= U Zn, ( k ; l, m ) + ( k ∗ − l + 1)( k ∗ − l )( m − k )( m − k − (cid:107) ∆ n (cid:107) − k ∗ − l + 1)( m − k )( m − k − k (cid:88) i = l ∆ Tn Z i + 2( m − k − k ∗ − l + 1)( k − l + 1) m (cid:88) i = k +1 ∆ Tn Z i + 2( m − k − m − k ) k ∗ (cid:88) i = l ∆ Tn Z i − m − k − k ∗ − l + 1) m (cid:88) i = k +1 ∆ Tn Z i . ♦ Proof of Proposition 6.2.
To show the first result, we first assume k < k ∗ − (cid:15) n . Then according to Proposition 6.1, U n, ( k ; 1 , n ) = U Zn, ( k ; 1 , n ) + k ( k − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) − k − n − k ∗ )( n − k ) k (cid:88) i =1 ∆ Tn Z i + 2 k ( k − n − k ∗ ) n (cid:88) i = k +1 ∆ Tn Z i + 2( k − n − k ∗ ) k (cid:88) i =1 ∆ Tn Z i − k ( k − n (cid:88) i = k ∗ +1 ∆ Tn Z i . Similarly we have U n, ( k ∗ ; 1 , n ) = U Zn, ( k ∗ ; 1 , n ) + k ∗ ( k ∗ − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) − k ∗ − n − k ∗ )( n − k ∗ − k ∗ (cid:88) i =1 ∆ Tn Z i + 2 k ∗ ( k ∗ − n − k ∗ − n (cid:88) i = k ∗ +1 ∆ Tn Z i . It is easy to verify that E [ U n, ( k ; 1 , n )] = k ( k − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) , for k ≤ k ∗ . Furthermore, by Theorem2.1 in Wang et al. (2019) and the argument therein, we havesup k =2 ,...,n − | U Zn, ( k ; 1 , n ) | = O ( n (cid:107) Σ (cid:107) F ) = o p ( n . (cid:112) (cid:107) Σ (cid:107) F (cid:107) ∆ n (cid:107) ) , (cid:112) (cid:107) Σ (cid:107) F = o ( √ n (cid:107) ∆ n (cid:107) ) by Assumption 3.1 (3), andsup ≤ a ≤ b ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b (cid:88) i = a ∆ Tn Z i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( √ n (cid:113) ∆ Tn Σ∆ n ) ≤ O p ( (cid:112) n (cid:107) Σ (cid:107) (cid:107) ∆ n (cid:107) ) ≤ O p ( (cid:112) n (cid:107) Σ (cid:107) F (cid:107) ∆ n (cid:107) ) . These imply that U n, ( k ∗ ; 1 , n ) = k ∗ ( k ∗ − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) + O p ( n . (cid:107) ∆ n (cid:107) (cid:112) (cid:107) Σ (cid:107) F )= k ∗ ( k ∗ − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) + o p ( n (cid:107) ∆ n (cid:107) ) , since (cid:112) (cid:107) Σ (cid:107) F = o ( √ n (cid:107) ∆ n (cid:107) ) by Assumption 3.1 (3). Therefore, we have P ( sup k
0) and P (sup k 0. Similar tactics can be applied to the case k > k ∗ + (cid:15) n and by combining the two parts we have P (cid:0) sup k ∈ Ω n U n, ( k ; 1 , n ) − U n, ( k ∗ ; 1 , n ) ≥ (cid:1) → 0. Thereforethis completes the proof for the first result.It remains to show the second part. Let us again assume k < k ∗ − (cid:15) n first. By Proposition 6.1 we have W n, ( k ∗ ; 1 , n ) = 1 n k ∗ − (cid:88) t =2 U n, ( t ; 1 , k ∗ ) + 1 n n − (cid:88) t = k ∗ +2 U n, ( t ; k ∗ + 1 , n ) = 1 n k ∗ − (cid:88) t =2 U Zn, ( t ; 1 , k ∗ ) + 1 n n − (cid:88) t = k ∗ +2 U Zn, ( t ; k ∗ + 1 , n ) , and W n, ( k ; 1 , n ) = 1 n k − (cid:88) t =2 U n, ( t ; 1 , k ) + 1 n n − (cid:88) t = k +2 U n, ( t ; k + 1 , n ) = 1 n k − (cid:88) t =2 U Zn, ( t ; 1 , k ) + 1 n n − (cid:88) t = k +2 U n, ( t ; k + 1 , n ) . When t is between k + 2 and k ∗ , by Proposition 6.1 we have U n, ( t ; k + 1 , n ) = U Zn, ( t ; k + 1 , n ) + ( t − k )( t − k − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) − t − k − n − k ∗ )( n − t ) t (cid:88) i = k +1 ∆ Tn Z i + 2( t − k − n − k ∗ )( t − k ) n (cid:88) i = t +1 ∆ Tn Z i + 2( t − k − n − k ∗ ) t (cid:88) i = k +1 ∆ Tn Z i − t − k − t − k ) n (cid:88) i = k ∗ +1 ∆ Tn Z i , and from the above decomposition we observe that E [ U n, ( t ; k + 1 , n )] = ( t − k )( t − k − n − k ∗ )( n − k ∗ − (cid:107) ∆ n (cid:107) ,41hich is the second term in the above equality. Then U n, ( t ; k + 1 , n ) = ( U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] + E [ U n, ( t ; k + 1 , n )]) ≥ E [ U n, ( t ; k + 1 , n )] + 2 E [ U n, ( t ; k + 1 , n )]( U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )]) ≥ E [ U n, ( t ; k + 1 , n )] − E [ U n, ( t ; k + 1 , n )] sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] | , since E [ U n, ( t ; k + 1 , n )] > 0. Furthermore,sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] |≤ sup t = k +2 ,...,n − | U Zn, ( t ; k + 1 , n ) | + 8 n sup a
2, we have U n, ( t ; k + 1 , n ) ≥ E [ U n, ( t ; k + 1 , n )] − E [ U n, ( t ; k + 1 , n )] sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] | , where E [ U n, ( t ; k + 1 , n )] = ( k ∗ − k )( k ∗ − k − n − t )( n − t − (cid:107) ∆ n (cid:107) > 0, andsup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] |≤ O p ( n (cid:107) Σ (cid:107) F ) + O p ( n . (cid:113) ∆ Tn Σ∆ n ) = O p ( n (cid:107) ∆ n (cid:107) / √ a n )42herefore by combining the above results we obtain that W n, ( k ; 1 , n ) ≥ n k − (cid:88) t =2 U Zn, ( t ; 1 , k ) + 1 n k ∗ (cid:88) t = k +2 E [ U n, ( t ; k + 1 , n )] + 1 n n − (cid:88) t = k ∗ +1 E [ U n, ( t ; k + 1 , n )] − n sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] | k ∗ (cid:88) t = k +2 E [ U n, ( t ; k + 1 , n )] − n sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] | n − (cid:88) t = k ∗ +1 E [ U n, ( t ; k + 1 , n )] (cid:38) ( k ∗ − k ) n (cid:107) ∆ n (cid:107) − ( k ∗ − k ) n (cid:107) ∆ n (cid:107) sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] | + ( k ∗ − k ) n (cid:107) ∆ n (cid:107) − ( k ∗ − k ) n (cid:107) ∆ n (cid:107) sup t = k +2 ,...,n − | U n, ( t ; k + 1 , n ) − E [ U n, ( t ; k + 1 , n )] |− (cid:32) sup k sup t =2 ,...,k − | U Zn, ( t ; 1 , k ) | (cid:33) =( k ∗ − k ) n (cid:107) ∆ n (cid:107) [( k ∗ − k ) − o p ( n / √ γ n, )] + ( k ∗ − k ) n (cid:107) ∆ n (cid:107) [( k ∗ − k ) − o p ( n / √ γ n, )] − O p ( n (cid:107) Σ (cid:107) F ) ≥ ( k ∗ − k ) n (cid:107) ∆ n (cid:107) [ (cid:15) n − o p ( n / √ γ n, )] + ( k ∗ − k ) n (cid:107) ∆ n (cid:107) [ (cid:15) n − o p ( n / √ γ n, )] − O p ( n (cid:107) Σ (cid:107) F )=(( k ∗ − k ) n + ( k ∗ − k ) n ) (cid:107) ∆ n (cid:107) (cid:15) n (1 − o p (1)) − O p ( n (cid:107) Σ (cid:107) F ) , since (cid:15) n = na − / κn . Andinf k 61 (200,10) 0.035 0.096 0.068 0.08 0.048 0.075 0.152 0.135 0.124 0.096(400,20) 0.054 0.084 0.049 0.071 0.048 0.097 0.142 0.094 0.135 0.0990.2 (200,10) 0.065 0.117 0.08 0.116 0.062 0.095 0.153 0.151 0.147 0.121(400,20) 0.05 0.101 0.043 0.09 0.047 0.099 0.153 0.096 0.137 0.083 Table 6: Size for testing one change point of network time seriesAs regards the power simulation, we generate the network data with a change point located at (cid:98) n/ (cid:99) , which leadsto µ t = µ + δ I ( t > n/ · µ . We take µ = 0 . /c, r = cm with c = 0 . , δ = 0 . , . 5. We obtain the empirical powerbased on 1000 Monte Carlo repetitions. DGP ( n, m ) H ,5% H ,10%( δ, c ) q = 2 q = 4 q = 6 q = 2 , q = 2 , q = 2 q = 4 q = 6 q = 2 , q = 2 , Table 7: Power for testing one change point of network time seriesWe can see that our method exhibits similar size behavior as compared to the setting for Gaussian distributed44ata in Section 4.1. The power also appears to be quite good and increases when the signal increases. Unfortunately,we are not aware of any particular testing method tailored for single network change-point so we did not include anyother method into the comparison.To estimate the change-points in the network time series, we also combine our method with WBS. We generate100 samples of networks with connection probability µ t and sparsity parameter r . The 3 change points are located at30 , 60 and 90. We take µ t = µ + δ · I (30 < t ≤ 60 or t > · µ . We report the MSE and ARI of 100 Monte Carlosimulations as before. We compare our method with modified neighborhood smoothing (MNBS) algorithm in Zhaoet al. (2019) and the graph-based test in Chen and Zhang (2015) combined with the binary segmentation (denotedas CZ). We do not include a comparison with Wang et al. (2020) as their method requires two iid samples. We cansee that CZ performs worse than the other two methods as our simulation involves non-monotonic changes in themean that does not favor binary segmentation. When the network becomes sparse, i.e. c = 0 .