Adaptive Change Point Monitoring for High-Dimensional Data
SStatistica Sinica
Adaptive Change Point Monitoring for High-Dimensional Data
Teng Wu, Runmin Wang, Hao Yan, Xiaofeng Shao
Department of Statistics, University of Illinois at Urbana-ChampaignDepartment of Statistical Science, Southern Methodist UniversitySchool of Computing Informatics & Decision Systems Engineering, Arizona State University
Abstract:
In this paper, we propose a class of monitoring statistics for a mean shift in a sequence ofhigh-dimensional observations. Inspired by the recent U-statistic based retrospective tests developedby Wang et al. (2019) and Zhang et al. (2020), we advance the U-statistic based approach to thesequential monitoring problem by developing a new adaptive monitoring procedure that can detectboth dense and sparse changes in real time. Unlike Wang et al. (2019) and Zhang et al. (2020), whereself-normalization was used in their tests, we instead introduce a class of estimators for q -norm ofthe covariance matrix and prove their ratio consistency. To facilitate fast computation, we furtherdevelop recursive algorithms to improve the computational efficiency of the monitoring procedure.The advantage of the proposed methodology is demonstrated via simulation studies and real dataillustrations. Key words and phrases:
Change point detection, Sequential monitoring, Sequential testing, U-statistics.
1. Introduction
Change point detection problems have been extensively studied in many areas, such asstatistics, econometrics and engineering, and there are wide applications to fields in physicalscience and engineering. The literature is huge and is still growing rapidly. For the low-dimensional data, it dates back to early work by Page (1954); MacNeill (1974); Brown et al.(1975), among others. More recent work that studied change point problems for low/fixeddimensional multivariate time series data can be found in Shao and Zhang (2010); Mattesonand James (2014); Kirch et al. (2015); B¨ucher et al. (2019), among others. We refer toPerron (2006), Aue and Horv´ath (2013) and Aminikhanghahi and Cook (2017) for someexcellent reviews on this topic.The literature on change point detection can be roughly divided into two categories:retrospective testing and estimation of change points based on a complete data sequenceoffline and online sequential monitoring for change points based on some training data andsequentially arrived data. This paper is concerned with the sequential monitoring problemfor temporally independent but cross-sectionally dependent high-dimensional data. Thereare two major lines of research for sequential change-point detection/monitoring. In one line,a huge body of work follows the paradigm set by pioneers in the field, such as Wald (1945),Page (1954) and Lorden (1971); see Lai (1995, 2001) and Polunchenko and Tartakovsky(2012) for comprehensive reviews. Most sequential detection methods along this line areoptimized to have a minimal detection delay with a control of average run length underthe null and also the existing procedures are mostly developed for low-dimensional data. a r X i v : . [ s t a t . M E ] J a n hese methods often require both pre-change distribution and post-change distribution tobe specified or some parametric assumption to be made. In the other line, Chu et al. (1996)assumed that there is a stretch of training data (without any change points) and sequentialmonitoring was applied to sequentially arrived testing data. They employ the invarianceprinciple to control the type I error and their framework has been adopted by many otherresearchers in both parametric and nonparametric contexts. See Horv´ath et al. (2004);Aue et al. (2012); Wied and Galeano (2013); Fremdt (2015); Dette and G¨osmann (2019).Along this line, it is typical to use size and power (plus average detection delay) to describeand compare operating characteristics of competing procedures. Our procedure falls intothe second category. It seems to us that these two frameworks are in general difficult tocompare, as they differ in terms of the model assumptions and evaluation criteria etc.Nowadays, with the rapid improvement of data acquisition technology, high-dimensionaldata streams involving continuous sequential observations appear frequently in modern man-ufacturing and service industries, and the demand for efficient online monitoring tools forsuch data has never been higher. For example, in Yan et al. (2018), they proposed a methodto monitor a multi-channel tonnage profile used for the forging process, which has thou-sands of attributes. Furthermore, image-based monitoring [Yan et al. (2014)] has becomepopular in the literature, which includes thousands of pixels per image. In L´evy-Leducand Roueff (2009), they considered the problem of monitoring thousands of Internet trafficmetrics provided by a French Internet service provider. This kind of high-dimensional dataposes significant new challenges to the traditional multivariate statistical process controland monitoring, since when the dimension p is high and is comparable to the sample size n , most existing sequential monitoring methods constructed based on the fixed-dimensionassumptions become invalid.In this article, we propose a new class of sequential monitoring methodology to monitor achange in the mean of independent high-dimensional data based on (sequential) retrospectivetesting. Our proposal is inspired by recent work on retrospective testing of mean changein high-dimensional data by Wang et al. (2019) and Zhang et al. (2020). In Wang et al.(2019), the authors proposed a U-statistic based approach to target the L -norm of themean difference by extending the U-statistic idea initiated in Chen and Qin (2010) fromtwo-sample testing to the change point testing problem. In Zhang et al. (2020), they furtherextended the test in Wang et al. (2019) to a L q -norm-based one mimicking He et al. (2018),where q ∈ N , to capture sparse alternative. By combining L -norm-based test and L q -norm-based one, the adaptive test statistic they proposed is shown to achieve high power forboth dense and sparse alternatives. However, one of the limitations of these works is thatthese methods are designed for off-line analysis, which is not suitable to be applied to real-time online monitoring systems. Building on the works of Wang et al. (2019) and Zhanget al. (2020), we shall propose a new adaptive sequential monitoring procedure that cancapture both sparse and dense alternatives. Instead of using the self-normalization scheme[Shao (2010); Shao and Zhang (2010); Shao (2015)], as done in Wang et al. (2019) and Zhanget al. (2020), we opt to use ratio-consistent estimators for (cid:107) Σ (cid:107) qq based on the training data,where Σ is the common covariance matrix of the sequence of random vectors and providea rigorous proof. Further, we develop recursive algorithms for fast implementation so thatt each time the monitoring statistics can be efficiently computed. The resulting adaptivemonitoring procedure via a combination of L and L q (say q = 6) based sequential tests areshown to be powerful against both dense and sparse alternatives via theory and simulations.There is a growing literature on high-dimensional change point detection in the retro-spective setting; see Horv´ath and Huˇskov´a (2012); Cho and Fryzlewicz (2015); Jirak (2015);Yu and Chen (2017); Wang and Samworth (2018); Yu and Chen (2019); Wang et al. (2019);Zhang et al. (2020); Wang and Shao (2020), among others. It is worth noting that Enikeevaand Harchaoui (2019) developed a test based on a combination of a linear statistic and ascan statistic, and their test can be adaptive to both sparse and dense alternatives. How-ever, their Gaussian and independent components assumptions are also too restrictive. Inaddition, the literature for online monitoring of high-dimensional data streams has also beengrowing steadily in the literature of statistics and quality control in the last decade. In par-ticular, Mei (2010) proposed a global monitoring scheme based on the sum of the cumulativesum monitoring statistics from each individual data stream. His method aims to minimizethe delay time and control the global false alarm rate, which is based on the average runlength under the null. This is different from the size and power analysis as done in our work.Note that the assumptions in Mei (2010) are quite restrictive in the sense that he assumedall data streams do not have cross-sectional dependence, and that both the pre-change andpost-change distributions are known. See Wang and Mei (2015), Zou et al. (2015), Liuet al. (2019), and Li (2020) for several variants in the sense that they propose new ways ofaggregating the local monitoring statistics. In Xie and Siegmund (2013), they proposed amixture detection procedure based on a likelihood ratio statistic that takes into account thefraction of data streams being affected. They argue that the performance is good when thefraction of affected data streams are known and do not require to completely specify thepost-change distribution. However, the mixture global log-likelihood they specified relies onthat hypothesized affected fraction p , and they showed the robustness of different choices of p only through numerical studies. The results they derived hold for data generated from anormal distribution or other exponential families of distributions. A common feature of allthese works is that they assume the data streams do not have cross-sectional dependence,which may be violated in practice. As a matter of fact, our theory for the proposed monitor-ing statistic demonstrates the impact of the correlation/covariance structure of the multipledata streams, which seems not well appreciated in the above-mentioned literature.The rest of the paper is structured as follows: In Section 2, we specify the change pointmonitoring framework we use and propose the monitoring statistic that targets the L q -normof the mean change. An adaptive monitoring scheme can be derived by combining the teststatistic for different q ’s, q ∈ N . Section 3 provides a ratio-consistent estimator for (cid:107) Σ (cid:107) qq ,which is crucial when constructing the monitoring statistics. Section 4 provides simulationstudies to examine the finite sample performance of the adaptive monitoring statistic. InSection 5, we apply the adaptive monitoring scheme to two real datasets. Section 6 concludesthe paper. All the technical details can be found in the Appendix. . Monitoring Statistics In this section, we specify the general framework we use to perform change point monitoring.We consider a closed-end change point monitoring scenario following Chu et al. (1996).Assume that we observe a sequence of temporally independent high dimensional observations X , . . . , X n ∈ R p , which are ordered in time and have constant mean µ and covariance matrixΣ. We start the monitoring procedure from time ( n + 1) to detect if the mean vector changesin the future. Throughout the analysis, we assume that all X t ’s are independent over time.A decision is to be made at each of the time points, and we will signal an alarm when themonitoring statistic exceeds a certain boundary. The process ends at time nT regardless ofwhether a change point is detected, where T is a pre-specified number. The Type-I errorof the monitoring procedure is controlled at α , which means the probability of signaling analarm when there is no change within the period [ n + 1 , nT ] is at most α .Under the null hypothesis, no change occurs within the monitoring period and we have E ( X t ) = µ for t = 1 , . . . , nT . against the alternativeUnder the alternatives, the mean function changes at some time t > n , and remains at thesame level for the following observations. That is E ( X t ) = (cid:40) µ < t < t µ + ∆ t ≤ t ≤ nT. We propose a family of test statistics T n,q ( k ), which serves as the monitoring statistictargeting (cid:107) ∆ (cid:107) q . The case q = 2 corresponds to dense alternatives, and larger values of q ’s correspond to sparser alternatives. We will discuss the formulation of our monitoringstatistic for q = 2 and then extend to general q’s in the following subsections. L -norm-based monitoring statistics In this section, we will first develop the L -norm-based monitoring statistic, which is espe-cially useful to detect the dense alternative. Furthermore, we will discuss the asymptoticproperties of the L -norm-based statistic. Finally, the recursive computational algorithmwill be developed to allow efficient implementation. For a given time k > n , suppose we know a change point happens at the location m , where n < m < k . We can separate the observations into two independent samples: pre-break X , . . . , X m and post-break X m +1 , . . . , X k . Consider using a two-sample U-statistic with.1 L -norm-based monitoring statisticskernel h (( X, Y ) , ( X (cid:48) , Y (cid:48) )) = ( X − Y ) T ( X (cid:48) − Y (cid:48) ), where ( X (cid:48) , Y (cid:48) ) is an independent copy of ( X, Y ). Then we have E [ h (( X, Y ) , ( X (cid:48) , Y (cid:48) ))] = (cid:107) E ( X ) − E ( Y ) (cid:107) , which estimates the squared L -norm of the mean difference. Indeed Wang et al. (2019)constructed a L -norm-based retrospective change point detection statistic by scanning overall possible m . For the online monitoring problem, we shall combine this idea with theapproach in Dette and G¨osmann (2019) to propose a monitoring statistic. To be moreprecise, at each time point k , we scan through all possible change point locations m ( n Assumption 1. tr (Σ ) = o ( (cid:107) Σ (cid:107) F ). Assumption 2. Let Cum ( h ) = (cid:80) pl ,...,l h =1 cum ( X ,l , . . . , X ,l h ) ≤ C || Σ || hF for h = 2 , , , , C . Here cum ( · ) is the joint cumulant. In general, for a sequence of ran-dom variable Y , . . . , Y n , their joint cumulant is defined as cum ( Y , . . . , Y n ) = (cid:88) π ( | π | − − | π |− (cid:89) B ∈ π E (cid:32)(cid:89) i ∈ B Y i (cid:33) , where π runs through the list of all partitions of { , . . . , n } , B runs through the list of allblocks of partition π and and π is the number of parts in the partition.Assumption 1 was also imposed in Chen and Qin (2010), who pioneered the use of U -statistic approach in the two-sample testing problem for high-dimensional data, and it canbe satisfied by a wide range of covariance models. Assumption 2 can be viewed as some.1 L -norm-based monitoring statisticsrestrictions on the dependence structure, which holds under uniform bounds on momentsand ‘short-range’ dependence type conditions on the entries of the vector ( X , , ..., X ,p ). SeeWang et al. (2019) for more discussions about these two assumptions. Finally, under thenull hypothesis and these assumptions, we provide the limiting distribution of the proposedmonitoring statistic in Theorem 1. Theorem 1. Under Assumptions 1 and 2, we have nT max k = n +3 T n, ( k ) D −→ sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G ( s, t ) , where G ( s, t ) = t ( t − s ) Q (0 , s ) + stQ ( s, t ) − s ( t − s ) Q (0 , t ) , and Q is a Gaussian process whose covariance structure is the following Cov ( Q ( a , b ) , Q ( a , b )) = (cid:40) (min( b , b ) − max( a , a )) if max( a , a ) ≤ min( b , b )0 otherwise In general, we can also consider some non-constant boundary function w ( t ), that is, nT max k = n +1 T n, ( k ) w ( k/n − D −→ sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G ( s, t ) w ( t − . We take the double supremums here to control the familywise error rate. Therefore, wereject the null hypothesis if T n, ( k ) > c α w ( k/n − 1) for some k ∈ { n + 3 , . . . , nT } . The sizecan be calibrated by choosing a c α , such that P (cid:32) sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G ( s, t ) w ( t − > c α (cid:33) = α. Different choices of w ( t ) have been considered in Dette and G¨osmann (2019). • (T1) w ( t ) = 1, • (T2) w ( t ) = ( t + 1) , • (T3) w ( t ) = ( t + 1) · max (cid:110)(cid:16) tt +1 (cid:17) / , − (cid:111) .These w ( t )s are motivated by the law of iterated logarithm and are used to reduce the stop-ping delay under the alternative. Based on our simulation results and real data applications,the choice of w ( t ) among the above three candidates does not seem to have a big impact onpower and detection delay. So in practice, for closed-end procedure, any choice would work.The detailed comparisons are shown in the simulation studies in Section 4..1 L -norm-based monitoring statistics Remark The current method can be generalized to the open-end framework. For anopen-end monitoring procedure, we are interested in testing E ( X t ) = µ for t = 1 , , . . .. against the alternative E ( X t ) = (cid:40) µ < t < t µ + ∆ t > t . for some t > n . Suppose we use the same L norm based monitoring statistic at time k = n + 3 , . . . , i.e., T n, ( k ) = 1 n (cid:91) (cid:107) Σ (cid:107) F max m = n +1 ...,k − G k ( m ) . For a suitably chosen boundary function w ( · ), we expect that ∞ sup k = n +3 T n, ( k ) w ( k/n − D −→ sup t ∈ [1 , ∞ ) sup s ∈ [1 ,t ] G ( s, t ) w ( t − , as n → ∞ . The critical value can be determined by P (cid:32) sup t ∈ [1 , ∞ ) sup s ∈ [1 ,t ] G ( s, t ) w ( t − > c α (cid:33) = α. We reject the null hypothesis if T n, ( k ) > c α w ( k/n − 1) for some k ∈ { n + 3 , . . . } . In practice,we can approximate critical values c α based on the procedure we used for simulating thecritical values in the closed-end procedure, by using a large T , say T = 200. Note that theboundary function used for open-end monitoring needs to satisfy certain smoothness anddecay rate assumptions and the above three we used for the closed-end procedure are nolonger applicable; see Assumption 2.4 in G¨osmann et al. (2020) and related discussions.The following theorem provides theoretical analysis for the power of the L -norm-basedmonitoring procedure. Theorem 2. Suppose that Assumptions 1 and 2 hold. Further assume that the change pointlocation is at (cid:98) nr (cid:99) for some r ∈ (1 , T ) , we have1. When n ∆ T ∆ || Σ || F → , max k = n +3 ,...nT T n, ( k ) D −→ sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G ( s, t ) . 2. When n ∆ T ∆ || Σ || F → b ∈ (0 + ∞ ) , max k = n +3 ,...nT T n, ( k ) D −→ ˜ T = sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] [ G ( s, t ) + b Λ( s, t )] , .1 L -norm-based monitoring statistics where Λ( s, t ) = ( t − r ) s s ≤ rr ( t − s ) s > r otherwise . 3. When n ∆ T ∆ || Σ || F → ∞ , max k = n +3 ,...nT T n, ( k ) D −→ ∞ . Theorem 2 implies that, under the local alternative where n ∆ T ∆ || Σ || F → 0, the proposedmonitoring procedure has trivial power. For the diverging alternative where n ∆ T ∆ || Σ || F → + ∞ ,the test has power converging to 1. When the strength corresponding to the change falls inbetween, the test has power between ( α, One challenge for the proposed monitoring statistic T n, ( k ) is that it needs to be recomputedat each given time k . The brute force calculation of the test statistics has O ( n p ) timecomplexity and O(1) space complexity. In this section, we develop a recursive algorithmto efficiently update the monitoring statistic, which greatly improves the computationalefficiency for online monitoring. More specifically, we propose a recursive algorithm toupdate G k ( m ), which is a major component to compute the monitoring statistic T n, ( k ) asfollows: G k ( m ) = ( k − m )( k − m − (cid:88) ≤ i 2+ ( n + 1) n [( B n +3 − B n +1 ) T ( B n +3 − B n +1 ) − ( C n +3 − C n +1 )] / − nB Tn +1 ( B n +3 − B n +1 ) . Increase index from k to k + 1: Fix index m , compute B k +1 and C k +1 : B k +1 = B k + X k +1 , C k +1 = C k + X Tk +1 X k +1 . The statistic for the pair ( m, k + 1) is G k +1 ( m ) = ( k − m + 1)( k − m )( B Tm B m − C m ) / m ( m − B k +1 − B m ) T ( B k +1 − B m )) − ( C k +1 − C m )] / − ( m − k − m ) m (cid:88) i =1 B Tm ( B k +1 − B m ) . Increase index from m to m + 1: For fixed index k , all B i and C i for i = n . . . , k are already recorded. The statistic for pair ( m + 1 , k ) is G k ( m + 1) = ( k − m − k − m − B Tm +1 B m +1 − C m +1 ) / 2+ ( m + 1) m [( B k − B m +1 ) T ( B k − B m +1 )) − ( C k − C m +1 )] / − ( k − m − mB Tm +1 ( B k − B m +1 ) . The algorithm should start with ( m, k ) = ( n + 1 , n + 3), increase the second index k first andthen increase along the first index m . The recursive formulation reduces the time complexityto O ( n p ) with additional space complexity O ( np ). L q -norm-based monitoring statistics In this section, we generalize the monitoring statistic from L -norm to L q -norm. As has beenshown in the previous analysis, the power of the L -norm-based monitoring statistic dependson quantity (cid:107) ∆ (cid:107) , which is sensitive to dense alternatives. However, in real applications,we usually do not know a priori if the mean change is dense or not. As an approximation,we can consider a similar test statistic targeting (cid:107) ∆ (cid:107) q , for q ∈ N . When q is large, we.2 L q -norm-based monitoring statisticsare essentially testing against sparse alternatives. As a special case, if we let q → ∞ ,lim q →∞ (cid:107) ∆ (cid:107) q = (cid:107) ∆ (cid:107) ∞ , we only target on the largest element (in absolute value) of the ∆. To define the monitoring statistics, we adopt the idea used in Zhang et al. (2020) withoutapplying self-normalization. Self-normalization requires more extensive computation andcan be avoided by using the Phase I data to obtain a ratio consistent estimator for (cid:107) Σ (cid:107) q .Also, as pointed out by Shao (2015), self-normalization can result in a slight loss of power.Essentially, we can construct a L q -norm-based test statistic at time k = n + q + 1 , . . . , nT , T n,q ( k ) = 1 (cid:113) n q (cid:91) (cid:107) Σ (cid:107) qq max m = n +1 ,...,k − q p (cid:88) l =1 ∗ (cid:88) ≤ i ,...,i q ≤ m ∗ (cid:88) m +1 ≤ j ,...,j q ≤ k ( X i ,l − X j ,l ) · · · ( X i q ,l − X j q ,l )= 1 (cid:113) n q (cid:91) (cid:107) Σ (cid:107) qq max m = n +1 ,...,k − q U n,q ( k, m ) , where (cid:91) (cid:107) Σ (cid:107) qq is a ratio-consistent estimator of (cid:107) Σ (cid:107) qq . In this subsection, we study the asymptotic property of the L q -norm-based test statistics.First, we impose the following conditions in Zhang et al. (2020) to facilitate the asymptoticanalysis. Assumption 3. Let X t = µ + Z t . Suppose Z t are i.i.d copies of Z with mean 0 andcovariance matrix Σ. There exists c > n such that inf i =1 ...,p V ar ( Z ,i ) ≥ c Assumption 4. Z has up to 8-th moments, with sup ≤ j ≤ p E [ Z ,j ] ≤ C , and for h = 2 . . . C h depending on h only and a constant r > | cum ( Z ,l . . . , Z ,l h ) | ≤ C (1 ∨ max ≤ i 90% 0.756 3.235 0.204 0.867 0.141 0.59295% 1.264 3.711 0.331 0.973 0.232 0.67699% 2.715 4.635 0.706 1.196 0.485 0.837 Theorem 3. Under Assumptions 3 and 4, nT max k = n + q +1 T n,q ( k ) d −→ ˜ T q := sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G q ( s, t ) , where G q ( s, t ) = q (cid:88) c =0 ( − q − c (cid:18) qc (cid:19) s q − c ( t − s ) c Q q,c ( s ; [0 , t ]) , and Q q,c ( r ; [ a, b ]) is a Gaussian process with covariance structure cov ( Q q,c ( r ; [ a , b ]) , Q q,c ( r ; [ a , b ])) = (cid:18) Cc (cid:19) c !( q − c )!( r − A ) c ( R − r ) C − c ( b − R ) q − C , where A = max( a , a ) , c = min( c , c ) , C = max( c , c ) and b = min( b , b ) . Two processes Q q ,c and Q q ,c are mutually independent if q (cid:54) = q ∈ N . The limiting null distribution is pivotal and its critical values can be simulated basedon the following equation, P (cid:32) sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] G q ( s, t ) w ( t − > c α (cid:33) = α. We reject the H when T n,q ( k ) > c α w ( k/n − 1) for k = n + q + 1 , . . . , nT . We tabulatethe critical values for T = 2, q = 2 , L q -norm-based monitoring procedure in Theorem 4. Theorem 4. Suppose that Assumptions 3 and 4 hold and the change point location is at (cid:98) nr (cid:99) for some r ∈ (1 , T ) ,1. When n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → , max nTk = n + q +1 T n,q ( k ) D −→ ˜ T q ; .2 L q -norm-based monitoring statistics 2. When n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → γ ∈ (0 , + ∞ ) , nT max k = n + q +1 T n,q ( k ) D −→ sup t ∈ [1 ,T ] sup s ∈ [1 ,t ] [ G q ( s, t ) + γJ q ( s ; [0 , t ])] , where J q ( s ; [0 , t ]) = r q ( t − s ) q r ≤ s < ts q ( t − r ) q s ≤ r < t otherwise ; 3. When n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → ∞ , max nTk = n + q +1 T n,q ( k ) D −→ ∞ . Analogous to the q = 2 case, the power of the test depends on (cid:107) ∆ (cid:107) q . Therefore, forlarge q , the proposed test is sensitive to sparse alternatives. Similar to the L -based-test statistics, we would like to extend the recursive formulation to L q -norm-based test statistic. According to Zhang et al.(2020), under the null, the process U n,q ( k, m ) can be simplified as U n,q ( k, m ) = q (cid:88) c =0 ( − q − c (cid:18) qc (cid:19) P m − − cq − c P k − m − q + cc S n,q,c ( m ; 1 , k ) , where P kl = k ! / ( k − l )! and S n,q,c ( m ; s, k ) = p (cid:88) l =1 ∗ (cid:88) s ≤ i ,...,i c ≤ m ∗ (cid:88) m +1 ≤ j ,...,j q − c ≤ k c (cid:89) t =1 X i t ,l q − c (cid:89) g =1 X j g ,l . Since S n,q,c ( m ; 1 , k )’s are the major building blocks of our final test statistic and needto be computed at each time k , we need to find efficient ways to calculate them recursively.One key element is the sum of product terms like B ( c, m, l ) := ∗ (cid:88) ≤ i ,...,i c ≤ m c (cid:89) t =1 X i t ,l , and M ( c, m, k, l ) := ∗ (cid:88) m ≤ j ,...,j c ≤ k c (cid:89) g =1 X j g ,l . .3 Combining multiple L q -norm-based statisticsWhen we increase from m to m + 1, ∗ (cid:88) ≤ i ,...,i c ≤ m +1 c (cid:89) t =1 X i t ,l = ∗ (cid:88) ≤ i ,...,i c ≤ m c (cid:89) t =1 X i t ,l + X m +1 ,l · ∗ (cid:88) ≤ i ,...,i c − ≤ m c − (cid:89) t =1 X i t ,l . We can derive the following recursive relationship for B ( c, k, l ), i.e. B ( c, m + 1 , l ) = B ( c, m, l ) + B ( c − , m, l ) · X k +1 ,l . (2.1)There is similar recursive relationship for M ( c, m, k, l ), M ( c, m + 1 , k, l ) = M ( c, m, k, l ) + X m +1 ,l M ( c − , m, k, l ) . (2.2)To enable the recursive computation, for each c = 0 , . . . , q , we maintain a matrix tostore the cumulative sums.1. Initialization : Starting with c = 0 and c = 1, for all l = 1 , . . . , p , we initialize B (0 , k + 1 , l ) , . . . , B (0 , k + q, l ) = 0 and calculate B (1 , k + 1 , l ) = k +1 (cid:88) i =1 X i,l , . . . , B (1 , k + q, l ) = k + q (cid:88) i =1 X i,l . Then we can recursively calculate B ( c, i, l ) for all c = 0 , . . . , q and i ≤ k + q based onEquation 2.1.2. Update from B ( c, k, l ) to B ( c, k + 1 , l ): We let B (0 , k + 1 , l ) = B (0 , k, l ) + X k +1 ,l and obtain the result for other B ( c, k + 1 , l ) ( c ≤ q ) using Equation 2.1.3. Update from M ( c, m, k, l ) to M ( c, m +1 , k, l ): Fix index k , for any n +1 ≤ m ≤ k − q , l = 1 , . . . , p , we let M (0 , m, k, l ) = 0 and calculate M (1 , m, k, l ) = k (cid:88) i = m X i,l . All other M ( c, m, k, l ) where c ≤ q and n +1 ≤ m ≤ k − q , can be obtained via Equation2.2. Construct the test statistic T n,q ( k + 1) using B ( c, k, l )’s and M ( c, m, k, l )’s andrepeat from step 2.The time complexity of the recursive formulation is O ( n pq ) with space complexity O ( npq ). L q -norm-based statistics In this section, we propose to combine multiple L q statistics to detect both dense and sparsealternatives. More specifically, based on theoretical results in Zhang et al. (2020), the U-process for different q ’s are asymptotically independent, which implies that { T n,q } nTk = n + q +1 aresymptotically independent for q ∈ N . Therefore, max nTk = n + q +1 T n,q ( k ) are asymptoticallyindependent for q ∈ I , where I ⊂ N , say I = { , } . Thus we can combine the monitoringprocedure for different q ’s and adjust for the asymptotic size. In general, if we want tocombine a set of q ∈ I , we can adjust the size of each individual test to be 1 − (1 − α ) / | I | given the asymptotic independence and reject the null if any of the monitoring statisticsexceeds its critical value. In Zhang et al.(2020) they provide a discussion on the poweranalysis for the identity covariance matrix case, and showed that the adaptive test enjoysgood overall power.In practice, there is this issue of which q to use. Based on the recommendation in Zhanget al. (2020), we set q = 6. As mentioned in the latter paper, using larger q leads to moretrimming and more computational cost. As we demonstrate in the simulations, using q = 6and combining with q = 2 show a very promising performance; see Section 4 for more details. 3. Ratio-consistent estimator for (cid:107) Σ (cid:107) qq Notice that the test statistic T n ( k ) requires a ratio-consistent estimator for (cid:107) Σ (cid:107) qq . For ex-ample, when q = 2, this can be simplified to (cid:107) Σ (cid:107) F . The ratio-consistent estimator for (cid:107) Σ (cid:107) F has been proposed in Chen and Qin (2010), but it seems difficult to generalize to (cid:107) Σ (cid:107) qq .In this section, we introduce a new class of ratio-consistent estimator for (cid:107) Σ (cid:107) qq based onU-statistics. We first show the result when q = 2 and generalize it to q ∈ N later.Assume { X t } nt =1 ∈ R p are i.i.d. random vectors with mean µ and variance Σ. Define (cid:91) (cid:107) Σ (cid:107) F = 14 (cid:0) n (cid:1) (cid:88) ≤ j Under Assumption 1 and Cum(4) ≤ C (cid:107) Σ (cid:107) F in Assumption 2, (cid:91) (cid:107) Σ (cid:107) F is a ratio-consistent estimator of (cid:107) Σ (cid:107) F . i.e. (cid:91) (cid:107) Σ (cid:107) F / (cid:107) Σ (cid:107) F p −→ . Now we extend this idea to general q ∈ N . We let (cid:91) (cid:107) Σ (cid:107) qq = 12 q (cid:0) n q (cid:1) p (cid:88) l ,l =1 (cid:88) ≤ i < ···
Assumption 5. We assume that1. there exists c > i =1 ,...,p Σ i,i > c ;2. there exists C > r > h = 2 , , ≤ l ≤ · · · ≤ l h ≤ p , | cum ( X ,l , ..., X ,l h ) | ≤ C (1 ∨ ( l h − l )) − r . Notice that Assumption 5(2) is required for the ratio consistency, which is weaker thanAssumption 4. The Assumptions 1-5 required for our theory do not state the explicit rela-tionship between n and p . For example, when Σ = I p , which means there is no cross-sectionaldependence, all the assumptions are satisfied and ( n, p ) can go to infinity freely without anyrestrictions. When there is cross-sectional dependence, our assumptions may implicitly re-strict the relative scale of n and p . In general, larger p is a blessing in our setting and itmakes the asymptotic approximation more accurate and larger n is always preferred owingto large sample approximation. On the other hand, the computational cost increase whenboth the dimension and the sample size get large. Theorem 6. Under Assumption 5, (cid:91) (cid:107) Σ (cid:107) qq is a ratio-consistent estimator of (cid:107) Σ (cid:107) qq , i.e., (cid:91) (cid:107) Σ (cid:107) qq / (cid:107) Σ (cid:107) qq p −→ . It is worth noting that implementing the above estimator may be time-consuming forlarge q . In practice, we can always take a random sample of all possible indices and form anincomplete U-statistic to approximate. The consistency of incomplete U-statistic can alsobe established but not pursued for simplicity. 4. Simulation Results We compare the monitoring procedures for q = 2 , q = 6 and q = (2 , 6) combined. Weconsider ( n, p ) = (100 , 50) with T = 2, where the observations X i ∼ N ( µ i , Σ) are generatedndependently over time. We consider four possible choices of Σ,Σ ij = ρ | i − j | for ρ = 0 , . , . , . , to evaluate the performance of the monitoring scheme for independent-components settingor under weak and strong dependence between components. All tests have nominal size α = 0 . H , there is no change point, µ i = 0 for all i . For the alternative, weconsider µ i = (cid:112) δ/r ( r , p − r ) for i = ( (cid:98) . n (cid:99) + 1) , . . . , nT . Under the dense alternative, weset ( δ, r ) = (1 , p ) , (2 , p ). Under the sparse alternative, we set ( δ, r ) = (1 , , (1 , n independent Gaussiansample from N ( , I p × p ) and calculate Mei and LZM monitoring statistics. We empiricallydetermine the critical value such that the empirical rejection rate is 10% based on 2500simulated datasets. For Mei’s methods, we need to specify the distribution after the changepoint, which we set it to be the distribution under the alternative ( δ, r ) = (1 , p ). For LZM’smethod, we use the same setting in Liu et al. (2019) and set b = log(10) , ρ = 0 . , t = 4 and s = 1.Table 2 shows the size of the monitoring procedure for the benchmark methods and theproposed methods for three different boundary functions T1, T2, T3 introduced in Section2.1 under different correlation coefficients ρ . Notice that the size is noticeably worse for ρ = 0 . 8. This is partially due to the poor performance in the ratio-consistent estimator sinceits variance increases as the cross-sectional dependence increases. Also, please note that thesize seems to go in different directions for q = 2 and q = 6 as the correlation increases. Thecombined test, on the other hand, balances out such distortions. To make sure this is only afinite sample behavior, we increase ( n, p ) from (100 , 50) to (200 , L q based tests.Table 3 provides the power result (left column) and ADT (right column) for differenttests under dense alternatives. As expected, the L -based test demonstrates higher powercompared to that of the L -based test. The power of the combined test falls in betweenand is closer to the power of L -based test. As the correlation increases, the powers of alltests decrease due to reduced signal. Among three different boundary functions, T2 seemsto be the one with the shortest average delay time (ADT) with a slight sacrifice in power.Mei’s method is only better than the L based test when there is no strong cross-sectionaldependence, and is generally worse than all other methods and have relatively longer delayable 2: Size of different monitoring proceduresT1 T2 T3 α = 0 . L L Comb L L Comb L L Comb ρ = 0 0.094 0.105 0.086 0.048 0.067 0.093 0.045 0.071 0.097 0.045 0.070 ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . 8, Mei’s method lost the power completely. LZM in general has the slightly shorterdetection delay but at the cost of a much lower power compared with L based test and thecombined test. This means the LZM is quicker in signaling an alarm when a change pointis detected. Although LZM showed good power for the strong cross-sectional dependencecase compared with the combined test, it comes at the price of much distorted size. This isrelated to the fact that LZM assumes all data streams are independent.Table 4 provides power of different tests under sparse alternatives. The L -based test andthe combined test are very comparable in power and L -based test exhibits inferior power inmost settings as expected. One interesting observation is that for the case ( δ, r ) = (1 , L -based test still shows slightly higher power than the L -based test when ρ = 0 . 2, whichmeans that for this particular setting, the change is not “sparse” enough. As the correlationincreases, we observe a noticeable drop in power, which is similar to the dense alternativesetting and is again attributed to the reduced signal size. Among three different boundaryfunctions, T2 still has shortest average delay time with a slight power loss compared to othertwo boundary functions. Mei’s method has worse power since it is designed for the densesignals and the distribution under the alternative is misspecified. By comparison, LZM givesconsistently good power and short delay time across all settings. However, the good powerunder strong cross-sectional dependence is still offset by the severe size distortion under thenull.Apart from evaluating the size and power of the monitoring procedure, we also comparethe computational cost of the recursive formulation versus the brute force approach. Forthe case of ( n, p ) = (100 , H , and the average run-time of the brute forceapproach is 13.07 times of that for the recursive algorithm under the alternative. The codesare implemented in R. This demonstrates the substantial efficiency gain from the recursivecomputational algorithm.able 3: Power under dense alternativesPower Mei LZM L L Comb α = 0 . δ, r ) power ADT power ADT w ( t ) power ADT power ADT power ADT ρ = 0 . , p ) 0.852 72.9 0.628 38.0 T1 0.958 51.9 0.295 64.6 0.926 55.0T2 0.951 44.3 0.284 63.0 0.921 47.7T3 0.953 46.8 0.286 63.4 0.921 50.2(2 , p ) 0.999 69.3 1.000 15.1 T1 1.000 27.5 0.919 56.2 1.000 29.5T2 1.000 20.4 0.919 54.3 1.000 21.9T3 1.000 22.9 0.920 54.9 1.000 24.7 ρ = 0 . , p ) 0.740 73.3 0.675 38.2 T1 0.935 51.8 0.302 64.4 0.907 54.9T2 0.930 44.2 0.291 62.9 0.906 47.7T3 0.933 46.7 0.294 63.5 0.903 50.3(2 , p ) 1.000 69.9 1.000 15.6 T1 1.000 28.0 0.884 56.6 1.000 30.0T2 1.000 20.8 0.884 54.8 1.000 22.3T3 1.000 23.4 0.883 55.3 1.000 25.2 ρ = 0 . , p ) 0.243 74.1 0.715 34.3 T1 0.844 52.9 0.274 63.3 0.796 55.8T2 0.843 45.2 0.267 61.5 0.787 47.9T3 0.847 47.9 0.267 62.0 0.792 50.7(2 , p ) 0.932 72.2 1.000 15.7 T1 1.000 30.7 0.864 55.9 1.000 33.0T2 1.000 23.1 0.861 54.2 1.000 24.8T3 1.000 25.7 0.861 54.8 1.000 27.8 ρ = 0 . , p ) 0.000 NA 0.803 29.0 T1 0.632 54.6 0.165 62.5 0.560 56.8T2 0.637 46.4 0.162 60.9 0.575 48.6T3 0.642 49.4 0.162 61.4 0.568 51.8(2 , p ) 0.001 74.0 0.997 16.1 T1 0.990 38.3 0.666 56.0 0.984 40.8T2 0.990 30.1 0.663 54.2 0.983 32.1T3 0.990 32.7 0.666 54.9 0.983 35.4able 4: Power under sparse alternativesPower Mei LZM L L Comb α = 0 . δ, r ) power ADT power ADT w ( t ) power ADT power ADT power ADT ρ = 0 . , 3) 0.422 74.0 0.990 27.4 T1 0.976 51.5 0.999 37.8 0.999 40.5T2 0.967 43.8 0.999 35.9 0.999 38.0T3 0.972 46.4 0.999 36.6 0.999 39.0(1 , 1) 0.400 73.9 1.000 23.4 T1 0.961 51.5 0.951 51.0 0.976 52.2T2 0.958 44.1 0.953 49.5 0.974 46.3T3 0.959 46.4 0.953 50.0 0.976 48.7 ρ = 0 . , 3) 0.274 74.1 0.990 29.1 T1 0.946 52.2 0.937 51.6 0.955 52.6T2 0.939 44.6 0.935 50.0 0.955 47.1T3 0.943 47.1 0.936 50.5 0.954 49.2(1 , 1) 0.268 74.1 1.000 23.9 T1 0.961 52.6 0.998 37.3 0.999 40.2T2 0.951 45.3 0.998 35.4 0.999 37.7T3 0.957 47.6 0.998 36.0 0.999 38.6 ρ = 0 . , 3) 0.048 74.5 0.972 28.2 T1 0.871 54.7 0.881 51.5 0.887 53.4T2 0.856 47.1 0.878 49.8 0.884 48.7T3 0.860 49.9 0.880 50.4 0.886 50.6(1 , 1) 0.036 74.3 1.000 23.2 T1 0.880 55.9 0.997 38.0 0.997 40.7T2 0.871 49.1 0.997 36.1 0.997 38.2T3 0.879 51.2 0.997 36.8 0.997 39.2 ρ = 0 . , 3) 0.000 NA 0.971 24.7 T1 0.621 58.9 0.800 52.9 0.808 55.3T2 0.610 50.6 0.801 51.3 0.802 51.5T3 0.614 53.7 0.803 51.9 0.807 53.2(1 , 1) 0.000 NA 1.000 21.5 T1 0.602 61.1 0.998 38.8 0.997 41.7T2 0.588 53.6 0.998 36.8 0.997 39.3T3 0.601 56.8 0.998 37.5 0.997 40.2 CCEPTED MANUSCRIPT F o r c e Figure 1: (a)(left panel): A forging machine with four strain gages. (b)(right panel): A sample offour-channel profiles. 30 ACCEPTED MANUSCRIPT D o w n l o a d e d by [ G e o r g i a T ec h L i b r a r y ] a t : J u l y (a) The setup of the forging machine Crank Angle (Degree) 100 200 300 400 500 C hanne l abnormalnormal Crank Angle (Degree) 100 200 300 400 500 C hanne l Crank Angle (Degree) 100 200 300 400 500 C hanne l Crank Angle (Degree) 100 200 300 400 500 C hanne l (b) Data collected in the forging machine Figure 1: Forging machine setup and the collected tonnage dataset 5. Data Illustration5.1 Tonnage dataset We first propose to apply the proposed methodology to monitor the multi-channel tonnageprofile collected in a forging process in (Lei et al., 2010), where four different strain gaugesensors are mounted at each column of the forging machine, measuring the exerted force ofthe press. The setup of the process is shown in Figure 1a. The four strain gauge sensorsrepresent the signature of the product and are used for process monitoring and changedetection in Lei et al. (2010). For example, 10 examples of the signals before the changesand after the changes are shown in Figure 1b. As mentioned by Lei et al. (2010); Yan et al.(2018), a missing part only affects a small region of the signals, which makes it very hard todetect, as shown in Figure 1b.We select a subset of the data with n = 200, where the first 130 observations are fromthe normal tonnage sample, and the samples after 130 are abnormal. We project the datato 20-dimensional space by training an anomaly basis on a holdout sample as has been donein Yan et al. (2018). The first 100 observations are treated as a Phase I stage without anychanges and we learn the 2-norm and q -norm of the covariance matrix from them. Themonitoring scheme started at observation 107 (trimming due to q = 6). The L -basedtest stopped at time 137, and estimated the possible change point location at time 128 byperforming a retrospective test at time 137. The L based test signaled slightly earlier attime 135 and also estimated the change at 128. The combined test signaled an alarm at time135 with the same estimated location. The trajectory of the L and the L test statisticsare shown in Figure 2a and 2b, respectively. Notice that, when we set the size of individualtest to be α ∗ = (1 − . / = 5 . α = 1 − α ∗ = 0 . 110 115 120 125 130 135 Plot of L −based test statistic Time L L −based stats90% threshold94.87% threshold (a) L based test for tonnage data 110 115 120 125 130 135 Plot of L −based test statistic Time L L −based stats90% threshold94.87% threshold (b) L based test for tonnage data Figure 2: Testing Statistics for tonnage data We then consider the process monitoring in a steel rolling manufacturing process. Surfacedefects such as seam defects can result in stress concentration on the bulk and may causefailures if the steel bar is used in the product. However, the rolling process is a high-speedprocess with the rolling bar moving around 200 miles per hour, providing real-time onlineanomaly detection for the high-speed rolling process is very important to prevent furtherthe product damage.The dataset is collected in such high-speed rolling process. Here, we have selected asegment near the end of the rolling bar, which contains 100 images of the rolling process. Toremove the trend, we have applied a smooth background remover and further downsamplethe image to only 16 × 64 pixels. An example of the normal image and the abnormal imageare shown in Figure 3a and 3b, respectively.We treated the first 50 observations as training set and obtained ratio-consistent esti-mators (cid:91) (cid:107) Σ (cid:107) qq . After performing the change point monitoring procedure, the L -norm-basedtest signals an alarm at the time 97 and estimated that the possible change point locationis at time 89 based on the retrospective test. On the other hand, the L based test fails todetect the change within the finite time horizon. The combined test also signals the alarmat time 97. We present the rolling image at time 91 in Figure 3b. This shows that afterdownsampling, the change is still quite sparse. The adaptive monitoring procedure is stillpowerful as long as one test has power. We also present the trajectory of the test statisticat each time point in Figure 4a and 4b. Notice that there is a downshift in the L -basedmonitoring statistic right after the estimated change. This is due to the fact that the signalis very sparse, and the construction of our proposed statistic may admit negative values fora short period of time. The negative values here should not be a major concern as the test a) Normal rolling image (b) Abnormal rolling image Figure 3: Examples of the rolling imagesstatistic should admit positive values in probability under the alternatives. We confirmedthis by adding an artificial dense change in the data. On the other hand, L -based moni-toring statistics detect the change efficiently due to the ability to capture the sparse changein the system. 6. Summary and Conclusion In this article, we propose a new methodology to monitor a mean shift in temporallyindependent high-dimensional observations. Our change point monitoring method targetsat the L q -norm of the mean change for q = 2 , , , · · · . By combining the monitoringstatistics for different values of q ∈ N , the adaptive procedure achieves overall satisfactorypower against both sparse and dense changes in the mean. This can be very desirablefrom a practitioner’s viewpoint as often we do not have the knowledge about the type ofalternatives. Compared with the recently developed methods for monitoring large-scaledata streams [e.g., Mei (2010), Xie and Siegmund (2013), Liu et al. (2019)], our method isfully nonparametric and does not require strong distributional assumptions. Furthermore,our method allows for certain cross-sectional dependence across data streams, which couldnaturally arise in many applications.To conclude, we mention a few interesting future directions. Firstly, our focus in thispaper is on the mean change, and it is natural to ask whether the method can be extendedto monitor a change in the covariance matrix. Secondly, many streaming data have weakdependence over time due to its sequential nature, and how to accommodate weak temporaldependence will be of interest. Thirdly, in the current implementation, the ratio-consistentestimators are learned from the training data and do not change as more observations areavailable. In practice, if the monitoring scheme runs for a long time without signaling analarm, it might be helpful to periodically update ratio-consistent estimators to gain efficiency, − − − Plot of L −based test statistic Time L L −based stats90% threshold94.87% threshold (a) L based test for rolling data 60 70 80 90 100 Plot of L −based test statistic Time L L −based stats90% threshold94.87% threshold (b) L based test for rolling data Figure 4: Examples of the rolling imagesespecially when the initial training sample is short. However, it may be impractical to updatethis estimator for each k since there seems no easy recursive way to update this estimatorand the associated computational cost is high. The user might need to determine howoften to update it based on the actual computational resources. Fourthly, even thoughthe proposed algorithm can detect the sparse change, in many applications, it is also animportant problem to identify which individual data stream actually experiences a change,which will be left for future research. Supplementary Materials The supplementary materials contains technical proofs for the theoretical results as well asadditional simulation results. Proof of Theorem 1. We can directly apply the results shown in Wang et al. (2019) for thepartial sum process S n ( a, b ) = (cid:98) nb (cid:99)− (cid:88) i = (cid:98) na (cid:99) +1 i (cid:88) j = (cid:98) na (cid:99) +1 X Ti +1 X j . The partial sum process (cid:110) √ n || Σ || F S n ( a, b ) (cid:111) ( a,b ) ∈ [0 ,T ] (cid:32) Q in l ∞ ([0 , T ] )here Q is a Gaussian process whose covariance structure is the following Cov ( Q ( a , b ) , ( a , b )) = (cid:40) (min( b , b ) − max( a , a )) if max( a , a ) ≤ min( b , b )0 otherwise The test statistic is a continuous transformation of the Gaussian process and the resultsstated follows. Proof of Theorem 2. We now analyze the power of the first proposed test. Suppose thechange point is at k ∗ , where k ∗ /n → r for some constant r ∈ (1 , T ). This assures that thechange point does not occur extremely early or late in the monitoring period. Under thealternative hypothesis, define a new sequence of random vectors Y i , Y i = (cid:40) X i i = 1 , . . . , k ∗ X i − ∆ i = k ∗ + 1 , . . . , n . This sequence does not have a change point. Without loss of generosity, assume Y i ’s arecentered.Suppose that n ∆ T ∆ || Σ || F → b ∈ [0 + ∞ ) . When m < k < k ∗ , G k ( m ) statistic will not be affected. It suffices to consider the case m < k ∗ < k and k ∗ < m < k . Following the decomposition in Wang et al. (2019), under thefixed alternative when k ∗ > m , G k ( m ) = G Yk ( m ) + ( k − k ∗ )( k − k ∗ − m ( m − || ∆ || − k − k ∗ )( k − m − m − m (cid:88) j =1 Y Tj ∆ − m − m − k − k ∗ ) k ∗ (cid:88) j = m +1 Y Tj ∆ .G Yn ( m ) is the statistic calculated for the Y i sequence. Let s n ( k ) = (cid:80) kj =1 Y Tj ∆. Thensup ≤ l ≤ k ≤ nT | k (cid:88) j = l Y Tj ∆ | ≤ ≤ k ≤ nT | s n ( k ) | = O p ( n / (∆ T Σ∆) / ) . The last part is obtained by Kolmogorov’s inequality. This implies that when k ∗ > m ,1 n (cid:107) Σ (cid:107) F G k ( m ) = 1 n (cid:107) Σ (cid:107) F G Yk ( m ) + ( k − k ∗ )( k − k ∗ − m ( m − n || ∆ || || Σ || F + O p ( n / (∆ T Σ∆) / || Σ || F ) . imilarly, we can show when k ∗ > m n (cid:107) Σ (cid:107) F G k ( m ) = 1 n (cid:107) Σ (cid:107) F G Yk ( m )+ k ∗ ( k ∗ − k − m )( k − m − n || ∆ || || Σ || F + O p ( n / (∆ T Σ∆) / || Σ || F ) . The last part is converging to 0 in probability. Therefore, the test statistic T n can be viewedas an extension to the original process. The second terms are also a process depend on m and k ∗ . Under the fixed alternative, the G k ( m ) converge to the process1 n (cid:107) Σ (cid:107) F { G (cid:98) nt (cid:99) ( (cid:98) ns (cid:99) ) } s ∈ [0 , → G ( s, t ) + b Λ( s, t ) , where Λ( s, t ) = ( t − r ) s s ≤ rr ( t − s ) s > r otherwise . This implies that, when b = 0, the process is the same with the null process, and the proposedmonitoring scheme will have trivial power. When the b is not zero, since the remainder termis positive, we will have non -trivial power.When n ∆ T ∆ || Σ || F → ∞ . Following above decomposition, we havemax k T n ( k ) ≥ T n ( k ∗ ) = 1 n (cid:107) Σ (cid:107) F D YnT ( k ∗ ) + O ( n || ∆ || || Σ || F ) → ∞ Since the first term is pivotal and is bounded in probability, the test have power convergingto 1. Proof of Theorem 3. We can directly apply the results in Theorem 2.1 and 2.2 in Zhang etal.(2020), which stated that for S n,q,c ( r ; [ a, b ]) = p (cid:88) l =1 ∗ (cid:88) (cid:98) na (cid:99) +1 ≤ i ,...,i c ≤(cid:98) nr (cid:99) ∗ (cid:88) (cid:98) nr (cid:99) +1 ≤ j ,...,j q − c ≤(cid:98) nb (cid:99) c (cid:89) t =1 X i t ,l q − c (cid:89) g =1 X j g ,l , we have 1 (cid:112) n q (cid:107) Σ (cid:107) qq S n,q,c ( r ; [ a, b ]) (cid:32) Q q,c ( r ; [ a, b ]) , where Q q,c is the Gaussian process stated in Theorem 4. The monitoring statistic is acontinuous transformation of process S n,q,c ’s and the asymptotic result follows. roof of Theorem 4. We first discuss the case when n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → γ ∈ [0 , + ∞ ) and the truechange point is at location k ∗ = (cid:98) nr (cid:99) . Here we adopt the process convergence results inTheorem 2.3 of Zhang et al.(2020), which stated that for ( k, m ) = ( (cid:98) ns (cid:99) , (cid:98) nt (cid:99) ),1 (cid:112) n q (cid:107) Σ (cid:107) qq D n,q ( s ; [0 , b ]) = 1 (cid:112) n q (cid:107) Σ (cid:107) qq p (cid:88) l =1 ∗ (cid:88) ≤ i ,...,i q ≤ k ∗ (cid:88) k +1 ≤ j ,...,j q ≤ m ( X i ,l − X j ,l ) · · · ( X i q ,l − X j q ,l ) , (cid:32) G q ( s, t ) + γJ q ( s ; [0 , t ])where J q ( s ; [0 , t ]) = r q ( t − s ) q r ≤ s < ts q ( t − r ) q s ≤ r < t otherwise Therefore, by continuous mapping theorem, when γ ∈ [0 , + ∞ ), the results in the theoremhold.For the case n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → + ∞ max k T n,q ( k ) ≥ T n,q ( k ∗ ) = 1 n (cid:107) Σ (cid:107) F D YnT ( k ∗ ) + C n q/ (cid:107) ∆ (cid:107) qq (cid:107) Σ (cid:107) q/ q → ∞ Proof of Theorem 5. By straightforward calculation, we have (cid:91) (cid:107) Σ (cid:107) F = 14 (cid:0) n (cid:1) (cid:88) ≤ j 4. By similar arguments, it is obvious to see that E [ I n,i / (cid:107) Σ (cid:107) F ] = 1 / i = 1 , , , 4, and E [ I n,i / (cid:107) Σ (cid:107) F ] = 0 for i = 5 , ..., I n,i / (cid:107) Σ (cid:107) F → p i =1 , , , 4, and I n,i / (cid:107) Σ (cid:107) F → p 0, for i = 5 , ..., 10. Since some of the I n,i share very similarstructures, we will only present the proof for (1) 4 I n, / (cid:107) Σ (cid:107) F → p I n, / (cid:107) Σ (cid:107) F → p E [16 I n, / (cid:107) Σ (cid:107) F ] → 1. To see this, E [16 I n, / (cid:107) Σ (cid:107) F ] = 1 (cid:0) n (cid:1) (cid:107) Σ (cid:107) F (cid:88) ≤ j 1, and (1) is proved. By similar arguments, 4 I n,i / (cid:107) Σ (cid:107) F → p i = 2 , , E [ I n, / (cid:107) Σ (cid:107) F ] → 0. To see this, E [ I n, / (cid:107) Σ (cid:107) F ] = 14 (cid:0) n (cid:1) (cid:107) Σ (cid:107) F (cid:88) ≤ j 0. And by similar arguments we can prove that I n,i / (cid:107) Σ (cid:107) F → p i = 6 , ..., 10. Combine the above results, we have (cid:91) (cid:107) Σ (cid:107) F / (cid:107) Σ (cid:107) F → p 1. This completesthe proof. Proof of Theorem 6. We can rewrite (cid:91) (cid:107) Σ (cid:107) qq as (cid:91) (cid:107) Σ (cid:107) qq = 12 q (cid:0) n q (cid:1) p (cid:88) l ,l =1 (cid:88) ≤ i < ···
1. As we observe thatmost of terms are structurally very similar, we shall only present the proof for I ( i , ..., i q ) → p E [ (cid:80) pl ,l =1 (cid:81) qk =1 X t k ,l X t k ,l / (cid:107) Σ (cid:107) qq ] = 1. This indicates that toshow (1), it suffices to show that E [ I ( i , ..., i q ) ] → 1. To show this, E [ I ( i , ..., i q ) ] = 1 (cid:0) n q (cid:1) (cid:107) Σ (cid:107) qq (cid:88) ≤ i < ···
0. Specifically, E [ J ( t , s , ..., t q , s q ) ] = 1 (cid:0) n q (cid:1) (cid:107) Σ (cid:107) qq (cid:88) ≤ i < ···
This is because the cumulant of a single randomvariable with mean zero is also zero.3. Every B ∈ π must contain only one distinct temporal index. Otherwise (cid:81) B ∈ π cum ( X i,l : ( i, l ) ∈ B ) = 0.The above properties imply that for ∀ π ∈ S c and ∀ B ∈ π , cum ( X i,l : ( i, l ) ∈ B ) has to be oneof the followings: cum ( X ,l , X ,l , X ,l , X ,l ), cum ( X ,l , X ,l , X ,l ), cum ( X ,l , X ,l , X ,l ), cum ( X ,l , X ,l , X ,l ), cum ( X ,l , X ,l , X ,l ), Σ l ,l , Σ l ,l , Σ l ,l , Σ l ,l , Σ l ,l , Σ l ,l .If we assume l ≤ l ≤ l ≤ l , it can be shown that (cid:89) B ∈ π cum ( X i,l : ( i, l ) ∈ B ) ≤ C (1 ∨ ( l − l )) − r (1 ∨ ( l − l )) − r , (6.5)for some generic positive constant C and any partition π . To see this, we notice that atleast one k = 1 , ..., q , say k , such that t k (cid:54) = s k and t (cid:48) k (cid:54) = s (cid:48) k . For every π ∈ S c thereexists B , B ∈ π such that ( t k , l ) ∈ B and ( s (cid:48) k , l ) ∈ B . Based on the third propertyabove, all other elements in B must have the same temporal index as t k . And because ofthe first property above, all i k , j k for k (cid:54) = k and s k are different from t k . This implies thatthe spatial indices for all other elements in B have to be either l or l , not l and l . Forthe same reason, the spatial indices for all other elements in B can only be either l or l .Therefore, cum ( X i,l : ( i, l ) ∈ B ) ∈ { cum ( X ,l , X ,l , X ,l ) , Σ l ,l , Σ l ,l } , and cum ( X i,l : ( i, l ) ∈ B ) ∈ { cum ( X ,l , X ,l , X ,l ) , Σ l ,l , Σ l ,l } . Under Assumption (5.2), cum ( X i,l : ( i, l ) ∈ B ) ≤ C (1 ∨ ( l − l )) − r and cum ( X i,l : ( i, l ) ∈ B ) ≤ C (1 ∨ ( l − l )) − r . And the joint cumulants are uniformly bounded above for those B ∈ π \ { B , B } . Thus Equation 6.5 is proved.Furthermore, define Ind ( t , s , ..., t q , s q , t (cid:48) , s (cid:48) , ..., t (cid:48) q , s (cid:48) q ) as the indicator function corre-sponding to the event that for every k = 1 , , .., q that t k (cid:54) = s k , there exists k (cid:48) = 1 , ..., q suchthat t k = t (cid:48) k (cid:48) or t k = s (cid:48) k (cid:48) , then E (cid:2)(cid:81) pk =1 X t k ,l X s k ,l X t (cid:48) k ,l X s (cid:48) k ,l (cid:3) (cid:54) = 0 only if Ind ( t , s , ..., t q , s q , t (cid:48) , s (cid:48) , ..., t (cid:48) q , s (cid:48) q ) = 1 . t is also easy to see that (cid:88) ≤ i < ···
Xiaofeng Shao acknowleges partial support from NSF grants DMS-1807032 and DMS-2014018.Hao Yan acknowleges partial support from NSF grants DMS-1830363 and CMMI-1922739.EFERENCES References Aminikhanghahi, S. and D. J. Cook (2017). A survey of methods for time series change point detection. Knowledgeand Information Systems 51 (2), 339–367.Aue, A., S. H¨ormann, L. Horv´ath, M. Huˇskov´a, and J. G. Steinebach (2012). Sequential testing for the stability ofhigh-frequency portfolio betas. Econometric Theory 28 (4), 804–837.Aue, A. and L. Horv´ath (2013). Structural breaks in time series. Journal of Time Series Analysis 34 (1), 1–16.Brown, R. L., J. Durbin, and J. M. Evans (1975). Techniques for testing the constancy of regression relationshipsover time. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 37 (2), 149–163.B¨ucher, A., J.-D. Fermanian, and I. Kojadinovic (2019). Combining cumulative sum change-point detection tests forassessing the stationarity of univariate time series. Journal of Time Series Analysis 40 (1), 124–150.Chen, S. and Y. Qin (2010). A two sample test for high dimensional data with application to gene-set testing. TheAnnals of Statistics 38 , 808–835.Cho, H. and P. Fryzlewicz (2015). Multiple-change-point detection for high dimensional time series via sparsifiedbinary segmentation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77 (2), 475–507.Chu, C.-S. J., M. Stinchcombe, and H. White (1996). Monitoring structural change. Econometrica: Journal of theEconometric Society 64 , 1045–1065.Dette, H. and J. G¨osmann (2019). A likelihood ratio approach to sequential change point detection for a generalclass of parameters. Journal of the American Statistical Association , 1–17.Enikeeva, F. and Z. Harchaoui (2019). High-dimensional change-point detection under sparse alternatives. TheAnnals of Statistics 47 (4), 2051–2079.Fremdt, S. (2015). Page’s sequential procedure for change-point detection in time series regression. Statistics 49 (1),128–155.G¨osmann, J., T. Kley, and H. Dette (2020). A new approach for open-end sequential change point monitoring. Journal of Time Series Analysis, to appear .He, Y., G. Xu, C. Wu, and W. Pan (2018). Asymptotically independent u-statistics in high-dimensional testing. arXiv preprint arXiv:1809.00411 .Horv´ath, L. and M. Huˇskov´a (2012). Change-point detection in panel data. Journal of Time Series Analysis 33 (4),631–648.Horv´ath, L., M. Huˇskov´a, P. Kokoszka, and J. Steinebach (2004). Monitoring changes in linear models. Journal ofStatistical Planning and Inference 126 (1), 225–251.Jirak, M. (2015). Uniform change point tests in high dimension. The Annals of Statistics 43 (6), 2451–2483.Kirch, C., B. Muhsal, and H. Ombao (2015). Detection of changes in multivariate time series with application to eegdata. Journal of the American Statistical Association 110 (511), 1197–1216.Lai, T. L. (1995). Sequential changepoint detection in quality control and dynamical systems. Journal of RoyalStatistical Society, Series B (Statistical Methodology) 57 , 613–658.Lai, T. L. (2001). Sequential analysis: Some classical problems and new challenges. Statistica Sinica 11 , 303–408.Lei, Y., Z. Zhang, and J. Jin (2010). Automatic tonnage monitoring for missing part detection in multi-operationforging processes. Journal of Manufacturing Science and Engineering 132 (5).L´evy-Leduc, C. and F. Roueff (2009). Detection and localization of change-points in high-dimensional network traffic EFERENCES data. The Annals of Applied Statistics 3 (2), 637–662.Li, J. (2020). Efficient global monitoring statistics for high-dimensional data. Quality Reliability Engineering Inter-national 36 , 18–32.Liu, K., R. Zhang, and Y. Mei (2019). Scalable sum-shrinkage schemes for distributed monitoring large-scale datastreams. Statistica Sinica 29 , 1–22.Lorden, G. (1971). Procedures for reacting to a change in distribution. Annals of Mathematical Statistics 42 ,1897–1908.MacNeill, I. B. (1974). Tests for change of parameter at unknown times and distributions of some related functionalson brownian motion. The Annals of Statistics 2 (5), 950–962.Matteson, D. S. and N. A. James (2014). A nonparametric approach for multiple change point analysis of multivariatedata. Journal of the American Statistical Association 109 (505), 334–345.Mei, Y. (2010). Efficient scalable schemes for monitoring a large number of data streams. Biometrika 97 (2), 419–433.Page, E. S. (1954). Continuous inspection schemes. Biometrika 41 (1/2), 100–115.Perron, P. (2006). Dealing with structural breaks. Palgrave Handbook of Econometrics Vol. 1: Econometric Theory,K. Patterson and T.C. Mills (eds.), Palgrave Macmillan , 278–352.Polunchenko, A. S. and A. G. Tartakovsky (2012). State-of-the-art in sequential change-point detection. Methodologyand computing in applied probability 14 (3), 649–684.Shao, X. (2010). A self-normalized approach to confidence interval construction in time series. Journal of RoyalStatistical Society, Series B 72 , 343–366.Shao, X. (2015). Self-normalization for time series: a review of recent developments. Journal of the AmericanStatistical Association 110 , 1797–1817.Shao, X. and W. B. Wu (2007). Asymptotic spectral theory for nonlinear time series. The Annals of Statistics 35 (4),1773–1801.Shao, X. and X. Zhang (2010). Testing for change points in time series. Journal of the American StatisticalAssociation 105 , 1228–1240.Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics 16 , 117–186.Wang, R. and X. Shao (2020). Dating the break in high-dimensional data. Available athttps://arxiv.org/pdf/2002.04115.pdf .Wang, R., S. Volgushev, and X. Shao (2019). Inference for change points in high dimensional data. arXiv preprintarXiv:1905.08446 .Wang, T. and R. J. Samworth (2018). High dimensional change point estimation via sparse projection. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) 80 (1), 57–83.Wang, Y. and Y. Mei (2015). Large-scale multi-stream quickest change detection via shrinkage post-change estima-tion. IEEE Transactions on Information Theory 61 (12), 6926–6938.Wied, D. and P. Galeano (2013). Monitoring correlation change in a sequence of random variables. Journal ofStatistical Planning and Inference 143 (1), 186–196.Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proceedings of the National Academy ofSciences 102 (40), 14150–14154.Wu, W. B. and X. Shao (2004). Limit theorems for iterated random functions. Journal of Applied Probability 41 (2),425–436. EFERENCES Xie, Y. and D. Siegmund (2013). Sequential multi-sensor change-point detection. The Annals of Statistics 41 (2),670–692.Yan, H., K. Paynabar, and J. Shi (2014). Image-based process monitoring using low-rank tensor decomposition. IEEE Transactions on Automation Science and Engineering 12 (1), 216–227.Yan, H., K. Paynabar, and J. Shi (2018). Real-time monitoring of high-dimensional functional data streams viaspatio-temporal smooth sparse decomposition. Technometrics 60 (2), 181–197.Yu, M. and X. Chen (2017). Finite sample change point inference and identification for high-dimensional meanvectors. arXiv preprint arXiv:1711.08747 .Yu, M. and X. Chen (2019). A robust bootstrap change point test for high-dimensional location parameter. arXivpreprint arXiv:1904.03372 .Zhang, Y., R. Wang, and X. Shao (2020). Adaptive change point inference for high-dimensional data. Preprint .Zou, C., Z. Wang, X. Zi, and W. Jiang (2015). An efficient online monitoring method for high-dimensional datastreams.