[PDF] A Fast and Efficient Change-point Detection Framework based on Approximate k -Nearest Neighbor Graphs

Abstract

Change-point analysis is thriving in this big data era to address problems arising in many fields where massive data sequences are collected to study complicated phenomena over time. It plays an important role in processing these data by segmenting a long sequence into homogeneous parts for follow-up studies. The task requires the method to be able to process large datasets quickly and deal with various types of changes for high-dimensional data. We propose a new approach making use of approximate k -nearest neighbor information from the observations, and derive an analytic formula to control the type I error. The time complexity of our proposed method is O(dnlogn+n k 2 ) for an n -length sequence of d -dimensional data. The test statistic we consider incorporates a useful pattern for moderate- to high- dimensional data so that the proposed method could detect various types of changes in the sequence. The new approach is also asymptotic distribution-free, facilitating its usage for a broader community. We apply our method to an fMRI dataset and a Neuropixels dataset to illustrate its effectiveness.

Full PDF

AA FAST AND EFFICIENT CHANGE-POINT DETECTIONFRAMEWORK FOR MODERN DATA

By Yi-Wei Liu and Hao Chen

University of California, Davis

Change-point analysis is thriving in this big data era to address prob-lems arising in many ﬁelds where massive data sequences are collected tostudy complicated phenomena over time. It plays an important role in pro-cessing these data by segmenting a long sequence into homogeneous partsfor follow-up studies. The task requires the method to be able to processlarge datasets quickly and to deal with various types of changes for high-dimensional and non-Euclidean data. We propose a novel approach makinguse of approximate k -nearest neighbor information of the observations, andderive an analytic formula to control the type I error. The time complexityof the method is O ( dn log n ) for an n -length sequence of d -dimensional data.Moreover, we incorporate a useful pattern of data in high dimension thatthe proposed method could detect various types of changes in the sequence.

1. Introduction.

With advances in technologies, scientists in many ﬁelds arecollecting massive data for studying complex phenomena over time and/or space.Such data often involve sequences of high-dimensional or non-Euclidean measure-ments that cannot be analyzed through traditional approaches. Insights on suchdata often come from segmentation/change-point analysis, which divides the se-quence into homogeneous temporal or spatial segments. They are crucial early stepsin understanding the data and in detecting anomalous events. Change-point anal-ysis has been extensively studied for univariate data (see Carlstein, M¨uller andSiegmund (1994); Chen and Gupta (2011) for a review). However, methods formultivariate data are limited in many ways (Chan et al., 2015; Heard et al., 2010;James, James and Siegmund, 1992; Siegmund, Yakir and Zhang, 2011; Srivastavaand Worsley, 1986; Wang et al., 2013; Wang and Samworth, 2018; Xie and Sieg-mund, 2013; Zhang et al., 2010), and there is little reasearch for non-Euclideandata (Chen and Zhang, 2015; Chu and Chen, 2019; Harchaoui, Moulines and Bach,2009; Matteson and James, 2014). Many modern applications require eﬀective andfast change-point detection for high-dimensional/non-Euclidean data. The followsare some examples.

Network evolution : Network data have become increasingly popular. For ex-ample, phone calls, emails, or online chat records can be used to construct networksdepicting social interactions among individuals (Eagle, Pentland and Lazer, 2009;Kossinets and Watts, 2006). In biological science, the study of gene-interactionnetwork is omnipresent (Li et al., 2011; Lu et al., 2011). A crucial part of thesestudies is identifying how the networks evolve in time. For the above examples, theobservations at each time point is a graphical encoding of the underlying network.Researchers might be interested in asking whether there is a change in networkstructure over time. a r X i v : . [ s t a t . M E ] J un YI-WEI LIU AND HAO CHEN

Public transportation demand analysis : The study of public goods alloca-tion is drawing attention as tons of data on public activities can be collected easilynowadays, such as the NYC taxi and limousine commission data ( ). Take transportation asan example, studies are performed to better meet the supply with aggregate de-mand. Such intension can be achieved by characterizing whether the public demandsfor certain trasportation are distinct in diﬀerent time intervals. Here, the observa-tion at each time point could be a heatmap illustrating the volume associated withpick-ups and drop-oﬀs locations.

Neuropixels data processing : As the development of Neuropixels probes en-lightens the ﬁeld of neuroscience, the challenges in analyzing such large-scale andhigh-density data couldn’t be over-emphasized. Neuropixels data usually consists ofhundreds or thousands of long stretches of neuron spiking activities simultaneously(Chen, Chen and Deng, 2019). Scientists and researchers are striving to extractmeaningful information from Neuropixels recordings (Stringer et al., 2019). It is ofscientiﬁc interesting to detect points in time at which population spiking activityexhibits simultaneous changes as well as changes that only occur in a subset of theneural population.In the above examples, the observation can be a network, an image, or a longvector. Let the sequence of observations be { y t : t = 1 , · · · , n } , indexed by somemeaningful order, such as time or location. Then the change-point detection can beformulated as testing the null hypothesis of homogeneity: H : y t ∼ F , t = 1 , · · · , n, (1)against the alternative that there exists a change-point τ : H : ∃ ≤ τ < n, y t ∼ (cid:40) F , t ≤ τF , otherwise , (2)Here, F and F are two diﬀerent probability measures. In practice, we usually donot know what F and F are, or what distributional family they belong to. Thismakes the parametric methods proposed for high-dimensional data not applicable.Some non-parametric methods have been proposed, and there are mainly threecategories:1. Kernel-based method (Harchaoui, Moulines and Bach, 2009)2. Distance-based method (Matteson and James, 2014)3. Graph-based method (Chen and Zhang, 2015; Chu and Chen, 2019)The kernel-based method depends on the choice of kernel and suﬀers from in-creased dimension. In addition, due to the natural of the kernel methods, there iscurrently no eﬀective way of controlling the type I error under the change-pointsetting. It’s extremely time consuming to apply the method when one would like tocontrol the type I error. This makes the method not practically useful. The distance-based method was proposed in (Matteson and James, 2014). The authors use allthe pairwise distances among observations to ﬁnd change-points. Computing all HANGE-POINT DETECTION FOR MODERN DATA pairwise distances needs O ( dn ) time for d -dimensional data. Moreover, there is noanalytic formula to control type I error, so it’s time consuming to run the methodwith proper type I error control (see Table 1). On the other hand, graph-basedmethods (Chen and Zhang, 2015; Chu and Chen, 2019) provide analytic formulasto control the type I error and are much faster to run.Existing graph-based methods utilize an undirected graph constructed amongobservations. Some common choices are the minimum spanning tree (MST), whereall observations are connected with the total distance minimized, the minimum dis-tance pairing (MDP), where the observations are partitioned into n/ k -MST, k -MDP, and undirected k -NN graphs. Take the k -MST for ex-ample, it is the union of the 1st, · · · , k th MSTs, where the 1st MST is the MST,and the j th MST is a spanning tree connecting all observations such that the sumof the edges in the tree is minimized under the constraint that it does not containany edge in the 1st, · · · , ( j − k -MST is preferredas it in general has a higher power than others (Chen and Zhang, 2015). Never-theless, it could take long to construct the k -MST when n is large. In this work,we propose a new statistic on a directed approximate k -NN graph. As the directed k -NN contains more useful information than the undirected k -NN graph, we ﬁndthe new method having power on par with the existing method on k -MST. Thereare multiple existing algorithms to construct the directed approximate k -NN graph,and the computational cost can easily be achieved at O ( dn log n ). As the directedgraph cannot be handled by the existing graph-based framework, we work out anew framework to handle them. In particular, we work out the limiting distributionof the new statistic and derive an analytic formula to control the type I error sothat the time complexity of the method after the directed approximate k -NN graphis obtained is O ( n ). Thus, the overall time complexity of the new method is thesame as that for obtaining the approximate k -NN graph.

2. Main results.

First, we compare the computational cost of our methodto the other three state-of-the-art methods: graph-based method with the max-type statistic on 5-MST (the recommended setting in (Chu and Chen, 2019)), thedistance-based method (ecp) (Matteson and James, 2014), and the kernel-basedmethod (Harchaoui, Moulines and Bach, 2009). For our method, we use the k - d tree algorithm (Beygelzimer et al., 2019) to approximate the directed 5-NN graph.We use the 5-NN graph as it contains a similar number of edges to the 5-MST tomake the comparison with the existing graph-based method fair. The observationsare generated i.i.d. from a multivariate t -distribution with dimension d = 100. Theresults are presented in Table 1. We also check the empirical size of these methods(Table 2). From these tables, we can see that the kernel-based method is slow torun and cannot control the type I error. The other three methods can control thetype I error well. The graph-based methods are much faster than ecp, and the newmethod is more than 2-fold faster than the existing graph-based method on 5-MSTfor n = 2 ,

000 or above.Next, we compare the power of the three methods that could control the type

YI-WEI LIU AND HAO CHEN

Table 1

Time comparison : Average time cost ( ± standard deviation) from simulation runs for eachchoice of n . The environment where the experiments are conduected: CPU: Intel(R) Xeon(R)CPU E5-2690 0 @ 2.90GHz / RAM: DDR3 @ 1600MHz / OS: Scientiﬁc Linux 6.10 / 2.6.32Linux. n 1,000 2,000 5,000 10,000 20,000 30,000New 1.9 ( ± .

4) 3.5 ( ± .

8) 11 ( ± .

2) 47 ( ± .

0) 227 ( ±

39) 478 ( ± ± .

0) 8.7 ( ± .

6) 29 ( ± .

9) 122 ( ±

27) 610 ( ± ± ± .

8) 54 ( ±

14) 362 ( ± ± ± > ± .

1) 9,153 ( ± > Table 2

Empirical size : Fractions of simulation runs (out of 10,000 simulations) that the nullhypothesis is rejected at level α when there is no change-point in the sequence ( n = 1 , ). α = 0 . α = 0 . α = 0 . I error. The length of the sequence is set to be n = 1 ,

000 with a change-pointhappens at a quarter of the sequence, τ = 250, for all simulations. We compare thepower under four diﬀerent scenarios. The speciﬁc amount of changes were chosenso that the tests are of moderate power to be comparable. We see that the new testis the most powerful or on par with the most powerful test over various types ofchanges. More simulations are provided in the supplement. Scenario 1 : The observations are generated from a d -dimensional multivariateGaussian distribution with mean zero and covariance matrix Σ , where Σ ii i.i.d. ∼ U(1 , Σ ij = (cid:0) ( h + 1) H + ( h − H − h H (cid:1) /

2, with h = | i − j | and H =0 .

85. After the change, all elements in the mean vector shift by 0 .

02, and thecovariance matrix becomes a Σ . We use ∆ to denote the mean vector and || ∆ || todenote its L norm after the change. Scenario 2 : The observations are generated from d -dimensional multivariate t with mean zero, and the same covariance matrix as Scenario 1. After the change,all elements in the mean vector ∆ shift by the same amount, and the covariancematrix becomes a Σ . Scenario 3 : The observations are generated from d -dimensional multivariate t with mean zero, and covariance matrix Σ with Σ ij = 0 . | i − j | . After the change,the covariance matrix becomes Σ ij = ρ | i − j | . Note that here only the oﬀ-diagonalelements of Σ change. Scenario 4 : The observations are generated from d -dimensional independentexponential distributions with scale parameter 1. After the change, all the d scaleparameters become λ . HANGE-POINT DETECTION FOR MODERN DATA Table 3

Power comparison : Fractions of times (out of 100) the null hypothesis is rejected under α = 0 . for various data dimensions and sizes of chnage. Scenario 1 d

25 100 500 1000 2000 || ∆ || a Scenario 2 d

25 100 500 1000 2000 || ∆ || a Scenario 3 d

25 100 500 1000 2000 ρ Scenario 4 d

25 100 500 1000 2000 λ

3. New statistic.

Given a directed similarity graph G = { ( i, j ) : y i points to y j } ,let R G, ( t ) be the number of edges on G connecting both observations before t , and R G, ( t ) be the number of edges on G connecting both observations after t : R G, ( t ) = (cid:88) ( i,j ) ∈ G { i ≤ t,j ≤ t } and R G, ( t ) = (cid:88) ( i,j ) ∈ G { i>t,j>t } , with A the indicator function for event A . Figure 1 illustrates the computation of R G, ( t ) and R G, ( t ) on a toy example. For the directed approximate k -NN graph,each observation points to k other observations. Similar to the undirected graph ver-sion in (Chen and Zhang, 2015) and (Chu and Chen, 2019), we can deﬁne four edge-count scan statistics (original/weighted/generalized/max-type) for directed graphs.Here, we focus on the max-type edge-count scan statistic and leave the other threeto the supplement.Since our method is non-parametric, we work under the permutation null dis-tribution that places 1 /n ! probability on each of the n ! permutations of { y i : i =1 , . . . , n } . With no further speciﬁcation, we use P , E , Var , and

Cov to denote prob-ability, expectation, variance, and covariance, respectively, under the permutationnull distribution. For each candidate t of the true change-point τ , the max-typeedge-count test statistic is deﬁned as M ( t ) = max( Z w ( t ) , | Z diﬀ ( t ) | ) , (3)where Z w ( t ) = R w ( t ) − E ( R w ( t )) (cid:112) Var ( R w ( t )) and Z diﬀ ( t ) = R diﬀ ( t ) − E ( R diﬀ ( t )) (cid:112) Var ( R diﬀ ( t )) , with R w ( t ) = n − t − n − R G, ( t ) + t − n − R G, ( t ) and R diﬀ ( t ) = R G, ( t ) − R G, ( t ) . The nullhypothesis of homogeneity is rejected if the scan statisticmax n ≤ t ≤ n M ( t ) (4) YI-WEI LIU AND HAO CHEN

Fig 1 . The computation of R G, ( t ) and R G, ( t ) at three diﬀerent values of t . The ﬁrst 10 ob-servations are drawn from N (( − . , − . T , I ) , and the second 10 observations are drawn from N ((0 . , . T , I ) , where I is the × identity matrix. Here, G is the directed 2-NN on Eu-clidean distance. Each t divides the observations into two groups: one group for observationsbefore t (squares) and the other group for observations after t (circles). Red edges connect obser-vations before t and the count is R G, ( t ) ; blue edges connect observations after t and the count is R G, ( t ) . Notice that as t changes, the group identities change but the graph G does not change. with n and n pre-speciﬁed, is larger than the critical value for a given signiﬁcancelevel.From the deﬁnition of M ( t ), we see that either a large value of Z w ( t ) or | Z diﬀ ( t ) | would contribute to the scan statistic. They corresponds to two major exhibitionsof the graph when there is a change at t . When { y , · · · , y t } and { y t +1 , · · · , y n } arefrom the same distribution, they are well mixed and R G, ( t ) and R G, ( t ) would beclose to their null expectations (Figure 2 (a)). When they are from diﬀerent distri-butions, one common exhibition is that observations from the same distribution aremore likely to be connected in G . Figure 2 (b) plots a typical directed 2-NN graphunder this alternative and we see there are very few edges connecting between thetwo groups. When this happens, Z w ( t ) is large. For moderate- to high- dimensionaldata, another exhibition of the graph is common under alternative shown in Figure2 (c). We see that R G, ( t ) is much larger than its null expectation but R G, ( t ) ismuch smaller than its null expectation (very few blue edges). This happens due to the curse of dimensionality . Here, the dimension is d = 100. The circles arefrom a distribution with a larger variance than those of the squares. As the volumeof a d -dimensional ball increases exponential in d , circles are sparsely scattered andtend to ﬁnd their nearest neighbors in squares. The Z diﬀ ( t ) part in our statistic iseﬀective in capturing this pattern. The absolute value is to cover the two possiblescenarios in opposite directions showcased in Figure 2 (c) and (d). Theorem . The expectation, variance and covariance of R G, ( t ) and R G, ( t )) on the directed approximate k -NN graph G are: E ( R G, ( t )) = nkp ( t ) , E ( R G, ( t )) = nkq ( t ) , Var ( R G, ( t )) = (cid:16) c (1) + c (2) (cid:17) p ( t ) + (cid:16) c (3) + c (4) + c (5) + c (6) (cid:17) p ( t ) + c (7) p ( t ) − ( nkp ( t )) , Var ( R G, ( t )) = (cid:16) c (1) + c (2) (cid:17) q ( t ) + (cid:16) c (3) + c (4) + c (5) + c (6) (cid:17) q ( t ) + c (7) q ( t ) − ( nkq ( t )) , HANGE-POINT DETECTION FOR MODERN DATA (a) (b) (c) (d) Fig 2 . Directed 2-NN graphs on 100-dimensional data visualized by the ggnet2 function. Here, y , · · · , y (squares) are drawn from the standard 100-dimensional Gaussian distribution, and y , · · · , y (circles) are drawn from N ( µ, a I ) with (a) µ = , a = 1 ; (b) µ = 0 . , a = 1 ;(c) µ = , a = 1 . ; (d) µ = , a = 0 . . Same edge coloring scheme as in Figure 1. Cov ( R G, ( t ) , R G, ( t )) = c (7) r ( t ) − ( nkp ( t )) ( nkq ( t )) , with p ( t ) = t ( t − n ( n − , p ( t ) = t ( t − t − n ( n − n − , p ( t ) = t ( t − t − t − n ( n − n − n − ,q ( t ) = ( n − t )( n − t − n ( n − , q ( t ) = ( n − t )( n − t − n − t − n ( n − n − ,q ( t ) = ( n − t )( n − t − n − t − n − t − n ( n − n − n − , r ( t ) = t ( t − n − t )( n − t − n ( n − n − n − , where c (1) , · · · , c (7) are quantities on the graph G , deﬁned as: c (1) = nk, c (2) = n (cid:88) i =1 (cid:88) j ∈ D i { ( i,j ) ∈ G } , c (3) = c (4) = n (cid:88) i =1 (cid:88) j ∈ D i ( k − { ( i,j ) ∈ G } ) ,c (5) = nk ( k − , c (6) = n (cid:88) i =1 (cid:0) | D i | − | D i | (cid:1) , c (7) = ( nk ) − (cid:88) m =1 c ( m ) . Here, D i is the set of indices of observations that point toward observation y i , and | D i | is the cardinality of set D i , or the in-degree of observation y i . Theorem 1 can be proved by combinatorial analysis. The expectations are rela-tively simple due to the linearity of expectation. For the variances and the covari-ance, we have to ﬁgure out the number of possible conﬁgurations of pairs of edgesplotted in Figure 3. A detailed proof of the theorem is in the supplement.Next, we show when the max-type edge-count scan statistic M ( t ) in (3) is well-deﬁned in Theorem 2, and its proof is in the supplement. Theorem . The max-type edge-count scan statistic { M ( t ) } t =1 ,...,n − on anapproximate k -NN graph is well-deﬁned when n ≥ and not every observation hasthe same in-degree (i.e., there exists an i , ≤ i ≤ n , such that | D i | (cid:54) = k ). YI-WEI LIU AND HAO CHEN

Fig 3 . Seven possible conﬁgurations of two edges ( i, j ) , ( u, v ) randomly chosen, with replacement,from a direct graph: (1) two edges degenerate into one ( ( i, j ) = ( u, v ) ); (2) two opposite edges(the two end nodes point to each other); (3)-(6) four diﬀerent conﬁgurations with the two edgessharing one node; (7) two edges without any node sharing.

4. Analytic type I error control.

Given the max-type edge-count scanstatistic, the next question is how large does it need to be to constitute suﬃ-cient evidence against the null hypothesis of homogeneity. In other words, we areconcerned with the tail probability of the scan statistics under H : P (cid:16) max n ≤ t ≤ n M ( t ) > b (cid:17) = 1 − P (cid:16) max n ≤ t ≤ n Z w ( t ) < b (cid:17) P (cid:16) max n ≤ t ≤ n | Z diﬀ ( t ) | < b (cid:17) (5)= 1 − (cid:18) − P (cid:16) max n ≤ t ≤ n Z w ( t ) > b (cid:17)(cid:19) (cid:18) − P (cid:16) max n ≤ t ≤ n | Z diﬀ ( t ) | > b (cid:17)(cid:19) . (6)For a small n , the probability (5) can be obtained by doing permutation directly.However, when n is large, doing permutation could be very time-consuming. Hence,we derive analytic formulas to approximate the probability based on the asymp-totic properties of the scan statistic. We ﬁrst work out the limiting distributions of { Z w ([ nu ] ) : 0 < u < } and { Z diﬀ ([ nu ]) : 0 < u < } jointly. On a directed graph,we use e = ( e − , e + ) to denote an edge connected from e − to e + . Let A e = G e − ∪ G e + , be the subgraph in G that connect to either node e − ot node e + , and B e = ∪ e ∗ ∈ A e A e ∗ , be the subgraph in G that connect to any edge in A e . In the following, we write a n = O ( b n ) when a n has the same order as b n , and write a n = o ( b n ) when a n hasorder smaller than b n . Theorem . If | G | = O ( n β ) , ≤ β < . , (cid:80) e ∈ G | A e || B e | = o ( n . β ) , (cid:80) e ∈ G | A e | = o ( n β +0 . ) , and (cid:80) ni =1 | D i | − k n = O ( (cid:80) ni =1 | D i | ) , as n → ∞ , { Z w ([ nu ]) : 0 < u < } and { Z diﬀ ([ nu ]) : 0 < u < } converge to independent Gaussian processes inﬁnite dimensional distributions. Here, we do not consider β < For a scalar x , we use [ x ] to denote the largest integer no greater than x HANGE-POINT DETECTION FOR MODERN DATA Based on Theorem 3, we have P (cid:16) max n ≤ t ≤ n Z w ( t ) > b (cid:17) ≈ bφ ( b ) (cid:90) n n C w ( t ) ν (cid:0)(cid:112) b C w ( t ) (cid:1) dt (7) P (cid:16) max n ≤ t ≤ n | Z diﬀ ( t ) | > b (cid:17) ≈ bφ ( b ) (cid:90) n n C diﬀ ( t ) ν (cid:0)(cid:112) b C diﬀ ( t ) (cid:1) dt (8)where the function ν ( · ) can be estimated numerically as ν ( x ) ≈ (2 /x )(Φ( x/ − . x/ x/

2) + φ ( x/ φ ( · ) and Φ( · ) being the probability density function and cumulative distribu-tion function of a standard normal distribution, respectively, and C j ( t ) = lim s (cid:37) t ∂ρ j ( s, t ) ∂s with ρ j ( s, t ) = Cov ( Z j ( s ) , Z j ( t )) , j = w, or diﬀ . The explicit expressions for C w ( t ) and C diﬀ ( t ) can be derived and simpliﬁed to be C w ( t ) = n ( n − t /n − t + 1)2 t ( n − t )( t − nt + n − , and C diﬀ ( t ) = n t ( n − t ) . The details are in the supplement.

Table 4

Critical values for the scan statistics max n ≤ t ≤ n M ( t ) based on 3-NN’s graph at α = 0 . . n = 100 n = 75 n = 50 n = 25 A1 n = 100 n = 75 n = 50 n = 25 A2 Per A2 Per A2 Per A2 Per

MultivariateGaussian d = 10 d = 100 d = 1 ,

000 3.26 3.26 3.31 3.35 3.39 3.43 3.52 3.603.29 3.29 3.36 3.40 3.45 3.52 3.62 3.783.31 3.31 3.40 3.42 3.51 3.60 3.70 3.94Multivariate t d = 10 d = 100 d = 1 ,

000 3.26 3.26 3.32 3.34 3.39 3.44 3.51 3.603.30 3.30 3.35 3.39 3.44 3.51 3.60 3.793.30 3.30 3.46 3.54 3.59 3.75 3.80 4.27Multivariatelog-normal d = 10 d = 100 d = 1 ,

000 3.27 3.28 3.32 3.33 3.40 3.41 3.52 3.583.28 3.28 3.35 3.37 3.43 3.48 3.58 3.703.36 3.41 3.45 3.58 3.58 3.81 3.72 4.07

Table 4 shows the performance of the asymptotic p -value approximation ofthe max-type edge-count scan statistic (5) under diﬀerent settings. We examinethree diﬀerent underlying distributions (multivariate Gaussian, multivariate t ,and multivariate log-normal distributions) with diﬀerent data dimensions ( d =10 , , Per is the critical value obtained from doing10 ,

000 permutations. This can be deemed as close to the true critical value. A1 YI-WEI LIU AND HAO CHEN presents the critical values given by substituting (6) by (7) and (8). When n issmall, formula (5) is not that close to the critical values obtained through per-mutations. This is due to the fact that Z w ( t ) and Z diﬀ ( t ) converge to the normaldistributions slowly when t is close to 1 or n . We perform skewness correction to im-prove the p -value approximations. Column A2 lists the skewness corrected criticalvalues. The skewness correction protocol is brieﬂy discussed below.Since the converge of Z w ( t ) and Z diﬀ ( t ) to the limiting distribution depends on t , we adopt a similar treatment in (Chen and Zhang, 2015; Chu and Chen, 2019)that the skewness correction terms depend on t . The tail probability after skewnesscorrection, is approximated by (with j = w, diﬀ) P (cid:16) max n ≤ t ≤ n Z j ( t ) > b (cid:17) ≈ bφ ( b ) (cid:90) n n S j ( t ) C j ( t ) ν (cid:0)(cid:113) b C j ( t ) (cid:1) dt (9)where the skewness correction term S j ( t ) = exp (cid:0) ( b − ˆ θ b,j ( t )) + γ j ( t )ˆ θ b,j ( t )) (cid:1)(cid:113) γ j ( t )ˆ θ b,j ( t )with γ j ( t ) = E (cid:0) Z j ( t ) (cid:1) and ˆ θ b,j ( t ) = ( − (cid:112) bγ j ( t )) /γ j ( t ).Perfroming skewness correction requires the computation of the third momentsof the edge-count test statistics. That is, E ( Z w ( t )) and E ( Z ( t )). Since those teststatistics are linear combinations of R G, ( t ) and R G, ( t ), all we need is to compute E ( R G, ( t )) and E ( R G, ( t )). The computation involves the analysis of the conﬁgu-rations of each set of three edges on the graph. For an undirected graph, there are8 possible such conﬁgurations as discussed in (Chen and Zhang, 2015). However,for a directed graph, there are 24 possible conﬁgurations for a set of three edges(Figure 4). We derive analytic formulas to the numbers of those conﬁgurations forthe directed k -NN graphs, and they are listed in the supplement. Fig 4 . Twenty-four possible conﬁgurations of three edges randomly chosen, with replacement,from a directed graph.

HANGE-POINT DETECTION FOR MODERN DATA

5. Conclusion.

We propose a new non-parametric framework for change-point analysis using approximation k -NN information. There are fast algorithmsavailable for obtaining the approximate k -NN graph and the time complexity caneasily be achieved at O ( dn log n ). In constructing the test statistic, we take intoaccount a pattern caused by the curse of dimensionality. As a result, the new testcan detect various types of changes for data in moderate to high dimensions, suchas change in mean and/or variance, and change in the covariance matrix. There isno distributional assumption on the data. The distribution could be heavy-tailedand/or skewed, the dimension of the data could be much higher than the num-ber of observations, and the change could be global or in a sparse/dense subsetof coordinates. We also derive analytic formulas to compute the test statistic andto approximate the p -value, so that the entire algorithm is of time complexity O ( dn log n ). Our method is so far the fastest change-point detection method avail-able for detecting various types of changes in long sequences of moderate- to high-dimensional data with proper control on false discoveries.The proposed framework can easily be extended to non-Euclidean data – as longas a similarity measure can be deﬁned on the sample space, an (approximate) k -NN graph can be constructed and our test can be applied. In this work, we focuson the single change-point alternative. In practice, a sequence could have morethan one change-point. Our method can also be extended to deal with multiplechange-points by incorporating binary segmentation or wild binary segmentationtechniques (Fryzlewicz, 2020; Fryzlewicz et al., 2014; Vostrikova, 1981). References.

Beygelzimer, A. , Kakadet, S. , Langford, J. , Arya, S. , Mount, D. and

Li, S. (2019). FastNearest Neighbor Search Algorithms and Applications R package FNN: kd-tree fast k-nearestneighbor search algorithms.

Carlstein, E. G. , M¨uller, H.-G. and

Siegmund, D. (1994). Change-point problems. IMS.

Chan, H. P. , Walther, G. et al. (2015). Optimal detection of multi-sample aligned sparse signals.

The Annals of Statistics Chen, H. , Chen, S. and

Deng, X. (2019). A Universal Nonparametric Event Detection Frameworkfor Neuropixels Data. bioRxiv . Chen, J. and

Gupta, A. K. (2011).

Parametric statistical change point analysis: with applicationsto genetics, medicine, and ﬁnance . Springer Science & Business Media.

Chen, H. and

Zhang, N. (2015). Graph-based change-point detection.

The Annals of Statistics Chu, L. and

Chen, H. (2019). Asymptotic distribution-free change-point detection for multivariateand non-euclidean data.

The Annals of Statistics Eagle, N. , Pentland, A. S. and

Lazer, D. (2009). Inferring friendship network structure byusing mobile phone data.

Proceedings of the national academy of sciences

Fryzlewicz, P. (2020). Detecting possibly frequent change-points: Wild Binary Segmentation 2and steepest-drop model selection.

Journal of the Korean Statistical Society

Fryzlewicz, P. et al. (2014). Wild binary segmentation for multiple change-point detection.

TheAnnals of Statistics Harchaoui, Z. , Moulines, E. and

Bach, F. R. (2009). Kernel change-point analysis. In

Advancesin neural information processing systems

Heard, N. A. , Weston, D. J. , Platanioti, K. , Hand, D. J. et al. (2010). Bayesian anomalydetection methods for social networks.

The Annals of Applied Statistics James, B. , James, K. L. and

Siegmund, D. (1992). Asymptotic approximations for likelihoodratio tests and conﬁdence regions for a change-point in the mean of a multivariate normaldistribution.

Statistica Sinica YI-WEI LIU AND HAO CHEN

Kossinets, G. and

Watts, D. J. (2006). Empirical analysis of an evolving social network. science

Li, Z. , Li, P. , Krishnan, A. and

Liu, J. (2011). Large-scale dynamic gene regulatory networkinference combining diﬀerential equation models with local dynamic Bayesian network analysis.

Bioinformatics Lu, T. , Liang, H. , Li, H. and

Wu, H. (2011). High-dimensional ODEs coupled with mixed-eﬀects modeling techniques for dynamic gene regulatory network identiﬁcation.

Journal of theAmerican Statistical Association

Matteson, D. S. and

James, N. A. (2014). A nonparametric approach for multiple change pointanalysis of multivariate data.

Journal of the American Statistical Association

Siegmund, D. , Yakir, B. and

Zhang, N. R. (2011). Detecting simultaneous variant intervals inaligned sequences.

The Annals of Applied Statistics

Srivastava, M. and

Worsley, K. J. (1986). Likelihood ratio tests for a change in the multivariatenormal mean.

Journal of the American Statistical Association Stringer, C. , Pachitariu, M. , Steinmetz, N. , Reddy, C. B. , Carandini, M. and

Har-ris, K. D. (2019). Spontaneous behaviors drive multidimensional, brainwide activity.

Science eaav7893.

Vostrikova, L. Y. (1981). Detecting disorder in multidimensional random processes.

Wang, T. and

Samworth, R. J. (2018). High dimensional change point estimation via sparseprojection.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) Wang, H. , Tang, M. , Park, Y. and

Priebe, C. E. (2013). Locality statistics for anomaly detectionin time series of graphs.

IEEE Transactions on Signal Processing Xie, Y. and

Siegmund, D. (2013). Sequential multi-sensor change-point detection. In

Zhang, N. R. , Siegmund, D. O. , Ji, H. and

Li, J. Z. (2010). Detecting simultaneous changepointsin multiple sequences.

Biometrika97