A Comparison of Single and Multiple Changepoint Techniques for Time Series Data
AA Comparison of Single and MultipleChangepoint Techniques for Time SeriesData
Xueheng Shi, Colin Gallagher, Robert Lund and Rebecca KillickJanuary 7, 2021
Abstract
This paper describes and compares several prominent single and mul-tiple changepoint techniques for time series data. Due to their importancein inferential matters, changepoint research on correlated data has accel-erated recently. Unfortunately, small perturbations in model assumptionscan drastically alter changepoint conclusions; for example, heavy posi-tive correlation in a time series can be misattributed to a mean shiftshould correlation be ignored. This paper considers both single and mul-tiple changepoint techniques. The paper begins by examining cumulativesum (CUSUM) and likelihood ratio tests and their variants for the singlechangepoint problem; here, various statistics, boundary cropping scenar-ios, and scaling methods (e.g., scaling to an extreme value or BrownianBridge limit) are compared. A recently developed test based on summingsquared CUSUM statistics over all times is shown to have realistic TypeI errors and superior detection power. The paper then turns to the mul-tiple changepoint setting. Here, penalized likelihoods drive the discourse,with AIC, BIC, mBIC, and MDL penalties being considered. Binaryand wild binary segmentation techniques are also compared. We intro-duce a new distance metric specifically designed to compare two multiplechangepoint segmentations. Algorithmic and computational concerns arediscussed and simulations are provided to support all conclusions. In theend, the multiple changepoint setting admits no clear methodological win-ner, performance depending on the particular scenario. Nonetheless, somepractical guidance will emerge.
Changepoints (abrupt shifts) arise in many time series due to changes in record-ing equipment, observers, etc. In climatology, temperature trends computedfrom raw data can be misleading if homogeneity adjustments for station relo-cation moves and gauge changes are not a priori made to the record. Lu andLund (2007) give an example where trend conclusions reverse when changepointinformation is neglected. Cases with multiple changepoints are also frequentlyencountered; for example, in climatology, United States weather stations av-erage about six station moves and/or gauge changes per century of operation(Menne et al., 2009). 1 a r X i v : . [ s t a t . M E ] J a n his paper intends to guide the researcher on the best changepoint tech-niques to use in common time series scenarios. Assumptions are crucial inchangepoint analyses and can significantly alter conclusions; here, correlationissues take center stage. It is known that changepoint inferences made from pos-itively correlated series can be spurious if correlation is not taken into account.Even lag one correlations as small as 0.25 can have deleterious consequences onchangepoint conclusions (Lund et al., 2007).This paper’s primary contribution is to extend/modify many of the popularchangepoint methods for IID data to correlated settings. Much of our work lieswith developing methods that put all techniques, to the best extent possible,on the same footing in time series settings. For example, we will see that singlechangepoint tests generally work best when applied to estimated versions of theseries’ one-step-ahead prediction residuals, computed under a null hypothesis ofno changepoints. Because of this, tests that handle one-step-ahead predictionresiduals need to be developed. Two other novel contributions in this articleare: (1) developing and proposing a new single changepoint test based on thesquare of the cummulative sum of one-step-ahead prediction residuals (see Sec-tion 2.2), and 2) presenting a new distance that compares multiple changepointsegmentations (see Section 5.1). The comparative aspect of the paper is yetanother contribution — and there is much to compare. In addition to compar-ing different statistics via Type I errors and powers, the paper also comparesdifferent asymptotic scaling methods.Academic changepoint research commenced with the single changepoint casefor independent and identically distributed (IID) data in Page (1955). The sub-ject is now vast, with hundreds of papers devoted to the topic. With our loftyobjectives, some concessions are necessary. Foremost, this paper examines meanshift changepoints only; that is, while series mean levels are allowed to abruptlyshift, the variances and correlations of the series are held constant (station-ary) in time. Changepoints can also occur in variances (volatilities) (Chapmanet al., 2020), in the series’ correlation structures (Davis et al., 2006), or evenin the marginal distribution of the series (Gallagher et al., 2012). Secondarily,the simulation results reported here are for Gaussian series only. Thirdly, wecompare the most common types of techniques within the literature, notablydiscluding those based on energy statistics (Matteson and James, 2014), movingsums (Eichinger and Kirch, 2018), and U statistics (Dehling et al., 2015).The rest of this paper proceeds as follows. Section 2 overviews single change-point detection methods, typically referred to as at most one changepoint (AMOC)tests. Here, a variety of test statistics and their scalings are reviewed andadapted to the time series setting. We specifically discuss two methods for mod-ifying changepoint techniques based on IID data: 1) retain the IID test statisticand modify the limiting distribution for any correlation; and 2) modify the teststatistic to account for the correlation. Section 3 compares AMOC detectorsin a simulation study. Thereafter, we move to the case of multiple change-points, where performance assessment becomes more challenging. Here, a novelchangepoint configuration distance specifically designed for our comparisons isdeveloped. Simulations in Section 4 consider a variety of multiple changepointconfigurations. We summarize results in Section 6 with recommendations forpractitioners. 2 Single Changepoint Techniques
Let { X t } Nt =1 be the observed time series and γ ( h ) = Cov( X t + h , X t ) be the lag h autocovariance of the series. We wish to test whether there exists a change inthe mean structure while assuming the second order structure is constant overtime . An AMOC model with the changepoint occurring at the unknown time k is X t = (cid:40) µ + (cid:15) t , for 1 ≤ t ≤ k , µ + ∆ + (cid:15) t , for k + 1 ≤ t ≤ N , (1)where µ is an unknown location parameter, ∆ is the magnitude of mean shiftat time k , and { (cid:15) t } is a stationary time series with zero mean and lag h auto-covariance γ ( h ). A hypothesis test for this scenario is: H : ∆ = 0 versus H : ∆ (cid:54) = 0 for some k ∈ { , . . . , N − } . (2)When the { (cid:15) t } are independent, cumulative sum (CUSUM) and likelihoodratio tests (LRT) are well understood, see Cs¨orgo and Horv´ath (1997) andChen and Gupta (2011). When incorporating general stationary autocovarianceaspects into a changepoint testing framework, there are two common strategies:1) keep the IID test statistic and identify any changes in the limiting distributioninduced by the correlation; and 2) incorporate the autocovariance within thetest statistic. Antoch et al. (1997) provide a summary of the first approach formany common changepoint statistics and provide simulations indicating howautocorrelation impacts the performance of the hypothesis tests; Kirch (2007)uses resampling techiques to improve the finite sample performance of thesetests. Robbins et al. (2011) shows that estimating and using the autocorrelation(the second approach) is preferable with CUSUM and LRTs. The CUSUM method was first introduced by Page (1955) and compares samplemeans before and after each admissible changepoint time via the statisticmax ≤ k 0. The estimator ˆ σ = N − (cid:80) Nt =1 ˆ Z t is used to estimate the variance of Z t . The residual CUSUM statistic at time k is max ≤ k Assume that { X t } follows (1) , { (cid:15) t } has the causal linear rep-resentation assumed in Theorem 1, and ˆ η is a null hypothesis based consistentestimator of η , the long-run variance defined in (4) . Then under H ,SCUSUM X = 1 N N (cid:88) k =1 (cid:20) CUSUM X ( k/N )ˆ η (cid:21) D −−→ (cid:90) B ( t ) dt. (11)Our second approach for incorporating correlation uses the one-step-aheadprediction residuals in place of the original data. The SCUSUM test statisticfor this scheme is SCUSUM Z := 1 N N (cid:88) k =1 (cid:20) CUSUM Z ( k/N )ˆ σ (cid:21) . (12)The asymptotic distribution of (12) can be derived from Theorem 2.1.2 viathe continuous mapping theorem. Theorem 2.2.2. With CU SU M Z defined as in Theorem 2.1.2, under the nullhypothesis of no changepoints,SCUSUM Z = 1 N N (cid:88) k =1 (cid:20) CUSUM Z ( k/N )ˆ σ (cid:21) D −−→ (cid:90) B ( t ) dt. (13)5he distribution of (cid:82) B ( t ) dt was investigated in Tolmatz (2002). We notethat Bai (1993) proposed using the sum of the square of partial sums of ARMAresiduals to detect a single changepoint in autocorrelated data; this test statisticconverges to the integral of a squared Brownian Motion rather than the integralof the square of a Brownian Bridge . To our knowledge, the variant in (12) hasnot previously been proposed nor studied in the literature.The differences between CUSUM and CUSUMz statistics were investigatedin Robbins et al. (2011) and their simulations indicate that the latter statisticis superior to the former in terms of type I error and power. Our simulationsconfirm this finding. As such, in the remainder of the paper, we do not considerSCUSUM (without the subscript Z ) tests further. While CUSUM tests are non-parametric, LRTs are inherently parametric. Sev-eral error distributions have been considered by previous authors, by far themost common being normal — this is the distribution considered here.The LRT compares the likelihood under the null hypothesis to likelihoodsunder alternatives with a changepoint. The LRT statistic for a changepoint hasthe general form Λ = max ≤ k 12 log π (cid:21) . hen under H , lim N →∞ P ( W U ≤ x ) = exp( − − x )) , −∞ < x < ∞ . (16) Specifically, H is rejected when W U is too large to be explained by the distribu-tion in (16). Another way of scaling the Λ k statistics involves cropping boundary times.Like the CUSUM test, the LRT is volatile at times near the boundaries. Infact, Λ D → ∞ as N → ∞ should the maximum be taken over the entire range1 ≤ k < N under the null hypothesis of no changepoints. A common croppedLRT simply truncates admissible times near the boundaries; for example, with0 < (cid:96) < h < (cid:96) being close to zero and h being close to unity, set U crop = max (cid:96) ≤ k/N ≤ h ( − k )) . (17)Robbins et al. (2011) shows that U crop D −−→ sup (cid:96) ≤ t ≤ h B ( t ) t (1 − t ) . (18)As the next section shows, LRTs are not competitive in changepoint detec-tion problems. While simulations are presented for the above extreme valuetest in the next section, simulations for cropped LRTs are delegated to thesupplementary material — both methods perform poorly.As a final comment here, deriving a LRT test for independent data, andthen replacing the data with one-step-ahead prediction residuals, another av-enue for dealing with dependence, does not yield a methodologically distinctpath. Specifically, if one derives a LRT statistic for independent series and thensubstitutes one-step-ahead prediction residuals in place of the original data, thelimit law in (18) again arises. The boundaries again must be cropped to ensurea proper limiting distribution. The discussion around (1.4.22) — (1.4.27) in(Cs¨orgo and Horv´ath, 1997) provides more detail on this route; see also Lavielleand Moulines (2000) for more on LRTs for correlated data. This section investigates the finite sample performance of the Section 2 tests(cropped CUSUMz, CUSUMz, SCUSUMz, LRT) through simulation. Resultsfor the cropped test statistics are delegated to the supplementary material;results for the others are presented here.Desirable tests have reasonable (non-inflated) false detection rates when nochangepoints exist, and large detection powers when a changepoint is present,regardless of the degree of correlation. For each statistic under consideration, theimpact of autocorrelation on the Type I error is first explored. We then examinedetection powers of the tests when a changepoint exists. First order Gaussianautoregressions (AR(1)) are considered here with σ = 1; other structures areexamined in the supplementary material.Figure 1 summarizes results for N = 1 , 000 across varying AR(1) correlationparameters φ . Our conclusions do not vary for different N — see the supple-mentary material. Figure 1 shows that the only method to retain a controlled7ype I error across all φ is the SCUSUMz. The LRT is the worst performingmethod, being far too conservative for all φ , except 0 . 95, when it becomes highlyinflated. The CUSUMz method is also slightly conservative, becoming more soas φ increases. Since we are using the asymptotic distribution for each of the10,000 test statistics, we would expect the 0 . 05 type I error to be reasonablymaintained. f T y pe E rr o r AMOC Cropped CUSUMzCUSUMzLRTSCUSUMz Figure 1: Type I Errors for an AR(1) Series with Different φ When N = 1000.We now consider test detection powers. In general, the detection power of anAMOC test depends on the degree of correlation, the size of the mean shift, andthe location of the changepoint time (Robbins et al., 2016). Figures 2 (∆ = 0 . . 3) show empirical powers based on 10 , 000 independent Gaussiansimulated series of length N = 1 , φ when the mean shift lies in the center of the series (time 501). The figuresdemonstrate the drastic effects of autocorrelation on the power of changepointtests. While the LRT had the highest empirical power when φ = . 95, theestimated changepoint location of LRT is biased and more variable than thatfor the CUSUM Z and SCUSUM Z tests, see Figures 3 and 5. The LRT test alsohas a Type I error far exceeding 0.05; as such, it’s higher power does not implybetter overall performance. Overall, the CUSUM Z and SCUSUM Z tests aremore powerful than the others. Note also that SCUSUM Z has higher power thanCUSUM Z for each φ considered. Additional simulations (not shown) duplicatethis conclusion for other sample sizes. The SCUSUMz statistic is clearly thebest test. 8 l l l l l l l l l l l l l l l l l l f P o w e r AMOC llll Cropped CUSUMzCUSUMzLRTSCUSUMz Figure 2: Detection Powers for an AR(1) Series with Different φ . Here, N =1 , 000 and ∆ = 0 . f D e t e c t ed C hangepo i n t Lo c a t i on Tests CUSUMzLRTSCUSUMz Figure 3: Boxplots of Detected Changepoint Locations for an AR(1) Series withDifferent φ . Here, N = 1 , 000 and ∆ = 0 . l l l l l l l l l l l l l l l l l l f P o w e r AMOC llll Cropped CUSUMzCUSUMzLRTSCUSUMz Figure 4: Detection Power for an AR(1) Series with Different φ . Here, N = 1000and ∆ = 0 . f D e t e c t ed C hangepo i n t Lo c a t i on Tests CUSUMzLRTSCUSUMz Figure 5: Detected Changepoint Location for an AR(1) Series with Different φ .Here, N = 1000 and ∆ = 0 . φ is fixed as 0.5. Figure 6 displays empirical powers. The largestdetection powers occur when the changepoint is near the center of the record,as expected, with power decreasing as the changepoint time moves towards aboundary. The SCUSUM Z appears to be the most accurate overall. However,10he LRT test is preferable when the changepoint occurs near the beginning ofthe record. l l l l l l l l l l l l l l l l l l l l l l l l l P o w e r AMOC llll Cropped CUSUMzCUSUMzLRTSCUSUMz Figure 6: A Graph of τN Against Power with N = 500 and ∆ = 0 . φ = 0 . Now suppose that { X t } Nt =1 has an unknown number of changepoints, denotedby m , occurring at the unknown ordered times 1 < τ < τ < · · · < τ m ≤ N . Boundary conditions take τ = 0 and τ m +1 = N . These m changepointspartition the series into m + 1 distinct regimes, the i th regime having its owndistinct mean and containing the data points { X τ i +1 , . . . , X τ i +1 } . The modelcan be written as X t = κ t + (cid:15) t , where κ t = µ r ( t ) and r ( t ) denotes the regime indexat time t , which takes values in { , , . . . , m } , and { (cid:15) t } is a stationary causaland invertible ARMA( p, q ) time series that applies to all regimes. Observe that κ t = µ , ≤ t ≤ τ ,µ , τ + 1 ≤ t ≤ τ , ... µ m , τ m + 1 ≤ t ≤ N . There are many challenges in the multiple changepoint problem. Here, esti-mation of a global autocovariance function that applies to all regimes — consid-ered further in Section 4.2 — is difficult. One also has to estimate an unknownnumber of changepoints, their locations, and all segment parameters in a com-putationally feasible manner for some of the techniques.While many authors have considered multiple changepoint issues, most as-sume IID { (cid:15) t } . For IID errors, dynamic programming based approaches (Auger11nd Lawrence, 1989; Killick et al., 2012), model selection methods using LASSO(Harchaoui and L´evy-Leduc, 2010; Shen et al., 2014), and moving sum statis-tics (Kirch and Muhsal, 2014) have all been applied to multiple changepointproblems — this list is not exhaustive. As in the AMOC setting, techniques forindependent data may not work well for dependent series (Davis et al., 2006; Liand Lund, 2012; Chakar et al., 2017).The multiple changepoint techniques considered here can be put into twobroad categories: 1) recursive segmentation and algorithmic methods usingAMOC techniques, and 2) direct approaches that fit all series subsegmentsjointly. The two approaches are completely different in their perspective. Elab-orating, recursive techniques employ AMOC single changepoint methods in aniterative manner, identifying at most one additional changepoint in each subseg-ment at each recursion level. In contrast, direct techniques model and estimatethe multiple changepoint configuration jointly; here, penalization methods typ-ically drive the discourse. No hypothesis testing paradigm underlies any directapproach. Some multiple changepoint techniques apply only to special timeseries structures. For example, Chakar et al. (2017) is exclusively designed forAR(1) series. Their techniques are not considered here as they cannot be appliedto all of our considered scenarios. Recursive segmentation approaches first focus on finding a single (usually themost prominent) changepoint, thereafter iterating in some manner to identifyadditional changepoints. The primary tool here is binary segmentation (Scottand Knott, 1974), which provides a multiple changepoint configuration estimatevia any AMOC method. Binary segmentation first tests the entire series for asingle changepoint. Should a changepoint be found, the series is split about thechangepoint time into two subsegments that are then analyzed for additionalchangepoints using the AMOC strategy. The process is repeated until no sub-segment tests positive for a changepoint. Binary segmentation works best whenthe changepoints are well separated and the segment means are distinct. Inour comparisons, the AMOC statistic adopted for recursive segmentation is theSCUSUM test applied to one-step-ahead prediction residuals, which won ourAMOC comparisons in the previous section.Extensions of binary segmentation abound and include circular binary seg-mentation (Olshen et al., 2004), which seeks to identify a segment of data thathas a distinct mean from the rest of the series. A popular binary segmenta-tion extension considered here is wild binary segmentation (WBS) Fryzlewicz(2014). WBS samples subsegments of the entire data of varying lengths and per-forms an AMOC test on each sampled subsegment. Fryzlewicz (2014) suggestssampling at least (9 N ) log( N δ − ) / ( δ ) subsegments, where δ is the minimumspacing between changepoints (see Assumption 3.2 of Fryzlewicz (2014)) as thisproduces a high probability of drawing a favorable subsegment. WBS is a ran-domized search and hence may return different segmentations on different runs.In our simulations, WBS employs a standard CUSUM test since its thresholdwas developed particularly for standard CUSUM methods. In addition, thethreshold constant C = 1 . For our work, the autocovariance of the series is assumed constant across time,applying to all series subsegments. This autocovariance function will be neededto decorrelate the series before applying any binary segmentation search meth-ods to the one-step-ahead prediction residuals. Unfortunately, accurate estima-tion of the autocovariance function requires knowledge of the underlying meanstructure. In the single changepoint case, the long-run covariance defined in (4)arises in the limit laws; however, this does not extend to multiple changepointsettings as no theoretical equivalent of (5) exists.In our setup, the second order (covariance) model parameters are deemednuisance parameters and are estimated using the entire data sequence. To ac-count for the impact of unknown mean shifts on these estimators, Yule-Walkertype moment equations will be used on the first order difference of { X t } . Thefirst order difference X t − X t − is used because E [ X t − X t − ] = 0 unless a change-point occurs at time t . Define d t = X t − X t − and note that d t = (cid:15) t − (cid:15) t − except when time t is a changepoint. Let γ d ( h ) = Cov( d t , d t − h ). For the AR( p )case, which is our primary interest, estimators of the AR( p ) parameters formedfrom { d t } take the form (cid:98) φ = (cid:99) M − (cid:98) ρ d , (19)where φ = ( φ , . . . , φ p ) (cid:48) and M = − − (cid:0) + ρ d (1) (cid:1) · · · − (cid:16) + (cid:80) p − j =1 ρ d ( j ) (cid:17) ρ d (1) ρ d (0) ρ d (1) · · · ρ d ( p − ρ d (2) ρ d (1) ρ d (0) · · · ρ d ( p − ρ d ( p − ρ d ( p − ρ d ( p − · · · ρ d (0) . The elements in ˆ M and ˆ ρ d simply replace ρ d ( h ) withˆ ρ d ( h ) = ˆ γ d ( h )ˆ γ d (0) = (cid:80) n − ht =2 ( X t − X t − )( X t + h − X t + h − ) (cid:80) nt =2 ( X t − X t − ) . While Shi et al. (ming) discuss these AR( p ) estimators in detail, the intuitionbehind them is that if m is small relative to N , then the mean shifts willhave negligible impact on the estimated covariance structure of the differencessince X t − X t − = (cid:15) t − (cid:15) t − except at times t that are changepoint times.Shi et al. (ming) demonstrate that this estimate of the covariance outperformsalternatives such as direct and windowed estimation. Due to this, the Yule-Walker moment estimators in (19) will be used in our simulations to decorrelatethe series. 13 .3 Direct Modelling Direct modelling approaches analyze the whole series at once, optimizing anobjective function with a penalty term that controls the number of changepoints.The techniques seek a changepoint configuration that minimizes F ( m ; τ , . . . , τ m ) := C ( m ; τ , . . . , τ m ) + P ( m ; τ , . . . , τ m ) , (20)where C is the cost of putting m changepoints at the times τ , . . . , τ m and P isa penalty term to prevent over-fitting. There are many ways to define the costand penalties. A frequently used cost is the negative log-likelihood. Here, wewill use C ( m ; τ , . . . , τ m ) = − L opt ( θ | m ; τ , . . . τ m )) , where L opt ( θ | m ; τ , . . . , τ m ) is the time series likelihood (Gaussian based) op-timized over all parameters θ given that m changepoints occur at the times τ , . . . , τ m .Penalties can be constructed in a variety of ways. Common penalties includeminimum description lengths (MDL), modified Bayesian Information Criterion(mBIC), and the classic BIC penalty. AIC is another popular penalty, despiteit not providing consistent estimates of the number or locations of the change-point(s). Of these four penalties, AIC and BIC are simple multiples of thenumber of changepoints, while the MDL and mBIC further incorporate change-point time information. The form of these penalties are listed in the followingtable. Table 1: Penalized Likelihood Objective FunctionsCriteria Objective FunctionAIC N ln(ˆ σ ) + 2(2 m + 3)BIC N ln(ˆ σ ) + (2 m + 2) ln( N )mBIC N ln(ˆ σ ) + m ln( N ) + (cid:80) m +1 i =1 ln (cid:16) τ i − τ i − N (cid:17) MDL N ln(ˆ σ ) + ln( m ) + (cid:80) m +1 i =1 ln( τ i − τ i − ) + (cid:80) mi =2 ln( τ i )Here, ˆ σ is the estimated white noise variance of the { (cid:15) t } process that drivesthe ARMA errors.MDL penalties are based on information theory and are discussed further inDavis et al. (2006) and Li and Lund (2012). The mBIC penalty is developedin Zhang and Siegmund (2007). These two penalties are taken as zero when m = 0. The mBIC penalty tends to be larger for the same changepoint configu-ration than the MDL penalty; as such, MDL will often select models with morechangepoints than mBIC.With penalized likelihood approaches, a computational bottleneck arises.Since there are (cid:0) N − m (cid:1) different admissible changepoint configurations in a serieswith m changepoints (time N cannot be a changepoint), there are 2 N − dif-ferent changepoint configurations to consider when analyzing the entire series.This huge count makes an exhaustive model search — one that evaluates all14dmissible changepoint configurations — virtually impossible to conduct, evenwhen N is a small as 100. Unfortunately, PELT (Killick et al., 2012) and FPOP(Maidstone et al., 2017), two rapid dynamic programming based techniques, re-quire the objective function to be additive over distinct regimes. The presenceof global parameters like the autocovariance function violates this restriction.Regime-additive likelihoods will not arise when { (cid:15) t } is ARMA( p, q ), althoughBai and Perron (1998) argues that any boundary contribution is negligible ifthe ARMA parameters are allowed to change at each changepoint time (this isnot the case here). Instead, a genetic algorithm will be used to optimize ourpenalized likelihoods exactly (including global autocovariances).Elaborating on the above, optimal changepoint configurations estimated bypenalized likelihoods can be found through a genetic algorithm (GA) search(Davis et al., 2006; Li and Lund, 2012). A GA is an intelligent random walksearch that is unlikely to evaluate suboptimal changepoint configurations. Un-fortunately, the objective function in (20) is not convex, and its optimizationis problematic; additional research is needed to optimize these penalized likeli-hoods. Our GA encodes the changepoint configuration into a binary string anduses the R GA package from Scrucca (2013). This GA has worked reliably in ourexperience. In presenting simulation results for different scenarios, we will only presentgraphic(s) that are judged informative. In general, for each simulation caseconsidered, graphics of distances, average number of detected changepoints,and empirical probabilities of estimating the correct number of changepointswere produced. The supplementary material contains any graphics that are notincluded here. Similarly, we focus on unit shift mean sizes in the main textbody unless otherwise noted; results for different shift sizes are presented in thesupplementary material.The changepoint configurations that we consider are illustrated in Figure 7,which shows sample time series generated with the various mean shift configu-rations. These configurations range from scenarios with no or few changepointsto those with a large number of changepoints. All series have length N = 500.AIC performs miserably in all our scenarios, always selecting an excessivenumber of changepoints. Since plotting AIC results would distort all othergraphical comparisons, AIC results are not presented to accentuate differencesbetween the remaining methods. 15 ime X t − − (a) One changepoint Time X t − (b) Three changepoints Time X t − − (c) Three changepoints Time X t − (d) Nine changepoints Time X t − − (e) Keyblade Time X t − − − (f) Random changepoints Figure 7: AR(1) Time Series with Different Changepoint Settings Before presenting our simulations, we discuss how to compare an estimatedmultiple changepoint segmentation to its true value. The estimated multiplechangepoint configuration could have a different number of changepoints thanthe true configuration. For a single changepoint method, such a comparisonis easy: examine first whether the method flags a changepoint, and then anydistance from the true changepoint time. With multiple changepoint configu-rations, this comparison is complicated by the fact that different segmentationsmay have different numbers of changepoints: which changepoint times in oneparticular configuration correspond to those in another may be nebulous.To compare different methods, a distance between the two changepoint con-figurations C = ( m ; τ , . . . , τ m ) and C = ( k ; η , . . . , η k ) will now be developed.Several distances have been utilized by the multiple changepoint field. Some,such as the mean squared error (MSE) of the fitted means, V-measure, or Haus-dorff distance, are not specific to changepoint problems. Others, such as thenumber of changepoints or true/false positive detection rates, are more tailoredto the changepoint problem. However, each of these statistics quantifies onlyone aspect of the fit. For example, the MSE could be low, but the number of16hangepoints could still be overestimated; or the number of changepoints couldbe perfect, but their locations could be inaccurate. As such, we introduce anew changepoint-specific distance balancing the two key components of multi-ple changepoint analysis: 1) the number of changepoints and 2) their individuallocations.To balance the number and location aspects of changepoint configurations,two components in our distance are needed. The first measures the discrep-ancy in the numbers of changepoints in the two configurations, for which weuse absolute difference. The second component measures discrepancies in thechangepoint times. This is trickier to quantify as the number of changepointsmay be different in the two configurations and some sort of “matching proce-dure” is needed. For two changepoint segmentations, C and C , the distanceused here is d ( C , C ) = | m − k | + min {A ( C , C ) } . (21)The term | m − k | assigns the difference in changepoint numbers for any mismatchin the total number of changepoints. The term min A ( C , C ) reflects the smallestcost that matches changepoint locations in C to those in C . This term can becomputed via the following linear assignment methods: A ( C , C ) = k (cid:88) i =1 m (cid:88) j =1 c i,j I i,j subject to the constraints (cid:80) ki =1 I i,j = 1, for j ∈ { , . . . , m } and (cid:80) mj =1 I i,j ≤ i ∈ { , . . . , k } . Here, the cost of assigning τ i to η j is taken simply as c i,j = | τ i − η j | /N and I i,j ∈ { , } is the decision variable I i,j = (cid:40) τ i is assigned to η j . This linear assignment problem can be efficiently computed from algorithms inBurkard et al. (2012).One can verify that (21) defines a legitimate distance satisfying the triangleinequality. The larger the distance is, the worse the two configurations corre-spond to one another. The term min A ( C , C ) can be shown to be bounded byunity and measures how closely the two changepoint configurations match up toone another. When both configurations have many changepoints, the distance isdominated by the | m − k | term. In our simulations, estimated multiple change-point configurations will be compared to the true changepoint configurationwith this distance. Many modern multiple changepoint simulation studies increasingly focusing oncases with a large number of changepoints, eschewing single and no changepointscenarios. We include such scenarios here to help illuminate the differencesbetween the methods.Our first simulation considers the changepoint free case in an AR(1) Gaus-sian series having various correlation parameters φ , N = 500, and σ = 1.Figure 8 shows probabilities of falsely declaring one or more changepoints over17,000 independent simulations. Unlike the single changepoint case, the methodshere are not aiming to control any false positive rate. f F a l s e P o s i t i v e R a t e Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 8: Empirical False Positive Detection Rates for an AR(1) Series withVarious φ . Truth: No Changepoints.The results show that BIC, mBIC, and binary segmentation perform best,with WBS and MDL performing significantly worse. It is worth noting thatWBS has a signicantly higher false positive rate, an issue discussed further inLund and Shi (2020). Binary segmentation is arguably best here, an expectedfinding since there are no changepoints (an AMOC test applied to the series’one-step-ahead prediction residuals should not see a changepoint and stop anyrecursion at its onset). All methods perform better with negative φ than withpositive φ ; performance of all methods degrades as φ moves upwards towardsunity (as expected). We now move to simulations with one changepoint in the same AR(1) setupabove. The changepoint is placed in the middle of the series, t = 251. Figure 9shows the average distances between the estimated changepoint configurationsand the true configuration. While there are no huge discrepancies between themethods, for heavily correlated series, binary segmentation is the worst andMDL and mBIC the best. Again, all tests degrade as φ approaches unity. MDLhas the least variability across φ . Comparing to the single changepoint results,the multiple changepoint penalties are more conservative than the LRT. Also,since the average distance is less than unity, the correct number of changepointsis often being identified. 18 .00.51.01.5 −1.0 −0.5 0.0 0.5 1.0 f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 9: Average Distances for an AR(1) Series with Varying φ . Truth: OneChangepoint in the Middle Moving the Series Upwards. Our next case moves to a setting with three mean shifts, partitioning the seriesinto four equal-length regimes. The changepoints occur at times 126, 251, and376, with each changepoint shifting the series upward by one unit (up-up-up).As before, Figure 10 reports average distances. MDL performs the worst fornegative φ , while the other methods perform similarly. Perhaps surprisingly,binary segmentation starts to degrade when φ becomes positive, with the othersalso degrading, but to a lesser extent. BIC performs best across all φ . f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 10: Average Distances for AR(1) Series with Different φ . Truth: ThreeEqually Spaced Changepoints Moving the Series Up-Up-Up.19 .5 Three Alternating Changepoints Next, we consider another three changepoint configuration, the changepointtimes again being equally spaced, but this time moving the series up, thendown, and then up again (up-down-up). Figure 11 reports the distances. Allmethods have a harder time than with the last up-up-up changepoint configu-ration. In this setting, binary segmentation becomes fooled and estimates toofew changepoints; mBIC is also not doing as well as the other methods. MDLand WBS work better, the surprise winner being BIC. f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 11: Average Distances for an AR(1) Series with Varying φ . Truth: ThreeEqually Spaced Changepoints Moving the Series Up-Down-Up. Next, we move to cases with nine changepoints. Our first set of simulationsequally spaces all changepoint times in the record, each moving the series higher(All Up). Because the changepoints are more difficult to detect, we have in-creased the absolute mean shift magnitude to two units — this induces moreseparation between the methods, allowing for an easier comparison. Figure 12displays distances for this setting. The winners are BIC and MDL; losers areWBS and binary segmentation. 20 .02.55.07.510.0 −1.0 −0.5 0.0 0.5 1.0 f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 12: Average Distances for an AR(1) Series with Varying φ . Truth: NineChangepoints, All Up. Our next set of simulations again considers nine changepoints, but the directionsof the equally spaced mean shift sizes of magnitude two are now alternated inan Up-Down-Up-Down-Up-Down-Up-Down-Up fashion (alternating). Figure 13displays results. The best method here is BIC again with WBS doing betterthan in the previous setting; mBIC is a laggard and binary segmentation isagain the worst. f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 13: Average Distances for an AR(1) Series with Varying φ . Truth: NineAlternating Changepoints. 21 .8 Nine Keyblade Changepoints As a different type of setup, we consider the nine changepoint setting where thesizes of the nine mean shifts vary, their shift directions vary, and the changepointtimes are not equally spaced. Figure 7(d) shows our chosen pattern for E [ X t ],which we call a “keyblade”. The distances in Figure 14 reveal BIC and MDLas winners, and WBS and binary segmentation as inferior. f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 14: Average Distances for the Keyblade AR(1) Series with Varying φ .Truth: Nine Changepoints. We now consider settings with a random number of changepoints simulated froma Poisson distribution with a mean of five. The locations of any mean shiftsare placed uniformly in the set { , . . . , N } without replacement. The mean ofeach segment is simulated from a normal distribution with a zero mean and astandard deviation of 1 . 5. Figures 15 summarizes the results: BIC and MDLare again superior and binary segmentation inferior.22 f A v e r age D i s t an c e f r o m T r ue C on f i gu r a t i on Methods BIC+GABS(SCUSUMz)mBIC+GAMDL+GAWBS(C=1.3) Figure 15: Average Distance between the Estimated and True ChangepointLocations. It was surprising to us how well the simple BIC penalty has done so far —especially since this penalty does not depend on the changepoint times. Toexamine this issue further, we fix the AR(1) parameter at φ = 0 . N varies with one and three changepoints. Here,the changepoints induce equal length regimes, all mean shift sizes are of a unitmagnitude, and their directions alternate with the first direction being upwards.Table 2 reports average BIC and mBIC distances when N ∈ { , , } .As the sample size increases, the additional penalty the mBIC has on the lengthof the segments results in fewer changepoints identified than for BIC. As N grows, there is a tendency for BIC to add (erroneous) changepoints in somesamples. Thus, as the number of changepoints and N grows, mBIC is betterthan BIC. This leads us to recommend mBIC over BIC for larger N or numbersof changepoints.Table 2: Comparison of BIC and mBIC. Truth: m changepoints, all of a unitmagnituide, placed in alternating directions that equally space the record lengthfor an AR(1) series with varying lengths N . Here, σ = 1 and φ = 0 . m = 1 m = 3BIC mBIC BIC mBIC N = 500 0 . 227 0 . 125 1 . 270 2 . N = 1000 0 . 126 0 . 066 0 . 311 0 . N = 2500 0 . 121 0 . 047 0 . 123 0 . N = 500 , φ = 0 . 5, and σ = 123nd consider three alternating changepoints placed at the times 126 , , 000 simulations are reported in Table 3. As the mean shift magnitudesincreases, all methods improve. BIC and MDL, two frequent winners of pastscenarios, perform worst when the mean shift size is largest; moreover, WBSand binary segmentation, two frequent past losers, perform best. mBIC reportsthe smallest average distance when ∆ ≥ ∆ BIC+GA mBIC+GA MDL+GA BS(SCUSUMz) WBS(C=1.3)∆ = 1 1.269 2.424 1.948 2.702 1.686∆ = 2 0.140 0.051 0.209 0.843 0.149∆ = 3 0.126 0.042 0.188 0.077 0.079 Table 3: Average Distance for an AR(1) Series with Varying Mean Shift Mag-nitudesOur final simulation task considers other autoregressive error structures. Webegin with AR(2) errors and the case of no changepoints. Table 4 shows falsepositive rates of signaling one or more changepoints when in truth none existfor various AR(2) parameters φ and φ . In this and all four tables below, 1,000independent simulations are conducted, N = 500, σ = 1, and all mean shiftsizes are two units (this adds additional information to cases above where meanshifts were of a unit size). All four tables are discussed below in tandem afterthey are presented. { φ , φ } BIC+GA mBIC+GA MDL+GA BS(SCUSUMz) WBS(C=1.3) { } { } { } { } { } Table 4: False Positive Rates for an AR(2) Series with Varying { φ , φ } . Truth:No ChangepointsTable 5 reports average distances for the AR(2) scenario of the last table,but now with three changepoints. The three shifts induce four equal lengthregimes and shift the series mean in an up-down-up manner. { φ , φ } BIC+GA mBIC+GA MDL+GA BS(SCUSUMz) WBS(C=1.3) { } { } { } { } { } Table 5: Average Distances for an AR(2) Series with Varying { φ , φ } . Truth:Three Alternating Changepoints of Size ∆ = 2.Table 6 shows false positive rates of signaling one or more changepoints whenin truth there are none for various parameter choices in an AR(4) series.24 φ , φ , φ , φ } BIC+GA mBIC+GA MDL+GA BS(SCUSUMz) WBS(C=1.3) { } { } { } { } { } Table 6: False Positive Rates for an AR(4) Series with Varying { φ , φ , φ , φ } .Truth: No ChangepointsFinally, Table 7 reports average distances over 1,000 independent simulationsfor the same AR(4) scenario above. The mean shift specifications are repeatedfrom Table 5. { φ , φ , φ , φ } BIC+GA mBIC+GA MDL+GA BS(SCUSUMz) WBS(C=1.3) { } { } { } { } { } Table 7: Average Distances for AR(4) Errors with Varying { φ , φ , φ , φ } .Truth: Three Alternating Changepoints of Size ∆ = 2.In the above tables, when there are no changepoints, binary segmentationappears best and MDL and WBS worst, repeating conclusions for AR(1) errors.In the tables with three changepoints and heavily positively correlated errors,MDL, BIC, and WBS all do comparatively well; when the correlation becomesnegative, the situation reverses and mBIC and binary segmentation are best.These aspects were also seen for AR(1) series, although we did not remark aboutthe negatively correlated results.To summarize our overall conclusions on multiple changepoints, the followingpoints emerge: • AIC and binary segmentation are not competitive. Binary segmentationworked well only when no or few changepoints existed and worsened whenmultiple mean shifts act in opposite directions. We do not recommendeither of these techniques. • Although not depending on the changepoint configuration, BIC is surpris-ingly good across a wide range of scenarios. However, as N gets larger,mBIC becomes superior. • MDL was often the best penalized likelihood technique in heavily corre-lated scenarios, but does not work as well with negatively correlated series.MDL also tends to lose to mBIC when the changepoint mean shift sizesare large or when changepoints are infrequent. • MDL and WBS techniques should be used with caution if there is a pos-sibility that no changes are present, as they have high false positive rates. • BIC and mBIC perform well in the low frequency changepoint settings.25e close with one more comment that is not apparent from the reported results.The MDL penalty works reasonably in a large variety of positively correlatedscenarios. However, when it is wrong, it has a tendency to put several change-points times very closely to one and other, typically near the beginning of therecord. This is an attempt by the method to flag an outlier. If one imposesa minimum spacing between changepoint times to combat this, the methodbecomes much better. This paper presented a systematic comparison of common single and multiplechangepoint techniques in time series settings. Previous work had demonstratedhow blindly applying techniques that assume IID to data could lead to erroneousconclusions. Here, we focused on how IID methods could be modified for thetime series setting, either by correcting the asymptotic distribution, or by mod-ifying the test statistic.In constructing our comprehensive approach, a summary of the major dif-ferent techniques available was made in a single manuscript; hence, this paperhas utility as a reference. A new changepoint distance was also developed thatcombines the two important features of changepoint detection, identification ofthe correct number and location(s) of the changepoints, within a single metric.In the single changepoint case, it was found that the best techniques applyIID methods to the time series of one-step-ahead prediction residuals. Thebest performing single changepoint detection method was the sum of CUSUMstatistic in Bai (1993). Extreme value based asymptotic tests exhibited poordetection power.In the multiple changepoint case, conclusions were more nebulous; however,binary segmentation and AIC are not recommended. The penalized likelihoodsMDL, mBIC, and BIC all are worthy of additional study. WBS methods alsoperformed reasonably and deserve additional attention, especially given theirrelatively recent entrance into the literature. At this point, it is still not clearwhether pure algorithmic techniques can beat penalized likelihood methods.It is our view that one should use BIC penalized likelihood methods for thecase of large numbers of changepoints and/or small data lengths, with mBICrecommended for smaller numbers of changepoints and/or longer lengths ofdata. References Antoch, J., Huˇskov´a, M., and Pr´aˇskov´a, Z. (1997). Effect of dependence onstatistics for determination of change. Journal of Statistical Planning andInference , 60(2):291–310.Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identifica-tion of segment neighborhoods. Bulletin of Mathematical Biology , 51(1):39–54.Bai, J. (1993). On the partial sums of residuals in autoregressive and movingaverage models. Journal of Time Series Analysis , 14(3):247–260.26ai, J. and Perron, P. (1998). Estimating and testing linear models with multiplestructural changes. Econometrica , 66(1):47–78.Brockwell, P. and Davis, R. (1991). Time Series: Theory and Methods . Springer-Verlag, 2nd edition edition.Burkard, R., Dell’Amico, M., and Martello, S. (2012). Assignment Problems,Revised Reprint , volume 106. Siam.Chakar, S., Lebarbier, E., L´evy-Leduc, C., and Robin, S. (2017). A robustapproach for estimating change-points in the mean of an AR(1) process. Bernoulli , 23(2):1408–1447.Chapman, J.-L., Eckley, I. A., and Killick, R. (2020). A nonparametric approachto detecting changes in variance in locally stationary time series. Environ-metrics , 31(1):e2576. e2576 env.2576.Chen, J. and Gupta, A. K. (2011). Parametric statistical change point analy-sis: with applications to genetics, medicine, and finance . Springer Science &Business Media.Cs¨orgo, M. and Horv´ath, L. (1997). Limit Theorems in Change-point Analysis .John Wiley & Sons Chichester.Davis, R. A., Lee, T. C. M., and Rodriguez-Yam, G. A. (2006). Structural breakestimation for nonstationary time series models. Journal of the AmericanStatistical Association , 101(473):223–239.Dehling, H., Fried, R., Garcia, I., and Wendler, M. (2015). Change-point detec-tion under dependence based on two-sample u-statistics. In Asymptotic lawsand methods in stochastics , pages 195–220. Springer.Eichinger, B. and Kirch, C. (2018). A mosum procedure for the estimation ofmultiple random change points. Bernoulli , 24(1):526–564.Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point de-tection. The Annals of Statistics , 42(6):2243–2281.Gallagher, C., Lund, R., and Robbins, M. (2012). Changepoint detection indaily precipitation data. Environmetrics , 23(5):407–419.Harchaoui, Z. and L´evy-Leduc, C. (2010). Multiple change-point estimationwith a total variation penalty. Journal of the American Statistical Association ,105(492):1480–1493.Jandhyala, V., Fotopoulos, S., MacNeill, I., and Liu, P. (2013). Inference forsingle and multiple change-points in time series. Journal of Time SeriesAnalysis , 34(4):423–446.Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of change-points with a linear computational cost. Journal of the American StatisticalAssociation , 107(500):1590–1598.Kirch, C. (2007). Resampling in the frequency domain of time series to de-termine critical values for change-point tests. Statistics & Risk Modeling ,25(3):1–25. 27irch, C. and Muhsal, B. (2014). A mosum procedure for the estimation ofmultiple random change points. Preprint .Lavielle, M. and Moulines, E. (2000). Least-squares estimation of an unknownnumber of shifts in a time series. Journal of Time Series Analysis , 21(1):33–59.Li, S. and Lund, R. (2012). Multiple changepoint detection via genetic algo-rithms. Journal of Climate , 25(2):674–686.Lu, Q. and Lund, R. B. (2007). Simple linear regression with multiple levelshifts. Canadian Journal of Statistics , 35(3):447–458.Lund, R. and Shi, X. (2020). Short communication: “detecting possi-bly frequent change-points: Wild binary segmentation 2”. arXiv preprintarXiv:2006.10845 .Lund, R., Wang, X. L., Lu, Q. Q., Reeves, J., Gallagher, C., and Feng, Y.(2007). Changepoint detection in periodic and autocorrelated time series. Journal of Climate , 20(20):5178–5190.MacNeill, I. B. (1974). Tests for change of parameter at unknown times anddistributions of some related functionals on Brownian motion. The Annals ofStatistics , pages 950–962.Maidstone, R., Hocking, T., Rigaill, G., and Fearnhead, P. (2017). On optimalmultiple changepoint algorithms for large data. Statistics and Computing ,27(2):519–533.Matteson, D. S. and James, N. A. (2014). A nonparametric approach for mul-tiple change point analysis of multivariate data. Journal of the AmericanStatistical Association , 109(505):334–345.Menne, M. J., Williams, Claude N., J., and Vose, R. S. (2009). The U.S. His-torical Climatology Network Monthly Temperature Data, Version 2. Bulletinof the American Meteorological Society , 90(7):993–1008.Olshen, A. B., Venkatraman, E., Lucito, R., and Wigler, M. (2004). Circularbinary segmentation for the analysis of array-based dna copy number data. Biostatistics , 5(4):557–572.Page, E. (1955). A test for a change in a parameter occurring at an unknownpoint. Biometrika , 42(3/4):523–527.Robbins, M., Gallagher, C., Lund, R., and Aue, A. (2011). Mean shift testingin correlated data. Journal of Time Series Analysis , 32(5):498–511.Robbins, M. W., Gallagher, C. M., and Lund, R. B. (2016). A general regressionchangepoint test for time series data. Journal of the American StatisticalAssociation , 111(514):670–683.Scott, A. J. and Knott, M. (1974). A cluster analysis method for groupingmeans in the analysis of variance. Biometrics , pages 507–512.28crucca, L. (2013). Ga: a package for genetic algorithms in R. Journal ofStatistical Software , 53(4):1–37.Shen, J., Gallagher, C. M., and Lu, Q. (2014). Detection of multiple undoc-umented change-points using adaptive lasso. Journal of Applied Statistics ,41(6):1161–1173.Shi, X., Lund, R., Killick, R., and Gallagher, C. (Forthcoming). Autocovarianceestimation in the presence of changepoints by differencing. Journal of TimeSeries Analysis . Unpublished Manuscript.Stoica, P. and Moses, R. L. (2005). Spectral Analysis of Signals . Pearson PrenticeHall Upper Saddle River, NJ.Tolmatz, L. (2002). On the distribution of the square integral of the Brownianbridge. The Annals of Probability , 30(1):253–269.Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information crite-rion with applications to the analysis of comparative genomic hybridizationdata.