[PDF] A Comparison of Single and Multiple Changepoint Techniques for Time Series Data

Abstract

This paper describes and compares several prominent single and multiple changepoint techniques for time series data. Due to their importance in inferential matters, changepoint research on correlated data has accelerated recently. Unfortunately, small perturbations in model assumptions can drastically alter changepoint conclusions; for example, heavy positive correlation in a time series can be misattributed to a mean shift should correlation be ignored. This paper considers both single and multiple changepoint techniques. The paper begins by examining cumulative sum (CUSUM) and likelihood ratio tests and their variants for the single changepoint problem; here, various statistics, boundary cropping scenarios, and scaling methods (e.g., scaling to an extreme value or Brownian Bridge limit) are compared. A recently developed test based on summing squared CUSUM statistics over all times is shown to have realistic Type I errors and superior detection power. The paper then turns to the multiple changepoint setting. Here, penalized likelihoods drive the discourse, with AIC, BIC, mBIC, and MDL penalties being considered. Binary and wild binary segmentation techniques are also compared. We introduce a new distance metric specifically designed to compare two multiple changepoint segmentations. Algorithmic and computational concerns are discussed and simulations are provided to support all conclusions. In the end, the multiple changepoint setting admits no clear methodological winner, performance depending on the particular scenario. Nonetheless, some practical guidance will emerge.

Full PDF

AA Comparison of Single and MultipleChangepoint Techniques for Time SeriesData

Xueheng Shi, Colin Gallagher, Robert Lund and Rebecca KillickJanuary 7, 2021

Abstract

This paper describes and compares several prominent single and mul-tiple changepoint techniques for time series data. Due to their importancein inferential matters, changepoint research on correlated data has accel-erated recently. Unfortunately, small perturbations in model assumptionscan drastically alter changepoint conclusions; for example, heavy posi-tive correlation in a time series can be misattributed to a mean shiftshould correlation be ignored. This paper considers both single and mul-tiple changepoint techniques. The paper begins by examining cumulativesum (CUSUM) and likelihood ratio tests and their variants for the singlechangepoint problem; here, various statistics, boundary cropping scenar-ios, and scaling methods (e.g., scaling to an extreme value or BrownianBridge limit) are compared. A recently developed test based on summingsquared CUSUM statistics over all times is shown to have realistic TypeI errors and superior detection power. The paper then turns to the mul-tiple changepoint setting. Here, penalized likelihoods drive the discourse,with AIC, BIC, mBIC, and MDL penalties being considered. Binaryand wild binary segmentation techniques are also compared. We intro-duce a new distance metric speciﬁcally designed to compare two multiplechangepoint segmentations. Algorithmic and computational concerns arediscussed and simulations are provided to support all conclusions. In theend, the multiple changepoint setting admits no clear methodological win-ner, performance depending on the particular scenario. Nonetheless, somepractical guidance will emerge.

Changepoints (abrupt shifts) arise in many time series due to changes in record-ing equipment, observers, etc. In climatology, temperature trends computedfrom raw data can be misleading if homogeneity adjustments for station relo-cation moves and gauge changes are not a priori made to the record. Lu andLund (2007) give an example where trend conclusions reverse when changepointinformation is neglected. Cases with multiple changepoints are also frequentlyencountered; for example, in climatology, United States weather stations av-erage about six station moves and/or gauge changes per century of operation(Menne et al., 2009). 1 a r X i v : . [ s t a t . M E ] J a n his paper intends to guide the researcher on the best changepoint tech-niques to use in common time series scenarios. Assumptions are crucial inchangepoint analyses and can signiﬁcantly alter conclusions; here, correlationissues take center stage. It is known that changepoint inferences made from pos-itively correlated series can be spurious if correlation is not taken into account.Even lag one correlations as small as 0.25 can have deleterious consequences onchangepoint conclusions (Lund et al., 2007).This paper’s primary contribution is to extend/modify many of the popularchangepoint methods for IID data to correlated settings. Much of our work lieswith developing methods that put all techniques, to the best extent possible,on the same footing in time series settings. For example, we will see that singlechangepoint tests generally work best when applied to estimated versions of theseries’ one-step-ahead prediction residuals, computed under a null hypothesis ofno changepoints. Because of this, tests that handle one-step-ahead predictionresiduals need to be developed. Two other novel contributions in this articleare: (1) developing and proposing a new single changepoint test based on thesquare of the cummulative sum of one-step-ahead prediction residuals (see Sec-tion 2.2), and 2) presenting a new distance that compares multiple changepointsegmentations (see Section 5.1). The comparative aspect of the paper is yetanother contribution — and there is much to compare. In addition to compar-ing diﬀerent statistics via Type I errors and powers, the paper also comparesdiﬀerent asymptotic scaling methods.Academic changepoint research commenced with the single changepoint casefor independent and identically distributed (IID) data in Page (1955). The sub-ject is now vast, with hundreds of papers devoted to the topic. With our loftyobjectives, some concessions are necessary. Foremost, this paper examines meanshift changepoints only; that is, while series mean levels are allowed to abruptlyshift, the variances and correlations of the series are held constant (station-ary) in time. Changepoints can also occur in variances (volatilities) (Chapmanet al., 2020), in the series’ correlation structures (Davis et al., 2006), or evenin the marginal distribution of the series (Gallagher et al., 2012). Secondarily,the simulation results reported here are for Gaussian series only. Thirdly, wecompare the most common types of techniques within the literature, notablydiscluding those based on energy statistics (Matteson and James, 2014), movingsums (Eichinger and Kirch, 2018), and U statistics (Dehling et al., 2015).The rest of this paper proceeds as follows. Section 2 overviews single change-point detection methods, typically referred to as at most one changepoint (AMOC)tests. Here, a variety of test statistics and their scalings are reviewed andadapted to the time series setting. We speciﬁcally discuss two methods for mod-ifying changepoint techniques based on IID data: 1) retain the IID test statisticand modify the limiting distribution for any correlation; and 2) modify the teststatistic to account for the correlation. Section 3 compares AMOC detectorsin a simulation study. Thereafter, we move to the case of multiple change-points, where performance assessment becomes more challenging. Here, a novelchangepoint conﬁguration distance speciﬁcally designed for our comparisons isdeveloped. Simulations in Section 4 consider a variety of multiple changepointconﬁgurations. We summarize results in Section 6 with recommendations forpractitioners. 2 Single Changepoint Techniques

Let { X t } Nt =1 be the observed time series and γ ( h ) = Cov( X t + h , X t ) be the lag h autocovariance of the series. We wish to test whether there exists a change inthe mean structure while assuming the second order structure is constant overtime . An AMOC model with the changepoint occurring at the unknown time k is X t = (cid:40) µ + (cid:15) t , for 1 ≤ t ≤ k , µ + ∆ + (cid:15) t , for k + 1 ≤ t ≤ N , (1)where µ is an unknown location parameter, ∆ is the magnitude of mean shiftat time k , and { (cid:15) t } is a stationary time series with zero mean and lag h auto-covariance γ ( h ). A hypothesis test for this scenario is: H : ∆ = 0 versus H : ∆ (cid:54) = 0 for some k ∈ { , . . . , N − } . (2)When the { (cid:15) t } are independent, cumulative sum (CUSUM) and likelihoodratio tests (LRT) are well understood, see Cs¨orgo and Horv´ath (1997) andChen and Gupta (2011). When incorporating general stationary autocovarianceaspects into a changepoint testing framework, there are two common strategies:1) keep the IID test statistic and identify any changes in the limiting distributioninduced by the correlation; and 2) incorporate the autocovariance within thetest statistic. Antoch et al. (1997) provide a summary of the ﬁrst approach formany common changepoint statistics and provide simulations indicating howautocorrelation impacts the performance of the hypothesis tests; Kirch (2007)uses resampling techiques to improve the ﬁnite sample performance of thesetests. Robbins et al. (2011) shows that estimating and using the autocorrelation(the second approach) is preferable with CUSUM and LRTs. The CUSUM method was ﬁrst introduced by Page (1955) and compares samplemeans before and after each admissible changepoint time via the statisticmax ≤ k

0. The estimator ˆ σ = N − (cid:80) Nt =1 ˆ Z t is used to estimate the variance of Z t . The residual CUSUM statistic at time k is max ≤ k

Assume that { X t } follows (1) , { (cid:15) t } has the causal linear rep-resentation assumed in Theorem 1, and ˆ η is a null hypothesis based consistentestimator of η , the long-run variance deﬁned in (4) . Then under H ,SCUSUM X = 1 N N (cid:88) k =1 (cid:20) CUSUM X ( k/N )ˆ η (cid:21) D −−→ (cid:90) B ( t ) dt. (11)Our second approach for incorporating correlation uses the one-step-aheadprediction residuals in place of the original data. The SCUSUM test statisticfor this scheme is SCUSUM Z := 1 N N (cid:88) k =1 (cid:20) CUSUM Z ( k/N )ˆ σ (cid:21) . (12)The asymptotic distribution of (12) can be derived from Theorem 2.1.2 viathe continuous mapping theorem. Theorem 2.2.2.

With

CU SU M Z deﬁned as in Theorem 2.1.2, under the nullhypothesis of no changepoints,SCUSUM Z = 1 N N (cid:88) k =1 (cid:20) CUSUM Z ( k/N )ˆ σ (cid:21) D −−→ (cid:90) B ( t ) dt. (13)5he distribution of (cid:82) B ( t ) dt was investigated in Tolmatz (2002). We notethat Bai (1993) proposed using the sum of the square of partial sums of ARMAresiduals to detect a single changepoint in autocorrelated data; this test statisticconverges to the integral of a squared Brownian Motion rather than the integralof the square of a

Brownian Bridge . To our knowledge, the variant in (12) hasnot previously been proposed nor studied in the literature.The diﬀerences between CUSUM and CUSUMz statistics were investigatedin Robbins et al. (2011) and their simulations indicate that the latter statisticis superior to the former in terms of type I error and power. Our simulationsconﬁrm this ﬁnding. As such, in the remainder of the paper, we do not considerSCUSUM (without the subscript Z ) tests further. While CUSUM tests are non-parametric, LRTs are inherently parametric. Sev-eral error distributions have been considered by previous authors, by far themost common being normal — this is the distribution considered here.The LRT compares the likelihood under the null hypothesis to likelihoodsunder alternatives with a changepoint. The LRT statistic for a changepoint hasthe general form Λ = max ≤ k

12 log π (cid:21) . hen under H , lim N →∞ P ( W U ≤ x ) = exp( − − x )) , −∞ < x < ∞ . (16) Speciﬁcally, H is rejected when W U is too large to be explained by the distribu-tion in (16). Another way of scaling the Λ k statistics involves cropping boundary times.Like the CUSUM test, the LRT is volatile at times near the boundaries. Infact, Λ D → ∞ as N → ∞ should the maximum be taken over the entire range1 ≤ k < N under the null hypothesis of no changepoints. A common croppedLRT simply truncates admissible times near the boundaries; for example, with0 < (cid:96) < h < (cid:96) being close to zero and h being close to unity, set U crop = max (cid:96) ≤ k/N ≤ h ( − k )) . (17)Robbins et al. (2011) shows that U crop D −−→ sup (cid:96) ≤ t ≤ h B ( t ) t (1 − t ) . (18)As the next section shows, LRTs are not competitive in changepoint detec-tion problems. While simulations are presented for the above extreme valuetest in the next section, simulations for cropped LRTs are delegated to thesupplementary material — both methods perform poorly.As a ﬁnal comment here, deriving a LRT test for independent data, andthen replacing the data with one-step-ahead prediction residuals, another av-enue for dealing with dependence, does not yield a methodologically distinctpath. Speciﬁcally, if one derives a LRT statistic for independent series and thensubstitutes one-step-ahead prediction residuals in place of the original data, thelimit law in (18) again arises. The boundaries again must be cropped to ensurea proper limiting distribution. The discussion around (1.4.22) — (1.4.27) in(Cs¨orgo and Horv´ath, 1997) provides more detail on this route; see also Lavielleand Moulines (2000) for more on LRTs for correlated data. This section investigates the ﬁnite sample performance of the Section 2 tests(cropped CUSUMz, CUSUMz, SCUSUMz, LRT) through simulation. Resultsfor the cropped test statistics are delegated to the supplementary material;results for the others are presented here.Desirable tests have reasonable (non-inﬂated) false detection rates when nochangepoints exist, and large detection powers when a changepoint is present,regardless of the degree of correlation. For each statistic under consideration, theimpact of autocorrelation on the Type I error is ﬁrst explored. We then examinedetection powers of the tests when a changepoint exists. First order Gaussianautoregressions (AR(1)) are considered here with σ = 1; other structures areexamined in the supplementary material.Figure 1 summarizes results for N = 1 ,

000 across varying AR(1) correlationparameters φ . Our conclusions do not vary for diﬀerent N — see the supple-mentary material. Figure 1 shows that the only method to retain a controlled7ype I error across all φ is the SCUSUMz. The LRT is the worst performingmethod, being far too conservative for all φ , except 0 .

95, when it becomes highlyinﬂated. The CUSUMz method is also slightly conservative, becoming more soas φ increases. Since we are using the asymptotic distribution for each of the10,000 test statistics, we would expect the 0 .

05 type I error to be reasonablymaintained. f T y pe E rr o r AMOC

Cropped CUSUMzCUSUMzLRTSCUSUMz

Figure 1: Type I Errors for an AR(1) Series with Diﬀerent φ When N = 1000.We now consider test detection powers. In general, the detection power of anAMOC test depends on the degree of correlation, the size of the mean shift, andthe location of the changepoint time (Robbins et al., 2016). Figures 2 (∆ = 0 . .

3) show empirical powers based on 10 ,

000 independent Gaussiansimulated series of length N = 1 , φ when the mean shift lies in the center of the series (time 501). The ﬁguresdemonstrate the drastic eﬀects of autocorrelation on the power of changepointtests. While the LRT had the highest empirical power when φ = .

95, theestimated changepoint location of LRT is biased and more variable than thatfor the CUSUM Z and SCUSUM Z tests, see Figures 3 and 5. The LRT test alsohas a Type I error far exceeding 0.05; as such, it’s higher power does not implybetter overall performance. Overall, the CUSUM Z and SCUSUM Z tests aremore powerful than the others. Note also that SCUSUM Z has higher power thanCUSUM Z for each φ considered. Additional simulations (not shown) duplicatethis conclusion for other sample sizes. The SCUSUMz statistic is clearly thebest test. 8 l l l l l l l l l l l l l l l l l l f P o w e r AMOC llll

Cropped CUSUMzCUSUMzLRTSCUSUMz

Figure 2: Detection Powers for an AR(1) Series with Diﬀerent φ . Here, N =1 ,

000 and ∆ = 0 . f D e t e c t ed C hangepo i n t Lo c a t i on Tests

CUSUMzLRTSCUSUMz

Figure 3: Boxplots of Detected Changepoint Locations for an AR(1) Series withDiﬀerent φ . Here, N = 1 ,

000 and ∆ = 0 . l l l l l l l l l l l l l l l l l l f P o w e r AMOC llll

Cropped CUSUMzCUSUMzLRTSCUSUMz

Figure 4: Detection Power for an AR(1) Series with Diﬀerent φ . Here, N = 1000and ∆ = 0 . f D e t e c t ed C hangepo i n t Lo c a t i on Tests

CUSUMzLRTSCUSUMz

Figure 5: Detected Changepoint Location for an AR(1) Series with Diﬀerent φ .Here, N = 1000 and ∆ = 0 . φ is ﬁxed as 0.5. Figure 6 displays empirical powers. The largestdetection powers occur when the changepoint is near the center of the record,as expected, with power decreasing as the changepoint time moves towards aboundary. The SCUSUM Z appears to be the most accurate overall. However,10he LRT test is preferable when the changepoint occurs near the beginning ofthe record. l l l l l l l l l l l l l l l l l l l l l l l l l P o w e r AMOC llll

Cropped CUSUMzCUSUMzLRTSCUSUMz

Figure 6: A Graph of τN Against Power with N = 500 and ∆ = 0 . φ = 0 . Now suppose that { X t } Nt =1 has an unknown number of changepoints, denotedby m , occurring at the unknown ordered times 1 < τ < τ < · · · < τ m ≤ N . Boundary conditions take τ = 0 and τ m +1 = N . These m changepointspartition the series into m + 1 distinct regimes, the i th regime having its owndistinct mean and containing the data points { X τ i +1 , . . . , X τ i +1 } . The modelcan be written as X t = κ t + (cid:15) t , where κ t = µ r ( t ) and r ( t ) denotes the regime indexat time t , which takes values in { , , . . . , m } , and { (cid:15) t } is a stationary causaland invertible ARMA( p, q ) time series that applies to all regimes. Observe that κ t =  µ , ≤ t ≤ τ ,µ , τ + 1 ≤ t ≤ τ , ... µ m , τ m + 1 ≤ t ≤ N .

There are many challenges in the multiple changepoint problem. Here, esti-mation of a global autocovariance function that applies to all regimes — consid-ered further in Section 4.2 — is diﬃcult. One also has to estimate an unknownnumber of changepoints, their locations, and all segment parameters in a com-putationally feasible manner for some of the techniques.While many authors have considered multiple changepoint issues, most as-sume IID { (cid:15) t } . For IID errors, dynamic programming based approaches (Auger11nd Lawrence, 1989; Killick et al., 2012), model selection methods using LASSO(Harchaoui and L´evy-Leduc, 2010; Shen et al., 2014), and moving sum statis-tics (Kirch and Muhsal, 2014) have all been applied to multiple changepointproblems — this list is not exhaustive. As in the AMOC setting, techniques forindependent data may not work well for dependent series (Davis et al., 2006; Liand Lund, 2012; Chakar et al., 2017).The multiple changepoint techniques considered here can be put into twobroad categories: 1) recursive segmentation and algorithmic methods usingAMOC techniques, and 2) direct approaches that ﬁt all series subsegmentsjointly. The two approaches are completely diﬀerent in their perspective. Elab-orating, recursive techniques employ AMOC single changepoint methods in aniterative manner, identifying at most one additional changepoint in each subseg-ment at each recursion level. In contrast, direct techniques model and estimatethe multiple changepoint conﬁguration jointly; here, penalization methods typ-ically drive the discourse. No hypothesis testing paradigm underlies any directapproach. Some multiple changepoint techniques apply only to special timeseries structures. For example, Chakar et al. (2017) is exclusively designed forAR(1) series. Their techniques are not considered here as they cannot be appliedto all of our considered scenarios. Recursive segmentation approaches ﬁrst focus on ﬁnding a single (usually themost prominent) changepoint, thereafter iterating in some manner to identifyadditional changepoints. The primary tool here is binary segmentation (Scottand Knott, 1974), which provides a multiple changepoint conﬁguration estimatevia any AMOC method. Binary segmentation ﬁrst tests the entire series for asingle changepoint. Should a changepoint be found, the series is split about thechangepoint time into two subsegments that are then analyzed for additionalchangepoints using the AMOC strategy. The process is repeated until no sub-segment tests positive for a changepoint. Binary segmentation works best whenthe changepoints are well separated and the segment means are distinct. Inour comparisons, the AMOC statistic adopted for recursive segmentation is theSCUSUM test applied to one-step-ahead prediction residuals, which won ourAMOC comparisons in the previous section.Extensions of binary segmentation abound and include circular binary seg-mentation (Olshen et al., 2004), which seeks to identify a segment of data thathas a distinct mean from the rest of the series. A popular binary segmenta-tion extension considered here is wild binary segmentation (WBS) Fryzlewicz(2014). WBS samples subsegments of the entire data of varying lengths and per-forms an AMOC test on each sampled subsegment. Fryzlewicz (2014) suggestssampling at least (9 N ) log( N δ − ) / ( δ ) subsegments, where δ is the minimumspacing between changepoints (see Assumption 3.2 of Fryzlewicz (2014)) as thisproduces a high probability of drawing a favorable subsegment. WBS is a ran-domized search and hence may return diﬀerent segmentations on diﬀerent runs.In our simulations, WBS employs a standard CUSUM test since its thresholdwas developed particularly for standard CUSUM methods. In addition, thethreshold constant C = 1 . For our work, the autocovariance of the series is assumed constant across time,applying to all series subsegments. This autocovariance function will be neededto decorrelate the series before applying any binary segmentation search meth-ods to the one-step-ahead prediction residuals. Unfortunately, accurate estima-tion of the autocovariance function requires knowledge of the underlying meanstructure. In the single changepoint case, the long-run covariance deﬁned in (4)arises in the limit laws; however, this does not extend to multiple changepointsettings as no theoretical equivalent of (5) exists.In our setup, the second order (covariance) model parameters are deemednuisance parameters and are estimated using the entire data sequence. To ac-count for the impact of unknown mean shifts on these estimators, Yule-Walkertype moment equations will be used on the ﬁrst order diﬀerence of { X t } . Theﬁrst order diﬀerence X t − X t − is used because E [ X t − X t − ] = 0 unless a change-point occurs at time t . Deﬁne d t = X t − X t − and note that d t = (cid:15) t − (cid:15) t − except when time t is a changepoint. Let γ d ( h ) = Cov( d t , d t − h ). For the AR( p )case, which is our primary interest, estimators of the AR( p ) parameters formedfrom { d t } take the form (cid:98) φ = (cid:99) M − (cid:98) ρ d , (19)where φ = ( φ , . . . , φ p ) (cid:48) and M =  − − (cid:0) + ρ d (1) (cid:1) · · · − (cid:16) + (cid:80) p − j =1 ρ d ( j ) (cid:17) ρ d (1) ρ d (0) ρ d (1) · · · ρ d ( p − ρ d (2) ρ d (1) ρ d (0) · · · ρ d ( p − ρ d ( p − ρ d ( p − ρ d ( p − · · · ρ d (0)  . The elements in ˆ M and ˆ ρ d simply replace ρ d ( h ) withˆ ρ d ( h ) = ˆ γ d ( h )ˆ γ d (0) = (cid:80) n − ht =2 ( X t − X t − )( X t + h − X t + h − ) (cid:80) nt =2 ( X t − X t − ) . While Shi et al. (ming) discuss these AR( p ) estimators in detail, the intuitionbehind them is that if m is small relative to N , then the mean shifts willhave negligible impact on the estimated covariance structure of the diﬀerencessince X t − X t − = (cid:15) t − (cid:15) t − except at times t that are changepoint times.Shi et al. (ming) demonstrate that this estimate of the covariance outperformsalternatives such as direct and windowed estimation. Due to this, the Yule-Walker moment estimators in (19) will be used in our simulations to decorrelatethe series. 13 .3 Direct Modelling Direct modelling approaches analyze the whole series at once, optimizing anobjective function with a penalty term that controls the number of changepoints.The techniques seek a changepoint conﬁguration that minimizes F ( m ; τ , . . . , τ m ) := C ( m ; τ , . . . , τ m ) + P ( m ; τ , . . . , τ m ) , (20)where C is the cost of putting m changepoints at the times τ , . . . , τ m and P isa penalty term to prevent over-ﬁtting. There are many ways to deﬁne the costand penalties. A frequently used cost is the negative log-likelihood. Here, wewill use C ( m ; τ , . . . , τ m ) = − L opt ( θ | m ; τ , . . . τ m )) , where L opt ( θ | m ; τ , . . . , τ m ) is the time series likelihood (Gaussian based) op-timized over all parameters θ given that m changepoints occur at the times τ , . . . , τ m .Penalties can be constructed in a variety of ways. Common penalties includeminimum description lengths (MDL), modiﬁed Bayesian Information Criterion(mBIC), and the classic BIC penalty. AIC is another popular penalty, despiteit not providing consistent estimates of the number or locations of the change-point(s). Of these four penalties, AIC and BIC are simple multiples of thenumber of changepoints, while the MDL and mBIC further incorporate change-point time information. The form of these penalties are listed in the followingtable. Table 1: Penalized Likelihood Objective FunctionsCriteria Objective FunctionAIC N ln(ˆ σ ) + 2(2 m + 3)BIC N ln(ˆ σ ) + (2 m + 2) ln( N )mBIC N ln(ˆ σ ) + m ln( N ) + (cid:80) m +1 i =1 ln (cid:16) τ i − τ i − N (cid:17) MDL N ln(ˆ σ ) + ln( m ) + (cid:80) m +1 i =1 ln( τ i − τ i − ) + (cid:80) mi =2 ln( τ i )Here, ˆ σ is the estimated white noise variance of the { (cid:15) t } process that drivesthe ARMA errors.MDL penalties are based on information theory and are discussed further inDavis et al. (2006) and Li and Lund (2012). The mBIC penalty is developedin Zhang and Siegmund (2007). These two penalties are taken as zero when m = 0. The mBIC penalty tends to be larger for the same changepoint conﬁgu-ration than the MDL penalty; as such, MDL will often select models with morechangepoints than mBIC.With penalized likelihood approaches, a computational bottleneck arises.Since there are (cid:0) N − m (cid:1) diﬀerent admissible changepoint conﬁgurations in a serieswith m changepoints (time N cannot be a changepoint), there are 2 N − dif-ferent changepoint conﬁgurations to consider when analyzing the entire series.This huge count makes an exhaustive model search — one that evaluates all14dmissible changepoint conﬁgurations — virtually impossible to conduct, evenwhen N is a small as 100. Unfortunately, PELT (Killick et al., 2012) and FPOP(Maidstone et al., 2017), two rapid dynamic programming based techniques, re-quire the objective function to be additive over distinct regimes. The presenceof global parameters like the autocovariance function violates this restriction.Regime-additive likelihoods will not arise when { (cid:15) t } is ARMA( p, q ), althoughBai and Perron (1998) argues that any boundary contribution is negligible ifthe ARMA parameters are allowed to change at each changepoint time (this isnot the case here). Instead, a genetic algorithm will be used to optimize ourpenalized likelihoods exactly (including global autocovariances).Elaborating on the above, optimal changepoint conﬁgurations estimated bypenalized likelihoods can be found through a genetic algorithm (GA) search(Davis et al., 2006; Li and Lund, 2012). A GA is an intelligent random walksearch that is unlikely to evaluate suboptimal changepoint conﬁgurations. Un-fortunately, the objective function in (20) is not convex, and its optimizationis problematic; additional research is needed to optimize these penalized likeli-hoods. Our GA encodes the changepoint conﬁguration into a binary string anduses the R GA package from Scrucca (2013). This GA has worked reliably in ourexperience.

In presenting simulation results for diﬀerent scenarios, we will only presentgraphic(s) that are judged informative. In general, for each simulation caseconsidered, graphics of distances, average number of detected changepoints,and empirical probabilities of estimating the correct number of changepointswere produced. The supplementary material contains any graphics that are notincluded here. Similarly, we focus on unit shift mean sizes in the main textbody unless otherwise noted; results for diﬀerent shift sizes are presented in thesupplementary material.The changepoint conﬁgurations that we consider are illustrated in Figure 7,which shows sample time series generated with the various mean shift conﬁgu-rations. These conﬁgurations range from scenarios with no or few changepointsto those with a large number of changepoints. All series have length N = 500.AIC performs miserably in all our scenarios, always selecting an excessivenumber of changepoints. Since plotting AIC results would distort all othergraphical comparisons, AIC results are not presented to accentuate diﬀerencesbetween the remaining methods. 15 ime X t − − (a) One changepoint

Time X t − (b) Three changepoints

Time X t − − (c) Three changepoints

Time X t − (d) Nine changepoints

Time X t − − (e) Keyblade

Time X t − − − (f) Random changepoints

Figure 7: AR(1) Time Series with Diﬀerent Changepoint Settings

Before presenting our simulations, we discuss how to compare an estimatedmultiple changepoint segmentation to its true value. The estimated multiplechangepoint conﬁguration could have a diﬀerent number of changepoints thanthe true conﬁguration. For a single changepoint method, such a comparisonis easy: examine ﬁrst whether the method ﬂags a changepoint, and then anydistance from the true changepoint time. With multiple changepoint conﬁgu-rations, this comparison is complicated by the fact that diﬀerent segmentationsmay have diﬀerent numbers of changepoints: which changepoint times in oneparticular conﬁguration correspond to those in another may be nebulous.To compare diﬀerent methods, a distance between the two changepoint con-ﬁgurations C = ( m ; τ , . . . , τ m ) and C = ( k ; η , . . . , η k ) will now be developed.Several distances have been utilized by the multiple changepoint ﬁeld. Some,such as the mean squared error (MSE) of the ﬁtted means, V-measure, or Haus-dorﬀ distance, are not speciﬁc to changepoint problems. Others, such as thenumber of changepoints or true/false positive detection rates, are more tailoredto the changepoint problem. However, each of these statistics quantiﬁes onlyone aspect of the ﬁt. For example, the MSE could be low, but the number of16hangepoints could still be overestimated; or the number of changepoints couldbe perfect, but their locations could be inaccurate. As such, we introduce anew changepoint-speciﬁc distance balancing the two key components of multi-ple changepoint analysis: 1) the number of changepoints and 2) their individuallocations.To balance the number and location aspects of changepoint conﬁgurations,two components in our distance are needed. The ﬁrst measures the discrep-ancy in the numbers of changepoints in the two conﬁgurations, for which weuse absolute diﬀerence. The second component measures discrepancies in thechangepoint times. This is trickier to quantify as the number of changepointsmay be diﬀerent in the two conﬁgurations and some sort of “matching proce-dure” is needed. For two changepoint segmentations, C and C , the distanceused here is d ( C , C ) = | m − k | + min {A ( C , C ) } . (21)The term | m − k | assigns the diﬀerence in changepoint numbers for any mismatchin the total number of changepoints. The term min A ( C , C ) reﬂects the smallestcost that matches changepoint locations in C to those in C . This term can becomputed via the following linear assignment methods: A ( C , C ) = k (cid:88) i =1 m (cid:88) j =1 c i,j I i,j subject to the constraints (cid:80) ki =1 I i,j = 1, for j ∈ { , . . . , m } and (cid:80) mj =1 I i,j ≤ i ∈ { , . . . , k } . Here, the cost of assigning τ i to η j is taken simply as c i,j = | τ i − η j | /N and I i,j ∈ { , } is the decision variable I i,j = (cid:40) τ i is assigned to η j . This linear assignment problem can be eﬃciently computed from algorithms inBurkard et al. (2012).One can verify that (21) deﬁnes a legitimate distance satisfying the triangleinequality. The larger the distance is, the worse the two conﬁgurations corre-spond to one another. The term min A ( C , C ) can be shown to be bounded byunity and measures how closely the two changepoint conﬁgurations match up toone another. When both conﬁgurations have many changepoints, the distance isdominated by the | m − k | term. In our simulations, estimated multiple change-point conﬁgurations will be compared to the true changepoint conﬁgurationwith this distance. Many modern multiple changepoint simulation studies increasingly focusing oncases with a large number of changepoints, eschewing single and no changepointscenarios. We include such scenarios here to help illuminate the diﬀerencesbetween the methods.Our ﬁrst simulation considers the changepoint free case in an AR(1) Gaus-sian series having various correlation parameters φ , N = 500, and σ = 1.Figure 8 shows probabilities of falsely declaring one or more changepoints over17,000 independent simulations. Unlike the single changepoint case, the methodshere are not aiming to control any false positive rate. f F a l s e P o s i t i v e R a t e Methods