[PDF] Lagged Exact Bayesian Online Changepoint Detection with Parameter Estimation

Abstract

Identifying changes in the generative process of sequential data, known as changepoint detection, has become an increasingly important topic for a wide variety of fields. A recently developed approach, which we call EXact Online Bayesian Changepoint Detection (EXO), has shown reasonable results with efficient computation for real time updates. The method is based on a \textit{forward} recursive message-passing algorithm. However, the detected changepoints from these methods are unstable. We propose a new algorithm called Lagged EXact Online Bayesian Changepoint Detection (LEXO) that improves the accuracy and stability of the detection by incorporating ℓ -time lags to the inference. The new algorithm adds a recursive \textit{backward} step to the forward EXO and has computational complexity linear in the number of added lags. Estimation of parameters associated with regimes is also developed. Simulation studies with three common changepoint models show that the detected changepoints from LEXO are much more stable and parameter estimates from LEXO have considerably lower MSE than EXO. We illustrate applicability of the methods with two real world data examples comparing the EXO and LEXO.

Full PDF

LLagged Exact Bayesian Online ChangepointDetection with Parameter Estimation

Michael Byrd, Linh Nghiem, Jing CaoDepartment of Statistical ScienceSouthern Methodist University

Abstract

Identifying changes in the generative process of sequential data, known as change-point detection, has become an increasingly important topic for a wide variety ofﬁelds. A recently developed approach, which we call EXact Online Bayesian Change-point Detection (EXO), has shown reasonable results with eﬃcient computation forreal time updates. The method is based on a forward recursive message-passing al-gorithm. However, the detected changepoints from these methods are unstable. Wepropose a new algorithm called Lagged EXact Online Bayesian Changepoint Detec-tion (LEXO) that improves the accuracy and stability of the detection by incorpo-rating (cid:96) -time lags to the inference. The new algorithm adds a recursive backward step to the forward EXO and has computational complexity linear in the numberof added lags. Estimation of parameters associated with regimes is also developed.Simulation studies with three common changepoint models show that the detectedchangepoints from LEXO are much more stable and parameter estimates from LEXOhave considerably lower MSE than EXO. We illustrate applicability of the methodswith two real world data examples comparing the EXO and LEXO.

Keywords: recursion, regime switching, forward-backward1 a r X i v : . [ s t a t . M L ] O c t Introduction

Changepoint detection is the process of detecting abrupt changes in the generative processof sequential data. Analysis of this form has been used in many applications includingclimate change Reeves et al. (2007), Li and Lund (2012), robotics Niekum et al. (2015),and telecommunication Shields et al. (2017). We consider the case where the data sequenceis assumed to be grouped into non-overlapping blocks, as in Barry and Hartigan (1992),each of which is often called a regime. Each regime is assumed to be generated from aﬁxed process. Boundaries between two adjacent regimes are called changepoints; the goalof changepoint detection is to determine where these boundaries are located in observationsof sequential data. While detecting changepoints, it is also often of interest to estimate theparameters associated with regimes in order to quantify the changes that have occurred.Changepoint detection methods can be divided into oﬄine and online methods. Anoﬄine method, like Killick et al. (2012), Saat¸ci et al. (2010) and Maidstone et al. (2017)needs to observe the whole sequence of data before aiming to make an inference about thegenerative process of the sequence. On the contrary, an online method aims to updatethe inference at each time point with the newly observed data value from the sequence.In particular, from a Bayesian viewpoint, online changepoint detection started with thetwo similar methods developed by Adams and MacKay (2007) and Fearnhead (2006). Thetwo methods can be inferred directly from the other; more importantly, both methods giveexact and online Bayesian posterior inference, hence forth referred to as the EXact OnlineBayesian Changepoint Algorithm (EXO). In the past decade, EXO has spawned a wealthof literature developing and expanding on the methodology; examples being particle ﬁltersFearnhead and Liu (2007), regime classiﬁcation Ranganathan (2012), online Thompsonsampling Mellor and Shapiro (2013), and hyperparameter tuning Caron et al. (2012). EXOis shown to be eﬀective in detecting changepoints in real world examples like in Adams andMacKay (2007) and Fearnhead (2006).We propose the Lagged EXact Online Bayesian Changepoint Algorithm (LEXO) thatimproves the EXO by introducing a time lag into the online inference. Whereas EXOuses the data from time 1 to t to make an inference on the current run length only attime t , our proposed method incorporates a backward inference on the run length at time2 − (cid:96) , for any (cid:96) ∈ N . This backward inference has been used extensively in dynamiclinear time series models, like Kalman smoothers, and Hidden Markov Model smoothers,see Douc et al. (2014). As we will illustrate, the main advantage of incorporating thisbackward inference is to stabilize the inference made by EXO. This stability is importantfor estimating parameters of regimes; indeed, parameter estimation from LEXO has aconsiderably smaller MSE than that from EXO. Moreover, we show that LEXO can becomputed recursively with computation complexity being linear in the number of lags.The paper is developed in the following way. In Section 2, we review the theory of EXOand develop the theory for LEXO for the run length. In Section 3, we develop parameterestimation for both EXO and LEXO. In Section 4, we illustrate the performance for theEXO and LEXO via an extensive simulation study for three common changepoint models.We illustrate the applicability of LEXO in Section 5 with two real data examples andconclude the paper in Section 6. Let stochastic process { X τ ( t ) } ∞ t =1 be observed sequentially as x t in equally spaced intervalsof time, t = 1 , , . . . , of an unknown length. The realizations of the process form non-overlapping regimes, where each observation is generated strictly from one regime; denote τ ( t ) to index each regime, where time t uniquely identiﬁes the regime it was generatedfrom. Further, assume that observations generated within the same regime are iid , and alsoindependent from observations in other regimes; this formulation is known as a productpartition model, see Barry and Hartigan (1992).For a given regime starting at t ∗ , we assume that x t ∗ , x t ∗ +1 , . . . i.i.d. ∼ f ( x | η τ ( t ∗ ) ) . Here, f denotes the probability distribution with regime speciﬁc parameter η τ ( t ∗ ) . Weadditionally assume that the underlying distribution does not change with time, only the3nderlying parameter of the distribution. A changepoint is deﬁned as the starting point fora new regime. The run length at a given time, r t , is deﬁned as the number of steps takensince the last changepoint occurred. Formally, a changepoint occurs at t when τ t = τ t − + 1and the run length at t is r t = t − t ∗ for t ≥ t ∗ , where t ∗ is the latest changepoint beforetime t .The EXO algorithm computes the exact posterior distribution for the current run lengthat each time point, and changepoints are learned from these distributions. This gives aﬂexible model for a practitioner to make informed decisions about the occurrence of achangepoint at each time point. For instance, if the concentration of the mass of the runlength’s distribution shifts greatly toward 0 at time t + 1 after being mostly concentratedaround the max possible run length at time t , then this would indicate a change in theregime.Following the same notation as in Adams and MacKay (2007), we denote r t , x a : b , and x ( r ) t as the current run length at time t , the set of observations from time a to time b , andthe set of observations associated with the regime at time t , respectively. Then, the goalis to ﬁnd P ( r t | x t ) = P ( r t , x t ) P ( x t ) , (1)corresponding to the posterior distribution of the current run length at time t given allthe observed data up to time t . As shown in Adams and MacKay (2007) and Fearnhead(2006), exact posterior inference can be found by noting θ t (cid:44) P ( r t , x t ) = t − (cid:88) r t − =0 P ( r t , r t − , x t )= t − (cid:88) r t − =0 P ( r t , x t | r t − , x t − ) P ( r t − , x t − )= t − (cid:88) r t − =0 P ( r t | r t − ) (cid:124) (cid:123)(cid:122) (cid:125) prior P ( x t | r t , x ( r t − ) t − ) (cid:124) (cid:123)(cid:122) (cid:125) likelihood P ( r t − , x t − ) (cid:124) (cid:123)(cid:122) (cid:125) θ t − . (2)Equation (2) deﬁnes a forward message passing scheme from θ t − to θ t . The likelihood4 Time ( t ) R un l e n g t h ( r t ) Figure 1:

The run length illustrated for t = 1 , . . . ,

7. The blue shaded node gives r = 1,where the red path signiﬁes the forward paths possible to reach that run length underthe model assumptions. Additionally, the green lines illustrate the backwards path forLEXO-3, again, under model assumptions.depends only on x ( r ) t , and the prior is speciﬁed in terms of a hazard function H : P ( r t | r t − ) =  H ( r t − + 1) if r t = 01 − H ( r t − + 1) if r t = r t − + 10 otherwise . Here, H ( x ) = P gap ( g = x ) (cid:80) ∞ t = x P gap ( g = t ) , and P gap ( g ) is a prior distribution for the interval between changepoints. As with Adamsand MacKay (2007), this prior distribution is chosen to be the geometric distributionwith time scale λ gap . Thus, the hazard function is constant at the prior hazard rate 1 /λ gap .Figure 1 illustrates the path of calculation under the given hazard, where, in red, all possibleroutes to get a run length of 1 at time 4 are shown. The EXO calculation incorporatesall of these possible paths, giving the exact probability of reaching that point under everypath.EXO has been incorporated in many places using, for example, Gaussian Processes5noblauch and Damoulas (2018) and Thompson Sampling Mellor and Shapiro (2013),but the forward-message passing scheme of EXO is the most convenient when the datais assumed to follow a distribution that belongs to the exponential family with conjugatepriors. In such situation, the likelihood in equation (2) is characterized by a ﬁnite numberof suﬃcient statistics, which in turn can be calculated incrementally as data arrives. Morespeciﬁcally, exponential family likelihoods have the form P ( x | η ) = h ( x ) exp( η (cid:62) U ( x ) − A ( η )) , where A ( η ) = log (cid:90) h ( x ) exp( η (cid:62) U ( x )) d η and U ( x ) denotes the suﬃcient statistics for η . The conjugate prior for η has the form P ( η | χ , ν ) = ˜ h ( η ) exp (cid:16) η (cid:62) χ − νA ( η ) − ˜ A ( χ , ν ) (cid:17) , where χ and ν denote the hyperparameter. So, when conditioned on r t = r , the posteriordistribution P ( η t | r t = r, x ( r t − ) t − ) has exactly the same form as the prior, but with the newparameter χ ( r ) t and ν ( r ) t given by χ ( r ) t = χ prior + (cid:88) x t ∈ x ( rt − t − U ( x t )and ν ( r ) t = ν prior + r t . Lastly, the likelihood P ( x t | r t , x ( r t − ) t − ) can be calculated by marginalizing the parameter η τ ( t ) . Note that, in this updating scheme, if r t = 0 then x t starts a new regime, which impliesthe likelihood is calculated under the prior value for the hyperparameters. The EXO-algorithm, as developed by Adams and MacKay (2007), is summarized in the Algorithm 1.We brieﬂy mention the standard pruning procedure that is often used with EXO, forexample Caron et al. (2012). Often, many of the values of the posterior run length are6 lgorithm 1 EXO Algorithm

Initialize: P ( r = 0) = 1; P ( x , r ) = P ( x | χ (0)1 , ν (0)1 ) χ (0)1 = χ prior ; ν (0)1 = ν prior Update suﬃcient statistics: χ (0)2 = χ prior ; ν (0)2 = ν prior χ (1)2 = χ + U ( x ); ν (1)2 = ν (0)1 + 1 for t = 2 , , . . . do Calculate predictive probabilities: π ( r ) t = P ( x t | r t = r, x r t − t − ) = P ( x t | χ ( r ) t , ν ( r ) t )Calculate growth probabilities: P ( r t = r t − + 1 , x t ) = P ( r t − , x t − ) π ( r t ) t (1 − H )Calculate changepoint probabilities: P ( r t = 0 , x t ) = (cid:88) r t − P ( r t − , x t − ) π (0) t H Calculate marginal probabilities: P ( x t ) = t − (cid:88) r =0 P ( r t = r, x t )Normalize: P ( r t = r | x t ) = P ( r t = r, x t ) P ( x t ) , r = 0 , . . . , t − χ (0) t +1 = χ prior , ν (0)( t +1) = ν prior χ ( r t +1) t +1 = χ ( r t ) t + U ( x t ) , ν ( r t +1) t +1 = ν ( r t ) t + 1 end for − , to be 0. This methodology often leaves to a massivereduction in calculations and memory, and is easily extended to the LEXO frameworkbelow. Note that EXO uses data from time 1 to t to make an inference at time t ; hence thealgorithm only makes a forward pass. We consider the incorporation of a backward passfor estimating the regime parameters at time t − (cid:96) , i.e r t − (cid:96) and η t − (cid:96) for some (cid:96) ∈ N . We willrefer to the methodology as lagged exact online inference with (cid:96) lags (LEXO- (cid:96) ). In otherwords, at each time point in the sequence, we reﬁne the inference made in the past. Ouralgorithm remains online in the sense that once a new datum is observed the past data neednot be reused to update the model. Moreover, the calculations consist of simple operationsof already computed values from the EXO calculation, which can be used to give morereﬁned estimates of r t − (cid:96) , and hence η t − (cid:96) over time. Figure 1 illustrates the incorporation ofthree lags, shown in green, into the previous example of getting a run length of 1 at time t = 4. With the incorporation of LEXO-3 an additional 7 possible paths from time 5, 6,and 7 are incorporated, in addition to the 2 possible forward paths already computed viaEXO. An example backward path to r = 1 is ( r , r , r , r ) = (0 , , , Theorem 1.

Given a sequence of data { x t } ∞ t =1 follows a product partition model, LEXO- (cid:96) gives exact posterior distributions for the run length for any (cid:96) ∈ N . Moreover, these runlength distributions are updated recursively, with the computation complexity O ( (cid:96)t ) .Proof. We will start with (cid:96) = 1, so the goal is to compute the distribution P ( r t − | x t ). Bythe chain rule, we have P ( r t − | x t ) = (cid:88) r t P ( r t − | r t , x t ) P ( r t | x t ) (3)where r t ∈ { , r t − + 1 } . The second term in the sum of (3) is computed by EXO at time t . For the ﬁrst term, if r t (cid:54) = 0, then P ( r t − = r t − | r t , x t ) = 1. If r t = 0, then the chain8ule gives P ( r t − | r t = 0 , x t ) ∝ P ( r t − , r t = 0 , x t ) ∝ P ( x t | r t − , r t = 0 , x t − ) P ( r t = 0 | r t − , x t − ) × P ( r t − , x t − ) ∝ P ( x t | r t = 0 , x t − ) P ( r t − , x t − ) ∝ P ( r t − | x t − ) , which is exactly the run length distribution computed by EXO at time t −

1. In the abovederivation the third proportionality follows because P ( r t = 0 | r t − , x t − ) = H , which isa constant. The last proportionality follows because P ( x t | r t = 0 , x t − ) does not dependon x t − . Intuitively, if we have a new regime at t , all the past information before thatbecomes irrelevant. Putting everything together, we have P ( r t − = r | x t ) = P ( r t = r + 1 | x t ) + P ( r t − = r | x t − ) P ( r t = 0 | x t )for r = 0 , . . . , t −

1. The above proof shows that the run length distribution at time t − t − t from EXO.Generally, for any (cid:96) ≥ − (cid:96) at time t − (cid:96) iscompletely determined by the run length distribution at time t − (cid:96) from EXO and the runlength distribution at time t − ( (cid:96) −

1) from LEXO-( (cid:96) − P ( r t − (cid:96) | x t ) = (cid:88) r t − (cid:96) +1 P ( r t − (cid:96) | r t − (cid:96) +1 , x t ) P ( r t − (cid:96) +1 | x t ) . (4)where r t − (cid:96) +1 ∈ { , r t − (cid:96) + 1 } . The second term in (4) is obtained from computation ofLEXO-( (cid:96) −

1) at time t − (cid:96) + 1. For the ﬁrst term, if r t − (cid:96) +1 (cid:54) = 0, then P ( r t − (cid:96) = r t − (cid:96) +1 − | r t − (cid:96) +1 , x t ) = 1. If r t − (cid:96) +1 = 0, then we start a new regime at t − (cid:96) + 1, so P ( x ( t − (cid:96) +1): t | r t − (cid:96) +1 = 0 , x t ) does not depend on r t − (cid:96) . Then we have P ( r t − (cid:96) | r t − (cid:96) +1 = 0 , x t ) ∝ P ( r t − (cid:96) , x t − (cid:96) ) ∝ P ( r t − (cid:96) | x t − (cid:96) ) , t − (cid:96) . Putting everything together, we thenhave P ( r t − (cid:96) = r | x t ) = P ( r t − (cid:96) +1 = r + 1 | x t ) + P ( r t − (cid:96) = r | x t − (cid:96) ) P ( r t − (cid:96) +1 = 0 | x t ) , for r = 0 , . . . , t − (cid:96) −

1. The above proof also shows that, for any (cid:96) ≥ (cid:96) algorithm can be computed recursively, starting from EXO. Notethat going from lag (cid:96) − (cid:96) requires exactly one computation of complexity O ( t ), sothe entire procedure is O ( (cid:96)t ). Another natural question while detecting changepoints is to estimate the parameter η t ateach time t . In this section, we consider how to ﬁnd the posterior distribution of theseparameters from LEXO- (cid:96) . First we start with EXO ( (cid:96) = 0), P ( η t | x t ) = (cid:88) r t P ( η t | r t , x t ) P ( r t | x t ) , where the second term P ( r t | x t ) is computed from EXO algorithm. For the ﬁrst term,given r t = r , this is the posterior distribution of η t computed from x ( r ) t , i.e, as if only r independent observations from the most recent regime are observed.For LEXO- (cid:96) , the parameter estimate process is more involved. The theorem belowspeciﬁes that the posterior distribution of η t can be computed recursively. Theorem 2.

Given a sequence of data { x t } ∞ t =1 follows a product partition model, the pos-terior distribution of η t from LEXO- (cid:96) is a mixture of mixture models and can be computedexactly and recursively for all t and (cid:96) .Proof. For any lag (cid:96) ≥

1, we can write P ( η t − (cid:96) | x t ) = (cid:88) r t − (cid:96) P ( η t − (cid:96) | r t − (cid:96) , x t ) P ( r t − (cid:96) | x t ) , (5)so the posterior distribution of η t − (cid:96) is a mixture of components, P ( η t − (cid:96) | r t − (cid:96) , x t ), which10re weighted by the posterior distribution of run length from LEXO- (cid:96) . We show below that,each component P ( η t − (cid:96) | r t − (cid:96) , x t ) is also a mixture of two other distributions. Indeed, wehave P ( η t − (cid:96) | r t − (cid:96) , x t ) = (cid:88) r t − (cid:96) +1 P ( η t − (cid:96) | r t − (cid:96) +1 , r t − (cid:96) , x t ) P ( r t − (cid:96) +1 | r t − (cid:96) , x t ) , (6)where r t − (cid:96) +1 ∈ { , r t − (cid:96) + 1 } . For the ﬁrst term in the sum of (6), if r t − (cid:96) +1 = 0 then anew regime begins at t − (cid:96) + 1, which implies η t − (cid:96) is independent of x ( t − (cid:96) +1):( t ) . Denote P ( η t − (cid:96) | r t − (cid:96) , x t ) = η ( (cid:96) ) t − (cid:96) , where the superscript ( (cid:96) ) in η ( (cid:96) ) t − (cid:96) indicates that η t − (cid:96) is estimatedwith (cid:96) more observations after time t − (cid:96) . We then have P ( η t − (cid:96) | r t − (cid:96) +1 = 0 , r t − (cid:96) , x t ) = P ( η t − (cid:96) | r t − (cid:96) , x t − (cid:96) ) = η (0) t − (cid:96) , which is the parameter estimate from EXO at time t − (cid:96) . If r t − (cid:96) +1 = r t − (cid:96) + 1 > x t − (cid:96) and x t − (cid:96) +1 is from the same regime; this implies P ( η t − (cid:96) | r t − (cid:96) +1 = r t − (cid:96) + 1 , r t − (cid:96) , x t ) = P ( η t − (cid:96) +1 | r t − (cid:96) +1 , x t ) = η ( (cid:96) − t − (cid:96) +1 , which is the estimate at time t − ( (cid:96) −

1) with (cid:96) − P ( r t − (cid:96) +1 | r t − (cid:96) , x t ) ∝ P ( r t − (cid:96) | r t − (cid:96) +1 , x t ) P ( r t − (cid:96) +1 | x t ).If r t − (cid:96) +1 = 0, then by the previous argument, P ( r t − (cid:96) | r t − (cid:96) +1 = 0 , x t ) = P ( r t − (cid:96) | x t : t − (cid:96) ) . If r t − (cid:96) +1 (cid:54) = 0, then P ( r t − (cid:96) = r t − (cid:96) +1 − | r t − (cid:96) +1 , x t ) = 1 . Therefore, P ( r t − (cid:96) +1 = 0 | r t − (cid:96) , x t ) ∝ P ( r t − (cid:96) | x t − (cid:96) ) P ( r t − (cid:96) +1 = 0 | x t ) (cid:44) α , and P ( r t − (cid:96) +1 = r t − (cid:96) + 1 | r t − (cid:96) , x t ) ∝ P ( r t − (cid:96) +1 = r t − (cid:96) + 1 | x t ) (cid:44) α . Letting α = α / ( α + α ), then we now have an explicit form for the distribution of η ( (cid:96) ) t as η ( (cid:96) ) t − (cid:96) = α η (0) t − (cid:96) + (1 − α ) η ( (cid:96) − t − (cid:96) +1 . η ( (cid:96) ) t − (cid:96) is itself a mixture of two distributions: 1) the posterior distribution for η t from EXO at time t − (cid:96) and 2) the posterior distribution for η t − (cid:96) +1 from LEXO-( (cid:96) − η ( (cid:96) ) t − (cid:96) and P ( η t − (cid:96) | x t : t ) can be computed recursively.Applying the properties of mixture model, we can compute the posterior K th momentof P ( η t − (cid:96) | x t ) in the two following stages: • Stage 1: Compute E (cid:18)(cid:104) η ( (cid:96) ) t − (cid:96) (cid:105) K (cid:19) recursively as E (cid:18)(cid:104) η ( (cid:96) ) t − (cid:96) (cid:105) K (cid:19) = α E (cid:18)(cid:104) η (0) t − (cid:96) (cid:105) K (cid:19) + (1 − α ) E (cid:18)(cid:104) η ( (cid:96) − t − (cid:96) +1 (cid:105) K (cid:19) . • Stage 2: Compute the posterior moment: E (cid:16) [ η t − (cid:96) | x t ] K (cid:17) = (cid:88) r t − (cid:96) E (cid:18)(cid:104) η ( (cid:96) ) t − (cid:96) (cid:105) K (cid:19) P ( r t − (cid:96) | x t ) . To demonstrate the eﬀect of incorporating lags into the online framework of EXO, we inves-tigate LEXO for changepoint detection under three diﬀerent settings – a Normal distributedmodel with mean shift, a Normal distributed model with precision (inverse variance) shift,and a Poisson distributed model (both mean and variance shift). Formally these settingscan be described in the same order as the following:1. X τ ( t ) ∼ N ( µ τ ( t ) ,

1) where µ = 0 and changes such that µ τ ( t ) = µ τ ( t ) − + 1.2. X τ ( t ) ∼ N (0 , /ξ τ ( t ) ) where ξ = 16 and changes such that ξ τ ( t ) = (1 / ξ τ ( t ) − .3. X τ ( t ) ∼ Poisson( λ τ ( t ) ) where λ = 1 and changes such that λ τ ( t ) = 4 + λ τ ( t ) − .The Normal mean and precision shift represent common applications of an onlinestreaming scenario. The Poisson setting illustrates a common counting problem that canbe diﬃcult to detect due to the increasing variability of the model. An example of each ofthe three processes can be seen in Figure 2. For all settings, we impose 5 equally-spacedchangepoints; for each setting, 1000 samples were generated.12 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − (a) Mean Shift llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − (b) Precision Shift llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll (c)

Poisson

Figure 2:

Random draw of each of the three simulated processes considered in the simu-lation studies.For each sample, we investigate the performance of the run length estimates and theparameter estimates made by EXO and LEXO- (cid:96) , for (cid:96) = 1 , . . . ,

30. The hyperparametersfor the prior were chosen to be non-informative, and the hazard rate was chosen to be H = 1 /

50. The diﬀerence in estimated run length is inspected from the 1000 samples’median of maximum-a-posteriori (MAP) of the posterior distribution of the run length,which is plotted at each time point. This plot illustrates a common choice of determininga changepoint made by practitioners, see Adams and MacKay (2007). Additionally, theparameter estimates are compared at each time point across simulations by calculating theratio of posterior MSE of EXO and LEXO. At time t − (cid:96) , given the true parameter η t , theposterior MSE of LEXO − (cid:96) is deﬁned asMSE ( (cid:96) ) t = ( E ( η t − (cid:96) | x t ) − η t ) + Var ( η t − (cid:96) | x t )Note that, LEXO is performing better than EXO in terms of parameter estimation when theratio is greater than 1. The average posterior mean across 1000 samples of each parameteris also plotted at each time point.Regarding the run length distribution of EXO and LEXO, the run length MAP plotscan be seen in Figure 3. From each of the processes, a clear shift can be observed fromLEXO, correcting overstepping from EXO’s run length MAP. Additionally, the shift con-verges toward the true changepoint in the process when allowed enough lags to reﬁne theuncertainty in the estimate. It seems evident that LEXO is reﬁning EXO appropriately,13ot over correcting past where a process changed. (a) Mean Shift (b)

Precision Shift (c)

Poisson

Figure 3:

Median run length, as indicated by the MAP, from 1000 simulations of eachprocess, from EXO (dashed gray line) and LEXO − (cid:96) , for (cid:96) = 5 (solid blue line), (cid:96) = 15(solid purple line), and (cid:96) = 30 (solid green line).Inspecting the parameter estimates also shows clear performance gains from LEXO.Beginning with the MSE ratio plots, observed in the left column of Figure 4, a substantialimprovement of the estimates is apparent. Each of the three processes show an improvementin terms of the MSE ratio, where it is not uncommon to note more than ﬁve times betterperformance. Noticeably, for the precision shift model, the MSE ratio can be up to 1 . × for LEXO-30. This can be attributed to more data being incorporated at each point fromfuture points, as well as more concentration around the correct run length in the estimatedrun length distribution. 14ext, the average mean plots are presented in the right column of Figure 4. Theimprovement of the parameter estimate is self evident from the plots, where the estimategets closer and closer to the truth. EXO’s estimate of the precision shift model is so poorthat it does not completely appear on the plot with reasonable dimensions. However, afterjust incorporating 5 lags, LEXO begins to closely follow the true changes in the model.The Poisson setting follows in a similar fashion to the mean shift model.Finally, Table 1 illustrates the average MSE ratio of EXO over some LEXO- (cid:96) at threetime points, t = 197, 200, and 220, corresponding to right before, at, and a fair bit after achangepoint. Overall most of the MSE ratios are greater than 1; especially for the normalprecision shift model, the ratio is much bigger than 1. This shows considerable beneﬁt ofLEXO over EXO. However, right before a changepoint ( t = 197), adding a large numberof lags can have a detrimental eﬀect on the performance of the estimates, when the MSEratio falls below 1. This occurs because a large number of lags include too many pointsfrom the next regime. However, the detection of the regime changes does not deterioratewith higher lags. A reasonable solution is to use as higher order a lag as possible for thedetection of changepoints, while using a smaller lag for moment calculations.15

200 400 600 800 1000 (a)

Mean Shift . + . + . + . + (b) Precision Shift (c)

Poisson

Figure 4:

The left plots represent the ratio of average posterior MSE of EXO vs LEXO- (cid:96) at each time point, MSE (0) t / MSE ( (cid:96) ) t , while the right plots show average posterior meanacross 1000 simulations for EXO (dashed gray line), (cid:96) = 5 (solid blue line), (cid:96) = 15 (solidpurple line), and (cid:96) = 30 (solid green line), along with the true value of the parameter (solidblack line). 16 able 1: MSE ratio of EXO over LEXO- (cid:96) given at t = 197 (right before a changepoint), t = 200 (at a changepoint), and t = 220 (in the middle of a regime)Type Time MSE ratio of EXO over LEXO- (cid:96) In the period from mid-1972 to mid-1975 several major political events occurred that hadpotentially signiﬁcant impacts on the US macroeconomy, including the Watergate scandaland the OPEC embargo. We examine the daily returns of the Dow Jones Industrial Averagefrom July 3, 1972 to June 30, 1975 to detect abrupt changes in the underlying distributionof the return. The daily return of day t with closing price p close t is deﬁned as R t = p close t p close t − − , which was modeled as a Gaussian distribution with mean 0 with unknown variance. Thedata on closing price is publicly available through Google Finance. Of speciﬁc interestis the change in the volatility of the return; in other words, we focus on estimating theprecision (inverse variance) of the underlying Gaussian distribution. This dataset is alsoanalyzed in the Adams and MacKay (2007) paper and is plotted in Figure 5 (top).We ran EXO and LEXO up to lag (cid:96) = 100 on the data set, with a gamma prior onthe precision with shape and rate of 1 and 10 − , respectively. The hazard rate was ﬁxedat λ = 1 / (cid:96) , where (cid:96) = 5 , ,

50, 100.17

200 400 600 − . . EXO LEXO−5 LEXO−30 LEXO−50 LEXO−100

Figure 5:

Dow Jones Industrial Average analysis’ plots. Top – Raw data, with 0 plottedhorizontally in red and the locations of the changepoints found by LEXO-100 in orange.Middle – Run length for the corresponding lags. Bottom – Posterior mean for the precision,for each time, for each lag.Figure 5 shows that the higher lags (50 and 100) detect two additional changepointsnear t = 500 and at t = 591. Additionally, the MAP of run length distribution and theposterior means of the precision given by LEXO- (cid:96) are much more stable than those givenby EXO. Table 2 below indicates changepoints and events that are likely associated withthem. 18 able 2: Date and events that are potientally associated with the detected changepointsfor the Dow Jones Return dataset.EXO LEXO-100 Event Description Date208 178 142 US involvement with Vietnam War ends 1/27/1973346 330 327 OPEC oil embargo begins 10/19/1973424 384 379 Nixon refuses to give tapes to the Senate 1/3/1974NA 505 504 Threshold Test Ban Treaty signed 7/4/1974544 526 526 Nixon’s involvement in Watergate revealed 8/5/1974NA 591 591 Democrats take control in Congress 11/5/1974657 601 602 AT&T anti-trust suit ﬁled 11/20/1974

Note: The ﬁrst two columns give the index where a sudden MAP shift occurs for EXO andLEXO-100, respectively. For EXO, we ignore random jumps that appear to be mistakes fromthe procedure. The event is the index where the associated even takes place, where if eventoccurs on a weekend it is put with the Friday of that week.

This classic dataset gives counts of coal mining accidents that killed at least ten or morein 112 years from March 15, 1851 and March 22, 1962. We model the number of accidentsof each year in the period x t following a Poisson distribution with unknown rate. We ranEXO and LEXO up to lag (cid:96) = 30 on the dataset with a gamma prior using a shape and rateof 1 and 10 − . Figure 6 (top) shows the posterior mean of the rate parameter overlayedwith the data. Figure 6 (bottom) shows the run length associated with the MAP of therun length distribution. 19 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll EXO LEXO−5 LEXO−15 LEXO−25 LEXO−30

Figure 6:

The coal mining analysis plots. The top illustrates the plotted data withposterior means of the Poisson’s rate parameter for corresponding LEXO. The bottomillustrates the run length found by the MAP of the posterior run length distribution.Figure 6 shows that the MAP of the run length distribution and the posterior mean ofthe rate given by EXO is very unstable. The more lags added, the more stable the plotsbecame. The lines corresponding to LEXO-15, LEXO-25, and LEXO-30 are stable, withno sporadic changes in the MAP of the run length distribution over time. The MAP of therun length seemingly converges to t = 41 as LEXO-25 and LEXO-30 both ﬁnd and settle atthis point. Historically, this has been point identiﬁed by other changepoint methodologies,corresponding to the introduction of the Coal Mines Regulation Act in 1887. The higherlags illustrate the mean change after the regulation, showing a drop from a rate of about 3to a rate of around 1. The introduction of a backward pass into the online changepoint framework brings muchneeded reﬁnement in the inference. The EXO framework is naturally susceptible to outliersand can take several time points before ﬁnding a new regime. We illustrate LEXO- (cid:96) canreﬁne the run length distributions estimated by EXO, including some level of convergenceto true changepoints and better parameter estimates. The LEXO framework is an iterative20rocess, allowing for continually updated regime estimates for the practicioneer to use.Moreover, for high-frequency data sets where many oﬄine methods that rely on MonteCarlo Markov Chain (MCMC) are not scalable, LEXO could be used with reasonableconﬁdence and computational power.

SUPPLEMENTAL MATERIAL

R code that implements the LEXO algorithm for the three changepoint models can befound in the Github at github.com/lnghiemum/LEXO . References

Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. arXivpreprint arXiv:0710.3742 .Barry, D. and Hartigan, J. A. (1992). Product partition models for change point problems.

The Annals of Statistics pages 260–279.Caron, F., Doucet, A., and Gottardo, R. (2012). On-line changepoint detection and pa-rameter estimation with application to genomic data.

Statistics and Computing

Nonlinear time series: Theory, methodsand applications with R examples . Chapman and Hall/CRC.Fearnhead, P. (2006). Exact and eﬃcient bayesian inference for multiple changepoint prob-lems.

Statistics and computing

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

Journal of the American Statistical Association arXiv preprint arXiv:1805.05383 .Li, S. and Lund, R. (2012). Multiple changepoint detection via genetic algorithms.

Journalof Climate

Statistics and Computing

Artiﬁcial Intelligence and Statistics , pages 442–450.Niekum, S., Osentoski, S., Atkeson, C. G., and Barto, A. G. (2015). Online bayesian change-point detection for articulated motion models. In

Robotics and Automation (ICRA), 2015IEEE International Conference on , pages 1468–1475. IEEE.Ranganathan, A. (2012). Pliss: labeling places using online changepoint detection.

Au-tonomous Robots

Journal of Applied Meteorology andClimatology

Proceedings of the 27th International Conference on Machine Learning(ICML-10) , pages 927–934.Shields, A., Doody, P., and Scully, T. (2017). Application of multiple change point detectionmethods to large urban telecommunication networks. In