On Posterior consistency of Bayesian Changepoint models
OOn Posterior consistency of Bayesian Changepoint models
Nilabja GuhaUniversity of Massachusetts LowellJyotishka DattaVirginia Polytechnic Institute and State University
Abstract
While there have been a lot of recent developments in the context of Bayesian model se-lection and variable selection for high dimensional linear models, there is not much work inthe presence of change point in literature, unlike the frequentist counterpart. We consider ahierarchical Bayesian linear model where the active set of covariates that affects the obser-vations through a mean model can vary between different time segments. Such structuremay arise in social sciences/ economic sciences, such as sudden change of house price basedon external economic factor, crime rate changes based on social and built-environment fac-tors, and others. Using an appropriate adaptive prior, we outline the development of ahierarchical Bayesian methodology that can select the true change point as well as the truecovariates, with high probability. We provide the first detailed theoretical analysis for pos-terior consistency with or without covariates, under suitable conditions. Gibbs samplingtechniques provide an efficient computational strategy. We also consider small sample sim-ulation study as well as application to crime forecasting applications.
In many applications such as economics, social science, the observed variable depends oncovariates through mean structure, where the mean structure changes with time, based onchanges in some latent unobserved factor such as an economic phenomenon/public policychange (Datta et al., 2019). The dependence on the covariate may be local (only in some timesegments) or global. Selecting the change point and covariates consistently remains an impor-tant problem which provides an insight to underlying sociological and economical factors.There is a huge and influential literature, both Bayesian and frequentist, on changepoint de-tection with application in diverse areas that dates back several decades. Initial attempts forchangepoint detection using cumulative sums date back to Page (1955, 1957). Shortly after,changepoint for the location parameter, primarily within the Gaussian observation model, wasstudied by several authors including Chernoff and Zacks (1964); Gardner (1969); Srivastava1 a r X i v : . [ s t a t . M E ] F e b recursive run length estimation .With high-throughput data becoming routine in modern scientific studies, the problem of vari-able selection in a high-dimensional changing linear model becomes important where a keyinferential goal is to identify the potentially different set of ‘active’ variables within each seg-ment. Recent papers that address this problem include frequentist approach such as Lee et al.(2016), and Bayesian treatment in Datta et al. (2019). In Lee et al. (2016), a lasso penalizationapproach is used for changing high-dimensional linear regression while selecting relevant re-gressors under sparsity assumption. While in Datta et al. (2019), this is handled by using theshrinking and diffusing prior (Narisetty et al., 2014) in each segment for variable selection.As Datta et al. (2019) points out, the decomposability of the likelihood for changing linear re-gression model also makes it easy to incorporate other Bayesian variable selection priors: forexample, one could use a spike-and-slab prior (Mitchell and Beauchamp, 1988) or a variety ofshrinkage priors that have become quite popular for sparse variable selection and estimation(Polson and Scott, 2010b; Bhadra et al., 2019). As the main focus of this article is theoretical guarantees, it is worthwhile to briefly mentiona few recent theoretical results that are relevant to our present discourse. Roughly speaking,the recent theoretical advances can be broadly classified into two major thrusts: (a) optimalityproperties for the estimator for the underlying piecewise mean and (b) optimality properties2or the estimator for changepoint locations.For the first kind, Gao et al. (2017) established sharp nonasymptotic risk bounds for leastsquares estimators when the underlying mean has a piecewise constant structure, and observeda phase change phenomenon when the number of changepoints goes beyond . Let Θ k denotesthe model with all piecewise constant θ with maximum k − changepoints and ˆ θ (Θ k ) is theleast squares estimator (LSE) under this model. Now consider a possibly misspecified LSE ˆ θ (Θ k ) , when the true θ ∈ Θ k . Gao et al. (2017) provided sharp risk bounds that are minimaxwhen k = k , i.e. inf ˆ θ sup θ ∈ Θ k E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ θ − θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) σ { n ( k = 2) + k log( en/k ) ( k > } .Martin and Shen (2017) developed an efficient empirical Bayes strategy for the piecewise con-stant sequence model and showed that the resulting posterior distribution attains a similaroptimal rate as in Gao et al. (2017). Theorem 1 of Martin and Shen (2017) states that undera data-driven prior on the elements of θ and a truncated geometric prior distribution on thenumber of changepoints, the empirical Bayes posterior distribution Π n of θ ∈ R n will satisfy: sup θ E θ Π n ( { θ ∈ R n : || θ − θ || > M n (cid:15) n ( θ ) } ) → , n → ∞ , where M n is any sequencewith M n → ∞ , and (cid:15) n ( θ ) is the target rate similar to the one in Gao et al. (2017). Martin andShen (2017) point out that the concentration rate for the empirical Bayes posterior distributionwill have one phase transition from k = 1 to k ≥ , unlike two phase transitions noted byGao et al. (2017), and conjecture that this phenomenon might be a characteristic of all Bayesianapproaches for piecewise constant changepoint detection. Liu et al. (2019) extends Gao et al.(2017) to the case of multidimensional change-point detection, where the location θ ∈ R p × n canchange in at most s out of p coordinates at some time-point t ∈ { , . . . , n } .For estimating the location of change points involving exponential families, Frick et al. (2014)proposed the multiscale estimator SMUCE that attains the minimax rate O ( n − ) up to a log-arithmic factor. Frick et al. (2014) also constructed asymptotically honest confidence sets forthe number and location of change points, and provided sufficient conditions for the SMUCEmethod to detect change points with probability approaching in the presence of ‘vanishingsignals’ for n → ∞ . Despite these remarkable advances, there is essentially no theoretical guarantees for change-points in high-dimensional linear models concerning consistency in model selection. Here weprovide the following theoretical substantiations:(i) We show that under the default Bayesian hierarchy, it is possible to recover both the truechange point locations and the true non-zero covariates with high probability under mildconditions on the covariates and the maximum model size.(ii) Specifically, we prove formal posterior consistency results for model selection and change3oint recovery via Bayes factor for both piecewise constant model as well as changinghigh-dimensional linear regression. We also prove that the minimax rate of O ( n − ) (Fricket al., 2014) is attained by the Bayes estimators for change point recovery.(iii) Finally, we show that the empirical Bayes estimator attains the same optimal rate of con-vergence as the full Bayes solution, under the assumption of same, but unknown, errorvariance σ across different segments.To our knowledge, this is the first theoretical substantiation of the superior performance ofBayesian methods in this specific methodological context. Consider the canonical high-dimensional regression set-up with an n -dimensional response y and an n × p design matrix X , with p (cid:29) n , where y i ∼ N ( x (cid:48) i β j , σ ) for covariate vector x i at time point t i , i = 1 , . . . , n . Let β , β . . . , β l ; β i (cid:54) = β i +1 , i = 1 , · · · , l − are the valuesof the coefficient vector, where β k +1 is the value of the coefficient vector between nt k < i ≤(cid:98) nt k +1 (cid:99) , k = 1 , . . . , l − , and t , . . . , t l − are locations of change points and t = 0 , and let β j,m be the m th component of β j . Spike-and-slab priors:
For the changing linear regression, we want to incorporate covariatesin the model with selection of relevant predictors for each time segment between two change-points. Our aim is to simultaneously select the true non-zero covariates as well as infer thecorrect number and positions of the changepoints. Our framework allows for using differentpriors that enable variable selection. The natural Bayesian solution is to put a spike-and-slabprior on β j that will ensure selection of covariates (Mitchell and Beauchamp, 1988).Let I yi be the indicator function associated with y i where I yi = 1 denotes a change point at the i th epoch/location. Given I yi ’s i = 1 , . . . , n , let l = 1 + (cid:80) ni =1 I yi be the number of partition basedon change points and P j denote the j th partition. The hierarchical model can be written as: y i | β j , σ ∼ N ( x (cid:48) i β j , σ ); i ∈ P j ; π ( σ ) ∝ σ ; β j,m | I βm , π, τ j ∼ (1 − I βm ) δ { } + I βm N (0 , τ j ) , for i ∈ P j , j = 1 , . . . , l ; I βm ∼ Bernoulli (˜ p m ) , and I yi ∼ Bernoulli ( p n ) (2.1)where I yi ’s are independent Bernoulli indicators of whether the i th observation y i is associatedwith a ‘change-point’, and I βm ’s are independent indicator Bernoulli random variables, indicat-ing whether the m th covariate is included in the true model, or analogously if the m th parameter4n the j th partition, β j,m is non-zero.Here the variable selection can be done via the posterior inclusion probability (PIP) for each β j within each time segment. The inclusion probability P ( β j (cid:54) = 0 | y ) is used to select the rele-vant predictors. Spike-and-slab priors have proven optimality properties for high-dimensionallinear models as shown by (Castillo et al., 2015), although they come with a higher computa-tional burden due to the need for exploring a high-dimensional parameter space. Alternatively,the global-local shrinkage priors (Polson and Scott, 2010a; Bhadra et al., 2019) that have beenproven optimal for variable selection (Datta and Ghosh, 2013; Ghosh et al., 2016) can be used.The simpler case of a Gaussian sequence model with no covariates is presented first to build theintuition, which will later be generalized for covariates. This leads to the following hierarchicalmodel: y i | θ i , σ ∼ N ( θ i , σ ); π ( σ ) ∝ σ ; θ i | µ j , τ ∼ N ( µ j , τ ) , for i ∈ P j , j = 1 , . . . , l ; µ j ∼ N (0 , V ) , V (cid:29) and π ( τ ) ∝ τ ; I yi ∼ Bernoulli ( p n ); (2.2)where I yi is the indicator whether y i is a change point, µ j ’s are independent given I yi ’s, θ i ’sare independent given µ j ’s and τ , and y i ’s are independent given θ i and σ . It is assumedthat p n ∼ n − c n . For n observations the expected number of change points E [ (cid:80) ni =1 I i ] = c n istherefore of the order of c n . Here, c n controls the expected numbers of change points apriori,and letting c n increase to infinity provides flexibility to model multiple and potentially largenumber of change points for large n . In this section, we establish the change-point detection consistency, in presence of covariate,and establish the rates of convergence. Let l ∗ be the true number of change-point and l ∗ + 1 be the corresponding number of the partitions. Let m ∗ β denote the true model for the covari-ate combination. Let M ∗ = M ∗ l ∗ t ∗ ,m ∗ β = M ∗ l ∗ ,m ∗ β be denote the model with both true covariatecombination and true change-point location, where for convenience we drop the suffix t ∗ in l ∗ t ∗ . Similarly, M ∗ l ∗ , ˜ m ∗ β be a model with true change-point locations and ˜ m ∗ β ⊃ m ∗ β . For a genericmodel with change-point locations t l = ( t , · · · , t l ) and the covariate combination m β , we writeas M t l l,m β or M l,m β dropping the index t l for notational convenience.For a model M l,m β and for partition P j corresponding to change points t j − , and t j , let X j be the corresponding design matrix. Let q n = o ( n ) be the maximum number of covariatespermissible in a model. Then, we can state the following relatively straightforward result aboutthe marginal distribution. 5 roposition 1. Under model given in (2.1) we have for τ j = O (1) , and known σ log L ( Y n | M l,m β ) = − (cid:88) j log( (cid:12)(cid:12) X (cid:48) j X j (cid:12)(cid:12) ) − σ (cid:88) j (cid:88) i ∈ P j ( y i − x i ˆ β j ) + c l,n + O (1) (3.1) where ˆ β j is the least squares estimator of the coefficient vector based only on the observationslying in P j , and c l,n = − n log σ + c l , c l = O ( kl (cid:48) ) where k = { m β } denotes the size of themodel m β . Remark 2.
For τ j (cid:29) in (2.1) , we have log L ( Y n | M l,m β ) = − (cid:88) j log( (cid:12)(cid:12) X (cid:48) j X j (cid:12)(cid:12) ) − σ (cid:88) j (cid:88) i ∈ P j ( y i − x i ˆ β j ) + c (cid:48) l,n , where c (cid:48) l,n = − n log σ + c (cid:48) l , c (cid:48) l = O ( kl (cid:48) ) . Using the above mentioned result, it will be sufficientto prove our result for the flat normal prior on the coefficients, which will help us streamlinethe derivation of the following results. Since many important applications as well as recent developments such as Frick et al. (2014)use the equal and known variance set-up: we establish our result first for this set-up. Then weextend our results to the case of unknown variance.For a model M l ∗ ,m β with the same number of change point as the true model let P j be the j th partition of i = 1 , · · · , n . Let P j ∗ be the j th partition for M ∗ . Suppose, X j ∩ j ∗ be the designmatrix corresponding to P j ∩ P ∗ j and covariate combination m β ∪ m β ∗ , and X j − j ∗ correspondsto P j − P ∗ j , X j ∩ k ∗ corresponds to P j ∩ P ∗ k , and β ∗ i be the true coefficient vector for y i and β ∗ i = β ∗ j for P j ∩ P ∗ j . Note that the design matrices are constructed when P j ∩ P ∗ j , P j − P ∗ j , P j ∩ P ∗ k are nonempty, respectively. Let ˆ β j be the least square estimator based on P j based on model covariatechoice given by m β , and the entries corresponding to m β ∪ m β ∗ − m β is zero. Similarly, ˜ β ∗ i and ˜ β ∗ j is defined by filling the entries not in m β ∗ by zero.Then, it follows after some calculations, P ( Y n | M l ∗ ,m β ) = − (cid:88) j log( | X (cid:48) j X j | ) − σ [ (cid:88) i ( y i − θ ∗ i ) + (cid:88) j : P j ∩ P ∗ j (cid:54) = φ ( ˜ β ∗ j − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ j ∗ X j ∩ j ∗ ( ˜ β ∗ j − E [ ˆ β j ])+ (cid:88) j (cid:88) k : k (cid:54) = j ; P j ∩ P ∗ k (cid:54) = φ ( ˜ β ∗ k − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ k ∗ X j ∩ k ∗ ( ˜ β ∗ k − E [ ˆ β j ])+ (cid:88) j ( ˆ β j − E [ ˆ β j ]) (cid:48) X (cid:48) j X j ( ˆ β j − E [ ˆ β j ])] − σ [ E + E + E ] + c l,n + O (1) (3.2)6here, E , E , E are the cross product terms, and θ ∗ i is the true value of E [ y i ] . Hence, account-ing for the bias terms and bounding the cross product terms we can show the Bayes factorconsistency if the proposed partition is not a refinement of the true partition for some covariatecombination. If l > l ∗ then the expression similar to in equation 3.2 can be derived by compar-ing M l,m β with model with partition/change points corresponding to a refinement of the truepartition, with l change points. A refinement of the true partition is a partition correspondingto change points ˜ t , . . . , ˜ t l , with l > l ∗ and t ∗ i ∈ { ˜ t , . . . , ˜ t l } , for i = 1 , . . . , l ∗ .If the proposed partition is a refinement of the true partition, and the covariate combinationcontains the true set of covariates, then we will have terms asymptotically similar to Schwarz’sBIC (Schwarz et al., 1978) which will guarantee Bayes factor consistency.Next, we make the following assumptions on covariates and model size:(A1) The covariates X j ’s are uniformly bounded.(A2) Coefficients β j ’s are uniformly bounded below and above.(A3) ( q n log n ) ≺ n(cid:15) n and q n (cid:22) n c , c < . ; for a sequence (cid:15) n → such that n(cid:15) n → ∞ . For twopositive sequences { a n } n ≥ , { b n } n ≥ , a n ≺ b n , if a n /b n → as n ↑ ∞ .(A4) τ j = V n = O (1) .(A5) < t ∗ < · · · < t ∗ l ∗ < are the true locations of change points (i.e. nt ∗ j ≤ τ ∗ j ≤ nt ∗ j + 1 ).(A6) Let ˜ P , ˜ P , · · · , ˜ P k be k disjoint subset of { , . . . , i . . . , n } of size n , n , · · · , n k . Let X , X · · · , X k be design matrices corresponding to observations indexed by ˜ P · · · ˜ P k for some modelbased on q n = o ( n ) many coefficients.For, n , . . . , n k → ∞ , q n /n , . . . , q n /n k → , assume an i < λ A < bn i , a n n + n + ··· + n k <λ C < b n n + n + ··· + n k where λ X is singular value of X and, A = X (cid:48) i X i and C = ( X (cid:48) X + X (cid:48) X + · · · + X (cid:48) k X k ) − X (cid:48) i X i , and a, b > .We can now state the following result for the number of change points: Theorem 3.
Let < t ∗ < · · · < t ∗ l ∗ < be the locations of true change points, M ∗ be the truemodel corresponding to true covariates and true change points, and < t < · · · < t l < be the locations of change points for the alternative with l (cid:54) = l ∗ . Let M l be the correspondingmodel with change points at t < · · · < t l and some covariate combinations m β with at most q n many covariates. Then under A − A , for (2.1) BF ( M l,m β , M ∗ ) → in probability as n → ∞ . < t ∗ < · · · < t ∗ l ∗ < be the locations of true change points and M ∗ be the true modelcorresponding to true covariates and true change points. Let < t < · · · < t l ∗ < be the lo-cation of change points for an alternative model with some covariate combination and assume max i | t ∗ i − t i | > (cid:15) n . Let M (cid:15) n l ∗ be the corresponding model with change points at t , · · · , t l ∗ andsome covariate combinations m β with at most q n many covariates. Then, we show that even for (cid:15) n ∼ n − up to some log factors, if q n increases in logarithmic rate, we have consistency., wherefor two positive sequences { a n } n ≥ , { b n } n ≥ , a n ∼ b n if a n /b n is bounded away from zero andinfinity. In particular: Theorem 4.
Under A − A , for (2.1) BF ( M (cid:15) n l ∗ ,m β , M ∗ ) → in probability as n → ∞ . Note that q n log n ≺ √ n(cid:15) n . If q n = O (log n ) , (cid:15) n ∼ n − (up to logarithmic factors), satisfiesthis condition. Under stronger conditions, the Bayes factor consistency holds uniformly overcovariate choice. The results hold for misspecification of variance parameter. Remark 5.
The condition of bounded covariates given in ( A can be relaxed. For example,if we use sub-exponential type tail bound conditions on the distributions of the covariates,that is, if F m ( · ) is the CDF for m th covariate and − log(1 − F m ( t )) (cid:23) t α for some α > , thenTheorem 4 holds under slightly modified (A3) (up to log factors). For two positive sequences { a n } n ≥ , { b n } n ≥ , a n (cid:22) b n if a n ≤ Kb n for some constant K > . Instead of using known variance σ , if we use ˆ σ = n − (cid:80) j (cid:107) Y j − ˆ Y j (cid:107) , it can be shown that theconclusion of Theorems 3 and 4 hold. Here Y j is the observation vector for P j and the ˆ Y j is theleast square fit for the proposed model m β based on the observations and covariates in P j . Thisresult is addressed in the following Theorems. Theorem 6.
Let < t ∗ < · · · < t ∗ l ∗ < be the locations of true change points, M ∗ be the truemodel corresponding to true covariates and true change points, and < t < · · · < t l < be the locations of change points for the alternative with l (cid:54) = l ∗ . Let M l be the correspondingmodel with change points at t < · · · < t l and some covariate combinations with at most q n many covariates. Then under A − A , for (2.1) , for the empirical estimator of σ , BF ( M l,m β , M ∗ ) → in probability as n → ∞ . Theorem 7.
Under A − A , for (2.1) , for the empirical estimator of σ , BF ( M (cid:15) n l ∗ , M ∗ ) → in probability as n → ∞ . Remark 8.
Theorem 6 and Theorem 7 hold under misspecification, that is if the true varianceparameter is not same over different partitions P ∗ j ’s. Next, we address the variable selection issue. We have already shown that under any covariatecombination, the model with incorrectly selected change points has Bayes factor converging tozero with respect to the model with true change points and covariate combination. Showingvariable selection consistency with the change points set to true change points then boils downto showing the consistency of variable selection. Let Π( M l ∗ ,m β |· ) be the posterior probability ofa model with covariate combinations given m β and change points at true change point locations t ∗ , · · · , t ∗ l ∗ , and let P n ( M l ∗ ,m β : M ∗ ) be the ratio of the posterior probabilities Π( M l ∗ ,m β |· ) and Π( M ∗ |· ) , based on n observations. Similarly, we can define P n ( M l,m β : M ∗ ) . We have thefollowing result regarding variable selection. Theorem 9.
Under A − A , for (2.1) , BF ( M l ∗ ,m β , M ∗ ) → in probability for m β (cid:54) = m β ∗ . For showing consistency over all possible variable choices we assume the following.(B1) − log P ( I βm = 1) ∼ (log n ) α , for any α > .(B2) √ n (cid:31) q . n log n and √ n(cid:15) n (cid:31) q . n log n . Remark 10.
Condition B imposes a stronger penalty on the larger model, which induces se-lection consistency uniformly over covariate choice and model size. Similarly B is needed tobound the Bayes factors uniformly over covariate choices. We consider the models M (cid:15) n l ∗ ,m β defined as earlier. Then we have the following. Theorem 11.
Under the assumptions A − A and B , B , for (2.1) , sup (cid:80) m β (cid:54) = m ∗ β P n ( M (cid:15) n l ∗ ,m β : M ∗ ) → in probability, where the supremum is over the possible change point selections and (cid:15) n converges to zero. For any model with change points t < · · · < t l we can state the following result for variable9election. Theorem 12.
Let M l,m β be the corresponding model with change points at t < · · · < t l andsome covariate combinations m β . Under A − A , B , B , for (2.1) , (cid:80) m β : m β (cid:54) = m β ∗ P n ( M l,m β : M ∗ ) → in probability. Remark 13.
Theorems 11 and Theorem 12 hold under unknown σ , for the empirical estimator σ as in Theorem 6 and 7, and under misspecification of equal variance as in Remark 8. Remark 14. Covariate free cases.
One special case of the model given in (2.2) is the covariatefree cases, which is the simple mean model. For such model the result given in Theorem 4regarding Bayes factor consistency holds for (cid:15) n going to zero for (cid:15) n (cid:23) n − (log n ) , which givesus rate equivalent to frequentist minimax rate up to logarithmic factors. In this section, we demonstrate the performance of the Bayesian hierarchical model describedin (2.2) and (2.1), for changing linear model and the simpler special case of piecewise constantmean model, respectively. First, we show the recovery of true mean as well as the true change-point locations the simple changing mean model, and then we consider a case with numberof covariates d n ∼ n case where we show that we can achieve accuracy in both variable selec-tion and change point estimation under the Bayesian model. Our goal here is not to establishsuperiority of the Bayesian method used here over extant methods, but rather to show thatthe methods are not just theoretically optimal, they also have satisfactory small sample perfor-mance. We consider an example originally reported in (Frick et al., 2014) and compared against an em-pirical Bayes procedure in (Martin and Shen, 2017), with change points for piecewise constantGaussian sequence model. Here the data-generating model is given as follows: y i = θ i + (cid:15) i , (cid:15) i iid ∼ N (0 , . , i = 1 , . . . , n (= 497) , (4.1)10ith the true mean being: θ i = − .
18; 1 ≤ i ≤ .
08; 139 ≤ i ≤ .
07; 226 ≤ i ≤ − .
53; 243 ≤ i ≤ .
16; 300 ≤ i ≤ − .
69; 309 ≤ i ≤ − .
16; 334 ≤ i ≤ . The sequence of true means θ i ’s are depicted in Fig. 1. Figure 1: True mean parameter values used in (Frick et al., 2014)
We compare the recovery and estimation performance of the method proposed here with twocandidates: the first is a frequentist method (the pruned exact linear time method or PELT,(Killick et al., 2012)) and the second is an empirical Bayes approach from (Martin and Shen,2017) (EB). We describe these two comparative candidates briefly.For PELT, consider ordered data-points: y , y , . . . , y n , and m change-points τ , . . . , τ m that di-vide the data into m + 1 partitions. The change-point detection methods then seek to minimizea function: m (cid:88) i =1 C ( y τ i − +1: τ i ) + pen ( n ) f ( m ) , where C is a cost-function and pen ( n ) × f ( m ) is the penalty applied to prevent over-fitting.For observations y , . . . , y n ∼ f ( y | θ ) for some unknown underlying parameter θ . The PELTmethod uses the negative log-likelihood as the cost function: C ( y ( t +1): s ) = − max θ (cid:80) si = t +1 f ( y i | θ ) . The penalty is chosen based on the inferential goal, e.g. pen ( n ) = n log( n ) is the popularBIC penalty and f ( m ) = m assumes that penalization is linear with the number of change-11oints. When m is not too large, BIC favors a parsimonious model and can be shown to bemodel selection consistent.The empirical Bayes (EB) approach in (Martin and Shen, 2017) works via specification of pri-ors on block-specific parameter vectors ( θ B ) and block configurations, where the prior centerson mean parameters are data-dependent. In particular, the mean parameter in each block isassumed to be Gaussian centered on maximum likelihood estimates based on observationsin that block, and the block-configuration follows a discrete uniformly distributed partitionpoints, with the configuration size or number of blocks following a truncated geometric distri-bution.For the results shown below, we do not assume known σ and assume that they can be differentover partitions. We use V = 1 , I i ∼ Bernoulli (1 /n ) , and calculate the posterior summariesbased on , Markov chain monte carlo samples with first , burn-ins. The fitted meanand the posterior probabilities for partitions are given in Fig. 2. Figure 2: Top row: change-point recovery by EB/PELT; Bottom row: Lower left: posterior mean(red) and meanbased on estimated change-points (blue); middle: posterior probability of the number of partitions; Lower right:posterior probability for change points at i = 1 , . . . , n . Consider a changing linear regression problem where the underlying linear model changes be-tween different observation windows or epochs. Here the parameters of interest are both the12arameter vector β as well as the number and location of change points. Let us fix the dimen-sions of observations and covariates to be n = 250 , p = 250 . Suppose the true locations ofchange-points as a fraction of the total number of observations are given by : t ∗ = 0 . , t ∗ = 0 . .Finally, let the covariates for the i th observation x i, . . . , x i,p ( i = 1 , . . . , n ), are generated fromindependent standard normal distribution. The data-generating model used for this experi-ment is given as follows: y i = x i, + 2 x i, + 1 . e i if i ≤
751 + 2 x i, + . e i if < i ≤ − . x i, − x i, + e i if < i ≤ . , i = 1 , . . . , . Here, e i iid ∼ N (0 , and the true change points occur in positions i = 75 and i = 175 as men-tioned before, and the proportion of change-points in observations is p n = n . (a) Change point recovery (b) Globally active variable selection performance(c) Globally active variable selection performance Figure 3: Changing linear regression example showing performance of the hierarchical Bayesian model
We use a spike-slab prior on the regression coefficients to detect the global set of covariates andthe change points. We let the Markov chain Monte Carlo chain run for , iterations andcalculate the posterior modes and means for the two indicator variables for the change-points13nd the non-null β j s. The posterior mean estimates are plotted in Fig. 3. Figure 3a showsthat the both the two change-points at i/n = 0 . and . can be recovered with high posteriorprobability. Figures 3b and 3c show that the global set of active β j ’s can also be recovered withhigh probability. In particular, from Fig. 4, almost 93% of the posterior samples give the correctnumber of partitions (3), and almost 90% of the posterior samples select the right model. Thus,the numerical results are in concurrence with our theoretical proofs of consistency in modelselection and change-point detection for a changing linear regression in §3. (a) Posterior distribution for number of change-points (b) Posterior distribution for number of active covariates Figure 4: Regression example: posterior distribution of model size and partition size.
This example considers a change in linear structure when an autoregressive time dependentcomponent is present. Such scenario may arise in economic application, when for examplehousing price index may change with the covariates stock market return, but an autoregressivestructure may be present for the response variable, i.e. price. We use a similar model as beforewith n = 300 , p = 250 but add an autoregressive component of first order (AR(1)) with autocor-relation equal to ρ . As before, the true change-points occur at positions i = 90 and i = 210 , atrelative positions t ∗ = 0 . and t ∗ = 0 . as before. We generate data from the following model: y i = ρy i − + x i, + 3 x i, + 1 . e i if i ≤
901 + ρy i − + 2 x i, + . e i if < i ≤ − ρy i − + x i, − x i, + e i if < i ≤ , with ρ = 0 . , x i,j , e i ’s are i.i.d. N (0 , . It should be noted that our theoretical results from§3 will continue to hold for this situation, as long as condition (A6) holds for the new designmatrix, accounting for the autoregressive structure.We use the spike and slab prior on the coefficients and the parameter ρ , and use a computationalscheme similar to the example in §4.2. The posterior probabilities for the variable selection and14hange-point detection are given in Fig. 5, where it can be seen that the true change pointlocations ( vide Fig. 5a) and the active variables ( vide
Fig. 5b and Fig. 5c) are selected withhigh probability. We also note that the estimated posterior inclusion probability for the AR(1)component is . (a) Change point recovery (b) Globally active variable selection performance(c) Globally active variable selection performance Figure 5: Changing linear regression example showing performance of the hierarchical Bayesian model, in thepresence of the AR(1) component Real Data Applications
Next the proposed method is applied to detecting change points in crime data from Little Rock,AR. Among all cities in the United States with at least , residents, Little Rock is rankedin the top for the highest violent crime (7th) and property crime (4th) rates in 2015 (Chillarand Drawve, 2020).For a piecewise constant mean model, there are competing methods against which the pro-posed method will be compared, as mentioned earlier there is essentially no comprehensiveframework for model selection under change point. The data is preprocessed by a square roottransformation and standardization.For comparing and contrasting the proposed method on the weekly burglary and breakingand entering data from Little Rock from 2017, the empirical Bayes method proposed by Martinand Shen (2017), as well as the PELT (Pruned Exact Linear Timing) method by (Killick et al.,2012), are used. As it is impossible to know if there should be any “true” change-points in 2017data in the absence of additional information, this can be regarded as a preliminary exploratoryanalysis. As Fig. 6 suggests, the change-points recovered by the proposed method mostly agreewith those by the PELT method, while the Empirical Bayes method seems to be conservativeand does not detect any change points in the data. IndexIndex B I C E s t i m a t e s P o s t e r i o r M ode E s t i m a t e method BIC Observed method
Proposed Method E−Bayes Observed
Figure 6: Changepoints detected by the PELT-BIC method (top panel) and the proposed approach and EmpiricalBayes approach (Martin and Shen, 2017) (bottom panel) for the burglary and breaking and entering activities inLittle Rock in 2017. Data is standardized for the bottom panel. AppendixProof of Theorem 3
We show our result first for a model M l,m β = M ∗ l,m β with l change points, where, the P j ’s area refinement of P ∗ j ’s, the true partition, and m β contains the true covariate combination, i.e. m β ⊃ m β ∗ . That is for t , · · · , t l be the proposed change point, and ˜ t , · · · , ˜ t l be the changepoints corresponding to some refinement of the true partition t ∗ , · · · , t ∗ l ∗ such that t i = ˜ t i , ∀ i .Then we show that BF ( M ∗ l,m β , M ∗ l ∗ ,m β ) → and BF ( M ∗ l ∗ ,m β , M ∗ ) → . (Case i)Then we show the case where t , . . . , t l be the proposed change point, and ˜ t , . . . , ˜ t l be changepoint corresponding to any refinement of the true partition t ∗ , . . . , t ∗ l ∗ and max i | t i − ˜ t i | > (cid:15) > (Case ii). Without loss of generality, we assume (cid:15) < min i | t i − t i − | ; i = 1 , . . . , l .Then we show the case where the m β does not contain the true covariate combination for somepartition for l ≥ l ∗ . (Case iii)Finally we show for the case where l < l ∗ . (Case iv) Case (i)
We show BF ( M ∗ l,m β , M ∗ l ∗ ,m β ) → and BF ( M ∗ l ∗ ,m β , M ∗ ) → in probability. BF ( M ∗ l,m β , M ∗ l ∗ ,m β ) → Writing the P ∗ j for some j for the true partition as the union of P ∗ j , · · · , P j ∗ k , k > , where P ∗ j i = P j (cid:48) for some j (cid:48) ∈ { , · · · , l + 1 } . Let Y j ∗ n be the vector of y i ’s in P ∗ j , and X j ∗ be thecorresponding covariate matrix. Similarly, X j ∗ i be the covariate matrix for P ∗ j i . Let ˆ Y j ∗ n and ˆ Y j ∗ i n be their least square fit based on P ∗ j and P ∗ j i ’s. Let ˆ Y j ∗ n,i be the sub-vector of ˆ Y j ∗ n correspondingto the observations in P ∗ j i .Now − log L ( Y j ∗ n | M ∗ l,m β ) + log L ( Y j ∗ n | M ∗ l ∗ ,m β ) = −
12 det( X (cid:48) j ∗ X j ∗ ) + 12 k (cid:88) i =1 det( X (cid:48) j ∗ i X j ∗ i ) − σ (cid:107) Y j ∗ n − ˆ Y j ∗ n (cid:107) + k (cid:88) i =1 σ (cid:107) Y j ∗ i n − ˆ Y j ∗ i n (cid:107) + c l,l ∗ where c l,l (cid:48) = O ( k ( l − l (cid:48) )) , where k is te number of covariates in m β . Now, (cid:107) Y j ∗ n − ˆ Y j ∗ n (cid:107) = k (cid:88) i =1 (cid:107) Y j ∗ i n − ˆ Y j ∗ n,i (cid:107) = k (cid:88) i =1 (cid:107) Y j ∗ i n − ˆ Y j ∗ i n + ˆ Y j ∗ i n − ˆ Y j ∗ n,i (cid:107) = k (cid:88) i =1 (cid:107) Y j ∗ i n − ˆ Y j ∗ i n (cid:107) + (cid:107) ˆ Y j ∗ i n − ˆ Y j ∗ n,i (cid:107) . δ i,m the m th componentof ( ˆ β j − ˆ β j i ) , x i (cid:48) ,m the m th component of the covariate vector x i (cid:48) for observation i (cid:48) , (cid:80) i (cid:48) ∈ P ∗ ji ( y i (cid:48) − x i (cid:48) ˆ β j i ) x i (cid:48) ( ˆ β j − ˆ β j i ) = (cid:80) m δ i,m (cid:80) i (cid:48) ∈ P ∗ ji ( y i (cid:48) − x i (cid:48) ˆ β j i ) x i (cid:48) ,m = 0 .Next, (cid:107) ˆ Y j ∗ i n − ˆ Y j ∗ n,i (cid:107) ≤ β j − ˜ β ∗ j ) (cid:48) ( X (cid:48) j ∗ i X j ∗ i )( ˆ β j − ˜ β ∗ j ) + ( ˆ β j i − ˜ β ∗ j ) (cid:48) ( X (cid:48) j ∗ i X j ∗ i )( ˆ β j i − ˜ β ∗ j )] , where ˜ β ∗ j be the true coefficient vector corresponding to P ∗ j with zeros in place for covariatesthat are in m β but not in m β ∗ . We have, ( ˆ β j i − ˜ β ∗ j ) (cid:48) ( X (cid:48) j ∗ i X j ∗ i )( ˆ β j i − ˜ β ∗ j ) ∼ χ q for q = { m β } and q ≤ q n . Again, ( ˆ β j − ˜ β ∗ j ) (cid:48) ( X (cid:48) j ∗ i X j ∗ i )( ˆ β j − ˜ β ∗ j ) = Z (cid:48) A / ( X (cid:48) j ∗ i X j ∗ i ) A / Z where Z ∼ N ( , σ I q ) and A − = X (cid:48) j ∗ X j ∗ . Given the eigenvalues of A / ( X (cid:48) j ∗ i X j ∗ i ) A / are bounded away from infinity,we have ( ˆ β j − β ∗ j ) (cid:48) ( X (cid:48) j ∗ i X j ∗ i )( ˆ β j − β ∗ j ) ∼ O p ( q n ) .Next, we consider the determinant term. Let n ∗ j be the number of observations in P ∗ j . Similarlywe define n ∗ j i for P ∗ j i ’s, and (cid:80) i n ∗ j i = n ∗ j , and liminf n ∗ ji n > . Let lim n ↑∞ n ∗ ji n ∗ j = α i . From the factthat log det ( X (cid:48) j ∗ i X j ∗ i ) = q [log n ∗ j + log α i ] + e q , for large n where | e q | ≤ q [ | loga | + | log b | ] , we have − log( Y j ∗ n | M ∗ l,m β ) + log( Y j ∗ n | M ∗ l ∗ ,m β ) ≥
12 ( k − q log n − qO p (1) + C for a generic constant C and hence, BF ( M ∗ l,m β , M ∗ l ∗ ,m β ) → in probability. BF ( M ∗ l ∗ ,m β , M ∗ ) → For each partition P ∗ j , L ( Y j ∗ n | M ∗ ) − L ( Y j ∗ n | M ∗ l ∗ ,m β ) ∼ ( q − { m β ∗ } ) log n + 1 σ χ q − { m β ∗ } ) + ( q − { m β ∗ } ) C (cid:48) for a generic constant C (cid:48) , where q = { m β } . Hence, the result follows. Case (ii)
From (3.2), we have E = (cid:80) j E ,j , E = (cid:80) j E ,j , and E = (cid:80) j E ,j , where for e i = y i − θ ∗ i , E ,j = (cid:88) i ∈ P j e i ( x i E [ ˆ β j ] − x i ˜ β ∗ i ); E ,j = (cid:88) i ∈ P j e i ( x i ˆ β j − x i E [ ˆ β j ]); E ,j = (cid:88) i ∈ P j ( x i E [ ˆ β j ] − x i ˜ β ∗ i )( x i ˆ β j − x i E [ ˆ β j ]) (6.1)Let, δ n,im be the m th component of E [ ˆ β j ] − ˜ β ∗ i , and δ n,Em be the m th component of ˆ β j − E [ ˆ β j ] .Note that δ n,im can take l ∗ + 1 many different values, as i ∈ P ∗ j for some j ∈ { , · · · , l ∗ + 1 } . Wedenote it by δ nm,l (cid:48) for i ∈ P j ∩ P ∗ l (cid:48) . 18ote that E [ ˆ β j ] is linear combination of ˜ β ∗ j ’s of the form ( (cid:80) i ∈ I j B (cid:48) i B i ) − ( (cid:80) i ∈ I j B (cid:48) i B i ˜ β ∗ i ) , where I j is a subset of { , · · · , l ∗ + 1 } , and i ∈ I j if P ∗ i ∩ P j (cid:54) = φ for some i , and B i be the correspondingcovariate matrix for observations in P ∗ i ∩ P j . Hence, E [ ˆ β j ] , bounded at each component by con-dition A6. For E ,j , E ,j we use the fact that n dimensional multivariate normal with boundedvariance, the absolute value maximum is bounded by log n for large n . Then, | E ,j | = | q (cid:88) m =1 l ∗ +1 (cid:88) l (cid:48) =1 δ nm,l (cid:48) (cid:88) i ∈ P j ∩ P ∗ l (cid:48) e i x i,m | = o p ( q √ n log n ) , | E ,j | = | q (cid:88) m =1 ( √ nδ n,Em ) 1 √ n (cid:88) i ∈ P j e i x i,m | = o p ( q log n ) , | E ,j | = | √ n l ∗ +1 (cid:88) l (cid:48) =1 (cid:88) i ∈ P j ∩ P ∗ l (cid:48) (cid:88) m,m (cid:48) δ nm,l (cid:48) ( √ nδ n,Em (cid:48) ) x i,m x i,m (cid:48) | = o p ( √ nq log n ) . Without loss of generality we assume that l = l ∗ (otherwise, we can show for a refinement oftrue partition for change points ˜ t , · · · , ˜ t l , such that | t j − ˜ t j | > and ˜ t j ∈ { t ∗ , · · · , t ∗ l ∗ } , then usethe result proved in Case (i)).Let | t j − t ∗ j | > (cid:15) and the model be denoted by M (cid:15)l ∗ ,m β . Then, − log L ( Y n | M (cid:15)l ∗ ,m β ) + log L ( Y n | M ∗ l ∗ ,m β ) = −
12 [ (cid:88) j log( det ( X (cid:48) j ∗ X j ∗ )) − (cid:88) j log( det ( X (cid:48) j X j ))]+12 σ (cid:88) j : P j ∩ P ∗ j (cid:54) = φ ( ˜ β ∗ j − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ j ∗ X j ∩ j ∗ ( ˜ β ∗ j − E [ ˆ β j ])+12 σ (cid:88) j (cid:88) k : k (cid:54) = j,P j ∩ P ∗ k (cid:54) = φ ( ˜ β ∗ k − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ k ∗ X j ∩ k ∗ ( ˜ β ∗ k − E [ ˆ β j ])+12 σ (cid:88) j ( ˆ β j − E [ ˆ β j ]) (cid:48) X (cid:48) j X j ( ˆ β j − E [ ˆ β j ]) − σ (cid:88) j ( ˆ β ∗ j − E [ ˆ β ∗ j ]) (cid:48) X (cid:48) j ∗ X j ∗ ( ˆ β ∗ j − E [ ˆ β ∗ j ]) − R n (6.2)where R n = o p (log n √ nq n ) , and (cid:80) j ( ˆ β ∗ j − E [ ˆ β ∗ j ]) (cid:48) X (cid:48) j ∗ X j ∗ ( ˆ β ∗ j − E [ ˆ β ∗ j ]) ∼ χ ql ∗ .As, β ∗ j (cid:54) = β ∗ ( j +1) for j = 1 , · · · l ∗ . Therefore, one of the first two quadratic form sums is com-puted for a nonzero vector for some j . The first quadratic is based on ∼ n (1 − (cid:15) ) observations19nd the second one is based on ∼ n(cid:15) observations, and therefore by A6 − log L ( Y n | M (cid:15) n l ∗ ,m β ) + log L ( Y n | M ∗ l ∗ ,m β ) (cid:23) n(cid:15) − R n → ∞ which proves our claim. Case (iii)
Case (iii) follows similar to last step in Case (ii), as the bounds on E , E , E are of same orderas the earlier part, and we have E [ ˆ β j ] (cid:54) = ˜ β ∗ j for some j . Case (iv)
We have P ( Y n | M l,m β ) = − (cid:88) j log( | X (cid:48) j X j | ) − σ [ (cid:88) i ( y i − θ ∗ i ) + (cid:88) j (cid:88) j : P j ∩ P ∗ j (cid:54) = φ ( ˜ β ∗ j − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ j ∗ X j ∩ j ∗ ( ˜ β ∗ j − E [ ˆ β j ])+ (cid:88) j ( ˆ β j − E [ ˆ β j ]) (cid:48) X (cid:48) j X j ( ˆ β j − E [ ˆ β j ])] − σ [ E + E + E ] + c l,n + O (1) . where X j ∩ j ∗ is the design matrix corresponding to P j ∩ P ∗ j if P j ∩ P ∗ j (cid:54) = φ , j ∈ = { , . . . , l ∗ +1 } .For l < l ∗ , we have min { L ([ t i − , t i ] ∩ [ t ∗ j − , t ∗ j ]) , L ([ t i − , t i ] ∩ [ t ∗ j , t ∗ j +1 ]) } > for some i, j , if t , . . . , t l correspond to the change points in M l,m β , and L ( · ) is the Lebesgue measure. Then, (cid:80) j (cid:80) j : P j ∩ P ∗ j (cid:54) = φ ( ˜ β ∗ j − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ j ∗ X j ∩ j ∗ ( ˜ β ∗ j − E [ ˆ β j ]) ∼ O ( n ) . The bounds on E , E , E from Case (ii) hold and hence, − P ( Y n | M l,m β ) + P ( Y n | M l ∗ ,m β ) = O p ( n ) , which proves ourclaim. Proof of Proposition 1
Marginalizing over the coefficient vector on P j we have, log L ( Y jn | M l,m β ) = − n j σ − log det ( X (cid:48) j X j + S β ) + log det ( S β ) − σ [ Y jn (cid:48) Y jn − Y jn (cid:48) X (cid:48) j ( X (cid:48) j X j + S β ) − X j Y jn ] (6.3)where σ S − β is prior variance covariance matrix for the coefficients in m β .Next, we consider ( X (cid:48) j X j + S β ) − . Let AA = X (cid:48) j X j , where A is a positive definite matrix witheigenvalues of the order of √ n j . 20hen, using the Neumann series expansion (Horn and Johnson, 2012, p.348), we arrive at: ( X (cid:48) j X j + S β ) − = A − ( I q + A − S β A − ) − A − = A − ( I q − B − B − B − · · · ) A − (6.4)where B = A − S β A − has Eigen values of the order n − and the above expression is valid forsufficiently large n .Hence, Y jn (cid:48) X (cid:48) j ( X (cid:48) j X j + S β ) − X j Y jn = Y jn (cid:48) X (cid:48) j A − A − X j Y jn − ∞ (cid:88) k =1 Y jn (cid:48) X (cid:48) j A − B k A − X j Y jn . (6.5)Note that (cid:107) Y jn (cid:107) ≤ (cid:107) θ n (cid:107) + (cid:107) e n (cid:107) ] (cid:22) n with probability one, θ n is the vector of θ i ’s and e n thevectors of e i ’s. The Eigen values of A − B k A − is of the order of n − k − .Hence, (cid:107) Y jn (cid:48) X (cid:48) j A − B k A − X j Y jn (cid:107) ∼ n n − k − , and therefore, (cid:80) ∞ k =1 Y jn (cid:48) X (cid:48) j A − B k A − X j Y jn is O (1) with probability one.Let λ i n j be the Eigen values of B for i = 1 , · · · , q , where λ i > and bounded. Again, log det (( X (cid:48) j X j + S β )) = log( det ( A )) + log( det ( I + B ) − ) = log det ( X (cid:48) j X j ) − (cid:80) qi =1 log(1 + λ i n j ) = log det ( X (cid:48) j X j ) − O ( qn ) .Hence, combining the calculation of the determinant and the residual calculation from equation6.5, the result follows. Proof of Theorem 4
We assume that m β ⊇ m β ∗ . For the case, where m β does not contain the true covariates, theproof will follow similar to the proof of Case iii of Theorem 3. The proof for m β ⊇ m β ∗ is givenas the following.From equation (6.1), we decompose E ,j and E ,j in two parts E ,j ∩ j ∗ , E ,j − j ∗ , and E ,j ∩ j ∗ , E ,j − j ∗ , where E ,j ∩ j ∗ = (cid:80) i ∈ P j ∩ P ∗ j e i ( x i E [ ˆ β j ] − x i ˜ β ∗ j ) , and E ,j − j ∗ = (cid:80) i ∈ P j − P ∗ j e i ( x i E [ ˆ β j ] − x i ˜ β ∗ j ) . Similarly, E ,j ∩ j ∗ , E ,j − j ∗ are defined.As in the proof of Theorem 3, δ n,im be the m th component of E [ ˆ β j ] − ˜ β ∗ i , and δ n,Em be the m thcomponent of ˆ β j − E [ ˆ β j ] , and δ n,im can take l ∗ + 1 many different values, as i ∈ P ∗ j for some j ∈ { , · · · , l ∗ + 1 } . It is denoted by δ nm,l (cid:48) for i ∈ P j ∩ P ∗ l (cid:48) .Note that, number of observations in P j − P ∗ j is (cid:22) n(cid:15) n and δ n,im in P j ∩ P ∗ j is of the order of (cid:15) n (follows from Lemma 15). Hence, | E ,j ∩ j ∗ | ≤ (cid:88) m | δ nm,j | n / n − / | (cid:88) i ∈ P j ∩ P ∗ j e i x i,m | (cid:22) (cid:15) n n / q n log n n almost surely. Next, | E ,j − j ∗ | ≤ (cid:88) l (cid:48) (cid:88) m | δ nm,l (cid:48) |√ n(cid:15) n √ n(cid:15) n | (cid:88) i ∈ P j ∩ P ∗ l (cid:48) ; l (cid:48) (cid:54) = j e i x i,m | (cid:22) q n √ n(cid:15) n log n almost surely.Using, similar calculation, with probability one, | E ,j ∩ j ∗ | ≤ (cid:88) m,m (cid:48) | δ nm,j | n − / | (cid:88) i ∈ P j ∩ P ∗ j x i,m (cid:48) ( n / δ n,Em (cid:48) ) x i,m | (cid:22) (cid:15) n n / q n (log n ) , and, | E ,j − j ∗ | ≤ (cid:88) l (cid:48) (cid:88) m,m (cid:48) | δ nm,l (cid:48) | n − / | (cid:88) i ∈ ( P j − P ∗ j ) ∩ P ∗ l (cid:48) : l (cid:48) (cid:54) = j x i,m (cid:48) ( n / δ n,Em (cid:48) ) x i,m | (cid:22) (cid:15) n n / q n (log n ) . As in the proof of Theorem 3, here we use the fact that the absolute value of the maximum ofan n -dimensional multivariate normal is less than log n for large n .Hence, from equation (6.2), using the stricter bound for E , E , E from the above derivation, − log L ( Y n | M (cid:15)l,m β ) + log L ( Y n | M ∗ l ∗ ,m β ) (cid:23) n(cid:15) n − √ n(cid:15) n q n log n − q n √ n(cid:15) n log n + q n log n → ∞ (6.6)as n → ∞ , which proves our result, as BF ( M ∗ l ∗ ,m β , M ∗ ) → in probability, as in Theorem 3(Case i,second part), by BIC type quantities for each partition P ∗ j . Lemma 15.
Under the setting of Theorem 4, we have (cid:107) E [ ˆ β j ] − ˜ β ∗ i (cid:107) ∞ (cid:22) (cid:15) n , for i ∈ P j ∩ P ∗ j . Proof.
Let Z j be the design matrix corresponding to observations in P j ∩ P ∗ j , and Z l (cid:48) be thematrix corresponding to observations in P j ∩ P ∗ l (cid:48) ; l (cid:48) = 1 , · · · , l ∗ + 1 , j (cid:54) = l (cid:48) . Note that number ofobservations corresponding to P j ∩ P ∗ l (cid:48) is less than n(cid:15) n + 1 .Then, E [ ˆ β j ] = ( Z (cid:48) j Z j + (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z l ) − ( X (cid:48) j E [ Y jn ]) = ( CC + D ) − ( Z (cid:48) j Z j ˜ β ∗ j + (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z l ˜ β ∗ l (cid:48) ) , where C is a positive definite matrix with Eigen values of the order of √ n j , when (cid:15) n → , and CC = Z (cid:48) j Z j . 22ow, writing ( CC + D ) − = C − ( I + C − DC − ) − C − and from the fact the Eigen values of B = C − DC − is of the order (cid:15) n , for large n and therefore, similar to equation 6.4 ( CC + D ) − = C − ( I − B − B − · · · ) C − . Hence, E [ ˆ β j ] = C − C − CC ˜ β ∗ j + C − C − (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z (cid:48) l ˜ β ∗ l (cid:48) − ∞ (cid:88) k =1 C − B k C − ( Z (cid:48) j Z j ˜ β ∗ j + (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z (cid:48) l ˜ β ∗ l (cid:48) ) . We have C − B k C − with Eigen values at most of the order of b k (cid:15) kn n − . Also, C − C − Z (cid:48) l (cid:48) Z (cid:48) l hasEigen value at most of the order of (cid:15) n . We have β ∗ j and ˜ β ∗ j bounded. Hence, (cid:107) C − C − (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z (cid:48) l ˜ β ∗ l (cid:48) (cid:107) ∞ < c ∗ (cid:15) n ; and (cid:107) C − B k C − ( Z (cid:48) j Z j ˜ β ∗ j + (cid:88) l (cid:48) ,l (cid:48) (cid:54) = j Z (cid:48) l (cid:48) Z (cid:48) l ˜ β ∗ l (cid:48) ) (cid:107) ∞ ≤ c ∗ b k ( (cid:15) kn + b(cid:15) k +1 n ) , where c ∗ > is an universal constant (using A6).Hence, for i ∈ P j ∩ P ∗ j , (cid:107) E [ ˆ β j ] − ˜ β ∗ i (cid:107) ∞ = (cid:107) E [ ˆ β j ] − ˜ β ∗ j (cid:107) ∞ (cid:22) (cid:15) n . Proof of Theorem 6 and 7
We use, ˆ σ M ∗ = (cid:107) Y n − ˆ Y n (cid:107) /n for the true model with change points at t ∗ i , and ˆ σ M l = ˆ σ M l,mβ be the estimate corresponding to a model with change point l change point, with covariatecombination given by m β .From earlier calculation in Proposition 1, replacing the σ in each partition P j by ˆ σ , log L ( Y n | M l,m β ) = − (cid:88) j log( det ( X (cid:48) j X j )) − σ M l n ˆ σ M l − (cid:88) j n j σ M l + O ( lq n ) . Hence, log L ( Y n | M l,m β ) − log L ( Y n | M ∗ )= −
12 [ (cid:88) j log( det ( X (cid:48) j X j ) − (cid:88) j log( det ( X (cid:48) j ∗ X j ∗ )] − (cid:88) j n j σ M ∗ (1 + ˆ σ M l − ˆ σ M ∗ ˆ σ M ∗ )+ n σ M ∗ + O ( lq n )= − (cid:88) j log( det ( X (cid:48) j X j ) + 12 (cid:88) j log( det ( X (cid:48) j ∗ X j ∗ ) − n σ M l − ˆ σ M ∗ ˆ σ M ∗ ) + O ( lq n ) . ˆ σ M ∗ → σ in probability, and ˆ σ M(cid:15)l − ˆ σ M ∗ ˆ σ M ∗ > , ˆ σ M(cid:15)nl ∗ − ˆ σ M ∗ ˆ σ M ∗ > for large n under theset up of Theorem 6 and Theorem 7, for M (cid:15)l,m β and M (cid:15) n l ∗ ,m β , respectively. (from Theorems 3 ,4proofs).For Theorem 6, for large n , we have, log L ( Y n | M (cid:15)l,m β ) − log L ( Y n | M ∗ ) (cid:22) − (cid:88) j log( det ( X (cid:48) j X j )+ 12 (cid:88) j log( det ( X (cid:48) j ∗ X j ∗ ) − n c ˆ σ M (cid:15)l − ˆ σ M ∗ ˆ σ M ∗ (cid:22) − n(cid:15), as σ M (cid:15)l is bounded with probability one, and log(1 + ˆ σ M(cid:15)l − ˆ σ M ∗ ˆ σ M ∗ ) > c σ M(cid:15)l − ˆ σ M ∗ ˆ σ M ∗ for some smallpositive constant c . Similarly, the proofs for the case m β not containing true covariate combi-nation, and the case l < l ∗ follow.Under the setup of 6 for a refinement of true partition, and covariate combination containingthe true covariates, ˆ σ Ml − ˆ σ M ∗ ˆ σ M ∗ → and log(1 + ˆ σ Ml − ˆ σ M ∗ ˆ σ M ∗ ) ∼ ˆ σ Ml − ˆ σ M ∗ ˆ σ M ∗ . Hence, Theorem 3 proof(Case i) can be repeated, and we have log L ( Y n | M l,m β ) − log L ( Y n | M ∗ ) (cid:22) − q n log n .For Theorem 7, we note that log(1 + ˆ σ M(cid:15)nl ∗ − ˆ σ M ∗ ˆ σ M ∗ ) (cid:23) ˆ σ M(cid:15)nl ∗ − ˆ σ M ∗ ˆ σ M ∗ for large n , as ˆ σ M (cid:15)nl ∗ − ˆ σ M ∗ ≥ andis of the order of (cid:15) n for m β ⊃ m β ∗ , from the fact log(1+ x ) x → as x → . Hence, Theorem 4 proofcan be repeated. Addressing Remark 8 If σ = σ j for P ∗ j , then ˆ σ M ∗ converges to (cid:80) l ∗ j =1 n ∗ j n σ j > in probability. Hence, the proofs ofTheorem 6 and Theorem 7 hold. Proof of Theorem 12
Let M l ∗ t ∗ ,m β (denoted by M l ∗ ,m β for convenience) be the model with covariate combinationgiven by m β and true change points t ∗ , · · · , t ∗ l . First we show that (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M l ∗ ,mβ |· )Π( M ∗ |· ) → in probability (Part 1). Then, we show that (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M ξl,mβ |· )Π( M ∗ |· ) → , where M ξl,m β is a modelwith change points t , t , · · · , t l such that for the refinements of true partition corresponding tochange points, ˜ t , · · · , ˜ t l , such that inf ˜ t , ··· , ˜ t l max i | t i − ˜ t i | = ξ (in Part 2). Next, we address thecase where t , . . . , t l are change points and l < l ∗ (Part 3). For the case, t , · · · , t l is a refinementof t ∗ , · · · , t ∗ l , the proof follows from Part 1, by considering M ∗ l instead of M ∗ (true covariate andthe change points t , . . . , t l ) and concluding (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M l,mβ |· )Π( M ∗ l |· ) → in probability and fromthe fact Π( M ∗ l |· )Π( M ∗ |· ) → in probability. 24 art 1: (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M l ∗ ,mβ |· )Π( M ∗ |· ) → Let R ( j ) t be the residual sum of square for P ∗ j under M ∗ . Let m β ⊃ m β ∗ . For { m β } = q and { m β ∗ } = t . Then we show that for residual sum of square for m β ∗ , R ( j ) m P ( R ( j ) t − R ( j ) m ≥ q − t +( q − t ) α (log n ) α ; for some m β ⊃ m β ∗ , { m β } = q ) ≤ d q − tn e − c ( q − t ) α (log n ) α for some universal constant c > , for any α > . Hence, P ( R ( j ) t − R ( j ) m ≥ ( q − t ) + ( q − t ) α (log n ) α ; for some m β ⊃ m β ∗ , { m β } = q, for some q ) ≤ q n (cid:88) q = t +1 d q − tn e − cα ( q − t )(log n ) α . Here, d n = p , the number of available covariates (depending on n ) and log d n = O (log n ) (byassumption) and δ n = (cid:80) q n q = t +1 d ( q − t ) n e − cα ( q − t )(log n ) α → .Similarly, for any m β (cid:54)⊃ m β ∗ , let R ( j ) m (cid:48) be the residual sum of square for m β ∪ m β ∗ . A conservativebound is given by, P ( R ( j ) t − R ( j ) m (cid:48) ≥ q + αq (log n ) α ; for some m β (cid:54)⊃ m β ∗ , { m β } = q, for some q ) ≤ q n (cid:88) q =1 d qn e − cαq (log n ) α . For, any ˜ m β missing at least one of the true covariates. Let, R ( j ) m (cid:48) be the residual sum of squarefor ˜ m β ∪ m β ∗ , and R ( j ) m for ˜ m β . Then, R ( j ) m − R ( j ) m (cid:48) = ( ˆ β ( j ) m (cid:48) − ˆ β ( j ) m ) (cid:48) X (cid:48) j ∗ X j ∗ ( ˆ β ( j ) m (cid:48) − ˆ β ( j ) m ) , where ˆ β ( j ) m (cid:48) , ˆ β ( j ) m are corresponding least square estimates for ˜ m β ∪ m β ∗ and ˜ m β , respectively,with entries corresponding to coefficients not in m β but in ˜ m β are filled with zero in ˆ β ( j ) m , and X j ∗ is the design matrix for P ∗ j with covariates corresponding to ˜ m β ∪ m β ∗ . Let δ > bethe minimum value of the true absolute coefficient vector over all P ∗ j . Then correspondingleast square estimates for ˜ m β , for coefficients present in true model has absolute less than δ/ with probability less than e − c n for some universal constant c > and hence, the probabilitythat some covariate belonging to true model has coefficient estimate less than δ/ for some ˜ m β ∪ m β ∗ is less than d q n n e − c n = o ( δ n ) . Similarly, for a covariate not present in the true modelthe corresponding coefficient has absolute value less than δ/ with probability − o ( δ n ) overall possible covariate combination. Hence, P ( R ( j ) m − R ( j ) m (cid:48) (cid:23) n, for all covariate combination ) ≥ − δ n , ( R ( j ) m − R ( j ) t (cid:23) n, for all covariate combination ) ≥ − δ n . (6.7)Hence for m β ⊃ m β ∗ , with probability at least − δ n log Π( M l ∗ t ,m β |· ) − log Π( M ∗ |· ) ≤ ( q − t ) log ˜ p n −
12 (( q − t )(1 − α (log n ) α ) log n + d ( q − t ) uniformly over m β , where d does not depend on m β , and − log ˜ p n ∼ (log n ) α .Hence, choosing α > small enough, summing over m β ⊃ m β ∗ , (cid:80) m β ⊃ m β ∗ ; { m β }≤ q n Π( M l ∗ ,m β |· )Π( M ∗ |· ) ≤ (cid:88) q>t e log(˜ p ( q − t ) n ) (cid:18) d n q − t (cid:19) e − . − α (log n ) α )( q − t ) log n d q − t (cid:22) ˜ p α (cid:48) n → as n → ∞ infinity, for some < α (cid:48) < , choosing sufficiently small α > and here d is auniversal constant.Similarly, summing over m β (cid:54)⊃ m β ∗ gives, for outside of a set of probability δ n , for someconstant d (cid:48) > , (cid:80) m β (cid:54)⊃ m β ∗ ; { m β }≤ q n Π( M l ∗ t ,m β |· )Π( M ∗ |· ) (cid:22) e − d (cid:48) n → . As the results are shown outside a set of probability O ( δ n ) where δ n → set, this proves ourclaim. Part 2: (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M ξl,mβ |· )Π( M ∗ |· ) → Using calculation from Theorem 11, using ξ in place of (cid:15) n , or repeating the argument of Theo-rem 11 proof for Case (ii) of Theorem 3 proof, we can bound the cross product terms, and boundthe Chi-square term (cid:107) X j ˆ β j − X j E [ ˆ β j ] (cid:107) as in Part 1 in Theorem 12 or in Theorem 11, outsidea set of probability approaching zero, and therefore, outside the small probability set, we have log L ( Y n | M ξl,m β ) − log L ( Y n | M ∗ l ˜ t ) (cid:22) − n uniformly over covariate choices. Here, M ∗ l ˜ t denote themodel with true covariate combination and change points at ˜ t , · · · , ˜ t l for any refinement oftrue partition corresponding to t ∗ , · · · , t ∗ l ∗ .Hence, we have (cid:80) mβ (cid:54) = mβ ∗ Π( M ξl,mβ |· )Π( M ∗ l ˜ t |· ) ≤ (cid:80) q n q = t +1 e log(˜ p ( q − t ) n ) (cid:0) d n q − t (cid:1) e − c (cid:48) n → in probability, where c (cid:48) > is a constant. We already have shown that Π( M ∗ l ˜ t |· )Π( M ∗ |· ) → in probability, which proves ourclaim. 26 art 3: (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M l,mβ |· )Π( M ∗ |· ) → ; l < l ∗ Using the calculation from the proof of Theorem 11 we bound the cross product terms andChi-square term over all covariate choice, as in last part, and using Case (iv) in Theorem 3proof, outside a set with probability approaching zero, log L ( Y n | M l,m β ) − log L ( Y n | M ∗ ) (cid:22) − n uniformly over covariate choices. Hence, (cid:80) mβ : mβ (cid:54) = mβ ∗ Π( M l,mβ |· )Π( M ∗ |· ) (cid:22) e − c (cid:48)(cid:48) n → in probability,where c (cid:48)(cid:48) > . Proof of Theorem 11
Note that for a subset S of i = 1 , · · · , n , of cardinality n s and for P ( | √ n s (cid:80) i ∈ S e i x ij | ≥√ q log n ) ≤ e − cq (log n ) for a universal constant c > , for q many covariates. Similar boundcan be derived for each coordinate for √ n ( ˆ β j − E [ ˆ β j ]) . Hence, over all possible covariate andchange point choices | E | , | E | , | E | (cid:22) max {√ n(cid:15) n q n √ q n log n, √ n(cid:15) n q n √ q n log n } outside a setof probability at most ˜ δ n ∼ n l ∗ (cid:80) q n q =1 d qn e − cq (log n ) → .Again, (cid:107) X j ˆ β j − X j E [ ˆ β j ] (cid:107) ≤ q + q (log n ) α outside a set of probability approaching zero fromPart 1 of Theorem 12. Therefore, outside a set of probability approaching zero, − σ log L ( Y n | M (cid:15) n l ∗ ,m β ) + σ log L ( Y n | M ∗ ) (cid:23) n(cid:15) n − √ n(cid:15) n q . n log n − √ n(cid:15) n q . n log n + q n log n → ∞ . Note that in the above equation, for (cid:15) n , we have an universal lower bound of the order of n(cid:15) n for the quadratic term corresponding to (cid:80) j : P j ∩ P ∗ j (cid:54) = φ ( ˜ β ∗ j − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ j ∗ X j ∩ j ∗ ( ˜ β ∗ j − E [ ˆ β j ]) + (cid:80) j (cid:80) k (cid:54) = j : P j ∩ P ∗ k (cid:54) = φ ( ˜ β ∗ k − E [ ˆ β j ]) (cid:48) X (cid:48) j ∩ k ∗ X j ∩ k ∗ ( ˜ β ∗ k − E [ ˆ β j ]) from equations (6.2) and (6.6) overall covariate and change point choices, as a result of the Eigen value condition given in A .Hence, summing over possible covariate choices (cid:88) m β ⊃ m β ∗ P n ( M (cid:15) n l ∗ ,m β : M ∗ ) (cid:22) q n e − α (cid:48) n(cid:15) n + q n log d n → in probability uniformly over change point choices, where α (cid:48) > is a constant.Similarly, for m β (cid:54)⊃ m β ∗ , outside a set with probability approaching zero, we have − σ log L ( Y n | M (cid:15) n l ∗ ,m β ) + σ log L ( Y n | M ∗ ) (cid:23) n, from calculation similar to that of leading to equation (6.7). Hence, (cid:80) m β (cid:54)⊃ m β ∗ P n ( M (cid:15) n l ∗ ,m β : M ∗ ) → which concludes our claim. 27 ddressing Remark 13 Remark 13 follows from the proofs of Theorems 6 and 7.
Proof of Theorem 9
Proof of Theorem 9 follows directly from the proof of Theorem 12.
References
Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv preprintarXiv:0710.3742 .Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2019). Lasso meets horseshoe: A survey.
Statistical Science forthcoming.Carlin, B. P., Gelfand, A. E., and Smith, A. F. (1992). Hierarchical bayesian analysis of change-point problems.
Journal of the Royal Statistical Society: Series C (Applied Statistics)
The Annals of Statistics
The Annals of Mathematical Statistics
Journal of econo-metrics
Policing: A Journal of Policy and Practice
Limit theorems in change-point analysis . John Wiley & SonsChichester.Datta, A., Zou, H., and Banerjee, S. (2019). Bayesian high-dimensional regression for changepoint analysis.
Statistics and its interface
Bayesian Analysis Journal of theRoyal Statistical Society: Series B (Statistical Methodology)
TheAnnals of Statistics arXiv preprint arXiv:1705.06386 .Gardner, L. (1969). On detecting changes in the mean of normal variates.
The Annals of Mathe-matical Statistics
BayesianAnal.
Matrix analysis . Cambridge university press.Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepoints with alinear computational cost.
Journal of the American Statistical Association
Journal of the Royal Statistical Society: Series B (Statistical Methodology) arXiv preprint arXiv:1907.10012 .Martin, R. and Shen, W. (2017). Asymptotically optimal empirical bayes inference in a piece-wise constant sequence model. arXiv preprint arXiv:1712.03848 .Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian Variable Selection in Linear Regression.
Journal of the American Statistical Association
The Annals of Statistics
Biometrika
Biometrika arXiv preprint arXiv:1010.5223 .Polson, N. G. and Scott, J. G. (2010b). Shrink globally, act locally: Sparse Bayesian regularizationand prediction.
Bayesian Statistics Journal of Applied Statistical Science Annals of statistics Sankhy¯a: The Indian Journal of Statistics, Series A pages 173–186.Sen, P. K. (1980). Asymptotic theory of some tests for a possible change in the regression slopeoccurring at an unknown time point.
Zeitschrift f ¨ur Wahrscheinlichkeitstheorie und verwandteGebiete
Biometrika
StatisticalDistributions in Scientific Work , pages 181–191. Springer.Stephens, D. (1994). Bayesian retrospective multiple-changepoint identification.
Journal of theRoyal Statistical Society: Series C (Applied Statistics)
Journal of Econometrics
Dok-lady Akademii Nauk , volume 259, pages 270–274. Russian Academy of Sciences.Zacks, S. (1983). Survey of classical and bayesian approaches to the change-point problem:fixed sample and sequential procedures of testing and estimation. In