[PDF] Localising change points in piecewise polynomials of general degrees

Abstract

In this paper we are concerned with a sequence of univariate random variables with piecewise polynomial means and independent sub-Gaussian noise. The underlying polynomials are allowed to be of arbitrary but fixed degrees. All the other model parameters are allowed to vary depending on the sample size. We propose a two-step estimation procedure based on the ℓ 0 -penalisation and provide upper bounds on the localisation error. We complement these results by deriving a global information-theoretic lower bounds, which show that our two-step estimators are nearly minimax rate-optimal. We also show that our estimator enjoys near optimally adaptive performance by attaining individual localisation errors depending on the level of smoothness at individual change points of the underlying signal. In addition, under a special smoothness constraint, we provide a minimax lower bound on the localisation errors. This lower bound is independent of the polynomial orders and is sharper than the global minimax lower bound.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l Localising change points inpiecewise polynomials of general degrees

Yi Yu and Sabyasachi Chatterjee Department of Statistics, University of Warwick Department of Statistics, University of Illinois Urbana-ChampaignJuly 21, 2020

Abstract

In this paper we are concerned with a sequence of univariate random variables with piecewisepolynomial means and independent sub-Gaussian noise. The underlying polynomials are allowedto be of arbitrary but ﬁxed degrees. We propose a two-step estimation procedure based on the ℓ -penalisation and provide upper bounds on the localisation error. We complement these resultsby deriving information-theoretic lower bounds, which show that our two-step estimators are nearlyminimax rate-optimal. We also show that our estimator enjoys near optimally adaptive performanceby attaining individual localisation errors depending on the level of smoothness at individual changepoints of the underlying signal. We are concerned with the model y = ( y , . . . , y n ) ⊤ ∈ R n . For each i ∈ { , . . . , n } , y i = f ( i/n ) + ε i , (1)where f : [0 , → R is an unknown piecewise-polynomial function and ε i ’s are independent mean zerosub-Gaussian random variables. To be speciﬁc, associated with f ( · ), there is a sequence of strictlyincreasing integers { η k } K +1 k =0 , with η = 1 and η K +1 = n + 1, such that f ( · ) restricted on each interval[ η k /n, η k +1 /n ), k = 0 , . . . , K , is a polynomial of degree at most r ∈ N . The maximum degree r is assumedto be arbitrary but ﬁxed, and the number of change points K is allowed to diverge as the sample size n grows unbounded. The goal of this paper is to estimate { η k } Kk =1 , called the change points of f ( · ),accurately and to understand the fundamental limits in detecting and localising these change points.The work in this paper falls within the general topic of change point analysis, which has a long historyand is being actively studied till date. In change point analysis, one assumes that the underlying distri-butions change at a set of unknown time points, called change points, and stay the same between twoconsecutive change points. A closely related problem is change point detection in piecewise constant sig-nals. This is studied thoroughly in Chan and Walther (2013), Frick et al. (2014), Dumbgen and Spokoiny(2001), D¨umbgen and Walther (2008), Li et al. (2017), Jeng et al. (2012) and Wang et al. (2020a), amongothers. Recently, Fearnhead et al. (2019) studied change point analysis in piecewise linear signals. Ourwork in this paper can be seen as a generalisation of the aforementioned results, but allowing for poly-nomials of arbitrary degrees. Detailed discussions regarding comparisons with Wang et al. (2020a) andFearnhead and Rigaill (2018) will be provided after we present our main results.Beyond univariate sequence, the existing work on change point analysis includes studies on high-dimensional models (e.g. Dette et al., 2018; Wang et al., 2017; Wang and Samworth, 2018), networkmodels (e.g. Bhattacharjee et al., 2018; Cribben and Yu, 2017; Wang et al., 2018) and nonparametricmodels (e.g. Garreau and Arlot, 2018; Padilla et al., 2019a,b).1o divert slightly, it is worth mentioning that instead of focusing on estimating the locations of thechange points, a complementary problem is to estimate the whole of the underlying piecewise polyno-mial function itself. This is a canonical problem in nonparametric regression and also has a long his-tory. The piecewise polynomial function is typically assumed to satisfy certain regularity at the changepoints. The classical settings therein assume that the degrees of the underlying polynomials are takento be some particular values and the change points, referred to as knots, are at ﬁxed locations, seee.g. Green and Silverman (1993) and Wahba (1990). More recent regression methods have focussed onﬁtting piecewise polynomials where the knots are not ﬁxed beforehand and is estimated from the data(e.g. Guntuboyina et al., 2020; Mammen and van de Geer, 1997; Shen et al., 2020; Tibshirani, 2014).In this paper, we focus on estimating the locations of the change points accurately, allowing forgeneral and diﬀerent degrees of polynomials within f ( · ), diverging number of change points, and diﬀerentsmoothness at diﬀerent change points. This framework, to the best of our knowledge, is the most ﬂexibleone in both change point analysis and spline regression analysis areas. In the rest of this paper, we ﬁrstformalise the problem and introduce the algorithm in Section 1.1, followed by a list of contributions inSection 1.2. The main results are collected in Section 2, with more discussions in Section 3 and the proofsin the Appendices. In order to estimate the change points of f ( · ), we propose a two-step estimator. The estimator is deﬁnedin this subsection, following introduction of necessary notation used throughout this paper.Let Π be any interval partition of { , . . . , n } , i.e. a collection of | Π | ≥ { , . . . , n } ,Π = (cid:8) { , . . . , s − } , { s , . . . , s − } , . . . , { s | Π |− , . . . , n } (cid:9) , for some integers 1 = s < s < . . . s | π |− ≤ n < s | Π | = n + 1, with | · | denoting the cardinality of aset. For any such partition Π, we denote η (Π) = { s , . . . , s | Π |− } to be its change points. Let P n thecollection of all such interval partitions of { , . . . , n } .For any ﬁxed λ > y ∈ R n , let the estimated partition be b Π ∈ argmin Π: Π ∈P n G (Π , λ ) , (2)where G (Π , λ ) = X I ∈ Π k y I − P I y I k + λ | Π | = X I ∈ Π H ( y, I ) + λ | Π | , (3)the notation therein is introduced below. • The norm k · k denotes the ℓ -norm of a vector. • For any interval I = { s, . . . , e } ⊂ { , . . . , n } , let y I = ( y i , i ∈ I ) ⊤ ∈ R | I | be the data vector oninterval I and P I be the projection matrix P I = U I,r ( U ⊤ I,r U I,r ) − U ⊤ I,r , (4)with U I,r =  s/n · · · ( s/n ) r ... ... ... ...1 e/n · · · ( e/n ) r  ∈ R ( e − s +1) × ( r +1) . (5)We can see that the loss function G ( · , · ) is a penalised residual sum of squares. The penalisation isimposed on the cardinality of the partition, which is in fact an ℓ penalisation. The residual sum of squaresare the residuals after projecting data onto the discrete polynomial space. The initial estimators { e η k } b Kk =1 are deﬁned to be η ( b Π), the change points of b Π.2ith the estimated partition b Π and its associated change points η ( b Π), provided that | η ( b Π) | ≥

1, weproceed to the second-step estimation. For any k ∈ { , . . . , b K } , let s k = e η k − / e η k / , e k = e η k / e η k +1 / I k = [ s k , e k ) , with e η = 1 and e η b K +1 = n + 1. For any k ∈ { , . . . , b K } , we deﬁne b η k = argmin t ∈ I k \{ s k } { H ( y, [ s k , t )) + H ( y, [ t, e k )) } , (6)where H ( · , · ) is deﬁned in (3). The updated estimators { b η k } b Kk =1 are our ﬁnal estimators .As a summary, this two-step algorithm precedes with the optimisation problem (2), providing a set ofinitial estimators { e η k } b Kk =1 . With the initial estimators, a parallelisable second step works on every triplet( e η k − , e η k , e η k +1 ), k ∈ { , . . . , K } , to reﬁne e η k and yield b η k . This update does not change the number ofestimated change points.To help further referring back to our two-step algorithm, we present the full procedure in Algorithm 1. Algorithm 1

Two-step estimation

INPUT:

Data { y i } ni =1 , tuning parameters λ > b Π ← argmin Π: Π ∈P n G (Π , λ ) ⊲ See (3)

B ← η ( b Π) ⊲ The initial estimators if B 6 = ∅ then { b η k } b Kk =1 ← Update B based on (6) ⊲ The ﬁnal estimators end if

We conclude this subsection with two remarks, on the optimisation problem (2) and the computationalaspect of the upper bound on the polynomial degree r , respectively. Remark 1 (The optimisation problem (2)) . The uniqueness of the solution (2) is not guaranteed ingeneral, but the properties we are to present regarding the change point estimators hold for any solutions.In fact, under some mild conditions, for instance the existence of densities of the noise distribution, onecan show the minimiser of (2) is unique almost surely.The optimisation problem (2) , with a general loss function, is knowns as minimal partitioning problem(e.g. Algorithm 1 in Friedrich et al., 2008), and can be solved by a dynamic programming approach inpolynomial time. The computational cost is of order O ( n Cost( n )) , where Cost( n ) is the computationalcost of calculating G (Π , λ ) , for any given Π and λ . To be speciﬁc, for (2) , Cost( n ) = O ( n ) , where thehidden constants depend on the polynomial degree r , therefore the total computationa cost is O ( n ) . Areference where the computational cost and the dynamic programming algorithm is explicitly mentionedis Lemma 1.1 in Chatterjee and Goswami (2019).We would like to mention that the minimal partitioning problem has previously being used in changepoint analysis literature for other models, including Fearnhead and Rigaill (2018), Killick et al. (2012),Wang et al. (2020a), Wang et al. (2019) and Wang et al. (2020b), among others. In the spline regressionanalysis area, the ℓ penalisation is also exploited, for instance, Shen et al. (2020) and Chatterjee and Goswami(2019), to name but a few. Remark 2 (The polynomial degree upper bound r ) . The degree r is in fact an input of the algorithm.One needs to specify the degree r in (2) and (6) . Usually, when we deﬁne a degree- d polynomial, we let g ( x ) = d X l =0 c l x l , x ∈ R , with { c l } dl =0 ⊂ R and c d = 0 . If c d = 0 , then g ( · ) is regarded as a degenerate degree- d polynomial. Inthis paper, we do not emphasis on the highest degree coeﬃcient being nonzero. With this ﬂexibility, inpractice, as long as the input r is not smaller than the largest degree of the underlying polynomials, thenthe performances of the algorithm are still guaranteed. However, the larger the input r is, the more costlythe optimisation is. More regarding this point will be discussed after we present our main theorem. .2 Main contributions To conclude this section, we summarise our contributions in this paper.Firstly, to the best of our knowledge, this is the ﬁrst paper studying the change point localisationin piecewise polynomials with general degrees. The model we are concerned in this paper enjoys greatﬂexibility. We allow for the number of change points and the variances of the noise sequence to diverge,and the diﬀerences between two consecutive diﬀerent polynomials to vanish, as the sample size growsunbounded.Secondly, we propose a two-step estimation procedure for the change points, detailed in Algorithm 1.The ﬁrst step is a version of the minimal partitioning problem (e.g. Friedrich et al., 2008), and the secondstep is a parallelisable update. The ﬁrst step can be done in O ( n ) time and the second step in O ( n )time.Thirdly, we provide theoretical guarantees for the change point estimators returned by Algorithm 1.To the best of our knowledge, it is the ﬁrst time in the literature, establishing the change point localisationrates for piecewise polynomials with general degrees. Prior to this paper, the state-of-the-art results wereonly on piecewise constant signals. As for piecewise linear signals, existing work has studied piecewiselinear signals which are necessarily continuous. In contrast, we allow the underlying contiguous polyno-mials to pertain diﬀerent smoothness at diﬀerent change points. This is reﬂected in our localisation errorbound for each individual change point. In short, we show that our change point estimator enjoys nearlyoptimal adaptive localisation rates.Lastly, in a fully ﬁnite sample framework, we provide information-theoretic lower bounds charac-terising the fundamental diﬃculties of the problem, showing that our estimators are nearly minimaxrate-optimal. To the best of our knowledge, even for the piecewise linear case, previous minimax lowerbounds only focussed on the scaling in the sample size n whereas we derive a minimax lower boundinvolving all the parameters of the problem. More detailed comparisons with existing literature are inSection 3. In this section, we investigate the theoretical properties of the initial and the ﬁnal estimators returnedby Algorithm 1.

In the change point analysis literature, the diﬃculty of the change point estimation task can be charac-terised by two key model parameters: the minimal spacing between two consecutive change points andthe minimal diﬀerence between two consecutive underlying distributions. In this paper, the underlyingdistributions are determined by the polynomial coeﬃcients. For two diﬀerent at-most-degree- r polyno-mials, the diﬀerence is nailed down to the diﬀerence between two ( r + 1)-dimensional vectors, consistingof the polynomial coeﬃcients. To characterise the diﬀerence, for any integers r, K ≥

0, we adopt thefollowing reparameterising for any piecewise polynomial function f ( · ) ∈ F r,Kn , where F r,Kn = n f ( · ) : [0 , → R : 1 = η < η < · · · < η K = n < η K +1 = n + 1 , s.t. ∀ k ∈ { , , . . . , K } , f [ η k /n,η k +1 /n ) : [ η k /n, η k +1 /n ) → R , with f | [ η k /n,η k +1 /n ) ( x ) = f ( x ) , is a right-continuous with left limit polynomial of degree at most r. o . (7) Deﬁnition 1.

Let f ( · ) ∈ F r,Kn , { η k } K +1 k =0 ⊂ { , . . . , n + 1 } be the collection of change points of f ( · ) , with η = 0 , η K +1 = n + 1 . For any k ∈ { , . . . , K } , let f [ η k − /n,η k +1 /n ) ( · ) : [ η k − /n, η k +1 /n ) → R be therestriction of f ( · ) on [ η k − /n, η k +1 /n ) . Deﬁne the reparameterisation of f [ η k − /n,η k +1 /n ) ( · ) as f ( x ) = (P rl =0 a l ( x − η k /n ) l , x ∈ [ η k − /n, η k /n ) , P rl =0 b l ( x − η k /n ) l , x ∈ [ η k /n, η k +1 /n ) , (8)4 here { a l , b l } rl =0 ⊂ R . Deﬁne the jump associated with the change point η k as κ k = | a r k − b r k | > , where r k = min { l = 0 , . . . , r : a l = b l } . (9)We deﬁne the jump associated with each change point of f ( · ) ∈ F r,Kn in Deﬁnition 1. The deﬁnitionis based on a reparameterisation of two consecutive polynomials. Using the notation in Deﬁnition 1, dueto the deﬁnition (7), f ( · ) is an at-most-degree- r polynomial in each [ η k /n, η k +1 /n ), k ∈ { , . . . , K } . Thisenables the reparameterisation (8).With the reparameterisation (8), it is easy to see, if f ( · ) at η k /n is d -time diﬀerentiable but not( d + 1)-time diﬀerentiable, d ∈ {− , . . . , r − } , then r k = d + 1. Here we use the convention that if f ( · )is − x , then f ( · ) at x is not continuous.There are two key advantages of using Deﬁnition 1 to characterise the diﬀerence. Firstly, we allowfor a full range of smoothness at the change points. Detecting change points in piecewise linear modelswas studied in Fearnhead et al. (2019), but the continuity at the change points is imposed. Translatedinto our notation, that means r k = 1 for all k ∈ { , . . . , K } . Our formulation covers this continuity butalso allows for discontinuity. Most importantly, we allow for each change point to have its individualsmoothness indicator r k .Secondly, we seek the smallest among all degrees with diﬀerent coeﬃcients, i.e. in (9), deﬁne r k tobe the minimum of all l with a l = b l . There are apparently other alternatives, for instance, one canchoose the maximum among all such l , or one can instead deﬁne κ k to be ℓ - or ℓ -norm of the diﬀerenceof the coeﬃcient vectors. We claim that our deﬁnition Deﬁnition 1 will return the sharpest localisationrates when estimating the change points under the weakest conditions. More discussions on this will beavailable after we present our main theorem. In this section, we present our main theorem providing theoretical guarantees on the output of Algo-rithm 1, with assumptions collected in Assumption 1.

Assumption 1 (Model assumptions) . Assume that the data { y i } ni =1 are generated from (1) , where f ( · ) belongs to F r,Kn deﬁned in (7) and ε i ’s are independent zero mean sub-Gaussian random variables with max ni =1 k ε i k ψ ≤ σ .We denote the collection of all change points of f ( · ) to be { η , . . . , η K } , satisfying ∆ = min k ∈{ ,...,K +1 } ( η k − η k − ) > , where η = 1 and η K +1 = n + 1 ,In addition, for any k ∈ { , . . . , K } , let κ = min k =1 ,...,K κ k > , where κ k is deﬁned in Deﬁnition 1. The problem now is completely characterised by the sample size n , the maximum degree r , the numberof change points K , the upper bound of the ﬂuctuations σ , the minimal spacing ∆, the jump sizes { κ k } and associated smoothness levels { r k } . In this paper, we allow the maximum degree r to be arbitrarybut ﬁxed, i.e. not a function of the sample size n . We allow the number of change points K and theﬂuctuation bound σ to diverge, the minimal spacing ∆ and the jump size κ to vanish, as the sample sizegrows unbounded. We formalise the signal strength in Deﬁnition 2 using these model parameters. We denote k · k ψ as the sub-Gaussian or Orlicz- ψ norm. For any random variable X , let k X k ψ = inf (cid:8) t > E (cid:8) exp( X /t ) (cid:9) ≤ (cid:9) . eﬁnition 2. Under Assumption 1, for any k ∈ { , . . . , K } , deﬁne the signal strength of the changepoint η k to be ρ k = κ k ∆ r k +1 n − r k and the overall signal strength parameter to be ρ = min k =1 ,...,K ρ k . The signal strength deﬁned in Deﬁnition 2 is inline with those used in other change point detectionproblems. More discussions will be provided after we state the main result in Theorem 3.

Theorem 3.

Let data { y i } ni =1 satisfy Assumption 1. Let { e η k } b Kk =1 and { b η k } b Kk =1 be the initial estimatorsand ﬁnal estimators of Algorithm 1, with inputs { y i } ni =1 and tuning parameter λ .If λ = c noise Kσ log( n ) and ρ ≥ c signal λ, (10) then we have that P {E} ≥ − n − c prob , where E = ( b K = K, | e η k − η k | ≤ c error (cid:26) Kn r k σ log( n ) κ k (cid:27) / (2 r k +1) , and | b η k − η k | ≤ c error (cid:26) n r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } ) . The constants c prob , c noise , c signal and c error > are all absolute constants. Remark 3 (Tracking constants) . All the absolute constants c prob , c noise , c signal , c error can be tracked inthe proof, although we do not claim the optimality of the constants thereof. The hierarchy of the constantsare as follows.We ﬁrst determine the constant c prob > , which only depends on the maximum degree r . Given c prob ,we can determine c noise , which only depends on c prob . With c prob and c noise at hand, we can determine c signal > . Lastly, the constant c error > depends on c signal , c noise and c prob . We note that the larger c signal is, the smaller c error is. From Theorem 3 we can see that the ﬁnal estimators { b η k } b Kk =1 improve upon the initial estimators { e η k } b Kk =1 , by getting rid of K , the dependence on the number of change points, in their localisation errorupper bounds. It is possible that this K term is actually an artefact of our current proof, and we mightnot need to update our initial estimators further. See Section 3 for more on this issue. However, withour current proof technique we do need the second step update to obtain the improved localisation errorbound.As for each individual change point η k , k ∈ { , . . . , K } , the localisation error of the ﬁnal estimator b η k is | b η k − η k | . n r k / (2 r k +1) (cid:26) σ log( n ) κ k (cid:27) / (2 r k +1) . (11)The upper bound in (11) is a decreasing function of κ k . Under mild conditions, the dominating term inthe upper bound in (11) is the term n r k / (2 r k +1) , which is an increasing function of r k . Recall Deﬁnition 1,where the jump is determined by the smallest possible degree. Together with (11), we can see that if weinstead deﬁne the jump in Deﬁnition 1 using, say the largest possible degree, or other vector norms of thecoeﬃcient vectors’ diﬀerence, then the corresponding localisation error’s rate will inﬂate correspondingly.Let us consider a concrete case where K = 1, r = 1 and the only one change point is η . Thecorresponding jump size is denoted as κ . A question that can be asked now is as follows. Is it easier to estimate the change point location when the underlying f ( · ) is continuous at η or discontinuous at η ? f ( · ) at η is discontinuous, under mild conditions,e.g. σκ − is of order O (1). In this particular case, the localisation error bound is of order σ log( n ) κ in the case of discontinuity, and is of order n / (cid:18) σ log( n ) κ (cid:19) / in the continuous case. The dependence of the localisation error bound on r k and κ k derived in Theorem 3is nearly minimax rate-optimal up to logarithmic factors. This will follow from Lemma 4 in Section 2.3.The condition (10) is a de facto signal-to-noise ratio condition. The tuning parameter λ reﬂects theﬂuctuations of the noise and it is set to be smaller than the signal strength ρ . A clearer comparisonunveils that we essentially requires min k =1 ,...,K κ k ∆ r k +1 n r k & Kσ log( n ) . (12)This includes a wide range of parameter settings. We list a few situations here. For any two positivesequences a n , b n we write a n ≍ b n if a n /b n stays bounded away from 0 and ∞ , as n diverges. • If K, σ ≍ ≍ n , then we allow the jump size κ to be of order { log( n ) /n } / , which vanishesas n diverges. • If K, σ, κ ≍

1, then we allow the minimal spacing∆ ≍ max k =1 ,...,K n r k / (2 r k +1) log / (2 r k +1) ( n ) . If all r k = 0, then it means ∆ ≍ log( n ); if all r k = 1, then it means ∆ ≍ n / log / ( n ). • The number of the change points K is allowed to diverge, provided K . σ log( n ) min k =1 ,...,K κ k ∆ r k +1 n r k . In this section, we aim to provide the information-theoretic lower bounds to characterise the fundamentaldiﬃculties of localisation change points in the model deﬁned in Assumption 1.In the change point analysis literature, in terms of localising the change point locations, there are twoaspects we are interested in. One is the minimax lower bound on the localisation error and the other ison the signal strength. For simplicity, in this section, we assume that K = 1 and r = r .As for these two aspects, in Theorem 3, we show that provided κ ∆ r +1 n r & Kσ log( n ) , (13)the output returned by Algorithm 1 have localisation error upper bounded by (cid:26) n r σ log( n ) κ (cid:27) / (2 r +1) . In this section, we will investigate the optimality of the above results.7 emma 4.

Under Assumption 1, assume that there exists one and only one change point and r = r .Let P κ, ∆ ,σ,r,n denote the joint distribution of the data. Consider the class Q = (cid:8) P κ, ∆ ,σ,r,n : ∆ < n/ , κ ∆ r +1 ≥ σ n r ζ n (cid:9) , for any diverging sequence { ζ n } . Then for all n large enough, it holds inf b η sup P ∈Q E P ( | b η − η ( P ) | ) ≥ max ( , (cid:20) cn r σ κ (cid:21) / (2 r +1) ) , where η ( P ) is the location of the change point for distribution P , the minimum is taken over all themeasurable functions of the data, b η is the estimated change point and < c < is an absolute constant. Lemma 4 shows that the ﬁnal estimators provided by Algorithm 1 are nearly optimal, in terms of thelocalisation error, save for a logarithmic factor.

Lemma 5.

Under Assumption 1, assume that there exists one and only one change point and r = r .Let P κ, ∆ ,σ,r,n denote the joint distribution of the data. For a small enough ξ > , consider the class P = ( P κ, ∆ ,σ,r,n : ∆ = min ($ (cid:18) ξn r κ σ − (cid:19) / (2 r +1) % , n/ )) . Then we have inf b η sup P ∈P E P ( | b η − η ( P ) | ) ≥ cn, where η ( P ) is the location of the change point for distribution P , the minimum is taken over all themeasurable functions of the data, b η is the estimated change point and < c < is an absolute constantdepending on ξ . Lemma 5 shows that, if κ ∆ r +1 n − r . σ , then no algorithm is guaranteed to be consistent, in thesense that inf b η sup P ∈P E P (cid:18) | b η − η ( P ) | n (cid:19) & . This means, besides the logarithmic factor, Lemma 5 and Theorem 3 leave a gap in terms of K . To bespeciﬁc, it remains unclear what results one would obtain if σ . κ ∆ r +1 n − r . Kσ . (14)This gap only exists when we allow K to diverge. We will provide some conjectures inline with thisdiscussion in Section 3.1. In this paper, we investigate the change point localisation in piecewise polynomial signals. We allow fora general framework and provide individual localisation error, associated with the individual smoothnessat each change point. A two-step algorithm consisting of solving a minimal partitioning problem and anupdating step is proposed. The outputs are shown to be nearly-optimal, supported by the information-theoretic lower bounds. To conclude this paper, we discuss some unresolved aspects of our work whilecomparing our results to some particularly relevant existing literature.

Wang et al. (2020a) studied change point localisation in piecewise constant signals. They studied the ℓ -penalised least squares method and proved that it is nearly minimax optimal in terms of both thesignal strength condition and the localisation error. In contrast, with our proof technique, we have been8ble to generalise this result for higher degree polynomials up to a factor depending on K , the number oftrue change points. This can be seen in our change point localisation error bound of our initial estimatorsas provided in Theorem 3 and also in our required signal strength condition in (12). In our paper, withgeneral degree polynomials, the localisation near-optimality is secured via an extra updating step, and agap remains in the upper and lower bounds for our required signal strength condition. This gap is notpresent if K is assumed to be O (1) but is present if it is allowed to diverge.We explain why the proof in Wang et al. (2020a) could not be fully generalised to our setting. Recallthe deﬁnition of H ( v, I ) in (3) denoting a residual sums of squares term. In our analysis, a crucial roleis played by the term Q { E ( y ); I , I } = H { E ( y ) , I ∪ I } − H { E ( y ) , I } − H { E ( y ) , I } , where I , I are two contiguous intervals of { , . . . , n } . Ideally, one needs to be able to upper and lowerbound Q { E ( y ); I , I } when y is deﬁned in (1), and its corresponding f ( · ) is a degree- r polynomial on I and another degree- r polynomial on I . In the case of r = 0, i.e. in the piecewise constant case, one canwrite an exact expression Q { E ( y ); I , I } = | I || I || I | + | I | | I | − X i ∈ I E ( y i ) − | I | − X i ∈ I E ( y i ) ! . In addition, it holds thatmin {| I | , | I |} | I || I | {| I | , | I |} ≤ | I || I || I | + | I | ≤ min {| I | , | I |} . Therefore, it follows that12 min {| I | , | I |} κ ≤ Q { E ( y ); I , I } ≤ min {| I | , | I |} κ , (15)where κ represents the absolute diﬀerence between the values of E ( y i ), i ∈ I and i ∈ I .For general r , by adopting an elegant result in Shen et al. (2020), one can actually generalise (15) toobtain that C min {| I | r +1 , | I | r +1 } n r κ ≤ Q { E ( y ); I , I } ≤ C min {| I | r +1 , | I | r +1 } n r κ , (16)where 0 < C < C are two absolute constants, and κ is the absolute diﬀerence of the r th degreecoeﬃcients of E ( y ) on I and I . However, the problem is that the constants C and C are not explicit.We can only show the existence of such constants. Even if we can track these two constants down, inorder to be able to generalise the argument of Wang et al. (2020a), we would still need to show that C and C are close enough. At this moment, it is not clear to us how to resolve this issue. We canonly conjecture that for all r ∈ N , the ℓ -penalised least squares method would itself be nearly optimalin terms of both the signal strength condition and the localisation error, and our second step updatewould not be needed. From a practical point of view, our second step can be done in O ( n ) time, which isnegligible compared to the O ( n ) time required to solve the penalised least squares. The computationaloverhead of our second step is thus minor. Fearnhead et al. (2019) showed that penalised least squares method for change point localisation workswell for piecewise linear signals. This work inspired us to investigate piecewise polynomial signals ofhigher degrees. Even in the piecewise linear case, there are some diﬀerences between our work andFearnhead et al. (2019). The algorithm provided in Fearnhead et al. (2019) can be seen as solving a vari-ant of the penalised least squares problem mentioned in this paper. In fact, the dynamic programmingalgorithm mentioned in Fearnhead et al. (2019) appears to be more sophisticated than what would be9equired to solve our problem. It is because the algorithm in Fearnhead et al. (2019) is tailored speciﬁ-cally for continuous piecewise linear functions. Maintaining continuity makes the dynamic programmingalgorithm more involved. Translated to our notation, Fearnhead et al. (2019) assumes r k = 1, for all k ∈ { , . . . , K } . Our formulation is more general than Fearnhead et al. (2019) as we do not imposecontinuity or any kind of smoothness at the change points. Our estimator adapts near-optimally to thelevel of smoothness at the change points. The theoretical results studied in Fearnhead et al. (2019) areunder the conditions K, σ ≍

1. Under these conditions, translated to our notation, their results read,provided that ( κ/n ) ∆ & log( n ), the localisation error is log / ( n )( n/κ ) / . Both are consistent withthe results we have obtained in this paper. Raimondo (1998) studied the minimax rates of change point localisation in a nonparametric setting. Themain focus there is how the localisation errors’ minimax rates change with α , the degree of discontinuityin a H¨older sense. Due to the nonparametric essence, the class of functions considered in Raimondo(1998) is more general than the piecewise polynomial class we discuss here. However, the measures ofregularity r k ’s we have deﬁned in Deﬁnition 1 are exactly the same as the parameter α in Raimondo(1998), if we only consider polynomials. Having drawn this connection, translated into our notation,Raimondo (1998) in fact shows that the localisation error’s minimax lower bound is of order (cid:8) n r log η ( n ) (cid:9) / (2 r +1) , ∀ η > . This is a lower bound for a larger class of functions than ours, but the dependence on n is the sameas ours up to a poly-logarithmic factor. Since Raimondo (1998) assumes all the other parameter to beof order O (1), our minimax lower bounds add value as they are in terms of all the relevant problemparameters and not just the sample size n . References

Bhattacharjee, M. , Banerjee, M. and

Michailidis, G. (2018). Change point estimation in adynamic stochastic block model. arXiv preprint arXiv:1812.03090 . Chan, H. P. and

Walther, G. (2013). Detection with the scan and the average likelihood ratio.

Statistica Sinica , Chatterjee, S. and

Goswami, S. (2019). Adaptive estimation of multivariate piecewise polynomialsand bounded variation functions by optimal decision trees. arXiv preprint arXiv:1911.11562 . Cribben, I. and

Yu, Y. (2017). Estimating whole-brain dynamics by using spectral clustering.

Journalof the Royal Statistical Society: Series C (Applied Statistcs) , Dette, H. , Pan, G. M. and

Yang, Q. (2018). Estimating a change point in a sequence of veryhigh-dimensional covariance matrices. arXiv preprint . Dumbgen, L. and

Spokoiny, V. G. (2001). Multiscale testing of qualitative hypotheses.

Annals ofStatistics

D¨umbgen, L. and

Walther, G. (2008). Multiscale inference about a density.

The Annals of Statistics , Fearnhead, P. , Maidstone, R. and

Letchford, A. (2019). Detecting changes in slope with an l 0penalty.

Journal of Computational and Graphical Statistics , Fearnhead, P. and

Rigaill, G. (2018). Changepoint detection in the presence of outliers.

Journal ofthe American Statistical Association

Frick, K. , Munk, A. and

Sieling, H. (2014). Multiscale change point inference.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , riedrich, F. , Kempe, A. , Liebscher, V. and

Winkler, G. (2008). Complexity penalized m-estimation: Fast computation.

Journal of Computational and Graphical Statistics , Garreau, D. and

Arlot, S. (2018). Consistent change-point detection with kernels.

Electronic Journalof Statistics , Green, P. J. and

Silverman, B. W. (1993).

Nonparametric regression and generalized linear models:a roughness penalty approach . Crc Press.

Guntuboyina, A. , Lieu, D. , Chatterjee, S. and

Sen, B. (2020). Adaptive risk bounds in univariatetotal variation denoising and trend ﬁltering.

The Annals of Statistics , Jeng, X. J. , Cai, T. T. and

Li, H. (2012). Simultaneous discovery of rare and common segmentvariants.

Biometrika , Killick, R. , Fearnhead, P. and

Eckley, I. A. (2012). Optimal detection of changepoints with alinear computational cost.

Journal of the American Statistical Association , Li, H. , Guo, Q. and

Munk, A. (2017). Multiscale change-point segmentation: Beyond step functions. arXiv preprint arXiv: 1708.03942 . Mammen, E. and van de Geer, S. (1997). Locally adaptive regression splines.

The Annals of Statistics , Padilla, O. H. M. , Yu, Y. , Wang, D. and

Rinaldo, A. (2019a). Optimal nonparametric changepoint detection and localization. arXiv preprint arXiv:1905.10019 . Padilla, O. H. M. , Yu, Y. , Wang, D. and

Rinaldo, A. (2019b). Optimal nonparametric multivariatechange point detection and localization. arXiv preprint arXiv:1910.13289 . Raimondo, M. (1998). Minimax estimation of sharp change points.

Annals of statistics

Rudelson, M. and

Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration.

Electronic Communications in Probability , . Shen, Y. , Han, Q. and

Han, F. (2020). On a phase transition in general order spline regression. arXivpreprint arXiv:2004.10922 . Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend ﬁltering.

The Annals ofStatistics , Tsybakov, A. B. (2009).

Introduction to Nonparametric Estimation . Springer.

Wahba, G. (1990).

Spline models for observational data . SIAM.

Wang, D. , Yu, Y. and

Rinaldo, A. (2017). Optimal covariance change point localization in highdimension. arXiv preprint arXiv:1712.09912 . Wang, D. , Yu, Y. and

Rinaldo, A. (2018). Optimal change point detection and localization in sparsedynamic networks. arXiv preprint arXiv:1809.09602 . Wang, D. , Yu, Y. and

Rinaldo, A. (2020a). Univariate mean change point detection: Penalization,cusum and optimality.

Electronic Journal of Statistics , Wang, D. , Yu, Y. , Rinaldo, A. and

Willett, R. (2019). Localizing changes in high-dimensionalvector autoregressive processes. arXiv preprint arXiv:1909.06359 . Wang, D. , Yu, Y. and

Willett, R. (2020b). Detecting abrupt changes in high-dimensional self-excitingpoisson processes. arXiv preprint arXiv:2006.03572 . Wang, T. and

Samworth, R. J. (2018). High-dimensional changepoint estimation via sparse projection.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) . Yu, B. (1997). Assouad, Fano, and Le Cam. In

Festschrift for Lucien Le Cam . Springer, 423–435.11 ppendices

We include all the proofs in the Appendices. Some preparatory results are provided in Appendix A.Appendix B contains the proof of Theorem 3. The lower bounds results Lemmas 4 and 5 are proved inAppendix C.

A Preparatory Results

The following notation will be used throughout the proofs. For any I = { s, . . . , e } ⊂ { , . . . , n } , recallthe projection matrix P I deﬁned in (4) using matrix U I,r deﬁned in (5). We recall the notation H ( v, I ) = k v I k − k P I v I k = k v I − P I v I k , for any vector v ∈ R n , where v I = ( v i , i ∈ I ) ⊤ ∈ R | I | .For any contiguous intervals I, J ⊂ { , . . . , n } and for any vector v ∈ R n , deﬁne Q ( v ; I, J ) = H ( v, I ∪ J ) − H ( v, I ) − H ( v, J ) = k P I v I k + k P J v J k − k P I ∪ J v I ∪ J k . Lemma 6.

Let I be any nonempty interval subset of { , . . . , n } . For any k ∈ { , . . . , | I |} and anypartition of I , I = ∪ kl =1 I l , satisfying I s ∩ I u = ∅ , for any s, u ∈ { , . . . , k } , s = u . It holds for any vector v ∈ R n that H ( v, I ) ≥ k X l =1 H ( v, I l ) . Proof.

The claims holds due to that H ( v, I ) = k v I − P I v I k = k X l =1 k v I l − ( P I v I ) I l k ≥ k X l =1 k v I l − P I l v I l k . Lemma 7.

Let y ∈ R n satisfy y = θ + ε and E ( y ) = θ . Let I, J be two contiguous interval subsets of { , . . . , n } . It holds that Q ( y ; I, J ) ≥ (cid:12)(cid:12)p Q ( θ ; I, J ) − p Q ( ε ; I, J ) (cid:12)(cid:12) . Proof.

First observe that Q ( y ; I, J ) is a quadratic form in y . Moreover, it is a positive semideﬁnitequadratic form as Q ( y ; I, J ) ≥ y ∈ R n by Lemma 6. Therefore, we can write Q ( y ; I, J ) = y ⊤ Ay ,for a positive semideﬁnite matrix A ∈ R n × n . Denoting A / as the square root matrix of A , satisfying A / A / = A , we can write Q ( y ; I, J ) = k A / y k . It then holds that p Q ( y ; I, J ) = k A / ( θ + ε ) k ≥ max n k A / θ k − k A / ε k , k A / ε k − k A / θ k o , which leads to the ﬁnal claim. Lemma 8 (Lemma E.1 in Shen et al. (2020)) . There exists an absolute constant c poly depending only on r such that for any integers n ≥ , m ≥ r + 1 , m ≤ n (17) and any real sequence { a ℓ } rℓ =0 , m X i =1 (cid:20) a + a (cid:18) in (cid:19) + · · · + a r (cid:18) in (cid:19) r (cid:21) ≥ c poly max d =1 ,...,r a d m d +1 n d . Lemma 8 is a direct consequence of Lemma E.1 in Shen et al. (2020). We omit its proof here.12 roposition 9.

Let I = { s, . . . , τ − } , J = { τ, . . . , e } be two contiguous interval subsets of { , . . . , n } such that min {| I | , | J |} ≥ r + 1 . Let θ = ( θ i , i = 1 , . . . , n ) ⊤ ∈ R n be a piecewise discretized polynomial,i.e. θ i = f ( i/n ) , where f ( · ) is a polynomial of order at most r on [ s/n, τ /n ) and a polynomial of orderat most r on [ τ /n, e/n ) .Let θ I ∪ J , θ restricted on I ∪ J , be reparametrised as θ i = (P rl =0 a l ( i/n − τ /n ) l , i ∈ I, P rl =0 b l ( i/n − τ /n ) l , i ∈ J. Let d = min { l = 0 , . . . , r : a l = b l } . Then there exists an absolute constant c poly depending only on r such that Q ( θ ; I, J ) ≥ c poly ( a d − b d ) min {| I | d +1 , | J | d +1 } n d . Proof.

For any ﬁxed d ∈ { , . . . , r } and any κ >

0, let A d = ( v ∈ R | I ∪ J | : there exist { c ,l , c ,l , l = 0 , . . . , r } ⊂ R s.t. v = (P rl =0 c ,l ( i/n − τ /n ) l , i ∈ I, P rl =0 c ,l ( i/n − τ /n ) l , i ∈ J, and | c ,d − c ,d | ≥ κ. ) (18)In words, A d is the set of vectors which are discretised polynomials of order at most r on the interval I/n and diﬀerent polynomials of order at most r on the interval J/n , with the d th order coeﬃcients atleast κ apart.For v ∈ A d , since v is a discretised polynomial on I/n and

J/n , separately, we have that Q ( v ; I, J ) = k v I ∪ J − P I ∪ J v I ∪ J k . In addition, we claim thatmin v ∈A d k v I ∪ J − P I ∪ J v I ∪ J k = min v ∈A d k v I ∪ J k . This is due to the following. Since orthogonal projections cannot increase the ℓ norm, we have theLHS ≤ RHS. As for the other direction, observe that the vector v I ∪ J − P I ∪ J v I ∪ J also belongs to the set A d .It now suﬃces to lower bound min v ∈A d k v I ∪ J k . For any v ∈ A d , it holds that k v I ∪ J k = k v I k + k v J k ≥ c poly n d (cid:0) c ,d | I | d +1 + c ,d | J | r +1 (cid:1) ≥ c poly n d κ min {| I | d +1 , | J | d +1 } , where c ,d and c ,d are the d th order coeﬃcients of v as deﬁned in (18), the ﬁrst inequality is due toLemma 8, and the second inequality follows from the fact that | c ,d − c ,d | ≥ κ . Lemma 10 (High Probability Event) . Under Assumption 1, there exists an absolute constant c prob > depending on r , and an absolute constant c noise > depending only on c prob , such that P  max I =[ s,e ]1 ≤ s

For any interval I ⊂ { , . . . , n } , there exists an absolute positive constant c > r such that for any t > P (cid:8) ε ⊤ I P I ε I − E (cid:0) ε ⊤ I P I ε I (cid:1) ≥ t (cid:9) ≤ (cid:20) − c min (cid:26) t σ k P I k , tσ k P I k op (cid:27)(cid:21) , which is due to the Hanson–Wright inequality (e.g. Theorem 1.1 in Rudelson and Vershynin, 2013). Since P I is a rank r + 1 orthogonal projection matrix, we have k P I k F = r + 1 and k P I k op = 1. Then P (cid:8) ε ⊤ I P I ε I − E (cid:0) ε ⊤ I P I ε I (cid:1) ≥ t (cid:9) ≤ (cid:20) − c min (cid:26) t σ ( r + 1) , tσ (cid:27)(cid:21) .

13n addition, we have that E (cid:0) ε ⊤ I P I ε I (cid:1) ≤ tr( P I ) max i =1 ,...,n E ( ε i ) ≤ ( r + 1) σ . For an absolute constant

C > c/

2, letting t = Cσ log( n ) and applying a union bound argument over allpossible I , we obtain that P  max I =[ s,e ]1 ≤ s n − cC and c noise log( n ) > C log( n ) + ( r + 1) , then we complete the proof. B Proof of Theorem 3

In this section, we provide the proof of Theorem 3. We will prove the result by ﬁrst proving that under anappropriate deterministic choice of the tuning parameter λ and some deterministic conditions on otherparameters, obtaining the desired localisation error is possible. We will then conclude the proof usingLemma 10, under which all these required conditions hold.For any τ >

0, deﬁne M ( τ ) =  max I =[ s,e ]1 ≤ s

It follows from Lemma 10 that P (cid:2) M{ c noise σ log( n ) } (cid:3) ≥ − n − c prob , where M ( · ) is deﬁned in (19).On the event M{ c noise σ log( n ) } , it follows from Proposition 11 that b K = K and | e η k − η k | ≤ c error (cid:26) Kn r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } . In addition, due to (10), it holds that max k =1 ,...,K | e η k − η k | < ∆ / . Then it follows from Lemma 17 that | b η k − η k | ≤ c error (cid:26) n r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } . We complete the proof.

B.1 The initial estimators { e η k } b Kk =1 The following proposition is our main intermediate result used to prove Theorem 3.14 roposition 11.

Let data { y i } ni =1 satisfy Assumption 1. Let { e η k } b Kk =1 be the initial estimators of Algo-rithm 1, with inputs { y i } ni =1 and tuning parameter λ .On the event M ( τ ) deﬁned in (19) , for any τ > , let λ > (4 K + 5) τ. (20) Assume that ρ > r +1 max { λ, r + 1 } . (21) We have that for any k ∈ { , . . . , K } , there exists an absolute constant < c < r/ (2 r +1) / , such that | e η k − η k | ≤ c (cid:18) n r k max { λ, r + 1 } κ k (cid:19) / (2 r k +1) . Remark 4.

Note that Proposition 11 is a completely deterministic result. In particular, no probabilisticassumption is needed on the noise variables. The proposition is written with explicit constants but theseconstants are not optimal in any sense. We have written out explicit constants just to emphasise thedeterministic nature of the result and in better understanding of the relative choices of the diﬀerentproblem parameters.Proof of Proposition 11.

We will show that(a) For any I = [ s, e ) ∈ b Π, there are no more than two true change points.(b) For any two consecutive intervals

I, J ∈ b Π, the interval I ∪ J contains at least one true changepoint.(c) For any I = [ s, e ) ∈ b Π, if there are exactly two true change points contained in I , i.e. η k − < s ≤ η k < η k +1 < e ≤ η k +2 , then η k − s ≤ c (cid:18) n r k max { λ, r + 1 } κ k (cid:19) / (2 r k +1) and e − η k +1 ≤ c (cid:18) n r k +1 max { λ, r + 1 } κ k +1 (cid:19) / (2 r k +1 +1) . (d) For any I = [ s, e ) ∈ b Π, if there is exactly one true change point contained in I , i.e. η k − < s ≤ η k max k =1 ,...,K (cid:18) r +1 max { λ, r + 1 } n r k κ k (cid:19) / (2 r k +1) > c max k =1 ,...,K (cid:18) max { λ, r + 1 } n r k κ k (cid:19) / (2 r k +1) . This assures that the mapping of true change points to estimated change points is one to one and impliesthat | b Π | ≥ | Π | . Finally, part (e) is deployed to complete the proof. Lemma 12 (Part (a) in the proof of Proposition 11) . Under all the assumptions in Proposition 11, forany I ∈ b Π , it holds that I does not contain more than two true change points.Proof. We prove by contradiction, assuming that there exists at least three true change points in I =[ s, e ) ∈ b Π, namely s ≤ η k − < η k < η k +1 < e . This implies thatmin { η k − s, e − η k } > ∆ . Denote I = [ s, η k − ∆), I = [ η k − ∆ , η k ), I = [ η k , η k + ∆) and I = [ η k + ∆ , e ). Let e Π be the intervalpartition such that e Π = b Π ∪ { I , I , I , I } \ { I } . It holds that 0 ≥ G ( b Π , λ ) − G ( e Π , λ )= − λ + H ( y, I ) − H ( y, I ) − H ( y, I ) − H ( y, I ) − H ( y, I ) ≥ − λ + H ( y, I ∪ I ) − H ( y, I ) − H ( y, I ) = − λ + Q ( y ; I , I ) ≥ − λ + np Q ( θ ; I , I ) − p Q ( ε ; I , I ) o ≥ − λ + Q ( θ ; I , I )4 { Q ( θ ; I , I ) > Q ( ε ; I , I ) } , where {·} is an indicator function, the ﬁrst inequality follows the deﬁnition of b Π, the second is fromLemma 6 and the third follows from Lemma 7. As for the ﬁnal inequality, it follows from Proposition 9and the fact that | I | , | I | = ∆ that Q ( θ ; I , I ) ≥ ρ . Since our assumption implies that 2 τ ≤ ρ , it holdsthat 12 λ ≥ ρ which contradicts the second assumption in (20). Lemma 13 (Part (b) in the proof of Proposition 11) . Under all the assumptions in Proposition 11, forany two consecutive intervals I , I ∈ b Π , there is at least one true change point in I ∪ I .Proof. We prove by contradiction, assuming there is no true change point in J = I ∪ I . Let e Π = b Π ∪ { J } \ { I , I } . We have that 0 ≤ G ( e Π , λ ) − G ( b Π , λ ) = − λ + Q ( y ; I , I ) = − λ + Q ( ε ; I , I ) ≤ − λ + 3 τ, where the ﬁrst inequality is due to the deﬁnition of b Π, the second identity follows from the fact that θ ispolynomial of order at most r on J , and the last inequality holds on the event M . Therefore we reach acontradiction to (20). Lemma 14 (Part (c) in the proof of Proposition 11) . Under all the assumptions in Proposition 11, forany I = [ s, e ) ∈ b Π , if there are exactly two true change points η k , η k +1 ∈ I , then it holds that η k − s ≤ c (cid:18) n r k max { λ + 12 τ, r + 1 } κ k (cid:19) / (2 r k +1) and e − η k +1 ≤ c (cid:18) n r k +1 max { λ + 12 τ, r + 1 } κ k +1 (cid:19) / (2 r k +1 +1) . roof. Let I = [ s, η k ), I = [ η k , η k +1 ), I = [ η k +1 , e ) and e Π = b Π ∪ { I , I ∪ I } \ { I } . It holds that 0 ≤ G ( e Π , λ ) − G ( b Π , λ ) ≤ λ − H ( y, I ) + H ( y, I ) + H ( y, I ∪ I ) ≤ λ − H ( y, I ) − H ( y, I ) + H ( y, I ∪ I ) = λ − Q ( y ; I , I ) ≤ λ − | p Q ( θ ; I , I ) − p Q ( ε ; I , I ) | ≤ λ − Q ( θ ; I , I ) / Q ( ε ; I , I ) ≤ λ + 6 τ − Q ( θ ; I , I ) / , where the ﬁrst inequality follows from the deﬁnition of b Π, the third inequality follows from Lemma 6,the fourth inequality follows from Lemma 7, the ﬁfth inequality follows from ( a − b ) > a / − b , forany a, b ∈ R , and the last follows from the deﬁnition of the event M ( τ ).Applying Proposition 9, we conclude that c poly κ k +1 n r k +1 min { ∆ r k +1 +1 , | I | r k +1 +1 } ≤ max { τ + 2 λ, r + 1 } . It follows from (21), we have that e − η k +1 ≤ c (cid:18) n r k +1 max { λ + 12 τ, r + 1 } κ k +1 (cid:19) / (2 r k +1 +1) . The same arguments can lead to the corresponding result on η k − s and complete the proof. Lemma 15 (Part (d) in the proof of Proposition 11) . Under all the assumptions in Proposition 11, forany I = [ s, e ) ∈ b Π , if there is exactly one true change point η k ∈ I , then min { e − η k , η k − s } ≤ c (cid:18) n r k max { τ + 2 λ, r + 1 } κ k (cid:19) / (2 r k +1) . Proof.

Let I = [ s, η k ), I = [ η k , e ) and e Π = b Π ∪ { I , I } \ { I } . It holds that 0 ≤ G ( e Π , λ ) − G ( b Π , λ ) = λ − H ( y, I ) + H ( y, I ) + H ( y, I ) = λ − Q ( y ; I , I ) ≤ λ − | p Q ( θ ; I , I ) − p Q ( ε ; I , I ) | ≤ λ − Q ( θ ; I , I ) / Q ( ε ; I , I ) ≤ λ + 6 τ − Q ( θ ; I , I ) / , where the ﬁrst inequality follows from the deﬁnition of b Π, the second inequality follows from Lemma 7,the third inequality follows from ( a − b ) > a / − b , for any a, b ∈ R , and the last follows from thedeﬁnition of the event M ( τ ).Applying Proposition 9, we conclude that c poly κ k n r k min {| I | r k +1 , | I | r k +1 } ≤ max { τ + 2 λ, r + 1 } . Lemma 16 (Part (e) in the proof of Proposition 11) . Under all the assumptions in Proposition 11,assuming that | Π | ≤ | b Π | ≤ | Π | , it holds that | b Π | = | Π | . roof. To ease notation, for any interval partition P of { , . . . , n } and any v ∈ R n , we let S ( v, P ) = X I ∈P H ( v, I ) . Using this notation, we ﬁrst note that k ε k ≥ S ( y, Π) , since for any I ∈ Π, H ( y, I ) = H ( ε, I ) ≤ k ε I k .Let b Π ∩ Π be the intersection of the partitions b Π and Π. It then holds that k ε k + ( K + 1) λ ≥ S ( y, Π) + ( K + 1) λ ≥ S ( y, b Π) + ( b K + 1) λ ≥ S ( y, b Π ∩ Π) + ( b K + 1) λ, (22)where the second inequality follows from the deﬁnition of b Π and the last inequality is due to Lemma 6.On the other hand, we have that k ε k − S ( y, b Π ∩ Π) = k ε k − S ( ε, b Π ∩ Π) ≤ (cid:0) b K + K + 2 (cid:1) τ, (23)where the identity is due to the fact that θ is a polynomial of order at most r on every member of b Π ∩ Π,and the second inequality holds on the event M ( τ ), noticing that | b Π ∩ Π | ≤ b K + K + 2.Combining (22) and (23), we have that (cid:0) b K − K (cid:1) λ ≤ (cid:0) b K + K + 2 (cid:1) τ ≤ (cid:0) K + 5 (cid:1) τ, where the last inequality is due to | b Π | ≤ | Π | . Since we also have | b Π | ≥ | Π | , the last display implies that | b K | = K , otherwise it contradicts with (20). B.2 The ﬁnal estimators { b η k } b Kk =1 The following lemma shows that our update step in Algorithm 1 can signiﬁcantly improve the initialestimators.

Lemma 17.

Let data { y i } ni =1 satisfy Assumption 1. For any set { ν k } Kk =1 satisfying max k =1 ,...,K | ν k − η k | ≤ ∆ / , with ν = 1 and ν K +1 = n + 1 , deﬁne s k = ν k − / ν k / , e k = ν k / ν k +1 / and I k = [ s k , e k ) , ∀ k ∈ { , . . . , K } . Let b η k = argmin t ∈ I k \{ s k } { H ( y, [ s k , t )) + H ( y, [ t, e k )) } , ∀ k ∈ { , . . . , K } . For any τ > , if ρ > × r +1 τ, (24) then on the event M ( τ ) , it holds that for an absolute constant c > , | b η k − η k | ≤ c (cid:26) n r k τκ k (cid:27) / (2 r k +1) . Proof.

For any k ∈ { , . . . , K } , we have that η k − s k = η k − ν k + ν k − ν k − η k − ν k + ν k − η k η k − η k − η k − − ν k − ≥ − ∆5 − ∆10 + ∆2 − ∆10 = ∆10 . s k − η k − = ν k − ν k − ν k − − η k − = ν k − η k η k − η k − η k − − ν k − ν k − − η k − ≥ − ∆10 + ∆2 − ∆10 − ∆5 = ∆10 . Using identical arguments, it also holds that min { e k − η k , η k +1 − e k } ≥ ∆ / s k < b η k < η k < e k . Let J = [ s k , e η k ), J = [ e η k , η k ) and J = [ η k , e k ). By the deﬁnition of e η k , we have that H ( y, J ∪ J ) + H ( y, J ) ≥ H ( y, J ) + H ( y, J ∪ J )= H ( y, J ) + H ( y, J ) + H ( y, J ) + Q ( y ; J , J ) . We then have that c poly κ k n r k min {| J | r k +1 , | J | r k +1 } ≤ Q ( y ; J , J ) ≤ Q ( y ; J , J ) = Q ( ε ; J , J ) ≤ τ, where the ﬁrst inequality is due to Proposition 9. Since | J | ≥ ∆10 , the ﬁnal claim follows from (24). C Proofs of Lemmas 4 and 5

Proof of Lemma 4.

Let P denote the joint distribution of the independent random variables { y i } ni =1 suchthat y i ∼ ( N (0 , σ ) , i ∈ { , . . . , ∆ }N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { ∆ + 1 , . . . , n } . Let P denote the joint distribution of the independent random variables { z i } ni =1 such that z i ∼ ( N (0 , σ ) , i ∈ { , . . . , ∆ + δ }N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { ∆ + δ + 1 , . . . , n } , where δ is a positive integer no larger than n − ∆ − P , it is easy to see that E ( y i ) = ( , i ∈ { , . . . , ∆ } ,κ { ( i − ∆) /n } r , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisﬁes η ( P ) = ∆ + 1. Recalling Deﬁnition 1, we also knowthat the corresponding smallest order r equals r , and the jump size κ = κ .As for P , we have that E ( y i ) = ( , i ∈ { , . . . , ∆ + δ } ,κ { ( i − ∆) /n } r = κ P rl =0 (cid:0) rl (cid:1) (cid:0) i − ∆ − δn (cid:1) r − l (cid:0) δn (cid:1) l , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisﬁes η ( P ) = ∆ + δ + 1. Recalling Deﬁnition 1, we alsoknow that the corresponding smallest order r equals 0, and the jump size κ = κ (cid:0) r (cid:1) ( δ/n ) = κ .It then follows from Le Cam’s lemma (e.g. Yu, 1997), a standard reduction of estimation to two pointtesting, and Lemma 2.6 in Tsybakov (2009), a form of Pinsker’s inequality, thatinf b η sup P ∈Q E P ( | b η − η | ) ≥ δ (1 − d TV ( P , P )) ≥ δ − κ σ n r ∆+ δ X i =∆+1 ( i − ∆) r ! = δ − κ σ n r δ X i =1 i r ! ≥ δ − κ σ n r Z δ +11 x r dx ! ≥ δ (cid:26) − κ ( δ + 1) r +1 (2 r + 1) σ n r (cid:27) ≥ δ (cid:26) − cκ δ r +1 σ n r (cid:27) .

19e set δ = max ((cid:20) cσ n r κ (cid:21) / (2 r +1) , ) and complete the proof. Proof of Lemma 5.

Let P denote the joint distribution of the independent random variables { y i } ni =1 suchthat y i ∼ ( N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { , . . . , ∆ }N (0 , σ ) , i ∈ { ∆ + 1 , . . . , n } . Let P denote the joint distribution of the independent random variables { z i } ni =1 such that z i ∼ ( N (0 , σ ) , i ∈ { , . . . , n − ∆ }N ( κ ( i/n − /n ) r , σ ) , i ∈ { n − ∆ + 1 , . . . , n } . As for P , it is easy to see that E ( y i ) = ( κ { ( i − ∆) /n } r , i ∈ { , . . . , ∆ } , , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisﬁes η ( P ) = ∆ + 1. Recalling Deﬁnition 1, we also knowthat the corresponding smallest order r equals r , and the jump size κ = κ .As for P , it is easy to see that E ( y i ) = ( , i ∈ { , . . . , n − ∆ } ,κ { ( i − n + ∆) /n } r , i ∈ { n − ∆ + 1 , . . . , n } , which implies that the change point of P satisﬁes η ( P ) = n − ∆ + 1. Recalling Deﬁnition 1, we alsoknow that the corresponding smallest order r equals r , and the jump size κ = κ .Since ∆ ≤ n/

3, it follows from Le Cam’s lemma (e.g. Yu, 1997) and Lemma 2.6 in Tsybakov (2009)that inf b η sup P ∈P E P ( | b η − η | ) ≥ ( n/ − d TV ( P , P )) ≥ n {− KL( P , P ) } . Since both P and P are product measures, it holds thatKL( P , P ) = κ σ ( ∆ X i =1 (cid:18) i − ∆ n (cid:19) r + n X i = n − ∆+1 (cid:18) i − n + ∆ n (cid:19) r ) ≤ κ σ X i =1 (cid:18) in (cid:19) r ≤ κ σ n r Z ∆+11 x r dx ≤ c κ ∆ r +1 σ n r = c ξ. Therefore inf ˆ η sup P ∈P E P ( | ˆ η − η | ) ≥ n c ξ ) = cn.cn.