Localising change points in piecewise polynomials of general degrees
aa r X i v : . [ m a t h . S T ] J u l Localising change points inpiecewise polynomials of general degrees
Yi Yu and Sabyasachi Chatterjee Department of Statistics, University of Warwick Department of Statistics, University of Illinois Urbana-ChampaignJuly 21, 2020
Abstract
In this paper we are concerned with a sequence of univariate random variables with piecewisepolynomial means and independent sub-Gaussian noise. The underlying polynomials are allowedto be of arbitrary but fixed degrees. We propose a two-step estimation procedure based on the ℓ -penalisation and provide upper bounds on the localisation error. We complement these resultsby deriving information-theoretic lower bounds, which show that our two-step estimators are nearlyminimax rate-optimal. We also show that our estimator enjoys near optimally adaptive performanceby attaining individual localisation errors depending on the level of smoothness at individual changepoints of the underlying signal. We are concerned with the model y = ( y , . . . , y n ) ⊤ ∈ R n . For each i ∈ { , . . . , n } , y i = f ( i/n ) + ε i , (1)where f : [0 , → R is an unknown piecewise-polynomial function and ε i ’s are independent mean zerosub-Gaussian random variables. To be specific, associated with f ( · ), there is a sequence of strictlyincreasing integers { η k } K +1 k =0 , with η = 1 and η K +1 = n + 1, such that f ( · ) restricted on each interval[ η k /n, η k +1 /n ), k = 0 , . . . , K , is a polynomial of degree at most r ∈ N . The maximum degree r is assumedto be arbitrary but fixed, and the number of change points K is allowed to diverge as the sample size n grows unbounded. The goal of this paper is to estimate { η k } Kk =1 , called the change points of f ( · ),accurately and to understand the fundamental limits in detecting and localising these change points.The work in this paper falls within the general topic of change point analysis, which has a long historyand is being actively studied till date. In change point analysis, one assumes that the underlying distri-butions change at a set of unknown time points, called change points, and stay the same between twoconsecutive change points. A closely related problem is change point detection in piecewise constant sig-nals. This is studied thoroughly in Chan and Walther (2013), Frick et al. (2014), Dumbgen and Spokoiny(2001), D¨umbgen and Walther (2008), Li et al. (2017), Jeng et al. (2012) and Wang et al. (2020a), amongothers. Recently, Fearnhead et al. (2019) studied change point analysis in piecewise linear signals. Ourwork in this paper can be seen as a generalisation of the aforementioned results, but allowing for poly-nomials of arbitrary degrees. Detailed discussions regarding comparisons with Wang et al. (2020a) andFearnhead and Rigaill (2018) will be provided after we present our main results.Beyond univariate sequence, the existing work on change point analysis includes studies on high-dimensional models (e.g. Dette et al., 2018; Wang et al., 2017; Wang and Samworth, 2018), networkmodels (e.g. Bhattacharjee et al., 2018; Cribben and Yu, 2017; Wang et al., 2018) and nonparametricmodels (e.g. Garreau and Arlot, 2018; Padilla et al., 2019a,b).1o divert slightly, it is worth mentioning that instead of focusing on estimating the locations of thechange points, a complementary problem is to estimate the whole of the underlying piecewise polyno-mial function itself. This is a canonical problem in nonparametric regression and also has a long his-tory. The piecewise polynomial function is typically assumed to satisfy certain regularity at the changepoints. The classical settings therein assume that the degrees of the underlying polynomials are takento be some particular values and the change points, referred to as knots, are at fixed locations, seee.g. Green and Silverman (1993) and Wahba (1990). More recent regression methods have focussed onfitting piecewise polynomials where the knots are not fixed beforehand and is estimated from the data(e.g. Guntuboyina et al., 2020; Mammen and van de Geer, 1997; Shen et al., 2020; Tibshirani, 2014).In this paper, we focus on estimating the locations of the change points accurately, allowing forgeneral and different degrees of polynomials within f ( · ), diverging number of change points, and differentsmoothness at different change points. This framework, to the best of our knowledge, is the most flexibleone in both change point analysis and spline regression analysis areas. In the rest of this paper, we firstformalise the problem and introduce the algorithm in Section 1.1, followed by a list of contributions inSection 1.2. The main results are collected in Section 2, with more discussions in Section 3 and the proofsin the Appendices. In order to estimate the change points of f ( · ), we propose a two-step estimator. The estimator is definedin this subsection, following introduction of necessary notation used throughout this paper.Let Π be any interval partition of { , . . . , n } , i.e. a collection of | Π | ≥ { , . . . , n } ,Π = (cid:8) { , . . . , s − } , { s , . . . , s − } , . . . , { s | Π |− , . . . , n } (cid:9) , for some integers 1 = s < s < . . . s | π |− ≤ n < s | Π | = n + 1, with | · | denoting the cardinality of aset. For any such partition Π, we denote η (Π) = { s , . . . , s | Π |− } to be its change points. Let P n thecollection of all such interval partitions of { , . . . , n } .For any fixed λ > y ∈ R n , let the estimated partition be b Π ∈ argmin Π: Π ∈P n G (Π , λ ) , (2)where G (Π , λ ) = X I ∈ Π k y I − P I y I k + λ | Π | = X I ∈ Π H ( y, I ) + λ | Π | , (3)the notation therein is introduced below. • The norm k · k denotes the ℓ -norm of a vector. • For any interval I = { s, . . . , e } ⊂ { , . . . , n } , let y I = ( y i , i ∈ I ) ⊤ ∈ R | I | be the data vector oninterval I and P I be the projection matrix P I = U I,r ( U ⊤ I,r U I,r ) − U ⊤ I,r , (4)with U I,r = s/n · · · ( s/n ) r ... ... ... ...1 e/n · · · ( e/n ) r ∈ R ( e − s +1) × ( r +1) . (5)We can see that the loss function G ( · , · ) is a penalised residual sum of squares. The penalisation isimposed on the cardinality of the partition, which is in fact an ℓ penalisation. The residual sum of squaresare the residuals after projecting data onto the discrete polynomial space. The initial estimators { e η k } b Kk =1 are defined to be η ( b Π), the change points of b Π.2ith the estimated partition b Π and its associated change points η ( b Π), provided that | η ( b Π) | ≥
1, weproceed to the second-step estimation. For any k ∈ { , . . . , b K } , let s k = e η k − / e η k / , e k = e η k / e η k +1 / I k = [ s k , e k ) , with e η = 1 and e η b K +1 = n + 1. For any k ∈ { , . . . , b K } , we define b η k = argmin t ∈ I k \{ s k } { H ( y, [ s k , t )) + H ( y, [ t, e k )) } , (6)where H ( · , · ) is defined in (3). The updated estimators { b η k } b Kk =1 are our final estimators .As a summary, this two-step algorithm precedes with the optimisation problem (2), providing a set ofinitial estimators { e η k } b Kk =1 . With the initial estimators, a parallelisable second step works on every triplet( e η k − , e η k , e η k +1 ), k ∈ { , . . . , K } , to refine e η k and yield b η k . This update does not change the number ofestimated change points.To help further referring back to our two-step algorithm, we present the full procedure in Algorithm 1. Algorithm 1
Two-step estimation
INPUT:
Data { y i } ni =1 , tuning parameters λ > b Π ← argmin Π: Π ∈P n G (Π , λ ) ⊲ See (3)
B ← η ( b Π) ⊲ The initial estimators if B 6 = ∅ then { b η k } b Kk =1 ← Update B based on (6) ⊲ The final estimators end if
We conclude this subsection with two remarks, on the optimisation problem (2) and the computationalaspect of the upper bound on the polynomial degree r , respectively. Remark 1 (The optimisation problem (2)) . The uniqueness of the solution (2) is not guaranteed ingeneral, but the properties we are to present regarding the change point estimators hold for any solutions.In fact, under some mild conditions, for instance the existence of densities of the noise distribution, onecan show the minimiser of (2) is unique almost surely.The optimisation problem (2) , with a general loss function, is knowns as minimal partitioning problem(e.g. Algorithm 1 in Friedrich et al., 2008), and can be solved by a dynamic programming approach inpolynomial time. The computational cost is of order O ( n Cost( n )) , where Cost( n ) is the computationalcost of calculating G (Π , λ ) , for any given Π and λ . To be specific, for (2) , Cost( n ) = O ( n ) , where thehidden constants depend on the polynomial degree r , therefore the total computationa cost is O ( n ) . Areference where the computational cost and the dynamic programming algorithm is explicitly mentionedis Lemma 1.1 in Chatterjee and Goswami (2019).We would like to mention that the minimal partitioning problem has previously being used in changepoint analysis literature for other models, including Fearnhead and Rigaill (2018), Killick et al. (2012),Wang et al. (2020a), Wang et al. (2019) and Wang et al. (2020b), among others. In the spline regressionanalysis area, the ℓ penalisation is also exploited, for instance, Shen et al. (2020) and Chatterjee and Goswami(2019), to name but a few. Remark 2 (The polynomial degree upper bound r ) . The degree r is in fact an input of the algorithm.One needs to specify the degree r in (2) and (6) . Usually, when we define a degree- d polynomial, we let g ( x ) = d X l =0 c l x l , x ∈ R , with { c l } dl =0 ⊂ R and c d = 0 . If c d = 0 , then g ( · ) is regarded as a degenerate degree- d polynomial. Inthis paper, we do not emphasis on the highest degree coefficient being nonzero. With this flexibility, inpractice, as long as the input r is not smaller than the largest degree of the underlying polynomials, thenthe performances of the algorithm are still guaranteed. However, the larger the input r is, the more costlythe optimisation is. More regarding this point will be discussed after we present our main theorem. .2 Main contributions To conclude this section, we summarise our contributions in this paper.Firstly, to the best of our knowledge, this is the first paper studying the change point localisationin piecewise polynomials with general degrees. The model we are concerned in this paper enjoys greatflexibility. We allow for the number of change points and the variances of the noise sequence to diverge,and the differences between two consecutive different polynomials to vanish, as the sample size growsunbounded.Secondly, we propose a two-step estimation procedure for the change points, detailed in Algorithm 1.The first step is a version of the minimal partitioning problem (e.g. Friedrich et al., 2008), and the secondstep is a parallelisable update. The first step can be done in O ( n ) time and the second step in O ( n )time.Thirdly, we provide theoretical guarantees for the change point estimators returned by Algorithm 1.To the best of our knowledge, it is the first time in the literature, establishing the change point localisationrates for piecewise polynomials with general degrees. Prior to this paper, the state-of-the-art results wereonly on piecewise constant signals. As for piecewise linear signals, existing work has studied piecewiselinear signals which are necessarily continuous. In contrast, we allow the underlying contiguous polyno-mials to pertain different smoothness at different change points. This is reflected in our localisation errorbound for each individual change point. In short, we show that our change point estimator enjoys nearlyoptimal adaptive localisation rates.Lastly, in a fully finite sample framework, we provide information-theoretic lower bounds charac-terising the fundamental difficulties of the problem, showing that our estimators are nearly minimaxrate-optimal. To the best of our knowledge, even for the piecewise linear case, previous minimax lowerbounds only focussed on the scaling in the sample size n whereas we derive a minimax lower boundinvolving all the parameters of the problem. More detailed comparisons with existing literature are inSection 3. In this section, we investigate the theoretical properties of the initial and the final estimators returnedby Algorithm 1.
In the change point analysis literature, the difficulty of the change point estimation task can be charac-terised by two key model parameters: the minimal spacing between two consecutive change points andthe minimal difference between two consecutive underlying distributions. In this paper, the underlyingdistributions are determined by the polynomial coefficients. For two different at-most-degree- r polyno-mials, the difference is nailed down to the difference between two ( r + 1)-dimensional vectors, consistingof the polynomial coefficients. To characterise the difference, for any integers r, K ≥
0, we adopt thefollowing reparameterising for any piecewise polynomial function f ( · ) ∈ F r,Kn , where F r,Kn = n f ( · ) : [0 , → R : 1 = η < η < · · · < η K = n < η K +1 = n + 1 , s.t. ∀ k ∈ { , , . . . , K } , f [ η k /n,η k +1 /n ) : [ η k /n, η k +1 /n ) → R , with f | [ η k /n,η k +1 /n ) ( x ) = f ( x ) , is a right-continuous with left limit polynomial of degree at most r. o . (7) Definition 1.
Let f ( · ) ∈ F r,Kn , { η k } K +1 k =0 ⊂ { , . . . , n + 1 } be the collection of change points of f ( · ) , with η = 0 , η K +1 = n + 1 . For any k ∈ { , . . . , K } , let f [ η k − /n,η k +1 /n ) ( · ) : [ η k − /n, η k +1 /n ) → R be therestriction of f ( · ) on [ η k − /n, η k +1 /n ) . Define the reparameterisation of f [ η k − /n,η k +1 /n ) ( · ) as f ( x ) = (P rl =0 a l ( x − η k /n ) l , x ∈ [ η k − /n, η k /n ) , P rl =0 b l ( x − η k /n ) l , x ∈ [ η k /n, η k +1 /n ) , (8)4 here { a l , b l } rl =0 ⊂ R . Define the jump associated with the change point η k as κ k = | a r k − b r k | > , where r k = min { l = 0 , . . . , r : a l = b l } . (9)We define the jump associated with each change point of f ( · ) ∈ F r,Kn in Definition 1. The definitionis based on a reparameterisation of two consecutive polynomials. Using the notation in Definition 1, dueto the definition (7), f ( · ) is an at-most-degree- r polynomial in each [ η k /n, η k +1 /n ), k ∈ { , . . . , K } . Thisenables the reparameterisation (8).With the reparameterisation (8), it is easy to see, if f ( · ) at η k /n is d -time differentiable but not( d + 1)-time differentiable, d ∈ {− , . . . , r − } , then r k = d + 1. Here we use the convention that if f ( · )is − x , then f ( · ) at x is not continuous.There are two key advantages of using Definition 1 to characterise the difference. Firstly, we allowfor a full range of smoothness at the change points. Detecting change points in piecewise linear modelswas studied in Fearnhead et al. (2019), but the continuity at the change points is imposed. Translatedinto our notation, that means r k = 1 for all k ∈ { , . . . , K } . Our formulation covers this continuity butalso allows for discontinuity. Most importantly, we allow for each change point to have its individualsmoothness indicator r k .Secondly, we seek the smallest among all degrees with different coefficients, i.e. in (9), define r k tobe the minimum of all l with a l = b l . There are apparently other alternatives, for instance, one canchoose the maximum among all such l , or one can instead define κ k to be ℓ - or ℓ -norm of the differenceof the coefficient vectors. We claim that our definition Definition 1 will return the sharpest localisationrates when estimating the change points under the weakest conditions. More discussions on this will beavailable after we present our main theorem. In this section, we present our main theorem providing theoretical guarantees on the output of Algo-rithm 1, with assumptions collected in Assumption 1.
Assumption 1 (Model assumptions) . Assume that the data { y i } ni =1 are generated from (1) , where f ( · ) belongs to F r,Kn defined in (7) and ε i ’s are independent zero mean sub-Gaussian random variables with max ni =1 k ε i k ψ ≤ σ .We denote the collection of all change points of f ( · ) to be { η , . . . , η K } , satisfying ∆ = min k ∈{ ,...,K +1 } ( η k − η k − ) > , where η = 1 and η K +1 = n + 1 ,In addition, for any k ∈ { , . . . , K } , let κ = min k =1 ,...,K κ k > , where κ k is defined in Definition 1. The problem now is completely characterised by the sample size n , the maximum degree r , the numberof change points K , the upper bound of the fluctuations σ , the minimal spacing ∆, the jump sizes { κ k } and associated smoothness levels { r k } . In this paper, we allow the maximum degree r to be arbitrarybut fixed, i.e. not a function of the sample size n . We allow the number of change points K and thefluctuation bound σ to diverge, the minimal spacing ∆ and the jump size κ to vanish, as the sample sizegrows unbounded. We formalise the signal strength in Definition 2 using these model parameters. We denote k · k ψ as the sub-Gaussian or Orlicz- ψ norm. For any random variable X , let k X k ψ = inf (cid:8) t > E (cid:8) exp( X /t ) (cid:9) ≤ (cid:9) . efinition 2. Under Assumption 1, for any k ∈ { , . . . , K } , define the signal strength of the changepoint η k to be ρ k = κ k ∆ r k +1 n − r k and the overall signal strength parameter to be ρ = min k =1 ,...,K ρ k . The signal strength defined in Definition 2 is inline with those used in other change point detectionproblems. More discussions will be provided after we state the main result in Theorem 3.
Theorem 3.
Let data { y i } ni =1 satisfy Assumption 1. Let { e η k } b Kk =1 and { b η k } b Kk =1 be the initial estimatorsand final estimators of Algorithm 1, with inputs { y i } ni =1 and tuning parameter λ .If λ = c noise Kσ log( n ) and ρ ≥ c signal λ, (10) then we have that P {E} ≥ − n − c prob , where E = ( b K = K, | e η k − η k | ≤ c error (cid:26) Kn r k σ log( n ) κ k (cid:27) / (2 r k +1) , and | b η k − η k | ≤ c error (cid:26) n r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } ) . The constants c prob , c noise , c signal and c error > are all absolute constants. Remark 3 (Tracking constants) . All the absolute constants c prob , c noise , c signal , c error can be tracked inthe proof, although we do not claim the optimality of the constants thereof. The hierarchy of the constantsare as follows.We first determine the constant c prob > , which only depends on the maximum degree r . Given c prob ,we can determine c noise , which only depends on c prob . With c prob and c noise at hand, we can determine c signal > . Lastly, the constant c error > depends on c signal , c noise and c prob . We note that the larger c signal is, the smaller c error is. From Theorem 3 we can see that the final estimators { b η k } b Kk =1 improve upon the initial estimators { e η k } b Kk =1 , by getting rid of K , the dependence on the number of change points, in their localisation errorupper bounds. It is possible that this K term is actually an artefact of our current proof, and we mightnot need to update our initial estimators further. See Section 3 for more on this issue. However, withour current proof technique we do need the second step update to obtain the improved localisation errorbound.As for each individual change point η k , k ∈ { , . . . , K } , the localisation error of the final estimator b η k is | b η k − η k | . n r k / (2 r k +1) (cid:26) σ log( n ) κ k (cid:27) / (2 r k +1) . (11)The upper bound in (11) is a decreasing function of κ k . Under mild conditions, the dominating term inthe upper bound in (11) is the term n r k / (2 r k +1) , which is an increasing function of r k . Recall Definition 1,where the jump is determined by the smallest possible degree. Together with (11), we can see that if weinstead define the jump in Definition 1 using, say the largest possible degree, or other vector norms of thecoefficient vectors’ difference, then the corresponding localisation error’s rate will inflate correspondingly.Let us consider a concrete case where K = 1, r = 1 and the only one change point is η . Thecorresponding jump size is denoted as κ . A question that can be asked now is as follows. Is it easier to estimate the change point location when the underlying f ( · ) is continuous at η or discontinuous at η ? f ( · ) at η is discontinuous, under mild conditions,e.g. σκ − is of order O (1). In this particular case, the localisation error bound is of order σ log( n ) κ in the case of discontinuity, and is of order n / (cid:18) σ log( n ) κ (cid:19) / in the continuous case. The dependence of the localisation error bound on r k and κ k derived in Theorem 3is nearly minimax rate-optimal up to logarithmic factors. This will follow from Lemma 4 in Section 2.3.The condition (10) is a de facto signal-to-noise ratio condition. The tuning parameter λ reflects thefluctuations of the noise and it is set to be smaller than the signal strength ρ . A clearer comparisonunveils that we essentially requires min k =1 ,...,K κ k ∆ r k +1 n r k & Kσ log( n ) . (12)This includes a wide range of parameter settings. We list a few situations here. For any two positivesequences a n , b n we write a n ≍ b n if a n /b n stays bounded away from 0 and ∞ , as n diverges. • If K, σ ≍ ≍ n , then we allow the jump size κ to be of order { log( n ) /n } / , which vanishesas n diverges. • If K, σ, κ ≍
1, then we allow the minimal spacing∆ ≍ max k =1 ,...,K n r k / (2 r k +1) log / (2 r k +1) ( n ) . If all r k = 0, then it means ∆ ≍ log( n ); if all r k = 1, then it means ∆ ≍ n / log / ( n ). • The number of the change points K is allowed to diverge, provided K . σ log( n ) min k =1 ,...,K κ k ∆ r k +1 n r k . In this section, we aim to provide the information-theoretic lower bounds to characterise the fundamentaldifficulties of localisation change points in the model defined in Assumption 1.In the change point analysis literature, in terms of localising the change point locations, there are twoaspects we are interested in. One is the minimax lower bound on the localisation error and the other ison the signal strength. For simplicity, in this section, we assume that K = 1 and r = r .As for these two aspects, in Theorem 3, we show that provided κ ∆ r +1 n r & Kσ log( n ) , (13)the output returned by Algorithm 1 have localisation error upper bounded by (cid:26) n r σ log( n ) κ (cid:27) / (2 r +1) . In this section, we will investigate the optimality of the above results.7 emma 4.
Under Assumption 1, assume that there exists one and only one change point and r = r .Let P κ, ∆ ,σ,r,n denote the joint distribution of the data. Consider the class Q = (cid:8) P κ, ∆ ,σ,r,n : ∆ < n/ , κ ∆ r +1 ≥ σ n r ζ n (cid:9) , for any diverging sequence { ζ n } . Then for all n large enough, it holds inf b η sup P ∈Q E P ( | b η − η ( P ) | ) ≥ max ( , (cid:20) cn r σ κ (cid:21) / (2 r +1) ) , where η ( P ) is the location of the change point for distribution P , the minimum is taken over all themeasurable functions of the data, b η is the estimated change point and < c < is an absolute constant. Lemma 4 shows that the final estimators provided by Algorithm 1 are nearly optimal, in terms of thelocalisation error, save for a logarithmic factor.
Lemma 5.
Under Assumption 1, assume that there exists one and only one change point and r = r .Let P κ, ∆ ,σ,r,n denote the joint distribution of the data. For a small enough ξ > , consider the class P = ( P κ, ∆ ,σ,r,n : ∆ = min ($ (cid:18) ξn r κ σ − (cid:19) / (2 r +1) % , n/ )) . Then we have inf b η sup P ∈P E P ( | b η − η ( P ) | ) ≥ cn, where η ( P ) is the location of the change point for distribution P , the minimum is taken over all themeasurable functions of the data, b η is the estimated change point and < c < is an absolute constantdepending on ξ . Lemma 5 shows that, if κ ∆ r +1 n − r . σ , then no algorithm is guaranteed to be consistent, in thesense that inf b η sup P ∈P E P (cid:18) | b η − η ( P ) | n (cid:19) & . This means, besides the logarithmic factor, Lemma 5 and Theorem 3 leave a gap in terms of K . To bespecific, it remains unclear what results one would obtain if σ . κ ∆ r +1 n − r . Kσ . (14)This gap only exists when we allow K to diverge. We will provide some conjectures inline with thisdiscussion in Section 3.1. In this paper, we investigate the change point localisation in piecewise polynomial signals. We allow fora general framework and provide individual localisation error, associated with the individual smoothnessat each change point. A two-step algorithm consisting of solving a minimal partitioning problem and anupdating step is proposed. The outputs are shown to be nearly-optimal, supported by the information-theoretic lower bounds. To conclude this paper, we discuss some unresolved aspects of our work whilecomparing our results to some particularly relevant existing literature.
Wang et al. (2020a) studied change point localisation in piecewise constant signals. They studied the ℓ -penalised least squares method and proved that it is nearly minimax optimal in terms of both thesignal strength condition and the localisation error. In contrast, with our proof technique, we have been8ble to generalise this result for higher degree polynomials up to a factor depending on K , the number oftrue change points. This can be seen in our change point localisation error bound of our initial estimatorsas provided in Theorem 3 and also in our required signal strength condition in (12). In our paper, withgeneral degree polynomials, the localisation near-optimality is secured via an extra updating step, and agap remains in the upper and lower bounds for our required signal strength condition. This gap is notpresent if K is assumed to be O (1) but is present if it is allowed to diverge.We explain why the proof in Wang et al. (2020a) could not be fully generalised to our setting. Recallthe definition of H ( v, I ) in (3) denoting a residual sums of squares term. In our analysis, a crucial roleis played by the term Q { E ( y ); I , I } = H { E ( y ) , I ∪ I } − H { E ( y ) , I } − H { E ( y ) , I } , where I , I are two contiguous intervals of { , . . . , n } . Ideally, one needs to be able to upper and lowerbound Q { E ( y ); I , I } when y is defined in (1), and its corresponding f ( · ) is a degree- r polynomial on I and another degree- r polynomial on I . In the case of r = 0, i.e. in the piecewise constant case, one canwrite an exact expression Q { E ( y ); I , I } = | I || I || I | + | I | | I | − X i ∈ I E ( y i ) − | I | − X i ∈ I E ( y i ) ! . In addition, it holds thatmin {| I | , | I |} | I || I | {| I | , | I |} ≤ | I || I || I | + | I | ≤ min {| I | , | I |} . Therefore, it follows that12 min {| I | , | I |} κ ≤ Q { E ( y ); I , I } ≤ min {| I | , | I |} κ , (15)where κ represents the absolute difference between the values of E ( y i ), i ∈ I and i ∈ I .For general r , by adopting an elegant result in Shen et al. (2020), one can actually generalise (15) toobtain that C min {| I | r +1 , | I | r +1 } n r κ ≤ Q { E ( y ); I , I } ≤ C min {| I | r +1 , | I | r +1 } n r κ , (16)where 0 < C < C are two absolute constants, and κ is the absolute difference of the r th degreecoefficients of E ( y ) on I and I . However, the problem is that the constants C and C are not explicit.We can only show the existence of such constants. Even if we can track these two constants down, inorder to be able to generalise the argument of Wang et al. (2020a), we would still need to show that C and C are close enough. At this moment, it is not clear to us how to resolve this issue. We canonly conjecture that for all r ∈ N , the ℓ -penalised least squares method would itself be nearly optimalin terms of both the signal strength condition and the localisation error, and our second step updatewould not be needed. From a practical point of view, our second step can be done in O ( n ) time, which isnegligible compared to the O ( n ) time required to solve the penalised least squares. The computationaloverhead of our second step is thus minor. Fearnhead et al. (2019) showed that penalised least squares method for change point localisation workswell for piecewise linear signals. This work inspired us to investigate piecewise polynomial signals ofhigher degrees. Even in the piecewise linear case, there are some differences between our work andFearnhead et al. (2019). The algorithm provided in Fearnhead et al. (2019) can be seen as solving a vari-ant of the penalised least squares problem mentioned in this paper. In fact, the dynamic programmingalgorithm mentioned in Fearnhead et al. (2019) appears to be more sophisticated than what would be9equired to solve our problem. It is because the algorithm in Fearnhead et al. (2019) is tailored specifi-cally for continuous piecewise linear functions. Maintaining continuity makes the dynamic programmingalgorithm more involved. Translated to our notation, Fearnhead et al. (2019) assumes r k = 1, for all k ∈ { , . . . , K } . Our formulation is more general than Fearnhead et al. (2019) as we do not imposecontinuity or any kind of smoothness at the change points. Our estimator adapts near-optimally to thelevel of smoothness at the change points. The theoretical results studied in Fearnhead et al. (2019) areunder the conditions K, σ ≍
1. Under these conditions, translated to our notation, their results read,provided that ( κ/n ) ∆ & log( n ), the localisation error is log / ( n )( n/κ ) / . Both are consistent withthe results we have obtained in this paper. Raimondo (1998) studied the minimax rates of change point localisation in a nonparametric setting. Themain focus there is how the localisation errors’ minimax rates change with α , the degree of discontinuityin a H¨older sense. Due to the nonparametric essence, the class of functions considered in Raimondo(1998) is more general than the piecewise polynomial class we discuss here. However, the measures ofregularity r k ’s we have defined in Definition 1 are exactly the same as the parameter α in Raimondo(1998), if we only consider polynomials. Having drawn this connection, translated into our notation,Raimondo (1998) in fact shows that the localisation error’s minimax lower bound is of order (cid:8) n r log η ( n ) (cid:9) / (2 r +1) , ∀ η > . This is a lower bound for a larger class of functions than ours, but the dependence on n is the sameas ours up to a poly-logarithmic factor. Since Raimondo (1998) assumes all the other parameter to beof order O (1), our minimax lower bounds add value as they are in terms of all the relevant problemparameters and not just the sample size n . References
Bhattacharjee, M. , Banerjee, M. and
Michailidis, G. (2018). Change point estimation in adynamic stochastic block model. arXiv preprint arXiv:1812.03090 . Chan, H. P. and
Walther, G. (2013). Detection with the scan and the average likelihood ratio.
Statistica Sinica , Chatterjee, S. and
Goswami, S. (2019). Adaptive estimation of multivariate piecewise polynomialsand bounded variation functions by optimal decision trees. arXiv preprint arXiv:1911.11562 . Cribben, I. and
Yu, Y. (2017). Estimating whole-brain dynamics by using spectral clustering.
Journalof the Royal Statistical Society: Series C (Applied Statistcs) , Dette, H. , Pan, G. M. and
Yang, Q. (2018). Estimating a change point in a sequence of veryhigh-dimensional covariance matrices. arXiv preprint . Dumbgen, L. and
Spokoiny, V. G. (2001). Multiscale testing of qualitative hypotheses.
Annals ofStatistics
D¨umbgen, L. and
Walther, G. (2008). Multiscale inference about a density.
The Annals of Statistics , Fearnhead, P. , Maidstone, R. and
Letchford, A. (2019). Detecting changes in slope with an l 0penalty.
Journal of Computational and Graphical Statistics , Fearnhead, P. and
Rigaill, G. (2018). Changepoint detection in the presence of outliers.
Journal ofthe American Statistical Association
Frick, K. , Munk, A. and
Sieling, H. (2014). Multiscale change point inference.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , riedrich, F. , Kempe, A. , Liebscher, V. and
Winkler, G. (2008). Complexity penalized m-estimation: Fast computation.
Journal of Computational and Graphical Statistics , Garreau, D. and
Arlot, S. (2018). Consistent change-point detection with kernels.
Electronic Journalof Statistics , Green, P. J. and
Silverman, B. W. (1993).
Nonparametric regression and generalized linear models:a roughness penalty approach . Crc Press.
Guntuboyina, A. , Lieu, D. , Chatterjee, S. and
Sen, B. (2020). Adaptive risk bounds in univariatetotal variation denoising and trend filtering.
The Annals of Statistics , Jeng, X. J. , Cai, T. T. and
Li, H. (2012). Simultaneous discovery of rare and common segmentvariants.
Biometrika , Killick, R. , Fearnhead, P. and
Eckley, I. A. (2012). Optimal detection of changepoints with alinear computational cost.
Journal of the American Statistical Association , Li, H. , Guo, Q. and
Munk, A. (2017). Multiscale change-point segmentation: Beyond step functions. arXiv preprint arXiv: 1708.03942 . Mammen, E. and van de Geer, S. (1997). Locally adaptive regression splines.
The Annals of Statistics , Padilla, O. H. M. , Yu, Y. , Wang, D. and
Rinaldo, A. (2019a). Optimal nonparametric changepoint detection and localization. arXiv preprint arXiv:1905.10019 . Padilla, O. H. M. , Yu, Y. , Wang, D. and
Rinaldo, A. (2019b). Optimal nonparametric multivariatechange point detection and localization. arXiv preprint arXiv:1910.13289 . Raimondo, M. (1998). Minimax estimation of sharp change points.
Annals of statistics
Rudelson, M. and
Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration.
Electronic Communications in Probability , . Shen, Y. , Han, Q. and
Han, F. (2020). On a phase transition in general order spline regression. arXivpreprint arXiv:2004.10922 . Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering.
The Annals ofStatistics , Tsybakov, A. B. (2009).
Introduction to Nonparametric Estimation . Springer.
Wahba, G. (1990).
Spline models for observational data . SIAM.
Wang, D. , Yu, Y. and
Rinaldo, A. (2017). Optimal covariance change point localization in highdimension. arXiv preprint arXiv:1712.09912 . Wang, D. , Yu, Y. and
Rinaldo, A. (2018). Optimal change point detection and localization in sparsedynamic networks. arXiv preprint arXiv:1809.09602 . Wang, D. , Yu, Y. and
Rinaldo, A. (2020a). Univariate mean change point detection: Penalization,cusum and optimality.
Electronic Journal of Statistics , Wang, D. , Yu, Y. , Rinaldo, A. and
Willett, R. (2019). Localizing changes in high-dimensionalvector autoregressive processes. arXiv preprint arXiv:1909.06359 . Wang, D. , Yu, Y. and
Willett, R. (2020b). Detecting abrupt changes in high-dimensional self-excitingpoisson processes. arXiv preprint arXiv:2006.03572 . Wang, T. and
Samworth, R. J. (2018). High-dimensional changepoint estimation via sparse projection.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) . Yu, B. (1997). Assouad, Fano, and Le Cam. In
Festschrift for Lucien Le Cam . Springer, 423–435.11 ppendices
We include all the proofs in the Appendices. Some preparatory results are provided in Appendix A.Appendix B contains the proof of Theorem 3. The lower bounds results Lemmas 4 and 5 are proved inAppendix C.
A Preparatory Results
The following notation will be used throughout the proofs. For any I = { s, . . . , e } ⊂ { , . . . , n } , recallthe projection matrix P I defined in (4) using matrix U I,r defined in (5). We recall the notation H ( v, I ) = k v I k − k P I v I k = k v I − P I v I k , for any vector v ∈ R n , where v I = ( v i , i ∈ I ) ⊤ ∈ R | I | .For any contiguous intervals I, J ⊂ { , . . . , n } and for any vector v ∈ R n , define Q ( v ; I, J ) = H ( v, I ∪ J ) − H ( v, I ) − H ( v, J ) = k P I v I k + k P J v J k − k P I ∪ J v I ∪ J k . Lemma 6.
Let I be any nonempty interval subset of { , . . . , n } . For any k ∈ { , . . . , | I |} and anypartition of I , I = ∪ kl =1 I l , satisfying I s ∩ I u = ∅ , for any s, u ∈ { , . . . , k } , s = u . It holds for any vector v ∈ R n that H ( v, I ) ≥ k X l =1 H ( v, I l ) . Proof.
The claims holds due to that H ( v, I ) = k v I − P I v I k = k X l =1 k v I l − ( P I v I ) I l k ≥ k X l =1 k v I l − P I l v I l k . Lemma 7.
Let y ∈ R n satisfy y = θ + ε and E ( y ) = θ . Let I, J be two contiguous interval subsets of { , . . . , n } . It holds that Q ( y ; I, J ) ≥ (cid:12)(cid:12)p Q ( θ ; I, J ) − p Q ( ε ; I, J ) (cid:12)(cid:12) . Proof.
First observe that Q ( y ; I, J ) is a quadratic form in y . Moreover, it is a positive semidefinitequadratic form as Q ( y ; I, J ) ≥ y ∈ R n by Lemma 6. Therefore, we can write Q ( y ; I, J ) = y ⊤ Ay ,for a positive semidefinite matrix A ∈ R n × n . Denoting A / as the square root matrix of A , satisfying A / A / = A , we can write Q ( y ; I, J ) = k A / y k . It then holds that p Q ( y ; I, J ) = k A / ( θ + ε ) k ≥ max n k A / θ k − k A / ε k , k A / ε k − k A / θ k o , which leads to the final claim. Lemma 8 (Lemma E.1 in Shen et al. (2020)) . There exists an absolute constant c poly depending only on r such that for any integers n ≥ , m ≥ r + 1 , m ≤ n (17) and any real sequence { a ℓ } rℓ =0 , m X i =1 (cid:20) a + a (cid:18) in (cid:19) + · · · + a r (cid:18) in (cid:19) r (cid:21) ≥ c poly max d =1 ,...,r a d m d +1 n d . Lemma 8 is a direct consequence of Lemma E.1 in Shen et al. (2020). We omit its proof here.12 roposition 9.
Let I = { s, . . . , τ − } , J = { τ, . . . , e } be two contiguous interval subsets of { , . . . , n } such that min {| I | , | J |} ≥ r + 1 . Let θ = ( θ i , i = 1 , . . . , n ) ⊤ ∈ R n be a piecewise discretized polynomial,i.e. θ i = f ( i/n ) , where f ( · ) is a polynomial of order at most r on [ s/n, τ /n ) and a polynomial of orderat most r on [ τ /n, e/n ) .Let θ I ∪ J , θ restricted on I ∪ J , be reparametrised as θ i = (P rl =0 a l ( i/n − τ /n ) l , i ∈ I, P rl =0 b l ( i/n − τ /n ) l , i ∈ J. Let d = min { l = 0 , . . . , r : a l = b l } . Then there exists an absolute constant c poly depending only on r such that Q ( θ ; I, J ) ≥ c poly ( a d − b d ) min {| I | d +1 , | J | d +1 } n d . Proof.
For any fixed d ∈ { , . . . , r } and any κ >
0, let A d = ( v ∈ R | I ∪ J | : there exist { c ,l , c ,l , l = 0 , . . . , r } ⊂ R s.t. v = (P rl =0 c ,l ( i/n − τ /n ) l , i ∈ I, P rl =0 c ,l ( i/n − τ /n ) l , i ∈ J, and | c ,d − c ,d | ≥ κ. ) (18)In words, A d is the set of vectors which are discretised polynomials of order at most r on the interval I/n and different polynomials of order at most r on the interval J/n , with the d th order coefficients atleast κ apart.For v ∈ A d , since v is a discretised polynomial on I/n and
J/n , separately, we have that Q ( v ; I, J ) = k v I ∪ J − P I ∪ J v I ∪ J k . In addition, we claim thatmin v ∈A d k v I ∪ J − P I ∪ J v I ∪ J k = min v ∈A d k v I ∪ J k . This is due to the following. Since orthogonal projections cannot increase the ℓ norm, we have theLHS ≤ RHS. As for the other direction, observe that the vector v I ∪ J − P I ∪ J v I ∪ J also belongs to the set A d .It now suffices to lower bound min v ∈A d k v I ∪ J k . For any v ∈ A d , it holds that k v I ∪ J k = k v I k + k v J k ≥ c poly n d (cid:0) c ,d | I | d +1 + c ,d | J | r +1 (cid:1) ≥ c poly n d κ min {| I | d +1 , | J | d +1 } , where c ,d and c ,d are the d th order coefficients of v as defined in (18), the first inequality is due toLemma 8, and the second inequality follows from the fact that | c ,d − c ,d | ≥ κ . Lemma 10 (High Probability Event) . Under Assumption 1, there exists an absolute constant c prob > depending on r , and an absolute constant c noise > depending only on c prob , such that P max I =[ s,e ]1 ≤ s For any interval I ⊂ { , . . . , n } , there exists an absolute positive constant c > r such that for any t > P (cid:8) ε ⊤ I P I ε I − E (cid:0) ε ⊤ I P I ε I (cid:1) ≥ t (cid:9) ≤ (cid:20) − c min (cid:26) t σ k P I k , tσ k P I k op (cid:27)(cid:21) , which is due to the Hanson–Wright inequality (e.g. Theorem 1.1 in Rudelson and Vershynin, 2013). Since P I is a rank r + 1 orthogonal projection matrix, we have k P I k F = r + 1 and k P I k op = 1. Then P (cid:8) ε ⊤ I P I ε I − E (cid:0) ε ⊤ I P I ε I (cid:1) ≥ t (cid:9) ≤ (cid:20) − c min (cid:26) t σ ( r + 1) , tσ (cid:27)(cid:21) . 13n addition, we have that E (cid:0) ε ⊤ I P I ε I (cid:1) ≤ tr( P I ) max i =1 ,...,n E ( ε i ) ≤ ( r + 1) σ . For an absolute constant C > c/ 2, letting t = Cσ log( n ) and applying a union bound argument over allpossible I , we obtain that P max I =[ s,e ]1 ≤ s In this section, we provide the proof of Theorem 3. We will prove the result by first proving that under anappropriate deterministic choice of the tuning parameter λ and some deterministic conditions on otherparameters, obtaining the desired localisation error is possible. We will then conclude the proof usingLemma 10, under which all these required conditions hold.For any τ > 0, define M ( τ ) = max I =[ s,e ]1 ≤ s It follows from Lemma 10 that P (cid:2) M{ c noise σ log( n ) } (cid:3) ≥ − n − c prob , where M ( · ) is defined in (19).On the event M{ c noise σ log( n ) } , it follows from Proposition 11 that b K = K and | e η k − η k | ≤ c error (cid:26) Kn r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } . In addition, due to (10), it holds that max k =1 ,...,K | e η k − η k | < ∆ / . Then it follows from Lemma 17 that | b η k − η k | ≤ c error (cid:26) n r k σ log( n ) κ k (cid:27) / (2 r k +1) , ∀ k ∈ { , . . . , K } . We complete the proof. B.1 The initial estimators { e η k } b Kk =1 The following proposition is our main intermediate result used to prove Theorem 3.14 roposition 11. Let data { y i } ni =1 satisfy Assumption 1. Let { e η k } b Kk =1 be the initial estimators of Algo-rithm 1, with inputs { y i } ni =1 and tuning parameter λ .On the event M ( τ ) defined in (19) , for any τ > , let λ > (4 K + 5) τ. (20) Assume that ρ > r +1 max { λ, r + 1 } . (21) We have that for any k ∈ { , . . . , K } , there exists an absolute constant < c < r/ (2 r +1) / , such that | e η k − η k | ≤ c (cid:18) n r k max { λ, r + 1 } κ k (cid:19) / (2 r k +1) . Remark 4. Note that Proposition 11 is a completely deterministic result. In particular, no probabilisticassumption is needed on the noise variables. The proposition is written with explicit constants but theseconstants are not optimal in any sense. We have written out explicit constants just to emphasise thedeterministic nature of the result and in better understanding of the relative choices of the differentproblem parameters.Proof of Proposition 11. We will show that(a) For any I = [ s, e ) ∈ b Π, there are no more than two true change points.(b) For any two consecutive intervals I, J ∈ b Π, the interval I ∪ J contains at least one true changepoint.(c) For any I = [ s, e ) ∈ b Π, if there are exactly two true change points contained in I , i.e. η k − < s ≤ η k < η k +1 < e ≤ η k +2 , then η k − s ≤ c (cid:18) n r k max { λ, r + 1 } κ k (cid:19) / (2 r k +1) and e − η k +1 ≤ c (cid:18) n r k +1 max { λ, r + 1 } κ k +1 (cid:19) / (2 r k +1 +1) . (d) For any I = [ s, e ) ∈ b Π, if there is exactly one true change point contained in I , i.e. η k − < s ≤ η k Let I = [ s, η k ), I = [ η k , e ) and e Π = b Π ∪ { I , I } \ { I } . It holds that 0 ≤ G ( e Π , λ ) − G ( b Π , λ ) = λ − H ( y, I ) + H ( y, I ) + H ( y, I ) = λ − Q ( y ; I , I ) ≤ λ − | p Q ( θ ; I , I ) − p Q ( ε ; I , I ) | ≤ λ − Q ( θ ; I , I ) / Q ( ε ; I , I ) ≤ λ + 6 τ − Q ( θ ; I , I ) / , where the first inequality follows from the definition of b Π, the second inequality follows from Lemma 7,the third inequality follows from ( a − b ) > a / − b , for any a, b ∈ R , and the last follows from thedefinition of the event M ( τ ).Applying Proposition 9, we conclude that c poly κ k n r k min {| I | r k +1 , | I | r k +1 } ≤ max { τ + 2 λ, r + 1 } . Lemma 16 (Part (e) in the proof of Proposition 11) . Under all the assumptions in Proposition 11,assuming that | Π | ≤ | b Π | ≤ | Π | , it holds that | b Π | = | Π | . roof. To ease notation, for any interval partition P of { , . . . , n } and any v ∈ R n , we let S ( v, P ) = X I ∈P H ( v, I ) . Using this notation, we first note that k ε k ≥ S ( y, Π) , since for any I ∈ Π, H ( y, I ) = H ( ε, I ) ≤ k ε I k .Let b Π ∩ Π be the intersection of the partitions b Π and Π. It then holds that k ε k + ( K + 1) λ ≥ S ( y, Π) + ( K + 1) λ ≥ S ( y, b Π) + ( b K + 1) λ ≥ S ( y, b Π ∩ Π) + ( b K + 1) λ, (22)where the second inequality follows from the definition of b Π and the last inequality is due to Lemma 6.On the other hand, we have that k ε k − S ( y, b Π ∩ Π) = k ε k − S ( ε, b Π ∩ Π) ≤ (cid:0) b K + K + 2 (cid:1) τ, (23)where the identity is due to the fact that θ is a polynomial of order at most r on every member of b Π ∩ Π,and the second inequality holds on the event M ( τ ), noticing that | b Π ∩ Π | ≤ b K + K + 2.Combining (22) and (23), we have that (cid:0) b K − K (cid:1) λ ≤ (cid:0) b K + K + 2 (cid:1) τ ≤ (cid:0) K + 5 (cid:1) τ, where the last inequality is due to | b Π | ≤ | Π | . Since we also have | b Π | ≥ | Π | , the last display implies that | b K | = K , otherwise it contradicts with (20). B.2 The final estimators { b η k } b Kk =1 The following lemma shows that our update step in Algorithm 1 can significantly improve the initialestimators. Lemma 17. Let data { y i } ni =1 satisfy Assumption 1. For any set { ν k } Kk =1 satisfying max k =1 ,...,K | ν k − η k | ≤ ∆ / , with ν = 1 and ν K +1 = n + 1 , define s k = ν k − / ν k / , e k = ν k / ν k +1 / and I k = [ s k , e k ) , ∀ k ∈ { , . . . , K } . Let b η k = argmin t ∈ I k \{ s k } { H ( y, [ s k , t )) + H ( y, [ t, e k )) } , ∀ k ∈ { , . . . , K } . For any τ > , if ρ > × r +1 τ, (24) then on the event M ( τ ) , it holds that for an absolute constant c > , | b η k − η k | ≤ c (cid:26) n r k τκ k (cid:27) / (2 r k +1) . Proof. For any k ∈ { , . . . , K } , we have that η k − s k = η k − ν k + ν k − ν k − η k − ν k + ν k − η k η k − η k − η k − − ν k − ≥ − ∆5 − ∆10 + ∆2 − ∆10 = ∆10 . s k − η k − = ν k − ν k − ν k − − η k − = ν k − η k η k − η k − η k − − ν k − ν k − − η k − ≥ − ∆10 + ∆2 − ∆10 − ∆5 = ∆10 . Using identical arguments, it also holds that min { e k − η k , η k +1 − e k } ≥ ∆ / s k < b η k < η k < e k . Let J = [ s k , e η k ), J = [ e η k , η k ) and J = [ η k , e k ). By the definition of e η k , we have that H ( y, J ∪ J ) + H ( y, J ) ≥ H ( y, J ) + H ( y, J ∪ J )= H ( y, J ) + H ( y, J ) + H ( y, J ) + Q ( y ; J , J ) . We then have that c poly κ k n r k min {| J | r k +1 , | J | r k +1 } ≤ Q ( y ; J , J ) ≤ Q ( y ; J , J ) = Q ( ε ; J , J ) ≤ τ, where the first inequality is due to Proposition 9. Since | J | ≥ ∆10 , the final claim follows from (24). C Proofs of Lemmas 4 and 5 Proof of Lemma 4. Let P denote the joint distribution of the independent random variables { y i } ni =1 suchthat y i ∼ ( N (0 , σ ) , i ∈ { , . . . , ∆ }N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { ∆ + 1 , . . . , n } . Let P denote the joint distribution of the independent random variables { z i } ni =1 such that z i ∼ ( N (0 , σ ) , i ∈ { , . . . , ∆ + δ }N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { ∆ + δ + 1 , . . . , n } , where δ is a positive integer no larger than n − ∆ − P , it is easy to see that E ( y i ) = ( , i ∈ { , . . . , ∆ } ,κ { ( i − ∆) /n } r , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisfies η ( P ) = ∆ + 1. Recalling Definition 1, we also knowthat the corresponding smallest order r equals r , and the jump size κ = κ .As for P , we have that E ( y i ) = ( , i ∈ { , . . . , ∆ + δ } ,κ { ( i − ∆) /n } r = κ P rl =0 (cid:0) rl (cid:1) (cid:0) i − ∆ − δn (cid:1) r − l (cid:0) δn (cid:1) l , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisfies η ( P ) = ∆ + δ + 1. Recalling Definition 1, we alsoknow that the corresponding smallest order r equals 0, and the jump size κ = κ (cid:0) r (cid:1) ( δ/n ) = κ .It then follows from Le Cam’s lemma (e.g. Yu, 1997), a standard reduction of estimation to two pointtesting, and Lemma 2.6 in Tsybakov (2009), a form of Pinsker’s inequality, thatinf b η sup P ∈Q E P ( | b η − η | ) ≥ δ (1 − d TV ( P , P )) ≥ δ − κ σ n r ∆+ δ X i =∆+1 ( i − ∆) r ! = δ − κ σ n r δ X i =1 i r ! ≥ δ − κ σ n r Z δ +11 x r dx ! ≥ δ (cid:26) − κ ( δ + 1) r +1 (2 r + 1) σ n r (cid:27) ≥ δ (cid:26) − cκ δ r +1 σ n r (cid:27) . 19e set δ = max ((cid:20) cσ n r κ (cid:21) / (2 r +1) , ) and complete the proof. Proof of Lemma 5. Let P denote the joint distribution of the independent random variables { y i } ni =1 suchthat y i ∼ ( N ( κ ( i/n − ∆ /n ) r , σ ) , i ∈ { , . . . , ∆ }N (0 , σ ) , i ∈ { ∆ + 1 , . . . , n } . Let P denote the joint distribution of the independent random variables { z i } ni =1 such that z i ∼ ( N (0 , σ ) , i ∈ { , . . . , n − ∆ }N ( κ ( i/n − /n ) r , σ ) , i ∈ { n − ∆ + 1 , . . . , n } . As for P , it is easy to see that E ( y i ) = ( κ { ( i − ∆) /n } r , i ∈ { , . . . , ∆ } , , i ∈ { ∆ + 1 , . . . , n } , which implies that the change point of P satisfies η ( P ) = ∆ + 1. Recalling Definition 1, we also knowthat the corresponding smallest order r equals r , and the jump size κ = κ .As for P , it is easy to see that E ( y i ) = ( , i ∈ { , . . . , n − ∆ } ,κ { ( i − n + ∆) /n } r , i ∈ { n − ∆ + 1 , . . . , n } , which implies that the change point of P satisfies η ( P ) = n − ∆ + 1. Recalling Definition 1, we alsoknow that the corresponding smallest order r equals r , and the jump size κ = κ .Since ∆ ≤ n/ 3, it follows from Le Cam’s lemma (e.g. Yu, 1997) and Lemma 2.6 in Tsybakov (2009)that inf b η sup P ∈P E P ( | b η − η | ) ≥ ( n/ − d TV ( P , P )) ≥ n {− KL( P , P ) } . Since both P and P are product measures, it holds thatKL( P , P ) = κ σ ( ∆ X i =1 (cid:18) i − ∆ n (cid:19) r + n X i = n − ∆+1 (cid:18) i − n + ∆ n (cid:19) r ) ≤ κ σ X i =1 (cid:18) in (cid:19) r ≤ κ σ n r Z ∆+11 x r dx ≤ c κ ∆ r +1 σ n r = c ξ. Therefore inf ˆ η sup P ∈P E P ( | ˆ η − η | ) ≥ n c ξ ) = cn.cn.