Detection of Change Points in Piecewise Polynomial Signals Using Trend Filtering
DDetection of Change Points in Piecewise Polynomial SignalsUsing Trend Filtering
Reza Valiollahi Mehrizi and Shojaeddin Chenouri Department of Statistics and Actuarial Science, University of Waterloo
Abstract
While many approaches have been proposed for discovering abrupt changes in piecewiseconstant signals, few methods are available to capture these changes in piecewise polynomialsignals. In this paper, we propose a change point detection method, PRUTF, based ontrend filtering. By providing a comprehensive dual solution path for trend filtering, PRUTFallows us to discover change points of the underlying signal for either a given value of theregularization parameter or a specific number of steps of the algorithm. We demonstratethat the dual solution path constitutes a Gaussian bridge process that enables us to derive anexact and efficient stopping rule for terminating the search algorithm. We also prove that theestimates produced by this algorithm are asymptotically consistent in pattern recovery. Thisresult holds even in the case of staircases (consecutive change points of the same sign) in thesignal. Finally, we investigate the performance of our proposed method for various signalsand then compare its performance against some state-of-the-art methods in the context ofchange point detection. We apply our method to three real-world datasets including theUK House Price Index (HPI), the GISS surface Temperature Analysis (GISTEMP) and theCoronavirus disease (COVID-19) pandemic.
KEYWORDS: Change point detection, Trend filtering, Gaussian bridge process, Pattern recov-ery, COVID-19.
The problem of change point detection is more than sixty years old. It was first studied byPage (1954, 1955), and since then, has been of interest to many scientists including statisticians.Many of the earlier developments concerned the existence of at most one change point; however,considerable attention in recent years has been given to multiple change point analysis, whichhas found applications in many fields such as finance and econometrics, Bai and Perron (2003);Hansen (2001), bioinformatics and genomics, Futschik et al. (2014); Lavielle (2005), climatology,Liu et al. (2010); Pezzatti et al. (2013), and technology, Siris and Papagalou (2004); Oudre et al.(2011); Lung-Yut-Fong et al. (2012); Ranganathan (2012); Galceran et al. (2017). Consequently,there is a vast and rich literature on the subject. In the following, we only review a body ofliterature on a retrospective change point framework closely related to our work and refer the a r X i v : . [ s t a t . M E ] S e p nterested readers to Eckley et al. (2011); Lee (2010); Horv´ath and Rice (2014); Truong et al.(2018) for more comprehensive reviews.We consider the univariate signal plus noise model y i = f i + ε i , i = 1 , . . . , n, (1)where f i = f ( i/n ) is a deterministic and unknown signal with equally spaced input points overthe interval [0 , ε , . . . , ε n are assumed to be independently and identicallydistributed Gaussian random variables with mean zero and finite variance σ . We assume that f ( · ) undergoes J unknown and distinct changes at point fractions 0 = ω < ω < . . . < ω J <ω J = 1, where the number of change point fractions, J can grow with the sample size n .Additionally, we assume that f ( · ) is a piecewise polynomial function of order r = 0 , , , . . . .These assumptions imply that, associated with ω , . . . , ω J , there are change points locations0 = τ < τ < . . . < τ J < τ J = n , which partition the entire signal f = ( f , . . . , f n ) into J + 1 segments. More specifically, any subsignal of f within segments created by the changepoints follows an r -degree polynomial structure with or without a continuity constraint at thechange points. For more detail, see Figure 2. Change in the level of a piecewise constant signal,known as the canonical multiple change point problem, and change in the slope of a piecewiselinear signal are examples of the problem under consideration in this paper. In change pointanalysis, the objective is to estimate the number of change points, J , as well as their locations τ = { τ , . . . , τ J } based on the observations y = ( y , . . . , y n ).The canonical multiple change point problem, where the signal f is modelled as a piecewiseconstant function, has been extensively studied in the literature. In this framework, there aremany approaches and we only attempt to list a selection of them here. The majority of thesetechniques seek to identify all change points at once by solving an optimization problem con-sisting of a loss function, often the negative log-likelihood, and a penalty criterion. Yao (1988);Yao and Au (1989) used the square error loss along with the Schwarz Information Criterion(SIC) as a penalty function to consistently estimate the bounded number of change points andtheir locations for the data drawn from a Gaussian distribution. Within the same setting, incor-poration of various penalty functions including Modified Information Criterion (MIC) Pan andChen (2006), modified Bayesian Information Criterion (mBIC) Zhang and Siegmund (2007),Simultaneous Information Theoretic Criterion (SITC) Wu (2008) and modified SIC Ciuperca(2011, 2014), have been studied. Specific algorithms such as Optimal Partitioning Auger andLawrence (1989), Segment Neighbourhood Jackson et al. (2005), and pruning approaches suchas PELT Killick et al. (2012) and PDPa Rigaill (2015) are developed to solve such optimizationproblems.Apart from penalty-based techniques, another frequently used class of change point detec-tion approaches encompasses the greedy search procedures in which they search sequentiallyfor one single change point at a time. The most popular methods in this class are BinarySegmentation Vostrikova (1981) and its variants such as Circular Binary Segmentation (CBS)Olshen et al. (2004), and Wild Binary Segmentation (WBS) Fryzlewicz et al. (2014). In recentyears, researchers have attempted to improve Binary Segmentation’s performance from statisti-2al and computational viewpoints. Fryzlewicz et al. (2018) suggested a backward (bottom-up)mechanism, called Tail Greedy Unbalanced Haar (TGUH), which is computationally fast andstatistically consistent in estimating both the number and the locations of change points. Also,Fryzlewicz (2018) introduced Wild Binary Segmentation 2 (WBS2) to deal with the shortcom-ings of WBS in datasets with frequent changes. It has been shown that the method is fast inrun time and accurate in detection.Beyond the canonical change point problem, signals in which f is modelled as a piece-wise polynomial of order r ≥ (cid:96) -penalized least square procedure. In continuous piecewise lin-ear models, Kim et al. (2009) developed a methodology called (cid:96) -trend filtering. Furthermore,Baranowski et al. (2019) put forward the method of Narrowest Over Threshold (NOT), andAnastasiou and Fryzlewicz (2019) developed an approach called Isolate-Detect (ID) which bothprovide asymptotically consistent estimators of the number and locations of change points.Our goal in this paper is to introduce a unifying method covering the canonical changepoint problem and beyond. More precisely, the method is cable of detecting change points inpiecewise polynomial signals of order r ( r = 0 , , , . . . ) with and without continuity constraintat the locations of change points.The detection of change points in a sequence of data can be formulated as a penalizedregression fitting problem. According to our notation, the quantity f τ − f τ +1 is nonzero if thesignal f undergoes a change at point τ , and is zero otherwise. Moreover, if we assume thatchange points are sparse, that is, the number of locations where f changes, J , is much smallerthan the number of observations n , change points can be estimated using the one-dimensionalfused lasso problem min f ∈ R n (cid:107) y − f (cid:107) + λ n − (cid:88) i =1 | f i +1 − f i | , where f = ( f , . . . , f n ).This formulation of the canonical change point problem was first considered in Huang et al.(2005) and was applied to analyze a DNA copy number dataset. Harchaoui and L´evy-Leduc(2010) considered the same formulation and proved the consistency of the respective changepoint estimates when the number of change points is bounded. Employing sparse fused lassowhich is composed of both the (cid:96) -norm and the total variation seminorm penalties, Rinaldoet al. (2009) proposed a sparse piecewise constant fit and established the consistency of thecorresponding estimates when the variance of the noise terms vanishes, and the minimum mag-nitude of jumps is bounded from below. However, Rojas and Wahlberg (2014) argued that theconsistency results achieved by Rinaldo et al. (2009) are incorrect when a frequently viewedpattern, called staircase , exists in the signal. The staircase phenomenon occurs in a piecewiseconstant model when there are either two consecutive downward jumps or upward jumps in its3ean structure. The staircase pattern will be discussed in more detail in Section 6. Addition-ally, Qian and Jia (2016) showed that the lasso problem of Tibshirani (1996) when derived bytransforming fused lasso does not satisfy the Irrepresentable Condition (Zhao and Yu (2006))that is necessary and sufficient for exact pattern recovery. In particular, Qian and Jia (2016)proposed an approach called preconditioned fused lasso based on the puffer transformation of Jiaet al. (2015) and established that it can recover the exact pattern with probability approachingone.A similar approach to that of the piecewise constant signals can be considered for estimatingchange points in piecewise polynomial signals. In particular, a positive integer τ is a changelocation in an r -th degree piecewise polynomial signal f if τ -th element of the vector D ( r +1) f is non-zero, denoted by [ D ( r +1) f ] τ (cid:54) = 0. Here D ( r +1) is a penalty matrix that will be defined inSection 2.2. Hence, change points can be estimated from nonzero elements of D ( r +1) (cid:98) f , where (cid:98) f is the solution of min f ∈ R n (cid:107) y − f (cid:107) + λ (cid:107) D ( r +1) f (cid:107) . (2)The aforementioned problem was first studied by Steidl et al. (2006) in the context of imageprocessing and was called higher order total variation regularization . It was later rediscoveredby Kim et al. (2009) and termed trend filtering in the nonparametric regression setting. Kimet al. (2009) specifically explored linear trend filtering ( r = 1) which fits piecewise linear models.Tibshirani et al. (2014) extensively studied trend filtering and compared its performance withsmoothing splines Green and Silverman (1993) and locally adaptive regression splines Mammenet al. (1997) in the context of nonparametric regression. Tibshirani et al. (2014) also estab-lished that trend filtering enjoys desirable and strong theoretical properties of locally adaptiveregression splines while being computationally less intensive due to its banded penalty matrix.Moreover, trend filtering has an adaptive knot selection property, which makes it well suited forchange point analysis.From a computational and algorithmic standpoint, Kim et al. (2009) described Primal-DualInterior Point (PDIP) method for deriving the estimates of the linear trend filtering problem ata fixed value of λ . This can be easily carried over to the trend filtering problem of any order.Wang et al. (2014) suggested an algorithm based on a falling factorial basis while Ramdas andTibshirani (2016) derived an algorithm based on the Alternating Direction Method of Multipliers(ADMM) discussed in Boyd et al. (2011). The computational complexity of all these algorithmsis of order O ( n ).In this paper, we develop a new methodology called Pattern Recovery Using Trend Filtering (PRUTF) for identifying unknown change points in piecewise polynomial signals with no con-tinuity restriction at change point locations. Therefore, a change point is defined as a suddenjump in the signal and its all derivatives up to order r . Figure 2 displays such change pointsfor various r . In this paper, we make the following contributions. • We propose a generic dual solution path algorithm along with the regularization parameterfor trend filtering. This solution path, whose basic idea is borrowed from Tibshirani4t al. (2011) enables us to determine change points at each level of the regularizationparameter. Our algorithm, PRUTF, is different from that of Tibshirani et al. (2011) aswe remove ( r + 1) coordinates of dual variables after locating each change point. Thisadjustment to the algorithm allows us to have independent dual variables between eachpair of neighbouring change points. Besides, the elimination of ( r + 1) coordinates at eachstep leads to faster implementation of the algorithm. • We establish a stopping criterion that plays an essential rule in the PRUTF algorithmused to find change points. Notably, we show that the dual variables of trend filteringbetween consecutive change points constitute a Gaussian bridge process. This findingallows us to introduce a threshold for terminating our proposed algorithm. • If the signal contains a staircase pattern, we prove that the method is statistically incon-sistent, which makes it unfavourable. Explaining the reason for this end, we modify ouralgorithm to produce estimates consistent in terms of both the number and location ofchange points.This paper is organized as follows: we first describe how to characterize the dual optimizationproblem of trend filtering. In Section 3, we develop our main algorithm, PRUTF, to use inconstructing the dual solution path of trend filtering and, in turn, identifying the locations ofchange points. Section 4 discusses the properties of this dual solution path. We establish thatthe dual variables derived from the solution path form a Gaussian bridge process that makesthem favourable for statistical inference. Applying these properties, we develop a stopping rulefor the change point search algorithm in Section 5. The quality of the PRUTF algorithm isvalidated in terms of pattern recovery of the true signal in Section 6. It is established that theproposed technique in its naive form fails to consistently identify the true signal when a specialpattern, called staircase, is present in the signal. Section 7 elaborates on how to modify thealgorithm in order to estimate the true pattern consistently. Simulation results, and real-worldapplications are presented in Section 8. We conclude the paper with a discussion in Section 9.
We begin this section with setting up notations that will be used throughout this article.For an m × n matrix A , we denote its rows by A , . . . , A m and express the matrix as A =( A T , . . . , A Tm ) T . Now for the set of indices I = { i , . . . , i k } ⊆ { , . . . , m } , the notation A I = ( A T i , . . . , A T i k ) T represents the submatrix of A whose row labels are in the set I . In asimilar manner, for a vector a of length m , we let a I = ( a i , . . . , a ik ) T denote a subvector of a whose coordinate labels are in I . We write A −I and a −I to denote A { , ..., m }\I and a { , ..., m }\I ,respectively, where J \I is the set of indices in J but not in I . Furthermore, for selecting i -th row of A , the notation [ A ] i and for its ( i, j )-th element the notation [ A ] ij are used. Also,[ a ] i extracts the i -th elements of the vector a . We write diag( A ) to denote the vector of the5ain diagonal entries of the matrix A . Moreover, for a real number x , (cid:98) x (cid:99) denotes the greatestinteger less than or equal x , and (cid:100) x (cid:101) denotes the least integer greater or equal x . For a set A ,the indicator function is denoted by ( A ) . Recall the trend filtering problemmin f ∈ R n (cid:107) y − f (cid:107) + λ (cid:107) D ( r +1) f (cid:107) , (3)where λ ≥ n − r − × n penalty matrix D ( r +1) is the difference operator of order ( r + 1). For r = 0, thefirst order difference matrix D (1) is defined as D (1) = − . . . − . . . . . . − , and for r ≥
1, the difference operator of order r + 1 can be recursively computed by D ( r +1) = D (1) × D ( r ) . Notice that, in this matrix multiplication, we consider only the submatrix consistingof the first n − r − n − r − D (1) . Figure 1 displays the trendfiltering fits for r = 0 , , (a) Linear (b)
Quadratic (c)
Cubic
Figure 1:
Trend filtering solutions for r = 1 , , producing (a) piecewise linear, (b) piecewisequadratic and (c) piecewise cubic fits, respectively. Although the objective function in the r -th order trend filtering (3) is strictly convex andthus the minimization has a guaranteed unique solution, the penalty term is not differentiablein f , so solving the optimization in its current form is difficult. To overcome this difficulty,6e follow the argument in Tibshirani et al. (2011) and convert this optimization problem intoits dual form. Since the objective function in the primal problem is strictly convex with noconstraint, the strong duality holds, meaning that the primal and the dual solutions coincide.The trend filtering problem (3) can be rewritten asmin f ∈ R n (cid:107) y − f (cid:107) + λ (cid:107) z (cid:107) , subject to z = Df , where, for ease in the notation, we use D = D ( r +1) . For any given λ >
0, the Lagrangian is L ( f , z , u ) = 12 (cid:107) y − f (cid:107) + λ (cid:107) z (cid:107) + u T ( Df − z )and, thus the dual function is given by g ( u ) = inf f ∈ R n , z ∈ R m L ( f , z , u ) , which is a concave function defined on R m , where m = n − r − R ∪ {−∞ , ∞} . The vectors f and u are called the primal and dual variables,respectively. Taking the derivative of the Lagrangian L ( f , z , u ) with respect to f and settingit to be equal to zero, we obtain f = y − D T u . (4)Now substituting this back into the Lagrangian L ( f , z , u ), and performing certain algebraicmanipulations, we obtain L ∗ ( z , u ) = inf f ∈ R n L ( f , z , u )= − (cid:107) y − D T u (cid:107) + 12 (cid:107) y (cid:107) + λ (cid:107) z (cid:107) − u T z . Minimizing L ∗ ( z , u ), or equivalently maximizing u T z − λ (cid:107) z (cid:107) , with respect to z ∈ R m leads usto the dual function g ( u ). Notice that sup z { u T z − λ (cid:107) z (cid:107) } is the conjugate of the function λ (cid:107) z (cid:107) in the context of conjugate convex functions. See Brezis (2010) and Boyd and Vandenberghe(2004). This conjugate function is given bysup z { u T z − λ (cid:107) z (cid:107) } = (cid:107) u (cid:107) ∞ ≤ λ ∞ otherwise .From all these, the dual function is given as g ( u ) = − (cid:107) y − D T u (cid:107) + 12 (cid:107) y (cid:107) for (cid:107) u (cid:107) ∞ ≤ λ , and, thus the dual problem is to find the maximum of the dual function g ( u ). This is equivalentto min u ∈ R m (cid:107) y − D T u (cid:107) subject to (cid:107) u (cid:107) ∞ ≤ λ . (5)The constraint in (5) is an (cid:96) ∞ -ball or a hypercube centered at the origin with the boundariesgiven by the set {− λ, + λ } m . Since the matrix D is full row rank, the problem (5) is strictly7onvex and has a unique solution, see Ali et al. (2019). In addition, notice that the dimensionof the dual vector u is m , which is smaller than that of the primal vector f and may lead torelatively faster computations. The connection between the primal and the dual solutions isgiven by the equations (cid:98) u λ = λ (cid:98) γ , (6) (cid:98) f λ = y − D T (cid:98) u λ , (7)where (cid:98) γ ∈ R m is a subgradient of (cid:107) x (cid:107) computed at x = D (cid:98) f λ . This subgradient is given by (cid:98) γ i ∈ { +1 } if [ D (cid:98) f λ ] i > {− } if [ D (cid:98) f λ ] i < − , +1] if [ D (cid:98) f λ ] i = 0 . (8)The statements in Equations (6)-(8) are equivalent to the KKT optimality conditions of theprimal problem (3). The dual problem (5) demonstrates that D T (cid:98) u λ is the projection, P C ( y ),of y onto the convex polyhedron (or hypercube here) C = { x ∈ R m : (cid:107) x (cid:107) ∞ ≤ λ } . From this,the primal solution (7) can be rewritten in the form of ( I − P C ) ( y ), representing the residualprojection map of y onto the polyhedron C .Our idea of applying trend filtering to discover change points in piecewise polynomial signalsis inspired by Rinaldo et al. (2009) and (Rinaldo, 2014), in which change point detection isstudied using fused lasso. Besides extending to piecewise polynomial signals, the novelty of ourwork is in providing an exact stopping criterion, which is based on the Gaussian bridge propertyof the trend filtering dual variables. In addition, we propose an algorithm which, unlike thatproposed in Rinaldo et al. (2009), always produces consistent change points even in the presenceof staircase patterns. In this section, we construct and study the solution path of dual variables (cid:98) u λ as the regu-larization parameter decreases from λ = ∞ to λ = 0. This dual solution path identifies thecorresponding primal solution using (7). Our PRUTF algorithm is given to compute the entiredual solution path. For any given λ , we call any coordinate of (cid:98) u λ a boundary coordinate ifit is a vertex of the polyhedron C = { x ∈ R m : (cid:107) x (cid:107) ∞ ≤ λ } , meaning that its absolute valuebecomes λ . In the process of constructing the solution path, we trace several sets, introducedbelow. • The set B = B ( λ ), called the boundary set, contains the boundary coordinates identifiedby (cid:98) u λ . • For any λ , the vector s B = s B ( λ ), called the sign vector, represents collectively the signsof the boundary points in B ( λ ). • The set A = A ( λ ), called the augmented boundary set, contains the boundary coordinatesin B ( λ ) as well as the first r a = (cid:98) ( r + 1) / (cid:99) coordinates immediately after.8 For any λ , the vector s A = s A ( λ ) represents collectively the signs of the augmentedboundary points in A ( λ ).In the following, we discuss the need for the augmented boundary set A .We begin by studying the structure of the dual vector u = Df in a piecewise polynomialsignal of order r , where the signal is partitioned into a number of blocks defined by the positionof the change points. Because the signal f is a piecewise polynomial of order r , to computethe i -th coordinate of the vector u , we need r b = (cid:100) ( r + 1) / (cid:101) − i -th element of f as well as r a = (cid:98) ( r + 1) / (cid:99) points immediately after that. Consequently, thefirst r a elements of Df within each block cannot be computed. Moreover, within each block,the last r b + 1 elements of Df are all nonzero due to the existence of a change point. Thisobservation is depicted in Figure 2 for r = 0 , , ,
3. To explain this point clearly, consider thecase of r = 2 in Figure 2 in which the structure of Df is shown, where the true change points are6 and 13. As it can be seen, the points on the boundary – the nonzero coordinates of Df – are B ( λ ) = (5 , , ,
13) with their respective signs s B ( λ ) = (1 , , − , − Df doesnot exist at points 7 and 14. The augmented boundary set contains these points as well as theboundary points; that is A ( λ ) = (5 , , , , , A ( λ ) are given by s A ( λ ) = (1 , , , − , − , − λ , we call the coordinates that belong to the augmented boundary set A ( λ ) the boundarycoordinates, and the rest, the interior coordinates.At the j -th iteration with λ = λ j , we assume that the boundary set and its correspondingsign vector are B = B ( λ ) and s B = s B ( λ ), respectively. Furthermore, we assume the augmentedboundary set and its sign vector are A = A ( λ ) and s A = s A ( λ ), respectively. Dual coordinatescan be split into boundary coordinates (cid:98) u λj, A and interior coordinates (cid:98) u λj, −A . Recall fromSection 2.1 that (cid:98) u λj, A represents the subvector of (cid:98) u λ with the coordinate labels in the set A and (cid:98) u λj, −A represents the subvector of (cid:98) u λ with the coordinate labels in the set { , , · · · , m }\A .It is apparent from the definition of the boundary coordinates that (cid:98) u λj, A = λ j s A . (9)Replacing the boundary coordinate with λ j s A in (5) and solving the resulting quadratic problemwith respect to the interior coordinates, leads to their least square estimates, given by (cid:98) u λj, −A = (cid:16) D −A D T −A (cid:17) − D −A (cid:0) y − λ j D T A s A (cid:1) . (10)It should be noted that for the purpose of simplicity, we denote ( D A ) T and ( D −A ) T with D T A and D T −A , respectively. Notice that in expression (10), the first term (cid:16) D −A D T −A (cid:17) − D −A y simply yields the least square estimate of regressing the response vector y on the design matrix D −A . The second term − λ j (cid:16) D −A D T −A (cid:17) − D −A D T A s A can be interpreted as a shrinkage termdue to the condition (cid:107) u (cid:107) ∞ ≤ λ . The expression (10) is true for λ ≤ λ j until either an interiorcoordinate joins to the boundary or a coordinate in the boundary set leaves the boundary. Thefollowing argument explains how to specify values of λ while the interior coordinates change.We define the joining time associated with the interior coordinate i ∈ { , , · · · , m }\A asthe time at which this interior coordinate joins the boundary. To determine the next join-9 a) Piecewise constant with r = 0 . (b) Piecewise linear with r = 1 . (c) Piecewise quadratic with r = 2 . (d) Piecewise cubic with r = 3 . Figure 2:
Structure of Df for piecewise polynomial signals with various orders r = 0 , , , .The grey lines display the true signals with two change points at the locations and . Emptycircles represent the indices that Df is not defined. ing time, we reduce the value of λ in a linear direction starting from λ j and solve (cid:98) u λ, −A =( ± λ, · · · , ± λ ) T . Note that the right-hand side of (10) can be expressed as a − λ j b , where a = (cid:16) D −A D T −A (cid:17) − D −A y , b = (cid:16) D −A D T −A (cid:17) − D −A D T A s A . (11)10he joining time for every i ∈ { , , · · · , m }\A is hence the solution of the equation a i − λ b i = ± λ with respect to λ , which is given by λ join i = a i b i ± , i ∈ { , , · · · , m }\A . Note that λ join i is uniquely defined because only one of the signs − λ i ∈ [0 , λ j ].Now we turn the attention to the characterization of a coordinate which leaves the boundaryset B . For i ∈ B , the leaving time is defined as the time that the coordinate i leaves the boundaryset B . Since s B is the sign vector of changes captured by [ D (cid:98) f ] B , then diag( s B ) [ D (cid:98) f ] B > ,which in turn, along with Equation (7), implies diag( s B ) (cid:2) D ( y − D T (cid:98) u λ ) (cid:3) B > . Here, forany vector η , diag( η ) denotes the diagonal matrix with the diagonal elements given by η ,and η > holds element-wise. Therefore, a coordinate i ∈ B leaves the boundary set B ifdiag( s B ) (cid:2) D ( y − D T (cid:98) u λ ) (cid:3) B > is violated. Using the relation (cid:2) D ( y − D T (cid:98) u λ ) (cid:3) B = D B ( y − D T (cid:98) u λ ) , and the decomposition D T (cid:98) u λ = D T A (cid:98) u λ, A + D T −A (cid:98) u λ, −A , we obtaindiag( s B ) (cid:2) D ( y − D T (cid:98) u λ ) (cid:3) B = c − λ d , (12)where c = diag( s B ) D B (cid:16) y − D T −A a (cid:17) , d = diag( s B ) D B (cid:16) D T A s A − D T −A b (cid:17) . (13)Hence, a leaving time is obtained from the equation c i − λ d i > λ leave i = c i d i , if c i < d i < , , otherwise . The conditions in the equation above is due to the fact that at the j -th iteration with λ ≤ λ j ,the expression c i − λ d i > i ∈ B , if both c i and d i are negative. An alternative way todetermine the next leaving time is to use the KKT optimality conditions of (5). We refer thereader to the supplementary materials of Tibshirani et al. (2011).The following algorithm, PRUTF, describes the process of constructing the entire dualsolution path of trend filtering. Algorithm 1 (PRUTF)
1. Initialize the set of change points locations as τ = ∅ , the empty set.2. At step j = 1 , initialize the boundary set B = { τ − r b , τ − r b +1 , . . . , τ } and its associatedsign vector s B = { s , . . . , s } , both with cardinality of r b + 1 , where τ is obtained by τ = argmax i =1 , ..., m | (cid:98) u i | , (14)11 nd s = sign( (cid:98) u τ ) , where (cid:98) u i is the i -th element of the vector (cid:98) u = (cid:0) DD T (cid:1) − D y . Theupdated set of change points locations is now τ = { τ } . We also record the first joiningtime λ = | (cid:98) u τ | and keep track of the augmented boundary set A = { τ − r b , . . . , τ + r a } and its corresponding sign vector s A = { s , . . . , s } of length r + 1 . The dual solution isregarded as (cid:98) u ( λ ) = (cid:0) DD T (cid:1) − D y , for λ ≥ λ .3. For step j = 2 , , . . . , (a) Obtain the pair ( τ join j , s join j ) from ( τ join j , s join j ) = argmax i/ ∈A j − , s ∈{− , } a i s + b i · (cid:26) ≤ a i s + b i ≤ λ j − (cid:27) , (15) and set the next joining time λ join j as the value of a i s + b i , for i = τ join j and s = s join j .(b) Obtain the pair ( τ leave j , s leave j ) from ( τ leave j , s leave j ) = argmax i ∈B j − , s ∈{− , } c i d i · { c i < , d i < } , (16) and assign the next leaving time λ leave j as the value of c i d i , for i = τ leave j and s = s leave j .(c) Let λ j = max { λ join j , λ leave j } , then the boundary set B j and its sign vector s B j areupdated in the following fashion:– Either append { τ join j − r b , τ join j − r b + 1 , . . . , τ join j } and the corresponding signs { s join j , . . . , s join j } to B j − and s B j − , respectively, provided that λ j = λ join j . Also,add τ join j to τ j − .– Or remove { τ leave j , τ leave j +1 , . . . , τ leave j + r b } and the corresponding signs { s leave j , . . . , s leave j } from B j − and s B j − , respectively, provided that λ j = λ leave j . Also, remove τ leave j from τ j − .In the same manner, the augmented boundary set, A j and its sign, s A j are formedby adding (cid:110) τ join j − r b , . . . , τ join j + r a (cid:111) and (cid:110) s join j , . . . , s join j (cid:111) to A j − and s A j − , respec-tively, if λ j = λ leave j or, otherwise, by removing the associated set (cid:110) τ leave j , . . . , τ leave j + r (cid:111) and (cid:110) s leave j , . . . , s leave j (cid:111) from A j − and s A j − . Thus, the dual solution is computed as (cid:98) u A j ( λ ) = a − λ b for interior coordinates and (cid:98) u −A j ( λ ) = λ s A j for boundary coordi-nates over λ j ≤ λ ≤ λ j − .4. Repeat step 3 until λ j > . The critical points λ ≥ λ ≥ . . . ≥ Remark 1
Notice that the vector τ derived by the PRUTF algorithm represents the locations ofchange points for the dual problem of trend filtering. In order to obtain the locations of changepoints in the primal problem of trend filtering, we must add r b to any element of τ , that is, { τ + r b , τ + r b , . . . } . This relationship between the primal and dual change point sets is visiblefrom Figure 2. emark 2 For r = 0 , the case of fused lasso, Lemma 1 of Tibshirani et al. (2011), known asthe boundary lemma, is satisfied since the matrix DD T is diagonally dominant, meaning that [ DD T ] i,i ≥ (cid:80) j (cid:54) = i | [ DD T ] i,j | , for i = 1 , . . . , m . This lemma states that when a coordinate joinsthe boundary, it will stay on the boundary for the rest of the path. Consequently, part (b) of step3 in Algorithm 1 is unnecessary, and hence the next leaving time in part (c) is set to zero, i.e., λ leave j = 0 , for every step j . However, the boundary lemma is not satisfied for r = 1 , , , . . . . Remark 3
There is a subtle and important distinction between our proposed algorithm, PRUTF,and the one presented in Tibshirani et al. (2011). The latter work studies the generalized lassoproblem for any arbitrary penalty matrix D (unlike D used in trend filtering, which must havea certain structure). The proposed algorithm in Tibshirani et al. (2011) relies on adding orremoving only one coordinate to or from the boundary set at every step. The key attribute ofour algorithm is to add or remove r + 1 coordinates to or from the augmented boundary set, anapproach inspired by the argument presented at the beginning of this section. Essentially, thisattribute makes PRUTF, presented in Algorithm 1, well-suited for change point analysis. It isimportant to mention that the trend filtering solution path requires at least r + 1 data pointsbetween neighbouring change points. Remark 4
For a given λ , equations (9) and (10) give the values of the dual variables in (cid:98) u λ .They demonstrate that the dual solution path is a linear function of λ with change in the slopesat joining or leaving times λ ≥ λ ≥ . . . ≥ . Remark 5
The number of iterations required for PRUTF, presented in Algorithm 1, is at most (cid:80) pi =0 (cid:0) mi ( r +1) (cid:1) i − , where p = (cid:100) mr +1 (cid:101) . However, this upper bound for the number of iterations isusually very loose. It comes from the following realization discovered by Osborne et al. (2000)and later improved by Mairal and Yu (2012). Any pair ( A , s A ) appears at most once throughoutthe solution path. In other words, if ( A , s A ) is visited in one iteration of the algorithm, the pair ( A , − s A ) as well as ( A , s A ) cannot reappear again for the rest of the algorithm. Interestingly,this fact says that once a coordinate enters the boundary set, it cannot immediately leave theboundary set at the next step. An important component of the methodology that we develop in this paper involves computingalgebraic expressions based on the matrix D = D ( r +1) . In this section, we describe the propertiesof such expressions. To begin with, let A = { A , . . . , A J } and s A = { s , . . . , s J } be theaugmented boundary set and its corresponding sign vector, respectively, after a number ofiterations of Algorithm 1, where A j = { τ j − r b , τ j − r b + 1 , . . . , τ j + r a } and s j = { s j , . . . , s j } for j = 1 , . . . , J . This augmented boundary set corresponds to J change points { τ , . . . , τ J } that partition all the dual variables into J + 1 blocks B j = { τ j + 1 , . . . , τ j +1 } for j = 0 , , . . . , J ,with the conventions that τ = 0 and τ J +1 = m . In the following, we list some properties ofmatrix multiplications involving D . 13 It follows from the definition of the matrix D that it is a banded Toeplitz matrix withbandwidth r + 1. It tuns out that the matrix DD T reveals the same property, meaningthat it is a square banded Toeplitz matrix. Moreover, its r + 1 nonzero row elements areconsecutive binomial coefficients of order 2( r + 1) + 1 with alternating signs. In otherwords, ( i, j )-th element of DD T for i ≥ j is ( − r +1+ i − j (cid:0) r +2 r +1+ i − j (cid:1) . An example is given inFigure 3a. This is a symmetric, nonsingular and positive definite matrix. See Demetriouand Lipitakis (2001). • The matrix D −A D T −A is a block diagonal matrix whose diagonal submatrices correspondto J + 1 blocks. More precisely, the j -th submatrix on the diagonal of D −A D T −A is amatrix with the first ( τ j +1 − τ j − r ) rows and columns of DD T , see Figure 3b. Noticethat, due to its non-singularity, D −A D T −A is always invertible. In fact, both (cid:16) D −A D T −A (cid:17) − and (cid:16) D −A D T −A (cid:17) − D −A are block diagonal matricies. Another interesting result is thatevery row of the matrix (cid:16) D −A D T −A (cid:17) − D −A is a contrast vector, meaning that for any t = 1 , . . . , m n (cid:88) i =1 (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) ( t, i ) = 0 . • Another interesting term in analyzing the behaviour of the dual variables is D T A s A . It canbe shown that the vector D T A s A can be partitioned into J + 1 subvectors associated withthe change points τ j , j = 1 , . . . , J . The subvector associated with τ j , j = 2 , . . . , J − D T Aj s Aj , whose elements are zero, except the first consecutive r + 1 as well as the lastconsecutive r + 1 elements. The first r + 1 nonzero elements of D T Aj s Aj are the binomialcoefficients in the expansion of s j ( x − r , and its last r + 1 elements are the binomialcoefficients in the expansion of − s j +1 ( x − r . Furthermore, the first r + 1 elements ofthe first subvector and the last r + 1 elements of the last subvector are also equal to zero.For example, for a piecewise cubic signal, r = 3, with two change points ( τ , τ ) and signs( − , D T A s A becomes , . . . , , , − , , − (cid:124) (cid:123)(cid:122) (cid:125) τ + r a ) , − , , − , , , . . . , , − , , − , (cid:124) (cid:123)(cid:122) (cid:125) ( τ + r a +1) : ( τ + r a ) , , − , , − , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ( τ + r a +1) : m . Consequently, the structure of D T Aj s Aj allows us to write D T A s A = (cid:80) Jj =0 D T Aj s j .Equation (9) says that the absolute values of the boundary coordinates are λ , that is, (cid:98) u ( t ; λ ) = λ s j for t ∈ A j . (17)14 − − − − − − − − − − − − − − −
40 0 0 0 0 0 1 − (a) Structure of DD T . − − − − − − − − −
40 0 0 0 1 − (b) The structure of D −A D T −A . Figure 3:
Structure of quadratic forms of the matrix D . On the other hand, the values of the interior coordinates are given by (cid:98) u ( t ; λ ) = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) y − λ D T A s (cid:17) , ≤ t < τ − r b (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) y − λ (cid:16) D T Aj +1 s j +1 + D T Aj s j (cid:17)(cid:17) , τ j + r a < t < τ j +1 − r b (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) y − λ D T AJ s J (cid:17) , τ J + r a < t ≤ m . (18) For a given λ , the dual variables (cid:98) u ( t ; λ ) for t = 0 , . . . , m + 1 can be collectively viewed as arandom bridge, that is, a conditioned random walk with drift whose end points are set to zero.Moreover, (cid:98) u ( t ; λ ) is bounded between − λ and λ . The quantity (cid:98) u ( t ; λ ) can also be decomposedinto a sum of several smaller random bridges which are formed by blocks created from thechange points. Recall that the last consecutive r + 1 elements of the block B j are λ s j , for any j = 0 , , · · · , J . Hence, for t = τ j + r a , . . . , τ j +1 − r b the random bridge associated with the j -th block is given by (cid:98) u j ( t ; λ ) = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) y − λ (cid:104) D T Aj +1 s j +1 + D T Aj s j (cid:105)(cid:17) , j = 0 , . . . , J , (19)with the conventions s = s J +1 = ∈ R r +1 . It is important to note that similar to (cid:98) u ( t ; λ ), theprocess (cid:98) u j ( t ; λ ) satisfies the conditions (cid:98) u j ( τ j + r a ; λ ) = λ s j and (cid:98) u j ( τ j +1 − r b ; λ ) = λ s j +1 . From(19), the process (cid:98) u j ( t ; λ ) is composed of the stochastic term (cid:98) u st j ( t ) = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t y , (20)and the drift term (cid:98) u dr j ( t ; λ ) = − λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 s j +1 + D T Aj s j (cid:17) . (21)According to model (1) with Gaussian noises, it turns out that the discrete time stochasticprocess term (cid:98) u st j ( t ) can be embedded in a continuous time Gaussian bridge process which isbounded between − λ and λ . The following theorem describes the characteristics of this process.15 heorem 1 Suppose the observation vector y is drawn from the model (1) , where the errorvector ε has a Gaussian distribution with mean zero and covariance matrix σ I . For given D ,and A ,(a) The continuous time stochastic processes W j = (cid:26) W j ( t ) : ( τ j + r a ) m ≤ t ≤ τ j +1 − r b m (cid:27) , where W j ( t ) = ( τ j +1 − τ j − r ) − ( r +1) / (cid:98) mt (cid:99) (cid:88) i = τ j + r a (cid:18) r + i − r (cid:19) y i , (22) are Gaussian processes with stationary and independent increments.(b) Let W j ( t ) = ( τ j +1 − τ j − r ) − ( r +1) / (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) (cid:98) mt (cid:99) y , (23) then the stochastic process W j = { W j ( t ) : ( τ j + r a ) /m ≤ t ≤ ( τ j +1 − r b ) /m } is a Gaussianbridge process derived from W j by conditioning on both events W j (( τ j + r a ) /m ) = 0 and W j (( τ j +1 − r b ) /m ) = 0 , where W j is given in part (a). The mean of W j is zeroand its covariance function is given by Cov( W j ( t ) , W j ( t (cid:48) )) = σ (cid:20)(cid:16) D −A D T −A (cid:17) − (cid:21) ( (cid:98) mt (cid:99) , (cid:98) mt (cid:48) (cid:99) ) , (24) for any ( τ j + r a ) /m ≤ t, t (cid:48) ≤ ( τ j +1 − r b ) /m .(c) The process W j is independent of any other processes W j (cid:48) for any j (cid:48) (cid:54) = j . This theorem could be extended to the case of non-Gaussian random variables and thereforeestablishes a Donsker type Central Limit Theorem for W j and W j . A proof is given inAppendix A.1. Theorem 1 guarantees that the dual variable process associated with the j -thblock, i.e. u j = { (cid:98) u ( (cid:98) mt (cid:99) ; λ ) : ( τ j + r a ) /m ≤ t ≤ ( τ j +1 − r b ) /m } is a Gaussian bridge process with the drift term − λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) (cid:98) mt (cid:99) (cid:16) D T Aj +1 s j +1 + D T Aj s j (cid:17) , (25)and the covariance matrix stated in (24).Recall that a standard Brownian bridge process defined on the interval [ a, b ] is a standardBrownian motion B ( t ) conditioned on the event B ( a ) = B ( b ) = 0. It is often characterizedfrom a Brownian motion B ( t ) with B ( a ) = 0, by setting B ( t ) = B ( t ) − t − ab − a B ( b ) . B ( t ) are given by E [ B ( t )] = 0 andCov( B ( s ) , B ( t )) = min { s − a, t − a } − ( b − a ) − ( s − a )( t − a ) for any s, t ∈ [ a, b ], respectively.A Gaussian bridge process is an extension of the Brownian bridge process when the Brownianmotion B ( t ), in the definition of the Brownian bridge B ( t ), is replaced by a more generalGaussian process G ( t ). See, for example, Gasbarra et al. (2007). Remark 6
The celebrated Donsker theorem Donsker (1951) states that the partial sum processof a sequence of i.i.d. random variables, with mean zero and variance 1, converges weaklyto a Brownian bridge process. See van der vaart and Wellner (1996) or Billingsley (2013).A version of Theorem 1 involving non-Gaussian random variables would extend this result toweighted partial sum processes and show that the limiting process is a Gaussian bridge with acertain covariance structure. So the Gaussian assumption in Theorem 1 is not restrictive. Itis also interesting to show that for r = 0 , , the process (cid:98) u st j ( (cid:98) mt (cid:99) ) boils down to its respectiveCUSUM process. To show this, consider the interval [( τ j + r a ) /m , ( τ j +1 − r b ) /m ] , • For the piecewise constant signals, i.e. r = 0 , the quantity (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) (cid:98) mt (cid:99) y can be written as , . . . , (cid:124)(cid:123)(cid:122)(cid:125) τ j , − (cid:98) mt (cid:99) τ j +1 − τ j , . . . , − (cid:98) mt (cid:99) τ j +1 − τ j (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) mt (cid:99) , − (cid:98) mt (cid:99) τ j +1 − τ j , . . . , − (cid:98) mt (cid:99) τ j +1 − τ j (cid:124) (cid:123)(cid:122) (cid:125) τ j +1 , , . . . , y . Notice that the above statement is the CUSUM statistic for the j -th segment, that is (cid:98) mt (cid:99) (cid:88) k = τ j +1 (cid:20) y k − y ( τj +1): τj +1 (cid:21) , (26) where y ( τj +1): τj +1 is the sample average of ( y τj +1 , . . . , y τj +1 ) . It is well known that theCUSUM statistic (26) converges weakly to the Brownian bridge. In addition, for any ( τ j + r a ) /m ≤ t (cid:48) ≤ t ≤ ( τ j +1 − r b ) /m , the covariance function becomes (cid:20)(cid:16) D −A D T −A (cid:17) − (cid:21) ( (cid:98) mt (cid:99) , (cid:98) mt (cid:48) (cid:99) ) = ( (cid:98) mt (cid:48) (cid:99) − τ j ) − ( (cid:98) mt (cid:48) (cid:99) − τ j )( (cid:98) mt (cid:99) − τ j ) τ j +1 − τ j , which is identical to the covariance function of the Brownian bridge. • For the piecewise linear signals, i.e. r = 1 , the quantity (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) (cid:98) mt (cid:99) y reduces to (cid:98) mt (cid:99) (cid:88) k = τ j +1 k (cid:104) y k − (cid:98) f k (cid:105) , (27) where (cid:98) f is the ordinary least square fit of the simple linear regression of ( y τj +1 , . . . , y τj +1 ) onto ( τ j + 1 , . . . , τ j +1 ) . As proved in Theorem 1, the preceding statistic (27) is a Gaussianbridge process. Furthermore, using the results in Hoskins and Ponzo (1972), for any τ j + r a ) /m ≤ t (cid:48) ≤ t ≤ ( τ j +1 − r b ) /m , the covariance function of this Gaussian bridgeprocess is given by (cid:20)(cid:16) D −A D T −A (cid:17) − (cid:21) ( (cid:98) mt (cid:99) , (cid:98) mt (cid:48) (cid:99) ) = (∆ j − (cid:98) mt (cid:99) + τ j )(∆ j − (cid:98) mt (cid:99) + τ j + 1)3 ∆ j (∆ j + 1)(∆ j + 2) × ( (cid:98) mt (cid:48) (cid:99) − τ j )( (cid:98) mt (cid:48) (cid:99) − τ j + 1) × (cid:104) ( (cid:98) mt (cid:99) − τ j + 1)( (cid:98) mt (cid:48) (cid:99) − τ j − j + 2) − ( (cid:98) mt (cid:99) − τ j )( (cid:98) mt (cid:48) (cid:99) − τ j + 2)∆ j (cid:105) , where ∆ j = τ j +1 − τ j . This section concerns with developing a stopping criterion for the PRUTF algorithm. Weprovide tools for deriving a threshold value at which the PRUTF algorithm terminates thesearch if no values of dual variables exceed this threshold. Consider the dual variables at thefirst step of the algorithm, i.e. (cid:98) u st ( t ) = (cid:104)(cid:0) DD T (cid:1) − D (cid:105) t y for t = 0 , . . . , m , which correspond to A = ∅ . It turns out that (cid:98) u st ( t ) is a stochastic process with local minima and maxima attainedat the change points. This structure is displayed with cyan-colored lines ( ) in Figure 4 forboth piecewise constant r = 0 and piecewise linear r = 1 signals. As the PRUTF algorithmdetects more change points and forms the augmented boundary set A , the local minima ormaxima corresponding to these change points are removed from the stochastic process (cid:98) u st A ( t ) = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t y = J (cid:88) j =0 (cid:98) u st j ( t ) { t ∈ B j } , (28)for t = 1 , . . . , m − |A| . This fact is shown by olive-colored lines ( ) in Figure 4. The lastequality in (28) expresses that the (cid:98) u st A ( t ) is the stochastic term of the dual variables for allthe interior coordinates and is derived by stacking the stochastic terms of the dual variablesassociated with j -th block, (cid:98) u st j ( t ), as defined in (20), for j = 0 , . . . , J . This behaviour suggestsa way to introduce a stopping rule for the PRUTF algorithm. As can be viewed from theorange-colored lines ( ) of Figure 4, if all true change points are captured by the algorithmand stored in the augmented set A , the resulting process (cid:98) u st A ( t ) = (cid:20)(cid:16) D A D T A (cid:17) − D A (cid:21) t y for t = 0 , . . . , m − |A | , contains no noticeable optimum points and tends to fluctuate close to the zero line (x-axis).We terminate the search in Algorithm 1 at step j by checking whether the maximum of (cid:12)(cid:12)(cid:12) (cid:98) u st −A j ( t ) (cid:12)(cid:12)(cid:12) , for t = 0 , . . . , m − |A j | , is smaller than a certain threshold. To exactly specifythis threshold, as suggested by Theorem 1, we need to calculate the excursion probabilities of a Gaussian bridge process. As stated in Adler and Taylor (2009), analytic formulas for theexcursion probabilities are known to be available only for a small number of Gaussian processes.18 a) Piecewise constant with r = 0 (b) Piecewise linear with r = 1 Figure 4:
The cyan-colored lines show the dual variables for the full matrix D . Dual variablescomputed after removing rows of the matrix D associated with τ , that is D −A , are displayedby the olive-colored lines. The augmented boundary set A corresponding to τ and τ results tothe dual variables shown by orange-colored lines. One of such Gaussian processes is the Brownian bridge process. It is well known that for theBrownian bridge process B ( t ) defined on the interval [ a, b ]Pr (cid:32) sup a ≤ t ≤ b | B ( t ) | ≥ x (cid:33) = 2 ∞ (cid:88) i =1 ( − i +1 exp (cid:18) − i x b − a (cid:19) . (29)See, for example, Adler and Taylor (2009), and Shorack and Wellner (2009). So for the piecewiseconstant signals, the required threshold for stopping Algorithm 1 can be obtained from (29),for a suitably chosen interval [ a, b ]. That is for a given value α , we choose x α such thatPr (cid:0) sup a ≤ t ≤ b | B ( t ) | ≥ x α (cid:1) = 1 − α . Therefore, for r = 0 and a = 0, b = 1, we stop Algorithm1 at the iteration j ifmax ≤ t ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) u st −A j ( (cid:98) kt (cid:99) ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ σx α √ k , for k = m − |A j | . For r >
0, the threshold is obtained in a similar fashion. Although the excursion probabilitiesfor the Gaussian bridge processes are not known, we notice that by adopting the steps for theproof of (29) in Beghin and Orsingher (1999), we can establish a similar formula for the Gaussianbridge process G ( t ) in Theorem 1 as 19r (cid:32) sup a ≤ t ≤ b | G ( t ) | ≥ x (cid:33) = 2 ∞ (cid:88) i =1 ( − i +1 exp (cid:18) − i x S r ( k ) (cid:19) , (30)where k = m − |A j | , and the quantity S r ( k ) is the k -th diagonal element of the matrix (cid:18) D −A j D T −A j (cid:19) − . So, we stop Algorithm 1 at the iteration j ifmax ≤ t ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) u st −A j ( (cid:98) kt (cid:99) ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ σx α ( k − r ) ( r +1) / , for k = m − |A j | , (31)where x α is derived from the equation ∞ (cid:88) i =1 ( − i +1 exp (cid:18) − i x α S r ( k ) (cid:19) = 1 − α . (32) The main purpose of this section is to investigate whether the method can recover features ofthe true signal f . We also demonstrate conditions under which the structure of the estimatedsignal (cid:98) f matches the true signal f . To verify the performance of trend filtering in the discoverythe true signal, we first define what we mean by pattern recovery. Definition 1 (Pattern Recovery): A trend filtering estimate (cid:98) f recovers the pattern of the truesignal f if sign([ D (cid:98) f ] i ) = sign([ Df ] i ) , for i = 1 , . . . , m, (33) where m = n − r − is the number of rows of matrix D . We use the notation (cid:98) f pr = f to brieflydenote the pattern recovery feature of (cid:98) f . In the asymptotic framework, a trend filtering estimate is called pattern consistent ifPr( (cid:98) f pr = f ) −→ n −→ ∞ , (34)where (cid:98) f = (cid:98) f ( n ), to denote its dependency to the sample size n . Pattern recovery is very similarto the concept of sign recovery in the lasso (Zhao and Yu (2006); Wainwright (2009)) as it dealswith the specification of both locations of non-zero coefficients and their signs.The problem of pattern recovery is studied for the special case of the fused lasso in severalpapers. Rinaldo et al. (2009) derived conditions under which fused lasso consistently identifiesthe true pattern. This was contradicted by Rojas and Wahlberg (2014), who argued that fusedlasso does not always succeed in discovering the exact change points. Rojas and Wahlberg(2014) showed that fused lasso can be reformulated as the usual lasso, for which the necessaryconditions for exact sign recovery have been established in the literature. Then, they provedthat one such necessary condition, known as the irrepresentable condition, is not satisfied for the20ransformed lasso when there is a specific pattern called a staircase (Definition 2). Corrections toRinaldo et al. (2009) were appeared in Rinaldo (2014). Later on, Qian and Jia (2016) proposeda method called puffer transformation, which is shown to be consistent in specifying the exactchange points, including in the presence of staircases.In the remaining part of this section, we use the dual variables to demonstrate the situationsin which trend filtering recovers signal patterns correctly. Exact pattern recovery implies thatthe dual variables are comprised of J + 1 consecutive bounded processes whose endpointscorrespond to the true change points. The following lemma describes the situations in whichexact pattern recovery can be attained. A particular case of this result in the context of piecewiseconstant signals was established in Rinaldo (2014). Theorem 2
Exact pattern recovery in trend filtering occurs when the discrete time processes { (cid:98) u st j ( t ) , t = τ j + r a , . . . , τ j +1 − r b } for j = 0 , . . . , J satisfy the following conditions simultane-ously with probability one:(a) First block constraint: for t = 1 , . . . , τ − r b , − λ (cid:18) − (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T A (cid:19) ≤ (cid:98) u st ( t ) ≤ λ (cid:18) (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T A (cid:19) . (35) (b) Last Block constraints: for t = τ J + r a , . . . , m , − λ (cid:18) (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T AJ (cid:19) ≤ (cid:98) u st J ( t ) ≤ λ (cid:18) − (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T AJ (cid:19) . (36) (c) Interior Block constraints: for t = τ j + r a , . . . , τ j +1 − r b , if s j +1 (cid:54) = s j − λ (cid:18) − (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t ( D T Aj +1 − D T Aj ) (cid:19) ≤ (cid:98) u st j ( t ) ≤ λ (cid:18) (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t ( D T Aj +1 − D T Aj ) (cid:19) , (37) and if s j +1 = s j , which corresponds to a staircase block, (cid:98) u st j ( t ) ≤ or (cid:98) u st j ( t ) ≥ . In the foregoing equations, = r +1 is a vector of size r + 1 whose elements are all 1. The proofof this theorem is given in Appendix A.2We analyze the performance of the PRUTF algorithm in pattern recovery in two differentscenarios; • signals with staircase patterns, • signals without staircase patterns.To the author’s knowledge, Rojas and Wahlberg (2014) is the first paper to carefully investigatethe staircase pattern for the piecewise constant signals in the change points analysis setting. InRojas and Wahlberg (2014), a staircase pattern refers to the phenomenon of equal signs in twoconsecutive changes. We extend this concept to the general case, which covers any piecewisepolynomial signals of order r , by applying the penalty matrix D = D ( r +1) .21 a) Piecewise constant signal with staircaseblock (50, 80]. (b)
Piecewise linear signal with staircase block(20, 55].
Figure 5:
Piecewise constant and piecewise linear signals with staircase pattern at blocks (50,80] and (20, 55] (shown by different colors) and their corresponding dual variables.
Definition 2
Suppose that the true signal f is a piecewise polynomial of order r with changepoints at the locations τ = ( τ , . . . , τ J ) . Moreover, let B = { B , . . . , B J } be blocks created bythe change points τ . A staircase occurs in block B j , j = 1 , . . . , J − if sign (cid:16) [ Df ] τ j (cid:17) = sign (cid:16) [ Df ] τ j +1 (cid:17) . (38)The drift term (21) plays a key role in assessing the performance of PRUTF in patternrecovery. For a staircase block, say B j , this drift becomes simply λ s j , which is constant in t forthe entire block. Consequently, the interior dual variables (cid:98) u j ( t ; λ ) for the block B j contain onlythe stochastic term (cid:98) u st j ( t ) = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t y , which fluctuates around the line λ s j .Recall that the KKT conditions for the dual problem of trend filtering require (cid:98) u j ( t ; λ ) to staywithin the lines − λ and λ . Thus, the trend filtering estimate is sensitive to the variability ofrandom noises and identifies change points once (cid:98) u st − j ( t ) touches the ± λ boundaries. Examples ofpiecewise constant and piecewise linear signals, along with their corresponding dual variables,are depicted in Figure 5, in which the above argument can be clearly seen.The following theorem investigates the consistency of trend filtering in pattern recovery, inboth with and without staircases. Specifically, it shows that for a signal without a staircase, theexact pattern recovery conditions stated in Theorem 2 are satisfied with probability one. Onthe other hand, in the presence of staircases in the signal, the probability of these conditionsholding, which is equivalent to the probability of a Gaussian bridge process never crossing thezero line, converges to zero. 22 heorem 3 The followings hold for the trend filtering estimates:(a) (Non-staircase Blocks):
Suppose there is not a staircase block in the true signal f . Iffor some (cid:15) > , the conditions λ > L min δ min , and λσ > (1 + (cid:15) ) L max (cid:112) J + 1) L max ) , (39) hold, where L min = min j =0 , ..., J ( τ j +1 − τ j − r ) r +1 , L max = max j =0 , ..., J ( τ j +1 − τ j − r ) r +1 and δ min = min j =0 , ..., J | f τj +1 − f τj | , then the PRUTF algorithm guarantees exact pattern recoverywith probability approaching one, that is, Pr( (cid:98) f pr = f ) −→ .(b) (Staircase Blocks): If the true signal f contains at least one staircase block, then forany λ > and any sample size n , the probability of exact pattern recovery converges tozero, that is, Pr( (cid:98) f pr = f ) −→ . A proof is given in Appendix A.3. According to Theorem 3, if there is no staircase pattern in theunderlying signal, trend filtering yields sign consistent estimates. Given the results in Theorem3, the natural question is whether Algorithm 1 could be modified to enjoy the consistent patternrecovery in any case. In the next section, we will present an effective remedy based on alteringthe sign of a change associated with a staircase block.
In this section, we attempt to modify the PRUTF algorithm in such a way that it producesconsistent estimates of the number and locations of change points even in the presence ofstaircase patterns. As previously mentioned, for a staircase block, the drift term (21) is constantand leads to false discoveries in change points. This is shown in Figure 6 with a piecewiseconstant signal of size n = 100 and the true change points at τ = (15 , , , ,
80] leads to three false discoveries at the locations 52, 54and 76.The inconsistency of PRUTF in the presence of a staircase as established in Theorem 3, stemsfrom the fact that the change signs of the two consecutive change points at both ends of thestaircase block are identical. That is, for instance, for the staircase block B j , sign (cid:16) [ Df ] τ j (cid:17) =sign (cid:16) [ Df ] τ j +1 (cid:17) . Therefore, a question arises: can we modify Algorithm 1 in such a way that thechange signs of two neighbouring change points never become equal but still yield the solutionpath of trend filtering? We suggest a simple but very efficient solution to the above question.Once a new change point is identified, we check whether its r -th order difference sign is thesame as that of the change points right before and after. If these change signs are not identical,then the procedure continues to search for the next change point. Otherwise, the modifiedPRUTF algorithm replaces the sign of the neighbouring change point with zero. This replace-ment of the sign prevents the drift term (21) from becoming zero. This idea is implemented forthe above signal, and the result is displayed in Figure 7. As shown in panel (b), the sign of thefirst change point at location 50 is set to zero since its sign is identical to the sign of the second23 a) The first true change point at τ = 50 . (b) The second true change point at τ = 15 . (c) The third true change point at τ = 80 . (d) The fourth true change point at τ = 40 . Figure 6:
The process of detecting change points using Algorithm 1 for a signal with a staircasepattern. In panel (b), there are three falsely detected change points (52 , , which is due tothe staircase block (50 , . change point at 15. This sign replacement vanishes false discoveries appeared in panel (b) ofFigure 6.Based on the above argument, the trend filtering solution path presented in Algorithm 1can be modified as follows to avoid false discovery and to produce consistent pattern recovery. Algorithm 2 (Modified PRUTF) a) The first true change point at τ = 50 . (b) The second true change point at τ = 15 . (c) The third true change point at τ = 80 . (d) The fourth true change point at τ = 40 . Figure 7:
Steps of the modified trend filtering algorithm until all four true change points areidentified.1. Execute step 1 of Algorithm 1.2. Execute step 2 of Algorithm 1.3. (a) Execute part (a) of step 3 in Algorithm 1 to obtain τ join j and its sign s join j . At thispoint, the algorithm checks whether s join j is identical to the signs of the change pointsjust before and after τ join j . If so, set the sign of change point which is identical to s join j to zero in s A j − and s B j − . Then, repeat part (a) of step 3 again to obtain new τ join j nd s join j and update the sets A j and B j .(b) Execute part ( b ) of step 3 in Algorithm 1.(c) Execute part ( c ) of step 3 in Algorithm 1.4. Repeat step 3 until either λ j > or a stopping rule is met. Remark 7
In part (a) of step 3 of the modified PRUTF, presented in Algorithm 2, it is impos-sible for the sign s join j of the new change point to be identical to the sign of either of its immediateneighbouring change points, because the algorithm checks the equality of signs at each previousstep. If they are equal, the sign of the immediate neighbouring change point will be set to zero. Recall that the KKT optimality conditions for solutions of the trend filtering problem in (5)requires the dual variables (cid:98) u λ to be less than or equal to λ in absolute values, i.e., | (cid:98) u λ | ≤ λ .This condition still holds when we replace the sign values (+1 or −
1) with 0. Consequently, wehave the following Theorem.
Theorem 4
The modified PRUTF algorithm presented in Algorithm 2 is a solution path oftrend filtering.
For brevity, we do not provide the proof of Theorem 4 here. We refer the reader to the similararguments for the LARS algorithm of lasso in Tibshirani et al. (2013). (a)
A piecewise constant signal with blocks 4and 8 as staircase blocks. (b)
A piecewise linear signal with blocks 3 and5 as staircase blocks.
Figure 8:
The frequency plots of estimated change points using the PRUTF and mPRUTFalgorithms.
26t is worth pointing out that the modified PRUTF (mPRUTF) algorithm requires slightlymore computation than the original PRUTF algorithm. The increase in computation timedirectly depends on the number of staircase blocks in the underlying signal. To show howmPRUTF resolves the problem of false discovery in signals with staircases, we ran both algo-rithms for 1000 generated datasets from a piecewise constant and piecewise linear signals. Thefrequency plot of the estimated change points for both algorithms are represented in Figure 8.The figure reveals that the original algorithm produces false discoveries within staircase blocksfor both signals, whereas mPRUTF resolves this issue.
In this section, we provide numerical studies to demonstrate the effectiveness and performanceof the proposed algorithm, PRUTF . We begin with a simulation study and then provide realdata analyses.
In this section, we investigate the performance of our proposed method, PRUTF or mPRUTF,by a simulation study. We consider two scenarios, namely piecewise constant and piecewiselinear signals with staircase patterns. We compare our method to some powerful state-of-the-art approaches in change point analysis. These methods, a list of their available packages onCRAN, and their applicability for different scenarios are listed in Table 1.
Method Reference R Package SignalPWC PWLPELT Killick et al. (2012) changepoint (cid:88) (cid:53)
WBS Fryzlewicz et al. (2014) wbs (cid:88) (cid:53)
SMUCE Frick et al. (2014) stepR (cid:88) (cid:53)
NOT Baranowski et al. (2019) not (cid:88) (cid:88)
ID Anastasiou and Fryzlewicz (2019)
IDetect (cid:88) (cid:88)
Table 1:
List of change point detection and estimation methods with their packages in R. Thelast two columns indicate which methods can be applied to piecewise constant or/and piecewiselinear signals.
We adopted the simulation setting of Baranowski et al. (2019), and consider piecewise con-stant and piecewise linear signals as follows.(i) A piecewise constant signal (PWC) of length n = 2024 with the number of change points J = 8. The locations of the true change points are τ = (205, 308, 512, 820, 902, 1332,1557, 1659) with jump sizes -1.464, 1.656, 1.098, 0.830, -1.537, 0.768, 1.574, -1.335, 1.537.We set the starting intercept to 0.(ii) A piecewise linear signal (PWL) of size n = 1408 and the number of change points J = 7.The true change points are located at τ = (256, 512, 768, 1024, 1152, 1280, 1344). The27 a) PWC signal with staircases at blocks (512 , and (1557 , . (b) PWL signal with staircases at blocks (512 , and (1024 , . Figure 9:
Two signals with the generated samples used in the simulation study. corresponding intercepts and slopes for 8 created blocks by τ are 0.111, 0.553, -0.481,3.002,-7.169, -0.030, 7.217, -0.958 and − × − , × − , − × − , − × − , × − , × − , − × − , × − , respectively.Figure 9 displays the true PWC and PWL signals, with their representative datasets generatedusing model (1). We note that both PWC and PWL signals contain two staircase blocks. Theseblocks for the PWC signal are (512 , , , , N (0 , σ ). Moreover, we set the significancelevel to α = 0 .
05 for the stopping criterion of (32) provided in Section 5 to terminate thealgorithm.In order to explore the impact of different noise levels on the change point methods, werun each simulation for various values of σ in { . , , . , , . , , . , , . , } . We run thesimulation N = 2000 times and report the results for each change point technique in termsof estimates of the number of change points, estimates of the mean square error given byMSE = N − (cid:80) Ni =1 ( (cid:98) f i − f i ) , estimates of the scaled Hausdorff distance given by d H = 1 N max (cid:40) max j =0 , ..., J min i =0 , ..., (cid:98) J | (cid:98) τ i − τ j | , max i =0 , ..., (cid:98) J min j =0 , ..., J | (cid:98) τ i − τ j | (cid:41) , a) Average number of change points. (b)
MSE of signal estimation. (c)
Hausdorff distance of the estimations. (d)
Computation time in seconds.
Figure 10:
The estimated average number of change points, MSE and Hausdorff distance, aswell as the computation time of various methods for PWC signal. The results are provided fordifferent values of the noise variability σ . and the computation time in seconds. These quantities are frequently used to assess the per-formance of a change point detection techniques in the literature, for example, see Baranowskiet al. (2019), Anastasiou and Fryzlewicz (2019). The signal estimate, (cid:98) f , is computed by theleast square fit of a polynomial of order r to the observations within segments created by eachmethod. We also remark that the tuning parameters and stopping criteria for the methods29 a) Average number of change points. (b)
MSE of signal estimation. (c)
Hausdorff distance of the estimations. (d)
Computation time in seconds.
Figure 11:
The estimated average number of change points, MSE and Hausdorff distance, aswell as the computation time of various methods for PWL signal. The results are provided fordifferent values of the noise variability σ . listed in Table 1 are set to the default values by the packages.The results for the PWC and PWL signals are presented in Figure 10 and Figure 11, respec-tively. In the case of piecewise constant signal, as in Figure 10, mPRUTF performs comparableto WBS, NOT and ID in terms of the average number of change points, MSE and Hausdorffdistance up to σ = 3, and outperforms them as σ increases. From a computational point of view,30PRUTF takes a slightly longer time, mainly due to the matrix D multiplications, however,this computation time decreases as noise level σ increases.In the case of piecewise linear signal, mPRUTF is only compared to NOT and ID methods,which are applicable to the piecewise polynomials of order r ≥
1. Figure 11 reveals thatmPRUTF outperforms both NOT and ID in terms of the average number of change points,MSE and the Hausdorff distance. As shown in Panel (d) of Figure 11, the computation time ofmPRUTF decreases faster as the value of σ increases.Simulation results for most of the scenarios indicate that mPRUTF is among the mostcompetitive change point detection approaches in the literature. mPRUTF performs well interms of the estimation of the number of change points, their locations, as well as the truesignals. We have analyzed UK HPI and GISTEMP and COVID-19 datasets using our proposed algo-rithm. Because σ is unknown for these real datasets, we applied median absolute deviation(MAD) proposed by Hampel (1974), to robustly estimate σ . More specifically, a MAD estimateof σ in presence of piecewise constant signals is given by (cid:98) σ = Median( D (1) y ) / [ √ − (0 . (cid:98) σ = Median( D (2) y ) / [ √ − (0 . − ( · ) representsthe inverse cumulative density function of the standard normal distribution. Example 1
The UK House Price Index (HPI) is a National Statistic that shows changes inthe value of residential properties in England, Scotland, Wales and Northern Ireland. The HPImeasures the price changes of residential housing by calculating the price of completed houses saletransactions as a percentage change from some specific start date. The UK HPI uses the hedonicregression model as a statistical approach to produce estimates of the change in house prices foreach period. For more details, see https: // landregistry. data. gov. uk/ app/ ukhpi . Manyresearchers, including Fryzlewicz (2018), Fryzlewicz et al. (2018) and Anastasiou and Fryzlewicz(2019), have studied the UK HPI dataset in carrying out change point analysis. In the currentstudy, we consider monthly percentage changes in the UK HPI at Tower Hamlets (an eastborough of London) from January 1996 to November 2018.The dataset and its piecewise constant fit derived by our proposed method are presented inFigure 12a. Our method finds five change points located at the dates December 2002, April 2008and August 2009 (can be attributed to the Credit Crunch and Financial Crises), May 2012 (canbe attributed to The London 2012 Summer Olympics) and August 2015 (can be attributed toregulatory and tax changes, and also by lower net migration from the EU).
Example 2
The Goddard Institute for Space Studies (GISS) monitors broad global changesaround the world. The GISS Surface Temperature Analysis (GISTEMP) is an estimate of theglobal surface temperature changes (see ). In the analysis ofGISTEMP data, the temperature anomalies are used rather than the actual temperatures. Atemperature anomaly is a difference from an average or baseline temperature. The baseline a) UK HPI dataset and its piecewise constantfit. (b)
GISTEMP dataset and its piecewise linearfit.
Figure 12:
The time series and fitted signals for both Tower Hamlet HPI and GISTEMPdatasets presented in examplestemperature is typically computed by averaging thirty or more years of temperature data (1951to 1980 in the current dataset). A positive anomaly indicates the observed temperature waswarmer than the baseline, while a negative anomaly indicates the observed temperature wascooler than the baseline. For more details see Hansen et al. (2010) and Lenssen et al. (2019).The GISTEMP dataset has been frequently explored in change point literature, for examplesee Ruggieri (2013), James and Matteson (2015) and Baranowski et al. (2019). Figure 12bdisplays the monthly land-ocean temperature anomalies recorded from January 1880 to August2019 (see https: // data. giss. nasa. gov/ gistemp ). The plot reveals the presence of a lineartrend with several change points in the dataset. An estimate of the piecewise linear signal usingour method PRUTF is also presented in Figure 12b.For the GISTEMP dataset, we identify seven change points located on October 1900, March1911, April 1915, June 1935, April 1944, December 1946, June 1976, September 2015 andDecember 2017. The piecewise linear fit to this dataset is also provided in Figure 12b.
Example 3
Since the initial outbreak of Novel Coronavirus Disease 2019 (COVID-19) inWuhan, China, in mid-November 2019, the virus has rapidly spread throughout the world. Thepandemic has infected 21.26 million people and killed more than 761,000 https: // covid19.who. int/ , greatly stressing public health systems and adversely influencing global society andeconomies. Therefore, every country has attempted to slow down the transmission rate by vari-ous regional and national policies such as the declaration of national emergencies, quarantinesand mass testing. Of vital interest to governments is understanding the pattern of the epidemic igure 13: The estimated linear trend and change point locations of the number of confirmedcases of COVID-19 for the selected countries rowth and assessing the effectiveness of policies undertaken. A scientist can investigate thesematters by analyzing the sequence of infection data for COVID-19. Changepoint detection isone possible framework for studying the behaviour of COVID-19 infection curves. By detectingthe locations of alterations in the curves, change point analysis gives us insights into changesin the transmission rate or efficiency of interventions. It also enables us to raise warning sig-nals if the disease pattern changes. We have applied our modified PRUTF algorithm to detectchange points that have occurred in the log-scale of the cumulative number of confirmed casesfor Australia, Brazil, Canada, Germany, Italy, Spain, United Kingdom and the United States.We then fitted a piecewise linear model to the transformed data using the selected change points.Figure 13 displays the locations of our detected change points as well as the estimated lineartrends for COVID-19 data from February 30 to August 18.As can be viewed from Figure 13, the COVID-19 growth rate decreases after the secondchange point in most countries. The growth rate curves were successfully flattened for mostcountries in early May, except for Brazil. The second wave of the pandemic has already begunin countries such as the United States, Australia and Spain in early July. These countries haveundergone two change points since the beginning of the second wave. This paper proposed an algorithm, PRUTF, to detect change points in piecewise polynomialsignals using trend filtering. We demonstrated that the dual solution path produced by thePRUTF algorithm forms a Gaussian bridge process for any given value of the regularizationparameter λ . This conclusion allowed us to derive an efficient stopping rule for terminatingthe search algorithm, which is vital in change point analysis. We then proved that whenthere is no staircase block in the signal, the method guarantees consistent pattern recovery.However, it fails in doing so when there is a staircase in the underlying signal. To address thisshortcoming in such a case, we suggested a modification in the procedure of constructing thesolution path, which effectively prevents false discovery of change points. Evidence from bothsimulation studies and real data analysis reveals the accuracy and the high detection power ofthe proposed method. Appendix A
A.1 Proof of Theorem 1 a) For ε , . . . , ε n , . . . i . i . d . ∼ N (0 , σ ), and a sequence q , . . . , q n , . . . of real numbers, let ν k = Var (cid:32) k (cid:88) i =1 q i ε i (cid:33) = σ k (cid:88) i =1 q i for k ≥ . Define the partial weighted sum process { S n ( t ) : 0 ≤ t ≤ } by S n ( t ) = 1 √ ν n (cid:98) n t (cid:99) (cid:88) i =1 q i ε i , for 0 ≤ t ≤ , k ≥
1, and any 0 < t < t < · · · < t k ≤
1, the vector ( S n ( t ) , . . . , S n ( t k ))has a multivariate normal distribution, and therefore { S n ( t ) : 0 ≤ t ≤ } is a Gaussianprocess for any given n .Since the dual variables D T u = y − f in (4) can be rewritten as u ( (cid:98) n t (cid:99) ) = (cid:80) (cid:98) n t (cid:99) i =1 (cid:0) r + i − r (cid:1) ε i ,the weights for the process W j ( t ) are q i = (cid:0) r + i − r (cid:1) as well as ν n = (cid:80) ni =1 (cid:0) r + i − r (cid:1) . More-over, from the definition of W j ( t ) and independence of ε i , it is clear that the incre-ments of W j are stationary and independent. Recall that the structure of the truesignal f remains unchanged within the j -th block, meaning that (cid:2) D −A (cid:3) t f = 0 for any t = τ j + r a , . . . , τ j +1 − r b , which in turn implies (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t f = 0. This provesthat the mean of W j is zero.b) Conditioning W j on W j (( τ j + r a ) /m ) = 0 and W j (( τ j +1 − r b ) /m ) = 0 gives rise to theGaussian bridge process W j .c) Recall that (cid:16) D −A D T −A (cid:17) − is a block diagonal matrix which states that the covariancematrix between two distinct blocks is zero. This completes the proof of the theorem. A.2 Proof of Theorem 2 a) For t = 1 , . . . , τ − r b , and both signs ±
1, according to the KKT conditions, the dualvariables (cid:98) u ( t ) must lie between − λ and λ , that is, − λ ≤ (cid:98) u st ( t ) − λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T A ≤ λ (A.1)which yields the constraint for the first block in (35).b) Similar to the first block, for t = τ J + r a , . . . , m , the constraint becomes − λ ≤ (cid:98) u st J ( t ) + λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t D T AJ ≤ λ , (A.2)which leads to the result of (36).c) For t = τ j + r a , . . . , τ j +1 − r b , and j = 1 , . . . , J − λ s j ≤ (cid:98) u st j ( t ) − λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 s j +1 + D T Aj s j (cid:17) ≤ λ s j +1 . (A.3)Since the stochastic process (cid:98) u st j ( t ) is symmetric around zero, when s j +1 and s j have theopposite signs, this constraint reduces to (37). Otherwise, for s j +1 = s j , which accountsfor staircase in block j , the constraint becomes (cid:98) u st j ( t ) ≤ (cid:98) u st j ( t ) ≥ A.3 Proof of Theorem 3 (a) It suffices to establish that the probability of violating the KKT conditions for a signalwith no staircase block converges to zero. To this end, let P j denote the probability of35iolating the KKT conditions for block j , hence from symmetry of (cid:98) u st j ( t ) P j ≤ (cid:18)(cid:98) u st j ( t ) ≥ λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 − D T Aj (cid:17) , for some t ∈ B j (cid:19) = 2 Pr (cid:91) t ∈ B j (cid:98) u st j ( t ) ≥ λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 − D T Aj (cid:17) ≤ τ j +1 − r b (cid:88) t = τ j + r a +1 Pr (cid:18)(cid:98) u st j ( t ) ≥ λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 − D T Aj (cid:17)(cid:19) ≤ τ j +1 − r b (cid:88) t = τ j + r a +1 exp (cid:40) − λ ξ t σ L j (cid:41) , (A.4)where ξ t = (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 − D T Aj (cid:17)(cid:114)(cid:104) ( D −A D T −A ) − (cid:105) t t ( τ j +1 − τ j − r ) ( r +1) / , (A.5)and L j = ( τ j +1 − τ j − r ) r +1 . (A.6)The last inequality in (A.4) is obtained from the Gaussian tail bounds. It can be shownthat ξ t ≥ t = τ j + r a + 1 , . . . , τ j +1 − r b , hence P j ≤ τ j +1 − r b (cid:88) t = τ j + r a +1 exp (cid:40) − λ ξ t σ L j (cid:41) ≤ L j exp (cid:40) − λ σ L j (cid:41) . (A.7)which converges to zero if λσL j −→ ∞ , and λL j σ √ L j ) > (1 + (cid:15) ) for some (cid:15) > (cid:98) u st j ( t ) for all j = 0 , . . . , J , the probability of violating the KKTconditions is given by2 Pr (cid:91) t ∈ B j (cid:98) u st j ( t ) ≥ λ (cid:20)(cid:16) D −A D T −A (cid:17) − D −A (cid:21) t (cid:16) D T Aj +1 − D T Aj (cid:17) , for some j = 0 , . . . , J ≤ J + 1) L max exp (cid:26) − λ σ L max (cid:27) = exp (cid:26) − λ σ L max + log(2 ( J + 1) L max ) (cid:27) . (A.8)Equation (A.8) states that the KKT conditions violate with probability approaching tozero as λ > L min δ min λσ > (1 + (cid:15) ) L max (cid:112) J + 1) L max ) , for any (cid:15) > . (b) As previously mentioned, in staircase blocks, the violation of the KKT conditions boilsdown to crossing the zero line for a Gaussian bridge process. Suppose j -th block is astaircase block, therefore PRUTF can attain the exact discovery if (cid:98) u st j ( t ) ≤ (cid:98) u st j ( t ) ≥ τ j + r a ) /m < t < ( τ j +1 − r b ) /m . Hence the probability of this event is equal toPr (cid:32) max ≤ t ≤ L j (cid:98) u st j ( t ) ≤ (cid:33) . According to Beghin and Orsingher (1999),Pr (cid:32) max ≤ t ≤ L j (cid:98) u st j ( t ) ≤ a (cid:33) = 1 − exp (cid:18) − a S r ( L j ) (cid:19) , (A.9)where S r ( L j ) is the L j -th diagonal element of the matrix σ (cid:16) D Aj D T Aj (cid:17) − . As a result,the probability converges to zero as a vanishes. This implies that trend filtering fails toconsistently discover the true pattern in presence of staircase. Bibliography
Adler, R. J. and Taylor, J. E. (2009).
Random fields and geometry . Springer.Ali, A., Tibshirani, R. J., et al. (2019). The generalized lasso problem and uniqueness.
ElectronicJournal of Statistics , 13(2):2307–2347.Aminikhanghahi, S. and Cook, D. J. (2017). A survey of methods for time series change pointdetection.
Knowledge and Information Systems , 51(2):339–367.Anastasiou, A. and Fryzlewicz, P. (2019). Detecting multiple generalized change-points byisolating single ones.
ArXiv Preprint arXiv:1901.10852 .Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segmentneighborhoods.
Bulletin of Mathematical Biology , 51(1):39–54.Bai, J. (1997). Estimating multiple breaks one at a time.
Econometric Theory , 13(3):315–352.Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change models.
Journal of Applied Econometrics , 18(1):1–22.Baranowski, R., Chen, Y., and Fryzlewicz, P. (2019). Narrowest-over-threshold detection ofmultiple change points and change-point-like features.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 81(3):649–672.Beghin, L. and Orsingher, E. (1999). On the maximum of the generalized brownian bridge.
Lithuanian Mathematical Journal , 39(2):157–167.Bianchi, M., Boyle, M., and Hollingsworth, D. (1999). A comparison of methods for trendestimation.
Applied Economics Letters , 6(2):103–109.Billingsley, P. (2013).
Convergence of probability measures . John Wiley and Sons.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimizationand statistical learning via the alternating direction method of multipliers.
Foundations andTrends R (cid:13) in Machine learning , 3(1):1–122. 37oyd, S. and Vandenberghe, L. (2004). Convex optimization . Cambridge university press.Brezis, H. (2010).
Functional analysis, Sobolev spaces and partial differential equations . Springer.Ciuperca, G. (2011). A general criterion to determine the number of change-points.
Statistics& Probability Letters , 81(8):1267–1275.Ciuperca, G. (2014). Model selection by lasso methods in a change-point model.
StatisticalPapers , 55(2):349–374.Demetriou, I. and Lipitakis, E. (2001). Certain positive definite submatrices that arise frombinomial coefficient matrices.
Applied Numerical Mathematics , 36(2-3):219–229.Donsker, M. D. (1951).
An invariance principle for certain probability limit theorems .Eckley, I. A., Fearnhead, P., and Killick, R. (2011). Analysis of changepoint models.
BayesianTime Series Models , pages 205–224.Frick, K., Munk, A., and Sieling, H. (2014). Multiscale change point inference.
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 76(3):495–580.Fryzlewicz, P. (2018). Detecting possibly frequent change-points: Wild binary segmentation 2and steepest-drop model selection.
ArXiv Preprint arXiv:1812.06880 .Fryzlewicz, P. et al. (2014). Wild binary segmentation for multiple change-point detection.
TheAnnals of Statistics , 42(6):2243–2281.Fryzlewicz, P. et al. (2018). Tail-greedy bottom-up data decompositions and fast multiplechange-point detection.
The Annals of Statistics , 46(6B):3390–3421.Futschik, A., Hotz, T., Munk, A., and Sieling, H. (2014). Multiscale dna partitioning: statisticalevidence for segments.
Bioinformatics , 30(16):2255–2262.Galceran, E., Cunningham, A. G., Eustice, R. M., and Olson, E. (2017). Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and ex-periment.
Autonomous Robots , 41(6):1367–1382.Gasbarra, D., Sottinen, T., and Valkeila, E. (2007). Gaussian bridges. In
Stochastic analysisand applications , pages 361–382. Springer.Green, P. J. and Silverman, B. W. (1993).
Nonparametric regression and generalized linearmodels: a roughness penalty approach . Chapman and Hall/CRC.Hampel, F. R. (1974). The influence curve and its role in robust estimation.
Journal of theAmerican Statistical Association , 69(346):383–393.Hansen, B. E. (2001). The new econometrics of structural change: dating breaks in us labourproductivity.
Journal of Economic Perspectives , 15(4):117–128.38ansen, J., Ruedy, R., Sato, M., and Lo, K. (2010). Global surface temperature change.
Reviewsof Geophysics , 48(4).Harchaoui, Z. and L´evy-Leduc, C. (2010). Multiple change-point estimation with a total varia-tion penalty.
Journal of the American Statistical Association , 105(492):1480–1493.Horv´ath, L. and Rice, G. (2014). Extensions of some classical methods in change point analysis.
Test , 23(2):219–255.Hoskins, W. and Ponzo, P. (1972). Some properties of a class of band matrices.
Mathematicsof Computation , 26(118):393–400.Huang, T., Wu, B., Lizardi, P., and Zhao, H. (2005). Detection of dna copy number alterationsusing penalized least squares regression.
Bioinformatics , 21(20):3811–3817.Jackson, B., Scargle, J. D., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P., Gwin, E., Sang-trakulcharoen, P., Tan, L., and Tsai, T. T. (2005). An algorithm for optimal partitioning ofdata on an interval.
IEEE Signal Processing Letters , 12(2):105–108.James, N. A. and Matteson, D. S. (2015). Change points via probabilistically pruned objectives.
ArXiv Preprint arXiv:1505.04302 .Jia, J., Rohe, K., et al. (2015). Preconditioning the lasso for sign consistency.
Electronic Journalof Statistics , 9(1):1150–1172.Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepoints with alinear computational cost.
Journal of the American Statistical Association , 107(500):1590–1598.Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. (2009). (cid:96) trend filtering. SIAM Review ,51(2):339–360.Lavielle, M. (2005). Using penalized contrasts for the change-point problem.
Signal Processing ,85(8):1501–1510.Lee, T.-S. (2010). Change-point problems: bibliography and review.
Journal of StatisticalTheory and Practice , 4(4):643–662.Lenssen, N. J., Schmidt, G. A., Hansen, J. E., Menne, M. J., Persin, A., Ruedy, R., and Zyss,D. (2019). Improvements in the uncertainty model in the goddard institute for space studiessurface temperature (gistemp) analysis.
Journal of Geophysical Research: Atmospheres .Liu, D., Chen, X., Lian, Y., and Lou, Z. (2010). Impacts of climate change and human ac-tivities on surface runoff in the dongjiang river basin of china.
Hydrological Processes: AnInternational Journal , 24(11):1487–1495.Lung-Yut-Fong, A., L´evy-Leduc, C., and Capp´e, O. (2012). Distributed detection/localization ofchange-points in high-dimensional network traffic data.
Statistics and Computing , 22(2):485–496. 39aidstone, R., Hocking, T., Rigaill, G., and Fearnhead, P. (2017). On optimal multiple change-point algorithms for large data.
Statistics and Computing , 27(2):519–533.Mairal, J. and Yu, B. (2012). Complexity analysis of the lasso regularization path.
ArXivPreprint arXiv:1205.0079 .Mammen, E., van de Geer, S., et al. (1997). Locally adaptive regression splines.
The Annals ofStatistics , 25(1):387–413.Olshen, A. B., Venkatraman, E., Lucito, R., and Wigler, M. (2004). Circular binary segmenta-tion for the analysis of array-based dna copy number data.
Biostatistics , 5(4):557–572.Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). On the lasso and its dual.
Journal ofComputational and Graphical statistics , 9(2):319–337.Oudre, L., Lung-Yut-Fong, A., and Bianchi, P. (2011). Segmentation of accelerometer sig-nals recorded during continuous treadmill walking. In , pages 1564–1568. IEEE.Page, E. (1955). A test for a change in a parameter occurring at an unknown point.
Biometrika ,42(3/4):523–527.Page, E. S. (1954). Continuous inspection schemes.
Biometrika , 41(1/2):100–115.Pan, J. and Chen, J. (2006). Application of modified information criterion to multiple changepoint problems.
Journal of Multivariate Analysis , 97(10):2221–2241.Pezzatti, G. B., Zumbrunnen, T., B¨urgi, M., Ambrosetti, P., and Conedera, M. (2013). Fireregime shifts as a consequence of fire policy and socio-economic development: an analysisbased on the change point approach.
Forest Policy and Economics , 29:7–18.Qian, J. and Jia, J. (2016). On stepwise pattern recovery of the fused lasso.
ComputationalStatistics & Data Analysis , 94:221–237.Ramdas, A. and Tibshirani, R. J. (2016). Fast and flexible admm algorithms for trend filtering.
Journal of Computational and Graphical Statistics , 25(3):839–858.Ranganathan, A. (2012). Pliss: labeling places using online changepoint detection.
AutonomousRobots , 32(4):351–368.Rigaill, G. (2015). A pruned dynamic programming algorithm to recover the best segmentationswith 1 to k max change-points.
Journal de la Soci´et´e Fran¸caise de Statistique , 156(4):180–205.Rinaldo, A. (2014). Corrections to properties and refinements of the fused lasso.Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso.
The Annals of Statistics ,37(5B):2922–2952. 40obbins, M. W., Lund, R. B., Gallagher, C. M., and Lu, Q. (2011). Changepoints in the northatlantic tropical cyclone record.
Journal of the American Statistical Association , 106(493):89–99.Rojas, C. R. and Wahlberg, B. (2014). On change point detection using the fused lasso method.
ArXiv Preprint arXiv:1401.5408 .Ruggieri, E. (2013). A bayesian approach to detecting change points in climatic records.
Inter-national Journal of Climatology , 33(2):520–528.Shorack, G. R. and Wellner, J. A. (2009).
Empirical processes with applications to statistics .SIAM.Siris, V. A. and Papagalou, F. (2004). Application of anomaly detection algorithms for detect-ing syn flooding attacks. In
IEEE Global Telecommunications Conference, 2004. GLOBE-COM’04. , volume 4, pages 2050–2054. IEEE.Stasinopoulos, D. and Rigby, R. (1992). Detecting break points in generalised linear models.
Computational Statistics & Data Analysis , 13(4):461–471.Steidl, G., Didas, S., and Neumann, J. (2006). Splines in higher order tv regularization.
Inter-national Journal of Computer Vision , 70(3):241–255.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society: Series B (Methodological) , 58(1):267–288.Tibshirani, R. J. et al. (2013). The lasso problem and uniqueness.
Electronic Journal ofStatistics , 7:1456–1490.Tibshirani, R. J. et al. (2014). Adaptive piecewise polynomial estimation via trend filtering.
The Annals of Statistics , 42(1):285–323.Tibshirani, R. J., Taylor, J., et al. (2011). The solution path of the generalized lasso.
TheAnnals of Statistics , 39(3):1335–1371.Truong, C., Oudre, L., and Vayatis, N. (2018). A review of change point detection methods.
ArXiv Preprint arXiv:1801.00718 .van der vaart, A. and Wellner, J. (1996). Weak convergence and empirical processes: Withapplications to statistics.
Springer , 58:59.Vostrikova, L. Y. (1981). Detecting disorder in multidimensional random processes. In
DokladyAkademii Nauk , volume 259, pages 270–274. Russian Academy of Sciences.Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recov-ery using (cid:96) -constrained quadratic programming (lasso). IEEE Transactions on InformationTheory , 55(5):2183–2202. 41ang, Y.-X., Smola, A., and Tibshirani, R. (2014). The falling factorial basis and its statisticalapplications. In
International Conference on Machine Learning , pages 730–738.Wu, Y. (2008). Simultaneous change point analysis and variable selection in a regression prob-lem.
Journal of Multivariate Analysis , 99(9):2154–2171.Yao, Y.-C. (1988). Estimating the number of change-points via schwarz’criterion.
Statistics &Probability Letters , 6(3):181–189.Yao, Y.-C. and Au, S.-T. (1989). Least-squares estimation of a step function.
Sankhy¯a: TheIndian Journal of Statistics, Series A , pages 370–381.Zhang, N. R. and Siegmund, D. O. (2007). A modified bayes information criterion with appli-cations to the analysis of comparative genomic hybridization data.
Biometrics , 63(1):22–32.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.