Scalable Multiple Changepoint Detection for Functional Data Sequences
SScalable Multiple Changepoint Detection forFunctional Data Sequences
Trevor Harris , Bo Li , J. Derek Tucker , August 6, 2020
Abstract
We propose the Multiple Changepoint Isolation (MCI) method for detecting multiplechanges in the mean and covariance of a functional process. We first introduce a pairof projections to represent the high and low frequency features of the data. We then ap-ply total variation denoising and introduce a new regionalization procedure to split theprojections into multiple regions. Denoising and regionalizing act to isolate each change-point into its own region, so that the classical univariate CUSUM statistic can be appliedregion-wise to find all changepoints. Simulations show that our method accurately detectsthe number and locations of changepoints under many different scenarios. These includelight and heavy tailed data, data with symmetric and skewed distributions, sparsely anddensely sampled changepoints, and both mean and covariance changes. We show that ourmethod outperforms a recent multiple functional changepoint detector and several univari-ate changepoint detectors applied to our proposed projections. We also show that the MCIis more robust than existing approaches, and scales linearly with sample size. Finally, wedemonstrate our method on a large time series of water vapor mixing ratio profiles fromatmospheric emitted radiance interferometer measurements.
Keywords: functional change points, Robust Procedures, Time Series: Time Domain,total variation denoising, AERI, CUSUM
Short title : Functional Changepoint Detection Department of Statistics, University of Illinois at Urbana-Champaign Sandia National Laboratories, Albuquerque, NM a r X i v : . [ s t a t . M E ] A ug Introduction
The statistical analysis of functional time series has become increasingly important to manyscientific fields including Climatology (Shang and Hyndman, 2011), Finance (Kokoszka andZhang, 2012), Geophysics (H¨ormann and Kokoszka, 2012), Demography (Hyndman andBooth, 2008), and Manufacturing (Woodall, 2007). A functional time series is a sequenceof infinite dimensional objects, such as curves and surfaces, observed over time. A func-tional time series is analogous to finite dimensional univariate or multivariate time series,except that entire functions are observed at each time point. Just as in univariate andmultivariate time series data, a functional time series may also experience abrupt changesin its generating process. These abrupt changes, or changepoints, can complicate statisticalanalysis by invalidating stationarity assumptions. They can also be interesting in their ownright because they reveal unexpected heterogeneous patterns. Identifying changepoints infunctional time series has, therefore, become an important issue and has started to gainincreasing interest due to the surge in available functional data.Within the functional data analysis (FDA) literature, changepoint detection has largelyfocused on the At Most One Change (AMOC) problem. In Berkes et al. (2009) a CumulativeSum (CUSUM) test was proposed for independent functional data, which was furtherstudied in Aue et al. (2009), where its asymptotic properties were developed. This test wasthen extended to weakly dependent functional data by H¨ormann et al. (2010) and epidemicchanges by Aston and Kirch (2012). Zhang et al. (2011) introduced a test for changes inthe mean of weakly dependent functional data using self-normalization to alleviate the useof asymptotic control. Later, Sharipov et al. (2016) similarly developed a sequential blockbootstrap procedure for these methods. More recently, Gromenko et al. (2017) consideredchanges in spatially correlated functional data, and Aue et al. (2018) proposed a fullyfunctional method for finding a change in the mean without losing information due todimension reduction. 1etecting multiple changepoints in a functional time series has received relatively scantattention compared to the AMOC problem. Recently, Li and Ghosal (2018) proposed aBayesian method for directly identifying multiple changepoints in the mean, by transform-ing the functional data into wavelets and identifying changes in the wavelet coefficientprocesses. In Chiou et al. (2019), a dynamic segmentation method for finding multiplechangepoints in the mean was proposed, which used dynamic programming and backwardelimination to find an optimal set of changepoints. Alternatively, multiple changepointshave also be identified by adapting AMOC methods with a binary segmentation algorithmto recursively partition the functional time series (Berkes et al., 2009; Aue et al., 2018).The consistency of binary segmentation approaches was recently shown in Rice and Zhang(2019).Despite the advances, the computational complexity of these multiple changepointsdetection methods has been an obstacle for their wide application to large functionaldata. Bayesian methods that rely on Markov Chain Monte Carlo sampling are intrinsi-cally burdensome because they typically require an enormous number of samples to reachconvergence. Ordinary dynamic programming and binary segmentation algorithms scalequadratically and log-linearly, respectively, with the data’s sample size. As larger andlarger functional time series data sets are curated, methods that scale linearly with samplesize are called for to meet the computational demand. In the univariate and multivari-ate changepoint detection literature, optimal linear-time methods for multiple changepointdetection have already emerged. These include the Pruned Exact Linear Time (PELT)algorithm (Killick et al., 2012), the Functional Pruning Optimal Partitioning (FPOP) al-gorithm (Maidstone et al., 2017), and the robust Functional Pruning Optimal Partitioning(r-FPOP) (Fearnhead and Rigaill, 2019).Another limitation of the existing functional changepoint detection methods is that mostmethods can only detect changes in the functional process’s mean. While mean changesare the most conspicuous, covariance changes are equally important and may also occur.2he need for detecting covariance changes has already been noticed and tackled in theunivariate time series literature where methods targeting both mean and variance changes,such as PELT and FPOP, have been developed. However, methods that target both meanand covariance changes in a functional process are still not available. This motivates usto develop a functional changepoint approach that can take the covariance into account.Lastly, most previous methods were developed under the assumption that the data followsa Gaussian process. Their performance on non-Gaussian, skewed, or heavy tailed datam,which is often encountered in practice, is not well studied and could be suboptimal.For example, remote sounding instruments such as Atmospheric Emitted Radiance In-terferometers (AERI) are currently providing a wealth of functional time series data (seeFigure 1), while also posing many challenges to existing changepoint detectors (Kulla andRitter, 2019; Sakai et al., 2019). For instance, the AERI data in Figure 1 exhibit heavy tailsand positive kurtosis due to being a ratio of densities. AERIs also monitor the atmospherecontinuously, which leads to a long time series of functional data, often on the order ofhundreds of thousands to millions of functions. Finding changepoints in the AERI data isa daunting task for existing detectors due to their non-Gaussianity and immense samplesizes. However, identifying changepoints is crucial for detecting atmospheric events such asair parcel mixing, rapid fluxes in aerosols, and evaporation or precipitation events. There-fore, it is necessary to develop changepoint detection methods that are robust to violationsof Normality and that scale to millions of observations.We propose the Multiple Changepoint Isolation (MCI) method to detect multiple change-points in a functional time series’s mean and covariance function. Our method is based ona combination of two univariate projections of the functional data representing the highand low frequency features of the data, an adaptive linear time regionalization procedure toisolate changepoints into regions, and the univariate CUSUM statistic for AMOC detectionwithin each region. Our method remedies three outstanding issues in the functional change-point detection literature by (1) having linear time computational complexity, (2) taking3igure 1: AERI water vapor mixing ratio profiles from August 9th to September 6th 2008.The x -axis is time, the y -axis is altitude, and the color values indicate the density of watervapor at each altitude and time. Each vertical bar is an entire function.broader types of changepoints into account, and (3) being robust to heavier or asymmetrictails than Gaussian processes have. We show in Section 4 that our method is consistentlyaccurate and powerful under an array of settings, and that our new detection procedure isbetter adapted to our projections than existing univariate changepoint detectors. Let { f t , t ∈ , ..., N } be a sequence of continuous functions in L ([0 , ,
1] satisfying (cid:82) | f t ( s ) | ds < ∞ . The interval[0 ,
1] is assumed to be the domain without loss of generality, and we further assume thateach function f t ∈ { f t , t ∈ , ..., N } is observed on the same finite grid of points 0 < s <... < s m < f , ..., f N follow the process f t ∼ F ( µ t , Σ t ) , (1)where F ( µ t , Σ t ) refers to a functional process with mean function µ t and covariance functionΣ t . We assume that F ( · ) is piecewise weakly stationary, that is µ t +1 = µ t and Σ t +1 = Σ t for all t except when t is a changepoint. To allow for changepoints, we assume there are4 < N and M < N time points where µ t +1 (cid:54) = µ t and Σ t +1 (cid:54) = Σ t respectively. M and M are both assumed to be unknown non-negative integers.Let τ µj with j = 1 , ..., M and τ Σ k with k = 1 , ..., M denote the times where µ t and Σ t “break” or “change”, i.e., where µ τ µj +1 (cid:54) = µ t µj and Σ τ Σ k +1 (cid:54) = Σ τ Σ k . For simplicity and easeof notation, we union the two sets of changepoints into a combined sequence τ , ..., τ M ,where M ≤ M + M < N . M is potentially smaller than M + M because there can beoverlap between τ µj and τ Σ k . The goal of our method is to estimate all changepoint locations τ , ..., τ M from the sequence { f t , t ∈ , ..., N } . Given a functional sequence { f t , t ∈ , ..., N } , we propose the Multiple Changepoint Iso-lation (MCI) method to identify all changepoints τ , ..., τ M in model (1). Our procedureconsists of four major steps.1. Project the functional data onto the real line to reduce the functional data into finitedimensions. Two projections are proposed to obtain two separate univariate timeseries (Section 3.1).2. Use total variation denoising (TVD) (Rudin et al., 1992) to identify a set of potentialchangepoint locations in each projection (Section 3.2).3. Split each projection into multiple regions based on the locations found in steptwo and a new “changeset regionalization” procedure. This procedure isolates eachchangepoint to its own region, which reduces the multiple changepoint problem intoa sequence of single changepoint problems, one for each region (Section 3.3).4. Apply the univariate CUSUM statistic (Page, 1954) to each region in each projec-tion to find all changepoints in the functional sequences. We account for multiple5esting with the Benjamini-Hochberg false discovery rate adjustment (Benjamini andHochberg, 1995) (Section 3.4). The first step is to project the functional data onto the real line. Projection is a classictechnique of functional data analysis because it reduces the infinite dimensions of func-tional data to a finite dimensional space. In doing so, projections can often act as featureextractors that reveal major patterns and trends in the data.The predominant technique for projecting functional data is the Karhunen-Lo`eve (KL)expansion (Ramsay, 2004). The KL expansion states that any f t ∈ L ([0 , f t ( s ) = µ ( s ) + ∞ (cid:88) i =1 α i,t φ i ( s ) , (2)where s ∈ [0 , µ ∈ L ([0 , f t , and φ i ∈ L ([0 , p ≥ f t with just the set ofcoefficients α t = ( α ,t , ..., α p,t ), also known as the functional principal components (FPC).This could allow us to reduce the problem of finding changepoints in the functional sequence { f t , t ∈ , ..., N } to finding changepoints in the multivariate sequence { α t , t ∈ , ..., N } .Some variations on FPC can be found in Tucker et al. (2013) and Lee and Jung (2016), forwhen the data are not aligned.The distinguishing feature of FPC analysis is that eigenfunctions are sorted by order ofdescending eigenvalues. Consequently, φ represents the primary direction of variation, φ represents the next primary direction of variation orthogonal to φ , and so on. Generally,only a small number ( p ) of eigenfunctions are required to represent the majority of the vari-ation in the data. However, selecting p can require a potentially expensive cross-validation6tep. For changepoint detection, we argue that only the first FPC will be necessary andthat the information in the trailing FPCs can be expressed through a single alternativeprojection.The first FPC represents the greatest variation in a functional time series, so all trail-ing FPCs beyond the first only capture smaller scale variation. The trailing FPCs will,therefore, naturally have a lower signal-to-noise ratio for identifying changepoints, whichleads to two adverse effects: increased estimation error and reduced test power. Estimationerror increases because changepoints will often manifest in multiple FPCs simultaneously,meaning changepoints may get repeatedly estimated on progressively noisier data. Thisleads to many spurious changepoint estimates near the true ones. Test power decreasesbecause more FPCs mean more tests will be performed, so the necessary false discoveryrate correction will have to control for a higher number of tests. This directly leads to apower loss for each individual test.For these reasons, we choose to employ only the first FPC in the projection. However,this does not mean we simply ignore the information in the trailing FPCs. Instead, weemploy an alternative projection that can more powerfully capture the smaller scale vari-ation. To do this, we propose using the arc length (Rudin et al., 1964) of each function asa projection. The arc length of a differentiable function f t : [0 , (cid:55)→ R n is defined as β t = (cid:90) |∇ f t ( s ) | ds, (3)where ∇ is the differential operator. On a real valued differentiable function, the arc lengthis equivalent to the Total Variation norm of the function, which measures the entire spec-trum of variability in the function. The arc length can be highly sensitive to changes inthe covariance, or “high frequency” features, of the data because the differencing opera-tor tends to exaggerate high frequency features. The arc length can, therefore, act as acompressed view of the higher order KL coefficients. In contrast, the first FPC is sensitiveto low frequency features because it only captures the dominant mode of variation in the7ata.The arc length, as a total variation measure, has frequently appeared in the signalprocessing and numerical methods literature as part of the total variation denoising (Rudinet al., 1992) and total variation diminishing procedures (Versteeg and Malalasekera, 2007).To our knowledge, our work is the first use of arc length in an FDA context for measuringthe high frequency variability in curves.We summarize our dimension reduction scheme as follows. First, project { f t , t ∈ , ..., N } onto its first KL coefficient, which we denote by the sequence { α t , t ∈ , ..., N } .Next, compute the arc length of each f t in { f t , t ∈ , ..., N } and denote this sequence as { β t , t ∈ , ..., N } . The two univariate sequences, { α t , t ∈ , ..., N } and { β t , t ∈ , ..., N } ,respectively represent the low and high frequency features of the sequence. These projec-tions reduce the functional changepoint problem to finding and synthesizing changepointsin two univariate sequences. The second step is to primitively segment the projected time series into regions of at mostone change (AMOC) using total variation denoising (TVD). TVD has been frequentlyused (Rojas and Wahlberg, 2014; Chan et al., 2014; Hyun et al., 2016; Lin et al., 2016) fordetecting changepoints in sequences of univariate random variables. This is because TVDestimates a piecewise constant fit with the estimated jumps often occurring at or near thetrue changepoints.Let Y = { y t , t ∈ , . . . , N } generically denote a univariate projection of the functionalsequence { f t , t ∈ , ..., N } , i.e., { y t , t ∈ , . . . , N } could be { α t , t ∈ , ..., N } or { β t , t ∈ , ..., N } . We assume the following signal-plus-noise model for Y : Y = θ + (cid:15), (4)where Y is an N × y t , θ is a piecewise constant N × θ t , and the error term (cid:15) has a finite second moment. Because θ is piecewise constant,the changepoints in Y are reflected in the jumps in θ . We use TVD to estimate θ as follows:ˆ θ = arg min θ ∈ R n || Y − θ || + λ || θ || T V (5)= arg min θ ∈ R n || Y − θ || + λ N − (cid:88) t =1 | θ t +1 − θ t | . (6)where || · || is the Euclidean norm, || · || T V is the Total Variation norm, and λ is a tuningparameter that controls the degree of regularization. We adapt a Bayesian InformationCriteria (BIC) (Yao, 1988) minimization approach to choose an appropriate λ , see thealgorithm in Alg. 3. The estimate ˆ θ will necessarily be a piecewise constant vector, becausethe penalty term λ (cid:80) N − t =1 | θ t +1 − θ t | induces sparsity on the differences of θ (Tibshirani et al.,2005). This gives rise to TVD’s alternative formulation as the fused lasso, a special case ofthe generalized lasso (Tibshirani et al., 2011).The TVD estimator is minimax rate optimal when the true signal θ has bounded totalvariation (Mammen et al., 1997; Donoho et al., 1998). In the one-dimensional case consid-ered here, the TVD estimator is also optimally adaptive to the piecewise constant signal(Guntuboyina et al., 2017). Optimal adaptivity means that if the true signal θ is piecewiseconstant with k steps, then the estimator ˆ θ will approach the oracle estimator, which apriori knows the true locations of the k jumps, at a rate O ( k/n ) (Lin et al., 2016; Dalalyanet al., 2017). TVD estimation is also extremely computationally efficient since direct TVDsolvers with linear time complexity exist (Condat, 2013). Together, these results indicatethat the TVD estimator is nearly optimal for piecewise signal recovery both theoreticallyand computationally and is, therefore, appealing for prescreening changepoints.We say TVD is an optimal changepoint pre-screener, but not an optimal changepointdetector, due to its suboptimal changepoint detection behavior noted by Fryzlewicz et al.(2014). The two main issues with TVD are that it generally overestimates the numberof changepoints and that it lacks exact support recovery. This leads to TVD only being9 -sign consistent (Rojas and Wahlberg, 2014); that is, the true break locations will onlybe contained within an (cid:15) radius of the estimated breaks. We mitigate the first issue inSection 3.3 by introducing a refinement procedure, as was done in Chan et al. (2014) andHyun et al. (2016), to group together “nearby” changepoints. Non-exact support recoveryis corrected in Section 3.4 by applying CUSUM to the regions surrounding the groupedchangepoints. Instead of using TVD directly to estimate changepoint locations, we use TVD jumps as aguide for finding regions of the time series that are likely to contain AMOC. Once these re-gions are specified, the standard CUSUM statistic can be applied within each region to testwhich regions contain changepoints and estimate their locations. This approach is similarto the method in Lin et al. (2016), which uses TVD to prescreen for potential changepoints,and then refines the screened changepoints with a post-processing procedure. The differ-ence between our method and theirs is that we introduce a new refinement procedure forthe TVD estimate, called “changeset regionalization”, and we base our post-processing pro-cedure on the CUSUM statistic rather than filtering and thresholding. The latter frees usfrom resorting to permutations to estimate the threshold, allowing us to enjoy a significantcomputational advantage.Recall Y = { y t , t ∈ , ..., N } denotes a univariate process and let ˆ θ = { θ t , t ∈ , ..., N } denote its TVD estimate as in (5). We define the set A ˆ θ = { t : ˆ θ t (cid:54) = ˆ θ t +1 } as the setof jumps in ˆ θ . The set A ˆ θ is sub-optimal for partitioning Y into regions because TVDtends to split single large breaks into multiple, tightly grouped, smaller breaks (Rojas andWahlberg, 2014). To remedy this, we introduce the set B ˆ θ,c . Definition 3.1.
Let c > and let P ( A ˆ θ ) denote the power set of A ˆ θ . We define the set B ˆ θ,c ∈ P ( A ˆ θ ) as the set meeting the following two conditions:1. d H ( b , b ) > c for any b , b ∈ B ˆ θ,c , where d H ( · , · ) denotes the Hausdorff distance. . | B ˆ θ,c | ≥ | B (cid:48) ˆ θ,c | for all B (cid:48) ˆ θ,c ∈ P ( A ˆ θ ) meeting condition 1. Here | · | denotes the cardi-nality of the set. Given a value c , each element of the set B ˆ θ,c is constructed by merging nearby change-points into one set of changepoints, called “changeset”. This procedure is specified inAlgorithm 2 as well as illustrated in Figure 2. The value c can either be specified directlyor estimated from the data as part of Algorithm 3. We take the latter approach in thispaper. Each changeset in B ˆ θ,c is assumed to have AMOC and to be in the correct temporalorder without loss of generality. After that, we define regions based on the changesets. Ifeventually M changesets are formed, we will accordingly define M regions along the uni-variate process Y with each region containing only a single changeset. This indicates thateach region contains AMOC, i.e. each region isolates a changepoint away from the others.The i th region within 1 , ..., M , R i , is defined as the following interval: R i = (sup b i − , inf b i +1 ) , (7)where b i ∈ B ˆ θ,c , sup b = 1, inf b M +1 = N , and M = | B ˆ θ,c | is the cardinality of B ˆ θ,c .Each interval R i is the largest possible interval within 1 , ..., N that contains b i but no otherelements of B ˆ θ,c . Figure 2 diagrams the process of changeset regionalization.Our definition of R i naturally has overlapping adjacent regions, as can be seen in Figure2. We find that overlapping regions, as opposed to nonoverlapping partitions, are quitebeneficial for changepoint analysis. The overlap gives each region extra data to improve thepower and accuracy of its CUSUM estimate. This is particularly helpful when changepointsare close to one another. The final step of our procedure is to detect changepoints contained within the regionsobtained in Sections 3.3. A suite of classical and well studied methods for estimatingAMOC is based on the CUSUM statistic (Page, 1954). The basic CUSUM statistic can be11igure 2:
Panel A:
The polished TVD fit (red line) over a time series (grey), with breaksin the TVD demarcated with black dashes along the bottom axis. Each black dash is anelement of A ˆ θ . Panel B:
Black dashes (changepoints) have been grouped together intothree “changesets” according to Definition 3.1, which leads to three distinct regions under(7). Changesets (and regional bounding boxes) are color coordinated and each set of samecolored changepoints is one element of B ˆ θ,c . Note that regions do not contain neighboringchangesets, which is why we say the regions “isolate” the changepoints. Panel C:
Closeup view of each of the three regions. The standard CUSUM will be applied to regions 1,2, and 3 to find all changepoints. 12sed to first evaluate the existence of a single change in the mean of a univariate process Y = { y t , t ∈ , ..., N } through the following hypothesis test: H : E ( y ) = · · · = E ( y n ) ,H A : E ( y ) = . . . E ( y k ) (cid:54) = E ( y k +1 ) = · · · = E ( y n ) , where 1 < k < n is unknown. A CUSUM process is defined as T n ( (cid:98) nr (cid:99) ) = n − / (cid:98) nr (cid:99) (cid:88) t =1 ( y t − ¯ y n ) , r ∈ [0 , , where ¯ y n = n − (cid:80) nt =1 y t . The CUSUM statistic is defined as the supremum of the scaledCUSUM process CS n = sup r ∈ [0 , | T n ( (cid:98) nr (cid:99) ) / ˆ σ n | , (8)where ˆ σ n is a consistent estimator of the long run variance σ = lim n →∞ n var( ¯ X n ). Weapproximate ˆ σ n with Var( Y − ˆ θ ), where ˆ θ is the TVD estimate of Y . Under H , CS n D −→ sup r ∈ [0 , | B ( r ) − rB (1) | , where B ( · ) is a Brownian motion so B ( r ) − rB (1) is a Brownian bridge on [0 , r ∈ [0 , | B ( r ) − rB (1) | can be computed using the well known KolmogorovDistribution (Shao and Zhang, 2010): P ( CS n < t ) = 1 − ∞ (cid:88) j =1 ( − j e − j t . (9)If the test rejects H , then the estimated changepoint location is k ∗ = arg sup r ∈ [0 , | T n ( (cid:98) nr (cid:99) ) | .We apply CUSUM independently to each of the regions found via the changeset re-gionalization procedure described in Section 3.3. Each region, therefore, has an associatedp-value and a potential estimated changepoint location. Due to the issue of multiple testing,we adjust the p-values using the Benjamini-Hochberg procedure (Benjamini and Hochberg,1995) to control the CUSUM’s False Discovery Rate (FDR) at a pre-specified level. We13etain all estimated changepoints with an adjusted p-value below the pre-specified level.We summarize all four steps in this section in the following algorithm (Alg. 1). The Algorithm 1:
Multiple Changepoint Isolation
Input :
Functions f ,..., f N , significance level α Output: changepoint locations for each projection do Project f ,..., f N onto the real line (Sec. 3.1) Estimate free parameters c and k (Alg. 3) Coarsely segment projected data with TVD with λ = c √ N (Eqn. 5) Combine segments into regions using changeset regionalization with (cid:15) = k √ N (Alg. 2) Compute CUSUM statistic on each segment to get potential changepoints (Def.8) Adjust p-values with an FDR correction. Retain changepoints with p-values lessthan α end Concatenate and sort all projection based changepoints together Average together any changepoints in the concatenated set within √ N of each other Return all remaining changepointsparameter λ (line 4) is chosen to scale with √ N , as was done in Lin et al. (2016) (SeeRemark 5), to ensure the TVD is well estimated while still filtering out a majority ofthe extraneous changepoints. Rojas and Wahlberg (2014) derived that (cid:15) (line 5) dependson sample size N as (cid:15) = kN . However, directly using k made the numerical estimationchallenging due to the small magnitudes of k . Instead, we make (cid:15) depend on √ N and findthat this works well for all of the scenarios we considered. √ N (line 10) was found to workwell for reducing the number of redundant changepoints between the arclength and FPCprojections in all scenarios. Our method’s computational complexity grows linearly with sample size because each sub-step of the method grows linearly in time, and the number of sub-steps does not growwith sample size. For the FPC decomposition, we use the FACE algorithm (Xiao et al.,14016), which is linear in sample size and function length. Computing the arclength of eachfunction requires only a single pass over the data, so the arclength projection is linear aswell. For TVD estimation, we use the Condat algorithm (Condat, 2013), which is linear intime. Refining the segmentation over the data with changeset regionalization can be donewith a single iteration using algorithm 2. Finally, the CUSUM is computable in linear time,and the overlapping regions only cause CUSUM to be run (nearly) twice over the data set.The optimization procedure for λ and (cid:15) uses a grid search over a fixed sequence of c and k values, ensuring that the optimization step is also linear in sample size. We investigate MCI’s empirical performance in detecting changes in the mean and covari-ance of a functional process. We consider the case when there are no changepoints, whenthere are only a few changepoints (“sparse” setting), and when there are many change-points (“dense” setting). We also consider the data with light and heavy tails, and withsymmetric and skewed distributions.
We start by simulating symmetric functional data. We use a Gaussian process (GP) modeland a t -process (TP) model to simulate symmetric light-tailed and heavy-tailed functionaldata respectively. Let Z t denote the symmetric process. We have Z t = µ t + (cid:15), where µ t ∈ L ([0 , (cid:15) follows a zero-mean GP or TP with Mat´erncovariance function (Stein, 2012), C ( x, x (cid:48) ) = σ √ πr ν ν − Γ( ν + 1 / (cid:18) (cid:107) x − x (cid:48) (cid:107) r (cid:19) ν K ν (cid:18) (cid:107) x − x (cid:48) (cid:107) r (cid:19) , σ , range parameter r , smoothness parameter ν , and K ν ( · ) as amodified Bessel function of the 2nd kind of order ν . If (cid:15) follows a TP, then an additionaldegrees of freedom parameter df is required; we set df = 3 in all simulations. We set thesmoothness parameter ν = 1 to ensure mean square differentiability of the sample pathsso that the arc length projections are meaningful.To generate a skewed functional process { Y t , t ∈ , ..., N } on the domain [0 , Y t = log(1 + e Z t ( s ) ) . We call the transformed Gaussian process and transformed t -process a log-sum Gaussianprocess (LS-GP) and a log-sum t -process (LS-TP) respectively. All results in Section 4.3are based on LS-GP and LS-TP simulated data. Results under a GP or TP are deferredto Section B.1 in the appendix. All results in Section 4.3 are based on LS-GP and LS-TPsimulated data, because many real datasets, including the data we analyze here, are skewed.Results under a GP or TP show similar patterns as the LS-GP and LS-TP simulations,although results improved overall because skewed distributions are more challenging forchangepoints detectors. We defer the GP and TP results to Section B.1 in the appendix.We vary the mean µ , the variance σ , and the range r in the simulation over the valueslisted in Table 1. The parameter ν is not varied because it acts on the process in nearly thesame way as r . We specify possible mean functions µ for the LS-GP or LS-TP followingthe simulations in Chiou et al. (2019). ψ ( t ) = 5 t − exp(1 − t ) ,ψ ( t ) = 0 . − t − . t − . t − . t − . ,ψ ( t ) = ψ ( t ) + 0 . πt ) ,ψ ( t ) = 1 + 3 t − t + 0 . πt ) ,ψ ( t ) = 1 + 3 t − t . Differences between ψ functions, as measured by the L norm, yield different scales of16igure 3: Five possible latent mean functions ψ – ψ for the LS-GP and LS-TP. Differencesbetween ψ functions yield different magnitudes of changepoints, i.e. ψ to ψ is a largeeasily detectable change while ψ to ψ is much smaller and harder to detect.change between the mean functions. The change from ψ to ψ is the largest, ψ to ψ is moderate, and ψ to ψ and ψ to ψ are both small. Additionally, these ψ functionsrepresent changes in both magnitude and the shape of the functional process, as shown inFigure 3.To generate M randomly spaced changepoints in either µ , σ , or r of an iid LS-GP (orLS-TP) sequence, we first randomly sample M + 1 parameter values and M + 1 segmentlengths (i.e. segment sample sizes). We generically denote the sequence of parameter valuesas θ , ..., θ M +1 and the sequence of segment lengths as n , ..., n M +1 . Only one parameter isvaried at a time, while the others are fixed at prespecified values. To generate the functionaltime series, we sample n LS-GP’s (or LS-TPs) with parameter θ = θ , then n LS-GP’s(or LS-TPs) with parameter θ = θ , and so on for the M + 1 segments. Each segment isconcatenated together so that the boundaries between segments represent changepoints inthe functional time series. Parameter values are sampled from Table 1, and sampling is doneso that consecutive values will are not the same. Segment lengths are samples from eithera Unif[500 , , µ ψ , ψ , ψ , ψ , ψ σ r ν µ indicates the mean function,while σ , r, and ν are scalar valued parameters in the Mat´ern covariance function. We ran 500 simulations for each parameter setting and calculated the Annotation error(Truong et al., 2019) and the Energy distance (Sz´ekely, 2003) between the true changepointsand the estimated changepoints, which we call the “Energy error”. The Annotation error,widely used for assessing changepoints detection, measures the difference in the number ofdetected changepoints and the number of true changepoints.
Definition 4.1.
Let X = { x , ..., x n } and Y = { y , ..., y m } be two sets, then the Annotationdistance between X and Y is d A ( X, Y ) = | n − m | (10)A low Annotation error means that the algorithm consistently estimates the number ofchangepoints correctly.The Energy error is non-standard compared to the usual Hausdorff metric for change-point comparison (Truong et al., 2019), but we find that it offers a more comprehensivedepiction of the differences between the real and estimated changepoints when there aremultiple changepoints. This is because the Hausdorff metric tends to overly weight singlechangepoint errors rather than assess the overall performance. We provide details on thedifference between the Hausdorff metric and the Energy distance in the appendix B.2. Wealso provide simulation results under the Hausdorff metric in the Supplement. The Energydistance between two sets is defined as follows. Definition 4.2.
Let X = { x , ..., x n } and Y = { y , ..., y m } be two sets, then the Energy istance between X and Y is d E ( X, Y ) = 2 nm n (cid:88) i =1 m (cid:88) j =1 | x i − y j | − n n (cid:88) i =1 n (cid:88) j =1 | x i − x j | − m m (cid:88) i =1 m (cid:88) j =1 | y i − y j | . (11)A low Energy distance between the estimated and actual changepoints, i.e., a low Energyerror, means that the estimated changepoints are very similar to the true changepoints. Inall simulation settings we compare the different detectors on their ability to achieve lowAnnotation and low Energy errors.Because the DSBE method proposed by Chiou et al. (2019) is the only multiple change-point detection method that is computationally comparable to MCI, we compare ourmethod to DSBE. Additionally, we include comparisons against three univariate, lineartime changepoint detectors applied to the two projections: the first functional principalcomponent and the arc length. This will allow us to learn the advantages of our detectionmethod on top of the projections. The three methods are the PELT algorithm Killicket al. (2012), the r-FPOP algorithm (Fearnhead and Rigaill, 2019), and the WBS proce-dure (Fryzlewicz et al., 2014). For PELT, we used the cpt.meanvar() function in the changepoint package with default settings and method = “PELT”. For r-FPOP, we usedthe Rob seg.std() function from the robseg package with L λ = 5 log( n ). The authors recommended λ to scale with the log of the sample sizeand the multiplier 5 was found to have the overall best results. For WBS, we used the wbs() function, from the wbs package, with default settings and changepoints found viassic.penalty minimization. For DSBE, we used the author’s provided code with number ofchangepoint candidates K = 50, minimum segment length b = 5, and significance threshold α = 0 . /K . DSBE is only applied to the mean change simulations because DSBE wasdesigned to only detect mean changes. 19igure 4: Annotation error when no changepoints are present in the data. MCI and DBSEnearly always detect 0 changepoints in both the light tail (LS-GP) and the heavy tail (LS-TP) setting. PELT and WBS detect an average of 55 and 25 changepoints, respectively,when no changepoints exist and the functional process is heavy tailed. FPOP performsbetter than PELT and WBS, but still detects around six false changepoints on average.Error values are plotted on the log(1 + error) scale. We first consider the situation where there are no changepoints, i.e. M = 0. Functionalobservations are generated from either a LS-GP or LS-TP with constant mean function µ = 0 and Mat`ern covariance with σ = 1, r = 0 .
2, and ν = 1. Figure 4 shows distributionof Annotation errors under the LS-GP and LS-TP for each method.All methods have an almost uniformly zero Annotation error under the LS-GP, meaningthat all methods have an essentially zero false positive rate. However, under the LS-TP,only MCI and DSBE maintain their almost uniformly near-zero error rates, while theAnnotation error for FPOP, PELT, and WBS all increases dramatically. PELT and WBShave particularly high Annotation error rates, indicating their sensitivity to larger randomfluctuations caused by the heavy tailed generating process. FPOP is more robust thanPELT and WBS, due to FPOP using the robust L t -process (LS-TP), only MCI is able to maintain asmall error rate. Error values are plotted on the log(1 + error) scale. We next consider the situation when the changepoints are relatively far apart, i.e., thechangepoints are sparse. We simulate data as described in Section 4.1 with M = 5 change-points and segment lengths sampled from Unif[5000 , µ ,we fix the covariance parameters to σ = 1, ν = 1, and r = 0 .
2. For changepoints in vari-ance, we fix µ = 0, ν = 1, and r = 0 .
2, and for changepoints in range we fix µ = 0, ν = 1,and σ = 1.Figure 5 summarizes each method’s Annotation error and Energy error in detectingchanges in the mean. Our MCI method maintains the overall lowest Annotation error andthe overall lowest Energy error. Together, this shows that MCI is accurately estimatingthe number and location of the changepoints, whether the error process is light or heavytailed. The univariate methods applied to our projections have reasonably low Annotationerror rates on the LS-GP, but show extremely high Annotation error rates on the LS-TPdue to overestimation, while their Energy error rates are high on both. DSBE sees the21igure 6: Annotation and Energy error rates for sparse changes in the covariance parame-ters of a LS-GP and a LS-TP. MCI performs the best in all cases. Error values are plottedon the log(1 + error) scale.least deterioration from LS-GP to LS-TP compared to the alternative methods, althoughits Energy error rates were already very high under LS-GP.We then investigated changes in the variance σ and the range r of Mat´ern covariancefunction for the LS-GP and LS-TP. The results are summarized in Figure 6. DSBE wasnot included in these simulations because it was not designed to detect changes in covari-ance. MCI again achieves the lowest overall error rates among all methods and for bothparameters. Even with the heavy tailed LS-TP data, MCI maintains a remarkable nearlyzero Annotation error. Although the MCI’s Energy error increases with the LS-TP data,it is still several times smaller than that of the other methods in detecting range changes.Variance changes were difficult for all methods across the board. Finally, we consider the situation when the changepoints are relatively close to each other,i.e., when changepoints are dense. To simulate data with dense changepoints, we set M =50 and sample segment lengths from Unif[500 , To demonstrate MCI’s linear computation time, we conduct a simulation study to empir-ically assess the run time growth of MCI over an increasing sequence of sample sizes. Wetake the simulation with mean zero GP and covariance parameters σ = 1, r = 0 .
2, and ν = 1 as example. We consider sample sizes from N = 1000 to N = 7000 in 1000 unitincrements, and run simulation 1000 times for each sample size. We show the run time ofMCI under each of the seven sample sizes in Figure 9. The run time of MCI exhibits alinear trend with the sample size. 25 Application to Profiles of Water Vapor
We now revisit the motivating data example illustrated in the Introduction and apply ourMCI method to this data.
The water vapor mixing ratio is the water vapor density over the dry air density in a givenatmospheric unit. It is an important variable in Meteorology for distinguishing individualair masses, monitoring the effects of soil evapotranspiration and large water body evapo-ration (North et al., 2014), and for the early detection of heavy precipitation events (Sakaiet al., 2019). Changepoint detection is useful for identifying sudden changes in an airparcel’s water vapor content due to precipitation events and air parcels mixing. Retrospec-tively identifying sudden changes in the water vapor profiles’ structure is often a necessaryfirst step before constructing statistical models for identifying precipitation events.We apply our functional changepoint detector to the water vapor mixing ratio profilescollected from the Atmospheric Emitted Radiance Interferometer (AERI) instrument at theLamont, Oklahoma Facility. The AERI instruments are maintained by the U.S. Depart-ment of Energy’s (DOE) Atmospheric Radiation (ARM) Program to collect high-resolutionatmospheric profile data (Stokes and Schwartz, 1994). In this dataset, each profile consistsof 58 measurements of the water vapor mixing ratio along a single atmospheric column from0 to 44,000 meters above ground level in Lamont, Oklahoma. Whole profiles were collectedevery 8 minutes, thus providing near-continuous monitoring of atmospheric conditions. Weremoved the top 18 altitude points representing the 11,000 to 44,000-meter range, due tothe extremely high rate of measurement errors in this range. Therefore, we only considerthe 40 measurements from 0 to 10,000 meters.For our analysis, we consider the entire time series of water vapor profiles from January4th, 2007 to March 10th, 2014. This period corresponds to 234,062 profiles, each sampled atthe same 40 altitudes. To illustrate the data, we plotted profiles water vapor profiles from26igure 10: Marginal distributions at two altitude bands in the AERI profile data (450meters and 1800 meters). Non Gaussian behavior such as skewness and heavy tails areobserved.August 9th through September 6th in 2008 in Figure 1. Each vertical line represents anindividual profile, with colors indicating the value of the profile at each altitude. Abruptincreases and decreases in water vapor along time are visible, indicating rapid changesfrom high density to low density and vice versa. The changes could be caused by suddenprecipitation events and air mass mixing.In Figure 10, we show the marginal distribution at two altitude bands (450 and 1850-meters) across all profiles. For this plot, we standardized each profile by removing the profilemean and dividing by the profile standard deviation so that mean and variance changeswould not obscure the marginal distribution. Even after standardization, the marginalsshow non-Gaussian behavior, including heavy tails and kurtosis.
We applied our MCI method to the water vapor mixing ratio profiles and found 210 change-points. Figure 11 shows four examples of the changepoints identified in the first functionalprincipal component, and Figure 12 shows four examples of changepoints identified in thearclength projection.A number of commonalities between the two plots are apparent. The first is that27igure 11: Four examples of the changepoints detected by MCI in the first FPC of theAERI profiles. The black curve is the projected values, the red vertical lines marks theestimated changepoints, and the red horizontal lines indicate the segment means betweenchangepointspanel A (top left) and panel D (bottom right) are highly similar in appearance, and thechangepoints identified in these regions are similar between the two projections. Thishappens because the first FPC and the arclengths are not necessarily orthogonal to eachother, and changepoints may manifest in both low and high frequency spectrum. Anothercommon feature is that MCI is robust to independence violations and heavy tails in boththe arclength and the first FPC. This can be seen in the right half of panel 3 (bottomleft) in both Figures 11 and 12, where both time series exhibit an autoregressive structure,yet MCI does not detect extra changepoints due to the spurious mean changes inducedby the correlations. Heavy tailed behavior can be observed in Panels A, B, and D, wherelarge “spikes” in the time series can be observed. MCI does not detect these anomalies aschangepoints in either projection.Figures 11 and 12 also show that there are many differences between the two projections,meaning that the first FPC and the arc length measure very different aspects of the data.28igure 12: Four examples of the changepoints detected by MCI in the arclength of theAERI profiles. The black curve is the projected values, the red vertical lines marks theestimated changepoints, and the red horizontal lines indicate the segment means betweenchangepoints.Panels B and C are almost completely different in the two plots, including the locations ofthe detected changepoints. The MCI was, therefore, able to pick up on a broader range ofchangepoints than the first FPC would allow, because of the arclength projection.
We propose the Multiple Changepoint Isolation method for detecting multiple changepointsof a functional time series. The changes can be either in the mean or in the dependencestructure of the functional data. Our method first introduces a pair of projections to extractthe functional process’s high frequency and low frequency features. We then use TVD anda new changeset regionalization procedure to identify regions of the time series that likelycontain at most one changepoint. CUSUM is then applied to each region individually toidentify the potential changepoint. Our extensive simulations show that the MCI methodis accurate, computationally linear time, and robust to data distributions. We finally29emonstrate MCI on water vapor mixing ratios over time.Using the leading FPC coefficient and the arc length as the two projections effectivelypreserves the major low frequency and high frequency variability in the functional datasequence. Our method is, therefore, able to detect a broad range of changepoints. Incontrast, entirely functional metric based approaches are, in general, only powerful againstchanges in the mean (Aue et al., 2018). Our effective data reduction compared to usingseveral FPCs also reduces the number of tests in the multiple testing context and thushelps to retain high power in the changepoint detection. More projections will lead tomore testings and less power.Our changepoint detection differs from the existing TVD (or fused lasso) based ap-proaches in two significant ways. First, we used TVD only as a screening method to elimi-nate locations that are very unlikely to be changepoints. Second, we proposed a changesetregionalization procedure to further filter out extraneous changepoints and identify properregions to apply the CUSUM statistic. This strategy is more efficient and precise thandirectly using TVD for changepoint detection since TVD is shown only to be (cid:15) -consistent(Rojas and Wahlberg, 2014).In future work, we would like to consider the theoretical underpinnings of the MCImethod more rigorously. For instance, it remains to be shown whether the MCI is a consis-tent estimator of the changepoints or whether the MCI is asymptotically powerful thoughour simulations seem to imply those properties. We could further study the MCI’s robust-ness and compare its current form with alternatives using robust changepoint statistics.The procedure to optimize the tuning parameters λ and (cid:15) (Alg 3) is also heuristic andnot guaranteed to find globally optimal parameters. A better algorithm, with optimalityguarantees, may be possible. However, optimal parameters are not strictly necessary foroptimal changepoint detection, so studying the trade-off between parameter optimizationand changepoint detection may also be interesting.30 eferences Aston, J. A. and Kirch, C. (2012). Detecting and estimating changes in dependent func-tional data. Journal of Multivariate Analysis, 109:204–220.Aue, A., Gabrys, R., Horv´ath, L., and Kokoszka, P. (2009). Estimation of a change-pointin the mean function of functional data. Journal of Multivariate Analysis, 100(10):2254–2269.Aue, A., Rice, G., and S¨onmez, O. (2018). Detecting and dating structural breaks infunctional data without dimension reduction. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 80(3):509–529.Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practicaland powerful approach to multiple testing. Journal of the Royal statistical society: seriesB (Methodological), 57(1):289–300.Berkes, I., Gabrys, R., Horv´ath, L., and Kokoszka, P. (2009). Detecting changes in the meanof functional observations. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 71(5):927–946.Chan, N. H., Yau, C. Y., and Zhang, R.-M. (2014). Group lasso for structural break timeseries. Journal of the American Statistical Association, 109(506):590–599.Chiou, J.-M., Chen, Y.-T., Hsing, T., et al. (2019). Identifying multiple changes for afunctional data sequence with application to freeway traffic segmentation. The Annalsof Applied Statistics, 13(3):1430–1463.Condat, L. (2013). A direct algorithm for 1-d total variation denoising. IEEE SignalProcessing Letters, 20(11):1054–1057.Dalalyan, A. S., Hebiri, M., Lederer, J., et al. (2017). On the prediction performance ofthe lasso. Bernoulli, 23(1):552–581. 31onoho, D. L., Johnstone, I. M., et al. (1998). Minimax estimation via wavelet shrinkage.The annals of Statistics, 26(3):879–921.Fearnhead, P. and Rigaill, G. (2019). Changepoint detection in the presence of outliers.Journal of the American Statistical Association, 114(525):169–183.Fryzlewicz, P. et al. (2014). Wild binary segmentation for multiple change-point detection.The Annals of Statistics, 42(6):2243–2281.Gromenko, O., Kokoszka, P., and Reimherr, M. (2017). Detection of change in the spa-tiotemporal mean function. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 79(1):29–50.Guntuboyina, A., Lieu, D., Chatterjee, S., and Sen, B. (2017). Adaptive risk bounds inunivariate total variation denoising and trend filtering. arXiv preprint arXiv:1702.05113.H¨ormann, S. and Kokoszka, P. (2012). Functional time series. In Handbook of statistics,volume 30, pages 157–186. Elsevier.H¨ormann, S., Kokoszka, P., et al. (2010). Weakly dependent functional data. The Annalsof Statistics, 38(3):1845–1884.Hyndman, R. J. and Booth, H. (2008). Stochastic population forecasts using functionaldata models for mortality, fertility and migration. International Journal of Forecasting,24(3):323–342.Hyun, S., G’Sell, M., and Tibshirani, R. J. (2016). Exact post-selection inference for change-point detection and other generalized lasso problems. arXiv preprint arXiv:1606.03552.Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepointswith a linear computational cost. Journal of the American Statistical Association,107(500):1590–1598. 32okoszka, P. and Zhang, X. (2012). Functional prediction of intraday cumulative returns.Statistical Modelling, 12(4):377–398.Kulla, B. S. and Ritter, C. (2019). Water vapor calibration: Using a raman lidar andradiosoundings to obtain highly resolved water vapor profiles. Remote Sensing, 11(6):616.Lee, S. and Jung, S. (2016). Combined analysis of amplitude and phase variations infunctional data. arXiv preprint arXiv:1603.01775.Li, X. and Ghosal, S. (2018). Bayesian change point detection for functional data. arXivpreprint arXiv:1808.01236.Lin, K., Sharpnack, J., Rinaldo, A., and Tibshirani, R. J. (2016). Approximate recovery inchangepoint problems, from (cid:96) estimation error rates. arXiv preprint arXiv:1606.06746.Maidstone, R., Hocking, T., Rigaill, G., and Fearnhead, P. (2017). On optimal multiplechangepoint algorithms for large data. Statistics and Computing, 27(2):519–533.Mammen, E., van de Geer, S., et al. (1997). Locally adaptive regression splines. TheAnnals of Statistics, 25(1):387–413.North, G. R., Pyle, J. A., and Zhang, F. (2014). Encyclopedia of atmospheric sciences,volume 1. Elsevier.Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2):100–115.Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.Rice, G. and Zhang, C. (2019). Consistency of binary segmentation for multiple change-points estimation with functional data. arXiv preprint arXiv:2001.00093.Rojas, C. R. and Wahlberg, B. (2014). On change point detection using the fused lassomethod. arXiv preprint arXiv:1401.5408. 33udin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based noise removalalgorithms. Physica D: nonlinear phenomena, 60(1-4):259–268.Rudin, W. et al. (1964). Principles of mathematical analysis, volume 3. McGraw-hill NewYork.Sakai, T., Nagai, T., Izumi, T., Yoshida, S., and Shoji, Y. (2019). Automated compact mo-bile raman lidar for water vapor measurement: instrument description and validation bycomparison with radiosonde, gnss, and high-resolution objective analysis. AtmosphericMeasurement Techniques, 12(1).Shang, H. L. and Hyndman, R. J. (2011). Nonparametric time series forecasting withdynamic updating. Mathematics and Computers in Simulation, 81(7):1310–1324.Shao, X. and Zhang, X. (2010). Testing for change points in time series. Journal of theAmerican Statistical Association, 105(491):1228–1240.Sharipov, O., Tewes, J., and Wendler, M. (2016). Sequential block bootstrap in ahilbert space with application to change point analysis. Canadian Journal of Statistics,44(3):300–322.Stein, M. L. (2012). Interpolation of spatial data: some theory for kriging. Springer Science& Business Media.Stokes, G. M. and Schwartz, S. E. (1994). The atmospheric radiation measurement (arm)program: Programmatic background and design of the cloud and radiation test bed.Bulletin of the American Meteorological Society, 75(7):1201–1222.Sz´ekely, G. J. (2003). E-statistics: The energy of statistical samples. Bowling Green StateUniversity, Department of Mathematics and Statistics Technical Report, 3(05):1–18.34ibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity andsmoothness via the fused lasso. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 67(1):91–108.Tibshirani, R. J., Taylor, J., et al. (2011). The solution path of the generalized lasso. TheAnnals of Statistics, 39(3):1335–1371.Truong, C., Oudre, L., and Vayatis, N. (2019). Selective review of offline change pointdetection methods. Signal Processing, page 107299.Tucker, J. D., Wu, W., and Srivastava, A. (2013). Generative models for functional datausing phase and amplitude separation. Computational Statistics & Data Analysis, 61:50–66.Versteeg, H. K. and Malalasekera, W. (2007). An introduction to computational fluiddynamics: the finite volume method. Pearson education.Woodall, W. H. (2007). Current research on profile monitoring. Production, 17(3):420–425.Xiao, L., Zipunnikov, V., Ruppert, D., and Crainiceanu, C. (2016). Fast covariance esti-mation for high-dimensional functional data. Statistics and computing, 26(1-2):409–421.Yao, Y.-C. (1988). Estimating the number of change-points via schwarz’criterion. Statistics& Probability Letters, 6(3):181–189.Zhang, X., Shao, X., Hayhoe, K., Wuebbles, D. J., et al. (2011). Testing the structuralstability of temporally dependent functional observations and application to climate pro-jections. Electronic Journal of Statistics, 5:1765–1796. A Additional Algorithms
We now introduce several algorithms mentioned in the body of the manuscript. The firstis the changeset regionalization Algorithm 2, which merges “nearby” changepoints into35roups called changesets.
Algorithm 2:
Changeset regionalization
Input :
TVD changepoints A θ = { t , ..., t p } and link radius c. Output:
Change set B θ,c Initialize j = 1 Initialize B j = { t } for i ← to p do if | t i − t i − | < c then B j ← B j (cid:83) { t i } else Initialize B j +1 = { t i } Increment j ← j + 1 end end Return B θ,c = { B , ...B m } Algorithm 2 starts with the singleton set containing the first changepoint, t , detectedby TVD. It then checks if the next changepoint, t , is within c of t and then either addsadds t to the set containing t or starts a new set containing only t . It then proceeds downthe list of changepoints until every changepoint belongs to a set. Since the changepointsare assumed to be in order, this only requires a single pass over the list of changepoints.The second algorithm, Alg. 3, tunes the hyperparameters of TVD and Alg. 2. That is,we write the regularization parameter of TVD as λ = c √ N and the linkage parameter ofAlg. 2 as (cid:15) = k √ N and try to find optimal values for c and k . Alg. 3 essentially performscoordinate descent on c and k to find the pair that minimize the BIC of a step wise functionwith jumps at the changepoints (using c and k ). B Additional Simulations
We investigate MCI’s empirical performance in detecting changes in the mean and covari-ance of a symmetric functional process. We again consider the case when there are nochangepoints, when there are only a few changepoints (“sparse” setting), and when thereare many changepoints (“dense” setting). The exact same generation process is used for36 lgorithm 3:
Optimize parameters
Input :
Projections y , ..., y N and significance level α Output: free parameters c and k Initialize k ∗ = 1 for c ← . to do Segment y , ..., y N with TVD with λ = c √ N Regionalize TVD segmentation with Alg. 2 with (cid:15) = k ∗ √ N Compute CUSUM on each region and adjust p-values with FDR correction Denote t , ..., t M as the changepoints found via CUSUM with p-values < α fit step wise function with steps at t , ..., t M compute and save BIC of the fitted function end let c ∗ be the c value with the minimal BIC for k ← . to do Segment y , ..., y N with TVD with λ = c ∗ √ N Regionalize TVD segmentation with Alg. 2 with (cid:15) = k √ N Compute CUSUM on each region and adjust p-values with FDR correction Denote t , ..., t M as the changepoints found via CUSUM with p-values < α fit step wise function with steps at t , ..., t M compute and save BIC of the fitted function end let k ∗ be the k value with the minimal BIC Return c ∗ , k ∗ e Y t ( s ) ) isomitted. B.1 Assessment results
B.1.1 No changepoints
We first consider the situation where there are no changepoints, i.e. M = 0. Functionalobservations are generated from either a LS-GP or LS-TP with constant mean function µ = 0 and Mat`ern covariance with σ = 1, r = 0 .
2, and ν = 1. Figure 13 shows the resultsof each process.All methods have an almost uniformly zero Annotation error under the Gaussian pro-cess, meaning that all methods have an essentially zero false positive rate. However, undera t -process, only our method MCI and DSBE maintain their almost uniformly near-zero er-ror rates, while the Annotation error for FPOP, PELT, and WBS all increases dramatically.PELT and WBS have particularly high Annotation error rates, indicating their sensitivityto larger random fluctuations caused by the heavy tailed generating process. FPOP is morerobust than PELT and WBS, due to FPOP using the robust L tt
2, and ν = 1. Figure 13 shows the resultsof each process.All methods have an almost uniformly zero Annotation error under the Gaussian pro-cess, meaning that all methods have an essentially zero false positive rate. However, undera t -process, only our method MCI and DSBE maintain their almost uniformly near-zero er-ror rates, while the Annotation error for FPOP, PELT, and WBS all increases dramatically.PELT and WBS have particularly high Annotation error rates, indicating their sensitivityto larger random fluctuations caused by the heavy tailed generating process. FPOP is morerobust than PELT and WBS, due to FPOP using the robust L tt -process (TP), only MCI is able to maintain a small error rate. B.1.2 Sparse changepoints
We next consider the situation when the changepoints are relatively far apart, i.e., thechangepoints are sparse. We simulate data as described in Section 4.1 with M = 5 change-points and segment lengths sampled from Unif[5000 , σ = 1, ν = 1, and r = 0 .
2. For changepoints in vari-ance, we fix µ = 0, ν = 1, and r = 0 .
2, and for changepoints in range we fix µ = 0, ν = 1,and σ = 1.In Figure 14, we summarize each method’s Annotation error and Energy error in de-tecting changes in the mean. Our MCI method maintains the overall lowest Annotationerror and the overall lowest Energy error. Together, these plots show that MCI accuratelyestimates the number and location of the changepoints, whether the error process is lightor heavy tailed. The other methods perform similarly with MCI on the GP data, but thenexperience massive increases in both Annotation error and Energy error rates on the TPdata.We then consider changes in the variance σ and the range r in Mat´ern covariance39igure 15: Annotation and Energy error rates for sparse changes in the covariance param-eters of a GP and a TP.function, respectively. The results are summarized in Figure 15. DSBE was not includedin these simulations since it was not designed to detect changes in covariance. MCI againachieves the lowest overall error rates among all methods and for both parameters. Evenwith the heavy tailed TP data, MCI maintains a remarkable nearly zero Annotation error.Although the MCI’s Energy error increases with the TP data, it is still several times smallerthan that of the other methods. It is worth mentioning that MCI is a powerful detectorfor changes in either range or variance, whereas the other methods exhibit wildly differentskill on these two parameters when data is Gaussian. B.1.3 Dense changepoints
Finally, we consider the situation when changepoints are relatively close to each other, i.e.,when changepoints are dense. To simulate data with dense changepoints, we set M = 50and sample segment lengths from Unif[500 , B.2 Hausdorff vs Energy error
The Hausdorff metric will only look at the “worst-case” estimation error, and so it isdifficult for it to say anything about the average estimation error. If a changepoint detector42igure 18: Annotation and Energy error rates for sparse changes the mean function. Undera Gaussian process (GP), error rates are comparable between MCI, PELT, and WBS. Undera tt