Inferring serial correlation with dynamic backgrounds
IInferring serial correlation with dynamic backgrounds
Song Wei a Yao Xie a Dobromir Rahnev ba School of Industrial and Systems Engineering, b School of Psychology,Georgia Institute of Technology, Atlanta, Georgia,30332-0205, U.S.A.February 2, 2021
Abstract
Sequential data with serial correlation and an unknown, unstructured, and dynamicbackground is ubiquitous in neuroscience, psychology, and econometrics. Inferring serialcorrelation for such data is a fundamental challenge in statistics. We propose a totalvariation constrained least square estimator coupled with hypothesis tests to infer the serialcorrelation in the presence of unknown and unstructured dynamic background. The totalvariation constraint on the dynamic background encourages a piece-wise constant structure,which can approximate a wide range of dynamic backgrounds. The tuning parameteris selected via the Ljung-Box test to control the bias-variance trade-off. We establisha non-asymptotic upper bound for the estimation error through variational inequalities.We also derive a lower error bound via Fano’s method and show the proposed methodis near-optimal. Numerical simulation and a real study in psychology demonstrate theexcellent performance of our proposed method compared with the state-of-the-art.
Keywords:
Autoregressive time series; High-dimensional lasso; Non-stationarity; Total variationconstraint; Variational inequality.
Serial correlation and serial dependence have been central to time series analysis [Hong, 2010].Modern time-series data from neuroscience, psychology, and economics usually contain botha substantial serial dependence and a non-stationary drift [Akrami et al., 2018, Wexler et al.,2015, Moskowitz et al., 2012, Fischer and Whitney, 2014, Cicchini et al., 2018, McIlhagga,2008, Rahnev et al., 2015]. A well-known example comes from human reaction times, whichare thought to be autocorrelated but also drift throughout an experiment [Laming, 1968].The drift can be due to many factors such as becoming better on the task, increased tiredness,and attention or arousal fluctuations. None of these influences take a specific parametric form.While some (e.g., learning or fatigue) are likely to be monotonic, others (e.g., fluctuations inattention) can be expected to waver unpredictably. This non-stationary background drift isthus typically considered a nuisance variable.1 a r X i v : . [ m a t h . S T ] J a n t is typically of strong scientific interest to infer the presence and/or assess serial correla-tion’s strength with an unknown and unstructured dynamic background. The magnitude ofautocorrelation has direct implications for many scientific theories. For example, Fischer andWhitney [2014] proposed that the human brain creates a “perceptual continuity field” wherethe subjective percept at one point of time directly influences the percept within a subsequent15-second window. Such effects are known as “serial dependence” and are an active area ofresearch within psychology and neuroscience. Progress in this and related endeavors dependson one’s ability to estimate the magnitude of autocorrelation in certain time series, even inthe presence of substantial unstructured drift.The most popular tool to handle the serial correlation is the autoregressive time series.However, the presence of even a small drift can induce strong biases in the autoregressivecoefficients. For example, unmodeled background drift can masquerade as autocorrelation,as illustrated in the first panel in Figure 1. This issue has been pointed out before byDutilh et al. [2012], but no solution exists to date. Techniques have been developed fortracking the unknown dynamic background with minimum structural assumptions [Hodrickand Prescott, 1997, Kim et al., 2009, Harchaoui and Lévy-Leduc, 2010] but these approachesdo not estimate the serial correlation. Thus, we currently lack an efficient method to capturethe autocorrelation strength in a time series in the presence of highly unstructured dynamicdrifts.
Fitted AR(1) modelFitted backgroundObservationTrue background = 0.11351, = 3.5881, p = 0.77083, = 0.000270510 200 400 600 800 1000 1200 1400 1600 1800 2000-0.100.10.20.30.40.50.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000-1-0.500.511.52 = 0.33307, = 3.5881, p = 0.77083, = 0.000270510 200 400 600 800 1000 1200 1400 1600 1800 2000-0.100.10.20.30.40.50.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000-1-0.500.511.52 = 0.059363, = 3.5881, p = 0.77083, = 0.000270510 200 400 600 800 1000 1200 1400 1600 1800 2000-0.200.20.40.60.8 TT TT TT = 0 , ˆ ↵ = 0 .
Figure 1: An example showing that proper modeling of dynamic background is important incapturing serial correlation. The estimate (cid:98) α is specified on the top of each column, with theground truth α = 0 . ; δ is a hyperparameter that controls the data fit and model complexity.In the first panel, we directly fit an ar (1) model while ignoring the dynamic background,leading to over-estimating (cid:98) α . In the third panel, the result overfits the dynamic background,leading to underestimating the autoregressive coefficient. The second panel is the desiredresult obtained by our method.Motivated by this, we consider the following problem. Assume a sequence of observations x , . . . , x T over time horizon T , which are generated from the underlying non-stationary ar ( p ) time series model: x i = f i + p (cid:88) j =1 α j x i − j + ε i , i = 1 , . . . , T, (1)2here ε , . . . , ε T are i.i.d. sub-Gaussian random noise with zero mean and variance σ , α , . . . , α p are autoregressive coefficients, f , . . . , f T are deterministic dynamic backgroundand x − p +1 , . . . , x are the known history. The goal is to infer the presence and/or estimatethe unknown autoregressive coefficients and dynamic background simultaneously from data. To ensure our model is general, we do not impose parametric or distributional assumptionson the dynamic background f i ’s. In this paper, we present a new convex optimization based method to estimate theautoregressive coefficients for sequential data in the presence of unknown dynamic background,coupled with the Ljung-Box test for model diagnosis. We cast the problem as minimizingthe least square error with a total-variation constraint on the dynamic background, whichencourages a piecewise constant structure and can approximate a wide range of unstructureddrifts with good precision. We establish performance guarantees for the (cid:96) recovery error ofthe coefficients. To efficiently tune hyperparameters to control the bias-variance trade-off,we adopt the Ljung-Box test [Ljung and Box, 1978]. Extensive numerical experiments areperformed to validate the effectiveness of the proposed method. We also test our method ona real psychology dataset to demonstrate it can infer whether or not there is a statisticallysignificant correlation.The rest of the paper is organized as follows. In the remainder of this section, wediscuss related works. Section 2 presents the proposed method. Section 3 contains the maintheoretical results, including a non-asymptotic bounds on the (cid:96) estimation error for the ar (1) model, for the ease of presentation. We discuss how to extend the result to ar ( p ) modelsin Section 4. Section 5 contains simulation results to demonstrate the good performance ofour method and validates theoretical results. Section 6 presents a real-data study from apsychology experiment. Finally, Section 7 summaries the paper. Standard time series models [Brockwell et al., 1991] such as autoregressive and moving averagemodels do not include dynamic backgrounds. On the other hand, the classical approach tocapture dynamic background usually makes strong structural assumptions such as the lineartrend, periodical trend Clark [1987] or hidden Markov model Hamilton [1989]. Our probleminvolves a highly unstructured background. This requires new solution approach; moreover,existing theory does not apply because the unstructured dynamic background leads to anon-stationary time series, which does not satisfy the strong-mixing condition. This disablesus from using asymptotic results in the classic time series literature.Recent works for similar problems also use convex optimization to fit the dynamicbackground while making few structural assumptions. This line of work typically considerssolving a least square problem with various penalties or constraints to encourage desiredstructures on the fitted background, which can approximate the unknown ground-truth. Forinstance, H-P filter [Hodrick and Prescott, 1997] imposed (cid:96) penalty on the second-orderdifference to encourage a smooth background; Kim et al. [2009] considered a variant of H-Pfilter with an (cid:96) regularization function to capture a piece-wise linear background. Anotherrelated work [Harchaoui and Lévy-Leduc, 2010] considered change-point detection in themeans using least square estimation with total variation penalty; since the number of changepoints is unknown, the work essentially estimates a piece-wise constant background. While3any advances have been achieved, these existing works have not considered serial correlationtogether with the dynamic background.Our proposed method is related to variable fusion [Land and Friedman, 1997] and fusedlasso [Tibshirani et al., 2005]. Here the unstructured, dynamic background leads to a high-dimensional problem: we have T equations and T + p variables; the optimal solution isnot unique. Thus, we borrow the analytical technique in analyzing high-dimensional lasso,particularly the restricted eigenvalue conditions for the design matrix [Bickel et al., 2009,Meinshausen and Yu, 2009, Van De Geer and Bühlmann, 2009] to derive the theoreticalresults, while further exploiting the special structure of our design matrix.There are two closely related recent works: Xu [2008] used polynomials to approximatethe dynamic background, and Zhang et al. [2020] developed an online forecasting algorithmbased on least square estimation with (cid:96) variable fusion constraint. These works do notexplicitly consider highly unstructured backgrounds. We compare with both methods vianumerical simulations in Section 5 and show the advantage of our approach when there aredynamic, unstructured backgrounds; moreover, we also present a method for hyperparameterselection based on the Ljung-Box test. Consider a total variation constrained least square estimator to estimate the autoregressivecoefficients and the dynamic background simultaneously, which is obtained by solving thefollowing convex optimization problem:minimize α ...,α p ,f ,...,f T T (cid:80) Ti =1 (cid:16) x i − (cid:80) pj =1 α j x i − j − f i (cid:17) subject to (cid:80) T − i =1 | f i +1 − f i | < δ, (2)where δ is a user-specified hyperparameter (the selection of δ is discussed in Section 2.2).As discussed for the problem formulation (1), different from the conventional autoregressivemodel, here we consider an unknown and time-varying background . Since the number ofobservations and the number of parameters both grow at the same rate as the time horizon T increases, we cannot uniquely recover the parameters using the available observations. Thus,we impose a total variation constraint on the dynamic background, essentially choosing onesolution with the smallest variations. Such an approach can serve as a good approximationto a broad class of unstructured, dynamic backgrounds. We will show that the choice of the hyper-parameter δ in (2) will critically impact its solution(the recovered dynamic background and the AR coefficient). As illustrated in the first panelin Figure 1, setting δ = 0 will result in a very simple ar (1) model, but a very biased estimate (cid:98) α . Clearly, this model under-fits data. On the other hand, the third panel in Figure 1 showsthat when δ is too large, the fitted model will have a small empirical loss but overfitted4ackground, which still results in very biased (cid:98) α . From the second panel in Figure 1, we cansee the fitted piecewise constant background faithfully captures the dynamics, and this modelyields a very accurate estimate (cid:98) α .Figure 1 illustrates that δ controls the bias-variance trade-off: a larger δ leads to a smallerfitting error, but an overfitted background, and thus the estimated AR coefficients are biased.Therefore, we cannot use the fitting error to tune δ . Instead, we choose δ by the Ljung-Boxtest, which can test the model’s goodness-of-fit by checking the remaining serial correlationin the residual sequence. This test provides a p -value to quantify the goodness-of-fit. Sincelarger p -value indicates less remaining serial correlation in the residuals, we select δ with themaximum p -value. Details of parameter tuning procedure can be found in Appendix A.Here we want to comment that we cannot use the popular cross-validation technique tochoose δ . The cross-validation splits the data into training data and testing data. Typically,the model for the training data and the test data are identical; thus, cross-validation errorcan be used to estimate the actual test error. However, here since our background is dynamicand different on the test and the training data, we cannot apply model fitted on trainingdata to the test data to tune hyperparameter. Finally, we present two bootstrap methods to construct confidence intervals for the autore-gressive coefficients. For serially correlated data, we cannot use the conventional bootstrapfor i.i.d. data [Efron, 1992] but instead using the following techniques: (i) Wild bootstrap[Wu, 1986], which resamples from the fitted residuals and (ii) a variant [Künsch, 1989] oflocal moving block bootstrap, which is designed for non-stationary time series. Details ofboth methods can be found in Appendix A. ar (1) sequences Now we present the main theoretical results, including the upper and lower bounds for theparameter (cid:96) recovery errors using (2). We start with some necessary definitions. Define a sequence as being ε -recoverable , if the (cid:96) recovery error using (2) is smaller than ε . The collection of recoverable sequences forms the ε -recoverable region . We make the following assumptions to ensure the dynamic backgrounddoes not change too drastically. (i) The background contains at most s changes over the timehorizon T , i.e., it consists of at most s + 1 pieces. In other words, the rate-of-change for thedynamic background is on the order of s/T , and the change does not happen very often. (ii)The magnitude of the each change is upper-bounded by δ : | ∆ i | ≤ δ , i = 2 , . . . , T, (3)where ∆ i = f i − f i − , i = 2 , . . . , T are one-step changes of the dynamic background.5t is known that the least square estimator is consistent for any stationary autoregressivetime-series. Intuitively, the “smaller” the dynamic background, the “closer” the sequence is toits stationary counterpart. A fundamental question is What ranges of dynamic backgroundsand autoregressive coefficient can be estimated accurately?
We answer this question viaTheorems 1 and 2, which establish the upper and lower bounds of the recovery error thatdepend on the number of changes s and the size of the change δ . However, in our setting, s ( T ) is non-decreasing with respect to the time horizon T . Thus, we cannot expect therecovery error to shrink to zero with increasing T as the usual asymptotic analysis.Theorem 1 establishes the sufficient condition to ensure the (cid:96) recovery error does notexceed ε : min (cid:110) (cid:101) C s / δ , (cid:112) vol( S ) /π (cid:111) + s / δ ≤ ε − δ := ε δ , (4)where (cid:101) C is a positive constant, S is a user-specified set that the true parameter resides inand vol( · ) denotes the volume of a set. Therefore, (4) is a sufficient condition that s and δ ofa ε -recoverable sequence needs to satisfy, and thus it defines the boundary for ε -recoverableregion as illustrated in Figure 2. The recoverable region is the union of the blue and the greenregions in Figure 2: the green region does not vary but the blue region shrinks with increasing vol( S ) and eventually vanishes once (cid:112) vol( S ) /π exceeds ε δ . This can be explained by thatwhen the unknown coefficients reside in a larger S , it is more difficult to recover the trueparameters for the same accuracy ε , which leads to a smaller recoverable region. Moreover,as illustrated in Figure 2, in the ε -recoverable region, there exists a collection of instances(the region shaded in red dashed lines) where the best achievable performance (lower bound)meets the upper bound of our proposed estimator over the finite time horizon.
20 40 60 8000.010.020.030.0420 40 60 8000.010.020.030.0420 40 60 8000.010.020.030.04 ✏ recoverable region
12 exp (cid:40) − C + C (cid:101) C M + C (cid:101) C M + (log 2) /TC m (cid:41) , (12)where C , C and C are some positive constants only dependent on δ s .Under assumptions (9) and (10), the naive lower bound will be of order O (1 / √ T ) andtherefore the lower bound C (constant order) will be tighter. We denote it by C lower . Besides,(8) ensures the upper bound will be at most (cid:112) vol( S ) /π plus a O (1 / √ T ) term. Thus, wecan also obtain a constant order upper bound C upper . In this special case, the (cid:96) estimationerror will stay within a constant order interval (cid:107) (cid:98) β T − β (cid:107) ∈ [ C lower , C upper ] with constantprobability for t = 1 , . . . , T . We illustrate this constant order interval in the region shadedin green in Figure 3.Figure 3 shows that the upper bound stays close to Fano’s lower bound on a finite timehorizon, which demonstrates the near-optimality of the proposed method. Besides, the greenregion in Figure 3 illustrates the constant probability estimation error trajectories for thoseinstances within the red dashed region in Figure 2.8 Time horizon t
Lower bound (Naive)Upper boundLower bound (Fano’s) T
Restricted Eigenvalue condition and
VariationalInequality . Consider the penalized form of (7) (cid:98) β T = arg min β ∈ R T +1 T (cid:107) x T − X β (cid:107) + λ (cid:107) ∆ (cid:107) , (13)where λ is the tuning parameter. By Lagrangian duality, we can show that (13) is equivalentto (7). It is known that [Wainwright, 2019] there is a one-to-one correspondence between δ and λ : if (cid:98) β T = (cid:98) β T ( λ ) minimizes (13), then it also minimizes (7) with δ = (cid:107) (cid:98) ∆ (cid:107) .The formulation (13) links our problem to high-dimensional lasso [Wainwright, 2019].This connection motivates us to invoke restricted eigenvalue condition due to Bickel et al.[2009], Van De Geer and Bühlmann [2009] for the design matrix to bound (cid:96) estimation error,since the restricted eigenvalue condition is the weakest known sufficient condition accordingto Raskutti et al. [2010]. Although there have been works verifying restricted eigenvalueconditions for x T − [Loh and Wainwright, 2011, Basu and Michailidis, 2015, Wu and Wu,2016] or L [Harchaoui and Lévy-Leduc, 2010], they cannot be directly applied for our settinghere. Specifically, expanding x T − with a square matrix L in the design matrix leads tothe rank-deficiency of X T X , thus simply exploring the structure of x T − cannot address theproblem.Partition the index set { , . . . , T + 1 } into three disjoint parts I i ( i = 1 , , , where I = { , } , I is the indices for non-zero ∆ i ’s and I is the indices for zeros in β . By using anindex set I as the subscript of a vector, we keep all entries with indices from I intact and zeroout entries with indices from its complement I c . Since the (cid:96) constraint does not encouragesparsity on α and µ , we modify the definition of the restricted eigenvalues as follows φ min ( u ) = min e ∈ R (cid:107) X e (cid:107) √ T (cid:107) e (cid:107) , φ max ( u ) = max e ∈ R (cid:107) X e (cid:107) √ T (cid:107) e (cid:107) , (14)where R = { e : 2 = (cid:107) e I (cid:107) ≤ (cid:107) e (cid:107) ≤ u } and R = { e : 1 ≤ (cid:107) e (cid:107) ≤ u, (cid:107) e I (cid:107) = 0 } .9 emark 1. The smallest restricted eigenvalue can be understood as follows: take columns of X indexed by 1, 2 and another u − indices from I c to form a new matrix (cid:101) X , then φ min ( u ) isthe smallest one among eigenvalues of all possible (cid:101) X T (cid:101) X /T ’s. Similarly, φ max ( u ) is the largesteigenvalue of (cid:101) X T (cid:101) X /T , where (cid:101) X is composed of u columns of X with indices chosen from I c .In the following analysis, we use a recently developed technique based on variationalinequality [Juditsky and Nemirovski, 2019, Juditsky et al., 2020] to establish the upper bound.Consider the gradient field of the objective function in (7): F x T ( z ) = (cid:0) A [ x T ] z − a [ x T ] (cid:1) /T, where A [ x T ] = X T X and a [ x T ] = X T ( (cid:80) Ti =1 x i x i − , x T T ) T . This vector filed is affine andmonotone, since we can verify the symmetric matrix A [ x T ] is positive semi-definite. Theminimizer of (7), i.e. (cid:98) β T , is in fact the solution to the following variational inequality:find z ∈ X : (cid:104) F x T ( w ) , w − z (cid:105) ≥ , ∀ w ∈ X . VI [ F x T , X ] Moreover, β is zero of (cid:101) F x T ( z ) and solution to the following variational inequality:find z ∈ X : (cid:104) (cid:101) F x T ( w ) , w − z (cid:105) ≥ , ∀ w ∈ X , VI [ (cid:101) F x T , X ] where (cid:101) F x T ( z ) = (cid:0) A [ x T ] z − A [ x T ] β (cid:1) /T. We can see that F x T ( z ) and (cid:101) F x T ( z ) only differ in the following constant term: η = F x T ( β ) − (cid:101) F x T ( β ) = F x T ( β ) = ( A [ x T ] β − a [ x T ]) /T. Intuitively, the difference between VI [ F x T , X ] and VI [ (cid:101) F x T , X ] should reflect the differencebetween the solutions to those two variational inequalities, i.e. our estimator (cid:98) β T and theground truth β . We will show how to bound the (cid:96) estimation error using (cid:107) η (cid:107) ∞ in thefollowing theorem.The following proposition establishes the error bound for the auto-correlation coefficientand the initial dynamic coefficient, combined. Proposition 1 (Upper Bound on (cid:96) estimation error for α and µ ) . For (cid:98) β T defined by(7), for A , A , A and tuning parameter δ in Theorem 1, there exists k ∈ (0 , such that kλ = O (cid:0) (log T ) / /T / (cid:1) , with probability at least − (2 T ) − A − (2 T ) − A /A − T ) − A /A ,we have (cid:112) ( (cid:98) α − α ) + ( (cid:98) µ − µ ) ≤ max (cid:40) √ κ (cid:18) (cid:107) η (cid:107) ∞ − k + κC δ,δ ,s (cid:19) , δ + sδ (cid:41) , (15)where δ is the magnitude for one-step changes of dynamic background in (3) and κ = (cid:112) φ min (2) − k (cid:112) T − − k ) √ T , C δ,δ ,s = 2 sδ √ T − − k ) √ T + (cid:114) ( s − T − s + 1) T ( sδ + δ ) . Here, φ min ( · ) is defined in (14). Moreover, we have (cid:107) η (cid:107) ∞ ≤ C (log T ) / /T / , where C = C ( A , A , A ) is a positive constant.10ext, we establish the lower bound of the (cid:96) estimation error using Fano’s method. Proposition 2 (Lower bound on (cid:96) estimation error) . For any estimator (cid:101) β T and constant C > , we have sup β ∈ Θ T pr (cid:16) (cid:107) (cid:101) β T − β (cid:107) ≥ C (cid:17) ≥ − C T + C δ ( T ) (cid:80) Tt =2 s ( t ) + C δ ( T ) (cid:80) Tt =2 s ( t ) + log 2 s ( T ) log(1 / C ) , where C , C and C are positive constants only dependent on δ s .We can show that Theorem 2 follows from the above proposition. The key steps in provingthis above proposition are to (i) find a large enough ε -packing of Θ T and (ii) upper boundthe Kullback–Leibler (KL) divergence over this packing. ar ( p ) sequences So far we have been focused on analysis for ar (1) sequences; now we discuss how to extendto general cases. For the ar ( p ) case, we need to change several terms in (5) (definedby ar (1) ): the design matrix becomes X = ( x T − , . . . , x − p +1: T − p , L ) ∈ R T × ( T + p ) , where L ∈ R T × T remains the lower triangular matrix of ones; the coefficient vector becomes β = ( α T p , µ, ∆ , . . . , ∆ T ) T , where α p = ( α , . . . , α p ) T . We can solve a similar convexoptimization problem as that defined in (7) to estimate the parameters, except that thehypothesis class X is defined differently X = { β : ( α T p , µ ) ∈ S p , (cid:107) ∆ (cid:107) < δ } , where S p = { ( α T p , µ ) : (cid:107) α p (cid:107) + µ ≤ δ p +1 s } . Moreover, we will redefine I = { , . . . , p + 1 } , whilethe definitions for I and I remain the same as defined in Section 3.5. The restrictedeigenvalues are also defined as (14), except that the error e are restricted to be in R = { e : p + 1 = (cid:107) e I (cid:107) ≤ (cid:107) e (cid:107) ≤ u } when calculating φ min ( u ) . With these definitions, we can showthe following upper bound for the (cid:96) recovery error: Theorem 3 (Upper Bound on (cid:96) estimation error for ar ( p ) case) . For (cid:98) β T defined by (7) andfor all A > , A > √ A and A > , for any selected tuning parameter δ , with probabilityat least − (2 T ) − A − (2 T ) − A /A − p (2 T ) − A /A , we have (cid:107) (cid:98) β T − β (cid:107) ≤ min (cid:40) (cid:101) C √ s max { sδ , δ } , (cid:115) Γ (cid:18) p + 32 (cid:19) vol( S p ) /π p +12 (cid:41) + δ + √ sδ , (16)where (cid:101) C is a positive constant dependent on A , A and A and Γ( · ) is the gamma function.Since the expression (3) for the upper bound for ar ( p ) case is similar to that in Theorem1, the discussion on ε -recoverable region, which is solely determined by the upper bound ofestimation error, will be similar too. For lower bounding the estimation error via Fano’smethod, we can use similar proof strategy as that in Proposition 2 or Lemma 6 (althoughthe details are more tedious to specify): (i) express x t with respect to β, ε t ; (ii) derive thejoint distribution of x T based on that expression and (iii) bound the KL divergence.11 Numerical experiments
In this section, we perform comprehensive numerical simulations to (i) show that our proposedmethod works well in practice; (ii) validate our theoretical findings regarding algorithmperformance; (iii) compare with existing methods; (iv) demonstrate the good performance ofthe two proposed bootstrap methods. Recall that our work’s primary focus is to estimate theautoregressive coefficients. Thus, we will focus on this in the following four experiments.
Experiment 1.
First, we show that our proposed estimation method can accuratelyrecover α from non-stationary ar (1) time series under various settings: α ∈ { . , . } , σ ∈ { . , . } δ ∈ { . , . } and T = 5000 . The dynamic background is generated by f i = (cid:80) ik =1 δ ( U k − . , i = 1 , . . . , T , where U k ∈ [0 , , k = 1 , . . . , T is a sequence of i.i.d.uniform random numbers. As discussed above, the accuracy depends on both s and δ . Here,we consider an extreme case: s = T + 1 (which is supposed to be the most challengingcase). Moreover, we also present the results with δ selected by Durbin-Watson test [Durbinand Watson, 1992] as an alternative. (Details on the Durbin-Watson test can be found inSection B.1 in the Appendix B.) The convex program (2) is solved by the cvx package [Grantand Boyd, 2014] and we tune the hyperparameter δ by Ljung-Box test and Durbin-Watsontest, respectively. We repeat the experiment 20 times for each setting, and plot the meansquare error of (cid:98) α , p -value of Ljung-Box test and Durbin-Watson test with different δ ’s inFigure 4.The results in Figure 4 show that (cid:98) α decreases when δ increases, and there is a specificvalue of δ that leads to the smallest mean square error for estimating α . In the figure, thered dots in the first two rows correspond to the best achievable δ ’s in mean square error. Inthe last two rows, those red dots indicate the δ ’s selected by our proposed tuning procedure.Thus, we can observe that (i) the best achievable δ ’s regarding the accuracy and mean squareerror are roughly the same; (ii) our proposed tuning procedure based on both the Ljung-Boxtest and Durbin-Watson test can select the best δ .Table 1: Summary of the information of red dots in Figure 4. Average (standard deviation) mean square error
Setting ( α , δ , σ ) ε -optimal Ljung-Box Durbin-Watson ε -optimal Ljung-Box Durbin-Watson(0.05, 0.05, 0.10) 4.13 × − (2.33 × − ) 4.13 × − (2.33 × − ) 4.13 × − (2.33 × − ) 6.19 × − × − × − (0.05, 0.05, 0.20) 3.12 × − (1.60 × − ) 3.12 × − (1.60 × − ) 3.12 × − (1.60 × − ) 6.09 × − × − × − (0.05, 0.10, 0.10) 3.91 × − (2.03 × − ) 3.91 × − (2.03 × − ) 5.04 × − (2.60 × − ) 5.33 × − × − × − (0.05, 0.10, 0.20) 3.69 × − (2.02 × − ) 3.69 × − (2.02 × − ) 4.64 × − (2.44 × − ) 5.81 × − × − × − (0.10, 0.05, 0.10) 8.47 × − (2.02 × − ) 8.47 × − (2.02 × − ) 8.47 × − (2.02 × − ) 6.42 × − × − × − (0.10, 0.05, 0.20) 8.01 × − (1.65 × − ) 8.01 × − (1.65 × − ) 8.01 × − (1.65 × − ) 6.68 × − × − × − (0.10, 0.10, 0.10) 9.21 × − (2.66 × − ) 8.14 × − (2.41 × − ) 8.14 × − (2.41 × − ) 7.68 × − × − × − (0.10, 0.10, 0.20) 7.83 × − (2.64 × − ) 8.64 × − (3.21 × − ) 8.64 × − (3.21 × − ) 1.17 × − × − × − Table 1 summarizes the optimal and the selected δ ’s (corresponding to the red dots) inFigure 4: (i) the average and the standard deviation of (cid:98) α obtained by ε -optimal (in thesense of accuracy) δ , δ selected by Ljung-Box test and Durbin-Watson test and (ii) meansquare error of (cid:98) α obtained by ε -optimal (in the sense of mean square error) δ , δ selected byLjung-Box test and Durbin-Watson test. 12 = 0.1, = 0.10 0.005 0.01 0.015 0.02-8-7-6-5-4 l og M SE p va l u e p va l u e = 0.05, = 0.050 0.005 0.01 0.015 0.02-8-7-6-5-4 l og M SE p va l u e p va l u e = 0.1, = 0.050 0.005 0.01 0.015 0.02-8-7-6-5-4 l og M SE p va l u e p va l u e = 0.05, = 0.10 0.005 0.01 0.015 0.02-8-7-6-5-4 l og M SE p va l u e p va l u e ˆ ↵
We compare our method with the method in Zhang et al. [2020]. Inthe following, we refer to their method as the “ (cid:96) variant,” since it is obtained by solvingthe convex program with same objective function as (2) except for a different constraint: (cid:80) T − i =1 ( f i +1 − f i ) < δ. Again, δ ≥ is the tuning parameter. We should mention Zhang et al.[2020] did not have a systematic way to tune δ and here we enhance their method by addingour statistical test based hyperparameter tuning as well.The piecewise linear dynamic background is generated by f i = (cid:80) ik =1 δ ( U k − . , i =1 , . . . , T . Here U k = u i for all k ∈ { k i , . . . , k i +1 } , i = 0 , . . . , s − , where k < k < · · · 200 400 600 800 1000 1200 1400 1600 1800 2000-10-8-6-4-2024 = 0.86595, = 46.9253, p = 0, = 0.000965560 200 400 600 800 1000 1200 1400 1600 1800 2000-10-8-6-4-202 0 200 400 600 800 1000 1200 1400 1600 1800 2000-10-8-6-4-2024 = 0.11667, = 46.9253, p = 6.34e-11, = 0.000965560 200 400 600 800 1000 1200 1400 1600 1800 2000-10-8-6-4-202 T TT T = 1 . , ˆ ↵ = 0 . TV-LSE in Case 2 TV-LSE in Case 1 Finally, we compare the confidence intervals obtained via two bootstrapmethods. We adopt the following experimental setting: α = 0 . , σ = 0 . , T = 1000 . Thedynamic drift is piecewise constant with δ = 0 . , s = 100 . The bootstrap replication is N = 100 ; we use standard normal random numbers as v t ’s in residual-based wild bootstrap;for local block bootstrap, we choose block size b = 20 and local neighborhood length B = 50 .We illustrate one replication result by plotting the histogram of (cid:98) α ’s from bootstrap samplesin Figure 7.From 50 repetition of the above procedure, we find that: (i) the coverage accuracy of 90%and 95% confidence intervals: 0.84 and 0.90 for residual-based wild bootstrap and 0.84 and0.88 for local block bootstrap; (ii) the average lengths of 90% and 95% confidence intervals:0.10 and 0.12 for residual-based wild bootstrap and 0.095 and 0.114 for local block bootstrap.The coverage accuracy is slightly lower than the theoretical value since T = 1000 is relativelysmall. The comparison indicates that local block bootstrap tends to yield smaller confidenceintervals but has slightly lower coverage accuracy.15 esidual-based wild bootstrap Local Block Bootstrap90% CI = [0.0281, 0.134]95% CI = [0.0248, 0.15] 90% CI = [0.0439, 0.139]95% CI = [0.0115, 0.154] Estimate from actual sampleGround-truthBoundaries for 90% CIBoundaries for 95% CI ↵ = 1 Maniscalco_2017_expt1 chosen based on the fact that it has RT data included andfeatures a large number of trials per subject.The data come from an experiment where human subjects made a series of 1000 perceptualjudgments over a period of about one hour. Participants were seated in front of a computerand made their responses using a standard keyboard. The task, which is standard in the field,consisted of deciding whether a briefly presented (33 ms) noisy sinusoidal grating was orientedclockwise or counterclockwise from vertical. Subjects responded as quickly as possible butwithout sacrificing accuracy. The experimenters recorded each judgment’s reaction time (thatis, the time from the onset of the visual stimulus to the button press used to indicate thesubject’s response), thus creating a time series of 1000 values for each subject. Data wereobtained from 28 subjects.We first pre-process the raw data by dealing with missing values and obvious outliers.To be precise, we treat RTs that exceed 10 times the interquartile range (i.e., the differencebetween 75th and 25th percentiles) as outliers and the rest as normal observations. Sincenaively omitting missing data in time series data will break the serial correlation, we usethe median of the normal observations to impute those missing values. The same median isused to replace all outliers. We propose a data-adaptive procedure to tune δ by applying theLjung-Box test on the logarithm of original residuals since they are strongly right-skewed.We plot the results for all 28 subjects in Figure 8. More details on why we choose logarithm16 200 400 600 800 10000246 = 8.29, = 0.057WB: 90% CI = [-0.0118, 0.16], 95% CI = [-0.019, 0.267]LBB: 90% CI = [0.0116, 0.102], 95% CI = [0.00446, 0.104] 0 200 400 600 800 100000.511.52 = 15.1, = -0.00989WB: 90% CI = [-0.069, 0.0341], 95% CI = [-0.0765, 0.0435]LBB: 90% CI = [-0.0748, 0.0438], 95% CI = [-0.0947, 0.0606] 0 200 400 600 800 100000.511.52 = 0.0999, = 0.0177WB: 90% CI = [-0.027, 0.0528], 95% CI = [-0.0339, 0.0627]LBB: 90% CI = [-0.0499, 0.0622], 95% CI = [-0.0678, 0.0738] 0 200 400 600 800 10000123 = 0.599, = 0.262WB: 90% CI = [0.22, 0.315], 95% CI = [0.212, 0.326]LBB: 90% CI = [0.188, 0.309], 95% CI = [0.171, 0.32]0 200 400 600 800 100000.511.52 = 11.6, = 0.207WB: 90% CI = [0.126, 0.222], 95% CI = [0.116, 0.231]LBB: 90% CI = [0.142, 0.262], 95% CI = [0.123, 0.272] 0 200 400 600 800 100000.511.52 = 0.2, = 0.058WB: 90% CI = [0.0343, 0.0765], 95% CI = [0.0304, 0.0801]LBB: 90% CI = [-0.00283, 0.09], 95% CI = [-0.0103, 0.0984] 0 200 400 600 800 100000.511.52 = 0.2, = 0.109WB: 90% CI = [0.0676, 0.148], 95% CI = [0.0613, 0.176]LBB: 90% CI = [0.038, 0.14], 95% CI = [0.0203, 0.148] 0 200 400 600 800 10000123 = 0.4, = 0.175WB: 90% CI = [0.128, 0.255], 95% CI = [0.116, 0.268]LBB: 90% CI = [0.105, 0.219], 95% CI = [0.0904, 0.243]0 200 400 600 800 100001234 = 15.1, = -0.0344WB: 90% CI = [-0.0747, 0.0357], 95% CI = [-0.0838, 0.0471]LBB: 90% CI = [-0.085, 0.0194], 95% CI = [-0.0923, 0.0232] 0 200 400 600 800 100000.511.5 = 1.3, = 0.166WB: 90% CI = [0.12, 0.217], 95% CI = [0.107, 0.225]LBB: 90% CI = [0.107, 0.217], 95% CI = [0.0761, 0.218] 0 200 400 600 800 10000.511.5 = 0.2, = 0.163WB: 90% CI = [0.127, 0.192], 95% CI = [0.123, 0.203]LBB: 90% CI = [0.0721, 0.189], 95% CI = [0.0661, 0.198] 0 200 400 600 800 10000123 = 1.4, = 0.0673WB: 90% CI = [0.022, 0.106], 95% CI = [0.0159, 0.13]LBB: 90% CI = [0.0305, 0.122], 95% CI = [0.0264, 0.126]0 200 400 600 800 100001234 = 30.1, = 1.96e-5WB: 90% CI = [-0.0803, 0.0699], 95% CI = [-0.0955, 0.0841]LBB: 90% CI = [-0.0708, 0.0605], 95% CI = [-0.0786, 0.0731] 0 200 400 600 800 100000.511.52 = 2.5, = -0.0627WB: 90% CI = [-0.114, -0.00908], 95% CI = [-0.121, 0.0101]LBB: 90% CI = [-0.119, -0.0274], 95% CI = [-0.129, -0.0172] 0 200 400 600 800 100000.511.52 = 0.3, = 0.157WB: 90% CI = [0.124, 0.204], 95% CI = [0.107, 0.225]LBB: 90% CI = [0.0862, 0.175], 95% CI = [0.0757, 0.18] 0 200 400 600 800 10000123 = 0.3, = 0.0727WB: 90% CI = [0.0411, 0.122], 95% CI = [0.0356, 0.146]LBB: 90% CI = [0.0212, 0.105], 95% CI = [0.0141, 0.114]0 200 400 600 800 10000246 = 13.0, = -0.0573WB: 90% CI = [-0.106, 0.00226], 95% CI = [-0.138, 0.0107]LBB: 90% CI = [-0.104, -0.0139], 95% CI = [-0.122, -0.00502] 0 200 400 600 800 100000.511.52 = 15.1, = -0.141WB: 90% CI = [-0.19, -0.0911], 95% CI = [-0.191, -0.0701]LBB: 90% CI = [-0.185, -0.0895], 95% CI = [-0.194, -0.0752] 0 200 400 600 800 100001234 = 0.0999, = 0.167WB: 90% CI = [0.131, 0.201], 95% CI = [0.12, 0.211]LBB: 90% CI = [0.0932, 0.224], 95% CI = [0.0836, 0.239] 0 200 400 600 800 100000.51 = 0.699, = 0.219WB: 90% CI = [0.169, 0.255], 95% CI = [0.157, 0.304]LBB: 90% CI = [0.135, 0.289], 95% CI = [0.126, 0.299]0 200 400 600 800 100001234 = 15.1, = -0.0348WB: 90% CI = [-0.115, 0.0322], 95% CI = [-0.124, 0.0367]LBB: 90% CI = [-0.0937, 0.00807], 95% CI = [-0.108, 0.0199] 0 200 400 600 800 100001234 = 1.1, = 0.0521WB: 90% CI = [1.34e-4, 0.0892], 95% CI = [-0.00231, 0.117]LBB: 90% CI = [0.00312, 0.0894], 95% CI = [-0.00743, 0.101] 0 200 400 600 800 100000.511.52 = 2.3, = 0.156WB: 90% CI = [0.0926, 0.208], 95% CI = [0.0778, 0.216]LBB: 90% CI = [0.0958, 0.199], 95% CI = [0.0862, 0.201] 0 200 400 600 800 10000123 = 9.69, = 0.109WB: 90% CI = [0.0242, 0.13], 95% CI = [0.02, 0.135]LBB: 90% CI = [0.0524, 0.149], 95% CI = [0.0468, 0.157]0 200 400 600 800 100000.511.52 = 8.19, = 0.0391WB: 90% CI = [-0.0172, 0.0759], 95% CI = [-0.0353, 0.105]LBB: 90% CI = [-0.0396, 0.0981], 95% CI = [-0.0497, 0.11] 0 200 400 600 800 100001234 = 0.999, = 0.0485WB: 90% CI = [0.02, 0.114], 95% CI = [0.0124, 0.15]LBB: 90% CI = [-0.0182, 0.0852], 95% CI = [-0.0216, 0.0932] 0 200 400 600 800 10000123 = 6.39, = -0.0215WB: 90% CI = [-0.0773, 0.0258], 95% CI = [-0.0891, 0.0365]LBB: 90% CI = [-0.0685, 0.0107], 95% CI = [-0.078, 0.0131] 0 200 400 600 800 10000123 = 2.1, = 0.092WB: 90% CI = [0.0258, 0.128], 95% CI = [0.0175, 0.153]LBB: 90% CI = [0.0377, 0.142], 95% CI = [0.0234, 0.148] Figure 8: Experimental results on reaction times for all 28 subjects. Subjects 1 to 28are organized in the order of left to right and top to bottom. The blue, red and yellowlines correspond to the raw RT values, fitted ar (1) model and fitted dynamic background,respectively. On the top of each figure, we report δ selected by Ljung-Box test on thelogarithm of residuals, estimated ar (1) coefficient and 90% and 95% confidence intervalsbased on residual-based wild bootstrap and local block bootstrap samples. Overall, weobserve the presence of substantial drift that varies significantly between subjects but isrecovered very well by our proposed method.17ransform is in Appendix D. The confidence intervals are constructed via bootstrapping withthe same bootstrapping parameters in our simulation. Regular AR(1) modelProposed method (90% CI) = 0 Subject ˆ ↵ The first two authors are supported by NSF CCF-1650913, NSF DMS-1938106, and NSFDMS-1830210. The last author is supported by NIH R01MH119189, NIH R21MH122825,and the Office of Naval Research N00014-20-1-2622. References Athena Akrami, Charles D. Kopec, Mathew E. Diamond, and Carlos D. Brody. Posteriorparietal cortex represents sensory history and mediates its effects on behaviour. Nature ,554(7692):368–372, feb 2018. doi: 10.1038/nature25510. URL .Sumanta Basu and George Michailidis. Regularized estimation in sparse high-dimensionaltime series models. The Annals of Statistics , 43(4):1535–1567, 2015.Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lassoand dantzig selector. The Annals of Statistics , 37(4):1705–1732, 2009.Peter J Brockwell, Richard A Davis, and Stephen E Fienberg. Time series: theory andmethods: theory and methods . Springer Science & Business Media, 1991.19uido Marco Cicchini, Kyriaki Mikellidou, and David C Burr. The func-tional role of serial dependence. Proceedings of the Royal Society B: Bio-logical Sciences , 285(1890):20181722, nov 2018. ISSN 0962-8452. doi: 10.1098/rspb.2018.1722. URL .Peter K Clark. The cyclical component of us economic activity. The Quarterly Journal ofEconomics , 102(4):797–814, 1987.James Durbin and Geoffrey S Watson. Testing for serial correlation in least squares regression.I. In Breakthroughs in Statistics , pages 237–259. Springer, 1992.Gilles Dutilh, Don Van Ravenzwaaij, Sander Nieuwenhuis, Han L.J. Van der Maas, Birte U.Forstmann, and Eric Jan Wagenmakers. How to measure post-error slowing: A confound anda simple solution. Journal of Mathematical Psychology , 56(3):208–216, 2012. ISSN 00222496.doi: 10.1016/j.jmp.2012.04.001. URL http://dx.doi.org/10.1016/j.jmp.2012.04.001 .Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs inStatistics , pages 569–593. Springer, 1992.Jason Fischer and David Whitney. Serial dependence in visual perception. Nature Neu-roscience , 17(5):738–43, mar 2014. ISSN 1097-6256. doi: 10.1038/nn.3689. URL http://dx.doi.org/10.1038/nn.3689 .Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming,version 2.1, 2014.James D Hamilton. A new approach to the economic analysis of nonstationary time seriesand the business cycle. Econometrica: Journal of the Econometric Society , pages 357–384,1989.Zaıd Harchaoui and Céline Lévy-Leduc. Multiple change-point estimation with a totalvariation penalty. Journal of the American Statistical Association , 105(492):1480–1493,2010.Robert J Hodrick and Edward C Prescott. Postwar US business cycles: an empiricalinvestigation. Journal of Money, Credit, and Banking , pages 1–16, 1997.Yongmiao Hong. Serial correlation and serial dependence. In Macroeconometrics and TimeSeries Analysis , pages 227–244. Springer, 2010.Anatoli Juditsky, Arkadi Nemirovski, Liyan Xie, and Yao Xie. Convex recovery of markedspatio-temporal point processes. arXiv preprint arXiv:2003.12935 , 2020.Anatoli B Juditsky and AS Nemirovski. Signal recovery by stochastic optimization. Automa-tion and Remote Control , 80(10):1878–1893, 2019.Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. (cid:96) trend filtering. SIAM review , 51(2):339–360, 2009. 20ans R Künsch. The jackknife and the bootstrap for general stationary observations. TheAnnals of Statistics , 17(3):1217–1241, 1989.Donald R J Laming. Information theory of choice-reaction times . Academic Press, New York,1968.Stephanie R Land and Jerome H Friedman. Variable fusion: A new adaptive signal regressionmethod. Technical report, Department of Statistics, Carnegie Mellon University, 1997.Greta M Ljung and George EP Box. On a measure of lack of fit in time series models. Biometrika , 65(2):297–303, 1978.Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and miss-ing data: Provable guarantees with non-convexity. In Advances in Neural InformationProcessing Systems , pages 2726–2734, 2011.W. McIlhagga. Serial correlations and 1/f power spectra in visual search reaction times. Journal of Vision , 8(9):5–5, jul 2008. ISSN 1534-7362. doi: 10.1167/8.9.5. URL http://jov.arvojournals.org/Article.aspx?doi=10.1167/8.9.5 .Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics , 37(1):246–270, 2009.Tobias J. Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. Time series momentum. Journal of Financial Economics , 104(2):228–250, may 2012. ISSN 0304405X. doi:10.1016/j.jfineco.2011.11.003. URL https://linkinghub.elsevier.com/retrieve/pii/S0304405X11002613 .Efstathios Paparoditis and Dimitris N Politis. Local block bootstrap. Comptes RendusMathematique , 335(11):959–962, 2002.Dobromir Rahnev, Ai Koizumi, Li Yan McCurdy, Mark D’Esposito, and Hakwan Lau.Confidence Leak in Perceptual Decision Making. Psychological Science , 26(11):1664–1680,2015. ISSN 0956-7976. doi: 10.1177/0956797615595037. URL http://pss.sagepub.com/lookup/doi/10.1177/0956797615595037 .Dobromir Rahnev, Kobe Desender, Alan L. F. Lee, William T. Adler, David Aguilar-Lleyda,Başak Akdoğan, Polina Arbuzova, Lauren Y. Atlas, Fuat Balcı, Ji Won Bang, IndritBègue, Damian P. Birney, Timothy F. Brady, Joshua Calder-Travis, Andrey Chetverikov,Torin K. Clark, Karen Davranche, Rachel N. Denison, Troy C. Dildine, Kit S. Double,Yalçın A. Duyan, Nathan Faivre, Kaitlyn Fallow, Elisa Filevich, Thibault Gajdos, Regan M.Gallagher, Vincent de Gardelle, Sabina Gherman, Nadia Haddara, Marine Hainguerlot,Tzu-Yu Hsu, Xiao Hu, Iñaki Iturrate, Matt Jaquiery, Justin Kantner, Marcin Koculak,Mahiko Konishi, Christina Koß, Peter D. Kvam, Sze Chai Kwok, Maël Lebreton, Karolina M.Lempert, Chien Ming Lo, Liang Luo, Brian Maniscalco, Antonio Martin, Sébastien Massoni,Julian Matthews, Audrey Mazancieux, Daniel M. Merfeld, Denis O’Hora, Eleanor R. Palser,Borysław Paulewicz, Michael Pereira, Caroline Peters, Marios G. Philiastides, Gerit Pfuhl,Fernanda Prieto, Manuel Rausch, Samuel Recht, Gabriel Reyes, Marion Rouault, Jérôme21ackur, Saeedeh Sadeghi, Jason Samaha, Tricia X. F. Seow, Medha Shekhar, Maxine T.Sherman, Marta Siedlecka, Zuzanna Skóra, Chen Song, David Soto, Sai Sun, Jeroen J. A.van Boxtel, Shuo Wang, Christoph T. Weidemann, Gabriel Weindel, Michał Wierzchoń,Xinming Xu, Qun Ye, Jiwon Yeon, Futing Zou, and Ariel Zylberberg. The confidencedatabase. Nature Human Behaviour , 4(3):317–325, mar 2020. ISSN 2397-3374. doi: 10.1038/s41562-019-0813-1. URL .Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties forcorrelated gaussian designs. The Journal of Machine Learning Research , 11:2241–2259,2010.Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsityand smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B(Methodological) , 67(1):91–108, 2005.Sara A Van De Geer and Peter Bühlmann. On the conditions used to prove oracle results forthe lasso. Electronic Journal of Statistics , 3:1360–1392, 2009.Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint , volume 48.Cambridge University Press, 2019.M. Wexler, M. Duyck, and P. Mamassian. Persistent states in vision break universality andtime invariance. Proceedings of the National Academy of Sciences , 112(48):14990–5, nov2015. ISSN 0027-8424. doi: 10.1073/pnas.1508847112. URL .Chien-Fu Jeff Wu. Jackknife, bootstrap and other resampling methods in regression analysis. The Annals of Statistics , 14(4):1261–1295, 1986.Wei-Biao Wu and Ying Nian Wu. Performance bounds for parameter estimates of high-dimensional linear models with correlated errors. Electronic Journal of Statistics , 10(1):352–379, 2016.Ke-Li Xu. Bootstrapping autoregression under non-stationary volatility. The EconometricsJournal , 11(1):1–26, 2008.Kaimeng Zhang, Chi Tim Ng, and Myung Hwan Na. Real time prediction of irregular periodictime series data. Journal of Forecasting , 39(3):501–511, 2020.22 Hyperparameter tuning and Bootstrap confidence in-terval Our proposed hyperparameter tuning procedure is: (i) Set an interval [ δ (cid:96) , δ u ] where we believethe best δ lies in based on prior knowledge; (ii) For any ε > , to make sure the Euclideandistance between selected δ and the optimal one is less than ε , we divide this interval into n = (cid:98) ( δ u − δ (cid:96) ) /ε (cid:99) parts with same length ε and denote the endpoints by δ (1) , . . . , δ ( n +1) ; (iii)For each δ ( j ) , we fit the proposed estimator as defined by (2) and construct the residualsequence by r i = x i − (cid:80) pj =1 (cid:98) α j x i − j − (cid:98) f i , i = 1 , . . . , T ; (iv) Apply (Lag-p) Ljung-Box test tothe residual sequence to obtain a p -value p j ; (v) The ε -optimal tuning parameter is δ ( j ) with j = argmax j ∈{ ,...,n +1 } p j . Further details on Ljung-Box test can be found in Section B.1 inAppendix B.Next, we present how to construct a bootstrap confidence interval. For our first methodresidual-based wild bootstrap: (i) we first perform proposed tuning procedure to obtaintuning parameter δ and the corresponding estimates (cid:98) α j ’s and (cid:98) f i ’s; (ii) then we calculate theresiduals (cid:98) r i ’s as suggested in step 2.(i) in proposed tuning procedure; (iii) residual-based wildbootstrap sample is constructed recursively by (1) with (cid:98) α j ’s, (cid:98) f i ’s and (cid:101) r i = (cid:98) r i v i , where v i ’sare i.i.d. random numbers with zero mean and unit variance. As for local block bootstrap,we first choose an integer block size b and local neighborhood size B . We partition T samples into M = (cid:100) T /b (cid:101) blocks. Then, for m = 0 , . . . , M − , the local block bootstrapsample is (cid:101) x mb + j = x I m + j − , j = 1 , . . . , b , where I m is a uniform random integer drawn from { max(1 , mb − B ) , . . . , min( T − b + 1 , mb + B ) } . In Paparoditis and Politis [2002], it is requiredthat (i) b/B → as b → ∞ ; (ii) when T → ∞ , T /B → but B → ∞ .After obtaining the bootstrap sample, we apply proposed tuning procedure to this pseudo-series with δ (cid:96) = δ − nε and δ u = δ + nε to obtain estimates (cid:101) α j ’s (we choose n = 2 in numericalsimulation). Then, we repeat this procedure N times to construct a confidence interval bythe empirical distribution of (cid:101) α j ’s. For bootstrap samples, we only need to search around the ε -optimal δ for the optimal tuning parameter of the pseudo-series since it closely resemblesthe actual observation. This helps to reduce the computational cost of bootstrapping. B Background knowledge B.1 Ljung-Box test and Durbin-Watson test Ljung-Box test, sometimes known as the Ljung–Box Q test, is designed to test if there stillexhibits serial correlation in the residual sequence. The null hypothesis is H : The data areindependently distributed. The test statistic is Q = T ( T + 2) h (cid:88) k =1 (cid:98) ρ k n − k , T is the sample size, (cid:98) ρ k is the sample autocorrelation at lag k , and h is the number oflags being tested. For sequence { x , . . . , x T } , the sample autocorrelation (cid:98) ρ k is defined as (cid:98) ρ k = (cid:98) γ ( k ) (cid:98) γ (0) , where (cid:98) γ ( k ) = 1 T T −| k | (cid:88) t =1 (cid:0) x t + | k | − ¯ x (cid:1) ( x t − ¯ x ) . Here, { x , . . . , x T } is residual sequence if one wants to implement Ljung-Box test. Under H ,the test statistic asymptotically follows a χ h ) distribution. The p -value of Ljung-Box test is pr( χ h ) > Q ) .Durbin-Watson test serves the same purpose. For residual e t = ρe t − + ν t , the test statisticis d = (cid:80) Tt =2 ( e t − e t − ) (cid:80) Tt =1 e t . It tests null hypothesis: H : ρ = 0 against alternative hypothesis H : ρ (cid:54) = 0 . B.2 Golden-section search Golden-section search is a efficient and robust technique for finding an extremum (minimumor maximum) of a function inside a specified interval. For any given δ , if we solve the convexprogram (2), calculate the residual sequence and perform the hypothesis test on it as wementioned in Section2.2, we will obtain a p -value. That is, we have a mapping that maps δ to p , which we denote as p = f ( δ ) . In our numerical experiment, we show that f is unimodalby Figure 4. Therefore, we can speed up the parameter tuning procedure by Golden-sectionsearch. The detailed steps are provided below in Algorithm 1.Compared to (cid:98) ( δ u − δ (cid:96) ) /ε (cid:99) + 1 searches in proposed tuning procedure, Golden-sectionsearch can achieve ε -optimality with just (cid:98) log( ε/ ( δ u − δ (cid:96) )) / log(0 . (cid:99) + 1 searches. C Proofs C.1 Proof of Theorem 1 To begin with, we prove Theorem 1 by using Proposition 1: Proof of Theorem 1. Denote estimation error by e = (cid:98) β T − β . By triangle inequality, we have (cid:112) ( (cid:98) α − α ) + ( (cid:98) µ − µ ) = (cid:107) e I (cid:107) ≤ (cid:113) (cid:98) α + α + (cid:98) µ + µ ) ≤ δ s = 2 (cid:112) vol( S ) /π. By definition (14), φ min (2) is the smallest eigenvalue of (cid:101) X T (cid:101) X /T , where (cid:101) X = ( x T − , ) and is vector of all ones. Since φ min (2) = 0 if and only if x T − = a for some a ∈ R , φ min (2) will be of constant order with overwhelming probability. Since k can be chosenarbitrarily small, κ can be lower bounded by a positive constant with high probability. Since (cid:107) η (cid:107) ∞ = O (cid:0) (log T ) / /T / (cid:1) , for large enough T , we can simplify Proposition 1 into (cid:107) e I (cid:107) ≤ (cid:101) C √ s max { sδ , δ } , lgorithm 1 Hyperparameter tuning procedure: a Golden-section search variant. Input: Observations x , . . . , x T , given history x − p +1 , . . . , x , a pre-specified interval [ δ (cid:96) , δ u ] to search the best δ and tolerance ε > . Output: ε -optimal hyperparameter δ . Determine two intermediate points δ = δ (cid:96) + d and δ = δ u − d , where d = √ − ( δ u − δ (cid:96) ) For k = 1 , : fit the proposed estimator as defined by (2) with hyperparameters δ k ;construct the residual sequence by r ( k ) i = x i − (cid:80) pj =1 (cid:98) α j ( k ) x i − j − (cid:98) f i ( k ) , i = 1 , . . . , T ;apply (Lag-p) Ljung-Box test to the residual sequence { r i ( k ) } Ti =1 to obtain a p -value p k = f ( δ k ) .If f ( δ ) > f ( δ ) , update δ (cid:96) , δ , δ , δ u as follows δ (cid:96) = δ , δ = δ , δ u = δ u , δ = δ + √ − ( δ u − δ (cid:96) ) ; Otherwise, update δ (cid:96) , δ , δ , δ u as follows δ (cid:96) = δ (cid:96) , δ u = δ , δ = δ , δ = δ u − √ − ( δ u − δ (cid:96) ) . If δ u − δ (cid:96) < ε , set δ max = ( δ u + δ (cid:96) ) / and stop iterating; otherwise, go back to step .where (cid:101) C > is a constant. Together with the naive upper bound by triangle inequality, weobtain (cid:107) e I (cid:107) ≤ min (cid:110) (cid:101) C √ s max { sδ , δ } , (cid:112) vol( S ) /π (cid:111) . Since (cid:107) (cid:98) ∆ (cid:107) ≤ (cid:107) (cid:98) ∆ (cid:107) ≤ δ and (cid:107) ∆ (cid:107) ≤ √ sδ , by triangle inequality, we have (cid:107) e I ∪ I (cid:107) = (cid:107) (cid:98) ∆ − ∆ (cid:107) ≤ (cid:107) (cid:98) ∆ (cid:107) + (cid:107) ∆ (cid:107) ≤ δ + √ sδ . Again, by triangle inequality, (cid:107) (cid:98) β T − β (cid:107) ≤ (cid:107) e I (cid:107) + (cid:107) e I ∪ I (cid:107) . We complete the proof.The proof of Proposition 1 is highly involved. We sketch its proof as follows: Proof of Proposition 1. We first state four very useful lemmas. Lemma 1 (High probability bounds for sub-Gaussian noise) . For sub-Gaussian randomnoise ε , . . . , ε T i.i.d. ∼ subG( σ ) and x , . . . , x T generated by (5) (given x ), for all A > , A > √ A and A > , define events A = (cid:26) | ε i | ≤ (cid:113) A σ log(2 T ) , i = 1 , . . . , T (cid:27) , and A = (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i = j ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ A σ √ T log(2 T ) , j = 1 , . . . , T (cid:41) , 25e have pr ( A ) ≥ − (2 T ) − A , pr ( A |A ) ≥ − (2 T ) − A /A . Furthermore, if we assume there exists a constant C > such that | f i | ≤ C (cid:112) log T , i = 1 , . . . , T, (17)define event A = (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i =1 ε i x i − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ A σ ( c + 1) log(2 T ) (cid:112) T log(2 T ) / (1 − α ) (cid:41) , where c > is a constant such that | f i | ≤ c (cid:112) A σ log(2 T ) , i = 1 , . . . , T , then we will have pr ( A |A ) ≥ − T ) − A /A . By Lemma 1, we will have pr( A ∩ A ∩ A ) = pr( A ) (1 − pr( A c ∪ A c |A )) > − (2 T ) − A − (2 T ) − A /A − T ) − A /A . This means event A = A ∩A ∩A holds with probability at least − (2 T ) − A − (2 T ) − A /A − T ) − A /A . Lemma 2 (Restricted (cid:96) estimation error) . Under assumption (17), for our proposed esti-mator (cid:98) β T , as defined in (7) or equivalently (13), if we choose k ∈ (0 , and tuning parameter λ such that kλ = O (cid:0) (log T ) / /T / (cid:1) , then on event A , the estimation error e = (cid:98) β T − β satisfies: (cid:107) e I (cid:107) ≤ min (cid:26) k − k (cid:107) e I (cid:107) , sδ − k (cid:27) + k − k (cid:107) e I (cid:107) . Lemma 3. Under assumption (17), on event A , for any integer m ≤ T + 1 − | J | ,we have √ T (cid:107) X e (cid:107) ≥ (cid:18)(cid:112) φ min (2) − (cid:112) φ max ( m ) k − k (cid:19) (cid:107) e I (cid:107) − (cid:32) (cid:112) φ max ( m )1 − k + (cid:115) ( s − (cid:18) − s − T (cid:19)(cid:33) sδ − (cid:115) ( s − (cid:18) − s − T (cid:19) δ, (18)where φ min ( · ) and φ max ( · ) are defined in (14).For simplicity, in the following we denote κ = (cid:112) φ min (2) − k (cid:112) φ max ( m ) / (1 − k ) and C ( δ, δ , s, m ) = 2 (cid:112) φ max ( m ) sδ − k + (cid:115) ( s − (cid:18) − s − T (cid:19) ( sδ + δ ) . We further denote J = I ∪ I , i.e. the set of indices for all non-zero coefficients.26 emma 4. Under assumption (17), on event A , we have (cid:107) X e (cid:107) /T ≤ (cid:107) η (cid:107) ∞ (cid:107) e J (cid:107) / (1 − k ) , (19)Additionally, we have (cid:107) η (cid:107) ∞ = O (cid:0) (log T ) / /T / (cid:1) .Here, we consider two cases: (i) (cid:107) e I (cid:107) ≤ (cid:107) e I (cid:107) and (ii) (cid:107) e I (cid:107) > (cid:107) e I (cid:107) . In case (i), wehave (cid:107) e I (cid:107) ≤ (cid:107) e I (cid:107) ≤ (cid:107) e I (cid:107) = (cid:107) (cid:98) ∆ I − ∆ I (cid:107) ≤ (cid:107) (cid:98) ∆ I (cid:107) + (cid:107) ∆ (cid:107) ≤ δ + sδ . In case (ii), (cid:107) e J (cid:107) = (cid:107) e I (cid:107) + (cid:107) e I (cid:107) < (cid:107) e I (cid:107) . By (18) and (19), we have − k (cid:107) η (cid:107) ∞ (cid:107) e J (cid:107) ≥ ( κ (cid:107) e I (cid:107) − C ( δ, δ , s, m )) ≥ κ (cid:107) e I (cid:107) (cid:107) e I (cid:107) / √ − κC ( δ, δ , s, m ) (cid:107) e I (cid:107) ≥ κ (cid:107) e I (cid:107) (cid:107) e J (cid:107) / √ − κC ( δ, δ , s, m ) (cid:107) e J (cid:107) . Rearranging the terms in the above inequality and choosing m = 1 , we have (cid:107) e I (cid:107) ≤ √ κ (cid:18) (cid:107) η (cid:107) ∞ − k + κC ( δ, δ , s, (cid:19) . Denote C δ,δ ,s = C ( δ, δ , s, . Since φ max (1) = 1 − /T , combing the results above proves(15). Proofs of Lemmas in the proof of Proposition 1 Proof of Lemma 1. For sub-Gaussian random noise ε i ∼ subG ( σ ) , we will have: pr( | ε i | ≤ c , i = 1 , . . . , T ) ≥ − T exp (cid:26) − c σ (cid:27) . Setting c = (cid:112) A σ log(2 T ) , we prove the first inequality.By the uniform upper bound on the dynamic background (17), we can find a constant c such that dynamic background is uniformly bounded by c c . Thus, on event A we will get − ( c + 1) c ≤ x i − α x i − ≤ ( c + 1) c , ( i = 1 , . . . , T ) . By the convergence of geometric series we have | x i | ≤ ( c + 1) c / (1 − α ) and thus wehave | x i − ε i | ≤ ( c + 1) c − α = c , ( i = 1 , . . . , T ) . Since E [ x i − ε i | x i − ] = x i − E [ ε i ] = 0 and Var( x i − ε i | x i − ) = x i − σ ≤ (cid:18) ( c + 1) c − α (cid:19) σ , { x i − ε i } is a bounded martingale difference sequence with respect to filtration { σ ( x , . . . , x i − ) } .27y Azuma–Hoeffding inequality, we have pr (cid:32) T (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i =1 x i − ε i (cid:12)(cid:12)(cid:12)(cid:12) ≥ c (cid:33) ≤ (cid:26) − T c c (cid:27) . Set c = A √ σ ( c + 1)1 − α log(2 T ) (cid:114) log(2 T ) T , (20)we prove the third inequality.Similarly, on event A , by Azuma–Hoeffding inequality, we will obtain pr (cid:32) T (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i = j ε i (cid:12)(cid:12)(cid:12)(cid:12) ≥ c (cid:33) ≤ (cid:26) − T c T − j ) c (cid:27) < (cid:26) − T c A σ log(2 T ) (cid:27) , j = 1 , . . . , T. Therefore, pr (cid:32) T (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i = j ε i (cid:12)(cid:12)(cid:12)(cid:12) < c , j = 1 , . . . , T (cid:33) ≥ − T (cid:88) j =2 pr (cid:32) T (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i = j ε i (cid:12)(cid:12)(cid:12)(cid:12) ≥ c (cid:33) > − T exp (cid:26) − T c A σ log(2 T ) (cid:27) , where the first inequality comes from union bound. Again, set c = 2 A σ log(2 T ) √ T , (21)we prove the second inequality. Proof of Lemma 2. By definition (13), we have T (cid:107) x T − X (cid:98) β T (cid:107) + λ (cid:107) (cid:98) ∆ (cid:107) ≤ T (cid:107) x T − X β (cid:107) + λ (cid:107) ∆ (cid:107) . Rearrange terms and we will get T (cid:107) X ( (cid:98) β T − β ) (cid:107) ≤ λ ( (cid:107) ∆ (cid:107) − (cid:107) (cid:98) ∆ (cid:107) ) + 1 T ε T T X ( (cid:98) β T − β ) . If we choose kλ as follows kλ = 2 √ A σ ( c + 1)1 − α log(2 T ) (cid:114) log(2 T ) T = O (cid:18) (log T ) / T / (cid:19) , we have kλ = c > c for T large enough, where c and c are defined in (20) and (21),respectively. Then on event A , we have T | ε T T X | = 1 T (cid:32)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i =1 ε i x i − (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i =1 ε i (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) i =2 ε i (cid:12)(cid:12)(cid:12)(cid:12) , . . . , | ε T | (cid:33) T ≤ kλ , (22)28here ∈ R T is the vector of ones. Thus, we will obtain (cid:107) η (cid:107) ∞ = (cid:107) ε T T X /T (cid:107) ∞ ≤ kλ and T (cid:107) X ( (cid:98) β T − β ) (cid:107) ≤ λ ( (cid:107) ∆ (cid:107) − (cid:107) (cid:98) ∆ (cid:107) ) + kλ (cid:107) (cid:98) β T − β (cid:107) . By pulsing λ (1 − k ) (cid:107) e I ∪ I (cid:107) on both side of this equation, we will get (1 − k ) (cid:107) e I ∪ I (cid:107) ≤ ( (cid:107) ∆ (cid:107) − (cid:107) (cid:98) ∆ (cid:107) + (cid:107) e I ∪ I (cid:107) ) + k (cid:107) e I (cid:107) . (23)Since ∆ = β I ∪ I , e I ∪ I = (cid:98) ∆ − ∆ . By the sparse structure we know (cid:107) ∆ (cid:107) − (cid:107) (cid:98) ∆ (cid:107) + (cid:107) e I ∪ I (cid:107) ≤ (cid:107) ∆ (cid:107) ≤ sδ . (24)Meanwhile, since (cid:107) ∆ (cid:107) takes value zero on index set I , we have (cid:98) ∆ I = e I and thus (cid:107) (cid:98) ∆ I (cid:107) = (cid:107) e I (cid:107) . Therefore, we have (cid:107) ∆ (cid:107) − (cid:107) (cid:98) ∆ (cid:107) + (cid:107) e I ∪ I (cid:107) = (cid:107) ∆ I (cid:107) − (cid:107) (cid:98) ∆ I (cid:107) + (cid:107) e I (cid:107) ≤ (cid:107) e I (cid:107) . (25)Plugging (24) and (25) back into (23), we will get (cid:107) e I (cid:107) ≤ min (cid:26) k − k (cid:107) e I (cid:107) , sδ − k (cid:27) + k (cid:107) e I (cid:107) . We complete the proof. Proof of Lemma 3. By (7), we have (cid:107) e I (cid:107) ≤ (cid:107) e I (cid:107) = (cid:107) (cid:98) ∆ I − ∆ I (cid:107) ≤ (cid:107) (cid:98) ∆ I (cid:107) + (cid:107) ∆ (cid:107) ≤ δ + sδ . Partition index set J c into L disjoint sets: J c = ∪ L(cid:96) =1 J (cid:96) , where | J | = · · · = | J L − | = m and | J L | ≤ m , and (cid:80) L(cid:96) =1 (cid:107) e J (cid:96) (cid:107) ≤ (cid:80) L(cid:96) =1 (cid:107) e J (cid:96) (cid:107) = (cid:107) e J c (cid:107) , we get √ T (cid:107) X e (cid:107) ≥ √ T (cid:107) X e J (cid:107) − √ T (cid:107) X e J c (cid:107) ≥ (cid:112) φ min (2) (cid:107) e I (cid:107) − (cid:112) φ max ( s − (cid:107) e I (cid:107) − (cid:112) φ max ( m ) L (cid:88) (cid:96) =1 (cid:107) e J (cid:96) (cid:107) ≥ (cid:112) φ min (2) (cid:107) e I (cid:107) − (cid:115) ( s − (cid:18) − s − T (cid:19) ( δ + sδ ) − (cid:112) φ max ( m ) (cid:107) e J c (cid:107) . Since I = J c , √ (cid:107) e I (cid:107) ≥ (cid:107) e I (cid:107) , by Lemma 2, we have √ T (cid:107) X e (cid:107) ≥ (cid:32)(cid:112) φ min (2) − (cid:112) φ max ( m ) √ k − k (cid:33) (cid:107) e I (cid:107) − (cid:32) (cid:112) φ max ( m )1 − k + (cid:115) ( s − (cid:18) − s − T (cid:19)(cid:33) sδ − (cid:115) ( s − (cid:18) − s − T (cid:19) δ. Denote κ = (cid:112) φ min (2) − (cid:112) φ max ( m ) √ k − k and we complete the proof.29 roof of Lemma 4. Since (cid:98) β T is solution to VI[ F x T , X ] , the weak VI, and the vector field F x T ( · ) is continuous, we have (cid:98) β T is also solution to the strong VI. That is, (cid:98) β T also satisfies (cid:104) F x T ( (cid:98) β T ) , w − (cid:98) β T (cid:105) ≥ , ∀ w ∈ X . In particular, we have (cid:104) F x T ( (cid:98) β T ) , β − (cid:98) β T (cid:105) ≥ . Meanwhile, we have F x T ( (cid:98) β T ) = F x T ( β ) − A [ x T ]( β − (cid:98) β T ) /T. Therefore, we will have (cid:28) F x T ( β ) − T A [ x T ]( β − (cid:98) β T ) , β − (cid:98) β T (cid:29) ≥ . Rearrange terms and recall that η = F x T ( β ) , we will get ( β − (cid:98) β T ) T ( A [ x T ] /T )( β − (cid:98) β T ) ≤ (cid:104) η, β − (cid:98) β T (cid:105) ≤ (cid:107) η (cid:107) ∞ (cid:107) β − (cid:98) β T (cid:107) , (26)where the last inequality comes from Hölder’s inequality.Notice that A [ x T ] = X T X , we can re-express the inequality above as √ T (cid:107) X e (cid:107) ≤ (cid:107) η (cid:107) ∞ (cid:107) e (cid:107) = (cid:107) η (cid:107) ∞ ( (cid:107) e J (cid:107) + (cid:107) e J c (cid:107) ) ≤ − k (cid:107) η (cid:107) ∞ (cid:107) e J (cid:107) , where the last inequality comes from Lemma 2.By (22) and the choice of kλ , we get (cid:107) η (cid:107) ∞ ≤ √ A σ ( c + 1)1 − α log(2 T ) (cid:114) log(2 T ) T = O (cid:0) (log T ) / /T / (cid:1) . We complete the proof. C.2 Proof of Theorem 2 Proof of Theorem 2. By Proposition 2, to make (cid:96) error lower bounded by C with probabilitygreater than − C , we need C = 12 exp (cid:40) − C T + C δ ( T ) (cid:80) Tt =2 s ( t ) + C δ ( T ) (cid:80) Tt =2 s ( t ) + log 2 C s ( T ) (cid:41) . (27)Since s ( t ) ≤ t , we will have a decreasing (w.r.t t ) lower bound at approximately exponentialrate. Thus, without any condition, the naive bound will be tighter compared to the onewe just derive if (cid:112) s ( t ) δ ( t ) goes to infinity. To make sure the lower bound C we derive in(27) is of constant order, we need s ( t ) at least of order t , i.e. condition (9). However, thismakes (cid:80) Tt =2 s ( t ) = Θ( T ) and we further need δ ( t ) small enough when t ∈ { , . . . , T } , i.e.condition (10). Proof of Proposition 2. First, we find a large enough ε -packing by the following Lemma.30 emma 5. Let ( V, (cid:107) · (cid:107) ) be a normed space. For Θ ⊂ V ⊂ R d , we have (cid:18) ε (cid:19) d vol(Θ)vol( B ) ≤ N (Θ , (cid:107) · (cid:107) , ε ) ≤ (cid:18) ε (cid:19) d vol(Θ)vol( B ) , where B is the unit norm ball and N (Θ , (cid:107) · (cid:107) , ε ) = max { m : ∃ ε -packing of Θ of size m } is the packing number.Recall that the coefficient vector space is Θ T = { β : ( α , µ ) ∈ S , ∆ ∈ B} . Since δ s isconstant, Θ T will have a constant order volume, even though δ can be very small. Thus, byLemma 5 we can find an ε -packing N = { β , . . . , β N } ⊂ Θ T such that N ≥ C (cid:18) ε (cid:19) s , (28)where C is some positive constant. Lemma 6. For any ε -packing N = { β , . . . , β N } ⊂ Θ T , if the random noise is normallydistributed, then the upper bound on KL divergence between the joint distributions of x T generated by (5) with coefficient chosen from N is max i,j ∈ [ N ] KL ( p β i || p β j ) ≤ C T + C δ ( T ) T (cid:88) t =2 s ( t ) + C δ ( T ) T (cid:88) t =2 s ( t ) , where p β is joint probability density function (p.d.f.) of x T generated by (5) with coefficient β and C , C and C are some positive constants dependent on δ s . Lemma 7 (Fano’s inequality) . Let P = { P , . . . , P N } . For any random variable Z takingvalues in [ N ] , we have N N (cid:88) i =1 P i ( Z (cid:54) = i ) ≥ − N (cid:80) i,j ∈ [ N ] KL ( P i || P j ) + log 2log N , (29)where KL ( ·||· ) is the Kullback–Leibler (KL) divergenceBy Fano’s inequality (29), we have for any r.v. Z N N (cid:88) i =1 pr β i ( Z (cid:54) = i ) ≥ − max i,j ∈ [ N ] KL ( p i || p j ) + log 2log N . (30)For any estimator (cid:101) β T , define (cid:98) ψ = ψ ( (cid:101) β T ) = argmin i ∈ [ N ] (cid:107) (cid:101) β T − β i (cid:107) , (31)31hich is the index for the element closest to (cid:101) β T (in (cid:96) norm sense) in the ε -packing N .Therefore, for any (cid:98) ψ (cid:54) = i , we have (cid:107) (cid:101) β T − β i (cid:107) ≥ (cid:107) β (cid:98) ψ − β i (cid:107) − (cid:107) (cid:101) β T − β (cid:98) ψ (cid:107) ≥ (cid:107) β (cid:98) ψ − β i (cid:107) − (cid:107) (cid:101) β T − β i (cid:107) , where the last inequality comes from (31).Re-arrange terms in the inequality above and we will have (cid:107) (cid:101) β T − β i (cid:107) ≥ (cid:107) β (cid:98) ψ − β i (cid:107) ≥ ε , where the last inequality comes from the definition of ε -packing, i.e. min i (cid:54) = j (cid:107) β i − β j (cid:107) > ε. This means when β = β i , event { (cid:98) ψ (cid:54) = i } is subset of event {(cid:107) (cid:101) β T − β i (cid:107) ≥ ε/ } . Therefore,we have sup β ∈ Θ T pr β (cid:16) (cid:107) (cid:101) β T − β (cid:107) ≥ ε/ (cid:17) ≥ sup β ∈N pr β (cid:16) (cid:107) (cid:101) β T − β (cid:107) ≥ ε/ (cid:17) ≥ max i ∈ [ N ] pr β i (cid:16) (cid:98) ψ (cid:54) = i (cid:17) ≥ N N (cid:88) i =1 pr β i (cid:16) (cid:98) ψ (cid:54) = i (cid:17) . Taking Z = (cid:98) ψ and ε = 2 C in (12), by (28) and (30), we complete the proof. Proof of Lemma 6. For x T generated by (5) with β = ( α, µ, ∆ , . . . , ∆ T ) T , we can derivethat x t = 1 − α t − α µ + t (cid:88) i =2 − α t +1 − i − α ∆ i + t (cid:88) i =1 α t − i ε i , t = 1 , . . . , T, (32)where for t = 1 the second term is zero. We further denote τ ( t ) = 1 − α t − α µ + t (cid:88) i =2 − α t +1 − i − α ∆ i , and B ( t ) = t (cid:88) i =1 α t − i ε i . Therefore, if the random noise in (5) is Gaussian, then the joint distribution for x T willbe N ( τ, Σ) , where τ = ( τ (1) , τ (2) , . . . , τ ( T ) ) T , Σ = P α P T α and P α = α α α α α α ... ... . . . α T − α T − . . . . . . . . . α . 32y some simple algebra, we will obtain det(Σ) = det( P α P T α ) = det( P α ) = 1 and Σ − = α + 1 − α − α α + 1 − α − α α + 1 − α . . . . . . . . . − α α + 1 − α − α . Arbitrarily choose two distinct coefficients from N . Without loss of generality, we denotethem to be β i ( = 1 , ). Given x , denote the joint p.d.f. of x T generated by (5) withcoefficient β by p ( x T | x ; β ) . For simplicity, we denote p ( x T | x ; β i ) = p i for β i ∈ N , i = 1 , . . . , N .By the derivation above, p i is joint p.d.f. of N ( τ i , Σ i ) . Then the KL divergence betweenthese two ( T − − dimensional multivariate Gaussian distributions is KL ( N ( τ , Σ ) || N ( τ , Σ )) = (cid:90) log p ( x ) p ( x ) p ( x ) dx = (cid:90) (cid:20) log | Σ || Σ | − ( x − τ ) T Σ − ( x − τ ) + ( x − τ ) T Σ − ( x − τ ) (cid:21) p ( x ) dx = 12 log | Σ || Σ | − 12 tr (cid:8) E (cid:2) ( x − τ ) ( x − τ ) T (cid:3) Σ − (cid:9) + 12 E (cid:2) ( x − τ ) T Σ − ( x − τ ) (cid:3) = 12 log | Σ || Σ | − 12 tr { I T } + 12 ( τ − τ ) T Σ − ( τ − τ ) + 12 tr (cid:8) Σ − Σ (cid:9) = 12 (cid:20) log | Σ || Σ | − T + tr (cid:8) Σ − Σ (cid:9) + ( τ − τ ) T Σ − ( τ − τ ) (cid:21) . (33)Since det(Σ ) = det(Σ ) = 1 , we have KL ( p || p ) = 12 (cid:2) tr (cid:8) Σ − Σ (cid:9) − T + ( τ − τ ) T Σ − ( τ − τ ) (cid:3) . (34)On one hand, by the explicit form of Σ as well as Σ , we can derive that the explicitform of the diagnoal elements of Σ − Σ . For i = 2 , . . . , T − , we have (cid:0) Σ − Σ (cid:1) i,i = T − (cid:88) k =1 (cid:0) Σ − (cid:1) k,i (Σ ) k,i = − α (cid:16) (Σ ) i − ,i + (Σ ) i +1 ,i (cid:17) + ( α + 1) (Σ ) i,i = − α α (cid:18) − α i − − α + α i − (cid:19) + ( α + 1) 1 − α i − α ≤ α + 2 | α α | + 11 − α ≤ δ s + 11 − δ s . Similarly, we can derive the expression for (cid:0) Σ − Σ (cid:1) , and (cid:0) Σ − Σ (cid:1) T,T and upper boundthem by some constant. This means all diagonal elements are bounded uniformly by a33onstant. Thus, we have tr (cid:8) Σ − Σ (cid:9) − T ≤ ( c − T, (35)where constant c is the uniform upper bound the constant and we can show c > .On the other hand, for i = 1 , , we have | τ ( t ) i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − α t − i − α i µ i + t (cid:88) j =2 − α t +1 − ji − α i ∆ i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ s + s ( t ) δ − δ s . Therefore, we have | τ − τ | ≤ − δ s ( δ s + s (1) δ , . . . , δ s + s ( T ) δ ) T = 21 − δ s ( δ s + δ s T ) , where ∈ R T is the vector of ones, s T = ( s (1) , . . . , s ( T )) T and the inequality is pointwise.Denote (cid:101) Σ − = α + 1 | α || α | α + 1 | α || α | α + 1 | α | . . . . . . . . . | α | α + 1 | α || α | , we can get (cid:12)(cid:12) ( τ − τ ) T Σ − ( τ − τ ) (cid:12)(cid:12) ≤ | τ − τ | T (cid:101) Σ − | τ − τ | = (cid:18) − δ s (cid:19) (cid:16) δ s T (cid:101) Σ − + 2 δ s δ T (cid:101) Σ − s T + δ s T T (cid:101) Σ − s T (cid:17) . We can upper bound the last three terms above as follows (notice that s (1) = 0 ) : T (cid:101) Σ − ≤ (1 + | α | ) T ; T (cid:101) Σ − s T ≤ (1 + | α | ) ( s (2) + · · · + s ( T )); s T T (cid:101) Σ − s T = T − (cid:88) i =2 (cid:0) | α | ( s ( i ) + s ( i + 2)) + ( α + 1) s ( i + 1) (cid:1) s ( i + 1)+ ( α + 1) s (2) + | α | ( s (2) s (3) + s ( T − s ( T )) + s ( T ) . By (34), (35) and last four inequalities, we prove Lemma 6. D Additional experiments D.1 Numerical simulation We set α = 0 . , σ = 0 . , δ = 0 . and choose s ∈ { , , } . For each s , we plot (cid:98) α and δ selected by Ljung-Box test with respect to time T . The result is in Figure 10.34 500 1000 1500 2000 2500 3000 3500 40000.050.10.150.20.25 = 0.1, s = 24.864, = 0.1, s = 10000 500 1000 1500 2000 2500 3000 3500 4000-12-10-8-6-40 500 1000 1500 2000 2500 3000 3500 4000-2-1012 = 0.1, s = 4.824, = 0.1, s = 2000 500 1000 1500 2000 2500 3000 3500 4000-25-20-15-10-500 500 1000 1500 2000 2500 3000 3500 4000-2-1012 = 0.1, s = 0.60474, = 0.1, s = 200 500 1000 1500 2000 2500 3000 3500 4000-20-15-10-500 500 1000 1500 2000 2500 3000 3500 4000-2-1.5-1-0.500.51 s = 200 , = 0 . , || || = 4 . Apart from piecewise constant function class,polynomial is another highly expressive function class. Xu [2008] proposed to use n thorder polynomials ( n-poly ) to approximate the unstructured dynamics in non-stationaryautoregressive time series. Then the autoregressive coefficients and polynomial coefficientsare estimated via ordinary least square (OLS). However, he did not give instructions onhow to choose n in practice. Here, we choose n ∈ { , , } and compare n-poly with ourproposed methods under the setting: α = 0 . , σ = 0 . , s = 2000 , δ = 0 . , (cid:107) ∆ (cid:107) = 24 . .The results are plotted in Figure 13.From the figure above, we can see that all three polynomial methods considered heredo not yield accurate estimate for ar (1) series with highly unstructured dynamics. This is35 500 1000 1500 2000 2500 3000 3500 4000-0.2-0.100.10.20.3 = 0.1, s = 74.366, = 0.1, s = 30000 500 1000 1500 2000 2500 3000 3500 4000-14-12-10-8-6-4-20 500 1000 1500 2000 2500 3000 3500 4000-2-1012 = 0.1, s = 25.0822, = 0.05, s = 20000 500 1000 1500 2000 2500 3000 3500 4000-12-10-8-6-40 500 1000 1500 2000 2500 3000 3500 4000-2-1.5-1-0.500.51 = 0.1, s = 50.3469, = 0.1, s = 20000 500 1000 1500 2000 2500 3000 3500 4000-9-8-7-6-5-4-30 500 1000 1500 2000 2500 3000 3500 4000-2-1012 = 0.1, s = 11.9259, = 0.05, s = 10000 500 1000 1500 2000 2500 3000 3500 4000-25-20-15-10-500 500 1000 1500 2000 2500 3000 3500 4000-2-1.5-1-0.500.51 l og 500 1000 1500 2000-10123 = 0.085667, = 24.8777, p = 0.18024, = 0.000996890 500 1000 1500 2000-0.500.511.5 0 500 1000 1500 2000-10123 = 0.059923, = 24.8777, p = 0.26722, = 8.1306e-060 500 1000 1500 2000-0.500.511.5 0 500 1000 1500 2000-10123 = 8.2401e-07, = 24.8777, p = 0.26722, = 8.1306e-060 500 1000 1500 2000-0.500.511.5 0 500 1000 1500 2000-10123 = 7.1162e-25, = 24.8777, p = 0.26722, = 8.1306e-060 500 1000 1500 2000-0.500.511.52 0 500 1000 1500 2000-10123 = 6.652e-57, = 24.8777, p = 0.26722, = 8.1306e-060 500 1000 1500 2000-0.500.511.52 ` variant Here, we take subject 23 as an example to show why we choose to use logarithm transform indetail. First, we directly apply our proposed estimator on the RT sequence with hyperparam-eter selected by Ljung-Box test, as is detailed in proposed tuning procedure. Since we donot have the ground truth, we can only access the goodness-of-fit by assessing how close ourresidual sequence resembles white noise. We plot the histogram as well as the QQ-plot of thefitted residual sequence. These two plots are shown in the first row in Figure 14.The histogram shows that the residuals are right-skewed — in fact this is true for nearlyall subjects. Ljung–Box test is commonly used in autoregressive integrated moving average(ARIMA) modeling, which requires Gaussian random noise assumption, and clearly thisassumption breaks in this study. Therefore, the p -value of Ljung-Box test directly applied toresidual sequence may not be a reasonable metric for the goodness-of-fit, which undermines thevalidity of δ selected by Ljung-Box test. Nevertheless, testing for remaining serial correlationin the residual sequence is the ultimate goal of applying Ljung-Box test. Thus, we cantransform the residuals to more closely approximate a Gaussian distribution and then applythe Ljung-Box test on the transformed residuals to check for serial correlation.For right-skewed data, the most commonly used transforms are cube root and logarithm.We apply both transforms here. The transforms are performed by first subtracting . × min residuals from the residual sequence (to make sure we obtain meaningful values afterlogarithm), and then applying cube root or logarithm transform to this sequence.We perform the aforementioned hyperparameter tuning procedure inn proposed tuningprocedure for original and transformed residuals. More precisely, the p -value in step 2.(ii) isobtained by applying Ljung-Box test on original, cube root and logarithm of residuals. Foreach method, we denote the selected hyperparameter δ and the maximum of p -value to be37 200 400 600 800 100000.511.522.5 -0.5 0 0.5 1 1.50100200300 -4 -2 0 2 4-1-0.500.511.5 p = 0.412890 200 400 600 800 100000.511.522.5 0.4 0.6 0.8 1 1.2050100150200250 -4 -2 0 2 40.40.60.811.2 p = 0.678720 200 400 600 800 100000.511.522.5 -2.5 -2 -1.5 -1 -0.5 0 0.5050100150200250 -4 -2 0 2 4-3-2-101 p = 0.95079 Figure 14: Experimental results of applying our proposed estimator to subject 23 withhyperparameter δ ori (top), δ cr (middle) and δ log (bottom). The first column plots rawobservation (blue), fitted ar (1) model (red) and fitted dynamic background (yellow) withhyperparameter δ and estimated ar (1) coefficient (cid:98) α on the top; the second and third columnplot the histogram and quantile-quantile (QQ) plot of (original, cube root of and logarithmof) residuals (with p ori , p cr , p log on the top). 38 δ ori , p ori ) , ( δ cr , p cr ) , ( δ log , p log ) , respectively. We illustrate all these three methods on subject23 by plotting the fitted ar (1) model, fitted dynamic background, histogram and QQ-plot ofthe residual sequence in Figure 14.Figure 14 shows that for subject 23 (i) from the first column, the first method clearlyunderfits the dynamic background; (ii) from the second column, the last histogram is muchmore symmetric and closely resembles p.d.f. of normal distribution; (iii) from the thirdcolumn, the last method has larger p -value, indicating less serial correlation remained inresidual sequence. This again shows that why we use p -value to select the hyperparameter —it is a easy-to-use metric which correctly indicates whether the dynamic background is fittedproperly. Moreover, we see that the third method, i.e. using logarithm transform, is the bestfor subject 23. In fact, logarithm transform the best for almost all subjects in the sense that p log is the largest among p ori , p cr , p log . We also observe that for those subjects that p log is notthe largest, the tuning parameter δδ