On Multivariate Singular Spectrum Analysis and its Variants
OOn Multivariate Singular Spectrum Analysis
Anish Agarwal ∗ MIT [email protected]
Abdullah Alomar
Devavrat Shah
Abstract
We analyze a variant of multivariate singular spectrum analysis (mSSA), a widely usedmultivariate time series method, which we find to perform competitively with respect tothe state-of-art neural network time series methods (LSTM, DeepAR [16]). Its restrictionfor single time series, singular spectrum analysis (SSA), has been analyzed recently[10, 2]. Despite its popularity, theoretical understanding of mSSA is absent. Towards this,we introduce a natural spatio-temporal factor model to analyze mSSA. We establish thein-sample prediction error for imputation and forecasting under mSSA scales as / √ NT ,for N time series with T observations per time series. In contrast, for SSA the errorscales as / √ T and for matrix factorization based time series methods (c.f. [25, 15]),the error scales as / min( N,T ) . We utilize an online learning framework to analyze theone-step-ahead prediction error of mSSA and establish it has a regret of / ( √ NT . ) with respect to in-sample forecasting error. By applying mSSA on the square of thetime series observations, we furnish an algorithm to estimate the time-varying varianceof a time series and establish it has in-sample imputation / forecasting error scalingas / √ NT . To establish our results, we make three technical contributions. First, weestablish that the “stacked” Page Matrix time series representation, the core data structurein mSSA, has an approximate low-rank structure for a large class of time series modelsused in practice under the spatio-temporal factor model. Second, we extend the theoryof online convex optimization to address the variant when the constraints are time-varying.Third, we extend the analysis prediction error analysis of Principle Component Regressionbeyond the recent work of [3] to when the covariate matrix is approximately low-rank. ∗ All authors are with Massachusetts Institute of Technology during the course of this work. Their affiliations includeDepartment of EECS, LIDS, IDSS, Statistics and Data Science Center, and CSAIL. Their email addresses: { anish90,aalomar , devavrat } @mit.edu.Preprint. Under review. a r X i v : . [ c s . L G ] J un ontents ( G,(cid:15) ) -Hankel Representable Time Series Dynamics . . . . . . . . . . . . 82.2.1 ( G,(cid:15) ) -Hankel Time Series “Algebra” . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 ( G,(cid:15) ) -LRF Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 “Smooth”, Periodic Time Series - C k T-per . . . . . . . . . . . . . . . . . . . . . 82.2.4 Time Series with Latent Variable Model Structure . . . . . . . . . . . . . . . . 92.3 Stacked Hankel Matrix of mSSA is (Approximately) Low-Rank . . . . . . . . . . . . . 9
A.1 Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17A.1.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17A.1.2 Mean Imputation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.1.3 Mean Forecasting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.2 Variance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.2.2 Variance Imputation and Forecasting . . . . . . . . . . . . . . . . . . . . . . 19A.3 Algorithms Parameters and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
B Proofs - ( G,(cid:15) ) -Hankel Representability of Different Time Series Dynamics 21 B.1 Proof of Proposition 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21B.2 Proof of Proposition 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21B.3 Proof of Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21B.3.1 Helper Lemmas for Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 Completing Proof of Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . 22B.4 Proof of Proposition 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
C Proofs - Stacked Hankel Matrix is Approximately Low-Rank 24
C.1 Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24C.2 Proof of Proposition 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
D Imputation and Forecasting Error Analysis - Proof Notation 25
D.1 Induced Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
E Concentration Inequalities Lemmas 26
E.0.1 Classic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26E.0.2 High Probability Events for Imputation and Forecasting . . . . . . . . . . . . . 26
F HSVT Error 28G Proofs - Imputation Analysis 32
G.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32G.2 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
H Proofs - Forecasting Analysis 33
H.1 Forecasting - Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33H.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35H.3 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
I Proofs - Regret Analysis 37
I.1 Regret - Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37I.1.1 Bounding ∆ ( T + tL,n ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39I.2 Proof of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Introduction
Multivariate time series data is ubiquitous and is of great interest across many application areas, includingcyber-physical systems, finance, retail, healthcare to name a few. The goal across these domains can besummarized as accurate imputation and forecasting of a multivariate time series in the presence of noisyand/or missing data along with providing meaningful uncertainty estimates.
Setup.
We consider a discrete time setting with time indexed as t ∈ Z . Let f n : Z → R , n ∈ [ N ]:= { ,...,N } be the N ∈ N latent time series of interest. For t ∈ [ T ] and n ∈ [ N ] , with probability ρ ∈ (0 , , we observethe random variable X n ( t ) , where X n ( t )= f n ( t )+ η n ( t ) . While the underlying time series f n is of coursestrongly correlated, we assume the per-step noise η n ( t ) , are independent mean-zero random variables, withtime-varying variance, E [ η n ( t )]:= σ n ( t ) . Note σ n ( t ) is a key parameter of interest in time series analysis todo uncertainty quantification, especially in applications where the time series exhibits time-varying volatility(see [5]). We consider a time series model where f n ( t ) and σ n ( t ) satisfy a spatio-temporal factor modelas described in detail in Section 2. Goal.
The objective is two-folds, for n ∈ [ N ] : (i) imputation – estimating f n ( t ) , σ n ( t ) for all t ∈ [ T ] ; (ii)forecasting – learning a model to forecast f n ( t ) ,σ n ( t ) for t>T . We begin by describing, SSA, a popular method in the literature forboth imputation and forecasting (see [10, 2]) of a univariate time series. For the purposes of describingSSA, we assume access to only one time series X ( t ) . The variant of SSA considered in [2] performs threekey steps: (a) transform X ( t ) , t ∈ [ T ] into an L × T/L dimensional Page matrix by placing non-overlappingcontiguous segments of size
L > (an algorithmic hyper-parameter) as columns ; (b) perform PrincipalComponent Analysis (PCA), or equivalently Hard Singular Value Thresholding (HSVT), on the resultingPage matrix (the number of principal components retained is an algorithmic hyper-parameter) to imputethe time series; and (c) perform Ordinary Least Squares (OLS) on the retained principal components andthe last row of the Page matrix to learn a linear forecasting model. Multivariate Singular Spectrum Analysis (mSSA).
Often, multiple related time series are available, e.g.prices of different stocks or sensor readings from cyber-physical systems. In this setting, the challenge isthat one wants a method that both exploits the temporal structure (i.e., the structure that exists within a timeseries) along with the ‘spatial’ structure (i.e., the structure that exists across the time series). Towards thisgoal, mSSA is a natural extension of the SSA method where in Step 1 of the algorithm, a “stacked” Pagematrix of dimension L × ( NT/L ) is constructed from the multivariate time series data by concatenating(column-wise) the various Page matrices induced by each time series. The subsequent two steps of mSSAremain equivalent to the SSA method introduced above . Why Study mSSA?
First, mSSA is a very well established and used in practice algorithm, yet lackstheoretical understanding. Second, we show that the variant of mSSA we propose has rather impressiveempirical effectiveness. As seen in Table 2, despite mSSA’s simplicity, for both imputation and forecasting,it performs competitively with the current best neural network based time series libraries on various timeseries datasets . In Section A we include details on the experimental setup and the datasets used. Third,mSSA significantly outperforms SSA – thus somehow the simple additional step of concatenating variousPage matrices allows mSSA to exploit the additional “spatial” structure in the data that SSA cannot. Thisbegs the question of when and why does mSSA work ? In Section 4, we establish both the imputation and (in-sample) forecastingprediction error scale as ( NT ) − / . In particular, by first doing the core data transformation in mSSA ofconstructing the “stacked” Page matrix, we see the error scales as the product of N and T . Our analysisallows one to reason about the sample complexity gains by exploiting both the low-rank spatial structureamongst the multivariate time series and the temporal structure within each time series. For example, in Formally, for a time series f ( t ) with T observations, the Page Matrix induced by it M f ∈ R L × ( T/L ) is definedas M fij := f ( i +( j − L ) ; the Hankel Matrix induced by it H f ∈ R L × T is defined as H fij := f ( i + j − . The mSSA algorithm described is a variant of the standard SSA/mSSA method, (cf. [10]). However, the twocore subroutines of performing PCA and subsequently OLS on the matrix induced from the time series remain thesame. For a detailed comparison, please refer to our literature review. We note in companion work ([1]), we build a scalable, open-source time series prediction system (in line with thegrowing field of AutoML) using an incremental variant of mSSA. Via comprehensive empirical testing, we show mSSAoutperforms all listed methods in Table 2 both in terms of statistical and just as importantly, computational performance.We include Table 2 in this work just as evidence of mSSA’s surprisingly high empirical effectiveness.
Method Functionality Mean Estimation Variance Estimation RegretMultivariatetime series VarianceEstimation Imputation Forecasting Imputation ForecastingThis Work Yes Yes ( NT ) − ( NT ) − ( NT ) − ( NT ) − N − T − . mSSA - Literature Yes No – – – – –SSA [10, 2] No No T − / – – – –Neural Network [16, 6] Yes Yes – – – – –TRMF [25, 15] Yes No (min( N,T )) − – – – – Table 2: mSSA statistically outperforms SSA, other state-of-the-art algorithms, including LSTMs and DeepARacross many datasets. We use the average normalized root mean squared error (NRMSE) as our metric.
Mean Imputation(NRMSE) Mean Forecasting(NRMSE) Varianceimputation(NRMSE) VarianceForecasting(NRMSE)Electricity Traffic Synthetic Financial Electricity Traffic Synthetic Financial Synthetic SyntheticmSSA
SSA 0.519 0.608 0.626 0.466 0.552 0.704 0.522 0.592 0.154 0.209LSTM NA NA NA NA 0.551
SSA we do not get this additional scaling given by the number of time series, N , which might help explainmSSA’s vastly superior empirical performance, as seen in Table 2. Further, recent results from the low-rankmatrix factorization time series literature (see [25, 15]), show imputation error scales as / min( N,T ) (seeTheorem 2 of [15] ): it does not lead to consistency if N and T are not growing, e.g., having access toa single ( N =1 ) time series. See Table 1 for a summary of our theoretical results. Approximate Low-Rank Hankel with Spatio-Temporal Factor Model.
In Section 2.1, we propose aspatio-temporal factor model which brings together two heavily-studied models from well-established yetdisparate literatures - the factor model used for panel data (i.e. multivariate time series data) in econometrics,and the low-rank Hankel model in the SSA literature. Under this model, in Proposition 2.6, we show that thestacked Page (and Hankel) matrix, the key data representation in mSSA, is indeed (approximately) low-rank.In Section 2.2, we show many important time series models (e.g. any differentiable, periodic function) havethis approximate low-rank structure. Towards that, we develop a representation “calculus” of this modelclass by establishing that it is closed under component wise addition and multiplication (cf. Proposition2.1). Further as empirical evidence, on real time series datasets, we find the stacked Page matrix does indeedhave low-rank structure (see Table 3), and hence fits within our proposed model. The spatio-temporal factormodel is in line with that introduced in [25]. However, as stated above, it is the additional stacked Pagematrix data transformation in mSSA which allows us prove the prediction error scales as the product of N and T rather than min( N,T ) as in [25]. Algorithm for Multivariate Time-Varying Variance Estimation.
Estimating the time-varying varianceparameter (a la GARCH-like model) is a key problem in time series analysis. Yet to the best of our knowledge,there does not exist a theoretically grounded method which accurately estimates this variance parameterwithout making restrictive parametric assumptions. The challenge is the time-varying variance is notdirectly observed, not even a noisy version of it. Hence we believe an important algorithmic and theoreticalcontribution is the extension of mSSA we propose (see Section 3.1). We show it performs consistentimputation/forecasting of the time-varying variance parameter for a broad class of time series models, atrate ( NT ) − / (see Section 4.3). Technically, we reduce the problem of estimating the time-varying varianceparameter to that of accurately estimating the heteroscedastic variance parameter of a matrix, and furnisha simple two-step matrix estimation algorithm to do so. We note the simple “meta” variance estimationalgorithm we propose allows for variance estimation using other time series methods that do imputation,e.g., in Appendix A.2 we use the method in [25] to do variance estimation via this approach. It could alsobe of independent interest in the matrix estimation literature to produce prediction intervals. Online Variant of mSSA.
Traditionally in time series analysis, the metric of evaluation is either parameterestimation (where explicit parametric assumptions are about the generating process) or the in-sample There seems to be a typo in Corollary 2 of [25] in applying Theorem 2: square in Frobenius-norm error is missing. N − T − . (see Corollary 4.1). To do so, we establish an online variant of Principle Component Regression, a key stepin the mSSA, has sub-linear regret in an error-in-variable regression setting. This required extending thetheory of online convex optimization (OCO) to deal with time-varying constraints, a variation not addressedby existing results on OCO. We believe that our regret bound is not tight (w.r.t to scaling of T ) and is animportant direction for future research. Given the ubiquity of multivariate time series analysis, we cannot possibly do justice to the entire literature.Hence we focus on a few techniques, most relevant to compare against, either theoretically or empirically.
SSA.
For a detailed analysis of the standard SSA method, please refer to [10]. The main steps of SSA aregiven by: Step 1 - create a Hankel matrix from the time series data; Step 2 - do a Singular Value Decomposition(SVD) of it; Step 3 - group the singular values based on user belief of the model that generated the process; Step4 - perform diagonal averaging to Hankelize” the grouped rank-1 matrices outputted from the SVD to create aset of time series; Step 5 - learn a linear model for each Hankelized” time series for the purpose of forecasting.The theoretical analysis of this original SSA method has been on proving that many univariate time series havea low-rank Hankel representation, and secondly on defining sufficient (asymptotic) conditions for when thesingular values of the various time series components are separable (thereby justifying Step 3 of the method).In [2], the authors extend the class of time series considered to those with an approximate low-rank Hankelstructure. Some subtle but core alterations to the SSA procedure were made. In Step 1, the Page Matrix ratherthan the Hankel was used, which allowed for independent noise assumptions to hold. In Steps 2-3, insteadof doing the SVD of the matrix and grouping the singular values, only a single threshold is picked for thesingular values (i.e., simply doing HSVT on the induced Page matrix). In Step 4, subsequently, a single linearforecasting model is learnt rather than a separate linear forecaster for each grouped time series. Under such asetting, they perform a finite-sample analysis of the variant of SSA. However, the analysis in [2] does falls short,even for the univariate case, as they do not explicitly show consistency of the SSA estimator for forecasting. mSSA. mSSA is the natural extension of SSA for multivariate time series data and has been employedin a variety of applications with some empirical success (see [13, 12, 14]). However, theoretical analysis(and even practical guidance) regarding when and why it works is severely limited.
Matrix Based Multivariate Time Series Methods.
There is a recent line of work in time series analysis(see [22, 25]), where multiple time series are viewed collectively as a matrix, and some form of matrixfactorization is done. Most such methods make strong prior model assumptions on the underlying timeseries and the algorithm changes based on assumptions of the model that generated the data. Further finitesample analysis of such methods is usually lacking. We highlight one method, Temporal Regularized MatrixFactorization (TRMF) (see [25]), which we directly compare against due to its popularity, and as it performsboth imputation and forecasting. In [25], authors provide finite sample imputation analysis for an instanceof the model considered in this work. In addition to the model being restrictive, the imputation error scalesas / min( N,T ) which is weaker compared to our imputation error of / √ NT . For example, for N =Θ(1) ,their error bound remains Θ(1) for any T , while ours would decay with T as / √ T . Other Relevant Time Series Methods.
Recently, with the advent of deep learning, neural network (NN)based approaches have been the most popular, and empirically effective. Some industry standard neuralnetwork methods include LSTMs, from the Keras library (a standard NN library, see [6]) and DeepAR(an industry leading NN library for time series analysis, see [16]).
Time-Varying Variance Estimation.
The time-varying variance is a key input parameter in many sequentialprediction/decision-making algorithms themselves. For example in control systems, the widely used KalmanFilter uses an estimate of the per step variance for both filtering and smoothing. Similarly in finance, the time-varying variance of each financial instrument in a portfolio is necessary for risk-adjustment. The key challengein estimating the variance of a time series (which itself might very well be time-varying) is that unlike the actualtime series itself, we do not get to directly observe it (nor even a noisy version of it). Despite the vast time seriesliterature, existing algorithms to estimate time-varying variance are mostly heuristics and/or make restrictive,parametric assumptions about how the variance (and the underlying mean) evolves (e.g. ARCH/GARCHmodels). See [5]. Hence, provable finite-sample guarantees of these previous methods are highly restricted.
Panel Data Models - Econometrics.
In a traditional panel data factor model in econometrics, for n ∈ [ N ] and t ∈ Z , X n ( t )= (cid:80) Rr =1 U nr W tr + η tn , where U n, · ,W t, · ∈ R R for some fixed R ≥ . The R -dimensional6ectors U n, · ,W t, · are referred to as the low-dimensional “factor loadings” associated with time and “space”respectively, and η n,t is the independent noise per measurement. Despite the ubiquity of such factor models formultivariate time series data, the time series dynamics associated with the latent factors W tr , are not explicitlymodeled, a possible additional structure that can be further exploited. In the model proposed in this work, withinterest towards analyzing mSSA, we extend this standard factor model by considering a time series modelclass for each latent time series factor, W · ,r ∈ R T for r ∈ [ R ] . In particular, we consider time series modelclasses, where the Hankel matrix induced by each W · ,r has an approximate low-rank structure (see Definition2.1). This (approximate) low-rank Hankel model class is inspired and extended from the SSA literature (see[10, 2]). We show that this approximate low-rank Hankel representation includes many important time seriesdynamics such as any finite sum and products of: harmonics; polynomials; any differentiable periodic function;any time series with a sufficiently “smooth” non-linear latent variable model representation (see Section 2.2).In short, this spatio-temporal factor model we propose can be thought of as a synthesis of two standardtime series models: (1) a low-rank factor model, traditionally used to analyze multivariate time series datain the econometrics literature; (2) approximately low-rank Hankel models for “time factors”, a representationtraditionally used to analyze univariate time series data in the SSA literature. Details of the model we proposecan be found in Section 2. This section is organized into three parts: (i) Section 2.1 describes the spatio-temporal factor model weuse; (ii) Section 2.2 describes the family of time series dynamics we consider specifically for the latent factorsassociated with time; (iii) Section 2.3 shows, theoretically and empirically, that the stacked Page (and Hankel)matrix induced as a result of the core data transformation of mSSA is (approximately) low-rank.
Define M f ∈ R N × T as M fn ( t )= f n ( t ) . We use the following spatio-temporal factor model, Property 2.1.
Let M f satisfy M fn ( t )= (cid:80) R r =1 U fnr W fr ( t ) , where W fr ( · ): Z → R , | U fnr |≤ Γ , | W fr ( · ) |≤ Γ .Interpretation. In words, there are R “fundamental” time series denoted by W fr ( · ): Z → R , r ∈ [ R ] and foreach n ∈ [ N ] , M fn ( · ) is obtained through weighted, by U fnr , combination of these R times series. Γ , Γ arestandard boundedness assumptions for the underlying latent time-varying means. In the econometrics literature,particularly to model panel data, the R -dimensional vectors U n, · ,W t, · are referred to as the low-dimensional“factor loadings” associated with “space” and time respectively, and η n,t is the independent noise permeasurement. Despite the ubiquity of such factor models in time series analysis, the time series dynamicsassociated with the latent factors W tr are not explicitly modeled. This is exactly what a spatio-temporal modelaims to circumvent. We specify and motivate the class of time series models we consider for W fr ( · ) below. Time Series Model Class Considered: ( G,(cid:15) ) -Hankel Representable Time Series. We now describe theapproximately low-rank Hankel model we consider for the latent time series factors, W fr ( · ) . Definition 2.1.
For any T ≥ , let the Hankel matrix H f ∈ R T × T induced by a time series f be definedas H fij := f ( i + j − , i,j ∈ [ T ] . A time series f is said to be ( G,(cid:15) ) -Hankel representable if there exists H f ( lr ) ∈ R T × T , such that ( i ) rank ( H f ( lr ) ) ≤ G ; ( ii ) (cid:107) H f − H f ( lr ) (cid:107) max ≤ (cid:15) . Property 2.2.
For r ∈ [ R ] , W fr ( · ) is ( G fr ,(cid:15) fr ) -Hankel, where G fr ≤ G (1)max , | (cid:15) fr |≤ (cid:15) (1)max . In Section 2.2, we establish that a large class of time series models admit such a ( G,(cid:15) ) -Hankel representation. Spatio-Temporal Model for Latent Multivariate Time-Varying Noise Variance.
Let σ n : Z → R represent the time-varying variances, i.e. σ n ( t ) = E [ η n ( t )] for ( n,t ) ∈ [ N ] × Z . We denote the collectionof time-varying variances as σ . Analogous to M f , the latent time-varying variance is described through, M σ ∈ R N × T , which captures the spatial and temporal structure in the data. Property 2.3.
Let M σ satisfy M σ n ( t ) = (cid:80) R r =1 U σ nr W σ r ( t ) , where W σ r ( · ) : Z → R , | U σ nr | ≤ Γ , | W σ r ( · ) |≤ Γ . Property 2.4.
For r ∈ [ R ] , W σ r ( · ) is ( G σ r ,(cid:15) σ r ) -Hankel, where G σ r ≤ G (2)max , | (cid:15) σ r |≤ (cid:15) (2)max . Noisy, Sparse Observation Model.
We now state some assumptions on how the observations of M f arecorrupted by noise and missingness. 7 roperty 2.5. We observe Z X ∈ R N × T , where for ( n,t ) ∈ [ N ] × [ T ] and ρ ∈ (0 , , we have Z Xn ( t )= X n ( t ) with probability ρ and (cid:63) with probability − ρ . Define Z X ∈ R N × T as Z X n ( t )=( Z Xn ( t )) . Property 2.6. η n ( t ) are independent mean-zero sub-gaussian random variables such that (cid:107) η n ( t ) (cid:107) ψ ≤ γ . ( G,(cid:15) ) -Hankel Representable Time Series Dynamics In this section, we show this approximate low-rank Hankel representation includes many important, standardtime series dynamics considered in the literature, including: (i) any finite sum of harmonics, polynomialsand exponentials (Proposition 2.3); (ii) any differentiable periodic function (Proposition 2.4) – we note, thisis a heavily utilized time series model in signal processing; (iii) any time series with a Holder continousnon-linear latent variable model Hankel representation (Proposition 2.5). Furthermore, we show the classof ( G,(cid:15) ) -Hankel representable time series is closed under component wise addition and multiplication(Proposition 2.1). Importantly, we establish the more measurements we collect (i.e., as T grows), theapproximation error for all these time series dynamics decays to zero. ( G,(cid:15) ) -Hankel Time Series “Algebra”Sums of Time Series. For two time series f and f , let f + f represent the time series induced by addingthe values of the time series f and f component wise, i.e., f + f ( t )= f ( t )+ f ( t ) for all t ∈ Z . Products of Time Series.
For two time series f and f , let f ◦ f represent the time series induced bymultiplying the values of the time series f and f component wise, i.e., f ◦ f ( t )= f ( t ) × f ( t ) for all t ∈ Z .In the following proposition we establish that time series representable as ( G,(cid:15) ) -Hankel time series are“closed” under component wise sums and products. Proposition 2.1.
Let f and f be two time series which are ( G , (cid:15) ) -Hankel and ( G , (cid:15) ) -Hankelrepresentable respectively. Then f + f is a ( G + G , (cid:15) + (cid:15) ) -Hankel. And, f ◦ f is a (cid:16) G G , (cid:15) ,(cid:15) ) · max( (cid:107) f (cid:107) ∞ , (cid:107) f (cid:107) ∞ ) (cid:17) -Hankel. See Appendix B.1 for proof of Proposition 2.1. ( G,(cid:15) ) -LRF Time SeriesDefinition 2.2 ( ( G,(cid:15) ) -LRF) . A time series f is said to be a ( G,(cid:15) ) -Linear Recurrent Formula (LRF) if forall T ∈ Z and t ∈ [ T ] , f ( t )= f (cid:48) ( t )+ (cid:15) ( t ) , where: (i) f (cid:48) ( t )= (cid:80) Gl =1 α l f (cid:48) ( t − l ) ; (ii) | (cid:15) ( t ) |≤ (cid:15) . Now we establish ( G,(cid:15) ) -LRFs are ( G,(cid:15) ) -Hankel representable (see Appendix B.2 for proof). Proposition 2.2. If f is a time series that is ( G,(cid:15) ) -LRF representable, then it is ( G,(cid:15) ) -Hankel representable. Finite Product of Harmonics, Polynomials and Exponentials.
LRF’s cover a broad class of time seriesfunctions, including any finite sum of products of harmonics, polynomials and exponentials. The order G of the LRF induced by such a time series is quantified through the following proposition. Proposition 2.3. (Proposition 5.2 in [2])
Let P m a denote a polynomial of degree m a . Then, f ( t )= A (cid:88) a =1 exp( α a t ) · cos(2 πω a t + φ a ) · P m a ( t ) is ( G, -LRF representable, where G ≤ A ( m max +1)( m max +2) , where m max =max a ∈ A m a . Remark 2.1.
We remark importantly, that given a time series of K harmonics, the order of the LRF induced by it is independent of the period of the K harmonics. Rather the order of the LRF scales onlywith the number of harmonics. C k T-per
Here we establish, perhaps surprisingly, that a broad class of time series - essentially any periodic functionthat is sufficiently “smooth” (is in C k T-per as in Definition 2.3) - has a ( G,(cid:15) ) -LRF representation. Intuitively,this is possible as by Fourier analysis, it is well-established that such functions are well-approximated byfinite sums of harmonics, which themselves are LRFs (by Proposition 2.3)8 efinition 2.3 ( C k R-per -smoothness ) . A time series f is said to be in C k R-per -smooth if f is R -periodic (i.e. f ( t + R )= f ( t ) ) and the k -th derivative of f , denoted f ( k ) , exists and is continuous. Proposition 2.4.
Let f be a time series such that f ∈C k R-per . Then for any G ∈ N , f is (cid:16) G,C ( k,R ) (cid:107) f ( k ) (cid:107) G k − . (cid:17) − Hankel representable . Here C ( k,R ) is a term that depends only on k,R ; and (cid:107) f ( k ) (cid:107) = R (cid:82) R ( f ( k ) ( t )) dt . See Appendix B.3 for proof of Proposition 2.4.
We now show that if a time serieshas a Latent Variable Model (LVM) representation, and the latent function is H¨older continuous, then ithas a ( G,(cid:15) ) -Hankel representation. We first define the H¨older class of functions. Note this class of functionsis widely adopted in the non-parametric regression literature [20]. Given a function g : [0 , K → R , anda multi-index κ , let the partial derivate of g at x ∈ [0 , K (if it exists) be denoted as, (cid:79) κ g ( x )= ∂ | κ | g ( x )( ∂x ) κ . Definition 2.4 ( ( α, L ) -H¨older Class ) . Let α, L be two positive numbers. The H¨older class H ( α, L ) on [0 , K is defined as the set of functions g :[0 , K → R whose partial derivatives satisfy, for all x,x (cid:48) ∈ [0 , K , (cid:88) κ : | κ | = (cid:98) α (cid:99) κ ! | (cid:79) κ g ( x ) − (cid:79) κ g ( x (cid:48) ) |≤L(cid:107) x − x (cid:48) (cid:107) α −(cid:98) α (cid:99)∞ . (1) Here (cid:98) α (cid:99) refers to the greatest integer strictly smaller than α . Remark 2.2.
Note if α ∈ (0 , , then (1) is equivalent to the ( α, L ) -Lipschitz condition, for all x,x (cid:48) ∈ [0 , K | g ( x ) − g ( x (cid:48) ) |≤L(cid:107) x − x (cid:48) (cid:107) α −(cid:98) α (cid:99)∞ However for α> , ( α, L ) -H¨older smoothness no longer implies ( α, L ) -Lipschitz smoothness. Property 2.7 ( Time Series with ( α, L ) -H¨older Smooth LVM Representation ) . Let H f ∈ R T × T bethe Hankel matrix induced by a time series f . Recall H ft,s := f ( t + s − . A time series f is said to havea ( α, L ) -H¨older Smooth LVM Representation if the Hankel matrix, H f , has the following representation H ft,s = g ( θ t ,ω s ) , where θ t ,ω s ∈ [0 , K are latent parameters. Moreover for all ω s , g ( · ,ω s ) ∈H ( α, L ) as defined in (1) . As stated earlier, the domain of the latent parameters θ t ,ω s in Property 2.7 is easily extended to any compactsubset of R K by appropriate rescaling. Remark 2.3.
Note that a ( G, -Hankel denoted by f , has the following representation (see Proposition C.2), H fij =( a ( i ) ) T b ( j ) for some latent vectors a ( i ) ,b ( j ) ∈ R G . Hence, a ( G, -Hankel representable time series is an instance ofa time series that satisfies Property 2.7 for all α ∈ N , and L = C , for some absolute positive constant, C . Onecan thus think of time series that satisfy Property 2.7 as generalizations to (sufficiently smooth) non-linearfunctions, instead of just linear products of the latent factors (as would have in a low-rank Hankel). Proposition 2.5.
Let a time series f satisfy Property 2.7 with parameters α, L . Then for all (cid:15)> , f is ( C ( α,K ) (cid:16) (cid:15) (cid:17) K , L (cid:15) α ) − Hankel representable . Here C ( α,K ) is a term that depends only on α and K . See Appendix B.4 for proof of Proposition 2.5.
The notation below is a formal description of the stacked Page andHankel matrices, one gets by performing the core mSSA data transformation (i.e., concatenating the inducedmatrices column-wise). The only difference is the stacked Hankel matrix contains overlapping columns (whilethe stacked Page matrix utilized in the proposed variant of mSSA does not). For a multivariate time series g with N time series and T observations, let L ∈ N be a hyper-parameter and let P = (cid:98) T/L (cid:99) . For simplicity, The domain is easily extended to any compact subset of R K T/L = (cid:98) T/L (cid:99) , i.e., L × P = T . Let ¯ P := N × P . Let M g ∈ R L × ¯ P , H g ∈ R L × ( N × ( T − L +1)) be the induced stacked Page and Hankel matrices by g (with hyper-parameter L ≥ ); for l ∈ [ L ] , k ∈ [ P ] , k ∈ [ T − L +1] , n ∈ [ N ] , M gl, [ k + P × ( n − = g n ( l +( k − L ) , H gl, [ k + T × ( n − = g n ( l + k − . Theoretical Justification For Stacked Hankel Matrix Transformation.
Proposition 2.6 shows thatthe stacked Page and Hankel matrices, the core data structures of interest in mSSA, have an approximatelow-rank representation under our proposed model. This approximate low-rank structure of H f (and H σ )is what crucially allows us to connect the analysis mSSA to recent advances in matrix estimation andhigh-dimensional statistics. See Appendix C for a proof of Proposition 2.6. Proposition 2.6.
Let Properties 2.1 and 2.2 hold. Let M f , H f be defined w.r.t M f . Then, there existsa matrix H f ( lr ) such that, rank ( H f ( lr ) ) ≤ R G (1)max , and (cid:107) H f − H f ( lr ) (cid:107) max ≤ R (cid:15) (1)max Γ . As an immediateconsequence, there exists M f ( lr ) such that rank ( M f ( lr ) ) ≤ R G (1)max and (cid:107) M f − M f ( lr ) (cid:107) max ≤ R (cid:15) (1)max Γ . Corollary 2.1.
Let Properties 2.3 and 2.4 hold. Let M σ , H σ be defined w.r.t M σ . Then, there existsa matrix H σ ( lr ) such that, rank ( H σ ( lr ) ) ≤ R G (2)max , and (cid:107) H σ − H σ ( lr ) (cid:107) max ≤ R (cid:15) (2)max Γ . As an immediateconsequence, there exists M σ ( lr ) such that rank ( M σ ( lr ) ) ≤ R G (2)max and (cid:107) M σ − M σ ( lr ) (cid:107) max ≤ R (cid:15) (2)max Γ .Interpretation. We see that the stacked Hankel matrix has (approximate) rank that scales no more than theproduct of the rank of the factor model R , and the maximum (approximate) rank, G (1)max , of the Hankelmatrices induced by the latent factors W fr ( · ) , for r ∈ [ R ] . Crucially, the rank does not scale with the numberof time series, N nor with the number of measurements, T . Indeed, we see in Table 3 that across standardbenchmark datasets, the (approximate) rank of the stacked Hankel matrix indeed scales very slowly withthe number of time series, N and number of measurements, T . Hence these standard multivariate timeseries datasets seem to fit within our proposed model.Table 3: Across standard benchmarks, effective rank of the stacked Hankel matrix scales slowly with thenumber of time series. Effective rank is defined as the number of singular values to capture > of thespectral energy. Dataset N = 1 N =10 N = 100 N = 350Electricity 19/24 37/43 44/60 31/52Financial 1/1 3/3 3/4 6/9Traffic 14/16 32/65 69/224 116/296 Some Necessary Notation.
Recall from Section 2, we assume access to observations Z Xn (1 : T ) ∈ R T for n ∈ [ N ] . Additionally, let (cid:98) ρ denote the fraction of observed entries of Z X , i.e., (cid:98) ρ := NT (cid:16) (cid:80) ( n,t ) ∈ [ N ] × [ T ] ( Z Xn ( t )) (cid:54) = (cid:63) ) (cid:17) ∨ NT . For a matrix A ∈ R L × ¯ P , let A L ∈ R ¯ P refer tothe last row of A and let ¯ A ∈ R ( L − × ¯ P refer to the sub-matrix induced by retaining its first L − rows. Imputation.
The key steps of mean imputation are as follows. (Form Page Matrix) Transform X (1 : T ) ,...,X N (1 : T ) into a stacked Page matrix Z X ∈ R L × ¯ P with L ≤ ¯ P . Fill all missing entries in the matrix by . (Singular Value Thresholding) Let SVD of Z X = USV T , where U ∈ R L × L , V ∈ R ¯ P × L represent left,right singular vectors, S = diag ( s ,...,s L ) is diagonal matrix of singular values s ≥ ··· ≥ s L ≥ . Obtain (cid:99) M f (Impute) = US k V T where, S k = diag ( s ,...,s k , ,..., for some k ∈ [ L ] . (Output) ˆ f n ( i +( j − L ):= (cid:99) M f (Impute) i, [ j + P ( n − , i ∈ [ L ] ,j ∈ [ P ] ,n ∈ [ N ] . Forecasting.
Forecasting includes an additional step of fitting a linear model on the de-noised matrix. (Form Sub-Matrices) Let ¯ Z X ∈ R L − × ¯ P be a sub-matrix of Z X obtained by removing its last row, Z XL .10 . (Singular Value Thresholding) Let SVD of ¯ Z X = ¯ U ¯ S ¯ V T , where ¯ U ∈ R L − × L − , ¯ V ∈ R ¯ P × L − represent left and right singular vectors and ¯ S = diag ( s , ... , s L − ) be diagonal matrix of singularvalues ¯ s ≥ ··· ≥ ¯ s L − ≥ . Obtain ˆ¯ M f = ¯ U ¯ S k ¯ V T by setting all but top k singular values to , i.e. ¯ S k = diag (¯ s ,..., ¯ s k , ,..., for some for some k ∈ [ L − . (Linear Regression) ˆ β ∈ argmin b ∈ R L − (cid:107) Z XL − ( ˆ¯ M f ) T b (cid:107) . (Output) For s ∈{ L, L,... } , n ∈ [ N ] , ˆ f n ( s ):= X n ( s − L +1: s − T ˆ β . Variance Estimation Algorithm.
For variance estimation, the mean estimation algorithm is run twice, onceon Z X and once on Z X . To estimate the time-varying variance (for both forecasting and imputation), a simplepost-processing step is done where the square of the estimate produced from running the algorithm on Z X issubtracted from the estimate produced from Z X . These steps above can be viewed as a “meta”-algorithm todo variance estimation. Indeed in Appendix A.2, to compare the performance of mSSA for variance estimationwith other algorithms, we utilize the method in [25] to do variance estimation via this “meta” algorithm. Imputation . Below are the steps for variance imputation using mSSA. (Impute Z X , Z X ) Use the mean imputation algorithm on X (1 : T ) ,...,X N (1 : T ) (i.e., Z X ) and X (1: T ) ,...,X N (1: T ) (i.e., Z X ), to produce the de-noised Page matrices (cid:99) M f and (cid:99) M f + σ , respectively. (Output) Construct (cid:99) M f ∈ R L × ¯ P , where (cid:99) M f ij := ( (cid:99) M fij ) , and produce estimates, ˆ σ n ( i +( j − L ):= (cid:99) M f + σ i, [ j + P × ( n − − (cid:99) M f i, [ j + P × ( n − , for i ∈ [ L ] ,j ∈ [ P ] . Forecasting . Like mean forecasting, it involves an additional step after imputation. (Forecast with ¯ Z X , Z XL , ¯ Z X , Z X L ) Using the mean forecasting algorithm on X (1 : T ) ,...,X N (1 : T ) and X (1: T ) ,...,X N (1: T ) , to produce forecast estimates ˆ f n ( T +1) and (cid:92) f n + σ n ( T +1) , respectively. (Output) Produce the variance estimate ˆ σ n ( T +1):= (cid:92) f n + σ n ( T +1) − (ˆ f n ( T +1)) . The aim of this section is to formally tie mSSA to an OCOsetting. First, we briefly describe the setup/dynamics of OCO relevant to us. At each step, t ∈ N we receive aconvex set Ω t and choose an element in it, denoted as β t ∈ Ω t . Subsequent to choosing β t , we receive costfunction c t and incur cost c t ( β t ) . Note the key difference from the tradition OCO setup is that the convex set Ω t is varying over time, an area of limited study in the online learning literature . Necessary Notation for online-mSSA.
We begin by introducing some notation. Let g be a multivariatetime series, comprising of N time series. For any t ∈ N , let Z g ( t ) ∈ R N × t refer to the first t (noisy, sparse)observations of g . For each time step t ∈ N , we assume we get access to the observations of the varioustime series in the sequence given by this ordering. Let ( t,n ) ∈ N × [ N ] be a double index denoting the currenttime step, t , and the number of time series n ∈ [ N ] we have observed at this current time step thus far. Note,the total number of observations by index ( t,n ) is equal to ( N ( t − n ) . Let the stacked Page matrixinduced by the current observations be denoted as Z g ( t,n ) ∈ R L × (( N ( t − n ) /L ) . Let [ Z g ( t,n ) ] L refer to thelast row of the stacked Page matrix and let ¯ Z g ( t,n ) be the sub-matrix of Z g ( t,n ) obtained by removing thelast row, [ Z g ( t,n ) ] L . For any k ∈ [ L − , let ¯ U k ( t,n ) ∈ R ( L − × k refer to the first k left singular vectors of ¯ Z g ( t,n ) . Let Ω k ( t,n ) refer to the linear subspace induced by ¯ U k ( t,n ) . mSSA (i.e., PCR) as Regularized Linear Regression. For ( t,n ) ∈ N × [ N ] , if we ran the forecastingalgorithm in Section 3.1 with all available data, define β ∗ ( t,n ) as the resulting linear model obtained (i.e., ˆ β from Step 3 of the forecasting algorithm in Section 3.1). For any fixed k ∈ [ L − , where k is the numberof singular vectors of ¯ Z g ( t,n ) retained, it can be verified that the β ∗ ( t,n ) is the solution of β ∗ ( t,n ) =arg min b ∈ Ω k ( t,n ) (cid:107) [ Z g ( t,n ) ] L − ( ¯ Z g ( t,n ) ) T b (cid:107) . Interpretation.
The forecasting algorithm in Section 3.1 essentially does PCR (see Proposition 3.1 of [3]). It is astandard result that PCR is a form of regularized linear regression where the linear model β ∗ ( t,n ) is constrainedto lie in the subspace spanned by the first k singular vectors (i.e., Ω k ( t,n ) ), of the covariate matrix (i.e., ¯ Z g ( t,n ) ). For Section 4, we denote (cid:99) M f ( Forecast ) L ∈ R ¯ P as (cid:99) M f ( Forecast ) L ( s +( n − P ):= ˆ f n ( s ) for n ∈ [ N ] ,s ∈ [ P ] CO Dynamics for mSSA Setting.
We make the following assumptions: (i) we have access to T datapoints for each of the N time series before we start making one-step ahead forecasts using mSSA; (ii) for eachtime series n ∈ [ N ] , we make a forecast for H time steps at points { T + L, T +2 L, ··· , T + H × L } ; (iii)we assume k the number of principal components we retain at each step is also specified beforehand (hence, Ω k ( T + tL,n ) for t ∈ [ H ] is the induced convex set from which we pick the per step linear model). We can nowspecify the OCO framework for our mSSA setting. For notational simplicity, we drop dependence on g below.For n ∈ [ N ] and for t ∈ [ H ] (where Z n ( T + tL ) is the T + tL observation of the n -th time series): Pick ˆ β ( T + tL,n ) from Ω k ( T + tL,n ) ; Incur cost c ( T + tL,n ) (ˆ β ( T + tL,n ) ):=[ Z n ( T + tL ) − ( Z n ( T +( t − L +1 : T + tL − T ˆ β ( T + tL,n ) ] . online-mSSA. We shall utilize an online gradient descent variant of the mSSA algorithm: for n ∈ [ N ] , Initialize ˆ β ( T + L,n ) , compute cost c ( T + L,n ) (ˆ β ( T + L,n ) ) . Update, t ∈ [1: H ] as:(i) ˜ˆ β ( T +( t +1) L,n ) = ˆ β ( T + tL,n ) − δ ∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) ;(ii) ˆ β ( T +( t +1) L,n ) =argmin w ∈ Ω k ( T +( t +1) L,n ) (cid:107) w − ˜ˆ β ( T +( t +1) L,n ) (cid:107) . Algorithm Intuition.
In effect, we utilize the standard projected online gradient descent algorithm. As statedearlier, the key difference is in our setting, the domain Ω k ( t,n ) is changing at each time step. We give guidanceon choice of δ when instantiating the regret bound in Theorem 4.1. From Section 2.3, recall definition of
L,P, ¯ P , M g , M gL and ¯ M g . For an estimate (cid:99) M g ( Impute ) of M g , our imputation error metric is MSE ( M g , (cid:99) M g ( Impute ) ):= NT E (cid:107) M g − (cid:99) M g ( Impute ) (cid:107) F , where the expectation is over noise and (cid:107)·(cid:107) F refers to the Frobenius norm. Forecasting.
For forecasting we evaluate mSSA through two error metrics, an “offline” benchmark (i.e.,in-sample prediction error), and an online regret analysis (relative to this offline benchmark).
Offline benchmark (in-sample error).
Recall M gL ∈ R ¯ P refers to the last row of the induced stacked Pagematrix. The “offline” benchmark corresponds to the in-sample prediction error with respect to M gL , i.e.,for an estimate (cid:99) M g ( Forecast ) L , we define the error as MSE ( M g , (cid:99) M g ( Forecast ) ):= P E (cid:107) M gL − (cid:99) M g ( Forecast ) L (cid:107) . Regret Metric.
Recall the definition of c t ( · ) , β ∗ ( T + H × L,N ) and ˆ β ( T + H × L,N ) from Section 3.2. ˆ β ( T + tL,n ) for ( n,t ) ∈ [ N ] × [ H ] are the estimates produced by online-mSSA. β ∗ ( T + H × L,N ) is the estimate producedif one had access to all ( T + H × L ) × N data points – it is exactly the in-sample prediction error, i.e., theoffline benchmark denoted above. Regret is thus defined as, regret := N (cid:88) n =1 H (cid:88) t =1 [ c t (ˆ β ( T + tL,n ) ) − c t ( β ∗ ( T + H × L,N ) )] . M f ( lr ) , M f + σ ( lr ) . Recall definition of M f ( lr ) , and M σ ( lr ) from Proposition 2.6and Corollary 2.1, respectively. Denote rank of M f ( lr ) as r (1) . For i ∈ [ r (1) ] , let τ i denote the i -th singularvalue of M f ( lr ) ordered by magnitude. Define M f + σ ( lr ) as the entry-wise addition of M f ( lr ) and M σ ( lr ) , We make forecasts at multiples of L . This is easily circumvented by having L different models; for ease ofexposition, we keep it to only single model. Note that M gL contains entries of the latent time series for multiples of L , i.e., { L, L,...,T } . This can be addressedsimply creating L different forecasting models, for L + i , for i ∈ { , ,...,L − } . The corresponding algorithm andtheoretical results remain identical. However, this is likely a limitation of our analysis technique and is irrelevant inpractice as evidenced in our experiments. M f ( lr ) is the entry-wise squaring of M f ( lr ) . Denote rank of M f + σ ( lr ) as r (2) . For i ∈ [ r (2) ] , let τ (2) i denote the i -th singular value of M f + σ ( lr ) . Property 4.1.
The non-zero singular values, τ (1) i , of M f ( lr ) are well-balanced, i.e., ( τ (1) i ) =Θ( T N/r (1) ) .Similarly, the non-zero singular values, τ (2) i , of M f + σ ( lr ) are well-balanced, i.e., ( τ (2) i ) =Θ( T N/r (2) ) .Interpretation. A natural setting in which Property 4.1 holds is the entries of M f ( lr ) are Θ(1) , and thenon-zero singular values of M f ( lr ) satisfy s i = Θ( ζ ) for some ζ . Then, Cr (1) ζ = (cid:107) M f ( lr ) (cid:107) F = Θ( T N ) for some constant C , i.e., ( τ (1) i ) =Θ( T N/r (1) ) . An identical argument applies to M f + σ ( lr ) . Further, seeProposition 4.2 of [3] for another canonical example of when Property 4.1 holds. Theorem 4.1 ( mSSA Mean Estimation: Imputation ) . Let Properties 2.1, 2.2, 2.5, 2.6 and 4.1 hold. Let ρ ≥ C log( L ¯ P ) L ¯ P for absolute constant C > , ¯ P = L and rank ( (cid:99) M f ) = r (1) . Let C be a term that dependspolynomially on Γ , Γ . Then, MSE ( M f , (cid:99) M f ( Impute ) ) ≤ C γ R G (1)max ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) (cid:19) log( ¯ P ) . Theorem 4.2 ( mSSA Mean Estimation: Forecasting ) . Let conditions of Theorem 4.1 hold. Then,
MSE ( M f , (cid:99) M f ( Forecast ) ) ≤ C γ R ( G (1)max ) ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) (cid:19) log( ¯ P ) . Interpretation.
The mSSA imputation and (in-sample) forecasting error bounds in Theorems 4.1 and 4.2scale as, O ((1 / √ NT )+( (cid:15) (1)max ) ) . In both results the bound is dominated by the error introduced in Step2 of the respective algorithms, i.e., the SVT step in Section 3.1. It is straightforward to verify that for thetime series dynamics listed in Section 2.2, the approximation error term vanishes as we collect more data. Theassumption rank ( (cid:99) M f )= r (1) , requires that the number of principal components are chosen correctly, i.e., isequal to the (approximate) rank of the stacked Page matrix. In Theorem H.1 in Appendix H, we generalize theabove results to the setting where the number of principal components is misspecified, i.e., rank ( (cid:99) M f ) (cid:54) = r (1) . Technical Innovations.
The key technical challenge in proving Theorem 4.1 is establishing SVT achievesprediction consistency when the covariate matrix is well-approximated in an entry-wise sense by a low-rankmatrix (see Lemma F.1). [3] establishes the effectiveness of PCR only when the covariate matrix iswell-approximated in operator norm by a low-rank matrix; hence, the existing bound in [3] on operator normerror could diverge even if the entry-wise error is diminishing. The key technical challenge in proving Theorem4.2 is establishing that a (approximate) linear relationship between ¯ M f and M fL exists (see PropositionH.1). This is what motivates learning a linear model in Step 3 of the forecasting algorithm in Section 3.1. ( mSSA Variance Estimation: Imputation ) . Let the conditions of Theorem 4.1 hold. LetProperties 2.3 and 2.4 hold and rank ( (cid:99) M f + σ )= r (2) . Then, MSE ( M σ , (cid:99) M σ ( Impute ) ) ≤ C γ R R ( G (1)max ) ( G (2)max ) ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) + (cid:16) (cid:15) (2)max (cid:17) (cid:19) log( ¯ P ) . Theorem 4.4 ( mSSA Variance Estimation: Forecasting ) . Let conditions of Theorem 4.3 hold.
MSE ( M σ , (cid:99) M σ (Forecast) ) ≤ C γ R R . ( G (1)max ) ( G (1)max ) ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) + (cid:16) (cid:15) (2)max (cid:17) (cid:19) log( ¯ P ) . Interpretation.
The variance estimation bounds come from the bounds for mean estimation and an extraterm that arises due to error in estimating M f + σ . This also leads to a higher polynomial dependenceon the model parameters, R ,R ,G (1)max ,G (2)max . We follow the same notation as in Section 3.2; in particular for the latenttime series f , and index ( t,n ) ∈ N × [ N ] , recall the definitions of Z f ( t ) , Z f ( t,n ) , ¯ Z f ( t,n ) , [ Z f ( t,n ) ] L and Ω k ( t,n ) k ∈ [ L − . Let v t,n ) ,...,v k ( t,n ) be an orthonormal basis of Ω k ( t,n ) and v k +1( t,n ) ,...,v L − t,n ) be additionalorthonormal vectors to form a complete basis of R L − . Let Z ( t,n,j ) :=[ ¯ Z f ( t,n ) ] · ,j . Define a ( t,n,j ) ∈ R L − as Z ( t,n,j ) = (cid:80) L − i =1 a i ( t,n,j ) v i ( t,n ) . Further we define M f ( t,n ) , ¯ M f ( t,n ) , [ M f ( t,n ) ] L and ¯Ω k ( t,n ) , for k ∈ [ L − ,analogously to above, but now with respect to the underlying latent mean of f , given by M f,t . Let τ t,n ) ,...,τ k ( t,n ) be the first k singular values of ¯ M f ( t,n ) . For a linear subspace defined by Ω , let P Ω be theassociated projection operator (or matrix). Theorem 4.5 ( Online mSSA: Regret Analysis ) . Recall T = L × P . Let Properties 2.1, 2.2, 2.5 and 2.6hold. Let C be a term that depends only polynomially on Γ , Γ , R , G (1)max and γ . Further, for n ∈ [ N ] , t ∈ [ H +1] , for some (cid:15) ,(cid:15) ∈ (0 , and some k ∈ [ L − , let the following hold:A.1 P ¯Ω k ( T + tL,n ) = P ¯Ω k ( T + HL,N ) ;A.2 τ k ( T + tL,n ) =Ω (cid:16)(cid:113) LNPk (cid:17) , τ k +1( T + tL,n ) =0 ;A.3 L = P − (cid:15) , H = P − (cid:15) ;A.4 (cid:107) ˆ β ( T + tL,n ) (cid:107) ≤ C k , for ˆ β ( T + tL,n ) ∈ Ω k ( t,n ) ;A.5 max i,j | a i ( T + tL,n,j ) |≤ C (cid:113) Lk ;A.6 | Z n ( T + tL ) |≤ C .Picking δ = L (cid:114) NH + (cid:113) HP , we get that with probability at least − / ( NT ) , NH regret ≤ C k . (cid:114) ( P − (cid:15) + (cid:15) N + P − (cid:15) − . (cid:15) ) . Corollary 4.1.
Let conditions of Theorem 4.5 hold. Pick (cid:15) = 0 . ,(cid:15) = 0 . . Then, with probability atleast − / ( NT ) , NH regret ≤ C k . √ NT . . Justification of Assumptions.
A.1 This is the key assumption made. In essence, it requires that T is large enough such that the latent k -dimensional subspaces of the stacked Page matrix induced by the time series, f i.e., ¯ M f ( t,n ) areno longer varying. An interpretation of this assumption is that for matrix-factorization based timeseries algorithms, such as mSSA, this serves as an analog of the i.i.d assumption of the underlyinggenerating process that is made in classical generalization error analysis (e.g. Rademacher analysis).A.2 This is analogous to Property 4.1 holding.A.3 This requires that the number of steps, H , that we do “online” forecasting for grows sub-linearlyin the number of observations during the “offline” phase of mSSA; the same goes for L , whichis intuitively the allowed model complexity.A.4 This is a standard boundedness assumption i.e., the linear model should not scale larger than thedimension of the subspace we project onto.A.5 This is to ensure consistency with Assumption (ii), i.e., the coefficients, a ( · ) we get after projectiononto ¯Ω k ( T + tL,n ) are well-balanced.A.6 This can easily be verified to hold in high-probability using standard concentration inequalitiesfor sub-Gaussian random variables. Technical Innovation.
Note the key technical difficultly of our setting is that the subspaces Ω k ( T + tL,n ) arevarying over time (due to noise in the observations), a setting not studied closely in the online learning literature.14 Conclusion
In this work, we provide theoretical justification of mSSA, a heavily used method in practice with limited theo-retical understanding. Using a spatio-temporal factor model, we argue the finite sample error for imputation andforecasting scales as / √ NT for N time series with observations of T time-steps. The key technical contribu-tions are: (a) establishing the stacked Page matrix, the core data representation in mSSA, is approximately low-rank for a large model class; (b) advancing the finite-sample analysis of PCR to obtain tight results for mSSA;(c) furnishing a time varying variance estimation algorithm with provable guarantees; (d) introducing a frame-work of online learning for quantifying the ‘generalization’ error for time series analysis; (e) extending the the-ory of OCO to allow for a time varying constraint set. As an important direction for future research, it is worthexploring the sample complexity gains by using a variant of mSSA that utilizes higher-dimensional ‘tensor’structure induced by the Page Matrices of multiple time series rather than “stacking” them up as a larger matrix. References [1] A. Agarwal, A. Alomar, and D. Shah. Tsps: A real-time time series prediction system.
Working papaer ,2020.[2] A. Agarwal, M. J. Amjad, D. Shah, and D. Shen. Model agnostic time series analysis via matrixestimation.
Proceedings of the ACM on Measurement and Analysis of Computing Systems , 2(3):40, 2018.[3] A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression. In
Advances in Neural Information Processing Systems , pages 9889–9900, 2019.[4] S. Bernstein.
The Theory of Probabilities . Gastehizdat Publishing House, 1946.[5] T. Bollerslev. Generalized autoregressive conditional heteroskedasticity.
Journal of econometrics ,31(3):307–327, 1986.[6] F. Chollet. keras. https://github.com/fchollet/keras , 2015.[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii.
SIAM Journal onNumerical Analysis , 7(1):1–46, 1970.[8] Facebook. Prophet. https://facebook.github.io/prophet/ . Online; accessed 25 February2020.[9] M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is / √ . IEEETransactions on Information Theory , 60(8):5040–5053, 2014.[10] N. Golyandina, V. Nekrutkin, and A. A. Zhigljavsky.
Analysis of time series structure: SSA and relatedtechniques . Chapman and Hall/CRC, 2001.[11] L. Grafakos.
Classical fourier analysis , volume 2. Springer, 2008.[12] H. Hassani, S. Heravi, and A. Zhigljavsky. Forecasting uk industrial production with multivariatesingular spectrum analysis.
Journal of Forecasting , 32(5):395–408, 2013.[13] H. Hassani and R. Mahmoudvand. Multivariate singular spectrum analysis: A general view and newvector forecasting approach.
International Journal of Energy and Statistics , 1(01):55–83, 2013.[14] V. Oropeza and M. Sacchi. Simultaneous seismic data denoising and reconstruction via multichannelsingular spectrum analysis.
Geophysics , 76(3):V25–V32, 2011.[15] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon. Collaborative filtering with graph information:Consistency and scalable methods. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors,
Advances in Neural Information Processing Systems 28 , pages 2107–2115. CurranAssociates, Inc., 2015.[16] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting withautoregressive recurrent networks.
International Journal of Forecasting , 2019.[17] R. Sen, H.-F. Yu, and I. S. Dhillon. Think globally, act locally: A deep neural network approach tohigh-dimensional time series forecasting. In
Advances in Neural Information Processing Systems ,pages 4838–4847, 2019.[18] A. Trindade. UCI machine learning repository - individual household electric power consumption data set.[19] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprintarXiv:1011.3027 , 2010.[20] L. Wasserman.
All of nonparametric statistics . Springer, 2006.[21] P.- ˚A. Wedin. Perturbation bounds in connection with singular value decomposition.
BIT NumericalMathematics , 12(1):99–111, 1972. 1522] K. W. Wilson, B. Raj, and P. Smaragdis. Regularized non-negative matrix factorization withtemporal dependencies for speech denoising. In
Ninth Annual Conference of the International SpeechCommunication Association , 2008.[23] WRDS. The trade and quote (taq) database.[24] J. Xu. Rates of convergence of spectral methods for graphon estimation. arXiv preprintarXiv:1709.03183 , 2017.[25] H.-F. Yu, N. Rao, and I. S. Dhillon. Temporal regularized matrix factorization for high-dimensionaltime series prediction. In
Advances in neural information processing systems , pages 847–855, 2016.16 verview of Appendix.
The supplementary material to the main body of this work primarily provides detailed technical proofs ofthe results stated in the main body.Section A provides details of the experiments conducted to generate the empirical results reported in Table2 in the main body. Sections B and C provide supporting technical proofs for the fundamental representationresult that forms the basis of our results.Sections D, E and F provide supporting notations and results that are utilized in Section G to provide proofof Imputation analysis and in Section H to provide proof of forecasting analysis (for both mean and variance).Finally, Section I provides proofs associated with regret analysis.
A Experiments
In this Section, we detail the experimental setup and the datasets used to produce the results in Table 2.Specifically, Section A.1 describes in details the mean imputation and forecasting experiments. SectionA.2 describes the variance imputation and forecasting experiments. Finally, Section A.3 provides detailsabout the parameters and implementations used for all algorithms used in the experiments.Note that in all experiments, Normalized Root Mean Squared Error (NRMSE) is used as an accuracy metric.In particular, time series is normalized to have zero mean and unit variance before calculating the RMSE.We use this metric as it weighs the error on each time series equally.
A.1 Mean EstimationA.1.1 Datasets.
In the mean estimation experiments, we use three real-world datasets and one synthetic dataset. Thedescription and preprocessing we do on these four datasets are as follows:
Electricity Dataset.
This is a public dataset obtained from the UCI repository which shows the 15-minuteselectricity load of 370 household ([18]). As was done in [25],[17],[16], we aggregate the data into hourlyintervals and use the first 25968 time-points for training. The goal here is to do 24-hour ahead forecastsfor the next seven days (i.e. 24-step ahead forecast).
Traffic Dataset.
This public dataset obtained from the UCI repository shows the occupancy rate of trafficlanes in San Francisco ([18]). The data is sampled every 15 minutes but to be consistent with previous workin [25], [17], we aggregate the data into hourly data and use the first 10392 time-points for training. Thegoal here is to do 24-hour ahead forecasts for the next seven days (i.e. 24-step ahead forecast).
Financial Dataset.
This dataset is obtained from Wharton Research Data Services (WRDS) and containsthe average daily stocks prices of 839 companies from October 2004 till November 2019 [23]. The datasetswere preprocessed to remove stocks with any null values, or those with an average price below 30$ acrossthe aforementioned period. This was simply done to constrain the number of time series for ease ofexperimentation and we end up with 839 time series (i.e. stock prices of listed companies) each with 3993readings of daily stock prices. For forecasting, we consider the predicting the 180 time-points ahead onepoint at a time, and we train on the first 3813 time points. The goal here is to do one-day ahead forecastsfor the next 180 days (i.e. 1-step ahead forecast). We choose to do so as this is a standard goal in finance.
Synthetic Dataset.
We generate the observation tensor X ∈ R n × m × T by first randomly generating thetwo vectors U ∈ R n = [ u ,...,u n ] and V ∈ R m = [ v ,...,v n ] . Then, we generate r mixtures of harmonicswhere each mixture g k ( t ) ,k ∈ [ r ] , is generated as: g k ( t )= (cid:80) h =1 α h cos( ω h t/T ) where the parameters areselected randomly such that α h ∈ [ − , and ω h ∈ [1 , . Then each value in the observation tensoris constructed as follows: X i,j ( t )= r (cid:88) k =1 u i v j g k ( t ) where r is the tensor rank, i ∈ [ n ] , j ∈ [ m ] . In our experiment, we select n = 20 , m = 20 , T = 15000 , and r =4 . This gives us 400 time series each with 15000 observations. In the forecasting experiments, we usethe first 14000 points for training. The goal here is to do 10-step ahead forecasts for the final 1000 points.17 a) (b) (c) (d)(e) (f) (g) (h) Figure 1: mSSA vs. TRMF vs. SSA - imputation performance on standard multivariate time seriesbenchmarks and synthetic data. Figures 1 (a)-(d) shows imputation accuracy of mSSA, TRMF and SSAas we vary the fraction of missing values; Figures 1 (e)-(h) shows imputation accuracy as we vary the noiselevel (and with 50% of values missing).
A.1.2 Mean Imputation.Setup.
We test the robustness of the imputation performance by adding two sources of corruption to the data- varying the percentage of observed values; and varying the amount of noise we perturb the observations by.We test imputation performance on how accurately we recover missing values. We compare the performanceof mSSA with TRMF, a method which achieves state-of-the-art imputation performance. Further, we comparewith the SSA variant introduced in [2]. See details about the implementation of, and parameters used forTRMF and SSA in Section A.3.
Results.
Figures 1 (a)-(d) show the imputation error in the aforementioned datasets as we vary the fractionof missing values, while Figures 1 (e)-(h) show the imputation error as we vary σ , the standard deviation ofthe gaussian noise. We see that as we vary the fraction of missing values and noise levels, mSSA outperformsboth TRMF and SSA in ∼
80% of experiments run. The average NRMSE across all experiments for eachdataset is reported in Table 2, where mSSA outperforms every other method across all datasets.
A.1.3 Mean Forecasting.Setup.
We test the forecasting accuracy of the proposed mSSA against several state-of-the-art algorithms.For each dataset, we split the data into training and testing datasets as outlined in Section A.1.1. As wasdone in the imputation experiments, we vary the conditions of the datasets by varying the percentage ofobserved values and the noise levels.We compare against several methods, namely: (i) SSA, specifically the variant introduced in [2]; (ii) LSTM,available through the Keras library (a standard deep learning library); (iii) DeepAR, a state-of-the-art, deeplearning methods that deals with multivariate time series [16]. (iv) TRMF, a matrix factorization approachwith temporal regularization hat has gained in popularity recently [25]; (v) Prophet, Facebook’s time seriesforecasting library [8]. Refer to Section A.3 for details about the implementations of these algorithms andthe selected hyper-parameters.
Results.
Figures 2 (a)-(d) show the forecasting accuracy of mSSA vs. these methods in the aforementioneddatasets as we vary the fraction of missing values, while Figures 2 (e)-(h) shows the forecasting accuracyas we vary the standard deviation of the added gaussian noise. We see that as we vary the fraction of missingvalues and noise levels, mSSA is the best performing method in ∼
40% of experiments. In terms of theaverage NRMSE across all experiments, we find that mSSA outperforms every other method across alldatasets except for the traffic dataset (see Table 2). 18 a) (b) (c) (d)(e) (f) (g) (h)
Figure 2: mSSA forecasting performance on standard multivariate time series benchmark is competitivewith/outperforming industry standard methods as we vary the number of missing data and noise level. Figures2 (a)-(d) shows the forecasting accuracy of all methods on the four datasets with varying fraction of missingvalues; Figures 2 (e)-(h) shows the forecasting accuracy on the same four datasets with varying noise level.
A.2 Variance EstimationA.2.1 Datasets
Since we do not directly observe the variance in real-world (indeed the key issue in why effective time-varying variance estimation algorithms are limited in the literature), we restrict experiments in this section tosynthetically generated data. In particular, with k ∈{ , , , } , we generate three different sets of time seriesas follows: (i) four harmonics g hark ( t )= (cid:80) i =1 α i cos( ω i t/T ) where α i ∈ [ − , and ω i ∈ [1 , ; (ii) fourAR processes g ARk ( t ) = (cid:80) i =1 φ i g ARk ( t − i )+ (cid:15) ( t ) where (cid:15) ( t ) ∼ N (0 , . , φ i ∈ [0 . , . ; (iii) four trends: g trendk ( t ) = ηt where η ∈ [10 − , − ] . Then we sample two random vectors U ∈ R = [ u ,...,u ] and V ∈ R =[ v ,...,v ] from a uniform distribution. Using these two vectors, we generate three × × tensors as follow: (i) A mixture of harmonics: F i,j ( t ) = (cid:80) k =1 u i v j g hark ( t ) ; (ii) A mixture of harmonics+ trend: F i,j ( t ) = (cid:80) k =1 u i v j ( g hark ( t )+ g trendk ( t )) ; (iii) A mixture of harmonics + trend+ AR: F i,j ( t ) = (cid:80) k =1 u i v j ( g trendk ( t )+ g hark ( t )+ g ARk ( t )) . Here i ∈ [1 ,..., , j ∈ [1 ,..., For each tensor, we normalizeits values to be between 0 and 1 and then use three observations models as follow: (i) Gaussian: we generatethree observations tensors where each observation X iij ( t ) is defined as: X iij ( t )= F i,j ( t )+ (cid:15) ( t ) , where (cid:15) ( t ) ∼N (0 ,F ii,j ( t )) ; (ii) Bernoulli: we generate three observations tensors where each observation X iij ( t ) is definedas a Bernoulli distribution with mean F ii,j , i.e., X iiji ( t ) ∼ Bernoulli ( F ii,j ) ; (iii) Poisson: we generate threeobservations where each observation X iij ( t ) is defined as a Poisson distribution with mean F ii,j , i.e., X iiji ( t ) ∼ Pois ( F ii,j ) . Here i ∈{ , , } . Note, this give us a total of nine observation tensors for our variance experiments. A.2.2 Variance Imputation and ForecastingSetup.
For both variance imputation and forecasting, we test the algorithms’ performance as we vary thefraction of observed data, p , for each mixture/observation model combination. For forecasting, we forecastthe last 1000 time steps for each time series using 1-step ahead forecasting.For imputation, we compare with TRMF, which as we stated earlier, achieves state-of-the-art results for theimputation of multivariate time series data [25]. While the algorithm does not explicitly estimate the variance,we adapt the method by using the “meta”-algorithm described in Section 3.1. Refer to Section A.3 for details onhow we use TRMF for variance imputation. For forecasting, we compare with DeepAR’s native functionalityfor variance forecasting. Specifically, DeepAR uses a Bayesian prior and subsequently uses Monte Carlosampling to sample time series trajectories from the posterior. Further, for both imputation and forecasting, wecompare with SSA [2] as well; where we impute and forecast the variance for each time series, one at a time. Results.
In Table 4 we see that for imputation, the variance imputation version of the mSSA algorithmoutperforms both TRMF and SSA across all (except for one) different time series dynamics, fraction of19bserved values, and observation models. In particular, the ratio of TRMF’s NRMSE to mSSA’s rangesbetween [0.97, 2.15], and the ratio of SSA’s NRMSE to mSSA’s ranges between [1.04, 3.46].Table 4: Variance imputation error in NRMSE
Observation Model mSSA TRMF SSAp = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5Gaussian Har
In Table 5 we see that for forecasting, the variance forecasting version of the mSSA algorithm outperformsDeepAR and SSA in all cases. In particular, the ratio of DeepAR’s NRMSE to mSSA’s ranges between[1.01, 976.97], with significant improvements in the case of the Poisson observation model. Further, theratio of SSA’a NRMSE to mSSA’a ranges between [1.03, 4.38].In Table 2, we report for the median NRMSE for each method across all experiments.Table 5: Variance forecasting error in NRMSE
Observation Model mSSA DeepAR SSAp = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5Gaussian Har
A.3 Algorithms Parameters and Settings
In this section, we describe the algorithms used throughout the experiments in more detail, as well asdescribing the hyper-parameters/implementation used for each method. mSSA & SSA.
Note that since the SSA’s variant described in [2] is a special case of our proposed mSSAalgorithm, we use our mSSA’s implementation to perform SSA experiments, however without “stacking” thevarious Page matrices induced by each time series. For all experiments we choose k , the number of retainedsingular values, in a data-driven manner. Specifically, we choose k based on the thresholding procedureoutlined in [9], where the threshold is determined by the median of the singular values and the shape of thematrix. Finally, as guided by our theoretical results, we choose L =1000 ( for SSA) for all experiments,except for experiments on the financial dataset where we choose L =500 ( for SSA). DeepAR.
We use the default parameters of “DeepAREstimator” algorithm provided by the GluonTS package.
LSTM.
Across all datasets, we use a LSTM network with three hidden layers each, with 45 neurons perlayer, as is done in [17]. We use the Keras implementation of LSTM.
Prophet.
We used Prophets Python library with the default parameters selected [8].
TRMF.
We use the implementation provided by the authors in the Github repository associated with thepaper ([25]). For the parameter k , which represent the chosen rank for the T × N Time series matrix, we use20 = 60 for the electricity dataset, k = 40 for the traffic dataset, k = 20 for the financial and synthetic datasetas well as variance experiments. The lag indices include the last day and the same weekday in last week forthe traffic and electricity data. For the financial and synthetic dataset, the lag indices include the last 30 points.For variance estimation, we produce an estimate for the variance by performing TRMF twice, just as we dofor mSSA. Specifically, we do the following steps for each n ∈ [ N ] (where N is the number of time series ofinterest): (i) impute the observations X n ( t ) using TRMF to produce ˆ f TRMF n ( t ) , an estimate of the underlyingmean. (ii) impute the squared observations X n ( t ) using TRMF to produce the estimate (cid:92) f n + σ n TRMF ( t ) .(iii) produce the variance estimate ˆ σ n TRMF = (cid:92) f n + σ n TRMF ( t ) − ˆ f TRMF n ( t ) . B Proofs - ( G,(cid:15) ) -Hankel Representability of Different Time Series Dynamics B.1 Proof of Proposition 2.1
Proof.
Noting that for any two matrices A and B , it is the case that rank ( A + B ) ≤ rank ( A )+ rank ( B ) and rank ( A ◦ B ) ≤ rank ( A ) rank ( B ) (where ◦ denotes the Hadamard product), completes the proof. B.2 Proof of Proposition 2.2
Proof.
Proof is immediate from Definitions 2.1 and 2.2.
B.3 Proof of Proposition 2.4B.3.1 Helper Lemmas for Proposition 2.4
We begin by stating some classic results from Fourier Analysis. To do so, we introduce some notation. C [0 ,R ] and L [0 ,R ] functions. C [0 ,R ] is the set of real-valued, continuous functions defined on [0 ,R ] . L [0 ,R ] is the set of square integrable functions, i.e. (cid:82) R f ( t ) dt ≤∞ Inner Product of functions in L [0 ,R ] . L [0 ,R ] is an inner product space endowed with inner productdefined as (cid:104) f,g (cid:105) := R (cid:82) R f ( t ) g ( t ) dt , and associated norm as (cid:107) f (cid:107) := (cid:113) R (cid:82) R f ( t ) dt . Fourier Representation of functions in L [0 ,R ] . For a function, f , in L [0 ,R ] , define f G as follows S G ( t )= a + G (cid:88) n =1 ( a n cos(2 πnt/R )+ b n cos(2 πnt/R )) (2)where for n ∈ [ N ] ( a ,a n ,b n are called the Fourier coefficients of f ), a := (cid:104) f, (cid:105) = 1 R (cid:90) T f ( t ) dt,a n := (cid:104) f, cos(2 πnt/R ) (cid:105) = 1 R (cid:90) R f ( t )cos(2 πnt/R ) dt,b n := (cid:104) f, sin(2 πnt/R ) (cid:105) = 1 R (cid:90) R f ( t )sin(2 πnt/R ) dt. We now state a classic result from Fourier analysis.
Theorem B.1 ([11]) . If f ∈C k per-R , for k ≥ , then S G converges pointwise to f , i.e., for all t ∈ R lim G →∞ S G ( t ) → f ( t ) . We next show that if f is k-times differentiable, then its Fourier coefficients decay rapidly. Precisely, Lemma B.1. If f ∈C k per-R , for k ≥ , then for k (cid:48) ∈ [ k ] , the Fourier coefficients of f ( k (cid:48) ) , the k (cid:48) -th derivativeof f , are defined as a ( k )0 = a , a ( k ) n = − (cid:16) πnR (cid:17) b ( k − n , b ( k ) n = (cid:16) πnR (cid:17) a ( k − n roof. We show it for a (1) n . Extension to a ( k (cid:48) ) n for k (cid:48) ∈ [ k ] follows by induction in a straightforward manner. a (1) n = (cid:104) f (1) , cos(2 πnt/R ) (cid:105) = 1 R (cid:90) R f (1) ( t )cos(2 πnt/R ) dt ( a ) = 1 R (cid:16)(cid:104) f ( t )cos(2 πnt/R ) (cid:105) R − πnR (cid:104)(cid:90) R f ( t )sin(2 πnt/R ) (cid:105) R (cid:17) = − (cid:16) πnR (cid:17) b (0) n . (a) follows by integration by parts. The identity for b ( k (cid:48) ) n follows in a similar fashion, as does it for a ( k )0 . B.3.2 Completing Proof of Proposition 2.4
Proof.
For G ∈ N , let S G be defined as in (2). Then for t ∈ R | f ( t ) − S G ( t ) | ( a ) = | ∞ (cid:88) n = G +1 ( a n cos(2 πnt/R )+ b n cos(2 πnt/R )) |≤ ∞ (cid:88) n = G +1 | a n | + | b n | ( b ) ≤ ∞ (cid:88) n = G +1 (cid:16) R πn (cid:17) k (cid:16) | a ( k ) n | + | b ( k ) n | (cid:17) ( c ) ≤ √ (cid:16) R π (cid:17) k (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) n = G +1 (cid:16) n (cid:17) k (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) n = G +1 (cid:16) | a ( k ) n | + | b ( k ) n | (cid:17) ≤√ (cid:16) R π (cid:17) k G k − . (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) n = G +1 (cid:16) | a ( k ) n | + | b ( k ) n | (cid:17) ( d ) ≤ √ (cid:16) R π (cid:17) k (cid:107) f k (cid:107) G k − . = C ( k,R ) (cid:107) f k (cid:107) G k − . where C ( k,R ) is a constant that depends only on k and R ; (a) follows from Theorem B.1; (b) follows fromLemma B.1; (c) follows from Cauchy-Schwarz inequality; (d) follows from Bessel’s inequality.Hence S G , a sum of G harmonics, gives an uniform approximation to f with error at most C ( k,R ) (cid:107) f k (cid:107) G k − . .Noting G harmonics can be represented by an order- G LRF (by Proposition 2.3) completes the proof.
B.4 Proof of Proposition 2.5
This analysis is closely adapted from [24] and is stated for completeness.
Proof.
Step 1: Partitioning the space [0 , K . Let E denote a partition of the cube [0 , K into a finite num-ber (denoted by |E| ) of cubes ∆ . Let (cid:96) ∈ N . We say P E ,(cid:96) :[0 , K → R is a piecewise polynomial of degree (cid:96) if P E ,(cid:96) ( θ )= (cid:88) ∆ ∈E P ∆ ,(cid:96) ( θ ) ( θ ∈ ∆) , (3)where P ∆ ,(cid:96) ( θ ):[0 , K → R denotes a polynomial of degree at most (cid:96) .It suffices to consider an equal partition of [0 , K . More precisely, for any k ∈ N , we partition the the set [0 , into /k half-open intervals of length /k , i.e, [0 , ∪ ki =1 [( i − /k,i/k ) . It follows that [0 , K canbe partitioned into k K cubes of forms ⊗ Kj =1 [( i j − /k,i j /k ) with i j ∈ [ k ] . Let E k be such a partition with I ,I ,...,I k K denoting all such cubes and z ,z ,...,z k K ∈ R K denoting the centers of those cubes.22 tep 2: Taylor Expansion of g ( · ,ω s ) . For Step 2 of the proof, to reduce notational overload, we suppressdependence of ω s on g , we abuse notation by using g ( · )= g ( · ,ω s ) .For every I i with ≤ i ≤ k K , define P I i ,(cid:96) ( x ) as the degree- (cid:96) Taylor’s series expansion of g ( x ) at point z i : P I i ,(cid:96) ( x )= (cid:88) κ : | κ |≤ (cid:96) κ !( x − z i ) κ ∇ κ g ( z i ) , (4)where κ = ( κ ,...,κ d ) is a multi-index with κ ! = (cid:81) Ki =1 κ i ! , and ∇ k g ( z i ) is the partial derivative definedin Section 2.2.4. Note similar to g , P I i ,(cid:96) ( x ) really refers to P I i ,(cid:96) ( x,ω s ) .Now we define a degree- (cid:96) piecewise polynomial as in (3), i.e., P E k ,(cid:96) ( x )= k K (cid:88) i =1 P I i ,(cid:96) ( x ) ( x ∈ I i ) . (5)For the remainder of the proof, let (cid:96) = (cid:98) α (cid:99) (recall (cid:98) α (cid:99) refers to the largest integer strictly larger than α ).Since f ∈H ( α,L ) , it follows that sup x ∈X | g ( x ) − P E k ,(cid:96) ( x ) | = sup ≤ i ≤ k K sup x ∈ I i | g ( x ) − P I i ,(cid:96) ( x ) | ( a ) = sup ≤ i ≤ k K sup x ∈ I i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) κ : | κ |≤ (cid:96) − ∇ κ g ( z i ) κ ! ( x − z i ) κ + (cid:88) κ : | κ | = (cid:96) ∇ κ g ( z (cid:48) i ) κ ! ( x − z i ) (cid:96) − P E k ,(cid:96) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup ≤ i ≤ k K sup x ∈ I i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) κ : | κ |≤ (cid:96) − ∇ κ g ( z i ) κ ! ( x − z i ) (cid:96) ± (cid:88) κ : | κ | = (cid:96) ∇ κ g ( z i ) κ ! ( x − z i ) (cid:96) + (cid:88) κ : | κ | = (cid:96) ∇ κ g ( z (cid:48) i ) κ ! ( x − z i ) (cid:96) − P E k ,(cid:96) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup ≤ i ≤ k K sup x ∈ I i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) κ : | κ |≤ (cid:96) ∇ κ g ( z i ) κ ! ( x − z i ) (cid:96) + (cid:88) κ : | κ | = (cid:96) ∇ κ g ( z (cid:48) i ) −∇ κ g ( z i ) κ ! ( x − z i ) (cid:96) − P E k ,(cid:96) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup ≤ i ≤ k K sup x ∈ I i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) κ : | κ | = (cid:96) ∇ κ g ( z (cid:48) i ) −∇ κ g ( z i ) κ ! ( x − z i ) (cid:96) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ sup ≤ i ≤ k K sup x ∈ I i (cid:107) x − z i (cid:107) (cid:96) ∞ sup x ∈ I i (cid:88) κ : | κ | = (cid:96) κ ! (cid:12)(cid:12)(cid:12) ∇ κ g ( z (cid:48) i ) −∇ κ g ( z i ) (cid:12)(cid:12)(cid:12) ( c ) ≤ L k − α . where (a) follows from multivariate’s version of Taylor’s theorem (and using the Lagrange form for theremainder) and z (cid:48) i ∈ [0 , K is a vector that can be represented as z i + cx for c ∈ (0 , ; (b) follows fromHolder’s inequality; (c) follows from Property 2.7. Step 3: Construct Low-Rank Approximation of Time Series Hankel Using P E k ,(cid:96) ( · ,ω s ) . Recall theHankel matrix, H ∈ R L × T induced by the original time series, where H ts = g ( θ t ,ω s ) , and g ( · ,ω s ) ∈H ( α, L ) .We now construct a low-rank approximation of it using P I i ,(cid:96) ( · , ω s ) . Define H ( lr ) ∈ R L × T , where [ H ( lr ) ] ( t,s ) = P E k ,(cid:96) ( θ t ,ω s ) .By Step 2, we have that for all t ∈ [ L ] ,s ∈ [ T ] , (cid:12)(cid:12)(cid:12) H ts − [ H ( lr ) ] ( t,s ) (cid:12)(cid:12)(cid:12) ≤L k − α It remains to bound the rank of H ( lr ) . Note that since P E k ,(cid:96) ( θ t ,ω s ) is a piecewise polynomial of degree (cid:96) = (cid:98) α (cid:99) , it has a decomposition of the form [ H ( lr ) ] ( t,s ) = P E k ,(cid:96) ( θ t ,ω s )= k K (cid:88) i =1 (cid:104) Φ( θ ) ,β I i ,s (cid:105) ( θ ∈ I i ) Φ( θ )= (cid:16) ,θ ,...,θ K ,...,θ (cid:96) ,...,θ (cid:96)K (cid:17) T , i.e., is the vector of all monomials of degree less than or equal to (cid:96) . The number of such monomials is easilyshow to be equal to C ( α,K ):= (cid:80) (cid:98) α (cid:99) i =0 (cid:0) i + K − K − (cid:1) .Thus the rank of H ( lr ) is bounded by k K C ( α,K ) . Setting k = 1 (cid:15) completes the proof. C Proofs - Stacked Hankel Matrix is Approximately Low-Rank
C.1 Helper LemmasLemma C.1.
Let Property 2.1 hold. Then (cid:107) M f (cid:107) max ≤ R Γ Γ Proof.
Proof is immediate.
Lemma C.2.
Let Property 2.3 hold. Then (cid:107) M σ (cid:107) max ≤ R Γ Γ Proof.
Proof is immediate.
Proposition C.1.
Let f be a ( G, -LRF, then for s ∈ { ,...,L } , t ∈ { ,P,..., ( P − L } , f admits therepresentation f ( s + t )= G (cid:88) g =1 α g a g ( s ) b g ( t ) (6) for some scalars α g , and functions a g :[ L ] → R and b g :[ P ] → R .Proof. Note the page matrix M f corresponding to time series f has rank at most G . Thus the singular valuedecomposition of M f has the form, M f = (cid:80) Gg =1 α g a (cid:48) g b (cid:48) g where α g are the singular values, and a (cid:48) g ∈ R L ,b (cid:48) g ∈ R N are the left and right singular vectors of M f respectively. Thus M fij has the following form, M fij = f ( i +( j − L )= (cid:80) Gg =1 α g a (cid:48) g ( i ) b (cid:48) g ( j ) . Identifying a g ,b g as a (cid:48) g ,b (cid:48) g respectively completes the proof. Proposition C.2.
Let f be a ( G, -Hankel, then for s ∈ { ,...,L } , t ∈ { ,P,..., ( P − L } , f admits therepresentation f ( s + t )= G (cid:88) g =1 α g a g ( s ) b g ( t ) (7) for some scalars α g , and functions a g :[ L ] → R and b g :[ P ] → R .Proof. Identical to that of Proposition C.1.
C.2 Proof of Proposition 2.6
Proof.
To reduce notational complexity, we suppress the superscript f for the remainder of the proof. For l ∈ [ L ] , k ∈ [ T ] , n ∈ [ N ] we have, 24 l, (cid:2) k + T × ( n − (cid:1)(cid:3) = H n ( l +( k − L )= R (cid:88) r =1 U nr W r ( l +( k − L ) ( a ) = R (cid:88) r =1 U nr (cid:32) G r (cid:88) g =1 α rg a rg ( l ) b rg (cid:16) ( k − L (cid:17) + (cid:15) r (cid:16) l +( k − L (cid:17)(cid:33) = r (cid:88) r =1 G r (cid:88) g =1 a rg ( l ) (cid:104) U nr · α rg · b rg (cid:16) ( k − L (cid:17)(cid:105) + R (cid:88) r =1 U nr (cid:15) r (cid:16) l +( k − L (cid:17) where (a) follows from directly from Property 2.2 and Proposition C.2; here (cid:15) r (cid:16) l + ( k − L (cid:17) =[ H W r − H W r ( lr ) ] ( l,k ) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R (cid:88) r =1 U nr (cid:15) r (cid:16) l +( k − L (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (1)max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R (cid:88) r =1 U nr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R (cid:15) (1)max Γ d This completes the proof.
D Imputation and Forecasting Error Analysis - Proof Notation
D.1 Induced Linear Operator
Consider a matrix B ∈ R N × p such that B = (cid:80) N ∧ pi =1 σ i ( B ) x i y Ti . Here, σ i ( B ) are the singular vectors of B and x i ,y i are the left and right singular vectors respectively. Hard Singular Value Thresholding.
To that end, given any λ> , we define the map HSVT λ : R N × p → R N × p , which simply shaves off the input matrix’s singular values that are below the threshold λ . Precisely,HSVT λ ( B )= N ∧ p (cid:88) i =1 σ i ( B ) ( σ i ( B ) ≥ λ ) x i y Ti . (8) Induced Linear Operator.
With a specific choice of λ ≥ , we can define a function ϕ B λ : R p → R p asfollows: for any vector row w ∈ R p (i.e. w ∈ R × p ), ϕ B λ ( w )= N ∧ p (cid:88) i =1 ( σ i ( B ) ≥ λ ) y i y Ti w T . (9)Note that ϕ B λ is a linear operator and it depends on the tuple ( B ,λ ) ; more precisely, the singular values andthe right singular vectors of B , as well as the threshold λ . If λ =0 , then we will adopt the shorthand notation: ϕ B = ϕ B . Lemma D.1 ( Lemma 35 of [3] ) . Let B ∈ R N × p and λ ≥ be given. Then for any j ∈ [ N ] , ϕ B λ (cid:0) B j, · (cid:1) = HSVT λ (cid:0) B (cid:1) Tj, · , (10) where B j, · ∈ R × p represents the j th row of B , and HSVT λ (cid:0) B (cid:1) j, · ∈ R × p represents the j th row of matrixobtained after applying HSVT over B with threshold λ .Proof. By (9), the orthonormality of the right singular vectors and B Tj, · = B T e j with e j ∈ R p with j th entry and everything else , we have 25 B λ (cid:0) B j, · (cid:1) = N ∧ p (cid:88) i =1 ( σ i ( B ) ≥ λ ) y i y Ti B Tj, · = N ∧ p (cid:88) i =1 ( σ i ( B ) ≥ λ ) y i y Ti B T e j = N ∧ p (cid:88) i =1 ( σ i ( B ) ≥ λ ) y i y Ti (cid:0) N ∧ p (cid:88) i (cid:48) =1 σ i (cid:48) ( B ) x i (cid:48) y Ti (cid:48) (cid:1) T e j = N ∧ p (cid:88) i,i (cid:48) =1 σ i (cid:48) ( B ) ( σ i ( B ) ≥ λ ) y i y Ti y i (cid:48) x Ti (cid:48) e j = N ∧ p (cid:88) i,i (cid:48) =1 σ i (cid:48) ( B ) ( σ i ( B ) ≥ λ ) y i δ ii (cid:48) x Ti (cid:48) e j = N ∧ p (cid:88) i =1 σ i ( B ) ( σ i ( B ) ≥ λ ) y i x Ti e j = HSVT λ (cid:0) B (cid:1) T e j = HSVT λ (cid:0) B (cid:1) Tj, · . E Concentration Inequalities Lemmas
E.0.1 Classic ResultsTheorem E.1. Bernstein’s Inequality. [4]
Suppose that X ,...,X n are independent random variables with zero mean, and M is a constant such that | X i |≤ M with probability one for each i . Let S := (cid:80) ni =1 X i and v := Var ( S ) . Then for any t ≥ , P ( | S |≥ t ) ≤ − t v +2 Mt ) . Theorem E.2. Norm of matrices with sub-gaussian entries. [19]
Let A be an m × n random matrix whose entries A ij are independent, mean zero, sub-gaussian randomvariables. Then, for any t> , we have (cid:107) A (cid:107)≤ CK ( √ m + √ n + t ) with probability at least − − t ) . Here, K =max i,j (cid:107) A ij (cid:107) ψ . Lemma E.1. Maximum of sequence of random variables. [19]
Let X ,X ,...,X n be a sequence of random variables, which are not necessarily independent, and satisfy E [ X pi ] p ≤ Kp β for some K,β > and all i . Then, for every n ≥ , E max i ≤ n | X i |≤ CK log β ( n ) . (11) Remark E.1.
Lemma E.1 implies that if X ,...,X n are ψ α random variables with (cid:107) X i (cid:107) ψ α ≤ K α for all i ∈ [ n ] , then E max i ≤ n | X i |≤ CK α log α ( n ) . E.0.2 High Probability Events for Imputation and ForecastingSetup.
Let X be an L × ¯ P random matrix (with L ≤ ¯ P ) whose entries X ij are independent sub-gaussianentries where E [ X ]= M and (cid:107) X ij (cid:107) ψ ≤ σ . Let Y denote the L × ¯ P matrix whose entries Y ij are defined as Y ij = (cid:26) X ij w.p. p, w.p. − p, for some p ∈ (0 , . Let ˆ p =max (cid:110) L ¯ P (cid:80) Li =1 (cid:80) ¯ Pj =1 X ij observed , L ¯ P (cid:111) .26 igh Probability Events. Define events E to E , for some positive absolute constant C as E := (cid:110) | ˆ p − p |≤ p/ (cid:111) , (12) E := (cid:110) (cid:107) Y − p M (cid:107)≤ Cσ (cid:112) ¯ P (cid:111) , (13) E := (cid:110) (cid:107) Y − p M (cid:107) ∞ , ≤ Cσ ¯ P (cid:111) , (14) E := (cid:110) max j ∈ L (cid:107) ϕ B λ k (cid:16) Y j, · − p M j, · (cid:17) (cid:107) ≤ Cσ k log( ¯ P ) (cid:111) , (15) E := (cid:40)(cid:18) − (cid:115) L ¯ P ) L ¯ P p (cid:19) p ≤ (cid:98) p ≤ − (cid:113) L ¯ P ) L ¯ Pp p (cid:41) . (16)Here, B ∈ R L × ¯ P is an arbitrary matrix such that B = (cid:80) Li =1 λ i ( B ) x i y Ti , where σ i ( B ) are the singularvectors of B and x i ,y i are the left and right singular vectors respectively. Recall the definition of ϕ B λ k in (9). Lemma E.2.
For some positive constant c P ( E ) ≥ − e − c LNp − (1 − p ) L ¯ P , (17) P ( E ) ≥ − e − ¯ P , (18) P ( E ) ≥ − e − ¯ P , (19) P ( E ) ≥ − L ¯ P . (20) P ( E ) ≥ − L ¯ P . (21) Proof.
Bounding E . Let ˆ p = LN (cid:80) Li =1 (cid:80) Nj =1 X ij observed , which implies E [ˆ p ]= p . We define the event E := { ˆ p =ˆ p } . Thus, we have that P ( E c )= P ( E c ∩ E )+ P ( E c ∩ E c )= P ( | ˆ p − p |≥ p/ P ( E c ∩ E c ) ≤ P ( | ˆ p − p |≥ p/ P ( E c )= P ( | ˆ p − p |≥ p/ − p ) LN , where the final equality follows by the independence of observations assumption and the fact that ˆ p (cid:54) = ˆ p only if we do not have any observations. By Bernstein’s Inequality, we have that P ( | ˆ p − p |≤ p/ ≥ − e − c LNp . Bounding E . Since E [ Y ij ]= pM ij , Theorem E.2 yields P ( E ) ≥ − e − N . Bounding E . Observing, (cid:107) Y j, · − p M j, · (cid:107) ∞ , ≤(cid:107) Y j, · − p M j, · (cid:107) and Theorem E.2 is sufficient to show (19). Bounding E . Recall y i ∈ ¯ P is the i -th right singular vector of B = Y − p M . Then, (cid:107) ϕ B λ k (cid:16) Y j, · − p M j, · (cid:17) (cid:107) = k (cid:88) i =1 (cid:107) y i y Ti ( Y j, · − p M j, · ) (cid:107) ≤ k (cid:88) i =1 (cid:16) y Ti ( Y j, · − p M j, · ) (cid:17) = k (cid:88) i =1 Z i , where Z i = y Ti ( Y j, · − p M j, · ) . By definition of ψ norm of a random variable since y i is unit norm vector,it follows that (cid:107) Z i (cid:107) ψ = (cid:107) y Ti ( Y j, · − p M j, · ) (cid:107) ψ ≤(cid:107) ( Y j, · − p M j, · ) (cid:107) ψ . Y j, · − p M j, · are mean-zero and independent, with ψ norm bounded by √ Cσ for some absolute constant C > , using Lemma H.10 of [3], it follows that P (cid:16) k (cid:88) i =1 Z i >t (cid:17) ≤ k exp (cid:16) − tkCσ (cid:17) . (22)Therefore, for choice of t = Cσ k log ¯ P (with large enough constant C > and since L ≤ ¯ P ) and unionbound, we have that P (cid:16) E c (cid:17) ≤ L ¯ P . (23) Bounding E . Recall definition of (cid:98) p . Then by the binomial Chernoff bound, for ε> , P (cid:16)(cid:98) p>εp (cid:17) ≤ exp (cid:18) − ( ε − ε +1 L ¯ P p (cid:19) , and P (cid:16)(cid:98) p< ε p (cid:17) ≤ exp (cid:18) − ( ε − ε L ¯ P p (cid:19) . By the union bound, P (cid:16) ε p ≤ (cid:98) p ≤ pε (cid:17) ≥ − P (cid:16)(cid:98) p>εp (cid:17) − P (cid:16)(cid:98) p< ε p (cid:17) . Noticing ε +1 < ε< ε for all ε> , and substituting ε = (cid:18) − (cid:113) L ¯ P ) L ¯ Pp (cid:19) − completes the proof. Corollary E.1.
Let E := E ∩ E . Then, P ( E c ) ≤ C e − c ¯ P , (24) where C and c are positive constants independent of L and ¯ P . Corollary E.2.
Let E := E ∩ E ∩ E ∩ E . Then, P ( E c ) ≤ C L ¯ P , (25) where C is an absolute positive constant, independent of L and ¯ P . F HSVT Error
Lemma F.1.
Let ¯ M f = ¯ M f ( lr ) + E f ( lr ) . Recall for r ∈ [ L − , let τ r ,µ r denote the r -th singular valueand right singular vector of ¯ M f ( lr ) respectively. Suppose that1. (cid:107) ¯ Z X − ρ ¯ M f (cid:107)≤ ∆ for some ∆ ≥ ,2. ε ρ ≤ (cid:98) ρ ≤ ερ for some ε ≥ ,Let ¯ M f ( k )( lr ) = HSVT τ k ( ¯ M ( lr ) ) . Let (cid:99) M f ( k ) = (cid:98) ρ HSVT s k ( ¯ Z X ) , where s k is the k -th signular value of ¯ Z X . Then for any j ∈ [ L − , (cid:13)(cid:13)(cid:13)(cid:99) M f ( k ) j, · − ¯ M fj, · (cid:13)(cid:13)(cid:13) ≤ ε ρ (cid:18) ∆ + (cid:107) E f ( lr ) (cid:107) ( τ k − τ k +1 ) (cid:19)(cid:18)(cid:13)(cid:13) ¯ Z Xj, · − ρ ¯ M fj, · (cid:13)(cid:13) + (cid:107) [ ¯ M f ( k )( lr ) ] j, · (cid:107) (cid:19) + 4 ε ρ (cid:13)(cid:13)(cid:13) ϕ ¯ M f ( k )( lr ) ( ¯ Z Xj, · − ρ ¯ M fj, · ) (cid:13)(cid:13)(cid:13) +2( ε − (cid:107) ¯ M fj, · (cid:107) +4 (cid:107) [ E f ( lr ) ] j, · (cid:107) +4 (cid:107) [ ¯ M f ( lr ) − ¯ M f ( k )( lr ) ] j, · (cid:107) . (26) Proof.
For ease of exposition, for the remainder of the proof, let: A = ¯ M f ; Z = ¯ Z X ; (cid:98) A = (cid:99) M f ( k ) ; A ( lr ) = ¯ M f ( lr ) ; A k ( lr ) = ¯ M f ( k )( lr ) ; E = ¯ M f − ¯ M f ( lr ) ; and E = ¯ M f − ¯ M f ( k )( lr ) .We will use notation λ ∗ = s k , the k th signular value of ¯ Z X for simplicity. Further, recall that s r ,u r denotethe r -th singular value and right singular vector of ¯ Z X respectively. We prove our Lemma in three steps.28 tep 1. Fix a row index j ∈ [ L − . Observe that (cid:98) A j, · − A j, · = (cid:16) (cid:98) A j, · − ϕ Z λ ∗ (cid:0) A j, · (cid:1)(cid:17) + (cid:16) ϕ Z λ ∗ (cid:0) A j, · (cid:1) − A j, · (cid:17) . By definition (see (9)), we have that ϕ Z λ ∗ : R ¯ P → R ¯ P is the projection operator onto the span of the top k right singular vectors of Z , namely, span (cid:8) u ,...,u k (cid:9) . Therefore, ϕ Z λ ∗ ( A j, · ) − A j, · ∈ span { u ,...,u k } ⊥ . By choice, rank ( (cid:98) A )= k ; hence, by using Lemma D.1, (cid:98) A j, · − ϕ Z λ ∗ ( A j, · )= 1 (cid:98) ρϕ Z λ ∗ ( Z j, · ) − ϕ Z λ ∗ ( A j, · ) ∈ span { u ,...,u k } . Hence, (cid:104) (cid:98) A j, · − ϕ Z λ ∗ ( A j, · ) ,ϕ Z λ ∗ ( A j, · ) − A j, · (cid:105) =0 and (cid:13)(cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:98) A j, · − ϕ Z λ ∗ (cid:0) A j, · (cid:1)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ϕ Z λ ∗ (cid:0) A j, · (cid:1) − A j, · (cid:13)(cid:13)(cid:13) (27)by the Pythagorean theorem. Step 2.
We begin by bounding the first term on the right hand side of (27). Again applying Lemma D.1,we can rewrite (cid:98) A j, · − ϕ Z λ ∗ ( A j, · )= 1 (cid:98) ρϕ Z λ ∗ ( Z j, · ) − ϕ Z λ ∗ ( A j, · )= ϕ Z λ ∗ (cid:16) (cid:98) ρ Z j, · − A j, · (cid:17) = 1 (cid:98) ρϕ Z λ ∗ ( Z j, · − ρ A j, · )+ ρ − (cid:98) ρ (cid:98) ρ ϕ Z λ ∗ ( A j, · ) . Using the Parallelogram Law (or, equivalently, combining Cauchy-Schwartz and AM-GM inequalities),we obtain (cid:107) (cid:98) A j, · − ϕ Z λ ∗ ( A j, · ) (cid:107) = (cid:107) (cid:98) ρϕ Z λ ∗ ( Z j, · − ρ A j, · )+ ρ − (cid:98) ρ (cid:98) ρ ϕ Z λ ∗ ( A j, · ) (cid:107) ≤ (cid:107) (cid:98) ρϕ Z λ ∗ ( Z j, · − ρ A j, · ) (cid:107) +2 (cid:107) ρ − (cid:98) ρ (cid:98) ρ ϕ Z λ ∗ ( A j, · ) (cid:107) ≤ (cid:98) ρ (cid:107) ϕ Z λ ∗ ( Z j, · − ρ A j, · ) (cid:107) +2 (cid:16) ρ − (cid:98) ρ (cid:98) ρ (cid:17) (cid:107) A j, · (cid:107) ≤ ε ρ (cid:107) ϕ Z λ ∗ ( Z j, · − ρ A j, · ) (cid:107) +2( ε − (cid:107) A j, · (cid:107) . (28)because Condition 2 implies (cid:98) ρ ≤ ερ and (cid:16) ρ − (cid:98) ρ (cid:98) ρ (cid:17) ≤ ( ε − .Note that the first term of (28) can further be decomposed as, (cid:107) ϕ Z λ ∗ ( Z j, · − ρ A j, · ) (cid:107) ≤ (cid:13)(cid:13)(cid:13) ϕ Z λ ∗ ( Z j, · − ρ A j, · ) − ϕ A k ( lr ) ( Z j, · − ρ A j, · ) (cid:13)(cid:13)(cid:13) +2 (cid:13)(cid:13)(cid:13) ϕ A k ( lr ) ( Z j, · − ρ A j, · ) (cid:13)(cid:13)(cid:13) . (29)We now bound the first term on the right hand side of (29) separately. First, we apply the Davis-Kahan sinΘ Theorem(see [7, 21]) to arrive at the following inequality: (cid:13)(cid:13) P u ,...,u k −P µ ,...,µ kk (cid:13)(cid:13) ≤ (cid:107) Z − ρ A ( lr ) (cid:107) ρτ k − ρτ k +1 (30) ≤ (cid:107) Z − ρ A (cid:107) ρτ k − ρτ k +1 + (cid:107) ρ A − ρ A ( lr ) (cid:107) ρτ k − ρτ k +1 (31) ≤ ∆ ρ ( τ k − τ k +1 ) + (cid:107) E (cid:107) τ k − τ k +1 (32)where P u ,...,u k and P µ ,...,µ k denote the projection operators onto the span of the top k right singular vectors of Z and A ( lr ) , respectively. Note, we utilized Condition 1 to bound (cid:107) Z − ρ A (cid:107) ≤ ∆ .Then it follows that (cid:13)(cid:13)(cid:13) ϕ Z λ ∗ ( Z j, · − ρ A j, · ) − ϕ A k ( lr ) ( Z j, · − ρ A j, · ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) P u ,...,u kk −P µ ,...,µ k (cid:13)(cid:13) (cid:13)(cid:13) Z j, · − ρ A j, · (cid:13)(cid:13) ≤ (cid:18) ∆ ρ ( τ k − τ k +1 ) + (cid:107) E (cid:107) τ k − τ k +1 (cid:19)(cid:13)(cid:13) Z j, · − ρ A j, · (cid:13)(cid:13) . ombining the inequalities together, we have (cid:13)(cid:13)(cid:13) (cid:98) A j, · − ϕ Z λ ∗ (cid:0) A j, · (cid:1)(cid:13)(cid:13)(cid:13) ≤ ε ρ (cid:18) ∆ ( τ k − τ k +1 ) + (cid:107) E (cid:107) ( τ k − τ k +1 ) (cid:19)(cid:13)(cid:13) Z j, · − ρ A j, · (cid:13)(cid:13) + 4 ε ρ (cid:13)(cid:13)(cid:13) ϕ A k ( lr ) ( Z j, · − ρ A j, · ) (cid:13)(cid:13)(cid:13) +2( ε − (cid:107) A j, · (cid:107) . (33) Step 3.
We now bound the second term of (27). Recalling A = A k ( lr ) + E and using (30) (cid:107) ϕ Z λ ∗ (cid:0) A j, · (cid:1) − A j, · (cid:107) = (cid:107) ϕ Z λ ∗ (cid:0) [ A k ( lr ) ] j, · +[ E ] j, · (cid:1) − [ A k ( lr ) ] j, · − [ E ] j, · (cid:107) ≤ (cid:107) ϕ Z λ ∗ (cid:0) [ A k ( lr ) ] j, · (cid:1) − [ A k ( lr ) ] j, · (cid:107) +2 (cid:107) ϕ Z λ ∗ (cid:0) [ E ] j, · (cid:1) − [ E ] j, · (cid:107) =2 (cid:107) ϕ Z λ ∗ (cid:0) [ A k ( lr ) ] j, · (cid:1) − ϕ A k (cid:0) [ A k ( lr ) ] j, · (cid:1) (cid:107) +2 (cid:107) ϕ Z λ ∗ (cid:0) [ E ] j, · (cid:1) − [ E ] j, · (cid:107) ≤ (cid:107)P u ,...,u k ( lr ) −P µ ,...,µ k (cid:107) (cid:107) [ A k ( lr ) ] j, · (cid:107) +2 (cid:107) [ E ] j, · (cid:107) ≤ (cid:18) ∆ ρ ( τ k − τ k +1 ) + (cid:107) E (cid:107) ( τ k − τ k +1 ) (cid:19) (cid:107) [ A k ( lr ) ] j, · (cid:107) +2 (cid:107) [ E ] j, · (cid:107) . (34)Inserting (33) and (34) back to (27), and collecting terms completes the proof. Corollary F.1.
Let the conditions of Lemma F.1 hold. Then for any j ∈ [ L − , (cid:13)(cid:13)(cid:13)(cid:99) M f ( k ) j, · − ¯ M fj, · (cid:13)(cid:13)(cid:13) ≤ ε ρ (cid:18) ∆ + L ¯ P ( R (cid:15) (1)max Γ ) ( τ k − τ k +1 ) (cid:19)(cid:18)(cid:13)(cid:13) ¯ Z Xj, · − ρ ¯ M fj, · (cid:13)(cid:13) + (cid:107) [ ¯ M f ( k )( lr ) ] j, · (cid:107) (cid:19) + 4 ε ρ (cid:13)(cid:13)(cid:13) ϕ ¯ M f ( k )( lr ) ( ¯ Z Xj, · − ρ ¯ M fj, · ) (cid:13)(cid:13)(cid:13) +2( ε − (cid:107) ¯ M fj, · (cid:107) +4 ¯ P ( R (cid:15) (1)max Γ ) +4 (cid:107) [ ¯ M f ( lr ) − ¯ M f ( k )( lr ) ] j, · (cid:107) . (35) Proof.
Immediate from Lemma F.1 and Proposition 2.6.
Proposition F.1.
Assume Properties 2.1, 2.5, 2.6 and 2.2 hold. Then, E (cid:13)(cid:13)(cid:13)(cid:13) ¯ M f − (cid:99) M f ( k ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ , ≤ C ∗ ( γ ,R , Γ , Γ ) ρ (cid:18) τ k − τ k +1 ) + k ¯ P + L (( (cid:15) (1)max ) +( (cid:15) (1)max ) )( τ k − τ k +1 ) (cid:19) log( ¯ P ) ¯ P +4 max j ∈ [ L − (cid:107) [ ¯ M f ( lr ) − ¯ M f ( k )( lr ) ] j, · (cid:107) , (36) where C ∗ ( γ ,R , Γ , Γ ) is a term that depends only on γ ,R , Γ , Γ . Proof.
Notation.
For ease of exposition, for the remainder of the proof, let: A = ¯ M f ; Z = ¯ Z X ; (cid:98) A = (cid:99) M f ( k ) ; A ( lr ) = ¯ M f ( lr ) ; A k ( lr ) = ¯ M f ( k )( lr ) ; and E = ¯ M f ( lr ) − ¯ M f ( k )( lr ) . High Probability Conditioning Event.
Let E := E ∩ E ∩ E ∩ E where E to E are defined in (13)to (16) respectively. Then, E [ (cid:107) (cid:98) A − A (cid:107) ∞ , ]= E max j ∈ [ L − (cid:107) (cid:98) A j, · − A j, · (cid:107) = E (cid:20) max j ∈ [ L − (cid:107) (cid:98) A j, · − A j, · (cid:107) · ( E ) (cid:21) + E (cid:20) max j ∈ [ L − (cid:107) (cid:98) A j, · − A j, · (cid:107) · ( E c ) (cid:21) . (37) Upper bound on the first term in (37) . First note ε ≤ since ρ ≥ PL )¯ PL ; and that (cid:107) [ A k ( lr ) ] j, · (cid:107) ≤(cid:107) A kj, · (cid:107)
22 ( a ) ≤ ( R Γ (Γ + (cid:15) (1)max )) ¯ P where (a) follows by Lemma C.1 and Proposition 2.6.Then conditioned on event E , and by Corollary F.1, max j ∈ [ L − (cid:13)(cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13)(cid:13) ≤ Cρ (cid:18) γ ¯ P + L ¯ P ( R (cid:15) (1)max Γ ) ( τ k − τ k +1 ) (cid:19)(cid:18) γ ¯ P +( R Γ (Γ + (cid:15) (1)max )) ¯ P (cid:19) + Cρ (cid:16) γ k log( ¯ P ) (cid:17) +2 ¯ P ( R (cid:15) (1)max Γ ) +4 max j ∈ [ L − (cid:107) E j, · (cid:107) . (38) implifying (38) by collecting terms, we have max j ∈ [ L − (cid:13)(cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13)(cid:13) ≤ C ∗ ( γ ,R , Γ , Γ ) ρ (cid:18) τ k − τ k +1 ) + k ¯ P + L (( (cid:15) (1)max ) +( (cid:15) (1)max ) )( τ k − τ k +1 ) (cid:19) log( ¯ P ) ¯ P +4 max j ∈ [ L − (cid:107) E j, · (cid:107) . (39)where C ∗ ( γ ,R , Γ , Γ ) is a term that depends only on γ ,R , Γ , Γ .Since P ( E ) ≤ , we have E (cid:20) max j ∈ [ L − (cid:107) (cid:98) A j, · − A j, · (cid:107) · ( E ) (cid:21) ≤ C ∗ ( γ ,R , Γ , Γ ) ρ (cid:18) τ k − τ k +1 ) + k ¯ P + L (( (cid:15) (1)max ) +( (cid:15) (1)max ) )( τ k − τ k +1 ) (cid:19) log( ¯ P ) ¯ P +4 max j ∈ [ L − (cid:107) E j, · (cid:107) (40) Upper bound on the second term in (37) . To begin with, we note that for any j ∈ [ L − , (cid:107) (cid:98) A j, · − A · ,j (cid:107) ≤(cid:107) (cid:98) A j, · (cid:107) + (cid:107) A j, · (cid:107) by triangle inequality. By the model assumption, the covariates are bounded (Property 2.1) and (cid:107) A j, · (cid:107) ≤ ( R Γ Γ ) √ ¯ P for all j ∈ [ L − . For the remainder of the proof, for ease of notation, let Γ := ( R Γ Γ ) . By definition, for any j ∈ [ L − , (cid:98) A j, · = 1 (cid:98) ρ HSVT λ (cid:0) Z (cid:1) · ,j for a given threshold λ = s k , the k th singular value of Z . Therefore, (cid:107) (cid:98) A j, · (cid:107) = 1 (cid:98) ρ (cid:13)(cid:13) HSVT λ (cid:0) Z (cid:1) · ,j (cid:13)(cid:13) a ) ≤ ¯ P L (cid:13)(cid:13)
HSVT λ (cid:0) Z (cid:1) · ,j (cid:13)(cid:13) ≤ ¯ P L (cid:107) Z · ,j (cid:107) . Here, (a) follows from (cid:98) ρ ≥ PL . max j ∈ [ L − (cid:107) (cid:98) A j, · − A · ,j (cid:107) ≤ max j ∈ [ L − (cid:107) (cid:98) A j, · (cid:107) +max j ∈ [ p ] (cid:107) A j, · (cid:107) ≤ ¯ P L max j ∈ [ L − (cid:107) Z · ,j (cid:107) +Γ (cid:112) ¯ P ≤ (cid:0) ¯ P L + (cid:112) ¯ P (cid:1) Γ+ ¯ P L max ij | η ij |≤ P L (cid:16) Γ+max ij | η ij | (cid:17) (41)because max j ∈ [ p ] (cid:107) Z · ,j (cid:107) ≤ √ ¯ P max i,j | Z ij | ≤ √ ¯ P max i,j | A ij + η ij | ≤ √ ¯ P (cid:0) Γ + max i,j | η ij | (cid:1) . Now we applyCauchy-Schwarz inequality on E (cid:2) max j ∈ [ L − (cid:107) (cid:98) A j, · − A j, · (cid:107) · ( E c ) (cid:3) to obtain E (cid:104) max j ∈ [ L − (cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13) · ( E c ) (cid:105) ≤ E (cid:104) max j ∈ [ L − (cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13) (cid:105) · E (cid:104) ( E c ) (cid:105) = E (cid:104) max j ∈ [ L − (cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13) (cid:105) · P ( E c ) ( a ) ≤ P L E (cid:104)(cid:16) Γ+max ij | η ij | (cid:17) (cid:105) · P ( E c ) ( b ) ≤ √ P L (cid:16) Γ + E (cid:2) max ij | η ij | (cid:3)(cid:17) · P ( E c ) ( c ) ≤ √ P L (cid:16) Γ + E (cid:2) max ij | η ij | (cid:3) (cid:17) · P ( E c ) . (42)Here, (a) follows from (41); and (b) follows from Jensen’s inequality: E (cid:104)(cid:16) Γ+max ij | η ij | (cid:17) (cid:105) = E (cid:20)(cid:16) (cid:0) ij | η ij | (cid:1)(cid:17) (cid:21) ≤ E (cid:20) (cid:16)(cid:0) (cid:1) + (cid:0) ij | η ij | (cid:1) (cid:17)(cid:21) =8 E (cid:104) Γ +max ij | η ij | (cid:105) =8 (cid:16) Γ + E [max ij | η ij | ] (cid:17) ; and (c) follows from the trivial inequality: √ A + B ≤√ A + √ B for any A,B ≥ . ow it remains to find an upper bound for E (cid:2) max ij | η ij | (cid:3) . Note that for θ ≥ , η ij being a ψ -random variable impliesthat (cid:12)(cid:12) η ij (cid:12)(cid:12) θ is a ψ /θ -random variable. With the choice of θ =4 , we have that E max ij | η ij | ≤ C (cid:48) γ log ( ¯ P L ) (43)for some C (cid:48) > by Lemma E.1 (also see Remark E.1). Inserting (43) to (42) yields E (cid:104) max j ∈ [ L − (cid:13)(cid:13) (cid:98) A j, · − A j, · (cid:13)(cid:13) · ( E c ) (cid:105) ≤ √ P L (cid:16) Γ + C (cid:48) / γ log( ¯ P L ) (cid:17) · P ( E c ) ( a ) ≤ (cid:16) Γ + C (cid:48) / γ log( ¯ P L ) (cid:17) P L , (44)where (a) follows from recalling that P ( E c ) ≤ P L . Concluding the Proof.
Thus, combining (40) and (44) in (37) and noticing that term in (44) is smaller order termthan that in (40), by redefining C ∗ ( γ ,R , Γ , Γ ) , appropriately we obtain the desired bound: E (cid:20) (cid:107) (cid:98) A j, · − A j, · (cid:107) ∞ , (cid:21) ≤ C ∗ ( γ ,R , Γ , Γ ) ρ (cid:18) τ k − τ k +1 ) + k ¯ P + L (( (cid:15) (1)max ) +( (cid:15) (1)max ) )( τ k − τ k +1 ) (cid:19) log( ¯ P ) ¯ P +4 max j ∈ [ L − (cid:107) E j, · (cid:107) (45) G Proofs - Imputation Analysis
G.1 Proof of Theorem 4.1
Proof.
Observe that for any matrix, A ∈ R m × n , n (cid:107) A (cid:107) ∞ , ≥ mn (cid:107) A (cid:107) F . Then using Property 4.1 and Proposition F.1, and simplifying terms gives the desired result.
G.2 Proof of Theorem 4.3
Proof.
For the purposes of the proof let N and pick an arbitrary ordering of the N time series, denotedas f ,...,f n . For i ∈ [ L ] and j ∈ [ ¯ P ] , define [ M f ] ij :=([ M f ] ij ) and [ M f + σ ] ij :=[ M σ + M f ] ij E (cid:107) (cid:99) M σ ( Impute ) − M σ (cid:107) F = E (cid:107) (cid:99) M f + σ − (cid:99) M f − M σ (cid:107) F = E (cid:107) (cid:99) M f + σ − (cid:99) M f − M f + σ + M f (cid:107) F ≤ E (cid:107) (cid:99) M f − M f (cid:107) F +2 E (cid:107) (cid:99) M f + σ − M f + σ (cid:107) F First Term: E (cid:107) (cid:99) M f − M f (cid:107) F E (cid:107) (cid:99) M f − M f (cid:107) F = E N (cid:88) n =1 T (cid:88) t =1 (cid:16) f n ( t ) − ˆ f n ( t ) (cid:17) = E N (cid:88) n =1 T (cid:88) t =1 (cid:16) f n ( t ) − ˆ f n ( t ) (cid:17) (cid:16) f n ( t )+ˆ f n ( t ) (cid:17) ≤ max n ∈ [ N ] ,t ∈ [ T ] (cid:16) f n ( t )+ˆ f n ( t ) (cid:17) E N (cid:88) n =1 T (cid:88) t =1 (cid:16) f n ( t ) − ˆ f n ( t ) (cid:17) a ) ≤ (cid:16) R Γ Γ (cid:17) E N (cid:88) n =1 T (cid:88) t =1 (cid:16) f n ( t ) − ˆ f n ( t ) (cid:17) =4 (cid:16) R Γ Γ (cid:17) E (cid:107) (cid:99) M f ( Impute ) − M f (cid:107) F Second Term: E (cid:107) (cid:99) M f + σ − M f + σ (cid:107) F .Note, (cid:107) X n ( t ) (cid:107) ψ = (cid:107) f n ( t )+2 f n ( t ) η n ( t )+ η n ( t ) (cid:107) ψ ≤ (cid:107) f n ( t ) (cid:107) ψ +2 (cid:107) η n ( t ) (cid:107) ψ ≤ (cid:16) R Γ Γ (cid:17) +2 γ Further, by Corollary 2.1 and Proposition 2.1, we immediately have there exists a matrix M f + σ ( lr ) ∈ R L × ¯ P such thatrank ( M f + σ ( lr ) ) ≤ ( R G (1)max ) + R G (2)max , (cid:107) M f + σ − M f + σ ( lr ) (cid:107) max ≤ ( R (cid:15) (1)max Γ Γ )+ R (cid:15) (2)max Γ Then, by a straightforward modification of the proof of Theorem 4.1, we have
MSE ( M f + σ , (cid:99) M f + σ ) ≤ C R R ( G (1)max ) ( G (2)max ) ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) + (cid:16) (cid:15) (2)max (cid:17) (cid:19) log( ¯ P ) . Adding the bounds we have for the first and second term completes the proof.
H Proofs - Forecasting Analysis
H.1 Forecasting - Helper LemmasProposition H.1.
Let Properties 2.1 and 2.2 hold. Then there exists β ∗ ∈ R L − , with (cid:107) β ∗ (cid:107) ≤ CR G (1)max ,such that (cid:107) ( ¯ M f ) T β ∗ − M fL (cid:107) ∞ ≤ C ( R ) ( G (1)max +1) (cid:15) (1)max Γ . (46) Here C is an absolute constant.Proof. To reduce notational complexity, we suppress the superscript f for the remainder of the proof.Let H and H ( lr ) be defined as in Proposition 2.6. Since rank ( H ( lr ) ) ≤ R G (1)max , it must be the case thatwithin the last R G (1)max rows, there exists at least one row (which we denote as r ∗ ) that can be written asa linear combination of at most R G (1)max rows above it (which we denote as r ,...,r R G (1)max ).Solely for the purposes of the remainder of the proof (and without any loss of generality), we redefine M , H , H ( lr ) assuming access to data t ∈ [ − T :2 T ] (instead of [1: T ] ).Specifically there exists θ l ∈ R for l ∈ [ R G (1)max ] , such that for all t ∈ [ L : T ] [ H ( lr ) ] ( r ∗ ,t ) = R G (1)max (cid:88) l =1 θ l [ H ( lr ) ] ( r l ,t ) . (47) Here is where we use the fact that we redefined M , H , H ( lr ) with respect to t ∈ [ − T :2 T ] . Otherwise, we couldonly claim the equality in (47) for t ∈ [ L : T − R G (1)max ] t ∈ [0: T ] , (cid:12)(cid:12)(cid:12)(cid:12) [ H ] ( r ∗ ,t ) − R G (1)max (cid:88) l =1 θ l [ H ] ( r l ,t ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) [ H ] ( r ∗ ,t ) ± [ H ( lr ) ] ( r ∗ ,t ) − R G (1)max (cid:88) l =1 θ l [ H ] ( r l ,t ) ± R G (1)max (cid:88) l =1 θ l [ H ( lr ) ] ( r l ,t ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) [ H ] ( r ∗ ,t ) − [ H ( lr ) ] ( r ∗ ,t ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) R G (1)max (cid:88) l =1 θ l [ H ] ( r l ,t ) − R G (1)max (cid:88) l =1 θ l [ H ( lr ) ] ( r l ,t ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) [ H ( lr ) ] ( r ∗ ,t ) − R G (1)max (cid:88) l =1 θ l [ H ( lr ) ] ( r l ,t ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) [ H ] ( r ∗ ,t ) − [ H ( lr ) ] ( r ∗ ,t ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) R G (1)max (cid:88) l =1 θ l [ H ] ( r l ,t ) − R G (1)max (cid:88) l =1 θ l [ H ( lr ) ] ( r l ,t ) (cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ C (cid:16) R (cid:15) (1)max Γ +( R G (1)max ) R (cid:15) (1)max Γ (cid:17) ≤ C ( R ) ( G (1)max +1) (cid:15) (1)max Γ where (a) follows from Proposition 2.6 and C is an absolute constant.Observing that every entry of M fL appears in [ H ] ( r ∗ , · ) and letting β ∗ := ( θ , ··· ,θ R G (1)max ) completes theproof by redefining constants appropriately. Proposition H.2.
Assume Properties 2.1, 2.5, 2.6 and 2.2 hold. Then for some absolute constants, C ≥ , MSE ( M f , (cid:99) M f ( Forecast ) ) ≤ C (cid:18) ( R ) ( G (1)max +1) (cid:15) (1)max Γ (cid:19) + C (cid:18) ( R G (1)max (cid:19) E (cid:13)(cid:13)(cid:13)(cid:13) ¯ M f − (cid:99) M f,k (cid:13)(cid:13)(cid:13)(cid:13) ∞ , ¯ P + ˜ K k ¯ P , where (cid:99) M f,k is the estimation obtained in the first step of the forecasting algorithm in Section 3.1 usingthreshold k ≥ for number of singular values, ˜ K is function of γ ,R , Γ , Γ ,ρ − .Proof. Let Q := ρ ( ¯ M f ) T and (cid:98) Q := ρ ( (cid:99) M f,k ) T . Define η L ∈ R ¯ P , where η L := Z XL − ρ M fL . For eachentry in Z XL , it is equal to the noisy version underlying time series with probability ρ , and otherwise ; andthe noisy version has additive noise added to the corresponding entry in the component of M fL . Therefore, E [ η L ] = and using Property 2.6 as well as Lemma G.2 in [3], it follows that the coordinates of η L areindependent mean-zero sub-gaussian random variables such that (cid:107) η L ( s ) (cid:107) ψ ≤ C (cid:48) ( γ + R Γ Γ ) for s ∈ R ¯ P ,where C (cid:48) > is an absolute constant. Define K ( γ ,R , Γ , Γ ):= C (cid:48) ( γ + R Γ Γ ) .Let β ∗ below be defined as in Proposition H.1. Then note that by the definition of ˆ β in the algorithm, (cid:107) Z XL − (cid:98) Q ˆ β (cid:107) ≤(cid:107) Z XL − (cid:98) Q β ∗ (cid:107) = (cid:107) ρ M fL − (cid:98) Q β ∗ (cid:107) + (cid:107) η L (cid:107) +2 η TL ( ρ M fL − (cid:98) Q β ∗ ) . (48)Moreover, (cid:107) Z XL − (cid:98) Q ˆ β (cid:107) = (cid:107) M fL − (cid:98) Q ˆ β (cid:107) + (cid:107) η L (cid:107) − η TL ( (cid:98) Q ˆ β − ρ M fL ) . (49)Combining (48) and (49) and taking expectations, we have E (cid:107) ρ M fL − (cid:98) Q ˆ β (cid:107) ≤ E (cid:107) ρ M fL − (cid:98) Q β ∗ (cid:107) +2 E [ η TL (cid:98) Q (ˆ β − β ∗ )] . (50)Let us bound the final term on the right hand side of (50). Under our independence assumptions, observe that E [ η TL (cid:98) Q ] β ∗ = E [ η TL ] E [ (cid:98) Q ] β ∗ =0 . (51)34ecall ˆ β = (cid:98) Q † Z XL = ρ (cid:98) Q † M fL + (cid:98) Q † η L . Using the cyclic and linearity properties of the trace operator(coupled with similar independence arguments), we further have E [ η TL (cid:98) Q ˆ β ]= E [ ρη TL (cid:98) Q (cid:98) Q † ] M fL + E [ η TL (cid:98) Q (cid:98) Q † η L ]= E (cid:104) Tr (cid:16) η TL (cid:98) Q (cid:98) Q † η L (cid:17)(cid:105) = E (cid:104) Tr (cid:16) (cid:98) Q (cid:98) Q † η L η TL (cid:17)(cid:105) = Tr (cid:16) E [ (cid:98) Q (cid:98) Q † ] · E [ η L η TL ] (cid:17) ( a ) ≤ C γ K ( ρ,γ ) E (cid:104) Tr (cid:16) (cid:98) Q (cid:98) Q † (cid:17)(cid:105) , (52)where (a) follows from Property 2.6. Here C γ is an absolute constant that may depend on γ as well as ρ .Let (cid:98) Q = USV T be the singular value decomposition of (cid:98) Q . Then (cid:98) Q (cid:98) Q † = USV T V S † U T = U ˜ IU T . (53)Here, ˜ I is a block diagonal matrix where its nonzero entries on the diagonal take the value 1. Pluggingin (53) into (52), and using the fact that the trace of a square matrix is equal to the sum of its eigenvalues, E (cid:104) Tr (cid:16) (cid:98) Q (cid:98) Q † (cid:17)(cid:105) = E [ rank ( (cid:98) Q )]= k. (54)We now turn our attention to the first term on the right hand side of (50). We obtain (cid:107) ρ M fL − (cid:98) Q β ∗ (cid:107) = (cid:107) ρ M fL − ( Q − Q + (cid:98) Q ) β ∗ (cid:107) ≤ (cid:107) ρ M fL − Q β ∗ (cid:107) +2 (cid:107) ( Q − (cid:98) Q ) β ∗ (cid:107) a ) ≤ ρ (cid:18) ( R ) ( G (1)max +1) (cid:15) (1)max Γ (cid:19) ¯ P +2 (cid:107) ( Q − (cid:98) Q ) β ∗ (cid:107) , where (a) follows from Proposition H.1.We also have, E (cid:107) ( Q − (cid:98) Q ) β ∗ (cid:107) = ρ E (cid:107) (cid:18) ¯ M f − (cid:99) M f,k (cid:19) T β ∗ (cid:107) (55) ≤ ρ (cid:107) β ∗ (cid:107) E (cid:107) ¯ M f − (cid:99) M f,k (cid:107) ∞ , (56) ( b ) ≤ ρ (cid:18) CR G (1)max (cid:19) E (cid:107) ¯ M f − (cid:99) M f,k (cid:107) ∞ , , (57)where (b) follows from Proposition H.1.Collecting all the terms together, dividing by ρ × ¯ P , using ˜ K ( γ ,R , Γ , Γ ,ρ − ) = K ( γ ,R , Γ , Γ ) /ρ and redefining the constants, we obtain our desired result. Theorem H.1.
Assume Properties 2.1, 2.5, 2.6 and 2.2 hold. Further, let (i) ¯ P ≥ L . Then, MSE ( M f , (cid:99) M f ( Forecast ) ) ≤ C ∗ (cid:16) γ ,R , Γ , Γ , ( G (1)max ) (cid:17) ρ (cid:18) ¯ P + ¯ P L (cid:16) ( (cid:15) (1)max ) +( (cid:15) (1)max ) (cid:17) ( τ k − τ k +1 ) + k ¯ P ˜ K ( γ ,R , Γ , Γ ,ρ − )+ (cid:16) (cid:15) (1)max (cid:17) (cid:19) log( ¯ P )+ C (cid:16) R G (1)max (cid:17) ¯ P (cid:13)(cid:13)(cid:13) ¯ M f ( lr ) − ¯ M f ( k )( lr ) (cid:13)(cid:13)(cid:13) ∞ , , where C ∗ ( γ ,R , Γ , Γ ) is a term that depends only on γ ,R , Γ , Γ and C ≥ is an absolute constant.Here, ˜ K ( γ ,R , Γ , Γ ,ρ − ) is analogously defined to C ∗ ( γ ,R , Γ , Γ ) .Proof. Immediate from Propositions H.2 and F.1.
H.2 Proof of Theorem 4.2
Proof.
Immediate from Theorem H.1 by simplifying terms and using Property 4.1 .35 .3 Proof of Theorem 4.4
Proof.
For the purposes of the proof let N and pick an arbitrary ordering of the N time series, denotedas f ,...,f n . Let (cid:99) M σ (Forecast) L , (cid:99) M f (Forecast) L , (cid:99) M f + σ (Forecast) L ∈ R ¯ P be the induced vectors correspondingto the ordering of the N time series chosen from (cid:99) M f ( Forecast ) n , (cid:99) M σ ( Forecast ) n , (cid:99) M f + σ ( Forecast ) n respectively.Let M σ L , M f L , M f + σ L ∈ R ¯ P be analogously defined but with respect to the latent time series σ , f and f + σ . Here f is a component wise squaring of the time series f . E (cid:107) (cid:99) M σ ( Forecast ) L − M σ L (cid:107) = E (cid:107) (cid:99) M f + σ (Forecast) L − (cid:99) M f (Forecast) L − M σ L (cid:107) = E (cid:107) (cid:99) M f + σ (Forecast) L − (cid:99) M f (Forecast) L − M f + σ L + M f L (cid:107) ≤ E (cid:107) (cid:99) M f (Forecast) L − M f L (cid:107) +2 E (cid:107) (cid:99) M f + σ (Forecast) L − M f + σ L (cid:107) First Term: E (cid:107) (cid:99) M f (Forecast) L − M f L (cid:107) E (cid:107) (cid:99) M f (Forecast) L − M f L (cid:107) = E N (cid:88) n =1 P (cid:88) t =1 (cid:16) f n ( tL ) − ˆ f n ( tL ) (cid:17) = E N (cid:88) n =1 P (cid:88) t =1 (cid:16) f n ( tL ) − ˆ f n ( tL ) (cid:17) (cid:16) f n ( tL )+ˆ f n ( tL ) (cid:17) ≤ max n ∈ [ N ] ,t ∈ [ T ] (cid:16) f n ( tL )+ˆ f n ( tL ) (cid:17) E N (cid:88) n =1 P (cid:88) t =1 (cid:16) f n ( tL ) − ˆ f n ( tL ) (cid:17) a ) ≤ (cid:16) R Γ Γ (cid:17) E N (cid:88) n =1 P (cid:88) t =1 (cid:16) f n ( tL ) − ˆ f n ( tL ) (cid:17) =4 (cid:16) R Γ Γ (cid:17) E (cid:107) (cid:99) M f ( Forecast ) − M f (cid:107) where (a) follows from Lemma C.1. Second Term: E (cid:107) (cid:99) M f + σ (Forecast) L − M f + σ L (cid:107) .Note for all t ∈ [ T ] and n ∈ [ N ] , (cid:107) X n ( tL ) (cid:107) ψ = (cid:107) f n ( t )+2 f n ( t ) η n ( t )+ η n ( t ) (cid:107) ψ ≤ (cid:107) f n ( t ) (cid:107) ψ +2 (cid:107) η n ( t ) (cid:107) ψ ≤ (cid:16) R Γ Γ (cid:17) +2 γ By Corollary 2.1 and Proposition 2.1, we immediately have there exists a matrix M f + σ ( lr ) ∈ R L × ¯ P such thatrank ( M f + σ ( lr ) ) ≤ ( R G (1)max ) + R G (2)max , (cid:107) M f + σ − M f + σ ( lr ) (cid:107) max ≤ ( R (cid:15) (1)max Γ Γ )+ R (cid:15) (2)max Γ Recall ¯ M σ , ¯ M f , ¯ M f + σ ∈ R ( L − × ¯ P refer to the first L − rows of the induced stacked Page matriceswith respect to the latent time series σ , f and f + σ . Note M σ L , M f L , M f + σ L ∈ R ¯ P defined aboveare the last row. Let R ∗ = ( R G (1)max ) + R G (2)max and recall (cid:15) ∗ = ( R (cid:15) (1)max Γ Γ )+ R (cid:15) (2)max Γ . Then bya straightforward modification of the proof of Proposition H.1, we see that there exists a β ∗ ∈ R L − , with (cid:107) β ∗ (cid:107) ≤ CR ∗ , such that (cid:107) ( ¯ M f + σ ) T β ∗ − M f + σ L (cid:107) ∞ ≤ ( R ∗ +1) (cid:15) ∗ . (58)Here C is an absolute constant (in short, the linear approximation error scales as the product of theapproximate rank, R ∗ , and the low-rank Hankel approximation error, (cid:15) ∗ .By showing the existence of β ∗ , we can now apply a straightforward modification of the proof of Theorem4.2 to get MSE ( M f + σ , (cid:99) M f + σ (Forecast) ) ≤ C γ R R . ( G (1)max ) ( G (1)max ) ρ (cid:18) √ T N + (cid:16) (cid:15) (1)max (cid:17) + (cid:16) (cid:15) (2)max (cid:17) (cid:19) log( ¯ P ) . I Proofs - Regret Analysis
I.1 Regret - Helper LemmasProposition I.1.
For n ∈ [ N ] , and t ∈ [ H ] assume, • (cid:107) P Ω k ( T + tL,n ) − P Ω k ( T + HL,N ) (cid:107)≤ ∆ ( T + tL,n ) • (cid:107) ˆ β ( T + tL,n ) (cid:107) ≤ B , for all ˆ β ( T + tL,n ) ∈ Ω k ( T + tL,n ) • (cid:107)∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) (cid:107) ≤ D Then, for large enough absolute constant
C > regret ≤ C δ (cid:18) B + NHδ D + B N (cid:88) n =1 H (cid:88) t =1 ∆ ( T + tL,n ) (cid:19) . (59) Proof.
For simplicity, write β ∗ := β ∗ ( T + H × L,N ) as in the definition of regret , let β ∗ ( T + tL,n ) be the projectionof β ∗ onto Ω k ( T + tL,n ) . Then for n ∈ [ N ] , and t ∈ [ H ] , ( c ( T + tL,n ) (ˆ β ( T + tL,n ) ) − c ( T + tL,n ) ( β ∗ )) ≤ ∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) )(ˆ β ( T + tL,n ) − β ∗ ) , due to convexity of c ( T + tL,n ) ( · )= (cid:18) ˆ β ( T + tL,n ) − ˆ˜ β ( T +( t +1) L,n ) δ (cid:19) (ˆ β ( T + tL,n ) − β ∗ ) , due to update definition in online-mSSA = 12 δ (cid:18) (cid:107) β ∗ − ˆ β ( T + tL,n ) (cid:107) −(cid:107) β ∗ − ˆ˜ β ( T +( t +1) L,n ) (cid:107) + (cid:107) ˆ β ( T + tL,n ) − ˆ˜ β ( T +( t +1) L,n (cid:107) (cid:19) = 12 δ (cid:18) (cid:107) β ∗ − ˆ β ( T + tL,n ) (cid:107) −(cid:107) β ∗ − ˆ˜ β ( T +( t +1) L,n ) (cid:107) + δ (cid:107)∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) (cid:107) (cid:19) , where we again use the definition of the online-mSSA. First, (cid:107) β ∗ − ˆ β ( T + tL,n ) (cid:107) = (cid:107) β ∗ − β ∗ ( T + tL,n ) + β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:107) = (cid:107) β ∗ − β ∗ ( T + tL,n ) (cid:107) + (cid:107) β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:107) +2 (cid:104) β ∗ − β ∗ ( T + tL,n ) ,β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:105) = (cid:107) β ∗ − β ∗ ( T + tL,n ) (cid:107) + (cid:107) β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:107) In above, (cid:104) β ∗ − β ∗ ( T + tL,n ) ,β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:105) = 0 due to β ∗ − β ∗ ( T + tL,n ) being orthogonal to anyvector in Ω k ( T + tL,n ) including β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) .Next, (cid:107) β ∗ − ˆ˜ β ( T +( t +1) L,n ) (cid:107) = (cid:107) β ∗ − β ∗ ( T +( t +1) L,n ) + β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:107) = (cid:107) β ∗ − β ∗ ( T +( t +1) L,n ) (cid:107) + (cid:107) β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:107) +2 (cid:104) β ∗ − β ∗ ( T +( t +1) L,n ) ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105)≥(cid:107) β ∗ − β ∗ ( T +( t +1) L,n ) (cid:107) + (cid:107) β ∗ ( T +( t +1) L,n ) − ˆ β ( T +( t +1) L,n ) (cid:107) +2 (cid:104) β ∗ − β ∗ ( T +( t +1) L,n ) ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105) In above, the inequality follows since ˆ β ( T +( t +1) L,n ) is projection of ˆ˜ β ( T +( t +1) L,n ) on linear sub-space Ω k ( T + tL,n ) which contains β ∗ ( T +( t +1) L,n ) and hence the inequality for (cid:107)·(cid:107) . Hence we have,37 δ · ( c ( T + tL,n ) (ˆ β ( T + tL,n ) ) − c ( T + tL,n ) ( β ∗ )) ≤(cid:107) β ∗ − β ∗ ( T + tL,n ) (cid:107) −(cid:107) β ∗ − β ∗ ( T +( t +1) L,n ) (cid:107) + (cid:107) β ∗ ( T + tL,n ) − ˆ β ( T + tL,n ) (cid:107) −(cid:107) β ∗ ( T +( t +1) L,n ) − ˆ β ( T +( t +1) L,n ) (cid:107) − (cid:104) β ∗ − β ∗ ( T +( t +1) L,n ) ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105) + δ (cid:107)∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) (cid:107) Summing over n ∈ [ N ] and t ∈ [ H ] , δ · N (cid:88) n =1 H (cid:88) t =1 ( c ( T + tL,n ) (ˆ β ( T + tL,n ) ) − c ( T + tL,n ) ( β ∗ )) ≤(cid:107) β ∗ − β ∗ ( T + L,n ) (cid:107) + (cid:107) β ∗ ( T + L,n ) − ˆ β ( T + L,n ) (cid:107) + N (cid:88) n =1 H (cid:88) t =1 (cid:18) (cid:104) β ∗ ( T +( t +1) L,n ) − β ∗ ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105) + δ (cid:107)∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) (cid:107) (cid:19) ≤ B + P δ D + N (cid:88) n =1 H (cid:88) t =1 (cid:18) (cid:104) β ∗ ( T +( t +1) L,n ) − β ∗ ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105) (cid:19) . For any u ∈ R L − , t ∈ [ H ] and using definition of β ∗ ( T +( t +1) L,n ) ,β ∗ we obtain (cid:104) β ∗ ( T +( t +1) L,n ) − β ∗ ,u (cid:105) = (cid:104) P Ω k ( T +( t +1) L,n ) β ∗ − P Ω k ( T + HL,N ) β ∗ ,u (cid:105)≤(cid:107) P Ω k ( T +( t +1) L,n ) − P Ω k ( T + HL,N ) (cid:107)(cid:107) ( β ∗ ) (cid:107)(cid:107) u (cid:107) . Hence, N (cid:88) n =1 H (cid:88) t =1 (cid:18) (cid:104) β ∗ ( T +( t +1) L,n ) − β ∗ ,β ∗ ( T +( t +1) L,n ) − ˆ˜ β ( T +( t +1) L,n ) (cid:105) (cid:19) ≤ B N (cid:88) n =1 H (cid:88) t =1 ∆ ( T + tL,n ) . Proposition I.2 ( Bounding D ). Let
B,D be defined as in Proposition I.1. Let a i ( T + tL,n,j ) and Z n ( T + tL ) be defined as in Section 4.4. Assume, • max i ∈ [ L − ,t ∈ [ H +1] ,n ∈ [ N ] ,j ∈ [¯ P + t ] | a i ( T + tL,n,j ) | = D • max t ∈ [ H +1] ,n ∈ [ N ]] | Z n ( T + tL ) | = D Then, D ≤ C · k · B · D · D (60) where C is an absolute constant.Proof. (cid:107)∇ c ( T + tL,n ) (ˆ β ( T + tL,n ) ) (cid:107) = (cid:107)∇ ˆ β ( T + tL,n ) (cid:18) Z n ( T + tL ) − (cid:16) Z n ( T +( t − L +1 : T + tL − (cid:17) T ˆ β ( T + tL,n ) (cid:19) (cid:107) = (cid:107)∇ ˆ β ( T + tL,n ) (cid:16) Z n ( T + tL ) − k (cid:88) i =1 a i ( T + tL,n,j ) ( v i ( t,n ) ) T ˆ β ( T + tL,n ) (cid:17) (cid:107) = (cid:107) (cid:16) Z n ( T + tL ) − k (cid:88) i =1 a i ( T + tL,n,j ) ( v i ( t,n ) ) T ˆ β ( T + tL,n ) (cid:17)(cid:16) k (cid:88) i =1 a i ( T + tL,n,j ) v i ( t,n ) (cid:17) (cid:107) ≤ (cid:16) ( Z n ( T + tL )) +( kBD ) (cid:17) (cid:107) (cid:16) k (cid:88) i =1 a i ( T + tL,n,j ) v i ( t,n ) (cid:17) (cid:107) ≤ (cid:16) ( D ) +( kBD ) (cid:17) k (cid:16) D (cid:17) D ≤ C · k · B · D · D where C is an absolute constant. I.1.1 Bounding ∆ ( T + tL,n ) Proposition I.3.
Recall τ t,n ) ,...,τ k ( t,n ) be the first k singular values of ¯ M f ( t,n ) . Assume, for n ∈ [ N ] and t ∈ [ H +1] , • P ¯Ω k ( T + tL,n ) = P ¯Ω k ( T + HL,N ) , • Assume Properties 2.1, 2.5, 2.6 and 2.2 hold.Let C be large enough constant defined as in Theorem 4.5. Then, with probability at least − / ( NT ) , (cid:107) P Ω k ( T + tL,n ) − P Ω k ( T + HL,N ) (cid:107)≤ C √ LNHτ k ( T + tL,n ) − τ k +1( T + tL,n ) − ( √ L + (cid:112) N ( P + H )) . Proof.
Let ˜ τ T + tL,n ) ,..., ˜ τ k ( T + tL,n ) be the first k (ordered by magnitude) singular values of ¯ Z f ( T + tL,n ) .Define ˘¯ Z f ( T + tL,n ) ∈ R L × ( N ( P + H )) , which is induced from ¯ Z f ( T + tL,n ) , by stacking columns of all s tothe right of it to make it of dimension L × ( N ( P + H )) . Let E ( T + tL,n ) := ¯ Z f ( T,N ) − ˘¯ Z f ( T + tL,n ) .By the Davis-Kahan sinΘ theorem, we have, (cid:107) P Ω k ( T + tL,n ) − P Ω k ( T + HL,N ) (cid:107)≤ C (cid:107) E ( T + tL,n ) (cid:107) ˜ τ k ( t,n ) − ˜ τ k +1( T + tL,n ) ≤ C √ NLH ˜ τ k ( T + tL,n ) − ˜ τ k +1( T + tL,n ) By the Cauchy Interlacing Theorem and an application of Theorem E.2 on ¯ Z f ( T + tL,n ) − ¯ M f ( T + tL,n ) , withappropriately chosen large enough C > , it follows that with probability at least − / ( NT ) , we havethat for i ∈ [ L − | ˜ τ i ( T + tL,n ) − τ i ( T + tL,n ) |≤ C ( √ L + (cid:112) N ( P + H )) . Hence, C ∗ √ NLH ˜ τ k ( T + tL,n ) − ˜ τ k +1( T + tL,n ) ≤ C ∗ √ NLHτ k ( T + tL,n ) − τ k +1( T + tL,n ) − ( √ L + (cid:112) N ( P + H )) . Corollary I.1.
For n ∈ [ N ] and t ∈ [ H +1] , assume: • Conditions of Proposition I.3 hold; • τ k ( T + tL,n ) =Ω (cid:16)(cid:113) LNPk (cid:17) , τ k +1( T + tL,n ) =0 ; • H = o ( P ) .Then for n ∈ [ N ] , t ∈ [ H ] , we have (cid:107) P Ω k ( T + tL,n ) − P Ω k ( T + HL,N ) (cid:107)≤ C (cid:113) kHP .Proof. Using, Proposition I.3, we have (cid:107) P Ω k ( T + tL,n ) − P Ω k ( T + HL,N ) (cid:107)≤ C √ LNHτ k ( T + tL,n ) − τ k +1( T + tL,n ) − ( √ L + (cid:112) N ( P + H )) ≤ C (cid:114) kHP .2 Proof of Theorem 4.5 Proof.
From Propositions I.1, I.2 and Corollary I.1, we have NH regret ≤ C k . (cid:18) δ (cid:16) NH + (cid:114) HP (cid:17) + δL (cid:19) , (61)with probability at least − / ( NT ) by Union Bound.Recall L = P − (cid:15) , H = P − (cid:15) . Then, NH + (cid:113) HP = NP − (cid:15) + P . (cid:15) .Setting δ = L (cid:114) NH + (cid:113) HP = L (cid:115)(cid:18) NP − (cid:15) + P . (cid:15) (cid:19) , we get (cid:18) δ (cid:16) NH + (cid:114) HP (cid:17) + δL (cid:19) =2 P − (cid:15) (cid:115)(cid:18) NP − (cid:15) + 1 P . (cid:15) (cid:19) =2 (cid:115)(cid:18) P − (cid:15) NP − (cid:15) + P − (cid:15) P . (cid:15) (cid:19) =2 (cid:115)(cid:18) P − (cid:15) + (cid:15) N + P − (cid:15) − . (cid:15) (cid:19)(cid:19)