Tight Risk Bound for High Dimensional Time Series Completion
TTIGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION
PIERRE ALQUIER*, NICOLAS MARIE † , AND AMÉLIE ROSIER (cid:5) Abstract.
Initially designed for independent datas, low-rank matrix completion was successfully ap-plied in many domains to the reconstruction of partially observed high-dimensional time series. However,there is a lack of theory to support the application of these methods to dependent datas. In this paper, wepropose a general model for multivariate, partially observed time series. We show that the least-squaremethod with a rank penalty leads to reconstruction error of the same order as for independent datas.Moreover, when the time series has some additional properties such as perdiodicity or smoothness, therate can actually be faster than in the independent case.
Contents
1. Introduction 12. Setting of the problem and notations 23. Risk bound on (cid:98) T k,τ
44. Model selection 65. Numerical experiments 66. Proofs 96.1. Exponential inequality 106.2. A preliminary non-explicit risk bound 146.3. Proof of Theorem 3.4 166.4. Proof of Theorem 4.1 17References 181.
Introduction
Low-rank matrix completion methods were studied in depth in the past 10 years. This was partlymotivated by the popularity of the Netflix prize [8] in the machine learning community. The first theo-retical papers on the topic covered matrix recovery from a few entries observed exactly [12, 13, 24]. Thesame problem was studied with noisy observations in [10, 11, 25, 21]. The minimax rate of estimationwas derived by [28]. Since then, many estimators and many variants of this problem were studied in thestatistical literature, see [37, 26, 30, 27, 33, 41, 16, 14, 4] for instance.High-dimensional time series often have strong correlation, and it is thus natural to assume that the ma-trix that contains such a series is low-rank (exactly, or approximately). Many econometrics models aredesigned to generate series with such a structure. For example, the factor model studied in [29, 31, 32, 20,15, 22] can be interpreted as a high-dimensional autoregressive (AR) process with a low-rank transitionmatrix. This model (and variants) was used and studied in signal processing [7] and statistics [37, 1].Other papers focused on a simpler model where the series is represented by a deterministic low-ranktrend matrix plus some possibly correlated noise. This model was used by [45] to perform prediction,and studied in [3].It is thus tempting to use low-rank matrix completion algorithms to recover partially observed high-dimensional time series, and this was indeed done in many applications: [44, 42, 18] used low-rank matrixcompletion to reconstruct data from multiple sensors. Similar techniques were used by [35, 34] to recoverthe electricity consumption of many households from partial observations, by [5] on panel data in econom-ics, and by [38, 6] for policy evaluation. Some algorithms were proposed to take into account the temporal a r X i v : . [ m a t h . S T ] F e b PIERRE ALQUIER*, NICOLAS MARIE † , AND AMÉLIE ROSIER (cid:5) updates of the observations (see [40]). However, it is important to note that 1) all the aforementionedtheory on matrix completion, for example [28], was only developed for independent observations, and 2)most papers using these techniques on time series did not provide any theoretical justification that it canbe used on dependent observations. One must however mention that [19] obtained theoretical results forunivariate time series prediction by embedding the time series into a Hankel matrix and using low-rankmatrix completion.In this paper, we study low-rank matrix completion for partially observed high-dimensional time seriesthat indeed exhibit a temporal dependence. We provide a risk bound showing for the reconstruction ofa rank- k matrix, and a model selection procedure for the case where the rank k is unknown. Under theassumption that the univariate series are φ -mixing, we prove that we can reconstruct the matrix with asimilar error than in the i.i.d case in [28]. If, moreover, the time series has some additional properties, asthe ones studied in [3] (periodicity or smoothness), the error can even be smaller than in the i.i.d case.This is confirmed by a short simulation study.From a technical point of view, we start by a reduction of the matrix completion problem to a structuredregression problem as in [33]. But on the contrary to [33], we have here dependent observations. We thusfollow the technique of [2] to obtain risk bounds for dependent observations. In [2], it is shown that onecan obtain risk bounds for dependent observations that are similar to the risk bounds for independentobservations under a φ -mixing assumption, using Samson’s version of Bernstein inequality [39]. For modelselection, we follow the guidelines of [36]: we introduce a penalty proportional to the rank. Using theprevious risk bounds, we show that this leads to an optimal rank selection. The implementation of ourprocedure is based on the R package softImpute [23].The paper is organized as follows. In Section 2, we introduce our model, and the notations used through-out the paper. In Section 3, we provide the risk analysis when the rank k is known. We then describe ourrank selection procedure in Section 4 and show that it satisfies a sharp oracle inequality. The numericalexperiments are in Section 5. All the proofs are gathered in Section 6.2. Setting of the problem and notations
Consider d, T ∈ N ∗ and a d × T random matrix M . Assume that the rows M ,. , . . . , M d,. are timeseries and that Y , . . . , Y n are n ∈ { , . . . , d × T } noisy entries of the matrix M :(1) Y i = trace ( X ∗ i M ) + ξ i ; i ∈ { , . . . , n } , where X , . . . , X n are i.i.d random matrices distributed on X := { e R d ( j ) e R T ( t ) ∗ ; 1 (cid:54) j (cid:54) d and (cid:54) t (cid:54) T } , and ξ , . . . , ξ n are i.i.d. centered random variables, with standard deviation σ ξ > , such that X i and ξ i areindependent for every i ∈ { , . . . , n } . Let us now describe the time series structure of each M ,. , . . . , M d,. .We assume that each series M j,. can be decomposed as a deterministic component Θ j,. plus some randomnoise ε j,. . The noise can exhibit some temporal dependence: ε j,t will not be independent from ε j,t (cid:48) ingeneral. Moreover, as discussed in [3], Θ j,. can have some more structure: Θ j,. = T j,. Λ for some knownmatrix Λ . Examples of such structures (smoothness, periodicity) are discussed below. This gives:(2) (cid:26) M = Θ + ε Θ = T Λ , where ε is a d × T random matrix having i.i.d. and centered rows, Λ ∈ M τ,T ( C ) ( τ (cid:54) T ) is known and T is an unknown element of M d,τ ( R ) such that(3) sup j,t | T j,t | (cid:54) m m Λ τ with m > and m Λ = sup t ,t | Λ t ,t | . Note that this leads to sup j,t | Θ j,t | (cid:54) m . We now make the additional assumption that the deterministic component is low-rank, reflecting thestrong correlation between the different series. Precisely, we assume that T is of rank k ∈ { , . . . , d ∧ T } : IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 3 T = U V with U ∈ M d,k ( R ) and V ∈ M k,τ ( R ) . The rows of the matrix V may be understood aslatent factors. By Equations (1) and (2), for any i ∈ { , . . . , n } ,(4) Y i = trace ( X ∗ i Θ ) + ξ i with ξ i := trace ( X ∗ i ε ) + ξ i . It is reasonable to assume that X i and ξ i , which are random terms relatedto the observation instrument, are independent to ε , which is the stochastic component of the observedprocess. Then, since ξ i is a centered random variable and ε is a centered random matrix, E ( ξ i ) = E ( (cid:104) X i , ε (cid:105) F ) + E ( ξ i ) = d (cid:88) j =1 T (cid:88) t =1 E (( X i ) j,t ) E ( ε j,t ) = 0 . This legitimates to consider the following least-square estimator of the matrix Θ :(5) (cid:40) (cid:98) Θ k,τ = (cid:98) T k,τ Λ (cid:98) T k,τ ∈ arg min T ∈S k,τ r n ( TΛ ) , where S k,τ is a subset of M d,k,τ := (cid:40) UV ; ( U , V ) ∈ M d,k ( R ) × M k,τ ( R ) s.t. sup j,(cid:96) | U j,(cid:96) | (cid:54) (cid:114) m kτ m Λ and sup (cid:96),t | V (cid:96),t | (cid:54) (cid:114) m kτ m Λ (cid:41) , and r n ( A ) := 1 n n (cid:88) i =1 ( Y i − (cid:104) X i , A (cid:105) F ) ; ∀ A ∈ M d,T ( R ) . Let us conclude this section with two examples of matrices Λ corresponding to usual time series struc-tures. On the one hand, if the trend of the multivalued time series M is τ -periodic, with T ∈ τ N ∗ , onecan take Λ = ( I τ | · · · | I τ ) , and then m Λ = 1 . On the other hand, assume that for any j ∈ { , . . . , d } , thetrend of M j,. is a sample on { , /T, /T, . . . , } of a function f j : [0 , → R belonging to a Hilbert space H . In this case, if ( e n ) n ∈ Z is a Hilbert basis of H , one can take Λ = ( e n ( t/T )) ( n,t ) ∈{− N,...,N }×{ ,...,T } .For instance, if f j ∈ L ([0 , R ) , a natural choice is the Fourier basis e n ( t ) = e iπnt/T , and then m Λ = 1 .Such a setting will result in smooth trends. Notations and basic definitions.
Throughout the paper, M d,T ( R ) is equipped with the Fröbéniusscalar product (cid:104) ., . (cid:105) F : ( A , B ) ∈ M d,T ( R ) (cid:55)−→ trace ( A ∗ B ) = (cid:88) j,t A j,t B j,t or with the spectral norm (cid:107) . (cid:107) op : A ∈ M d,T ( R ) (cid:55)−→ sup (cid:107) x (cid:107) =1 (cid:107) A x (cid:107) = σ ( A ) . Let us finally remind the definition of the φ -mixing condition on stochastic processes. Given two σ -algebras A and B , we define the φ -mixing coefficient between A and B by φ ( A , B ) := sup {| P ( B ) − P ( B | A ) | ; ( A, B ) ∈ A × B , P ( A ) (cid:54) = 0 } . When A and B are independent, φ ( A , B ) = 0 , more generally, this coefficient measure how dependent A and B are. Given a process ( Z t ) t ∈ N , we define its φ -mixing coefficients by φ Z ( i ) := sup { φ ( A, B ) ; t ∈ Z , A ∈ σ ( X h , h (cid:54) t ) , B ∈ σ ( X (cid:96) , (cid:96) (cid:62) t + i ) } . Some properties and examples of applications of φ -mixing coefficients can be found in [17]. PIERRE ALQUIER*, NICOLAS MARIE † , AND AMÉLIE ROSIER (cid:5) Risk bound on (cid:98) T k,τ First of all, since X , . . . , X n are i.i.d X -valued random matrices, there exists a probability measure Π on X such that P X i = Π ; ∀ i ∈ { , . . . , n } . In addition to the two norms on M d,T ( R ) introduced above, let us consider the scalar product (cid:104) ., . (cid:105) F , Π defined on M d,T ( R ) by (cid:104) A , B (cid:105) F , Π := (cid:90) M d,T ( R ) (cid:104) X, A (cid:105) F (cid:104) X, B (cid:105) F Π( dX ) ; ∀ A , B ∈ M d,T ( R ) . Remarks: (1) For any deterministic d × T matrices A and B , (cid:104) A , B (cid:105) F , Π = E ( (cid:104) A , B (cid:105) n ) where (cid:104) ., . (cid:105) n is the empirical scalar product on M d,T ( R ) defined by (cid:104) A , B (cid:105) n := 1 n n (cid:88) i =1 (cid:104) X i , A (cid:105) F (cid:104) X i , B (cid:105) F . However, note that this relationship between (cid:104) ., . (cid:105) F , Π and (cid:104) ., . (cid:105) n doesn’t hold anymore when A and B are random matrices.(2) Note that if the sampling distribution Π is uniform, then (cid:107) . (cid:107) F , Π = ( dT ) − (cid:107) . (cid:107) F . Notation.
For every i ∈ { , . . . , n } , let χ i be the couple of coordinates of the nonzero element of X i ,which is a E -valued random variable with E = { , . . . , d } × { , . . . , T } .In the sequel, ε , ξ , . . . , ξ n and X , . . . , X n fulfill the following additional conditions. Assumption 3.1.
The rows of ε are independent and indentically distributed. There is a process ( ε t ) t ∈ Z such that each ε j,. has the same distribution than ( ε , . . . , ε T ) , and such that Φ ε := 1 + n (cid:88) i =1 φ ε ( i ) / < ∞ . Assumption 3.2.
There exist two deterministic constants m ε > and m ξ > such that sup j,t | ε j,t | (cid:54) m ε and sup i ∈{ ,...,n } | ξ i | (cid:54) m ξ a.s. Assumption 3.3.
There is a constant c Π > such that Π( { e R d ( j ) e R T ( t ) ∗ } ) (cid:54) c Π dT ; ∀ ( j, t ) ∈ E . Note that when the sampling distribution Π is uniform, Assumption 3.3 is trivially satisfied with c Π = 1 .Moreover, note that under Assumption 3.2, Y i = Θ χ i + ε χ i + ξ i for every i ∈ { , . . . , n } , and so sup i ∈{ ,...,n } | Y i | (cid:54) m + m ε + m ξ a.s. Theorem 3.4.
Let α ∈ (0 , . Under Assumptions 3.1, 3.2 and 3.3, if n (cid:62) ( dτ ) / ( k ( d + τ )) − , then (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + c . (cid:20) k ( d + τ ) log( n ) n + 1 n log (cid:18) α (cid:19)(cid:21) with probability larger than − α , where c . is a constant depending only on m , m ξ , m ε , m Λ , Φ ε and c Π . Actually, from the proof of the theorem, we know c . explicitely. Indeed, c . = 4 c . , + 9 c . , m m Λ where c . , and c . , are constants (explicited in Theorem 6.4 in Section 6) depending themselves onlyon m , m ξ , m ε , m Λ , Φ ε and c Π . IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 5
Remark.
The φ -mixing assumption (Assumption 3.1) is known restrictive, we refer the reader to [17]where it is compared to other mixing conditions. Some examples are provided in Examples 7, 8 and 9in [2], including stationary AR processes with a noise that has a density with respect to the Lebesguemeasure on [ − c, c ] . Interestingly, [2] also discusses weaker notions of dependence. Under these conditions,we could here apply the inequalities used in [2], but it is important to note that this would prevent usfrom taking λ of the order of n in the proof of Proposition 6.1. In other words, this would deterioriatethe rates of convergence. A complete study of all the possible dependence coniditions on ε goes beyondthe scope of this paper.Finally, let us focus on the rate of convergence, in general and in the specific case of time series withsmooth trends belonging to a Sobolev ellipsoid.First, note that the constant c . in Theorem 3.4 doesn’t depend on Λ : c . = 4 c . , + 9 c . , m m Λ = 128( c − . ∧ λ ∗ ) − + 36(3 m + m ε + m ξ )= 128((4 max { m , m ξ , m ε , m ε Φ ε c Π } ) ∨ (16 m max { m , m ε , m ξ } )) + 36(3 m + m ε + m ξ ) . So, the variance term in the risk bound on (cid:98) Θ k,τ depends on the time series structure by τ only. In thespecific cases of periodic or smooth trends, as mentioned at the end of Section 2, m Λ = 1 , and then S k,τ is a subset of M d,k,τ = (cid:40) UV ; ( U , V ) ∈ M d,k ( R ) × M k,τ ( R ) s.t. sup j,(cid:96) | U j,(cid:96) | (cid:54) (cid:114) m kτ and sup (cid:96),t | V (cid:96),t | (cid:54) (cid:114) m kτ (cid:41) . If T ∈ S k,τ , then the bias term in the risk bound on (cid:98) Θ k,τ is null and (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π = O (cid:18) k ( d + τ ) log( n ) n (cid:19) . In other words, the rate of convergence is thus k ( d + τ ) log( n ) /n . This is to be compared with the rate inthe i.i.d case: k ( d + T ) log( n ) /n . First, when our series does not have a given structure, τ = T and therates are the same. However, when there is a strong structure, for example, when the series is periodic,we have τ (cid:28) T and our rate is actually better.Now, consider Fourier’s basis ( e n ) n ∈ Z and τ = 2 N + 1 with N ∈ N ∗ . In the sequel, assume that Λ = (cid:18) e n (cid:18) tT (cid:19)(cid:19) ( n,t ) ∈{− N,...,N }×{ ,...,T } and S k,τ = S k,β,L := { T ∈ M d,k,τ : ∀ j = 1 , . . . , d , ∃ f j ∈ W ( β, L ) , ∀ n = − N, . . . , N , T j,n = c n ( f j ) } , where β ∈ N ∗ , L > , W ( β, L ) := (cid:26) f ∈ C β − ([0 , R ) : (cid:90) f ( β ) ( x ) dx (cid:54) L (cid:27) is a Sobolev ellipsoid, and c n ( ϕ ) is the Fourier coefficient of order n ∈ Z of ϕ ∈ W ( β, L ) . Thanks toTsybakov [43], Chapter 1, there exists a constant c β,L > such that for every f ∈ W ( β, L ) , T T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) tT (cid:19) − N (cid:88) n = − N c n ( f ) e n (cid:18) tT (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) c β,L N − β . So, if Θ = ( f j ( t/T )) j,t with f j ∈ W ( β, L ) for j = 1 , . . . , d , and if the sampling distribution Π is uniform,then min T ∈S k,β,L (cid:107) ( T − T ) Λ (cid:107) F , Π = 1 dT (cid:88) j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f j (cid:18) tT (cid:19) − N (cid:88) n = − N c n ( f j ) e n (cid:18) tT (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) c β,L N − β . PIERRE ALQUIER*, NICOLAS MARIE † , AND AMÉLIE ROSIER (cid:5) By Theorem 3.4, for n (cid:62) ( dτ ) / ( k ( d + τ )) − , with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) c β,L N − β + c . (cid:20) k ( d + 2 N + 1) log( n ) n + 1 n log (cid:18) α (cid:19)(cid:21) . Therefore, by assuming that β is known, the bias-variance tradeoff is reached for N = N opt := (cid:36)(cid:18) c β,L β c . k · n log( n ) (cid:19) / (2 β +1) (cid:37) , and with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) c β,L N − β opt + 2 c . kN opt log( n ) n + c . (cid:20) k ( d + 1) log( n ) n + 1 n log (cid:18) α (cid:19)(cid:21) = c (cid:34) c / (2 β +1) β,L (cid:18) k log( n ) n (cid:19) β/ (2 β +1) + k ( d + 1) log( n ) n + 1 n log (cid:18) α (cid:19)(cid:35) with c = [[ β − β/ (2 β +1) + 2 β / (2 β +1) ]3 / (2 β +1) c β/ (2 β +1)3 . ] ∨ c . .4. Model selection
The purpose of this section is to provide a selection method of the parameter k . First, for the sake ofreadability, S k,τ and (cid:98) T k,τ are respectively denoted by S k and (cid:98) T k in the sequel. The adaptive estimatorstudied here is (cid:98) Θ := (cid:98) TΛ , where (cid:98) T := (cid:98) T (cid:98) k , (cid:98) k ∈ arg min k ∈K (cid:110) r n ( (cid:98) T k Λ ) + pen ( k ) (cid:111) with K = { , . . . , k ∗ } ⊂ N ∗ , and pen ( k ) := 16 c cal log( n ) n k ( d + τ ) with c cal = 2 (cid:18) c . ∧ λ ∗ (cid:19) − . Theorem 4.1.
Under Assumptions 3.1, 3.2 and 3.3, if n (cid:62) ( dτ ) / ( d + τ ) − , then (cid:107) (cid:98) Θ − Θ (cid:107) F , Π (cid:54) k ∈K (cid:26) T ∈S k (cid:107) ( T − T ) Λ (cid:107) F , Π + c . , k ( d + τ ) log( n ) n (cid:27) + c . , n log (cid:18) k ∗ α (cid:19) + c . , d / τ / n with probability larger than − α , where c . , = 4 c . + 16 c cal and c . , = 9 c . , m m Λ . Numerical experiments
This section deals with numerical experiments on the estimator of the matrix T introduced at Section2. The R package softImpute is used. Our experiments are done on datas simulated the following way:(1) We generate a matrix T = U V with U ∈ M d,k ( R ) and V ∈ M k,τ ( R ) . Each entries of U and V are generated independently by simulating i.i.d. N (0 , random variables.(2) We multiply T by a known matrix Λ ∈ M τ,T ( R ) . This matrix depends on the time seriesstructure assumed on M . Here, we consider the periodic case: T = pτ , p ∈ N ∗ and Λ =( I τ | . . . | I τ ) . We use the notation Λ + for the pseudo-inverse of Λ which satisfies Λ + = Λ ∗ ( ΛΛ ∗ ) − because Λ is of full-rank τ .(3) The matrix M is then obtained by adding a matrix ε such that ε ,. , . . . , ε d,. are generated inde-pendently by simulating i.i.d. AR(1) processes with compactly supported error in order to meetthe φ - mixing condition. To keep a relatively small noise in order to have a relevant estimation atthe end, we multiply ε by the coefficient σ ε = 0 . .Only 30% of the entries of M , taken randomly, are observed. This sample of entries is thencorrupted by i.i.d. observation errors ξ , . . . , ξ n (cid:32) N (0 , . ) . Note that we keep the same IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 7 percentage of observed entries throughout this section, so the number n of corrupted entries willvary according to the dimension d × T .Given the observed entries, our goal is to complete the missing values of the matrix and check if theycorrespond to the simulated data. The output given by the function complete of softImpute needs tobe multiplied by Λ + in order to have an estimator of the matrix T . We will evaluate the MSE of theestimator with respect to several parameters and show that there is a gain to take into account the timeseries structure in the model. As expected, the more Θ is perturbed, either with ε or ξ , . . . , ξ n , themore difficult it is to reconstruct the matrix. In the same way, increasing the value of the rank k willlead to a worse estimation. Finally, we study the effect of replacing the uniform error in each AR(1) bya Gaussian one.The first experiments are done with d = 1000 and p = 10 . Here are the MSE obtained for 3 valuesof the dimension T ( , and ), three values of the rank k ( , and ), and for two kinds oferrors in the AR(1) process involved in ε (uniform U ([ − , v.s. Gaussian N (0 , . ) ): d × T ×
100 1000 ×
500 1000 × Uniform errors
Gaussian errors
Table 1. k = 2 . d × T ×
100 1000 ×
500 1000 × Uniform errors
Gaussian errors
Table 2. k = 5 . d × T ×
100 1000 ×
500 1000 × Uniform errors
Gaussian errors
Table 3. k = 9 .Thus, both of the rank k and the nature of the error in ε seem to play a key role on the reduction of theMSE. Regarding the dimension T ( k and d being fixed), our numerical results are consistent with respectto the theoretical rate of convergence of order O ( k ( d + τ ) log( n ) /n ) obtained at Theorem 3.4. Indeed, theMSE is shrinking when T is increasing whatever the value of the rank k or the error considered, whichconfirms that T has no impact on the MSE when we add the time series structure in our model. On thecontrary, in the model without time series structure, the MSE increases when T increases, what is alsoconsistent with the theoretical rate of convergence of order O ( k ( d + T ) log( n ) /n ) .For each tested values of k and T , the MSE is smaller for Gaussian errors than for uniform errors in ε .The gap between the MSE’s, especially when the dimension T goes from 100 to 500, is huge when therank k is high. The increasing of the rank k significantly degrades the MSE, even with a gaussian errorand a high value of T .Another interesting study consists in the comparison of the MSE with or without (classic model) takinginto account the time series structure of the dataset. This means to take M = U V Λ + ε or M = U V PIERRE ALQUIER*, NICOLAS MARIE † , AND AMÉLIE ROSIER (cid:5) in Model (1). On time series datas, the MSE obtained with the classic model is expected to be worst thanthe one obtained with our model. The following experiments shows the evolution of the MSE with respectto the rank k ( k = 1 , . . . , ) for both models. We take d = T = 1000 , the ξ i ’s are i.i.d. N (0 , . ) random variables, and ε ,. , . . . , ε d,. are i.i.d. AR(1) processes with Gaussian errors. Finally, recall that p = 10 , so τ = 100 in our model. . . . . . rg M SE _ c l a ss Figure 1.
Models (time series (solid line) v.s. classic (dotted line)) MSEs with respectto the rank k .As expected, the MSE is much better with the model taking into account the time series structure.As we said, the estimation seems to be more precise with Gaussian errors in ε , and the more Θ isperturbed via ε or ξ , . . . , ξ n , the more the completion process is complicated and the MSE degrades.So, we now evaluate the consequence on the MSE of changing the value of σ ε . For both models (withor without taking into account the time series structure), the following figure shows the evolution of theMSE with respect to σ ε when the errors in ε are N (0 , . ) random variables and all the other parametersremain the same than previously. Note that this time, the MSE is not multiplied by 100 and we kept theoriginal values. . . . . . . coef M SE _ c l a ss Figure 2.
Models (time series (solid line) v.s. classic (dotted line)) MSEs with respectto σ ε , Gaussian errors. IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 9
Once again, as expected, the MSE with our model is smaller than the one with the classic model for eachvalues of σ ε . The fact that the MSE increases with respect to σ ε with both models illustrates that morenoise always complicates the completion process. In our experiments, the values of σ ε range from 0.01to 1 and we can notice that, even with σ ε close to 1, the MSE sticks to very small values with our model,which is great as it means a good estimation. We have the following results:MSE MIN MAX Model without time series structure
Model with time series structure
Table 4.
MIN and MAX values reached by the MSE with Gaussian errors in ε .Let us do the same experiment but with uniform errors ( U ([ − , ) in ε . . . . . . coef M SE _ c l a ss Figure 3.
Models (time series (solid line) v.s. classic (dotted line)) MSEs with respectto σ ε , uniform errors.The curves shape is pretty much the same as in the previous graph: the MSE with our model is stillsmaller than with the classic model. However, this time the MSE with both models reaches higher values:MSE MIN MAX Model without time series structure
Model with time series structure
Table 5.
MIN and MAX values reached by the MSE with uniform errors in ε Proofs
This section is organized as follows. We first state an exponential inequality that will serve as a basisfor all the proofs. From this inequality, we prove Theorem 6.4, a prototype of Theorem 3.4 that holdswhen the set S k,τ is finite or infinite but compact by using (cid:15) -nets ( (cid:15) > ). In the proof of Theorem 3.4,we provide an explicit risk-bound by using the (cid:15) -net S (cid:15)k,τ of S k,τ constructed in Candès and Plan [11],Lemma 3.1. † , AND AMÉLIE ROSIER (cid:5) Exponential inequality.
This sections deals with the proof of the following exponential inequality,the cornerstone of the paper, which is derived from the usual Bernstein inequality and its extension to φ -mixing processes due to Samson [39]. Proposition 6.1.
Let T ∈ S k,τ . Under Assumptions 3.1, 3.2 and 3.3, (6) E (cid:20) exp (cid:18) λ (cid:18)(cid:18) c . λn (cid:19) ( R ( T Λ ) − R ( TΛ )) + r n ( TΛ ) − r n ( T Λ ) (cid:19)(cid:19)(cid:21) (cid:54) and (7) E (cid:20) exp (cid:18) λ (cid:18)(cid:18) − c . λn (cid:19) ( R ( TΛ ) − R ( T Λ )) + r n ( T Λ ) − r n ( TΛ ) (cid:19)(cid:19)(cid:21) (cid:54) for every T ∈ S k,τ and λ ∈ (0 , nλ ∗ ) , where R ( A ) := E ( | Y − (cid:104) X , A (cid:105) F | ) ; ∀ A ∈ M d,T ( R ) , c . = 4 max { m , m ξ , m ε , m ε Φ ε c Π } and λ ∗ = (16 m max { m , m ε , m ξ } ) − .Proof of Proposition 6.1. The proof relies on Bernstein’s inequality as stated in [9], that we remind inthe following lemma.
Lemma 6.2.
Let T , . . . , T n be some independent and real-valued random variables. Assume that thereare v > and c > such that n (cid:88) i =1 E ( T i ) (cid:54) v and, for any q (cid:62) , n (cid:88) i =1 E ( T qi ) (cid:54) vc q − q !2 . Then, for every λ ∈ (0 , /c ) , E (cid:34) exp (cid:34) λ n (cid:88) i =1 ( T i − E ( T i )) (cid:35)(cid:35) (cid:54) exp (cid:18) vλ − cλ ) (cid:19) . We will also use a variant of this inequality for time series due to Samson, stated in the proof of Theorem3 in [39].
Lemma 6.3.
Consider m ∈ N ∗ , M > , a stationary sequence of R m -valued random variables Z =( Z t ) t ∈ Z , and Φ Z := 1 + T (cid:88) t =1 φ Z ( t ) / , where φ Z ( t ) , t ∈ Z , are the φ -mixing coefficients of Z . For every smooth and convex function f :[0 , M ] T → R such that (cid:107)∇ f (cid:107) (cid:54) L a.e, for any λ > , E (exp( λ ( f ( Z , . . . , Z T ) − E [ f ( Z , . . . , Z T )]))) (cid:54) exp (cid:18) λ L Φ Z M (cid:19) . Let T ∈ S k,τ be arbitrarily chosen. Consider the deterministic map X : E → M d,T ( R ) such that X i = X ( χ i ) ; ∀ i ∈ { , . . . , n } , Ξ i := ( ξ i , χ i ) for any i ∈ { , . . . , n } , and h : R × E → R the map defined by h ( x, y ) := 1 n (2 x (cid:104) X ( y ) , ( T − T ) Λ (cid:105) F + (cid:104) X ( y ) , ( T − T ) Λ (cid:105) F ) ; ∀ ( x, y ) ∈ R × E . IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 11
Note that h (Ξ i ) = 1 n (2 ξ i (cid:104) X i , ( T − T ) Λ (cid:105) F + (cid:104) X i , ( T − T ) Λ (cid:105) F )= 1 n (( ξ i + (cid:104) X i , ( T − T ) Λ (cid:105) F ) − ξ i )= 1 n (( Y i − (cid:104) X i , TΛ (cid:105) F ) − ( Y i − (cid:104) X i , T Λ (cid:105) F ) ) and n (cid:88) i =1 ( h (Ξ i ) − E ( h (Ξ i ))) = r n ( TΛ ) − r n ( T Λ ) + R ( T Λ ) − R ( TΛ ) . Now, replacing ξ i by its expression in terms of X i , ξ i and ε , n (cid:88) i =1 ( h (Ξ i ) − E ( h (Ξ i ))) = n (cid:88) i =1 (cid:18) n ξ i (cid:104) X i , ( T − T ) Λ (cid:105) F (cid:19) + n (cid:88) i =1 (cid:18) n (cid:104) X i , ε (cid:105) F (cid:104) X i , ( T − T ) Λ (cid:105) F (cid:19) + n (cid:88) i =1 (cid:18) n (cid:104) X i , ( T − T ) Λ (cid:105) F − E ( h (Ξ i )) (cid:19) =: n (cid:88) i =1 A i + n (cid:88) i =1 B i + n (cid:88) i =1 ( C i − E ( h (Ξ i ))) . In order to conclude, by using Lemmas 6.2 and 6.3, let us provide suitable bounds for the exponentielmoments of each terms of the previous decomposition: • Bounds for the A i ’s and the C i ’s. First, note that since X , ξ and ε are independent, R ( TΛ ) − R ( T Λ ) = E (( Y − (cid:104) X , TΛ (cid:105) F ) − ( Y − (cid:104) X , T Λ (cid:105) F ) )= 2 E ( ξ (cid:104) X , ( T − T ) Λ (cid:105) F ) + E ( (cid:104) X , ( T − T ) Λ (cid:105) F )= 2 (cid:104) E ( (cid:104) X , ( T − T ) Λ (cid:105) F X ) , E ( ε ) (cid:105) F +2 E ( ξ ) E ( (cid:104) X , ( T − T ) Λ (cid:105) F ) + (cid:107) ( T − T ) Λ (cid:107) F , Π = (cid:107) ( T − T ) Λ (cid:107) F , Π . (8) On the one hand, | A i | (cid:54) m ξ m /n and E ( A i ) (cid:54) n m ξ E ( (cid:104) X i , ( T − T ) Λ (cid:105) F ) = 4 n m ξ ( R ( T Λ ) − R ( TΛ )) thanks to Equality (8). So, we can use Lemma 6.2 with v = 4 n m ξ ( R ( T Λ ) − R ( TΛ )) and c = 4 m ξ m n to obtain: E (cid:34) exp (cid:32) λ n (cid:88) i =1 A i (cid:33)(cid:35) (cid:54) exp (cid:34) m ξ ( R ( T Λ ) − R ( TΛ )) λ n − m ξ m λ (cid:35) for any λ ∈ (0 , n/ (4 m ξ m )) . On the other hand, in the same way, | C i | (cid:54) m /n and E ( C i ) = 1 n E ( (cid:104) X i , ( T − T ) Λ (cid:105) F ) (cid:54) m n (cid:107) ( T − T ) Λ (cid:107) F , Π = 4 n m ( R ( T Λ ) − R ( TΛ )) thanks to Equality (8). So, we can use Lemma 6.2 with v = 4 n m ( R ( T Λ ) − R ( TΛ )) and c = 4 m n † , AND AMÉLIE ROSIER (cid:5) to obtain: E (cid:34) exp (cid:32) λ n (cid:88) i =1 ( C i − E ( h (Ξ i ))) (cid:33)(cid:35) (cid:54) exp (cid:20) m ( R ( T Λ ) − R ( TΛ )) λ n − m λ (cid:21) for any λ ∈ (0 , n/ (4 m )) . • Bounds for the B i ’s. First, write n (cid:88) i =1 B i = n (cid:88) i =1 ( B i − E ( B i | ε )) + n (cid:88) i =1 E ( B i | ε ) =: n (cid:88) i =1 D i + n (cid:88) i =1 E i , and note that E ( B i | ε ) = 2 n E ( (cid:104) X i , ε (cid:105) F (cid:104) X i , ( T − T ) Λ (cid:105) F | ε ) (9) = 2 n (cid:88) j,t E ( χ i =( j,t ) [( T − T ) Λ ] χ i ) ε j,t = 2 n (cid:88) j,t p j,t [( T − T ) Λ ] j,t ε j,t and(10) (cid:107) ( T − T ) Λ (cid:107) F , Π = E ( (cid:104) X i , ( T − T ) Λ (cid:105) F ) = E ([( T − T ) Λ ] χ i ) = (cid:88) j,t p j,t [( T − T ) Λ ] j,t , where p j,t := P ( χ = ( j, t )) = Π( { e R d ( j ) e R T ( t ) ∗ } ) for every ( j, t ) ∈ E . On the one hand, given ε , the D i ’s are i.i.d, | D i | (cid:54) m ε m /n and E ( B i | ε ) = 4 n E ( (cid:104) X i , ε (cid:105) F (cid:104) X i , ( T − T ) Λ (cid:105) F | ε ) (cid:54) n m ε E ( (cid:104) X i , ( T − T ) Λ (cid:105) F | ε ) = 4 n m ε E ( (cid:104) X i , ( T − T ) Λ (cid:105) F ) = 4 n m ε ( R ( T Λ ) − R ( TΛ )) thanks to Equality (8). So, conditionnally on ε , we can apply Lemma 6.2 with v = 4 n m ε ( R ( T Λ ) − R ( TΛ )) and c = 8 m ε m n to obtain: E (cid:34) exp (cid:32) λ n (cid:88) i =1 D i (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ε (cid:35) (cid:54) exp (cid:20) m ε ( R ( T Λ ) − R ( TΛ )) λ n − m ε m λ (cid:21) for any λ ∈ (0 , n/ (8 m ε m )) . Taking the expectation of both sides gives: E (cid:34) exp (cid:32) λ n (cid:88) i =1 D i (cid:33)(cid:35) (cid:54) exp (cid:20) m ε ( R ( T Λ ) − R ( TΛ )) λ n − m ε m λ (cid:21) . On the other hand, let us focus on the E i ’s. Thanks to Equality (9) and since the rows of ε areindependent, E (cid:34) exp (cid:32) λ n (cid:88) i =1 E i (cid:33)(cid:35) = E exp λ (cid:88) j,t p j,t [( T − T ) Λ ] j,t ε j,t = d (cid:89) j =1 E (cid:34) exp (cid:32) λ T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t ε j,t (cid:33)(cid:35) . Now, for any j ∈ { , . . . , d } , let us apply Lemma 6.3 to ( ε j, , . . . , ε j,T ) , which is a sample of a φ -mixing sequence, and to the function f j : [0 , m ε ] T → R defined by f j ( u , . . . , u T ) := 2 T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t u t ; ∀ u ∈ [0 , m ε ] T . IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 13
Since (cid:107)∇ f j ( u , . . . , u T ) (cid:107) = 4 T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t ; ∀ u ∈ [0 , m ε ] T , by Lemma 6.3: E (cid:34) exp (cid:32) λ T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t ε j,t (cid:33)(cid:35) = E (exp( λ ( f j ( ε j, , . . . , ε j,T ) − E [ f j ( ε j, , . . . , ε j,T )]))) (cid:54) exp (cid:32) m ε λ Φ ε T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t (cid:33) . Thus, for any λ > , by Equalities (8) and (10) together with n (cid:54) dT , E (cid:34) exp (cid:32) λ n (cid:88) i =1 E i (cid:33)(cid:35) = d (cid:89) j =1 E (cid:34) exp (cid:32) λ T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t ε j,t (cid:33)(cid:35) (cid:54) d (cid:89) j =1 exp (cid:32) m ε λ Φ ε T (cid:88) t =1 p j,t [( T − T ) Λ ] j,t (cid:33) (cid:54) exp m ε λ Φ ε c Π dT (cid:88) j,t p j,t [( T − T ) Λ ] j,t (cid:54) exp (cid:20) m ε λ Φ ε c Π n ( R ( T Λ ) − R ( TΛ )) (cid:21) . Therefore, these bounds together with Jensen’s inequality give: E exp (cid:18) λ r n ( TΛ ) − r n ( T Λ ) + R ( T Λ ) − R ( TΛ )] (cid:19) = E (cid:34) exp (cid:32) λ n (cid:88) i =1 ( h (Ξ i ) − E ( h (Ξ i ))) (cid:33)(cid:35) = E (cid:34) exp (cid:32) λ n (cid:88) i =1 A i + λ n (cid:88) i =1 ( C i − E ( h (Ξ i ))) + λ n (cid:88) i =1 D i + λ n (cid:88) i =1 E i (cid:33)(cid:35) (cid:54) (cid:34) E (cid:34) exp (cid:32) λ n (cid:88) i =1 A i (cid:33)(cid:35) + E (cid:34) exp (cid:32) λ n (cid:88) i =1 ( C i − E ( h (Ξ i ))) (cid:33)(cid:35) + E (cid:34) exp (cid:32) λ n (cid:88) i =1 D i (cid:33)(cid:35) + E (cid:34) exp (cid:32) λ n (cid:88) i =1 E i (cid:33)(cid:35)(cid:35) (cid:54) exp (cid:34) m ξ − m ξ m λ/n · λ n ( R ( T Λ ) − R ( TΛ )) (cid:35) + exp (cid:20) m − m λ/n · λ n ( R ( T Λ ) − R ( TΛ )) (cid:21) + exp (cid:20) m ε − m ε m λ/n · λ n ( R ( T Λ ) − R ( TΛ )) (cid:21) + exp (cid:20) m ε Φ ε c Π λ n ( R ( T Λ ) − R ( TΛ )) (cid:21) (cid:54) exp (cid:20) c λ λ n ( R ( T Λ ) − R ( TΛ )) (cid:21) with c λ = max (cid:40) m ξ − m ξ m λ/n , m − m λ/n , m ε − m ε m λ/n , m ε Φ ε c Π (cid:41) and < λ < n min (cid:26) m ξ m , m , m ε m (cid:27) . † , AND AMÉLIE ROSIER (cid:5) In particular, for λ < n m max { m , m ε , m ξ } , we have c λ (cid:54) max { m , m ξ , m ε , m ε Φ ε c Π } . This ends the proof of the first inequality. (cid:3)
A preliminary non-explicit risk bound.
We now provide a simpler version of Theorem 3.4, thatholds in the case where S k,τ is finite: (1) in the following theorem. When this is not the case, we providea similar bound using a general (cid:15) -net, that is (2) in the theorem. Theorem 6.4.
Consider α ∈ ]0 , .(1) Under Assumptions 3.1, 3.2 and 3.3, if |S k,τ | < ∞ , then (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + c . , n log (cid:18) α |S k,τ | (cid:19) with probability larger than − α , where c . , = 32( c − . ∧ λ ∗ ) − .(2) Under Assumptions 3.1, 3.2 and 3.3, for every (cid:15) > , there exists a finite subset S (cid:15)k,τ of S k,τ suchthat (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + c . , n log (cid:18) α |S (cid:15)k,τ | (cid:19) + c . , τ (cid:15) with probability larger than − α , where c . , = 4(3 m + m ε + m ξ ) m Λ .Proof of Theorem 6.4. (1) Assume that |S k,τ | < ∞ . For any x > , λ ∈ (0 , nλ ∗ ) and S ⊂ M d,τ ( R ) ,consider the events Ω − x,λ, S ( T ) := (cid:26)(cid:18) − c . λn (cid:19) (cid:107) ( T − T ) Λ (cid:107) F , Π − ( r n ( TΛ ) − r n ( T Λ )) > x (cid:27) , T ∈ S and Ω − x,λ, S := (cid:91) T ∈S Ω − x,λ, S ( T ) . By Markov’s inequality together with Proposition 6.1, Inequality (7), P (Ω − x,λ, S k,τ ) (cid:54) (cid:88) T ∈S k,τ P (cid:18) exp (cid:18) λ (cid:18)(cid:18) − c . λn (cid:19) ( R ( TΛ ) − R ( T Λ )) − ( r n ( TΛ ) − r n ( T Λ )) (cid:19)(cid:19) > e λx (cid:19) (cid:54) |S k,τ | e − λx . In the same way, with Ω + x,λ, S ( T ) := (cid:26) − (cid:18) c . λn (cid:19) (cid:107) ( T − T ) Λ (cid:107) F , Π + r n ( TΛ ) − r n ( T Λ ) > x (cid:27) , T ∈ S and Ω + x,λ, S := (cid:91) T ∈S Ω + x,λ, S ( T ) , by Markov’s inequality together with Proposition 6.1, Inequality (6), P (Ω + x,λ, S k,τ ) (cid:54) |S k,τ | e − λx .Then, P (Ω x,λ, S k,τ ) (cid:62) − |S k,τ | e − λx with Ω x,λ, S := (Ω − x,λ, S ) c ∩ (Ω + x,λ, S ) c ⊂ Ω − x,λ, S ( (cid:98) T k,τ ) c ∩ Ω + x,λ, S ( (cid:98) T k,τ ) c =: Ω x,λ, S k,τ ( (cid:98) T k,τ ) . IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 15
Moreover, on the event Ω x,λ, S k,τ , by the definition of (cid:98) T k,τ , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) (cid:18) − c . λn (cid:19) − ( r n ( (cid:98) T k,τ Λ ) − r n ( T Λ ) + 4 x )= (cid:18) − c . λn (cid:19) − (cid:18) min T ∈S k,τ { r n ( TΛ ) − r n ( T Λ ) } + 4 x (cid:19) (cid:54) c . λn − − c . λn − min T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + 8 x − c . λn − . So, for any α ∈ ]0 , , with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) c . λn − − c . λn − min T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + 8 λ − log(2 α − |S k,τ | )1 − c . λn − . Now, let us take λ = n (cid:18) c . ∧ λ ∗ (cid:19) ∈ (0 , nλ ∗ ) and x = 1 λ log (cid:18) α |S k,τ | (cid:19) . In particular, c . λn − (cid:54) / , and then c . λn − − c . λn − (cid:54) and λ − − c . λn − (cid:54) (cid:18) c . ∧ λ ∗ (cid:19) − n . Therefore, with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + 32 (cid:18) c . ∧ λ ∗ (cid:19) − n log (cid:18) α |S k,τ | (cid:19) . (2) Now, assume that |S k,τ | = ∞ . Since dim ( M d,τ ( R )) < ∞ and S k,τ is a bounded subset of M d,τ ( R ) (equipped with T (cid:55)→ sup j,t | T j,t | ), S k,τ is compact in ( M d,τ ( R ) , (cid:107) . (cid:107) F ) . Then, for any (cid:15) > , thereexists a finite subset S (cid:15)k,τ of S k,τ such that(11) ∀ T ∈ S k,τ , ∃ T (cid:15) ∈ S (cid:15)k,τ : (cid:107) T − T (cid:15) (cid:107) F (cid:54) (cid:15). On the one hand, for any T ∈ S k,τ and T (cid:15) ∈ S (cid:15)k,τ satisfying (11), since (cid:104) X i , ( T − T (cid:15) ) Λ (cid:105) F = (cid:104) X i Λ ∗ , T − T (cid:15) (cid:105) F for every i ∈ { , . . . , n } , | r n ( TΛ ) − r n ( T (cid:15) Λ ) | (cid:54) n n (cid:88) i =1 |(cid:104) X i , ( T − T (cid:15) ) Λ (cid:105) F (2 Y i − (cid:104) X i , ( T + T (cid:15) ) Λ (cid:105) F ) | (cid:54) (cid:15)n n (cid:88) i =1 (cid:107) X i Λ ∗ (cid:107) F (cid:32) | Y i | + sup j,t τ (cid:88) (cid:96) =1 | ( T + T (cid:15) ) j,(cid:96) Λ (cid:96),t | (cid:33) (cid:54) c τ (cid:15) (12) with c = 2(2 m + m ε + m ξ ) m Λ , and thanks to Equality (8), | R ( TΛ ) − R ( T (cid:15) Λ ) | = | R ( TΛ ) − R ( T Λ ) − ( R ( T (cid:15) Λ ) − R ( T Λ )) | = |(cid:107) ( T − T ) Λ (cid:107) F , Π − (cid:107) ( T (cid:15) − T ) Λ (cid:107) F , Π | (cid:54) E ( |(cid:104) X i , ( T − T (cid:15) ) Λ (cid:105) F (cid:104) X i , ( T + T (cid:15) − T ) Λ (cid:105) F | ) (cid:54) c τ (cid:15) (13) with c = 4 m m Λ . On the other hand, consider(14) (cid:98) T (cid:15)k,τ = arg min T ∈S (cid:15)k,τ (cid:107) T − (cid:98) T k,τ (cid:107) F . † , AND AMÉLIE ROSIER (cid:5) On the event Ω x,λ, S (cid:15)k,τ with x > and λ ∈ (0 , nλ ∗ ) , by the definitions of (cid:98) T (cid:15)k,τ and (cid:98) T k,τ , andthanks to Inequalities (12) and (13), (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) (cid:107) ( (cid:98) T (cid:15)k,τ − T ) Λ (cid:107) F , Π + c τ (cid:15) (cid:54) (cid:18) − c . λn (cid:19) − ( r n ( (cid:98) T (cid:15)k,τ Λ ) − r n ( T Λ ) + 4 x ) + c τ (cid:15) (cid:54) (cid:18) − c . λn (cid:19) − ( r n ( (cid:98) T k,τ Λ ) − r n ( T Λ ) + c τ (cid:15) + 4 x ) + c τ (cid:15) = (cid:18) − c . λn (cid:19) − (cid:18) min T ∈S k,τ { r n ( TΛ ) − r n ( T Λ ) } + c τ (cid:15) + 4 x (cid:19) + c τ (cid:15) (cid:54) c . λn − − c . λn − min T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + 8 x − c . λn − + (cid:18) c − c . λn − + c (cid:19) τ (cid:15). Therefore, by taking λ = n (cid:18) c . ∧ λ ∗ (cid:19) and x = 1 λ log (cid:18) α |S (cid:15)k,τ | (cid:19) , as in the proof of Theorem 6.4.(1), with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + 32 (cid:18) c . ∧ λ ∗ (cid:19) − n log (cid:18) α |S (cid:15)k,τ | (cid:19) + (2 c + c ) τ (cid:15). (cid:3) Proof of Theorem 3.4.
The proof is dissected in two steps:
Step 1.
Consider M d,τ,k ( R ) := { T ∈ M d,τ ( R ) : rank ( T ) = k } . For every T ∈ M d,τ,k ( R ) and ρ > , let us denote the closed ball (resp. the sphere) of center T andof radius ρ of M d,τ,k ( R ) by B k ( T , ρ ) (resp. S k ( T , ρ ) ). For any (cid:15) > , thanks to Candès and Plan [11],Lemma 3.1, there exists an (cid:15) -net S (cid:15)k (0 , covering S k (0 , and such that | S (cid:15)k (0 , | (cid:54) (cid:18) (cid:15) (cid:19) k ( d + τ +1) . Then, for every ρ > , there exists an (cid:15) -net S (cid:15)k (0 , ρ ) covering S k (0 , ρ ) and such that | S (cid:15)k (0 , ρ ) | (cid:54) (cid:18) ρ(cid:15) (cid:19) k ( d + τ +1) . Moreover, for any ρ ∗ > , B k (0 , ρ ∗ ) = (cid:91) ρ ∈ [0 ,ρ ∗ ] S k (0 , ρ ) . So, B (cid:15)k (0 , ρ ∗ ) := [ ρ ∗ /(cid:15) ]+1 (cid:91) j =0 S (cid:15)k (0 , j(cid:15) ) is an (cid:15) -net covering B k (0 , ρ ∗ ) and such that | B (cid:15)k (0 , ρ ∗ ) | (cid:54) [ ρ ∗ /(cid:15) ]+1 (cid:88) j =0 | S (cid:15)k (0 , j(cid:15) ) | (cid:54) (cid:18)(cid:20) ρ ∗ (cid:15) (cid:21) + 2 (cid:19) (cid:18) ρ ∗ (cid:15) (cid:19) k ( d + τ +1) . If in addition ρ ∗ (cid:62) (cid:15) , then | B (cid:15)k (0 , ρ ∗ ) | (cid:54) ρ ∗ (cid:15) (cid:18) ρ ∗ (cid:15) (cid:19) k ( d + τ +1) (cid:54) (cid:18) ρ ∗ (cid:15) (cid:19) k ( d + τ ) . IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 17
Step 2.
For any T ∈ S k,τ , sup j,t | T j,t | (cid:54) m m Λ τ . Then, (cid:107) T (cid:107) F = d (cid:88) j =1 τ (cid:88) t =1 T j,t / (cid:54) ρ ∗ d,τ := c (cid:18) dτ (cid:19) / with c = m m Λ . So, S k,τ ⊂ B k (0 , ρ ∗ d,τ ) , and by the first step of the proof, there exists an (cid:15) -net S (cid:15)k,τ covering S k,τ and suchthat |S (cid:15)k,τ | (cid:54) (cid:18) ρ ∗ d,τ (cid:15) (cid:19) k ( d + τ ) = (cid:18) c d / τ − / (cid:15) (cid:19) k ( d + τ ) . By taking (cid:15) = 9 c d / τ − / n − , thanks to Theorem 6.4.(2), with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + c . , n (cid:20) log (cid:18) α (cid:19) + 2 k ( d + τ ) log (cid:18) c d / τ − / (cid:15) (cid:19)(cid:21) + c . , τ (cid:15) = 3 min T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + c . , n (cid:20) log (cid:18) α (cid:19) + 4 k ( d + τ ) log( n ) (cid:21) + 9 c c . , d / τ / n . Therefore, since n (cid:62) ( dτ ) / ( k ( d + τ )) − , with probability larger than − α , (cid:107) (cid:98) Θ k,τ − Θ (cid:107) F , Π (cid:54) T ∈S k,τ (cid:107) ( T − T ) Λ (cid:107) F , Π + (4 c . , + 9 c c . , ) k ( d + τ ) log( n ) n + c . , n log (cid:18) α (cid:19) . Proof of Theorem 4.1.
For any k ∈ K , let S (cid:15)k := S (cid:15)k,τ be the (cid:15) -net introduced in the proof ofTheorem 3.4, and recall that for (cid:15) = 9 m m − Λ d / τ − / n − , |S (cid:15)k | (cid:54) (cid:18) m m Λ · d / τ − / (cid:15) (cid:19) k ( d + τ ) = n k ( d + τ ) . Then, for α ∈ (0 , and x k,(cid:15) := λ − log(2 α − |K| · |S (cid:15)k | ) with λ = n c − ∈ (0 , nλ ∗ ) , x k,(cid:15) − pen ( k ) = 4 c cal n log (cid:18) α |K| · |S (cid:15)k | (cid:19) − c cal log( n ) n k ( d + τ ) (cid:54) c cal n (cid:20) k ( d + τ ) log( n ) + log (cid:18) α |K| (cid:19)(cid:21) − c cal log( n ) n k ( d + τ ) (cid:54) c cal n log (cid:18) α |K| (cid:19) =: m n . (15)Now, consider the event Ω λ,(cid:15) := (Ω − λ,(cid:15) ) c ∩ (Ω + λ,(cid:15) ) c with Ω − λ,(cid:15) := (cid:91) k ∈K (cid:91) T ∈S (cid:15)k Ω − x k,(cid:15) ,λ, S (cid:15)k ( T ) and Ω + λ,(cid:15) := (cid:91) k ∈K (cid:91) T ∈S (cid:15)k Ω + x k,(cid:15) ,λ, S (cid:15)k ( T ) . So, P (Ω cλ,(cid:15) ) (cid:54) (cid:88) k ∈K (cid:88) T ∈S (cid:15)k [ P (Ω − x k,(cid:15) ,λ, S (cid:15)k ( T )) + P (Ω + x k,(cid:15) ,λ, S (cid:15)k ( T ))] (cid:54) (cid:88) k ∈K |S (cid:15)k | e − λx k,(cid:15) = α and Ω x (cid:98) k,(cid:15) ,λ, S (cid:15) (cid:98) k ( (cid:98) T (cid:15) (cid:98) k ) ⊂ Ω λ,(cid:15) , where (cid:98) T (cid:15)k is a solution of the minimization problem (14) for every k ∈ K . † , AND AMÉLIE ROSIER (cid:5) On the event Ω λ,(cid:15) , by the definition of (cid:98) k , and thanks to Inequalities (12), (13) and (14), (cid:107) (cid:98) Θ − Θ (cid:107) F , Π (cid:54) (cid:107) ( (cid:98) T (cid:15) (cid:98) k − T ) Λ (cid:107) F , Π + c τ (cid:15) (cid:54) (cid:18) − c . λn (cid:19) − ( r n ( (cid:98) T (cid:15) (cid:98) k Λ ) − r n ( T Λ ) + 4 x (cid:98) k,(cid:15) ) + c τ (cid:15) (cid:54) (cid:18) − c . λn (cid:19) − ( r n ( (cid:98) T (cid:98) k Λ ) − r n ( T Λ ) + c τ (cid:15) + 4 x (cid:98) k,(cid:15) ) + c τ (cid:15) = (cid:18) − c . λn (cid:19) − (cid:18) min k ∈K { r n ( (cid:98) T k Λ ) − r n ( T Λ ) + pen ( k ) } + c τ (cid:15) + 4 x (cid:98) k,(cid:15) − pen ( (cid:98) k ) (cid:19) + c τ (cid:15) (cid:54) − c . λn − min k ∈K { (1 + c . λn − ) (cid:107) ( (cid:98) T k − T ) Λ (cid:107) F , Π + 4 x k,(cid:15) + pen ( k ) } + m n + c τ (cid:15) − c . λn − + c τ (cid:15) (cid:54) k ∈K { / (cid:107) ( (cid:98) T k − T ) Λ (cid:107) F , Π + 2 pen ( k ) } + 4 m n + (2 c + c ) τ (cid:15) with c = 2(2 m + m ε + m ξ ) m Λ and c = 4 m m Λ . Moreover, by following the proof of Theorem 6.4 andTheorem 3.4 on the same event Ω λ,(cid:15) , (cid:107) ( (cid:98) T k − T ) Λ (cid:107) F , Π (cid:54) T ∈S k (cid:107) ( T − T ) Λ (cid:107) F , Π + c . (cid:20) k ( d + τ ) log( n ) n + 1 n log (cid:18) α |K| (cid:19)(cid:21) for every k ∈ K . Therefore, with probability larger than − α , (cid:107) (cid:98) Θ − Θ (cid:107) F , Π (cid:54) k ∈K (cid:26) T ∈S k (cid:107) ( T − T ) Λ (cid:107) F , Π + ( c . + 16 c cal ) k ( d + τ ) log( n ) n (cid:27) + 4 c . + 16 c cal n log (cid:18) α |K| (cid:19) + 9(2 c + c ) m m Λ · d / τ / n . References [1] Alquier, P., Bertin, K., Doukhan P. and Garnier, R..
High Dimensional VAR with low rank transition.
Statistics andComputing 30, 1139-1153, 2020.[2] Alquier, P., Li, X. and Wintenberger, O.
Prediction of Time Series by Statistical Learning: General Losses and FastRates.
Dependence Modeling 1, 65-93, 2013.[3] Alquier, P. and Marien N.
Matrix Factorization for Multivariate Time Series Analysis.
Electronic Journal of Statistics13, 2, 4346-4366, 2019.[4] Alquier, P. and Ridgway, J. Concentration of tempered posteriors and of their variational approximations. Annals ofStatistics, 48(3), 1475-1497, 2020.[5] Athey, S., Bayati, M., Doudchenko, N., Imbens, G. and Khosravi, K.
Matrix completion methods for causal panel datamodels (No. w25132). National Bureau of Economic Research, 2018.[6] Bai, J. and Ng, S.
Matrix completion, counterfactuals, and factor analysis of missing data . ArXiv preprintarXiv:1910.06677, 2019.[7] Basu, S., Li, X. and Michailidis, G. Low rank and structured modeling of high-dimensional vector autoregressions.IEEE Transactions on Signal Processing, 67(5), 1207-1222, 2019.[8] Bennett, J. and Lanning, S..
The netflix prize . In
Proceedings of KDD Cup and Workshop , page 35, 2007.[9] Boucheron, S., Lugosi, G. and Massart, P.
Concentration inequalities . Oxford University Press, 2013.[10] Candès, E. J. and Plan, Y.
Matrix completion with noise . Proceedings of the IEEE, 98(6):925-936, 2010.[11] Candès, E. J. and Plan, Y.
Tight Oracle Inequalities for Low-Rank Matrix Recovery from a Minimal Number of NoisyRandom Measurements.
IEEE Trans. Inf. Theory 57, 4, 2342-2359, 2011.[12] Candès, E. J. and Recht, B.
Exact matrix completion via convex optimization . Found. Comput. Math., 9(6):717-772,2009.[13] Candès, E. J. and Tao, T.,
The power of convex relaxation: Near-optimal matrix completion . IEEE Trans. Inform.Theory, 56(5):2053-2080, 2010.[14] Carpentier, A., Klopp, O., Löffler, M., Nickl, R.
Adaptive confidence sets for matrix completion . Bernoulli, 24(4A),2429-2460, 2018.[15] Chan, J., Leon-Gonzalez, R. and Strachan, R.W.
Invariant inference and efficient computation in the static factormodel . J. Am. Stat. Assoc. 113(522), 819-828, 2018.[16] Cottet, V. and Alquier, P. . MachineLearning, 107(3), 579-603, 2018.[17] Doukhan, P. Mixing: properties and examples (Vol. 85). Springer Science & Business Media, 1994.
IGHT RISK BOUND FOR HIGH DIMENSIONAL TIME SERIES COMPLETION 19 [18] Eshkevari, S. S. and Pakzad, S. N.
Signal reconstruction from mobile sensors network using matrix completion approach .In Topics in Modal Analysis & Testing, Volume 8 (pp. 61-75, Springer, Cham, 2020.[19] Gillard, J. and Usevich, K.
Structured low-rank matrix completion for forecasting in time series analysis . InternationalJournal of Forecasting, 34(4), 582-597, 2018.[20] Giordani, P., Pitt, M. and Kohn, R.
Bayesian inference for time series state space models . In: Geweke, J., Koop, G.,Van Dijk, H. (eds.) Oxford Handbook of Bayesian Econometrics. Oxford University Press, Oxford, 2011.[21] Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory,57(3), 1548-1566, 2011.[22] Hallin, M. and Lippi, M.
Factor models in high-dimensional time series - a time-domain approach . Stoch. Process.Appl. 123(7), 2678-2695, 2013.[23] Hastie, T., Mazumder, R. and Hastie, M. T. R Package ’softImpute’, 2013.[24] Keshavan, R. H., Montanari, A. and Oh, S.
Matrix completion from a few entries . IEEE transactions on informationtheory, 56(6), 2980-2998, 2010.[25] Keshavan, R. H., Montanari, A. and Oh, S. Matrix completion from noisy entries. The Journal of Machine LearningResearch, 11, 2057-2078, 2010.[26] Klopp, O.
Noisy low-rank matrix completion with general sampling distribution . Bernoulli, 20(1), 282-303, 2014.[27] Klopp, O., Lounici, K. and Tsybakov, A. B.
Robust matrix completion . Probability Theory and Related Fields, 169(1-2),523-564, 2017.[28] Koltchinskii, V., Lounici, K. and Tsybakov, A. B.
Nuclear-Norm Penalization and Optimal Rates for Noisy Low-RankMatrix Completion.
The Annals of Statistics 39, 5, 2302-2329, 2011.[29] Koop, G., Potter, S.
Forecasting in dynamic factor models using Bayesian model averaging . Econom. J. 7(2), 550-565,2004.[30] Lafond, J., Klopp, O., Moulines, E. and Salmon, J.
Probabilistic low-rank matrix completion on finite alphabets .Advances in Neural Information Processing Systems, 27, 1727-1735, 2014.[31] Lam, C. and Yao, Q.
Factor modeling for high-dimensional time series: inference for the number of factors . Ann.Stat. 40(2), 694-726, 2012.[32] Lam, C., Yao, Q. and Bathia, N.
Estimation of latent factors for high-dimensional time series . Biometrika 98(4),901-918, 2011.[33] Mai, T. T. and Alquier, P.
A Bayesian approach for noisy matrix completion: Optimal rate under general samplingdistribution . Electronic! Journal of Statistics, 9(1), 823-841, 2015.[34] Mei, J., De Castro, Y., Goude, Y., Azais, J. M. and Hébrail, G.
Nonnegative matrix factorization with side informationfor time series recovery and prediction . IEEE Transactions on Knowledge and Data Engineering, 31(3), 493-506, 2018.[35] Mei, J., De Castro, Y., Goude, Y., and Hébrail, G.
Nonnegative matrix factorization for time series recovery from afew temporal aggregates . Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2382-2390,2017.[36] Massart, P.
Concentration Inequalities and Model Selection , volume 1896 of
Lecture Notes in Mathematics , Springer,Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, Edited by Jean Picard.[37] Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal boundswith noise. The Journal of Machine Learning Research, 13(1), 1665-1697, 2012.[38] Poulos, J.
State-Building through Public Land Disposal? An Application of Matrix Completion for CounterfactualPrediction . ArXiv preprint arXiv:1903.08028, 2019.[39] Samson, P.-M.
Concentration of Measure Inequalities for Markov Chains and φ -Mixing Processes. The Annals ofProbability 28, 1, 416-461, 2000.[40] Shi, W., Zhu, Y., Philip, S. Y., Huang, T., Wang, C., Mao, Y. and Chen, Y.
Temporal dynamic matrix factorizationfor missing data prediction in large scale coevolving time series . IEEE Access, 4, 6719-6732, 2016.[41] Suzuki, T.
Convergence rate of Bayesian tensor estimator and its minimax optimality . The 32nd International Con-ference on Machine Learning (ICML2015), JMLR Workshop and Conference Proceedings 37:pp. 1273-1282, 2015.[42] Tsagkatakis, G., Beferull-Lozano, B. and Tsakalides, P.
Singular spectrum-based matrix completion for time seriesrecovery and prediction . EURASIP Journal on Advances in Signal Processing, 2016(1), 66, 2016.[43] Tsybakov, A.
Introduction to Nonparametric Estimation.
Springer, 2009.[44] Xie, K., Ning, X., Wang, X., Xie, D., Cao, J., Xie, G. and Wen, J.
Recover corrupted data in sensor networks: Amatrix completion solution . IEEE Transactions on Mobile Computing, 16(5), 1434-1448, 2016.[45] Yu, H. F., Rao, N. and Dhillon, I. S. Temporal regularized matrix factorization for high-dimensional time seriesprediction. Advances in neural information processing systems, 29, 847-855, 2016. *RIKEN AIP, Tokyo, Japan
Email address : [email protected] † , (cid:5) Laboratoire Modal’X, Université Paris Nanterre, Nanterre, France
Email address : [email protected] (cid:5) ESME Sudria, Paris, France
Email address ::