Sparse constrained projection approximation subspace tracking
SSparse constrained pro jection approximationsubspace tracking
Denis Belomestny, Ekaterina Krymova ∗†‡§
November 27, 2018
Abstract
In this paper we revisit the well-known constrained projection ap-proximation subspace tracking algorithm (CPAST) and derive, forthe first time, non-asymptotic error bounds. Furthermore, we intro-duce a novel sparse modification of CPAST which is able to exploitsparsity in the underlying covariance structure. We present a non-asymptotic analysis of the proposed algorithm and study its empiricalperformance on simulated and real data.
Subspace tracking methods are intensively used in statistical and signal pro-cessing community. Given observations of a multidimensional signal, oneis interested in estimating or tracking a subspace spanning the eigenvectorscorresponding to the first largest eigenvalues of the signal covariance matrix.Over the past few decades many variations of the original projection approx-imation subspace tracking (PAST) method [1] were developed which found ∗ The authors are with Faculty of Mathematics at the University Duisburg-Essen, Essen,Germany † D. Belomestny e-mail: [email protected] ‡ E. Krymova e-mail: [email protected] § This work is funded by the German Research Foundation (DFG), Collaborative Re-search Center 823, Subprojects B3 and C5. The first author was also supported by theRSF grant 18-11-00132 a r X i v : . [ s t a t . M E ] N ov pplications in data compression, filtering, speech enhancement, etc. (see [2]and references therein). Despite popularity of the subspace tracking meth-ods, only partial results are known about their convergence. The asymptoticconvergence of the PAST algorithm was first established in [3, 4] using ageneral theory of stability for ordinary differential equations. However, nofinite sample error bounds are available in the literature. Furthermore, inthe case of a high-dimensional signal the empirical covariance matrix esti-mator performs poorly if the number of observations is small. A commonway to improve the estimation quality in this case is to impose some kindof sparsity assumptions on the signal itself or on the eigensubspace of theunderlying covariance matrix. In [5] a sparse modification of the orthogonaliteration scheme for a fixed number of observations was proposed. A thor-ough analysis in [5] shows that under appropriate sparsity assumptions on theleading eigenvectors, the orthogonal iteration scheme combined with thresh-olding allows to perform dimension reduction in high-dimensional setting.Our main goal is to propose a novel modification of constraint projection ap-proximation subspace tracking method (CPAST) [6], called sparse projectionapproximation subspace tracking method (SCPAST), which can be used forefficient subspace tracking in the case of high-dimensional sparse signal andsmall number of available observations. Another contribution of our paper isa non-asymptotic convergence analysis of CPAST and SCPAST algorithmsshowing the advantage of SCPAST algorithm in the case of sparse covari-ance structure. Last but not the least, we analyse numerical performance ofSCPAST algorithm on simulated and real data. In particular, the problemof tracking the leading subspace of a music signal is considered.The structure of the paper is as follows. In Section 2 we introduce ourobservational model and formulate main assumptions. Section 3 first re-views the CPAST algorithm and then provides the non-asymptotic errorbounds for CPAST in a ”stationary” case. In Section 4 we introduce oursparse constraint approximation subspace tracking method and prove thenon-asymptotic upper bounds for the estimation error. A numerical studyof the proposed algorithm is presented in Section 5. Finally the proofs arecollected in Section 6. 2 Main setup
One important problem in signal processing is adaptive estimation of a dom-inant subspace given incoming noisy observations. Specifically one considersa model x ( t ) = s ( t ) + σ ( t ) ξ ( t ) , t = 1 , . . . , T, (1)where the observations x ( t ) ∈ R n contain the signal s ( t ) ∈ R n corrupted bya vector ξ ( t ) ∈ R n with independent standard Gaussian components. Thesignal s ( t ) is usually modelled as s ( t ) = A ( t ) η ( t ) , where A ( t ) is a deterministic n × d matrix of rank d with d (cid:28) n and η ( t )is a random vector in R d independent of ξ ( t ) , such that E [ η ( t )] = 0 and E [ η i ( t )] = 1, i = 1 , . . . , d . Under these assumptions, the process x ( t ) has acovariance matrix Σ( t ) which may be decomposed in the following wayΣ( t ) = E [ x ( t ) x (cid:62) ( t )] = A ( t ) A (cid:62) ( t ) + σ ( t ) I n , (2)where I n stands for the unit matrix in R n . Note that the matrix A ( t ) A (cid:62) ( t )has the rank d and by the singular value decomposition (SVD) A ( t ) A (cid:62) ( t ) = d (cid:88) i =1 λ i ( t ) v i ( t ) v (cid:62) i ( t ) , where v i ( t ) ∈ R n , i = 1 , . . . , d, are the eigenvectors of A ( t ) A (cid:62) ( t ) corre-sponding to the eigenvalues λ ( t ) ≥ λ ( t ) ≥ · · · ≥ λ d ( t ) >
0. It followsfrom (2) that the first d eigenvalues of Σ( t ) are λ ( t ) + σ ( t ) , . . . , λ d ( t ) + σ ( t ) , whereas the remaining n − d eigenvalues are equal to σ ( t ). Since λ d ( t ) > , the subspace corresponding to the first d eigenvectors of A ( t ) A (cid:62) ( t )is identifiable. The subspace tracking methods aim to estimate the subspace span( v ( t ) , . . . , v d ( t )) based on the observations ( x ( k )) tk =1 . The overall num-ber of observations T is assumed to be fixed and known. Relying on a heuristic assumption of slow (in time) varying Σ( t ), the sub-space tracking methods use the following estimator of the covariance matrix(up to scaling) (cid:98) Σ γ ( t ) = t (cid:88) i =0 γ t − i x ( i ) x (cid:62) ( i ) , (3)3here 0 < γ ≤ (cid:98) Σ γ ( t ) canadapt to the change in Σ( t ) by discounting the past observations. In thestationary regime, that is, if Σ( t ) is a constant matrix, one would use γ = 1.It is well known, that in the case of Gaussian independent noise the estimator (cid:98) Σ ( t ) is consistent. For the general model (1) and non-stationary case, constrained projectionapproximation subspace tracking (CPAST) method allows to iteratively com-pute a matrix (cid:98) V γ ( t ), t = 1 , . . . , T , containing the first d leading eigenvectorsof the matrix (cid:98) Σ γ ( t ) (see (3)) based on sequentially arriving observations x ( j ), j = 1 , . . . , t. The procedure starts with some initial approximation (cid:98) V γ (0) = (cid:98) V and consists of the following two steps • multiplication : compute the n × d matrix (cid:98) Σ γ,V ( t ) = (cid:98) Σ γ ( t ) (cid:98) V γ ( t − • orthogonalization : compute an estimator (cid:98) V γ ( t ) of the matrix V ( t ) con-taining d leading eigenvectors via (cid:98) V γ ( t ) = (cid:98) Σ γ,V ( t )[ (cid:98) Σ (cid:62) γ,V ( t ) (cid:98) Σ γ,V ( t )] − / . In the ”stationary” case ( γ = 1) the method may be regarded as the”online”-version of the orthogonal iterations scheme (see [7]) for computingthe eigen-subspace of the non-negatively definite matrix. With the use ofthe Sherman-Morrison-Woodbury formula for the inversion at each time t, one has to perform O ( nd ) operations to compute the updated matrix (cid:98) V γ ( t )given (cid:98) V γ ( t − (cid:98) Σ γ ( t −
1) and x ( t ). Throughout this section we consider the stationary case where Σ( t ) = Σ, A ( t ) = A , v i ( t ) = v i , λ i ( t ) = λ i , i = 1 , . . . , d , σ ( t ) = σ . In this situationone would like to keep all the available information to estimate V , that is, touse the estimator (3) for Σ with γ = 1. For notational simplicity, from now4n we skip the dependence on γ and use the notation (cid:98) Σ( t ) for the empiricalcovariance matrix (cid:98) Σ( t ) = t (cid:80) ti =1 x ( i ) x (cid:62) ( i ) and (cid:98) V ( t ) for CPAST estimator.Thus in the stationary case the CPAST estimator takes the form (cid:98) V ( t ) = [ (cid:98) Σ( t ) (cid:98) V ( t − (cid:98) V (cid:62) ( t − (cid:62) (cid:98) Σ ( t ) (cid:98) V ( t − − / . (4)We assume that the random vectors η ( t ) and ξ ( t ) have independent N (0 , t = 1 , . . . , T . Under these assumptions the covariance matrix(2) becomes Σ = d (cid:88) i =1 λ i v i v (cid:62) i + σ I n = V Λ d V (cid:62) + σ I n , (5)where V is n × d matrix with columns { v i } di =1 , Λ d is d × d diagonal matrix with { λ i } di =1 on the diagonal. Note that the observational model (1) in stationarycase can be alternatively written as the so-called spike model x ( t ) = d (cid:88) i =1 (cid:112) λ i u i ( t ) v i + σξ ( t ) , (6)where u i ( t ) are i.i.d. standard Gaussian random variables independent from ξ ( t ).For our non-asymptotic error analysis of CPAST, we assume that d , σ and λ i , i = 1 , . . . , d are known. With the known σ we can always normalizethe data and therefore without loss of generality we can assume that σ = 1.The typical condition while analyzing the quality of the eigenvectors es-timation is the so-called spectral gap condition, which says that the adjacenteigenvectors explain distinguishably different portion of the variance in thedata, namely there exists τ ≥
1, such that for all j = 1 , . . . , d,τ ( λ j − λ j +1 ) ≥ λ , where λ d +1 = 0 by definition. Since our goal is the estimation of the d -dimensional subspace of the first eigenvectors, and we are not interested inthe estimation of each particular eigenvector, we need only the condition forthe separation of this d -dimensional subspace, namely that the gap between λ d and λ d +1 is sufficiently large: τ λ d ≥ λ . (7)5efine a distance l between two subspaces W and Q spanning orthonormalcolumns w , . . . , w d and q , . . . , q d correspondingly via l ( W, Q ) = l ( W , Q ) = (cid:107) W W (cid:62) − QQ (cid:62) (cid:107) , (8)where the nuclear norm (cid:107) A (cid:107) of a matrix A ∈ R n × d is defined as (cid:107) A (cid:107) = sup x ∈ R d (cid:107) Ax (cid:107) (cid:107) x (cid:107) ,and W = { w , . . . , w d } and Q = { q , . . . , q d } are the matrices in R n × d withorthonormal columns.The next result shows that with high probability the subspace which spansthe CPAST estimator (cid:98) V ( t ) is close, in terms of l, to the subspace spanning V when the number of observation is large enough. We assume that the initialestimator (cid:98) V ( t ) = (cid:98) V is constructed from t first observations by means ofthe singular value decomposition of (cid:98) Σ( t ). Theorem 1.
Suppose that the spectral gap condition (7) holds and √ t ≥ √ R max λ + 1 λ d , where R max = 5 √ n − d + 5 √ (cid:112) ln( n ∨ T ) . Then after t − t iterations weget with probability at least − C ( n ∨ t ) − ,l ( V, (cid:98) V ( t )) ≤ C λ d + 1 λ d n − dt + C λ + 1 λ d log( n ∨ t ) t , (9) where C , C are absolute constants and C depends on τ. Remark 1.
The second term on the right-hand side of (9) corresponds to theerror of separating the first d eigenvectors from the rest. The first term is anaverage error of estimating all components of d leading eigenvectors. It orig-inates from the interaction of the noise terms with the different coordinates,see [8]. We assume that in the stationary case (5) the first d leading eigenvectors v i , i = 1 , . . . , d, of Σ have most of their entries close to zero. Namely, we6uppose that each v i fulfils the so-called weak- l r ball condition [9, 10], thatis, for some r ∈ (0 , , | v i | ( k ) ≤ s i k − /r , k = 1 , . . . , n. where | v i | ( k ) is the k -th largest coordinate of v i . The weak- l r ball conditionis known to be more general than l r ball condition (which is (cid:107) q (cid:107) r ≤ s for q ∈ R n , r ∈ (0 , s ≥ g ( x, β ) with a thresholding parameter β > x ∈ R via x − β ≤ g ( x, β ) ≤ x + β, g ( x, β )1 | x |≤ β = 0 . (10)For example, the so-called hard-thresholding function g H ( x, β ) given by g H ( x, β ) = x ( | x |≤ β ) (11)and the so-called soft-thresholding function defined as g S ( x, β ) = ( x − β ) + · sign( x )fulfill the conditions (10). When β is a vector with components β i , i =1 , . . . , d , and V is a matrix with columns v i ∈ R n , i = 1 , . . . , d , we denote by g ( V, β ) a n × d matrix with the elements { g ( v ij , β i ) } , i = 1 , . . . , d , j = 1 , . . . , n .Our primal goal is to propose a subspace tracking method for estimat-ing a d -dimensional subspace of the process under weak- l r ball assumptionon the leading eigenvectors of the covariance matrix Σ and to analyze it’sconvergence. Our sparse modification of CPAST relies on the orthogonal iteration schemewith an additional thresholding step (cf. [5, 10]). From now on, by a slightabuse of notation, we will denote by (cid:98) V ( t ) an iterative estimator obtainedwith the help of the modified CPAST, given t observations. To get theinitial approximation (cid:98) V ( t ) , we use the following modification of a standardSPCA scheme, see [5, 10].1. First compute the empirical covariance (cid:98) Σ( t ) based on t observations: (cid:98) Σ( t ) = t (cid:80) t i =1 x ( i ) x (cid:62) ( i ) .
7. Define a set of indices G, corresponding to large enough diagonal ele-ments of (cid:98) Σ( t ) : G = (cid:26) k : (cid:98) Σ kk ( t ) > γ log( n ∨ t ) t (cid:27) for γ ≥ √ log( n ∨ T )log( n ∨ t ) .3. Let (cid:98) Σ ( t ) be a submatrix of (cid:98) Σ( t ) corresponding to the row and columnindices in G × G .4. As an estimator at step zero, we take the first d eigenvectors of (cid:98) Σ ( t )completed with zeros in the coordinates { , . . . , n }\ G to the vectors oflength n .Now we describe a sparse modification of CPAST, which we called SCPAST.We start with (cid:98) V ( t ) obtained by the above procedure. Then for t = t +1 , . . . , T, we perform the following steps1. multiplication : (cid:98) Υ( t ) = (cid:98) Σ( t ) (cid:98) V ( t − , thresholding : define a matrix (cid:98) Υ β ( t ) = g ( (cid:98) Υ( t ) , β ( t )) , where g is a thresholding function satisfying (10) and β ( t ) is the cor-responding thresholding vector;3. orthogonalization : (cid:98) V ( t ) = (cid:98) Υ β ( t )[ (cid:98) Υ β (cid:62) ( t ) (cid:98) Υ β ( t )] − / . First we define the thresholding parameter β ( t ) as follows. For t = t +1 , . . . , T and a ≥ √ n ∨ T )log( n ∨ t )the components β i ( t ), i = 1 , . . . , d of the vector β ( t ) are given by β i ( t ) = a (cid:114) ( λ i + 1) log( n ∨ t ) t . (12)8he motivation for thresholding of the column vectors of (cid:98) Υ( t ) comes fromthe following connection between sparsity of the leading eigenvectors v j , j =1 , . . . , d, and the vector ζ v with the components (cid:113)(cid:80) dj =1 λ j v jk , k = 1 , . . . , n ( ζ vk is the variance of the k -th coordinate of the signal part [12]): the weak- l r sparsity of the vector ζ v implies the weak- l r sparsity of v j , j = 1 , . . . , d .Suppose that d and the eigenvalues λ , . . . , λ d are known. In the case ofunknown d and λ , . . . , λ d one might first estimate the eigenvalues of (cid:98) Σ ( t )defined in the previous section and then select the largest set of eigenvaluessatisfying the spectral gap condition (7) with some parameter τ (see [5] formore details).Denote by S ( t ) the set of indices of “large” eigenvectors components ( S stands for ”signal”), that is, for a fixed t,S ( t ) = (cid:40) j : | v ij | ≥ bh i (cid:114) log( n ∨ t ) t , for some i = 1 , . . . , d (cid:41) , where h i = √ λ i +1 λ i and b = . a √ τ √ d . In fact, the quantity h i /t is an estimateof the noise variance in the entries of the i -th leading eigenvector [8]. Thenumber of “large” entries of the first d leading eigenvectors to estimate thusmight be estimated by the cardinality of S ( t ), which we denote by card( S ( t )).One can bound card( S ( t )) ascard( S ( t )) ≤ d (cid:88) i =1 card( S j ( t )) , where S j ( t )) = (cid:26) j : | v ij | ≥ bh i (cid:113) log( n ∨ t ) t (cid:27) . From Lemma 14 (see AppendixB) we see that d ≤ card( S ( t )) ≤ CM ( t ) , where C depends on b, r and M ( t ) = n ∧ (cid:34) d (cid:88) j =1 s rj h rj (cid:18) log( n ∨ t ) t (cid:19) − r/ (cid:35) . (13)Note that in the sparse case, the number of non-zero components card( S ( t ))is much smaller than n . For example, if (cid:107) v j (cid:107) r ≤ s , j = 1 . . . , d, then M ( t ) ≤ n ∧ d s r h rd (cid:18) log( n ∨ t ) t (cid:19) − r/ . s r h rj is often referred to as an effective dimension of the vector v j .Thus M ( t ) is the number of effective coordinates of v j , j = 1 , . . . , d in thecase of disjoint S j ( t ).Since h d /t is an upper-bound for the estimation error for the componentsof the first d leading eigenvectors, the right hand side of the above inequalitygives, up to a logarithmic term, the overall number of components of the d leading eigenvectors to estimate. The next theorem gives non-asymptoticbounds for a distance between V and (cid:98) V ( t ) . Theorem 2.
Let √ t ≥ (cid:0) C h d M / ( T ) + C (cid:1) λ + 1 λ d (cid:112) log( n ∨ T ) , (14) where C depends on τ in (7) , r , a , C depends on τ . After t iterations onehas with probability at least − C ( n ∨ t ) − ,l ( V, (cid:98) V ( t )) ≤ C h d M ( t ) log( n ∨ t ) t + C λ + 1 λ d log( n ∨ t ) t . (15) with some absolute constant C > . Remark 2.
The second term in (15) is the same as in the non-sparse case,see Theorem 1. This term is always present as an error of separating the first d eigenvectors from the rest eigenvectors regardless how sparse they are. Thefirst term in (15) and (9) is responsible for the interaction of the noise withdifferent coordinates of the signal. The average error of estimating one entryof the first d leading eigenvectors based on t observation can be bounded by t λ d +1 λ d , see [13]. The number of components to be estimated in SCPAST foreach vector is bounded by M ( t ) (see (13) ), which is small compared to n in thesparse case. Thus, the first term in (15) can be significantly smaller than thefirst one in (9) , provided the first d leading eigenvectors are sparse. Note alsothat the computational complexity of SCPAST at each step t = t + 1 , . . . , T is O ( d card( S ( t ))) with probability given by Theorem (2) . To illustrate the advantage of using SCPAST for the sparse case, we generate T = 2000 observations from (6) for the case of a single spike, that is, d = 110 a)(b) (c)(d) Figure 1: The components of the leading eigenvector to recover (a) stepfunction, (b)–(d) contain the results for the error l ( v , (cid:98) v ) for λ = { , , } and n = 1024. Our aim is to estimate the leading eigenvector v . We shalluse three functions depicted in subplots (a) of Fig. 1–3 with different sparsitylevels in the wavelet domain.The observations are generated for the noise variance 1 and followingcases of maximal eigenvalue λ ∈ { , , } . We used the Symmlet 8basis from the Matlab package SPCALab to transform the initial data intothe wavelet domain. We applied CPAST and SCPAST for the recovery ofwavelet coefficients of the vector v and then transformed the estimates tothe initial domain and computed the error l ( v , (cid:98) v ) depending on the numberof observations. The results for the hard thresholding (11) with the a = 1 . a)(b) (c)(d) Figure 2: The components of the leading eigenvector to recover (a) threepeeks function, (b)–(d) contain the results for the error l ( v , (cid:98) v ) for λ = { , , } a)(b) (c)(d) Figure 3: The components of the leading eigenvector to recover (a) one peekfunction, (b)–(d) contain the results for the error l ( v , (cid:98) v ) for λ = { , , } Natural acoustic signals like the musical ones exhibit a highly varying tem-poral structure, therefore there is a need in adaptive unsupervised methodsfor signal processing which reduce the complexity of the signal. In [14] amethod was proposed which reduces the spectral complexity of music signalsusing the adaptive segmentation of the signal in the spectral domain for theprincipal component analysis for listeners with cochlear hearing loss. In thefollowing we apply CPAST and SCPAST as an alternative method for thecomplexity reduction of music signals. To illustrate the use of SCPAST andCPAST we set the memory parameter γ = 0 . , v .Fig. 5 contains the results of the recovery of the leading eigenvalue with168 components. The results show that SCPAST method allows to obtainsparse representation of the leading eigenvectors and seems to be promisingfor construction of the structure-preserving compressed representations ofthe signals. Denote by ¯ V a matrix with n − d column vectors v i , i = d + 1 , . . . , n, whichcomplete the orthonormal columns { v i } di =1 of the matrix V to the orthonor-mal basis in R n . Denote by X ( t ) a matrix with the columns { x ( i ) } ti =1 . From(6) one gets a representation X ( t ) = V Λ / d U (cid:62) ( t ) + σ Ξ( t ) , t = 1 , . . . , T, (16)14igure 4: CQT-Spectrogram of Bach Siciliano for Oboe and Piano. Thedeep blue color corresponds to the zero values, the red color corresponds tothe higher values.where U ( t ) ∈ R t × d , Ξ( t ) ∈ R n × t are matrices with independent N (0 , V is the orthonormal matrix with columns { v i } ni =1 , Λ d is a diagonalmatrix with λ i , i = 1 , . . . , d on the diagonal. Denote a set of indices to thesmall components of leading eigenvectors as N ( t ) = { , . . . , n }\ S ( t ) (where N here stands for “noise”).From (16) the empirical covariance matrix can be decomposed as (cid:98) Σ( t ) = 1 t V Λ / d U (cid:62) ( t ) U ( t )Λ / d V (cid:62) + 1 t Ξ( t )Ξ (cid:62) ( t )+ 1 t V Λ / d U (cid:62) ( t )Ξ (cid:62) ( t ) + 1 t Ξ( t ) U ( t )Λ / d V (cid:62) . (17)It is well known [7] that the distance (8) between subspaces W and Q ,spanning n × d matrices with orthonormal columns W and Q correspond-ingly, is related to d -th principal angle between subspaces W and Q as l ( W , Q ) = sin φ d ( W , Q ) , where the principal angles 0 ≤ φ ≤ · · · ≤ φ d γ = 0 . W and Q are recursively defined as [16] φ i ( W , Q ) = arccos (cid:104) x i , y i (cid:105)(cid:107) x i (cid:107) (cid:107) y i (cid:107) , where { x i , y i } = arg min x ∈W , y ∈Q ,x ⊥ x j , y ⊥ y j , j
For CPAST (4) with probability − C ( n ∨ t ) − r ( t ) ≤ ( λ d +1 + 1) tan φ d ( (cid:98) V ( t − , V ) λ d + 1 − ( λ + 1) E ( t ) sec φ d ( (cid:98) V ( t − , V )+ ( λ + 1) / E ( t ) sec φ d ( (cid:98) V ( t − , V ) λ d + 1 − ( λ + 1) E ( t ) sec φ d ( (cid:98) V ( t − , V ) , (20) where E ( t ) = 5 (cid:114) n − dt + 5 √ (cid:114) log( n ∨ t ) t . The following lemma gives the bound for the error l ( (cid:98) V ( u ) , V ) dependingon the error of the previous iteration l ( (cid:98) V ( u − , V ) for all u ∈ { t + 1 , . . . , t } . Lemma 2.
With probability greater than − C ( n ∨ t ) − for all u ∈ { t + 1 , . . . , t } r ( u ) ≤ α r ( u −
1) + α R ( t ) √ u (cid:112) − r ( u − − α R ( t ) √ u , (21) where r ( u ) = sin φ d ( (cid:98) V ( u ) , V ) , R ( t ) = 5 √ n − d + 5 √ (cid:112) log( n ∨ t ) ,α = 1 λ d + 1 , α = √ λ + 1 λ d + 1 , α = λ + 1 λ d + 1 . (22)17iven Lemma 2 it is possible to derive a bound for l ( (cid:98) V ( t ) , V ). First let usstate the result which allows to bound the error of the initial estimate (cid:98) V ( t ). Lemma 3.
Let (cid:98) V ( t ) be a matrix containing first d leading eigenvectors ofthe matrix (cid:98) Σ( t ) . Then with probability − C ( n ∨ T ) − r ( t ) = l ( (cid:98) V ( t ) , V ) ≤ α t , where α = R max λ +1 λ d , with R max = R ( T ) . The following Lemma gives the error of CPAST after observing K vectors x i , i = t + 1 , . . . , t + K based on the recursive bound (21). Note that theproof of Lemma 4 also insures that the denominator in (21) is bounded awayfrom zero. Lemma 4.
Suppose that r ( t ) ≤ α √ t , where √ t ≥ R max α (1 − α ) / . Then for K ≥ K ( ρ, t ) r ( t + K ) ≤ α α − α R ( t + K ) √ K + t . (23)The statement of the Theorem 1 follows from Lemma 2 applied to theinequality (21) with (22), which holds with probability 1 − C ( n ∨ t ) − withthe initial conditions given by Lemma 3. Define (cid:98) Σ ◦ ( t ), the oracle version of (cid:98) Σ( t ) and the corresponding expectationΣ ◦ ( t ) = E (cid:98) Σ ◦ ( t ) as follows (cid:98) Σ ◦ ( t ) = (cid:20)(cid:98) Σ S ( t ) 00 I N ( t ) (cid:21) , Σ ◦ ( t ) = (cid:20) Σ S ( t ) 00 I N ( t ) (cid:21) , (24)where (cid:98) Σ S ( t ) and Σ S ( t ) are the sub-matrices of the size card( S ( t )) × card( S ( t ))with column and row indices from S ( t ). The identity matrix I N ( t ) has the18ize card( N ( t )) × card( N ( t )). Here we assumed without loss of generalitythat indices in S ( t ) are always smaller than ones in N ( t ).First we obtain the oracle sequence (cid:98) V ◦ ( t ) of the solutions by iteratingSCPAST with matrices (cid:98) Σ ◦ ( t ) instead of (cid:98) Σ( t ). We define the initial estimate (cid:98) V ◦ ( t ) with the steps (a)-(d) in the section 6.1 applied to the matrix (cid:98) Σ ◦ ( t ).And then bound sin φ d ( (cid:98) V ◦ ( t ) , V ). Denote the result of the thresholding thecolumns of the matrix (cid:98) Υ ◦ ( t ) = (cid:98) Σ ◦ ( t ) (cid:98) V ( t −
1) with the thresholding parametersgiven by the vector β ( t ) as (cid:98) Υ ◦ ,β ( t ) = g ( (cid:98) Υ ◦ ( t ) , β ( t )) . Denote the submatrix V S ( t ) obtained by selecting the rows of V withindices in S ( t ). Denote by V k , k = 1 , . . . , n the rows of V (recall that thecolumns are v j , j = 1 , . . . , m ). For the estimators of V we omit the depen-dence of S ( t ) on t as the estimator itself depends on t , that is, (cid:98) V S ( t ) is amatrix of the rows of (cid:98) V ( t ) with indices from S ( t ).The following bound for the oracle error r ( t ) = l ( (cid:98) V ◦ ( t ) , V ) of SCPASTmethod is analogous to Lemma 2. Lemma 5.
For u = t + 1 , . . . , t with probability greater than − C ( n ∨ t ) − the following bound holds true r ( u ) ≤ α r ( u −
1) + α R ◦ ( t ) √ u (cid:112) − r ( u − − α R ◦ ( t ) √ u , where (25) α = ( λ d + 1) − , α = α = ( λ + 1) / ( λ d + 1) , (26) R ◦ ( t ) = C h d M / ( t ) (cid:112) log( n ∨ t ) + C (cid:112) log( n ∨ t ) , where C and C are con-stants and C depends on r , d , a , τ . In the sparse case the similar result to Lemma 3 holds true giving a boundon the error of the initial oracle estimator.
Lemma 6.
The error of initial oracle estimation is bounded as follows r ( t ) ≤ α √ t , with probability − C ( n ∨ T ) − , where α = (cid:18) λ d C λ h d M / ( t ) + C λ + 1 λ d (cid:19) (cid:112) log( n ∨ T ) , where C and C are constants and C depends on r . emma 7. Thus after t iterations, t = t + 1 , . . . , T , with the probability − C ( n ∨ t ) − one has l ( V, (cid:98) V ◦ ( t )) ≤ C h d M ( t ) log( n ∨ t ) t + C λ + 1 λ d log( n ∨ t ) t , where C depends on d , r , τ , a , and C depends on τ . The convergence of the oracle scheme doesn’t immediately imply the con-vergence of the SCPAST estimators. The following two lemmas state thatwith the high probability (cid:98) V ◦ ( t ) = (cid:98) V ( t ). Thus the bound in Lemma 7 holdsfor SCPAST and the Theorem 2 is justified. Lemma 8.
For γ ≥ √ log( n ∨ T )log( n ∨ t ) with probability − C ( n ∨ T ) − the initialoracle estimate coincide with the initial SPCA estimate, that is, (cid:98) V ◦ ( t ) = (cid:98) V ( t ) . Lemma 9.
With probability − C ( n ∨ t ) − for u = t + 1 , . . . , t the oracleSCPAST and SCPAST solutions coincide (cid:98) V ◦ ( t ) = (cid:98) V ( t ) . Conclusions
We developed a new method SCPAST based on constraint projection ap-proximation subspace tracking method for subspace tracking in the sparsityassumptions on the underlying signal eigen subspace. The thesholding stepwas introduced in order to ensure the sparsity of the solution. We presentedthe non-asymptotical bounds for the errors of subspace recovery with SC-PAST and CPAST as well as the empirical studies of the methods. Theresults of experiments show that SCPAST method allows to obtain sparserepresentation of the leading eigenvector of music signals and might be usedfor adaptive compression of the musical signal in the spectral domain.20 ppendix A. Proofs of Lemmas
Proof of Lemma 1
From the definition (19) tan φ d ( (cid:98) Σ( t ) (cid:98) V ( t − , V ) = max (cid:107) x (cid:107) =1 (cid:107) ¯ V (cid:62) (cid:98) Σ( t ) (cid:98) V ( t − x (cid:107)(cid:107) V (cid:62) (cid:98) Σ( t ) (cid:98) V ( t − x (cid:107) .Using (cid:98) Σ( t ) = (cid:98) Σ( t ) − Σ + Σ the latter maximum can be bounded bymax (cid:107) x (cid:107) =1 (cid:107) ¯ V (cid:62) (cid:98) V ( t − x (cid:107) + (cid:107) ¯ V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:98) V ( t − x (cid:107) ( λ d + 1) (cid:107) V (cid:62) (cid:98) V ( t − x (cid:107) − (cid:107) V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:98) V ( t − x (cid:107) . Using (18) max (cid:107) x (cid:107) =1 (cid:107) ¯ V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:98) V ( t − x (cid:107)(cid:107) V (cid:62) (cid:98) V ( t − x (cid:107) ≤ (cid:107) ¯ V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) cos φ d ( (cid:98) V ( t − , V ) , max (cid:107) x (cid:107) =1 (cid:107) V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:98) V ( t − x (cid:107)(cid:107) V (cid:62) (cid:98) V ( t − x (cid:107) ≤ (cid:107) V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) cos φ d ( (cid:98) V ( t − , V ) . From (17) (cid:98) Σ( t ) =Σ + 1 t Ξ( t )Ξ (cid:62) ( t ) − I n + 1 t V Λ / d U (cid:62) ( t )Ξ (cid:62) ( t ) + 1 t Ξ( t ) U ( t )Λ / d V (cid:62) + V Λ / d (cid:18) t U (cid:62) ( t ) U ( t ) − I d (cid:19) Λ / d V (cid:62) . (27)21herefore using ¯ V ¯ V (cid:62) + V V (cid:62) = I n (cid:107) ¯ V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) t ¯ V (cid:62) Ξ( t )[ ¯ V (cid:62) Ξ( t )] (cid:62) − I n − d (cid:13)(cid:13)(cid:13)(cid:13) + 1 t (cid:107) ¯ V (cid:62) Ξ( t )[ V (cid:62) Ξ( t )] (cid:62) (cid:107) + (cid:112) λ t (cid:107) ¯ V (cid:62) Ξ( t ) U ( t ) (cid:107) , (cid:107) V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) ≤ λ (cid:13)(cid:13)(cid:13)(cid:13) t U (cid:62) ( t ) U ( t ) − I d (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) t V (cid:62) Ξ( t )[ ¯ V (cid:62) Ξ( t )] (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) t V (cid:62) Ξ( t ) (cid:2) V (cid:62) Ξ( t ) (cid:3) (cid:62) − I d (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:112) λ (cid:13)(cid:13)(cid:13)(cid:13) t Ξ( t ) U ( t ) (cid:13)(cid:13)(cid:13)(cid:13) . Using √ λ ∨ ≤ √ λ + 1 ≤ λ +1, Lemma 10 and Lemma 11 with p = √ − C ( n ∨ t ) − , for big enough t (cid:107) ¯ V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) ≤ (cid:112) λ + 1 E ( t ) , (cid:107) V (cid:62) ( (cid:98) Σ( t ) − Σ) (cid:107) ≤ ( λ + 1) E ( t ) , (28)where E ( t ) = 5 (cid:113) n − dt + 5 √ (cid:113) log( n ∨ t ) t with probability 1 − C ( n ∨ t ) − , where C is a constant. The statement of the lemma follows from the observationthat tan φ d ( (cid:98) V ( t ) , V ) = tan φ d ( (cid:98) Σ( t ) (cid:98) V ( t − , V ) . Proof of Lemma 2
Lemma 1 gives a probabilistic bound on the error of subspace estimation l ( (cid:98) V ( t ) , V ) based on the previous iteration l ( (cid:98) V ( t − , V ). Or goal is to bound l ( (cid:98) V ( t ) , V ) for t ∈ { t + 1 , . . . , T } . Due to Lemmas (10) and (11) we get for p > u = t + 1 , . . . , t the term √ u (cid:13)(cid:13)(cid:13) u V (cid:62) Ξ( u ) (cid:2) V (cid:62) Ξ( u ) (cid:3) (cid:62) − I d (cid:13)(cid:13)(cid:13) is boundedfrom above by 3( √ d + p (cid:112) log( n ∨ t )), √ u (cid:13)(cid:13) u ¯ V (cid:62) Ξ( u )[ ¯ V (cid:62) Ξ( u )] (cid:62) − I n − d (cid:13)(cid:13) by22( √ n − d + p (cid:112) log( n ∨ t )), √ u (cid:13)(cid:13) u V (cid:62) Ξ( u )[ ¯ V (cid:62) Ξ( u )] (cid:62) (cid:13)(cid:13) by (cid:113) p log( n ∨ t ) √ u ( √ n − d + √ d + p (cid:112) log( n ∨ t )). Finally, √ u (cid:13)(cid:13) u ¯ V (cid:62) Ξ( u ) U ( u ) (cid:13)(cid:13) is bounded by (cid:115) p log( n ∨ t ) √ u ( √ n − d + √ d + p (cid:112) log( n ∨ t )) . Each of the bounds holds with the probability 1 − ( n ∨ t ) − for p = √
6. Usingthe union bound we get the statement of the Lemma for the intersection ofevents with the probability 1 − C ( t − t )( n ∨ t ) − . Proof of Lemma 3
The proof is based on Davis sin θ Theorem 4, Lemma 10 (see Appendix B),and Weyl’s theorem [17]. From Davis sin θ Theorem l ( V, (cid:98) V ( t )) ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) Σ( t ) − Σ) V (cid:13)(cid:13)(cid:13) (cid:16) λ d + 1 − λ d +1 (cid:16)(cid:98) Σ( t ) (cid:17)(cid:17) , (29)where λ d +1 ( A ) is a ( d + 1)-th singular value of the matrix A T A . Weil’stheorem gives for j = 1 , . . . , n | λ j + 1 − λ j ( (cid:98) Σ( t )) | ≤ (cid:107) (cid:98) Σ( t ) − Σ (cid:107) . Therefore the denominator in (29) may be bounded as | λ d + 1 − λ d +1 ( (cid:98) Σ( t )) | ≥ λ d − (cid:107) (cid:98) Σ( t ) − Σ (cid:107) . From (27) (cid:107) (cid:98) Σ( t ) − Σ (cid:107) ≤ λ (cid:13)(cid:13)(cid:13)(cid:13) t U ( t ) (cid:62) U ( t ) − I d (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) t Ξ( t )Ξ (cid:62) ( t ) − I n (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:112) λ (cid:13)(cid:13)(cid:13)(cid:13) t U (cid:62) ( t )Ξ (cid:62) ( t ) (cid:13)(cid:13)(cid:13)(cid:13) ,
23y Lemma 10 and 11 | λ d + 1 − λ d +1 ( (cid:98) Σ( t )) | ≥ (1 + o (1)) λ d . From (28) and Lemma 10 one has that with probability 1 − C ( n ∨ t ) − (cid:107) ( (cid:98) Σ( t ) − Σ) V (cid:107) ≤ ( λ + 1) (cid:114) nt + (cid:115) log( n ∨ t ) t . Combining the last two inequalities we get the statement of Lemma.
Proof of Lemma 4
First we prove (23) for the pair ( t , r ( t )) which satisfies by induction for all k = 1 , . . . , K, and some ρ ∈ ( α , (cid:115) − (cid:18) r ( t ) + R (cid:18) α α (cid:19) ρ − ρ √ t (cid:19) − α R √ t > α ρ . (30)We have for K = 1 r ( t + 1) ≤ α r ( t ) + α R √ t +1 (cid:112) − r ( t ) − α R √ t +1 ≤ ρr ( t ) + ρα α R √ t + 1 . Furthermore suppose that (30) holds for K = L, then r ( t + L ) ≤ (cid:32) ρ L r ( t ) + R (cid:18) α α (cid:19) L (cid:88) k =1 ρ L +1 − k √ t + k (cid:33) and r ( t + L + 1) ≤ α r ( t + L ) + α R √ t + L +1 (cid:112) − r ( t + L ) − α R √ t + L +1 ≤ ρ L r ( t ) + R (cid:18) α α (cid:19) L +1 (cid:88) k =1 ρ L − k √ t + k .
24 sufficient condition for the above formula to hold reads as (cid:118)(cid:117)(cid:117)(cid:116) − (cid:32) ρ k − r ( t ) + α Rα k − (cid:88) j =1 ρ k − j √ t + j (cid:33) − α R √ t + k > α ρ . Note that (cid:80) k − j =1 ρ k − j √ t + j ≤ ρ − ρ √ t , therefore the above condition is fulfilledgiven (30). Furthermore r ( t + K + 1) ≤ ρ K r ( t ) + R (cid:18) α α (cid:19) K +1 (cid:88) k =1 ρ K − k √ t + k , where for K > K ( ρ ), t > j K,ρ = log( K ) / (2 log(1 /ρ )) K (cid:88) k =1 ρ K − k √ t + k ≤ j K,ρ (cid:88) j =0 ρ j √ t + K − j + K − (cid:88) j = j K,ρ +1 ρ j √ t + K − j ≤ − ρ (cid:112) t + K − j K,ρ + 11 − ρ √ K + t and r ( t + K + 1) (cid:46) α α ρ − ρ R √ K + t + 1 . From (30) the condition on the starting value r ( t ) is r ( t ) ≤ (cid:115) − (cid:18) α R √ t + α ρ (cid:19) − Rα α ρ − ρ √ t . (31)Thus the number of initial observations t for (31) to be satisfied given r ( t ) ≤ α √ t reads as α √ t ≤ (cid:115) − (cid:18) α R √ t + α ρ (cid:19) − R (cid:18) α α (cid:19) − ρ √ t . Therefore, taking into account (26) and ρ > α , the sufficient condition on √ t is √ t ≥ α R (cid:16) − α ρ (cid:17) + 2 R (cid:16) α α (cid:17) ρ − ρ (cid:113) − α ρ . α = R max α − α . Set ρ = ρ ( (cid:15) ) = 1 − (cid:15) (1 − α ). Itis easy to check that α < ρ ( (cid:15) ) < (cid:15) ∈ (0 , / R max = R ( T ),therefore √ t ≥ R max α (cid:15) (1 − α ) / √ − (cid:15) . The value K ( ρ ) might be defined by ρ K ≤ α α R √ t + K , thus (using | ln(1 − x ) | ≥ x for x ∈ (0 , K ≥ (cid:15) (1 − α ) ln (cid:16) α α √ T (cid:17) . Put (cid:15) = 1 / Proof of Lemma 5
Using the triangle inequality (cid:113) l ( (cid:98) V ◦ ( t ) , V ) ≤ (cid:113) l ( (cid:98) Υ ◦ ( t ) , V ) + (cid:113) l ( (cid:98) Υ ◦ ( t ) , (cid:98) V ◦ ( t )) . (32)We bound the first term as l / ( (cid:98) Υ ◦ ( t ) , V ) ≤ tan φ d ( (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − , V ) . Using the variational definition of tan φ d l / ( (cid:98) Υ ◦ ( t ) , V ) ≤ max (cid:107) x (cid:107) =1 (cid:107) ¯ V (cid:62) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − x (cid:107)(cid:107) V (cid:62) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − x (cid:107) The right hand side may be bounded withmax (cid:107) x (cid:107) =1 (cid:107) ¯ V (cid:62) Σ (cid:98) V ◦ ( t − x (cid:107) + (cid:107) ¯ V (cid:62) ( (cid:98) Σ ◦ ( t ) − Σ) (cid:98) V ◦ ( t − x (cid:107)(cid:107) V (cid:62) Σ (cid:98) V ◦ ( t − x (cid:107) − (cid:107) V (cid:62) ( (cid:98) Σ ◦ ( t ) − Σ) (cid:98) V ◦ ( t − x (cid:107) . Triangle inequality gives (cid:107) ( (cid:98) Σ ◦ ( t ) − Σ) ¯ V (cid:107) ≤(cid:107) ( (cid:98) Σ ◦ ( t ) − Σ ◦ ( t )) ¯ V (cid:107) + (cid:107) (Σ ◦ ( t ) − Σ) ¯ V (cid:107) . Note that (cid:98) Σ ◦ ( t ) − Σ ◦ ( t ) = (cid:20)(cid:98) Σ S ( t ) − Σ S ( t ) 00 0 (cid:21) , (33)Σ ◦ ( t ) − Σ = (cid:20) − V S ( t )Λ d V (cid:62) N ( t ) − V N ( t )Λ d V (cid:62) S ( t ) − V N ( t )Λ d V (cid:62) N ( t ) (cid:21) , (34)26here V S ( t ) is a submatrix of V with the row indices in S ( t ). Decompose (cid:98) Σ S ( t ) − Σ S ( t ) using (27) and (33) (cid:107) ( (cid:98) Σ ◦ ( t ) − Σ ◦ ( t )) ¯ V (cid:107) ≤ λ (cid:107) ¯ V (cid:62) S V S ( t ) (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) t U ( t ) (cid:62) U ( t ) − I d (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) t Ξ S ( t )Ξ (cid:62) S ( t ) − I S ( t ) (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:112) λ (cid:13)(cid:13)(cid:13)(cid:13) t U ( t ) (cid:62) Ξ (cid:62) S ( t ) (cid:13)(cid:13)(cid:13)(cid:13) , where Ξ S ( t ) is t × card( S ( t )) matrix, U ( t ) is t × d matrix. The elements ofboth matrices are i.i.d. N (0 , V (cid:62) V = ¯ V (cid:62) S ( t ) V S ( t ) + ¯ V (cid:62) N ( t ) V N ( t ) = 0we may bound (cid:107) ¯ V (cid:62) S V S ( t ) (cid:107) ≤ (cid:107) ¯ V (cid:62) N V N ( t ) (cid:107) ≤ (cid:107) V N ( t ) (cid:107) ≤ (cid:107) V N ( t ) (cid:107) F , where (cid:107) · (cid:107) F is Frobenius norm, i.e. (cid:107) A (cid:107) F = (cid:112) tr( A (cid:62) A ) for any matrix A , issmall since it depends only on the components of the eigenvectors below thecorresponding thresholds (see Lemma 13 and definition (13)) (cid:107) V N ( t ) (cid:107) = d (cid:88) i =1 (cid:107) v i,N ( t ) (cid:107) ≤ d (cid:88) j =1 (cid:34) − r t r s rj / ( bh j ) r [log( n ∨ t )] r ∧ n (cid:35) b h j log( n ∨ t ) t ≤ CM ( t ) h d log( n ∨ t ) t , where C depends on d, r .From Lemma 10, 11 and 12 (see Appendix B) with the probability 1 − C ( n ∨ t ) − one can bound (cid:107) ( (cid:98) Σ ◦ ( t ) − Σ ◦ ( t )) ¯ V (cid:107) ≤ C λ h d M / ( t ) √ t + C ( (cid:112) λ ∨ (cid:112) log( n ∨ t ) √ t . From (34) (cid:107) ¯ V (cid:62) (Σ ◦ ( t ) − Σ) (cid:107) ≤ λ (cid:107) V N ( t ) (cid:107) . Thus (cid:107) ( (cid:98) Σ ◦ ( t ) − Σ ◦ ( t )) ¯ V (cid:107) ≤ C λ h d M / ( t ) (cid:114) log( n ∨ t ) t + C ( (cid:112) λ ∨ (cid:114) log( n ∨ t ) t . (35)27here C depends on r , d and C is a constant. Similarly, from (34) (cid:107) V (cid:62) (Σ ◦ ( t ) − Σ) (cid:107) ≤ | V N ( t ) (cid:107) (1 + o (1)) and (cid:107) V (cid:62) ( (cid:98) Σ ◦ ( t ) − Σ) (cid:107) ≤ C λ h d M / ( t ) (cid:114) log( n ∨ t ) t + C ( (cid:112) λ ∨ (cid:114) log( n ∨ t ) t , (36)where C depends on r , d and C is a constant.The bound on l ( (cid:98) Υ ◦ ( t ) , (cid:98) V ◦ ( t )) = l ( (cid:98) Υ ◦ ( t ) , (cid:98) Υ ◦ ,β ( t )) relies on Wedin’s sin θ Theorem 3 (see Appendix B) l ( (cid:98) Υ ◦ ( t ) , (cid:98) Υ ◦ ,β ( t )) ≤ (cid:107) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − − (cid:98) Σ ◦ ,β ( t ) (cid:107) λ d ( (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − . (37)Note that (cid:107) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − − (cid:98) Υ ◦ ,β ( t − (cid:107) ≤ (cid:107) Z ( t ) (cid:107) F , where Z ij ( t ) is a matrixwith the entries Z ij ( t ) = β j ( t ) if i ∈ S ( t ) and Z ij ( t ) = 0 if i ∈ N ( t ). Thus (cid:107) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − − (cid:98) Υ ◦ ,β ( t ) (cid:107) ≤ CM ( t ) (cid:80) di =1 β i ( t ) and from (12) d (cid:88) i =1 β i ( t ) ≤ a log( n ∨ t ) t d (cid:88) i =1 ( λ i + 1) ≤ da λ d log( n ∨ t ) t h d That is (cid:107) (cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − − (cid:98) Υ ◦ ,β ( t ) (cid:107) ≤ CM ( t ) λ d log( n ∨ t ) t h d , where C (cid:48) depends on d , a and r .To bound the denominator of (37) note that one may decompose (cid:107) Σ (cid:98) V ◦ ( t − x (cid:107) = (cid:107) Σ z (cid:107) + (cid:107) Σ z (cid:107) , where (cid:98) V ◦ ( t − x = z + z and z ∈ ran( V ) and z ∈ ran( ¯ V ). Thus (cid:107) Σ (cid:98) V ◦ ( t − x (cid:107) ≥ (cid:107) Σ z (cid:107) . Using z ∈ ran( V ) one has (cid:107) Σ z (cid:107) ≥ ( λ d + 1) (cid:107) z (cid:107) ≥ ( λ d + 1) cos( V, (cid:98) V ◦ ( t − λ / d (cid:16)(cid:98) Σ ◦ ( t ) (cid:98) V ◦ ( t − (cid:17) ≥ ( λ d + 1) cos( V, (cid:98) V ◦ ( t − − (cid:107) (cid:98) Σ ◦ ( t ) − Σ ◦ ( t ) (cid:107) − λ (cid:107) V N ( t ) (cid:107) . l / ( (cid:98) Υ ◦ ( t ) , (cid:98) V ◦ ( t )) ≤ λ d CM / ( t ) (cid:113) log( n ∨ t ) t h d ( λ d + 1) cos( V, (cid:98) V ◦ ( t − − E ◦ ( t ) , where E ◦ ( t ) = (cid:0) C λ h d M / ( t ) + C ( √ λ ∨ (cid:1) (cid:113) log( n ∨ t ) t . Combining the above inequality, (35), (36), (32) and the spectral gap con-dition (7) we get the result in the flavour of (1), that is with probability 1 − C ( n ∨ t ) − for one step of SCPAST algorithm. To get the bounds for u = t +1 , . . . , t simultaneously, similarly to Lemma 2 define the events, each of whichoccurs with probability 1 − C ( n ∨ t ) − , namely that √ u (cid:13)(cid:13) u U ( u ) (cid:62) U ( u ) − I d (cid:13)(cid:13) is bounded from above by 2( √ d + p (cid:112) log( n ∨ t )), √ u (cid:13)(cid:13) u Ξ S ( u )Ξ (cid:62) S ( u ) − I S ( u ) (cid:13)(cid:13) by 2( (cid:112) card( S ( t ))+ p (cid:112) log( n ∨ t )), √ u (cid:13)(cid:13) u U ( u ) (cid:62) Ξ (cid:62) S ( u ) (cid:13)(cid:13) by (cid:113) p log( n ∨ t ) √ u ( (cid:112) card( S ( t ))+ √ d + p (cid:112) log( n ∨ t )) . Taking the intersection of the above events for u = t + 1 , . . . , t and using Lemma 12, we get the statement of the Lemma. Proof of Lemma 6
Using Wedin sin θ Theorem 3 (Appendix B) l ( V, (cid:98) V ◦ ( t )) ≤ (cid:107) V (cid:62) (Σ − (cid:98) Σ ◦ ( t )) (cid:107) ( λ d − λ d +1 ( (cid:98) Σ ◦ ( t )) . (38)Using Weyl theorem [17] it may be shown that λ d +1 ( (cid:98) Σ ◦ ( t )) = λ d +1 + o ( λ )and thus | λ d − λ d +1 ( (cid:98) Σ ◦ ( t )) | ≥ λ d (1 + o (1)) . From (36) with probability1 − ( n ∨ T ) − (cid:107) ( (cid:98) Σ ◦ ( t ) − Σ) V (cid:107) ≤ C λ h d M / ( t ) (cid:115) log( n ∨ t ) t + C ( (cid:112) λ ∨ (cid:115) log( n ∨ T ) t . Thus r ( t ) ≤ α √ t holds with probability 1 − ( n ∨ T ) − , where α = (cid:18) λ d C λ h d M / ( t ) + C √ λ + 1 λ d (cid:19) (cid:112) log( n ∨ T ) . roof of Lemma 7 The proof follows from Lemma 4 applied to (25) with α = λ d +1 , α = α = λ +1 λ d +1 , and initial conditions given by Lemma 6. Proof of Lemma 8
Following [12] define η j = (cid:80) di =1 λ i v ji , j = 1 , . . . , n and for 0 < a − < G + = (cid:26) j : η j > a − γ (cid:113) log( n ∨ t ) t (cid:27) . To show that (cid:98) V ◦ ( t ) = (cid:98) V ( t ) onehas to prove that for the proper choice of γ and a − it holds G ⊆ G + ⊆ S ( t )with probability 1 − C ( n ∨ T ) − . To show that we first note that (cid:98) Σ jj ( t ) ∼ (1 + (cid:80) di =1 λ i v ji ) ξ/t , where ξ is χ t r. v. Therefore P ( G (cid:54)⊂ G + ) = P (cid:91) j (cid:54)∈ G + (cid:98) Σ jj ( t ) > γ (cid:115) log( n ∨ t ) t ≤ (cid:88) j (cid:54)∈ G + P (cid:98) Σ jj ( t ) > γ (cid:115) log( n ∨ t ) t ≤ n P ξt − > γ (1 − a − ) (cid:113) log( n ∨ t ) t a − γ (cid:113) log( n ∨ t ) t ≤ √ nγ exp − γ (1 − a − ) log( n ∨ t )4 (cid:18) a − γ (cid:113) log( n ∨ t ) t (cid:19) ≤ √ γ n ( n ∨ t ) − ( γ (1 − a − ) / o (1)) . Thus G ⊂ G + holds with probability 1 − C ( n ∨ T ) − , e.g. for a − = 1 −√ / √ γ ≥ √ (cid:113) log( n ∨ T )log( n ∨ t ) . Note that for any j ∈ G + there exists i ∈ { , . . . , d } , λ i v ji ≥ a − γ d (cid:113) log( n ∨ t ) t , thus for G + ⊂ S ( t ) to hold it is sufficient that a − dλ i (cid:115) log( n ∨ T ) t > b λ i + 1 λ i log( n ∨ t ) t . T , G ∩ S ( t ) = G , that is (cid:98) V ◦ ( t ) = (cid:98) V ( t ) withprobability 1 − C ( n ∨ T ) − . Proof of Lemma 9
From Lemma 8 with probability 1 − ( n ∨ T ) − the results of the original andoracle version of the zero-step estimation procedure coincide, that is (cid:98) V ◦ ( t ) = (cid:98) V ( t ). First let us show that the similar statement holds for (cid:98) V ◦ ( t + 1) and (cid:98) V ( t + 1). Denote t = t + 1. On the event for which (cid:98) V ◦ ( t ) = (cid:98) V ( t ) holdsit is true that (cid:98) Υ( t ) = (cid:98) Σ( t ) (cid:98) V ( t ) = (cid:98) Σ( t ) (cid:98) V ◦ ( t ) . From the construction of (cid:98) V ◦ ( t ), the submatrix (cid:98) V ◦ N ( t ) has zero entries. Note that S ( t ) ⊆ S ( t ) and N ( t ) ⊆ N ( t ). Thus (cid:98) Υ( t ) k,l = (cid:98) Σ k,S ( t ) (cid:98) v ◦ l,S ( t ) , (39)where (cid:98) v ◦ l,S ( t ) is a vector of size card( S ( t )) containing the components of (cid:98) v ◦ l ( t ) indexed by S ( t ), (cid:98) Σ k,S ( t ) is a row containing the components of k -throw of (cid:98) Σ( t ) indexed by S ( t ). Let us show that for k ∈ N ( t ) with highprobability, which is equivalent to (cid:98) Υ( t ) k,l ≤ β l ( t ) = a (cid:115) ( λ l + 1) log( n ∨ t ) t , that is during the thresholding step the components from N ( t ) would be setto zero with high probability. From (16) t (cid:98) Σ k,S ( t ) = V k Λ / d U ( t ) (cid:62) U ( t )Λ / d V (cid:62) S ( t )+ Ξ k ( t )Ξ (cid:62) S ( t )+ V k Λ / d U ( t ) (cid:62) Ξ (cid:62) S ( t )+ Ξ k ( t ) U ( t )Λ / d V (cid:62) S ( t ) , (40)where Ξ k ( t ) is k -th row of Ξ( t ). Denote by V ◦ S ( t ) a matrix containingthe first d eigenvalues of Σ S ( t ) as columns (recall (24)) and by ¯ V ◦ S ( t ) amatrix with card( S ( t )) − d columns which complete columns of V ◦ S ( t ) tothe orthonormal basis in R card( S ( t )) . Note that¯ V ◦ S ( t ) ¯ V ◦ , (cid:62) S ( t ) + V ◦ S ( t ) V ◦ , (cid:62) S ( t ) = I S ( t ) . V (cid:62) S ( t )) in the view of (40) onegets (cid:98) Υ( t ) k,l = q + q + q + q + q + q + q + q , where q -s with the listed below with the first index 1 depend on ¯ V ◦ S ( t ) andwith the first index 2 depend on V ◦ S ( t ). Let us first bound the terms q and q . To this end we add and subtract V k Λ d V (cid:62) S ( t ) in the first term in (40)and use that (cid:13)(cid:13)(cid:13) t U ( t ) (cid:62) U ( t ) − I (cid:13)(cid:13)(cid:13) = o (1), thus | q | ≤ (1 + o (1)) (cid:107) V k Λ d (cid:107)(cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) (41)where it was also used that (cid:107) V (cid:62) S ( t ) ¯ V ◦ S ( t ) (cid:107) ≤ k ∈ N ( t ), that is, | V kl | = | v kl | ≤ b (cid:113) h i log( n ∨ t ) t . Using thedefinition of β k ( t ) (recall (12)) (cid:107) V k Λ d (cid:107) ≤ ba β k ( t ) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 λ i + 1 λ k + 1 , (42) (cid:107) V k Λ / d (cid:107) ≤ ba β k ( t ) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 λ i + 1( λ k + 1) λ i . (43)Thus using (42) and (43) the term (41) may be bounded as | q | ≤ (1 + o (1)) ba β k ( t ) (cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 λ i + 1 λ l + 1and in the same way it can be shown that | q | ≤ (1 + o (1)) ba β k ( t ) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 λ i + 1 λ l + 1 . Next | q | = 1 t | Ξ k ( t )Ξ (cid:62) S ( t ) ¯ V ◦ S ( t ) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) |≤ t ζ ( k, S ( t )) (cid:107) Ξ (cid:62) S ( t ) (cid:107)(cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) , ζ ( k, l, S ( t )) = t q (cid:107) Ξ (cid:62) S ( t ) ¯ V ◦ S ( t ) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) . Note that (cid:98) v ◦ l,S ( t ) is the l -th eigenvector of (cid:98) Σ S ( t ) and doesn’t depend onΞ k ( t ), k ∈ N ( t ), and since N ( t ) ⊆ N ( t ), Ξ k ( t ) is independent fromΞ S ( t ), thus ζ ( k, l, S ( t )) has N (0 ,
1) distribution. Define the events | ζ ( k, l, S ( t )) | ≤ (cid:112) c log( n ∨ t )and (cid:107) Ξ S ( t ) (cid:107) ≤ √ t + (cid:112) card( S ( t )) + 2 (cid:112) log( n ∨ t ) . For big enough t (guarantied by (14)) t dominates card( S ( t )) and log( n ∨ t ).Thus | q | ≤ t | ζ ( k, S ( t )) |(cid:107) Ξ (cid:62) S ( t ) (cid:107)(cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ a (cid:18) c λ l + 1 (cid:19) / β l ( t ) log( n ∨ t )log( n ∨ t ) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) . On the events defined in the end of the proof of Lemma 5 the bound for theterm q is as follows | q | ≤ t (cid:107) V k Λ / d (cid:107)(cid:107) U ( t ) (cid:62) Ξ (cid:62) S ( t ) (cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ ba card( S ( t )) √ λ d t log( n ∨ t )log( n ∨ t ) β l ( t ) (cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) . From λ d card( S ( t ))) t = o (1) (see Supplementary materials for [5] p.16) it followsthat | q | = o ( β l ( t )). To bound the term q one may utilize the sameargument as for q | q | ≤ t | g ( k, S ( t )) |(cid:107) U ( t )Λ / d (cid:107)(cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) , where g ( k, l, S ( t )) = t q (cid:107) U ( t )Λ / d V (cid:62) S ( t ) ¯ V ◦ S ( t ) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)
33s independent from U ( t ) and (cid:98) v ◦ l,S ( t ), furthermore g ( k, l, S ( t )) has N (0 , {(cid:107) U ( t ) (cid:107) ≤ √ t + √ d +2 (cid:112) log( n ∨ t ) } and {| g ( k, l, S ( t )) | ≤ (cid:112) c log( n ∨ t ) } . On these events with probability1 − C ( n ∨ t ) − | q | ≤ t g ( k, l, S ( t )) (cid:107) U ( t ) (cid:107)(cid:107) Λ / d (cid:107)(cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ a (cid:18) c λ λ l + 1 (cid:19) / (cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) β l ( t ) log( n ∨ t )log( n ∨ t ) . In the similar way term q may be bounded as follows q ≤ (cid:13)(cid:13)(cid:13)(cid:13) t Ξ k Ξ (cid:62) S ( t ) V ◦ S ( t ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ ba (1 + o (1)) (cid:114) c λ l + 1 β j ( t ) log( n ∨ t )log( n ∨ t ) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) . The bound on the term q due to (43) reads as | q | = 1 t (cid:107) V k Λ / d (cid:107)(cid:107) U ( t ) (cid:62) Ξ (cid:62) S ( t ) V ◦ S ( t ) (cid:107)(cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ (cid:115) log( n ∨ t ) t (cid:107) V k Λ / d (cid:107)(cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) = o ( β l ( t )) . Similarly to the case of q one can show that | q | ≤(cid:107) t Ξ k U ( t ) (cid:107)(cid:107) Λ / d V (cid:62) S ( t ) V ◦ S ( t ) (cid:107)(cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107)≤ (cid:114) c λ λ l + 1 1 a β l ( t ) log( n ∨ t )log( n ∨ t ) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) . Note that (see [5]) (cid:107) V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) = 1 + o (1) and (cid:107) ¯ V ◦ , (cid:62) S ( t ) (cid:98) v ◦ l,S ( t ) (cid:107) = o (1),that is from above bounds (cid:80) i =1 | q i | = o (cid:0)(cid:80) i =1 | q i | (cid:1) . Therefore (cid:98) Υ( t ) k,l ≤ ba β l ( t ) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) j =1 λ j + 1 λ l + 1+ 1 a β l ( t ) √ c log( n ∨ t )log( n ∨ t ) (cid:114) λ + 1 λ l + 1 . (cid:113)(cid:80) dj =1 λ j +1 λ l +1 ≤ √ τ √ d and λ /λ j ≤ τ . Let us bound log( n ∨ t )by log( n ∨ T ) thus a ≥ √ c log( n ∨ T )log( n ∨ t ) , b = 0 . a − √ c n ∨ T )log( n ∨ t ) √ τ √ d . Therefore one gets for all k ∈ N ( t ) | (cid:98) Υ( t ) k,l | ≤ β l ( t )and (cid:98) V ◦ N ( t + 1) = 0, and so, (cid:98) V ◦ N ( t + 1) = (cid:98) V N ( t + 1) . To show that (cid:98) V ◦ N ( u ) = (cid:98) V N ( u ) , u = t + 2 , . . . , t we consider the events defined for the standard normal random variables z ( k, l, S ( u )) and g ( k, l, S ( u )) { z ( k, l, S ( u )) ≤ (cid:112) c log( n ∨ t ) } , {| g ( k, l, S ( u )) | ≤ (cid:112) c log( n ∨ t ) } . Using the union bound P (cid:91) l ∈ N ( u ) ,k =1 ,...,d,u = t +1 ,...,t (cid:110) z ( k, l, S ( u )) ≤ (cid:112) c log( n ∨ t ) (cid:111) ≤ − d (cid:88) k =1 t (cid:88) u = t +1 (cid:88) l ∈ N ( t ) P { z ( k, l, S ( u )) ≤ (cid:112) c log( n ∨ t ) }≤ − nd ( t − t ) P { z ( k, l, S ( t )) ≤ (cid:112) c log( n ∨ t ) }≤ − C n ( t − t )log( n ∨ t ) − ( n ∨ t ) − c / . Take c ≥ ppendix B. Concentration of the spectral normof the perturbation Denote δ n,t = log( n ∨ t ). Lemma 10. [18] Let X be a t × n matrix with i.i.d. N (0 , entries. Thefollowing result holds true P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) t X (cid:62) X − I n (cid:13)(cid:13)(cid:13)(cid:13) ≥ E ( t, n, p ) (cid:19) ≤ n ∨ t ) − p / , where E ( t, n, p ) = 3 max (cid:114) nt + p (cid:112) δ n,t √ t , (cid:34)(cid:114) nt + p (cid:112) δ n,t √ t (cid:35) . Lemma 11. [19] Let X and Y be t × q and t × m matrices, q > m , withi.i.d. N (0 , entries then for any < x < / and c > P (cid:0) (cid:107) X (cid:62) Y (cid:107) ≥ tE ( t, q, m, x, c ) (cid:1) ≤ e − c δn,t + q e − x δn,t , where E ( t, q, m, x, c ) = (cid:114) x δ n,t t (cid:32)(cid:114) qt + (cid:114) mt + c (cid:112) δ n,t √ t (cid:33) . Lemma 12. [5] There exist constants ˜ C and ¯ C depending on r and ¯ C and ˜ C such that E ( t, card( S ( t )) , p ) ≤ ˜ C λ M / ( t ) √ t h d + ˜ C (cid:114) δ n,t tE ( t, card( S ( t )) , d, , p ) ≤ ¯ C M / ( t ) √ t h d h + ¯ C (cid:114) δ n,t t . Theorem 3. (Wedin sin θ ) Let A and B be n × k , n ≥ k , full-column rankmatrices. Let the columns of a n × ( n − k + 1) matrix U be the orthogonalmatrices spanning the orthogonal complement of range of B . If the λ min ( A ) ≥ (cid:15) ≥ then l ( A, B ) ≤ (cid:107) A (cid:62) U (cid:107) (cid:15) ≤ (cid:107) B − A (cid:107) (cid:15) . heorem 4. (Davis sin θ ) [20] Let A and B be the symmetric matrices withthe decomposition A = W Λ W (cid:62) + W Λ W (cid:62) and B = U ∆ U (cid:62) + U ∆ U (cid:62) ,with conditions [ U , U ] is orthogonal, W is orthonormal and W (cid:62) W = 0 ,the eigenvalues of Λ W (cid:62) W are contained in the interval ( a , a ) and theeigenvalues of ∆ are laying outside of the interval ( a − (cid:15), a − (cid:15) ) for some (cid:15) > then l ( W , U ) ≤ (cid:107) U (cid:62) ( B − A ) W (cid:107) (cid:15)λ ( W ) . Lemma 13.
The norms of the subvectors v j,N ( t ) of v j satisfy (cid:107) v j,N ( t ) (cid:107) ≤ (cid:34) − r s rj ( bh j ( t )) r (cid:18) δ n,t t (cid:19) − r/ ∧ n (cid:35) b h j ( t ) δ n,t t . Lemma 14.
Bound on the effective dimension card( S ( t )) is given by d ≤ card( S ( t )) ≤ CM ( t ) = C (cid:34) n ∧ d (cid:88) j =1 s rj h − rj (cid:18) δ n,t t (cid:19) − r/ (cid:35) . Lemma 15.
For a χ t random variable ζ t the following bounds hold [21] P ( ζ t > t (1 + ε )) ≤ e − tε / , < ε < / , P ( ζ t < t (1 − ε )) ≤ e − tε / , < ε < , P ( ζ t > t (1 + ε )) ≤ √ ε √ t e − tε / , < ε < t / , t > . Acknowledgment
This work is funded by the German Research Foundation (DFG), Collabo-rative Research Center 823, Subprojects B3 and C5.
References [1] B. Yang, “Projection approximation subspace tracking,”
IEEE Trans-actions on Signal processing , vol. 43, no. 1, pp. 95–107, 1995.372] J.-P. Delmas, “Subspace tracking for signal processing,” 2010.[3] B. Yang, “Convergence analysis of the subspace tracking algorithms pastand pastd,” in
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on ,vol. 3. IEEE, 1996, pp. 1759–1762.[4] ——, “Asymptotic convergence analysis of the projection approximationsubspace tracking algorithms,”
Signal processing , vol. 50, no. 1-2, pp.123–136, 1996.[5] Z. Ma et al. , “Sparse principal component analysis and iterative thresh-olding,”
The Annals of Statistics , vol. 41, no. 2, pp. 772–801, 2013.[6] A. Valizadeh and M. Karimi, “Fast subspace tracking algorithm basedon the constrained projection approximation,”
EURASIP Journal onAdvances in Signal Processing , vol. 2009, p. 9, 2009.[7] G. H. Golub and C. F. Van Loan,
Matrix computations . JHU Press,2012, vol. 3.[8] A. Birnbaum, I. M. Johnstone, B. Nadler, and D. Paul, “Minimaxbounds for sparse pca with noisy high-dimensional data,”
Annals ofstatistics , vol. 41, no. 3, p. 1055, 2013.[9] D. L. Donoho, “Unconditional bases are optimal bases for data compres-sion and for statistical estimation,”
Applied and computational harmonicanalysis , vol. 1, no. 1, pp. 100–115, 1993.[10] I. M. Johnstone and A. Y. Lu, “Sparse principal components analysis,”
Unpublished manuscript , vol. 7, 2004.[11] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihoodand its oracle properties,”
Journal of the American statistical Associa-tion , vol. 96, no. 456, pp. 1348–1360, 2001.[12] D. Paul and I. M. Johnstone, “Augmented sparse principal componentanalysis for high dimensional data,” arXiv preprint arXiv:1202.1242 ,2012.[13] D. Paul, “Nonparametric estimation of principal components,” 2004.3814] E. Krymova, A. Nagathil, D. Belomestny, and R. Martin, “Segmenta-tion of music signals based on explained variance ratio for applications inspectral complexity reduction,” in
Acoustics, Speech and Signal Process-ing (ICASSP), 2017 IEEE International Conference on . IEEE, 2017,pp. 206–210.[15] J. C. Brown, “Calculation of a constant q spectral transform,”
TheJournal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991.[16] M. Hardt and E. Price, “The noisy power method: A meta algorithmwith applications,” in
Advances in Neural Information Processing Sys-tems , 2014, pp. 2861–2869.[17] G. Stewart and J.-G. Sun, “Matrix perturbation theory (computer sci-ence and scientific computing),” 1990.[18] R. Vershynin, “Introduction to the non-asymptotic analysis of randommatrices,” arXiv preprint arXiv:1011.3027 , 2010.[19] K. R. Davidson and S. J. Szarek, “Local operator theory, random ma-trices and banach spaces,”
Handbook of the geometry of Banach spaces ,vol. 1, no. 317-366, p. 131, 2001.[20] C. Davis and W. M. Kahan, “The rotation of eigenvectors by a per-turbation. iii,”
SIAM Journal on Numerical Analysis , vol. 7, no. 1, pp.1–46, 1970.[21] I. M. Johnstone, “Chi-square oracle inequalities,”