Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation From Undersampled Data
aa r X i v : . [ c s . NA ] J u l Convergence of a Grassmannian GradientDescent Algorithm for Subspace EstimationFrom Undersampled Data ∗ Dejiao Zhang and Laura [email protected] and [email protected] of Electrical Engineering and Computer ScienceUniversity of Michigan
Abstract
Subspace learning and matrix factorization problems have great many appli-cations in science and engineering, and efficient algorithms are critical as datasetsizes continue to grow. Many relevant problem formulations are non-convex, andin a variety of contexts it has been observed that solving the non-convex problemdirectly is not only efficient but reliably accurate. We discuss convergence the-ory for a particular method: first order incremental gradient descent constrainedto the Grassmannian. The output of the algorithm is an orthonormal basis for a d -dimensional subspace spanned by an input streaming data matrix. We study twosampling cases: where each data vector of the streaming matrix is fully sampled,or where it is undersampled by a sampling matrix A t ∈ R m × n with m ≪ n . Ourresults cover two cases, where A t is Gaussian or a subset of rows of the identitymatrix. We propose an adaptive stepsize scheme that depends only on the sampleddata and algorithm outputs. We prove that with fully sampled data, the stepsizescheme maximizes the improvement of our convergence metric at each iteration,and this method converges from any random initialization to the true subspace,despite the non-convex formulation and orthogonality constraints. For the case ofundersampled data, we establish monotonic expected improvement on the definedconvergence metric for each iteration with high probability. Low-rank matrix factorization is an essential tool for high-dimensional inference withfewer measurements than variables of interest, where low-dimensional models arenecessary to perform accurate and stable inference. Many modern problems fit thisparadigm, where signals are undersampled because of sensor failure, resource con-straints, or privacy concerns. Suppose we wish to factorize a matrix M = U W T when ∗ The work of both authors in this publication was supported by the U.S. Army Research Office undergrant number W911NF1410634.
1e only get a small number of linear measurements of M . Solving for the subspacebasis U can be computationally burdensome in this undersampled problem and relatedregularized problems. Many algorithms that attempt to speed up computation are solv-ing a non-convex optimization problem, and therefore come with few guarantees.The Singular Value Decomposition (SVD) provides the solution to the non-convexmatrix factorization problem formulation with full data, and there are several highlysuccessful algorithms for solving it [17]. Unfortunately, these algorithms cannot easilybe extended to problems with incomplete observations of the matrix. Recently, severalresults have been published with first-of-their-kind guarantees for a variety of differentgradient-type algorithms on non-convex matrix factorization problems [2, 10, 13, 15,19, 20, 34]. These new algorithms, being gradient-based, are well-suited to extensionsof the SVD where the matrix is not fully sampled and where we include different costfunctions or regularizers. For example, with gradient methods to solve the SVD wemay be able to solve Robust PCA [12, 18, 31], Sparse PCA [14], or even ℓ PCA [11]with gradient methods as well. However, almost none of these results gives guaranteesin streaming problem , where data can only be accessed one partial column vector at atime. This is a critical problem in the modern machine learning context with massivedata and comparatively limited memory, or in applications where data are collectedcontinuously and must be processed in realtime. The existing theoretical results forthe streaming problem significantly overestimate the number of samples needed forconvergence for typical algorithms.Our contribution is to provide a global convergence result for d -dimensional sub-space estimation using an incremental gradient algorithm performed on the Grassman-nian, the space of all d -dimensional subspaces of R n , denoted by G ( n, d ) . Subspace es-timation is a special case of matrix factorization with orthogonality constraints, wherewe seek to estimate only the subspace spanned by the columns of the left matrix fac-tor U ∈ R n × d . Our result demonstrates that, for fully sampled data without noise, thisgradient algorithm converges globally to the global minimizer almost surely, i.e., it con-verges from any random initialization to the global minimizer. For undersampled data,including compressively sampled data and missing data, we provide results showingmonotonic improvement in expectation on the metric of convergence for each itera-tion.To the best of our knowledge, [33] provided the first global convergence resultfor an incremental gradient descent method on the Grassmannian. Compared to [33],our contributions are the following: (1) simplifying the analysis of [33] and providinga tighter bound on the number of iterations required to converge to the global opti-mal point; (2) extending the results to undersampled data, including both compressivemeasurements and missing data.The paper is organized as follows. The problem formulation and the GROUSE al-gorithm are described in Section 2. The global convergence result for fully sampleddata is presented in Section 4, the convergence behavior of GROUSE with undersam-pled data is studied in Section 5, and the corresponding proofs are provided in Sections8.1, 8.2 and 8.3. Experiment results are in Section 6, and the conclusion follows inSection 7. 2 Problem Setting
In this paper, we consider the problem of learning a low dimensional subspace repre-sentation from streaming data. Specifically, we are given a sequence of observations x t = A t v t where A t ∈ R m × n ( m ≤ n ) are sampling matrices that are given for eachobservation; v t ∈ R n are drawn from a continuous distribution with support on the truesubspace, spanned by ¯ U ∈ R n × d with orthonormal columns, i.e., v t = ¯ U s t , s t ∈ R d .In this paper, we study three different sampling frameworks: the fully sampled casewith A t being the identity matrix, the compressively sampled case with A t ∈ R m × n ( m ≪ n ) being random Gaussian matrices, and the missing data case where each rowof A t ( m ≪ n ) is uniformly sampled from the identity matrix.We formulate subspace estimation as a non-convex optimization problem as fol-lows. Let U ∈ R n × d be a matrix with orthonormal columns. Then we want to solve:minimize U ∈ R n × d T X t =1 min w t k A t U w t − x t k (1)subject to span ( U ) ∈ G ( n, d ) This problem is non-convex firstly because of the product of the two variables U and w t and secondly because the optimization is over the Grassmannian G ( n, d ) , the non-convex set of all d -dimensional subspaces in R n . We study an online algorithm to solvethe above problem, where we process one observation at a time and perform a rank-one update to generate a sequence of estimates U t with the goal that R ( U t ) → R ( ¯ U ) ,where R ( · ) denotes the column range.We can see the relationship between our problem and the well studied low-rankmatrix recovery problem. Let W ∈ R d × T and M = [ v , . . . , v T ] ∈ R n × T , then (1) isequivalent to minimize U ∈ R n × d ,W ∈ R d × T kA ( U W ) − A ( M ) k (2)subject to span ( U ) ∈ G ( n, d ) where A : R n × T → R mT is a linear operator. Our algorithm can be thought of asan incremental algorithm to solve this problem as well. Fueled by the great deal ofrecent success of directly solving non-convex factorization problems (as we discuss inrelated work below), we study the natural incremental gradient descent algorithm [9]applied to (1) directly. Since the optimization variable in our problem is a subspace, weconstrain the gradient descent to the Grassmannian G ( n, d ) . The resulting algorithmis called GROUSE (Grassmannian Rank-One Update Subspace Estimation) algorithmand is described in Algorithm 1. This description differs from its initial introduction in[5] in that it extends the missing data case to a more general sampling framework. At each step, the GROUSE algorithm receives a vector x t = A t v t , and tries to mini-mize the inconsistency between R ( U ) and the true subspace R ( ¯ U ) with respect to the3 lgorithm 1 GROUSE: Grassmannian Rank-One Update Subspace EstimationGiven U , an n × d matrix with orthonormal columns, with < d < n ;Set t := 0 ; repeat Given sampling matrix A t : R n → R m and observation x t = A t v t ;Define w t := arg min a k A t U t a − x t k ;Define p t := U t w t and e r t := x t − A t p t , r t := A Tt e r t ;Using step size θ t = arctan (cid:18) k r t kk p t k (cid:19) (3)update with a gradient step on the Grassmannian: U t +1 := U t + (cid:18) y t k y t k − p t k p t k (cid:19) w Tt k w t k (4)where y t k y t k = p t k p t k cos( θ t ) + r t k r t k sin( θ t ) t := t + 1 ; until terminationinformation revealed in the sampled vector x t , i.e., F ( U ; t ) = min a k A t U a − x t k (5)In order to do so, GROUSE forms the gradient of F with respect to U evaluated at thecurrent estimate U t , and takes a step in the direction of the negative gradient restrictedto the Grassmannian. The derivation of the incremental gradient descent update ruleon the Grassmannian is found in [5, 8], and we summarize it here.To compute the gradient of F on the Grassmannian, we first need to compute thederivative of F with respect to U and evaluate it at U t . As we will prove later, un-der mild conditions, A t U t has full column rank with high probability. Therefore, thederivative is d F dU = − A Tt e r t w Tt (6)where e r := x t − A t U t w t denotes the residual vector with respect to the sampled vector x t , and w t is the least-squares solution of (5). Using Equation (2.70) in [16], thegradient of F on the Grassmannian then follows as ∇F = (cid:0) I − U t U Tt (cid:1) d F dU = − (cid:0) I − U t U Tt (cid:1) A Tt e r t w Tt = − A Tt e r t w Tt . (7)The final equality follows by e r t ⊥ A t U t , which can be verified using the definitions of w t and e r t . According to Eq (2.65) in [16], a gradient step along the geodesic with tan-gent vector −∇F can be then formed as a function of the singular values and singular4ectors of ∇F . For this specific case of our rank one ∇F given in 7, the update rulefollows as U ( η ) = U t + (cid:20) (cos ( η t σ t ) − U t w t k w t k + sin ( η t σ t ) A Tt e r t k A Tt e r t k (cid:21) w Tt k w t k (8)where η t > is the chosen step size at iteration t , p t := U t w t is the predicted valueof the projection of the vector v t onto R ( U t ) and σ t = k A Tt e r t kk p t k . By leveragingthe fact that e r t ⊥ A t U t and p t ∈ R ( U t ) , it’s easy to verify that the rank-one up-date (8) maintains orthogonality U ( η ) T U ( η ) = I d , and tilts R ( U t ) to a new point onGrassmannian.In summary, for each observation the GROUSE algorithm works as follows: itprojects the data vector onto the current estimate of the true subspace with respectto the sampling matrix A t , to get either the exact (when A t = I n ) or approximatedprojection p t and residual r t = A Tt e r t . Then GROUSE updates the current estimatewith a rank-one step as described by (4). In the present work, we propose an adap-tive stepsize framework that sets the stepsize only based on the sampled data and thealgorithm outputs. More specifically, at each iteration a stepsize η t is chosen suchthat η t σ t = arctan (cid:16) k r t kk p t k (cid:17) . As shown in Section 4, the proposed stepsize scheme isgreedy for the fully sampled data, i.e., it maximizes the improvement of our definedconvergence metric at each iteration. For the undersampled data, we establish a localconvergence result by showing that, with the proposed stepsize, GROUSE moves thecurrent estimated subspace towards the true subspace with high probability despite thenonconvex nature of the problem and undersampled data. Many recent results have shown theoretical support for directly solving non-convex ma-trix factorization problems with gradient or alternating minimization methods. Amongthe incremental methods [15] is the one closest to ours, where the authors considerrecovering a positive semidefinite matrix with undersampled data. They propose a stepsize scheme with which they prove global convergence results from a randomly gen-erated initialization. However, their choice of step size depends on the knowledge ofsome parameters that are likely to be unknown in practical problems. Without thisknowledge, the results only hold with sufficiently small step size that implies signifi-cantly slower convergence. In contrast, while our work applies more narrowly to thesubspace estimation problem, our convergence results are much more accurate to whatis seen in practice, using a step size that only depends on the observations and out-puts of the algorithms. Other work that has looked at incremental methods has focusedonly on fully sampled vectors. For example, [4] invokes a martingale-based argumentto derive the global convergence rate of the proposed incremental PCA method to thesingle top eigenvector in the fully sampled case. In contrast, [3] estimates the best d -dimensional subspace in the fully sampled case and provides a global convergenceresult by relaxing the non-convex problem to a convex one. We seek to identify the d dimensional subspace by solving the non-convex problem directly.5he results in this paper are very closely related to our previous work [7] and [33].In [7], we prove that, within a local region of the true subspace, an expected improve-ment of their defined convergence metric for each iteration of GROUSE can be ob-tained. In contrast, we establish global convergence results to a global minimizer fromany random initialization for fully sampled data, and extend the local convergence re-sults to compressively sampled data. We also expand the local convergence results in[7] to a much less conservative region, and we provide a much simpler analysis frame-work that can be applied to different sampling strategies. Moreover, for each iterationof the GROUSE algorithm, the expected improvement on the convergence metric de-fined in [7] only holds locally in both theory and practice, while our theoretical resultprovides a tighter bound for the global convergence behavior of GROUSE over a vari-ety of simulations. This suggests that our result has more promise to be extended to aglobal result for both missing data and compressively sampled data. In [33], we onlyconsider fully sampled data and study the global convergence behavior of GROUSE interms of two phases with two different convergence metrics. In comparison, we extendthe results to undersampled data and provide an unifying analysis framework with onlyone single convergence metric for both fully sampled and undersampled data. Withthis much simpler analysis framework, the global convergence result of full data casewe provide in this paper is still tighter than that in [33]. Hence the focus of this paperis more general and the analysis is more concise than that in [33].Turning to batch methods, [26, 20] provided the first theoretical guarantee for analternating minimization algorithm for low-rank matrix recovery in the undersampledcase. Under typical assumptions required for the matrix recovery problems [25], theyestablished geometric convergence to the global optimal solution. Earlier work [21, 23]considered the same undersampled problem formulation and established convergenceguarantees for a steepest descent method (and a preconditioned version) on the fullgradient, performed on the Grassmannian. [13, 10, 34] considered low rank semidef-inite matrix estimation problems, where they reparamterized the underlying matrix as M = U U T , and update U via a first order gradient descent method. However, allthese results require batch processing and a decent initialization that is close enoughto the optimal point, resulting in a heavy computational burden and precluding prob-lems with streaming data. We study random initialization, and our algorithm has fast,computationally efficient updates that can be performed in an online context.Lastly, several convergence results for optimization on general Riemannian mani-folds, including several special cases for the Grassmannian, can be found in [1]. Mostof the results are very general; they include global convergence rates to local optimafor steepest descent, conjugate gradient, and trust region methods, to name a few. Weinstead focus on solving the problem in (1) and provide global convergence rates to theglobal minimum.Before we present the main results, we first call out the following notation whichwe use throughout the paper. For notational convenience, we will drop the iterationsubscript except our convergence metric ζ t defined in Definition 1 hereafter. Notation
We use R ( M ) to denote the column space of a matrix M and P M to denotethe orthogonal projection onto R ( M ) . I n denotes the identity matrix in R n × n and M i i th row of matrix M . In this paper, without specification, k · k denotesthe ℓ norm. R ( ¯ U ) and R ( U ) denote the true subspace and our estimated subspacerespectively, here both ¯ U and U are matrices in R n × d with orthonormal columns. Alsowe use v k and v ⊥ to denote the projection and residual of the underlying full vector v ∈ R n onto the estimated subspace R ( U ) , i.e., v k = U U T v, v ⊥ = v − v k . Notethat these two quantities are in general unknown for the undersampled data case. Wedefine them so as to relate the intermediate quantities, determined by the algorithm andsampled data, to the improvement on our defined convergence metric. In this section, we first define our convergence metric and describe an assumption onthe streaming data needed to establish our results. Subsequently, we state a fundamen-tal result that is essential to quantify the improvement on the convergence metric overGROUSE iterates.
Definition 1 (Determinant similarity) . Our measure of similarity between R ( U ) and R ( ¯ U ) is ζ ∈ [0 , , defined as ζ := det( ¯ U T U U T ¯ U ) = d Y k =1 cos φ k . where φ k denotes the k th principal angle between R ( ¯ U ) and R ( U ) (See [[27], Chapter5]) by cos φ k = σ k ( ¯ U T U ) with σ k denoting the k th singular value of ¯ U T U . The convergence metric ζ increases to one when our estimate R ( U ) converges to R ( ¯ U ) , i.e., all principal angles between the two subspaces equal zero. Compared toother convergence metrics defined either as k ( I − ¯ U ¯ U T ) U k F = d − k ¯ U T U k F = P dk =1 sin φ k or − k ¯ U T U k = sin φ , our convergence metric ζ measures the sim-ilarity instead of the discrepancy between R ( U ) and R ( ¯ U ) . In other words, ζ achievesits maximum value one when R ( U ) converges to R ( ¯ U ) , while the typical subspacedistance is zero when the subspaces are equal. Also note that ζ = 0 iff at least one ofthe principal angles is a right angle. That is, all stationary points U stat of the full dataproblem except the true subspace have det (cid:0) ¯ U T U stat U Tstat ¯ U (cid:1) = 0 [32, 8]. Assumption 1.
For the underlying data v = ¯ U s , we assume the entries of s are un-correlated and each has zero mean and unit variance. Given this assumption, we have the following lemma which relates the projection v k and the projection residual v ⊥ to the improvement convergence metric ζ t . As we willshow in the following sections, this lemma is crucial for us to establish the expectedimprovement on our defined convergence metric ζ t for all the sampling frameworksconsidered in this work. The proof is provided in Section 8.1. Lemma 1.
Let v k and v ⊥ denote the projection and residual of the full data sample v onto the current estimate R ( U ) . Then given Assumption 1, for each iteration ofGROUSE we have E (cid:20) k v ⊥ k k v k k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ E (cid:20) k v ⊥ k k v k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ − ζ t d . (9)7lthough both projection ( v k ) and projection residual ( v ⊥ ) are in general unknownfor the undersampled data, we can relate the approximated projection residual A T e r tothe true one v ⊥ by leveraging either random matrix theory or the incoherence propertyof the underlying subspace R ( ¯ U ) . Therefore, the above lemma provides a unifyingstep to quantify the improvement on the convergence metric for all cases considered inthe present work. In this section, we consider fully sampled data, i.e., A = I n . The corresponding proofsfor these results can be found in Section 8.2. We start by deriving a greedy step sizescheme for each iteration t that maximizes the improvement on our convergence metric ζ t . For each update we prove the following: ζ t +1 ζ t = (cid:18) cos θ + k v ⊥ kk v k k sin θ (cid:19) . (10)It then follows that θ ∗ = arg max θ ζ t +1 ζ t = arctan (cid:18) k v ⊥ kk v k k (cid:19) . (11)This is equivalent to (3) in the fully sampled setting A t = I n . Using θ ∗ , we obtainmonotonic improvement on the determinant similarity that can be quantified by thefollowing lemma. Lemma 2 (Monotonicity for the fully sampled noiseless case) . For fully sampled data,choosing step size θ ∗ = arctan (cid:16) k v ⊥ kk v k k (cid:17) , after one iteration of GROUSE we obtain ζ t +1 ζ t = 1 + k v ⊥ k k v k k ≥ . Under the mild assumption that each data vector is randomly sampled from theunderlying subspace, we obtain strict improvement on ζ t for each iteration provided k v ⊥ k > and k v k k > ; we will discuss this further below. Note that conditioned on U , ζ t is deterministic. Therefore, according to Lemma 1 and Lemma 2, we obtain thefollowing lower bound on the expected improvement on ζ t for each iteration. Lemma 3 (Expected improvement on ζ t ) . When fully sampled data satisfying Assump-tion 1 are input to the GROUSE (Algorithm 1), the expected improvement after oneupdate step is given as: E (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ (cid:18) − ζ t d (cid:19) ζ t . U stat have det( ¯ U T U stat U Tstat ¯ U ) = 0 , because they haveat least one direction orthogonal to ¯ U [8]. Therefore, if the initial point U has de-terminant similarity with ¯ U strictly greater than zero, then we are guaranteed to stayaway from other stationary points, since GROUSE increases the determinant similaritymonotonically, according to Lemma 2. We therefore may initialize GROUSE with U drawn uniformly from the Grassmannian, e.g., as the orthonormal basis of a randommatrix V ∈ R n × d with entries being independent standard Gaussian variables, whichguarantees ζ > with probability one. Lemma 4. [24] Initialize the starting point U of GROUSE as the orthonormalizationof an n × d matrix with entries being standard normal random variables. Then E [ ζ ] = E (cid:2) det( U T ¯ U ¯ U T U ) (cid:3) = C (cid:18) dne (cid:19) d where C > is a constant. Finally, with Lemmas 3 & 4 providing the expectation of initial value E [ ζ ] and theexpected convergence rate of ζ t , we establish a global convergence result for GROUSE. Theorem 5 (Global Convergence of GROUSE) . Let ≥ ζ ∗ > be the desired accu-racy of our estimated subspace. With the initialization ( U ) of GROUSE as the rangeof an n × d matrix with entries being i.i.d standard normal random variables, then forany ρ > , after K ≥ K + K = (cid:18) d ρ + 1 (cid:19) τ log( n ) + 2 d log (cid:18) ρ (1 − ζ ∗ ) (cid:19) iterations of GROUSE Algorithm 1, P ( ζ K ≥ ζ ∗ ) ≥ − ρ , where τ = 1 + log (1 − ρ/ C + d log( e/d ) d log n with C be a constant approximately equal to . The proof is provided in Section 8.2, where we show that the iteration complexityis a combination of iterations required by two phases: K = (cid:16) d ρ + 1 (cid:17) τ log( n ) isthe number of iterations required by GROUSE to achieve ζ t ≥ / from a randominitialization U ; and K = 2 d log (cid:16) ρ (1 − ζ ∗ ) (cid:17) is the number of additional iterationsrequired by GROUSE to converge to the given accuracy ζ ∗ from ζ K = 1 / . Com-pared with the similar result in [33], our result is tighter while the analysis is muchmore concise. In [33], the iterations required for convergence to a given accuracy is O (cid:0) d log( n ) /ρ (cid:1) . The authors also use a two phase analysis strategy, but analyze adifferent convergence metric in each phase. By leveraging a relationship between the9wo convergence metrics, they combine the convergence results in each phase to givea global result. Our analysis avoids this complication by focusing on the determinantsimilarity metric.We want to comment that Theorem 5 requires fully observed noiseless data, whichis not very practical in many cases. However, our result provides the first convergenceguarantee for the Grassmannian gradient descent based method for subspace estima-tion with streaming data. It is a very important initial step for further studies on moregeneral cases, including undersampled data and noisy data with outliers. In the follow-ing section, we will analyze the convergence behavior of GROUSE for undersampleddata. We leave the corrupted data case as future work. In this section, we consider undersampled data where each vector v is subsampled bya sampling matrix A ∈ R m × n with the number of measurements being much smallerthan the ambient dimension ( m ≪ n ) . We study two typical cases, the compressivelysampled data where A are random Gaussian matrices, and the missing data where eachrow of A is uniformly sampled from the identity matrix, I n ∈ R n × n .We first outline several elementary facts that can help us understand how the GROUSEalgorithm navigates on the Grassmannian with undersampled data. The proofs can befound in Section 8.3.Suppose AU has full column rank, then the projection coefficients w are found bythe squares solution of w =: arg min a k AU a − x k , i.e., w = ( U T A T AU ) − U T A T x .Note that x = Av , therefore we can further decompose the projection coefficients w as w = w k + w ⊥ where w k = (cid:0) U T A T AU (cid:1) − U T A T Av k , w ⊥ = (cid:0) U T A T AU (cid:1) − U T A T Av ⊥ . (12)This decomposition explicitly shows the perturbation induced by the undersamplingframework, i.e., Av ⊥ is not perpendicular to AU in general, though v ⊥ is orthogonalto R ( U ) . Now we are going to use this perturbation to show how the approximatedprojection p and residual r deviate from the exact ones obtained by projecting the fulldata sample v onto the current estimate R ( U ) . Lemma 6.
Given Eq (12) , let p = p k + p ⊥ with p k = U w k and p ⊥ = U w ⊥ , then p k = v k and r = A T Av ⊥ − A T P AU ( Av ⊥ ) . (13) Proof.
Let a = U T v k , then a is the unique solution to U w = v k given that U has fullcolumn rank. Since AU also has full column rank, b = (cid:0) U T A T AU (cid:1) − U T A T Av k isalso the unique solution to AU w = Av k . It then follows that AU a = Av k = AU b .Therefore, a = b . As for the second statement, it simply follows due to the fact that Av k = AU w k ∈ R ( AU ) . Hence e r = ( I m − P AU ) Av = ( I m − P AU ) Av ⊥ , recallthat P AU denotes the orthogonal projection operator onto the column space of AU .This together with r = A T e r completes the proof.10elow we lower bound the improvement on ζ t as a function of the key quantities r, e r and p . Compared to Lemma 2, Lemma 6 and Lemma 7 highlight the how theperturbations induced by the undersampling framework influence the improvement on ζ t for each iteration. Being able to analyze and bound the quantities that include theperturbations is the key to establish the expected improvement on ζ t for undersampleddata. Lemma 7.
Suppose AU has full column rank, then for each iteration of GROUSE wehave ζ t +1 ζ t ≥ k e r k − k r k k p k + 2 ∆ k p k (14) where ∆ = w T ⊥ (cid:0) ¯ U T U (cid:1) − ¯ U T r with w ⊥ = (cid:0) U T A T AU (cid:1) − U T A T Av ⊥ . Next, we proceed by bounding the key quantities in Lemma 7 so as to establishconvergence results for both compressively sampled data and missing data.
This section presents convergence results for compressively sampled data. We use anapproach that merges linear algebra with random matrix theory to establish an expectedrate of improvement on the determinant similarity ζ t at each iteration. We show that,under mild conditions, the determinant similarity increases in expectation with a ratesimilar to that of the fully sampled case, roughly scaled by mn . Detailed proofs for thissection are provided in Section 8.3.1. Theorem 8.
Suppose each sampling matrix A has i.i.d Gaussian entries distributedas N (0 , /n ) . Let φ d denote the largest principal angle between R ( U ) and R ( ¯ U ) ,then with probability exceeding − exp (cid:16) − dδ (cid:17) − exp (cid:16) − mδ + d log (cid:0) δ (cid:1)(cid:17) − (4 d +2) exp (cid:16) − mδ (cid:17) we obtain E v (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ (cid:18) γ (cid:18) − γ dm (cid:19) mn − ζ t d (cid:19) ζ t , where γ = (1 − δ ) ( − δ √ mn ) (cid:16) q δ − δ dm (cid:17) and γ = (cid:18) φ d )+ δ d cos( φd ) ( − δ √ mn ) √ (1+ δ ) d/m (cid:19) δ − δ . Now let β = δ )(1 − δ ) (1 − δ ) , further suppose m ≥ d · max (cid:26) δ log (cid:18) n /d δ (cid:19) , β (tan φ d + δ cos φ d d ) (cid:18) tan φ d + δ cos φ d d + 12 (cid:19)(cid:27) , then with probability at least − /n − exp (cid:0) − dδ / (cid:1) we have E v (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ (cid:18) γ mn − ζ t d (cid:19) ζ t . ζ t can be obtained with high probability as long as the number of samples is enough.As shown in Theorem 8, our theory for GROUSE requires more measurements when R ( U ) is far away from R ( ¯ U ) , in which case cos φ d =: ε is very small. In the high di-mensional setting where m ≪ n , compared to the fully sampled data case, the expectedimprovement on ζ t is approximately scaled down by mn . As we will show, this scalingfactor is mainly determined by the relative amount of effective information stored inthe approximated projection residual. On the other hand, due to the perturbation anduncertainty induced by the compressed sampling framework, the improvement on thedeterminant similarity given by the lower bound in Lemma 7 is neither monotonic norglobal. This is the main hurdle to pass before we can provide a global convergenceresult for undersampled data. However, despite of these difficulties, we are still able toestablish Theorem 8 which shows that, with reasonable number of measurements, theexpected improvement on the convergence metric is monotonic with high probabilityas long as our estimate R ( U ) is not too far away from the true subspace R ( ¯ U ) .To prove Theorem 8, we provide the following intermediate results to quantifythe key quantities in Lemma 7 with high probability, where probability is taken withrespect to the random Gaussian sampling matrix A . Lemma 9.
Under the same conditions as Theorem 8, with probability at least − exp (cid:16) − mδ (cid:17) − exp (cid:16) − mδ (cid:17) − exp (cid:16) − dδ (cid:17) we obtain k e r k ≥ (1 − δ ) (cid:18) − β dm (cid:19) mn k v ⊥ k (15) k e r k − k r k ≥ (1 − δ ) (cid:18) − δ r mn (cid:19) (cid:18) − β dm (cid:19) mn k v ⊥ k (16) where δ , δ ∈ (0 , , and β = δ − δ . To interpret the above results, note that k e r k = k ( I m − P AU ) Av ⊥ k = k Av ⊥ k − kP AU ( Av ⊥ ) k . (17)where the first equality follows by the fact that ( I m − P AU ) Av k = 0 as we argued be-fore, and the second equality holds since P AU is an orthogonal projection onto R ( AU ) .Then by leveraging the concentration property of random projection, we can prove that k e r k concentrates around its expectation m − dn k v ⊥ k with high probability. Also notethat k r k ≤ k A k k e r k , hence the second statement (16) can be established by theconcentration result of k e r k and that of k A k according to the random matrix theory.Next we establish high probability bounds on k p k and ∆ . Then Theorem 8 followsnaturally by first replacing the key quantities in Lemma 7 with their high probabilitybounds, and then taking the expectation over the uncertainty of the underlying full data v t . Lemma 10.
With the same conditions as Theorem 8, for any δ ∈ (0 , , we have k p k ≤ r δ − δ dm ! k v k ith probability at least − exp (cid:16) − dδ (cid:17) − exp (cid:16) − mδ + d log (cid:16) δ (cid:17)(cid:17) . Lemma 11.
With the same conditions as Theorem 8, let δ , δ ∈ (0 , , then ∆ ≤ r δ − δ dm (cid:18) tan( φ d ) + δ d cos( φ d ) (cid:19) mn k v ⊥ k holds with probability at least − exp (cid:16) − dδ (cid:17) − exp (cid:16) − mδ + d log (cid:16) δ (cid:17)(cid:17) − d exp (cid:16) − mδ (cid:17) . Lemma 10 shows that k p k doesn’t diverge significantly from k v k as long as m ≥ d . This together with Lemma 7 and Lemma 9 imply that the required num-ber of measurements in Theorem 8 is mainly determined by that required by Lemma11 so as to prevent ∆ diverging too far from mn k v ⊥ k . As a result, the improvement onthe determinant similarity is still dominated by the magnitude of the projection residualover that of the projection, which is proportional to that of the full data case scaled bythe sampling density. On the other hand, Lemma 11 implies that, in order to guarantee ∆ to be much smaller than mn k v ⊥ k , the number of required measurements increasesalong with first principal angle between the estimated subspace R ( U ) and the true sub-space R ( ¯ U ) .For the sake of completeness, we sketch the proof of Theorem 8 here, and thedetailed proof is provided in Section 8.3.1. Proof sketch of Theorem 8.
Let η = δ − δ dm , η = (1 − δ ) (cid:0) − δ p mn (cid:1) and η =tan( φ d ) + δ d cos( φ d ) , then plugging in the results in Lemmas 9, 10 and 11 into Lemma7 with δ = δ = δ = δ yields, ζ t +1 ζ t ≥ γ (cid:18) − γ dm (cid:19) mn k v ⊥ k k v k ≥ γ (cid:18) − γ dm (cid:19) mn − ζ t d (18)where γ = (1 − δ ) ( − δ √ mn ) (cid:16) q δ − δ dm (cid:17) and γ = (cid:18) tan( φ d )+ δ d cos( φd ) ( − δ √ mn ) √ (1 − δ ) d/m (cid:19) δ − δ .The first probability bound is obtained by taking the union bound of those quantitiesused to generate Lemma 9 to Lemma 11, which can be lower bounded by − exp (cid:18) − dδ (cid:19) − exp (cid:18) − mδ
32 + d log (cid:18) δ (cid:19)(cid:19) − (4 d + 2) exp (cid:18) − mδ (cid:19) (19)Next we establish the complexity bound on m . As we will prove in Section 8.3.1, γ dm < is equivalent to the following, m ≥ δ )(1 − δ ) (1 − δ ) (cid:16) ε + δ p ε d (cid:17) (cid:18) ε + δ p ε d + 12 (cid:19) d (20)13o establish another bound on m , m ≥ δ log (cid:16) n /d δ (cid:17) d implies the following, exp (cid:18) − mδ
32 + d log (cid:18) δ (cid:19)(cid:19) ≤ exp( − log n ) = 1 n (21) (4 d + 2) exp (cid:18) − mδ (cid:19) ≤ (4 d + 2) n (cid:18) δ (cid:19) d ≪ n (22)(21) and (22) complete the proof for the bound on m and justify the simplification ofthe probability bound in (19). In this section, we study the convergence of GROUSE for the missing data case. Weshow that within the local region of the true subspace, we obtain an expected mono-tonic improvement on our defined convergence metric with high probability. We use Ω to denote the indices of observed entries for each data vector, and we assume Ω is uni-formly sampled over { , , . . . , n } with replacement. In other words, we assume eachrow of the sampling matrices A is uniformly sampled from the rows of identity matrix I n with replacement. We use the notation Av =: v Ω , AU =: U Ω . Again our results arewith high probability with respect to A , in this case with respect to the random drawof rows of I n , and in expectation with respect to the random data v . Please refer toSection 8.3.2 for the proofs of this section.Before we present our main results, we first call out the typical incoherence as-sumption on the underlying data. Definition 2.
A subspace R ( U ) is incoherent with parameter µ if max i ∈{ ,...,n } kP U e i k ≤ µdn where e i is the i th canonical basis vector and P U is the projection operator onto thecolumn space of U . Note that ≤ µ ≤ nd . According to the above definition, the incoherence parameterof a vector z ∈ R n is defined as: µ ( z ) = n k z k ∞ k z k (23)In this section, we assume the true subspace R ( ¯ U ) is incoherent with parameter µ , anduse µ ( U ) , µ ( v ⊥ ) to denote the incoherence parameter of R ( U ) and v ⊥ respectively. Wenow show the expected improvement of ζ t in a local region of the true subspace. Theorem 12.
Suppose P dk =1 sin φ k ≤ dµ n and | Ω | = m . If m > max (cid:26) dµ (cid:16) √ dn (cid:17) , µ ( v ⊥ ) log ( n ) , (cid:16) p µ ( v ⊥ ) log( n ) (cid:17) dµ (cid:27) hen with probability at least − n we have E v (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ mn − ζ t d . This theorem shows that, within the local region of the true subspace, expectedimprovement on ζ t can be obtained with high probability. As is implied by the theorem,this local region gets enlarged if the true subspace is more coherent, which may seemat first counterintuitive. However, the required number of measurements also increasesas we increase µ . In the extreme case, when m increases to n , the local convergenceresults can be extended to a global result, as we proved for the full data case in Section4. On the other hand, compared to Theorem 8, the convergence result for the missingdata case holds within a more conservative local region of the true subspace. This gapis induced by the challenge of maintaining the incoherence property of our estimates R ( U ) , for which we had to consider the worst case. We leave the extension of the localconvergence results to global results as future work.In order to compare our result to the local convergence result in [Corollary 2.15,[7]], consider the following corollary. Corollary 13.
Define the determinant discrepancy as κ t = 1 − ζ t , then under the sameconditions as Theorem 12, we have E v (cid:2) κ t +1 (cid:12)(cid:12) κ t (cid:3) ≤ (cid:18) − (cid:18) − dµ n (cid:19) mnd (cid:19) κ t with probability exceeding − /n . Recall that ≤ µ ≤ nd , therefore the expected linear decay rate of κ t is at least − mnd . In [7] (Corollary 2.15), a similar linear convergence result is established interms of the Frobenius norm discrepancy between R ( ¯ U ) and R ( U ) , denoted as ǫ t = P di =1 sin φ d . However, their result only holds when ǫ t ≤ (8 × − ) mn d which ismore conservative than our assumption in Theorem 12. Moreover, as we mentionedpreviously, empirical evidence shows the lower bound in Theorem 12 holds for everyiteration from any random initialization. In contrast, in [7], even for numerical resultsexpected linear improvements only hold within the local region of the true subspace.Now we present the following intermediate results for the proof of Theorem 12.Note that in this missing data case, the projection residual r Ω of v Ω onto U Ω is mappedback to R n by zero padding the entries at the indices that are not in Ω . Therefore,unlike Lemma 11 of the compressively sampled data case, here k e r k = k r k = k r Ω k .Therefore, (14) becomes ζ t +1 ζ t ≥ k r Ω k k p k + 2 ∆ k p k . (24)Now similarly to the compressively sampled data case, we proceed by establishingconcentration results for the key quantities k r k , k p k and ∆ respectively.15 emma 14 ([6], Theorem 1) . Let δ > , and suppose m ≥ dµ ( U ) log (2 d/δ ) . Then,with probability exceeding − δ , k r Ω k ≥ (1 − α ) mn k v ⊥ k where α = q µ ( v ⊥ ) m log (cid:0) δ (cid:1) + ( β +1) − γ dµ ( U ) m , β = q µ ( v ⊥ ) log (cid:0) δ (cid:1) , and γ = q dµ ( U )3 m log (2 d/δ ) . Lemma 15.
Let δ > . Under the same condition on m as Lemma 14, with probabilityat least − δ we have k p k ≤ β + 11 − γ r dµ ( U ) m ! k v k where β and γ equal to those defined in Lemma 14. Lemma 16.
Let δ > . Under the same condition on m as Lemma 14, with probabilityat least − δ we have | ∆ | ≤ η cos φ d r sin φ d + dµ m r dµ ( U ) m mn k v ⊥ k where η = (1+ β )(1+ β )1 − γ , β = q µ ( v ⊥ ) log (cid:0) δ (cid:1) dµ dµ + m sin φ d , and β and γ equalto those defined in Lemma 14. Lemma 14 shows that the concentration of k r k = k r Ω k does not only depend onthe sampling framework, but also on the incoherence property of the current estimateand the true projection residual, i.e., µ ( U ) and µ ( v ⊥ ) . To see this clearly, recall that k r Ω k = k v ⊥ , Ω k − kP U Ω ( v ⊥ , Ω ) k , hence the incoherence property of v ⊥ and R ( U ) directly influences the concentration of k r Ω k . On the other hand, for compressivedata, the Gaussian distributed sampling matrices yield tight concentration results for k p k , k r Ω k and ∆ . Therefore, the upper bounds of the key quantities established inLemmas 14, 15 and 16 are not as tight as those for the compressive data except theextreme case where µ ( U ) = µ ( v ⊥ ) = 1 , i.e., both R ( U ) and v ⊥ are incoherent.As shown in the above lemmas, in order to establish concentration of the key quan-tities in (24), it is essential for the subspaces generated by GROUSE to be incoherentover iterates. It has been proven in [7] that within the local region of R ( ¯ U ) , the inco-herence of R ( U ) can be bounded by that of R ( ¯ U ) . Lemma 17 ([7], Lemma 2.5) . Suppose P dk =1 sin φ k ≤ d n µ , then µ ( U ) ≤ µ . Now we are ready to prove Theorem 12. We sketch the proof here, and a detailedproof is provided in Section 8.3.2.
Proof sketch of Theorem 12.
Given the condition required by Theorem 12, we have sin φ d ≤ p dµ / n and cos φ d ≥ p − dµ / n . This together with Lemma167 and Lemma 16 yield | ∆ | ≤ η dµ n k v ⊥ k . Also for β in Lemma 16, β ≤ p µ ( v ⊥ ) log(1 /δ ) = β . Hence, | ∆ | ≤
115 (1 + β ) − γ dµ n k v ⊥ k . (25)Letting η = (1+ β ) − γ dµ m and α = q µ ( v ⊥ ) m log (cid:0) δ (cid:1) , then applying this definitiontogether with Lemma 17 to Lemma 15 and Lemma 14 yields k p k ≤ (cid:18) r η − γ (cid:19) k v k (26) k r Ω k ≥ (1 − α − η ) mn k v ⊥ k (27)Now applying (25), (26) and (27) to (24) we have ζ t +1 ζ t ≥ − α − η )(1 + p η / (1 − γ )) mn k v ⊥ k k v k (28)with probability at least − δ . The probability bound is obtained by taking the unionbound of those generating Lemmas 14, 15 and 16, as we can see in the proofs in Section8.3.2 this union bound is at least − δ .Letting η = (1 − α − η )(1+ √ η / (1 − γ )) , then η > is equivalent to − α − η > .This further gives that if m satisfies the condition in Theorem 12, then η > . Nowtaking expectation with respect to v yields, E v (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ (cid:18) mn E (cid:20) k v ⊥ k k v k (cid:12)(cid:12) U (cid:21)(cid:19) ζ t ≥ (cid:18) mn − ζ t d (cid:19) ζ t (29)where the last inequality follows from Lemma 1. Finally choosing δ to be /n completesthe proof. In this section, we demonstrate that our theoretical results match the empirical con-vergence behavior of GROUSE. We generate the underlying data matrix M = (cid:2) v v . . . v T (cid:3) as M = ¯ U W . For both the fully sampled data case and com-pressively sampled data case, the underlying signals are generated from a sparse sub-space, demonstrating that incoherence assumptions are not required by our results forthese two cases. Specifically, the underlying subspace of each trial is set to be a sparsesubspace, as the range of an n × d matrix ¯ U with sparsity on the order of log( n ) n . Forthe missing data case, we generate the underlying subspace as the range of an n × d matrix with i.i.d standard normal distribution. The entries of the coefficient matrix W for all three cases are generated as i.i.d N (0 , satisfying Assumption 1. We also wantto mention that we run GROUSE with random initialization for all of the plots in thissection. 17 , , dn d Var (cid:2) K/ ( d log( n ) + d log(1 − ζ ∗ )) (cid:3) · −
10 40 70 1005002 , , dn b E (cid:2) K/ ( d log( n ) + d log(1 − ζ ∗ )) (cid:3) · − . .
15 0 . Figure 1: Illustration of the bounds on K in Theorem 5 compared to their values inpractice, averaged over trials with different n and d . We show the ratio of K to thebound d log( n ) + d log(1 − ζ ∗ ) . ,
000 3 ,
000 5 , − − − − iterations Compressively Sampled Data m = d log( n ) m = 2 d log( n ) m = 4 d log( n ) m = 8 d log( n ) Thm 2 ,
000 3 ,
000 5 , − − − − iterations Missing Data m = d log( n ) m = 2 d log( n ) m = 4 d log( n ) m = 8 d log( n ) Thm 3
Figure 2: Illustration of expected improvement on ζ given by Theorem 8 (left) andTheorem 12 (right) over trials. We set n = 5000 , d = 10 . The diamonds denotethe lower bound on expected convergence rates described in Theorem 8 and Theorem12. We first examine our global convergence result, i.e., Theorem 5, for the fullysampled data in Figure 1. We run GROUSE to convergence for a required accu-racy ζ ∗ = 1 − e -4 and show the ratio of K to the simplified bound of Theorem5, d log( n ) + d log − ζ ∗ . We run GROUSE over trials and show the mean andvariance. We can see that, for fixed n , our theoretical results become more and moreloose as we increase the dimension of the underlying subspace. However, comparedto the empirical mean, the empirical variance is very small. This indicates that the re-lationship between our theoretical upper bounds and the actual iterations required by18 . . m log ( n ) (b): Compressively Sampled Data . .
250 200 350 5002 . . m log ( n ) (a): Missing Data · − . .
15 0 .
210 100 300 500104070100 md (c): Missing Data · − . .
15 0 . md (d): Compressively Sampled Data · − . .
15 0 . Figure 3: Illustration of our heuristic bounds on K (the actual iterations required byGROUSE to converge to the given accuracy) over different d , m and n , averaged over20 trials. In this simulation, we run GROUSE from a random initialization to conver-gence for a required accuracy ζ ∗ = 1 − e -3. We show the ratio of K to the heuristicbound nm (cid:0) d log( n ) + d log(1 − ζ ∗ ) (cid:1) . In (a) and (b) , we set d = 50 and examine K over m and n for both missing data (a) and compressively sampled data (b) . In (c) and (d) , we set n = 10000 and examine K over m and d for both missing data (c) and compressively sampled data (d) . In these plots, we use the dark red to indicate thefailure of convergence.GROUSE is stable.Next we examine our theoretical results (Theorem 8 and Theorem 12) for the ex-pected improvement on ζ t for the undersampled case in Figure 2. We set n = 5000 and d = 10 . We run GROUSE over different sampling numbers m . The plots are obtainedby averaging over trials. We can see that our theoretical bounds on the expected im-provement on ζ t for both missing data and compressively sampled data are tight fromany random initialization, although we have only established local convergence resultsfor both cases. Also note that Theorem 8 and Theorem 12 indicate that the expected19mprovement on the determinant similarity has a similar form to that of the fully sam-pled case roughly scaled by the sampling density ( m/n ) . These together motivate usto approximate the required iterations to achieve a given accuracy as that required bythe fully sampled case times the reciprocal of sampling density, n/m : ( n/m ) · (cid:0) d log( n ) + d log(1 − ζ ∗ ) (cid:1) . As we see in Figure 3, when m is slightly larger than d , the empirical mean of the ratioof the actual iterations required by GROUSE to our heuristic bound is similar to that ofthe full data case. We leave the rigorous proof of this heuristic as future work. We analyze a manifold incremental gradient descent algorithm applied to a particularnon-convex optimization formulation for recovering a low-dimensional subspace fromstreaming data sampled from that subspace. We provide a simplified analysis as com-pared to [33], showing global convergence of the algorithm to the global minimizer forfully sampled data. We show that with probability exceeding − ρ , the gradient al-gorithm converges from any random initialization to any desired accuracy of ζ ∗ within O (cid:16) d log( n ) ρ + d log (cid:16) ρ (1 − ζ ∗ ) (cid:17)(cid:17) iterations.With undersampled data, we show that expected improvement on our convergencemetric ζ t can be obtained with high probability for each iteration t . We prove that,comparing with fully sampled data, the expected improvement on determinant similar-ity is roughly proportional to the sampling density. With compressively sampled datathis expected improvement holds from any random initialization, while it only holdswithin the local region of the true subspace for the missing data case. Establishingglobal convergence results similar to those for fully sampled data remains as futurework. We start by providing the following lemma that we will use regularly in the manip-ulation of the matrix ¯ U T U . It also provides us with more insight into our metric ofdeterminant similarity between the subspaces. The proof can be found in [27]. Lemma 18 ([27], Theorem 5.2) . Let U, ¯ U ∈ R n × d with orthonormal columns, thenthere are unitary matrices Q , ¯ Y , and Y such that Q ¯ U ¯ Y := dd Id n − d and QU Y := dd Γ d Σ n − d where Γ = diag (cos φ , . . . , cos φ d ) , Σ = diag (sin φ , . . . , sin φ d ) with φ i being the i th principal angle between R ( U ) and R ( ¯ U ) defined in Definition 1. Lemma 19 ([7], Lemma 2.12) . Given any matrix Q ∈ R d × d suppose that x ∈ R d is arandom vector with entries identically distributed, zero-mean, and uncorrelated, then E (cid:20) x T Qxx T x (cid:21) = 1 d tr ( Q ) Lemma 20 ([15], Lemma 16) . Let X = [ X , · · · , X d ] with X i ∈ [0 , , i = 1 , . . . , d ,then d − d X i =1 X i ≥ − Π di =1 X i Proof of Lemma 1.
According to Lemma 19 and Lemma 20 we have the following E (cid:20) k v ⊥ k k v k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) = E (cid:20) k ¯ U s k − k U U T ¯ U s k k ¯ U s k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ϑ = E (cid:20) s T ¯ Y ( I − Γ ) ¯ Y T ss T s (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ϑ = 1 d tr (cid:0) I − Γ (cid:1) ϑ ≥ − ζ t d (30)where ϑ follows by Lemma 18 and k ¯ U s k = k s k , ϑ from Lemma 19, and ϑ fromLemma 20 with X i = cos φ i . In this section we prove the results of Section 4. We start by proving Eq (10), thedeterministic expression for the change in determinant similarity from one step of theGROUSE algorithm to the next. Using this expression, we prove the GROUSE mono-tonic improvement of Lemma 2, expected improvement of Lemma 3, and finally theglobal convergence Theorem 5.Recall that y k y k = cos( θ ) v k k v k k + sin( θ ) v ⊥ k v ⊥ k in Algorithm 1. Then according to theGROUSE update in (4) we have det (cid:0) ¯ U T U t +1 (cid:1) = det ¯ U T U + ¯ U T y k y k − ¯ U T v k k v k k ! w T k w k ! ϑ = det (cid:0) ¯ U T U (cid:1) w T ( ¯ U T U ) − k w k ¯ U T y k y k − ¯ U T v k k v k k !! ϑ = det (cid:0) ¯ U T U (cid:1) w T ( ¯ U T U ) − ¯ U T y k y kk w k ϑ = det (cid:0) ¯ U T U (cid:1) (cid:18) cos θ + k v ⊥ kk v k k sin θ (cid:19) (31)21here ϑ follows from the Schur complement, i.e., that for any invertible matrix M we have det (cid:0) M + ab T (cid:1) = det( M ) (cid:0) b T M − a (cid:1) ; ϑ and ϑ hold since k v k k = k U w k = k w k and the following w T ( ¯ U T U ) − ¯ U T v k w = U T ¯ Us = v T v k = k v k k (32a) w T ( ¯ U T U ) − ¯ U T v ⊥ w = U T ¯ Us = v T v ⊥ = k v ⊥ k . (32b)Given this, the proof of Lemma 2 follows directly from the above proof and thegreedy step size derived in Eq. (11). Proof of Lemma 2.
By using θ = arctan (cid:16) k v ⊥ kk v k k (cid:17) , we have cos θ = k v k kk v k and sin θ = k v ⊥ kk v k . This together with (31) gives det (cid:0) ¯ U T U t +1 (cid:1) = det (cid:0) ¯ U T U (cid:1) k v kk v k k . Therefore, ζ t +1 ζ t = det ( ¯ U T U t +1 ) det ( ¯ U T U ) = k v k k v k k = 1 + k v ⊥ k k v k k . Proof of Lemma 3.
Lemma 3 follows directly from 1 and 2, i.e., E (cid:20) ζ t +1 ζ t (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) = 1 + E (cid:20) k v ⊥ k k v k k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ E (cid:20) k v ⊥ k k v k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ − ζ t d (33)Note that, given U , ζ t is a constant, hence completes the proof.With the above results, we are ready to prove Theorem 5. Proof of Theorem 5.
Let κ t = 1 − ζ t denote the determinant discrepancy between R ( ¯ U ) and R ( U ) . According to Lemma 3 we have the following: E (cid:20) ζ t +1 ζ t (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ − ζ t d (34a) E (cid:20) κ t +1 κ t (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≤ − − κ t d (34b)Therefore, the expected convergence rate of ζ t is faster when R ( U ) is far away from R ( ¯ U ) , while that of κ t is faster when R ( U ) is close to R ( ¯ U ) . This motivates us to splitthe analysis into two phases, bounding the number of iterations in each phase. We firstuse (34a) to get the necessary K iterations for GROUSE to converge to a local regionof global optimal point from a random initialization. From there, we obtain the nec-essary K iterations for GROUSE to converge to the required accuracy by leveraging(34b).According to Lemma 2, ζ t is a non-decreasing sequence. Therefore, there exists T ≥ such that ζ t ≤ − ρ , ∀ t ≤ T where ρ ∈ (0 , . Then together with Lemma 3we obtain the following: for any t ≤ T , E (cid:2) ζ t +1 (cid:12)(cid:12) U (cid:3) ≥ (cid:18) − ζ t d (cid:19) ζ t ≥ (cid:16) ρ d (cid:17) ζ t . E [ ζ t +1 ] ≥ (cid:16) ρ d (cid:17) E [ ζ t ] Therefore after K ≥ (2 d/ρ + 1) log − ρ E [ ζ ] iterations of GROUSE we have E [ ζ K ] ≥ (cid:16) ρ d (cid:17) K E [ ζ ] ≥ (cid:18)(cid:16) ρ d (cid:17) dρ +1 (cid:19) log − ρ E [ ζ E [ ζ ] ≥ E [ ζ ] e log − ρ E [ ζ = 1 − ρ Therefore, P (cid:18) ζ K ≥ (cid:19) = 1 − P (cid:18) − ζ K ≥ (cid:19) ϑ ≥ − E [1 − ζ K ]1 / ≥ − ρ (35)where ϑ follows by applying Markov inequality to the nonnegative random variable − ζ K . If T ≤ K , then (35) automatically holds. Therefore, together with E [ ζ ] derived from Lemma 4, we obtain log (cid:16) − ρ/ E [ ζ ] (cid:17) = log (cid:18) − ρ/ C ( dne ) d (cid:19) = τ d log( n ) with τ = 1 + log (1 − ρ/ C + d log( e/d ) d log n .Now with probability at least − ρ , ζ t ≥ for all t ≥ K , i.e., κ t ≤ for all t ≥ K . So using (34b) we have the following: E (cid:2) κ t +1 (cid:12)(cid:12) U (cid:3) ≤ (cid:18) − − κ t d (cid:19) κ t ≤ (cid:18) − d (cid:19) κ t . Taking expectation of both sides, we have E [ κ t +1 ] ≤ (cid:18) − d (cid:19) E [ κ t ] . After K ≥ d log / ρ (1 − ζ ∗ ) ≥ d log E [ η K ] ρ (1 − ζ ∗ ) additional iterations of GROUSE weobtain E [ κ t + K ] ≤ (cid:18) − d (cid:19) K E [ κ K ] ≤ (cid:18) − d (cid:19) d log E [ κK ρ (1 − ζ ∗ ) E [ κ K ] ≤ ρ (1 − ζ ∗ ) . Hence following a similar argument as before we have P ( ζ K + K ≥ ζ ∗ ) = 1 − P ( κ K + K ≥ − ζ ∗ ) ≥ − E [ κ K + K ]1 − ζ ∗ ≥ − ρ . (36)(35) and (36) together complete the proof.23 .3 Proof of Undersampled Data In this section, we prove our main results for undersampled data. We again start byproving a result for the deterministic expression for the change in determinant simi-larity from one step of the GROUSE algorithm to the next, in this case a lower boundgiven by Lemma 7.
Proof of Lemma 7.
Note that, w T ( ¯ U T U ) − ¯ U T p = w T ( ¯ U T U ) − ¯ U T U w = k p k (37a) w T ( ¯ U T U ) − ¯ U T r ϑ = s T ¯ U T U ( ¯ U T U ) − ¯ U T r = v T A T e r ϑ = k e r k (37b)where ϑ follows by Lemma 6 and ϑ holds since v T A T e r = v T A T ( I m − P AU ) e r = k e r k . As a consequence, we have the following det (cid:0) ¯ U T U t +1 (cid:1) = det (cid:18) ¯ U T U + ¯ U T (cid:18) p + r k p + r k − p k p k (cid:19) w T k w k (cid:19) ϑ = det( ¯ U T U ) w T ( ¯ U T U ) − ¯ U T ( p + r ) k p k p k p k + k r k = det( ¯ U T U ) k p k + k r k + k e r k − k r k + ∆ k p k p k p k + k r k where ∆ = w T (cid:0) ¯ U T U (cid:1) − ¯ U T r ; and ϑ follows by the Schur complement det (cid:0) M + ab T (cid:1) = det( M ) (cid:0) b T M − a (cid:1) for any invertible M ∈ R n × n and a, b ∈ R n . Hence ζ t +1 ζ t = det (cid:0) ¯ U T U t +1 (cid:1) det (cid:0) ¯ U T U (cid:1) ! ϑ ≥ k r k k p k + 2 k e r k − k r k k p k + 2 ∆ k p k where ϑ holds since ( c + d ) ≥ c + 2 cd with c = k p k + k r k k p k √ k p k + k r k , d = k e r k −k r k +∆ k p k √ k p k + k r k .In the following sections, we proceed by establishing the convergence results ofmissing data and compressively sampled data by bounding the key quantities in Lemma7. We start by showing how the results on the key quantities in Lemmas 9, 10 and 11 leadto the main result of the compressively sampled data case.
Proof of Theorem 8.
Let η = δ − δ dm , η = (1 − δ ) (cid:0) − δ p mn (cid:1) and η = tan( φ d )+ δ d cos( φ d ) , then plugging in the results in Lemma 9 to Lemma 11 into Lemma 7 with24 = δ = δ = δ yields, ζ t +1 ζ t ≥ k e r k − k r k k p k + 2 ∆ k p k ≥ (cid:0) √ η (cid:1) ( η (1 − η ) − √ η η ) mn k v ⊥ k k v k = 1 + γ (cid:18) − γ dm (cid:19) mn k v ⊥ k k v k = (cid:18) γ (cid:18) − γ dm (cid:19) mn (cid:19) − ζ t d (38)where γ = (cid:16) η η √ η (cid:17) δ − δ = (cid:18) tan( φ d )+ δ d cos( φd ) ( − δ √ mn ) √ (1 − δ ) d/m (cid:19) δ − δ , γ = η ( √ η ) = (1 − δ ) ( − δ √ mn ) (cid:16) q δ − δ dm (cid:17) , and the last equality follows from Lemma 1.The probability bound is obtained by taking the union bound of those quantities (inLemma 21, Lemma 24, Lemma 23, Corollary 26, Lemma 34) used to generate Lemma9 to Lemma 11. As we can see, this union bound is − exp (cid:18) − mδ (cid:19) − exp (cid:18) − dδ (cid:19) − exp (cid:18) − mδ
32 + d log (cid:18) δ (cid:19)(cid:19) − (4 d + 1) exp (cid:18) − mδ (cid:19) > − exp (cid:18) − dδ (cid:19) − exp (cid:18) − mδ
32 + d log (cid:18) δ (cid:19)(cid:19) − (4 d + 2) exp (cid:18) − mδ (cid:19) (39) To get the complexity bound on m , let ε = tan( φ d ) , α = ε + δ √ ε d , α = δ − δ and α = (cid:0) − δ p mn (cid:1) √ δ , then according to (48) we have γ dm < isequivalent to the following, α d + 2 α α √ dα √ m < m ⇔ r m − α α √ dα ! > (cid:18) α + α α α (cid:19) d ϑ ⇐ m ≥ α α α d + 4 √ α α α α d ϑ ⇐ m ≥ β (cid:16) ε + δ p ε d (cid:17) (cid:18) ε + δ p ε d + 12 (cid:19) d (40)where ϑ follows from r(cid:16) α + α α α (cid:17) d < √ α d + α α α √ d ; and ϑ follows by β = δ )(1 − δ ) (1 − δ ) .To establish another bound on m we can see that m ≥ δ log (cid:16) n /d δ (cid:17) d implies25he following, exp (cid:18) − mδ
32 + d log (cid:18) δ (cid:19)(cid:19) ≤ exp( − log n ) = 1 n (41) (4 d + 2) exp (cid:18) − mδ (cid:19) ≤ (4 d + 2) n (cid:18) δ (cid:19) d → (42)(41) and (42) complete the proof for the bound on m and justify the simplification ofthe probability bound in (39).Next we are going to prove the intermediate lemmas in Section 5.1, i.e., bound thekey quantities in Lemma 7, for which we need the following concentration results. Lemma 21.
Let A ∈ R m × n with entries being i.i.d Gaussian random variables dis-tributed as N (0 , /n ) , v ∈ R n is an vector. Then for any δ ∈ (0 , , with probabilityat least − − mδ / , we have P (cid:16) k Av k > (1 + δ ) mn k v k (cid:17) < exp (cid:18) − mδ (cid:19) , P (cid:16) k Av k < (1 − δ ) mn k v k (cid:17) < exp (cid:18) − mδ (cid:19) . Proof.
Note that Av is a random vector with i.i.d entries distributing as N (cid:0) , k v k /n (cid:1) .Therefore, n k Av k k v k is a chi-squared distribution with m degrees of freedom, whichyields [30] P " n k Av k m k v k − > δ < exp (cid:0) − mδ / (cid:1) P " n k Av k m k v k − < − δ < exp (cid:0) − mδ / (cid:1) Lemma 22.
Let A ∈ R m × n be a random matrix whose entries are independent andidentically distributed Gaussian random variables with mean zero, and variance γ .Let z , z ∈ R n such that z ⊥ z , then Az and Az are independent of each other.Proof. Let a Ti denote the i th row of A and M = Az z T A T . Then we have E [ M ] ii = E (cid:2) a Ti z z T a i (cid:3) = z T E [ a i a Ti ] z = γz T z = 0 E [ M ] ij = E (cid:2) a Ti z z T a j (cid:3) = z T E [ a i a Tj ] z = 0 Therefore Az and Az are uncorrelated. This together with the fact that both Az and Az are Gaussian distributed random vectors imply that Az and Az are independent.26 emma 23 ([29], Corollary 5.35) . Let A be an n × m matrix ( n ≥ m ) whose entries areindependent standard normal random variables. Then for every h ≥ , with probabilityat least − (cid:0) − h / (cid:1) one has √ n − √ m − h ≤ σ min ( A ) ≤ σ max ( A ) ≤ √ n + √ m + h (43) where σ min , σ max denote the smallest and largest singular values of A . With the above results, we are able to call out the following intermediate result toquantify kP AU ( Av ⊥ ) k , which is a key quantity that will be used for proving Lemmas9, 10 and 11. Lemma 24.
Let A ∈ R m × n with entries being i.i.d Gaussian random variables dis-tributed as N (0 , /n ) , then for any δ ∈ (0 , we have kP AU Av ⊥ k ≤ (1 + δ ) dn k v ⊥ k hold with probability at least − exp (cid:16) − dδ (cid:17) .Proof. Note that Av ⊥ is a Gaussian random vector with i.i.d entries distributed as N (cid:0) , k v ⊥ k /n (cid:1) , and AU is a Gaussian random matrix with i.i.d entries distributedas N (0 , /n ) . Then according to Lemma 22, AU and Av ⊥ are independent of eachother. Therefore, y = P AU ( Av ⊥ ) is the projection of Av ⊥ onto a independent ran-dom d -dimensional subspace. According to the rotation invariance property of Av ⊥ , kP AU ( Av ⊥ ) k is equivalent to the length of projecting Av ⊥ onto its first d coordinates.Hence, P kP AU ( Av ⊥ ) k = d X k =1 y k ≤ (1 + δ ) dn k v ⊥ k ! ≥ − exp (cid:18) − dδ (cid:19) (44)Similar to the proof for Lemma 21, here the probability bound is followed from theconcentration bound for Chi-squared distribution with degree d .Now we start by proving that Lemma 9 follows directly from Lemma 21 andLemma 23. Proof of Lemma 9.
According to Lemmas 21 and 24, we have k e r k = k ( I m − P AU ) Av ⊥ k = k Av ⊥ k − kP AU ( Av ⊥ ) k ≥ (1 − δ ) mn k v ⊥ k − (1 + δ ) dn k v ⊥ k = (1 − δ ) (cid:18) − δ − δ dm (cid:19) mn k v ⊥ k (45)27old with probability at least − exp (cid:16) − mδ (cid:17) − exp (cid:16) − dδ (cid:17) . As for the second partof Lemma 9, we have k e r k − k r k = 2 k e r k − k A T e r k ≥ (2 − σ max ( A T )) k e r k ϑ ≥ (cid:18) − δ r mn (cid:19) k e r k ≥ (cid:18) − δ r mn (cid:19) (1 − δ ) (cid:18) − δ − δ dm (cid:19) mn k v ⊥ k (46)here ϑ follows from Lemma 23 with A ij ∼ N (0 , /n ) and h = δ p m/n . Theprobability bound − exp (cid:16) − mδ (cid:17) − exp (cid:16) − dδ (cid:17) − exp (cid:16) − mδ (cid:17) is obtained bytaking the union bound over (45) and ϑ .To prove Lemma 10 and Lemma 11, we need the following extra results whichare implied by Lemma 21. The corresponding proofs are provided at the end of thissection. Corollary 25.
Under the conditions of Lemma 21, for x, y ∈ R n and δ , with probabil-ity exceeding − e − mδ / we have mn (cid:0) x T y − δ k x kk y k (cid:1) ≤ x T A T Ay ≤ mn (cid:0) x T y + δ k x kk y k (cid:1) Corollary 26.
Under the condition of Lemma 21, for any vector v ∈ R ( U ) we have P (cid:16) k Av k > (1 + δ ) mn k v k (cid:17) < exp (cid:18) − mδ − d log( δ ) + d log(24) (cid:19) , P (cid:16) k Av k < (1 − δ ) mn k v k (cid:17) < exp (cid:18) − mδ − d log( δ ) + d log(24) (cid:19) . Given Lemma 25 and Corollary 26, we prove Lemma 10 and Lemma 11 by firstproving the following intermediate results to bound the key components of p and ∆ . Lemma 27.
Let w = (cid:0) U T A T AU (cid:1) − U T A T Av ⊥ , then P k w k ≤ r δ − δ dm k v ⊥ k ! ≥ − exp (cid:18) − dδ (cid:19) − exp (cid:18) − mδ − d log( δ ) + d log(24) (cid:19) roof. Given the fact that U ∈ R n × d with columns being orthonormal, we have k w k = k U w k . It then follows that, k U w k ϑ ≤ k AU w k p (1 − δ ) m/n ϑ ≤ r δ − δ dm k v ⊥ k where ϑ follows from Corollary 26, and ϑ followed by Lemma 24, i.e., k AU w k = kP AU ( Av ⊥ ) k ≤ r (1 + δ ) dn k v ⊥ k The probability bound is obtained by applying the union bound over ϑ and ϑ . Lemma 28.
Let φ d denote the largest principal angle between R ( U ) and R ( ¯ U ) , then P (cid:16)(cid:13)(cid:13) ¯ U T A T Av ⊥ (cid:13)(cid:13) ≤ (sin φ d + dδ ) mn k v ⊥ k (cid:17) ≥ − d exp (cid:18) − mδ (cid:19) Proof of Lemma 28.
Let ¯ u k denote the k th column of ¯ U , and δ ∈ (0 , . Then (cid:13)(cid:13) ¯ U T A T Av ⊥ (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ¯ U T (cid:16) A T A − mn I n (cid:17) v ⊥ + mn ¯ U T v ⊥ (cid:13)(cid:13)(cid:13) ≤ mn (cid:13)(cid:13) ¯ U T v ⊥ (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ U T (cid:16) A T A − mn I n (cid:17) v ⊥ (cid:13)(cid:13)(cid:13) = mn (cid:13)(cid:13) ¯ U T v ⊥ (cid:13)(cid:13) + vuut d X k =1 (cid:16) ¯ u Tk A T Av ⊥ − mn ¯ u Tk v ⊥ (cid:17) ϑ ≤ mn (cid:13)(cid:13) ¯ U T v ⊥ (cid:13)(cid:13) + vuut d X k =1 (cid:16) δ mn k ¯ u k kk v ⊥ k (cid:17) ϑ ≤ sin φ d mn k v ⊥ k + mn dδ k v ⊥ k (47)where ϑ follows from Lemma 25; ϑ holds from Lemma 34 and the fact that qP dk =1 (cid:0) δ mn k ¯ u k kk v ⊥ k (cid:1) ≤ dδ mn k ¯ u k kk v ⊥ k ; and the probability bound is obtainedby taking the union bound of that in Lemma 25.We are ready to prove Lemma 10 and Lemma 11. Proof of Lemma 10.
Let η = q δ − δ dm , then according to Lemma 27 we have k p k = k U w + U w k ≤ (cid:0) k v k k + k U w k (cid:1) ≤ (cid:0) k v k k + η k v ⊥ k (cid:1) ≤ (1 + η ) k v k − exp (cid:18) − mδ − d log( δ ) + d log(24) (cid:19) − exp (cid:18) − dδ (cid:19) . Here the probability bound is obtained by choosing δ = δ in Lemma 27, hencecompletes the proof. Proof of Lemma 11.
According to the definition of ∆ , we can see Lemma 11 is a directresults of Lemma 27 and Lemma 34, that is | ∆ | = w T (cid:0) ¯ U T U (cid:1) − ¯ U T A T ( I m − P AU ) Av ⊥ ≤ (cid:13)(cid:13) w T (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:0) ¯ U T U (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ¯ U T A T ( I m − P AU ) Av ⊥ (cid:13)(cid:13) ϑ ≤ k w k (cid:13)(cid:13)(cid:13)(cid:0) ¯ U T U (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ¯ U T A T Av ⊥ (cid:13)(cid:13) ϑ ≤ φ d ) r δ − δ dm k v ⊥ k (cid:16) sin φ d mn + mn dδ (cid:17) k v ⊥ k = 1cos( φ d ) r δ − δ dm (sin( φ d ) + dδ ) mn k v ⊥ k (48)where ϑ holds since (cid:13)(cid:13) ¯ U T A T ( I m − P AU ) Av ⊥ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ¯ U T A T Av ⊥ (cid:13)(cid:13) ; ϑ followed byLemma 27 and Lemma 28; and the probability bound is obtained by taking the unionbound that in Lemma 27 and Lemma 28.Finally, we are going to prove the auxiliary results Corollary 26 and Lemma 25.The key idea for proving Corollary 26 is using the covering numbers argument and ap-plying Lemma 9 to a given d -dimensional subspace R ( U ) . This is a common strategyused for compress sensing. Proof of Corollary 26.
Without loss of generality we restrict k v k = 1 . From coveringnumbers [28], there exists a finite set Q with at most (cid:0) δ (cid:1) d points such that Q ⊂ R ( U ) , k q k = 1 , ∀ q ∈ Q , and for all x ∈ R ( U ) with k v k = 1 we can find a q ∈ Q such that k v − q k ≤ δ/ Now applying Lemma 21 to the points in Q with ε = δ/ and using the standard unionbound, then with probability at least − (cid:0) δ (cid:1) d exp (cid:16) − δ m (cid:17) we have (1 − δ/ mn k v k ≤ k Ax k ≤ (1 + δ/ mn k v k which gives p − δ/ r mn k v k ≤ k Ax k ≤ p δ/ r mn k v k (49)Since k v k = 1 , we define γ as the smallest number such that k Ax k ≤ p γ r mn ∀ x ∈ R ( U ) (50)30ince for any x ∈ R ( U ) with k v k = 1 we can find a q ∈ Q such that k x − q k ≤ δ/ ,we have the following k Ax k ≤ k Aq k + k A ( x − q ) k ≤ p δ/ r mn + √ H r mn δ/ Since γ is the smallest number (50) holds, we have √ γ ≤ p δ/ √ γδ/ . p γ ≤ p δ/ − δ/ ≤ √ δ (51)Similarly, the lower bound follows by k Ax k ≥ k Aq k − k A ( x − q ) k ≥ p − δ/ r mn − p γ δ r mn ≥ (cid:18)p − δ/ − √ δ δ (cid:19) r mn ≥ √ − δ r mn This completes the proof.
Proof of Lemma 25.
Note that, x T A T Ay k x kk y k = 14 (cid:13)(cid:13)(cid:13)(cid:13) A (cid:18) x k x k + y k y k (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13) A (cid:18) x k x k − y k y k (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ! Applying Lemma 21 on both terms separately and applying the union bound we have P (cid:20) x T A T Ay k x kk y k ≤ mn (cid:18) x T y k x kk y k − δ (cid:19)(cid:21) = P " x T A T Ay k x kk y k ≤ (1 − δ ) mn (cid:13)(cid:13)(cid:13)(cid:13) x k x k + y k y k (cid:13)(cid:13)(cid:13)(cid:13) − (1 + δ ) mn (cid:13)(cid:13)(cid:13)(cid:13) x k x k − y k y k (cid:13)(cid:13)(cid:13)(cid:13) ! < (cid:18) − mδ (cid:19) (52)Similarly, P (cid:20) x T A T Ay k x kk y k ≥ mn (cid:18) x T y k x kk y k + δ (cid:19)(cid:21) = P " x T A T Ay k x kk y k ≥ (1 + δ ) mn (cid:13)(cid:13)(cid:13)(cid:13) x k x k + y k y k (cid:13)(cid:13)(cid:13)(cid:13) − (1 − δ ) mn (cid:13)(cid:13)(cid:13)(cid:13) x k x k − y k y k (cid:13)(cid:13)(cid:13)(cid:13) ! < (cid:18) − mδ (cid:19) (53)holds with probability no more than (52) and (53) complete the proof.31 .3.2 Proof of Missing Data Here we again bound the quantities in Lemma 7, Equation (14), this time assuming A represents an entry-wise observation operation and assuming incoherence on thesignals of interest. As we show below, in the proof of Theorem 12, we put togetherbounds given by Lemmas 14, 15 and 16, which are all proved in this section too, alongwith Lemma 17 for completeness. We start by proving the main result for missing data. Proof of Theorem 12.
Given the condition required by Theorem 12, we have sin φ d ≤ p dµ / n and cos φ d ≥ p − dµ / n . This together with Lemma 17 and Lemma16 yield | ∆ | ≤ η √ m n √ − dµ / n dµ n k v ⊥ k ≤ η √ √ − dµ n k v ⊥ k ≤ η dµ n k v ⊥ k .Also for β in Lemma 16 we have β ≤ p µ ( v ⊥ ) log(1 /δ ) = β . Therefore, | ∆ | ≤
115 (1 + β ) − γ dµ n k v ⊥ k . (54)Letting η = (1+ β ) − γ dµ m and α = q µ ( v ⊥ ) m log (cid:0) δ (cid:1) , then applying this definitiontogether with Lemma 17 to Lemma 15 Lemma 14 yields k p k ≤ (cid:18) r η − γ (cid:19) k v k (55) k r Ω k ≥ (1 − α − η ) mn k v ⊥ k (56)Now applying (54), (55) and (56) to (24) we obtain ζ t +1 ζ t ≥ − α − η )(1 + p η / (1 − γ )) mn k v ⊥ k k v k − η (1 + p η / (1 − γ )) mn k v ⊥ k k v k ≥ − α − η )(1 + p η / (1 − γ )) mn k v ⊥ k k v k (57)which holds with probability at least − δ . The probability bound is obtained bytaking the union bound of those generating Lemmas 14, 15 and 16, as we can see inthe proofs of them in Section 8.3.2 this union bound is at least − δ .Letting η = (1 − α − η )(1+ √ η / (1 − γ )) , then η > is equivalent to − α − η > ,for which we have the following: if m > max dµ (cid:18) dδ (cid:19) , µ ( v ⊥ ) log (cid:18) δ (cid:19) , dµ s µ ( v ⊥ ) log (cid:18) δ (cid:19)! (58) then η > .Under this condition, taking expectation with respect to v yields, E v (cid:20) ζ t +1 ζ t (cid:12)(cid:12) U (cid:21) ≥ mn E (cid:20) k v ⊥ k k v k (cid:12)(cid:12)(cid:12)(cid:12) U (cid:21) ≥ mn − ζ t d (59)where the last inequality follows from Lemma 1. Finally choosing δ to be /n completes the proof. 32e then prove Corollary 13, the result that allows comparison between our conver-gence rate and that in [7]. Proof of Corollary 13.
Let X = [ X , . . . , X d ] with X i = sin φ i . Let f ( X ) = 1 − P di =1 X i − Π di =1 (1 − X i ) , then ∂f ( X ) ∂X i = − j = i (1 − X j ) ≤ . That is, f ( X ) is adecreasing function of each component. Therefore, f ( X ) ≤ f (0) = 0 . It follows that ζ t = Π di =1 (1 − X i ) ≥ − d X i =1 X i ≥ − dµ n (60)With a slight modification of Theorem 12 we obtain E (cid:2) κ t +1 (cid:12)(cid:12) κ t (cid:3) ≤ (cid:18) − mn ζ t d (cid:19) κ t . (61)(60) and (61) together complete the proof.We now focus on proving the key lemmas for establishing Theorem 12, for whichwe need the following lemmas (the proofs can be found in [6]). Lemma 29. [6] Let δ > . Suppose m ≥ dµ ( U ) log (2 d/δ ) , then P (cid:18)(cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) ≤ n (1 − γ ) m (cid:19) ≥ − δ where γ = q dµ ( U )3 m log (2 d/δ ) . Lemma 30 ([6], Lemma 1) . Let α = q µ ( v ⊥ ) m log(1 /δ ) , then P (cid:16) k v ⊥ , Ω k ≥ (1 − α ) mn k v ⊥ k (cid:17) ≥ − δ Lemma 31 ([6], Lemma 2) . Let µ ( U ) , µ ( v ⊥ ) denote the incoherence parameters of R ( U ) and v ⊥ , and let δ ∈ (0 , and β = p µ ( v ⊥ ) log (1 /δ ) , then P (cid:18)(cid:13)(cid:13) U T Ω v ⊥ , Ω (cid:13)(cid:13) ≤ ( β + 1) mn dµ ( U ) n k v ⊥ k (cid:19) ≥ − δ Now we are ready for the proof of Lemmas 14, 15 and 16.
Proof of Lemma 14.
According to Lemmas 30, 31 and 29, we have k r Ω k = k v ⊥ , Ω k − v T ⊥ , Ω U Ω (cid:0) U T Ω U Ω (cid:1) − U T Ω v ⊥ , Ω ≥ k v ⊥ , Ω k − (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) k U T Ω v ⊥ , Ω k ϑ ≥ (cid:18) − α − ( β + 1) − γ dµ ( U ) m (cid:19) mn k v ⊥ k with probability at least − δ . 33 roof of Lemma 15. Lemma 31 and Lemma 29 together give the following k U w k = (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − U T Ω v ⊥ , Ω (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) U T Ω v ⊥ , Ω (cid:13)(cid:13) ≤ ( β + 1) (1 − γ ) dµ ( U ) m k v ⊥ k holds with probability exceeding − δ . Therefore, k p k ≤ (cid:0) k v k k + k U w k (cid:1) ≤ β + 11 − γ r dµ ( U ) m ! k v k We also need the following lemma for the proof of Lemma 16, the proof of whichis provided at the end of this section.
Lemma 32.
Let β = q µ ( v ⊥ ) log (cid:0) δ (cid:1) dµ dµ + m sin φ d , where again µ denoting theincoherence parameter of R ( ¯ U ) . Then P (cid:13)(cid:13) ¯ U T Ω v ⊥ , Ω (cid:13)(cid:13) ≤ (1 + β ) r mn dµ n s m sin φ d dµ + 1 k v ⊥ k ≥ − δ Proof of Lemma 16.
Note that | ∆ | = k ∆ k , for which we have the following, k ∆ k = (cid:13)(cid:13) w T ( ¯ U T U ) − ¯ U T r (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) v T ⊥ , Ω U Ω (cid:0) U T Ω U Ω (cid:1) − (cid:0) ¯ U T U (cid:1) − ¯ U T Ω ( I − P U Ω ) v ⊥ , Ω (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) v T ⊥ , Ω U Ω (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:0) ¯ U T U (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ¯ U T Ω ( I − P U Ω ) v ⊥ , Ω (cid:13)(cid:13) ϑ ≤ φ d (cid:13)(cid:13) v T ⊥ , Ω U Ω (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ¯ U T Ω v ⊥ , Ω (cid:13)(cid:13) ≤ φ d ( β + 1) r mn dµ ( U ) n (1 + β ) r mn dµ n s m sin φ d dµ + 1 nm (1 − γ ) k v ⊥ k ϑ ≤ (1 + β )(1 + β )(1 − γ ) cos φ d s m sin φ d dµ + 1 r dµ n r dµ ( U ) n k v ⊥ k where ϑ holds since from the following: (cid:13)(cid:13) ¯ U T Ω ( I − P U Ω ) v ⊥ , Ω (cid:13)(cid:13) ≤ (cid:13)(cid:13) ¯ U T Ω v ⊥ , Ω (cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:0) U T Ω U Ω (cid:1) − (cid:13)(cid:13)(cid:13) ≤ φ d and ϑ follows by putting Lemmas 31, 29 and 32 together.34e also prove Lemma 17 for completeness. Before that we first call out the fol-lowing lemma, the proof of which can be found in [7]. Lemma 33. [7] There exists an orthogonal matrix V ∈ R d × d such that d X k =1 sin φ k ≤ (cid:13)(cid:13) ¯ U V − U (cid:13)(cid:13) F ≤ d X k =1 sin φ k Proof of Lemma 17.
According to Lemma 33 we have k U i k ≤ (cid:13)(cid:13) ¯ U i (cid:13)(cid:13) + (cid:13)(cid:13) ¯ U i V − U i (cid:13)(cid:13) ≤ (cid:13)(cid:13) ¯ U i (cid:13)(cid:13) + vuut d X k =1 sin φ k ≤ (cid:18) √ (cid:19) r dµ n It hence follows that k U i k ≤ dµ n .We need the following lemma and McDiarmid’s inequality to prove Lemma 34. Lemma 34. (cid:13)(cid:13) ¯ U T v ⊥ (cid:13)(cid:13) ≤ sin ( φ d ) k v ⊥ k , where φ d denotes the largest principalangle between R ( ¯ U ) and R ( U ) .Proof. According to the definition of v ⊥ and Lemma 18, we have (cid:13)(cid:13) ¯ U T y (cid:13)(cid:13) = (cid:13)(cid:13) ¯ U T (cid:0) I − U U T (cid:1) ¯ U s (cid:13)(cid:13) = s T ¯ Y Σ ¯ Y s ϑ ≤ sin φ d s T ¯ Y Σ ¯ Y T s = sin φ d k v ⊥ k here ¯ Y and Σ are the same as those defined in Lemma 18, and the last equality holdssince k v ⊥ k = k s k − v T U U T v = s T ¯ Y Σ ¯ Y T s . Theorem 35. (McDiarmid’s Inequality [22]). Let X , . . . , X n be independent randomvariables, and assume f is a function for which there exist t i , i = 1 , . . . , n satisfying sup x ,...,x n , b x i | f ( x , . . . , x n ) − f ( x , . . . , b x i , . . . , x n ) | ≤ t i where b x i indicates replacing the sample value x i with any other of its possible values.Call f ( X , . . . , X n ) := Y . Then for any ǫ > , P [ Y ≥ E Y + ǫ ] ≤ exp (cid:18) − ǫ P ni =1 t i (cid:19) P [ Y ≤ E Y − ǫ ] ≤ exp (cid:18) − ǫ P ni =1 t i (cid:19) roof of Lemma 32. We use McDiarmid’s inequality to prove this. For the simplic-ity of notation denote v ⊥ as y . Let X i = ¯ U Ω( i ) y Ω( i ) ∈ R d , and f ( X , . . . , X m ) = k P mi =1 X i k = (cid:13)(cid:13) ¯ U T Ω v ⊥ , Ω (cid:13)(cid:13) , then | f ( x , . . . , x n ) − f ( x , . . . , b x i , . . . x n | can bebounded via (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i = k X i + b X k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) X k − b X k (cid:13)(cid:13)(cid:13) ≤ k X k k + k b X k k ≤ k y k ∞ p dµ /n (62)We next calculate E [ f ( X , . . . , X m )] = E (cid:2) k P mi =1 X i k (cid:3) . Note that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E m X i =1 k X i k + m X i =1 X j = i X Ti X j (63)Recall that we assume the samples are taken uniformly with replacement. This togetherwith the fact that (cid:13)(cid:13) ¯ U i (cid:13)(cid:13) = kP R ( ¯ U ) ( e i ) k ≤ p dµ /n yield the following E " m X i =1 k X i k = m X i =1 E h(cid:13)(cid:13) U Ω( i ) y Ω ( i ) (cid:13)(cid:13) i = m X i =1 n X k =1 k ¯ U k k y k P { Ω( i )= k } ≤ mn dµ n k y k (64) E m X i =1 X j = i X Ti X j = m X i =1 X j = i n X k =1 n X k =1 y k ¯ U Tk ¯ U k y k P (Ω j = k ) P (Ω i = k )= m − mn k ¯ U T y k ≤ m n sin φ d k y k (65)where the last inequality holds by Lemma 34.(63), (64) and (65) together with the Jensen’s inequality imply E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ r mn r mn sin φ d + dµ n k y k = r mn dµ n s m sin φ d dµ + 1 k y k (66)Let ǫ = β q mn dµ n q m sin φ d dµ + 1 k y k , then (62) and (66) together with Theorem 3536ive P k U Ω y Ω k ≥ (1 + β ) r mn dµ n s m sin φ d dµ + 1 k y k ≤ exp − β mn dµ n (cid:16) m sin φ d dµ + 1 (cid:17) k y k m k y k ∞ dµ n = exp − β (cid:16) m sin φ d dµ + 1 (cid:17) k y k n k y k ∞ = δ (67)where the last inequality follows by submitting our definition of µ ( y ) Eq (23) and β . References [1] P-A Absil, Robert Mahony, and Rodolphe Sepulchre.
Optimization algorithmson matrix manifolds . Princeton University Press, 2009.[2] Diego Armentano, Carlos Beltrán, and Michael Shub. Average polynomial timefor eigenvector computations. arXiv preprint arXiv:1410.2179 , 2014.[3] Raman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of PCAwith capped MSG. In
Advances in Neural Information Processing Systems , pages1815–1823, 2013.[4] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergenceof incremental PCA. In
Advances in Neural Information Processing Systems ,pages 3174–3182, 2013.[5] Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification andtracking of subspaces from highly incomplete information. In
Communication,Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on ,pages 704–711. IEEE, 2010.[6] Laura Balzano, Benjamin Recht, and Robert Nowak. High-dimensional matchedsubspace detection when data are missing. In , pages 1638–1642. IEEE, 2010.[7] Laura Balzano and Stephen J Wright. Local convergence of an algorithm forsubspace identification from partial data.
Foundations of Computational Mathe-matics , pages 1–36, 2014.[8] Laura Kathryn Balzano.
Handling missing data in high-dimensional subspacemodeling . PhD thesis, University of Wisconsin – Madison, 2012.379] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methodsfor convex optimization: A survey.
Optimization for Machine Learning , 2010(1-38):3, 2011.[10] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping con-vexity for faster semi-definite optimization. arXiv preprint arXiv:1509.03917 ,2015.[11] J Paul Brooks, JH Dulá, and Edward L Boone. A pure l1-norm principal compo-nent analysis.
Computational statistics & data analysis , 61:83–98, 2013.[12] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principalcomponent analysis?
Journal of the ACM (JACM) , 58(3):11, 2011.[13] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projectedgradient descent: General statistical and algorithmic guarantees. arXiv preprintarXiv:1509.03025 , 2015.[14] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solu-tions for sparse principal component analysis.
The Journal of Machine LearningResearch , 9:1269–1294, 2008.[15] Christopher D De Sa, Christopher Re, and Kunle Olukotun. Global convergenceof stochastic gradient descent for some non-convex matrix problems. In
Pro-ceedings of the 32nd International Conference on Machine Learning (ICML-15) ,pages 2332–2341, 2015.[16] Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithmswith orthogonality constraints.
SIAM journal on Matrix Analysis and Applica-tions , 20(2):303–353, 1998.[17] Gene H Golub and Charles F Van Loan.
Matrix computations . JHU Press, 4edition, 2012.[18] Jun He, Laura Balzano, and Arthur Szlam. Incremental gradient on the grass-mannian for online foreground and background separation in subsampled video.In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on ,pages 1568–1575. IEEE, 2012.[19] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford.Streaming pca: Matching matrix bernstein and near optimal finite sample guaran-tees for oja’s algorithm. In , pages1147–1164, 2016.[20] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix comple-tion using alternating minimization. In
Proceedings of the forty-fifth annual ACMsymposium on Theory of computing , pages 665–674. ACM, 2013.[21] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix comple-tion from a few entries.
Information Theory, IEEE Transactions on , 56(6):2980–2998, 2010. 3822] Colin McDiarmid. On the method of bounded differences.
Surveys in combina-torics , 141(1):148–188, 1989.[23] Thanh Ngo and Yousef Saad. Scaled gradients on grassmann manifolds for matrixcompletion. In
Advances in Neural Information Processing Systems , pages 1412–1420, 2012.[24] Hoi H Nguyen, Van Vu, et al. Random matrices: Law of the determinant.
TheAnnals of Probability , 42(1):146–167, 2014.[25] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization.
SIAM review ,52(3):471–501, 2010.[26] R.H.Keshavan.
Efficient algorithms for collaborative filtering . PhD thesis, Stan-ford University, 2012.[27] Gilbert W Stewart and Ji-guang Sun.
Matrix perturbation theory . Academicpress, 1990.[28] Stanislaw J Szarek. Metric entropy of homogeneous spaces. arXiv preprintmath/9701213 , 1997.[29] Roman Vershynin. Introduction to the non-asymptotic analysis of random matri-ces. arXiv preprint arXiv:1011.3027 , 2010.[30] Martin Wainwright. Basic tail and concentration bounds. , 2015.[31] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlierpursuit. In
Advances in Neural Information Processing Systems , pages 2496–2504, 2010.[32] Bin Yang. Projection approximation subspace tracking.
IEEE Transactions onSignal processing , 43(1):95–107, 1995.[33] Dejiao Zhang and Laura Balzano. Global convergence of a grassmannian gradientdescent algorithm for subspace estimation. In
AISTATS , pages 1460–1468, 2016.[34] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm forrank minimization and semidefinite programming from random linear measure-ments. In