Incrementally Updated Spectral Embeddings
IIncrementally Updated Spectral Embeddings
Vasileios Charisopoulos ∗ , Austin R. Benson † , and Anil Damle ‡ Abstract.
Several fundamental tasks in data science rely on computing an extremal eigenspace of size r (cid:28) n ,where n is the underlying problem dimension. For example, spectral clustering and PCA bothrequire the computation of the leading r -dimensional subspace. Often, this process is repeated overtime due to the possible temporal nature of the data; e.g. , graphs representing relations in a socialnetwork may change over time, and feature vectors may be added, removed or updated in a dataset.Therefore, it is important to efficiently carry out the computations involved to keep up with frequentchanges in the underlying data and also to dynamically determine a reasonable size for the subspace ofinterest. We present a complete computational pipeline for efficiently updating spectral embeddingsin a variety of contexts. Our basic approach is to “seed” iterative methods for eigenproblems withthe most recent subspace estimate to significantly reduce the computations involved, in contrastwith a na¨ıve approach which recomputes the subspace of interest from scratch at every step. In thissetting, we provide various bounds on the number of iterations common eigensolvers need to performin order to update the extremal eigenspace to a sufficient tolerance. We also incorporate a criterionfor determining the size of the subspace based on successive eigenvalue ratios. We demonstrate themerits of our approach on the tasks of spectral clustering of temporally evolving graphs and PCA ofan incrementally updated data matrix. Key words. spectral methods, iterative methods, temporal data, matrix perturbation.
AMS subject classifications.
1. Introduction.
In the big data era, scientists and engineers need to operate on massivedatasets on a daily basis, fueling essential algorithms for commercial or scientific applications.These datasets can contain millions or billions of data points, making the task of extractingmeaningful information especially challenging [36]; moreover, it is often the case that thedata are also high-dimensional, which can significantly affect the time and storage required towork with such datasets. These computational challenges strongly motivate the need to workwith “summarized” versions of data that facilitate fast and memory-efficient computations.Several popular techniques in data science have been devised to address this problem, such assketching [68], dimensionality reduction [4, 29, 54], and limited-memory/stochastic algorithmsin optimization (see, e.g., [7] and references therein).A popular approach to remedy some of the aforementioned difficulties revolves aroundusing spectral information associated to the problem. For example, many high-dimensionalproblems involve data that can be modeled as graphs encoding interactions between entitiesin social networks [46], gene interaction data [45] and product recommendation networks [35].A task of fundamental importance is clustering the nodes of the network into communities orclusters of similar nodes [50, 59].
Spectral clustering is a popular approach for this problemthat computes the r leading eigenvectors of the (symmetrically normalized) adjacency matrix, ∗ Department of Operations Research & Information Engineering, Cornell University, 14850 Ithaca, NY. Email: [email protected] † Department of Computer Science, Cornell University, 14850 Ithaca, NY. Email: [email protected] ‡ Department of Computer Science, Cornell University, 14850 Ithaca, NY. Email: [email protected] a r X i v : . [ m a t h . NA ] S e p sing this subspace as a low-dimensional representation of nodes in the graph, and then feedingthis representation to a point cloud clustering algorithm such as k-means [43]. Determining r on the fly is often done by comparing the ratios of successive eigenvalues [65].In a similar vein, linear dimensionality reduction is often tackled via Principal ComponentAnalysis (PCA) [29], which projects the data matrix X to a low-dimensional coordinate systemwhich is spanned by the leading eigenvectors of the covariance matrix, X (cid:62) X . The i th principalcomponent is precisely the projection of the data matrix X to the i th eigenvector. In nonlineardimensionality reduction, a common assumption is that the data lie on a low-dimensionalmanifold embedded in high-dimensional space. The method of Laplacian eigenmaps [6], whichoperates under this assumption, constructs a weighted adjacency matrix and solves a smallsequence of eigenproblems to compute a lower-dimensional embedding of the data points. Going one step further, many datasets getincrementally updated over time. For example, graph data can change in many ways: newlinks form between entities, old connections cease to exist, and new nodes are introduced toa network [64, 33]. In such cases, old clustering assignments may no longer accurately reflectthe community structure of the network, and therefore need to be updated [44]. In othercases, the dimensionality of data may increase as more informative features as well as newdata become available [5, 37]. Such updates reduce the fidelity of the previously computedlow-dimensional embeddings. In both cases, we are faced with a pressing question: has thequality of the spectral embedding degraded, and if so, can we update it efficiently?
The main theoretical tools for answering the first part of this question come from ma-trix perturbation theory [61], which provides worst-case bounds for the distance betweeneigenspaces. However, algorithms proposed for updating spectral embeddings may not nec-essarily utilize those results. For example, if the original (unperturbed) data matrix and itsupdates are low rank, then one can update the thin SVD in practically linear time [8]; however,this is rarely the case in practice. Other “direct” approaches under low-rank updates are ofteninapplicable since they require the full set of singular values and vectors to be available forthe unperturbed matrix [9, 26]. When the data matrix is sparse, a natural candidate may beupdating the sparse Cholesky or
LDL (cid:62) decompositions [12]; however, these rely on heuristicsand may not necessarily preserve sparsity across multiple updates and it is unclear how toadapt these methods to maintain a low-dimensional subspace.Several of the aforementioned approaches share a common prohibitive requirement: theknowledge of the full set of eigenvalues and eigenvectors at the beginning of an update step. Ifone is willing to forego guarantees on the accuracy of the updated embedding, it is at least em-pirically possible to exploit the first-order expansion of eigenvalues, where explicit knowledgeof the full set is no longer required [44]. On the downside, this entails significant uncertaintyabout the quality of the maintained embedding. A more promising and relevant heuristic isthat of warm-starting iterative eigenvalue methods. Many eigensolvers are typically initializedusing a random guess, but when the target is a sequence of similar or related eigenproblems itis natural to expect that the previous solution might be a good initial estimate. This heuristichas been applied towards solving sequences of linear systems [31, 47], sometimes referred toas recycled Krylov method , and problems arising from successive linearizations of nonlineareigenvalue problems [58]. omewhat surprisingly, despite the empirical success of warm starting, it is often over-looked in favor of other heuristics [44] or matrix sketching to reduce per-iteration cost [25, 27]. Given the preceding discussion, an ideal method for incrementallycomputing spectral embeddings should:1. incur minimal additional work per modification, ideally independent of the underlyingproblem’s dimension;2. require minimal additional storage or extensive pre-computation (this excludes classi-cal approaches that assume knowledge of the full set of eigenvalues [9]); and3. take advantage of any features of the underlying matrix, such as sparsity or structure(e.g. Toeplitz matrices).With regards to Item 1, it is important to be able to bound the amount of work required toupdate a spectral embedding on the fly , before performing the actual update.In this paper, we show how warm-started iterative methods can naturally satisfy all ofthe above criteria under a minimal assumption on eigenvalue decay formalized in Section 3.We show that when the perturbations are “small”, using the previous subspace estimate as a“seed” for warm-starting entails essentially O (1) additional iterations per update even whenthe previous estimates have been computed inexactly (orders of magnitude above floatingpoint error). We provide an adaptive algorithm for updating the spectral embedding overtime as well computable bounds on the number of iterations required after each modificationto the underlying matrix, for two standard iterative methods (eigensolvers). Because the mainbottleneck operation of iterative methods is usually matrix-vector multiplication [18, 22, 57],the proposed algorithm naturally accelerates whenever the underlying matrix is structured orsparse. It also enables tracking a proxy for the appropriate subspace dimension, which canaid in determining the correct size of the spectral embedding on the fly.In Section 4, we present a few concrete applications which fit into the incremental frame-work outlined above. In these settings, we are able to utilize properties of the perturbationsto derive further a priori bounds on their effects on subspace distance. For example, whenthe perturbations are outer products of Gaussian random vectors, the subspace distance isaffected by a factor of at most ˜ O ( (cid:112) rn ) with high probability. We also present experimentson real and synthetic data which validate the effectiveness of our approach.Finally, we stress that we do not intend to provide a replacement to or compete withestablished eigensolvers, but rather describe a concrete pipeline for incremental spectral em-beddings with well-understood worst-case guarantees. Indeed, our method only benefits fromadvances in software aimed at solving eigenproblems. Throughout the paper, we endow R n with the standard inner product (cid:104) x, y (cid:105) := x (cid:62) y and the induced norm (cid:107) x (cid:107) = (cid:112) (cid:104) x, x (cid:105) . Given a matrix A ∈ R n × n , we write (cid:107) A (cid:107) for its spectral norm sup (cid:107) x (cid:107) =1 (cid:107) Ax (cid:107) and (cid:107) A (cid:107) F := (cid:112) tr( A (cid:62) A ) for its Frobenius norm.We let O n,k := { Q ∈ R n × k | Q (cid:62) Q = I } denote the set of real n × k matrices with orthonormalcolumns and drop the second subscript when k = n .We write λ i ( A ) for the i th eigenvalue of a matrix A , assuming a descending order suchthat λ ( A ) ≥ · · · ≥ λ n ( A ). We follow the same notation for singular values, σ i ( A ). Given a ymmetric matrix A ∈ R n × n , we will use its spectral decomposition :(1.1) A = [ V V ⊥ ] (cid:20) Λ 00 Λ ⊥ (cid:21) (cid:20) V (cid:62) V (cid:62)⊥ (cid:21) , where [ V V ⊥ ] are the eigenvectors of A and (cid:20) Λ 00 Λ ⊥ (cid:21) is a diagonal matrix containing theeigenvalues of A . By convention, V should be understood to correspond to the subspace ofinterest. Similarly, when A ∈ R m × n with m ≥ n is an arbitrary rectangular matrix, we willuse its singular value decomposition : A = U Σ V (cid:62) where U ∈ O m,n , V ∈ O n are the left and right singular vectors of A and Σ is a diagonalmatrix containing the singular values of A . We will write κ ( A ) := σ ( A ) σ n ( A ) for A ’s conditionnumber. Throughout the main text, we refer to standard linear algebraic routines, which arebuilt into numerical software packages such as LAPACK , using the following notation: • Q, R = qr ( A ) for the (reduced) QR factorization of A ∈ R m × n with m ≥ n , whichdecomposes A into an orthogonal matrix Q ∈ O m,n and an upper triangular matrix R ∈ R n × n . • V, Λ = spectral ( A ) for the spectral decomposition of a symmetric matrix A ∈ R n × n ,with V containing the eigenvectors and Λ := diag( λ , . . . , λ n ) containing the eigenval-ues of A , respectively. We will assume that λ ≥ · · · ≥ λ n in Λ.Moreover, we use diag( x , . . . , x n ) to refer to the diagonal matrix constructed from the vector x = ( x . . . x n ) (cid:62) , and when D is a matrix denote diag( D ) := ( D D . . . D nn ) (cid:62) . Wealso use notation from Golub and Van Loan [22] for matrix or vector slicing: A : , r denotesthe submatrix of A formed by taking the first r columns, while x r denotes the vector formedby the leading r elements of x .
2. Iterative Methods for Eigenvalue Problems.
In this section we present two populariterative methods for computing the eigenvalues of symmetric / Hermitian matrices that formthe building blocks of our algorithms. Additionally, we review their convergence guarantees.Throughout, we assume that we are interested in the subspace corresponding to the algebra-ically largest eigenvalues (which are also the largest in magnitude, up to a shift). Recallthat the
Ritz values are the eigenvalues of the matrix V (cid:62) AV , where V is an approximationto an invariant subspace of A . Below, we are interested in subspace distances measured inthe spectral norm, dist( V, ˜ V ) := (cid:13)(cid:13) V V (cid:62) − ˜ V ˜ V (cid:62) (cid:13)(cid:13) , which has been the focus of the conver-gence analysis of iterative methods for eigenproblems, and pairs well with established matrixperturbation theory for unitarily invariant norms. Remark 1.
While our work predominantly relies on standard notions of subspace distance,many applications may benefit from different types of control over changes to invariant sub-spaces. In particular, controlling the → ∞ distance dist →∞ ( V, ˜ V ) := min W ∈ O r (cid:13)(cid:13) V − ˜ V W (cid:13)(cid:13) →∞ = min W ∈ O r max j ∈ [ n ] (cid:13)(cid:13) V j, : − ˜ V j, : W (cid:13)(cid:13) , ay be more applicable to settings where subspaces are interpreted row-wise (as is the case inspectral clustering). Recent work on perturbation theory for the → ∞ norm [1, 11, 15, 19]allows aspects our work to be extended to control incremental changes to the subspace underthis metric. However, the lack of convergence theory for eigensolvers in this metric (beyondna¨ıve bounds from norm equivalence) prevent us from performing a full analysis in this setting. The eigensolvers sketched in Algorithms 2.1 and 2.2 perform a prescribed number ofiterations k max . However, in practice one usually incorporates robust numerical terminationcriteria which often results in fewer iterations being required. We omit such criteria here tosimplify the presentation and defer the details to the books by Parlett [48] and Saad [57]. Subspace iteration.
Subspace iteration, also known as simultaneous iteration, is a well-studied generalization of the celebrated power method for computing the leading or trailing r eigenvalues and eigenvectors of a symmetric matrix A . Starting from an initial guess V , thealgorithm computes orthonormal bases for terms of the form A q V , usually handled in practicevia the QR factorization. In the special case of r = 1, the intuition is that higher powers of A amplify the component of the vector corresponding to the dominant eigenvector (as longas the initial guess is not perpendicular to that eigenvector). Since the estimates of subspaceiteration converge to the eigenspace corresponding to eigenvalues of largest magnitude (insteadof algebraically largest), we assume it will be used with an appropriate shift.Algorithm 2.1 shows a full version of subspace iteration incorporating the Rayleigh-Ritzprocedure. In the algorithm, r stands for the desired number of eigenpairs sought, but wemay opt for running it with a larger subspace of size (cid:96) > r for reasons that will be clarifiedlater. There are a number of modifications (e.g., “locking” converged eigenpairs, shifts) that Algorithm 2.1
Subspace iteration Input : matrix A , dimension (cid:96) , target dimension r , initial estimate V ∈ R n × (cid:96) , k max . for k = 1 , . . . , k max do V k , R k := qr ( AV k − ) Q k , Λ k = spectral ( V k (cid:62) AV k ) V k := V k · Q k end for return ( V k ) : , r , diag(Λ k ) r one can incorporate to improve subspace iteration in practice [57].The following theorem characterizes the convergence of subspace iteration when (cid:96) = r . Theorem 2.1 (Adapted from Theorem 8.2.2 in [22]).
Consider the spectral decompositionof A as in (1.1) , assume that | λ r | > | λ r +1 | , and let d := (cid:107) V V (cid:62) − V V (cid:62) (cid:107) , ∆ := d √ − d , where V ∈ O n,r . Then Algorithm initialized with V and (cid:96) = r produces an orthogonalmatrix V k such that (2.1) k max ≥ log(∆ /ε )log (cid:0)(cid:12)(cid:12) λ r λ r +1 (cid:12)(cid:12)(cid:1) ⇒ (cid:13)(cid:13) V V (cid:62) − V k V (cid:62) k (cid:13)(cid:13) ≤ ε. The rate of convergence of individual eigenpairs may be faster than the rate implied by The-orem 2.1; for example, the rate of convergence of the i th eigenvalue estimate is asymptotically ( | λ i /λ r +1 | ) [60].The amount of work suggested by Theorem 2.1 crucially relies on two quantities: theeigenvalue ratio ρ := λ r λ r +1 , which controls the convergence rate, as well as the initial subspacedistance d . In this paper, we use the fact that even if ρ is moderately close to 1, a smallinitial distance can result in fewer iterations. Block Krylov method.
Subspace iteration, while practical, discards a lot of useful infor-mation by only using the last computed power A q V , instead of using the full (block) Krylovsubspace span( V , AV , . . . , A q V ). This is precisely the motivation for (block) Krylov meth-ods. The well-known Lanczos method is part of this family when V = v , a single vector. Inpractice, the Lanczos method often exhibits superlinear convergence [48, 55], making it themethod of choice for eigenvalue computation. However, its performance deteriorates in thepresence of repeated or clustered (i.e., very close to each other in magnitude) eigenvalues. Itis well-known that the single-vector Lanczos method cannot find multiple eigenvalues withoutdeflation [57] and converges slowly for clustered eigenvalues. On the other hand, the blockLanczos method can get around this issue, provided the block size b is set appropriately (i.e., b > r ). See Parlett [48] for a discussion on the tradeoffs of block sizes.Algorithm 2.2 presents a Block Krylov method, albeit an impractical variant as it forms thefull Krylov matrix. While far from a practical implementation, this form is helpful for statingand interpreting results under the assumption that we are operating in exact arithmetic. Algorithm 2.2
Block Krylov method Input : A ∈ R n × n , dimension r , block size b ≥ r , initial matrix V ∈ R n × b , power k max . Form the block Krylov matrix K = [ AV ( AA (cid:62) ) AV . . . ( AA (cid:62) ) k max AV ] Q, R = qr ( K ) V k , Λ k = spectral ( Q (cid:62) AQ ) return ( QV k ) : , r , diag(Λ k ) r The block Lanczos algorithm first appeared to address the shortcomings of the single-vector Lanczos iteration [14]. Saad provides a comprehensive convergence analysis [56], whilemore recent work presents sharp bounds and addresses convergence to clusters of eigenval-ues [39]. For our purposes, we use the analysis found in Wang et al. [66], the results of whichare stated in a form easily comparable with the corresponding rate for Algorithm 2.1.
Theorem 2.2 (Adapted from Theorem 6.6 in [66]).
Consider the Schur decomposition of A as in (1.1) , assume that λ r > λ r +1 , and let d := (cid:107) V V (cid:62) − V V (cid:62) (cid:107) , ∆ := d √ − d , where V ∈ O n,r . Then Algorithm initialized with V produces an orthogonal matrix V k such that (2.2) k max ≥ (∆ /ε ) (cid:113)(cid:12)(cid:12) λ r λ r +1 (cid:12)(cid:12) − ⇒ (cid:13)(cid:13) V V (cid:62) − V k V k (cid:62) (cid:13)(cid:13) ≤ ε. Comparing with the rate of Theorem 2.1, the block Lanczos method is clearly superior tosubspace iteration: in the challenging regime where λ r ≈ λ r +1 , (cid:113) λ r λ r +1 − (cid:29) log (cid:0) λ r λ r +1 (cid:1) . As ith Algorithm 2.1, it is important to keep in mind that starting from an estimate V with d (cid:28)
3. Subspace estimation with incremental updates.
In this section, we develop our in-cremental approach to spectral estimation, motivated by data that changes over time. Asstated before, we are interested in maintaining an invariant subspace across “small” modifi-cations to the data. More specifically, we have a matrix A ∈ R n × n and wish to dynamicallyupdate an estimate of the leading invariant subspace of A corresponding to the eigenvaluesof largest magnitude. This is not without loss of generality; for example, modifying subspaceiteration for finding the smallest eigenpairs involves solving a linear system. However, someapplications stated in terms of finding the smallest eigenpairs can be reformulated to fit oursetting (see Subsection 4.1). A common example of such a problem is spectral clusteringfor community detection in graphs, which we rely on for intuition and constructing variousnumerical examples in the experimental section. However, our approach generalizes beyondthe graph setting.We formalize the incremental updates to our matrix as follows: Assumption 1.
Let A (0) := A ∈ R n × n . At each time step t ∈ { , . . . , T } , we observe updatesof the form A ( t ) := A ( t − + E ( t ) , where E ( t ) is random, sparse, or low-rank. We denote theeigenvectors and eigenvalues of A ( t ) by V ( t ) and Λ ( t ) . An additional assumption that we impose on our matrix A is that there is sufficient decay onthe lower end of the spectrum outside of the subspace of interest, stated as Assumption 2. Assumption 2.
Let λ ≥ λ ≥ · · · ≥ λ n be the ordered eigenvalues of A . There is a constant γ > and an integer r (cid:28) n such that ∀ r ≤ r , there exists a p (cid:28) n satisfying (3.1) | λ r | ≥ (1 + γ ) | λ p | . Intuitively, Assumption 2 implies a sort of “uniform” decay on the eigenvalues of A , since γ does not depend on the particular choice of r ≤ r . Indeed, if we allowed γ = 0, then (3.1)would be trivially satisfied. The challenging regime is when γ (cid:28)
1, as hinted by standardconvergence results on iterative methods for eigenvalue problems [18, 22, 57].This technical assumption allows us to derive a priori upper bounds on the required steps ofour eigensolvers discussed above. In all the experiments presented in Section 4, Assumption 2was verified to hold with γ > . p ≤ r + 10. We now sketch the high-level idea behind our approach.Naturally, not all modifications to the underlying data have the same effect on the invariantsubspace of interest. For instance, “small” (with respect to some matrix norm) modificationsshould not significantly change the leading eigenvalues and eigenvectors. Matrix perturbationtheory provides us with the analytical tools to measure the worst-case behavior of thesemodifications [61]. Intuitively, if this behavior is sufficiently “bounded”, seeding our iterativemethod at the previously obtained estimate should reduce the computation required to obtainthe next estimate. This is the crux of our approach.An indispensable tool for perturbation bounds is the Davis–Kahan theorem [16, 38]. In or-der to state it formally, we first recall the definition of principal angles between subspaces [61]. efinition 3.1. Consider matrices
V, W ∈ O n,k and their corresponding column spaces V , W . Denote σ i := σ i ( V (cid:62) W ) , i = 1 , . . . , k . The principal angles between V and W arethe diagonal elements of the matrix Θ( V, W ) := diag(cos − σ , . . . , cos − σ k ) . Moreover, the sin -distance between V , W for any unitarily invariant norm (cid:107)·(cid:107) is (3.2) (cid:107) sin Θ( V, W ) (cid:107) := (cid:107) diag (sin(cos − σ ) , . . . , sin(cos − σ k )) (cid:107) . With Definition 3.1 in hand, we can state the Davis–Kahan Theorem 3.2. While it holdsfor any unitarily invariant norm, we are only interested in the case where (cid:107)·(cid:107) = (cid:107)·(cid:107) . Theorem 3.2 (Davis–Kahan).
Suppose A, ˆ A are symmetric and let V, ˆ V ∈ R n × r be twoinvariant subspaces containing eigenvectors of A and ˆ A , respectively. Let Λ( V ) denote the setof eigenvalues of A corresponding to V and Λ( ˆ V ⊥ ) the set of eigenvalues of ˆ A correspondingto ˆ V ⊥ . If there exists an interval [ α, β ] and δ > such that Λ( V ) ∈ [ α, β ] and Λ( ˆ V ⊥ ) ∈ ( −∞ , α − δ ) ∪ ( β + δ, + ∞ ) , then (3.3) (cid:13)(cid:13) sin Θ( V, ˆ V ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( ˆ A − A ) V (cid:13)(cid:13) δ . Moreover, if (cid:13)(cid:13) A − ˆ A (cid:13)(cid:13) < δ r , where δ r = min {| µ − λ | | µ ∈ Λ( V ⊥ ) , λ ∈ Λ( V ) } , then (3.4) (cid:13)(cid:13) sin Θ( V, ˆ V ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( ˆ A − A ) V (cid:13)(cid:13) δ r . Equation (3.4) above is straightforward to derive (e.g., see the proof of [21, Corollary 2.1]). Weemploy Theorem 3.2 to obtain an a priori bound on the distance between previously availablesubspace estimates and the invariant subspace of the updated matrix. Algorithm 3.1 outlinesour high-level strategy.
Algorithm 3.1
High-level algorithm for subspace updates Input : Matrix A := A (0) , initial subspace size r , high-order q ≥
1, threshold ε . (cid:46) compute estimates ˆ V (0) ∈ R n × r and ˆ λ r , . . . , ˆ λ r + q . for t = 1 , . . . , T do A ( t ) := A ( t − + E ( t ) • estimate (cid:107) sin Θ( V ( t ) , V ( t − ) (cid:107) ≤ d t (cid:46) Subsection 3.2.3 if d t > ε then • update ˆ V ( t ) , ˆ λ r to tolerance ε (cid:46) Subsection 3.2.1 • update high order eigenvalues ˆ λ r +1 , . . . , ˆ λ r + q to tolerance ε (cid:46) Subsection 3.2.2 • update subspace size r (cid:46) Subsection 3.2.4 end if end for n Algorithm 3.1, the constant q ≥ | λ r λ r +1 | for our choice of eigenvalue algorithm. Step 7 updates the subspaceestimate while step 8 updates the higher order eigenvalue estimates. These updates aretriggered when our proxy for (3.4) is above a specified threshold ε . Remark 2 (Extension for singular subspaces).
Our method is easily adapted for computingthe leading singular subspace of a rectangular matrix A ∈ R m × n . Via the standard dilationtrick [22, 61], we can form S = (cid:18) AA (cid:62) (cid:19) ⇒ λ i ( S ) = ± σ i ( A ) , and given the (thin) SVD of A , A = U Σ V (cid:62) , it is easy to verify that the i th eigenvector of S isthe concatenation of the i th left and right singular vectors, u i and v i . Alternatively, if one isinterested only in the left or right singular subspace, it is also possible to run Algorithm on A (cid:62) A or AA (cid:62) implicitly, without ever forming the complete matrix—at the expense of standardconditioning consequences. Below, we describe how to efficiently implement each of theaforementioned steps, addressing the fact that our estimates at each step are inexact. Wediscuss d t last, focusing on the updates in lines 7 and 8 of Algorithm 3.1 first. Let us first assume that d t from step 5 has al-ready been computed. Given the previous subspace estimate ˆ V ( t − ∈ R n × r , we can seed oureigensolver (Algorithm 2.1 or Algorithm 2.2) with ˆ V ( t − . Assume that (cid:12)(cid:12) ˆ λ r ( A ( t − ) − λ r ( A ( t − ) (cid:12)(cid:12) ≤ ε , (cid:12)(cid:12) ˆ λ r +1 ( A ( t − ) − λ r +1 ( A ( t − ) (cid:12)(cid:12) ≤ ε . Applying Weyl’s inequality (Lemma A.7), the rate controlling convergence will be at least(3.5) λ r ( A ( t ) ) λ r +1 ( A ( t ) ) ≥ λ r ( A ( t − ) − (cid:107) E ( t ) (cid:107) λ r +1 ( A ( t − ) + (cid:107) E ( t ) (cid:107) ≥ ˆ λ r − (cid:107) E ( t ) (cid:107) − ε ˆ λ r +1 + (cid:107) E ( t ) (cid:107) + ε =: ρ t , and Theorems 2.1 and 2.2 say that we have to set (with ∆ t := d t / (cid:112) − d t ):(3.6) k max ≥ log(∆ t /ε )log( ρ t ) for Algorithm 2.1 (∆ t /ε ) √ ρ t − for Algorithm 2.2in order to guarantee dist( V ( t ) , ˆ V ( t ) ) ≤ ε .In order to set k max , we need to know the error bounds ε and ε on the accuracy of thepreviously computed eigenvalues ˆ λ r and ˆ λ r +1 . For ˆ λ r , we can apply Theorem A.6 for theestimate ˆ V ( t − which gives us(3.7) (cid:12)(cid:12) ˆ λ r ( A ( t − ) − λ r ( A ( t − ) (cid:12)(cid:12) ≤ ρ ( A ( t − ) (cid:13)(cid:13) sin Θ( ˆ V ( t − , V ( t − ) (cid:13)(cid:13) ≤ ρ ( A ( t − ) ε . or the higher order eigenvalue estimate ˆ λ r +1 , we can deduce a similar worst-case bound (seeSubsection 3.2.2 for details):(3.8) (cid:12)(cid:12) ˆ λ r +1 ( A ( t − ) − λ r +1 ( A ( t − ) (cid:12)(cid:12) ≤ ρ ( A ( t − ) ε . In (3.8), the error factor is due to the combined inexactness of the “deflated” matrix and theestimate returned from the iterative method of choice.
The best way for updating the higherorder eigenvalue estimates ˆ λ r +1 , . . . , ˆ λ r + q is not obvious. One possible approach is to “aug-ment” the tracked subspace updated at Line 7 to have dimension r + q . However, becauseiterative eigenvalue algorithms use some orthogonalization scheme (e.g., QR factorization)that scales quadratically in the subspace dimension [22], we would have incurred a cost pro-portional to (at least) ( r + q ) .A more careful approach is a 2-phase algorithm, where we lock the r -dimensional estimateˆ V ( t ) and apply an iterative method to the “deflated” matrix ( I − ˆ V ( t ) ˆ V (cid:62) ( t ) ) A ( t ) ( I − ˆ V ( t ) ˆ V (cid:62) ( t ) ),with potentially substantially smaller time and memory costs than would be required formaintaining an ( r + q )-dimensional subspace. Even though it is evident that deflating A ( t ) thisway must incur some loss of accuracy, by Lemma 3.3 this loss is negligible when ε ≤ ρ ( A ( t ) ) − . Lemma 3.3.
Consider a matrix A = [ V V ⊥ ] (cid:20) Λ 00 Λ ⊥ (cid:21) [ V V ⊥ ] (cid:62) and an estimate (cid:98) V ∈ O n,r such that (cid:13)(cid:13) V V (cid:62) − (cid:98) V (cid:98) V (cid:62) (cid:13)(cid:13) ≤ ε . Then (cid:12)(cid:12) λ r + i ( A ) − λ i (cid:0) ( I − (cid:98) V (cid:98) V (cid:62) ) A ( I − (cid:98) V (cid:98) V (cid:62) ) (cid:1)(cid:12)(cid:12) ≤ ρ ( A ) ε (3.9) Proof. If V contains a basis for the leading r -dimensional eigenspace, then it is obviousthat I − V V (cid:62) is the projector to the trailing ( n − r )-dimensional eigenspace, which correspondsto a contiguous set of eigenvalues of A . By elementary arguments, we know that λ r + i ( A ) = λ i ( ( I − V V (cid:62) ) A ( I − V V (cid:62) ) (cid:124) (cid:123)(cid:122) (cid:125) =: ˜ A ) . Therefore, by Theorem A.6,max i (cid:12)(cid:12) λ i (cid:0) ˜ A (cid:1) − λ i (cid:0) ( I − (cid:98) V (cid:98) V (cid:62) ) A ( I − (cid:98) V (cid:98) V (cid:62) ) (cid:1)(cid:12)(cid:12) ≤ ρ ( A ) (cid:13)(cid:13) V V (cid:62) − (cid:98) V (cid:98) V (cid:62) (cid:13)(cid:13) ≤ ρ ( A ) ε , since ( I − (cid:98) V (cid:98) V (cid:62) ) is also a projector to an ( n − r )-dimensional subspace.We conclude the following accuracy estimate (dropping the ( t ) subscript for brevity): (cid:12)(cid:12) λ r + i ( A ) − ˆ λ r + i ( A ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) λ r + i ( A ) − λ i (cid:0) ( I − ˆ V ˆ V (cid:62) ) A ( I − ˆ V ˆ V (cid:62) ) (cid:1)(cid:12)(cid:12) + (cid:12)(cid:12) λ i (cid:0) ( I − ˆ V ˆ V (cid:62) ) A ( I − ˆ V ˆ V (cid:62) ) (cid:1) − ˆ λ i (cid:0) ( I − ˆ V ˆ V (cid:62) ) A ( I − ˆ V ˆ V (cid:62) ) (cid:1)(cid:12)(cid:12) ≤ ρ ( A ) ε , (3.10) ith the first ρ ( A ) ε factor coming from an application of Lemma 3.3 and the second suchfactor coming from an application of Theorem A.6 when applying our iterative method withaccuracy parameter ε to the deflated matrix. This provides the inequality in (3.8).When computing higher order eigenvalues, we might not have a previously maintainedestimate of the corresponding subspace. In this situation, it is common to pick a randomGaussian matrix as the seed matrix. The two following Propositions provide guarantees forAlgorithms 2.1 and 2.2 under this initialization scheme. Proposition 3.4 (Corollary of Theorem 5.8 in [25]).
Let A = V Λ V (cid:62) be a symmetric matrix,and let ≤ p ≤ (cid:96) − k . For a given δ ∈ (0 , , define C δ := e √ (cid:96)p + 1 (cid:16) δ (cid:17) p +1 (cid:20)(cid:112) n − (cid:96) + p + √ (cid:96) + (cid:114) (cid:16) δ (cid:17)(cid:21) . Then, with probability at least − δ , for j = 1 , . . . , k , the Ritz values { µ j } returned by Algo-rithm initialized with a random Gaussian matrix ( V ) ij ∼ N (0 , satisfy λ j ≥ µ j ≥ λ j (cid:114) C δ (cid:16) λ (cid:96) − p +1 λ j (cid:17) k max +2 , ∀ j = 1 , . . . , r. Proposition 3.5 (Theorem III.4 in [69]).
Let A = V Λ V (cid:62) be a symmetric matrix, and let b = r + p in Algorithm initialized with ( V ) ij ∼ N (0 , . Then, for j = 1 , . . . , r , the Ritzvalues { µ j } returned satisfy λ j ≥ µ j ≥ λ j (cid:114) C T − k max +1 (cid:16) λ j − λ j + p +1 λ j + p +1 (cid:17) , where T p is the p th degree Chebyshev polynomial and C is a constant depending on the initial-ization matrix. Let us incorporate Assumption 2 into Propositions 3.4 and 3.5. For the latter, it is knownthat Chebyshev polynomials satisfy T d (1 + α ) ≥ d √ α when α (cid:28) ε -accuracy under Gaussian initialization, we can set(3.11) k max ≥ log( C /ε )log(1+ γ ) for Algorithm 2.1 log( C /ε ) √ γ for Algorithm 2.2 , where C , C are the constants from Propositions 3.4 and 3.5. We still need a good bound d t on thesubspace distance in (3.4). Evaluating (cid:107) ( A ( t ) − A ( t − ) V (cid:107) exactly is not possible, since weonly maintain approximations of V . A standard majorizer, often too loose in practice, is (cid:107) A ( t ) − A ( t − (cid:107) . We can get a better upper bound by leveraging a priori information aboutthe perturbation, such as sparsity or randomness—two such instances are handled in Section 4. e can still get around this issue in the absence of prior information about E ( t ) = A ( t ) − A ( t − . Let us first assume that we have computed a ε -close subspace estimate ˆ V ( t ) (i.e.,dist( V ( t ) , ˆ V ( t ) ) ≤ ε ). Given this information, we can compute an upper bound to the subspacedistance at time t + 1 in terms of ˆ V ( t ) and the perturbation E ( t ) , as Lemma 3.6 shows. Again,we drop the subscripts for simplicity. Lemma 3.6.
Let E be a perturbation to a matrix A = [ V V ⊥ ] Λ [ V V ⊥ ] (cid:62) and (cid:98) V be asubspace estimate such that (cid:13)(cid:13) V V (cid:62) − (cid:98) V (cid:98) V (cid:62) (cid:13)(cid:13) ≤ ε. Then (cid:107) EV (cid:107) ≤ (cid:113)(cid:13)(cid:13) E (cid:98) V (cid:13)(cid:13) + ε (cid:107) E (cid:107) . (3.12) Proof.
Using the triangle inequality for the first inequality and our assumption on V, ˆ V for the second inequality, we obtain (cid:107) EV (cid:107) ≤ (cid:113)(cid:13)(cid:13) E ( V V (cid:62) − (cid:98) V (cid:98) V (cid:62) ) E (cid:62) (cid:13)(cid:13) + (cid:13)(cid:13) E (cid:98) V (cid:98) V (cid:62) E (cid:62) (cid:13)(cid:13) ≤ (cid:113) ε (cid:107) E (cid:107) + (cid:13)(cid:13) E (cid:98) V (cid:13)(cid:13) . This provides us with a proxy for the numerator of (3.3), with the estimate for the eigen-value gap following from the discussion in Subsections 3.2.1 and 3.2.2. From (3.7) and (3.8),we know that the approximate eigengap ˆ δ r := ˆ λ r − ˆ λ r +1 satisfies (cid:12)(cid:12) ˆ δ r − δ r (cid:12)(cid:12) = (cid:12)(cid:12) λ r ( V (cid:62) AV ) − λ (( I − V V (cid:62) ) A ( I − V V (cid:62) )) (cid:12)(cid:12) ≤ ε ρ ( A ) . Putting everything together, we arrive at Corollary 3.7, whose proof follows immediatelyafter applying Lemma 3.6 to bound the numerator on Theorem 3.2.
Corollary 3.7.
Given a sequence of updates satisfying Assumption , Algorithm main-tains an estimate (cid:98) V ( t ) satisfying dist( (cid:98) V ( t ) , V ( t ) ) ≤ ε . If | λ r ( A ( t − ) − λ r +1 ( A ( t − ) | < (cid:107) E ( t ) (cid:107) ,Line 7 of Algorithm takes at most k max iterations, with k max ≥ log(∆ t /ε )log( ρ t ) for Algorithm t /ε ) √ ρ t − for Algorithm , (3.13) where ∆ t := d t √ − d t and { d t , ρ t } are given by d t := 2 (cid:113) ε (cid:107) E ( t ) (cid:107) + (cid:13)(cid:13) E ( t ) (cid:98) V ( t − (cid:13)(cid:13) ˆ λ r ( A ( t − ) − ˆ λ r +1 ( A ( t − ) − ε ρ ( A ( t − ) ρ t := ˆ λ r ( A ( t − ) − (cid:107) E ( t ) (cid:107) − ρ ( A ( t − ) ε ˆ λ r +1 ( A ( t − ) + (cid:107) E ( t ) (cid:107) + 2 ρ ( A ( t − ) ε . (3.14) Moreover, under Assumption , Line 8 takes at most k (cid:48) max iterations, with (3.15) k (cid:48) max ≥ log( C /ε )log(1+ γ ) for Algorithm log( C /ε ) √ γ for Algorithm ,where C , C are the constants in Propositions and . emark 3. Though the factor ρ ( A ) above seems to introduce a non-negligible loss, severalapplications of interest satisfy ρ ( A ) = O ( ε − ) . For example, spectral clustering with thenormalized adjacency matrix satisfies ρ ( A ) ≤ . On the numerical side, if we can’t assumethat ρ ( A ) = O ( ε − ) , it is sometimes possible to estimate ρ ( A ( t ) ) ≤ ˆ λ ( ˆ A ( t − ) + (cid:107) E ( t ) (cid:107) usingLemma A.7 , as in the case of positive semi-definite matrices. This task is straightforward inall applications considered in Section .If the assumption of the Davis-Kahan theorem fails or d t ≥ , it is straightforward to checkthat seeding with a Gaussian matrix results in the bound of (3.15) for Line 7, with ρ t insteadof γ if ρ t > . As discussed in Subsection 1.1, it is often the case thatthe dimension r of the subspace of interest is either not known a priori or changing overtime. In spectral clustering, a common heuristic to determine the “correct” size of clusters isto compute a large extremal eigenspace of the normalized adjacency matrix and observe theratios of successive eigenvalues [65].We can extend this criterion to enable tracking the dimension over time, as approximatinghigher order eigenvalues given an r -dimensional invariant subspace is feasible via the proceduredescribed in Lemma 3.3. Assume that estimates ˆ λ , . . . , ˆ λ r , . . . , ˆ λ r + q are available. We proposesetting the new candidate size r c as follows:(3.16) r c := argmin ≤ i ≤ q − (cid:12)(cid:12)(cid:12)(cid:12) ˆ λ i +1 ˆ λ i (cid:12)(cid:12)(cid:12)(cid:12) . In other words, r c is set to the index of the smallest ratio of successive eigenvalues, with i ≥ r c (cid:54) = r only after T rounds ofˆ λ r c +1 / ˆ λ r c as the dominant eigenvalue ratio to account for short-lived fluctuations.
4. Applications.
In this section, we illustrate the effectiveness of our incremental subspaceestimation framework with two applications. The first application is tracking the leadingsubspace of the normalized adjacency of a graph evolving in time, which enables clusteringthe underlying graph incrementally. In this case, the changes are sparse , since only a smallfraction of the edges of the graph changes over time. The data matrices are also sparse, sincethe graphs involved have small vertex degrees, so matrix-vector multiplication is efficient. Thesecond application is Principal Component Analysis (PCA), with deterministic or randomlow-rank updates. In this case, the data matrices are dense, but in some cases, we can usestructure (such as Toeplitz/Hankel structure) to speed up matrix-vector multiplication. Forboth applications, we outline domain-specific properties which simplify some of the involvedcomputations and can further improve the performance of our adaptive procedure.Numerical experiments for Algorithm 2.2 use the locally optimal block preconditionedconjugate gradient method (
LOBPCG ) [32], which is the only available block Krylov imple-mentation in
Julia that allows seeding the subspace.
LOBPCG also provides built-in sup-port for orthogonalization “constraints” which are used in Line 8 of Algorithm 3.1 to imple-ment matrix-vector multiplication with the deflated matrix. We do not use a preconditionerwhen calling
LOBPCG . Finally, all of the code is freely available from https://github.com/VHarisop/inc-spectral-embeddings . .1. Spectral embeddings in graphs. A common problem in analyzing graph data isfinding its clusters, communities, or modules. A standard approach to this problem is spectralclustering [43, 65]. This procedure applies the k -means algorithm with an initial seed V ∈ R d × r , where V is the invariant subspace of the normalized adjacency matrix of the graphcorresponding to the r largest eigenvalues. There is a fundamental computational problemfor spectral clustering when the graph is changing over time: if the graph is large enough, itsimply becomes too expensive to compute the embedding from scratch every time. However,we often observe individual “small” (low-rank or small norm) and sparse perturbations tothe graph, such as the addition of an edge. Therefore, our incremental spectral estimationframework is well-suited for this task.For a bit more notation, we denote a general undirected graph by G = ( V , E ) and let n := |V| and m := |E| . We write A for the adjacency matrix of G . The degree matrix D = diag( A ), where is the vector of all ones, is the diagonal matrix whose diagonal entriescontain the degrees of each node. The symmetrically normalized adjacency matrix isthen ˜ A = D − / AD − / . Again, basic spectral clustering computes a principal subspace of ˜ A and runs k -means on this spectral embedding to cluster the nodes.In typical data, the adjacency matrix is sparse, and in this case, it is possible to deduce apriori bounds which can aid us in the application of the Davis–Kahan theorem as well as in theestimation of the convergence rate of the iterative methods. To formalize this, assume thatthe adjacency matrix A is perturbed by a symmetric matrix E ∈ { , ± } n × n such that thenumber of modifications at each vertex does not exceed the number of its currently incidentedges. In that case it is possible to bound the norm of the perturbation to the normalizedadjacency matrix, as in Proposition 4.1. Proposition 4.1.
Suppose we observe a sequence of edge updates corresponding to a sym-metric matrix E ∈ { , ± } n × n such that d i (cid:12)(cid:12)(cid:80) nj =1 E ij (cid:12)(cid:12) ≤ α < , i = 1 , . . . , n and E ii = 0 . Let ˜ A new denote the updated normalized adjacency matrix. Then (4.1) (cid:13)(cid:13) ˜ A new − ˜ A (cid:13)(cid:13) ≤ α · (cid:0) α + (cid:112) min { κ ( D ) , rank( E ) } (1 + O ( α )) (cid:1) + (cid:18) α (1 + α )2 (cid:19) , where κ ( D ) := max i d i min j d j . The proof is in Appendix B.1. Proposition 4.1 provides theoretical justification for the in-tuition that updating vertices with few neighbors tends to have more severe effects on thespectrum; in contrast, if all affected vertices have large neighborhoods, we expect that α (cid:28) In this experiment, we use synthetic data generated bythe stochastic block model [28]. The probabilistic model consists of a set of n nodes and k clusters or communities {C i } ki =1 , with each node belonging to exactly one community C i .Edges are observed according to the following model:(4.2) P ( v i connected to v j ) = (cid:40) p, v i , v j ∈ C p for some pq, otherwise . However, we only observe a graph and not the cluster idenitification. There is a rich literatureon how spectral clustering methods can identify the latent communities [41, 53, 62]. ffect of eigenvalue precision. We initially examine how the looseness of our eigenvalueestimates affects (via the quantities d t , ρ t in (3.14)) the bound on the number of iterations.Our experimental methodology is as follows. • We first sample two graphs G s and G t from the stochastic block model with parameters p = 0 . q = 0 .
1, with 5 clusters in G s and 6 clusters in G t . • We then use a “network interpolation” method [52] to evolve G s towards G t . At eachstep, we perform one edge addition or deletion on G s . This update decreases the graphedit distance by 1 with probability h , and increases it with probability 1 − h . • Each update E ( t ) corresponds to 5 of the aforementioned edits, to be used in singleiteration of Algorithm 3.1. The eigensolver is run for the full number of iterationsprescribed by the upper bounds instead of accuracy ε .Figure 4.1 shows the number of iterations needed by Algorithm 3.1 as a function of the num-ber of graph edits. Warm colors correspond to using an oracle for the eigenvalues involved,i.e., the real values of λ i (up to numerical accuracy) instead of the estimates ˆ λ i , which cor-respond to the cold colors. Interestingly, the difference is negligible, so we should not expectour numerical bounds to degrade significantly with “rough” eigenvalue estimates except inchallenging regimes where the ratio controlling convergence is already very close to 1.All instances generated for this experiment interpolate between a graph with five clustersand one with six clusters, so we expect the eigengap λ r +1 − λ r to gradually decrease; thereforethe lowest-dimensional example ( n = 400) should be the most challenging, which is empiri-cally verified. Smaller values of ε also need fewer subspace iterations, as expected. Finally,by comparing the prescribed accuracy ε with the final distance of the computed estimates,dist( V ( t ) , ˆ V ( t ) ), we find that the latter is at least 2 orders of magnitude smaller, a somewhatunsurprising outcome given the pessimistic nature of most perturbation results used above. Effect of sparse updates to spectrum.
We take a further step to assess the impact of sparseupdates via the lens provided by Proposition 4.1. More precisely, we generate an SBM with n = 1000 nodes and 5 communities, with p = 0 . q = 0 . p is moderately high, each of these updatesresults in α < .
05 in the notation of Proposition 4.1.The performance of Algorithm 3.1 using the worst-case estimate of (4.1) to majorize theDavis–Kahan bound is illustrated in Figure 4.2. The estimate is at most an order of magnitudehigher than the true perturbation norm. Moreover, the resulting bound on iterations is off bya constant factor. Therefore, in situations where we observe sparse updates to graphs withfew isolated nodes, it is possible to get additional speedups by replacing d t from (3.14) witha less complicated proxy. We next test our algorithms on real-world graphdatasets: the college-msg [46] dataset of private communications on a Facebook-like collegemessaging platform, as well as subsets of the temporal-reddit-reply dataset [40] consistingof replies between users on the public social media platform reddit.com . All datasets areanonymized and contain timestamped edges representing the interactions. We make edgesundirected and remove any duplicates during preprocessing. We isolate the largest connectedcomponent and work on the induced subgraph for each dataset, leading to the statistics shown
100 2001020 Edits I t e r a t i o n s n = 400 n = 600 n = 800 I t e r a t i o n s n = 400 n = 600 n = 800 Figure 4.1.
Upper bound from (3.13) on iterations performed for tracking V ∈ R n × using Algorithm without (cold colors) vs. with (warm colors) eigengap oracle. The resulting bounds exhibit negligible differences.Underlying graphs are SBMs with p = 0 . , q = 0 . . Left: ε = 10 − . Right: ε = 10 − .
20 40 60 80 10010 − − Updates P e r t u r b a t i o nn o r m k ˜ A new − ˜ A k Bound from Eq. (4.1)
20 40 60 80 1002468 Updates I t e r a t i o n s RunUpper bound
Figure 4.2.
Performance of Algorithm for tracking V ∈ R n × using the bound from (4.1) with α < . (left) and resulting iterations (right). The perturbation bound is at most an order of magnitude off, with theresulting iteration bound just a constant multiple of those elapsed to reach accuracy ε = 10 − . Underlyinggraphs are SBMs with p = 0 . , q = 0 . , n = 1000 . in Table 4.1. For the reddit-* datasets, small , medium and large variants correspond tokeeping the first N = { , , } nodes of the raw data, respectively.As before, we isolate the subgraph G = ( V , E ) corresponding to the largest connectedcomponent of the static version of the graph, and start from a version G s = ( V s , E s ) such that V s = V , E s ⊂ E , containing the |E s | edges with the earliest timestamps. Then, we add edgesin the order they were encountered in the original dataset.We employ a regularized version of the normalized adjacency matrix [2, 30, 51, 70]. Specif-ically, given a regularization parameter τ , the regularized adjacency matrix is A τ := A + τn (cid:62) .We set τ equal to 1 for the college-msg dataset and equal to the average degree ( τ = Summary statistics of temporal graphs used in numerical experiments.
Dataset college-msg reddit-small reddit-medium reddit-large n (cid:80) ni =1 D ii ), for the reddit-* datasets. This regularization can improve spectral clusteringby addressing the adverse effects of isolated or low-degree nodes. In particular, the largecluster of the leading eigenvalue (corresponding to connected components in the graph), cansignificantly degrade the performance of both iterative methods.At each step, we use the previous estimate ˆ V of the leading subspace of A := A τ toinitialize subspace iteration. Moreover, if we detect a change in the dimension of the invariantsubspace (using the criterion in Subsection 3.2.4), we follow the steps below to form ˆ V new : • if r new < r , we simply drop the eigenvectors v i corresponding to the smallest Ritzvalues θ i = λ i ( ˆ V (cid:62) A ˆ V ). • if r new > r , we generate r new − r new vectors and orthogonalize them against ˆ V usingthe (modified) Gram-Schmidt process [22].Initially, we experiment with the college-msg dataset, as it is feasible to evaluate the truesubspace distances and to compare with the performance of the randomly initialized variantat each step. Figure 4.3 shows the number of iterations run to attain the desired accuracy(checked using numerical stopping criteria) of ε = 10 − , as well as the number of iterations k max determined by (3.13), starting from G s and evolving towards the static version of thegraph. As in the SBM case, we observe five edge modifications at a time. The plot revealsthat our bounds are essentially sharp, since there is more than one occasion where the numberof subspace iterations almost matches the upper bound (appearing as spikes).We also observe that our warm-starting methods provides substantial performance gainscompared to random initialization. Figure 4.4 compares our pipeline with randomly initial-ized subspace iteration. We set d t := dist( V, ˆ V ), with ˆ V being the Q factor from the QRdecomposition of a random Gaussian matrix, and report the number of iterations elapseduntil achieving accuracy ε starting from ˆ V as well as the bounds prescribed by Theorems 2.1and 2.2 using ρ := λ r +1 ( A ) λ r ( A ) (computed using Arpack ). Our upper bound is usually 1–2 ordersof magnitude tighter than the bound using random initialization, while far fewer iterationsare needed to reach convergence when initializing V with the previous estimate.Next, we evaluate our method on the reddit-* datasets, as depicted in Figure 4.5. In thiscase, Algorithm 2.2 requires just a handful of iterations per update. Our upper bound is atmost an order of magnitude loose and independent of the underlying problem dimension. Weare thus able to handle large problems with O (1) additional iterations per update, in contrastto random initialization which would be expected to scale with dimension.We also observe that the measured wall-clock times spent updating the subspace V andthe higher-order eigenvalues (cid:8) ˆ λ r +1 , . . . (cid:9) are comparable. In fact, the majority of each step
00 1 , Additions I t e r a t i o n s RunOracle boundNumerical bound
500 1 , Additions I t e r a t i o n s RunOracle boundNumerical bound
Figure 4.3.
Performance of Algorithm using block Krylov method (left) and subspace iteration (right)on the college-msg dataset, with accuracy ε = 10 − . Both plots indicate that the bound of (3.13) can be sharp,even though it is loose in general.
500 1 , Additions I t e r a t i o n s randomours
500 1 , Additions I t e r a t i o n s randomours Figure 4.4.
Iterations using the Davis–Kahan bound and Algorithm (left) vs. Algorithm (right),starting from V ( ours ) vs. a random initial estimate ( random ) on the college-msg dataset ( ε = 10 − ). Solidlines are iterations until numerical convergence, dashed lines are upper bounds. Naive seeding incurs orders ofmagnitude higher computational costs, which can even exceed the (conservative) upper bound of (3.13) . is spent on estimating (cid:107) E (cid:107) . Importantly, the time elapsed per iteration appears to scalelinearly as n increases, which is the expected behavior given the edge density in Table 4.1(here, edge density exactly controls the complexity of matrix-vector multiplication). This section evaluates the performance of our adap-tive method on Principal Component Analysis (PCA) [29]. In this setting, we have a datamatrix X ∈ R n × d , containing n points in d dimensions. For a target dimension p , we wish tocompute W ∈ R p × p so that W ’s columns are the top- p eigenvectors of X (cid:62) X . These columnsthen define a projection X (cid:55)→ XW ∈ R n × p which can be helpful in exploratory data analy- , . , . , . , . I t e r a t i o n s RunUpper bound 0 . . . . . . . . · I t e r a t i o n s RunUpper bound .
75 1 .
50 2 .
25 3 .
00 3 .
75 4 .
50 5 . · I t e r a t i o n s RunUpper bound , . , . · − · − · − . . .
14 Updates T i m e ( s ) corehigh-order . . . . . . . . · . . . . . T i m e ( s ) corehigh-order . . . . . . . · . .
53 Updates T i m e ( s ) corehigh-order Figure 4.5.
Performance of Algorithm on the reddit-* datasets, using Algorithm as the iterativemethod. Left to right: small, medium, large with , , and edge additions per update. Top row: numberof iterations per update, which appear to be practically constant and dimension-independent. Bottom row: timeelapsed computing V ( core ) and higher order eigenvalues ( high-order ); both times are comparable to eachother, with the total scaling linearly as n increases. sis, de-noising, clustering, and other tasks. First, we show how to improve our perturbationbounds under common update models that are applicable to incrementally updated PCA. The perturbation bounds for sub-spaces employed in Algorithm 3.1 can be greatly simplified when the updates E ( t ) are random.Below, we consider both general Gaussian random matrices, as well as sums of outer productsof Gaussian random vectors. We defer proofs of the technical results below to Appendix B. Random Gaussian perturbations.
Suppose we are given a matrix A := A (0) ∈ R n × n (weassume A is square for simplicity, but the proof techniques extend to A ∈ R m × n ). Initially,let us assume that the perturbation matrix is i.i.d. Gaussian, formalized below: Assumption 3.
For each t , the perturbation matrix E := E ( t ) satisfies ( E ) ij ∼ N (0 , n ) ,with all the elements independent of each other. To derive a priori bounds for the subspace distances, we use the analogue of Davis–Kahanfor singular subspaces, due to Wedin [67]. Under Assumption 3, we show that the distancesbetween singular subspaces are upper bounded and that the singular values of A, ˜ A are only offby a factor of ˜ O ( n − ). Both statements hold with high probability, leading to Proposition 4.2. Proposition 4.2.
Let Assumption hold and let A , ˜ A := A + E have SVDs given by A = Σ V (cid:62) , ˜ A = ˜ U ˜Σ ˜ V (cid:62) . Then, with probability at least − c exp( − c n ) − n − log n , max (cid:8)(cid:13)(cid:13) sin Θ( U , ˜ U ) (cid:13)(cid:13) , (cid:13)(cid:13) sin Θ( V , ˜ V ) (cid:13)(cid:13)(cid:9) ≤ / √ n + √ r/nσ r − σ r +1 − ( C A + log n ) /n , (4.3) where c , c are universal constants and C A is a constant only depending on A . Low-rank random perturbations.
In a different setting, we observe updates to the covariancematrix A = XX (cid:62) in which the perturbation matrix E is not i.i.d. Gaussian but obeys a low-rank structure, as described below: Assumption 4.
For each t , the random perturbation E ( t ) satisfies E ( t ) = ε t z t z (cid:62) t , where z t ∼ N (0 , /n ) and P ( ε t = 1) = P ( ε t = −
1) = 1 / with ε t independent of z t . Suppose the above assumption holds and we want to retain information about a r -dimensionalsubspace, r (cid:28) n , over time. When we apply the a priori bound from Theorem 3.2, assuming δ r > (cid:107) E (cid:107) , we are reduced to bounding (cid:107) E ( t ) V (cid:107) in the numerator, with V ∈ R n × r . Usingstandard tools from concentration of measure, we can show the following: Proposition 4.3.
Let Assumption hold (dropping the subscript for simplicity), and supposethat δ r > (cid:15) . Let ˜ A = A + E and let V, ˜ V correspond to the leading r -dimensional subspacesof A, ˜ A respectively. Then, with probability at least − exp (cid:0) − (cid:15) n (cid:1) − n − , (cid:13)(cid:13) sin Θ( V, ˜ V ) (cid:13)(cid:13) ≤ (cid:15)δ r (cid:18)(cid:114) rn + (cid:114) nn (cid:19) . (4.4)This bound can be directly applied for the estimate d t in Algorithm 3.1. We perform two experiments on synthetic data, startingfrom an appropriately scaled Gaussian random matrix X ∈ R n × d , using the two variationsof random updates described in Subsection 4.2.1. For each variation, we apply our adaptivemethod to keep track of a subspace of dimension at most (cid:98)√ n (cid:99) , which is updated using (3.16).At each step, we record the true subspace distance as well as the refined a priori bounds derivedin the aforementioned sections.Figure 4.6 shows the performance of Algorithm 3.1. In terms of required iterations, ourmethod clearly outperforms reinitializing with a random matrix. In fact, the upper bound derived from perturbation theory falls below the number of iterations required to reach thedesired accuracy starting from a random guess, as in Figure 4.4. This means that even theworst-case performance of the warm-starting method can be significantly better than naiveseeding in simple examples. An application of PCA is Sin-gular Spectrum Analysis (SSA) [20, 23], which is primarily applied in time series analysis. Toapply SSA, one specifies a window length W that is expected to capture the essential behaviorof the time series of length N ≥ W , and performs the following steps (following [23]):
00 400 600 800 1 , Updates I t e r a t i o n s randomours
200 400 600 800 1 , − − Updates D i s t a n ce (cid:13)(cid:13) V V > − V V > (cid:13)(cid:13) / √ n )+( √ r/n ) σ r − σ r +1 − ( C A +log n ) /n
200 400 600 800 1 , Updates I t e r a t i o n s randomours
200 400 600 800 1 , − − − Updates D i s t a n ce (cid:13)(cid:13) V V > − V V > (cid:13)(cid:13) δ − r ( p r/n + p n/n ) Figure 4.6.
PCA on synthetic data. Top: ( n, d ) = (1500 , following Assumption . Bottom: ( n, d ) =(2000 , following Assumption . Left: iterations of Algorithm to reach accuracy ε = 10 − (solid)vs. upper bounds (dashed). Right: oracle distance vs. bounds from Propositions and . Updating withoutwarm-starting in practice is more expensive than the worst-case bounds of Algorithm even when the subspacedistance estimates are loose.
1. Form the trajectory matrix of the series XC X := 1 √ N X X X . . . X N − W +1 X X X . . . X N − W +2 ... ... X W X W +1 X W +2 . . . X N ∈ R W × ( N − W +1) . Observe that C X is a Hankel matrix, since its antidiagonals are constant, and thereforeadmits fast matrix-vector multiplication.2. Compute the truncated SVD F X of C X with target rank r (cid:28) min( W, N − W + 1):(4.5) C X ≈ F X = r (cid:88) i =1 u i σ i v (cid:62) i . . · t A c t i v e p o w e r ( k W ) X t ˆ X t . · t X t ˆ X t Figure 4.7.
Original ( X t ) and reconstructed ( ˆ X t ) signals after the equivalent of (left) and (right)days of updates, using r = 4 components and accuracy ε = 10 − . The low-rank reconstruction captures theessential behavior of the time series.
3. Average the antidiagonals of F X , from which a smoothed time series can be extracted.We applied SSA on the Household Power Consumption dataset, available from the UCIMachine Learning Repository. The dataset contains power consumption readings for asingle household spanning 4 years, spaced apart by a minute. We preprocess the datasetby computing the moving averages of active power over 30 minutes with a forward stepof 20 minutes, resulting in 3 data points per hour. We set the window and series lengths { W, N } = (4032 , C X is the trajectory matrix for the most recent 6 months.Next, we applied Algorithm 3.1 to maintain the decomposition of the trajectory matrix C X over time, with r equal to the value specified by the criterion in (3.16). Each update “step”moves the time series forward by 6 measurements (or 2 hours). The resulting dimension isdetermined as r = 4 in every time step, satisfying (cid:107) Σ r (cid:107) F (cid:107) Σ (cid:107) F ≥
99% throughout.Figure 4.7 depicts the original and reconstructed signals after 15 and 30 days, respectively.The reconstructed curve captures the essential behavior of the original time series, having“smoothed” out fluctuations on smaller scales. Moreover, with the exception of the firststep, all subsequent computations of the singular subspaces take at most 3 matrix-vectormultiplications per step. On the other hand, naive seeding can take up to 20 matrix-vectormultiplications per step.
5. Discussion.
This work has detailed the theoretical and practical aspects of incre-mentally updating spectral embeddings via careful use of existing information—an oft usedmethodology, albeit one previously employed heuristically. Concisely, we showed how thestraightforward heuristic of warm-starting iterative eigensolvers with previously computedsubspaces can dramatically decrease the amount of work required to update low-dimensional http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption mbeddings under small changes to the data matrix. This is a broadly applicable scenario,encompassing all manners of time-evolving or sequentially observed data. While we havefocused on a few specific applications, there is a wider range of settings that fall into thisregime such as spectral ranking methods in network analysis [49] or latent semantic indexingin natural language processing [17].A key insight, and an advantage of our proposed pipeline, is that worst-case bounds oninvariant subspace perturbations can be efficiently computed before updates to the spectralembedding are carried out. This added information can be used both theoretically, to boundthe number of “clean-up” iterations that need to be carried out, and practically, to determineif the dimension of the subspace should change. Furthermore, the bounds simplify withadditional assumptions on the incremental updates such as sparsity or randomness. Presentingthis work in a predominantly algorithm-agnostic manner makes the underlying conclusionsapplicable to any iterative method that can be warm-started. This broadens the applicabilityof our analysis and strengthens its implications.We have focused on the dominant eigenspace, though Algorithm 3.1 carries over to thecomplementary setting. However, an added complexity may be the need to repeatedly solveclosely related linear systems; the nature of the incremental updates may admit ways to ad-dress this concern, and we defer such challenges to future work. Additional challenges pertainto removing Assumption 2 on the spectral decay, though we consider it rather nonrestrictive.Nevertheless, in the presence of “challenging” spectra, warm-starting can help reduce thecost per update and still outperform e.g., algebraic methods, especially when matrix-vectormultiplication is cheap. Lastly, while we have considered the problem dimension to be fixedthroughout, an assumption that may be violated in practice, we believe it is possible to pairthis work with careful augmentation of the existing subspace to address that problem. Acknowledgements.
This research was supported by NSF Award DMS-1830274. Theauthors would like to thank Yuekai Sun for his assistance in developing the proof strategyfor Proposition 4.2.
REFERENCES [1]
E. Abbe, J. Fan, K. Wang, and Y. Zhong , Entrywise eigenvector analysis of random matrices withlow expected rank , arXiv preprint arXiv:1709.09565, (2017).[2]
A. A. Amini, A. Chen, P. J. Bickel, and E. Levina , Pseudo-likelihood methods for community de-tection in large sparse networks , Ann. Statist., 41 (2013), pp. 2097–2122, https://doi.org/10.1214/13-AOS1138.[3]
M. Argentati, A. Knyazev, C. Paige, and I. Panayotov , Bounds on changes in ritz values for a per-turbed invariant subspace of a hermitian matrix , SIAM Journal on Matrix Analysis and Applications,30 (2008), pp. 548–559, https://doi.org/10.1137/070684628.[4]
S. Balakrishnama and A. Ganapathiraju , Linear discriminant analysis-a brief tutorial , Institute forSignal and information Processing, 18 (1998), pp. 1–8.[5]
A. Balsubramani, S. Dasgupta, and Y. Freund , The fast convergence of incremental pca , in Advancesin Neural Information Processing Systems, 2013, pp. 3174–3182.[6]
M. Belkin and P. Niyogi , Laplacian eigenmaps for dimensionality reduction and data representation ,Neural computation, 15 (2003), pp. 1373–1396.[7]
L. Bottou, F. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAMReview, 60 (2018), pp. 223–311, https://doi.org/10.1137/16M1080173. M. Brand , Fast low-rank modifications of the thin singular value decomposition , Linear algebra and itsapplications, 415 (2006), pp. 20–30.[9]
J. R. Bunch and C. P. Nielsen , Updating the singular value decomposition , Numerische Mathematik,31 (1978), pp. 111–129.[10]
E. J. Candes and Y. Plan , Tight oracle inequalities for low-rank matrix recovery from a minimal numberof noisy random measurements , IEEE Transactions on Information Theory, 57 (2011), pp. 2342–2359.[11]
J. Cape, M. Tang, and C. E. Priebe , The two-to-infinity norm and singular subspace geometry withapplications to high-dimensional statistics , arXiv e-prints, (2018), https://arxiv.org/abs/1705.107325.[12]
Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam , Algorithm 887: Cholmod, supernodalsparse cholesky factorization and update/downdate , ACM Transactions on Mathematical Software(TOMS), 35 (2008), p. 22.[13]
F. R. K. Chung , Spectral Graph Theory , no. 92 in CBMS Regional Conference Series in Mathematics,American Mathematical Society, 1997.[14]
J. Cullum and W. E. Donath , A block lanczos algorithm for computing the q algebraically largesteigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices , in 1974 IEEEConference on Decision and Control including the 13th Symposium on Adaptive Processes, 1974,pp. 505–509, https://doi.org/10.1109/CDC.1974.270490.[15]
A. Damle and Y. Sun , Uniform bounds for invariant subspace perturbations , arXiv e-prints, (2019),arXiv:1905.07865, https://arxiv.org/abs/1905.07865.[16]
C. Davis and W. M. Kahan , The rotation of eigenvectors by a perturbation. III , SIAM Journal onNumerical Analysis, 7 (1970), pp. 1–46.[17]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman , Indexing bylatent semantic analysis , Journal of the American Society for Information Science, 41 (1990), pp. 391–407.[18]
J. W. Demmel , Applied Numerical Linear Algebra , Society for Industrial and Applied Mathematics,1997.[19]
J. Eldridge, M. Belkin, and Y. Wang , Unperturbed: spectral analysis beyond davis-kahan , in Pro-ceedings of Algorithmic Learning Theory, F. Janoos, M. Mohri, and K. Sridharan, eds., vol. 83 ofProceedings of Machine Learning Research, PMLR, 2018, pp. 321–358, http://proceedings.mlr.press/v83/eldridge18a.html.[20]
J. B. Elsner and A. A. Tsonis , Singular Spectrum Analysis: a new tool in time series analysis , SpringerScience & Business Media, 2013.[21]
J. Fan, K. Wang, Y. Zhong, and Z. Zhu , Robust high dimensional factor models with applications tostatistical machine learning , arXiv preprint arXiv:1808.03889, (2018).[22]
G. H. Golub and C. F. Van Loan , Matrix Computations , Johns Hopkins University Press, 4th ed. ed.,2013.[23]
N. Golyandina and A. Zhigljavsky , Singular Spectrum Analysis for time series , Springer Science &Business Media, 2013.[24]
Y. Gordon , Some inequalities for gaussian processes and applications , Israel Journal of Mathematics, 50(1985), pp. 265–289, https://doi.org/10.1007/BF02759761.[25]
M. Gu , Subspace iteration randomization and singular value problems , SIAM Journal on Scientific Com-puting, 37 (2015), pp. A1139–A1173.[26]
M. Gu and S. C. Eisenstat , Efficient algorithms for updating a strong rank-revealing qr factorization ,SIAM Journal on Scientific Computing, (1996), https://doi.org/10.1137/0917055.[27]
N. Halko, P.-G. Martinsson, and J. A. Tropp , Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrix decompositions , SIAM review, 53 (2011), pp. 217–288.[28]
P. W. Holland, K. B. Laskey, and S. Leinhardt , Stochastic blockmodels: First steps , Social networks,5 (1983), pp. 109–137.[29]
I. Jolliffe , Principal Component Analysis , Springer, 2nd ed., 2002.[30]
A. Joseph and B. Yu , Impact of regularization on spectral clustering , The Annals of Statistics, 44 (2016),pp. 1765–1791.[31]
M. E. Kilmer and E. De Sturler , Recycling subspace information for diffuse optical tomography ,SIAM Journal on Scientific Computing, 27 (2006), pp. 2140–2166. A. V. Knyazev , Toward the optimal preconditioned eigensolver: Locally optimal block preconditionedconjugate gradient method , SIAM journal on scientific computing, 23 (2001), pp. 517–541.[33]
R. Kumar, J. Novak, and A. Tomkins , Structure and evolution of online social networks , in LinkMining: Models, Algorithms, and Applications, Springer New York, 2010, pp. 337–357, https://doi.org/10.1007/978-1-4419-6515-8 13.[34]
B. Laurent and P. Massart , Adaptive estimation of a quadratic functional by model selection , Annalsof Statistics, (2000), pp. 1302–1338.[35]
J. Leskovec, L. A. Adamic, and B. A. Huberman , The dynamics of viral marketing , ACM Transac-tions on the Web (TWEB), 1 (2007), p. 5.[36]
J. Leskovec, A. Rajaraman, and J. D. Ullman , Mining of massive datasets , Cambridge universitypress, 2014.[37]
H. Li, H. Jiang, R. Barrio, X. Liao, L. Cheng, and F. Su , Incremental manifold learning by spectralembedding methods , Pattern Recognition Letters, 32 (2011), pp. 1447 – 1455, https://doi.org/10.1016/j.patrec.2011.04.004.[38]
R.-C. Li , Relative perturbation theory: II. eigenspace and singular subspace variations , SIAM Journal onMatrix Analysis and Applications, 20 (1998), pp. 471–492.[39]
R.-C. Li and L.-H. Zhang , Convergence of the block lanczos method for eigenvalue clusters , NumerischeMathematik, 131 (2015), pp. 83–113, https://doi.org/10.1007/s00211-014-0681-6.[40]
P. Liu, A. R. Benson, and M. Charikar , Sampling methods for counting temporal motifs , in Proceed-ings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19,New York, NY, USA, 2019, ACM, pp. 294–302, http://doi.acm.org/10.1145/3289600.3290988.[41]
F. McSherry , Spectral partitioning of random graphs , in Proceedings 42nd IEEE Symposium on Foun-dations of Computer Science, IEEE, 2001, https://doi.org/10.1109/sfcs.2001.959929.[42]
C. Musco and C. Musco , Randomized block krylov methods for stronger and faster approximate singularvalue decomposition , in Advances in Neural Information Processing Systems 28, C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds., 2015, pp. 1396–1404.[43]
A. Y. Ng, M. I. Jordan, and Y. Weiss , On spectral clustering: Analysis and an algorithm , in Advancesin neural information processing systems, 2002, pp. 849–856.[44]
H. Ning, W. Xu, Y. Chi, Y. Gong, and T. Huang , Incremental spectral clustering with application tomonitoring of evolving blog communities , in Proceedings of the 2007 SIAM International Conferenceon Data Mining, SIAM, 2007, pp. 261–272.[45]
A. ¨Ozg¨ur, T. Vu, G. Erkan, and D. R. Radev , Identifying gene-disease associations using centralityon a literature mined gene-interaction network , Bioinformatics, 24 (2008), pp. i277–i285.[46]
P. Panzarasa, T. Opsahl, and K. M. Carley , Patterns and dynamics of users’ behavior and inter-action: Network analysis of an online community , Journal of the American Society for InformationScience and Technology, 60 (2009), pp. 911–932.[47]
M. L. Parks, E. De Sturler, G. Mackey, D. D. Johnson, and S. Maiti , Recycling krylov subspacesfor sequences of linear systems , SIAM Journal on Scientific Computing, 28 (2006), pp. 1651–1674.[48]
B. Parlett , The Symmetric Eigenvalue Problem , Classics in Applied Mathematics, Society for Industrialand Applied Mathematics, 1998, https://doi.org/10.1137/1.9781611971163.[49]
N. Perra and S. Fortunato , Spectral centrality measures in complex networks , Physical Review E, 78(2008), https://doi.org/10.1103/physreve.78.036107.[50]
M. A. Porter, J.-P. Onnela, and P. J. Mucha , Communities in networks , Notices of the AMS, 56(2009), pp. 1082–1097.[51]
T. Qin and K. Rohe , Regularized spectral clustering under the degree-corrected stochastic blockmodel , inAdvances in Neural Information Processing Systems, 2013, pp. 3120–3128.[52]
T. Reeves, A. Damle, and A. R. Benson , Network interpolation , arXiv e-prints, (2019), https://arxiv.org/abs/1905.01253.[53]
K. Rohe, S. Chatterjee, and B. Yu , Spectral clustering and the high-dimensional stochastic blockmodel ,The Annals of Statistics, 39 (2011), pp. 1878–1915, https://doi.org/10.1214/11-aos887.[54]
S. T. Roweis and L. K. Saul , Nonlinear dimensionality reduction by locally linear embedding , Science,290 (2000), pp. 2323–2326.[55]
A. Ruhe , Implementation aspects of band lanczos algorithms for computation of eigenvalues of largesparse symmetric matrices , Mathematics of Computation, 33 (1979), pp. 680–687. Y. Saad , On the rates of convergence of the lanczos and the block-lanczos methods , SIAM Journal onNumerical Analysis, 17 (1980), pp. 687–706, https://doi.org/10.1137/0717059.[57]
Y. Saad , Numerical Methods for Large Eigenvalue Problems , Society for Industrial and Applied Mathe-matics, 2011, https://doi.org/10.1137/1.9781611970739.[58]
P. Salas, L. Giraud, Y. Saad, and S. Moreau , Spectral recycling strategies for the solution of non-linear eigenproblems in thermoacoustics , Numerical Linear Algebra with Applications, 22 (2015),pp. 1039–1058.[59]
S. E. Schaeffer , Graph clustering , Computer Science Review, 1 (2007), pp. 27–64, https://doi.org/10.1016/j.cosrev.2007.05.001.[60]
G. W. Stewart , Accelerating the orthogonal iteration for the eigenvectors of a hermitian matrix , Nu-merische Mathematik, 13 (1969), pp. 362–376, https://doi.org/10.1007/BF02165413.[61]
G. W. Stewart and J.-g. Sun , Matrix Perturbation Theory , Academic Press, 1990.[62]
D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe , A consistent adjacency spectral em-bedding for stochastic blockmodel graphs , Journal of the American Statistical Association, 107 (2012),pp. 1119–1128, https://doi.org/10.1080/01621459.2012.699795.[63]
R. Vershynin , High-dimensional probability: An introduction with applications in data science , vol. 47,Cambridge University Press, 2018.[64]
B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi , On the evolution of user interaction infacebook , in Proceedings of the 2nd ACM workshop on Online social networks, ACM Press, 2009,https://doi.org/10.1145/1592665.1592675.[65]
U. Von Luxburg , A tutorial on spectral clustering , Statistics and computing, 17 (2007), pp. 395–416.[66]
S. Wang, Z. Zhang, and T. Zhang , Improved Analyses of the Randomized Power Method and BlockLanczos Method , arXiv e-prints, (2015), https://arxiv.org/abs/1508.06429.[67]
P.-˚A. Wedin , Perturbation bounds in connection with singular value decomposition , BIT Numerical Math-ematics, 12 (1972), pp. 99–111, https://doi.org/10.1007/BF01932678.[68]
D. P. Woodruff , Sketching as a tool for numerical linear algebra , Foundations and Trends R (cid:13) in Theo-retical Computer Science, 10 (2014), pp. 1–157.[69] Q. Yuan, M. Gu, and B. Li , Superlinear convergence of randomized block lanczos algorithm , in 2018IEEE International Conference on Data Mining (ICDM), IEEE, 2018, pp. 1404–1409.[70]
Y. Zhang and K. Rohe , Understanding regularized spectral clustering via graph conductance , in Advancesin Neural Information Processing Systems, 2018, pp. 10631–10640. ppendix A. Auxiliary results.A.1. Concentration of measure. Let us recall some results and definitions which areused in later proofs. The following Lemma is very useful for bounding the norm of Gaussianrandom vectors. It appears as part of Lemma 1 in [34]:
Lemma A.1.
Let ( Y , . . . , Y D ) be standard Gaussian variables. Let α = ( a , . . . , a D ) , a i ≥ and define Z := (cid:80) Di =1 a i ( Y i − . Then, ∀ x > , we have: P (cid:16) Z ≥ (cid:107) α (cid:107) √ x + (max i a i ) x (cid:17) ≤ exp( − x )(A.1)When dealing with Gaussian processes, it is common to encounter the concept of Gaussianwidth : Definition A.2.
Let T ⊂ R n be a bounded set and g ∼ N (0 , I n ) a standard Gaussian randomvariable. The Gaussian width of T is defined as (A.2) W ( T ) := E (cid:20) sup x ∈ T (cid:104) g, x (cid:105) (cid:21) . One of several bounds on the Gaussian width of a set is outlined below. It is a straight-forward consequence of [63, Exercise 7.6.1 & Lemma 7.6.3]:
Lemma A.3.
Consider T ⊂ R n and let diam( T ) := sup x ∈ T (cid:107) x (cid:107) and dim( T ) denote itsalgebraic dimension. Then (A.3) W ( T ) ≤ diam( T ) (cid:112) dim( T )Finally, it is known that Lipschitz functions of Gaussian variables concentrate well aroundtheir mean. The following Theorem is standard, see e.g. [63, Chapter 5]. Theorem A.4.
Let f : R n → R be an L -Lipschitz function, i.e. | f ( x ) − f ( y ) | ≤ L (cid:107) x − y (cid:107) , ∀ x, y ∈ R n . Then, if X = ( X , . . . , X n ) ∼ N (0 , I n ) , it holds that (A.4) P ( f ( X ) − E [ f ( X )] ≥ t ) ≤ exp (cid:18) − t L (cid:19) . A.2. Linear Algebra.
Theorem A.5 (Wedin’s sin theorem).
Let A, ˜ A ∈ R m × n with SVDs given by A = U Σ V (cid:62) , ˜ A = ˜ U ˜Σ ˜ V (cid:62) , and let E := A − ˜ A . Suppose there exist α, δ > such that min i ∈ [ r ] σ i > α + δ, max j ∈ [ n − r ] ˜ σ r + j < α. Denote by U , ˜ U , V , ˜ V the left and right singular subspaces corresponding to the top r singularvalues. Then for any unitarily invariant norm (cid:107)·(cid:107) , (A.5) max (cid:8)(cid:13)(cid:13) sin Θ( U , ˜ U ) (cid:13)(cid:13) , (cid:13)(cid:13) sin Θ( V , ˜ V ) (cid:13)(cid:13)(cid:9) ≤ max {(cid:107) EV (cid:107) , (cid:107) E (cid:62) U (cid:107)} δ . heorem A.6 (Theorem 3.6 in [3]). Consider a Hermitian matrix A and an A -invariantsubspace X , corresponding to a contiguous set of eigenvalues of A . Then, for any othersubspace Y with dim( X ) = dim( Y ) , it holds that max i (cid:12)(cid:12) λ i ( X (cid:62) AX ) − λ i ( Y (cid:62) AY ) (cid:12)(cid:12) ≤ ρ ( A ) max i { sin θ i ( X , Y ) } , where ρ ( A ) is the spectral radius of A . Lemma A.7 (Weyl’s inequality).
Let A, (cid:98) A ∈ R n × n with (cid:98) A = A + E . Then, for all k , wehave (cid:12)(cid:12) σ k ( A ) − σ k ( (cid:98) A ) (cid:12)(cid:12) ≤ (cid:107) E (cid:107) . Lemma A.8.
Let D = diag( d , . . . , d n ) and A ∈ R n × n with A = A (cid:62) . Then it holds that (cid:107) DAD (cid:107) ≤ (cid:112) rank( A ) (cid:107) D A (cid:107) . Proof.
Notice that we can write (cid:107) D A (cid:107) F = sup V : (cid:107) V (cid:107) F =1 (cid:104) D A, V (cid:105) ≥
Tr ( AD AD ) (cid:107) D A (cid:107) F = Tr ( DADDAD ) (cid:107) D A (cid:107) F = (cid:107) DAD (cid:107) F (cid:107) D A (cid:107) F , making use of the cyclic property of the trace throughout. Rearranging shows that (cid:107) DAD (cid:107) F ≤(cid:107) D A (cid:107) F . From norm equivalence (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) F ≤ (cid:112) rank( A ) (cid:107) A (cid:107) , we have (cid:107) DAD (cid:107) ≤ (cid:107) DAD (cid:107) F ≤ (cid:107) D A (cid:107) F ≤ (cid:112) rank( D A ) (cid:107) D A (cid:107) , with rank( D A ) ≤ rank( A ), which completes the proof. A.3. Miscellanea.
Lemma A.9.
Consider the set of matrices X r = { X ∈ R n × n | rank( X ) ≤ r, (cid:107) X (cid:107) F ≤ } . Then the covering number of X r with respect to the metric d ( M , M ) := (cid:107) M − M (cid:107) F satisfies N ( X r , d, ε ) ≤ (cid:16) ε (cid:17) ( n + n +1) r . Proof.
Notice that we can decompose all matrices in X r using the (economic) SVD as X = U Σ V (cid:62) , (cid:107) Σ (cid:107) F ≤ , Σ ∈ R r × r , U ∈ O n ,r , V ∈ O n ,r . Then we can cover the setof admissible singular values with a ε -net of vectors in B r , which is of cardinality at most(1 + ε ) r ≤ ( ε ) r (see e.g. [63, Chapter 4]). The remainder of the proof involves covering O n,r and is identical to [10, Lemma 3.1]. Appendix B. Omitted proofs. .1. Proof of Proposition 4.1. Given E , we denote D (cid:48) := diag( (cid:15) , . . . , (cid:15) n ) where (cid:15) i := (cid:80) nj =1 E ij . Denoting the updated regularized adjacency matrix by ˜ A new , it is immediate thatwe can write(B.1) ˜ A new := ( D + D (cid:48) ) − / ( A + E )( D + D (cid:48) ) − / . The first step is to get an expression for ( D + D (cid:48) ) − / . Since both matrices are diagonal, wecan write ( D + D (cid:48) ) − = diag (cid:18)(cid:110) d i (cid:16) (cid:15) i d i (cid:17)(cid:111) ni =1 (cid:19) − = diag (cid:32)(cid:26) d i ·
11 + (cid:15) i d i (cid:27) ni =1 (cid:33) = diag (cid:32)(cid:26) d i (cid:18) − (cid:15) i d i (cid:15) i d i (cid:19)(cid:27) ni =1 (cid:33) = diag (cid:18)(cid:110) d i (cid:16) − δ i δ i (cid:17)(cid:111) ni =1 (cid:19) , (B.2) ( D + D (cid:48) ) − / = diag (cid:18)(cid:26) √ d i (cid:114) − δ i δ i (cid:27) ni =1 (cid:19) (B.3)where δ i := (cid:15) i d i in (B.2). Next, we can write (cid:113) − δ i δ i = 1 − z i , with z i ∈ (cid:104) δ i δ i ) , δ i (2+ δ i )2(1+ δ i ) (cid:105) ,based on the identity 1 − x − x ≤ √ − x ≤ − x . It then follows that ( D + D (cid:48) ) − / = D − / ( I − Z ), Z := diag( z , . . . , z n ). Expanding this in the expression of (B.1) gives us˜ A new = D − / ( I − Z )( A + E )( I − Z ) D − / = D − / ( A − AZ + ZAZ ) D − / + ( I − Z ) D − / ED − / ( I − Z )= ˜ A + Z ˜ AZ − AZ + ( I − Z ) D − / ED − / ( I − Z ) . (B.4)Therefore, to bound (cid:13)(cid:13) ˜ A new − ˜ A (cid:13)(cid:13) , we have to bound three major components: (cid:13)(cid:13) Z ˜ AZ (cid:13)(cid:13) , (cid:13)(cid:13) ˜ AZ (cid:13)(cid:13) , and (cid:13)(cid:13) D − / ED − / (cid:13)(cid:13) . For the first two terms, since λ i ( ˜ A ) ∈ [ − ,
1] [13], we have that: (cid:13)(cid:13) ˜ AZ (cid:13)(cid:13) ≤ (cid:107) Z (cid:107) = max j ∈ [ n ] | z j | ≤ α (1 + α )2 , (cid:13)(cid:13) Z ˜ AZ (cid:13)(cid:13) ≤ (cid:18) α (1 + α )2 (cid:19) , (B.5)with the bound for | z i | coming from the fact that (cid:12)(cid:12)(cid:12)(cid:12) δ i δ i ) + δ i δ i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α α , since | δ i | ≤ α. For the remaining term, we upper bound (cid:107) D − / ED − / (cid:107) ≤ (cid:107) D − / (cid:107) (cid:107) ED − / (cid:107) , and bythe Gershgorin Circle Theorem [22] (using the columns of ED − / ) we obtainmax i (cid:12)(cid:12) λ i ( ED − / ) (cid:12)(cid:12) ≤ max j ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) k =1 ,k (cid:54) = j E kj (cid:112) d j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ∗ ) = max j ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) k =1 E jk (cid:112) d j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max j ∈ [ n ] (cid:112) d j αd j (cid:13)(cid:13) D − / ED − / (cid:13)(cid:13) ≤ (cid:13)(cid:13) D − / (cid:13)(cid:13) max j ∈ [ n ] α (cid:112) d j = α (cid:115) max i d i min j d j = α (cid:112) κ ( D ) , (B.6) This identity is valid for all x ∈ ( − , δ i / (1 + δ i ). here ( ∗ ) follows since E ii = 0 , E (cid:62) = E and the next inequality follows by our assumptions.From Lemma A.8 (applied with D := D − / , A := E ) we have (cid:13)(cid:13) D − / ED − / (cid:13)(cid:13) ≤ (cid:112) rank( E ) (cid:107) D − E (cid:107) ≤ (cid:112) rank( E ) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 E ij d i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) rank( E ) α, (B.7)with the penultimate inequality via Gershgorin’s circles. We deduce that(B.8) (cid:13)(cid:13) D − / ED − / (cid:13)(cid:13) ≤ min (cid:0)(cid:112) κ ( D ) , (cid:112) rank( E ) (cid:1) α. Combining (B.5) and (B.8) into (B.4) we complete the proof (noting that (cid:107) I − Z (cid:107) ≤ O ( α ) by submultiplicativity and the definition of Z ):(B.9) (cid:13)(cid:13) ˜ A new − ˜ A (cid:13)(cid:13) ≤ α · (cid:0) α + (cid:112) min { κ ( D ) , rank( E ) } (cid:1) + (cid:18) α (1 + α )2 (cid:19) B.2. Proof of Proposition 4.2.
The proof of this proposition is a combination of Lem-mas B.1 and B.2. The former controls (cid:107) EV (cid:107) , (cid:107) E (cid:62) U (cid:107) , while the latter bounds the differencebetween the corresponding singular values with high probability. Lemma B.1.
In the setting of Theorem
A.5 , under Assumption , it holds that max (cid:8) (cid:107) EV (cid:107) , (cid:13)(cid:13) E (cid:62) U (cid:13)(cid:13) (cid:9) ≤ √ n + √ rn (B.10) with probability at least − − n ) .Proof. (cid:107) EV (cid:107) = sup u,v ∈ S m − × S n − (cid:104) u, EV v (cid:105) = sup u,v ∈ S m − × ran( V ) (cid:104) u, Ev (cid:105) Then, appealing to the Gaussian flavor of Chevet’s inequality due to Gordon [24], as we canwrite E = n G, ( G ) ij ∼ N (0 , E [ (cid:107) EV (cid:107) ] ≤ n (cid:32) W ( S n − ) sup x ∈ ran( V ) (cid:107) x (cid:107) + W (ran( V )) sup x ∈ S n − (cid:107) x (cid:107) (cid:33) ≤ √ n + √ rn , (B.11)where the last inequality is an appeal to Lemma A.3 for the second term after noticingdim(ran( V )) = r combined with the fact that W ( S n − ) = E (cid:34) sup v : (cid:107) v (cid:107) =1 (cid:104) g, v (cid:105) (cid:35) = E [ (cid:107) g (cid:107) ] ≤ √ n. An identical argument gives the same result for (cid:107) E (cid:62) U (cid:107) .Now, using the identity | sup f − sup g | ≤ | sup f − g | , denoting f ( X ) := sup u,v n (cid:104) u, Xv (cid:105) ,we recover | f ( X ) − f ( Y ) | ≤ n (cid:12)(cid:12)(cid:12)(cid:12) sup u,v (cid:104) u, ( X − Y ) v (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n sup u,v (cid:8)(cid:8)(cid:8) (cid:107) u (cid:107) (cid:107) ( X − Y ) v (cid:107) F ≤ n (cid:107) X − Y (cid:107) F , here the last two inequalities are derived using the Cauchy-Schwarz inequality. Therefore f is a (1 /n )-Lipschitz function of Gaussian random variables. Theorem A.4 gives P (cid:18) f ( G ) ≥ √ n + √ rn + t (cid:19) ≤ P ( f ( G ) ≥ E [ f ] + t ) ≤ exp (cid:18) − ( nt ) (cid:19) , (B.12)hence setting t = n − / in (B.12) gives us the desired high probability bound. The proof for (cid:107) E (cid:62) U (cid:107) is completely analogous, and following with a union bound for the two terms werecover the desired probability. Lemma B.2.
In the setting of Theorem
A.5 , under Assumption , it holds with probabilityat least − e − c n − n − log n that (cid:12)(cid:12) σ i ( ˜ A ) − σ i ( A ) (cid:12)(cid:12) ≤ C A n + log nn (B.13) where C A is a constant depending only on A and c is an absolute constant.Proof. Let us write A = U Σ V (cid:62) and ˆ A := A + E = ˆ U ˆΣ ˆ V (cid:62) for the singular value decom-positions of the two matrices. For a matrix M , write M uv := (cid:104) u, M v (cid:105) for brevity and observethe following decomposition: σ i ( A + E ) − σ i ( A ) = ˆ A (cid:98) u i , (cid:98) v i − A u i ,v i = A (cid:98) u i , (cid:98) v i + E (cid:98) u i , (cid:98) v i − A u i ,v i − E u i ,v i + E u i ,v i = ( E (cid:98) u i , (cid:98) v i − E u i ,v i ) + E u i ,v i + ( A (cid:98) u i − u i ,v i + A u i , (cid:98) v i − v i + A (cid:98) u i − u i , (cid:98) v i − v i )= ( E (cid:98) u i , (cid:98) v i − E u i ,v i ) + E u i ,v i − σ i ( A ) (cid:2) (1 − (cid:104) (cid:98) u i , u i (cid:105) ) + (1 − (cid:104) (cid:98) v i , v i (cid:105) ) (cid:3) + A (cid:98) u i − u i , (cid:98) v i − v i , (B.14)with the last equality above following since Av i = (cid:80) nj =1 σ j ( A ) u j v (cid:62) j v i = σ i ( A ) u i . The decom-position above hints towards the objects that we need to control to prove that σ i ( ˆ A ) − σ i ( A )is small in magnitude. To be precise, let us define the set(B.15) S δ := (cid:110) ( u, v ) ∈ S n − × S n − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13) uv (cid:62) − u i v (cid:62) i (cid:13)(cid:13) F ≤ δ , (cid:104) u, u i (cid:105) ≥ , (cid:104) v, v i (cid:105) ≥ (cid:111) and the events below: E := (cid:26) (cid:107) E (cid:107) ≤ C ( x ) n (cid:27) , E := (cid:26) | E u i ,v i | ≤ C ( x ) n (cid:27) , E := (cid:26) sup u,v ∈ S δ | E u,v − E u i ,v i | ≤ δ √ n + δC ( x ) n (cid:27) , (B.16)where C , C , C only depend on x . The set S δ contains the unit vectors ( u, v ) such that E u,v is “close” to E u i ,v i in terms of L distance, with the condition (cid:104) u, u i (cid:105) , (cid:104) v, v i (cid:105) ≥ E i being false. s a first step, notice that Theorem A.5 implies(B.17) (cid:13)(cid:13) u i u (cid:62) i − ˆ u i ˆ u (cid:62) i (cid:13)(cid:13) F + (cid:13)(cid:13) v i v (cid:62) i − ˆ v i ˆ v (cid:62) i (cid:13)(cid:13) F ≤ δ (cid:107) E (cid:107) ≤ (cid:16) α − √ n (cid:17) Cn = c n , assuming that α > √ n , with the last inequality above holding with high probability since (cid:107) E (cid:107) (cid:16) √ n from elementary arguments in random matrix theory. Notice that the above inparticular is equivalent to2(1 − (cid:104) u i , ˆ u i (cid:105) ) + 2(1 − (cid:104) v i , ˆ v i (cid:105) ) ≤ c n , max { − (cid:104) u i , ˆ u i (cid:105) , − (cid:104) v i , ˆ v i (cid:105)} ≤ c n (B.18)since 1 − (cid:104) u i , ˆ u i (cid:105) ≤ − (cid:104) u i , ˆ u i (cid:105) due to all vectors being unitary and (cid:104) u i , ˆ u i (cid:105) ≥
0; the same lineof reasoning applies to (cid:104) v i , ˆ v i (cid:105) . Equivalently, the above gives us(B.19) (cid:107) u i − ˆ u i (cid:107) + (cid:107) v i − ˆ v i (cid:107) ≤ c n , max {(cid:107) u i − ˆ u i (cid:107) , (cid:107) v i − ˆ v i (cid:107) } ≤ c √ n The magnitude of the RHS in (B.19) will help us determine the correct range for δ in E .For the remainder, consider the Gaussian process { nE u,v − nE u i ,v i } ( u,v ) ∈ S n − × S n − . Fol-lowing arguments from [63, Chapter 7], since nE is equal in distribution to a standard Gaussianmatrix, we obtain using the standard L distance:(B.20) d (( u, v ) , ( u (cid:48) , v (cid:48) )) := (cid:107) nE u,v − nE u (cid:48) ,v (cid:48) (cid:107) L = (cid:13)(cid:13) uv (cid:62) − u (cid:48) v (cid:48)(cid:62) (cid:13)(cid:13) F ≤ (cid:107) u − u (cid:48) (cid:107) + (cid:107) v − v (cid:48) (cid:107) , where the inequality follows from [63, Exercise 7.3.2]. We can now define the following quan-tities, which control most of the terms above:(B.21) ξ δ := sup u,v ∈ S δ E u,v − E u i ,v i , ˜ ξ δ := sup u,v ∈ S δ | E u,v − E u i ,v i | . Since 0 ∈ S δ by setting ( u, v ) = ( u i , v i ), it follows thatsup u,v ∈ S δ E u,v − E u i ,v i , sup u,v ∈ S δ − ( E u,v − E u i ,v i ) ≥ (cid:13)(cid:13) ˜ ξ δ (cid:13)(cid:13) ψ (cid:16) (cid:107) ξ δ (cid:107) ψ , with (cid:107)·(cid:107) ψ denoting the subgaussian norm. To arrive at theresult, we need the following Lemma: Lemma B.3.
For ˜ ξ δ defined as in (B.21) , it holds that (cid:13)(cid:13) ˜ ξ δ (cid:13)(cid:13) ψ (cid:46) (cid:16) δn (cid:17) , E (cid:2) ˜ ξ δ (cid:3) (cid:46) δ √ n . Proof.
The proof of the first property is immediate since (B.20) and the fact that we areworking within S δ give ussup u,v ∈ S δ (cid:107) E u,v − E u i ,v i (cid:107) L = sup u,v ∈ S δ n (cid:13)(cid:13) uv (cid:62) − u i v (cid:62) i (cid:13)(cid:13) F ≤ δ n , ence the result for the subgaussian norm follows from the high-probability version of Dudley’stheorem [63, Theorem 8.1.6].For the latter property, let us write Z u,v = nE u,v , so that Z := nE is a standard Gaussianmatrix. Then, using Dudley’s inequality, it follows that E [ ξ δ ] = 1 n E (cid:20) sup u,v ∈ S δ Z u,v − Z u i ,v i (cid:21) ≤ c n (cid:90) δ (cid:112) log N ( S δ , d, ε )d ε, (B.22)with N ( S , d, · ) denoting the covering number of S with respect to the metric d , and the upperlimit δ following since N ( S δ , d, ε ) = 1 , ∀ ε > δ as the distance between two “points” indexedby S δ is never more than δ .To estimate N ( S δ , d, ε ), it suffices to notice that d (( u, v ) , ( u i , v i )) = (cid:13)(cid:13) uv (cid:62) − u i v (cid:62) i (cid:13)(cid:13) F ∈ { X ∈ R n × n | rank( X ) ≤ , (cid:107) X (cid:107) F ≤ δ } , which is just a scaled version of X appearing in Lemma A.9. Therefore it must satisfy N ( S δ , d, ε ) ≤ N ( δ X , d, ε ) (cid:46) (cid:16) δε (cid:17) n +2 . Substituting the above into the integral of (B.22), we recover that E [ ξ δ ] ≤ c n (cid:90) δ (cid:114) (4 n + 2) log 9 δε d ε (cid:16) δ √ n . Given Lemma B.3, we can deduce that P (cid:18) ˜ ξ δ ≥ c δ √ n + t (cid:19) ≤ exp (cid:16) − c (cid:16) nδ (cid:17) t (cid:17) ⇒ P (cid:18) ˜ ξ δ ≥ δ √ n (cid:19) ≤ exp ( − c n ) . Moreover, we know that (cid:104) u i , Ev i (cid:105) ∼ n g, g ∼ N (0 ,
1) and therefore it follows from Gaussianconcentration that P ( | E u i ,v i | ≥ t ) ≤ (cid:18) − n t (cid:19) ⇒ P (cid:16) | E u i ,v i | ≥ log nn (cid:17) ≤ n − log n . Finally, an appeal to [63, Corollary 7.3.3] gives us that P (cid:18) (cid:107) E (cid:107) ≥ c √ n (cid:19) ≤ − c n ) , so we can deduce (via a union bound) that P (cid:0)(cid:84) i =1 E i (cid:1) ≥ − − cn ) − n − log n , by setting c := max( c , c ) using the notation above. Then, returning to (B.14), we deduce that | σ i ( A + E ) − σ i ( A ) | ≤ δ √ n + log nn + σ i ( A ) Cn + (cid:107) A (cid:107) C (cid:48) n Looking back at (B.19), it is immediate that δ (cid:46) √ n , which completes the desired claim.Putting Lemmas B.1 and B.2 together and appealing to Weyl’s inequality for the sin-gular value gap, we deduce that under Assumption 3 the desired inequality must hold withprobability at least 1 − c exp( − c n ) − n − log n . .3. Proof of Proposition 4.3. The proof consists of two components. In order forTheorem 3.2 to be applicable, we need to ensure that (cid:107) E (cid:107) < δ r , so given the lower bound on δ r it would suffice to prove that (cid:107) E (cid:107) = (cid:107) ε t z t z (cid:62) t (cid:107) = (cid:107) z t (cid:107) ≤ (cid:15) , which is a straightforwardapplication of χ -concentration. The other component consists of showing that (cid:107) EV (cid:107) issmall with high probability, which we tackle by bounding its expectation and using Gaussianconcentration to obtain a high probability bound.Let us take care of the first component; we can write (cid:107) z t (cid:107) = (cid:80) ni =1 1 n g i , g i ∼ N (0 , α = n tells us that P (cid:18) (cid:107) E (cid:107) ≥ (cid:114) xn + xn (cid:19) ≤ exp( − x )Setting x = (cid:15) n when (cid:15) is small gives us that (cid:107) E (cid:107) ≤ (cid:15) √ + (cid:15) < (cid:15) with probabilityat least 1 − exp (cid:0) − (cid:15) n (cid:1) .In order to address the second component, we prove the following Lemma: Lemma B.4.
Suppose Assumption holds. Then, for any r -dimensional subspace V , thereexist constants c , c > such that P (cid:18)(cid:13)(cid:13) E ( t ) V (cid:13)(cid:13) ≤ c (cid:114) r log nn (cid:19) ≥ − exp ( − c n ) − n − r . Proof.
Denote E ≡ E ( t ) for brevity. Using the variational characterization of singularvalues, we know that (cid:107) EV (cid:107) = sup u ∈ S n − ,v ∈ S r − (cid:104) u, EV v (cid:105) = sup u ∈ S n − ,v ∈ S r − ε t (cid:10) u, z t z (cid:62) t V v (cid:11) ≤ (cid:107) z t (cid:107) sup v ∈ V S r − (cid:104) z t , v (cid:105) . Taking expectations and applying the Cauchy-Schwarz inequality: E [ (cid:107) EV (cid:107) ] ≤ E (cid:2) (cid:107) z t (cid:107) (cid:3) / E (cid:20) sup v ∈ V S r − (cid:104) z t , v (cid:105) (cid:21) / ≤ (cid:16) n W ( S n − ∩ ran( V )) (cid:17) / , where the last inequality follows from the fact that E (cid:2) (cid:107) z t (cid:107) (cid:3) = n , the definition of gaussianwidth, and the fact that V is an orthogonal matrix, hence V acting on the unit sphere mapsback to the unit sphere. However, diam( S n − ∩ ran( V )) ≤
1, and dim( T ) = dim(ran( V )) = r .An appeal to Lemma A.3 recovers E [ (cid:107) EV (cid:107) ] ≤ (cid:112) rn . We can now proceed to show that (cid:107) EV (cid:107) is small with high probability. Let us denotethe events E := {(cid:107) z t (cid:107) ≤ c } , E := (cid:26) sup v ∈ V S r − (cid:104) z t , v (cid:105) ≤ (cid:114) rn + (cid:114) r log nn (cid:27) . For the first event, we know from Lemma A.1 that P ( E c ) ≤ exp( − c n ), where c dependsonly on c . Additionally, writing z t = √ n g, g ∼ N (0 , I n ), we see that (cid:12)(cid:12)(cid:12)(cid:12) sup v ∈ V S r − (cid:104) z t , v (cid:105) − sup v ∈ V S r − (cid:104) z (cid:48) t , v (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) sup v ∈ V S r − (cid:104) z t − z (cid:48) t , v (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ n (cid:107) g − g (cid:48) (cid:107) , hich means that f ( X ) := √ n sup v ∈ V S r − (cid:104) X, v (cid:105) is a (1 / √ n )-Lipschitz function of Gaussianvariables. Theorem A.4 then implies P (cid:18) sup v ∈ V S r − (cid:104) z t , v (cid:105) ≥ E (cid:20) sup v ∈ V S r − (cid:104) z t , v (cid:105) (cid:21) + t (cid:19) ≤ exp (cid:18) − t L (cid:19) ⇒ P ( E c ) = P (cid:18) sup v ∈ V S r − (cid:104) z t , v (cid:105) ≥ (cid:114) rn + (cid:114) nn (cid:19) ≤ n . Taking a union bound over E c , E c we recover P (cid:18) (cid:107) EV (cid:107) ≤ c (cid:114) rn + c √ (cid:114) log nn (cid:19) ≥ − exp ( − c n ) − n − . Finally, to conclude the proof, we set c = 1 + (cid:15) √ + (cid:15) in the definition of E above, whichgives c = (cid:15) ..