[PDF] DeEPCA: Decentralized Exact PCA with Linear Convergence Rate

Abstract

Due to the rapid growth of smart agents such as weakly connected computational nodes and sensors, developing decentralized algorithms that can perform computations on local agents becomes a major research direction. This paper considers the problem of decentralized Principal components analysis (PCA), which is a statistical method widely used for data analysis. We introduce a technique called subspace tracking to reduce the communication cost, and apply it to power iterations. This leads to a decentralized PCA algorithm called \texttt{DeEPCA}, which has a convergence rate similar to that of the centralized PCA, while achieving the best communication complexity among existing decentralized PCA algorithms. \texttt{DeEPCA} is the first decentralized PCA algorithm with the number of communication rounds for each power iteration independent of target precision. Compared to existing algorithms, the proposed method is easier to tune in practice, with an improved overall communication cost. Our experiments validate the advantages of \texttt{DeEPCA} empirically.

Full PDF

DDeEPCA: Decentralized Exact PCA with Linear Convergence Rate

Haishan Ye ∗ Tong Zhang † February 9, 2021

Abstract

DeEPCA , which has aconvergence rate similar to that of the centralized PCA, while achieving the best communicationcomplexity among existing decentralized PCA algorithms.

DeEPCA is the ﬁrst decentralized PCAalgorithm with the number of communication rounds for each power iteration independent oftarget precision. Compared to existing algorithms, the proposed method is easier to tune inpractice, with an improved overall communication cost. Our experiments validate the advantagesof

DeEPCA empirically.

Principal Components Analysis (PCA) is a statistical data analysis method wide applicationsin machine learning (Moon & Phillips, 2001; Bishop, 2006; Ding & He, 2004; Dhillon et al., 2015),data mining (Cadima et al., 2004; Lee et al., 2010; Qu et al., 2002), and engineering (Bertrand &Moonen, 2014). In recent years, because of the rapid growth of data and quick advances in networktechnology, developing distributed algorithms has become a more and more important researchtopic, due to their advantages in privacy preserving, robustness, lower communication cost, etc.(Kairouz et al., 2019; Lian et al., 2017; Nedic & Ozdaglar, 2009). There have been a number ofprevious studies of decentralized PCA algorithms (Scaglione et al., 2008; Kempe & McSherry, 2008;Suleiman et al., 2016; Wai et al., 2017).In a typical decentralized PAC setting, we assume that a positive semi-deﬁnite matrix A isstored at diﬀerent agents. Speciﬁcally, the matrix A can be decomposed as A = 1 m m X j =1 A j , ∗ Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong, Shenzhen; email:hsye [email protected]; † Hong Kong University of Science and Technology; email: [email protected] a r X i v : . [ c s . L G ] F e b here data for A j is stored in the j -th agent and known only to the agent (This helps to preserveprivacy). The agents form a connected and undirected network. Agents can communicate withtheir neighbors in the network to cooperatively compute the PCA of A .To obtain the top- k principal components of the positive semi-deﬁnite matrix A ∈ R d × d , acommonly used centralized algorithm is the power method, which converges fast in practice with alinear convergence rate (Golub & Van Loan, 2012). In the implementation of decentralized PCA,a natural idea is the decentralized power method ( DePM ) which mimics its centralized counterpart.The main procedure of

DePM can be summarized as a local power iteration plus a multi-consensusstep to synchronize the local computations (Kempe & McSherry, 2008; Raja & Bajwa, 2015; Waiet al., 2017; Wu et al., 2018). The multi-consensus step in

DePM is used to achieve averaging.However, decentralized PCA algorithm based on

DePM suﬀers from a suboptimal communicationcost, and is tricky to implement in practice. For each power iteration, theoretically, it requires O (cid:16) log (cid:15) (cid:17) times communication, where (cid:15) is the target precision. The communication cost becomesmuch quite signiﬁcant when (cid:15) is small. Although seemingly only a logarithmic factor, in practice,with a data size of merely 10000, this logarithmic factor leads to an order of magnitude morecommunications. This is clearly prohibitively large for many applications. Moreover, one oftenhas to gradually increase the number of communication rounds in the multi-consensus step to dealwith increased precision. However, this strategy makes the tuning of DePM diﬃcult for practicalapplications.In this paper, we propose a new decentralized PCA algorithm that does not suﬀer from theweakness of

DePM . We observe that the communication precision requirement in

DePM comes fromthe heterogeneity of data in diﬀerent agents. Due to the heterogeneity, the local power methodwill converge to the top- k principal components of the local matrix A j if no consensus step isconducted to perform averaging. To conquer the weakness of DePM whose consensus steps in eachpower iteration depend on the target precision (cid:15) , we adapted a technique called gradient trackingin the existing decentralized optimization literature, so that it can be used to track the subspace inpower iterations. We call this adapted technique subspace tracking . Based on the subspace trackingtechnique and multi-consensus, we propose Decentralized Exact PCA (

DeEPCA ) which can achievea linear convergence rate similar to the centralized PCA, but the consensus steps of each poweriteration is independent of the target precision (cid:15) . We summarize our contributions as follows:1. We propose a novel power-iteration based decentralized PCA called

DeEPCA , which can achievethe best known communication complexity, especially when the ﬁnal error (cid:15) is small. Further-more,

DeEPCA is the ﬁrst decentralized PCA algorithm whose consensus steps of each poweriteration does not depend on the target precision (cid:15) .2. We show that the ‘gradient tracking’ technique from the decentralized optimization literaturecan be adapted to subspace tracking for PCA. The resulting

DeEPCA algorithm can be regardedas a novel decentralized power method. Because power method is the foundation of manymatrix decomposition problems, subspace tracking and the proof technique of

DeEPCA can beapplied to develop communication eﬃcient decentralized algorithms for spectral analysis, andlow rank matrix approximation. 2. The improvement is practically signiﬁcant. Our experiments show that

DeEPCA can achieve alinear convergence rate comparable to centralized PCA, even only a small number of consensussteps are used in each power iteration. In contrast, the conventional decentralized PCAalgorithm based on

DePCA can not converge to the principal components of A when thenumber of consensus steps is not large. In this section, we introduce notations and deﬁnitions that will be used throughout the paper.

Given a matrix A = [ a ij ] ∈ R n × d and a positive integer k ≤ min { n, d } , its SVD is given as A = U Σ V T = U k Σ k V Tk + U \ k Σ \ k V T \ k , where U k and U \ k contain the left singular vectors of A , V k and V \ k contain the right singular vectors of A , and Σ = diag( σ , . . . , σ ‘ ) with σ ≥ σ ≥ · · · ≥ σ min { n,d } ≥ A . Accordingly, we can deﬁne the Frobenius norm k A k = qP min { n,d } i =1 σ i = qP n,di =1 ,j =1 ( A ( i, j )) and the spectral norm k A k = σ ( A ), where A ( i, j )denotes the i, j -th entry of A . We will use σ max ( A ) to denote the largest singular value and σ min ( A )to denote the smallest singular value which may be zero. If A is symmetric positive semi-deﬁnite,then it holds that U = V and λ i ( A ) = σ i ( A ), where λ i ( A ) is the i -th largest eigenvalue of A , λ max ( A ) = σ max ( A ), and λ min ( A ) = σ min ( A ).Next, we will introduce the angle between two subspaces U ∈ R d × k and X ∈ R d × k . Deﬁnition 1.

Let U ∈ R d × k have orthonormal columns and X ∈ R d × k have independent columns.For V = U ⊥ , then we have cos θ k ( U, X ) = min k w k =1 (cid:13)(cid:13)(cid:13) U > Xw (cid:13)(cid:13)(cid:13) k Xw k , sin θ k ( U, X ) = max k w k =1 (cid:13)(cid:13)(cid:13) V > Xw (cid:13)(cid:13)(cid:13) k Xw k , and , tan θ k ( U, X ) = max k w k =1 (cid:13)(cid:13)(cid:13) V > Xw (cid:13)(cid:13)(cid:13) k U > Xw k . (2.1) If X is orthonormal, then it also holds that cos θ k ( U, X ) = σ min ( U > X ) , sin θ k ( U, X ) = (cid:13)(cid:13)(cid:13) V > X (cid:13)(cid:13)(cid:13) , and , tan θ k ( U, X ) = (cid:13)(cid:13)(cid:13) V > X ( U > X ) − (cid:13)(cid:13)(cid:13) , (2.2) where k·k is the spectral norm and σ min ( X ) is the smallest singular value of matrix X . The above deﬁnitions can be found in the works (Hardt & Price, 2014; Golub & Van Loan,2012).

Let L be the weight matrix associated with the network, indicating how agents are connected.We assume that the weight matrix L has the following properties:3 lgorithm 1 Decentralized Exact PCA (

DeEPCA ) Input:

Proper initial point W , FastMix parameter K . Initialize S j = W , W j = W and A j W ( − j = W . for t = 0 , . . . , T do For each agent j , update S t +1 j = S tj + A j W tj − A j W t − j (3.1) Communicate S t +1 j with its neighbors several times to achieve averaging, that is S t +1 = FastMix( S t +1 , K ) , with S t +1 (: , : , j ) = S t +1 j . (3.2) For each agent j , compute the orthonormal basis of S t +1 j by QR decomposition, that is W t +1 j = QR( S t +1 j ) , and W t +1 j = SignAdjust( W t +1 , W ) . (3.3) end for Output: W T +1 j L is symmetric with L i,j = 0 if and if only agents i and j are connected or i = j .2. (cid:22) L (cid:22) I , L1 = , null( I − L ) = span( ).We use I to denote the m × m identity matrix and = [1 , . . . , > ∈ R m denotes the vector withall ones.The weight matrix has an important property that L ∞ = m > (Xiao & Boyd, 2004). Thus,one can achieve the eﬀect of averaging local variables on diﬀerent agents by multiple steps of localcommunications. Recently, Liu & Morse (2011) proposed a more eﬃcient way to achieve averagingdescribed in Algorithm 3 than the one in (Xiao & Boyd, 2004). Proposition 1.

Let W K ∈ R d × d × m be the output of Algorithm 3 and ¯ W = m W ∈ R d × d . Thenit holds that ¯ W = 1 m W K , and (cid:13)(cid:13)(cid:13) W K − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − q − λ ( L ) (cid:19) K (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) , where λ ( L ) is the second largest eigenvalue of L , and ⊗ denotes the tensor outer product. In this section, we propose a novel decentralized exact PCA algorithm with a linear convergencerate. First, we provide the main idea behind our algorithm.4 lgorithm 2

SignAdjust Input:

Matrices W t and W and column number k . for i = 1 , . . . , k do if (cid:10) W t (: , i ) , W (: , i ) (cid:11) < then Flip the sign, that is, W t (: , i ) = − W t (: , i ) end if end for Output: W t In previous works, the common algorithmic frame is to conduct a multi-consensus step to achieveaveraging for each local power method (Raja & Bajwa, 2015; Wai et al., 2017; Kempe & McSherry,2008), that is, W t +1 j = A j W tj , W t +1 =MultiConsensus( W t +1 ) ,W t +1 j =QR( W t +1 j ) (3.4)where QR( W j ) computes the orthonormal basis of W j by QR decomposition and W t ∈ R d × k × m has its j -th slice W t (: , : , j ) = W tj . However, algorithms in this framework will take increasingconsensus steps to achieve high precision principal components and the consensus steps of eachpower iteration depend on the target precision (cid:15) . This framework is similar to the well-known DGD algorithm in decentralized optimization which can not converges to the optima without increasingthe number of communications in each multi-consensus step (Yuan et al., 2016; Nedic & Ozdaglar,2009).In decentralized optimization, to overcome the weakness of

DGD , a novel technique called ‘gradient-tracking’ was introduced recently (Qu & Li, 2017; Shi et al., 2015). By the advantages of thegradient-tracking, several algorithms have achieved the linear convergence rate without increasingthe number of multi-consensus iterations per step. Especially, a recent work

Mudag showed thatgradient tracking can be used to achieve a near optimal communication complexity up to a logfactor (Ye et al., 2020).To obtain a decentralized exact PCA algorithm with a linear convergence rate without increasingthe number of communications per consensus step, we track the subspace in the proposed PCAalgorithm by adapting the gradient tracking method to ‘subspace tracking’. Compared with previousdecentralized PCA (Eqn. (3.4)) methods, we introduce an extra term S j to track the space of poweriterations. Combining S j with multi-consensus, we can track the subspace in the power methodexactly. We can then obtain the exact principal component W j after several power iterations. Thedetailed description of the resulting algorithm DeEPCA is in Algorithm 1.Please note that, Algorithm 1 conducts a sign adjustment in Eqn. (3.3) which is necessary tomake

DeEPCA converge stably. This is because the signs of some columns of W tj maybe ﬂip duringthe local power iterations and the sign ﬂipping does not change the column space of the matrix.However, if some signs are ﬂipped, then the outcome of the aggregation ¯ W t = m P W tj will be5ﬀected.The subspace tracking technique in our algorithm is the key to achieving the advantages of DeEPCA . The intuition behind the subspace tracking comes from the observation that when W tj and W t − j are close to the optimal subspace U (where U is the top- k principal components of A ), then A j W tj − A j W t − j is close to zero. This implies that diﬀerent local subspaces S t +1 j in diﬀerent agentsonly vary by small perturbations. Thus, we only need a small number of consensus steps to make S t +1 j consistent with each other. In fact, the idea behind subspace tracking has also been used invariance reduction methods for ﬁnite sum stochastic optimization algorithms (Johnson & Zhang,2013; Defazio et al., 2014).Using subspace tracking, we can maintain highly consistent subspaces S t +1 j in the power iterationcomputation A j W tj without increasing the number of communication rounds per consensus step.We can show that the approximation error reduce according to O ( (cid:15) ) where (cid:15) is the error precisionfor the power method. The following lemma shows how the mean variable ¯ S t = m P mj =1 S tj converges to the top- k principal components of A and local variable S tj converges to its mean counterpart ¯ S t . Lemma 1.

Matrix A ∈ R d × d = m P mj =1 A j is positive semi-deﬁnite with A j being stored in j -thagent and k A j k ≤ L . The agents form a undirected connected graph with weighted matrix L ∈ R m × m . Given parameter k ≥ , orthonormal matrix U ∈ R d × k is the top- k principal componentsof A . λ k and λ k +1 are k -th and k + 1 -th largest eigenvalue of A , respectively. Suppose ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , γ = 1 − λ k − λ k +1 λ k and ‘ ( ¯ S ) < ∞ . If ρ = (cid:16) − p − λ ( L ) (cid:17) K satisﬁes ρ ≤ min ( γ , ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:0) γ t · ‘ ( ¯ S ) (cid:1) (cid:0) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 · ‘ ( ¯ S ) (cid:1) ,λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ mγ t − · ‘ ( ¯ S ) · q γ t · ‘ ( ¯ S ) (cid:0) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 · ‘ ( ¯ S ) (cid:1)  , (3.5) for t = 1 , . . . , T + 1 . Letting ¯ S t = m P mj =1 S tj , then sequence { ¯ S t } T +1 t =0 and { S t } T +1 t =0 generated byAlgorithm 1 satisfy that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) and √ m (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ρL ( √ k + 1) γ t − · ‘ ( ¯ S ) , (3.6) and √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 )24( λ k +1 + 2 L ) · γ t · ‘ ( ¯ S ) . (3.7) Remark 1.

Lemma 1 shows that our

DeEPCA can achieve a linear convergence rate almost the sameto power method. Furthermore, the diﬀerence between local variable S tj and its mean variable ¯ S t will also converge to zero as iteration goes. This implies that S j ’s in diﬀerent agents will converge o the same subspace. Thus, we can obtain that W tj = QR( S tj ) will converge to the top- k principalcomponents of A . Furthermore, we can observe that the right hand of Eqn. (3.5) decreases as t increasing and is independent of (cid:15) . Hence, DeEPCA does not require to increase the consensus stepsto achieve a high precision solution nor setting consensus steps for each power iteration accordingto (cid:15) which is required in previous work (Wai et al., 2017; Kempe & McSherry, 2008). Lemma 1 alsoreveals an interesting property of

DeEPCA . To obtain the top- k principal components of a positivesemi-deﬁnite matrix A = m P mj =1 A j , DeEPCA does not require A j to be positive semi-deﬁnite. Thus,our DeEPCA is a robust algorithm and can be applied in diﬀerent settings.

By Lemma 1, we can easily obtain the iteration and communication complexities of

DeEPCA toachieve tan θ k ( U, W j ) ≤ (cid:15) for each agent- j . The communication complexity depends on the timesof local communication which is presented as the product of W and L in Algorithm 3. Now we givethe detailed iteration complexity and communication complexity of our algorithm in the followingtheorem. Theorem 1.

Let A , U , and graph weight matrix L satisfy the properties in Lemma 1. The initialorthonormal matrix W satisﬁes that tan θ k ( U, W ) < ∞ . Let parameter K satisfy K ≤ p − λ ( L ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . Given (cid:15) < , to achieve tan θ k ( U, W Tj ) ≤ (cid:15) for j = 1 , . . . , m , the iteration complexity T is at most T = 2 λ k λ k − λ k +1 · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) . (3.8) The communication complexity is at most C = 2 λ k ( λ k − λ k +1 ) p − λ ( L ) · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . (3.9) Furthermore, it also holds that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W T +1 j − m m X j =1 W T +1 j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) , and tan θ k ( U, ¯ S T ) ≤ (cid:15) . (3.10) Remark 2.

Theorem 1 shows that for any agent j , W j takes T = O (cid:16) λ k − λ k +1 λ k log (cid:15) (cid:17) iterations toconverge to the top- k principal components of A with an (cid:15) -suboptimality. This iteration complexityis the same to the centralized PCA based on power method (Golub & Van Loan, 2012). Furthermore, ach power iteration of DeEPCA requires K = O p − λ ( L ) · log L λ k λ k +1 · λ k − λ k +1 λ k !! (3.11) consensus steps. Note that K is independent of the precision parameter (cid:15) which shows that DeEPCA does not need to tune its consensus parameter K according to (cid:15) . This also implies that DeEPCA does not increase its consensus steps gradually to achieve a high precision principal components. Incontrast, the best known consensus steps for each power iteration of previous decentralized algorithmsare required to be (Wai et al., 2017) K = O p − λ ( L ) log (cid:18) λ k − λ k +1 λ k · (cid:15) (cid:19)! . (3.12) Thus,

DeEPCA achieves the best communication complexity of decentralized PCA algorithms. Com-paring Eqn. (3.11) and (3.12) , our result is better than the one of (Wai et al., 2017) up to log (cid:15) factor. In fact, this advantage will become large even when (cid:15) is moderate large which can be observedin our experiments. Similar advantage of EXTRA over

DGD in decentralized optimization makes

EXTRA become one of most important algorithm in decentralized optimization (Shi et al., 2015).Furthermore, Eqn. (3.11) shows that the consensus steps depend on the ratio L / ( λ k λ k +1 ) . Infact, the value L / ( λ k λ k +1 ) reﬂect the data heterogeneity which can be observed more clearly when k = 1 . Due the data heterogeneity, multi-consensus is necessary in DeEPCA which will be validatedin our experiments.

Remark 3.

Lemma 1 shows that once ρ satisﬁes Eqn. (3.5) , ¯ S t will converge to the top- k principalcomponents of A linearly. That, any multi-consensus which can satisfy Eqn. (3.5) , DeEPCA canachieve linear convergence rate. Thus, though our analysis is based on the undirected graph, theresults of

DeEPCA can be easily extended to directed graph, gossip models, etc.

Remark 4.

DeEPCA is a novel decentralized exact power method. Because the power method isthe key tool in eigenvector computation and low rank approximation (SVD decomposition) (Golub& Van Loan, 2012),

DeEPCA provides a solid foundation for developing decentralized eigenvaluedecomposition, decentralized SVD, decentralized spectral analysis, etc.

In this section, we will give the detailed convergence analysis of

DeEPCA . For notation conve-nience, we ﬁrst introduce local and aggregate variables.8 .1 Local and Aggregate Variables

Matrix W tj ∈ R d × k is the local copy of the variable of W for agent j at t -th power iteration andwe introduce its aggregate variable W t ∈ R d × k × m whose j -th slice W t (: , : , j ) is W tj , that is, W t (: , : , j ) = W tj . Furthermore, we introduce G tj = A j W t − j ∈ R d × k and tracking variable S j ∈ R d × k . We alsointroduce the aggregate variables G t ∈ R d × k × m and S t ∈ R d × k × m of G ti and S ti , respectively whichsatisfy G t (: , : , j ) = G tj and S t (: , : , j ) = S tj . (4.1)Using the local and aggregate variables, we can represent Algorithm 1 as S t +1 =FastMix (cid:16) S t + G t +1 − G t , K (cid:17) (4.2) W t +1 j =QR( S t +1 j ) . (4.3)For the convergence analysis, we further introduce the mean values¯ W t = 1 m m X j =1 W tj , ¯ G t = 1 m m X j =1 G tj , ¯ S t = 1 m m X j =1 S tj , ¯ H t = 1 m m X j =1 A i ¯ W t − , ˜ W t = QR( ¯ S t ) . (4.4) First, we give the relationship between ¯ S t , ¯ G t , and ¯ H t in Lemma 2 and Lemma 3. These twolemmas show that ¯ S t and ¯ H t are close to each other but perturbed by L √ m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) .Furthermore, by the deﬁnition of ¯ H t , we can obtain that¯ S t +1 ≈ ¯ H t +1 = A ¯ W t . (4.5)If ¯ S t is also close to S tj , then we can obtain that¯ W t +1 ≈ QR( ¯ S t +1 ) . (4.6)We can observe that Eqn. (4.5) and (4.6) are the two steps of a power iteration but with someperturbation. Based on ¯ S t , ¯ G t , and ¯ H t , we can observe that DeEPCA can ﬁt into the frameworkof power method but with some perturbation. This is the reason why

DeEPCA will converge to thetop- k principal components of A .Next, we will bound the error between local and mean variables (Deﬁned in Section 4.1) suchas (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (in Lemma 4) and (cid:13)(cid:13)(cid:13) W t +1 − ¯ W t +1 ⊗ (cid:13)(cid:13)(cid:13) (in Lemma 6). Lemma 4 showsthat (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) will decay with a rate ρ < Lρ (cid:13)(cid:13) W t − W t − (cid:13)(cid:13) . When DeEPCA converges, then W t and W t − will both converge to the9 lgorithm 3 FastMix Input: W = W − , K , L , step size η w = − √ − λ ( W )1+ √ − λ ( W ) . for k = 0 , . . . , K do W k +1 = (1 + η w ) W k L − η w W k − ; end for Output: W K . top- k principal components, that is, (cid:13)(cid:13) W t − W t − (cid:13)(cid:13) will converge to zero (in Lemma 8). Thus, (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) will also converge to zero. This implies that (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) goes to zero as t increases by Lemma 6. Hence, the noisy power method described in Eqn. (4.5) and (4.6) becomesexact power method gradually.Finally, Lemma 7 shows that tan θ k ( U, ¯ S t ) converges with rate γ = 1 − λ k − λ k +1 λ k when the pertur-bation term (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) is upper bounded as Eqn. (4.12). Combining Lemma 4, Lemma 5and Lemma 7, we use induction in the proof of Lemma 1 to show that the assumption (4.12) andEqn. (4.13) hold for t = 1 , . . . , T +1 when ρ is properly chosen. This leads to the results of Lemma 1. In our analysis, we aim to show tan θ k ( U, ¯ S T +1 ) and (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) will converge to (cid:15) . First,we give the relationship between ¯ S t , ¯ G t , and ¯ H t . Based on ¯ S t , ¯ G t , and ¯ H t , we can observe that DeEPCA can ﬁt into the framework of power method but with some perturbation.

Lemma 2.

Let ¯ W , ¯ G , and ¯ S be initialized as W . Supposing ¯ G t , and ¯ S t be deﬁned in Eqn. (4.4) and S t update as Eqn. (4.2) , it holds that ¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t = ¯ G t +1 . Lemma 3.

Letting ¯ G t and ¯ H t be deﬁned in Eqn. (4.4) and k A j k ≤ L for j = 1 , . . . , m , they havethe following properties (cid:13)(cid:13)(cid:13) ¯ G t − ¯ H t (cid:13)(cid:13)(cid:13) ≤ L √ m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) . (4.7)In the next lemmas, we will bound the error between local and mean variables (Deﬁned inSection 4.1). First, we upper bound the error (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) recursively. Lemma 4.

Letting S t be updated as Eqn. (4.2) and k A j k ≤ L , then S t +1 and ¯ S t +1 have thefollowing properties (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + Lρ (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) , with ρ (cid:44) (cid:18) − q − λ ( L ) (cid:19) K . (4.8) Lemma 5.

If for t = 0 , , . . . , t , it holds that σ min ( U > ˜ W t ) > with ˜ W deﬁned in Eqn. (4.4) and being top principal components of A , then we can obtain that σ min ( ¯ S t +1 ) ≥ λ k · q ‘ ( ¯ S t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.9)Now, we will bound the error (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) . Lemma 6.

Assuming that (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S t − S tj (cid:13)(cid:13)(cid:13) ≤ for j = 1 , . . . , m , where h ¯ S t i † is the pseudoinverse of ¯ S t , then it holds that (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.10) Letting ¯ S t = ˜ W t ˜ R t be the QR decomposition of ¯ S t , then it holds that (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) ≤ √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.11)Next, we will give the convergence rate of ¯ S t under the assumption that the error between localvariable S tj and its mean counterpart ¯ S t is upper bounded. Lemma 7.

Letting ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , γ (cid:44) − λ k − λ k +1 λ k and √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) satisfy √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 ) · γ t · ‘ ( ¯ S )24 q γ t · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 ‘ ( ¯ S ) (cid:17) , (4.12) for t = 0 , , . . . , T , sequence { ¯ S t } generated by Algorithm 1 satisﬁes ‘ ( ¯ S t ) ≤ γ t +1 · ‘ ( ¯ S ) . (4.13)Finally, we will bound the diﬀerence between W t and W t − . Lemma 8.

Letting W be deﬁned in Eqn. (4.4) and ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , then it holds that (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) . (4.14) Using lemmas in previous subsection, we can prove Lemma 1 and Theorem 1 as follows.

Proof of Lemma 1.

We prove the result by induction. When t = 0, Eqn. (4.12) holds since eachagent shares the same initialization. This implies that ‘ ( ¯ S ) ≤ γ · ‘ ( ¯ S ).11ow, we assume that Eqn. (4.12) and (4.13) hold for t = 0 , . . . , T . In this case, for t = 1 , . . . , T ,it holds that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) , and 1 √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 ) · γ t · ‘ ( ¯ S )24 q γ t · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 ‘ ( ¯ S ) (cid:17) ≤ ( λ k − λ k +1 )24( λ k +1 + 2 L ) · γ t · ‘ ( ¯ S ) . We will show that the result holds for t = T + 1 and we only need to prove Eqn. (4.12) will holdfor t = T + 1. First, by Eqn. (4.8), we have (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (4.8) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) (4.14) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) + 24 ρL (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) ( ?? ) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) + 24 ρL √ m · λ k − λ k +1 λ k +1 + 2 L ) (cid:16) γ t + γ t − (cid:17) · ‘ ( ¯ S ) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL √ m ( √ k + 1) γ t − · ‘ ( ¯ S ) , which implies that1 √ m (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ · √ m (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL ( √ k + 1) γ t − · ‘ ( ¯ S ) . Using above equation recursively, we can obtain that1 √ m (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ T +1 · √ m (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL ( √ k + 1) · ‘ ( ¯ S ) T X i =1 ρ T − i γ i =2 ρL ( √ k + 1) · ‘ ( ¯ S ) · γ T − γρ T γ − ρ ≤ ρL ( √ k + 1) γ T − · ‘ ( ¯ S ) , where the ﬁrst equality is because each agent shares the same initialization and last inequality isbecause of the assumption that ρ ≤ γ . 12urthermore, we have σ min ( ¯ S T +1 ) (4.9) ≥ λ k q ‘ ( ¯ S T ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) (4.12) ≥ λ k q γ T · ‘ ( ¯ S ) − L ( λ k − λ k +1 ) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ))= λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) . Therefore, we can obtain that1 √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ k · q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) · ρL ( √ k + 1) γ T − · ‘ ( ¯ S ) . First, we need to satisfy the condition in Lemma 6, that is, (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 j − ¯ S T +1 (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ . Therefore, ρ only needs ρ ≤ Lk ( √ k + 1) √ mγ T − · ‘ ( ¯ S ) · λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) . To simplify above equation, we only require ρ to be ρ ≤ λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ mγ T − · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ) (cid:17) . To satisfy Eqn. (4.12) for t = T + 1, ρ only needs to satisfy ρ ≤ ( λ k − λ k +1 ) · γ kL ( √ k + 1) q γ T +1) · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +2 ‘ ( ¯ S ) (cid:17) · λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) .

13o simplify above equation, we only require ρ to be ρ ≤ ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) γ T · ‘ ( ¯ S ) (cid:17) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ) (cid:17) . Since Eqn. (4.12) holds for t = T + 1 when ρ satisﬁes the condition (3.5), then Eqn. (4.13) alsoholds for t = T + 1. This concludes the proof.Using the results of Lemma 1, we can prove Theorem 1 as follows. Proof of Theorem 1.

First, by Eqn. (4.10), Eqn. (3.7), and the condition that T ≥ λ k λ k − λ k +1 log λ k +2 L ) tan θ k ( U,W ) √ m ( λ k − λ k +1 ) (cid:15) ,we can obtain that (cid:13)(cid:13)(cid:13) W T − ¯ W T ⊗ (cid:13)(cid:13)(cid:13) ≤ √ m · λ k − λ k +1 λ k +1 + 2 L ) · γ T · ‘ ( ¯ S ) ≤ (cid:15) . Similarly, we can obtain that tan θ k ( U, ¯ S T ) ≤ (cid:15) . Thus, we can obtain the results in Eqn. (3.10).Furthermore, by the deﬁnition of angels between two subspaces, we havetan θ k ( U, W tj ) (2.1) = max k w k =1 (cid:13)(cid:13)(cid:13) V > W Tj w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > W Tj w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ W T w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ˜ W T w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W T − ¯ W T (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ˜ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ˜ W T − ¯ W T (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13) (4.10) , (4.11) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ˜ W T w (cid:13)(cid:13)(cid:13) + 24 (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ˜ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) = tan θ k ( U, ˜ W T ) + 24 (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W T )1 − (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W T ) (3.6) , (3.7) ≤ γ T · ‘ ( ¯ S ) + √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )1 − √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )= γ T · tan θ k ( U, W ) + √ m · λ k − λ k +1 λ k +2 L · γ T · tan θ k ( U, W ) · q γ T · tan θ k ( U, W )1 − √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )Since T = λ k λ k − λ k +1 log θ k ( U,W ) (cid:15) , it holds that γ T · tan θ k ( U, W ) ≤ (cid:15) . Furthermore, when T = λ k λ k − λ k +1 log λ k +2 L ) tan θ k ( U,W ) √ m ( λ k − λ k +1 ) (cid:15) , it holds that √ m · λ k − λ k +1 λ k +2 L · γ T · tan θ k ( U, W ) ≤ (cid:15) . Thus, when14 <

1, we can obtain that tan θ k ( U, W Tj ) ≤ (cid:15)/ (cid:15)/ · p / − / · p / < (cid:15). Since the right hand of Eqn. 3.5 is monotone deceasing as t increases, ρ only satisﬁes that ρ ≤ min  ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) ‘ ( ¯ S ) (cid:17) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17) ,λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ m · ‘ ( ¯ S ) · q ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17)  . Furthermore, ρ only requires to satisfy ρ ≤ ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) ‘ ( ¯ S ) (cid:17) (cid:16) λ k + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17) (4.15)Replacing the deﬁnition of ‘ ( ¯ S ) and Proposition 1, we can obtain if K satisﬁes that K ≤ p − λ ( L ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) , the requirement of ρ in Eqn. (4.15) is satisﬁed. Combining with iteration complexity, we can obtainthe total communication complexity C = T × K = 2 λ k ( λ k − λ k +1 ) p − λ ( L ) · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . In the previous sections, we presented a theoretical analysis of our algorithm. In this section,we will provide empirical studies.

Experiment Setting

In our experiments, we consider random networks where each pair of agentshas a connection with a probability of p = 0 .

5. We set L = I − Mλ max ( M ) where M is the Laplacian15

100 200 300 400 500 600 700 800 900 1000

Number of Power Iteration

DeEPCA (a) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 3 Number of Power Iteration -0.6-0.4-0.200.20.40.60.811.2

DeEPCADePCA (b) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 3 Number of Power Iteration -16-14-12-10-8-6-4-202

DeEPCADePCACPCA (c) tan θ k ( U, W ) with K = 3 Number of Power Iteration -20-15-10-50510

DeEPCA (d) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -30-25-20-15-10-50

DeEPCADePCA (e) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -16-14-12-10-8-6-4-202

DeEPCADePCACPCA (f) tan θ k ( U, W ) with K = 5 Number of Power Iteration -20-15-10-50510

DeEPCA (g) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -30-25-20-15-10-50

DeEPCADePCA (h) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -16-14-12-10-8-6-4-202

DeEPCADePCACPCA (i) tan θ k ( U, W ) with K = 10 Figure 1: Experiment on ‘w8a’.matrix associated with a weighted graph. We set m = 50 , that is, there exists 50 agents in thisnetwork. In our experiments, the gossip matrix L satisﬁes 1 − λ ( L ) = 0 . n = 800 and d = 300. For ‘a9a’, we set n = 600 and d = 123. For eachagent, A j has the following form A = 1 m m X j =1 A j , and A j = n X i =1 v i v > i , with v i = a ( j − ∗ n + i , (5.1)where a ( j − ∗ n + i ∈ R d is the (( j − ∗ n + i )-th input vector of the dataset.16

100 200 300 400 500 600 700 800 900 1000

Number of Power Iteration -15-10-5051015

DeEPCA (a) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 1 Number of Power Iteration -25-20-15-10-505

DeEPCADePCA (b) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 1 Number of Power Iteration -12-10-8-6-4-202

DeEPCADePCACPCA (c) tan θ k ( U, W ) with K = 1 Number of Power Iteration -15-10-50510

DeEPCA (d) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -30-25-20-15-10-50

DeEPCADePCA (e) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -10-8-6-4-202

DeEPCADePCACPCA (f) tan θ k ( U, W ) with K = 5 Number of Power Iteration -15-10-50510

DeEPCA (g) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -30-25-20-15-10-50

DeEPCADePCA (h) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -10-8-6-4-202

DeEPCADePCACPCA (i) tan θ k ( U, W ) with K = 10 Figure 2: Experiment on ‘a9a’.

Experiment Results

In our experiments, we compare

DeEPCA with decentralized PCA (

DePCA )(Wai et al., 2017), and centralized PCA (

CPCA ). We will study how consensus steps aﬀect theconvergence rate of

DeEPCA empirically. Thus, we set diﬀerent K ’s in our experiment,. We will re-port the convergence rate of (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) and m P mj =1 tan θ k ( U, W tj ). We reportexperiment results in Figure 1 and Figure 2.Figure 1 shows that multi-consensus step is required in our DeEPCA . When K = 3, DeEPCA cannot converge to the top- k principal components of A . The number of consensus steps of DeECPA in each power iteration should be determined by the heterogeneity of the data just as discussed inRemark 2. Furthermore, once consensus steps of

DeECPA are suﬃcient, then

DeEPCA can achievea fast convergence rate comparable to centralized PCA which can be observed from Figure 1 andFigure 2. This validates our convergence analysis of

DeEPCA in Theorem 1.Figure 1 and Figure 2 show that without increasing consensus steps,

DePCA can not converge to17he top- k principal components of A . Because of lacking of subspace tracking, to achieve a highprecision solution, DePCA can only depends on an increasing consensus steps which can be observedfrom third columns of Figure 1 and Figure 2. Comparing

DeEPCA and

DePCA , we can conclude that

DeEPCA has great advantages in communication cost.

This paper proposed a novel decentralized PCA algorithm

DeEPCA that can achieve a linearconvergence rate similar to the centralized PCA method, and the number of communications permulti-consensus step does not depend on the target precision (cid:15) . In this way,

DeEPCA can achievethe best known communication complexity for decentralized PCA. Our experiments also veriﬁesthe communication eﬃciency of

DeEPCA . Although the analysis of

DeEPCA is based on undirectedgraph and ‘FastMix’, it can be easily extended to handle directed graphs because our analysis of

DeEPCA only requires averaging. As a ﬁnal remark, we note that

DeEPCA employs the power method,which can be applied to eigenvector ﬁnding, low rank matrix approximation, spectral analysis, etc.Therefore

DeEPCA can be used to design communication eﬃcient decentralized algorithms for theseproblems as well.

References

Bertrand, A. & Moonen, M. (2014). Distributed adaptive estimation of covariance matrix eigen-vectors in wireless sensor networks with application to distributed pca.

Signal Processing , 104,120–135.Bishop, C. M. (2006).

Pattern recognition and machine learning . springer.Cadima, J., Cerdeira, J. O., & Minhoto, M. (2004). Computational aspects of algorithms for variableselection in the context of principal components.

Computational statistics & data analysis , 47(2),225–236.Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). Saga: a fast incremental gradient method withsupport for non-strongly convex composite objectives. In

Proceedings of the 27th InternationalConference on Neural Information Processing Systems-Volume 1 (pp. 1646–1654).Dhillon, P. S., Foster, D. P., & Ungar, L. H. (2015). Eigenwords: Spectral word embeddings.

TheJournal of Machine Learning Research , 16(1), 3035–3078.Ding, C. & He, X. (2004). K-means clustering via principal component analysis. In

Proceedings ofthe twenty-ﬁrst international conference on Machine learning (pp.˜29).Golub, G. H. & Van Loan, C. F. (2012).

Matrix computations , volume 3. JHU Press.Hardt, M. & Price, E. (2014). The noisy power method: A meta algorithm with applications.

Advances in Neural Information Processing Systems , 27, 2861–2869.18orn, R. A. & Johnson, C. R. (2012).

Matrix analysis . Cambridge university press.Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variancereduction.

Advances in neural information processing systems , 26, 315–323.Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K.,Charles, Z., Cormode, G., Cummings, R., et al. (2019). Advances and open problems in federatedlearning. arXiv preprint arXiv:1912.04977 .Kempe, D. & McSherry, F. (2008). A decentralized algorithm for spectral analysis.

Journal ofComputer and System Sciences , 74(1), 70–83.Lee, D., Lee, W., Lee, Y., & Pawitan, Y. (2010). Super-sparse principal component analyses forhigh-throughput genomic data.

BMC bioinformatics , 11(1), 296.Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., & Liu, J. (2017). Can decentralizedalgorithms outperform centralized algorithms? a case study for decentralized parallel stochasticgradient descent. In

Advances in Neural Information Processing Systems (pp. 5330–5340).Liu, J. & Morse, A. S. (2011). Accelerated linear iterations for distributed averaging.

AnnualReviews in Control , 35(2), 160–165.Moon, H. & Phillips, P. J. (2001). Computational and performance aspects of pca-based face-recognition algorithms.

Perception , 30(3), 303–321.Nedic, A. & Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization.

IEEE Transactions on Automatic Control , 54(1), 48–61.Qu, G. & Li, N. (2017). Harnessing smoothness to accelerate distributed optimization.

IEEETransactions on Control of Network Systems , 5(3), 1245–1260.Qu, Y., Ostrouchov, G., Samatova, N., & Geist, A. (2002). Principal component analysis fordimension reduction in massive distributed data sets. In

Proceedings of IEEE InternationalConference on Data Mining (ICDM) , volume 1318 (pp. 1788).Raja, H. & Bajwa, W. U. (2015). Cloud k-svd: A collaborative dictionary learning algorithm forbig, distributed data.

IEEE Transactions on Signal Processing , 64(1), 173–188.Scaglione, A., Pagliari, R., & Krim, H. (2008). The decentralized estimation of the sample covari-ance. In (pp. 1722–1726).:IEEE.Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). Extra: An exact ﬁrst-order algorithm for decentralizedconsensus optimization.

SIAM Journal on Optimization , 25(2), 944–966.Stewart, G. (1977). Perturbation bounds for the qr factorization of a matrix.

SIAM Journal onNumerical Analysis , 14(3), 509–518. 19uleiman, W., Pesavento, M., & Zoubir, A. M. (2016). Performance analysis of the decentralizedeigendecomposition and esprit algorithm.

IEEE Transactions on Signal Processing , 64(9), 2375–2386.Wai, H.-T., Scaglione, A., Lafond, J., & Moulines, E. (2017). Fast and privacy preserving distributedlow-rank regression. In (pp. 4451–4455).: IEEE.Wu, S. X., Wai, H.-T., Li, L., & Scaglione, A. (2018). A review of distributed algorithms forprincipal component analysis.

Proceedings of the IEEE , 106(8), 1321–1340.Xiao, L. & Boyd, S. (2004). Fast linear iterations for distributed averaging.

Systems & ControlLetters , 53(1), 65–78.Ye, H., Luo, L., Zhou, Z., & Zhang, T. (2020). Multi-consensus decentralized accelerated gradientdescent. arXiv preprint arXiv:2005.00797 .Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent.

SIAMJournal on Optimization , 26(3), 1835–1854. 20

Proof of Lemmas in Section 4.3

We will prove our lemmas in the order of their appearance.

A.1 Proof of Lemma 2

Proof of Lemma 2.

First, because the operation ‘FastMix’ is linear, we can obtain that¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t . We prove the result by induction. When t = 0, it holds that ¯ S = ¯ G = W . Supposing it holdsthat ¯ S t = ¯ G t , then we have ¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t = ¯ G t +1 . Thus, for each t = 0 , , . . . , it holds that ¯ S t = ¯ G t . A.2 Proof of Lemma 3

Proof of Lemma 3.

By the deﬁnition of ¯ G t and ¯ H t in Eqn. (4.4), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X j =1 A j W t − j − m m X j =1 A j ¯ W t − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m m X j =1 (cid:13)(cid:13)(cid:13) A j ( W t − j − ¯ W t − ) (cid:13)(cid:13)(cid:13) ≤ m m X j =1 k A j k · (cid:13)(cid:13)(cid:13) W t − j − ¯ W t − (cid:13)(cid:13)(cid:13) ≤ L m m X j =1 (cid:13)(cid:13)(cid:13) W t − j − ¯ W t − (cid:13)(cid:13)(cid:13) = L m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) , where the last inequality is because of the assumption k A j k ≤ L for j = 1 , . . . , m . A.3 Proof of Lemma 4

Proof of Lemma 4.

For notation convenience, we use T ( W ) to denote the ‘FastMix’ operation on W , which is used in Algorithm 1. That is, T ( W ) (cid:44) FastMix( W , K ) . Then for W , it holds that (cid:13)(cid:13)(cid:13) T ( W ) − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ · (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) . (A.1)21t is obvious that the ‘FastMix’ operation T ( · ) is linear. By the update rule of S t , we have (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (4.2) = (cid:13)(cid:13)(cid:13) T ( S t + G t +1 − G t ) − (cid:16) ¯ S t +1 + ¯ G t +1 − ¯ G t (cid:17) ⊗ (cid:13)(cid:13)(cid:13) (A.1) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13) G t +1 − G t − ( ¯ G t +1 − ¯ G t ) ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13) G t +1 − G t (cid:13)(cid:13)(cid:13) = ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ vuut m X j (cid:13)(cid:13)(cid:13) A j ( W tj − W t − j ) (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + Lρ (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) . where the second inequality is because the fact that for any W ∈ R d × k × m , it holds that (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) = m X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W j − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m X j =1 k W j k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − m X j =1 * W j , m m X i =1 W i + = m X j =1 k W j k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 k W j k = k W k . The last inequality is because of m X j (cid:13)(cid:13)(cid:13) A j ( W tj − W t − j ) (cid:13)(cid:13)(cid:13) ≤ m X j =1 k A j k · (cid:13)(cid:13)(cid:13) W tj − W t − j (cid:13)(cid:13)(cid:13) ≤ L m X j =1 (cid:13)(cid:13)(cid:13) W tj − W t − j (cid:13)(cid:13)(cid:13) = L (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) . .4 Proof of Lemma 5 Proof of Lemma 5.

By the deﬁnition of σ min ( ¯ S t +1 ) and Lemma 2, we can obtain σ min ( ¯ S t +1 ) = σ min ( ¯ G t +1 ) ≥ σ min ( ¯ H t +1 ) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) = σ min ( A ¯ W t ) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) ≥ σ min ( A ˜ W t ) − (cid:13)(cid:13)(cid:13) A ( ˜ W t − ¯ W t ) (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) ≥ σ min ( A ˜ W t ) − L (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) (4.7) , (4.10) , (4.11) ≥ σ min ( A ˜ W t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . Furthermore, we have σ min ( A ˜ W t ) = σ min " Σ k U > ˜ W t Σ \ k V > ˜ W t ≥ σ min " Σ k U > ˜ W t ≥ λ k · σ min ( U > ˜ W t ) (2.2) = λ k · cos θ k ( U, ˜ W t )= λ k · q ‘ ( ¯ S t ) , where the ﬁrst inequality is because of Corollary 7.3.6 of Horn & Johnson (2012) and matrixΣ k U > ˜ W t is non-singular.Therefore, we can obtain σ min ( ¯ S t +1 ) ≥ λ k · q ‘ ( ¯ S t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . A.5 Proof of Lemma 6

First, we give a important lemma that will be used in our proof.

Lemma 9 (Theorem 3.1 of Stewart (1977)) . Let A = QR , where A ∈ R d × k has rank k and Q > Q = I with I being the identity matrix. Let E satisfy k A † k k E k < where A † is the pseudoinverse of A . Moreover A + E = ( Q + ∆ Q )( R + ∆ R ) , where Q + ∆ Q has orthogonal columns. Thenit holds that k ∆ Q k ≤ k A † k k E k − k A † k k E k . (A.2) Proof of Lemma 6.

For notation convenience, we will omit the superscript. Let S j = W j R j and23 S = ˜ W ˜ R be the QR decomposition of S j and ¯ S , respectively . Then we have (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) = m X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W j − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 (cid:13)(cid:13)(cid:13) W j − ˜ W (cid:13)(cid:13)(cid:13) + 2 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ W − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 (cid:13)(cid:13)(cid:13) W j − ˜ W (cid:13)(cid:13)(cid:13) ≤ m X j =1  (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13)  ≤ (12) · (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) , where the last inequality is because of the assumption (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13) ≤ . Hence, we can obtainthat (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) . .6 Proof of Lemma 7 Proof of Lemma 7.

By the update rule of Algorithm, we can obtain that ‘ ( ¯ S t +1 ) = tan θ k ( U, ¯ S t +1 )= max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ S t +1 w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ S t +1 w (cid:13)(cid:13)(cid:13) = max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ G t +1 w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ G t +1 w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ H t +1 w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ G t +1 − ¯ H t +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ H t +1 w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ G t +1 − ¯ H t +1 (cid:13)(cid:13)(cid:13) (4.7) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ H t +1 w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ H t +1 w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) = max k w k =1 (cid:13)(cid:13)(cid:13) V > A ¯ W t w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > A ¯ W t w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ¯ W t w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ¯ W t w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) − λ k (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) (4.11) , (4.10) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) = max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) . Furthermore, we have 1 (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) = 1cos θ k ( U, ˜ W t ) . ‘ ( ¯ S t +1 ) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W t ) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W t )= λ k +1 ‘ ( ¯ S t ) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) · q ‘ ( ¯ S t ) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) · q ‘ ( ¯ S t ) , (A.3)where the last equality is because of the fact 1 + tan θ = θ .Now we will prove the result by induction. When t = 0, it holds that S j ’s are equal to eachother, that is, (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) = 0. Hence, we can obtain that ‘ ( ¯ S ) ≤ λ k +1 λ k ‘ ( ¯ S ) < (cid:18) − λ k − λ k +1 λ k (cid:19) · ‘ ( ¯ S ) . We assume that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) and Eqn. (4.12) hold. Replacing the assumptions to Eqn. (A.3),we can obtain that ‘ ( ¯ S t +1 ) ≤ (cid:18) − λ k − λ k +1 λ k (cid:19) t +1 · ‘ ( ¯ S ) = γ t +1 · ‘ ( ¯ S ) . This concludes the proof.

A.7 Proof of Lemma 8

Proof of Lemma 8.

First, by triangle inequality, we can obtain (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ W t ⊗ − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) (4.10) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ m (cid:13)(cid:13)(cid:13) ¯ W t − ¯ W t − (cid:13)(cid:13)(cid:13) . Furthermore, we have (cid:13)(cid:13)(cid:13) ¯ W t − ¯ W t − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ W t − − U (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − ¯ W t − (cid:13)(cid:13)(cid:13) (4.11) ≤ (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − U (cid:13)(cid:13)(cid:13) + 12 √ m (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) . Now we begin to bound the value of (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) . Note that due to sign adjustment in Eqn. (3.3)in Algorithm 1, then W t and W t − share the same direction, that is the dot product of columns of26 t and W t − are positive. Thus, we can choose such U that shares the same direction with W t and W t − . In this case, ˜ W t and ˜ W t − can also share the same direction with U . Combining withthe deﬁnition of ˜ W in Eqn. (4.4), we have (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˜ W t (cid:13)(cid:13)(cid:13) + k U k − D ˜ W t , U E ≤ k − k · σ min ( U > ˜ W t )=2 k (1 − cos θ k ( ˜ W t , U )) = 2 k  − q ‘ ( ¯ S t )  =2 k · q ‘ ( ¯ S t ) − q ‘ ( ¯ S t ) = 2 k · ‘ ( ¯ S t ) q ‘ ( ¯ S t )( q ‘ ( ¯ S t ) + 1) ≤ k · ‘ ( ¯ S t ) , where the ﬁrst inequality is because of U > (: , i ) ˜ W t (: , i ) > D ˜ W t , U E = k X i =1 U > (: , i ) ˜ W t (: , i ) ≥ k · σ min ( U > ˜ W t ) . Therefore, we can obtain that (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + 12 (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) =24 (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) ..