DeEPCA: Decentralized Exact PCA with Linear Convergence Rate
DDeEPCA: Decentralized Exact PCA with Linear Convergence Rate
Haishan Ye ∗ Tong Zhang † February 9, 2021
Abstract
Due to the rapid growth of smart agents such as weakly connected computational nodesand sensors, developing decentralized algorithms that can perform computations on local agentsbecomes a major research direction. This paper considers the problem of decentralized Principalcomponents analysis (PCA), which is a statistical method widely used for data analysis. Weintroduce a technique called subspace tracking to reduce the communication cost, and apply itto power iterations. This leads to a decentralized PCA algorithm called
DeEPCA , which has aconvergence rate similar to that of the centralized PCA, while achieving the best communicationcomplexity among existing decentralized PCA algorithms.
DeEPCA is the first decentralized PCAalgorithm with the number of communication rounds for each power iteration independent oftarget precision. Compared to existing algorithms, the proposed method is easier to tune inpractice, with an improved overall communication cost. Our experiments validate the advantagesof
DeEPCA empirically.
Principal Components Analysis (PCA) is a statistical data analysis method wide applicationsin machine learning (Moon & Phillips, 2001; Bishop, 2006; Ding & He, 2004; Dhillon et al., 2015),data mining (Cadima et al., 2004; Lee et al., 2010; Qu et al., 2002), and engineering (Bertrand &Moonen, 2014). In recent years, because of the rapid growth of data and quick advances in networktechnology, developing distributed algorithms has become a more and more important researchtopic, due to their advantages in privacy preserving, robustness, lower communication cost, etc.(Kairouz et al., 2019; Lian et al., 2017; Nedic & Ozdaglar, 2009). There have been a number ofprevious studies of decentralized PCA algorithms (Scaglione et al., 2008; Kempe & McSherry, 2008;Suleiman et al., 2016; Wai et al., 2017).In a typical decentralized PAC setting, we assume that a positive semi-definite matrix A isstored at different agents. Specifically, the matrix A can be decomposed as A = 1 m m X j =1 A j , ∗ Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong, Shenzhen; email:hsye [email protected]; † Hong Kong University of Science and Technology; email: [email protected] a r X i v : . [ c s . L G ] F e b here data for A j is stored in the j -th agent and known only to the agent (This helps to preserveprivacy). The agents form a connected and undirected network. Agents can communicate withtheir neighbors in the network to cooperatively compute the PCA of A .To obtain the top- k principal components of the positive semi-definite matrix A ∈ R d × d , acommonly used centralized algorithm is the power method, which converges fast in practice with alinear convergence rate (Golub & Van Loan, 2012). In the implementation of decentralized PCA,a natural idea is the decentralized power method ( DePM ) which mimics its centralized counterpart.The main procedure of
DePM can be summarized as a local power iteration plus a multi-consensusstep to synchronize the local computations (Kempe & McSherry, 2008; Raja & Bajwa, 2015; Waiet al., 2017; Wu et al., 2018). The multi-consensus step in
DePM is used to achieve averaging.However, decentralized PCA algorithm based on
DePM suffers from a suboptimal communicationcost, and is tricky to implement in practice. For each power iteration, theoretically, it requires O (cid:16) log (cid:15) (cid:17) times communication, where (cid:15) is the target precision. The communication cost becomesmuch quite significant when (cid:15) is small. Although seemingly only a logarithmic factor, in practice,with a data size of merely 10000, this logarithmic factor leads to an order of magnitude morecommunications. This is clearly prohibitively large for many applications. Moreover, one oftenhas to gradually increase the number of communication rounds in the multi-consensus step to dealwith increased precision. However, this strategy makes the tuning of DePM difficult for practicalapplications.In this paper, we propose a new decentralized PCA algorithm that does not suffer from theweakness of
DePM . We observe that the communication precision requirement in
DePM comes fromthe heterogeneity of data in different agents. Due to the heterogeneity, the local power methodwill converge to the top- k principal components of the local matrix A j if no consensus step isconducted to perform averaging. To conquer the weakness of DePM whose consensus steps in eachpower iteration depend on the target precision (cid:15) , we adapted a technique called gradient trackingin the existing decentralized optimization literature, so that it can be used to track the subspace inpower iterations. We call this adapted technique subspace tracking . Based on the subspace trackingtechnique and multi-consensus, we propose Decentralized Exact PCA (
DeEPCA ) which can achievea linear convergence rate similar to the centralized PCA, but the consensus steps of each poweriteration is independent of the target precision (cid:15) . We summarize our contributions as follows:1. We propose a novel power-iteration based decentralized PCA called
DeEPCA , which can achievethe best known communication complexity, especially when the final error (cid:15) is small. Further-more,
DeEPCA is the first decentralized PCA algorithm whose consensus steps of each poweriteration does not depend on the target precision (cid:15) .2. We show that the ‘gradient tracking’ technique from the decentralized optimization literaturecan be adapted to subspace tracking for PCA. The resulting
DeEPCA algorithm can be regardedas a novel decentralized power method. Because power method is the foundation of manymatrix decomposition problems, subspace tracking and the proof technique of
DeEPCA can beapplied to develop communication efficient decentralized algorithms for spectral analysis, andlow rank matrix approximation. 2. The improvement is practically significant. Our experiments show that
DeEPCA can achieve alinear convergence rate comparable to centralized PCA, even only a small number of consensussteps are used in each power iteration. In contrast, the conventional decentralized PCAalgorithm based on
DePCA can not converge to the principal components of A when thenumber of consensus steps is not large. In this section, we introduce notations and definitions that will be used throughout the paper.
Given a matrix A = [ a ij ] ∈ R n × d and a positive integer k ≤ min { n, d } , its SVD is given as A = U Σ V T = U k Σ k V Tk + U \ k Σ \ k V T \ k , where U k and U \ k contain the left singular vectors of A , V k and V \ k contain the right singular vectors of A , and Σ = diag( σ , . . . , σ ‘ ) with σ ≥ σ ≥ · · · ≥ σ min { n,d } ≥ A . Accordingly, we can define the Frobenius norm k A k = qP min { n,d } i =1 σ i = qP n,di =1 ,j =1 ( A ( i, j )) and the spectral norm k A k = σ ( A ), where A ( i, j )denotes the i, j -th entry of A . We will use σ max ( A ) to denote the largest singular value and σ min ( A )to denote the smallest singular value which may be zero. If A is symmetric positive semi-definite,then it holds that U = V and λ i ( A ) = σ i ( A ), where λ i ( A ) is the i -th largest eigenvalue of A , λ max ( A ) = σ max ( A ), and λ min ( A ) = σ min ( A ).Next, we will introduce the angle between two subspaces U ∈ R d × k and X ∈ R d × k . Definition 1.
Let U ∈ R d × k have orthonormal columns and X ∈ R d × k have independent columns.For V = U ⊥ , then we have cos θ k ( U, X ) = min k w k =1 (cid:13)(cid:13)(cid:13) U > Xw (cid:13)(cid:13)(cid:13) k Xw k , sin θ k ( U, X ) = max k w k =1 (cid:13)(cid:13)(cid:13) V > Xw (cid:13)(cid:13)(cid:13) k Xw k , and , tan θ k ( U, X ) = max k w k =1 (cid:13)(cid:13)(cid:13) V > Xw (cid:13)(cid:13)(cid:13) k U > Xw k . (2.1) If X is orthonormal, then it also holds that cos θ k ( U, X ) = σ min ( U > X ) , sin θ k ( U, X ) = (cid:13)(cid:13)(cid:13) V > X (cid:13)(cid:13)(cid:13) , and , tan θ k ( U, X ) = (cid:13)(cid:13)(cid:13) V > X ( U > X ) − (cid:13)(cid:13)(cid:13) , (2.2) where k·k is the spectral norm and σ min ( X ) is the smallest singular value of matrix X . The above definitions can be found in the works (Hardt & Price, 2014; Golub & Van Loan,2012).
Let L be the weight matrix associated with the network, indicating how agents are connected.We assume that the weight matrix L has the following properties:3 lgorithm 1 Decentralized Exact PCA (
DeEPCA ) Input:
Proper initial point W , FastMix parameter K . Initialize S j = W , W j = W and A j W ( − j = W . for t = 0 , . . . , T do For each agent j , update S t +1 j = S tj + A j W tj − A j W t − j (3.1) Communicate S t +1 j with its neighbors several times to achieve averaging, that is S t +1 = FastMix( S t +1 , K ) , with S t +1 (: , : , j ) = S t +1 j . (3.2) For each agent j , compute the orthonormal basis of S t +1 j by QR decomposition, that is W t +1 j = QR( S t +1 j ) , and W t +1 j = SignAdjust( W t +1 , W ) . (3.3) end for Output: W T +1 j L is symmetric with L i,j = 0 if and if only agents i and j are connected or i = j .2. (cid:22) L (cid:22) I , L1 = , null( I − L ) = span( ).We use I to denote the m × m identity matrix and = [1 , . . . , > ∈ R m denotes the vector withall ones.The weight matrix has an important property that L ∞ = m > (Xiao & Boyd, 2004). Thus,one can achieve the effect of averaging local variables on different agents by multiple steps of localcommunications. Recently, Liu & Morse (2011) proposed a more efficient way to achieve averagingdescribed in Algorithm 3 than the one in (Xiao & Boyd, 2004). Proposition 1.
Let W K ∈ R d × d × m be the output of Algorithm 3 and ¯ W = m W ∈ R d × d . Thenit holds that ¯ W = 1 m W K , and (cid:13)(cid:13)(cid:13) W K − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − q − λ ( L ) (cid:19) K (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) , where λ ( L ) is the second largest eigenvalue of L , and ⊗ denotes the tensor outer product. In this section, we propose a novel decentralized exact PCA algorithm with a linear convergencerate. First, we provide the main idea behind our algorithm.4 lgorithm 2
SignAdjust Input:
Matrices W t and W and column number k . for i = 1 , . . . , k do if (cid:10) W t (: , i ) , W (: , i ) (cid:11) < then Flip the sign, that is, W t (: , i ) = − W t (: , i ) end if end for Output: W t In previous works, the common algorithmic frame is to conduct a multi-consensus step to achieveaveraging for each local power method (Raja & Bajwa, 2015; Wai et al., 2017; Kempe & McSherry,2008), that is, W t +1 j = A j W tj , W t +1 =MultiConsensus( W t +1 ) ,W t +1 j =QR( W t +1 j ) (3.4)where QR( W j ) computes the orthonormal basis of W j by QR decomposition and W t ∈ R d × k × m has its j -th slice W t (: , : , j ) = W tj . However, algorithms in this framework will take increasingconsensus steps to achieve high precision principal components and the consensus steps of eachpower iteration depend on the target precision (cid:15) . This framework is similar to the well-known DGD algorithm in decentralized optimization which can not converges to the optima without increasingthe number of communications in each multi-consensus step (Yuan et al., 2016; Nedic & Ozdaglar,2009).In decentralized optimization, to overcome the weakness of
DGD , a novel technique called ‘gradient-tracking’ was introduced recently (Qu & Li, 2017; Shi et al., 2015). By the advantages of thegradient-tracking, several algorithms have achieved the linear convergence rate without increasingthe number of multi-consensus iterations per step. Especially, a recent work
Mudag showed thatgradient tracking can be used to achieve a near optimal communication complexity up to a logfactor (Ye et al., 2020).To obtain a decentralized exact PCA algorithm with a linear convergence rate without increasingthe number of communications per consensus step, we track the subspace in the proposed PCAalgorithm by adapting the gradient tracking method to ‘subspace tracking’. Compared with previousdecentralized PCA (Eqn. (3.4)) methods, we introduce an extra term S j to track the space of poweriterations. Combining S j with multi-consensus, we can track the subspace in the power methodexactly. We can then obtain the exact principal component W j after several power iterations. Thedetailed description of the resulting algorithm DeEPCA is in Algorithm 1.Please note that, Algorithm 1 conducts a sign adjustment in Eqn. (3.3) which is necessary tomake
DeEPCA converge stably. This is because the signs of some columns of W tj maybe flip duringthe local power iterations and the sign flipping does not change the column space of the matrix.However, if some signs are flipped, then the outcome of the aggregation ¯ W t = m P W tj will be5ffected.The subspace tracking technique in our algorithm is the key to achieving the advantages of DeEPCA . The intuition behind the subspace tracking comes from the observation that when W tj and W t − j are close to the optimal subspace U (where U is the top- k principal components of A ), then A j W tj − A j W t − j is close to zero. This implies that different local subspaces S t +1 j in different agentsonly vary by small perturbations. Thus, we only need a small number of consensus steps to make S t +1 j consistent with each other. In fact, the idea behind subspace tracking has also been used invariance reduction methods for finite sum stochastic optimization algorithms (Johnson & Zhang,2013; Defazio et al., 2014).Using subspace tracking, we can maintain highly consistent subspaces S t +1 j in the power iterationcomputation A j W tj without increasing the number of communication rounds per consensus step.We can show that the approximation error reduce according to O ( (cid:15) ) where (cid:15) is the error precisionfor the power method. The following lemma shows how the mean variable ¯ S t = m P mj =1 S tj converges to the top- k principal components of A and local variable S tj converges to its mean counterpart ¯ S t . Lemma 1.
Matrix A ∈ R d × d = m P mj =1 A j is positive semi-definite with A j being stored in j -thagent and k A j k ≤ L . The agents form a undirected connected graph with weighted matrix L ∈ R m × m . Given parameter k ≥ , orthonormal matrix U ∈ R d × k is the top- k principal componentsof A . λ k and λ k +1 are k -th and k + 1 -th largest eigenvalue of A , respectively. Suppose ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , γ = 1 − λ k − λ k +1 λ k and ‘ ( ¯ S ) < ∞ . If ρ = (cid:16) − p − λ ( L ) (cid:17) K satisfies ρ ≤ min ( γ , ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:0) γ t · ‘ ( ¯ S ) (cid:1) (cid:0) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 · ‘ ( ¯ S ) (cid:1) ,λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ mγ t − · ‘ ( ¯ S ) · q γ t · ‘ ( ¯ S ) (cid:0) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 · ‘ ( ¯ S ) (cid:1) , (3.5) for t = 1 , . . . , T + 1 . Letting ¯ S t = m P mj =1 S tj , then sequence { ¯ S t } T +1 t =0 and { S t } T +1 t =0 generated byAlgorithm 1 satisfy that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) and √ m (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ρL ( √ k + 1) γ t − · ‘ ( ¯ S ) , (3.6) and √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 )24( λ k +1 + 2 L ) · γ t · ‘ ( ¯ S ) . (3.7) Remark 1.
Lemma 1 shows that our
DeEPCA can achieve a linear convergence rate almost the sameto power method. Furthermore, the difference between local variable S tj and its mean variable ¯ S t will also converge to zero as iteration goes. This implies that S j ’s in different agents will converge o the same subspace. Thus, we can obtain that W tj = QR( S tj ) will converge to the top- k principalcomponents of A . Furthermore, we can observe that the right hand of Eqn. (3.5) decreases as t increasing and is independent of (cid:15) . Hence, DeEPCA does not require to increase the consensus stepsto achieve a high precision solution nor setting consensus steps for each power iteration accordingto (cid:15) which is required in previous work (Wai et al., 2017; Kempe & McSherry, 2008). Lemma 1 alsoreveals an interesting property of
DeEPCA . To obtain the top- k principal components of a positivesemi-definite matrix A = m P mj =1 A j , DeEPCA does not require A j to be positive semi-definite. Thus,our DeEPCA is a robust algorithm and can be applied in different settings.
By Lemma 1, we can easily obtain the iteration and communication complexities of
DeEPCA toachieve tan θ k ( U, W j ) ≤ (cid:15) for each agent- j . The communication complexity depends on the timesof local communication which is presented as the product of W and L in Algorithm 3. Now we givethe detailed iteration complexity and communication complexity of our algorithm in the followingtheorem. Theorem 1.
Let A , U , and graph weight matrix L satisfy the properties in Lemma 1. The initialorthonormal matrix W satisfies that tan θ k ( U, W ) < ∞ . Let parameter K satisfy K ≤ p − λ ( L ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . Given (cid:15) < , to achieve tan θ k ( U, W Tj ) ≤ (cid:15) for j = 1 , . . . , m , the iteration complexity T is at most T = 2 λ k λ k − λ k +1 · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) . (3.8) The communication complexity is at most C = 2 λ k ( λ k − λ k +1 ) p − λ ( L ) · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . (3.9) Furthermore, it also holds that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W T +1 j − m m X j =1 W T +1 j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) , and tan θ k ( U, ¯ S T ) ≤ (cid:15) . (3.10) Remark 2.
Theorem 1 shows that for any agent j , W j takes T = O (cid:16) λ k − λ k +1 λ k log (cid:15) (cid:17) iterations toconverge to the top- k principal components of A with an (cid:15) -suboptimality. This iteration complexityis the same to the centralized PCA based on power method (Golub & Van Loan, 2012). Furthermore, ach power iteration of DeEPCA requires K = O p − λ ( L ) · log L λ k λ k +1 · λ k − λ k +1 λ k !! (3.11) consensus steps. Note that K is independent of the precision parameter (cid:15) which shows that DeEPCA does not need to tune its consensus parameter K according to (cid:15) . This also implies that DeEPCA does not increase its consensus steps gradually to achieve a high precision principal components. Incontrast, the best known consensus steps for each power iteration of previous decentralized algorithmsare required to be (Wai et al., 2017) K = O p − λ ( L ) log (cid:18) λ k − λ k +1 λ k · (cid:15) (cid:19)! . (3.12) Thus,
DeEPCA achieves the best communication complexity of decentralized PCA algorithms. Com-paring Eqn. (3.11) and (3.12) , our result is better than the one of (Wai et al., 2017) up to log (cid:15) factor. In fact, this advantage will become large even when (cid:15) is moderate large which can be observedin our experiments. Similar advantage of EXTRA over
DGD in decentralized optimization makes
EXTRA become one of most important algorithm in decentralized optimization (Shi et al., 2015).Furthermore, Eqn. (3.11) shows that the consensus steps depend on the ratio L / ( λ k λ k +1 ) . Infact, the value L / ( λ k λ k +1 ) reflect the data heterogeneity which can be observed more clearly when k = 1 . Due the data heterogeneity, multi-consensus is necessary in DeEPCA which will be validatedin our experiments.
Remark 3.
Lemma 1 shows that once ρ satisfies Eqn. (3.5) , ¯ S t will converge to the top- k principalcomponents of A linearly. That, any multi-consensus which can satisfy Eqn. (3.5) , DeEPCA canachieve linear convergence rate. Thus, though our analysis is based on the undirected graph, theresults of
DeEPCA can be easily extended to directed graph, gossip models, etc.
Remark 4.
DeEPCA is a novel decentralized exact power method. Because the power method isthe key tool in eigenvector computation and low rank approximation (SVD decomposition) (Golub& Van Loan, 2012),
DeEPCA provides a solid foundation for developing decentralized eigenvaluedecomposition, decentralized SVD, decentralized spectral analysis, etc.
In this section, we will give the detailed convergence analysis of
DeEPCA . For notation conve-nience, we first introduce local and aggregate variables.8 .1 Local and Aggregate Variables
Matrix W tj ∈ R d × k is the local copy of the variable of W for agent j at t -th power iteration andwe introduce its aggregate variable W t ∈ R d × k × m whose j -th slice W t (: , : , j ) is W tj , that is, W t (: , : , j ) = W tj . Furthermore, we introduce G tj = A j W t − j ∈ R d × k and tracking variable S j ∈ R d × k . We alsointroduce the aggregate variables G t ∈ R d × k × m and S t ∈ R d × k × m of G ti and S ti , respectively whichsatisfy G t (: , : , j ) = G tj and S t (: , : , j ) = S tj . (4.1)Using the local and aggregate variables, we can represent Algorithm 1 as S t +1 =FastMix (cid:16) S t + G t +1 − G t , K (cid:17) (4.2) W t +1 j =QR( S t +1 j ) . (4.3)For the convergence analysis, we further introduce the mean values¯ W t = 1 m m X j =1 W tj , ¯ G t = 1 m m X j =1 G tj , ¯ S t = 1 m m X j =1 S tj , ¯ H t = 1 m m X j =1 A i ¯ W t − , ˜ W t = QR( ¯ S t ) . (4.4) First, we give the relationship between ¯ S t , ¯ G t , and ¯ H t in Lemma 2 and Lemma 3. These twolemmas show that ¯ S t and ¯ H t are close to each other but perturbed by L √ m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) .Furthermore, by the definition of ¯ H t , we can obtain that¯ S t +1 ≈ ¯ H t +1 = A ¯ W t . (4.5)If ¯ S t is also close to S tj , then we can obtain that¯ W t +1 ≈ QR( ¯ S t +1 ) . (4.6)We can observe that Eqn. (4.5) and (4.6) are the two steps of a power iteration but with someperturbation. Based on ¯ S t , ¯ G t , and ¯ H t , we can observe that DeEPCA can fit into the frameworkof power method but with some perturbation. This is the reason why
DeEPCA will converge to thetop- k principal components of A .Next, we will bound the error between local and mean variables (Defined in Section 4.1) suchas (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (in Lemma 4) and (cid:13)(cid:13)(cid:13) W t +1 − ¯ W t +1 ⊗ (cid:13)(cid:13)(cid:13) (in Lemma 6). Lemma 4 showsthat (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) will decay with a rate ρ < Lρ (cid:13)(cid:13) W t − W t − (cid:13)(cid:13) . When DeEPCA converges, then W t and W t − will both converge to the9 lgorithm 3 FastMix Input: W = W − , K , L , step size η w = − √ − λ ( W )1+ √ − λ ( W ) . for k = 0 , . . . , K do W k +1 = (1 + η w ) W k L − η w W k − ; end for Output: W K . top- k principal components, that is, (cid:13)(cid:13) W t − W t − (cid:13)(cid:13) will converge to zero (in Lemma 8). Thus, (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) will also converge to zero. This implies that (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) goes to zero as t increases by Lemma 6. Hence, the noisy power method described in Eqn. (4.5) and (4.6) becomesexact power method gradually.Finally, Lemma 7 shows that tan θ k ( U, ¯ S t ) converges with rate γ = 1 − λ k − λ k +1 λ k when the pertur-bation term (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) is upper bounded as Eqn. (4.12). Combining Lemma 4, Lemma 5and Lemma 7, we use induction in the proof of Lemma 1 to show that the assumption (4.12) andEqn. (4.13) hold for t = 1 , . . . , T +1 when ρ is properly chosen. This leads to the results of Lemma 1. In our analysis, we aim to show tan θ k ( U, ¯ S T +1 ) and (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) will converge to (cid:15) . First,we give the relationship between ¯ S t , ¯ G t , and ¯ H t . Based on ¯ S t , ¯ G t , and ¯ H t , we can observe that DeEPCA can fit into the framework of power method but with some perturbation.
Lemma 2.
Let ¯ W , ¯ G , and ¯ S be initialized as W . Supposing ¯ G t , and ¯ S t be defined in Eqn. (4.4) and S t update as Eqn. (4.2) , it holds that ¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t = ¯ G t +1 . Lemma 3.
Letting ¯ G t and ¯ H t be defined in Eqn. (4.4) and k A j k ≤ L for j = 1 , . . . , m , they havethe following properties (cid:13)(cid:13)(cid:13) ¯ G t − ¯ H t (cid:13)(cid:13)(cid:13) ≤ L √ m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) . (4.7)In the next lemmas, we will bound the error between local and mean variables (Defined inSection 4.1). First, we upper bound the error (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) recursively. Lemma 4.
Letting S t be updated as Eqn. (4.2) and k A j k ≤ L , then S t +1 and ¯ S t +1 have thefollowing properties (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + Lρ (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) , with ρ (cid:44) (cid:18) − q − λ ( L ) (cid:19) K . (4.8) Lemma 5.
If for t = 0 , , . . . , t , it holds that σ min ( U > ˜ W t ) > with ˜ W defined in Eqn. (4.4) and being top principal components of A , then we can obtain that σ min ( ¯ S t +1 ) ≥ λ k · q ‘ ( ¯ S t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.9)Now, we will bound the error (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) . Lemma 6.
Assuming that (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S t − S tj (cid:13)(cid:13)(cid:13) ≤ for j = 1 , . . . , m , where h ¯ S t i † is the pseudoinverse of ¯ S t , then it holds that (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.10) Letting ¯ S t = ˜ W t ˜ R t be the QR decomposition of ¯ S t , then it holds that (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) ≤ √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . (4.11)Next, we will give the convergence rate of ¯ S t under the assumption that the error between localvariable S tj and its mean counterpart ¯ S t is upper bounded. Lemma 7.
Letting ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , γ (cid:44) − λ k − λ k +1 λ k and √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) satisfy √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 ) · γ t · ‘ ( ¯ S )24 q γ t · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 ‘ ( ¯ S ) (cid:17) , (4.12) for t = 0 , , . . . , T , sequence { ¯ S t } generated by Algorithm 1 satisfies ‘ ( ¯ S t ) ≤ γ t +1 · ‘ ( ¯ S ) . (4.13)Finally, we will bound the difference between W t and W t − . Lemma 8.
Letting W be defined in Eqn. (4.4) and ‘ ( ¯ S ) (cid:44) tan θ k ( U, ¯ S ) , then it holds that (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) . (4.14) Using lemmas in previous subsection, we can prove Lemma 1 and Theorem 1 as follows.
Proof of Lemma 1.
We prove the result by induction. When t = 0, Eqn. (4.12) holds since eachagent shares the same initialization. This implies that ‘ ( ¯ S ) ≤ γ · ‘ ( ¯ S ).11ow, we assume that Eqn. (4.12) and (4.13) hold for t = 0 , . . . , T . In this case, for t = 1 , . . . , T ,it holds that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) , and 1 √ m · (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) ≤ ( λ k − λ k +1 ) · γ t · ‘ ( ¯ S )24 q γ t · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ t +1 ‘ ( ¯ S ) (cid:17) ≤ ( λ k − λ k +1 )24( λ k +1 + 2 L ) · γ t · ‘ ( ¯ S ) . We will show that the result holds for t = T + 1 and we only need to prove Eqn. (4.12) will holdfor t = T + 1. First, by Eqn. (4.8), we have (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (4.8) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) (4.14) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) + 24 ρL (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) ( ?? ) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρL √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) + 24 ρL √ m · λ k − λ k +1 λ k +1 + 2 L ) (cid:16) γ t + γ t − (cid:17) · ‘ ( ¯ S ) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL √ m ( √ k + 1) γ t − · ‘ ( ¯ S ) , which implies that1 √ m (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ · √ m (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL ( √ k + 1) γ t − · ‘ ( ¯ S ) . Using above equation recursively, we can obtain that1 √ m (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ T +1 · √ m (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) + 2 ρL ( √ k + 1) · ‘ ( ¯ S ) T X i =1 ρ T − i γ i =2 ρL ( √ k + 1) · ‘ ( ¯ S ) · γ T − γρ T γ − ρ ≤ ρL ( √ k + 1) γ T − · ‘ ( ¯ S ) , where the first equality is because each agent shares the same initialization and last inequality isbecause of the assumption that ρ ≤ γ . 12urthermore, we have σ min ( ¯ S T +1 ) (4.9) ≥ λ k q ‘ ( ¯ S T ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) (4.12) ≥ λ k q γ T · ‘ ( ¯ S ) − L ( λ k − λ k +1 ) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ))= λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) . Therefore, we can obtain that1 √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ k · q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) · ρL ( √ k + 1) γ T − · ‘ ( ¯ S ) . First, we need to satisfy the condition in Lemma 6, that is, (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 j − ¯ S T +1 (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T +1 i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T +1 − ¯ S T +1 ⊗ (cid:13)(cid:13)(cid:13) ≤ . Therefore, ρ only needs ρ ≤ Lk ( √ k + 1) √ mγ T − · ‘ ( ¯ S ) · λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) . To simplify above equation, we only require ρ to be ρ ≤ λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ mγ T − · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ) (cid:17) . To satisfy Eqn. (4.12) for t = T + 1, ρ only needs to satisfy ρ ≤ ( λ k − λ k +1 ) · γ kL ( √ k + 1) q γ T +1) · ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +2 ‘ ( ¯ S ) (cid:17) · λ k λ k +1 + 2 Lλ k + (cid:16) λ k ( λ k + λ k +1 )2 + 2 Lλ k +1 (cid:17) γ T · ‘ ( ¯ S ) q γ T · ‘ ( ¯ S )( λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S )) .
13o simplify above equation, we only require ρ to be ρ ≤ ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) γ T · ‘ ( ¯ S ) (cid:17) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) γ T +1 · ‘ ( ¯ S ) (cid:17) . Since Eqn. (4.12) holds for t = T + 1 when ρ satisfies the condition (3.5), then Eqn. (4.13) alsoholds for t = T + 1. This concludes the proof.Using the results of Lemma 1, we can prove Theorem 1 as follows. Proof of Theorem 1.
First, by Eqn. (4.10), Eqn. (3.7), and the condition that T ≥ λ k λ k − λ k +1 log λ k +2 L ) tan θ k ( U,W ) √ m ( λ k − λ k +1 ) (cid:15) ,we can obtain that (cid:13)(cid:13)(cid:13) W T − ¯ W T ⊗ (cid:13)(cid:13)(cid:13) ≤ √ m · λ k − λ k +1 λ k +1 + 2 L ) · γ T · ‘ ( ¯ S ) ≤ (cid:15) . Similarly, we can obtain that tan θ k ( U, ¯ S T ) ≤ (cid:15) . Thus, we can obtain the results in Eqn. (3.10).Furthermore, by the definition of angels between two subspaces, we havetan θ k ( U, W tj ) (2.1) = max k w k =1 (cid:13)(cid:13)(cid:13) V > W Tj w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > W Tj w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ W T w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ˜ W T w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W T − ¯ W T (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ˜ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ˜ W T − ¯ W T (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) W Tj − ¯ W T (cid:13)(cid:13)(cid:13) (4.10) , (4.11) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ˜ W T w (cid:13)(cid:13)(cid:13) + 24 (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ˜ W T w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) = tan θ k ( U, ˜ W T ) + 24 (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W T )1 − (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S T i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S T − ¯ S T ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W T ) (3.6) , (3.7) ≤ γ T · ‘ ( ¯ S ) + √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )1 − √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )= γ T · tan θ k ( U, W ) + √ m · λ k − λ k +1 λ k +2 L · γ T · tan θ k ( U, W ) · q γ T · tan θ k ( U, W )1 − √ m · λ k − λ k +1 λ k +2 L · γ T · ‘ ( ¯ S ) · q γ T · ‘ ( ¯ S )Since T = λ k λ k − λ k +1 log θ k ( U,W ) (cid:15) , it holds that γ T · tan θ k ( U, W ) ≤ (cid:15) . Furthermore, when T = λ k λ k − λ k +1 log λ k +2 L ) tan θ k ( U,W ) √ m ( λ k − λ k +1 ) (cid:15) , it holds that √ m · λ k − λ k +1 λ k +2 L · γ T · tan θ k ( U, W ) ≤ (cid:15) . Thus, when14 <
1, we can obtain that tan θ k ( U, W Tj ) ≤ (cid:15)/ (cid:15)/ · p / − / · p / < (cid:15). Since the right hand of Eqn. 3.5 is monotone deceasing as t increases, ρ only satisfies that ρ ≤ min ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) ‘ ( ¯ S ) (cid:17) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17) ,λ k λ k +1 + 2 Lλ k Lk ( √ k + 1) √ m · ‘ ( ¯ S ) · q ‘ ( ¯ S ) (cid:16) λ k +1 + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17) . Furthermore, ρ only requires to satisfy ρ ≤ ( λ k − λ k +1 )( λ k λ k +1 + 2 Lλ k +1 ) · γ kL ( √ k + 1) (cid:16) ‘ ( ¯ S ) (cid:17) (cid:16) λ k + 2 L + ( λ k + 2 L ) · ‘ ( ¯ S ) (cid:17) (4.15)Replacing the definition of ‘ ( ¯ S ) and Proposition 1, we can obtain if K satisfies that K ≤ p − λ ( L ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) , the requirement of ρ in Eqn. (4.15) is satisfied. Combining with iteration complexity, we can obtainthe total communication complexity C = T × K = 2 λ k ( λ k − λ k +1 ) p − λ ( L ) · max ( log 4 tan θ k ( U, W ) (cid:15) , log 4( λ k + 2 L ) tan θ k ( U, W ) √ m ( λ k − λ k +1 ) (cid:15) ) · log 96 kL ( √ k + 1)( λ k + 2 L ) (cid:0) θ k ( U, W ) (cid:1) λ k +1 ( λ k − λ k +1 ) · (cid:16) − λ k − λ k +1 λ k (cid:17) . In the previous sections, we presented a theoretical analysis of our algorithm. In this section,we will provide empirical studies.
Experiment Setting
In our experiments, we consider random networks where each pair of agentshas a connection with a probability of p = 0 .
5. We set L = I − Mλ max ( M ) where M is the Laplacian15
100 200 300 400 500 600 700 800 900 1000
Number of Power Iteration
DeEPCA (a) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 3 Number of Power Iteration -0.6-0.4-0.200.20.40.60.811.2
DeEPCADePCA (b) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 3 Number of Power Iteration -16-14-12-10-8-6-4-202
DeEPCADePCACPCA (c) tan θ k ( U, W ) with K = 3 Number of Power Iteration -20-15-10-50510
DeEPCA (d) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -30-25-20-15-10-50
DeEPCADePCA (e) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -16-14-12-10-8-6-4-202
DeEPCADePCACPCA (f) tan θ k ( U, W ) with K = 5 Number of Power Iteration -20-15-10-50510
DeEPCA (g) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -30-25-20-15-10-50
DeEPCADePCA (h) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -16-14-12-10-8-6-4-202
DeEPCADePCACPCA (i) tan θ k ( U, W ) with K = 10 Figure 1: Experiment on ‘w8a’.matrix associated with a weighted graph. We set m = 50 , that is, there exists 50 agents in thisnetwork. In our experiments, the gossip matrix L satisfies 1 − λ ( L ) = 0 . n = 800 and d = 300. For ‘a9a’, we set n = 600 and d = 123. For eachagent, A j has the following form A = 1 m m X j =1 A j , and A j = n X i =1 v i v > i , with v i = a ( j − ∗ n + i , (5.1)where a ( j − ∗ n + i ∈ R d is the (( j − ∗ n + i )-th input vector of the dataset.16
100 200 300 400 500 600 700 800 900 1000
Number of Power Iteration -15-10-5051015
DeEPCA (a) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 1 Number of Power Iteration -25-20-15-10-505
DeEPCADePCA (b) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 1 Number of Power Iteration -12-10-8-6-4-202
DeEPCADePCACPCA (c) tan θ k ( U, W ) with K = 1 Number of Power Iteration -15-10-50510
DeEPCA (d) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -30-25-20-15-10-50
DeEPCADePCA (e) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 5 Number of Power Iteration -10-8-6-4-202
DeEPCADePCACPCA (f) tan θ k ( U, W ) with K = 5 Number of Power Iteration -15-10-50510
DeEPCA (g) (cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -30-25-20-15-10-50
DeEPCADePCA (h) (cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13) with K = 10 Number of Power Iteration -10-8-6-4-202
DeEPCADePCACPCA (i) tan θ k ( U, W ) with K = 10 Figure 2: Experiment on ‘a9a’.
Experiment Results
In our experiments, we compare
DeEPCA with decentralized PCA (
DePCA )(Wai et al., 2017), and centralized PCA (
CPCA ). We will study how consensus steps affect theconvergence rate of
DeEPCA empirically. Thus, we set different K ’s in our experiment,. We will re-port the convergence rate of (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) and m P mj =1 tan θ k ( U, W tj ). We reportexperiment results in Figure 1 and Figure 2.Figure 1 shows that multi-consensus step is required in our DeEPCA . When K = 3, DeEPCA cannot converge to the top- k principal components of A . The number of consensus steps of DeECPA in each power iteration should be determined by the heterogeneity of the data just as discussed inRemark 2. Furthermore, once consensus steps of
DeECPA are sufficient, then
DeEPCA can achievea fast convergence rate comparable to centralized PCA which can be observed from Figure 1 andFigure 2. This validates our convergence analysis of
DeEPCA in Theorem 1.Figure 1 and Figure 2 show that without increasing consensus steps,
DePCA can not converge to17he top- k principal components of A . Because of lacking of subspace tracking, to achieve a highprecision solution, DePCA can only depends on an increasing consensus steps which can be observedfrom third columns of Figure 1 and Figure 2. Comparing
DeEPCA and
DePCA , we can conclude that
DeEPCA has great advantages in communication cost.
This paper proposed a novel decentralized PCA algorithm
DeEPCA that can achieve a linearconvergence rate similar to the centralized PCA method, and the number of communications permulti-consensus step does not depend on the target precision (cid:15) . In this way,
DeEPCA can achievethe best known communication complexity for decentralized PCA. Our experiments also verifiesthe communication efficiency of
DeEPCA . Although the analysis of
DeEPCA is based on undirectedgraph and ‘FastMix’, it can be easily extended to handle directed graphs because our analysis of
DeEPCA only requires averaging. As a final remark, we note that
DeEPCA employs the power method,which can be applied to eigenvector finding, low rank matrix approximation, spectral analysis, etc.Therefore
DeEPCA can be used to design communication efficient decentralized algorithms for theseproblems as well.
References
Bertrand, A. & Moonen, M. (2014). Distributed adaptive estimation of covariance matrix eigen-vectors in wireless sensor networks with application to distributed pca.
Signal Processing , 104,120–135.Bishop, C. M. (2006).
Pattern recognition and machine learning . springer.Cadima, J., Cerdeira, J. O., & Minhoto, M. (2004). Computational aspects of algorithms for variableselection in the context of principal components.
Computational statistics & data analysis , 47(2),225–236.Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). Saga: a fast incremental gradient method withsupport for non-strongly convex composite objectives. In
Proceedings of the 27th InternationalConference on Neural Information Processing Systems-Volume 1 (pp. 1646–1654).Dhillon, P. S., Foster, D. P., & Ungar, L. H. (2015). Eigenwords: Spectral word embeddings.
TheJournal of Machine Learning Research , 16(1), 3035–3078.Ding, C. & He, X. (2004). K-means clustering via principal component analysis. In
Proceedings ofthe twenty-first international conference on Machine learning (pp.˜29).Golub, G. H. & Van Loan, C. F. (2012).
Matrix computations , volume 3. JHU Press.Hardt, M. & Price, E. (2014). The noisy power method: A meta algorithm with applications.
Advances in Neural Information Processing Systems , 27, 2861–2869.18orn, R. A. & Johnson, C. R. (2012).
Matrix analysis . Cambridge university press.Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variancereduction.
Advances in neural information processing systems , 26, 315–323.Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K.,Charles, Z., Cormode, G., Cummings, R., et al. (2019). Advances and open problems in federatedlearning. arXiv preprint arXiv:1912.04977 .Kempe, D. & McSherry, F. (2008). A decentralized algorithm for spectral analysis.
Journal ofComputer and System Sciences , 74(1), 70–83.Lee, D., Lee, W., Lee, Y., & Pawitan, Y. (2010). Super-sparse principal component analyses forhigh-throughput genomic data.
BMC bioinformatics , 11(1), 296.Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., & Liu, J. (2017). Can decentralizedalgorithms outperform centralized algorithms? a case study for decentralized parallel stochasticgradient descent. In
Advances in Neural Information Processing Systems (pp. 5330–5340).Liu, J. & Morse, A. S. (2011). Accelerated linear iterations for distributed averaging.
AnnualReviews in Control , 35(2), 160–165.Moon, H. & Phillips, P. J. (2001). Computational and performance aspects of pca-based face-recognition algorithms.
Perception , 30(3), 303–321.Nedic, A. & Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization.
IEEE Transactions on Automatic Control , 54(1), 48–61.Qu, G. & Li, N. (2017). Harnessing smoothness to accelerate distributed optimization.
IEEETransactions on Control of Network Systems , 5(3), 1245–1260.Qu, Y., Ostrouchov, G., Samatova, N., & Geist, A. (2002). Principal component analysis fordimension reduction in massive distributed data sets. In
Proceedings of IEEE InternationalConference on Data Mining (ICDM) , volume 1318 (pp. 1788).Raja, H. & Bajwa, W. U. (2015). Cloud k-svd: A collaborative dictionary learning algorithm forbig, distributed data.
IEEE Transactions on Signal Processing , 64(1), 173–188.Scaglione, A., Pagliari, R., & Krim, H. (2008). The decentralized estimation of the sample covari-ance. In (pp. 1722–1726).:IEEE.Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). Extra: An exact first-order algorithm for decentralizedconsensus optimization.
SIAM Journal on Optimization , 25(2), 944–966.Stewart, G. (1977). Perturbation bounds for the qr factorization of a matrix.
SIAM Journal onNumerical Analysis , 14(3), 509–518. 19uleiman, W., Pesavento, M., & Zoubir, A. M. (2016). Performance analysis of the decentralizedeigendecomposition and esprit algorithm.
IEEE Transactions on Signal Processing , 64(9), 2375–2386.Wai, H.-T., Scaglione, A., Lafond, J., & Moulines, E. (2017). Fast and privacy preserving distributedlow-rank regression. In (pp. 4451–4455).: IEEE.Wu, S. X., Wai, H.-T., Li, L., & Scaglione, A. (2018). A review of distributed algorithms forprincipal component analysis.
Proceedings of the IEEE , 106(8), 1321–1340.Xiao, L. & Boyd, S. (2004). Fast linear iterations for distributed averaging.
Systems & ControlLetters , 53(1), 65–78.Ye, H., Luo, L., Zhou, Z., & Zhang, T. (2020). Multi-consensus decentralized accelerated gradientdescent. arXiv preprint arXiv:2005.00797 .Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent.
SIAMJournal on Optimization , 26(3), 1835–1854. 20
Proof of Lemmas in Section 4.3
We will prove our lemmas in the order of their appearance.
A.1 Proof of Lemma 2
Proof of Lemma 2.
First, because the operation ‘FastMix’ is linear, we can obtain that¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t . We prove the result by induction. When t = 0, it holds that ¯ S = ¯ G = W . Supposing it holdsthat ¯ S t = ¯ G t , then we have ¯ S t +1 = ¯ S t + ¯ G t +1 − ¯ G t = ¯ G t +1 . Thus, for each t = 0 , , . . . , it holds that ¯ S t = ¯ G t . A.2 Proof of Lemma 3
Proof of Lemma 3.
By the definition of ¯ G t and ¯ H t in Eqn. (4.4), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X j =1 A j W t − j − m m X j =1 A j ¯ W t − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m m X j =1 (cid:13)(cid:13)(cid:13) A j ( W t − j − ¯ W t − ) (cid:13)(cid:13)(cid:13) ≤ m m X j =1 k A j k · (cid:13)(cid:13)(cid:13) W t − j − ¯ W t − (cid:13)(cid:13)(cid:13) ≤ L m m X j =1 (cid:13)(cid:13)(cid:13) W t − j − ¯ W t − (cid:13)(cid:13)(cid:13) = L m (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) , where the last inequality is because of the assumption k A j k ≤ L for j = 1 , . . . , m . A.3 Proof of Lemma 4
Proof of Lemma 4.
For notation convenience, we use T ( W ) to denote the ‘FastMix’ operation on W , which is used in Algorithm 1. That is, T ( W ) (cid:44) FastMix( W , K ) . Then for W , it holds that (cid:13)(cid:13)(cid:13) T ( W ) − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ · (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) . (A.1)21t is obvious that the ‘FastMix’ operation T ( · ) is linear. By the update rule of S t , we have (cid:13)(cid:13)(cid:13) S t +1 − ¯ S t +1 ⊗ (cid:13)(cid:13)(cid:13) (4.2) = (cid:13)(cid:13)(cid:13) T ( S t + G t +1 − G t ) − (cid:16) ¯ S t +1 + ¯ G t +1 − ¯ G t (cid:17) ⊗ (cid:13)(cid:13)(cid:13) (A.1) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13) G t +1 − G t − ( ¯ G t +1 − ¯ G t ) ⊗ (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13) G t +1 − G t (cid:13)(cid:13)(cid:13) = ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + ρ vuut m X j (cid:13)(cid:13)(cid:13) A j ( W tj − W t − j ) (cid:13)(cid:13)(cid:13) ≤ ρ (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + Lρ (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) . where the second inequality is because the fact that for any W ∈ R d × k × m , it holds that (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) = m X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W j − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m X j =1 k W j k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − m X j =1 * W j , m m X i =1 W i + = m X j =1 k W j k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 k W j k = k W k . The last inequality is because of m X j (cid:13)(cid:13)(cid:13) A j ( W tj − W t − j ) (cid:13)(cid:13)(cid:13) ≤ m X j =1 k A j k · (cid:13)(cid:13)(cid:13) W tj − W t − j (cid:13)(cid:13)(cid:13) ≤ L m X j =1 (cid:13)(cid:13)(cid:13) W tj − W t − j (cid:13)(cid:13)(cid:13) = L (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) . .4 Proof of Lemma 5 Proof of Lemma 5.
By the definition of σ min ( ¯ S t +1 ) and Lemma 2, we can obtain σ min ( ¯ S t +1 ) = σ min ( ¯ G t +1 ) ≥ σ min ( ¯ H t +1 ) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) = σ min ( A ¯ W t ) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) ≥ σ min ( A ˜ W t ) − (cid:13)(cid:13)(cid:13) A ( ˜ W t − ¯ W t ) (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) ≥ σ min ( A ˜ W t ) − L (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ H t +1 − ¯ G t +1 (cid:13)(cid:13)(cid:13) (4.7) , (4.10) , (4.11) ≥ σ min ( A ˜ W t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . Furthermore, we have σ min ( A ˜ W t ) = σ min " Σ k U > ˜ W t Σ \ k V > ˜ W t ≥ σ min " Σ k U > ˜ W t ≥ λ k · σ min ( U > ˜ W t ) (2.2) = λ k · cos θ k ( U, ˜ W t )= λ k · q ‘ ( ¯ S t ) , where the first inequality is because of Corollary 7.3.6 of Horn & Johnson (2012) and matrixΣ k U > ˜ W t is non-singular.Therefore, we can obtain σ min ( ¯ S t +1 ) ≥ λ k · q ‘ ( ¯ S t ) − L √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) . A.5 Proof of Lemma 6
First, we give a important lemma that will be used in our proof.
Lemma 9 (Theorem 3.1 of Stewart (1977)) . Let A = QR , where A ∈ R d × k has rank k and Q > Q = I with I being the identity matrix. Let E satisfy k A † k k E k < where A † is the pseudoinverse of A . Moreover A + E = ( Q + ∆ Q )( R + ∆ R ) , where Q + ∆ Q has orthogonal columns. Thenit holds that k ∆ Q k ≤ k A † k k E k − k A † k k E k . (A.2) Proof of Lemma 6.
For notation convenience, we will omit the superscript. Let S j = W j R j and23 S = ˜ W ˜ R be the QR decomposition of S j and ¯ S , respectively . Then we have (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) = m X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W j − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 (cid:13)(cid:13)(cid:13) W j − ˜ W (cid:13)(cid:13)(cid:13) + 2 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ W − m m X i =1 W i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m X j =1 (cid:13)(cid:13)(cid:13) W j − ˜ W (cid:13)(cid:13)(cid:13) ≤ m X j =1 (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13) ≤ (12) · (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) , where the last inequality is because of the assumption (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ¯ S − S j (cid:13)(cid:13)(cid:13) ≤ . Hence, we can obtainthat (cid:13)(cid:13)(cid:13) W − ¯ W ⊗ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ S † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) . .6 Proof of Lemma 7 Proof of Lemma 7.
By the update rule of Algorithm, we can obtain that ‘ ( ¯ S t +1 ) = tan θ k ( U, ¯ S t +1 )= max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ S t +1 w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ S t +1 w (cid:13)(cid:13)(cid:13) = max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ G t +1 w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ G t +1 w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ H t +1 w (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ G t +1 − ¯ H t +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ H t +1 w (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ¯ G t +1 − ¯ H t +1 (cid:13)(cid:13)(cid:13) (4.7) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) V > ¯ H t +1 w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > ¯ H t +1 w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) = max k w k =1 (cid:13)(cid:13)(cid:13) V > A ¯ W t w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) U > A ¯ W t w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ¯ W t w (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ¯ W t w (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) + L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) − λ k (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) − L √ m (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) (4.11) , (4.10) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) λ k (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) = max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) . Furthermore, we have 1 (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) ≤ max k w k =1 (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) = 1cos θ k ( U, ˜ W t ) . ‘ ( ¯ S t +1 ) ≤ max k w k =1 λ k +1 (cid:13)(cid:13)(cid:13) V > ˜ W t w (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) U > ˜ W t w (cid:13)(cid:13)(cid:13) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W t ) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) / cos θ k ( U, ˜ W t )= λ k +1 ‘ ( ¯ S t ) + λ k +1 + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) · q ‘ ( ¯ S t ) λ k − λ k + L ) √ m (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) · q ‘ ( ¯ S t ) , (A.3)where the last equality is because of the fact 1 + tan θ = θ .Now we will prove the result by induction. When t = 0, it holds that S j ’s are equal to eachother, that is, (cid:13)(cid:13)(cid:13) S − ¯ S ⊗ (cid:13)(cid:13)(cid:13) = 0. Hence, we can obtain that ‘ ( ¯ S ) ≤ λ k +1 λ k ‘ ( ¯ S ) < (cid:18) − λ k − λ k +1 λ k (cid:19) · ‘ ( ¯ S ) . We assume that ‘ ( ¯ S t ) ≤ γ t · ‘ ( ¯ S ) and Eqn. (4.12) hold. Replacing the assumptions to Eqn. (A.3),we can obtain that ‘ ( ¯ S t +1 ) ≤ (cid:18) − λ k − λ k +1 λ k (cid:19) t +1 · ‘ ( ¯ S ) = γ t +1 · ‘ ( ¯ S ) . This concludes the proof.
A.7 Proof of Lemma 8
Proof of Lemma 8.
First, by triangle inequality, we can obtain (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) W t − ¯ W t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) W t − − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ W t ⊗ − ¯ W t − ⊗ (cid:13)(cid:13)(cid:13) (4.10) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ m (cid:13)(cid:13)(cid:13) ¯ W t − ¯ W t − (cid:13)(cid:13)(cid:13) . Furthermore, we have (cid:13)(cid:13)(cid:13) ¯ W t − ¯ W t − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ W t − − U (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − ¯ W t (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − ¯ W t − (cid:13)(cid:13)(cid:13) (4.11) ≤ (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ W t − − U (cid:13)(cid:13)(cid:13) + 12 √ m (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) . Now we begin to bound the value of (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) . Note that due to sign adjustment in Eqn. (3.3)in Algorithm 1, then W t and W t − share the same direction, that is the dot product of columns of26 t and W t − are positive. Thus, we can choose such U that shares the same direction with W t and W t − . In this case, ˜ W t and ˜ W t − can also share the same direction with U . Combining withthe definition of ˜ W in Eqn. (4.4), we have (cid:13)(cid:13)(cid:13) ˜ W t − U (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˜ W t (cid:13)(cid:13)(cid:13) + k U k − D ˜ W t , U E ≤ k − k · σ min ( U > ˜ W t )=2 k (1 − cos θ k ( ˜ W t , U )) = 2 k − q ‘ ( ¯ S t ) =2 k · q ‘ ( ¯ S t ) − q ‘ ( ¯ S t ) = 2 k · ‘ ( ¯ S t ) q ‘ ( ¯ S t )( q ‘ ( ¯ S t ) + 1) ≤ k · ‘ ( ¯ S t ) , where the first inequality is because of U > (: , i ) ˜ W t (: , i ) > D ˜ W t , U E = k X i =1 U > (: , i ) ˜ W t (: , i ) ≥ k · σ min ( U > ˜ W t ) . Therefore, we can obtain that (cid:13)(cid:13)(cid:13) W t − W t − (cid:13)(cid:13)(cid:13) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + 12 (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) =24 (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − ¯ S t ⊗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)h ¯ S t − i † (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) S t − − ¯ S t − ⊗ (cid:13)(cid:13)(cid:13)(cid:19) + √ mk · (cid:16) ‘ ( ¯ S t ) + ‘ ( ¯ S t − ) (cid:17) ..