[PDF] Decentralized Riemannian Gradient Descent on the Stiefel Manifold

Abstract

Full PDF

DDecentralized Riemannian Gradient Descent on the Stiefel Manifold

Shixiang Chen ∗ Alfredo Garcia ∗ Mingyi Hong † Shahin Shahrampour ∗ Abstract

We consider a distributed non-convex optimization where a network of agents aims at minimizing aglobal function over the Stiefel manifold. The global function is represented as a ﬁnite sum of smoothlocal functions, where each local function is associated with one agent and agents communicate with eachother over an undirected connected graph. The problem is non-convex as local functions are possibly non-convex (but smooth) and the Steifel manifold is a non-convex set. We present a decentralized Riemannianstochastic gradient method (DRSGD) with the convergence rate of O (1 / √ K ) to a stationary point. Tohave exact convergence with constant stepsize, we also propose a decentralized Riemannian gradienttracking algorithm (DRGTA) with the convergence rate of O (1 /K ) to a stationary point. We use multi-step consensus to preserve the iteration in the local (consensus) region. DRGTA is the ﬁrst decentralizedalgorithm with exact convergence for distributed optimization on Stiefel manifold. Distributed optimization has received signiﬁcant attention in the past few years in machine learning, controland signal processing. There are mainly two scenarios where distributed algorithms are necessary: (i) thedata is geographically distributed over networks and/or (ii) the computation on a single (centralized) serveris too expensive (large-scale data setting). In this paper, we consider the following multi-agent optimizationproblem min 1 n n (cid:88) i =1 f i ( x i )s . t . x = x = . . . = x n ,x i ∈ M , ∀ i = 1 , . . . , n, (1.1)where f i has L − Lipschitz continuous gradient in Euclidean space and M := St( d, r ) = { x ∈ R d × r : x (cid:62) x = I r } is the Stiefel manifold. Unlike the Euclidean distributed setting, problem (1.1) is deﬁned on the Stiefelmanifold, which is a non-convex set. Many important applications can be written in the form (1.1), e.g.,decentralized spectral analysis [15, 19, 20], dictionary learning [34], eigenvalue estimation of the covariancematrix [32] in wireless sensor networks, and deep neural networks with orthogonal constraint [4, 18, 41].Problem (1.1) can generally represent a risk minimization. One approach to solving (1.1) is collectingall variables to a central server and running a centralized algorithm. However, when the dataset is massive(or the data dimension is large), this causes memory issues and computational burden on the central server.Then, it is more eﬃcient to take a decentralized approach and use local computation based on a networktopology. In this case, each local function f i is associated with one agent in the network, and agentscommunicate with each other over an undirected connected graph. For example, for stochastic gradientdescent (SGD), [23] show that the decentralized SGD can be faster than centralized SGD, especially whentraining neural networks. More importantly, a central server may not exist in practice. ∗ The Wm Michael Barnes ’64 Department of Industrial and Systems Engineering, Texas A&M University, College Station,TX 77843. Email addresses: [email protected] (S. Chen), [email protected] (A. Garcia), [email protected] (S.Shahrampour). † The Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455. Email address: [email protected] (M. Hong). a r X i v : . [ m a t h . O C ] F e b .1 Our Contributions In this paper, we focus on the decentralized setting and design eﬃcient decentralized algorithms to solve(1.1) over any connected undirected network. Our contributions are as follows:(1) We show the convergence of the decentralized stochastic Riemannian gradient method (Algorithm 1) forsolving (1.1). Speciﬁcally, the iteration complexity of obtaining an (cid:15) − stationary point (see Deﬁnition 2.2) is O (1 /(cid:15) ) in expectation .(2) To achieve exact convergence with constant stepsize, we propose a gradient tracking algorithm (DRGTA)(Algorithm 2) for solving (1.1). For DRGTA, the iteration complexity of obtaining an (cid:15) − stationary point is O (1 /(cid:15) ) .Importantly, both of the proposed algorithms are retraction-based and DRGTA is vector transport-free.These two features make the algorithms computationally cheap and conceptually simple. DRGTA is the ﬁrstdecentralized algorithm with exact convergence for distributed optimization on the Stiefel manifold. Decentralized optimization has been well-studied in Euclidean space. The decentralized (sub)-gradient meth-ods were studied in [10,30,40,46] and a distributed dual averaging subgradient method was proposed in [12].However, with a constant stepsize β > , these methods can only converge to a O ( β − σ ) − neighborhoodof a stationary point, where σ is a network parameter (see Assumption 1). To achieve exact convergencewith a ﬁxed stepsize, gradient tracking algorithms were proposed in [11, 29, 33, 37, 44, 47], to name a few.The convergence analysis can be uniﬁed via a primal-dual framework [3]. Another way to use the constantstepsize is decentralized ADMM and its variants [6, 8,27,38]. Also, decentralized stochastic gradient methodfor non-convex smooth problems were well-studied in [5, 23, 43], etc. We refer to the survey paper [28] for acomplete review on the state-of-the-art algorithms and the role of network topology.The problem (1.1) can be thought as a constrained decentralized problem in Euclidean space, but sincethe Stiefel manifold constraint is non-convex, none of the above works can solve the problem. On theother hand, we can also treat (1.1) as a smooth problem over the Stiefel manifold. However, the constraint x = x = . . . = x n is diﬃcult to handle due to the lack of linearity on M . Since the Stiefel manifoldis an embedded submanifold in Euclidean space, our viewpoint is to treat the problem in Euclidean spaceand develop new tools based on Riemannian manifold optimization [1, 7, 13]. For the optimization problem(1.1), a decentralized Riemannian gradient tracking algorithm was presented in [36]. The vector transportoperation should be used in [36], which brings not only expensive computation but also analysis diﬃculties.Moreover, they need to use asymptotically inﬁnite number of consensus steps. Other distributed algorithmswere either speciﬁcally designed for the PCA problem [15, 32, 34] or in centralized topology [14, 19, 42]. Forthese decentralized algorithms, diminishing stepsize or asymptotically inﬁnite number of communicationsteps should be utilized to get exact solution. Diﬀerent from all these works, DRGTA requires a ﬁnite number of communications using a constant step-size.As a special case of problem (1.1), the Riemannian consensus problem is well-studied; see [9, 26, 35, 39].Recently, it was shown in [9] that the multi-step consensus algorithm (DRCS) converges linearly to the globalconsensus in a local region. Deﬁnition 1.1 (Consensus) . Consensus is the conﬁguration where x i = x j ∈ M for all i, j ∈ [ n ] . We deﬁnethe consensus set as follows X ∗ := { x ∈ M n : x = x = . . . = x n } . (1.2)Speciﬁcally, DRCS iterates { x k } have the following convergence property in a neighborhood of X ∗ dist( x k +1 , X ∗ ) ≤ ϑ · dist( x k , X ∗ ) , ϑ ∈ (0 , , (1.3)where dist ( x , X ∗ ) := min y ∈M n (cid:80) ni =1 (cid:107) y − x i (cid:107) and x (cid:62) = ( x (cid:62) x (cid:62) . . . x (cid:62) n ). The linear rate of DRCS shedssome lights on designing the decentralized Riemannain gradient method on Stiefel manifold. More detailswill be provided in Section 3. We have omitted the dependence on network parameters here. Preliminaries

Notation:

The undirected connected graph G = ( V , E ) is composed of |V| = n agents. We use x to denote acollection of all local variables x i by stacking them, i.e., x (cid:62) = ( x (cid:62) x (cid:62) . . . x (cid:62) n ). For M , the n − fold Cartesianproduct of M with itself is denoted as M n = M × . . . × M . We use [ n ] := { , , . . . , n } . For x ∈ M n , wedenote the i − th block by [ x ] i = x i . We denote the tangent space of M at point x as T x M and the normalspace as N x M . The inner product on T x M is induced from the Euclidean inner product (cid:104) x, y (cid:105) = Tr( x (cid:62) y ).Denote (cid:107) · (cid:107) F as the Frobenius norm and (cid:107) · (cid:107) as the operator norm. The Euclidean gradient of function g ( x ) is ∇ g ( x ) and the Riemannain gradient is grad g ( x ) . Let I r and 0 r be the r × r identity matrix and zeromatrix, respectively. And let n denote the n dimensional vector with all ones.The network structure is modeled using a matrix, denoted by W , which satisﬁes the following assumption. Assumption 1.

We assume that the undirected graph G is connected and W is doubly stochastic, i.e., (i) W = W (cid:62) ; (ii) W ij ≥ and > W ii > for all i, j ; (iii) Eigenvalues of W lie in ( − , . The second largestsingular value σ of W lies in σ ∈ [0 , . We now introduce some preliminaries of Riemannian manifold and fundamental lemmas.

Denote the Euclidean average point of x , . . . , x n byˆ x := 1 n n (cid:88) i =1 x i . (2.1)To measure the degree of consensus, the error (cid:107) x i − ˆ x (cid:107) F is typically used in the Euclidean decentralizedalgorithms. Instead, here we use the induced arithmetic mean(IAM) [35] on St( d, r ), deﬁned as follows¯ x := argmin y ∈ St( d,r ) n (cid:88) i =1 (cid:107) y − x i (cid:107) = P St (ˆ x ) , (IAM)where P St ( · ) is the orthogonal projection onto St( d, r ). Deﬁne¯ x = n ⊗ ¯ x. (2.2)Then the distance between x and X ∗ is given bydist ( x , X ∗ ) = min y ∈ St( d,r ) n n (cid:88) i =1 (cid:107) y − x i (cid:107) = 1 n (cid:107) x − ¯ x (cid:107) . Furthermore, we deﬁne the l F, ∞ distance between x and ¯ x as (cid:107) x − ¯ x (cid:107) F , ∞ = max i ∈ [ n ] (cid:107) x i − ¯ x (cid:107) F . ( l F, ∞ )We will develop the analysis of decentralized Riemannian gradient descent by studying the error distance (cid:107) x − ¯ x (cid:107) F and (cid:107) x − ¯ x (cid:107) F , ∞ . Next, we introduce the optimality condition on manifold M . Consider the following centralized optimizationproblem over a matrix manifold M min h ( x ) s . t . x ∈ M . (2.3)Since we use the metric on tangent space T x M induced from the Euclidean inner product (cid:104)· , ·(cid:105) , the Rie-mannian gradient grad h ( x ) on St( d, r ) is given by grad h ( x ) = P T x M ∇ h ( x ), where P T x M is the orthogonalprojection onto T x M . More speciﬁcally, we have P T x M y = y − x ( x (cid:62) y + y (cid:62) x )3or any y ∈ R d × r ; see [1, 13]. The necessary ﬁrst-order optimality condition of problem (2.3) is given asfollows. Proposition 2.1. [7, 45] Let x ∈ M be a local optimum for (2.3) . If h is diﬀerentiable at x , then grad h ( x ) = 0 . Therefore, x is a ﬁrst-order critical point (or critical point) if grad h ( x ) = 0. Let ¯ x be the IAM of x . Wedeﬁne the (cid:15) − stationary point of problem (1.1) as follows. Deﬁnition 2.2 ( (cid:15) -Stationarity) . We say that x (cid:62) = ( x (cid:62) x (cid:62) . . . x (cid:62) n ) is an (cid:15) − stationary point of problem (1.1) if the following holds: n n (cid:88) i =1 (cid:107) x i − ¯ x (cid:107) F ≤ (cid:15) ∀ i, j ∈ [ n ] and (cid:107) grad f (¯ x ) (cid:107) F ≤ (cid:15), where we use the notation f (¯ x ) = n (cid:80) ni =1 f i (¯ x ) . Our goal is to develop the decentralized version of centralized Riemannian gradient descent on St( d, r ) . Simply speaking, the centralized Riemannian gradient descent [1, 7] iterates as x k +1 = R x k ( − α grad h ( x k )) , i.e., updating along a negative Riemannian gradient direction on the tangent space, then performing aoperation called retraction R x k to ensure feasibility. We use the deﬁnition of retraction in [7, Deﬁnition1]. The retraction is the relaxation of exponential mapping, and more importantly, it is computationallycheaper. We also assume the second-order boundedness of retraction. It means that R x ( ξ ) = x + ξ + O ( (cid:107) ξ (cid:107) ) . That is, R x ( ξ ) is locally good approximation to x + ξ . Such kind of approximation is well enough to takethe place of exponential map for the ﬁrst-order algorithms. Lemma 2.3. [7, 24] Let R be a second-order retraction over St( d, r ) , we have (cid:107)R x ( ξ ) − ( x + ξ ) (cid:107) F ≤ M (cid:107) ξ (cid:107) F , ∀ x ∈ St( d, r ) , ∀ ξ ∈ T x M . (P1) Moreover, if the retraction is the polar decomposition. For all x ∈ St( d, r ) and ξ ∈ T x M , the followinginequality holds for any y ∈ St( d, r ) [22, Lemma 1]: (cid:107)R x ( ξ ) − y (cid:107) F ≤ (cid:107) x + ξ − y (cid:107) F . (2.4)In the sequel, retraction refers to the polar retraction to present a simple analysis, unless otherwise noted.More details on the polar retraction is provided in appendix A. Throughout the paper, we assume thatevery f i ( x ) is Lipschitz smooth. Assumption 2.

Each f i ( x ) has L − Lipschitz continuous gradient, and let D := max x ∈ St( d,r ) (cid:107)∇ f i ( x ) (cid:107) F . Therefore, ∇ f ( x ) is also L -Lipschitz continuous and D = max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) F . We have two similar Lipschitz continuous inequalities on Stiefel manifold as the Euclidean-type ones [31].We provide the proof in Appendix. 4 emma 2.4 (Lipschitz-type inequalities) . For any x, y ∈ St( n, d ) and ξ ∈ T x M , if f ( x ) is L − Lipschitzsmooth in Euclidean space, then there exists a constant L g = L + L n such that | f ( y ) − [ f ( x ) + (cid:104) grad f ( x ) , y − x (cid:105) ] | ≤ L g (cid:107) y − x (cid:107) F , (2.5) where L n = max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) . Moreover, deﬁne L G = L + 2 L n , one has (cid:107) grad f ( x ) − grad f ( y ) (cid:107) F ≤ L G (cid:107) y − x (cid:107) F . (2.6)The diﬀerence between two Riemannian gradients is not well-deﬁned on general manifold. However, sincethe Stiefel manifold is embedded in Euclidean space, we are free to do so. Another similar inequality as (2.5)is the restricted Lipschitz-type gradient presented in [7, Lemma 4]. But they do not provide an inequalityas (2.6). One could also consider the following Lipschitz inequality (see [1, 48]) (cid:107) P x → y grad f ( x ) − grad f ( y ) (cid:107) F ≤ L (cid:48) g d g ( x, y ) , where P x → y : T x M → T y M is the vector transport and d g ( x, y ) is the geodesic distance. Since involvingvector transport and geodesic distance brings computational and conceptual diﬃculties, we choose to usethe form of (2.6) for simplicity. In fact, L g , ˜ L g and L (cid:48) g are the same up to a constant. A detailed comparisonis provided in appendix C.1.We will use Lemma 2.3 and Lemma 2.4 to present a parallel analysis to the decentralized Euclideangradient methods [23, 29, 30]. For the decentralized gradient-type algorithms [23,29,30,37,40,46], they are based on the linear convergenceof consensus iteration in Euclidean space.The consensus problem over St( d, r ) is to minimize the quadratic loss function on Stiefel manifoldmin ϕ t ( x ) := 14 n (cid:88) i =1 n (cid:88) j =1 W tij (cid:107) x i − x j (cid:107) s . t . x i ∈ M , ∀ i ∈ [ n ] , (3.1)where the superscript t ≥ t -th power of a doubly stochastic matrix W . Notethat t is introduced to provide ﬂexibility for algorithm design and analysis, and computing W tij correspondsto performing t steps of communication on the tangent space. The Riemannian gradient method DRCSproposed in [9] is given by for any i ∈ [ n ], x i,k +1 = R x i,k ( α P T xi M ( n (cid:88) j =1 W tij x j,k )) . (3.2)DRCS converges almost surely to consensus when r ≤ d − M , the Q-linear rate of DRCSholds in a local region deﬁned as follows N : = N ∩ N , (3.3) N : = { x : (cid:107) x − ¯ x (cid:107) ≤ nδ } , (3.4) N : = { x : (cid:107) x − ¯ x (cid:107) F , ∞ ≤ δ } , (3.5)where δ , δ satisfy δ ≤ √ r δ and δ ≤ . (3.6)The following convergence result of DRCS can be found in [9, Theorem 2]. The formal statement isprovided in Fact B.1 in Appendix. 5 act 3.1. ( Informal ) Under Assumption 1, for some ¯ α ∈ (0 , , if α ≤ ¯ α and t ≥ (cid:100) log σ ( √ n ) (cid:101) , thesequence { x k } of (3.2) achieves consensus linearly if the initialization satisﬁes x ∈ N deﬁned by (3.6) .That is, there exists ρ t ∈ (0 , such that x k ∈ N for all k ≥ and (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F . (3.7) The results of consensus problem on Stiefel manifold lead us to combine the ideas of decentralized gradientmethod in Euclidean space with the Stiefel manifold optimization. In this section, we study a distributedRiemannian stochastic gradient method for solving problem (1.1), which is described in Algorithm 1. Thealgorithm is an extension of the decentralized subgradient descent [30].Since we need to achieve consensus, the initial point x should be in the consensus region N . One cansimply initialize all agents from the same point. The step 5 in Algorithm 1 is ﬁrst to perform a consensusstep and then to update local variable using Riemannian stochastic gradient direction v i,k . The consensusstep and computation of Riemannian gradient can be done in parallel . The consensus stepsize α satisﬁes α ≤ ¯ α , which is the same as the consensus algorithm. The constant ¯ α is given in Fact B.1 in Appendix.Moreover, α = 1 works in practice for any W satisfying Assumption 1. If x = . . . = x n = z , we denote f ( z ) := 1 n n (cid:88) i =1 f i ( z ) . Moreover, we need the following assumptions on the stochastic Riemannian gradient v i,k and the stepsize β k . Algorithm 1

Decentralized Riemannian Stochastic Gradient Descent (DRSGD) for Solving (1.1) Input: initial point x ∈ N , an integer t ≥ log σ ( √ n ), 0 < α ≤ ¯ α , where ¯ α is given in Fact 3.1. for k = 0 , . . . (cid:46) For each node i ∈ [ n ], in parallel do Choose diminishing stepsize β k = O (1 / √ k ) Compute stochastic Riemannian gradient v i,k satisfying E v i,k = grad f i ( x i,k ) Update x i,k +1 = R x i,k ( α P T xi,k M ( (cid:80) nj =1 W tij x j,k ) − β k v i,k ) end forAssumption 3.

1. The stochastic gradient v i,k is unbiased, i.e., E v i,k = grad f i ( x i,k ) for all i ∈ [ n ] and v i,k is independent of v j,k for any i (cid:54) = j . Moreover, the variance is bounded: E (cid:107) v i,k − grad f i ( x i,k ) (cid:107) F ≤ Ξ .2. We assume the uniform upper bound of (cid:107) v i (cid:107) F is D , i.e., max x ∈ St( d,r ) (cid:107) v i (cid:107) F ≤ D for each i ∈ [ n ] . The Lipschitz smoothness of f i ( x ) in Assumption 2 and unbiased estimation are quite standard in the lit-erature. And Lemma 2.4 suggests that grad f i is L G -Lipschitz continuous. For the boundedness of (cid:107) v i (cid:107) F , itis a weak assumption since the Stiefel manifold is compact.One common example is the ﬁnite-sum form: f i = m i (cid:80) m i j =1 f ij , where f ij is smooth. Then the stochastic gradient v i,k is uniformly sampled fromgrad f ij ( x i,k ) , j ∈ [ m i ]. We emphasize that the uniform boundedness of gradient is not needed for prob-lems in Euclidean space, but Lipschitz continuity is necessary [17]. The step 5 can be seen as applyingRiemannian gradient method to solve the following problemmin x ∈M n β k f ( x ) + αϕ t ( x ) . Similar as the analysis of DGD in Euclidean space, we need to ensure that (cid:107) x k − ¯ x k (cid:107) F →

0. Hence, the eﬀectof f should be diminishing. The following assumption on the stepsize is also needed to get an (cid:15) − solution. One could also exchange the order of gradient step and communication step, i.e., x i,k + = R x i,k ( − β k v i,k ) , x i,k +1 = R x i,k + 12 ( α P T xi,k + 12 M ( (cid:80) nj =1 W tij x j,k + )). Our analysis can also apply to this kind of updates if x ∈ ρ t N , where ρ t N denotes region N with shrunk radius ρ t δ , ρ t δ . For the Euclidean counterparts, when the graph is complete associated withequal weight matrix, the above updates are the same as centralized gradient step. However, they are diﬀerent on Stiefel manifold. ssumption 4 (Diminishing stepsize) . The stepsize β k > is non-increasing and ∞ (cid:88) k =0 β k = ∞ , lim k →∞ β k = 0 , lim k →∞ β k +1 β k = 1 . The assumption lim k →∞ β k +1 β k = 1 is additionally required to show the bound n (cid:107) x k − ¯ x k (cid:107) = O ( β k D (1 − ρ t ) ),see Lemma D.3 in Appendix.To proceed, we ﬁrst need to guarantee that x k ∈ N , where N is the consensus contraction region deﬁnedin (3.3). Therefore, uniform bound D and the multi-step consensus requirement t ≥ (cid:100) log σ ( √ n ) (cid:101) arenecessary in our convergence analysis. With appropriate stepsizes α and β k , we get the following lemmausing the consensus results in Fact 3.1. We provide the proof in Appendix. Lemma 4.1.

Under Assumptions 1 to 4, let the stepsize α satisfy < α ≤ ¯ α , β k satisfy ≤ β k ≤ min { − ρ t D δ , αδ D } , ∀ k ≥ , and t ≥ (cid:100) log σ ( √ n ) (cid:101) . If x ∈ N , it follows that x k ∈ N for all k ≥ generatedby Algorithm 1 and (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ k +1 t (cid:107) x − ¯ x (cid:107) F + √ nD k (cid:88) l =0 ρ k − lt β l . We have β k = O ( − ρ t D ) when α = O (1). Note that t ≥ (cid:100) log σ ( √ n ) (cid:101) implies ρ t = O (1); see appendix B.When β k is constant, Lemma 4.1 suggests that x k converges linearly to an O ( β k )-neighborhood of ¯ x k .We present the convergence of Algorithm 1. The proof is based on the new Lipschitz inequalities forthe Riemannian gradient in Lemma 2.4 and the properties of retraction in Lemma 2.3. We provide it inAppendix. Theorem 4.2.

Under Assumptions 1 to 4, suppose x k ∈ N , t ≥ (cid:100) log σ ( √ n ) (cid:101) , < α ≤ ¯ α . If β k = 1 √ k + 1 · min { L g , αδ D , − ρ t D δ } , (4.1) it follows that min k ≤ K E (cid:107) grad f (¯ x k ) (cid:107) F ≤ f (¯ x ) − f ∗ ) + L g Ξ n (cid:80) Kk =0 β k (cid:80) Kk =0 β k (4.2)+ (2 CD L G + 4 T D ) (cid:80) Kk =0 β k + 4 T L g D (cid:80) Kk =0 β k (cid:80) Kk =0 β k , where C = O ( − ρ t ) ) is given in Lemma D.3 in Appendix. And T = 2(4 √ r + 6 α ) C + 8 M and T = 201 α C + 9 M . Therefore, we have min k ≤ K E (cid:107) grad f (¯ x k ) (cid:107) F = O (cid:18) f (¯ x ) − f ∗ ˜ β √ K + 1 + Ξ ln( K + 1) n √ K + 1 (cid:19) + O (cid:18) max { D , L G } · ( C + T + T ) √ K + 1 (cid:19) , where ˜ β = min { /L g , (1 − ρ t ) /D } . Theorem 4.2 together with Lemma D.3 implies that the iteration complexity of obtaining an (cid:15) − stationarypoint deﬁned in Deﬁnition 2.2 is O (1 /(cid:15) ) in expectation. The communication round per iteration is t ≥(cid:100) log σ ( √ n ) (cid:101) since we need to ensure x k ∈ N . For sparse network, t = O ( n log n ) [9].Following [23], if we use the constant stepsize β k = L G + √ ( K +1) /n where K is suﬃciently large, we canobtain the following result min k =0 ,...,K E (cid:107) grad f (¯ x k ) (cid:107) ≤ L G ( f (¯ x ) − f ∗ ) K + 1 + 8( f (¯ x ) − f ∗ + L G )Ξ (cid:112) n ( K + 1) . K is suﬃciently large, the convergencerate is O (1 / √ nK ) . To obtain an (cid:15) − stationary point, the computational complexity of single node is O ( n(cid:15) ) . However, the communication round t ≥ (cid:100) log σ ( √ n ) (cid:101) is too large. In practice, we ﬁnd t = 1 performs almostthe same as t = ∞ , which is shown in Section 6. This may be because that when the stepsize is very small,DRSGD will not deviate from the consensus algorithm DRCS too much. We leave the further discussion asfuture work. In this section we study the decentralized gradient tracking method, which is based on the DIGing algorithm[29, 33] for solving Euclidean problems. With an auxiliary gradient tracking sequence to estimate the fullgradient, the constant stepsize can be used and faster convergence rate can be shown for the Euclideanalgorithms [29,37]. We describe our algorithm in Algorithm 2, which is named as Decentralized RiemannianGradient Tracking Algorithm (DRGTA).

Algorithm 2

Decentralized Riemannian Gradient Tracking over Stiefel manifold (DRGTA) for Solving (1.1) Input: initial point x ∈ N , an integer t ≥ log σ ( √ n ), 0 < α ≤ ¯ α and stepsize β according to (5.2). Let y i, = grad f i ( x i, ) on each node i ∈ [ n ]. for k = 0 , . . . (cid:46) For each node i ∈ [ n ], in parallel do Projection onto tangent space: v i,k = P T xi,k M y i,k . Update x i,k +1 = R x i,k ( α P T xi,k M ( (cid:80) nj =1 W tij x j,k ) − βv i,k ) . Riemannian gradient tracking: y i,k +1 = n (cid:88) j =1 W tij y j,k + grad f i ( x i,k +1 ) − grad f i ( x i,k ) . end for In Algorithm 2, the step 4 is to project the direction y i,k onto the tangent space T x i,k M , which followsa retraction update. The sequence { y i,k } is to approximate the Riemannian gradient grad f i ( x i,k ). Morespeciﬁcally, the sequence { y k } tracks the average Riemannian gradient n (cid:80) ni =1 grad f i ( x i,k ). Although, it isnot mathematically sound to do addition operation between diﬀerent tangent space in diﬀerential geometry,we can view grad f i ( x i,k ) as the projected Euclidean gradient. Note that y i,k is not necessarily on the tangentspace T x i,k M . Therefore, it is important to deﬁne v i,k = P T xi,k M y i,k so that we can use the properties ofretraction in Lemma 2.3. Such a projection onto tangent space step, followed by the retraction operation,distinguishes the algorithm from the Euclidean space gradient tracking algorithms. Multi-step consensus ofgradient is also required in step 5 and step 6. The consensus stepsize α satisﬁes the same condition as thatof Algorithm 1. We ﬁrst brieﬂy revisit the idea of gradient tracking (GT) algorithm DIGing in Euclidean space. Note thatif we consider the decentralized optimization problem (1.1) without the Stiefel manifold constraint, thenAlgorithm 2 is exactly the same as the DIGing. Since the Riemannian gradient grad f i becomes simply theEuclidean gradient ∇ f i and projection onto the tangent space and retraction are not needed. The mainadvantage of Euclidean gradient tracking algorithm is that one can use constant stepsize β >

0, which is dueto following observation: for all k ≥

0, it follows that1 n n (cid:88) i =1 y i,k = 1 n n (cid:88) i =1 ∇ f i ( x i,k ) . y i,k is the same as that of ∇ f i ( x i,k ). It can be shown that if the followinginexact gradient sequence, then it converges to a stationary point [29] x i,k +1 = n (cid:88) i =1 W ij x j,k − β n n (cid:88) i =1 ∇ f i ( x i,k ) . However, the average of gradient information is unavailable in the decentralized setting. Therefore, GT uses n (cid:80) ni =1 y i,k to approximate n (cid:80) ni =1 ∇ f i ( x i,k ). Inspired by this, y i,k is used to approximate the Riemanniangradient, i.e., if y i,k +1 = n (cid:88) j =1 W tij y j,k + grad f i ( x i,k +1 ) − grad f i ( x i,k ) , then it follows that 1 n n (cid:88) i =1 y i,k = 1 n n (cid:88) i =1 grad f i ( x i,k ) i . e . ˆ y k = ˆ g k . Therefore, { y k } tracks the average of Riemannian gradient, and if (cid:107) ˆ g k (cid:107) F → { x k } achievesconsensus, then x k also converges to the critical point. This is because (cid:107) grad f (¯ x k ) (cid:107) ≤ (cid:107) ˆ g k (cid:107) + 2 (cid:107) grad f (¯ x k ) − ˆ g k (cid:107) ≤ (cid:107) ˆ g k (cid:107) + 2 L G n (cid:107) x k − ¯ x k (cid:107) . To achieve consensus, we still need multi-step consensus in DRGTA as DRSGD. The multi-step consensusalso helps us to show the uniform boundedness of y i,k and v i,k , i ∈ [ n ] for all k ≥ , which is important toguarantee x k ∈ N . We get that the sequence stays in consensus region N in Lemma 5.1. We provide theproof in Appendix. Lemma 5.1 (Uniform bound of y i and stay in N ) . Under Assumptions 1 and 2, let x ∈ N , t ≥ log σ ( √ n ) , α satisfy < α ≤ ¯ α , β satisfy ≤ β ≤ ¯ β := min { − ρ t L G +2 D δ , αδ L G +2 D ) } , then (cid:107) y i,k (cid:107) F ≤ L G + 2 D for all i ∈ [ n ] and x k ∈ N for all k ≥ . Moreover, we have n (cid:107) x k − ¯ x k (cid:107) F ≤ C ( L G + 2 D ) β , k ≥ for some C = O ( − ρ t ) ) and C is independent of L G , D . We present the O (1 /(cid:15) ) iteration complexity to obtain the (cid:15) − stationary point of (1.1) as follows. Theproof of DIGing can be uniﬁed by the primal-dual framework [3]. However, DRGTA cannot be rewrittenin the primal-dual form. The proof is mainly established with the help of Lemma 2.4 and the properties ofIAM. We provide it in Appendix. Theorem 5.2.

Under Assumptions 1 and 2, let x ∈ N , t ≥ (cid:100) log σ ( √ n ) (cid:101) , < α ≤ ¯ α , and < β ≤ min { ¯ β, L G , L G (2 G + (8 ˜ C + ˜ C ) αδ ) } , (5.2) where ¯ β is given in Lemma 5.1. Then it follows that for the sequences generated by Algorithm 2 min k =0 ,...,K n (cid:107) y k (cid:107) F ≤ f (¯ x ) − f ∗ + ˜ C + G L G ) β · K , (5.3)min k ≤ K n (cid:107) x k − ¯ x k (cid:107) F ≤ β ( f (¯ x ) − f ∗ + ˜ C + G L G ) ˜ C + ˜ C K , (5.4)9in k ≤ K (cid:107) grad f (¯ x k ) (cid:107) F ≤ (16 + α δ ˜ C )( f (¯ x ) − f ∗ + ˜ C + G L G ) + ˜ C L G β · K , (5.5) where the constants above are given by G = G ˜ C + G ˜ C + G , G = G ˜ C δ α

25 + ˜ C ( G + 4 rC ) , ˜ C = 2(1 − ρ t ) , ˜ C = 21 − ρ t · n (cid:107) x − ¯ x (cid:107) F , ˜ C = 2(1 − σ t ) , ˜ C = 21 − σ t · n (cid:107) y − ˆ G (cid:107) F , ˜ C = (8 α ˜ C ˜ C L G + ˜ C ) · β O ( L G (1 − σ t ) ) . The constants G = O ( r C ) , G = O ( r C ) and G = O ( M ) are given in Lemma E.2 in the appendix.We have G = O ( r C (1 − ρ t ) + M ) and G = O ( r C δ − ρ t ). Recall that β ≤ ¯ β is required to guarantee that thesequence { x k } always stays in the consensus region N . And note that ρ t is the linear rate of Riemannianconsensus, which is greater than σ t . The stepsize β follows β = O (min { − ρ t L G + 2 D , (1 − ρ t ) L G · r C + M (1 − ρ t ) } ) . This matches the bound of DIGing [29, 33]. Then Theorem 5.2 suggests that the consensus error rateis O ( r C + M ) L G · f (¯ x ) − f ∗ K + (cid:107) x − ¯ x (cid:107) n (1 − ρ t ) K ) and the convergence rate of min k =0 ,...,K (cid:107) grad f (¯ x k ) (cid:107) is given by O ( ( r C + M )( L G +2 D )( f (¯ x ) − f ∗ )) K (1 − ρ t ) + (cid:107) x − ¯ x (cid:107) n (1 − ρ t ) T + r C δ L G K (1 − ρ t ) ). Moreover, if the initial points satisfy x , = x , = . . . = x n, , we have ˜ C = ˜ C = ˜ C = 0 . We solve the following decentralized eigenvector problem:min x ∈M n − n n (cid:88) i =1 x (cid:62) i A (cid:62) i A i x i , s . t . x = . . . = x n , (6.1)where A i ∈ R m i × d , i ∈ [ n ] is the local data matrix in local agent and m i is the sample size. Denote theglobal data matrix by A := [ A (cid:62) A (cid:62) . . . A (cid:62) n ] (cid:62) . It is known that the global minimizer of (6.1) is given by theﬁrst r leading eigenvectors of A (cid:62) A = (cid:80) ni =1 A (cid:62) i A i , denoted by x ∗ . DRSGD and DRGTA are only provedto converge to the critical points, but we ﬁnd they always converge to x ∗ in our experiments. Denote thecolumn space of a matrix x by [ x ]. To measure the quality of the solution, the distance between columnspace [ x ] and [ y ] can be deﬁned via the canonical correlations between x ∈ R d × r and y ∈ R d × r [16]. Onecan deﬁne it by d s ( x, y ) := min Q ∈ O( r ) (cid:107) uQ − v (cid:107) F , where O( r ) is the orthogonal group, u and v are the orthogonal basis of [ x ] and [ y ], respectively. In thesequel, we ﬁx α = 1 and generate the initial points uniformly randomly satisfying x , = . . . = x n, ∈ M . Iffull batch gradient is used in Algorithm 1, we call it DRDGD, otherwise one stochastic gradient is uniformlysampled without replacement in DRSGD. In DRSGD, one epoach represents the number of complete passesthrough the dataset, while one iteration is used in the deterministic algorithms. For DRSGD, we set the10aximum epoch to 200 and early stop it if d s (¯ x k , x ∗ ) ≤ − . For DRGTA and DRDGD, we set the maximumiteration number to 10 and the termination condition is d s (¯ x k , x ∗ ) ≤ − or (cid:107) grad f (¯ x k ) (cid:107) F ≤ − . We set β k = ˆ β n (cid:80) ni =1 m i for DRGTA and DRDGD where ˆ β will be speciﬁed later. For DRSGD, we set β = ˆ β √ .We select the weight matrix W to be the Metroplis constant weight [37]. We report the convergence results of DRSGD, DRDGD and DRGTA with diﬀerent t and ˆ β on syntheticdata. We ﬁx m = . . . = m n = 1000, d = 100 and r = 5 and generate m × n i.i.d samples following standardmulti-variate Gaussian distribution to obtain A . Let A = U SV (cid:62) be the truncated SVD. Given an eigengap∆ ∈ (0 , A to be a geometric sequence, i.e. S i,i = S , × ∆ i/ , i ∈ [ d ] . Typically, larger ∆ results in more diﬃcult problem. In Figure 1, we show the results of DRSGD, DRDGD DRSGD, t=1, =0.01DRSGD, t=1, =0.1DRSGD, t=10, =0.01DRSGD, t=10, =0.1complete, =0.1 (a) DRSGD

Iteration DRDGD, t=1, =0.01DRDGD, t=1, =0.05DRDGD, t=10, =0.01DRDGD, t=10, =0.05complete, =0.05 (b) DRDGD DRGTA, t=1, =0.01DRGTA, t=1, =0.05DRGTA, t=10, =0.01DRGTA, t=10, =0.05complete, =0.05 (c) DRGTA

Figure 1: Synthetic data, agents number n = 32 , eigengap ∆ = 0 .

8. y-axis: log-scale d s (¯ x k , x ∗ ).and DRGTA on the data with n = 32 and ∆ = 0 . . The y-axis is the log-scale distance d s (¯ x k , x ∗ ) . The ﬁrstfour lines in each testing case are for the ring graph, and the last one is on a complete graph with equallyweighted matrix, which aims to show the case of t → ∞ . In Figure 1(a), when ﬁxing ˆ β , it is shown thatthat smaller ˆ β produces higher accuracy, which indicates the Theorem 4.2. We also see DRSGD performsalmost the same with diﬀerent t ∈ { , , ∞} . For the two deterministic algorithms DRDGD and DRGTA,we see that DRDGD can use larger ˆ β if more communication rounds t is used in Figure 1(b),(c). DRDGDcannot achieve exact convergence with the constant stepsize, while DRGTA successfully solves the problemusing t ∈ { , , ∞} , ˆ β = 0 . . Next, we report the numerical results on diﬀerent networks and data size.Figure 2 shows the results on the same data set as that of Figure 1. However, the network is an Erd¨os-R´enyi model ER ( n, p ), which means the probability of each edge is included in the graph with probability p .The Metropolis constant matrix is associated with the graph. Since the ER (32 , .

3) is more well-connectedthan the ring graph, we see that the results for diﬀerent t ∈ { , , ∞} are almost the same except forDRDGD with ˆ β = 0 .

05. Moreover, the solutions accuracy and convergence rate of DRDGD and DRGTAare better than those shown in Figure 1.In Figure 3, we show the results when the initial point does not satisfy x ∈ N . Speciﬁcally, werandomly generate x , , . . . , x n, on M , and the other settings are the same as Figure 1. Surprisingly, weﬁnd that the proposed algorithms still converge. As suggested by [9,26], the consensus algorithm can achieveglobal consensus with random initialization when r ≤ d −

1. The iteration in DRSGD and DRGTA is aperturbation of the consensus iteration. It will be interesting to study it further.

We compare our algorithms with a recently proposed algorithm decentralized Sanger’s algorithm (DSA) [15],which is a Euclidean-type algorithm. To solve the eigenvector problem (6.1), DSA is shown to converge11

50 100 150 200Epoch10 DRSGD, t=1, =0.01DRSGD, t=1, =0.1DRSGD, t=10, =0.01DRSGD, t=10, =0.1complete, =0.1 (a) DRSGD

Figure 2: Synthetic data, agents number n = 32 , eigengap ∆ = 0 .

8, Graph: ER (32 , . . y-axis: log-scale d s (¯ x k , x ∗ ). DRSGD, t=1, =0.01DRSGD, t=1, =0.1DRSGD, t=10, =0.01DRSGD, t=10, =0.1complete, =0.1 (a) DRSGD

Figure 3: Synthetic data, agents number n = 32 , eigengap ∆ = 0 .

8, Graph: Ring. y-axis: log-scale d s (¯ x k , x ∗ ).linearly to a neighborhood of the optimal solution. The computation of DSA iteration is cheaper thanDRDGD since there is no retraction step. For simplicity, we ﬁx t = 1 and r = 5 in this section.We provide some numerical results on the MNIST dataset [21]. The graph is still the ring and W isthe Metropolis constant weight matrix. For MNIST, there are 60000 samples and the dimension is given by d = 784 . We normalize the data matrix by dividing 255 such that the elements are in [0 , . The data set isevenly partitioned into n subsets. The stepsizes of DRDGD and DRGTA are set to β = ˆ β . The results for MNIST data set with n = 20 ,

40 are shown in Figure 4. We see that the convergence rateof DSA and DRDGD are almost the same and DRGTA with ˆ β = 0 . n becomes larger, the convergence rate of all algorithms is slower. Although the computation of DSAis cheaper than DRDGD, we ﬁnd that when ˆ β = 0 . , n = 20 , DSA does not converge, which is not shown inthe Figure 4 (a). This is probably because DSA is not a feasible method and needs carefully tuned stepsize.Finally, we demonstrate the linear speedup of DRSGD for diﬀerent n . The experiments are evaluatedin a HPC cluster, where each computation node is associated with an Intel Xeon E5-2670 v2 CPU. Thecomputation nodes are connected by FDR10 Inﬁniband. We use 10 CPU cores each computation nodein the HPC cluster. And we treat one CPU core as one network node in our problem. The codes areimplemented in python with mpi4py.We set the maximum epoch as 300 in all experiments. The stepsize is set to β = √ n √ ˆ β, where ˆ β istuned for the best performance. The results in Figure 5 are log d s (¯ x k , x ∗ ) v.s. epoch and log d s (¯ x k , x ∗ ) v.s.CPU time, respectively. As we see in Figure 5(a), the solutions accuracy of n = 16 , ,

60 are almost thesame, while the CPU time in Figure 5(b) can be accelerated by nearly linear ratio.12

Iteration DRDGD, t=1, = 0.1DRDGD, t=1, = 0.5DRGTA, t=1, = 0.1DRGTA, t=1, = 0.5DSA, t=1, = 0.1 (a) MNIST, n = 20, ring graph Iteration DRDGD, t=1, = 0.1DRDGD, t=1, = 0.5DRGTA, t=1, = 0.1DRGTA, t=1, = 0.5DSA, t=1, = 0.5DSA, t=1, = 0.1 (b) MNIST, n = 40, ring graph Figure 4: Numerical results of DRDGD, DRGTA, DSA on MNIST data set. y-axis: log-scale d s (¯ x k , x ∗ ). Epoch DRSGD, n=16, = 0.6DRSGD, n=32, = 0.8DRSGD, n=60, = 1.1 (a) iteration

CPU time (seconds) DRSGD, n=16, = 0.6DRSGD, n=32, = 0.8DRSGD, n=60, = 1.1 (b) time

Figure 5: Comparison results of diﬀerent number of nodes on MNIST. Ring graph associated with Metropolisconstant weight matrix, t = 1 , β = √ n √ ˆ β. y-axis: log-scale d s (¯ x k , x ∗ ). We discuss the decentralized optimization problem over Stiefel manifold and propose the two decentralizedRiemannian gradient method and establish their convergence rate. Future topics could be cast into thefollowing several folds: Firstly, for the eigenvector problem (6.1), it will be interesting to establish the linearconvergence of DRGTA. Secondly, the analysis is based on the local convergence of Riemannian consensus,which results in multi-step consensus. It would be interesting to design algorithms based on Euclideanconsensus.

References [1] P.-A. Absil, R. Mahony, and R. Sepulchre.

Optimization algorithms on matrix manifolds . PrincetonUniversity Press, 2009. 132] P-A Absil, Robert Mahony, and Jochen Trumpf. An extrinsic look at the riemannian hessian. In

International Conference on Geometric Science of Information , pages 361–368. Springer, 2013.[3] Sulaiman A Alghunaim, Ernest Ryu, Kun Yuan, and Ali H Sayed. Decentralized proximal gradientalgorithms with linear convergence rates.

IEEE Transactions on Automatic Control , 2020.[4] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In

International Conference on Machine Learning , pages 1120–1128. PMLR, 2016.[5] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push fordistributed deep learning. In

International Conference on Machine Learning , pages 344–353. PMLR,2019.[6] Necdet Serhat Aybat, Zi Wang, Tianyi Lin, and Shiqian Ma. Distributed linearized alternating directionmethod of multipliers for composite convex consensus optimization.

IEEE Transactions on AutomaticControl , 63(1):5–20, 2017.[7] Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvexoptimization on manifolds.

IMA Journal of Numerical Analysis , 39(1):1–33, 2019.[8] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexactconsensus admm.

IEEE Transactions on Signal Processing , 63(2):482–497, 2014.[9] Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. On the local linear rate ofconsensus on the stiefel manifold. arXiv preprint arXiv:2101.09346 , 2021.[10] Shixiang Chen, Alfredo Garcia, and Shahin Shahrampour. Distributed projected subgradient methodfor weakly convex optimization. arXiv preprint arXiv:2004.13233 , 2020.[11] Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization.

IEEE Transactionson Signal and Information Processing over Networks , 2(2):120–136, 2016.[12] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:Convergence analysis and network scaling.

IEEE Transactions on Automatic control , 57(3):592–606,2011.[13] Alan Edelman, Tom´as A Arias, and Steven T Smith. The geometry of algorithms with orthogonalityconstraints.

SIAM journal on Matrix Analysis and Applications , 20(2):303–353, 1998.[14] Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principaleigenspaces.

Annals of statistics , 47(6):3009, 2019.[15] Arpita Gang and Waheed U Bajwa. A linearly convergent algorithm for distributed principal componentanalysis. arXiv preprint arXiv:2101.01300 , 2021.[16] Gene H Golub and Hongyuan Zha. The canonical correlations of matrix pairs and their numericalcomputation. In

Linear algebra for signal processing , pages 27–49. Springer, 1995.[17] Mingyi Hong, Siliang Zeng, Junyu Zhang, and Haoran Sun. On the divergence of decentralized non-convex optimization. arXiv preprint arXiv:2006.11662 , 2020.[18] Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight nor-malization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks.In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 2018.[19] Long-Kai Huang and Sinno Pan. Communication-eﬃcient distributed pca by riemannian optimization.In

International Conference on Machine Learning , pages 4465–4474. PMLR, 2020.[20] David Kempe and Frank McSherry. A decentralized algorithm for spectral analysis.

Journal of Computerand System Sciences , 74(1):70–83, 2008. 1421] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ .[22] Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, and Anthony Man Cho So. Nonsmoothoptimization over stiefel manifold: Riemannian subgradient methods. arXiv preprint arXiv:1911.05047 ,2019.[23] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralizedalgorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradientdescent. In

Advances in Neural Information Processing Systems , pages 5330–5340, 2017.[24] H. Liu, A. M.-C. So, and W. Wu. Quadratic optimization with orthogonality constraint: Explicit(cid:32)Lojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods.

Mathematical Programming Series A , 178(1-2):215–262, 2019.[25] Shuai Liu, Zhirong Qiu, and Lihua Xie. Convergence rate analysis of distributed optimization withprojected subgradient algorithm.

Automatica , 83:162–169, 2017.[26] Johan Markdahl, Johan Thunberg, and Jorge Goncalves. High-dimensional kuramoto models on stiefelmanifolds synchronize complex networks almost globally.

Automatica , 113:108736, 2020.[27] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus P¨uschel. D-admm: A communication-eﬃcient distributed algorithm for separable optimization.

IEEE Transactions on Signal Processing ,61(10):2718–2723, 2013.[28] Angelia Nedi´c, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-computation tradeoﬀs in decentralized optimization.

Proceedings of the IEEE , 106(5):953–976, 2018.[29] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimiza-tion over time-varying graphs.

SIAM Journal on Optimization , 27(4):2597–2633, 2017.[30] Angelia Nedic, Asuman Ozdaglar, and Pablo A Parrilo. Constrained consensus and optimization inmulti-agent networks.

IEEE Transactions on Automatic Control , 55(4):922–938, 2010.[31] Yurii Nesterov.

Introductory lectures on convex optimization: A basic course , volume 87. SpringerScience & Business Media, 2013.[32] Federico Penna and S(cid:32)lawomir Sta´nczak. Decentralized eigenvalue algorithms for distributed signaldetection in wireless networks.

IEEE Transactions on Signal Processing , 63(2):427–440, 2014.[33] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization.

IEEE Trans-actions on Control of Network Systems , 5(3):1245–1260, 2017.[34] Haroon Raja and Waheed U Bajwa. Cloud k-svd: A collaborative dictionary learning algorithm for big,distributed data.

IEEE Transactions on Signal Processing , 64(1):173–188, 2015.[35] Alain Sarlette and Rodolphe Sepulchre. Consensus optimization on manifolds.

SIAM Journal on Controland Optimization , 48(1):56–76, 2009.[36] Suhail M Shah. Distributed optimization on riemannian manifolds for multi-agent networks. arXivpreprint arXiv:1711.11196 , 2017.[37] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact ﬁrst-order algorithm for decentralizedconsensus optimization.

SIAM Journal on Optimization , 25(2):944–966, 2015.[38] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the admm indecentralized consensus optimization.

IEEE Transactions on Signal Processing , 62(7):1750–1761, 2014.[39] Roberto Tron, Bijan Afsari, and Ren´e Vidal. Riemannian consensus for manifolds with bounded cur-vature.

IEEE Transactions on Automatic Control , 58(4):921–934, 2012.1540] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic andstochastic gradient optimization algorithms.

IEEE transactions on automatic control , 31(9):803–812,1986.[41] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learningrecurrent networks with long term dependencies. In

International Conference on Machine Learning ,pages 3570–3578. PMLR, 2017.[42] Lei Wang, Xin Liu, and Yin Zhang. A distributed and secure algorithm for computing dominant svdbased on projection splitting. arXiv preprint arXiv:2012.03461 , 2020.[43] Ran Xin, Usman A Khan, and Soummya Kar. A near-optimal stochastic gradient method for decen-tralized non-convex ﬁnite-sum optimization. arXiv preprint arXiv:2008.07428 , 2020.[44] Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient methodsfor multi-agent optimization under uncoordinated constant stepsizes. In , pages 2055–2060. IEEE, 2015.[45] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming problemson Riemannian manifolds.

Paciﬁc J. Optimization , 10(2):415–434, 2014.[46] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.

SIAMJournal on Optimization , 26(3):1835–1854, 2016.[47] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diﬀusion for distributed optimizationand learning—part i: Algorithm development.

IEEE Transactions on Signal Processing , 67(3):708–723,2018.[48] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In

Conferenceon Learning Theory , pages 1617–1638, 2016. 16

About the polar retraction

Given the polar decomposition of x + ξ = QH, where Q ∈ R d × r is orthogonal and H ∈ R r × r is positivedeﬁnite. The polar retraction is the polar factor R x ( ξ ) = Q = ( x + ξ )( I r + ξ (cid:62) ξ ) − / , (A.1)which is also the orthogonal projection of x + ξ onto St( d, r ). The computation complexity is O ( dr ) . [24,Append. E] showed that if (cid:107) ξ (cid:107) F ≤ M = 1 for polar retraction. The boundedness of ξ can be veriﬁedin our convergence analysis. Therefore, we have M = 1 in this paper. B More details on linear rate of consensus

The following results were provided in [9].If there exists an integer t ≥ i ∈ [ n ] (cid:107) n (cid:88) j =1 ( W tij − /n )( x j − ¯ x ) (cid:107) F ≤ max i ∈ [ n ] n (cid:88) j =1 | W tij − /n |(cid:107) x − ¯ x (cid:107) F , ∞ ≤ (cid:107) x − ¯ x (cid:107) F , ∞ , (B.1)then it suﬃces to show the sequence { x k } of DRCS satisfying x k ∈ N with t ≥ (cid:100) log σ ( √ n ) (cid:101) steps ofcommunication.Denote the smallest eigenvalue of W t by λ n ( W t ) , the constant L t is given by L t = 1 − λ n ( W t ) . (B.2)It is the Lipschitz constant of ∇ ϕ t ( x ). Since L t ∈ (0 , , if λ n ( W t ) is unknown, one can use L t = 2 . Deﬁnethe second largest eigenvalue of W t by λ ( W t ) and µ t = 1 − λ ( W t ) . The formal statement of Fact 3.1 is given as follows.

Fact B.1. [9] Under Assumption 1, let the stepsize α satisfy < α ≤ ¯ α := min { ν Φ L t , , M } and t ≥(cid:100) log σ ( √ n ) (cid:101) , where ν ∈ [0 , , Φ = 2 − δ and M is given in Lemma 2.3. The sequence { x k } of (3.2) achieves consensus linearly if the initialization satisﬁes x ∈ N deﬁned by (3.6) . That is, we have x k ∈ N for all k ≥ and (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ (cid:107) x k − α grad ϕ t ( x k ) − ¯ x k (cid:107) F ≤ (cid:112) − − ν ) αγ t (cid:107) x k − ¯ x k (cid:107) F , (B.3) where γ t = (1 − rδ )(1 − δ ) µ t ≥ µ t ≥ − σ t . If ν = 1 /

2, we have α ≤ ¯ α := min { Φ2 L t , , /M } and ρ t = (cid:112) − γ t α. Recall that M is the constant given in Lemma 2.3. We also have M = O (1) which is discussed in appendix A.If α = 1 is admissible, then the rate is ρ t = (cid:113) σ t which is worse that the Euclidean rate σ t . Moreover,it was shown in [9] in a smaller region, i.e., ϕ t ( x ) = O ( σ t ) and (cid:107) x − ¯ x (cid:107) = O (1) , it follows asymptotically ρ t = σ t with α = 1. For simplicity, we will only discuss the convergence of our proposed algorithms using(B.3) with ν = 1 /

2. Note that this may imply ¯ α <

1, but we ﬁnd that α = 1 always works for our proposedalgorithms. 17 Proofs for Section 2

Denote P N x M as the orthogonal projection onto the normal space N x M . One can rewrite the projection P T x M ( y − x ) , ∀ y ∈ St( d, r ) [9] as follows P T x M ( y − x ) = y − x − P N x M ( y − x )= y − x + 12 x ( x − y ) (cid:62) ( x − y ) . (P2)This implies that P T x M ( y − x ) = y − x + O ( (cid:107) y − x (cid:107) ) . The relationship (P2) helps us to prove Lemma 2.4.

Proof of Lemma 2.4 . Firstly, since ∇ f ( x ) is L − Lipschitz in Euclidean space, one has | f ( y ) − [ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) ] | ≤ L (cid:107) y − x (cid:107) . (C.1)Since grad f ( x ) = P T x M ∇ f ( x ), we have (cid:104) grad f ( x ) , y − x (cid:105) = (cid:104)∇ f ( x ) , P T x M ( y − x ) (cid:105) (P2) = (cid:104)∇ f ( x ) , y − x (cid:105) + (cid:28) ∇ f ( x ) , x ( y − x ) (cid:62) ( y − x ) (cid:29) . Using (cid:28) ∇ f ( x ) , x ( y − x ) (cid:62) ( y − x ) (cid:29) ≤ (cid:107)∇ f ( x ) (cid:107) · (cid:107) x (cid:107) · (cid:107) x − y (cid:107) ≤ (cid:107)∇ f ( x ) (cid:107) · (cid:107) x − y (cid:107) implies |(cid:104) grad f ( x ) , y − x (cid:105) − (cid:104)∇ f ( x ) , y − x (cid:105)| ≤

12 max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) · (cid:107) y − x (cid:107) , (C.2)where (cid:107)∇ f ( x ) (cid:107) represents the operator norm of ∇ f ( x ). Since St( d, r ) is a compact set and ∇ f ( x ) iscontinuous, we denote L n = max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) . Let L g = L n + L . Combining (C.1) with (C.2) yields | f ( y ) − [ f ( x ) + (cid:104) grad f ( x ) , y − x (cid:105) ] | ≤ L g (cid:107) y − x (cid:107) . (C.3)Secondly, using grad f ( x ) = ∇ f ( x ) − P N x M ∇ f ( x ) and grad f ( y ) = ∇ f ( y ) − P N y M ∇ f ( y ) implies (cid:107) grad f ( x ) − grad f ( y ) (cid:107) F ≤ (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) F + (cid:107)P N x M ∇ f ( y ) − P N y M ∇ f ( y ) (cid:107) F = (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) F + 12 (cid:107) x ( x (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) x ) − y ( y (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) y ) (cid:107) F ≤ (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) F + 2 L n (cid:107) x − y (cid:107) F (C.4) ≤ ( L + 2 L n ) (cid:107) x − y (cid:107) F . In (C.4) we used (cid:107) x ( x (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) x ) − y ( y (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) y ) (cid:107) F ≤ (cid:107) x (( x − y ) (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) ( x − y )) (cid:107) F + (cid:107) ( x − y )( y (cid:62) ∇ f ( y ) + ∇ f ( y ) (cid:62) y ) (cid:107) F ≤ L n (cid:107) x − y (cid:107) F . The proof is completed. 18 .1 Comparison on diﬀerent Lipschitz-type inequalities

Using Taylor’s Theorem [1, Lemma 7.4.7], L (cid:48) g corresponds to the leading eigenvalue of Riemannian Hessian.According to [2], it follows for any η ∈ T x M thatHess f ( x )[ η ] = P T x M ( D grad h ( x )[ η ])= P T x M ∇ f ( x ) η − ηx (cid:62) P N x ∇ f ( x ) − x (cid:0) η (cid:62) P N x ∇ f ( x i ) + ( P N x ∇ f ( x )) (cid:62) η (cid:1) , (C.5)where P N x is the orthogonal projection onto the normal space N x M . Since x (cid:0) η (cid:62) P N x ∇ f ( x i ) + ( P N x ∇ f ( x )) (cid:62) η (cid:1) ∈ N x M , we have (cid:104) η, Hess f ( x )[ η ] (cid:105) = (cid:10) η, ∇ f ( x ) η (cid:11) − (cid:10) η, ηx (cid:62) P N x ∇ f ( x ) (cid:11) = (cid:10) η, ∇ f ( x ) η (cid:11) − (cid:28) η, η

12 ( x (cid:62) ∇ f ( x ) + ∇ f ( x ) (cid:62) x ) (cid:29) , (C.6)where we use P T x M ∇ f ( x ) η = ∇ f ( x ) η − P N x ∇ f ( x ) η . Therefore, we get L (cid:48) g ≤ λ max ( ∇ f ( x )) + max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) = L + L n . (C.7)The restricted inequality proposed in [7] is related to the pull back function g ( ξ ) := f ( R x ( ξ )), whoseLipschitz constant ˜ L g relies on the retraction. Speciﬁcally, ˜ L g = M L + 2 M L n , where M is a constantrelated to the retraction, M and L n are the same constants in Lemma 2.3. C.2 Technical lemmas

Lemma C.1. [9] For any x ∈ St( d, r ) n , let ˆ x = n (cid:80) ni =1 x i be the Euclidean mean and denote ˆ x = n ⊗ ˆ x .Similarly, let ¯ x = n ⊗ ¯ x , where ¯ x is the IAM deﬁned in (IAM) . Moreover, if (cid:107) x − ¯ x (cid:107) F ≤ n/ , one has (cid:107) ¯ x − ˆ x (cid:107) F ≤ √ r (cid:107) x − ¯ x (cid:107) F n . (P1)The following lemma will be useful to bound the Euclidean distance between two average points ¯ x k and¯ x k +1 . Lemma C.2. [9] Suppose x , y ∈ N , where N is deﬁned in (3.4) . Then we have (cid:107) ¯ x − ¯ y (cid:107) F ≤ − δ (cid:107) ˆ x − ˆ y (cid:107) F , where ¯ x and ¯ y are the IAM of x , . . . , x n and y , . . . , y n , respectively. We also need the following bounds for grad ϕ t ( x ). Lemma C.3. [9] For any x ∈ St( d, r ) n , it follows that (cid:107) n (cid:88) i =1 grad ϕ t ( x i ) (cid:107) F ≤ L t (cid:107) x − ¯ x (cid:107) F (C.8) and (cid:107) grad ϕ t ( x ) (cid:107) F ≤ L t (cid:107) x − ¯ x (cid:107) F , (C.9) where L t is the constant given in (B.2) . Moreover, suppose x ∈ N , where N is deﬁned by (3.5) . We thenhave max i ∈ [ n ] (cid:107) grad ϕ ti ( x ) (cid:107) F ≤ δ . (C.10)Applying Lemma C.2 to the update rule of our algorithms gives the following lemma.19 emma C.4. If x k ∈ N , x k +1 ∈ N and x i,k +1 = R x i,k ( − α grad ϕ ti ( x k ) + βu i,k ) , where u i,k ∈ T x i,k M , ≤ α ≤ M , ≤ β . Let u (cid:62) k = ( u (cid:62) ,k . . . u (cid:62) n,k ) and ˆ u k = n (cid:80) ni =1 u i,k . It follows that (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:18) L t α + L t αn (cid:107) x k − ¯ x k (cid:107) F + β (cid:107) ˆ u k (cid:107) F + 2 M β n (cid:107) u k (cid:107) F (cid:19) . Proof.

From Lemma 2.3 and Lemma C.3, we have (cid:107) ˆ x k − ˆ x k +1 (cid:107) F ≤ (cid:107) ˆ x k + 1 n n (cid:88) i =1 ( − α grad ϕ ti ( x k ) + βu i,k ) − ˆ x k +1 (cid:107) F + (cid:107) n n (cid:88) i =1 ( − α grad ϕ ti ( x k ) + βu i,k ) (cid:107) F(P1) ≤ Mn n (cid:88) i =1 (cid:107) α grad ϕ ti ( x k ) + βu i,k (cid:107) + α (cid:107) n n (cid:88) i =1 grad ϕ ti ( x k ) (cid:107) F + β (cid:107) ˆ u k (cid:107) F ≤ M α n (cid:107) grad ϕ t ( x k ) (cid:107) + 2 M β n (cid:107) u k (cid:107) + α (cid:107) n n (cid:88) i =1 grad ϕ ti ( x k ) (cid:107) F + β (cid:107) ˆ u k (cid:107) F(C.9)(C.8) ≤ L t M α + L t αn (cid:107) x k − ¯ x k (cid:107) + 2 M β n (cid:107) u k (cid:107) + β (cid:107) ˆ u k (cid:107) F . Therefore, it follows from Lemma C.2 that (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:107) ˆ x k − ˆ x k +1 (cid:107) F ≤ − δ (cid:18) L t α + L t αn (cid:107) x k − ¯ x k (cid:107) + β (cid:107) ˆ u k (cid:107) F + 2 M β n (cid:107) u k (cid:107) (cid:19) , where we use the fact that α ≤ M . D Proofs for Section 4

We use the notations v k = [ v (cid:62) ,k . . . v (cid:62) n,k ] (cid:62) , ˆ v k = 1 n n (cid:88) i =1 v i,k ,g i,k = grad f i ( x i,k ) and ˆ g k = 1 n n (cid:88) i =1 g i,k . The following lemma is useful to show x k ∈ N for all k. Lemma D.1. [9, Lemma 11] Given any x ∈ N , where N is deﬁned in (3.5) , if t ≥ (cid:100) log σ ( √ n ) (cid:101) , we have max i ∈ [ n ] (cid:107) n (cid:88) j =1 ( W tij − /n ) x j (cid:107) F ≤ δ . (D.1) Lemma D.2.

Under the same conditions of Fact B.1, if x k ∈ N and x i,k +1 = R x i,k ( − α grad ϕ ti ( x k ) + βv i,k ) , ∀ i ∈ [ n ] , where v i,k ∈ T x i,k M , the following holds (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F + β k (cid:107) v k (cid:107) F . Proof.

By the deﬁnition of IAM, we have (cid:107) x k +1 − ¯ x k +1 (cid:107) ≤ (cid:107) x k +1 − ¯ x k (cid:107) = n (cid:88) i =1 (cid:107)R x i,k (cid:0) − α grad ϕ ti ( x k ) − β k v i,k (cid:1) − ¯ x k (cid:107) ≤ n (cid:88) i =1 (cid:107) x i,k − α grad ϕ ti ( x k ) − β k v i,k − ¯ x k (cid:107) . (D.2)20et v k = [ v (cid:62) ,k . . . v (cid:62) n,k ] (cid:62) . Then, we get (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ (cid:107) x k − α grad ϕ t ( x k ) − β k v k − ¯ x k (cid:107) F ≤ (cid:107) x k − α grad ϕ t ( x k ) − ¯ x k (cid:107) F + β k (cid:107) v k (cid:107) F . (D.3)By combining inequality (B.3) of Fact B.1, we get (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F + β k (cid:107) v k (cid:107) F . (D.4)The proof is completed. Proof of Lemma 4.1 . We prove that x k ∈ N for all k ≥ x k ∈ N , let us show x k +1 ∈ N . Note (cid:107) v k (cid:107) F ≤ √ nD. Using Lemma D.2 yields (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F + β k √ nD (D.5) ≤ ρ t √ nδ + β k √ nD ≤ √ nδ , where the last inequality follows from β k ≤ − ρ t D δ . Hence x k +1 ∈ N . Secondly, let us verify x k +1 ∈ N . Itfollows from β k ≤ αδ D ≤ α D and α ≤ (cid:107) x k +1 − x k (cid:107) F , ∞ (2.4) ≤ max i ∈ [ n ] (cid:107) α grad ϕ t ( x k,i ) (cid:107) F + β k D (C.10) ≤ αδ + α ≤ − δ . Using Lemma C.4 yields (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:18) L t α + L t αn (cid:107) x k − ¯ x k (cid:107) + β (cid:107) ˆ v k (cid:107) F + 2 M β k n (cid:107) v k (cid:107) (cid:19) ≤ − δ (cid:2) (2 L t α + L t α ) δ + β k D + 2 M β k D (cid:3) . Furthermore, since L t ≤ β k ≤ αδ D , α ≤ /M , we get (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:18) αδ + αδ (cid:19) ≤ − δ (cid:18) r αδ + 125 √ r αδ (cid:19) , (D.6)where the last inequality follows from δ ≤ √ r δ . Then, one has (cid:107) x i,k +1 − ¯ x k +1 (cid:107) F ≤(cid:107) x i,k +1 − ¯ x k (cid:107) F + (cid:107) ¯ x k − ¯ x k +1 (cid:107) F(2.4) ≤ (cid:107) x i,k − α grad ϕ ti ( x k ) − β k v i,k − ¯ x k (cid:107) F + (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤(cid:107) x i,k − α grad ϕ ti ( x k ) − ¯ x k (cid:107) F + 15 αδ + (cid:107) ¯ x k − ¯ x k +1 (cid:107) F . (D.7)Now, we proceed by using the same lines in the proof of [9, Lemma 13] as followsgrad ϕ ti ( x ) = x i − n (cid:88) j =1 W ij x j − x i n (cid:88) j =1 W tij ( x i − x j ) (cid:62) ( x i − x j ) , (D.8)21nd (cid:107) x i,k − α grad ϕ ti ( x k ) − ¯ x k (cid:107) F(D.8) = (cid:107) (1 − α )( x i,k − ¯ x k ) + α (ˆ x k − ¯ x k ) + α n (cid:88) j =1 W tij ( x j,k − ˆ x k ) + α x i,k n (cid:88) j =1 W tij ( x i,k − x j,k ) (cid:62) ( x i,k − x j,k ) (cid:107) F ≤ (1 − α ) δ + α (cid:107) ˆ x k − ¯ x k (cid:107) F + α (cid:107) n (cid:88) j =1 ( W tij − n ) x j,k (cid:107) F + 12 (cid:107) α n (cid:88) j =1 W tij ( x i,k − x j,k ) (cid:62) ( x i,k − x j,k ) (cid:107) F (D.9) ≤ (1 − α ) δ + 2 αδ ,t √ r + α (cid:107) n (cid:88) j =1 ( W tij − n ) x j,k (cid:107) F + 2 αδ (D.10) ≤ (1 − α δ + 2 αδ √ r + 2 αδ , (D.11)where (D.9) follows from α ∈ [0 , (cid:107) x i,k +1 − ¯ x k +1 (cid:107) F ≤ (1 − α δ + 2 αδ √ r + 2 αδ + 15 αδ + (cid:107) ¯ x k − ¯ x k +1 (cid:107) F(D.6) ≤ (1 − α δ + 2 αδ √ r + 2 αδ + 15 αδ + 11 − δ (cid:18) r αδ + 125 √ r αδ (cid:19) . (D.12)Therefore, substituting the conditions (3.6) on δ , δ into (D.12) yields (cid:107) x i,k +1 − ¯ x k +1 (cid:107) F ≤ δ . The proof of the ﬁrst statement is completed. Finally, it follows from (D.5) that (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F + β k √ nD ≤ ρ k +1 t (cid:107) x − ¯ x (cid:107) F + √ nD k (cid:88) l =0 ρ k − lt β l . (D.13)An immediate result of Lemma 4.1 is that the rate of consensus (cid:107) x k − ¯ x k (cid:107) = O ( β k ) if β k = O ( k p ). Theproof is similar as [25, Proposition 8], we provide it for completeness. Lemma D.3.

Under Assumptions 1 to 4, for Algorithm 1, if x ∈ N , < α ≤ min { Φ2 L t , , M } , t ≥(cid:100) log σ ( √ n ) (cid:101) and β k = min { αδ D · k + 1) p , − ρ t D δ } , p ∈ (0 , , (D.14) then there exists a constant C > such that n (cid:107) x k − ¯ x k (cid:107) F ≤ CD β k for any k ≥ , where C is independentof D and n . Proof of Lemma D.3 . The proof relies on Lemma 4.1. Let a k := (cid:107) x k − ¯ x k (cid:107) F √ nβ k .It follows from (D.13) that a k +1 ≤ ρ t a k + D · β k β k +1 ≤ ρ k +1 − Kt a K + D k (cid:88) l = K ρ k − lt β l β l +1 . (D.15)Recall that β k = O (1 /D ) and n (cid:107) x − ¯ x (cid:107) ≤ δ , it follows that a ≤ δ /β = O ( D ). Since lim k →∞ β k +1 β k = 1,there exists suﬃciently large K such that β k β k +1 ≤ , ∀ k ≥ K. ≤ k ≤ K , there exists some C (cid:48) > a k ≤ C (cid:48) D , where C (cid:48) is independent of D and n . For k ≥ K , using (D.15) gives a k ≤ CD , where C = 2 C (cid:48) + − ρ t ) .Hence, we get (cid:107) x k − ¯ x k (cid:107) /n ≤ CD β k for all k ≥

0, where C = O ( − ρ t ) ). Lemma D.4.

Under Assumptions 1 to 4, suppose x k ∈ N , t ≥ (cid:100) log σ ( √ n ) (cid:101) , < α ≤ min { Φ2 L t , , M } .If x i,k +1 = R x i,k ( − α grad ϕ t ( x i,k ) − β k v i,k ) , < β k ≤ min { L g , αδ D } and β k ≥ β k +1 , where v i,k satisﬁesAssumption 3 and L g is given in Lemma 2.4. It follows that E k f (¯ x k +1 ) ≤ f (¯ x k ) − β k (cid:107) ˆ g k (cid:107) F − β k (cid:107) grad f (¯ x k ) (cid:107) F + 3 L g Ξ n β k + ( CD L G T D ) β k + T L g D β k , (D.16) where L G is given in Lemma 2.4, C is given in Lemma D.3, T = 2(4 √ r + 6 α ) C + 8 M and T =201 α C + 9 M . Note the variance term is in the order of O ( Ξ n β k ), since the gradient batch size is n . Proof of Lemma D.4 . Denote the conditional expectation E i,k v i,k := E [ v i,k | x i,k ] and E k := E [ ·| x k ]. Byinvoking Lemma 2.4, we have E k f (¯ x k +1 ) ≤ f (¯ x k ) + (cid:104) grad f (¯ x k ) , E k ¯ x k +1 − ¯ x k (cid:105) + L g E k (cid:107) ¯ x k +1 − ¯ x k (cid:107) = f (¯ x k ) − (cid:104) grad f (¯ x k ) , β k ˆ g k (cid:105) + (cid:104) grad f (¯ x k ) , E k [¯ x k +1 − ¯ x k + β k ˆ v k ] (cid:105) + L g E k (cid:107) ¯ x k +1 − ¯ x k (cid:107) = f (¯ x k ) − β k (cid:107) grad f (¯ x k ) (cid:107) − β k (cid:107) ˆ g k (cid:107) + β k (cid:107) grad f (¯ x k ) − ˆ g k (cid:107) + (cid:104) grad f (¯ x k ) , E k ¯ x k +1 − ¯ x k + β k ˆ g k (cid:105) + L g E k (cid:107) ¯ x k +1 − ¯ x k (cid:107) , (D.17)where ˆ v k = n (cid:80) ni =1 v i,k and we use E k ˆ v k = ˆ g k in the ﬁrst equation.Note that for β k >

0, we have (cid:104) grad f (¯ x k ) , E k ¯ x k +1 − ¯ x k + β k ˆ g k (cid:105) ≤ β k (cid:107) grad f (¯ x k ) (cid:107) + 1 β k (cid:107) E k ¯ x k +1 − ¯ x k + β k ˆ g k (cid:107) . Plugging this into (D.17) yields E k f (¯ x k +1 ) ≤ f (¯ x k ) − β k (cid:107) ˆ g k (cid:107) − β k (cid:107) grad f (¯ x k ) (cid:107) + β k (cid:107) grad f (¯ x k ) − ˆ g k (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := a + 1 β k (cid:107) E k [¯ x k +1 − ¯ x k + β k ˆ v k ] (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := a + L g E k (cid:107) ¯ x k +1 − ¯ x k (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := a . (D.18)Using Lemma 2.4 implies a ≤ n n (cid:88) i =1 (cid:107) grad f ( x i,k ) − grad f (¯ x k ) (cid:107)

2F (2.6) ≤ L G n (cid:107) x k − ¯ x k (cid:107) . a . From Lemma 4.1, we have x k +1 ∈ N . One has (cid:107) ¯ x k +1 − ¯ x k + β k ˆ v k (cid:107) F ≤ (cid:107) ¯ x k − ˆ x k (cid:107) F + (cid:107) ¯ x k +1 − ˆ x k +1 (cid:107) F + (cid:107) ˆ x k − β k ˆ v k − ˆ x k +1 (cid:107) F(P1) ≤ √ rn ( (cid:107) x k − ¯ x k (cid:107) + (cid:107) x k +1 − ¯ x k +1 (cid:107) ) + (cid:107) ˆ x k − β k ˆ v k − ˆ x k +1 (cid:107) F ≤ √ rn (cid:107) x k − ¯ x k (cid:107) + (cid:107) ˆ x k − β k ˆ v k − ˆ x k +1 (cid:107) F , (D.19)where we use (cid:107) x k − ¯ x k (cid:107) ≥ (cid:107) x k +1 − ¯ x k +1 (cid:107) in the last inequality.For the second term, since v i,k ∈ T x i,k M we have (cid:107) ˆ x k − β k ˆ v k − ˆ x k +1 (cid:107) F ≤ n n (cid:88) i =1 (cid:107) x i,k − α grad ϕ ti ( x k ) − β k v i,k − x i,k +1 (cid:107) F + αn (cid:107) n (cid:88) i =1 grad ϕ ti ( x k ) (cid:107) F(P1) ≤ Mn n (cid:88) i =1 (cid:107) α grad ϕ ti ( x k ) + β k v i,k (cid:107) + αn (cid:107) n (cid:88) i =1 grad ϕ ti ( x k ) (cid:107) F(C.8) ≤ M α n (cid:107) grad ϕ t ( x k ) (cid:107) + 2 M β k n (cid:107) v k (cid:107) + L t αn (cid:107) x k − ¯ x k (cid:107) ≤ L t M α + L t αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k n (cid:107) v k (cid:107) ≤ αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k n (cid:107) v k (cid:107) , (D.20)where we use α ≤ M and L t ≤ (cid:107) ¯ x k +1 − ¯ x k + β k ˆ v k (cid:107) ≤

2( 4 √ r + 10 αn ) (cid:107) x k − ¯ x k (cid:107) + 2( 2 M β k n ) (cid:107) v k (cid:107) . (D.21)Then, using Jensen’s inequality and (cid:107) v k (cid:107) ≤ nD implies a ≤ E k [ (cid:107) ¯ x k +1 − ¯ x k + β k ˆ v k (cid:107) ] ≤

2( 4 √ r + 10 αn ) (cid:107) x k − ¯ x k (cid:107) + 8 M β k D . Thirdly, invoking Lemma C.4 yields (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k D + β k (cid:107) ˆ v k (cid:107) F (cid:21) . Hence, it follows that a ≤ − δ ) (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k D (cid:21) + 2(1 − δ ) β k E k (cid:107) ˆ v k (cid:107) = 2(1 − δ ) (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k D (cid:21) + 2(1 − δ ) β k E k (cid:107) ˆ v k − ˆ g k (cid:107) + 2(1 − δ ) β k (cid:107) ˆ g k (cid:107) i ) = 2(1 − δ ) (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β k D (cid:21) + 2(1 − δ ) n β k n (cid:88) i =1 E k (cid:107) v i,k − g i,k (cid:107) + 2(1 − δ ) β k (cid:107) ˆ g k (cid:107) ii ) ≤ − δ ) (cid:20) α n (cid:107) x k − ¯ x k (cid:107) + 4 M β k D (cid:21) + 2(1 − δ ) n β k Ξ + 2(1 − δ ) β k (cid:107) ˆ g k (cid:107) , where (i) and (ii) hold by the independence of v i,k and bounded variance of Assumption 3, respectively.24herefore, by combining a , a , a with (D.18) implies that E k f (¯ x k +1 ) ≤ f (¯ x k ) − β k (cid:107) ˆ g k (cid:107) − β k (cid:107) grad f (¯ x k ) (cid:107) + β k a + 1 β k a + L g a ≤ f (¯ x k ) − ( β k − L g β k (1 − δ ) ) (cid:107) ˆ g k (cid:107) − β k (cid:107) grad f (¯ x k ) (cid:107) + β k L G n (cid:107) x k − ¯ x k (cid:107) + 2 β k ( 4 √ r + 10 αn ) (cid:107) x k − ¯ x k (cid:107) + 8 M β k D + 2 L g (1 − δ ) (cid:20) α n (cid:107) x k − ¯ x k (cid:107) + 4 M β k D (cid:21) + L g (1 − δ ) n β k Ξ . By Lemma D.3, we have (cid:107) x k − ¯ x k (cid:107) ≤ nCD β k . It follows that E k f (¯ x k +1 ) ≤ f (¯ x k ) − ( β k − L g β k (1 − δ ) ) (cid:107) ˆ g k (cid:107) − β k (cid:107) grad f (¯ x k ) (cid:107) + L g Ξ (1 − δ ) n β k + (cid:20) CD L G (cid:0) √ r + 10 α ) C + 8 M (cid:1) D (cid:21) β k + 2 L g (1 − δ ) (cid:2) α C D + 4 M D (cid:3) β k ≤ f (¯ x k ) − β k (cid:107) ˆ g k (cid:107) − β k (cid:107) grad f (¯ x k ) (cid:107) + 3 L g Ξ n β k + (cid:20) CD L G (cid:0) √ r + 10 α ) C + 8 M (cid:1) D (cid:21) β k + (cid:0) α C D + 9 M D (cid:1) L g β k , where we use − δ ) ≤ .

002 and β k ≤ L g in the last inequality. The proof is completed. Proof of Theorem 4.2 . Using (D.16) implies E k f (¯ x k +1 ) ≤ f (¯ x k ) − β k (cid:107) grad f (¯ x k ) (cid:107) + 3 L g Ξ n β k + ( CD L g T D ) β k + T L g D β k , (D.22)Taking the expectation on all k and telescoping the right hand side give us for any K > K (cid:88) k =0 β k E (cid:107) grad f (¯ x k ) (cid:107) ≤ f (¯ x ) − f ∗ + 3 L g Ξ n K (cid:88) k =0 β k + ( CD L g T D ) K (cid:88) k =0 β k + T L g D K (cid:88) k =0 β k , where f ∗ = min x ∈ St( d,r ) f ( x ). Dividing both sides by (cid:80) Kk =0 β k yieldsmin k =0 ,...,K E (cid:107) grad f (¯ x k ) (cid:107) ≤ f (¯ x ) − f ∗ + L g Ξ n (cid:80) Kk =0 β k + ( CD L g + T D ) (cid:80) Kk =0 β k + T L g D (cid:80) Kk =0 β k (cid:80) Kk =0 β k . Let ˜ β = min { /L g , − ρ t D } . Noticing that β k = O (min { − ρ t D , L G } · k ), (cid:80) Kk =0 β k (cid:80) Kk =0 β k = O ( ˜ β ln( K +1)) √ K +1 ), (cid:80) Kk =0 β k (cid:80) Kk =0 β k = O ( ˜ β √ K +1 ) and (cid:80) Kk =0 β k (cid:80) Kk =0 β k = O ( ˜ β √ K +1 ). The proof is completed.The following corollary follows [23], in which the convergence results of constant stepsize β k is given. Corollary D.5.

Under Assumptions 1 to 4, suppose x k ∈ N , t ≥ (cid:100) log σ ( √ n ) (cid:101) , < α ≤ ¯ α . If constantstepsize β k ≡ β = L G +Ξ √ ( K +1) /n , where K + 1 ≥ max { n Ξ (max { L G , Dαδ , Dδ − ρ t } ) , n Ξ (cid:32) CD L g + (2 T + T ) D f (¯ x ) − f ∗ ) + 3 L G (cid:33) } , f follows that min k =0 ,...,K E (cid:107) grad f (¯ x k ) (cid:107) F ≤ L G ( f (¯ x ) − f ∗ ) K + 1 + 8( f (¯ x ) − f ∗ + L G )Ξ (cid:112) n ( K + 1) . Proof.

Since K + 1 ≥ n Ξ (max { L G , Dαδ , Dδ − ρ t } ) , we have β k ≤ min { L G , αδ D , − ρ t D δ } for all k = 0 , , . . . , K . Therefore, it follows that x k ∈ N for k = 0 , , . . . , K . Using Theorem 4.2, we havemin k =0 ,...,K E (cid:107) grad f (¯ x k ) (cid:107) ≤ f (¯ x ) − f ∗ )( K + 1) β + 6 L g β Ξ n + (2 CD L g + 4 T D ) β + 4 T L g D β ≤ L G ( f (¯ x ) − f ∗ ) K + 1 + 4( f (¯ x ) − f ∗ )Ξ (cid:112) n ( K + 1) + 6 L G Ξ nL G + Ξ (cid:112) n ( K + 1) + 2 CD L g + (4 T + 2 T ) D (2 L G + Ξ (cid:112) ( K + 1) /n ) (D.23) ≤ L G ( f (¯ x ) − f ∗ ) K + 1 + 4( f (¯ x ) − f ∗ + L G )Ξ (cid:112) n ( K + 1) + 2 nCD L g + (4 T + 2 T ) nD Ξ ( K + 1) , (D.24)where we use β ≤ L G ≤ L g in (D.23).When K + 1 ≥ n Ξ (cid:32) CD L g + (2 T + T ) D f (¯ x ) − f ∗ ) + 3 L G (cid:33) , the second term in (D.24) is greater than the third term, we getmin k =0 ,...,K E (cid:107) grad f (¯ x k ) (cid:107) ≤ L G ( f (¯ x ) − f ∗ ) K + 1 + 8( f (¯ x ) − f ∗ + L G )Ξ (cid:112) n ( K + 1) , which completes the proof. E Proofs for Section 5

In this section, we use the following notations G k :=  grad f ( x ,k )...grad f n ( x n,k )  , y k =  y ,k ... y n,k  , ˆ y k := 1 n n (cid:88) i =1 y i,k , ˆ g k := 1 n n (cid:88) i =1 grad f i ( x i,k ) , ˆ G k := ( n ⊗ I n )ˆ g k . Proof of Lemma 5.1 . We prove it by induction. Let ˆ g − = ˆ y , one has (cid:107) y i, (cid:107) F ≤ D and (cid:107) y i, − ˆ g − (cid:107) F ≤ (cid:107) y i, (cid:107) F + (cid:107) ˆ g − (cid:107) F ≤ D + 1 n n (cid:88) j =1 (cid:107) y j, (cid:107) F ≤ D for all i ∈ [ n ] by Assumption 2. Suppose for some k ≥

0, it follows that (cid:107) y i,k (cid:107) F ≤ D + L G and (cid:107) y i,k − ˆ g k − (cid:107) F ≤ D + L G . 26e note that the bound of v i becomes 2 D + L G here since (cid:107) v i,k (cid:107) F = (cid:107)P T xi,k M y i,k (cid:107) F ≤ (cid:107) y i,k (cid:107) F . Followingthe same argument in the proof of Lemma 4.1, we get x k +1 ∈ N since 0 < α ≤ min { Φ2 L t , , M } and0 ≤ β ≤ min { − ρ t L G +2 D δ , αδ L G +2 D ) } .Then, we have (cid:107) y i,k +1 − ˆ g k (cid:107) F = (cid:107) n (cid:88) j =1 W ti,j y j,k − ˆ g k + grad f ( x i,k +1 ) − grad f ( x i,k ) (cid:107) F = (cid:107) n (cid:88) j =1 ( W ti,j − n )( y j,k − ˆ g k − ) + grad f ( x i,k +1 ) − grad f ( x i,k ) (cid:107) F(2.6) ≤ σ t √ n (cid:107) y j,k − ˆ g k − (cid:107) F + L G (cid:107) x i,k +1 − x i,k (cid:107) F(2.4) ≤ σ t √ n (cid:107) y j,k − ˆ g k − (cid:107) F + L G ( α (cid:107) grad ϕ ti ( x k ) (cid:107) F + β (cid:107) y i,k (cid:107) F ) (C.10) ≤ (cid:107) y j,k − ˆ g k − (cid:107) F + 2 L G αδ + L G β (cid:107) y i,k (cid:107) F ≤

12 (2 D + L G ) + 2 δ L G + L G δ α (3.6) ≤ D + L G . Hence, (cid:107) y i,k +1 (cid:107) F ≤ (cid:107) y i,k +1 − ˆ g k (cid:107) F + (cid:107) ˆ g k (cid:107) F ≤ L G + 2 D , where we use (cid:107) ˆ g k (cid:107) F ≤ D . Therefore, we get (cid:107) y i,k (cid:107) F ≤ L g + 2 D for all i, k and x k ∈ N .Using the same argument of Lemma D.3, there exists some C = O ( − ρ t ) ) that is independent of L G and D such that 1 n (cid:107) x k − ¯ x k (cid:107) ≤ C ( L G + 2 D ) β , k ≥ . (E.1)The proof is completed.Next, we present the relations between the consensus error and the gradient tracking error. Lemma E.1.

Under the same conditions of Lemma 5.1, one has the following error bounds for any k ≥ :1. Successive gradient error: (cid:107) G k +1 − G k (cid:107) F ≤ αL G (cid:107) x k − ¯ x k (cid:107) F + βL G (cid:107) y k (cid:107) F . (E.2)

2. Successive tracking error: (cid:107) y k +1 − ˆ G k +1 (cid:107) F ≤ σ t (cid:107) y k − ˆ G k (cid:107) F + (cid:107) G k +1 − G k (cid:107) F . (E.3)

3. Successive consensus error: for ρ t = √ − γ t α ∈ (0 , , (cid:107) x k +1 − ¯ x k +1 (cid:107) F ≤ ρ t (cid:107) x k − ¯ x k (cid:107) F + β (cid:107) y k (cid:107) F . (E.4)

4. Associating y k , ˆ G k with above items: (cid:107) y k (cid:107) F ≤ (cid:107) y k − ˆ G k (cid:107) F + (cid:107) ˆ G k (cid:107) F . (E.5) Proof of Lemma E.1 . By Lemma 5.1, we know x k ∈ N for all k ≥ (cid:107) G k +1 − G k (cid:107) F ≤ L G (cid:107) x k +1 − x k (cid:107) F . By Lemma 2.3, it follows that (cid:107) x k − x k +1 (cid:107) F ≤ α (cid:107) grad ϕ t ( x k ) (cid:107) F + β (cid:107) v k (cid:107) F (C.9) ≤ α (cid:107) x k − ¯ x k (cid:107) F + β (cid:107) y k (cid:107) F , where we use (cid:107) v k (cid:107) F ≤ (cid:107) y k (cid:107) F . Hence, the inequality (E.2) is proved.27. Denote J = n n (cid:62) n . Note that y k +1 − ˆ G k +1 = (( I n − J ) ⊗ I n ) y k +1 = (( I n − J ) ⊗ I n ) (cid:2) ( W t ⊗ I n ) y k + G k +1 − G k (cid:3) = (( W t − J ) ⊗ I n ) y k + (( I n − J ) ⊗ I n )( G k +1 − G k )where we use (( I n − J ) ⊗ I n )( W t ⊗ I n ) = ( W t − J ) ⊗ I n . It follows that (cid:107) y k +1 − ˆ G k +1 (cid:107) F ≤ σ t (cid:107) y k − ˆ G k (cid:107) F + (cid:107) G k +1 − G k (cid:107) F

3. Note that (cid:107) v k (cid:107) F ≤ (cid:107) y k (cid:107) F . Then the desired result follows the same line as that of Lemma D.2.4. This follows from the triangle inequality.To show Theorem 5.2, we ﬁrstly show a descent lemma. Note that an extra (cid:107) ˆ G k (cid:107) = n (cid:107) ˆ g k (cid:107) appearsin (E.5), what is we aim at bounding in the optimization problem (1.1). By combining with the followinglemmas, we can quickly obtain the ﬁnal convergence result. Lemma E.2.

Under the same conditions of Lemma 5.1, it follows that f (¯ x k +1 ) ≤ f (¯ x k ) − ( β − L G β ) (cid:107) ˆ g k (cid:107) F + G L G n (cid:107) x k +1 − ¯ x k +1 (cid:107) F + G L G n (cid:107) x k − ¯ x k (cid:107) F + G L G n β (cid:107) y k (cid:107) F , (E.6) where G = r ( L G +2 D ) C L G , G = 1 + G + Dα +8 MDα L G + 13 C δ α , G = MDL G + δ + 5 and C is given inLemma 5.1. Since D = max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) F ≤ √ r · max x ∈ St( d,r ) (cid:107)∇ f ( x ) (cid:107) = √ rL n . By the choice of α , the constantsin Lemma E.2 are given by G = O ( r C ) , G = O ( r C ) and G = O ( M ). Proof of Lemma E.2 . It follows from Lemma 2.4 that (cid:107) ˆ g k − grad f (¯ x k ) (cid:107) ≤ n n (cid:88) i =1 (cid:107) grad f i ( x i,k ) − grad f (¯ x k ) (cid:107) ≤ L G n (cid:107) x k − ¯ x k (cid:107) . (E.7)By invoking Lemma 2.4 and noting L g ≤ L G , we also have f (¯ x k +1 ) ≤ f (¯ x k ) + (cid:104) grad f (¯ x k ) , ¯ x k +1 − ¯ x k (cid:105) + L g (cid:107) ¯ x k +1 − ¯ x k (cid:107) ≤ f (¯ x k ) + (cid:104) ˆ g k , ¯ x k +1 − ¯ x k (cid:105) + (cid:104) grad f (¯ x k ) − ˆ g k , ¯ x k +1 − ¯ x k (cid:105) + L G (cid:107) ¯ x k +1 − ¯ x k (cid:107) ≤ f (¯ x k ) + (cid:104) ˆ g k , ¯ x k +1 − ¯ x k (cid:105) + 1 L G (cid:107) grad f (¯ x k ) − ˆ g k (cid:107) + 3 L G (cid:107) ¯ x k +1 − ¯ x k (cid:107) ≤ f (¯ x k ) + (cid:104) ˆ g k , ¯ x k +1 − ¯ x k (cid:105) + L G n (cid:107) x k − ¯ x k (cid:107) + 3 L G (cid:107) ¯ x k +1 − ¯ x k (cid:107) = f (¯ x k ) + (cid:104) ˆ g k , ˆ x k +1 − ˆ x k (cid:105) + (cid:104) ˆ g k , ¯ x k +1 − ˆ x k +1 + ˆ x k − ¯ x k (cid:105) + L G n (cid:107) x k − ¯ x k (cid:107) + 3 L G (cid:107) ¯ x k +1 − ¯ x k (cid:107) . (E.8)Note that for β >

0, we have (cid:104) ˆ g k , ¯ x k +1 − ˆ x k +1 + ˆ x k − ¯ x k (cid:105) ≤ β L G (cid:107) ˆ g k (cid:107) + 1 β L G (cid:107) ˆ x k − ¯ x k (cid:107) + 1 β L G (cid:107) ¯ x k +1 − ˆ x k +1 (cid:107) . f (¯ x k +1 ) ≤ f (¯ x k ) + (cid:104) ˆ g k , ˆ x k +1 − ˆ x k (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) := b + β L G (cid:107) ˆ g k (cid:107) + 1 β L G ( (cid:107) ˆ x k − ¯ x k (cid:107) + (cid:107) ¯ x k +1 − ˆ x k +1 (cid:107) ) (cid:124) (cid:123)(cid:122) (cid:125) := b + L G n (cid:107) x k − ¯ x k (cid:107) + 3 L G (cid:107) ¯ x k +1 − ¯ x k (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := b . (E.9)Firstly, we have b = (cid:104) ˆ g k , ˆ x k +1 − ˆ x k − β ˆ g k + β ˆ g k (cid:105) = − β (cid:107) ˆ g k (cid:107) + (cid:42) ˆ g k , n n (cid:88) i =1 [ x i,k +1 − ( x i,k − βv i,k − α grad ϕ ti ( x k ))] (cid:43) + (cid:42) ˆ g k , n n (cid:88) i =1 [ β ( y i,k − v i,k ) − α grad ϕ ti ( x k )] (cid:43) . (E.10)Since y i,k − v i,k ∈ N x i,k M , it follows that (cid:42) ˆ g k , βn n (cid:88) i =1 ( y i,k − v i,k ) − α grad ϕ ti ( x k ) (cid:43) (C.8) ≤ βn n (cid:88) i =1 (cid:104) ˆ g k − grad f i ( x i,k ) , y i,k − v i,k (cid:105) + 2 αn (cid:107) ˆ g k (cid:107) F · (cid:107) x k − ¯ x k (cid:107) ≤ nL G n (cid:88) i =1 (cid:107) ˆ g k − grad f i ( x i,k ) (cid:107) + β L G n n (cid:88) i =1 (cid:107)P N xi,k y i,k (cid:107) + 2 αDn (cid:107) x k − ¯ x k (cid:107) ≤ n L G n (cid:88) i =1 n (cid:88) j =1 (cid:107) grad f j ( x j,k ) − grad f i ( x i,k ) (cid:107) + β L G n (cid:107) y k (cid:107) + 2 αDn (cid:107) x k − ¯ x k (cid:107) ≤ L G + 2 αDn (cid:107) x k − ¯ x k (cid:107) + β L G n (cid:107) y k (cid:107) , where we use Lemma 2.4 in the last inequality. This, together with (E.10) and (C.8) implies b ≤ − β (cid:107) ˆ g k (cid:107) + (cid:42) ˆ g k , n n (cid:88) i =1 [ x i,k +1 − ( x i,k − βv i,k − α grad ϕ ti ( x k ))] (cid:43) + L G + 2 Dαn (cid:107) x k − ¯ x k (cid:107) + β L G n (cid:107) y k (cid:107) ≤ − β (cid:107) ˆ g k (cid:107) + Dn n (cid:88) i =1 (cid:107) x i,k − α grad ϕ ti ( x k ) − βv i,k − x i,k +1 (cid:107) F + L G + 2 Dαn (cid:107) x k − ¯ x k (cid:107) + β L G n (cid:107) y k (cid:107) ≤ − β (cid:107) ˆ g k (cid:107) + M Dn n (cid:88) i =1 (cid:107) α grad ϕ ti ( x k ) + βv i,k (cid:107) + L G + 2 Dαn (cid:107) x k − ¯ x k (cid:107) + β L G n (cid:107) y k (cid:107) ≤ − β (cid:107) ˆ g k (cid:107) + 2 M Dα n (cid:107) grad ϕ t ( x k ) (cid:107) + 2 M Dβ n (cid:107) y k (cid:107) + L G + 2 Dαn (cid:107) x k − ¯ x k (cid:107) + β L G n (cid:107) y k (cid:107) ≤ − β (cid:107) ˆ g k (cid:107) + 8 M Dα + 2 Dα + L G n (cid:107) x k − ¯ x k (cid:107) + (2 M D + L G ) β n (cid:107) y k (cid:107) , (E.11)where we use (cid:107) ˆ g k (cid:107) F ≤ D .Secondly, we use the following inequality to derive the upper bound of b . From Lemma 5.1, we have x k +1 ∈ N . One has (cid:107) ¯ x k − ˆ x k (cid:107) + (cid:107) ¯ x k +1 − ˆ x k +1 (cid:107) ≤ rn ( (cid:107) x k − ¯ x k (cid:107) + (cid:107) x k +1 − ¯ x k +1 (cid:107) ) . (E.12)29e then obtain b ≤ rn β L G ( (cid:107) x k − ¯ x k (cid:107) + (cid:107) x k +1 − ¯ x k +1 (cid:107) ) . Thirdly, invoking Lemma C.4 and α ≤ /M yields (cid:107) ¯ x k − ¯ x k +1 (cid:107) F ≤ − δ (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β n (cid:107) y k (cid:107) + β (cid:107) ˆ v k (cid:107) F (cid:21) . Then, it follows from β (cid:107) y k (cid:107) F ≤ αδ that b ≤ L G (cid:32) − δ ) (cid:20) αn (cid:107) x k − ¯ x k (cid:107) + 2 M β n (cid:107) y k (cid:107) (cid:21) + 2(1 − δ ) β (cid:107) ˆ v k (cid:107) (cid:33) ≤ L G (1 − δ ) (cid:20) α n (cid:107) x k − ¯ x k (cid:107) + ( M αδ β ) n (cid:107) y k (cid:107) (cid:21) + 3 L G (1 − δ ) β ( (cid:107) ˆ y k (cid:107) + (cid:107) ˆ v k − ˆ y k (cid:107) ) ≤ L G (1 − δ ) (cid:20) α n (cid:107) x k − ¯ x k (cid:107) + ( M αδ β ) n (cid:107) y k (cid:107) (cid:21) + 3 L G (1 − δ ) β ( (cid:107) ˆ g k (cid:107) + 1 n (cid:107) y k (cid:107) ) , where we use ˆ y k = ˆ g k and (cid:107) ˆ v k − ˆ y k (cid:107) ≤ n (cid:107)P N xi,k y i,k (cid:107) ≤ n (cid:107) y k (cid:107) . It follows from (5.1) that (cid:107) x k − ¯ x k (cid:107) ≤ C ( L G + 2 D ) β ≤ C α δ , where we use β ≤ αδ L G +2 D ) . Therefore, we get b ≤ r ( L G + 2 D ) C nL G ( (cid:107) x k − ¯ x k (cid:107) + (cid:107) x k +1 − ¯ x k +1 (cid:107) ) . (E.13)and b ≤ L G (1 − δ ) (cid:20) C δ α n (cid:107) x k − ¯ x k (cid:107) + δ n β (cid:107) y k (cid:107) (cid:21) + 3 L G (1 − δ ) β ( (cid:107) ˆ g k (cid:107) + 1 n (cid:107) y k (cid:107) ) ≤ L G C δ α n (cid:107) x k − ¯ x k (cid:107) + 72 L G β (cid:107) ˆ g k (cid:107) + δ + 4 n L G β (cid:107) y k (cid:107) , (E.14)where we use α ≤ M and − δ ) ≤ . b , b , b with (E.9)implies f (¯ x k +1 ) ≤ f (¯ x k ) + b + β L G (cid:107) ˆ g k (cid:107) + b + L G n (cid:107) x k − ¯ x k (cid:107) + b ≤ f (¯ x k ) − ( β − L G β ) (cid:107) ˆ g k (cid:107) + L G + r ( L G +2 D ) C L G + 2 Dα + 8 M Dα + 13 L G C δ α n (cid:107) x k − ¯ x k (cid:107) + 4 r ( L G + 2 D ) C nL G (cid:107) x k +1 − ¯ x k +1 (cid:107) + 2 M D + ( δ + 5) L G n β (cid:107) y k (cid:107) . The proof is completed.To proceed, we need the following recursive lemma, which is helpful to combine Lemma E.1 and Lemma E.2.It is a little diﬀerent from the original one in [44]. We only change (cid:113)(cid:80) kl =0 u i and (cid:113)(cid:80) kl =0 w i to be (cid:80) kl =0 u i and (cid:80) kl =0 w i . 30 emma E.3. [44, Lemma 2] Let { u k } k ≥ and { w k } k ≥ be two positive scalar sequences such that for all k ≥ u k +1 ≤ ηu k + w k , where η ∈ (0 , is the decaying factor. Let Γ( k ) = (cid:80) kl =0 u i and Ω( k ) = (cid:80) kl =0 w i . Then we have Γ( k ) ≤ c Ω( k ) + c , where c = − η ) and c = − η u . Proof of Theorem 5.2 . Applying Lemma E.3 to (E.4) yields1 n K (cid:88) k =0 (cid:107) x k − ¯ x k (cid:107) ≤ ˜ C · β n K (cid:88) k =0 (cid:107) y k (cid:107) + ˜ C , (E.15)where ˜ C = − ρ t ) and ˜ C = − ρ t n (cid:107) x − ¯ x (cid:107) .It follows from Lemma E.2 that f (¯ x K +1 ) ≤ f (¯ x ) − ( β − L G β ) K (cid:88) k =0 (cid:107) ˆ g k (cid:107) + G L G n K (cid:88) k =0 (cid:107) x k − ¯ x k (cid:107) + G L G n K +1 (cid:88) k =1 (cid:107) x k − ¯ x k (cid:107) + G L G n β K (cid:88) k =0 (cid:107) y k (cid:107) ≤ f (¯ x ) − ( β − L G β ) K (cid:88) k =0 (cid:107) ˆ g k (cid:107) + ( G ˜ C + G ˜ C + G ) L G β n K (cid:88) k =0 (cid:107) y k (cid:107) + G ˜ C L G β (cid:107) y K +1 (cid:107) n + ˜ C ( G + G ) L G ≤ f (¯ x ) − β K (cid:88) k =0 (cid:107) ˆ g k (cid:107) + G L G β n K (cid:88) k =0 (cid:107) y k (cid:107) + G L G , (E.16)where we use β ≤ min { L G , αδ L G +2 D ) } , β (cid:107) y K +1 (cid:107) ≤ n ( L G + 2 D ) β ≤ δ α n and G := G ˜ C + G ˜ C + G and G := G ˜ C δ α + ˜ C ( G + 4 rC ) in the last inequality.We are going to associate (cid:107) ˆ g k (cid:107) with (cid:107) y k (cid:107) . By (E.5), we get − K (cid:88) k =0 (cid:107) ˆ g k (cid:107) = − n K (cid:88) k =0 (cid:107) ˆ G k (cid:107) ≤ n K (cid:88) k =0 (cid:107) y k − ˆ G k (cid:107) − n K (cid:88) k =0 (cid:107) y k (cid:107) (E.17)Again, applying Lemma E.3 to (E.3) yields1 n K (cid:88) k =0 (cid:107) y k − ˆ G k (cid:107) ≤ ˜ C n K (cid:88) k =0 (cid:107) G k +1 − G k (cid:107) + ˜ C ≤ ˜ C n K (cid:88) k =0 (8 α L G (cid:107) x k − ¯ x k (cid:107) + 2 β L G (cid:107) y k (cid:107) ) + ˜ C ≤ (8 α ˜ C ˜ C + 2 ˜ C ) L G β n K (cid:88) k =0 (cid:107) y k (cid:107) + 8 α ˜ C ˜ C L G + ˜ C ≤ (8 ˜ C + 12 ˜ C ) αδ L G β n K (cid:88) k =0 (cid:107) y k (cid:107) + 8 α ˜ C ˜ C L G + ˜ C , where ˜ C = − σ t ) and ˜ C = − σ t · n (cid:107) y − ˆ G (cid:107) . The last line is due to β ≤ αδ L G and α ˜ C ≤ ˜ C ≤ − √ n ) ≤

5. Plugging this into (E.17) implies − K (cid:88) k =0 (cid:107) ˆ g k (cid:107) ≤ (cid:20) (8 ˜ C + 12 ˜ C ) αδ L G β − (cid:21) n K (cid:88) k =0 (cid:107) y k (cid:107) + 8 α ˜ C ˜ C L G + ˜ C . (E.18)31ence, it follows from equation (E.16) that f (¯ x K +1 ) (E.18) ≤ f (¯ x ) − β (cid:18) − (cid:20) G + (8 ˜ C + 12 ˜ C ) αδ (cid:21) L G β (cid:19) n K (cid:88) k =0 (cid:107) y k (cid:107) + β (cid:16) α ˜ C ˜ C L G + ˜ C (cid:17) + G L G ≤ f (¯ x ) − β n K (cid:88) k =0 (cid:107) y k (cid:107) + β (cid:16) α ˜ C ˜ C L G + ˜ C (cid:17) + G L G (E.19)where the last inequality is due to β ≤ L G (2 G + (8 ˜ C + ˜ C ) αδ ) .Then, we get β K (cid:88) k =0 (cid:107) ˆ g k (cid:107) ≤ β · n K (cid:88) k =0 (cid:107) y k (cid:107) ≤ f (¯ x ) − f ∗ + ˜ C + G L G , (E.20)where ˜ C = (8 α ˜ C ˜ C L G + ˜ C ) β = O ( rδ L G (1 − σ t ) ) and f ∗ = min x ∈ St( d,r ) f ( x ). This impliesmin k =0 ,...,K (cid:107) ˆ g k (cid:107) = min k =0 ,...,K (cid:107) ˆ y k (cid:107) ≤ min k =0 ,...,K n (cid:107) y k (cid:107) ≤ f (¯ x ) − f ∗ + ˜ C + G L G ) β · K . (E.21)It then follows from (E.15) thatmin k =0 ,...,K n (cid:107) x k − ¯ x k (cid:107) ≤ β ( f (¯ x ) − f ∗ + ˜ C + G L G ) ˜ C + ˜ C K .