ADOM: Accelerated Decentralized Optimization Method for Time-Varying Networks
Dmitry Kovalev, Egor Shulgin, Peter Richtárik, Alexander Rogozin, Alexander Gasnikov
AADOM: Accelerated Decentralized Optimization Methodfor Time-Varying Networks
Dmitry Kovalev Egor Shulgin Peter Richt´arik Alexander Rogozin Alexander Gasnikov Abstract
We propose
ADOM – an accelerated method forsmooth and strongly convex decentralized opti-mization over time-varying networks.
ADOM uses a dual oracle, i.e., we assume access to thegradient of the Fenchel conjugate of the individ-ual loss functions. Up to a constant factor, whichdepends on the network structure only, its commu-nication complexity is the same as that of acceler-ated Nesterov gradient method (Nesterov, 2003).To the best of our knowledge, only the algorithmof Rogozin et al. (2019) has a convergence ratewith similar properties. However, their algorithmconverges under the very restrictive assumptionthat the number of network changes can not begreater than a tiny percentage of the number ofiterations. This assumption is hard to satisfy inpractice, as the network topology changes usuallycan not be controlled. In contrast,
ADOM merelyrequires the network to stay connected throughouttime.
1. Introduction
We study the decentralized optimization problem min x ∈ R d n (cid:80) i =1 f i ( x ) , (1)where each function f i : R d → R is stored on a computenode i ∈ [ n ] := { , , . . . , n } . We assume that the nodesare connected through a communication network defined byan undirected connected graph. Each node can perform com-putations based on its local state and data, and can directlycommunicate with its neighbors only. Further, we assumethe functions f i to be smooth and strongly convex. Such de-centralized optimization problems have been studied heavily King Abdullah University of Science and Technology, Thuwal,Saudi Arabia Moscow Institute of Physics and Technology, Dol-goprudny, Russia. Correspondence to: Dmitry Kovalev < [email protected] > .February 19, 2021 ⇒ ⇒ Figure 1.
A sample time-varying network with n = 20 nodes. (Gorbunov et al., 2020b), and arise in many applications, in-cluyding estimation by sensor networks (Rabbat & Nowak,2004), network resource allocation (Beck et al., 2014), co-operative control (Giselsson et al., 2013), distributed spec-trum sensing (Bazerque & Giannakis, 2009), power systemcontrol (Gan et al., 2012) and federated learning (Li et al.,2020a; Kovalev et al., 2021). When the network is notallowed to change in time, a lower communication com-plexity bound has been established by Scaman et al. (2017).This bound is tight as there is a matching upper bound bothin the case when a dual oracle is assumed (Scaman et al.,2017), which means that we have access to the gradient ofthe Fenchel conjugate of the functions f i ( x ) , and also inthe case when a primal oracle is assumed (Kovalev et al.,2020b), which means that we have access to the gradient ofthe functions f i ( x ) themselves. In this work, we study the situation when the links in thecommunication network are allowed to change over time(for an illustration, see Figure 1). Such time-varying net-works (Zadeh, 1961; Kolar et al., 2010) are ubiquitous inmany complex systems and practical applications. In sensornetworks, for example, changes in the link structure occurwhen the sensors are in motion, and due to other distur-bances in the wireless signal connecting pairs of nodes. Weenvisage that a similar regime will be supported in future-generation federated learning systems (Koneˇcn´y et al., 2016;McMahan et al., 2017; Kovalev et al., 2021), where the com-munication pattern among pairs of mobile devices or mobiledevices and edge servers will be dictated by their physicalproximity, which naturally changes over time. Our workcan be partially understood as an attempt to contribute tothe algorithmic foundations of this nascent field. a r X i v : . [ m a t h . O C ] F e b DOM
As mentioned earlier, throughout this paper we restricteach function f i ( x ) to be L -smooth and µ -strongly con-vex. That is, we require that the inequalities f i ( x ) ≥ f i ( y ) + (cid:104)∇ f i ( y ) , x − y (cid:105) + µ (cid:107) x − y (cid:107) and f i ( x ) ≤ f i ( y ) + (cid:104)∇ f i ( y ) , x − y (cid:105) + L (cid:107) x − y (cid:107) hold for all nodes i ∈ [ n ] andall x, y ∈ R d . This naturally leads to the quantity κ := L/µ (2)known as the condition number of function f . As we shallsee, current understanding of decentralized optimizationover time-varying networks is insufficient even in this set-ting, and we believe that the key technical issues we faceat present do not come from the difficulty of the functionclass, but from the algorithmic and modeling aspect of deal-ing with the decentralized and time-varying nature of theproblem. Thus, focusing on smooth and strongly convexproblems should not be seen as a weakness, but as a nec-essary step in the quest to make a significant advance inour understanding of how efficient decentralized methodsshould be designed in the time-varying network regime. To the best of our knowledge, there is only a handful of al-gorithms for solving the decentralized optimization problem(1) that enjoy a linear convergence rate in the time-varyingregime under smoothness and strong convexity assumptions.These include DIGing (Nedic et al., 2017) and Push-PullGradient Method (Pu et al., 2020), which use the primaloracle, and PANDA (Maros & Jald´en, 2018), which uses thedual oracle. While linear, their rates are slow in comparisonto the best methods “on the market” at present (see Table 1).A well known mechanism for improving the convergencerates of standard gradient type methods is to apply or adaptNesterov acceleration (Nesterov, 2003), whose goal is toreduce the dependence of the method on the condition num-ber κ associated with the problem, the condition number χ associated with the network structure (see (14) in Sec-tion 4.3), or both. However, doing this is nontrivial in thedecentralized time-varying setting.
2. Summary of Contributions
We now briefly outline the main contributions:
In this paper we propose an accelerated algorithm—
ADOM (Algorithm 2)—for smooth and strongly convex decentral-ized optimization over time-varying networks. This algo-rithm uses the dual oracle, and is based on a careful general-ization of the Projected Nesterov Gradient Descent method(PNGD; Algorithm 1).
We prove that
ADOM enjoys the rate O ( χκ / log ε ) (seeThm 1), which matches the O ( κ / log ε ) rate of PNGD inthe special case of a fully connected time-invariant network. Our analysis requires several new insights and tools. First,we rely on the new observation that decentralized communi-cation can be seen as the application of a certain contractivecompression operator (see Section 4.4). This operator is lin-ear, but may be biased , which raises significant challenges.While the use of unbiased compression operators, such assparsification and quantization, is increasingly popular inmodern literature on distributed optimization in the parame-ter server framework , we only know of a handful of resultscombining compression with acceleration (Li et al., 2020b;Qian et al., 2020). Of these, the first handles unbiased com-pressors only, and the second is the only work we knowof successfully combining biased communication compres-sion and acceleration. However, their work makes use ofa different acceleration mechanism from ours, and it is notclear how to extend it to decentralized optimization. We arenot aware of any results combining the use of biased com-pressors, acceleration and decentralized communication,even if we allow for the networks to be time-invariant. Theobservation that decentralized communication can be mod-eled as the application of a certain contractive compressorallows us to design a bespoke error-feedback mechanism,previously studied in other settings by Stich & Karimireddy(2019); Karimireddy et al. (2019); Beznosikov et al. (2020);Gorbunov et al. (2020a), for achieving acceleration despitedealing with a biased compressor. While there were attempts to design accelerated algorithmsthat could deal with time-varying networks, only severalmethods provide sub-quadratic dependence on χ : Acc-DNGD (Qu & Li, 2019), Mudag (Ye et al., 2020), andthe Accelerated Penalty Method (APM) (Rogozin et al.,2020; Li et al., 2018). Acc-DNGD has O ( χ / ) depen-dence on χ , which is worse than the linear dependence on χ shared by Mudag, APM and our method ADOM . Moreover,Acc-DNGD has O ( κ / ) dependence on κ and Mudag has O ( κ / log κ ) dependence, which is worse than the O ( κ / ) dependence of APM and our method ADOM . Lastly, APMhas a square-logarithmic dependence on /ε , which is worsethan the dependence of all the other methods on this quan-tity. These results are summarized in Table 1. In summary, Distributed optimization in a parameter server framework ismathematically equivalent to the setting where communicationhappens over a fully connected time-invariant network.
DOM
Table 1.
A review of decentralized optimization algorithms capa-ble of working in the time-varying network regime, with guaran-tees. Complexity terms highlighted in red represent the best knowndependencies. Our method is the only method with best knowndependencies in all terms.
Algorithm Communication complexity
DIGingNedic et al. (2017) O (cid:16) n / χ κ / log (cid:15) (cid:17) PANDAMaros & Jald´en (2018) O (cid:16) χ κ / log (cid:15) (cid:17) Acc-DNGDQu & Li (2019) O (cid:16) χ / κ / log (cid:15) (cid:17) APMLi et al. (2018) O (cid:16) χκ / log (cid:15) (cid:17) MudagYe et al. (2020) O (cid:16) χκ / log( κ )log (cid:15) (cid:17) ADOM (Algorithm 2)THIS PAPER O (cid:16) χκ / log (cid:15) (cid:17) ADOM achieves the new state-of-the-art rate for decentral-ized optimization over time-varying networks.
We left one relevant method our from the above comparison– the Distributed Nesterov Method (DNM) of Rogozin et al.(2019). This method has O ( χ / ) dependence on χ . How-ever, DNM converges under the very restrictive assumptionrequiring the number of network changes to not exceed atiny percentage of the number of iterations . This assumptionis hard to satisfy in practice, as the changes in the networktopology usually can not be controlled and happen indepen-dently of the algorithm run. In contrast, our algorithm justrequires the network to be connected all the time. In Fig-ure 2 we give a representative comparison of the workingsof our method ADOM and DNM in a regime where the num-ber of network changes exceeds the theoretical limit. WhileADOM converges, DNM often diverges, which shows thatDNM is not robust to the network dynamics, and that therestrictive assumption is crucial to their analysis.
3. Problem Formulation and ProjectedNesterov Gradient Descent
The design of our method is based on a particular reformu-lation of problem (1), which we now describe.
Consider function F : ( R d ) V → R defined by F ( x ) = (cid:80) i ∈V f i ( x i ) , (3)where x = ( x , . . . , x n ) ∈ ( R d ) V and V := [ n ] denotesthe set of compute nodes. Then F is L -smooth µ -strongly convex since the individual functions f i are. Consider alsothe so called consensus space L ⊂ ( R d ) V defined by L := { ( x , . . . , x n ) ∈ ( R d ) V : x = · · · = x n } . (4)Using this notation, we arrive at an equivalent formulationof problem (1), which we call the primal formulation : min x ∈L F ( x ) . (5)Since the function F ( x ) is strongly convex, this problemhas a unique solution, which we denote as x ∗ ∈ L . It is a well known fact that problem (5) has an equivalent dual formulation of the form min z ∈L ⊥ F ∗ ( z ) , (6)where F ∗ is the Fenchel transform of F and L ⊥ ⊂ ( R d ) V isthe orthogonal complement to the space L , given as follows: L ⊥ = (cid:8) ( z , . . . , z n ) ∈ ( R d ) V : (cid:80) ni =1 z i = 0 (cid:9) . (7)Note that the function F ∗ ( z ) is µ -smooth and L -stronglyconvex (Rockafellar, 1970). Hence, problem (6) also has aunique solution, which we denote as z ∗ ∈ L ⊥ . A natural way to tackle problem (6) is to use a projected ver-sion of Nesterov’s accelerated gradient method: ProjectedNesterov Gradient Descent (PNGD) (Nesterov, 2003). Thisalgorithm requires us to calculate projection onto the set L ⊥ , which can be written in the closed form proj L ( g ) := arg min z ∈L ⊥ (cid:107) g − z (cid:107) = P g, (8)where g ∈ ( R d ) V and P is an orthogonal projection matrixonto the subspace L ⊥ . Matrix P is given as follows: P = (cid:0) I n − n n (cid:62) n (cid:1) ⊗ I d , (9)where I p denotes p × p identity matrix, n = (1 , . . . , ∈ R n , ⊗ is a Kronecker product. Note that P = P . (10)With this notation, PNGD is presented as Algorithm 1.A key property of Algorithm 1 is that it converges withthe accelerated rate O (cid:16)(cid:112) L / µ log (cid:15) (cid:17) . However, PNGD ineach iteration calculates the matrix-vector multiplication P ∇ F ∗ ( z kg ) , which requires full averaging , i.e., consensus,over all nodes of the communication network. In particular,this can not be done in decentralized fashion . In Section 5we describe our algorithm ADOM , which in a certain sensemimics the behavior of Algorithm 1, but can be imple-mented in a decentralized fashion.
DOM
Algorithm 1
PNGD: Projected Nesterov Gradient Descent input: z ∈ L ⊥ , α, η, θ > , τ ∈ (0 , set z f = z for k = 0 , , . . . do z kg = τ z k + (1 − τ ) z kf z k +1 = z k + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg ) z k +1 f = z kg − θ P ∇ F ∗ ( z kg ) end for
4. Decentralized Communication
We now introduce the necessary notation, definitions andformalism to be able to describe our method. Computenodes V = [ n ] are connected through a communicationnetwork represented as a graph G k = ( V , E k ) , where k ∈{ , , , . . . } encodes time, and E k ⊆ { ( i, j ) ∈ V × V : i (cid:54) = j } is the set of edges at time k . In this work we assumethat the graph G k is undirected, that is, ( i, j ) ∈ E k implies ( j, i ) ∈ E k . We also assume that G k is connected. For eachnode i ∈ V we consider a set of its neighbors at time step k : N ki = { j ∈ V : ( i, j ) ∈ E k } . At time step k , eachnode i ∈ V can communicate with the nodes from set N ki only. This type of communication is known as decentralizedcommunication in the literature. Decentralized communication between nodes is typicallyrepresented via a matrix-vector multiplication with a gossipmatrix . For time-invariant networks such representationscan be found in, e.g., (Kovalev et al., 2020b). For each timestep k ∈ { , , , . . . } consider a matrix ˆ W ( k ) ∈ R n × n with the following properties:1. ˆ W ( k ) is symmetric and positive semi-definite,2. ˆ W ( k ) i,j (cid:54) = 0 if and only if ( i, j ) ∈ E k or i = j ,3. ker ˆ W ( k ) = { ( x , . . . , x n ) ∈ R n : x = . . . = x n } .Matrix ˆ W ( k ) is often called a gossip matrix . A typicalexample is the Laplacian of the graph G k . Consider also alinear map W ( k ) : ( R d ) V → ( R d ) V , i.e., nd × nd matrixdefined by W ( k ) = ˆ W ( k ) ⊗ I d . This matrix can be rep-resented as a block matrix ( W ( k ) i,j ) ( i,j ) ∈V , where eachblock W ( k ) i,j = ˆ W ( k ) i,j I d is a d × d matrix proportionalto I d . Matrix W ( k ) satisfies similar properties to ˆ W ( k ) :1. W ( k ) is symmetric and positive semi-definite,2. W ( k ) i,j (cid:54) = 0 if and only if ( i, j ) ∈ E k or i = j ,3. ker W ( k ) = L or equivalently range W ( k ) = L ⊥ . With a slight abuse of language, in the rest of this paper wewill refer to W ( k ) as a gossip matrix as well. Decentralized communication of vectors x , . . . , x n ∈ R d stored on the nodes among neighboring nodes at timestep k can be represented as a multiplication of the nd -dimensional vector by matrix W ( k ) . Indeed, consider x = ( x , . . . , x n ) ∈ ( R d ) V , y = ( y , . . . , y n ) ∈ ( R d ) V ,where each x i is stored by node i ∈ V , and let y = W ( k ) x .One can observe that y i = n (cid:80) j =1 ˆ W ( k ) i,j x j = (cid:80) j ∈N i ˆ W ( k ) i,j x j . Hence, for each node i , vector y i is a linear combination ofvectors x j , stored at the neighboring nodes j ∈ N i . Thismeans that matrix-vector multiplications by matrix W ( k ) can be computed in a decentralized fashion. A condition number of the matrix ˆ W ( k ) is given as λ max ( ˆ W ( k )) λ +min ( ˆ W ( k )) , where λ max refers to the largest and λ +min tothe smallest positive eigenvalue. This quantity is known tobe a measure of the connectivity of graph G k , and appearsin convergence rates of many decentralized algorithms. Inthis work we assume that this condition number is boundedfor all k ∈ { , , . . . } . In particular, we assume that thereexist constants < λ +min < λ max such that λ +min ≤ λ +min ( ˆ W ( k )) ≤ λ max ( ˆ W ( k )) ≤ λ max . (11)So, we assume that the worst case spectral behavior of thegossip matrices is bounded, and these bounds will laterappear in our convergence rate for ADOM .Relation (11) can be equivalently written in the form of alinear matrix inequality involving the gossip matrix W ( k ) : λ +min P (cid:22) W ( k ) (cid:22) λ max P , (12)where P is orthogonal projector onto subspace range W k = L ⊥ given by (9). Note that PW ( k ) = W ( k ) P = W ( k ) . (13)By χ we denote a bound on the condition number of matri-ces W ( k ) , k = 0 , , . . . , given by χ := λ max /λ +min . (14) We have just shown that decentralized communication attime step k can be represented as multiplication by the gos-sip matrix W ( k ) . We will now show, and this is a key DOM insight which was the starting point of our work, that decen-tralized communication can also be seen as the applicationof a contractive compression operator.
Let Q be a linear space. A mapping C : Q → Q is called a compression operator if there exists δ ∈ (0 , such that (cid:107)C ( z ) − z (cid:107) ≤ (1 − δ ) (cid:107) z (cid:107) for all z ∈ Q . (15)The following lemma shows that matrix-vector multiplica-tion by gossip matrix W ( k ) is a contractive compressionoperator acting on the subspace L ⊥ . Lemma 1.
Let σ ∈ (0 , /λ max ) , k ∈ { , , . . . } . Thenthe following inequality holds for all z ∈ L ⊥ : (cid:107) σ W ( k ) z − z (cid:107) ≤ (1 − σλ +min ) (cid:107) z (cid:107) . There is a natural question regarding Algorithm 1: can we re-place the gradient P ∇ F ∗ ( z kg ) on lines 5 and 6 with its com-pressed version W ( k ) P ∇ F ∗ ( z kg ) , or some modificationthereof, and still obtain a good convergence result? Notethat (13) implies W ( k ) P ∇ F ∗ ( z kg ) = W ( k ) ∇ F ∗ ( z kg ) . InSection 5 we will provide a positive answer to this question.Convergence of gradient-type methods with contractivecompression operators satisfying (15) was studied in severalrecent papers. In particular, Stich & Karimireddy (2019);Karimireddy et al. (2019); Beznosikov et al. (2020); Gor-bunov et al. (2020a) study a mechanism called error feed-back , which allows to design variations with better conver-gence properties. However, these works do not study ac-celerated algorithms, nor provide any connections betweencompression and decentralized communication, which wedo.The only exception we are aware of is (Qian et al., 2020),which is the first work proposing an accelerated error com-pensated method. However, their ECLK method is morecomplicated than ours, uses the Katyusha momentum (Allen-Zhu, 2017; Kovalev et al., 2020a) instead of the Nesterovmomentum we employ, and does not apply to decentralizedoptimization. Moreover, while for us it is crucial that thecontractive property is enforced on a subspace only, Qianet al. (2020) require this property to hold globally.
5. ADOM: Algorithm and its Analysis
Armed with the notions and ideas described in precedingsections, we are now ready to present our method
ADOM (Algorithm 2). As alluded to in the introduction,
ADOM is a generalization of Algorithm 1 that can be implementedin a decentralized fashion. Indeed, our algorithm does not make use of matrix-vector multiplication by P in the wayAlgorithm 1 does, which requires full averaging over thenetwork. Instead, ADOM uses matrix-vector multiplicationby the gossip matrix W ( k ) , which represents a single decen-tralized communication round, as we discussed in Section 4. Algorithm 2
ADOM: Accelerated Decentralized Optimiza-tion Method input: z ∈ L ⊥ , m ∈ ( R d ) V , α, η, θ, σ > , τ ∈ (0 , set z f = z for k = 0 , , , . . . do z kg = τ z k + (1 − τ ) z kf ∆ k = σ W ( k )( m k − η ∇ F ∗ ( z kg )) m k +1 = m k − η ∇ F ∗ ( z kg ) − ∆ k z k +1 = z k + ηα ( z kg − z k ) + ∆ k z k +1 f = z kg − θ W ( k ) ∇ F ∗ ( z kg ) end for5.1. Design and Analysis of the New Algorithm First, we mention two lemmas, which play an important rolein the convergence analysis of Algorithm 2.
Lemma 2.
For θ ≤ µλ max we have the inequality F ∗ ( z k +1 f ) ≤ F ∗ ( z kg ) − θλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P . (16) Lemma 3.
For σ ≤ λ max we have the inequality (cid:107) m k (cid:107) P ≤ (cid:16) − σλ +min (cid:17) σλ +min (cid:107) m k (cid:107) P − σλ +min (cid:107) m k +1 (cid:107) P + η ( σλ +min ) (cid:107)∇ F ∗ ( z kg ) (cid:107) P . (17)We now make a few remarks about the main steps of Algo-rithm 2 compared to Algorithm 1 and comment on someaspects of our convergence analysis and the role of the abovelemmas in it:• Line 4 of Algorithm 2 is unchanged compared to line 4of Algorithm 1.• Line 8 of Algorithm 2 corresponds to line 6 of Algo-rithm 1. Note that the analysis of Algorithm 1 requiresan inequality of the type F ∗ ( z k +1 f ) ≤ F ∗ ( z kg ) − const · (cid:107)∇ F ∗ ( z kg ) (cid:107) P . Lemma 2 establishes a similar inequality for Algo-rithm 2.• Together, lines 5, 6 and 7 of Algorithm 2 form an errorfeedback update, which we discussed in Section 4.4when we interpreted a decentralized communicationround as the application of a contractive compressionoperator. A key to the theoretical analysis of this updateis to make use of the so-called ghost iterate ˆ z k = z k + P m k . One can observe that the ghost iterate isupdated as ˆ z k +1 = ˆ z k + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg ) , which is similar to the update on line 5 of Algorithm 1. DOM • Another key step in the analysis of Algorithm 2 is tobound the distance between the actual iterate z k andthe ghost iterate ˆ z k , which is equal to (cid:107) m k (cid:107) P . This isdone in Lemma 3 based on line 6 of Algorithm 2. Now, we are ready to present our main theorem.
Theorem 1.
Set parameters α, η, θ, σ, τ of Algorithm 2 to α = L , η = λ +min √ µL λ max , θ = µλ max , σ = λ max , and τ = λ +min λ max (cid:112) µL . Then there exists C > , such that (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ C (cid:16) − λ +min λ max (cid:112) µL (cid:17) k . (18)Note that the rate is O ( χκ / log ε ) , as previously adver-tised. Proofs of all our results are available in the appendix. In this paper we compare our Algorithm 2 with currentstate-of-the-art algorithms for decentralized optimizationover time-varying networks. While the Accelerated PenaltyMethod (APM) (Li et al., 2018) and Mudag (Ye et al., 2020)were originally designed for time-invariant networks, theycan be easily extended to the time-varying case. Also notethat DIGing (Nedic et al., 2017), Push-Pull Gradient Method(Pu et al., 2020) and PANDA (Maros & Jald´en, 2018) con-verge under slightly more general assumptions than thosewe used to analyze our method. However, these methodsconverge at a substantially slower rate due to the fact thatthey do not employ any acceleration mechanism. Moreover,to the best of our knowledge, no results improving the con-vergence rates of these algorithms under our assumptionsexist in the literature.
6. Numerical Experiments
In this section we perform experiments with logistic regres-sion for binary classification with (cid:96) regularization. That is,our loss function has the form f i ( x ) = m m (cid:80) j =1 log(1 + exp( − b ij a (cid:62) ij x )) + r (cid:107) x (cid:107) , (19)where a ij ∈ R d and b ij ∈ {− , +1 } are datapoints and labels, r > is a regularization pa-rameter, and m is the number of data points storedon each node. In our experiments we use function sklearn.datasets.make classification fromscikit-learn library for dataset generation. We generatea number of datasets consisting of , samples, dis-tributed to the n = 100 nodes of the network with m = 100 samples on each node. We vary r to obtain different values of the condition number κ . We also vary the number offeatures d .In order to simulate a time-varying network, we use ge-ometric random graphs. That is, we generate n = 100 nodes from the uniform distribution over [0 , ⊂ R andconnect each pair of nodes whose distance is less than acertain radius . Since a geometric graph is likely to be dis-connected when the radius is small, we enforce connectivityby adding a minimal number of edges. We obtain a sequenceof networks {G k } ∞ k =0 by generating , random geomet-ric graphs and switching between them in a cyclic way. Foreach k , matrix W k is chosen to be the Laplacian of graph G k divided by its largest eigenvalue. We obtain differentvalues of the time-varying network structure parameter χ by choosing different values of the radius.One potential problem with ADOM is that it has to calcu-late the dual gradient ∇ F ∗ ( z kg ) , which is known to be thesolution of the following problem: ∇ F ∗ ( z kg ) = arg min x ∈ ( R d ) V F ( x ) − (cid:104) x, z kg (cid:105) . (20)In practice, ∇ F ∗ ( z kg ) may be hard to compute. In our exper-iments we solve this issue by calculating ∇ F ∗ ( z kg ) inexactlyusing T iterations of Gradient Descent (GD) or AcceleratedGradient Descent (AGD) initialized with the previous esti-mate of ∇ F ∗ ( z k − g ) . It turns out that it is sufficient to use T ≤ to obtain a good convergence rate in practice. ADOM isstable
We first compare
ADOM with the Distributed NesterovMethod (DNM) of Rogozin et al. (2019). The conditionnumber κ is set to , the number of features is d = 40 . Tocalculate the dual gradient ∇ F ∗ ( z ) we use T = 3 steps ofAGD in ADOM and T = 30 steps of AGD in DNM.We switch between 2 networks every t iterations, where t ∈{ , , , } . We use the following choice of networks:(i) two random geometric graphs with χ ≈ ; see Figure 2(top row); (ii) two networks with ring and star topology with χ ≈ , ; see Figure 2 (bottom row).DNM diverges in 7 out of 8 cases presented in Figure 2,while ADOM converges in all cases. However, when DNMconverges, it can converge faster than
ADOM , since itscommunication complexity has better dependence on χ ( √ χ of DNM vs χ of ADOM ). ADOM with the state ofthe art: Mudag, Acc-DNGD and APM
We compare
ADOM with the following algorithms for de-centralized optimization over time-varying networks, allequipped with Nesterov acceleration: Mudag (Ye et al.,
DOM network change every 50 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 20 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 10 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 5 iterations ADOMDNM network change every 50 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 20 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 10 iterations ADOMDNM 0 5000 10000 15000 20000 25000 30000 35000 40000 network change every 5 iterations ADOMDNM
Figure 2.
Comparison of DNM and
ADOM on a problem with κ = 30 and d = 40 . Top row:
We alternate between two geometric graphs( χ ≈ ). Bottom row:
We alternate between two networks, one with a ring and the other with a star topology ( χ ≈ ). = 10, d = 40 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 12000 14000 = 10, d = 60 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 12000 14000 = 10, d = 80 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 12000 14000 = 10, d = 100 ADOMMudagAcc-DNGDAPM = 100, d = 40 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 = 100, d = 60 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 = 100, d = 80 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 = 100, d = 100 ADOMMudagAcc-DNGDAPM0 10000 20000 30000 40000 50000 60000 = 1000, d = 40 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 = 1000, d = 60 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 = 1000, d = 80 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 = 1000, d = 100 ADOMMudagAcc-DNGDAPM = 10000, d = 40 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 120000 = 10000, d = 60 ADOMMudagAcc-DNGDAPM 0 40000 80000 120000 160000 200000 240000 = 10000, d = 80 ADOMMudagAcc-DNGDAPM = 10000, d = 100 ADOMMudagAcc-DNGDAPM
Figure 3.
Comparison of Mudag, Acc-DNGD, APM and
ADOM on problems with χ ≈ , d ∈ { , , , } and κ ∈{ , , , } . DOM of communication rounds = 9, d = 40 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 of communication rounds = 9, d = 60 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 of communication rounds = 9, d = 80 ADOMMudagAcc-DNGDAPM 0 2000 4000 6000 8000 10000 of communication rounds = 9, d = 100 ADOMMudagAcc-DNGDAPM0 5000 10000 15000 20000 25000 30000 of communication rounds = 32, d = 40 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 of communication rounds = 32, d = 60 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 of communication rounds = 32, d = 80 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 of communication rounds = 32, d = 100 ADOMMudagAcc-DNGDAPM of communication rounds = 134, d = 40 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 of communication rounds = 134, d = 60 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 of communication rounds = 134, d = 80 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 of communication rounds = 134, d = 100 ADOMMudagAcc-DNGDAPM0 50000 100000 150000 200000 250000 300000 of communication rounds = 521, d = 40 ADOMMudagAcc-DNGDAPM 0 50000 100000 150000 200000 250000 300000 of communication rounds = 521, d = 60 ADOMMudagAcc-DNGDAPM 0 50000 100000 150000 200000 250000 300000 of communication rounds = 521, d = 80 ADOMMudagAcc-DNGDAPM 0 50000 100000 150000 200000 250000 300000 of communication rounds = 521, d = 100 ADOMMudagAcc-DNGDAPM
Figure 4.
Comparison of Mudag, Acc-DNGD, APM and
ADOM on problems with χ ∈ { , , , } , d ∈ { , , , } and κ = 100 . ADOM with PANDA (Maros & Jald´en,2018) and DIGing (Nedic et al., 2017) because they arenot accelerated and have very slow convergence rate bothin theory and practice. We use T = 1 iterations of GD tocalculate ∇ F ∗ ( z kg ) in ADOM .We generate random datasets with the number of fea-tures d ∈ { , , , } . In Figure 3 we fix the net-work structure parameter χ ≈ and perform compar-ison for condition number κ ∈ { , , , } . InFigure 4 we fix κ = 100 and perform comparison for χ ∈ { , , , } .Overall, ADOM is better than the contenders. Acc-DNGDperforms worse as the values of χ and κ grow since it has theworst dependence on them. One can also observe that APMsuffers from sub-linear convergence, which becomes clearas the number of iterations grows (see bottom row of Fig- ure 3) since its communication complexity is proportionalto log (cid:15) . References
Allen-Zhu, Z. Katyusha: The first direct acceleration ofstochastic gradient methods. In
Proceedings of the 49thAnnual ACM SIGACT Symposium on Theory of Comput-ing , pp. 1200–1205. ACM, 2017.Bazerque, J. A. and Giannakis, G. B. Distributed spec-trum sensing for cognitive radio networks by exploitingsparsity.
IEEE Transactions on Signal Processing , 58(3):1847–1862, 2009.Beck, A., Nedi´c, A., Ozdaglar, A., and Teboulle, M. An o (1 /k ) gradient method for network resource allocationproblems. IEEE Transactions on Control of NetworkSystems , 1(1):64–73, 2014.Beznosikov, A., Horv´ath, S., Richt´arik, P., and Safaryan,
DOM
M. On biased compression for distributed learning. arXiv:2002.12410 , 2020.Chang, C.-C. and Lin, C.-J. LIBSVM: A library forsupport vector machines.
ACM Transactions on In-telligent Systems and Technology , 2:27:1–27:27, 2011.Software available at .Gan, L., Topcu, U., and Low, S. H. Optimal decentralizedprotocol for electric vehicle charging.
IEEE Transactionson Power Systems , 28(2):940–951, 2012.Giselsson, P., Doan, M. D., Keviczky, T., De Schutter, B.,and Rantzer, A. Accelerated gradient methods and dualdecomposition in distributed model predictive control.
Automatica , 49(3):829–833, 2013.Gorbunov, E., Kovalev, D., Makarenko, D., and Richt´arik, P.Linearly converging error compensated SGD.
Advancesin Neural Information Processing Systems , 33, 2020a.Gorbunov, E., Rogozin, A., Beznosikov, A., Dvinskikh, D.,and Gasnikov, A. Recent theoretical advances in decen-tralized distributed convex optimization. arXiv preprintarXiv:2011.13259 , 2020b.Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi,M. Error feedback fixes SignSGD and other gradientcompression schemes. arXiv preprint arXiv:1901.09847 ,2019.Kolar, M., Song, L., Ahmed, A., and Xing, E. P. Estimatingtime-varying networks.
The Annals of Applied Statistics ,4(1):94–123, 2010.Koneˇcn´y, J., McMahan, H. B., Yu, F., Richt´arik, P., Suresh,A. T., and Bacon, D. Federated learning: strategies forimproving communication efficiency. In
NIPS PrivateMulti-Party Machine Learning Workshop , 2016.Kovalev, D., Horv´ath, S., and Richt´arik, P. Don’t jumpthrough hoops and remove those loops: SVRG andKatyusha are better without the outer loop. In
Proceed-ings of the 31st International Conference on AlgorithmicLearning Theory , 2020a.Kovalev, D., Salim, A., and Richt´arik, P. Optimal andpractical algorithms for smooth and strongly convex de-centralized optimization.
Advances in Neural InformationProcessing Systems , 33, 2020b.Kovalev, D., Koloskova, A., Jaggi, M., Richt´arik, P., andStich, S. A linearly convergent algorithm for decentral-ized optimization: Sending less bits for free! In
The 24thInternational Conference on Artificial Intelligence andStatistics (AISTATS 2021) , 2021. Li, H., Fang, C., Yin, W., and Lin, Z. A sharp convergencerate analysis for distributed accelerated gradient methods. arXiv preprint arXiv:1810.01053 , 2018.Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Feder-ated learning: challenges, methods, and future directions.
IEEE Signal Processing Magazine , 37(3):50–60, 2020a.Li, Z., Kovalev, D., Qian, X., and Richt´arik, P. Accelera-tion for compressed gradient descent in distributed andfederated optimization. In
International Conference onMachine Learing , 2020b.Maros, M. and Jald´en, J. Panda: A dual linearly convergingmethod for distributed optimization over time-varyingundirected graphs. In , pp. 6520–6525. IEEE, 2018.McMahan, H. B., Moore, E., Ramage, D., Hampson, S., andAg¨uera y Arcas, B. Communication-efficient learningof deep networks from decentralized data. In
Proceed-ings of the 20th International Conference on ArtificialIntelligence and Statistics (AISTATS) , 2017.Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel,P., and Neumann, M. Tudataset: A collection of bench-mark datasets for learning with graphs. In
ICML 2020Workshop on Graph Representation Learning and Beyond(GRL+ 2020) , 2020. URL .Nedic, A., Olshevsky, A., and Shi, W. Achieving geomet-ric convergence for distributed optimization over time-varying graphs.
SIAM Journal on Optimization , 27(4):2597–2633, 2017.Nesterov, Y.
Introductory lectures on convex optimization:A basic course , volume 87. Springer Science & BusinessMedia, 2003.Pu, S., Shi, W., Xu, J., and Nedic, A. Push-pull gradientmethods for distributed optimization in networks.
IEEETransactions on Automatic Control , 2020.Qian, X., Richt´arik, P., and Zhang, T. Error compen-sated distributed SGD can be accelerated. arXiv preprintarXiv:2010.00091 , 2020.Qu, G. and Li, N. Accelerated distributed nesterov gradientdescent.
IEEE Transactions on Automatic Control , 2019.Rabbat, M. and Nowak, R. Distributed optimization insensor networks. In
Proceedings of the 3rd internationalsymposium on Information processing in sensor networks ,pp. 20–27, 2004.Rockafellar, R. T.
Convex analysis , volume 36. Princetonuniversity press, 1970.
DOM
Rogozin, A., Uribe, C., Gasnikov, A., Malkovskii, N., andNedich, A. Optimal distributed convex optimization onslowly time-varying graphs.
IEEE Transactions on Con-trol of Network Systems , 2019.Rogozin, A., Lukoshkin, V., Gasnikov, A., Kovalev, D.,and Shulgin, E. Towards accelerated rates for distributedoptimization over time-varying networks. arXiv preprintarXiv:2009.11069 , 2020.Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massouli´e,L. Optimal algorithms for smooth and strongly con-vex distributed optimization in networks. arXiv preprintarXiv:1702.08704 , 2017.Stich, S. U. and Karimireddy, S. P. The error-feedbackframework: Better rates for SGD with delayed gradi-ents and compressed communication. arXiv preprintarXiv:1909.05350 , 2019.Ye, H., Luo, L., Zhou, Z., and Zhang, T. Multi-consensusdecentralized accelerated gradient descent. arXiv preprintarXiv:2005.00797 , 2020.Zadeh, L. A. Time-varying networks, i.
Proceedings of theIRE , 49(10):1488–1503, 1961.
DOM
Appendix
A. Proof of Lemma 2
Proof.
We start with µ -smoothness of F ∗ : F ∗ ( z k +1 f ) ≤ F ∗ ( z kg ) + (cid:104)∇ F ∗ ( z kg ) , z k +1 f − z kg (cid:105) + 12 µ (cid:107) z k +1 f − z kg (cid:107) . Using line 8 of Algorithm 2 together with (12) we get F ∗ ( z k +1 f ) ≤ F ∗ ( z kg ) − θ (cid:107)∇ F ∗ ( z kg ) (cid:107) W ( k ) + θ µ (cid:107)∇ F ∗ ( z kg ) (cid:107) W ( k ) ≤ F ∗ ( z kg ) − θλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P − θ (cid:107)∇ F ∗ ( z kg ) (cid:107) W ( k ) + θ λ max µ (cid:107)∇ F ∗ ( z kg ) (cid:107) W ( k ) = F ∗ ( z kg ) − θλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P + θ (cid:18) θλ max µ − (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) W ( k ) . Using condition θ ≤ µλ max we get F ∗ ( z k +1 f ) ≤ F ∗ ( z kg ) − θλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P . B. Proof of Lemma 3
Proof.
Using (10) and (13) together with lines 5 and 6 of Algorithm 2 we obtain (cid:107) m k +1 (cid:107) P = (cid:107) m k − η ∇ F ∗ ( z kg ) − ∆ k (cid:107) P = (cid:107) ( P − σ W ( k ))( m k − η ∇ F ∗ ( z kg )) (cid:107) = (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P − σ (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) W ( k ) + σ (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) W ( k ) . Using (12) we obtain (cid:107) m k +1 (cid:107) P ≤ (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P − σλ +min (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P − σ (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) W ( k ) + σ λ max (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) W ( k ) = (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P − σλ +min (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P + σ ( σλ max − (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) W ( k ) Using condition σ ≤ λ max we get (cid:107) m k +1 (cid:107) P ≤ (1 − σλ +min ) (cid:107) m k − η ∇ F ∗ ( z kg ) (cid:107) P . Using Young’s inequality we get (cid:107) m k +1 (cid:107) P ≤ (1 − σλ +min ) (cid:18)(cid:18) σλ +min − σλ +min ) (cid:19) (cid:107) m k (cid:107) P + (cid:18) − σλ +min ) σλ +min (cid:19) (cid:107) η ∇ F ∗ ( z kg ) (cid:107) P (cid:19) = (cid:18) − σλ +min (cid:19) (cid:107) m k (cid:107) P + η (1 − σλ +min )(2 − σλ +min ) σλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P ≤ (cid:18) − σλ +min (cid:19) (cid:107) m k (cid:107) P + 2 η σλ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P . Rearranging concludes the proof.
DOM
C. New Lemma
Lemma 4.
Let α = 12 L , (21) η = 2 λ +min √ µL λ max , (22) θ = µλ max , (23) σ = 1 λ max , (24) τ = λ +min λ max (cid:114) µL . (25) Define the Lyapunov function Ψ k := (cid:107) ˆ z k − z ∗ (cid:107) + 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) + 6 (cid:107) m k (cid:107) P , (26) where ˆ z k is defined by ˆ z k = z k + P m k . (27) Then the following inequality holds: Ψ k +1 ≤ (cid:18) − λ +min λ max (cid:114) µL (cid:19) Ψ k . (28) Proof.
Using (27) together with lines 6 and 7 of Algorithm 2, we get ˆ z k +1 = z k +1 + P n k +1 = z k + ηα ( z kg − z k ) + ∆ k + P ( m k − η ∇ F ∗ ( z kg ) − ∆ k )= z k + P m k + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg ) + ∆ k − P ∆ k . From line 5 of Algorithm 2 and (13) it follows that P ∆ k = ∆ k , which implies ˆ z k +1 = z k + P m k + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg )= ˆ z k + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg ) . Hence, (cid:107) ˆ z k +1 − z ∗ (cid:107) = (cid:107) ˆ z k − z ∗ + ηα ( z kg − z k ) − η P ∇ F ∗ ( z kg ) (cid:107) = (cid:107) (1 − ηα )(ˆ z k − z ∗ ) + ηα ( z kg + P m k − z ∗ ) (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (cid:104) P ∇ F ∗ ( z kg ) , z k + P m k − z ∗ + ηα ( z kg − z k ) (cid:105)≤ (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + ηα (cid:107) z kg + P m k − z ∗ (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (cid:104)∇ F ∗ ( z kg ) , P ( z kg − z ∗ ) (cid:105) + 2 η (1 − ηα ) (cid:104)∇ F ∗ ( z kg ) , P ( z kg − z k ) (cid:105) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105)≤ (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + 2 ηα (cid:107) z kg − z ∗ (cid:107) + 2 ηα (cid:107) m k (cid:107) P + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (cid:104)∇ F ∗ ( z kg ) , P ( z kg − z ∗ ) (cid:105) + 2 η (1 − ηα ) (cid:104)∇ F ∗ ( z kg ) , P ( z kg − z k ) (cid:105) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) One can observe, that z k , z kg , z ∗ ∈ L ⊥ . Hence, (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + 2 ηα (cid:107) z kg − z ∗ (cid:107) + 2 ηα (cid:107) m k (cid:107) P + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (cid:104)∇ F ∗ ( z kg ) , z kg − z ∗ (cid:105) + 2 η (1 − ηα ) (cid:104)∇ F ∗ ( z kg ) , z kg − z k (cid:105) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) . DOM
Using line 4 of Algorithm 2 we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + 2 ηα (cid:107) z kg − z ∗ (cid:107) + 2 ηα (cid:107) m k (cid:107) P + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (cid:104)∇ F ∗ ( z kg ) , z kg − z ∗ (cid:105) + 2 η (1 − ηα ) (1 − τ ) τ (cid:104)∇ F ∗ ( z kg ) , z kf − z kg (cid:105) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) . Using convexity and L -strong convexity of F ∗ ( z ) we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + 2 ηα (cid:107) z kg − z ∗ (cid:107) + 2 ηα (cid:107) m k (cid:107) P + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η ( F ∗ ( z kg ) − F ∗ ( z ∗ )) − ηL (cid:107) z kg − z ∗ (cid:107) + 2 η (1 − ηα ) (1 − τ ) τ ( F ∗ ( z kf ) − F ∗ ( z kg )) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) = (1 − ηα ) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:16) ηα − ηL (cid:17) (cid:107) z kg − z ∗ (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η ( F ∗ ( z kg ) − F ∗ ( z ∗ )) + 2 η (1 − ηα ) (1 − τ ) τ ( F ∗ ( z kf ) − F ∗ ( z kg )) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) + 2 ηα (cid:107) m k (cid:107) P . Using α defined by (21) we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η ( F ∗ ( z kg ) − F ∗ ( z ∗ )) + 2 η (1 − ηα ) (1 − τ ) τ ( F ∗ ( z kf ) − F ∗ ( z kg )) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) + 2 ηα (cid:107) m k (cid:107) P . Since F ∗ ( z kg ) ≥ F ∗ ( z ∗ ) , we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P − η (1 − ηα )( F ∗ ( z kg ) − F ∗ ( z ∗ )) + 2 η (1 − ηα ) (1 − τ ) τ ( F ∗ ( z kf ) − F ∗ ( z kg )) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) + 2 ηα (cid:107) m k (cid:107) P = (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + η (cid:107)∇ F ∗ ( z kg ) (cid:107) P + 2 η (1 − ηα ) (cid:18) (1 − τ ) τ F ∗ ( z kf ) + F ∗ ( z ∗ ) − τ F ∗ ( z kg ) (cid:19) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) + 2 ηα (cid:107) m k (cid:107) P . Using (16) and θ defined by (23) we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η − (1 − ηα ) ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ )) − η (cid:104) P F ∗ ( z kg ) , m k (cid:105) + 2 ηα (cid:107) m k (cid:107) P . Using Young’s inequality we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η − (1 − ηα ) ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ ))+ η λ max λ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P + λ +min λ max (cid:107) m k (cid:107) P + 2 ηα (cid:107) m k (cid:107) P = (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η + η λ max λ +min − (1 − ηα ) ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P DOM + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ ))+ (cid:18) λ +min λ max + 2 ηα (cid:19) (cid:107) m k (cid:107) P . Using (22) and (21), that imply ηα ≤ λ +min λ max , we obtain (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η + η λ max λ +min − ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ )) + 3 λ +min λ max (cid:107) m k (cid:107) P . Using (17) and σ defined by (24) we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η + η λ max λ +min − ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ ))+ (cid:18) − λ +min λ max (cid:19) (cid:107) m k (cid:107) P − (cid:107) m k +1 (cid:107) P + 12 η λ max λ +min (cid:107)∇ F ∗ ( z kg ) (cid:107) P ≤ (cid:16) − η L (cid:17) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) η λ max λ +min − ηµλ +min τ λ max (cid:19) (cid:107)∇ F ∗ ( z kg ) (cid:107) P + (1 − τ ) 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ ))+ (cid:18) − λ +min λ max (cid:19) (cid:107) m k (cid:107) P − (cid:107) m k +1 (cid:107) P . Using η defined by (22) and τ defined by (25) we get (cid:107) ˆ z k +1 − z ∗ (cid:107) ≤ (cid:18) − λ +min λ max (cid:114) µL (cid:19) (cid:107) ˆ z k − z ∗ (cid:107) + (cid:18) − λ +min λ max (cid:19) (cid:107) m k (cid:107) P − (cid:107) m k +1 (cid:107) P + (cid:18) − λ +min λ max (cid:114) µL (cid:19) η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ )) ≤ (cid:18) − λ +min λ max (cid:114) µL (cid:19) (cid:18) (cid:107) ˆ z k − z ∗ (cid:107) + 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) + 6 (cid:107) m k (cid:107) P (cid:19) − η (1 − ηα ) τ ( F ∗ ( z k +1 f ) − F ∗ ( z ∗ )) − (cid:107) m k +1 (cid:107) P . Rearranging and using (26) concludes the proof.
D. Proof of Theorem 1
Proof.
Using µ -smoothness of F ∗ and the fact that ∇ F ∗ ( z ∗ ) = x ∗ , we get: (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) = (cid:107)∇ F ∗ ( z kg ) − ∇ F ∗ ( z ∗ ) (cid:107) ≤ µ (cid:107) z kg − z ∗ (cid:107) . Using line 4 of Algorithm 2 we get (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ τµ (cid:107) z k − z ∗ (cid:107) + (1 − τ ) µ (cid:107) z kf − z ∗ (cid:107) . Using L -strong convexity of F ∗ we get (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ τµ (cid:107) z k − z ∗ (cid:107) + 2(1 − τ ) Lµ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) . DOM
Using (27) we get (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ τµ (cid:107) ˆ z k − z ∗ (cid:107) + 2 τµ (cid:107) m k (cid:107) P + 2(1 − τ ) Lµ ( F ∗ ( z kf ) − F ∗ ( z ∗ ))= 2 τµ (cid:107) ˆ z k − z ∗ (cid:107) + τ (1 − τ ) Lη (1 − ηα ) µ η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) + τ µ (cid:107) m k (cid:107) P . ≤ max (cid:26) τµ , τ (1 − τ ) Lη (1 − ηα ) µ , τ µ (cid:27) (cid:18) (cid:107) ˆ z k − z ∗ (cid:107) + 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) + 6 (cid:107) m k (cid:107) P (cid:19) = max (cid:26) τµ , τ (1 − τ ) Lη (1 − ηα ) µ (cid:27) (cid:18) (cid:107) ˆ z k − z ∗ (cid:107) + 2 η (1 − ηα ) τ ( F ∗ ( z kf ) − F ∗ ( z ∗ )) + 6 (cid:107) m k (cid:107) P (cid:19) . Using the definition of Ψ k (26) and denoting C = Ψ max (cid:110) τµ , τ (1 − τ ) Lη (1 − ηα ) µ (cid:111) we get (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ C Ψ Ψ k . One can observe, that conditions of Lemma 4 are satisfied. Hence, inequality (28) holds, which implies (cid:107)∇ F ∗ ( z kg ) − x ∗ (cid:107) ≤ C Ψ (cid:18) − λ +min λ max (cid:114) µL (cid:19) Ψ k − ≤ C Ψ (cid:18) − λ +min λ max (cid:114) µL (cid:19) Ψ k − ... ≤ C Ψ (cid:18) − λ +min λ max (cid:114) µL (cid:19) k Ψ = C (cid:18) − λ +min λ max (cid:114) µL (cid:19) k , which concludes the proof. E. Additional Experiments
E.1. Real data
In this section, we perform experiments for the same problem (19) and network setup as in Section 6 (and 6.2), but withLIBSVM datasets: a6a, w6a, ijcnn1 instead of the synthetic ones (see Table 2). In Figure 5, the network structure parameteris fixed ( χ ≈ ), and condition number κ ∈ { , , , } changes. In Figure 6, we fix κ = 100 and performcomparison for χ ∈ { , , , } . dataset samples dimensiona6a 11220 122w6a 17188 300ijcnn1 49990 22 Table 2.
Details of the datasets
To summarize the obtained results,
ADOM outperforms all other methods for every set of parameters. This becomes evenmore evident on real data. One can also observe that for some cases, Acc-DNGD almost does not converge. Apart from that,competing methods (such as APM and Mudag) often show divergence during the first iterations, while
ADOM consistentlydemonstrates significant progress during the initial phase. Besides, it is enough to use one iteration ( T = 1 ) of GD tocalculate ∇ F ∗ ( z kg ) in ADOM to ensure liner convergence. DOM a6a, = 10 ADOMMudagAcc-DNGDAPM 0 8000 16000 24000 32000 40000 a6a, = 100 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 a6a, = 1000 ADOMMudagAcc-DNGDAPM a6a, = 10000 ADOMMudagAcc-DNGDAPM w6a, = 10 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 w6a, = 100 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 w6a, = 1000 ADOMMudagAcc-DNGDAPM w6a, = 10000 ADOMMudagAcc-DNGDAPM ijcnn1, = 10 ADOMMudagAcc-DNGDAPM 0 8000 16000 24000 32000 40000 ijcnn1, = 100 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 ijcnn1, = 1000 ADOMMudagAcc-DNGDAPM ijcnn1, = 10000 ADOMMudagAcc-DNGDAPM
Figure 5.
Comparison of Mudag, Acc-DNGD, APM and
ADOM on LIBSVM datasets ( a6a, w6a, ijcnn1 in separate rows) with χ ≈ ,and κ ∈ { , , , } (in different columns). a6a, = 9 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 a6a, = 32 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 a6a, = 134 ADOMMudagAcc-DNGDAPM a6a, = 521 ADOMMudagAcc-DNGDAPM w6a, = 9 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 w6a, = 32 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 w6a, = 134 ADOMMudagAcc-DNGDAPM w6a, = 521 ADOMMudagAcc-DNGDAPM ijcnn1, = 9 ADOMMudagAcc-DNGDAPM 0 5000 10000 15000 20000 25000 30000 ijcnn1, = 32 ADOMMudagAcc-DNGDAPM 0 10000 20000 30000 40000 50000 60000 ijcnn1, = 134 ADOMMudagAcc-DNGDAPM ijcnn1, = 521 ADOMMudagAcc-DNGDAPM
Figure 6.
Comparison of Mudag, Acc-DNGD, APM and
ADOM on LIBSVM datasets with χ ∈ { , , , } and κ = 100 . DOM
E.2. Real networks
For the next set of experiments, we use a real-world temporal graph dataset infectious ct1 representing social interactionsfrom the TUDataset collection (Morris et al., 2020). It consists of graphs G k on n = 50 nodes with χ ≈ . For each k , matrix W k is chosen to be the Laplacian of graph G k divided by its largest eigenvalue.Our experimental results are presented in Figure 7. We solve the regularized logistic regression problem (19) described inSection 6 for κ ∈ { , } with the same LIBSVM datasets from Section E.1. Overall, the algorithms perform in a similarfashion as in the case of the synthetic geometric graphs. Notice that for smaller κ ( κ = 10 ), Mudag outperforms APM afterreaching a certain solution accuracy, while for κ = 10 the situation is the opposite. Superiority of ADOM persists for everydataset and condition number κ . We use one iteration ( T = 1 ) of GD to calculate ∇ F ∗ ( z kg ) in ADOM . a6a, = 10 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 w6a, = 10 ADOMMudagAcc-DNGDAPM 0 20000 40000 60000 80000 100000 ijcnn1, = 10 ADOMMudagAcc-DNGDAPM0 100000 200000 300000 400000 500000 a6a, = 1000 ADOMMudagAcc-DNGDAPM 0 100000 200000 300000 400000 500000 w6a, = 1000 ADOMMudagAcc-DNGDAPM 0 100000 200000 300000 400000 500000 ijcnn1, = 1000 ADOMMudagAcc-DNGDAPM
Figure 7.
Comparison of Mudag, Acc-DNGD, APM and
ADOM on temporal graph dataset infectious ct1 with χ ≈ and κ ∈{ , } . Dataset infectious ct1 is available in