[PDF] Decentralized Distributed Optimization for Saddle Point Problems

Abstract

We consider distributed convex-concave saddle point problems over arbitrary connected undirected networks and propose a decentralized distributed algorithm for their solution. The local functions distributed across the nodes are assumed to have global and local groups of variables. For the proposed algorithm we prove non-asymptotic convergence rate estimates with explicit dependence on the network characteristics. To supplement the convergence rate analysis, we propose lower bounds for strongly-convex-strongly-concave and convex-concave saddle-point problems over arbitrary connected undirected networks. We illustrate the considered problem setting by a particular application to distributed calculation of non-regularized Wasserstein barycenters.

Full PDF

DD ECENTRALIZED D ISTRIBUTED O PTIMIZATION FOR S ADDLE P OINT P ROBLEMS ∗ Alexander Rogozin

Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]

Alexander Beznosikov

Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]

Darina Dvinskikh

Weierstrass Institute for AppliedAnalysis and StochasticsBerlin, [email protected]

Dmitry Kovalev

King Abdullah University of Science and TechnologyThuwal, Saudi Arabia [email protected]

Pavel Dvurechensky

Weierstrass Institute for AppliedAnalysis and StochasticsBerlin, Germany [email protected]

Alexander Gasnikov

Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]

February 12, 2021 A BSTRACT

We consider distributed convex-concave saddle point problems over arbitrary connected undirectednetworks and propose a decentralized distributed algorithm for their solution. The local functionsdistributed across the nodes are assumed to have global and local groups of variables. For theproposed algorithm we prove non-asymptotic convergence rate estimates with explicit dependence onthe network characteristics. To supplement the convergence rate analysis, we propose lower boundsfor strongly-convex-strongly-concave and convex-concave saddle-point problems over arbitrary con-nected undirected networks. We illustrate the considered problem setting by a particular applicationto distributed calculation of non-regularized Wasserstein barycenters.

In the last few years, the interest to the saddle-point problems has grown signiﬁcantly. One of the reasons is GANapplications [12]. The other reason is a fruitful reformulation of some classes of convex problems as saddle-pointproblems that allows to get a solution with better properties or even improve the wall-clock time [17, 18]. Thus, inmany applications we solve the smooth convex-concave saddle-point problem min p, { x i } mi =1 max r, { y i } mi =1 m (cid:88) i =1 f i ( x i , p, y i , r ) . (1)A particular case of this problem is the Wasserstein barycenter problem in the saddle-point representation [10]. Thisproblem is computationally and memory costly. To overcome these obstacles, decentralized distributed approaches ∗ The work of A. Rogozin and A. Beznosikov was supported by RFBR, project number 19-31-51001. The research of A. Gasnikovwas supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, projectno. 0714-2020-0005. a r X i v : . [ m a t h . O C ] F e b PREPRINT - F

EBRUARY

17, 2021are used [30, 19]. For convex optimization, the theory of decentralized distributed ﬁrst-order methods is currentlywell-developed: lower bounds on communication steps and oracle calls are well-known, and algorithms convergingaccording to the lower bounds (up to a logarithmic factors) are developed, see [1, 21, 27, 28, 30, 14]. To the best of ourknowledge, the theory is lacking the following• saddle-point generalizations (no lower bounds, no algorithms with provable bounds on communication stepsand oracle calls per node)• generalizations that allows mixed arguments (unique, like x i , y i and common, like p , r )• generalizations to non-Euclidean proximal setup.Further motivated by [10], we comment how we managed to eliminate these theory gaps in the current paper.Firstly, to present problem (1) as a problem that can be solved by a decentralized algorithm over a network, somepreliminary work is required. The objective in (1) must be rewritten as a separable function with respect to (w.r.t.)variables p and r which are different for each node in a network. To make a consensus (common agreement betweennetwork nodes) regarding p and r , we introduce afﬁne constraints to the problem (separately for p and r variables) whichthen are bringing to the objective by using Lagrange multipliers principle. Thereby we obtain composite saddle-pointproblems (with two bilinear composite terms). Moreover, we bound dual variables by using Slater’s arguments.An optimal method for convex-concave problems with smooth composite terms (i.e., having Lipschitz gradient),introduced above, is Mirror-Prox algorithm [24, 25] with arbitrary proximal setup. The algorithm can be performed in adecentralized manner, however, it is not known whether its optimally preserves. In this paper, we prove that Mirror-Proxremains optimal even in a decentralized case w.r.t. the dependence on the desired accuracy ε , but not w.r.t. otherparameters, such as condition number of a communication network. This explains by the lack of techniques which splitsthe complexity bounds on the number of communications steps and oracle calls per node in saddle point-problems.This splitting was only done for (strongly) convex problems by sliding technique that signiﬁcantly exploits a compositestructure of the problem [20, 13, 9]. For strongly monotone variational inequalities, we develop this technique in theEuclidean proximal setup.Finally, we apply our results to ubiquitous Wasserstein barycenter problem by generalizing the results of [10] todecentralized distributed setup. We provide decentralized distributed algorithm calculating a Wasserstein barycenterw.r.t. the non-regularized Wasserstein distances. The main advantage of this approach is the possibility of calculatingWasserstein barycenters with any precision meanwhile the regularized-based approaches with entropic regularizationof Wasserstein distances [26] become unstable when a high precision ε of solving the WB problem is desired (as theregularization parameter is proportional to ε ) . • In the Euclidean proximal setup, we obtain the lower bounds on the number of communication steps andoracle calls per node for (strongly) convex-concave saddle-point problems.• We provide decentralized execution of Mirror-Prox algorithm and prove its optimality w.r.t. the dependenceon the desired accuracy ε in a general proximal setup.• We split the complexity bounds on the number of communication steps and oracle calls per node for stronglymonotone variational inequalities in the Euclidean proximal setup.• We show the beneﬁts of presenting some convex problems as convex-concave problems on an example of theWasserstein barycenter problem proved by numerical experiments. Paper Organization.

This paper is organized as follows. Firstly, in Section 2 we describe a decentralized saddle-point problem of interest and its reformulation based on Lagrange multipliers. The algorithm based on Mirror-Proxis presented in Section 3. Section 4 provides lower complexity bounds for saddle-point problems without individualvariables, and in Section 5 we show how gradient sliding may be employed to separate communication and computationalcomplexities. Finally, we describe the application of our theory to a particular example of Wasserstein barycenters inSection 6, and ﬁnish the paper with concluding remarks in Section 7.

Notation.

For a prox-function d ( x ) , we deﬁne the corresponding Bregman divergence B ( x, y ) = d ( x ) − d ( y ) −(cid:104)∇ d ( y ) , x − y (cid:105) . We denote by I n the identity matrix and by n × n zeros matrix of size n × n . For some norm (cid:107) · (cid:107) ,we deﬁne the dual norm (cid:107) · (cid:107) ∗ in a usual way (cid:107) s (cid:107) ∗ = max x ∈X {(cid:104) x, s (cid:105) : (cid:107) x (cid:107) ≤ } . When functions, such as log or exp , are used on vectors, they are always applied element-wise. For two vectors x, y of the same size, denotations x/y and x (cid:12) y stand for the element-wise product and element-wise division respectively. We use bold symbol forcolumn vector x = [ x (cid:62) , · · · , x (cid:62) m ] (cid:62) ∈ R mn , where x , ..., x m ∈ R n . Then we refer to the i -th component of vector x as x i ∈ R n and to the j -th component of vector x i as [ x i ] j .2 PREPRINT - F

EBRUARY

17, 2021

We study a saddle-point problem of the form min p ∈ ¯ P x ∈X max r ∈ ¯ R y ∈Y m (cid:88) i =1 f i ( x i , p, y i , r ) , (2)where x = ( x (cid:62) . . . x (cid:62) m ) (cid:62) , y = ( y (cid:62) . . . y (cid:62) m ) (cid:62) and X = X × . . . × X n , Y = Y × . . . × Y n . Variables x i , p, y i , r have dimensions d x , d p , d y , d r , respectively. For each node, individual variables x i , y i are restricted to sets X i , Y i ,respectively. Assumption 2.1. • Sets X i , Y i , i = 1 , . . . , m , ¯ P , ¯ R are convex compacts. • Each f i ( · , · , y i , r ) is convex on X i × ¯ P for every ﬁxed y i ∈ Y i , r ∈ ¯ R . • Each f i ( x i , p, · , · ) is concave on Y i × ¯ R for every ﬁxed x i ∈ X i , p ∈ P . Such problems arise, e.g. in decentralized Wasserstein barycenter computation [11].We seek to solve problem (2) in a decentralized distributed setup as follows. Each f i is stored at a separate computationalagent. The agents are connected by a connected undirected network represented by a ﬁxed graph G = ( V, E ) . Everypair of agents ( i, j ) can communicate iff ( i, j ) ∈ E . To deal with these communication constraints, we use the standardtechnique of decentralized distributed optimization [16] in the following way. Each agent i stores a local copy p i , r i ofthe global variables p and r , and consensus constraints p = . . . = p m , r = . . . = r m are imposed. The next step isto obtain tractable reformulation of (2). In order to do this, we introduce a matrix W associated with the network andsatisfying the following Assumption 2.2. • W is symmetric positive semi-deﬁnite. • (Network compatibility) [ W ] ij = 0 if ( i, j ) / ∈ E and i (cid:54) = j . • (Kernel property) For any v = [ v , . . . , v m ] (cid:62) ∈ R m , W v = 0 if and only if v = . . . = v m . In other words, Ker W = span { } . A typical example of matrix satisfying this Assumption is the Laplacian matrix W ∈ R m × m of the graph G such that a) [ W ] ij = − if ( i, j ) ∈ E , b) [ W ] ij = deg ( i ) if i = j , c) [ W ] ij = 0 otherwise. Here deg ( i ) is the degree of the node i , i.e., the number of neighbors of the node. The communication matrix is then deﬁned as W (cid:44) W ⊗ I d , where ⊗ denotes the Kronecker product and d is the dimension of variables on which afﬁne constraints are imposed.Thanks to Assumption 2.2 we equivalently reformulate the consensus constraints using two different communicationmatrices W p (with d = d p ) and W r (with d = d r ), each satisfying the deﬁnition above. This gives us the followingreformulation of problem (2) min W p p =0 x ∈X , p ∈P max W r r =0 y ∈Y , r ∈R m (cid:88) i =1 f i ( x i , p i , y i , r i ) . (3)Here we denote p = ( p . . . p m ) (cid:62) , y = ( y . . . y m ) (cid:62) and P = ¯ P × . . . × ¯ P , R = ¯ R × . . . × ¯ R .Since we are in the large-scale regime of distributed optimization with ﬁrst-order methods being the method of choice,we can not afford projecting on the linear constraints in problem (3). Thus, we use Lagrange multipliers to lift the afﬁneconstraints into the objective. Namely, we add Lagrange multipliers u ∈ R nd p and z ∈ R nd r corresponding to theconstraints W p p = 0 and W r r = 0 , respectively. For brevity we denote F ( x , p , y , r ) = (cid:80) mi =1 f i ( x i , p i , y i , r i ) . Theorem 2.3.

Problem (2) can be equivalently rewritten as min x ∈X , p ∈P u ∈ R mdr max y ∈Y , r ∈R z ∈ R mdp [ F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) ] (4) in the sense that any for any saddle point ( x ∗ , p ∗ , y ∗ , r ∗ , u ∗ , z ∗ ) of (4) , the point ( x ∗ , p ∗ , y ∗ , r ∗ ) is a saddle point of (2) . The proof of Theorem 2.3 is based on the classical duality theory and accurate application of Sion–Kakutani theorem toswap min and max operations. The details are given in Appendix A.3

PREPRINT - F

EBRUARY

17, 2021

In this section we present our main results on decentralized distributed algorithm for saddle-point problem (2). Thepseudo-code is listed as Algorithm 1 and the main result on convergence rate is given in Theorem 3.6. The main idea isto use reformulation (4) and apply Mirror-Prox algorithm [24] for its solution. This requires careful analysis in twoaspects. First, the Lagrange multipliers z , u are not constrained, while the convergence rate result for the classicalMirror-Prox algorithm [24] is proved for problems on compact sets. Second, we need to show that the updates can beorganized via only local communications between the nodes in the network. We start presenting the algorithm by introducing necessary deﬁnitions and notations, among which the main is theproximal setup [3]. Let us ﬁx an arbitrary agent i . For each variable t i ∈ { x i , p i , y i , r i } we assume that there is anorm (cid:107) t i (cid:107) t ; i , a 1-strongly convex w.r.t. this norm prox-function d t ; i ( t i ) , and the corresponding Bregman divergence B t ; i ( t i , ˘ t i ) . For the variables u i , z i we introduce the Euclidean norm (cid:107) u i (cid:107) u ; i = (cid:107) u i (cid:107) , (cid:107) z i (cid:107) z ; i = (cid:107) z i (cid:107) , correspondingprox-functions d u ; i ( u i ) = (cid:107) u i (cid:107) , d z ; i ( z i ) = (cid:107) z i (cid:107) , and Bregman divergences B u ; i ( u i , ˘ u i ) = (cid:107) u i − ˘ u i (cid:107) , B z ; i ( z i , ˘ z i ) = (cid:107) z i − ˘ z i (cid:107) . These objects allow us to deﬁne the following Mirror step

Mirr( g i ; t i ; T i ) = arg min t ∈T i [ α (cid:104) g i , t (cid:105) + B t ; i ( t, t i )] , (5)where t i ∈ { x i , p i , u i , y i , r i , z i } , T i ∈ {X i , ¯ P , R d r , Y i , ¯ R , R d p } , g i deﬁning the step direction is an element of thecorresponding dual space and α > is the stepsize.Our decentralized distributed algorithm for saddle-point problem (4) is listed as Algorithm 1. In each iteration eachagent i makes two Mirror step updates in each of its six local variables. Besides using respective gradient of the localobjective f i to deﬁne the step direction, some updates include aggregating the variables of other agents. Consider forexample the update p k + i , in which agent i needs to calculate the sum Σ j [ W p ] ij z kj . Thanks to Assumption 2.2, i.e.network compatibility of matrix W p , this update requires aggregating local variables only from neighbouring agents.Thus, each iteration of the algorithm is performed in a decentralized distributed manner.In Algorithm 1, we denote ∇ t f (cid:96)i = ∇ t f i ( x (cid:96)i , p (cid:96)i , y (cid:96)i , r (cid:96)i ) for t ∈ { x, p, y, r } and (cid:96) ∈ (cid:8) k, k + (cid:9) .In the next two subsections, we introduce two necessary components which allow us to prove the convergence ratetheorem for our algorithm. These are smoothness assumptions and localization of the solution to the saddle-pointproblem (4). In this subsection, we address one of the challenges of applying the standard Mirror-Prox to our problem (4). As itwas noted above, the standard analysis of Mirror-Prox requires the feasible sets to be compact. However, this is notthe case for problem (4) since variables u and z are unconstrained. We overcome this issue by localizing saddle pointcomponents (cid:98) u and (cid:98) z on bounded sets. The corresponding result is formulated in Lemma 3.3.Our analysis builds upon duality theory for convex minimization problems. The next lemma shows that it sufﬁces toseek a solution of the dual problem on a bounded set. In other words, putting speciﬁc constraints on the dual variablesstill yields an equivalent saddle-point reformulation. Lemma 3.1.

Let Θ ⊆ R d be a closed convex set and h ( θ ) : Θ → R be a convex differentiable function. Consider aproblem with afﬁne constraints min θ ∈ Θ h ( θ ) s.t. A θ = b (6) where Θ is a compact convex set, A ∈ R m × d , b ∈ R m . Denote optimal solution of (6) θ ∗ and introduce a dual function ϕ ( ν ) = min θ ∈ Θ [ h ( θ ) + (cid:104) ν, A θ − b (cid:105) ] . Then1. [22] Dual problem has a solution ν ∗ such that (cid:107) ν ∗ (cid:107) ≤ (cid:107)∇ h ( θ ∗ ) (cid:107) σ +min ( A ) =: R ν , where σ +min ( A ) denotes the minimalnon-zero singular value of A .2. Let R > R v and consider a constrained dual problem max (cid:107) ν (cid:107) ≤ R min θ ∈ Θ [ h ( θ ) + (cid:104) ν, A u − b (cid:105) ] = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) (7)4 PREPRINT - F

EBRUARY

17, 2021

Algorithm 1

Decentralized Mirror-Prox for k = 0, 1, . . . do Node i does: Send z i , u i to neighbors N ( i ) Compute x k + i = Mirr (cid:0) ∇ x f ki ; x ki ; X i (cid:1) p k + i = Mirr (cid:0) ∇ p f ki + Σ j [ W p ] ij z kj ; p ki ; ¯ P (cid:1) y k + i = Mirr (cid:0) −∇ y f ki ; y ki ; Y i (cid:1) r k + i = Mirr (cid:0) −∇ r f ki − Σ j [ W r ] ij u kj ; r ki ; ¯ R (cid:1) u k + i = Mirr (cid:0) Σ j [ W r ] ij r kj ; u ki ; R d r (cid:1) z k + i = Mirr (cid:0) − Σ j [ W p ] ij p kj ; z kj ; R d p (cid:1) Send ˜ z k +1 i , ˜ u k +1 i to neighbors N ( i ) Compute x k +1 i = Mirr (cid:16) ∇ x f k + i ; x ki ; X i (cid:17) p k +1 i = Mirr (cid:16) ∇ p f k + i + Σ j [ W p ] ij z k + j ; p ki ; ¯ P (cid:17) y k +1 i = Mirr (cid:16) −∇ y f k + i ; y ki ; Y i (cid:17) r k +1 i = Mirr (cid:16) −∇ r f k + i − Σ j [ W r ] ij u k + j ; r ki ; ¯ R (cid:17) u k +1 i = Mirr (cid:16) Σ j [ W r ] ij r k + j ; u ki ; R d r (cid:17) z k +1 i = Mirr (cid:16) − Σ j [ W p ] ij p k + j ; z kj ; R d p (cid:17) For t i ∈ { x i , p i , u i , y i , r i , z i } compute (cid:98) t ki = 1 k k − (cid:88) (cid:96) =0 t (cid:96) + i . end for If (˜ θ, ˜ ν ) is a saddle point of (7) , then ˜ θ is a solution of (6) . Proof details for this Lemma are given in Appendix A.Lemma 3.1 allows to localize (cid:98) u and (cid:98) z components of saddle point of (4) on Euclidean balls. In order to estimate theradius of each ball, we need the following Assumption. Assumption 3.2.

There exist positive scalars M p , M r such that for any x ∈ X , z ∈ Z , r ∈ R it holds (cid:107)∇ p F ( x , p , y , r ) (cid:107) ≤ M p , (cid:107)∇ r F ( x , p , y , r ) (cid:107) ≤ M r . Building upon Lemma 3.1, we obtain the following result.

Lemma 3.3.

Let Assumption 3.2 holds and let R z = αM p λ +min ( W p ) , R u = βM r λ +min ( W r ) , (8) where α, β > are arbitrary constants and λ +min ( · ) denotes the minimal non-zero eigenvalue of matrix. Then thereexists a saddle point ( x ∗ , p ∗ , y ∗ , r ∗ , u ∗ , z ∗ ) of problem (4) such that (cid:107) u ∗ (cid:107) ≤ R u , (cid:107) z ∗ (cid:107) ≤ R z . Proof of Lemma 3.3 is provided in Appendix A. 5

PREPRINT - F

EBRUARY

17, 2021

In this subsection we introduce the second important component of the convergence rate analysis, namely the smoothnessassumption on the objective F . To set the stage we ﬁrst introduce a general deﬁnition of Lipschitz-smooth function oftwo variables. Deﬁnition 3.4.

A continuously differentiable function G ( v , w ) is ( L vv , L vw , L wv , L ww ) -smooth if for any v , v (cid:48) ∈ V and w , w (cid:48) ∈ W , (cid:107)∇ v G ( v , w ) − ∇ v G ( v (cid:48) , w ) (cid:107) v , ∗ ≤ L vv (cid:107) v − v (cid:48) (cid:107) v , (cid:107)∇ v G ( v , w ) − ∇ v G ( v , w (cid:48) ) (cid:107) v , ∗ ≤ L vw (cid:107) w − w (cid:48) (cid:107) w , (cid:107)∇ w G ( v , w ) − ∇ w G ( v (cid:48) , w ) (cid:107) w , ∗ ≤ L wv (cid:107) v − v (cid:48) (cid:107) v , (cid:107)∇ w G ( v , w ) − ∇ w G ( v , w (cid:48) ) (cid:107) w , ∗ ≤ L ww (cid:107) w − w (cid:48) (cid:107) w . To apply this deﬁnition in our context of problem (4) we aggregate the local norms for each variable among agents.Formally, for t ∈ { x , p , u , y , r , z } we deﬁne (cid:107) t (cid:107) t = (cid:80) mi =1 (cid:107) t i (cid:107) t ; i . For the space of vectors ( x , p ) and ( y , r ) weintroduce respectively norms (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , (cid:107) ( y , r ) (cid:107) y , r ) = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r and corresponding dualnorms. Now, setting v = ( x , p ) , w = ( y , r ) , we make the following smoothness assumption on F . Assumption 3.5.

The function F in (4) is ( L ( x , p )( x , p ) , L ( x , p )( y , r ) , L ( y , r )( x , p ) , L ( y , r )( y , r ) ) -smooth. For simplicity of further derivations it is convenient to aggregate all the variables in two blocks: minimization variablesand maximization variables. To do that, we deﬁne ξ = ( x (cid:62) , p (cid:62) , u (cid:62) ) (cid:62) and η = ( y (cid:62) , r (cid:62) , z (cid:62) ) (cid:62) and Q ξ = X ×P× R nd r , Q η = Y × R × R nd p . This allows us to write the problem (4) in the following simpliﬁed form. min ξ ∈ Q ξ max η ∈ Q η S ( ξ, η ) (9)where S ( · , η ) is convex for any ﬁxed η and S ( ξ, · ) is concave for any ﬁxed ξ , and the sets Q ξ , Q η are closed and convex,but not bounded.Further, we introduce aggregated norms (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) . Similarly, we aggregate the prox-functions d ξ ( ξ ) = (cid:80) mi =1 ( d x ; i ( x i ) + d p ; i ( p i ) + d u ; i ( u i )) , d η ( η ) = (cid:80) mi =1 ( d y ; i ( y i ) + d r ; i ( r i ) + d z ; i ( z i )) , which induce the corresponding Bregman divergences B ξ ( ξ, ˘ ξ ) , B η ( η, ˘ η ) .Under these deﬁnitions the function S ( ξ, η ) in (9) has the following Lipshitz constants (the detailed derivations aregiven in Appendix B) L ξξ = L ( x , p )( x , p ) , L ηη = L ( y , r )( y , r ) ,L ξη = 2 max (cid:110) L ( x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) ,L ηξ = 2 max (cid:110) L ( y , r )( x , p ) , (cid:107) W r (cid:107) → ( r , ∗ ) , (cid:107) W p (cid:107) p → (cid:111) . Let us now discuss the second main aspect of the analysis, i.e. unboundedness of the feasible sets Q ξ , Q η , which doesnot allow to directly apply the standard analysis of [24]. Thanks to the special structure of the problem (4) and ourAssumption 3.2, we have at our disposal bounds R z , R u deﬁned in (8) such that the optimal values z ∗ and u ∗ satisfy (cid:107) z ∗ (cid:107) ≤ R z , (cid:107) u ∗ (cid:107) ≤ R u . Thus, if we deﬁne C ξ = X × P × { u : (cid:107) u (cid:107) ≤ R u } , C η = Y × R × { z : (cid:107) z (cid:107) ≤ R z } ,then we have that the saddle-point ( ξ ∗ , η ∗ ) belongs to C ξ × C η . Let us deﬁne R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Note that since C ξ , C η are compact, the values R ξ , R η are well-deﬁned. Finally, letus deﬁne L ζ = 2 max { L ξξ R ξ , L ηη R η , L ξη R ξ R η , L ηξ R ξ R η } . We are now in a position to state the main result.

Theorem 3.6.

Let Assumptions and 2.2, 3.2, 3.5 hold. Then, after N steps of Algorithm 1 with the stepsize α = 1 /L ζ we have that max η ∈ C η S ( (cid:98) ξ N , η ) − min ξ ∈ C ξ S ( ξ, (cid:98) η N ) ≤ L ζ N . PREPRINT - F

EBRUARY

17, 2021

In this section we consider a simpliﬁed particular case of problem (2), i.e. a saddle-point problem of the form min x max y f ( x, y ) = 1 m m (cid:88) i =1 f i ( x, y ) . (10)where f ( x, y ) is ( L xx , L xy , L yx , L yy ) -smooth. Additionally, we assume that f ( x, y ) is µ x , µ y – strongly-convex–strongly-concave (i.e. f is strongly-convex in x and strongly-concave in y ) or convex–concave (strongly-convex–strongly-concave with µ x = µ y = 0 )Our lower bounds are valid for a certain class of algorithms (see Assumption D.1 in Appendix), which we deﬁne in asimilar way with [27]. Most decentralized algorithms use the matrix W (see Assumption 2.2) for attaining consensus,and the convergence estimates of algorithms are tied to properties of W . Our lower bounds include the conditionnumber χ ( W ) = λ max ( W ) λ +min ( W ) of this matrix.We assume that the local iteration time is t , and the communication round is τ . Then the following two theorems holdfor the lower bounds for the total running time T : Theorem 4.1.

Let

L, µ x , µ y > , χ ≥ , ε > . Additionally, we assume that L > √ µ x µ y . Then there existsa distributed saddle-point problem with decentralised architecture and gossip matrix W , such that the followingstatements are true: 1) the gossip matrix W has χ ( W ) = χ , 2) f : R n × R n → R is ( µ x , L, L, µ y ) -smooth, µ x , µ y – strongly-convex-strongly-concave, 3) size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . Then for any procedure, which satisﬁes Assumption D.1, one can bound the timeto achieve an ε -accuracy of the solution in ﬁnal global output: T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . Theorem 4.2.

Let

L > , Ω x , Ω y > , χ ≥ , ε > . Additionally, we assume that L > ε x Ω y . Then thereexists a distributed saddle-point problem with decentralised architecture and gossip matrix W , such that 1) thegossip matrix W has χ ( W ) = χ , 2) f : R n × R n → R is ( ε x , L, L, ε y ) -smooth, convex–concave, 3) size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = ε x Ω y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) .Then for any procedure, which satisﬁes Assumption D.1, one can bound the time to achieve an accuracy ε of the solutionin ﬁnal global output: T = ˜Ω (cid:18) L Ω x Ω y ε ( t + √ χτ ) (cid:19) . The saddle-point problem deﬁned in (9) may be viewed as a particular instance of a more general problem of solvingstrong variational inequality (VI). That is, given a closed convex subset Q ⊂ R d and operator g : Q → R d we need toﬁnd a point ζ ∗ ∈ Q , such that (cid:104) g ( ζ ∗ ) , ζ − ζ ∗ (cid:105) ≥ for all ζ ∈ Q. (11)We are interested in the case in which operator g ( ζ ) is deﬁned as a sum of 2 operators A : Q → R d and B : Q → R d : g ( ζ ) = A ( ζ ) + B ( ζ ) . (12)We assume operators A : Q → R d and B : Q → R d to be monotone and L A and L B -Lipschitz, respectively. That is,for all ζ, θ ∈ Q (cid:104) A ( ζ ) − A ( θ ) , ζ − θ (cid:105) ≥ , (cid:107) A ( ζ ) − A ( θ ) (cid:107) ≤ L A (cid:107) ζ − θ (cid:107) (13a) (cid:104) B ( ζ ) − B ( θ ) , ζ − θ (cid:105) ≥ , (cid:107) B ( ζ ) − B ( θ ) (cid:107) ≤ L B (cid:107) ζ − θ (cid:107) . (13b)Moreover, operator g is assumed to be µ -strongly monotone ( µ > ), i.e. for all ζ, θ ∈ Q it holds (cid:104) g ( ζ ) − g ( θ ) , ζ − θ (cid:105) ≥ µ (cid:107) ζ − θ (cid:107) . (14)7 PREPRINT - F

EBRUARY

17, 2021

Sliding

Require:

Initial guess x ∈ Q , step-size η > for k = 0 , , , . . . do ν k = ζ k − ηA ( ζ k ) Find θ k ∈ Q , such that θ k ≈ (cid:98) θ k , where (cid:98) θ k ∈ Q is a solution to variational inequality: (cid:104) ηB ( (cid:98) θ k ) + (cid:98) θ k − ν k , ζ − (cid:98) θ k (cid:105) ≥ for all ζ ∈ Q. (15) ω k = θ k + η ( A ( ζ k ) − A ( θ k )) x k +1 = Proj Q ( ω k ) end for Problem (15) is solved with a Forward-Backward-Forward algorithm [29], which is run for T iterations starting frompoint ζ k until accuracy δ > . The choices of δ and T are discussed in Appendix C. Theorem 5.1.

Without loss of generality assume L A ≤ L B . Total number of computations of A ( ζ ) in Algorithm 2 is N A = N = O (cid:32)(cid:18) L A µ (cid:19) log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . Total number of computations of B ( ζ ) is N B = N × T = O (cid:32)(cid:18) L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . In Appendix C, we show how Algorithm 2 allows to split oracle and communication complexities for a decentralizedsolution of saddle-point problem (4).

Our results have various applications including ubiquitous Wasserstein barycenter (WB) problem that can be reformu-lated as a saddle point problem of form (2) and solved over a network of agents. Wasserstein distances and Wassersteinbarycenter form a successful framework to operate with probability distributions and the objects that can be presented asprobability distributions (e.g., images, videos, text, etc) in various ﬁelds such as machine learning, statistics, economicsand ﬁnance. However, large calculations are behind this superiority, and to overcome computational complexity,entropic regularization [7] was proposed. However, regularized-based methods, e.g., celebrated Iterative BregmanProjections (IBP) algorithm [4], are numerically unstable when one needs to solve the WB problem with a high precision ε (as regularization parameter is choosing proportionally to ε ). For this case, Mirror-Prox algorithm can be used thatﬁnds a WB with respect to the original (non-regularized) Wasserstein distances. Using distributed calculations decreasesthe problem complexity by sharing it between computational nodes. We start with the preliminaries for the WB problem and reformulate it as a saddle point problem. For given histograms q , q ..., q m from the probability simplex ∆ n , a WB of these measures is a solution of the following ﬁnite-dimensionaloptimization problem p ∗ = arg min p ∈ ∆ n m m (cid:88) i =1 W ( p, q i ) , (16)where W ( p, q ) = min X ∈ U ( p,q ) (cid:104) C, X (cid:105) is optimal transport problem between two histograms p, q ∈ ∆ n , C ∈ R n × n + is agiven ground cost matrix, and X is a transport plan which belongs to transportation polytope U = { X ∈ R n × n + , X = p, X (cid:62) = q } . As we consider a general cost matrix C , the optimal transport problem is more general than the problemdeﬁning Wasserstein distance. 8 PREPRINT - F

EBRUARY

17, 2021Following the paper [10] that develops the idea of [17] we reformulate the WB problem (16) as a saddle point problemby introducing stacked column vector b i = ( p (cid:62) , q (cid:62) i ) (cid:62) , vectorized cost matrix d of C , vectorized transport plan x ∈ ∆ n of X , incidence matrix A = { , } n × n , vectors y i ∈ [ − , n ( i = 1 , ..., m ). Then (16) can be equivalently rewrittenas min p ∈ ∆ n m m (cid:88) i =1 min x i ∈ ∆ n max y i ∈ [ − , n { d (cid:62) x i + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) i Ax i − b (cid:62) i y i (cid:1) } . (17)For the distributed computation of Wasserstein barycenters, we introduce the stacked column vectors p = ( p (cid:62) ∈ ∆ n , · · · , p (cid:62) m ∈ ∆ n ) (cid:62) ∈ P (cid:44) (cid:81) m ∆ n , x = ( x (cid:62) ∈ ∆ n , . . . , x (cid:62) m ∈ ∆ n ) (cid:62) ∈ X (cid:44) (cid:81) m ∆ n (where (cid:81) m ∆ n is theCartesian product of m simplices), and y = ( y (cid:62) , . . . , y (cid:62) m ) (cid:62) ∈ Y (cid:44) [ − , mn Then we equivalently rewrite (17) as min x ∈X , p ∈P ,p = ... = p m max y ∈Y f ( x , p , y ) (cid:44) m (cid:110) d (cid:62) x + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) A x − b (cid:62) y (cid:1)(cid:111) , (18)where b = ( p (cid:62) , q (cid:62) , ..., p (cid:62) m , q (cid:62) m ) (cid:62) , d = ( d (cid:62) , . . . , d (cid:62) ) (cid:62) , and A = diag { A, ..., A } ∈ { , } mn × mn is block-diagonalmatrix. To enable distributed computation of this problem, the constraint p = · · · = p m is replaced by Wp = 0 .Finally, we introduce Lagrangian dual variable z = ( z (cid:62) , ..., z (cid:62) m ) ∈ Z (cid:44) R nm , scaled by /m , to constraint Wp = 0 for the problem (18) and rewrite it as follows min x ∈X , p ∈P max y ∈Y , z ∈ R nm F ( x , p , y , z ) (cid:44) m { d (cid:62) x + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) A x − b (cid:62) y (cid:1) + (cid:104) z , Wp (cid:105)} . (19)The gradient operator for F ( x , p , y , z ) is deﬁned by  ∇ x F ∇ p F −∇ y F −∇ z F  = 1 m  d + 2 (cid:107) d (cid:107) ∞ A (cid:62) yW (cid:62) z − (cid:107) d (cid:107) ∞ { [ y i ] ...n } mi =1 (cid:107) d (cid:107) ∞ ( A x − b ) − Wp  Here [ y i ] ...n is the ﬁrst n component of vector y i ∈ [ − , n , and { [ y i ] ...n } mi =1 is a short form of ([ y ] ...n , [ y ] ...n , ..., [ y m ] ...n ) . Now we are in a position to apply Algorithm 1. We start with the preliminaries and the setup for Mirror-Prox algorithm.For space U (cid:44) X × P (cid:44) (cid:81) m ∆ n × (cid:81) m ∆ n , we choose norm (cid:107) u (cid:107) u = (cid:112)(cid:80) mi =1 (cid:107) x i (cid:107) + (cid:80) mi =1 (cid:107) p i (cid:107) , where (cid:107) · (cid:107) is the (cid:96) -norm, prox-function d u ( u ) = (cid:80) mi =1 (cid:104) x i , ln x i (cid:105) + (cid:80) mi =1 (cid:104) p i , ln p i (cid:105) , and the corresponding Bregman divergence B u ( u , ˘ u ) = (cid:80) mi =1 (cid:104) x i , ln( x i / ˘ x i ) (cid:105) − (cid:80) mi =1 (cid:62) n ( x i − ˘ x i ) + (cid:80) mi =1 (cid:104) p i , ln( p i / ˘ p i ) (cid:105) − (cid:80) mi =1 (cid:62) n ( p i − ˘ p i ) . We endow space V (cid:44) Y × Z (cid:44) [ − , nm × R nm with the standard Euclidean setup: the Euclidean norm (cid:107) · (cid:107) ,prox-function d v ( v ) = (cid:107) v (cid:107) , and the corresponding Bregman divergence B v = (cid:107) v − ˘ v (cid:107) .For space U × V the prox-function is deﬁned as ad u ( u ) + bd v ( v ) , the corresponding Bregman divergence is aB u ( u , ˘ u ) + bB v ( v , ˘ v ) , where a = 1 /R u , b = 1 /R v , and R u = max u ∈U d u ( u ) − min u ∈U d u ( u ) , R v = max v ∈V∩ B R (0) d v ( v ) − min v ∈V∩ B R (0) d v ( v ) . Here B R (0) is a ball of radius R centered in . Lemma 6.1.

Objective F ( u , v ) in (19) is ( L uu , L uv , L vu , L vv ) -smooth with L uu = L vv = 0 and L uv = L vu = (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) /m . This lemma (the proof is in Appendix E.2.) allows us to obtain the following convergence result

Theorem 6.2.

Let (cid:107) z (cid:107) ≤ R , then R u = √ m ln n and R v = (cid:112) mn + R / with R = (cid:107)∇ p f ( x , p ∗ , y ) (cid:107) λ +min (cid:0) m W (cid:1) ≤ n (cid:107) d (cid:107) ∞ λ +min ( W ) , PREPRINT - F

EBRUARY

17, 2021 where λ +min ( W ) is the minimal positive eigenvalue of W . Then after N = 4 L uv R U R V /ε iterations, Algorithm 4 fromAppendix E with η = L uv R u R u outputs a pair ( (cid:101) u , (cid:101) v ) such that max y ∈Y , (cid:107) z (cid:107) ≤ R F ( (cid:101) u , y , z ) − min x ∈X , p ∈P F ( x , p , (cid:101) v ) ≤ ε. The total complexity of Algorithm 4 per node is O (cid:16) n √ n ln n/ε (cid:17) . The proof of Theorem 6.2 is provided in Appendix E.1.For the star network, we can compare the complexity of Mirror-Prox with the complexity of the IBP running in (cid:101) O (cid:0) n /ε (cid:1) time per node [19]. Distributed Mirror-Prox has better dependence on ε , namely /ε , as well as theaccelerated IBP with (cid:101) O (cid:0) n √ n/ε (cid:1) complexity per node of [15]. However, as any regularization-based method,the (accelerated) IBP has a strong limitation regarding the regularization parameter: numerical unstability at smallregularization parameter. This is the case when a high precision is desired (as the regularization parameter must bechosen proportionally to ε ). Next, we illustrate the work of Algorithm 4 from Appendix E for different network architectures on Gaussiansdistributions and the notMNIST dataset.We randomly generated 10 Gaussian measures with equally spaced support of 100 points in [ − , − , mean from [ − , and variance from [0 . , . , ] . We studied the convergence of calculated barycenters to the theoretical truebarycenter [8] on different network architectures: the cycle graph, the complete graph, the star graph, and the Erd˝os-R´enyi random graph with probability of edge creation p = 0 . . Figure 1 shows the dependence of the convergenceof Algorithm 4 (on a logarithmic scale). The slope ration − ﬁts theoretical dependence of the desired accuracy ε onnumber of iterations ( N ∼ ε − , Theorem 6.2). (a) Convergence in the function value. Here p ∗ is thetrue barycenter. (b) Convergence to theconsensus ( Wp = 0 ) in the network Figure 1: Convergence of Distributed Mirror-Prox algorithm skipping the ﬁrst 1000 iterations.Figure 2 illustrates a better approximation of the true Wasserstein barycenter of Gaussian measures performed byAlgorithm 4 on different network in comparison with the IBP. The regularization parameter for the IBP (from the POTpython library) is taken as smallest as possible under which the IBP still works. The slow convergence is due to the lackof theory for the convergence in the argument.Next, we illustrate non-stability of IBP algorithm with small values of regularizing parameter ( γ = 10 − , − ) when ahigh-precision ε of calculating a WB is desired. Figure 3 presents Wasserstein barycenter of letters ’B’ in a variety offonts from the notMNIST dataset. 10 PREPRINT - F

EBRUARY

17, 2021 (a) Convergence in argumentafter iterations (b) Convergence in argumentafter · iterations Figure 2: Convergence of Distributed Mirror-Prox on different network and the IBP to the true barycenter of Gaussianmeasures.

DistributedMirror-Prox IBP, γ = 10 − IBP, γ = 10 − Figure 3: Wasserstein barycenters of letter ’B’ computed by Distributed Mirror-Prox and the IBP with different valuesof the regularizing parameter

In this paper, we develop a theory for decentralized distributed smooth convex-concave problems. This theory includesobtaining the lower bounds on communication rounds and oracle calls. We also generalize Mirror-Prox algorithmto decentralized distributed setup with the dependence on desired accuracy in its working time according to lowerbound. Moreover, we improve this algorithm by using sliding technique to split communication and oracle complexities.Finally, for some classes of convex problems we show the beneﬁts of bringing them to convex-concave problems on anexample of the Wasserstein barycenter problem in a number of numerical experiments.

References [1] Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. arXivpreprint arXiv:1506.01900 , 2015.[2] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization (2012). , 2011.[3] A. Ben-Tal and A. Nemirovski.

Lectures on Modern Convex Optimization (Lecture Notes) . Personal web-page ofA. Nemirovski, 2020.[4] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr´e. Iterative bregman projections for regularizedtransportation problems.

SIAM Journal on Scientiﬁc Computing , 37(2):A1111–A1138, 2015.[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms.

IEEE transactions on informationtheory , 52(6):2508–2530, 2006.[6] S. Bubeck. Theory of convex optimization for machine learning. arXiv preprint arXiv:1405.4980 , 15, 2014.[7] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems26 , pages 2292–2300. Curran Associates, Inc., 2013.11

PREPRINT - F

EBRUARY

17, 2021[8] J. Delon and A. Desolneux. A wasserstein-type distance in the space of gaussian mixture models.

SIAM Journalon Imaging Sciences , 13(2):936–970, 2020.[9] D. Dvinskikh and A. Gasnikov. Decentralized and parallel primal and dual accelerated methods for stochasticconvex programming problems.

Journal of Inverse and Ill-posed Problems , 1(ahead-of-print), 2021.[10] D. Dvinskikh and D. Tiapkin. Improved complexity bounds in the wasserstein barycenter problem. arXiv preprintarXiv:2010.04677 , 2020.[11] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Fasteralgorithm for wasserstein barycenters.

Advances in Neural Information Processing Systems , 31:10760–10770,2018.[12] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective ongenerative adversarial networks. arXiv preprint arXiv:1802.10551 , 2018.[13] E. Gorbunov, D. Dvinskikh, and A. Gasnikov. Optimal decentralized distributed algorithms for stochastic convexoptimization. arXiv:1911.07363 , 2019.[14] E. Gorbunov, A. Rogozin, A. Beznosikov, D. Dvinskikh, and A. Gasnikov. Recent theoretical advances indecentralized distributed convex optimization. arXiv preprint arXiv:2011.13259 , 2020.[15] S. V. Guminov, Y. E. Nesterov, P. E. Dvurechensky, and A. V. Gasnikov. Accelerated primal-dual gradientdescent with linesearch for convex, nonconvex, and nonsmooth optimization problems.

Doklady Mathematics ,99(2):125–128, Mar 2019.[16] D. Jakoveti´c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods.

IEEE Transactions on AutomaticControl , 59(5):1131–1146, May 2014.[17] A. Jambulapati, A. Sidford, and K. Tian. A direct tilde { O } (1/epsilon) iteration parallel algorithm for optimaltransport. Advances in Neural Information Processing Systems , 32:11359–11370, 2019.[18] Y. Jin and A. Sidford. Efﬁciently solving mdps with stochastic mirror descent. In

International Conference onMachine Learning , pages 4890–4900. PMLR, 2020.[19] A. Kroshnin, N. Tupitsa, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and C. Uribe. On the complexity ofapproximating Wasserstein barycenters. In K. Chaudhuri and R. Salakhutdinov, editors,

Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research , pages3530–3540, Long Beach, California, USA, 09–15 Jun 2019. PMLR. arXiv:1901.08686.[20] G. Lan.

First-order and Stochastic Optimization Methods for Machine Learning . Springer, 2020.[21] G. Lan, S. Lee, and Y. Zhou. Communication-efﬁcient algorithms for decentralized and stochastic optimization. arXiv:1701.03961 , 2017.[22] G. Lan, S. Lee, and Y. Zhou. Communication-efﬁcient algorithms for decentralized and stochastic optimization.

Mathematical Programming , 180(1):237–284, 2020.[23] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.

IEEE Transactions onAutomatic Control , 54(1):48–61, 2009.[24] A. Nemirovski. Prox-method with rate of convergence o (1 /t ) for variational inequalities with lipschitz continuousmonotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 15(1):229–251, 2004.[25] Y. Ouyang and Y. Xu. Lower complexity bounds of ﬁrst-order methods for convex-concave bilinear saddle-pointproblems.

Mathematical Programming , pages 1–35, 2019.[26] G. Peyr´e and M. Cuturi. Computational optimal transport. arXiv:1803.00567 , 2018.[27] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massouli´e. Optimal algorithms for smooth and strongly convexdistributed optimization in networks. In D. Precup and Y. W. Teh, editors,

Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research , pages 3027–3036,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[28] K. Scaman, F. Bach, S. Bubeck, L. Massouli´e, and Y. T. Lee. Optimal algorithms for non-smooth distributedoptimization in networks. In

Advances in Neural Information Processing Systems 31 , pages 2740–2749. The MITPress, 2018.[29] P. Tseng. A modiﬁed forward-backward splitting method for maximal monotone mappings.

SIAM Journal onControl and Optimization , 38(2):431–446, 2000.[30] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedi´c. A dual approach for optimal algorithms in distributed optimizationover networks.

Optimization Methods and Software , pages 1–40, 2020.12

PREPRINT - F

EBRUARY

17, 2021[31] J. Zhang, M. Hong, and S. Zhang. On lower iteration complexity bounds for the saddle point problems. arXivpreprint arXiv:1912.07481 , 2019. 13

PREPRINT - F

EBRUARY

17, 2021

A Missing Proofs from Section 3

A.1 Proof of Lemma 3.1

Proof.

For part 1, see [22].Introduce function ψ ( θ ) = max (cid:107) ν (cid:107) ≤ R [ h ( θ ) + (cid:104) ν, A u − b (cid:105) ] . Since U is a compact set, it holds max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) = min θ ∈ Θ ψ ( θ ) by Sion-Kakutani theorem (see i.e. Theorem D.4.2 in [2]). Moreover, since ν ∗ ∈ Arg max ν ∈ R m ϕ ( ν ) and ν ∗ ∈ B R (0) , we have ν ∗ ∈ Arg max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) and max ν ∈ R m ϕ ( ν ) = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) . Also note that h ( (cid:98) θ ) = min θ ∈ Θ ψ ( θ ) by Theorem D.4.1 in [2]. Combining the three facts (cid:98) θ ∈ Arg min θ ∈ Θ ψ ( θ ) , ν ∗ ∈ Arg max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) , min θ ∈ Θ ψ ( θ ) = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) we obtain that ( (cid:98) θ, ν ∗ ) is a saddle point of (7) by Theorem D.4.1 of [2]. Therefore for ν ∈ B R (0) we have h ( (cid:98) θ ) + (cid:68) ν ∗ , A (cid:98) θ − b (cid:69) ≥ h ( (cid:98) θ ) + (cid:68) ν, A (cid:98) θ − b (cid:69)(cid:68) ν ∗ − ν, A (cid:98) θ − b (cid:69) ≥ . Taking into account that (cid:107) ν ∗ (cid:107) ≤ R ν < R , we imply ν ∗ ∈ int B R (0) and therefore A (cid:98) θ − b = 0 . Finally, by deﬁnitionof a saddle point, h ( (cid:98) θ ) + (cid:68) ν ∗ , A (cid:98) θ − b (cid:69) ≤ h ( θ ) + (cid:104) ν ∗ , A u − b (cid:105) ∀ θ ∈ Θ h ( (cid:98) θ ) ≤ h ( θ ) ∀ θ ∈ Θ : A u = b which concludes the proof. A.2 Proof of Theorem 2.3

We prove a lemma that generalizes Theorem 2.3.

Lemma A.1.

Introduce ¯ R z = ¯ αM p λ +min ( W p ) , ¯¯ R u = ¯ βM r λ +min ( W r ) , where ¯ α, ¯ β ∈ (1 , + ∞ ) ∪ { + ∞} . Problem (3) is equivalent to min x ∈X , p ∈P(cid:107) u (cid:107) (cid:54) ¯ R u max y ∈Y , r ∈R(cid:107) z (cid:107) (cid:54) ¯ R z [ F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) ] (20) in the sense that any solution of (20) is the solution of (3) .Proof. We are free to use Sion–Kakutani theorem since the sets X , P , Y , R are compact. min W p p =0 x ∈X , p ∈P max W r r =0 y ∈Y , r ∈R F ( x , p , y , r ) = max W r r =0 y ∈Y , r ∈R min W p p =0 x ∈X , p ∈P F ( x , p , y , r )= max W r r =0 y ∈Y , r ∈R min x ∈X  min W p p =0 p ∈P F ( x , p , y , r )  (21)14 PREPRINT - F

EBRUARY

17, 2021For any ﬁxed pair ( x , y , r ) ∈ X × Z × R function F ( x , p , y , r ) is convex in p . Consider problem min p ∈P F ( x , p , y , r ) s.t. W p p = 0 (22)and denote p ∗ ( x , y , r ) = arg min W p p =0 p ∈P [ F ( x , p , y , r )] . By Lemma 3.1, part 1, there exists a solution of dual to (22) with norm bounded by (cid:107)∇ p F ( x , p ∗ ( x , y , r ) , y , r ) (cid:107) λ +min ( W p ) < ¯ R z . (23)By Lemma 3.1 (part 2), problem (22) is equivalent to max (cid:107) z (cid:107) ≤ ¯ R z min p ∈P [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] (24)in the sense that for any saddle point ( (cid:98) p ( x , y , r ) , (cid:98) z ( x , y , r )) of (24) (cid:98) p ( x , y , r ) is a solution of (22).Returning to (21), we get max W r r =0 y ∈Z , r ∈R min x ∈X  min Wp =0 p ∈P F ( x , p , y , r )  = max W r r =0 y ∈Z , r ∈R min x ∈X (cid:20) max (cid:107) z (cid:107) ≤ ¯ R z min p ∈P ( F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ) (cid:21) = min x ∈X , p ∈P max y ∈Z , r ∈R W r r =0 , (cid:107) z (cid:107) ≤ ¯ R z [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ]= min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z  max r ∈R W r r =0 [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ]  . (25) Now we introduce Lagrange multipliers for the constraints W r r = 0 , as well. Analogously to the case with W p p = 0 constraints, consider a problem max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] s.t. W r r = 0 (26)and introduce its solution r ∗ ( x , p , y , z ) = arg max W r r =0 r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] . Analogously, the dual solution norm to problem (26) can be bounded as (cid:107)∇ r F ( x , p , y , r ∗ ( x , p , y , z )) (cid:107) λ +min ( W r ) < ¯ R u , β ∈ (1 , + ∞ ] . (27)And we get a saddle-point reformulation of (26): min (cid:107) u (cid:107) ≤ ¯ R u max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] . Substituting this reformulation into (25), we obtain min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z  max r ∈R W r r =0 [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ]  = min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z (cid:20) min (cid:107) u (cid:107) ≤ ¯ R u max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] (cid:21) = min x ∈X , p ∈P max y ∈Z , r ∈R(cid:107) z (cid:107) ≤ ¯ R z min (cid:107) u (cid:107) ≤ ¯ R u [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ]= max y ∈Z , r ∈R(cid:107) z (cid:107) ≤ ¯ R z min x ∈X , p ∈P(cid:107) u (cid:107) ≤ ¯ R u [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] , (28) PREPRINT - F

EBRUARY

17, 2021which is an equivalent reformulation of (2). Note that cases ¯ R u = + ∞ , ¯ R z = + ∞ are also supported in this proof. A min - max reformulation is obtained analogously by adding Lagrange multipliers in different order. Proof of Theorem 2.3.

The proof immediately follows from Lemma A.1 by setting ¯ R u = ¯ R z = + ∞ . A.3 Proof of Lemma 3.3

The proof follows from a more general statement in Lemma A.1.

B Smoothness Constants for Mirror-Prox

B.1 Estimating Lipschitz constants for S ( ξ, η ) We start by deriving the Lipschitz constants L ξξ , L ηη , L ξη , L ηξ of the function S ( ξ, η ) in (9). Recall that for anoperator A which acts from a space V with a norm (cid:107) v (cid:107) v to a space W with a norm (cid:107) w (cid:107) w , then these two normsnaturally induce a norm of the operator A as (cid:107) A (cid:107) v → w = max v,w {(cid:107) Av (cid:107) w : (cid:107) v (cid:107) v ≤ } , which gives an inequality (cid:107) Av (cid:107) w ≤ (cid:107) A (cid:107) v → w (cid:107) v (cid:107) v . Recall also that (9) is a reformulation of (4) in a simpler form using the deﬁnitions ξ = ( x (cid:62) , p (cid:62) , u (cid:62) ) (cid:62) , η = ( y (cid:62) , r (cid:62) , z (cid:62) ) (cid:62) and S ( ξ, η ) = F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) . Then for the corresponding partial derivatives we have ∇ ξ S ( ξ, η ) = (cid:34) ∇ x F ( x , p , y , r ) ∇ p F ( x , p , y , r ) + W p zW r r (cid:35) , ∇ η S ( ξ, η ) = (cid:34) ∇ y F ( x , p , y , r ) ∇ r F ( x , p , y , r ) + W r uW p p (cid:35) . Recall also that (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) , which induce the dual norms (cid:107) ξ (cid:107) ξ, ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ + (cid:107) u (cid:107) , (cid:107) η (cid:107) η, ∗ = (cid:107) y (cid:107) y , ∗ + (cid:107) r (cid:107) r , ∗ + (cid:107) z (cid:107) . The constant L ξξ has to satisfy (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ (cid:48) , η ) (cid:107) ξ, ∗ ≤ L ξξ (cid:107) ξ − ξ (cid:48) (cid:107) ξ for all ξ, ξ (cid:48) ∈ Q ξ . We have (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ (cid:48) , η ) (cid:107) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) ∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) ∇ p F ( x , p , y , r ) + W p z − ( ∇ p F ( x (cid:48) , p (cid:48) , y , r ) + W p z ) W r r − W r r (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) ∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) ∇ p F ( x , p , y , r ) − ∇ p F ( x (cid:48) , p (cid:48) , y , r ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ( x , p ) , ∗ ≤ L ( x , p )( x , p ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x − x (cid:48) p − p (cid:48) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ( x , p ) ≤ L ( x , p )( x , p ) (cid:107) ξ − ξ (cid:48) (cid:107) ξ , where we used that (cid:107) ( x , p ) (cid:107) x , p ) , ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ since (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , Assumption 3.5 anddeﬁnition 3.4. Thus, L ξξ = L ( x , p )( x , p ) . The equality L ηη = L ( y , r )( y , r ) is proved in the same way.Let us estimate L ξη , which has to satisfy (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ, η (cid:48) ) (cid:107) ξ, ∗ ≤ L ξη (cid:107) η − η (cid:48) (cid:107) η for all η, η (cid:48) ∈ Q η . We have (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ, η (cid:48) ) (cid:107) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) ∇ x F ( x , p , y , r ) − ∇ x F ( x , p , y (cid:48) , r (cid:48) ) ∇ p F ( x , p , y , r ) + W p z − ( ∇ p F ( x , p , y (cid:48) , r (cid:48) ) + W p z (cid:48) ) W r r − W r r (cid:48) (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ξ, ∗ = (cid:107)∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) (cid:107) x , ∗ + (cid:107)∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) + W p ( z − z (cid:48) ) (cid:107) p , ∗ + (cid:107) W r ( r − r (cid:48) ) (cid:107) ≤ (cid:107)∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) (cid:107) x , ∗ + 2 (cid:107)∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) (cid:107) p , ∗ + 2 (cid:107) W p ( z − z (cid:48) ) (cid:107) p , ∗ + (cid:107) W r ( r − r (cid:48) ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) ∇ x F ( x , p , y , r ) − ∇ x F ( x , p , y (cid:48) , r (cid:48) ) ∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) x , p ) , ∗ + 2 (cid:107) W p (cid:107) → ( p , ∗ ) (cid:107) z − z (cid:48) (cid:107) + (cid:107) W r (cid:107) r → (cid:107) r − r (cid:48) (cid:107) r ≤ L x , p )( y , r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y − y (cid:48) r − r (cid:48) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) x , p ) + 2 (cid:107) W p (cid:107) → ( p , ∗ ) (cid:107) z − z (cid:48) (cid:107) + (cid:107) W r (cid:107) r → (cid:107) r − r (cid:48) (cid:107) r ≤ (cid:110) L x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) ( (cid:107) y − y (cid:48) (cid:107) y + (cid:107) r − r (cid:48) (cid:107) r + (cid:107) z − z (cid:48) (cid:107) )= 4 max (cid:110) L x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) (cid:107) η − η (cid:48) (cid:107) η , PREPRINT - F

EBRUARY

17, 2021where we used the deﬁnition of (cid:107) ξ (cid:107) ξ, ∗ , inequality ( a + b ) ≤ a + 2 b , that (cid:107) ( x , p ) (cid:107) x , p ) , ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ since (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , deﬁnition of the operator norm, Assumption 3.5 and deﬁnition3.4, and, ﬁnally, the deﬁnition of (cid:107) η (cid:107) η . Taking the square root of the derived inequality, we obtain L ξη =2 max (cid:110) L ( x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) . The bound for L ηξ is derived in the same way. B.2 Proof of Theorem 3.6

To prove Theorem 3.6 we ﬁrst show that the iterates of Algorithm 1 naturally correspond to the iterates of a generalMirror-Prox algorithm applied to problem (4). Then we extend the standard analysis of the general Mirror-Proxalgorithm to account for unbounded feasible sets.By deﬁnition we have that the aggregated norms (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) , aggregatedprox-functions d ξ ( ξ ) = (cid:80) mi =1 ( d x ; i ( x i ) + d p ; i ( p i ) + d u ; i ( u i )) , d η ( η ) = (cid:80) mi =1 ( d y ; i ( y i ) + d r ; i ( r i ) + d z ; i ( z i )) , andaggregated Bregman divergences B ξ ( ξ, ˘ ξ ) , B η ( η, ˘ η ) are separable w.r.t. each local variable for an agent i . Further, Q ξ = X × P × R nd r , Q η = Y × R × R nd p , where P = ¯ P × . . . × ¯ P , R = ¯ R × . . . × ¯ R and X = X × . . . × X n , Y = Y × . . . ×Y n . Thus, the feasible set Q ξ × Q η is also separable w.r.t. each local feasible set X i × ¯ P × R d r ×Y i × ¯ R× R d p for an agent i . Thus, if we deﬁne a variable ζ = ( ξ (cid:62) , η (cid:62) ) (cid:62) ∈ Q := Q ξ × Q η , and the operator g ( ζ ) g ( ζ ) = (cid:20) ∇ ξ S ( ξ, η ) −∇ η S ( ξ, η ) (cid:21) =  ∇ x F ( x , p , y , r ) ∇ p F ( x , p , y , r ) + W p zW r r −∇ y F ( x , p , y , r ) −∇ r F ( x , p , y , r ) − W r u − W p p  , then the updates of Algorithm 1 are equivalent to the updates of a general Mirror-Prox algorithm listed as Algorithm 3. Algorithm 3

Mirror-Prox

Require:

Starting point ζ ∈ Q , stepsize α > . for k = 0 , , . . . do ζ k + := arg min ζ ∈ Q α (cid:10) g ( ζ k ) , ζ (cid:11) + B ( ζ, ζ k ) (29) ζ k +1 := arg min ζ ∈ Q α (cid:68) g ( ζ k + ) , ζ (cid:69) + B ( ζ, ζ k ) (30)Set (cid:98) ζ k = k (cid:80) k − (cid:96) =0 ζ (cid:96) + . end for Next, we analyze the general Mirror-Prox Algorithm 3. Recall that we deﬁned C ξ = X × P × { u : (cid:107) u (cid:107) ≤ R u } , C η = Y × R × { z : (cid:107) z (cid:107) ≤ R z } , whence there exists a saddle-point ( ξ ∗ , η ∗ ) which belongs to C ξ × C η . We alsodeﬁned R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Note that since C ξ , C η are compact, thevalues R ξ , R η are well-deﬁned. Using the weights a = 1 /R ξ , b = 1 /R η , we deﬁne the norm (cid:107) ζ (cid:107) ζ = a (cid:107) ξ (cid:107) ξ + b (cid:107) η (cid:107) η ,the corresponding prox-function d ζ ( ζ ) and Bregman divergence B ζ ( ζ, ˘ ζ ) = aB ξ ( ξ, ˘ ξ ) + bB η ( η, ˘ η ) . Note that thecorresponding dual norm is (cid:107) ζ (cid:107) ζ, ∗ = a (cid:107) ξ (cid:107) ξ, ∗ + b (cid:107) η (cid:107) η, ∗ . Under these deﬁnitions, following the standard analysis in[6], we obtain that the operator g ( ζ ) is L ζ -Lipschitz-continuous with respect to the norm (cid:107) ζ (cid:107) ζ with L ζ = 2 max { L ξξ R ξ , L ηη R η , L ξη R ξ R η , L ηξ R ξ R η } . (31)Next, we analyze the iterations of Algorithm 3 under the assumption that g ( ζ ) is L ζ -Lipschitz-continuous. Let us ﬁxsome iteration k ≥ . By the optimality conditions in (29), we have, for any ζ ∈ Q , α (cid:104) g ( ζ k ) + ∇ d ζ ( ζ k + ) − ∇ d ζ ( ζ k ) , ζ − ζ k + (cid:105) ≥ , (32) α (cid:104) g ( ζ k + ) + ∇ d ζ ( ζ k +1 ) − ∇ d ζ ( ζ k ) , ζ − ζ k +1 (cid:105) ≥ . (33)17 PREPRINT - F

EBRUARY

17, 2021Whence, for all ζ ∈ Q , (cid:104) g ( ζ k + ) , ζ k + − ζ (cid:105) = (cid:104) g ( ζ k + ) , ζ k +1 − ζ (cid:105) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) (33) ≤ α (cid:104)∇ d ζ ( ζ k ) − ∇ d ζ ( ζ k +1 ) , ζ k +1 − ζ (cid:105) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) = 1 α B ζ ( ζ, ζ k ) − α B ζ ( ζ, ζ k +1 ) − α B ζ ( ζ k +1 , ζ k ) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) , where the last equality uses the deﬁnition of Bregman divergence B ζ ( ζ, ˘ ζ ) = d ζ ( ζ ) − ( d ζ (˘ ζ ) + (cid:104)∇ d ζ (˘ ζ ) , ζ − ˘ ζ (cid:105) ) .Further, for all ζ ∈ Q , (cid:104) g ( ζ k + ) ,ζ k + − ζ k +1 (cid:105) − α B ζ ( ζ k +1 , ζ k ) = (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105)− α B ζ ( ζ k +1 , ζ k ) + (cid:104) g ( ζ k ) , ζ k + − ζ k +1 (cid:105) (32) with ζ = ζ k +1 ≤ (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105) + 1 α (cid:104)∇ d ζ ( ζ k ) − ∇ d ζ ( ζ k + ) , ζ k + − ζ k +1 (cid:105)− α B ζ ( ζ k +1 , ζ k )= (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105) − α B ( ζ k + , ζ k ) − α B ( ζ k +1 , ζ k + ) ≤ (cid:107) g ( ζ k + ) − g ( ζ k ) (cid:107) ζ, ∗ (cid:107) ζ k + − ζ k +1 (cid:107) ζ − α (cid:16) (cid:107) ζ k + − ζ k (cid:107) ζ + (cid:107) ζ k + − ζ k +1 (cid:107) ζ (cid:17) α =1 /L ζ ≤ L ζ (cid:107) ζ k + − ζ k ) (cid:107) ζ (cid:107) ζ k + − ζ k +1 (cid:107) ζ − L ζ (cid:16) (cid:107) ζ k + − ζ k (cid:107) ζ + (cid:107) ζ k + − ζ k +1 (cid:107) ζ (cid:17) ≤ , where we also used that B ζ ( ζ, ˘ ζ ) ≥ (cid:107) ζ − ˘ ζ (cid:107) ζ and that g ( ζ ) is L ζ -Lipschitz-continuous.Combining the above two inequalities and the choice α = 1 /L ζ , we obtain, for all ζ ∈ Q and i ≥ , (cid:104) g ( ζ i + ) , ζ i + − ζ (cid:105) ≤ L ζ B ζ ( ζ, ζ i ) − L ζ B ζ ( ζ, ζ i +1 ) . Summing up these inequalities for i from 0 to k − , we have: k − (cid:88) i =0 (cid:104) g ( ζ i + ) , ζ i + − ζ (cid:105) ≤ L ζ ( B ζ ( ζ, ζ ) − B ζ ( ζ, ζ k )) ≤ L ζ B ζ ( ζ, ζ ) . Now we use the connection between S ( ξ, η ) and the operator g ( ζ ) . By convexity of S ( ξ, η ) in ξ and concavity of S ( ξ, η ) w.r.t η , we have, for all ξ ∈ Q ξ , k k − (cid:88) i =0 (cid:10) ∇ ξ S ( ξ i , η i ) , ξ i − ξ (cid:11) ≥ k k − (cid:88) i =0 ( S ( ξ i , η i ) − S ( ξ, η i )) ≥ k k − (cid:88) i =0 S ( ξ i , η i ) − S ( ξ, (cid:98) η k ) . In the same way, we obtain, for all η ∈ Q η , k k − (cid:88) i =0 (cid:10) −∇ η S ( ξ i , η i ) , η i − η (cid:11) ≥ − k k − (cid:88) i =0 S ( ξ i , η i ) + S ( (cid:98) ξ k , η ) . Summing these inequalities, by the deﬁnition of g ( ζ ) we obtain that, for all u ∈ Q , v ∈ Q and x = ( u, v ) , S ( (cid:98) ξ k , η ) − S ( ξ, (cid:98) η k ) ≤ k k − (cid:88) i =0 (cid:104) g ( ζ i + ) , ζ i + − ξ (cid:105) ≤ L ζ k B ζ ( ζ, ζ ) . Taking the maximum over ξ ∈ C ξ and η ∈ C η and using the deﬁnition of Bregman divergence, we obtain, max η ∈ C η S ( (cid:98) ξ k , η ) − min ξ ∈ C ξ S ( ξ, (cid:98) η k ) ≤ L ζ N max ξ ∈ C ξ ,η ∈ C η B ζ ( ζ, ζ ) = L ζ N max ξ ∈ C ξ ,η ∈ C η (cid:0) aB ξ ( ξ, ξ ) + bB η ( η, η ) (cid:1) ≤ L ζ k , PREPRINT - F

EBRUARY

17, 2021where we used that a = 1 /R ξ , b = 1 /R η with R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Thisﬁnishes the proof of Theorem 3.6.Since, by construction, there exists a saddle-point ( ξ ∗ , η ∗ ) such that ξ ∗ ∈ C ξ and η ∗ ∈ C η , we see that Theorem 3.6allows to say that Algorithm 1 generates a good approximation to a saddle-point after a sufﬁciently large number ofiterations. C Separating the Communication and Oracle Complexities

In this section, we describe how to apply the result of Theorem 5.1 to solve problem (4) restricting our attention onlyEuclidean norm. Since our sliding technique for variational inequalities requires strong monotonicity, we introduce aregularization term in Section C.1. The proof of Theorem 5.1 is presented in Section C.2.

C.1 Regularization of Problem (4)

Lemma C.1.

Let h ( θ ) : R d → R be a convex function and consider a problem min θ ∈ Q h ( θ ) , (34) where Q ⊆ R d is a convex set. Fix y ∈ Q and introduce a regularized problem min θ ∈ Q h µ ( θ ) = h ( θ ) + µ (cid:13)(cid:13) θ − θ (cid:13)(cid:13) , where (cid:54) µ (cid:54) ε (cid:107) θ − θ ∗ (cid:107) and θ ∗ denotes the solution of (34) . Then it holds h ∗ µ − h ∗ (cid:54) ε . Proof. h ∗ µ = min (cid:16) h ( θ ) + µ (cid:107) θ − θ (cid:107) (cid:17) (cid:54) h ∗ + µ (cid:13)(cid:13) θ ∗ − θ (cid:13)(cid:13) (cid:54) h ∗ + ε . Corollary C.2.

Let Q ξ , Q η be convex sets and S ( ξ, η ) be convex in ξ and concave in η . Consider a saddle-pointproblem min ξ ∈ Q ξ max η ∈ Q η S ( ξ, η ) (35) and introduce its regularized version min ξ ∈ Q ξ max η ∈ Q η ˜ S ( ξ, η ) = min ξ ∈ Q ξ max η ∈ Q η (cid:104) S ( ξ, η ) + µ (cid:107) ξ (cid:107) − µ (cid:107) η (cid:107) (cid:105) , (36) where µ = ε . Then (cid:54) min ξ ∈ Q ξ max η ∈ Q η (cid:104) ˜ S ( ξ, η ) (cid:105) − min ξ ∈ Q ξ max η ∈ Q η [ S ( ξ, η )] (cid:54) ε. Applying Corollary C.1 to problem (4) with µ = ε · max( d ( X ) + d ( P ) + R u , d ( Y ) + d ( R ) + R z ) and Theorem 5.1 allows to achieve convergence rate ˜ O (cid:16) L F + λ max ( W ) µ (cid:17) .19 PREPRINT - F

EBRUARY

17, 2021

C.2 Convergence Proof for SlidingLemma C.3.

Iterates of Algorithm 2 satisfy the following inequality: (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . (37) Proof.

We start with using line 2 of Algorithm 2: (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) = (cid:13)(cid:13) Proj Q ( ω k ) − Proj Q ( ζ ∗ ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) ω k − ζ ∗ (cid:13)(cid:13) . = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , ζ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − ζ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , θ k − ζ ∗ (cid:105) + 2 (cid:104) ω k − ζ k , ζ k − θ k (cid:105) + (cid:13)(cid:13) ω k − ζ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k + η ( A ( ζ k ) − A ( θ k )) − ζ k , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k + ηA ( ζ k ) − ζ k , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k − ν k , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − (cid:104) (cid:98) θ k − ν k , ζ ∗ − θ k (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k , θ k − ζ ∗ (cid:105) . Using (15) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 η (cid:104) B ( (cid:98) θ k ) , ζ ∗ − θ k (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k , θ k − ζ ∗ (cid:105) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) B ( θ k ) , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105) . Using (12) and (14) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) F ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105)≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) F ( ζ ∗ ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105) . Using (11) and Young’s inequality we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105)≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + 2 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) B ( θ k ) − B ( (cid:98) θ k ) (cid:13)(cid:13)(cid:13) . PREPRINT - F

EBRUARY

17, 2021Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + η (cid:13)(cid:13) A ( ζ k ) − A ( θ k ) (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) B ( θ k ) − B ( (cid:98) θ k ) (cid:13)(cid:13)(cid:13) . Using (13a) and (13b) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Using inequality (cid:107) a + b (cid:107) ≥ (cid:107) a (cid:107) − (cid:107) b (cid:107) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:20) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) (cid:21) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 3 ηµ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) = (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Lemma C.4.

Let θ k be the output of Forward-Backward-Forward algorithm for solving auxiliary problem (15) after T iterations starting at point ζ k . Choosing δ ∈ (0 , / and number of iterations T = O (cid:18) (1 + ηL B ) log 1 δ (cid:19) (38) implies (cid:13)(cid:13)(cid:13)(cid:98) θ k − θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . (39) Proof.

Problem (15) is a variational inequality of the form (cid:104) (cid:98) B k ( (cid:98) θ k ) , ζ − (cid:98) θ k (cid:105) ≥ for all ζ ∈ Q, where operator (cid:98) B k : Q → R d is deﬁned in the following way (cid:98) B k ( ζ ) = ηB ( ζ ) + ζ − ν k . Operator (cid:98) B k ( ζ ) is 1-strongly monotone and (1 + ηL B ) -Lipschitz. Hence, applying the standard result on Forward-Backward-Forward algorithm implies (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13)(cid:13) ζ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Moreover, (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13)(cid:13) ζ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 δ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 12 (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) , where the last inequality follows from δ ≤ . Rearranging gives (39).21 PREPRINT - F

EBRUARY

17, 2021

Lemma C.5.

Apply Algorithm 2 to the problem (11) . On each iteration of Algorithm 2 solve auxiliary problem (15) on line 2 with T iterations of Forward-Backward-Forward algorithm starting at point ζ k , where T is given by (38) Choosing δ to be δ = min (cid:40) , (cid:20) ηµ + 64 ηL B µ (cid:21) − (cid:41) , (40) choosing stepsize η to be η = min (cid:26) L A , µ (cid:27) (41) and choosing number of iterations N to be N = 1 ηµ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (42) implies (cid:13)(cid:13) ζ N − ζ ∗ (cid:13)(cid:13) ≤ (cid:15). (43) Proof.

Using (37) together with (39), we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 δ (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Plugging δ deﬁned by (40) gives (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 14 (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:18) ηµ + η L A − (cid:19) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Plugging η deﬁned by (41) gives (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) . After telescoping and using N deﬁned by (42) we get (43), which concludes the proof. Corollary C.6.

Without loss of generality assume L A ≤ L B . Total number of computations of A ( ζ ) is N A = N = O (cid:32)(cid:18) L A µ (cid:19) log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . Total number of computations of B ( ζ ) is N B = N × T = O (cid:32)(cid:18) (cid:26) µ , L A (cid:27) L B (cid:19) (cid:18) L A µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) = O (cid:32)(cid:18) (cid:26) L B µ , L B L A (cid:27) + L A µ + min (cid:26) L B L A µ , L B µ (cid:27)(cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) ≤ O (cid:32)(cid:18) L A µ + L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) ≤ O (cid:32)(cid:18) L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . PREPRINT - F

EBRUARY

17, 2021

D Proof of lower bounds in Theorem 4.1

In contrast to the main part, we will adhere to the following notation in the proof: min x max y f ( x, y ) = 1 M M (cid:88) m =1 f i ( x, y ) . (44)The next assumption is needed to obtain the lower bounds for some class of algorithms. To describe this class ofﬁrst-order methods, we use a similar Black-Box model as in [27]. We assume that one local iteration costs t time units,and the communication round costs τ time units. Additionally, information can be transmitted only along the undirectededge of the network. Communications and local updates can take place in parallel and asynchronously. More formally: Assumption D.1.

For each machine m and each moment of time T , let deﬁne reachable sets H xm,T ⊆ R n x , H ym,T ⊆R n y . Initialization.

Starting point x = 0 , y = 0 , then H xm, = { } , H ym, = { } for all m . Connection.

The reachable sets are the union of what could be obtained by a communication ¯ H xm,T , ¯ H ym,T and alocal step (cid:98) H xm,T , (cid:98) H ym,T : H xm,T ⊆ ¯ H xm,T ∪ (cid:98) H xm,T , H ym,T ⊆ ¯ H ym,T ∪ (cid:98) H ym,T . Communication.

We have communication reachable sets ¯ H xm,T , ¯ H ym,T , starting to communicate with neighbors τ units of time backward: ¯ H xm,T ⊆ span  (cid:91) ( i,m ) ∈E H xi,T − τ  , ¯ H ym,T ⊆ span  (cid:91) ( i,m ) ∈E H yi,T − τ  . The main differences between decentralized and centralized optimization are the equivalence of nodes in decentralizedcase, as well as the fact that communications can be unscheduled. This allows stronger use of the properties of theconnection graph. One of the most famous communication ways to do this – the gossip algorithm [5], [23]. Thisapproach uses a certain matrix W (see Assumption 2.2). Local vectors during communications are ”weighted” bymultiplication of a vector with W . The convergence of most of the algorithms is based on this matrix, our bounds areno exception. In particular, we need to use χ = χ ( W ) = λ ( W ) λ M − ( W ) . Local computation.

Locally, each device can ﬁnd F m ( x, y, ξ ) for all x ∈ H xm,T − t , y ∈ H ym,T − t . Then the localupdate is (cid:98) H xm,T ⊆ span (cid:110) x, ∇ x f ( x, y, ξ ) : ∀ x ∈ H xm,T − t , y ∈ H ym,T − t (cid:111) , (cid:98) H ym,T ⊆ span (cid:110) y, ∇ y f ( x, y, ξ ) : ∀ x ∈ H xm,T − t , y ∈ H ym,T − t (cid:111) . Output.

When the algorithm ends (after T units of time), we have M local outputs x fm ∈ H xm,T , y fm ∈ H ym,T .Suppose the ﬁnal global output is calculated as follows: x f ∈ span (cid:40) M (cid:91) m =1 H xm,T (cid:41) , y f ∈ span (cid:40) M (cid:91) m =1 H ym,T (cid:41) . The proofs of the theorems are a designing of examples of ”bad” functions, as well as ”bad” arrangement of them ondevices. Our example builds on the example function for the non-distributed case from [31].Let B ⊂ V – subset of nodes of G. For d > we deﬁne B d = { v ∈ V : d ( B, v ) ≥ d } , where d ( B, v ) – distancebetween set B and node v . Then we construct the following arrangement of bilinearly functions on nodes: f m ( x, y ) =  f ( x, y ) = M | B d | · L x T A y + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) − M | B d | · L √ µ x µ y e T y, m ∈ B d ,f ( x, y ) = M | B | · L x T A y + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) , m ∈ B,f ( x, y ) = µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) , otherwise . (45)23 PREPRINT - F

EBRUARY

17, 2021where e = (1 , . . . , and A =  −

21 01 − . . . . . . −

21 01  , A =  −

21 01 −

21 0 . . . . . . −  . Lemma D.2. If B d (cid:54) = ∅ , using the procedure that satisﬁes Assumption D.1, after T units of time in the global output,only the ﬁrst k = (cid:106) T − tt + dτ (cid:107) + 2 coordinates can be non-zero, the rest of the n − k coordinates are strictly equal to zero. Proof

Consider an arbitrary moment T . According Assumption D.1 let us write down how H x , H y changes in onelocal step: H xm,T + t ⊆  span { x, A y } , m ∈ B d span { x, A y } , m ∈ B span { x } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T ,H ym,T + t ⊆  span (cid:8) e , y, A T x (cid:9) , m ∈ B d span (cid:8) y, A T x (cid:9) , m ∈ B span { y } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T . Then for any number H ∈ N of local iterations (without communications) H xm,T + tH ⊆  span (cid:8) x, A y, A A T x, A ( A T A ) y, . . . ( A A T ) H x, A ( A T A ) H y, . . . (cid:9) , m ∈ B d span (cid:8) x, A y, A A T x, A ( A T A ) y, . . . ( A A T ) H x, A ( A T A ) H y, . . . (cid:9) , m ∈ B span { x } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T ,H ym,T + tH ⊆  span (cid:8) e , y, A T x, A T A y, A T ( A A T ) x, . . . ( A T A ) H y, A T ( A A T ) H x, . . . (cid:9) , m ∈ B d span (cid:8) y, A T x, A T A y, A T ( A A T ) x, . . . ( A T A ) H y, A T ( A A T ) H x, . . . (cid:9) , m ∈ B span { y } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T . Here we additionally use that A e = A T e = e . Let for all m we have H xm,T , H ym,T ⊆ span { e , . . . , e k } , then forany number H ∈ N of local iterations (without communications) H xm,T + tH , H ym,T + tH ⊆  span (cid:110) e , . . . , e (cid:100) k +12 (cid:101) − (cid:111) = (cid:26) span { e , . . . , e k } , odd k span { e , . . . , e k +1 } , even k , m ∈ B d span (cid:110) e , . . . , e (cid:100) k (cid:101) (cid:111) = (cid:26) span { e , . . . , e k +1 } , odd k span { e , . . . , e k } , even k , m ∈ B span { e , . . . , e k } , otherwise . (46)When deriving the last statement, we needed A T A =  − − . . . − −  , A T A =  − − − − . . . − −  ,A A T =  − − . . . − −  , A A T =  − − − − . . . − −  . PREPRINT - F

EBRUARY

17, 2021This fact leads to the main idea of the proof. At the initial moment of time, we have all zero coordinates, since thestarting points x , y are equal to 0. Using only local iterations (at least 2), we can achieve that for the nodes B d only theﬁrst coordinates of x and y can be non-zero, the rest coordinates are strictly zero. For the rest of the nodes, everythingremains strictly zero. Without communications, the situation does not change. Therefore, we need to make at least d communications in order to have non-zero ﬁrst coordinates in some node from B . Using (46), by local iterations (atleast 1) at the node of the set B , one can achieve the ﬁrst and second non-zero coordinates. Next the process continueswith respect of (46).Hence, to get at least one node i ∈ V with the reachable sets H xm , H ym ⊆ span { e , . . . , e k } , we need a minimum of k + 1 local steps (at least 2 steps in the beginning, when we start from 0, and then at least 1 local step in other cases –see previous paragraph), as well as ( k − d communication rounds. In other words max m (cid:104) max { k ∈ N : ∃ x ∈ H xm,T , ∃ y ∈ H ym,T s.t. x k (cid:54) = 0 , y k (cid:54) = 0 } (cid:105) ≤ (cid:22) T − tt + dτ (cid:23) + 2 . According to Assumption D.1, we have that the ﬁnal global output is the union of all the outputs. Whence the statementof the lemma follows. (cid:3)

Consider the global objective function we have: f ( x, y ) = 1 M M (cid:88) m =1 f m ( x, y ) = 1 M ( | B d | · f ( x, y ) + | B | · f ( x, y ) + ( M − | B d | − | B | ) · f ( x, y ))= L x T Ay + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) − L √ µ x µ y e T y, with A = 12 ( A + A ) (47)The previous lemma gives an idea of what the solution obtained using procedures that satisfy Assumption D.1. Thenext lemma is already devoted to the approximate solution of the problem (47). Lemma D.3 (Lemma 3.3 from [31]) . Let α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) – the smallest root of q − (2 + α ) q + 1 = 0 , and let introduce approximation ¯ y ∗ ¯ y ∗ i = q i − q . Then error between approximation and real solution of (47) can be bounded (cid:107) ¯ y ∗ − y ∗ (cid:107) ≤ q n +1 α (1 − q ) . Proof:

Let write dual function for (47): g ( y ) = − y T (cid:18) L µ x A T A + µ y I (cid:19) y + L µ x e T y. where one can easy found AA T =  − − − − − − − − − . . . − − −  . The optimality of dual problem ∇ g ( y ∗ ) = 0 : (cid:18) L µ x A T A + µ y I (cid:19) y ∗ = L µ x e , or (cid:0) A T A + αI (cid:1) y ∗ = e . PREPRINT - F

EBRUARY

17, 2021Let us write in the form of a system of equations:  (1 + α ) y ∗ − y ∗ = 1 − y ∗ + (2 + α ) y ∗ − y ∗ = 0 . . . − y ∗ n − + (2 + α ) y ∗ n − − y ∗ n = 0 − y ∗ n − + (2 + α ) y ∗ n = 0 Note that  (1 + α )¯ y ∗ − ¯ y ∗ = 1 − ¯ y ∗ + (2 + α )¯ y ∗ − ¯ y ∗ = 0 . . . − ¯ y ∗ n − + (2 + α )¯ y ∗ n − − ¯ y ∗ n = 0 − ¯ y ∗ n − + (2 + α )¯ y ∗ n = q n +1 − q or (cid:0) A T A + αI (cid:1) ¯ y ∗ = e + q n +1 − q e n . And we have ¯ y ∗ − y ∗ = (cid:0) A T A + αI (cid:1) − q n +1 − q e n , With the fact α − I (cid:23) (cid:0) A T A + αI (cid:1) − (cid:31) it implies the statement of lemma. (cid:3) Now we formulate a key lemma (similar to Lemma 3.4 from [31]).

Lemma D.4. If B d (cid:54) = ∅ , let consider a distributed saddle-point problem in form (45) . For any time T one can found sizeof problem n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + dτ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . Then for any procedure satisfying Assumption D.1 we get the following lower bound for the solution after T units of time: (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . Proof:

By Lemma D.3 with q < and k ≤ n we have (cid:107) y T − ¯ y ∗ (cid:107) ≥ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j = k +1 (¯ y ∗ j ) = q k − q (cid:112) q + q + . . . + q n − k ) ≥ q k √ − q ) (cid:112) q + q + . . . + q n = q k √ (cid:107) ¯ y ∗ (cid:107) = q k √ (cid:107) y − ¯ y ∗ (cid:107) . By Lemma D.3 for n ≥ q (cid:16) α √ (cid:17) we can guarantee that ¯ y ∗ ≈ y ∗ (for more detailed see [31] ) and (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ (cid:107) y T − y ∗ (cid:107) ≥ q k (cid:107) y − y ∗ (cid:107) = q ( (cid:98) T − tt + dτ (cid:99) +2 ) (cid:107) y − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . Theorem D.5.

Let

L, µ x , µ y > , χ ≥ , ε > . Additionally, we assume that L > √ µ x µ y . Then there exists adistributed saddle-point problem of M (deﬁned in the proof) functions with decentralised architecture and gossip matrix W . For which the following statements are true: • a gossip matrix W have χ ( W ) = χ , • f = M M (cid:80) m =1 f m : R n × R n → R is ( µ x , L, L, µ y ) -smooth, µ x , µ y – strongly-convex-strongly-concave, • size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . PREPRINT - F

EBRUARY

17, 2021

Then for any procedure, which satisfy Assumption D.1, one can bounded the time to achieve an accuracy ε of thesolution in ﬁnal global output: T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . Proof:

The proof is similar to proof of Theorem 2 from [27]. Let γ M = − cos πM πM – decreasing sequence of positivenumbers. Since γ = 1 and lim m γ M = 0 , there exists M ≥ such that γ M ≥ χ > γ M +1 .If M ≥ , let consider linear graph of size M , and weighted with w , = 1 − a and w i,i +1 = 1 for i ≥ . Then, let B = { v } and d = M − , then we have B = { v M } . If we deﬁne f m as in (45). Hence, one can use previous Lemmaand get: (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . If W a is the Laplacian of the weighted graph G , one can note that with a = 0 , χ ( W a ) = γ M , with a = 1 – χ ( W a ) = 0 .Hence, there exists a ∈ (0; 1] such that χ ( W a ) = χ . Then ﬁnally, χ ≥ γ M +1 ≥ M +1) , and d ≥ √ χ − ≥ √ χ since χ ≤ γ = 1 / . Then Tt + dτ = Ω (cid:0) log q − ( ε − ) (cid:1) .T = Ω (cid:18) ( t + √ χτ ) ln( ε − )ln q − (cid:19) . If M = 2 , we take a totally connected graph, the weight of the edge ( v , v ) is a . Let B = { v } , d = 1 . If W a is theLaplacian of G . With a = 1 we get χ ( W a ) = 1 , and with a = 0 – χ ( W a ) = = γ . Hence, exists a ∈ [0; 1] such that χ ( W a ) = γ . Additionally, we add that L max ≤ L and d ≥ √ χ √ . Finally, we have the same estimate T = Ω (cid:18) ( t + √ χτ ) ln( ε − )ln q − (cid:19) . Next we work with q − = 1ln(1 + (1 − q ) /q )) ≥ q − q = 1 + µ x µ y L − (cid:113) µ x µ y L + (cid:0) µ x µ y L (cid:1) (cid:113) µ L + (cid:0) µ x µ y L (cid:1) − µ x µ y L ≥ (cid:113) µ x µ y L + (cid:0) µ x µ y L (cid:1) + µ x µ y L µ x µ y L ≥ (cid:115) L µ x µ y . Finally, we get T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . (cid:3) Regularized problem.

To consider lower bounds for a convex-concave problem, we use a standard trick that reduces aconvex-concave problem to a strongly convex–strong-concave. We work with regularized problem: f reg ( x, y ) = f ( x, y ) + 12 ε x (cid:107) x − x (cid:107) − ε y (cid:107) y − y (cid:107) , where Ω x , Ω y – Euclidean diameters of sets X , Y . It turns out that if f ( x, y ) is a convex-concave function, then f reg ( x, y ) is (cid:16) ε x , ε y (cid:17) - strongly-convex–strongly-concave. The new problem is solved with an accuracy of ε/ , thenwe ﬁnd a solution to the original problem with an accuracy of ε . Example (45) will be written as follows: f ( x, y ) =  f ( x, y ) = M | B d | · L x T A y − M | B d | · L · ε x Ω y e T y, m ∈ B d f ( x, y ) = M | B | · L x T A y, m ∈ Bf ( x, y ) = 0 , otherwise . PREPRINT - F

EBRUARY

17, 2021 f reg ( x, y ) =  f ( x, y ) = M | B d | · L x T A y + ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) − M | B d | · L · ε x Ω y e T y, m ∈ B d f ( x, y ) = M | B | · L x T A y + ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) , m ∈ Bf ( x, y ) = ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) , otherwise . The following theorem is true:

Theorem D.6.

Let

L > , Ω x , Ω y > , χ ≥ , ε > . Additionally, we assume that L > ε x Ω y . Then there exists adistributed saddle-point problem of M (deﬁned in the proof) functions with decentralised architecture and gossip matrix W . For which the following statements are true: • a gossip matrix W have χ ( W ) = χ , • f = M M (cid:80) m =1 f m : R n × R n → R is ( ε x , L, L, ε y ) -smooth, convex–concave, • size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = ε x Ω y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) .Then for any procedure, which satisfy Assumption D.1, one can bounded the time to achieve an accuracy ε of thesolution in ﬁnal global output: T = ˜Ω (cid:18) L Ω x Ω y ε ( t + √ χτ ) (cid:19) . E Missing Proofs from Section 6

E.1 Proof of Theorem 6.2

Proof of Theorem F ( u , v ) follows from Lemma 6.1. The bound on duality gapfollows from the theory of Mirror-Prox with proper R U , R V and L uv , L uv , L vu , L vv .To estimate R , we calculate the (cid:96) -norm of the objective in (18) (cid:107)∇ p f ( x , p ∗ , y ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) d (cid:107) ∞ m { [ y i ] ...n } mi =1 (cid:13)(cid:13)(cid:13)(cid:13) = m (cid:88) i =1 (cid:107) d (cid:107) ∞ m (cid:107) [ y i ] ...n (cid:107) ≤ (cid:107) d (cid:107) ∞ nm . Substituting ∇ p f ( x , p ∗ , y ) in (23) and using the fact that λ +min (cid:0) m W (cid:1) = m λ +min ( W ) we obtain the estimate for R .The complexity of one iteration of Alg. 4 per node is O (cid:0) n (cid:1) as the number of non-zero elements in matrix A is n .Multiplying this by the number of iterations N we get n · ε (cid:112) (8 (cid:107) d (cid:107) ∞ + λ max ( W ) ) /m √ m ln n (cid:114) mn + R n ε √ n ln n (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) (cid:115) (cid:107) d (cid:107) ∞ mλ +min ( W ) . Finally, we roughly estimate this by O ( n √ n ln n/ε ) and ﬁnish the proof. (cid:3) E.2 Proof of Lemma 6.1

Proof of Lemma F ( u , v ) is bilinear, L uu = L vv = 0 . Next, we estimate L uv and L vu . By the deﬁnition of L uv and the spaces U , V we have (cid:107)∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:107) U ∗ ≤ L uv (cid:107) v − v (cid:48) (cid:107) . (48)From the deﬁnition of dual norm, it follows (cid:107)∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:107) U ∗ = max (cid:107) u (cid:107) U ≤ (cid:104) u , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:105) . From this and (48) we get max (cid:107) u (cid:107) U ≤ (cid:104) u , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:105) ≤ L uv (cid:107) v − v (cid:48) (cid:107) . (49)28 PREPRINT - F

EBRUARY

17, 2021

Algorithm 4

Distributed Mirror-Prox for Wasserstein Barycenters

Require: measures q , ..., q m , linearized cost matrix d , incidence matrix A , step η , starting points p = n n , x = ... = x m = n n , y = ... = y m = n α = 2 (cid:107) d (cid:107) ∞ η ( mn + R / /m , β = 6 (cid:107) d (cid:107) ∞ η ln n , γ = 3 η ln n , θ = η ( mn + R / /m for k = 1 , , · · · , N − do for i = 1 , , · · · , m do u k +1 i = x ki (cid:12) exp (cid:8) − γ (cid:0) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) y ki (cid:1)(cid:9) n (cid:80) l =1 [ x ki ] l exp (cid:8) − γ (cid:0) [ d ] l + 2 (cid:107) d (cid:107) ∞ [ A (cid:62) y ki ] l (cid:1)(cid:9) s k +1 i = p ki (cid:12) exp (cid:110) β [ y ki ] ...n − η ln n (cid:80) mj =1 W ij z kj (cid:111)(cid:80) nl =1 [ p ki ] l exp (cid:110) β [ y ki ] l − η ln n (cid:104)(cid:80) mj =1 W ij z kj (cid:105) l (cid:111) v k +1 i = y ki + α (cid:18) Ax ki − (cid:18) p ki q i (cid:19)(cid:19) , Project v k +1 i onto [ − , n . λ k +1 i = z ki + θ m (cid:88) j =1 W ij p kj x k +1 i = x ki (cid:12) exp (cid:8) − γ (cid:0) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) v k +1 i (cid:1)(cid:9) n (cid:80) l =1 [ x ki ] l exp (cid:8) − γ (cid:0) [ d ] l + 2 (cid:107) d (cid:107) ∞ [ A (cid:62) v k +1 i ] l (cid:1)(cid:9) p k +1 i = p ki (cid:12) exp (cid:110) β [ v k +1 i ] ...n − η ln n (cid:80) mj =1 W ij λ k +1 j (cid:111)(cid:80) nl =1 [ p ki ] l exp (cid:110) β [ v k +1 i ] l − η ln n (cid:104)(cid:80) mj =1 W ij λ k +1 j (cid:105) l (cid:111) y k +1 i = y ki + α (cid:18) Au k +1 i − (cid:18) s k +1 i q i (cid:19)(cid:19) Project y k +1 i onto [ − , n . z k +1 i = z ki + θ m (cid:88) j =1 W ij s k +1 j end for end forEnsure: (cid:101) u = N N (cid:80) k =1 (cid:0) ( u k ) (cid:62) , . . . , ( u km ) (cid:62) , ( s k ) (cid:62) , . . . , ( s km ) (cid:62) (cid:1) (cid:62) , (cid:101) v = N N (cid:80) k =1 (cid:0) ( v k ) (cid:62) , . . . , ( v km ) (cid:62) , ( λ k ) (cid:62) , . . . , ( λ km ) (cid:62) (cid:1) (cid:62) By the deﬁnition of F ( · ) and U = X × P we have ∇ u F = (cid:18) ∇ x F ∇ p F (cid:19) = 1 m (cid:18) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) yW (cid:62) z − (cid:107) d (cid:107) ∞ { [ y i ] ...n } mi =1 (cid:19) . From this and V (cid:44) Y × Z , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) = 1 m (cid:18) (cid:107) d (cid:107) ∞ A (cid:62) ( y − y (cid:48) ) W (cid:62) ( z − z (cid:48) ) − (cid:107) d (cid:107) ∞ ( { [ y i − y (cid:48) i ] ...n } mi =1 ) (cid:19) = 1 m (cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:62) (cid:18) y − y (cid:48) z − z (cid:48) (cid:19) , PREPRINT - F

EBRUARY

17, 2021where

E ∈ { , } mn × mn is block-diagonal matrix E = (cid:18) I n n × n (cid:19) · · · n × n ... . . . ... n × n · · · (cid:18) I n n × n (cid:19) . From this it follows that ∇ u F ( · ) is linear function in v − v (cid:48) , then (49) can be rewritten as L uv = max (cid:107) v − v (cid:48) (cid:107) ≤ max (cid:107) u (cid:107) U ≤ (cid:42) u , m (cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:62) ( v − v (cid:48) ) (cid:43) . (50)By the same arguments we can get the same expression for L vu up to rearrangement of maximums. Next, we use thefact that the (cid:96) -norm is the conjugate norm for the (cid:96) -norm. From this and (50) it follows L uv = max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) u (cid:13)(cid:13)(cid:13)(cid:13) . (51)We consider max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:18) xp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ ( A x − E p ) Wp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ m (cid:0) (cid:107) d (cid:107) ∞ (cid:107) A x − E p (cid:107) + (cid:107) Wp (cid:107) (cid:1) ≤ (cid:107) d (cid:107) ∞ m max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) + 1 m max (cid:107) p (cid:107) P ≤ (cid:107) Wp (cid:107) . (52)We consider the ﬁrst term of the r.h.s. of (52) under the minimum (cid:107) A x − E p (cid:107) = m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13) Ax i − (cid:18) p i n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m (cid:88) i =1 (cid:107) Ax i (cid:107) + m (cid:88) i =1 (cid:107) p i (cid:107) . (53)The last bound holds due to (cid:104) Ax i , ( p (cid:62) i , (cid:62) n ) (cid:105) ≥ as the entries of A, x , p are non-negative. Next we take the minimumin (53) max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) = max (cid:80) mi =1 ( (cid:107) x i (cid:107) + (cid:107) p i (cid:107) ) ≤ (cid:107) A x (cid:107) (53) = max α ∈ ∆ m (cid:32) m (cid:88) i =1 max (cid:107) x i (cid:107) ≤√ α i (cid:107) Ax i (cid:107) + m (cid:88) i =1 max (cid:107) p i (cid:107) ≤√ α i (cid:107) p i (cid:107) (cid:33) = max α ∈ ∆ m (cid:32) m (cid:88) i =1 α i max (cid:107) x i (cid:107) ≤ (cid:107) Ax i (cid:107) + m (cid:88) i =1 α i max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) (cid:33) . (54)By the deﬁnition of incidence matrix A we get Ax i = ( h (cid:62) , h (cid:62) ) ,where h and h such that (cid:62) h = (cid:62) h = (cid:80) n j =1 [ x i ] j = 1 as x i ∈ ∆ n ∀ i = 1 , ..., m . Thus, (cid:107) Ax i (cid:107) = (cid:107) h (cid:107) + (cid:107) h (cid:107) ≤ (cid:107) h (cid:107) + (cid:107) h (cid:107) = 2 . (55)As p i ∈ ∆ n , ∀ i = 1 , ..., m we have max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) ≤ max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) = 1 . (56)Using (55) and (56) in (54) we get max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) ≤ max α ∈ ∆ m (cid:32) m (cid:88) i =1 α i + m (cid:88) i =1 α i (cid:33) ≤ max α ∈ ∆ m m (cid:88) i =1 α i = 2 . (57)30 PREPRINT - F

EBRUARY

17, 2021Now we consider the second term of the r.h.s. of (52). max (cid:107) p (cid:107) P ≤ (cid:107) Wp (cid:107) = max (cid:80) mi =1 (cid:107) p i (cid:107) ≤ (cid:107) Wp (cid:107) . (58)The set (cid:80) mi =1 (cid:107) p i (cid:107) ≤ is contained in the set (cid:80) nj =1 (cid:80) mi =1 [ p i ] j ≤ as cros-product terms of (cid:107) p i (cid:107) are non-negative.Thus, we can change the constraint in the minimum in (58) as follows max (cid:80) mi =1 (cid:107) p i (cid:107) ≤ (cid:107) Wp (cid:107) ≤ max (cid:80) nj =1 (cid:80) mi =1 [ p i ] j ≤ (cid:107) Wp (cid:107) = max (cid:107) p (cid:107) ≤ (cid:107) Wp (cid:107) = max (cid:107) p (cid:107) ≤ (cid:107) Wp (cid:107) (cid:44) λ max ( W ) = λ max ( W ) . (59)The last inequality holds due to W (cid:44) W ⊗ I n and the properties of the Kronecker product for eigenvalues. Using(57)and (59) in (52) for the estimation of L uv from (51), we get L uv = L uv = (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) /m (cid:3) F Numerical Experiments

F.1 Additional Figures for Gaussian Distributions (a) Convergence in the function value. Here p ∗ is the truebarycenter. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 4: Convergence of Algorithm 4 on different network architectures. (a) Convergence in the function value. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 5: Convergence of Algorithm 4 on different network architectures on a logarithmic scale.31

PREPRINT - F

EBRUARY

17, 2021 (a) Convergence in the function value. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 6: Convergence of Algorithm 4 on different network architectures on a logarithmic scale from iterations to ∗ iterations. F.2 The notMNIST dataset

Figure 7: Letter ’B’ in a variety of fonts from the notMNIST dataset

F.3 Network architectures

The complete graph The cycle graph The star graph Erd˝os-R´enyi random graph withprobability of edge creation p =0 .5