Decentralized Distributed Optimization for Saddle Point Problems
Alexander Rogozin, Aleksandr Beznosikov, Darina Dvinskikh, Dmitry Kovalev, Pavel Dvurechensky, Alexander Gasnikov
DD ECENTRALIZED D ISTRIBUTED O PTIMIZATION FOR S ADDLE P OINT P ROBLEMS ∗ Alexander Rogozin
Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]
Alexander Beznosikov
Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]
Darina Dvinskikh
Weierstrass Institute for AppliedAnalysis and StochasticsBerlin, [email protected]
Dmitry Kovalev
King Abdullah University of Science and TechnologyThuwal, Saudi Arabia [email protected]
Pavel Dvurechensky
Weierstrass Institute for AppliedAnalysis and StochasticsBerlin, Germany [email protected]
Alexander Gasnikov
Moscow Institute of Physics and TechnologyDolgoprudny, Russia [email protected]
February 12, 2021 A BSTRACT
We consider distributed convex-concave saddle point problems over arbitrary connected undirectednetworks and propose a decentralized distributed algorithm for their solution. The local functionsdistributed across the nodes are assumed to have global and local groups of variables. For theproposed algorithm we prove non-asymptotic convergence rate estimates with explicit dependence onthe network characteristics. To supplement the convergence rate analysis, we propose lower boundsfor strongly-convex-strongly-concave and convex-concave saddle-point problems over arbitrary con-nected undirected networks. We illustrate the considered problem setting by a particular applicationto distributed calculation of non-regularized Wasserstein barycenters.
In the last few years, the interest to the saddle-point problems has grown significantly. One of the reasons is GANapplications [12]. The other reason is a fruitful reformulation of some classes of convex problems as saddle-pointproblems that allows to get a solution with better properties or even improve the wall-clock time [17, 18]. Thus, inmany applications we solve the smooth convex-concave saddle-point problem min p, { x i } mi =1 max r, { y i } mi =1 m (cid:88) i =1 f i ( x i , p, y i , r ) . (1)A particular case of this problem is the Wasserstein barycenter problem in the saddle-point representation [10]. Thisproblem is computationally and memory costly. To overcome these obstacles, decentralized distributed approaches ∗ The work of A. Rogozin and A. Beznosikov was supported by RFBR, project number 19-31-51001. The research of A. Gasnikovwas supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, projectno. 0714-2020-0005. a r X i v : . [ m a t h . O C ] F e b PREPRINT - F
EBRUARY
17, 2021are used [30, 19]. For convex optimization, the theory of decentralized distributed first-order methods is currentlywell-developed: lower bounds on communication steps and oracle calls are well-known, and algorithms convergingaccording to the lower bounds (up to a logarithmic factors) are developed, see [1, 21, 27, 28, 30, 14]. To the best of ourknowledge, the theory is lacking the following• saddle-point generalizations (no lower bounds, no algorithms with provable bounds on communication stepsand oracle calls per node)• generalizations that allows mixed arguments (unique, like x i , y i and common, like p , r )• generalizations to non-Euclidean proximal setup.Further motivated by [10], we comment how we managed to eliminate these theory gaps in the current paper.Firstly, to present problem (1) as a problem that can be solved by a decentralized algorithm over a network, somepreliminary work is required. The objective in (1) must be rewritten as a separable function with respect to (w.r.t.)variables p and r which are different for each node in a network. To make a consensus (common agreement betweennetwork nodes) regarding p and r , we introduce affine constraints to the problem (separately for p and r variables) whichthen are bringing to the objective by using Lagrange multipliers principle. Thereby we obtain composite saddle-pointproblems (with two bilinear composite terms). Moreover, we bound dual variables by using Slater’s arguments.An optimal method for convex-concave problems with smooth composite terms (i.e., having Lipschitz gradient),introduced above, is Mirror-Prox algorithm [24, 25] with arbitrary proximal setup. The algorithm can be performed in adecentralized manner, however, it is not known whether its optimally preserves. In this paper, we prove that Mirror-Proxremains optimal even in a decentralized case w.r.t. the dependence on the desired accuracy ε , but not w.r.t. otherparameters, such as condition number of a communication network. This explains by the lack of techniques which splitsthe complexity bounds on the number of communications steps and oracle calls per node in saddle point-problems.This splitting was only done for (strongly) convex problems by sliding technique that significantly exploits a compositestructure of the problem [20, 13, 9]. For strongly monotone variational inequalities, we develop this technique in theEuclidean proximal setup.Finally, we apply our results to ubiquitous Wasserstein barycenter problem by generalizing the results of [10] todecentralized distributed setup. We provide decentralized distributed algorithm calculating a Wasserstein barycenterw.r.t. the non-regularized Wasserstein distances. The main advantage of this approach is the possibility of calculatingWasserstein barycenters with any precision meanwhile the regularized-based approaches with entropic regularizationof Wasserstein distances [26] become unstable when a high precision ε of solving the WB problem is desired (as theregularization parameter is proportional to ε ) . • In the Euclidean proximal setup, we obtain the lower bounds on the number of communication steps andoracle calls per node for (strongly) convex-concave saddle-point problems.• We provide decentralized execution of Mirror-Prox algorithm and prove its optimality w.r.t. the dependenceon the desired accuracy ε in a general proximal setup.• We split the complexity bounds on the number of communication steps and oracle calls per node for stronglymonotone variational inequalities in the Euclidean proximal setup.• We show the benefits of presenting some convex problems as convex-concave problems on an example of theWasserstein barycenter problem proved by numerical experiments. Paper Organization.
This paper is organized as follows. Firstly, in Section 2 we describe a decentralized saddle-point problem of interest and its reformulation based on Lagrange multipliers. The algorithm based on Mirror-Proxis presented in Section 3. Section 4 provides lower complexity bounds for saddle-point problems without individualvariables, and in Section 5 we show how gradient sliding may be employed to separate communication and computationalcomplexities. Finally, we describe the application of our theory to a particular example of Wasserstein barycenters inSection 6, and finish the paper with concluding remarks in Section 7.
Notation.
For a prox-function d ( x ) , we define the corresponding Bregman divergence B ( x, y ) = d ( x ) − d ( y ) −(cid:104)∇ d ( y ) , x − y (cid:105) . We denote by I n the identity matrix and by n × n zeros matrix of size n × n . For some norm (cid:107) · (cid:107) ,we define the dual norm (cid:107) · (cid:107) ∗ in a usual way (cid:107) s (cid:107) ∗ = max x ∈X {(cid:104) x, s (cid:105) : (cid:107) x (cid:107) ≤ } . When functions, such as log or exp , are used on vectors, they are always applied element-wise. For two vectors x, y of the same size, denotations x/y and x (cid:12) y stand for the element-wise product and element-wise division respectively. We use bold symbol forcolumn vector x = [ x (cid:62) , · · · , x (cid:62) m ] (cid:62) ∈ R mn , where x , ..., x m ∈ R n . Then we refer to the i -th component of vector x as x i ∈ R n and to the j -th component of vector x i as [ x i ] j .2 PREPRINT - F
EBRUARY
17, 2021
We study a saddle-point problem of the form min p ∈ ¯ P x ∈X max r ∈ ¯ R y ∈Y m (cid:88) i =1 f i ( x i , p, y i , r ) , (2)where x = ( x (cid:62) . . . x (cid:62) m ) (cid:62) , y = ( y (cid:62) . . . y (cid:62) m ) (cid:62) and X = X × . . . × X n , Y = Y × . . . × Y n . Variables x i , p, y i , r have dimensions d x , d p , d y , d r , respectively. For each node, individual variables x i , y i are restricted to sets X i , Y i ,respectively. Assumption 2.1. • Sets X i , Y i , i = 1 , . . . , m , ¯ P , ¯ R are convex compacts. • Each f i ( · , · , y i , r ) is convex on X i × ¯ P for every fixed y i ∈ Y i , r ∈ ¯ R . • Each f i ( x i , p, · , · ) is concave on Y i × ¯ R for every fixed x i ∈ X i , p ∈ P . Such problems arise, e.g. in decentralized Wasserstein barycenter computation [11].We seek to solve problem (2) in a decentralized distributed setup as follows. Each f i is stored at a separate computationalagent. The agents are connected by a connected undirected network represented by a fixed graph G = ( V, E ) . Everypair of agents ( i, j ) can communicate iff ( i, j ) ∈ E . To deal with these communication constraints, we use the standardtechnique of decentralized distributed optimization [16] in the following way. Each agent i stores a local copy p i , r i ofthe global variables p and r , and consensus constraints p = . . . = p m , r = . . . = r m are imposed. The next step isto obtain tractable reformulation of (2). In order to do this, we introduce a matrix W associated with the network andsatisfying the following Assumption 2.2. • W is symmetric positive semi-definite. • (Network compatibility) [ W ] ij = 0 if ( i, j ) / ∈ E and i (cid:54) = j . • (Kernel property) For any v = [ v , . . . , v m ] (cid:62) ∈ R m , W v = 0 if and only if v = . . . = v m . In other words, Ker W = span { } . A typical example of matrix satisfying this Assumption is the Laplacian matrix W ∈ R m × m of the graph G such that a) [ W ] ij = − if ( i, j ) ∈ E , b) [ W ] ij = deg ( i ) if i = j , c) [ W ] ij = 0 otherwise. Here deg ( i ) is the degree of the node i , i.e., the number of neighbors of the node. The communication matrix is then defined as W (cid:44) W ⊗ I d , where ⊗ denotes the Kronecker product and d is the dimension of variables on which affine constraints are imposed.Thanks to Assumption 2.2 we equivalently reformulate the consensus constraints using two different communicationmatrices W p (with d = d p ) and W r (with d = d r ), each satisfying the definition above. This gives us the followingreformulation of problem (2) min W p p =0 x ∈X , p ∈P max W r r =0 y ∈Y , r ∈R m (cid:88) i =1 f i ( x i , p i , y i , r i ) . (3)Here we denote p = ( p . . . p m ) (cid:62) , y = ( y . . . y m ) (cid:62) and P = ¯ P × . . . × ¯ P , R = ¯ R × . . . × ¯ R .Since we are in the large-scale regime of distributed optimization with first-order methods being the method of choice,we can not afford projecting on the linear constraints in problem (3). Thus, we use Lagrange multipliers to lift the affineconstraints into the objective. Namely, we add Lagrange multipliers u ∈ R nd p and z ∈ R nd r corresponding to theconstraints W p p = 0 and W r r = 0 , respectively. For brevity we denote F ( x , p , y , r ) = (cid:80) mi =1 f i ( x i , p i , y i , r i ) . Theorem 2.3.
Problem (2) can be equivalently rewritten as min x ∈X , p ∈P u ∈ R mdr max y ∈Y , r ∈R z ∈ R mdp [ F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) ] (4) in the sense that any for any saddle point ( x ∗ , p ∗ , y ∗ , r ∗ , u ∗ , z ∗ ) of (4) , the point ( x ∗ , p ∗ , y ∗ , r ∗ ) is a saddle point of (2) . The proof of Theorem 2.3 is based on the classical duality theory and accurate application of Sion–Kakutani theorem toswap min and max operations. The details are given in Appendix A.3
PREPRINT - F
EBRUARY
17, 2021
In this section we present our main results on decentralized distributed algorithm for saddle-point problem (2). Thepseudo-code is listed as Algorithm 1 and the main result on convergence rate is given in Theorem 3.6. The main idea isto use reformulation (4) and apply Mirror-Prox algorithm [24] for its solution. This requires careful analysis in twoaspects. First, the Lagrange multipliers z , u are not constrained, while the convergence rate result for the classicalMirror-Prox algorithm [24] is proved for problems on compact sets. Second, we need to show that the updates can beorganized via only local communications between the nodes in the network. We start presenting the algorithm by introducing necessary definitions and notations, among which the main is theproximal setup [3]. Let us fix an arbitrary agent i . For each variable t i ∈ { x i , p i , y i , r i } we assume that there is anorm (cid:107) t i (cid:107) t ; i , a 1-strongly convex w.r.t. this norm prox-function d t ; i ( t i ) , and the corresponding Bregman divergence B t ; i ( t i , ˘ t i ) . For the variables u i , z i we introduce the Euclidean norm (cid:107) u i (cid:107) u ; i = (cid:107) u i (cid:107) , (cid:107) z i (cid:107) z ; i = (cid:107) z i (cid:107) , correspondingprox-functions d u ; i ( u i ) = (cid:107) u i (cid:107) , d z ; i ( z i ) = (cid:107) z i (cid:107) , and Bregman divergences B u ; i ( u i , ˘ u i ) = (cid:107) u i − ˘ u i (cid:107) , B z ; i ( z i , ˘ z i ) = (cid:107) z i − ˘ z i (cid:107) . These objects allow us to define the following Mirror step
Mirr( g i ; t i ; T i ) = arg min t ∈T i [ α (cid:104) g i , t (cid:105) + B t ; i ( t, t i )] , (5)where t i ∈ { x i , p i , u i , y i , r i , z i } , T i ∈ {X i , ¯ P , R d r , Y i , ¯ R , R d p } , g i defining the step direction is an element of thecorresponding dual space and α > is the stepsize.Our decentralized distributed algorithm for saddle-point problem (4) is listed as Algorithm 1. In each iteration eachagent i makes two Mirror step updates in each of its six local variables. Besides using respective gradient of the localobjective f i to define the step direction, some updates include aggregating the variables of other agents. Consider forexample the update p k + i , in which agent i needs to calculate the sum Σ j [ W p ] ij z kj . Thanks to Assumption 2.2, i.e.network compatibility of matrix W p , this update requires aggregating local variables only from neighbouring agents.Thus, each iteration of the algorithm is performed in a decentralized distributed manner.In Algorithm 1, we denote ∇ t f (cid:96)i = ∇ t f i ( x (cid:96)i , p (cid:96)i , y (cid:96)i , r (cid:96)i ) for t ∈ { x, p, y, r } and (cid:96) ∈ (cid:8) k, k + (cid:9) .In the next two subsections, we introduce two necessary components which allow us to prove the convergence ratetheorem for our algorithm. These are smoothness assumptions and localization of the solution to the saddle-pointproblem (4). In this subsection, we address one of the challenges of applying the standard Mirror-Prox to our problem (4). As itwas noted above, the standard analysis of Mirror-Prox requires the feasible sets to be compact. However, this is notthe case for problem (4) since variables u and z are unconstrained. We overcome this issue by localizing saddle pointcomponents (cid:98) u and (cid:98) z on bounded sets. The corresponding result is formulated in Lemma 3.3.Our analysis builds upon duality theory for convex minimization problems. The next lemma shows that it suffices toseek a solution of the dual problem on a bounded set. In other words, putting specific constraints on the dual variablesstill yields an equivalent saddle-point reformulation. Lemma 3.1.
Let Θ ⊆ R d be a closed convex set and h ( θ ) : Θ → R be a convex differentiable function. Consider aproblem with affine constraints min θ ∈ Θ h ( θ ) s.t. A θ = b (6) where Θ is a compact convex set, A ∈ R m × d , b ∈ R m . Denote optimal solution of (6) θ ∗ and introduce a dual function ϕ ( ν ) = min θ ∈ Θ [ h ( θ ) + (cid:104) ν, A θ − b (cid:105) ] . Then1. [22] Dual problem has a solution ν ∗ such that (cid:107) ν ∗ (cid:107) ≤ (cid:107)∇ h ( θ ∗ ) (cid:107) σ +min ( A ) =: R ν , where σ +min ( A ) denotes the minimalnon-zero singular value of A .2. Let R > R v and consider a constrained dual problem max (cid:107) ν (cid:107) ≤ R min θ ∈ Θ [ h ( θ ) + (cid:104) ν, A u − b (cid:105) ] = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) (7)4 PREPRINT - F
EBRUARY
17, 2021
Algorithm 1
Decentralized Mirror-Prox for k = 0, 1, . . . do Node i does: Send z i , u i to neighbors N ( i ) Compute x k + i = Mirr (cid:0) ∇ x f ki ; x ki ; X i (cid:1) p k + i = Mirr (cid:0) ∇ p f ki + Σ j [ W p ] ij z kj ; p ki ; ¯ P (cid:1) y k + i = Mirr (cid:0) −∇ y f ki ; y ki ; Y i (cid:1) r k + i = Mirr (cid:0) −∇ r f ki − Σ j [ W r ] ij u kj ; r ki ; ¯ R (cid:1) u k + i = Mirr (cid:0) Σ j [ W r ] ij r kj ; u ki ; R d r (cid:1) z k + i = Mirr (cid:0) − Σ j [ W p ] ij p kj ; z kj ; R d p (cid:1) Send ˜ z k +1 i , ˜ u k +1 i to neighbors N ( i ) Compute x k +1 i = Mirr (cid:16) ∇ x f k + i ; x ki ; X i (cid:17) p k +1 i = Mirr (cid:16) ∇ p f k + i + Σ j [ W p ] ij z k + j ; p ki ; ¯ P (cid:17) y k +1 i = Mirr (cid:16) −∇ y f k + i ; y ki ; Y i (cid:17) r k +1 i = Mirr (cid:16) −∇ r f k + i − Σ j [ W r ] ij u k + j ; r ki ; ¯ R (cid:17) u k +1 i = Mirr (cid:16) Σ j [ W r ] ij r k + j ; u ki ; R d r (cid:17) z k +1 i = Mirr (cid:16) − Σ j [ W p ] ij p k + j ; z kj ; R d p (cid:17) For t i ∈ { x i , p i , u i , y i , r i , z i } compute (cid:98) t ki = 1 k k − (cid:88) (cid:96) =0 t (cid:96) + i . end for If (˜ θ, ˜ ν ) is a saddle point of (7) , then ˜ θ is a solution of (6) . Proof details for this Lemma are given in Appendix A.Lemma 3.1 allows to localize (cid:98) u and (cid:98) z components of saddle point of (4) on Euclidean balls. In order to estimate theradius of each ball, we need the following Assumption. Assumption 3.2.
There exist positive scalars M p , M r such that for any x ∈ X , z ∈ Z , r ∈ R it holds (cid:107)∇ p F ( x , p , y , r ) (cid:107) ≤ M p , (cid:107)∇ r F ( x , p , y , r ) (cid:107) ≤ M r . Building upon Lemma 3.1, we obtain the following result.
Lemma 3.3.
Let Assumption 3.2 holds and let R z = αM p λ +min ( W p ) , R u = βM r λ +min ( W r ) , (8) where α, β > are arbitrary constants and λ +min ( · ) denotes the minimal non-zero eigenvalue of matrix. Then thereexists a saddle point ( x ∗ , p ∗ , y ∗ , r ∗ , u ∗ , z ∗ ) of problem (4) such that (cid:107) u ∗ (cid:107) ≤ R u , (cid:107) z ∗ (cid:107) ≤ R z . Proof of Lemma 3.3 is provided in Appendix A. 5
PREPRINT - F
EBRUARY
17, 2021
In this subsection we introduce the second important component of the convergence rate analysis, namely the smoothnessassumption on the objective F . To set the stage we first introduce a general definition of Lipschitz-smooth function oftwo variables. Definition 3.4.
A continuously differentiable function G ( v , w ) is ( L vv , L vw , L wv , L ww ) -smooth if for any v , v (cid:48) ∈ V and w , w (cid:48) ∈ W , (cid:107)∇ v G ( v , w ) − ∇ v G ( v (cid:48) , w ) (cid:107) v , ∗ ≤ L vv (cid:107) v − v (cid:48) (cid:107) v , (cid:107)∇ v G ( v , w ) − ∇ v G ( v , w (cid:48) ) (cid:107) v , ∗ ≤ L vw (cid:107) w − w (cid:48) (cid:107) w , (cid:107)∇ w G ( v , w ) − ∇ w G ( v (cid:48) , w ) (cid:107) w , ∗ ≤ L wv (cid:107) v − v (cid:48) (cid:107) v , (cid:107)∇ w G ( v , w ) − ∇ w G ( v , w (cid:48) ) (cid:107) w , ∗ ≤ L ww (cid:107) w − w (cid:48) (cid:107) w . To apply this definition in our context of problem (4) we aggregate the local norms for each variable among agents.Formally, for t ∈ { x , p , u , y , r , z } we define (cid:107) t (cid:107) t = (cid:80) mi =1 (cid:107) t i (cid:107) t ; i . For the space of vectors ( x , p ) and ( y , r ) weintroduce respectively norms (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , (cid:107) ( y , r ) (cid:107) y , r ) = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r and corresponding dualnorms. Now, setting v = ( x , p ) , w = ( y , r ) , we make the following smoothness assumption on F . Assumption 3.5.
The function F in (4) is ( L ( x , p )( x , p ) , L ( x , p )( y , r ) , L ( y , r )( x , p ) , L ( y , r )( y , r ) ) -smooth. For simplicity of further derivations it is convenient to aggregate all the variables in two blocks: minimization variablesand maximization variables. To do that, we define ξ = ( x (cid:62) , p (cid:62) , u (cid:62) ) (cid:62) and η = ( y (cid:62) , r (cid:62) , z (cid:62) ) (cid:62) and Q ξ = X ×P× R nd r , Q η = Y × R × R nd p . This allows us to write the problem (4) in the following simplified form. min ξ ∈ Q ξ max η ∈ Q η S ( ξ, η ) (9)where S ( · , η ) is convex for any fixed η and S ( ξ, · ) is concave for any fixed ξ , and the sets Q ξ , Q η are closed and convex,but not bounded.Further, we introduce aggregated norms (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) . Similarly, we aggregate the prox-functions d ξ ( ξ ) = (cid:80) mi =1 ( d x ; i ( x i ) + d p ; i ( p i ) + d u ; i ( u i )) , d η ( η ) = (cid:80) mi =1 ( d y ; i ( y i ) + d r ; i ( r i ) + d z ; i ( z i )) , which induce the corresponding Bregman divergences B ξ ( ξ, ˘ ξ ) , B η ( η, ˘ η ) .Under these definitions the function S ( ξ, η ) in (9) has the following Lipshitz constants (the detailed derivations aregiven in Appendix B) L ξξ = L ( x , p )( x , p ) , L ηη = L ( y , r )( y , r ) ,L ξη = 2 max (cid:110) L ( x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) ,L ηξ = 2 max (cid:110) L ( y , r )( x , p ) , (cid:107) W r (cid:107) → ( r , ∗ ) , (cid:107) W p (cid:107) p → (cid:111) . Let us now discuss the second main aspect of the analysis, i.e. unboundedness of the feasible sets Q ξ , Q η , which doesnot allow to directly apply the standard analysis of [24]. Thanks to the special structure of the problem (4) and ourAssumption 3.2, we have at our disposal bounds R z , R u defined in (8) such that the optimal values z ∗ and u ∗ satisfy (cid:107) z ∗ (cid:107) ≤ R z , (cid:107) u ∗ (cid:107) ≤ R u . Thus, if we define C ξ = X × P × { u : (cid:107) u (cid:107) ≤ R u } , C η = Y × R × { z : (cid:107) z (cid:107) ≤ R z } ,then we have that the saddle-point ( ξ ∗ , η ∗ ) belongs to C ξ × C η . Let us define R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Note that since C ξ , C η are compact, the values R ξ , R η are well-defined. Finally, letus define L ζ = 2 max { L ξξ R ξ , L ηη R η , L ξη R ξ R η , L ηξ R ξ R η } . We are now in a position to state the main result.
Theorem 3.6.
Let Assumptions and 2.2, 3.2, 3.5 hold. Then, after N steps of Algorithm 1 with the stepsize α = 1 /L ζ we have that max η ∈ C η S ( (cid:98) ξ N , η ) − min ξ ∈ C ξ S ( ξ, (cid:98) η N ) ≤ L ζ N . PREPRINT - F
EBRUARY
17, 2021
In this section we consider a simplified particular case of problem (2), i.e. a saddle-point problem of the form min x max y f ( x, y ) = 1 m m (cid:88) i =1 f i ( x, y ) . (10)where f ( x, y ) is ( L xx , L xy , L yx , L yy ) -smooth. Additionally, we assume that f ( x, y ) is µ x , µ y – strongly-convex–strongly-concave (i.e. f is strongly-convex in x and strongly-concave in y ) or convex–concave (strongly-convex–strongly-concave with µ x = µ y = 0 )Our lower bounds are valid for a certain class of algorithms (see Assumption D.1 in Appendix), which we define in asimilar way with [27]. Most decentralized algorithms use the matrix W (see Assumption 2.2) for attaining consensus,and the convergence estimates of algorithms are tied to properties of W . Our lower bounds include the conditionnumber χ ( W ) = λ max ( W ) λ +min ( W ) of this matrix.We assume that the local iteration time is t , and the communication round is τ . Then the following two theorems holdfor the lower bounds for the total running time T : Theorem 4.1.
Let
L, µ x , µ y > , χ ≥ , ε > . Additionally, we assume that L > √ µ x µ y . Then there existsa distributed saddle-point problem with decentralised architecture and gossip matrix W , such that the followingstatements are true: 1) the gossip matrix W has χ ( W ) = χ , 2) f : R n × R n → R is ( µ x , L, L, µ y ) -smooth, µ x , µ y – strongly-convex-strongly-concave, 3) size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . Then for any procedure, which satisfies Assumption D.1, one can bound the timeto achieve an ε -accuracy of the solution in final global output: T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . Theorem 4.2.
Let
L > , Ω x , Ω y > , χ ≥ , ε > . Additionally, we assume that L > ε x Ω y . Then thereexists a distributed saddle-point problem with decentralised architecture and gossip matrix W , such that 1) thegossip matrix W has χ ( W ) = χ , 2) f : R n × R n → R is ( ε x , L, L, ε y ) -smooth, convex–concave, 3) size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = ε x Ω y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) .Then for any procedure, which satisfies Assumption D.1, one can bound the time to achieve an accuracy ε of the solutionin final global output: T = ˜Ω (cid:18) L Ω x Ω y ε ( t + √ χτ ) (cid:19) . The saddle-point problem defined in (9) may be viewed as a particular instance of a more general problem of solvingstrong variational inequality (VI). That is, given a closed convex subset Q ⊂ R d and operator g : Q → R d we need tofind a point ζ ∗ ∈ Q , such that (cid:104) g ( ζ ∗ ) , ζ − ζ ∗ (cid:105) ≥ for all ζ ∈ Q. (11)We are interested in the case in which operator g ( ζ ) is defined as a sum of 2 operators A : Q → R d and B : Q → R d : g ( ζ ) = A ( ζ ) + B ( ζ ) . (12)We assume operators A : Q → R d and B : Q → R d to be monotone and L A and L B -Lipschitz, respectively. That is,for all ζ, θ ∈ Q (cid:104) A ( ζ ) − A ( θ ) , ζ − θ (cid:105) ≥ , (cid:107) A ( ζ ) − A ( θ ) (cid:107) ≤ L A (cid:107) ζ − θ (cid:107) (13a) (cid:104) B ( ζ ) − B ( θ ) , ζ − θ (cid:105) ≥ , (cid:107) B ( ζ ) − B ( θ ) (cid:107) ≤ L B (cid:107) ζ − θ (cid:107) . (13b)Moreover, operator g is assumed to be µ -strongly monotone ( µ > ), i.e. for all ζ, θ ∈ Q it holds (cid:104) g ( ζ ) − g ( θ ) , ζ − θ (cid:105) ≥ µ (cid:107) ζ − θ (cid:107) . (14)7 PREPRINT - F
EBRUARY
17, 2021
Sliding
Require:
Initial guess x ∈ Q , step-size η > for k = 0 , , , . . . do ν k = ζ k − ηA ( ζ k ) Find θ k ∈ Q , such that θ k ≈ (cid:98) θ k , where (cid:98) θ k ∈ Q is a solution to variational inequality: (cid:104) ηB ( (cid:98) θ k ) + (cid:98) θ k − ν k , ζ − (cid:98) θ k (cid:105) ≥ for all ζ ∈ Q. (15) ω k = θ k + η ( A ( ζ k ) − A ( θ k )) x k +1 = Proj Q ( ω k ) end for Problem (15) is solved with a Forward-Backward-Forward algorithm [29], which is run for T iterations starting frompoint ζ k until accuracy δ > . The choices of δ and T are discussed in Appendix C. Theorem 5.1.
Without loss of generality assume L A ≤ L B . Total number of computations of A ( ζ ) in Algorithm 2 is N A = N = O (cid:32)(cid:18) L A µ (cid:19) log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . Total number of computations of B ( ζ ) is N B = N × T = O (cid:32)(cid:18) L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . In Appendix C, we show how Algorithm 2 allows to split oracle and communication complexities for a decentralizedsolution of saddle-point problem (4).
Our results have various applications including ubiquitous Wasserstein barycenter (WB) problem that can be reformu-lated as a saddle point problem of form (2) and solved over a network of agents. Wasserstein distances and Wassersteinbarycenter form a successful framework to operate with probability distributions and the objects that can be presented asprobability distributions (e.g., images, videos, text, etc) in various fields such as machine learning, statistics, economicsand finance. However, large calculations are behind this superiority, and to overcome computational complexity,entropic regularization [7] was proposed. However, regularized-based methods, e.g., celebrated Iterative BregmanProjections (IBP) algorithm [4], are numerically unstable when one needs to solve the WB problem with a high precision ε (as regularization parameter is choosing proportionally to ε ). For this case, Mirror-Prox algorithm can be used thatfinds a WB with respect to the original (non-regularized) Wasserstein distances. Using distributed calculations decreasesthe problem complexity by sharing it between computational nodes. We start with the preliminaries for the WB problem and reformulate it as a saddle point problem. For given histograms q , q ..., q m from the probability simplex ∆ n , a WB of these measures is a solution of the following finite-dimensionaloptimization problem p ∗ = arg min p ∈ ∆ n m m (cid:88) i =1 W ( p, q i ) , (16)where W ( p, q ) = min X ∈ U ( p,q ) (cid:104) C, X (cid:105) is optimal transport problem between two histograms p, q ∈ ∆ n , C ∈ R n × n + is agiven ground cost matrix, and X is a transport plan which belongs to transportation polytope U = { X ∈ R n × n + , X = p, X (cid:62) = q } . As we consider a general cost matrix C , the optimal transport problem is more general than the problemdefining Wasserstein distance. 8 PREPRINT - F
EBRUARY
17, 2021Following the paper [10] that develops the idea of [17] we reformulate the WB problem (16) as a saddle point problemby introducing stacked column vector b i = ( p (cid:62) , q (cid:62) i ) (cid:62) , vectorized cost matrix d of C , vectorized transport plan x ∈ ∆ n of X , incidence matrix A = { , } n × n , vectors y i ∈ [ − , n ( i = 1 , ..., m ). Then (16) can be equivalently rewrittenas min p ∈ ∆ n m m (cid:88) i =1 min x i ∈ ∆ n max y i ∈ [ − , n { d (cid:62) x i + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) i Ax i − b (cid:62) i y i (cid:1) } . (17)For the distributed computation of Wasserstein barycenters, we introduce the stacked column vectors p = ( p (cid:62) ∈ ∆ n , · · · , p (cid:62) m ∈ ∆ n ) (cid:62) ∈ P (cid:44) (cid:81) m ∆ n , x = ( x (cid:62) ∈ ∆ n , . . . , x (cid:62) m ∈ ∆ n ) (cid:62) ∈ X (cid:44) (cid:81) m ∆ n (where (cid:81) m ∆ n is theCartesian product of m simplices), and y = ( y (cid:62) , . . . , y (cid:62) m ) (cid:62) ∈ Y (cid:44) [ − , mn Then we equivalently rewrite (17) as min x ∈X , p ∈P ,p = ... = p m max y ∈Y f ( x , p , y ) (cid:44) m (cid:110) d (cid:62) x + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) A x − b (cid:62) y (cid:1)(cid:111) , (18)where b = ( p (cid:62) , q (cid:62) , ..., p (cid:62) m , q (cid:62) m ) (cid:62) , d = ( d (cid:62) , . . . , d (cid:62) ) (cid:62) , and A = diag { A, ..., A } ∈ { , } mn × mn is block-diagonalmatrix. To enable distributed computation of this problem, the constraint p = · · · = p m is replaced by Wp = 0 .Finally, we introduce Lagrangian dual variable z = ( z (cid:62) , ..., z (cid:62) m ) ∈ Z (cid:44) R nm , scaled by /m , to constraint Wp = 0 for the problem (18) and rewrite it as follows min x ∈X , p ∈P max y ∈Y , z ∈ R nm F ( x , p , y , z ) (cid:44) m { d (cid:62) x + 2 (cid:107) d (cid:107) ∞ (cid:0) y (cid:62) A x − b (cid:62) y (cid:1) + (cid:104) z , Wp (cid:105)} . (19)The gradient operator for F ( x , p , y , z ) is defined by ∇ x F ∇ p F −∇ y F −∇ z F = 1 m d + 2 (cid:107) d (cid:107) ∞ A (cid:62) yW (cid:62) z − (cid:107) d (cid:107) ∞ { [ y i ] ...n } mi =1 (cid:107) d (cid:107) ∞ ( A x − b ) − Wp Here [ y i ] ...n is the first n component of vector y i ∈ [ − , n , and { [ y i ] ...n } mi =1 is a short form of ([ y ] ...n , [ y ] ...n , ..., [ y m ] ...n ) . Now we are in a position to apply Algorithm 1. We start with the preliminaries and the setup for Mirror-Prox algorithm.For space U (cid:44) X × P (cid:44) (cid:81) m ∆ n × (cid:81) m ∆ n , we choose norm (cid:107) u (cid:107) u = (cid:112)(cid:80) mi =1 (cid:107) x i (cid:107) + (cid:80) mi =1 (cid:107) p i (cid:107) , where (cid:107) · (cid:107) is the (cid:96) -norm, prox-function d u ( u ) = (cid:80) mi =1 (cid:104) x i , ln x i (cid:105) + (cid:80) mi =1 (cid:104) p i , ln p i (cid:105) , and the corresponding Bregman divergence B u ( u , ˘ u ) = (cid:80) mi =1 (cid:104) x i , ln( x i / ˘ x i ) (cid:105) − (cid:80) mi =1 (cid:62) n ( x i − ˘ x i ) + (cid:80) mi =1 (cid:104) p i , ln( p i / ˘ p i ) (cid:105) − (cid:80) mi =1 (cid:62) n ( p i − ˘ p i ) . We endow space V (cid:44) Y × Z (cid:44) [ − , nm × R nm with the standard Euclidean setup: the Euclidean norm (cid:107) · (cid:107) ,prox-function d v ( v ) = (cid:107) v (cid:107) , and the corresponding Bregman divergence B v = (cid:107) v − ˘ v (cid:107) .For space U × V the prox-function is defined as ad u ( u ) + bd v ( v ) , the corresponding Bregman divergence is aB u ( u , ˘ u ) + bB v ( v , ˘ v ) , where a = 1 /R u , b = 1 /R v , and R u = max u ∈U d u ( u ) − min u ∈U d u ( u ) , R v = max v ∈V∩ B R (0) d v ( v ) − min v ∈V∩ B R (0) d v ( v ) . Here B R (0) is a ball of radius R centered in . Lemma 6.1.
Objective F ( u , v ) in (19) is ( L uu , L uv , L vu , L vv ) -smooth with L uu = L vv = 0 and L uv = L vu = (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) /m . This lemma (the proof is in Appendix E.2.) allows us to obtain the following convergence result
Theorem 6.2.
Let (cid:107) z (cid:107) ≤ R , then R u = √ m ln n and R v = (cid:112) mn + R / with R = (cid:107)∇ p f ( x , p ∗ , y ) (cid:107) λ +min (cid:0) m W (cid:1) ≤ n (cid:107) d (cid:107) ∞ λ +min ( W ) , PREPRINT - F
EBRUARY
17, 2021 where λ +min ( W ) is the minimal positive eigenvalue of W . Then after N = 4 L uv R U R V /ε iterations, Algorithm 4 fromAppendix E with η = L uv R u R u outputs a pair ( (cid:101) u , (cid:101) v ) such that max y ∈Y , (cid:107) z (cid:107) ≤ R F ( (cid:101) u , y , z ) − min x ∈X , p ∈P F ( x , p , (cid:101) v ) ≤ ε. The total complexity of Algorithm 4 per node is O (cid:16) n √ n ln n/ε (cid:17) . The proof of Theorem 6.2 is provided in Appendix E.1.For the star network, we can compare the complexity of Mirror-Prox with the complexity of the IBP running in (cid:101) O (cid:0) n /ε (cid:1) time per node [19]. Distributed Mirror-Prox has better dependence on ε , namely /ε , as well as theaccelerated IBP with (cid:101) O (cid:0) n √ n/ε (cid:1) complexity per node of [15]. However, as any regularization-based method,the (accelerated) IBP has a strong limitation regarding the regularization parameter: numerical unstability at smallregularization parameter. This is the case when a high precision is desired (as the regularization parameter must bechosen proportionally to ε ). Next, we illustrate the work of Algorithm 4 from Appendix E for different network architectures on Gaussiansdistributions and the notMNIST dataset.We randomly generated 10 Gaussian measures with equally spaced support of 100 points in [ − , − , mean from [ − , and variance from [0 . , . , ] . We studied the convergence of calculated barycenters to the theoretical truebarycenter [8] on different network architectures: the cycle graph, the complete graph, the star graph, and the Erd˝os-R´enyi random graph with probability of edge creation p = 0 . . Figure 1 shows the dependence of the convergenceof Algorithm 4 (on a logarithmic scale). The slope ration − fits theoretical dependence of the desired accuracy ε onnumber of iterations ( N ∼ ε − , Theorem 6.2). (a) Convergence in the function value. Here p ∗ is thetrue barycenter. (b) Convergence to theconsensus ( Wp = 0 ) in the network Figure 1: Convergence of Distributed Mirror-Prox algorithm skipping the first 1000 iterations.Figure 2 illustrates a better approximation of the true Wasserstein barycenter of Gaussian measures performed byAlgorithm 4 on different network in comparison with the IBP. The regularization parameter for the IBP (from the POTpython library) is taken as smallest as possible under which the IBP still works. The slow convergence is due to the lackof theory for the convergence in the argument.Next, we illustrate non-stability of IBP algorithm with small values of regularizing parameter ( γ = 10 − , − ) when ahigh-precision ε of calculating a WB is desired. Figure 3 presents Wasserstein barycenter of letters ’B’ in a variety offonts from the notMNIST dataset. 10 PREPRINT - F
EBRUARY
17, 2021 (a) Convergence in argumentafter iterations (b) Convergence in argumentafter · iterations Figure 2: Convergence of Distributed Mirror-Prox on different network and the IBP to the true barycenter of Gaussianmeasures.
DistributedMirror-Prox IBP, γ = 10 − IBP, γ = 10 − Figure 3: Wasserstein barycenters of letter ’B’ computed by Distributed Mirror-Prox and the IBP with different valuesof the regularizing parameter
In this paper, we develop a theory for decentralized distributed smooth convex-concave problems. This theory includesobtaining the lower bounds on communication rounds and oracle calls. We also generalize Mirror-Prox algorithmto decentralized distributed setup with the dependence on desired accuracy in its working time according to lowerbound. Moreover, we improve this algorithm by using sliding technique to split communication and oracle complexities.Finally, for some classes of convex problems we show the benefits of bringing them to convex-concave problems on anexample of the Wasserstein barycenter problem in a number of numerical experiments.
References [1] Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. arXivpreprint arXiv:1506.01900 , 2015.[2] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization (2012). , 2011.[3] A. Ben-Tal and A. Nemirovski.
Lectures on Modern Convex Optimization (Lecture Notes) . Personal web-page ofA. Nemirovski, 2020.[4] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr´e. Iterative bregman projections for regularizedtransportation problems.
SIAM Journal on Scientific Computing , 37(2):A1111–A1138, 2015.[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms.
IEEE transactions on informationtheory , 52(6):2508–2530, 2006.[6] S. Bubeck. Theory of convex optimization for machine learning. arXiv preprint arXiv:1405.4980 , 15, 2014.[7] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems26 , pages 2292–2300. Curran Associates, Inc., 2013.11
PREPRINT - F
EBRUARY
17, 2021[8] J. Delon and A. Desolneux. A wasserstein-type distance in the space of gaussian mixture models.
SIAM Journalon Imaging Sciences , 13(2):936–970, 2020.[9] D. Dvinskikh and A. Gasnikov. Decentralized and parallel primal and dual accelerated methods for stochasticconvex programming problems.
Journal of Inverse and Ill-posed Problems , 1(ahead-of-print), 2021.[10] D. Dvinskikh and D. Tiapkin. Improved complexity bounds in the wasserstein barycenter problem. arXiv preprintarXiv:2010.04677 , 2020.[11] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Fasteralgorithm for wasserstein barycenters.
Advances in Neural Information Processing Systems , 31:10760–10770,2018.[12] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective ongenerative adversarial networks. arXiv preprint arXiv:1802.10551 , 2018.[13] E. Gorbunov, D. Dvinskikh, and A. Gasnikov. Optimal decentralized distributed algorithms for stochastic convexoptimization. arXiv:1911.07363 , 2019.[14] E. Gorbunov, A. Rogozin, A. Beznosikov, D. Dvinskikh, and A. Gasnikov. Recent theoretical advances indecentralized distributed convex optimization. arXiv preprint arXiv:2011.13259 , 2020.[15] S. V. Guminov, Y. E. Nesterov, P. E. Dvurechensky, and A. V. Gasnikov. Accelerated primal-dual gradientdescent with linesearch for convex, nonconvex, and nonsmooth optimization problems.
Doklady Mathematics ,99(2):125–128, Mar 2019.[16] D. Jakoveti´c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods.
IEEE Transactions on AutomaticControl , 59(5):1131–1146, May 2014.[17] A. Jambulapati, A. Sidford, and K. Tian. A direct tilde { O } (1/epsilon) iteration parallel algorithm for optimaltransport. Advances in Neural Information Processing Systems , 32:11359–11370, 2019.[18] Y. Jin and A. Sidford. Efficiently solving mdps with stochastic mirror descent. In
International Conference onMachine Learning , pages 4890–4900. PMLR, 2020.[19] A. Kroshnin, N. Tupitsa, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and C. Uribe. On the complexity ofapproximating Wasserstein barycenters. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of
Proceedings of Machine Learning Research , pages3530–3540, Long Beach, California, USA, 09–15 Jun 2019. PMLR. arXiv:1901.08686.[20] G. Lan.
First-order and Stochastic Optimization Methods for Machine Learning . Springer, 2020.[21] G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. arXiv:1701.03961 , 2017.[22] G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic optimization.
Mathematical Programming , 180(1):237–284, 2020.[23] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.
IEEE Transactions onAutomatic Control , 54(1):48–61, 2009.[24] A. Nemirovski. Prox-method with rate of convergence o (1 /t ) for variational inequalities with lipschitz continuousmonotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 15(1):229–251, 2004.[25] Y. Ouyang and Y. Xu. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-pointproblems.
Mathematical Programming , pages 1–35, 2019.[26] G. Peyr´e and M. Cuturi. Computational optimal transport. arXiv:1803.00567 , 2018.[27] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massouli´e. Optimal algorithms for smooth and strongly convexdistributed optimization in networks. In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of
Proceedings of Machine Learning Research , pages 3027–3036,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[28] K. Scaman, F. Bach, S. Bubeck, L. Massouli´e, and Y. T. Lee. Optimal algorithms for non-smooth distributedoptimization in networks. In
Advances in Neural Information Processing Systems 31 , pages 2740–2749. The MITPress, 2018.[29] P. Tseng. A modified forward-backward splitting method for maximal monotone mappings.
SIAM Journal onControl and Optimization , 38(2):431–446, 2000.[30] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedi´c. A dual approach for optimal algorithms in distributed optimizationover networks.
Optimization Methods and Software , pages 1–40, 2020.12
PREPRINT - F
EBRUARY
17, 2021[31] J. Zhang, M. Hong, and S. Zhang. On lower iteration complexity bounds for the saddle point problems. arXivpreprint arXiv:1912.07481 , 2019. 13
PREPRINT - F
EBRUARY
17, 2021
A Missing Proofs from Section 3
A.1 Proof of Lemma 3.1
Proof.
For part 1, see [22].Introduce function ψ ( θ ) = max (cid:107) ν (cid:107) ≤ R [ h ( θ ) + (cid:104) ν, A u − b (cid:105) ] . Since U is a compact set, it holds max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) = min θ ∈ Θ ψ ( θ ) by Sion-Kakutani theorem (see i.e. Theorem D.4.2 in [2]). Moreover, since ν ∗ ∈ Arg max ν ∈ R m ϕ ( ν ) and ν ∗ ∈ B R (0) , we have ν ∗ ∈ Arg max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) and max ν ∈ R m ϕ ( ν ) = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) . Also note that h ( (cid:98) θ ) = min θ ∈ Θ ψ ( θ ) by Theorem D.4.1 in [2]. Combining the three facts (cid:98) θ ∈ Arg min θ ∈ Θ ψ ( θ ) , ν ∗ ∈ Arg max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) , min θ ∈ Θ ψ ( θ ) = max (cid:107) ν (cid:107) ≤ R ϕ ( ν ) we obtain that ( (cid:98) θ, ν ∗ ) is a saddle point of (7) by Theorem D.4.1 of [2]. Therefore for ν ∈ B R (0) we have h ( (cid:98) θ ) + (cid:68) ν ∗ , A (cid:98) θ − b (cid:69) ≥ h ( (cid:98) θ ) + (cid:68) ν, A (cid:98) θ − b (cid:69)(cid:68) ν ∗ − ν, A (cid:98) θ − b (cid:69) ≥ . Taking into account that (cid:107) ν ∗ (cid:107) ≤ R ν < R , we imply ν ∗ ∈ int B R (0) and therefore A (cid:98) θ − b = 0 . Finally, by definitionof a saddle point, h ( (cid:98) θ ) + (cid:68) ν ∗ , A (cid:98) θ − b (cid:69) ≤ h ( θ ) + (cid:104) ν ∗ , A u − b (cid:105) ∀ θ ∈ Θ h ( (cid:98) θ ) ≤ h ( θ ) ∀ θ ∈ Θ : A u = b which concludes the proof. A.2 Proof of Theorem 2.3
We prove a lemma that generalizes Theorem 2.3.
Lemma A.1.
Introduce ¯ R z = ¯ αM p λ +min ( W p ) , ¯¯ R u = ¯ βM r λ +min ( W r ) , where ¯ α, ¯ β ∈ (1 , + ∞ ) ∪ { + ∞} . Problem (3) is equivalent to min x ∈X , p ∈P(cid:107) u (cid:107) (cid:54) ¯ R u max y ∈Y , r ∈R(cid:107) z (cid:107) (cid:54) ¯ R z [ F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) ] (20) in the sense that any solution of (20) is the solution of (3) .Proof. We are free to use Sion–Kakutani theorem since the sets X , P , Y , R are compact. min W p p =0 x ∈X , p ∈P max W r r =0 y ∈Y , r ∈R F ( x , p , y , r ) = max W r r =0 y ∈Y , r ∈R min W p p =0 x ∈X , p ∈P F ( x , p , y , r )= max W r r =0 y ∈Y , r ∈R min x ∈X min W p p =0 p ∈P F ( x , p , y , r ) (21)14 PREPRINT - F
EBRUARY
17, 2021For any fixed pair ( x , y , r ) ∈ X × Z × R function F ( x , p , y , r ) is convex in p . Consider problem min p ∈P F ( x , p , y , r ) s.t. W p p = 0 (22)and denote p ∗ ( x , y , r ) = arg min W p p =0 p ∈P [ F ( x , p , y , r )] . By Lemma 3.1, part 1, there exists a solution of dual to (22) with norm bounded by (cid:107)∇ p F ( x , p ∗ ( x , y , r ) , y , r ) (cid:107) λ +min ( W p ) < ¯ R z . (23)By Lemma 3.1 (part 2), problem (22) is equivalent to max (cid:107) z (cid:107) ≤ ¯ R z min p ∈P [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] (24)in the sense that for any saddle point ( (cid:98) p ( x , y , r ) , (cid:98) z ( x , y , r )) of (24) (cid:98) p ( x , y , r ) is a solution of (22).Returning to (21), we get max W r r =0 y ∈Z , r ∈R min x ∈X min Wp =0 p ∈P F ( x , p , y , r ) = max W r r =0 y ∈Z , r ∈R min x ∈X (cid:20) max (cid:107) z (cid:107) ≤ ¯ R z min p ∈P ( F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ) (cid:21) = min x ∈X , p ∈P max y ∈Z , r ∈R W r r =0 , (cid:107) z (cid:107) ≤ ¯ R z [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ]= min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z max r ∈R W r r =0 [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] . (25) Now we introduce Lagrange multipliers for the constraints W r r = 0 , as well. Analogously to the case with W p p = 0 constraints, consider a problem max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] s.t. W r r = 0 (26)and introduce its solution r ∗ ( x , p , y , z ) = arg max W r r =0 r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] . Analogously, the dual solution norm to problem (26) can be bounded as (cid:107)∇ r F ( x , p , y , r ∗ ( x , p , y , z )) (cid:107) λ +min ( W r ) < ¯ R u , β ∈ (1 , + ∞ ] . (27)And we get a saddle-point reformulation of (26): min (cid:107) u (cid:107) ≤ ¯ R u max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] . Substituting this reformulation into (25), we obtain min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z max r ∈R W r r =0 [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) ] = min x ∈X , p ∈P max y ∈Z(cid:107) z (cid:107) ≤ ¯ R z (cid:20) min (cid:107) u (cid:107) ≤ ¯ R u max r ∈R [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] (cid:21) = min x ∈X , p ∈P max y ∈Z , r ∈R(cid:107) z (cid:107) ≤ ¯ R z min (cid:107) u (cid:107) ≤ ¯ R u [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ]= max y ∈Z , r ∈R(cid:107) z (cid:107) ≤ ¯ R z min x ∈X , p ∈P(cid:107) u (cid:107) ≤ ¯ R u [ F ( x , p , y , r ) + (cid:104) z , W p p (cid:105) + (cid:104) u , W r r (cid:105) ] , (28) PREPRINT - F
EBRUARY
17, 2021which is an equivalent reformulation of (2). Note that cases ¯ R u = + ∞ , ¯ R z = + ∞ are also supported in this proof. A min - max reformulation is obtained analogously by adding Lagrange multipliers in different order. Proof of Theorem 2.3.
The proof immediately follows from Lemma A.1 by setting ¯ R u = ¯ R z = + ∞ . A.3 Proof of Lemma 3.3
The proof follows from a more general statement in Lemma A.1.
B Smoothness Constants for Mirror-Prox
B.1 Estimating Lipschitz constants for S ( ξ, η ) We start by deriving the Lipschitz constants L ξξ , L ηη , L ξη , L ηξ of the function S ( ξ, η ) in (9). Recall that for anoperator A which acts from a space V with a norm (cid:107) v (cid:107) v to a space W with a norm (cid:107) w (cid:107) w , then these two normsnaturally induce a norm of the operator A as (cid:107) A (cid:107) v → w = max v,w {(cid:107) Av (cid:107) w : (cid:107) v (cid:107) v ≤ } , which gives an inequality (cid:107) Av (cid:107) w ≤ (cid:107) A (cid:107) v → w (cid:107) v (cid:107) v . Recall also that (9) is a reformulation of (4) in a simpler form using the definitions ξ = ( x (cid:62) , p (cid:62) , u (cid:62) ) (cid:62) , η = ( y (cid:62) , r (cid:62) , z (cid:62) ) (cid:62) and S ( ξ, η ) = F ( x , p , y , r ) + (cid:104) u , W r r (cid:105) + (cid:104) z , W p p (cid:105) . Then for the corresponding partial derivatives we have ∇ ξ S ( ξ, η ) = (cid:34) ∇ x F ( x , p , y , r ) ∇ p F ( x , p , y , r ) + W p zW r r (cid:35) , ∇ η S ( ξ, η ) = (cid:34) ∇ y F ( x , p , y , r ) ∇ r F ( x , p , y , r ) + W r uW p p (cid:35) . Recall also that (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) , which induce the dual norms (cid:107) ξ (cid:107) ξ, ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ + (cid:107) u (cid:107) , (cid:107) η (cid:107) η, ∗ = (cid:107) y (cid:107) y , ∗ + (cid:107) r (cid:107) r , ∗ + (cid:107) z (cid:107) . The constant L ξξ has to satisfy (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ (cid:48) , η ) (cid:107) ξ, ∗ ≤ L ξξ (cid:107) ξ − ξ (cid:48) (cid:107) ξ for all ξ, ξ (cid:48) ∈ Q ξ . We have (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ (cid:48) , η ) (cid:107) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) ∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) ∇ p F ( x , p , y , r ) + W p z − ( ∇ p F ( x (cid:48) , p (cid:48) , y , r ) + W p z ) W r r − W r r (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) ∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) ∇ p F ( x , p , y , r ) − ∇ p F ( x (cid:48) , p (cid:48) , y , r ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ( x , p ) , ∗ ≤ L ( x , p )( x , p ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x − x (cid:48) p − p (cid:48) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ( x , p ) ≤ L ( x , p )( x , p ) (cid:107) ξ − ξ (cid:48) (cid:107) ξ , where we used that (cid:107) ( x , p ) (cid:107) x , p ) , ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ since (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , Assumption 3.5 anddefinition 3.4. Thus, L ξξ = L ( x , p )( x , p ) . The equality L ηη = L ( y , r )( y , r ) is proved in the same way.Let us estimate L ξη , which has to satisfy (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ, η (cid:48) ) (cid:107) ξ, ∗ ≤ L ξη (cid:107) η − η (cid:48) (cid:107) η for all η, η (cid:48) ∈ Q η . We have (cid:107)∇ ξ S ( ξ, η ) − ∇ ξ S ( ξ, η (cid:48) ) (cid:107) ξ, ∗ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) ∇ x F ( x , p , y , r ) − ∇ x F ( x , p , y (cid:48) , r (cid:48) ) ∇ p F ( x , p , y , r ) + W p z − ( ∇ p F ( x , p , y (cid:48) , r (cid:48) ) + W p z (cid:48) ) W r r − W r r (cid:48) (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ξ, ∗ = (cid:107)∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) (cid:107) x , ∗ + (cid:107)∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) + W p ( z − z (cid:48) ) (cid:107) p , ∗ + (cid:107) W r ( r − r (cid:48) ) (cid:107) ≤ (cid:107)∇ x F ( x , p , y , r ) − ∇ x F ( x (cid:48) , p (cid:48) , y , r ) (cid:107) x , ∗ + 2 (cid:107)∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) (cid:107) p , ∗ + 2 (cid:107) W p ( z − z (cid:48) ) (cid:107) p , ∗ + (cid:107) W r ( r − r (cid:48) ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) ∇ x F ( x , p , y , r ) − ∇ x F ( x , p , y (cid:48) , r (cid:48) ) ∇ p F ( x , p , y , r ) − ∇ p F ( x , p , y (cid:48) , r (cid:48) ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) x , p ) , ∗ + 2 (cid:107) W p (cid:107) → ( p , ∗ ) (cid:107) z − z (cid:48) (cid:107) + (cid:107) W r (cid:107) r → (cid:107) r − r (cid:48) (cid:107) r ≤ L x , p )( y , r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y − y (cid:48) r − r (cid:48) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) x , p ) + 2 (cid:107) W p (cid:107) → ( p , ∗ ) (cid:107) z − z (cid:48) (cid:107) + (cid:107) W r (cid:107) r → (cid:107) r − r (cid:48) (cid:107) r ≤ (cid:110) L x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) ( (cid:107) y − y (cid:48) (cid:107) y + (cid:107) r − r (cid:48) (cid:107) r + (cid:107) z − z (cid:48) (cid:107) )= 4 max (cid:110) L x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) (cid:107) η − η (cid:48) (cid:107) η , PREPRINT - F
EBRUARY
17, 2021where we used the definition of (cid:107) ξ (cid:107) ξ, ∗ , inequality ( a + b ) ≤ a + 2 b , that (cid:107) ( x , p ) (cid:107) x , p ) , ∗ = (cid:107) x (cid:107) x , ∗ + (cid:107) p (cid:107) p , ∗ since (cid:107) ( x , p ) (cid:107) x , p ) = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p , definition of the operator norm, Assumption 3.5 and definition3.4, and, finally, the definition of (cid:107) η (cid:107) η . Taking the square root of the derived inequality, we obtain L ξη =2 max (cid:110) L ( x , p )( y , r ) , (cid:107) W p (cid:107) → ( p , ∗ ) , (cid:107) W r (cid:107) r → (cid:111) . The bound for L ηξ is derived in the same way. B.2 Proof of Theorem 3.6
To prove Theorem 3.6 we first show that the iterates of Algorithm 1 naturally correspond to the iterates of a generalMirror-Prox algorithm applied to problem (4). Then we extend the standard analysis of the general Mirror-Proxalgorithm to account for unbounded feasible sets.By definition we have that the aggregated norms (cid:107) ξ (cid:107) ξ = (cid:107) x (cid:107) x + (cid:107) p (cid:107) p + (cid:107) u (cid:107) , (cid:107) η (cid:107) η = (cid:107) y (cid:107) y + (cid:107) r (cid:107) r + (cid:107) z (cid:107) , aggregatedprox-functions d ξ ( ξ ) = (cid:80) mi =1 ( d x ; i ( x i ) + d p ; i ( p i ) + d u ; i ( u i )) , d η ( η ) = (cid:80) mi =1 ( d y ; i ( y i ) + d r ; i ( r i ) + d z ; i ( z i )) , andaggregated Bregman divergences B ξ ( ξ, ˘ ξ ) , B η ( η, ˘ η ) are separable w.r.t. each local variable for an agent i . Further, Q ξ = X × P × R nd r , Q η = Y × R × R nd p , where P = ¯ P × . . . × ¯ P , R = ¯ R × . . . × ¯ R and X = X × . . . × X n , Y = Y × . . . ×Y n . Thus, the feasible set Q ξ × Q η is also separable w.r.t. each local feasible set X i × ¯ P × R d r ×Y i × ¯ R× R d p for an agent i . Thus, if we define a variable ζ = ( ξ (cid:62) , η (cid:62) ) (cid:62) ∈ Q := Q ξ × Q η , and the operator g ( ζ ) g ( ζ ) = (cid:20) ∇ ξ S ( ξ, η ) −∇ η S ( ξ, η ) (cid:21) = ∇ x F ( x , p , y , r ) ∇ p F ( x , p , y , r ) + W p zW r r −∇ y F ( x , p , y , r ) −∇ r F ( x , p , y , r ) − W r u − W p p , then the updates of Algorithm 1 are equivalent to the updates of a general Mirror-Prox algorithm listed as Algorithm 3. Algorithm 3
Mirror-Prox
Require:
Starting point ζ ∈ Q , stepsize α > . for k = 0 , , . . . do ζ k + := arg min ζ ∈ Q α (cid:10) g ( ζ k ) , ζ (cid:11) + B ( ζ, ζ k ) (29) ζ k +1 := arg min ζ ∈ Q α (cid:68) g ( ζ k + ) , ζ (cid:69) + B ( ζ, ζ k ) (30)Set (cid:98) ζ k = k (cid:80) k − (cid:96) =0 ζ (cid:96) + . end for Next, we analyze the general Mirror-Prox Algorithm 3. Recall that we defined C ξ = X × P × { u : (cid:107) u (cid:107) ≤ R u } , C η = Y × R × { z : (cid:107) z (cid:107) ≤ R z } , whence there exists a saddle-point ( ξ ∗ , η ∗ ) which belongs to C ξ × C η . We alsodefined R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Note that since C ξ , C η are compact, thevalues R ξ , R η are well-defined. Using the weights a = 1 /R ξ , b = 1 /R η , we define the norm (cid:107) ζ (cid:107) ζ = a (cid:107) ξ (cid:107) ξ + b (cid:107) η (cid:107) η ,the corresponding prox-function d ζ ( ζ ) and Bregman divergence B ζ ( ζ, ˘ ζ ) = aB ξ ( ξ, ˘ ξ ) + bB η ( η, ˘ η ) . Note that thecorresponding dual norm is (cid:107) ζ (cid:107) ζ, ∗ = a (cid:107) ξ (cid:107) ξ, ∗ + b (cid:107) η (cid:107) η, ∗ . Under these definitions, following the standard analysis in[6], we obtain that the operator g ( ζ ) is L ζ -Lipschitz-continuous with respect to the norm (cid:107) ζ (cid:107) ζ with L ζ = 2 max { L ξξ R ξ , L ηη R η , L ξη R ξ R η , L ηξ R ξ R η } . (31)Next, we analyze the iterations of Algorithm 3 under the assumption that g ( ζ ) is L ζ -Lipschitz-continuous. Let us fixsome iteration k ≥ . By the optimality conditions in (29), we have, for any ζ ∈ Q , α (cid:104) g ( ζ k ) + ∇ d ζ ( ζ k + ) − ∇ d ζ ( ζ k ) , ζ − ζ k + (cid:105) ≥ , (32) α (cid:104) g ( ζ k + ) + ∇ d ζ ( ζ k +1 ) − ∇ d ζ ( ζ k ) , ζ − ζ k +1 (cid:105) ≥ . (33)17 PREPRINT - F
EBRUARY
17, 2021Whence, for all ζ ∈ Q , (cid:104) g ( ζ k + ) , ζ k + − ζ (cid:105) = (cid:104) g ( ζ k + ) , ζ k +1 − ζ (cid:105) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) (33) ≤ α (cid:104)∇ d ζ ( ζ k ) − ∇ d ζ ( ζ k +1 ) , ζ k +1 − ζ (cid:105) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) = 1 α B ζ ( ζ, ζ k ) − α B ζ ( ζ, ζ k +1 ) − α B ζ ( ζ k +1 , ζ k ) + (cid:104) g ( ζ k + ) , ζ k + − ζ k +1 (cid:105) , where the last equality uses the definition of Bregman divergence B ζ ( ζ, ˘ ζ ) = d ζ ( ζ ) − ( d ζ (˘ ζ ) + (cid:104)∇ d ζ (˘ ζ ) , ζ − ˘ ζ (cid:105) ) .Further, for all ζ ∈ Q , (cid:104) g ( ζ k + ) ,ζ k + − ζ k +1 (cid:105) − α B ζ ( ζ k +1 , ζ k ) = (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105)− α B ζ ( ζ k +1 , ζ k ) + (cid:104) g ( ζ k ) , ζ k + − ζ k +1 (cid:105) (32) with ζ = ζ k +1 ≤ (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105) + 1 α (cid:104)∇ d ζ ( ζ k ) − ∇ d ζ ( ζ k + ) , ζ k + − ζ k +1 (cid:105)− α B ζ ( ζ k +1 , ζ k )= (cid:104) g ( ζ k + ) − g ( ζ k ) , ζ k + − ζ k +1 (cid:105) − α B ( ζ k + , ζ k ) − α B ( ζ k +1 , ζ k + ) ≤ (cid:107) g ( ζ k + ) − g ( ζ k ) (cid:107) ζ, ∗ (cid:107) ζ k + − ζ k +1 (cid:107) ζ − α (cid:16) (cid:107) ζ k + − ζ k (cid:107) ζ + (cid:107) ζ k + − ζ k +1 (cid:107) ζ (cid:17) α =1 /L ζ ≤ L ζ (cid:107) ζ k + − ζ k ) (cid:107) ζ (cid:107) ζ k + − ζ k +1 (cid:107) ζ − L ζ (cid:16) (cid:107) ζ k + − ζ k (cid:107) ζ + (cid:107) ζ k + − ζ k +1 (cid:107) ζ (cid:17) ≤ , where we also used that B ζ ( ζ, ˘ ζ ) ≥ (cid:107) ζ − ˘ ζ (cid:107) ζ and that g ( ζ ) is L ζ -Lipschitz-continuous.Combining the above two inequalities and the choice α = 1 /L ζ , we obtain, for all ζ ∈ Q and i ≥ , (cid:104) g ( ζ i + ) , ζ i + − ζ (cid:105) ≤ L ζ B ζ ( ζ, ζ i ) − L ζ B ζ ( ζ, ζ i +1 ) . Summing up these inequalities for i from 0 to k − , we have: k − (cid:88) i =0 (cid:104) g ( ζ i + ) , ζ i + − ζ (cid:105) ≤ L ζ ( B ζ ( ζ, ζ ) − B ζ ( ζ, ζ k )) ≤ L ζ B ζ ( ζ, ζ ) . Now we use the connection between S ( ξ, η ) and the operator g ( ζ ) . By convexity of S ( ξ, η ) in ξ and concavity of S ( ξ, η ) w.r.t η , we have, for all ξ ∈ Q ξ , k k − (cid:88) i =0 (cid:10) ∇ ξ S ( ξ i , η i ) , ξ i − ξ (cid:11) ≥ k k − (cid:88) i =0 ( S ( ξ i , η i ) − S ( ξ, η i )) ≥ k k − (cid:88) i =0 S ( ξ i , η i ) − S ( ξ, (cid:98) η k ) . In the same way, we obtain, for all η ∈ Q η , k k − (cid:88) i =0 (cid:10) −∇ η S ( ξ i , η i ) , η i − η (cid:11) ≥ − k k − (cid:88) i =0 S ( ξ i , η i ) + S ( (cid:98) ξ k , η ) . Summing these inequalities, by the definition of g ( ζ ) we obtain that, for all u ∈ Q , v ∈ Q and x = ( u, v ) , S ( (cid:98) ξ k , η ) − S ( ξ, (cid:98) η k ) ≤ k k − (cid:88) i =0 (cid:104) g ( ζ i + ) , ζ i + − ξ (cid:105) ≤ L ζ k B ζ ( ζ, ζ ) . Taking the maximum over ξ ∈ C ξ and η ∈ C η and using the definition of Bregman divergence, we obtain, max η ∈ C η S ( (cid:98) ξ k , η ) − min ξ ∈ C ξ S ( ξ, (cid:98) η k ) ≤ L ζ N max ξ ∈ C ξ ,η ∈ C η B ζ ( ζ, ζ ) = L ζ N max ξ ∈ C ξ ,η ∈ C η (cid:0) aB ξ ( ξ, ξ ) + bB η ( η, η ) (cid:1) ≤ L ζ k , PREPRINT - F
EBRUARY
17, 2021where we used that a = 1 /R ξ , b = 1 /R η with R ξ = max ξ ∈ C ξ d ξ ( ξ ) − min ξ ∈ C ξ d ξ ( ξ ) , R η = max η ∈ C η d η ( η ) − min η ∈ C η d η ( η ) . Thisfinishes the proof of Theorem 3.6.Since, by construction, there exists a saddle-point ( ξ ∗ , η ∗ ) such that ξ ∗ ∈ C ξ and η ∗ ∈ C η , we see that Theorem 3.6allows to say that Algorithm 1 generates a good approximation to a saddle-point after a sufficiently large number ofiterations. C Separating the Communication and Oracle Complexities
In this section, we describe how to apply the result of Theorem 5.1 to solve problem (4) restricting our attention onlyEuclidean norm. Since our sliding technique for variational inequalities requires strong monotonicity, we introduce aregularization term in Section C.1. The proof of Theorem 5.1 is presented in Section C.2.
C.1 Regularization of Problem (4)
Lemma C.1.
Let h ( θ ) : R d → R be a convex function and consider a problem min θ ∈ Q h ( θ ) , (34) where Q ⊆ R d is a convex set. Fix y ∈ Q and introduce a regularized problem min θ ∈ Q h µ ( θ ) = h ( θ ) + µ (cid:13)(cid:13) θ − θ (cid:13)(cid:13) , where (cid:54) µ (cid:54) ε (cid:107) θ − θ ∗ (cid:107) and θ ∗ denotes the solution of (34) . Then it holds h ∗ µ − h ∗ (cid:54) ε . Proof. h ∗ µ = min (cid:16) h ( θ ) + µ (cid:107) θ − θ (cid:107) (cid:17) (cid:54) h ∗ + µ (cid:13)(cid:13) θ ∗ − θ (cid:13)(cid:13) (cid:54) h ∗ + ε . Corollary C.2.
Let Q ξ , Q η be convex sets and S ( ξ, η ) be convex in ξ and concave in η . Consider a saddle-pointproblem min ξ ∈ Q ξ max η ∈ Q η S ( ξ, η ) (35) and introduce its regularized version min ξ ∈ Q ξ max η ∈ Q η ˜ S ( ξ, η ) = min ξ ∈ Q ξ max η ∈ Q η (cid:104) S ( ξ, η ) + µ (cid:107) ξ (cid:107) − µ (cid:107) η (cid:107) (cid:105) , (36) where µ = ε . Then (cid:54) min ξ ∈ Q ξ max η ∈ Q η (cid:104) ˜ S ( ξ, η ) (cid:105) − min ξ ∈ Q ξ max η ∈ Q η [ S ( ξ, η )] (cid:54) ε. Applying Corollary C.1 to problem (4) with µ = ε · max( d ( X ) + d ( P ) + R u , d ( Y ) + d ( R ) + R z ) and Theorem 5.1 allows to achieve convergence rate ˜ O (cid:16) L F + λ max ( W ) µ (cid:17) .19 PREPRINT - F
EBRUARY
17, 2021
C.2 Convergence Proof for SlidingLemma C.3.
Iterates of Algorithm 2 satisfy the following inequality: (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . (37) Proof.
We start with using line 2 of Algorithm 2: (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) = (cid:13)(cid:13) Proj Q ( ω k ) − Proj Q ( ζ ∗ ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) ω k − ζ ∗ (cid:13)(cid:13) . = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , ζ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − ζ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , θ k − ζ ∗ (cid:105) + 2 (cid:104) ω k − ζ k , ζ k − θ k (cid:105) + (cid:13)(cid:13) ω k − ζ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) ω k − ζ k , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k + η ( A ( ζ k ) − A ( θ k )) − ζ k , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k + ηA ( ζ k ) − ζ k , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 (cid:104) θ k − ν k , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − (cid:104) (cid:98) θ k − ν k , ζ ∗ − θ k (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k , θ k − ζ ∗ (cid:105) . Using (15) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 2 η (cid:104) B ( (cid:98) θ k ) , ζ ∗ − θ k (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k , θ k − ζ ∗ (cid:105) = (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) B ( θ k ) , θ k − ζ ∗ (cid:105) − η (cid:104) A ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105) . Using (12) and (14) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) F ( θ k ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105)≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) − η (cid:104) F ( ζ ∗ ) , θ k − ζ ∗ (cid:105) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105) . Using (11) and Young’s inequality we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 (cid:104) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) , θ k − ζ ∗ (cid:105)≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + 2 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k + η ( B ( θ k ) − B ( (cid:98) θ k )) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + (cid:13)(cid:13) ω k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) B ( θ k ) − B ( (cid:98) θ k ) (cid:13)(cid:13)(cid:13) . PREPRINT - F
EBRUARY
17, 2021Using line 2 of Algorithm 2 we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + η (cid:13)(cid:13) A ( ζ k ) − A ( θ k ) (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) B ( θ k ) − B ( (cid:98) θ k ) (cid:13)(cid:13)(cid:13) . Using (13a) and (13b) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:13)(cid:13) θ k − ζ ∗ (cid:13)(cid:13) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Using inequality (cid:107) a + b (cid:107) ≥ (cid:107) a (cid:107) − (cid:107) b (cid:107) we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − ηµ (cid:20) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) (cid:21) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + 3 ηµ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + η L A (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) − (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 ηµ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) + 4 ηL B µ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) = (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Lemma C.4.
Let θ k be the output of Forward-Backward-Forward algorithm for solving auxiliary problem (15) after T iterations starting at point ζ k . Choosing δ ∈ (0 , / and number of iterations T = O (cid:18) (1 + ηL B ) log 1 δ (cid:19) (38) implies (cid:13)(cid:13)(cid:13)(cid:98) θ k − θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . (39) Proof.
Problem (15) is a variational inequality of the form (cid:104) (cid:98) B k ( (cid:98) θ k ) , ζ − (cid:98) θ k (cid:105) ≥ for all ζ ∈ Q, where operator (cid:98) B k : Q → R d is defined in the following way (cid:98) B k ( ζ ) = ηB ( ζ ) + ζ − ν k . Operator (cid:98) B k ( ζ ) is 1-strongly monotone and (1 + ηL B ) -Lipschitz. Hence, applying the standard result on Forward-Backward-Forward algorithm implies (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13)(cid:13) ζ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) . Moreover, (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13)(cid:13) ζ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 2 δ (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) ≤ δ (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 12 (cid:13)(cid:13)(cid:13) θ k − (cid:98) θ k (cid:13)(cid:13)(cid:13) , where the last inequality follows from δ ≤ . Rearranging gives (39).21 PREPRINT - F
EBRUARY
17, 2021
Lemma C.5.
Apply Algorithm 2 to the problem (11) . On each iteration of Algorithm 2 solve auxiliary problem (15) on line 2 with T iterations of Forward-Backward-Forward algorithm starting at point ζ k , where T is given by (38) Choosing δ to be δ = min (cid:40) , (cid:20) ηµ + 64 ηL B µ (cid:21) − (cid:41) , (40) choosing stepsize η to be η = min (cid:26) L A , µ (cid:27) (41) and choosing number of iterations N to be N = 1 ηµ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (42) implies (cid:13)(cid:13) ζ N − ζ ∗ (cid:13)(cid:13) ≤ (cid:15). (43) Proof.
Using (37) together with (39), we get (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 4 δ (cid:18) ηµ + 4 ηL B µ (cid:19) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Plugging δ defined by (40) gives (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:0) ηµ + η L A − (cid:1) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) + 14 (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) = (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) + (cid:18) ηµ + η L A − (cid:19) (cid:13)(cid:13) ζ k − θ k (cid:13)(cid:13) . Plugging η defined by (41) gives (cid:13)(cid:13) ζ k +1 − ζ ∗ (cid:13)(cid:13) ≤ (1 − ηµ ) (cid:13)(cid:13) ζ k − ζ ∗ (cid:13)(cid:13) . After telescoping and using N defined by (42) we get (43), which concludes the proof. Corollary C.6.
Without loss of generality assume L A ≤ L B . Total number of computations of A ( ζ ) is N A = N = O (cid:32)(cid:18) L A µ (cid:19) log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . Total number of computations of B ( ζ ) is N B = N × T = O (cid:32)(cid:18) (cid:26) µ , L A (cid:27) L B (cid:19) (cid:18) L A µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) = O (cid:32)(cid:18) (cid:26) L B µ , L B L A (cid:27) + L A µ + min (cid:26) L B L A µ , L B µ (cid:27)(cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) ≤ O (cid:32)(cid:18) L A µ + L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) ≤ O (cid:32)(cid:18) L B µ (cid:19) log 1 δ log (cid:13)(cid:13) ζ − ζ ∗ (cid:13)(cid:13) (cid:15) (cid:33) . PREPRINT - F
EBRUARY
17, 2021
D Proof of lower bounds in Theorem 4.1
In contrast to the main part, we will adhere to the following notation in the proof: min x max y f ( x, y ) = 1 M M (cid:88) m =1 f i ( x, y ) . (44)The next assumption is needed to obtain the lower bounds for some class of algorithms. To describe this class offirst-order methods, we use a similar Black-Box model as in [27]. We assume that one local iteration costs t time units,and the communication round costs τ time units. Additionally, information can be transmitted only along the undirectededge of the network. Communications and local updates can take place in parallel and asynchronously. More formally: Assumption D.1.
For each machine m and each moment of time T , let define reachable sets H xm,T ⊆ R n x , H ym,T ⊆R n y . Initialization.
Starting point x = 0 , y = 0 , then H xm, = { } , H ym, = { } for all m . Connection.
The reachable sets are the union of what could be obtained by a communication ¯ H xm,T , ¯ H ym,T and alocal step (cid:98) H xm,T , (cid:98) H ym,T : H xm,T ⊆ ¯ H xm,T ∪ (cid:98) H xm,T , H ym,T ⊆ ¯ H ym,T ∪ (cid:98) H ym,T . Communication.
We have communication reachable sets ¯ H xm,T , ¯ H ym,T , starting to communicate with neighbors τ units of time backward: ¯ H xm,T ⊆ span (cid:91) ( i,m ) ∈E H xi,T − τ , ¯ H ym,T ⊆ span (cid:91) ( i,m ) ∈E H yi,T − τ . The main differences between decentralized and centralized optimization are the equivalence of nodes in decentralizedcase, as well as the fact that communications can be unscheduled. This allows stronger use of the properties of theconnection graph. One of the most famous communication ways to do this – the gossip algorithm [5], [23]. Thisapproach uses a certain matrix W (see Assumption 2.2). Local vectors during communications are ”weighted” bymultiplication of a vector with W . The convergence of most of the algorithms is based on this matrix, our bounds areno exception. In particular, we need to use χ = χ ( W ) = λ ( W ) λ M − ( W ) . Local computation.
Locally, each device can find F m ( x, y, ξ ) for all x ∈ H xm,T − t , y ∈ H ym,T − t . Then the localupdate is (cid:98) H xm,T ⊆ span (cid:110) x, ∇ x f ( x, y, ξ ) : ∀ x ∈ H xm,T − t , y ∈ H ym,T − t (cid:111) , (cid:98) H ym,T ⊆ span (cid:110) y, ∇ y f ( x, y, ξ ) : ∀ x ∈ H xm,T − t , y ∈ H ym,T − t (cid:111) . Output.
When the algorithm ends (after T units of time), we have M local outputs x fm ∈ H xm,T , y fm ∈ H ym,T .Suppose the final global output is calculated as follows: x f ∈ span (cid:40) M (cid:91) m =1 H xm,T (cid:41) , y f ∈ span (cid:40) M (cid:91) m =1 H ym,T (cid:41) . The proofs of the theorems are a designing of examples of ”bad” functions, as well as ”bad” arrangement of them ondevices. Our example builds on the example function for the non-distributed case from [31].Let B ⊂ V – subset of nodes of G. For d > we define B d = { v ∈ V : d ( B, v ) ≥ d } , where d ( B, v ) – distancebetween set B and node v . Then we construct the following arrangement of bilinearly functions on nodes: f m ( x, y ) = f ( x, y ) = M | B d | · L x T A y + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) − M | B d | · L √ µ x µ y e T y, m ∈ B d ,f ( x, y ) = M | B | · L x T A y + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) , m ∈ B,f ( x, y ) = µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) , otherwise . (45)23 PREPRINT - F
EBRUARY
17, 2021where e = (1 , . . . , and A = −
21 01 − . . . . . . −
21 01 , A = −
21 01 −
21 0 . . . . . . − . Lemma D.2. If B d (cid:54) = ∅ , using the procedure that satisfies Assumption D.1, after T units of time in the global output,only the first k = (cid:106) T − tt + dτ (cid:107) + 2 coordinates can be non-zero, the rest of the n − k coordinates are strictly equal to zero. Proof
Consider an arbitrary moment T . According Assumption D.1 let us write down how H x , H y changes in onelocal step: H xm,T + t ⊆ span { x, A y } , m ∈ B d span { x, A y } , m ∈ B span { x } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T ,H ym,T + t ⊆ span (cid:8) e , y, A T x (cid:9) , m ∈ B d span (cid:8) y, A T x (cid:9) , m ∈ B span { y } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T . Then for any number H ∈ N of local iterations (without communications) H xm,T + tH ⊆ span (cid:8) x, A y, A A T x, A ( A T A ) y, . . . ( A A T ) H x, A ( A T A ) H y, . . . (cid:9) , m ∈ B d span (cid:8) x, A y, A A T x, A ( A T A ) y, . . . ( A A T ) H x, A ( A T A ) H y, . . . (cid:9) , m ∈ B span { x } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T ,H ym,T + tH ⊆ span (cid:8) e , y, A T x, A T A y, A T ( A A T ) x, . . . ( A T A ) H y, A T ( A A T ) H x, . . . (cid:9) , m ∈ B d span (cid:8) y, A T x, A T A y, A T ( A A T ) x, . . . ( A T A ) H y, A T ( A A T ) H x, . . . (cid:9) , m ∈ B span { y } , otherwise , ∀ x ∈ H xm,T , ∀ y ∈ H ym,T . Here we additionally use that A e = A T e = e . Let for all m we have H xm,T , H ym,T ⊆ span { e , . . . , e k } , then forany number H ∈ N of local iterations (without communications) H xm,T + tH , H ym,T + tH ⊆ span (cid:110) e , . . . , e (cid:100) k +12 (cid:101) − (cid:111) = (cid:26) span { e , . . . , e k } , odd k span { e , . . . , e k +1 } , even k , m ∈ B d span (cid:110) e , . . . , e (cid:100) k (cid:101) (cid:111) = (cid:26) span { e , . . . , e k +1 } , odd k span { e , . . . , e k } , even k , m ∈ B span { e , . . . , e k } , otherwise . (46)When deriving the last statement, we needed A T A = − − . . . − − , A T A = − − − − . . . − − ,A A T = − − . . . − − , A A T = − − − − . . . − − . PREPRINT - F
EBRUARY
17, 2021This fact leads to the main idea of the proof. At the initial moment of time, we have all zero coordinates, since thestarting points x , y are equal to 0. Using only local iterations (at least 2), we can achieve that for the nodes B d only thefirst coordinates of x and y can be non-zero, the rest coordinates are strictly zero. For the rest of the nodes, everythingremains strictly zero. Without communications, the situation does not change. Therefore, we need to make at least d communications in order to have non-zero first coordinates in some node from B . Using (46), by local iterations (atleast 1) at the node of the set B , one can achieve the first and second non-zero coordinates. Next the process continueswith respect of (46).Hence, to get at least one node i ∈ V with the reachable sets H xm , H ym ⊆ span { e , . . . , e k } , we need a minimum of k + 1 local steps (at least 2 steps in the beginning, when we start from 0, and then at least 1 local step in other cases –see previous paragraph), as well as ( k − d communication rounds. In other words max m (cid:104) max { k ∈ N : ∃ x ∈ H xm,T , ∃ y ∈ H ym,T s.t. x k (cid:54) = 0 , y k (cid:54) = 0 } (cid:105) ≤ (cid:22) T − tt + dτ (cid:23) + 2 . According to Assumption D.1, we have that the final global output is the union of all the outputs. Whence the statementof the lemma follows. (cid:3)
Consider the global objective function we have: f ( x, y ) = 1 M M (cid:88) m =1 f m ( x, y ) = 1 M ( | B d | · f ( x, y ) + | B | · f ( x, y ) + ( M − | B d | − | B | ) · f ( x, y ))= L x T Ay + µ x (cid:107) x (cid:107) − µ y (cid:107) y (cid:107) − L √ µ x µ y e T y, with A = 12 ( A + A ) (47)The previous lemma gives an idea of what the solution obtained using procedures that satisfy Assumption D.1. Thenext lemma is already devoted to the approximate solution of the problem (47). Lemma D.3 (Lemma 3.3 from [31]) . Let α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) – the smallest root of q − (2 + α ) q + 1 = 0 , and let introduce approximation ¯ y ∗ ¯ y ∗ i = q i − q . Then error between approximation and real solution of (47) can be bounded (cid:107) ¯ y ∗ − y ∗ (cid:107) ≤ q n +1 α (1 − q ) . Proof:
Let write dual function for (47): g ( y ) = − y T (cid:18) L µ x A T A + µ y I (cid:19) y + L µ x e T y. where one can easy found AA T = − − − − − − − − − . . . − − − . The optimality of dual problem ∇ g ( y ∗ ) = 0 : (cid:18) L µ x A T A + µ y I (cid:19) y ∗ = L µ x e , or (cid:0) A T A + αI (cid:1) y ∗ = e . PREPRINT - F
EBRUARY
17, 2021Let us write in the form of a system of equations: (1 + α ) y ∗ − y ∗ = 1 − y ∗ + (2 + α ) y ∗ − y ∗ = 0 . . . − y ∗ n − + (2 + α ) y ∗ n − − y ∗ n = 0 − y ∗ n − + (2 + α ) y ∗ n = 0 Note that (1 + α )¯ y ∗ − ¯ y ∗ = 1 − ¯ y ∗ + (2 + α )¯ y ∗ − ¯ y ∗ = 0 . . . − ¯ y ∗ n − + (2 + α )¯ y ∗ n − − ¯ y ∗ n = 0 − ¯ y ∗ n − + (2 + α )¯ y ∗ n = q n +1 − q or (cid:0) A T A + αI (cid:1) ¯ y ∗ = e + q n +1 − q e n . And we have ¯ y ∗ − y ∗ = (cid:0) A T A + αI (cid:1) − q n +1 − q e n , With the fact α − I (cid:23) (cid:0) A T A + αI (cid:1) − (cid:31) it implies the statement of lemma. (cid:3) Now we formulate a key lemma (similar to Lemma 3.4 from [31]).
Lemma D.4. If B d (cid:54) = ∅ , let consider a distributed saddle-point problem in form (45) . For any time T one can found sizeof problem n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + dτ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . Then for any procedure satisfying Assumption D.1 we get the following lower bound for the solution after T units of time: (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . Proof:
By Lemma D.3 with q < and k ≤ n we have (cid:107) y T − ¯ y ∗ (cid:107) ≥ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j = k +1 (¯ y ∗ j ) = q k − q (cid:112) q + q + . . . + q n − k ) ≥ q k √ − q ) (cid:112) q + q + . . . + q n = q k √ (cid:107) ¯ y ∗ (cid:107) = q k √ (cid:107) y − ¯ y ∗ (cid:107) . By Lemma D.3 for n ≥ q (cid:16) α √ (cid:17) we can guarantee that ¯ y ∗ ≈ y ∗ (for more detailed see [31] ) and (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ (cid:107) y T − y ∗ (cid:107) ≥ q k (cid:107) y − y ∗ (cid:107) = q ( (cid:98) T − tt + dτ (cid:99) +2 ) (cid:107) y − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . Theorem D.5.
Let
L, µ x , µ y > , χ ≥ , ε > . Additionally, we assume that L > √ µ x µ y . Then there exists adistributed saddle-point problem of M (defined in the proof) functions with decentralised architecture and gossip matrix W . For which the following statements are true: • a gossip matrix W have χ ( W ) = χ , • f = M M (cid:80) m =1 f m : R n × R n → R is ( µ x , L, L, µ y ) -smooth, µ x , µ y – strongly-convex-strongly-concave, • size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = µ x µ y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) . PREPRINT - F
EBRUARY
17, 2021
Then for any procedure, which satisfy Assumption D.1, one can bounded the time to achieve an accuracy ε of thesolution in final global output: T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . Proof:
The proof is similar to proof of Theorem 2 from [27]. Let γ M = − cos πM πM – decreasing sequence of positivenumbers. Since γ = 1 and lim m γ M = 0 , there exists M ≥ such that γ M ≥ χ > γ M +1 .If M ≥ , let consider linear graph of size M , and weighted with w , = 1 − a and w i,i +1 = 1 for i ≥ . Then, let B = { v } and d = M − , then we have B = { v M } . If we define f m as in (45). Hence, one can use previous Lemmaand get: (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≥ q ( Tt + dτ +2 ) (cid:107) y − y ∗ (cid:107) . If W a is the Laplacian of the weighted graph G , one can note that with a = 0 , χ ( W a ) = γ M , with a = 1 – χ ( W a ) = 0 .Hence, there exists a ∈ (0; 1] such that χ ( W a ) = χ . Then finally, χ ≥ γ M +1 ≥ M +1) , and d ≥ √ χ − ≥ √ χ since χ ≤ γ = 1 / . Then Tt + dτ = Ω (cid:0) log q − ( ε − ) (cid:1) .T = Ω (cid:18) ( t + √ χτ ) ln( ε − )ln q − (cid:19) . If M = 2 , we take a totally connected graph, the weight of the edge ( v , v ) is a . Let B = { v } , d = 1 . If W a is theLaplacian of G . With a = 1 we get χ ( W a ) = 1 , and with a = 0 – χ ( W a ) = = γ . Hence, exists a ∈ [0; 1] such that χ ( W a ) = γ . Additionally, we add that L max ≤ L and d ≥ √ χ √ . Finally, we have the same estimate T = Ω (cid:18) ( t + √ χτ ) ln( ε − )ln q − (cid:19) . Next we work with q − = 1ln(1 + (1 − q ) /q )) ≥ q − q = 1 + µ x µ y L − (cid:113) µ x µ y L + (cid:0) µ x µ y L (cid:1) (cid:113) µ L + (cid:0) µ x µ y L (cid:1) − µ x µ y L ≥ (cid:113) µ x µ y L + (cid:0) µ x µ y L (cid:1) + µ x µ y L µ x µ y L ≥ (cid:115) L µ x µ y . Finally, we get T = Ω (cid:18) L √ µ x µ y ( t + √ χτ ) ln( ε − ) (cid:19) . (cid:3) Regularized problem.
To consider lower bounds for a convex-concave problem, we use a standard trick that reduces aconvex-concave problem to a strongly convex–strong-concave. We work with regularized problem: f reg ( x, y ) = f ( x, y ) + 12 ε x (cid:107) x − x (cid:107) − ε y (cid:107) y − y (cid:107) , where Ω x , Ω y – Euclidean diameters of sets X , Y . It turns out that if f ( x, y ) is a convex-concave function, then f reg ( x, y ) is (cid:16) ε x , ε y (cid:17) - strongly-convex–strongly-concave. The new problem is solved with an accuracy of ε/ , thenwe find a solution to the original problem with an accuracy of ε . Example (45) will be written as follows: f ( x, y ) = f ( x, y ) = M | B d | · L x T A y − M | B d | · L · ε x Ω y e T y, m ∈ B d f ( x, y ) = M | B | · L x T A y, m ∈ Bf ( x, y ) = 0 , otherwise . PREPRINT - F
EBRUARY
17, 2021 f reg ( x, y ) = f ( x, y ) = M | B d | · L x T A y + ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) − M | B d | · L · ε x Ω y e T y, m ∈ B d f ( x, y ) = M | B | · L x T A y + ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) , m ∈ Bf ( x, y ) = ε x (cid:107) x (cid:107) − ε y (cid:107) y (cid:107) , otherwise . The following theorem is true:
Theorem D.6.
Let
L > , Ω x , Ω y > , χ ≥ , ε > . Additionally, we assume that L > ε x Ω y . Then there exists adistributed saddle-point problem of M (defined in the proof) functions with decentralised architecture and gossip matrix W . For which the following statements are true: • a gossip matrix W have χ ( W ) = χ , • f = M M (cid:80) m =1 f m : R n × R n → R is ( ε x , L, L, ε y ) -smooth, convex–concave, • size n ≥ max (cid:110) q (cid:16) α √ (cid:17) , k (cid:111) , where k = (cid:106) T − tt + τ (cid:107) + 2 , α = ε x Ω y L and q = (cid:0) α − √ α + 4 α (cid:1) ∈ (0; 1) .Then for any procedure, which satisfy Assumption D.1, one can bounded the time to achieve an accuracy ε of thesolution in final global output: T = ˜Ω (cid:18) L Ω x Ω y ε ( t + √ χτ ) (cid:19) . E Missing Proofs from Section 6
E.1 Proof of Theorem 6.2
Proof of Theorem F ( u , v ) follows from Lemma 6.1. The bound on duality gapfollows from the theory of Mirror-Prox with proper R U , R V and L uv , L uv , L vu , L vv .To estimate R , we calculate the (cid:96) -norm of the objective in (18) (cid:107)∇ p f ( x , p ∗ , y ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) d (cid:107) ∞ m { [ y i ] ...n } mi =1 (cid:13)(cid:13)(cid:13)(cid:13) = m (cid:88) i =1 (cid:107) d (cid:107) ∞ m (cid:107) [ y i ] ...n (cid:107) ≤ (cid:107) d (cid:107) ∞ nm . Substituting ∇ p f ( x , p ∗ , y ) in (23) and using the fact that λ +min (cid:0) m W (cid:1) = m λ +min ( W ) we obtain the estimate for R .The complexity of one iteration of Alg. 4 per node is O (cid:0) n (cid:1) as the number of non-zero elements in matrix A is n .Multiplying this by the number of iterations N we get n · ε (cid:112) (8 (cid:107) d (cid:107) ∞ + λ max ( W ) ) /m √ m ln n (cid:114) mn + R n ε √ n ln n (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) (cid:115) (cid:107) d (cid:107) ∞ mλ +min ( W ) . Finally, we roughly estimate this by O ( n √ n ln n/ε ) and finish the proof. (cid:3) E.2 Proof of Lemma 6.1
Proof of Lemma F ( u , v ) is bilinear, L uu = L vv = 0 . Next, we estimate L uv and L vu . By the definition of L uv and the spaces U , V we have (cid:107)∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:107) U ∗ ≤ L uv (cid:107) v − v (cid:48) (cid:107) . (48)From the definition of dual norm, it follows (cid:107)∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:107) U ∗ = max (cid:107) u (cid:107) U ≤ (cid:104) u , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:105) . From this and (48) we get max (cid:107) u (cid:107) U ≤ (cid:104) u , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) (cid:105) ≤ L uv (cid:107) v − v (cid:48) (cid:107) . (49)28 PREPRINT - F
EBRUARY
17, 2021
Algorithm 4
Distributed Mirror-Prox for Wasserstein Barycenters
Require: measures q , ..., q m , linearized cost matrix d , incidence matrix A , step η , starting points p = n n , x = ... = x m = n n , y = ... = y m = n α = 2 (cid:107) d (cid:107) ∞ η ( mn + R / /m , β = 6 (cid:107) d (cid:107) ∞ η ln n , γ = 3 η ln n , θ = η ( mn + R / /m for k = 1 , , · · · , N − do for i = 1 , , · · · , m do u k +1 i = x ki (cid:12) exp (cid:8) − γ (cid:0) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) y ki (cid:1)(cid:9) n (cid:80) l =1 [ x ki ] l exp (cid:8) − γ (cid:0) [ d ] l + 2 (cid:107) d (cid:107) ∞ [ A (cid:62) y ki ] l (cid:1)(cid:9) s k +1 i = p ki (cid:12) exp (cid:110) β [ y ki ] ...n − η ln n (cid:80) mj =1 W ij z kj (cid:111)(cid:80) nl =1 [ p ki ] l exp (cid:110) β [ y ki ] l − η ln n (cid:104)(cid:80) mj =1 W ij z kj (cid:105) l (cid:111) v k +1 i = y ki + α (cid:18) Ax ki − (cid:18) p ki q i (cid:19)(cid:19) , Project v k +1 i onto [ − , n . λ k +1 i = z ki + θ m (cid:88) j =1 W ij p kj x k +1 i = x ki (cid:12) exp (cid:8) − γ (cid:0) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) v k +1 i (cid:1)(cid:9) n (cid:80) l =1 [ x ki ] l exp (cid:8) − γ (cid:0) [ d ] l + 2 (cid:107) d (cid:107) ∞ [ A (cid:62) v k +1 i ] l (cid:1)(cid:9) p k +1 i = p ki (cid:12) exp (cid:110) β [ v k +1 i ] ...n − η ln n (cid:80) mj =1 W ij λ k +1 j (cid:111)(cid:80) nl =1 [ p ki ] l exp (cid:110) β [ v k +1 i ] l − η ln n (cid:104)(cid:80) mj =1 W ij λ k +1 j (cid:105) l (cid:111) y k +1 i = y ki + α (cid:18) Au k +1 i − (cid:18) s k +1 i q i (cid:19)(cid:19) Project y k +1 i onto [ − , n . z k +1 i = z ki + θ m (cid:88) j =1 W ij s k +1 j end for end forEnsure: (cid:101) u = N N (cid:80) k =1 (cid:0) ( u k ) (cid:62) , . . . , ( u km ) (cid:62) , ( s k ) (cid:62) , . . . , ( s km ) (cid:62) (cid:1) (cid:62) , (cid:101) v = N N (cid:80) k =1 (cid:0) ( v k ) (cid:62) , . . . , ( v km ) (cid:62) , ( λ k ) (cid:62) , . . . , ( λ km ) (cid:62) (cid:1) (cid:62) By the definition of F ( · ) and U = X × P we have ∇ u F = (cid:18) ∇ x F ∇ p F (cid:19) = 1 m (cid:18) d + 2 (cid:107) d (cid:107) ∞ A (cid:62) yW (cid:62) z − (cid:107) d (cid:107) ∞ { [ y i ] ...n } mi =1 (cid:19) . From this and V (cid:44) Y × Z , ∇ u F ( u , v ) − ∇ u F ( u , v (cid:48) ) = 1 m (cid:18) (cid:107) d (cid:107) ∞ A (cid:62) ( y − y (cid:48) ) W (cid:62) ( z − z (cid:48) ) − (cid:107) d (cid:107) ∞ ( { [ y i − y (cid:48) i ] ...n } mi =1 ) (cid:19) = 1 m (cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:62) (cid:18) y − y (cid:48) z − z (cid:48) (cid:19) , PREPRINT - F
EBRUARY
17, 2021where
E ∈ { , } mn × mn is block-diagonal matrix E = (cid:18) I n n × n (cid:19) · · · n × n ... . . . ... n × n · · · (cid:18) I n n × n (cid:19) . From this it follows that ∇ u F ( · ) is linear function in v − v (cid:48) , then (49) can be rewritten as L uv = max (cid:107) v − v (cid:48) (cid:107) ≤ max (cid:107) u (cid:107) U ≤ (cid:42) u , m (cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:62) ( v − v (cid:48) ) (cid:43) . (50)By the same arguments we can get the same expression for L vu up to rearrangement of maximums. Next, we use thefact that the (cid:96) -norm is the conjugate norm for the (cid:96) -norm. From this and (50) it follows L uv = max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) u (cid:13)(cid:13)(cid:13)(cid:13) . (51)We consider max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ A − (cid:107) d (cid:107) ∞ E mn × mn W (cid:19) (cid:18) xp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = max (cid:107) u (cid:107) U ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:107) d (cid:107) ∞ ( A x − E p ) Wp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ m (cid:0) (cid:107) d (cid:107) ∞ (cid:107) A x − E p (cid:107) + (cid:107) Wp (cid:107) (cid:1) ≤ (cid:107) d (cid:107) ∞ m max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) + 1 m max (cid:107) p (cid:107) P ≤ (cid:107) Wp (cid:107) . (52)We consider the first term of the r.h.s. of (52) under the minimum (cid:107) A x − E p (cid:107) = m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13) Ax i − (cid:18) p i n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m (cid:88) i =1 (cid:107) Ax i (cid:107) + m (cid:88) i =1 (cid:107) p i (cid:107) . (53)The last bound holds due to (cid:104) Ax i , ( p (cid:62) i , (cid:62) n ) (cid:105) ≥ as the entries of A, x , p are non-negative. Next we take the minimumin (53) max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) = max (cid:80) mi =1 ( (cid:107) x i (cid:107) + (cid:107) p i (cid:107) ) ≤ (cid:107) A x (cid:107) (53) = max α ∈ ∆ m (cid:32) m (cid:88) i =1 max (cid:107) x i (cid:107) ≤√ α i (cid:107) Ax i (cid:107) + m (cid:88) i =1 max (cid:107) p i (cid:107) ≤√ α i (cid:107) p i (cid:107) (cid:33) = max α ∈ ∆ m (cid:32) m (cid:88) i =1 α i max (cid:107) x i (cid:107) ≤ (cid:107) Ax i (cid:107) + m (cid:88) i =1 α i max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) (cid:33) . (54)By the definition of incidence matrix A we get Ax i = ( h (cid:62) , h (cid:62) ) ,where h and h such that (cid:62) h = (cid:62) h = (cid:80) n j =1 [ x i ] j = 1 as x i ∈ ∆ n ∀ i = 1 , ..., m . Thus, (cid:107) Ax i (cid:107) = (cid:107) h (cid:107) + (cid:107) h (cid:107) ≤ (cid:107) h (cid:107) + (cid:107) h (cid:107) = 2 . (55)As p i ∈ ∆ n , ∀ i = 1 , ..., m we have max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) ≤ max (cid:107) p i (cid:107) ≤ (cid:107) p i (cid:107) = 1 . (56)Using (55) and (56) in (54) we get max (cid:107) x (cid:107) X + (cid:107) p (cid:107) P ≤ (cid:107) A x − E p (cid:107) ≤ max α ∈ ∆ m (cid:32) m (cid:88) i =1 α i + m (cid:88) i =1 α i (cid:33) ≤ max α ∈ ∆ m m (cid:88) i =1 α i = 2 . (57)30 PREPRINT - F
EBRUARY
17, 2021Now we consider the second term of the r.h.s. of (52). max (cid:107) p (cid:107) P ≤ (cid:107) Wp (cid:107) = max (cid:80) mi =1 (cid:107) p i (cid:107) ≤ (cid:107) Wp (cid:107) . (58)The set (cid:80) mi =1 (cid:107) p i (cid:107) ≤ is contained in the set (cid:80) nj =1 (cid:80) mi =1 [ p i ] j ≤ as cros-product terms of (cid:107) p i (cid:107) are non-negative.Thus, we can change the constraint in the minimum in (58) as follows max (cid:80) mi =1 (cid:107) p i (cid:107) ≤ (cid:107) Wp (cid:107) ≤ max (cid:80) nj =1 (cid:80) mi =1 [ p i ] j ≤ (cid:107) Wp (cid:107) = max (cid:107) p (cid:107) ≤ (cid:107) Wp (cid:107) = max (cid:107) p (cid:107) ≤ (cid:107) Wp (cid:107) (cid:44) λ max ( W ) = λ max ( W ) . (59)The last inequality holds due to W (cid:44) W ⊗ I n and the properties of the Kronecker product for eigenvalues. Using(57)and (59) in (52) for the estimation of L uv from (51), we get L uv = L uv = (cid:112) (cid:107) d (cid:107) ∞ + λ max ( W ) /m (cid:3) F Numerical Experiments
F.1 Additional Figures for Gaussian Distributions (a) Convergence in the function value. Here p ∗ is the truebarycenter. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 4: Convergence of Algorithm 4 on different network architectures. (a) Convergence in the function value. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 5: Convergence of Algorithm 4 on different network architectures on a logarithmic scale.31
PREPRINT - F
EBRUARY
17, 2021 (a) Convergence in the function value. (b) Convergence to the consensus ( Wp = 0 ) in the network Figure 6: Convergence of Algorithm 4 on different network architectures on a logarithmic scale from iterations to ∗ iterations. F.2 The notMNIST dataset
Figure 7: Letter ’B’ in a variety of fonts from the notMNIST dataset
F.3 Network architectures
The complete graph The cycle graph The star graph Erd˝os-R´enyi random graph withprobability of edge creation p =0 .5