[PDF] Extragradient and Extrapolation Methods with Generalized Bregman Distances for Saddle Point Problems

Abstract

In this work, we introduce two algorithmic frameworks, named Bregman extragradient method and Bregman extrapolation method, for solving saddle point problems. The proposed frameworks not only include the well-known extragradient and optimistic gradient methods as special cases, but also generate new variants such as sparse extragradient and extrapolation methods. With the help of the recent concept of relative Lipschitzness and some Bregman distance related tools, we are able to show certain upper bounds in terms of Bregman distances for gap-type measures. Further, we use those bounds to deduce the convergence rate of \cO(1/k) for the Bregman extragradient and Bregman extrapolation methods applied to solving smooth convex-concave saddle point problems. Our theory recovers the main discovery made in [Mokhtari et al. (2020), SIAM J. Optim., 20, pp. 3230-3251] for more general algorithmic frameworks with weaker assumptions via a conceptually different approach.

Full PDF

aa r X i v : . [ m a t h . O C ] J a n Extragradient and Extrapolation Methods with GeneralizedBregman Distances for Saddle Point Problems

Hui Zhang ∗ January 26, 2021

Abstract

In this work, we introduce two algorithmic frameworks, named Bregman extragradientmethod and Bregman extrapolation method, for solving saddle point problems. The proposedframeworks not only include the well-known extragradient and optimistic gradient methods asspecial cases, but also generate new variants such as sparse extragradient and extrapolationmethods. With the help of the recent concept of relative Lipschitzness and some Bregmandistance related tools, we are able to show certain upper bounds in terms of Bregman dis-tances for “regret” measures. Further, we use those bounds to deduce the convergence rate of O (1 /k ) for the Bregman extragradient and Bregman extrapolation methods applied to solvingsmooth convex-concave saddle point problems. Our theory recovers the main discovery madein [Mokhtari et al. (2020), SIAM J. Optim., 20, pp. 3230-3251] for more general algorithmicframeworks with weaker assumptions via a conceptually diﬀerent approach. Keywords.

Extragradient method, extrapolation method, Bregman distance, iteration com-plexity, saddle point problem

AMS subject classiﬁcations.

The extragradient method is a powerful tool for solving smooth convex-concave saddle point prob-lems. Its original scheme was introduced by Korpelevich as early as 1976 in [13]. During the pastfour decades, this method has been extensively developed from many aspects such as extending itsapplications [20], simplifying its iterate [5], and generalizing its theory [7], etc. Recently, due tothe fact that many game and learning problems are actually equivalent to ﬁnding saddle points ofmin-max optimization problems, it has attracted considerable attention in the machine learning,computer science, and optimization communities.At the algorithmic level, many works are focused on modifying the original extragradientmethod. There are two types of progress. The ﬁrst type is to reduce the computational com-plexity in each iteration. In this direction, remarkable examples include Popov’s modiﬁcation ofthe Arrow-Hurwicz method [22], Tseng’s modiﬁed forward-backward splitting method [25], Malit-sky’s reﬂected projected method [17] and its generalization in [14,18]. It should be pointed out that, ∗ Department of Mathematics, National University of Defense Technology, Changsha, Hunan 410073, China. Email: [email protected] O (1 /k ). With a conceptually diﬀerent approach, our theorygeneralizes the recent discovery in [19], which was made by formulating the extragradient andOGDA methods as approximations of the proximal point algorithm [23].The rest of the paper is organized as follows. In Section 2, we introduce some basic convexanalysis and Bregman distance related tools. In Section 3, we propose two algorithmic frameworkswith generalized Bregman distances as well as some specialized variants. In Section 4, we list a groupof assumptions and establish two main iterate results for the proposed algorithmic frameworks. InSection 5, we derive the convergence rate O (1 /k ) for the proposed algorithmic frameworks appliedto solving smooth convex-concave saddle point problems. Concluding remarks are given in Section6. Missing proofs are postponed to Appendix. In this paper, we restrict our analysis into real ﬁnite dimensional spaces R d . We use h· , ·i to denotethe inner product and k · k to denote the Euclidean norm. For a multi-variables function f ( x, y ),we use ∇ x f (respectively, ∇ y f ) to denote the gradient of f with respective to x (respectively, y ). We present some basic notations and facts about convex analysis, which will be used in our results.

Deﬁnition 2.1.

A function φ : R d → R is convex if for any α ∈ [0 , and u, v ∈ R d , we have φ ( αu + (1 − α ) v ) ≤ αφ ( u ) + (1 − α ) φ ( v );2 nd strongly convex with modulus µ > if for any α ∈ [0 , and u, v ∈ R d , we have φ ( αu + (1 − α ) v ) ≤ αφ ( u ) + (1 − α ) φ ( v ) − µα (1 − α ) k u − v k . Further, φ is concave if − φ is convex. Deﬁnition 2.2.

Let φ : R d → R be a convex function. The subdiﬀerential of φ at u ∈ R d is deﬁnedas ∂φ ( u ) := { u ∗ ∈ R d : φ ( v ) ≥ φ ( u ) + h u ∗ , v − u i , ∀ v ∈ R d } . The elements of ∂φ ( u ) are called the subgradients of φ at u . The subdiﬀerential generalizes of the classical concept of diﬀerential because of the well-knownfact that ∂φ ( u ) = {∇ φ ( u ) } when the function φ is diﬀerentiable. In terms of the subdiﬀerential,the strong convexity in Deﬁnition 2.1 can be equivalently stated as [10]: For any u, v ∈ R d and v ∗ ∈ ∂φ ( v ), we have φ ( u ) ≥ φ ( v ) + h v ∗ , u − v i + µ k u − v k . (2.1) Deﬁnition 2.3.

Let φ : R d → R be a convex function. The conjugate of φ is deﬁned as φ ∗ ( u ∗ ) = sup v ∈ R d {h u ∗ , v i − φ ( v ) } . Deﬁnition 2.4.

A function φ : R d → R is gradient-Lipschitz-continuous with modulus L > if forany u, v ∈ R d , we have k∇ φ ( u ) − ∇ φ ( v ) k ≤ L k u − v k . The following facts are well-known, which could be found from the classic textbooks [24] and [10].

Lemma 2.1.

Let φ : R d → R be a strongly convex function with modulus µ > . Then we havethat • its conjugate φ ∗ is gradient-Lipschitz-continuous with modulus µ ; • the conditions φ ( u ) + φ ∗ ( u ∗ ) = h u, u ∗ i , u ∗ ∈ ∂φ ( u ) , and u ∈ ∂φ ∗ ( u ∗ ) are equivalent. The Bregman distance, original introduced in [4], is a very powerful concept in many ﬁelds wheredistances are involved; partially because it generalizes the traditional Euclidean distance and isable to captures local “geometrical information”. During the past ﬁve decades, many variants ofBregman distances were proposed and widely studied, see e.g. [2, 11]. The key ingredient for de-signing Bregman distances is to search suitable generating functions. For simplicity and generality,we choose strongly convex functions to deﬁne the Bregman distance below. We would like to high-light that the strongly convex functions are not necessarily diﬀerentiable, which is a big diﬀerencebetween our methods and the existing methods with Bregman distances.

Deﬁnition 2.5.

Let ω : R d → R be a strongly convex function with modulus µ > . The Bregmandistance D v ∗ ω ( u, v ) between u, v ∈ R d with respect to ω and a subgradient v ∗ ∈ ∂ω ( v ) is deﬁned by D v ∗ ω ( u, v ) := ω ( u ) − ω ( v ) − h v ∗ , u − v i . (2.2)3n the following, we state three basic facts about the Bregman distance, which will be used laterin our analysis. It should be pointed out that the results are well-known–see e.g. [11, 12]. We listthem here, along with a brief proof, for completeness. Lemma 2.2.

Let ω : R d → R be a strongly convex function with modulus µ > . For any u, p, q ∈ R d and p ∗ ∈ ∂ω ( p ) , q ∗ ∈ ∂ω ( q ) , we have that D p ∗ ω ( u, p ) − D q ∗ ω ( u, q ) + D q ∗ ω ( p, q ) = h q ∗ − p ∗ , u − p i , (2.3) D q ∗ ω ( p, q ) = D pω ∗ ( q ∗ , p ∗ ) , (2.4) and D q ∗ ω ( p, q ) ≥ µ k p − q k . (2.5) Proof.

The equality (2.3) follows from (2.2), and the inequality (2.5) from (2.1) and (2.2). Toobtain (2.4), we derive that D q ∗ ω ( p, q ) = ω ( p ) − ω ( q ) − h q ∗ , p − q i = h p, p ∗ i − ω ∗ ( p ∗ ) − h q, q ∗ i + ω ∗ ( q ∗ ) − h q ∗ , p − q i = ω ∗ ( q ∗ ) − ω ∗ ( p ∗ ) − h p, q ∗ − p ∗ i = D pω ∗ ( q ∗ , p ∗ ) , (2.6)where the second and fourth lines follow by using the second part of Lemma 2.1 and the condition p ∗ ∈ ∂ω ( p ) , q ∗ ∈ ∂ω ( q ). In this section, we introduce two algorithmic frameworks, both of which are constructed by anoperator F : R d → R d and a strongly convex function ω : R d → R via certain coupled styles. Let u , u ∗ ∈ R d and positive parameters { α k } be given. The Bregman extragradient method,abbreviated as BEG, updates the iterates { u k } for k ≥  ¯ u k = arg min u ∈ R d { α k h F ( u k ) , u i + D u ∗ k ω ( u, u k ) } ,u k +1 = arg min u ∈ R d { α k h F (¯ u k ) , u i + D u ∗ k ω ( u, u k ) } ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.1)Equivalently, it can be rewritten as  ¯ u k = ∇ ω ∗ ( u ∗ k − α k F ( u k )) ,u k +1 = ∇ ω ∗ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.2)Due to the newly introduced “parameter” ω , BEG not only includes the standard extragradientmethod as its special case by taking ω ( u ) = k u k , but also generates implicitly regularized variants.To illustrate the latter, we take ω as the augmented ℓ -norm [15], that is ω ( u ) = γ k u k + k u k ,4here γ is a positive constant and k · k is the ℓ -norm deﬁned as the sum of absolute values of theentries. It is easy to see that the augmented ℓ -norm is a strongly convex function with modulus µ = 1. Further, we have ∇ ω ∗ ( · ) = S γ ( · ) , (3.3)where S γ ( · ) is the well-known shrinkage operator deﬁned by S γ ( u ) := sign( u ) max {| u | − µ, } , with sign( · ) , | · | , and max {· , ·} being component-wise operations for vectors. Now, the BEG methodfor this special case becomes  ¯ u k = S γ ( u ∗ k − α k F ( u k )) ,u k +1 = S γ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.4)Because the shrinkage operator S γ ( · ) generates sparse vectors, we call the newly appeared scheme(3.4) sparse extragradient method. Its remarkable advantage is that the iterates may be sparse,although the shrinkage operations also increase computational cost. Thereby, how to balance thesetwo sides is worthy of future studying. In addition, it should be noted that although the iteratescould be sparse, their averaged sequences, whose iteration complexities will be studied, may bedense.In slightly more general, let us consider the case of ω ( u ) = ψ ( u ) + k u k , where ψ is a convexregularized function with an “easily” computational proximal operator given by prox ψ ( u ) := arg min v { ψ ( v ) + 12 k u − v k } . Substituting such ω into (3.1), we immediately obtain the following regularized extragradientmethod  ¯ u k = prox ψ ( u ∗ k − α k F ( u k )) ,u k +1 = prox ψ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.5)It seems that, to the best of our knowledge, such variants have not appeared before us. Let u , u − , u ∗ ∈ R d , positive parameters { α k } , and nonnegative parameters { β k } be given. TheBregman extrapolation method, abbreviated as BEP, updates the iterates { u k } for k ≥ ( u k +1 = arg min u ∈ R d { α k h F ( u k ) + β k ( F ( u k ) − F ( u k − )) , u i + D u ∗ k ω ( u, u k ) } u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.6)Equivalently, it can be rewritten as (cid:26) u k +1 = ∇ ω ∗ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.7)5EP is general enough to include several existing algorithms as its special cases. For example,BEP with ω = k · k and α k ≡ η, β k ≡ ω to be diﬀerentiableand the modiﬁed forward-backward splitting [18] specialized to our setting. When considering thecase of ω ( u ) = γ k u k + k u k , we have the following scheme, called sparse extrapolation method. (cid:26) u k +1 = S γ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.8)Corresponding to the case of ω ( u ) = ψ ( u ) + k u k , the regularized extrapolation method reads as (cid:26) u k +1 = prox ψ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.9)At last, we would like to point out that the BEP method does not cover the simultaneous centripetalacceleration and alternating centripetal acceleration methods, proposed in our recent work [21] fortraining GANs, one of which also includes the OGDA as a special case. In order to provide a uniﬁed convergence analysis for the previously introduced algorithmic frame-works, we make the following assumptions about the operator F and the function ω . Assumption 4.1.

Let ω : R d → R be a strongly convex function with modulus µ > and let λ be a positive parameter. The operator F is λ -relatively Lipschitz with respect to ω , i.e., for any u, v, z ∈ R d , we have h F ( v ) − F ( u ) , v − z i ≤ λ ( D u ∗ ω ( v, u ) + D v ∗ ω ( z, v )) . This assumption is a slight modiﬁcation of the relative Lipschitzness recently proposed in [6],and will be a key tool in the forthcoming convergence analysis. As observed in [6], the relativeLipshcitzness is a more general condition encapsulating the standard Lipschitz assumption as wellas the more recent relative smoothness assumption [1, 16]. The following, shown in [6], will berequired later.

Lemma 4.1. If ω is strongly convex with modulus µ and F is L-Lipschitz in the sense that for any u, v ∈ R d , we have k F ( u ) − F ( v ) k ≤ L k u − v k , then F is L/µ -relatively Lipschitz with respect to ω . Assumption 4.2.

The solution set of F deﬁned as Y := { u : F ( u ) = 0 } is nonempty. Assumption 4.3.

The operator F is monotone, i.e., for any u, v ∈ R d we have h F ( u ) − F ( v ) , u − v i ≥ . Assumption 4.4.

The conjugate function ω ∗ : R d → R satisﬁes coercivity, i.e, for any ﬁxed v ∈ R d , we have ω ∗ ( u ∗ ) − h v, u ∗ i → + ∞ , k u ∗ k → + ∞ . It is easy to verify that Assumption 4.4 holds for ω ∗ ( u ∗ ) = k u ∗ k . Actually, it holds for theconjugates of strongly convex functions, as a direct result of Proposition 14.15 in [3].6 emma 4.2. Let ω : R d → R be a strongly convex function with modulus µ > . Then the conjugatefunction ω ∗ satisﬁes the coercivity. Now, we are ready to present the main results of this study. The ﬁrst result is a upper bound ofthe “regret measure” h α k F (¯ u k ) , ¯ u k − u i by the diﬀerence between the generalized Bregman distances D u ∗ k ω ( u, u k ) and D u ∗ k +1 ω ( u, u k +1 ), for the Bregman extragradient method. Proposition 4.1.

Let { ¯ u k , u k } be the iterates generated by the Bregman extragradient methodintroduced in (3.1) . Suppose that Assumption 4.1 holds and the parameters α k satisfy < λα k ≤ .Then, we have h α k F (¯ u k ) , ¯ u k − u i ≤ D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) . (4.1) Moreover, if Assumptions 4.2-4.3 also hold, then the sequence { ¯ u k } is bounded. The ﬁrst part of this result is a modiﬁcation of Lemma 3.1 in [20] (see also Proposition 1 in [6]),and the second part and the followings are partially inspired by the work [19].

Proposition 4.2.

Let { u k } be the iterates generated by the Bregman extrapolation method intro-duced in (3.6) with the initial conditions u = u − . Suppose that Assumption 4.1 holds and theparameters α k , β k satisfy the condition (cid:26) α k β k = α k − ,λ ( α k + α k − ) ≤ . (4.2) Then, we have α k h F ( u k +1 ) , u k +1 − u i ≤ α k h F ( u k +1 ) − F ( u k ) , u k +1 − u i − α k − h F ( u k ) − F ( u k − ) , u k − u i + D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 )+ λα k − D u ∗ k − ω ( u k , u k − ) − λα k D u ∗ k ω ( u k +1 , u k ) . (4.3) Moreover, if Assumptions 4.2-4.3 also hold and there is a positive constant ρ such that λα k ≤ − ρ ,then the sequence { u k } is bounded. To illustrate the generality of the result above, we consider the OGDA method, i.e., the specialcase of BEP with α k ≡ η, β k ≡ , ω = k · k . The condition on the step size η becomes 0 < η ≤ L due to (4.2) and µ = 1 , λ = Lµ because of Lemma 4.1. Note that D v ∗ ω ( u, v ) = k u − v k . Under thissetting, the existence of ρ such that λα k ≤ − ρ also holds. From (4.3), we obtain h F ( u k +1 ) , u k +1 − u i ≤ h F ( u k +1 ) − F ( u k ) , u k +1 − u i − h F ( u k ) − F ( u k − ) , u k − u i + η k u − u k k − η k u − u k +1 k + L k u k − u k − k − L k u k +1 − u k k , (4.4)which is exactly Lemma 8(a) in [19]. The second part of Proposition (4.2) guarantees the bounded-ness of { u k } , which recovers Lemma 8(b) in [19]. The careful reader could ﬁnd that similar situationof the extragradient method does not happen here. In other words, one may ask whether Lemma11(a) in [19] can be generalized. Our answer is: it should be possible but not worth pursuingbecause the current relation (4.1) is powerful enough to derive what we want.7 Iteration complexity for saddle point problems

In this section, we ﬁrst formulate the saddle point problem which we want to solve and then deducethe iteration complexity results.

Let f : R m × R n → R be a given function. We consider ﬁnding a saddle point (¯ x, ¯ y ) of the problemmin x ∈ R m max y ∈ R n f ( x, y ) . (5.1)In other words, ﬁnd a pair (¯ x, ¯ y ) ∈ R m × R n , which will be called saddle point, to satisfy thefollowing relation f (¯ x, y ) ≤ f (¯ x, ¯ y ) ≤ f ( x, ¯ y )for all x ∈ R m , y ∈ R n . Let z = [ x ; y ] ∈ R n + m and deﬁne the operator F : R n + m → R n + m as F ( z ) := [ ∇ x f ( x, y ); −∇ y f ( x, y )] . (5.2)In order to apply previous theory to this specialized operator F , we restrict our attention to theproblem (5.1) with the following assumption. Assumption 5.1.

The set of all saddle point pairs of the problem (5.1) , denoted by Z , is nonempty.Let positive parameters L xx , L xy , L yy , L yx be given and let L := 2 × max { L xx , L xy , L yy , L yx } . The function f ( x, y ) in (5.1) is • continuously diﬀeretiable in x and y , • convex in x for any ﬁxed y and concave in y for any ﬁxed x , and • gradient-Lipschitz-continuous with modulus L in the sense that k∇ x f ( x , y ) − ∇ x f ( x , y ) k ≤ L xx k x − x k for all y, k∇ x f ( x, y ) − ∇ x f ( x, y ) k ≤ L xy k y − y k for all x, k∇ y f ( x, y ) − ∇ y f ( x, y ) k ≤ L yy k y − y k for all x, k∇ y f ( x , y ) − ∇ y f ( x , y ) k ≤ L yx k x − x k for all y. The gradient-Lipschitz-continuity above implies the standard Lipschitzness of F , which furtherimplies the relative Lipschitzness of F due to Lemma 4.1. Now, by substituting the formula (5.2)for F in the algorithmic frameworks (3.1) and (3.6), the saddle point problems in the form (5.1)could be solved in some degree. Especially, for the saddle point problems that satisfy Assumption5.1, we will derive the convergence rate of O (1 /k ) for the BEG and BEP methods. To this end,we let { z k } be the concerned iterate sequence with z k := [ x k ; y k ] ∈ R n + m . Let { r k } be a parameter8equence with r k > s k := P ki =0 r i . We denote the averaged iterates of { z k } by { r k } asfollows ˆ z k := [ˆ x k ; ˆ y k ] := [ k X i =0 r i s i x i ; k X i =0 r i s i x i ] = k X i =0 r i s i z i . (5.3)In terms of these notations, we have the following preliminary results for the saddle point problem(5.1). It should be pointed out that they are well known–see for example [20]. Lemma 5.1.

Consider the saddle point problem (5.1) with Assumption 5.1 and recall the deﬁnitionsin (5.2) - (5.3) . Let ω : R d → R be a strongly convex function with modulus µ > . Then, • Assumptions 4.2-4.3 and Assumption 4.1 with λ = Lµ hold for the operator F , and especially, Z ⊂ Y , i.e., F (¯ z ) = 0 for any ¯ z ∈ Z ; • For any z = [ x ; y ] ∈ R n + m , we have f (ˆ x k , y ) − f ( x, ˆ y k ) ≤ s k k X i =0 r i h F ( z i ) , z i − z i . (5.4)We remark that the gradient Lipschitz continuity in Assumption 5.1 is only a suﬃcient conditionguaranteeing the weaker relative Lipschitzness. The ﬁrst result is about the iteration complexity of BEG method applied to the saddle pointproblem with Assumption 5.1. Before presentation, we ﬁrst clarify some notations.Let { ¯ u k } be the iterates generated by the BEG method with initial points u , u ∗ , µ -stronglyconvex ω and L -Lipschitz operator F = [ ∇ x f ; −∇ y f ], and the parameters { α k } satisfying 0

Let ω : R d → R be a strongly convex function with modulus µ > . The BEGmethod with α k satisfying < Lα k ≤ µ and initial points u , u ∗ being given, applied to the saddlepoint problem (5.1) with Assumption 5.1, converges sublinearly in the sense that | f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) | ≤ max z : z ∈D P ki =0 α i D u ∗ ω ( z, u ) . (5.5)9he second result is about the iteration complexity of the BEP method applied to the saddlepoint problem (5.1). Since its deduction is similar to that of the ﬁrst result, we here just brieﬂypoint out the diﬀerence notations. Let { u k } be the iterates generated by the BEP method with F = [ ∇ x f ; −∇ y f ], the initial points u = u − being given, and the positive parameters α k , β k satisfying the condition (4.2). Let z k = u k +1 and r k = α k . Let ¯ z = [¯ x ; ¯ y ] ∈ Z ⊂ Y be a ﬁxed saddlepoint. Denote R := max { max k {k u k k , k ¯ z k} ;then R < + ∞ due to the boundedness of { u k } . Let D := { z ∈ R m + n : k z k ≤ R } , which is the smallest compact convex set including the iterates { u k } (or say { z k } ) and the saddlepoint ¯ z . By convexity, it also includes the averaged iterates { ˆ z k } , as deﬁned in (5.3). In terms ofthese notations, we present the second iteration complexity result, whose proof follows directly byreplacing (6.4) with (6.15) and repeating the argument for Proposition 5.1. Proposition 5.2.

Let ω : R d → R be a strongly convex function with modulus µ > . The BEPmethod with { α k , β k } satisfying (4.2) and initial points u ∗ , u − = u being given, applied to thesaddle point problem (5.1) with Assumption 5.1, converges sublinearly in the sense that | f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) | ≤ max z : z ∈D P ki =0 α i D u ∗ ω ( z, u ) . (5.6)The iterate complexities above show that the function value of the averaged iterates generatedby BEG or BEP converges to the function value at any ﬁxed saddle point of the problem (5.1). Partially inspired by the elegant results in [19], we introduce two algorithmic frameworks with gen-eralized Bregman distances and study their iteration complexity for solving saddle point problems.With the help of the recent concept of relative Lipschitzness and Bregman distance related tools,our approach is simple and essentially diﬀerent from the proximal point approach taken in [19].Moreover, our theory is general in the sense that it applies to more general algorithmic frameworksunder weaker assumptions.

Acknowledgements

This work is supported by the National Science Foundation of China (No.11971480), the NaturalScience Fund of Hunan for Excellent Youth (No.2020JJ3038), and the Fund for NUDT YoungInnovator Awards (No. 20190105).

Appendix: The missing proofs

The proof of Proposition 4.1 : From the formula ¯ u k = ∇ ω ∗ ( u ∗ k − α k F ( u k )) in (3.2), we have u ∗ k − α k F ( u k ) ∈ ∂ω (¯ u k ) . u ∗ k := u ∗ k − α k F ( u k ); then using the fact of u ∗ k ∈ ∂ω ( u k ) and applying (2.3) in Lemma 2.2, wederive that h α k F ( u k ) , ¯ u ∗ k − u i = −h u ∗ k − ¯ u ∗ k , u − ¯ u ∗ k i = D u ∗ k ω ( u, u k ) − D ¯ u ∗ k ω ( u, ¯ u k ) − D u ∗ k ω (¯ u k , u k ) . (6.1)Similarly, starting with u ∗ k +1 = u ∗ k − α k F (¯ u k ) in (3.2) and applying (2.3) in Lemma 2.2 again, wederive that h α k F (¯ u k ) , u k +1 − u i = −h u ∗ k − u ∗ k +1 , u − u k +1 i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.2)Substituting u k +1 for u in (6.1), we obtain h α k F ( u k ) , ¯ u ∗ k − u k +1 i = D u ∗ k ω ( u k +1 , u k ) − D ¯ u ∗ k ω ( u k +1 , ¯ u k ) − D u ∗ k ω (¯ u k , u k ) . (6.3)Combining (6.2) and (6.3), we have h α k F (¯ u k ) , ¯ u k − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 )+ α k h F (¯ u k ) − F ( u k ) , ¯ u k − u k +1 i − D u ∗ k ω (¯ u k , u k ) − D ¯ u ∗ k ω ( u k +1 , ¯ u k ) . Invoking the relative Lipschitz assumption and noting λα k ≤

1, we obtain (4.1).Now, we turn to prove the boundedness. Summing up (4.1) from k = 0 to t −

1, we obtain t − X k =0 h α k F (¯ u k ) , ¯ u k − u i ≤ D u ∗ ω ( u, u ) − D u ∗ t ω ( u, u t ) . (6.4)Using the monotonicity of F in Assumption 4.3 and the condition F (¯ u ) = 0 for any ¯ u ∈ Y , we havethat h F (¯ u k ) , ¯ u k − ¯ u i = h F (¯ u k ) − F (¯ u ) , ¯ u k − ¯ u i ≥ . This means that each term in the summand in (6.4) with u = ¯ u is nonnegative and hence impliesthe following D u ∗ t ω (¯ u, u t ) ≤ D u ∗ ω (¯ u, u ) . (6.5)Note from (2.4) in Lemma (2.2) that D u ∗ t ω (¯ u, u t ) = D ¯ uω ∗ ( u ∗ t , ¯ u ∗ ) = ω ∗ ( u ∗ t ) − ω ∗ (¯ u ∗ ) − h ¯ u, u ∗ t − ¯ u ∗ i . Thereby, we have ω ∗ ( u ∗ t ) − h ¯ u, u ∗ t i ≤ D u ∗ ω (¯ u, u ) + ω ∗ (¯ u ∗ ) − h ¯ u, ¯ u ∗ i . (6.6)This implies that for the ﬁxed ¯ u , { ω ∗ ( u ∗ t ) − h ¯ u, u ∗ t i} is bounded. Together with Lemma 4.2 andAssumption 4.4, we conclude that { u ∗ t } is bounded. Finally, using the strong convexity of ω andthe fact in Lemma 2.1, we deduce that k u t − u k = k∇ ω ∗ ( u ∗ t ) − ∇ ω ∗ ( u ∗ t ) k ≤ µ k u ∗ t − u ∗ k , from which the boundedness of { u t } immediately follows. This completes the proof.11 he proof of Proposition 4.2 : First, recall from (3.7) that u ∗ k − u ∗ k +1 = α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . It follows that h u ∗ k − u ∗ k +1 , u k +1 − u i = h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i . (6.7)On the other hand, applying (2.3) in Lemma 2.2, we have h u ∗ k − u ∗ k +1 , u k +1 − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.8)Thus, combining (6.7) and (6.8), we obtain h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.9)Let ∆ F k := F ( u k ) − F ( u k − ); then h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i = α k β k h ∆ F k , u k +1 − u i − α k h ∆ F k +1 , u k +1 − u i + α k h F ( u k +1 ) , u k +1 − u i = α k β k h ∆ F k , u k − u i + α k β k h ∆ F k , u k +1 − u k i− α k h ∆ F k +1 , u k +1 − u i + α k h F ( u k +1 ) , u k +1 − u i . (6.10)Inserting (6.9) into (6.10) and rearranging the terms, we have α k h F ( u k +1 ) , u k +1 − u i = α k h ∆ F k +1 , u k +1 − u i − α k β k h ∆ F k , u k − u i + α k β k h ∆ F k , u k − u k +1 i + D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.11)Using the relative Lipschitzness assumption, we have that h ∆ F k , u k − u k +1 i = h F ( u k ) − F ( u k − ) , u k − u k +1 i≤ λD u ∗ k ω ( u k +1 , u k ) + λD u ∗ k − ω ( u k , u k − ) . (6.12)Finally, combining (6.11) and (6.12) and noting (4.2), we obtain (4.3).Now, we turn to prove the boundedness. Summing up (4.3) from k = 0 to t −

1, we obtain P t − k =0 α k h F ( u k +1 ) , u k +1 − u i ≤ α t − h ∆ F t , u t − u i − α − h ∆ F , u − u i + D u ∗ ω ( u, u ) − D u ∗ t ω ( u, u t )+ λα − D u ∗− ω ( u , u − ) − λα t − D u ∗ t − ω ( u t , u t − ) . (6.13)Using the relative Lipschitzness assumption, we have that h ∆ F t , u t − u i = h F ( u t ) − F ( u t − ) , u t − u i≤ λD u ∗ t ω ( u, u t ) + λD u ∗ t − ω ( u t , u t − ) . (6.14)Combining (6.13) and (6.14) and noting k ∆ F k = D u ∗ ω ( u, u ) = 0 due to the fact u = u − , weobtain t − X k =0 α k h F ( u k +1 ) , u k +1 − u i ≤ D u ∗ ω ( u, u ) − (1 − λα t − ) D u ∗ t ω ( u, u t ) . (6.15)12sing (6.15) with u = ¯ u ∈ Y and repeating the argument below (6.4), we deduce that(1 − λα t − ) D u ∗ t ω (¯ u, u t ) ≤ D u ∗ ω (¯ u, u ) . (6.16)Repeating the argument below (6.5) and noting that 1 − λα t − > ρ , we ﬁnally conclude that thesequence { u k } is bounded. This completes the proof. The proof of Proposition 5.1 : Combining (5.4) in Lemma and (6.4), we obtain f (ˆ x k , y ) − f ( x, ˆ y k ) ≤ s k k X i =0 r i h F ( z i ) , z i − z i ≤ s k D u ∗ ω ( z, u ) . (6.17)In view of the deﬁnition of D , we derive thatmax y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k )= max z : z ∈D [ f (ˆ x k , y ) − f ( x, ˆ y k )] ≤ max z : z ∈D s k D u ∗ ω ( z, u ) . (6.18)Note that max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) ≥ f (ˆ x k , ¯ y ) ≥ f (¯ x, ¯ y )and min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ f (¯ x, ˆ y k ) ≤ f (¯ x, ¯ y )due to the fact of ¯ z = [¯ x ; ¯ y ] ∈ D and the deﬁnition of saddle points. Thus, together with the factof [ˆ x k ; ˆ y k ] ∈ D and using (6.18), we derive that f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) ≤ max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ max z : z ∈D s k D u ∗ ω ( z, u )and f (¯ x, ¯ y ) − f (ˆ x k , ˆ y k ) ≤ max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ max z : z ∈D s k D u ∗ ω ( z, u ) . Therefore, the sublinear convergence (5.5) follows.

References [1]

H. H. Bauschke, J. Bolte, and M. Teboulle , A descent lemma beyond Lipschitz gradient conti-nuity: First-order methods revisited and applications , Math. Oper. Res., 42 (2016), pp. 330–348.[2]

H. H. Bauschke and J. M. Borwein , Legendre functions and the method of random bregman pro-jections , Heldermann Verlag, 4 (1997), pp. 27–67.[3]

H. H. Bauschke and P. L. Combettes , Convex Analysis and Monotone Operator Theory in HilbertSpaces , Springer International Publishing, 2017.[4]

L. M. Bregman , The relaxation method of ﬁnding the common point of convex sets and its applicationto the solution of problems in convex programming , Ussr Computational Mathematics and MathematicalPhysics, 7 (1967), pp. 200–217. Y. Censor, A. Gibali, and S. Reich , The subgradient extragradient method for solving variationalinequalities in Hilbert space , J. Optim. Theory. Appl., 148 (2011), pp. 318–335.[6]

M. B. Cohen, A. Sidford, and K. Tian , Relative Lipschitzness in extragradient methods and adirect recipe for acceleration , arXiv:2011.06572 [math.OC], (2020).[7]

C. D. Dang and G. Lan , On the convergence properties of non-euclidean extragradient methods forvariational inequalities with generalized monotone operators , Comput. Optim. Appl., 60 (2015), pp. 277–310.[8]

C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng , Training gans with optimism , in ICLR 2018Conference, February 2018.[9]

J. Diakonikolas, C. Daskalakis, and M. I. Jordan , Eﬃcient methods for structured nonconvex-nonconcave min-max optimization , arXiv:2011.00364 [math.OC], (2020).[10]

J.-B. Hiriart-Urruty and C. Lemarechal , Foundations of convex analysis , Springer-Verlag Pub-lishers, 2004.[11]

K. C. Kiwiel , Free-steering relaxation methods for problems with strictly convex costs and linear con-straints , Math. Oper. Res., 22 (1997), pp. 326–349.[12] ,

Proximal minimization methods with generalized Bregman functions , SIAM J. Control Optim.,35 (1997), p. 1142C1168.[13]

G. M. Korpelevich , An extragradient method for ﬁnding saddle points and for other problems. , Mate-con, 12 (1976), pp. 747–756.[14]

G. Kotsalis, G. Lan, and L. Tianjiao , Simple and optimal methods for stochastic variationalinequalities, I: Operator extrapolation , arXiv:2011.02987v3 [math.OC], (2020).[15]

M.-J. Lai and W. Yin , Augmented ℓ and nuclear-norm models with a globally linearly convergentalgorithm , SIAM J. Imaging Sci., 6 (2013), pp. 1059–1091.[16] H. Lu, R. M. Freund, and Y. Nesterov , Relatively smooth convex optimization by ﬁrst-ordermethods, and applications , SIAM J. Optim., 28 (2018), pp. 333–354.[17]

Y. Malitsky , Projected reﬂected gradient methods for monotone variational inequalities , SIAM J.Optim., 25 (2015), pp. 502–520.[18]

Y. Malitsky and M. K. Tam , A forward-backward splitting method for monotone inclusions withoutcocoercivity , SIAM J. Optim., 30 (2020), pp. 1451–1472.[19]

A. Mokhtari, A. E. Ozdaglar, and S. Pattathil , Convergence rate of O (1 /k ) for optimisticgradient and extragradient methods in smooth convex-concave saddle point problems , SIAM J. Optim.,30 (2020), pp. 3230–3251.[20] A. S. Nemirovski , Prox-method with rate of convergence O (1 /t ) for variational inequalities with Lips-chitz continuous monotone operators and smooth convex-concave saddle point problems , SIAM J. Optim.,15 (2004), pp. 229–251.[21] W. Peng, Y. Dai, H. Zhang, and L. Cheng , Training gans with centripetal acceleration , Optim.Method Softw., 35 (2020), pp. 955–973.[22]

L. D. Popov , A modiﬁcation of the Arrow-Hurwicz method for ﬁnding saddle points , Math. Notes,(1980), pp. 845–848.[23]

R. T. Rockafellar , Monotone operators and the proximal point algorithm , SIAM J. Control Optim.,14 (1976), pp. 877–898.[24]

R. T. Rockafellar , Convex Analysis , Princeton University Press, 2015.[25]

P. Tseng , A modiﬁed forward-backward splitting method for maximal monotone mappings , SIAM J.Control Optim., 38, pp. 431–446., SIAM J.Control Optim., 38, pp. 431–446.