Extragradient and Extrapolation Methods with Generalized Bregman Distances for Saddle Point Problems
aa r X i v : . [ m a t h . O C ] J a n Extragradient and Extrapolation Methods with GeneralizedBregman Distances for Saddle Point Problems
Hui Zhang ∗ January 26, 2021
Abstract
In this work, we introduce two algorithmic frameworks, named Bregman extragradientmethod and Bregman extrapolation method, for solving saddle point problems. The proposedframeworks not only include the well-known extragradient and optimistic gradient methods asspecial cases, but also generate new variants such as sparse extragradient and extrapolationmethods. With the help of the recent concept of relative Lipschitzness and some Bregmandistance related tools, we are able to show certain upper bounds in terms of Bregman dis-tances for “regret” measures. Further, we use those bounds to deduce the convergence rate of O (1 /k ) for the Bregman extragradient and Bregman extrapolation methods applied to solvingsmooth convex-concave saddle point problems. Our theory recovers the main discovery madein [Mokhtari et al. (2020), SIAM J. Optim., 20, pp. 3230-3251] for more general algorithmicframeworks with weaker assumptions via a conceptually different approach. Keywords.
Extragradient method, extrapolation method, Bregman distance, iteration com-plexity, saddle point problem
AMS subject classifications.
The extragradient method is a powerful tool for solving smooth convex-concave saddle point prob-lems. Its original scheme was introduced by Korpelevich as early as 1976 in [13]. During the pastfour decades, this method has been extensively developed from many aspects such as extending itsapplications [20], simplifying its iterate [5], and generalizing its theory [7], etc. Recently, due tothe fact that many game and learning problems are actually equivalent to finding saddle points ofmin-max optimization problems, it has attracted considerable attention in the machine learning,computer science, and optimization communities.At the algorithmic level, many works are focused on modifying the original extragradientmethod. There are two types of progress. The first type is to reduce the computational com-plexity in each iteration. In this direction, remarkable examples include Popov’s modification ofthe Arrow-Hurwicz method [22], Tseng’s modified forward-backward splitting method [25], Malit-sky’s reflected projected method [17] and its generalization in [14,18]. It should be pointed out that, ∗ Department of Mathematics, National University of Defense Technology, Changsha, Hunan 410073, China. Email: [email protected] O (1 /k ). With a conceptually different approach, our theorygeneralizes the recent discovery in [19], which was made by formulating the extragradient andOGDA methods as approximations of the proximal point algorithm [23].The rest of the paper is organized as follows. In Section 2, we introduce some basic convexanalysis and Bregman distance related tools. In Section 3, we propose two algorithmic frameworkswith generalized Bregman distances as well as some specialized variants. In Section 4, we list a groupof assumptions and establish two main iterate results for the proposed algorithmic frameworks. InSection 5, we derive the convergence rate O (1 /k ) for the proposed algorithmic frameworks appliedto solving smooth convex-concave saddle point problems. Concluding remarks are given in Section6. Missing proofs are postponed to Appendix. In this paper, we restrict our analysis into real finite dimensional spaces R d . We use h· , ·i to denotethe inner product and k · k to denote the Euclidean norm. For a multi-variables function f ( x, y ),we use ∇ x f (respectively, ∇ y f ) to denote the gradient of f with respective to x (respectively, y ). We present some basic notations and facts about convex analysis, which will be used in our results.
Definition 2.1.
A function φ : R d → R is convex if for any α ∈ [0 , and u, v ∈ R d , we have φ ( αu + (1 − α ) v ) ≤ αφ ( u ) + (1 − α ) φ ( v );2 nd strongly convex with modulus µ > if for any α ∈ [0 , and u, v ∈ R d , we have φ ( αu + (1 − α ) v ) ≤ αφ ( u ) + (1 − α ) φ ( v ) − µα (1 − α ) k u − v k . Further, φ is concave if − φ is convex. Definition 2.2.
Let φ : R d → R be a convex function. The subdifferential of φ at u ∈ R d is definedas ∂φ ( u ) := { u ∗ ∈ R d : φ ( v ) ≥ φ ( u ) + h u ∗ , v − u i , ∀ v ∈ R d } . The elements of ∂φ ( u ) are called the subgradients of φ at u . The subdifferential generalizes of the classical concept of differential because of the well-knownfact that ∂φ ( u ) = {∇ φ ( u ) } when the function φ is differentiable. In terms of the subdifferential,the strong convexity in Definition 2.1 can be equivalently stated as [10]: For any u, v ∈ R d and v ∗ ∈ ∂φ ( v ), we have φ ( u ) ≥ φ ( v ) + h v ∗ , u − v i + µ k u − v k . (2.1) Definition 2.3.
Let φ : R d → R be a convex function. The conjugate of φ is defined as φ ∗ ( u ∗ ) = sup v ∈ R d {h u ∗ , v i − φ ( v ) } . Definition 2.4.
A function φ : R d → R is gradient-Lipschitz-continuous with modulus L > if forany u, v ∈ R d , we have k∇ φ ( u ) − ∇ φ ( v ) k ≤ L k u − v k . The following facts are well-known, which could be found from the classic textbooks [24] and [10].
Lemma 2.1.
Let φ : R d → R be a strongly convex function with modulus µ > . Then we havethat • its conjugate φ ∗ is gradient-Lipschitz-continuous with modulus µ ; • the conditions φ ( u ) + φ ∗ ( u ∗ ) = h u, u ∗ i , u ∗ ∈ ∂φ ( u ) , and u ∈ ∂φ ∗ ( u ∗ ) are equivalent. The Bregman distance, original introduced in [4], is a very powerful concept in many fields wheredistances are involved; partially because it generalizes the traditional Euclidean distance and isable to captures local “geometrical information”. During the past five decades, many variants ofBregman distances were proposed and widely studied, see e.g. [2, 11]. The key ingredient for de-signing Bregman distances is to search suitable generating functions. For simplicity and generality,we choose strongly convex functions to define the Bregman distance below. We would like to high-light that the strongly convex functions are not necessarily differentiable, which is a big differencebetween our methods and the existing methods with Bregman distances.
Definition 2.5.
Let ω : R d → R be a strongly convex function with modulus µ > . The Bregmandistance D v ∗ ω ( u, v ) between u, v ∈ R d with respect to ω and a subgradient v ∗ ∈ ∂ω ( v ) is defined by D v ∗ ω ( u, v ) := ω ( u ) − ω ( v ) − h v ∗ , u − v i . (2.2)3n the following, we state three basic facts about the Bregman distance, which will be used laterin our analysis. It should be pointed out that the results are well-known–see e.g. [11, 12]. We listthem here, along with a brief proof, for completeness. Lemma 2.2.
Let ω : R d → R be a strongly convex function with modulus µ > . For any u, p, q ∈ R d and p ∗ ∈ ∂ω ( p ) , q ∗ ∈ ∂ω ( q ) , we have that D p ∗ ω ( u, p ) − D q ∗ ω ( u, q ) + D q ∗ ω ( p, q ) = h q ∗ − p ∗ , u − p i , (2.3) D q ∗ ω ( p, q ) = D pω ∗ ( q ∗ , p ∗ ) , (2.4) and D q ∗ ω ( p, q ) ≥ µ k p − q k . (2.5) Proof.
The equality (2.3) follows from (2.2), and the inequality (2.5) from (2.1) and (2.2). Toobtain (2.4), we derive that D q ∗ ω ( p, q ) = ω ( p ) − ω ( q ) − h q ∗ , p − q i = h p, p ∗ i − ω ∗ ( p ∗ ) − h q, q ∗ i + ω ∗ ( q ∗ ) − h q ∗ , p − q i = ω ∗ ( q ∗ ) − ω ∗ ( p ∗ ) − h p, q ∗ − p ∗ i = D pω ∗ ( q ∗ , p ∗ ) , (2.6)where the second and fourth lines follow by using the second part of Lemma 2.1 and the condition p ∗ ∈ ∂ω ( p ) , q ∗ ∈ ∂ω ( q ). In this section, we introduce two algorithmic frameworks, both of which are constructed by anoperator F : R d → R d and a strongly convex function ω : R d → R via certain coupled styles. Let u , u ∗ ∈ R d and positive parameters { α k } be given. The Bregman extragradient method,abbreviated as BEG, updates the iterates { u k } for k ≥ ¯ u k = arg min u ∈ R d { α k h F ( u k ) , u i + D u ∗ k ω ( u, u k ) } ,u k +1 = arg min u ∈ R d { α k h F (¯ u k ) , u i + D u ∗ k ω ( u, u k ) } ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.1)Equivalently, it can be rewritten as ¯ u k = ∇ ω ∗ ( u ∗ k − α k F ( u k )) ,u k +1 = ∇ ω ∗ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.2)Due to the newly introduced “parameter” ω , BEG not only includes the standard extragradientmethod as its special case by taking ω ( u ) = k u k , but also generates implicitly regularized variants.To illustrate the latter, we take ω as the augmented ℓ -norm [15], that is ω ( u ) = γ k u k + k u k ,4here γ is a positive constant and k · k is the ℓ -norm defined as the sum of absolute values of theentries. It is easy to see that the augmented ℓ -norm is a strongly convex function with modulus µ = 1. Further, we have ∇ ω ∗ ( · ) = S γ ( · ) , (3.3)where S γ ( · ) is the well-known shrinkage operator defined by S γ ( u ) := sign( u ) max {| u | − µ, } , with sign( · ) , | · | , and max {· , ·} being component-wise operations for vectors. Now, the BEG methodfor this special case becomes ¯ u k = S γ ( u ∗ k − α k F ( u k )) ,u k +1 = S γ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.4)Because the shrinkage operator S γ ( · ) generates sparse vectors, we call the newly appeared scheme(3.4) sparse extragradient method. Its remarkable advantage is that the iterates may be sparse,although the shrinkage operations also increase computational cost. Thereby, how to balance thesetwo sides is worthy of future studying. In addition, it should be noted that although the iteratescould be sparse, their averaged sequences, whose iteration complexities will be studied, may bedense.In slightly more general, let us consider the case of ω ( u ) = ψ ( u ) + k u k , where ψ is a convexregularized function with an “easily” computational proximal operator given by prox ψ ( u ) := arg min v { ψ ( v ) + 12 k u − v k } . Substituting such ω into (3.1), we immediately obtain the following regularized extragradientmethod ¯ u k = prox ψ ( u ∗ k − α k F ( u k )) ,u k +1 = prox ψ ( u ∗ k − α k F (¯ u k )) ,u ∗ k +1 = u ∗ k − α k F (¯ u k ) . (3.5)It seems that, to the best of our knowledge, such variants have not appeared before us. Let u , u − , u ∗ ∈ R d , positive parameters { α k } , and nonnegative parameters { β k } be given. TheBregman extrapolation method, abbreviated as BEP, updates the iterates { u k } for k ≥ ( u k +1 = arg min u ∈ R d { α k h F ( u k ) + β k ( F ( u k ) − F ( u k − )) , u i + D u ∗ k ω ( u, u k ) } u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.6)Equivalently, it can be rewritten as (cid:26) u k +1 = ∇ ω ∗ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.7)5EP is general enough to include several existing algorithms as its special cases. For example,BEP with ω = k · k and α k ≡ η, β k ≡ ω to be differentiableand the modified forward-backward splitting [18] specialized to our setting. When considering thecase of ω ( u ) = γ k u k + k u k , we have the following scheme, called sparse extrapolation method. (cid:26) u k +1 = S γ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.8)Corresponding to the case of ω ( u ) = ψ ( u ) + k u k , the regularized extrapolation method reads as (cid:26) u k +1 = prox ψ ( u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − ))) ,u ∗ k +1 = u ∗ k − α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . (3.9)At last, we would like to point out that the BEP method does not cover the simultaneous centripetalacceleration and alternating centripetal acceleration methods, proposed in our recent work [21] fortraining GANs, one of which also includes the OGDA as a special case. In order to provide a unified convergence analysis for the previously introduced algorithmic frame-works, we make the following assumptions about the operator F and the function ω . Assumption 4.1.
Let ω : R d → R be a strongly convex function with modulus µ > and let λ be a positive parameter. The operator F is λ -relatively Lipschitz with respect to ω , i.e., for any u, v, z ∈ R d , we have h F ( v ) − F ( u ) , v − z i ≤ λ ( D u ∗ ω ( v, u ) + D v ∗ ω ( z, v )) . This assumption is a slight modification of the relative Lipschitzness recently proposed in [6],and will be a key tool in the forthcoming convergence analysis. As observed in [6], the relativeLipshcitzness is a more general condition encapsulating the standard Lipschitz assumption as wellas the more recent relative smoothness assumption [1, 16]. The following, shown in [6], will berequired later.
Lemma 4.1. If ω is strongly convex with modulus µ and F is L-Lipschitz in the sense that for any u, v ∈ R d , we have k F ( u ) − F ( v ) k ≤ L k u − v k , then F is L/µ -relatively Lipschitz with respect to ω . Assumption 4.2.
The solution set of F defined as Y := { u : F ( u ) = 0 } is nonempty. Assumption 4.3.
The operator F is monotone, i.e., for any u, v ∈ R d we have h F ( u ) − F ( v ) , u − v i ≥ . Assumption 4.4.
The conjugate function ω ∗ : R d → R satisfies coercivity, i.e, for any fixed v ∈ R d , we have ω ∗ ( u ∗ ) − h v, u ∗ i → + ∞ , k u ∗ k → + ∞ . It is easy to verify that Assumption 4.4 holds for ω ∗ ( u ∗ ) = k u ∗ k . Actually, it holds for theconjugates of strongly convex functions, as a direct result of Proposition 14.15 in [3].6 emma 4.2. Let ω : R d → R be a strongly convex function with modulus µ > . Then the conjugatefunction ω ∗ satisfies the coercivity. Now, we are ready to present the main results of this study. The first result is a upper bound ofthe “regret measure” h α k F (¯ u k ) , ¯ u k − u i by the difference between the generalized Bregman distances D u ∗ k ω ( u, u k ) and D u ∗ k +1 ω ( u, u k +1 ), for the Bregman extragradient method. Proposition 4.1.
Let { ¯ u k , u k } be the iterates generated by the Bregman extragradient methodintroduced in (3.1) . Suppose that Assumption 4.1 holds and the parameters α k satisfy < λα k ≤ .Then, we have h α k F (¯ u k ) , ¯ u k − u i ≤ D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) . (4.1) Moreover, if Assumptions 4.2-4.3 also hold, then the sequence { ¯ u k } is bounded. The first part of this result is a modification of Lemma 3.1 in [20] (see also Proposition 1 in [6]),and the second part and the followings are partially inspired by the work [19].
Proposition 4.2.
Let { u k } be the iterates generated by the Bregman extrapolation method intro-duced in (3.6) with the initial conditions u = u − . Suppose that Assumption 4.1 holds and theparameters α k , β k satisfy the condition (cid:26) α k β k = α k − ,λ ( α k + α k − ) ≤ . (4.2) Then, we have α k h F ( u k +1 ) , u k +1 − u i ≤ α k h F ( u k +1 ) − F ( u k ) , u k +1 − u i − α k − h F ( u k ) − F ( u k − ) , u k − u i + D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 )+ λα k − D u ∗ k − ω ( u k , u k − ) − λα k D u ∗ k ω ( u k +1 , u k ) . (4.3) Moreover, if Assumptions 4.2-4.3 also hold and there is a positive constant ρ such that λα k ≤ − ρ ,then the sequence { u k } is bounded. To illustrate the generality of the result above, we consider the OGDA method, i.e., the specialcase of BEP with α k ≡ η, β k ≡ , ω = k · k . The condition on the step size η becomes 0 < η ≤ L due to (4.2) and µ = 1 , λ = Lµ because of Lemma 4.1. Note that D v ∗ ω ( u, v ) = k u − v k . Under thissetting, the existence of ρ such that λα k ≤ − ρ also holds. From (4.3), we obtain h F ( u k +1 ) , u k +1 − u i ≤ h F ( u k +1 ) − F ( u k ) , u k +1 − u i − h F ( u k ) − F ( u k − ) , u k − u i + η k u − u k k − η k u − u k +1 k + L k u k − u k − k − L k u k +1 − u k k , (4.4)which is exactly Lemma 8(a) in [19]. The second part of Proposition (4.2) guarantees the bounded-ness of { u k } , which recovers Lemma 8(b) in [19]. The careful reader could find that similar situationof the extragradient method does not happen here. In other words, one may ask whether Lemma11(a) in [19] can be generalized. Our answer is: it should be possible but not worth pursuingbecause the current relation (4.1) is powerful enough to derive what we want.7 Iteration complexity for saddle point problems
In this section, we first formulate the saddle point problem which we want to solve and then deducethe iteration complexity results.
Let f : R m × R n → R be a given function. We consider finding a saddle point (¯ x, ¯ y ) of the problemmin x ∈ R m max y ∈ R n f ( x, y ) . (5.1)In other words, find a pair (¯ x, ¯ y ) ∈ R m × R n , which will be called saddle point, to satisfy thefollowing relation f (¯ x, y ) ≤ f (¯ x, ¯ y ) ≤ f ( x, ¯ y )for all x ∈ R m , y ∈ R n . Let z = [ x ; y ] ∈ R n + m and define the operator F : R n + m → R n + m as F ( z ) := [ ∇ x f ( x, y ); −∇ y f ( x, y )] . (5.2)In order to apply previous theory to this specialized operator F , we restrict our attention to theproblem (5.1) with the following assumption. Assumption 5.1.
The set of all saddle point pairs of the problem (5.1) , denoted by Z , is nonempty.Let positive parameters L xx , L xy , L yy , L yx be given and let L := 2 × max { L xx , L xy , L yy , L yx } . The function f ( x, y ) in (5.1) is • continuously differetiable in x and y , • convex in x for any fixed y and concave in y for any fixed x , and • gradient-Lipschitz-continuous with modulus L in the sense that k∇ x f ( x , y ) − ∇ x f ( x , y ) k ≤ L xx k x − x k for all y, k∇ x f ( x, y ) − ∇ x f ( x, y ) k ≤ L xy k y − y k for all x, k∇ y f ( x, y ) − ∇ y f ( x, y ) k ≤ L yy k y − y k for all x, k∇ y f ( x , y ) − ∇ y f ( x , y ) k ≤ L yx k x − x k for all y. The gradient-Lipschitz-continuity above implies the standard Lipschitzness of F , which furtherimplies the relative Lipschitzness of F due to Lemma 4.1. Now, by substituting the formula (5.2)for F in the algorithmic frameworks (3.1) and (3.6), the saddle point problems in the form (5.1)could be solved in some degree. Especially, for the saddle point problems that satisfy Assumption5.1, we will derive the convergence rate of O (1 /k ) for the BEG and BEP methods. To this end,we let { z k } be the concerned iterate sequence with z k := [ x k ; y k ] ∈ R n + m . Let { r k } be a parameter8equence with r k > s k := P ki =0 r i . We denote the averaged iterates of { z k } by { r k } asfollows ˆ z k := [ˆ x k ; ˆ y k ] := [ k X i =0 r i s i x i ; k X i =0 r i s i x i ] = k X i =0 r i s i z i . (5.3)In terms of these notations, we have the following preliminary results for the saddle point problem(5.1). It should be pointed out that they are well known–see for example [20]. Lemma 5.1.
Consider the saddle point problem (5.1) with Assumption 5.1 and recall the definitionsin (5.2) - (5.3) . Let ω : R d → R be a strongly convex function with modulus µ > . Then, • Assumptions 4.2-4.3 and Assumption 4.1 with λ = Lµ hold for the operator F , and especially, Z ⊂ Y , i.e., F (¯ z ) = 0 for any ¯ z ∈ Z ; • For any z = [ x ; y ] ∈ R n + m , we have f (ˆ x k , y ) − f ( x, ˆ y k ) ≤ s k k X i =0 r i h F ( z i ) , z i − z i . (5.4)We remark that the gradient Lipschitz continuity in Assumption 5.1 is only a sufficient conditionguaranteeing the weaker relative Lipschitzness. The first result is about the iteration complexity of BEG method applied to the saddle pointproblem with Assumption 5.1. Before presentation, we first clarify some notations.Let { ¯ u k } be the iterates generated by the BEG method with initial points u , u ∗ , µ -stronglyconvex ω and L -Lipschitz operator F = [ ∇ x f ; −∇ y f ], and the parameters { α k } satisfying 0 Let ω : R d → R be a strongly convex function with modulus µ > . The BEGmethod with α k satisfying < Lα k ≤ µ and initial points u , u ∗ being given, applied to the saddlepoint problem (5.1) with Assumption 5.1, converges sublinearly in the sense that | f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) | ≤ max z : z ∈D P ki =0 α i D u ∗ ω ( z, u ) . (5.5)9he second result is about the iteration complexity of the BEP method applied to the saddlepoint problem (5.1). Since its deduction is similar to that of the first result, we here just brieflypoint out the difference notations. Let { u k } be the iterates generated by the BEP method with F = [ ∇ x f ; −∇ y f ], the initial points u = u − being given, and the positive parameters α k , β k satisfying the condition (4.2). Let z k = u k +1 and r k = α k . Let ¯ z = [¯ x ; ¯ y ] ∈ Z ⊂ Y be a fixed saddlepoint. Denote R := max { max k {k u k k , k ¯ z k} ;then R < + ∞ due to the boundedness of { u k } . Let D := { z ∈ R m + n : k z k ≤ R } , which is the smallest compact convex set including the iterates { u k } (or say { z k } ) and the saddlepoint ¯ z . By convexity, it also includes the averaged iterates { ˆ z k } , as defined in (5.3). In terms ofthese notations, we present the second iteration complexity result, whose proof follows directly byreplacing (6.4) with (6.15) and repeating the argument for Proposition 5.1. Proposition 5.2. Let ω : R d → R be a strongly convex function with modulus µ > . The BEPmethod with { α k , β k } satisfying (4.2) and initial points u ∗ , u − = u being given, applied to thesaddle point problem (5.1) with Assumption 5.1, converges sublinearly in the sense that | f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) | ≤ max z : z ∈D P ki =0 α i D u ∗ ω ( z, u ) . (5.6)The iterate complexities above show that the function value of the averaged iterates generatedby BEG or BEP converges to the function value at any fixed saddle point of the problem (5.1). Partially inspired by the elegant results in [19], we introduce two algorithmic frameworks with gen-eralized Bregman distances and study their iteration complexity for solving saddle point problems.With the help of the recent concept of relative Lipschitzness and Bregman distance related tools,our approach is simple and essentially different from the proximal point approach taken in [19].Moreover, our theory is general in the sense that it applies to more general algorithmic frameworksunder weaker assumptions. Acknowledgements This work is supported by the National Science Foundation of China (No.11971480), the NaturalScience Fund of Hunan for Excellent Youth (No.2020JJ3038), and the Fund for NUDT YoungInnovator Awards (No. 20190105). Appendix: The missing proofs The proof of Proposition 4.1 : From the formula ¯ u k = ∇ ω ∗ ( u ∗ k − α k F ( u k )) in (3.2), we have u ∗ k − α k F ( u k ) ∈ ∂ω (¯ u k ) . u ∗ k := u ∗ k − α k F ( u k ); then using the fact of u ∗ k ∈ ∂ω ( u k ) and applying (2.3) in Lemma 2.2, wederive that h α k F ( u k ) , ¯ u ∗ k − u i = −h u ∗ k − ¯ u ∗ k , u − ¯ u ∗ k i = D u ∗ k ω ( u, u k ) − D ¯ u ∗ k ω ( u, ¯ u k ) − D u ∗ k ω (¯ u k , u k ) . (6.1)Similarly, starting with u ∗ k +1 = u ∗ k − α k F (¯ u k ) in (3.2) and applying (2.3) in Lemma 2.2 again, wederive that h α k F (¯ u k ) , u k +1 − u i = −h u ∗ k − u ∗ k +1 , u − u k +1 i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.2)Substituting u k +1 for u in (6.1), we obtain h α k F ( u k ) , ¯ u ∗ k − u k +1 i = D u ∗ k ω ( u k +1 , u k ) − D ¯ u ∗ k ω ( u k +1 , ¯ u k ) − D u ∗ k ω (¯ u k , u k ) . (6.3)Combining (6.2) and (6.3), we have h α k F (¯ u k ) , ¯ u k − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 )+ α k h F (¯ u k ) − F ( u k ) , ¯ u k − u k +1 i − D u ∗ k ω (¯ u k , u k ) − D ¯ u ∗ k ω ( u k +1 , ¯ u k ) . Invoking the relative Lipschitz assumption and noting λα k ≤ 1, we obtain (4.1).Now, we turn to prove the boundedness. Summing up (4.1) from k = 0 to t − 1, we obtain t − X k =0 h α k F (¯ u k ) , ¯ u k − u i ≤ D u ∗ ω ( u, u ) − D u ∗ t ω ( u, u t ) . (6.4)Using the monotonicity of F in Assumption 4.3 and the condition F (¯ u ) = 0 for any ¯ u ∈ Y , we havethat h F (¯ u k ) , ¯ u k − ¯ u i = h F (¯ u k ) − F (¯ u ) , ¯ u k − ¯ u i ≥ . This means that each term in the summand in (6.4) with u = ¯ u is nonnegative and hence impliesthe following D u ∗ t ω (¯ u, u t ) ≤ D u ∗ ω (¯ u, u ) . (6.5)Note from (2.4) in Lemma (2.2) that D u ∗ t ω (¯ u, u t ) = D ¯ uω ∗ ( u ∗ t , ¯ u ∗ ) = ω ∗ ( u ∗ t ) − ω ∗ (¯ u ∗ ) − h ¯ u, u ∗ t − ¯ u ∗ i . Thereby, we have ω ∗ ( u ∗ t ) − h ¯ u, u ∗ t i ≤ D u ∗ ω (¯ u, u ) + ω ∗ (¯ u ∗ ) − h ¯ u, ¯ u ∗ i . (6.6)This implies that for the fixed ¯ u , { ω ∗ ( u ∗ t ) − h ¯ u, u ∗ t i} is bounded. Together with Lemma 4.2 andAssumption 4.4, we conclude that { u ∗ t } is bounded. Finally, using the strong convexity of ω andthe fact in Lemma 2.1, we deduce that k u t − u k = k∇ ω ∗ ( u ∗ t ) − ∇ ω ∗ ( u ∗ t ) k ≤ µ k u ∗ t − u ∗ k , from which the boundedness of { u t } immediately follows. This completes the proof.11 he proof of Proposition 4.2 : First, recall from (3.7) that u ∗ k − u ∗ k +1 = α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) . It follows that h u ∗ k − u ∗ k +1 , u k +1 − u i = h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i . (6.7)On the other hand, applying (2.3) in Lemma 2.2, we have h u ∗ k − u ∗ k +1 , u k +1 − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.8)Thus, combining (6.7) and (6.8), we obtain h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i = D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.9)Let ∆ F k := F ( u k ) − F ( u k − ); then h α k F ( u k ) − α k β k ( F ( u k ) − F ( u k − )) , u k +1 − u i = α k β k h ∆ F k , u k +1 − u i − α k h ∆ F k +1 , u k +1 − u i + α k h F ( u k +1 ) , u k +1 − u i = α k β k h ∆ F k , u k − u i + α k β k h ∆ F k , u k +1 − u k i− α k h ∆ F k +1 , u k +1 − u i + α k h F ( u k +1 ) , u k +1 − u i . (6.10)Inserting (6.9) into (6.10) and rearranging the terms, we have α k h F ( u k +1 ) , u k +1 − u i = α k h ∆ F k +1 , u k +1 − u i − α k β k h ∆ F k , u k − u i + α k β k h ∆ F k , u k − u k +1 i + D u ∗ k ω ( u, u k ) − D u ∗ k +1 ω ( u, u k +1 ) − D u ∗ k ω ( u k +1 , u k ) . (6.11)Using the relative Lipschitzness assumption, we have that h ∆ F k , u k − u k +1 i = h F ( u k ) − F ( u k − ) , u k − u k +1 i≤ λD u ∗ k ω ( u k +1 , u k ) + λD u ∗ k − ω ( u k , u k − ) . (6.12)Finally, combining (6.11) and (6.12) and noting (4.2), we obtain (4.3).Now, we turn to prove the boundedness. Summing up (4.3) from k = 0 to t − 1, we obtain P t − k =0 α k h F ( u k +1 ) , u k +1 − u i ≤ α t − h ∆ F t , u t − u i − α − h ∆ F , u − u i + D u ∗ ω ( u, u ) − D u ∗ t ω ( u, u t )+ λα − D u ∗− ω ( u , u − ) − λα t − D u ∗ t − ω ( u t , u t − ) . (6.13)Using the relative Lipschitzness assumption, we have that h ∆ F t , u t − u i = h F ( u t ) − F ( u t − ) , u t − u i≤ λD u ∗ t ω ( u, u t ) + λD u ∗ t − ω ( u t , u t − ) . (6.14)Combining (6.13) and (6.14) and noting k ∆ F k = D u ∗ ω ( u, u ) = 0 due to the fact u = u − , weobtain t − X k =0 α k h F ( u k +1 ) , u k +1 − u i ≤ D u ∗ ω ( u, u ) − (1 − λα t − ) D u ∗ t ω ( u, u t ) . (6.15)12sing (6.15) with u = ¯ u ∈ Y and repeating the argument below (6.4), we deduce that(1 − λα t − ) D u ∗ t ω (¯ u, u t ) ≤ D u ∗ ω (¯ u, u ) . (6.16)Repeating the argument below (6.5) and noting that 1 − λα t − > ρ , we finally conclude that thesequence { u k } is bounded. This completes the proof. The proof of Proposition 5.1 : Combining (5.4) in Lemma and (6.4), we obtain f (ˆ x k , y ) − f ( x, ˆ y k ) ≤ s k k X i =0 r i h F ( z i ) , z i − z i ≤ s k D u ∗ ω ( z, u ) . (6.17)In view of the definition of D , we derive thatmax y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k )= max z : z ∈D [ f (ˆ x k , y ) − f ( x, ˆ y k )] ≤ max z : z ∈D s k D u ∗ ω ( z, u ) . (6.18)Note that max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) ≥ f (ˆ x k , ¯ y ) ≥ f (¯ x, ¯ y )and min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ f (¯ x, ˆ y k ) ≤ f (¯ x, ¯ y )due to the fact of ¯ z = [¯ x ; ¯ y ] ∈ D and the definition of saddle points. Thus, together with the factof [ˆ x k ; ˆ y k ] ∈ D and using (6.18), we derive that f (ˆ x k , ˆ y k ) − f (¯ x, ¯ y ) ≤ max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ max z : z ∈D s k D u ∗ ω ( z, u )and f (¯ x, ¯ y ) − f (ˆ x k , ˆ y k ) ≤ max y :[ˆ x k ; y ] ∈D f (ˆ x k , y ) − min x :[ x ;ˆ y k ] ∈D f ( x, ˆ y k ) ≤ max z : z ∈D s k D u ∗ ω ( z, u ) . Therefore, the sublinear convergence (5.5) follows. References [1] H. H. Bauschke, J. Bolte, and M. Teboulle , A descent lemma beyond Lipschitz gradient conti-nuity: First-order methods revisited and applications , Math. Oper. Res., 42 (2016), pp. 330–348.[2] H. H. Bauschke and J. M. Borwein , Legendre functions and the method of random bregman pro-jections , Heldermann Verlag, 4 (1997), pp. 27–67.[3] H. H. Bauschke and P. L. Combettes , Convex Analysis and Monotone Operator Theory in HilbertSpaces , Springer International Publishing, 2017.[4] L. M. Bregman , The relaxation method of finding the common point of convex sets and its applicationto the solution of problems in convex programming , Ussr Computational Mathematics and MathematicalPhysics, 7 (1967), pp. 200–217. Y. Censor, A. Gibali, and S. Reich , The subgradient extragradient method for solving variationalinequalities in Hilbert space , J. Optim. Theory. Appl., 148 (2011), pp. 318–335.[6] M. B. Cohen, A. Sidford, and K. Tian , Relative Lipschitzness in extragradient methods and adirect recipe for acceleration , arXiv:2011.06572 [math.OC], (2020).[7] C. D. Dang and G. Lan , On the convergence properties of non-euclidean extragradient methods forvariational inequalities with generalized monotone operators , Comput. Optim. Appl., 60 (2015), pp. 277–310.[8] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng , Training gans with optimism , in ICLR 2018Conference, February 2018.[9] J. Diakonikolas, C. Daskalakis, and M. I. Jordan , Efficient methods for structured nonconvex-nonconcave min-max optimization , arXiv:2011.00364 [math.OC], (2020).[10] J.-B. Hiriart-Urruty and C. Lemarechal , Foundations of convex analysis , Springer-Verlag Pub-lishers, 2004.[11] K. C. Kiwiel , Free-steering relaxation methods for problems with strictly convex costs and linear con-straints , Math. Oper. Res., 22 (1997), pp. 326–349.[12] , Proximal minimization methods with generalized Bregman functions , SIAM J. Control Optim.,35 (1997), p. 1142C1168.[13] G. M. Korpelevich , An extragradient method for finding saddle points and for other problems. , Mate-con, 12 (1976), pp. 747–756.[14] G. Kotsalis, G. Lan, and L. Tianjiao , Simple and optimal methods for stochastic variationalinequalities, I: Operator extrapolation , arXiv:2011.02987v3 [math.OC], (2020).[15] M.-J. Lai and W. Yin , Augmented ℓ and nuclear-norm models with a globally linearly convergentalgorithm , SIAM J. Imaging Sci., 6 (2013), pp. 1059–1091.[16] H. Lu, R. M. Freund, and Y. Nesterov , Relatively smooth convex optimization by first-ordermethods, and applications , SIAM J. Optim., 28 (2018), pp. 333–354.[17] Y. Malitsky , Projected reflected gradient methods for monotone variational inequalities , SIAM J.Optim., 25 (2015), pp. 502–520.[18] Y. Malitsky and M. K. Tam , A forward-backward splitting method for monotone inclusions withoutcocoercivity , SIAM J. Optim., 30 (2020), pp. 1451–1472.[19] A. Mokhtari, A. E. Ozdaglar, and S. Pattathil , Convergence rate of O (1 /k ) for optimisticgradient and extragradient methods in smooth convex-concave saddle point problems , SIAM J. Optim.,30 (2020), pp. 3230–3251.[20] A. S. Nemirovski , Prox-method with rate of convergence O (1 /t ) for variational inequalities with Lips-chitz continuous monotone operators and smooth convex-concave saddle point problems , SIAM J. Optim.,15 (2004), pp. 229–251.[21] W. Peng, Y. Dai, H. Zhang, and L. Cheng , Training gans with centripetal acceleration , Optim.Method Softw., 35 (2020), pp. 955–973.[22] L. D. Popov , A modification of the Arrow-Hurwicz method for finding saddle points , Math. Notes,(1980), pp. 845–848.[23] R. T. Rockafellar , Monotone operators and the proximal point algorithm , SIAM J. Control Optim.,14 (1976), pp. 877–898.[24] R. T. Rockafellar , Convex Analysis , Princeton University Press, 2015.[25] P. Tseng , A modified forward-backward splitting method for maximal monotone mappings , SIAM J.Control Optim., 38, pp. 431–446., SIAM J.Control Optim., 38, pp. 431–446.