[PDF] Cyclic Coordinate Dual Averaging with Extrapolation for Generalized Variational Inequalities

Abstract

We propose the \emph{Cyclic cOordinate Dual avEraging with extRapolation (CODER)} method for generalized variational inequality problems. Such problems are fairly general and include composite convex minimization and min-max optimization as special cases. CODER is the first cyclic block coordinate method whose convergence rate is independent of the number of blocks (under a suitable Lipschitz definition), which fills the significant gap between cyclic coordinate methods and randomized ones that remained open for many years. Moreover, CODER provides the first theoretical guarantee for cyclic coordinate methods in solving generalized variational inequality problems under only monotonicity and Lipschitz continuity assumptions. To remove the dependence on the number of blocks, the analysis of CODER is based on a novel Lipschitz condition with respect to a Mahalanobis norm rather than the commonly used coordinate-wise Lipschitz condition; to be applicable to general variational inequalities, CODER leverages an extrapolation strategy inspired by the recent developments in primal-dual methods. Our theoretical results are complemented by numerical experiments, which demonstrate competitive performance of CODER compared to other coordinate methods.

Full PDF

FFast Cyclic Coordinate Dual Averaging with Extrapolationfor Generalized Variational Inequalities ∗ Chaobing Song and Jelena DiakonikolasDepartment of Computer SciencesUniversity of Wisconsin-Madison [email protected], [email protected]

Abstract

We propose the

Cyclic cOordinate Dual avEraging with extRapolation (CODER) methodfor generalized variational inequality problems. Such problems are fairly general and includecomposite convex minimization and min-max optimization as special cases. CODER is the ﬁrstcyclic block coordinate method whose convergence rate is independent of the number of blocks,which ﬁlls the signiﬁcant gap between cyclic coordinate methods and randomized ones thatremained open for many years. Moreover, CODER provides the ﬁrst theoretical guarantee forcyclic coordinate methods for solving generalized variational inequality problems under onlymonotonicity and Lipschitz continuity assumptions. To remove the dependence on the numberof blocks, the analysis of CODER is based on a novel Lipschitz condition with respect to aMahalanobis norm rather than the commonly used coordinate-wise Lipschitz condition; to beapplicable to general variational inequalities, CODER leverages an extrapolation strategy inspiredby the recent developments in primal-dual methods. Our theoretical results are complementedby numerical experiments, which demonstrate competitive performance of CODER compared toother coordinate methods.

Large-scale optimization problems are omnipresent in machine learning. The ever-increasing scale ofthe problems renders standard ﬁrst-order methods that rely on full gradient information impracticalfor many settings of interest. Fortunately, most of the standard machine learning problems possessuseful structure that makes them amenable to eﬃcient optimization methods that only access partialproblem information at a time. A speciﬁc instance are (block) coordinate methods, which rely onaccessing only a subset of coordinates of the objective function (sub)gradient at a time (Wright, 2015,Nesterov, 2012). These methods have been very popular over the past decade, ﬁnding applicationsin areas such as feature selection in high-dimensional computational statistics (Wu et al., 2008,Friedman et al., 2010, Mazumder et al., 2011), empirical risk minimization in machine learning(Zhang and Lin, 2015, Nesterov, 2012, Lin et al., 2015, Allen-Zhu et al., 2016, Alacaoglu et al., 2017,G¨urb¨uzbalaban et al., 2017, Diakonikolas and Orecchia, 2018), and distributed computing (Liuet al., 2014, Fercoq and Richt´arik, 2015, Richt´arik and Tak´aˇc, 2016).Coordinate methods are classiﬁed according to the order in which (blocks of) coordinates areselected and updated (Shi et al., 2016), generally falling into of the three main categories: (i) greedy, ∗ This research was partially supported by the Oﬃce of the Vice Chancellor for Research and Graduate Educationat the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation. a r X i v : . [ m a t h . O C ] F e b r Gauss-Southwell, methods, which greedily select coordinates that lead to the largest progress (e.g.,coordinates with the largest magnitude of the gradient, which maximize progress in function valuefor descent-type methods), (ii) randomized methods, which select (blocks of) coordinates accordingto some probability distribution over the coordinate blocks, and (iii) cyclic methods, which update(blocks of) coordinates in a cyclic order. Although greedy methods can be quite eﬀective, they aregenerally limited by the greedy selection criterion, which (except in some very specialized instances;see, e.g., Nutini et al. (2015)) requires reading full ﬁrst-order information, in each iteration. Thus,more attention in the literature has been given to randomized and cyclic methods.From the aspect of theoretical guarantees, a major advantage of randomized coordinate methods(RCM) over cyclic variants has been the simplicity with which convergence arguments can be carriedout. By sampling coordinates randomly with replacement, the expectation of a coordinate gradientis the full gradient, thus the analysis can be largely reduced to that of standard gradient descent.As a result, many variants of RCM with provable guarantees have been proposed for both convexminimization problems (Nesterov, 2012, Lin et al., 2015, Fercoq and Richt´arik, 2015, Diakonikolasand Orecchia, 2018, Allen-Zhu et al., 2016, Hanzely and Richt´arik, 2019, Nesterov and Stich, 2017)and convex-concave min-max problems (Dang and Lan, 2014, Zhang and Lin, 2015, Alacaoglu et al.,2017, Chambolle et al., 2018, Tan et al., 2018, Carmon et al., 2019, Latafat et al., 2019, Fercoqand Bianchi, 2019, Alacaoglu et al., 2020). The complexity of RCM as measured by the number oftimes full gradient information is accessed is no worse than for full-gradient ﬁrst-order methods,making RCM suitable for high-dimensional settings. However, these guarantees are attained only inexpectation or with high probability. Meanwhile, to sample the coordinates, randomized methodsmust involve generation of pseudorandom numbers from a certain probability distribution, whichmakes the implementation complicated and may dominate the cost if the coordinate update is cheap.Furthermore, in practical tasks such as training of deep neural networks, the strategy of samplingwith replacement is seldom used due to reduced performance caused by not iterating over all thecoordinates with high probability in one pass (while sampling without replacement achieves thiswith probability one) (Bottou, 2009).Compared to sampling with replacement, cyclically choosing coordinates or sampling withoutreplacement (i.e., cyclically choosing coordinate blocks with their order determined according toa random permutation) appears more natural. In fact, cyclic coordinate methods (CCMs) oftenhave better empirical performance than RCM (Beck and Tetruashvili, 2013, Chow et al., 2017,Sun and Ye, 2019). Due to its simplicity and eﬃciency, CCM has been the default approach inmany well-known software packages for high-dimensional computational statistics such as SparseNet(Mazumder et al., 2011) and GLMNet (Friedman et al., 2010).However, CCM is much harder to analyze than RCM because it is highly nontrivial to establisha connection between the (cyclically selected) coordinate gradient and full gradient. As a result,compared to RCM, there are hardly any theoretical guarantees for CCM. In the seminal paper aboutRCM, Nesterov (2012) has remarked that it is “almost impossible to estimate the rate of convergence”of cyclic coordinate descent in the general case. However, some guarantees have been provided in theliterature, albeit often under very restrictive assumptions such the isotonicity of the gradient (Sahaand Tewari, 2013) or with convergence rates that do not justify better empirical performanceof CCM over RCM (Beck and Tetruashvili, 2013). In particular, the iteration complexity resultfrom Beck and Tetruashvili (2013) for the general convex optimization setting has linear dependenceon the ambient dimension (or the number of blocks in the block coordinate setting). This lineardependence is expected, as the argument from Beck and Tetruashvili (2013) relies on treating thecyclical coordinate gradient as an approximation of full gradient of the current iterate.Beyond the setting of convex optimization, Chow et al. (2017) has provided convergence resultsfor a variant of CCM applied to unconstrained monotone variational inequality problems (VIPs),2here the operator F : R d → R d is assumed to be cocoercive. Cocoercivity is a very strongassumption, which leads to an equivalence between solving the original VIP (equivalently, ﬁnding azero of F , which is also known as the monotone inclusion problem) and ﬁnding a ﬁxed point ofa nonexpansive (1-Lipschitz) operator (see, e.g., Facchinei and Pang (2007, Chapter 12)). Thiscondition already fails to hold for bilinear matrix games, which is one of the most basic setups ofmin-max optimization. Moreover, the convergence rate of 1 /k / for reducing (cid:107) F ( x ) (cid:107) from Chowet al. (2017) is unsatisfying, as we expect faster convergence for this class of methods.As monotone VIP includes convex optimization problems as a special case, to address bothissues from Beck and Tetruashvili (2013), Chow et al. (2017), our main focus is on answering thefollowing question: Is it possible to obtain a cyclic coordinate method with dimension-independent iteration complexityfor general monotone VIPs?

We consider generalized Minty variational inequality (GMVI) problems, which ask for ﬁnding x ∗ such that (cid:104) F ( x ) , x − x ∗ (cid:105) + g ( x ) − g ( x ∗ ) ≥ , ∀ x ∈ R d , (P)where F : R d → R d is a monotone Lipschitz operator and g : R d → R is a proper, convex, lowersemicontinuous, block-separable function with an eﬃciently computable proximal operator (seeSection 2 for precise deﬁnitions).For this general problem (P), we propose a novel Cyclic cOordinate Dual avEraging withextRapolation (CODER) method. When F ( x ) is monotone and L -Lipschitz, and g ( x ) is generalconvex, to attain an (cid:15) -accurate solution to (P) deﬁned as x ∗ (cid:15) that satisﬁes (cid:104) F ( x ) , x − x ∗ (cid:15) (cid:105) + g ( x ) − g ( x ∗ (cid:15) ) ≥ − (cid:15), ∀ x ∈ R d , (P (cid:15) )CODER needs to equivalently access O ( L/(cid:15) ) full gradients, which is dimension-independent andrecovers the optimal complexity of ﬁrst-order methods for monotone Lipschitz VIPs. Moreover, if g ( x ) is assumed to be σ -strongly convex, the oracle complexity of CODER becomes O (cid:0) Lσ log (cid:15) (cid:1) .The constant L in the aforementioned bounds is deﬁned based on a novel Lipschitz condition for F w.r.t. a Mahalanobis norm that we introduce (see Assumption 1 for a precise deﬁnition), and formost cases of interest is bounded above by the more traditional Lipschitz constant of F (deﬁnedw.r.t. the standard Euclidean norm). In fact, the Lipschitz constant resulting from our analysisis often much lower than the Euclidean Lipschitz constant. Meanwhile, the value of L is stronglyinﬂuenced by the order of cyclic updates that the algorithm takes, thus it partially explains theeﬀectiveness of random permutations for CCM. Furthermore, to the best of our knowledge, our workis the ﬁrst to provide any type of convergence guarantees for CCM methods applied to generalizedVIPs. Finally, our method provides consistent guarantees for arbitrary block separation, which ishighly nontrivial in the min-max setting (see Remark 2).To prove our main result, instead of treating coordinate gradient as an approximation of the fullgradient, we consider a novel approximation strategy that relates the collection of cyclic coordinategradients from one full pass over the coordinates to a certain full implicit gradient . This collectionperspective helps us remove the linear dependence on the dimension (or number of coordinateblocks). Furthermore, the approximation of the coordinate gradient collection is more accurate thanthe full gradient of the previous iterate, thus we show that cyclic coordinate methods can dependon a smaller Lipschitz constant L . To make our results applicable to generalized monotone VIPs,we introduce an extrapolation step on the operator, which is inspired by the very recent momentumstrategy for non-bilinear convex-concave min-max problems (Hamedani and Aybat, 2018).3 .2 Related Work As discussed earlier, despite signiﬁcant research activity devoted to randomized coordinate meth-ods (Nesterov, 2012, Lin et al., 2015, Fercoq and Richt´arik, 2015, Diakonikolas and Orecchia, 2018,Allen-Zhu et al., 2016, Hanzely and Richt´arik, 2019, Nesterov and Stich, 2017, Zhang and Lin,2015, Alacaoglu et al., 2017, Tan et al., 2018), far less attention has been given to cyclic coordinatevariants, and speciﬁcally to their rigorous convergence guarantees.In particular, while convergence guarantees have been established for smooth convex optimizationproblems in Beck and Tetruashvili (2013), the obtained bounds exhibit at least linear dependence onthe number of blocks (equal to the dimension in the coordinate case). Further, the bound from Beckand Tetruashvili (2013) also scales with L max /L min , where L max and L min are the maximum and theminimum Lipschitz constants over the blocks, which is unsatisfying, as (block) coordinate methodsmainly exhibit improvements over full gradient methods when the Lipschitz constants over blocksare highly non-uniform.In general, vanilla CCM is known to be order- d slower than RCM in the worst case (Sun andYe, 2019), where d is the dimension, which is in conﬂict with its comparable and often superiorperformance compared to RCM in practice. This has led to more reﬁned analyses of CCM withsofter guarantees that explain why the worst-case examples are uncommon (G¨urb¨uzbalaban et al.,2017, Lee and Wright, 2019, Wright and Lee, 2020). However, the existing results only apply tounconstrained convex quadratic problems.By contrast to existing work, we introduce a novel extrapolation-based CCM that applies toa broad class of generalized variational inequality problems, which contains convex (composite)optimization as a special case. In the case of convex quadratic functions and unlike RCM or existingCCM methods, the results we obtain never exhibit worse complexity than the full-gradient methods,and are often of much lower complexity. Further, our method provably converges on min-maxproblems on which standard CCM and RCM methods diverge in general (see Remark 2). We consider the d -dimensional Euclidean space ( R d , (cid:107) · (cid:107) ) , where (cid:107) · (cid:107) = (cid:112) (cid:104)· , ·(cid:105) denotes the Euclideannorm, (cid:104)· , ·(cid:105) denotes the (standard) inner product, and d is assumed to be ﬁnite. Given a matrix B , the operator norm of B is deﬁned in a standard way as (cid:107) B (cid:107) = max {(cid:107) Bx (cid:107) : x ∈ R d , (cid:107) x (cid:107) ≤ } . Weuse to denote an all-zeros vector.Throughout the paper, we assume that there is a given partition of the set { , , . . . , d } intosets S i , i ∈ { , . . . , m } , where | S i | = d i > . For notational convenience, we assume that sets S i are comprised of consecutive elements from { , , . . . , d } , that is, S = { , , . . . , d } , S = { d + 1 , d + 2 , . . . , d + d } , . . . , S m = { (cid:80) m − j =1 d j + 1 , (cid:80) m − j =1 d j + 2 , . . . , (cid:80) mj =1 d j } . This assumptionis without loss of generality, as all our results are invariant to permutations of the coordinates. Foran operator F : R d → R d , we use F i to denote its coordinate components indexed by S i . We say that an operator F : R d → R d is monotone, if ∀ x , y ∈ R d , (cid:104) F ( x ) − F ( y ) , x − y (cid:105) ≥ . An operator F : R d → R d is said to be M -Lipschitz, if ∀ x , y ∈ R d , (cid:107) F ( x ) − F ( y ) (cid:107) ≤ M (cid:107) x − y (cid:107) . (1)Given a proper, convex, lower semicontinuous function g : R d → R ∪ { + ∞} , we use ∂g ( x ) todenote the subdiﬀerential set (the set of all subgradients) of g at x ∈ dom( g ) . Of particular interestto us are functions g whose proximal operator (or resolvent), deﬁned byprox g,τ ( z ) = arg min x ∈ R d (cid:110) g ( x ) + 12 τ (cid:107) x − z (cid:107) (cid:111) (2)4s eﬃciently computable for all z ∈ R d , τ > g is convex and strongly convex, we will say that g is γ -stronglyconvex for γ ≥ , if for all x , y ∈ R d and all α ∈ (0 ,

1) : g ((1 − α ) x + α y ) ≤ (1 − α ) g ( x ) + αg ( y ) − γ α (1 − α ) (cid:107) x − y (cid:107) . If g (cid:48) ( x ) ∈ ∂g ( x ) is any subgradient of g at x , then we also have, ∀ y ∈ R d ,g ( y ) ≥ g ( x ) + (cid:10) g (cid:48) ( x ) , y − x (cid:11) + γ (cid:107) y − x (cid:107) . Problem deﬁnition.

As discussed in the introduction, we consider Problem (P), under thefollowing assumption.

Assumption 1.

There exists at least one x ∗ that solves (P) . g ( x ) is γ -strongly convex, where γ ≥ , and block-separable over coordinate sets { S i } mi =1 : g ( x ) = (cid:80) mi =1 g i ( x i ) , where x i is the d i -dimensional vector comprised of the entries of x correspondingto the coordinates from S i . Each g i ( x i ) , ≤ i ≤ m, admits an eﬃciently computable proximaloperator.Operator F : R d → R d is monotone. Further, there exist positive semideﬁnite matrices Q i , ≤ i ≤ m, such that each F i ( · ) is -Lipschitz continuous w.r.t. the norm (cid:107) · (cid:107) Q i , i.e., ∀ x , y ∈ R d , (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ (cid:113) ( x − y ) T Q i ( x − y ) = (cid:107) x − y (cid:107) Q i , (3) where F i ( x ) is the d i -dimensional vector comprised of the S i coordinates of F ( x ) . Finally, (cid:13)(cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13)(cid:13) = L < ∞ , where (cid:98) Q i is deﬁned by ( (cid:98) Q i ) j,k = (cid:40) ( Q i ) j,k , if min { j, k } > (cid:80) i − (cid:96) =1 d (cid:96) , , otherwise.That is, (cid:98) Q i corresponds to the matrix Q i with the ﬁrst i − blocks of rows and columns set to zero. For notational convenience, given a candidate solution x ∈ R d and an arbitrary point u ∈ R d ,we deﬁne Gap( ˆ x ; u ) := (cid:104) F ( u ) , ˆ x − u (cid:105) + g ( ˆ x ) − g ( u ) , (4)so that Gap( ˆ x ) := sup u ∈ R d Gap( ˆ x ; u ) (5)deﬁnes the error of the candidate solution ˆ x for Problem (P). In particular, if Gap( ˆ x ) ≤ (cid:15) for some (cid:15) > , then (cid:104) F ( x ) , x − ˆ x (cid:105) + g ( x ) − g ( ˆ x ) ≥ − (cid:15), ∀ x ∈ R d , which deﬁnes the (cid:15) -approximation for Problem (P). Comparison of Lipschitzness assumptions.

Standard Lipschitz assumptions that are usedfor full gradient methods are typically stated as in Eq. (1). Observe that the Lipschitz constant ofthe entire operator F under our assumptions is bounded by (cid:112) (cid:107) (cid:80) mi =1 Q i (cid:107) , as, ∀ x , y , (cid:107) F ( x ) − F ( y ) (cid:107) = m (cid:88) i =1 (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ m (cid:88) i =1 ( x − y ) T Q i ( x − y ) ≤ (cid:13)(cid:13)(cid:13) m (cid:88) i =1 Q i (cid:13)(cid:13)(cid:13) (cid:107) x − y (cid:107) .

5n the worst case for full gradient methods, it is possible that M = (cid:113)(cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) , and this worstcase in fact happens for many interesting examples discussed below. The guarantees that weprovide for our method are in terms of L = (cid:113)(cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) . It is not hard to show that in general (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) ≤ (cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) , while it is possible that (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) (cid:28) (cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) . In the literature on standard (randomized and cyclic) block coordinate methods and in thecase where F is the gradient of a convex function, the Lipschitz assumptions are typically statedas (Nesterov, 2012): (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ L i (cid:107) x − y (cid:107) , where x , y ∈ R d are restricted to only diﬀerover the i th block of coordinates. These assumptions are hard to directly compare to our Lipschitzassumptions stated in Assumption 1. What can be said is that in general L i ≤ (cid:107) Q i (cid:107) ; however, notethat our ﬁnal convergence bound is in terms of (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) , which is incomparable to weighted sumsof L i ’s that typically appear in the convergence bounds for block coordinate methods. Further, notethat the coordinate Lipschitz assumptions used for convex optimization are generally not suitable formin-max setups. In particular, for bilinear problems, all coordinate Lipschitz constants deﬁned asin Nesterov (2012) would be zero, which does not appear meaningful, given the non-zero complexityof bilinear problems (Ouyang and Xu, 2019).To demonstrate the full power of our method and how our notion of block coordinate Lipschitznesscan provide much improved results compared to other methods on some standard classes of problemswith applications in machine learning, in the following, we provide a few illustrative examples. Example applications.

Before providing concrete example applications, we note that our problemof interest (P) captures broad classes of optimization problems, such as convex-concave min-maxoptimization min x ∈ R d max x ∈ R d Φ( x , x ) , (P MM )where Φ( x , x ) := φ ( x , x ) + g ( x ) − g ( x ) , d + d = d, φ is convex-concave and smooth, and g , g are convex and “simple” (i.e., have eﬃciently computable proximal operators), and convexcomposite optimization min x ∈ R d (cid:8) f ( x ) + g ( x ) (cid:9) , (P CO )where f is smooth and convex and g is convex and “simple.”To reduce (P MM ) to (P), it suﬃces to stack vectors x , x and deﬁne x = (cid:2) x x (cid:3) , F ( x ) = (cid:2) ∇ x φ ( x , x ) −∇ x φ ( x , x ) (cid:3) , g ( x ) = g ( x ) − g ( x ) . To reduce (P CO ) to (P), it suﬃces to take F ( x ) = ∇ f ( x ),while g is the same for both problems. See, e.g., Nemirovski (2004), Malitsky (2019) and Corollaries 1and 2 for more information. Now we provide some concrete example applications. Example 1 (Lasso) . The well-known Lasso problem min x ∈ R d (cid:107) Ax − b (cid:107) + λ (cid:107) x (cid:107) is an exampleof (P CO ) and a special case of (P) , where A = [ a , a , . . . , a n ] T ∈ R n × d , b ∈ R n , F ( x ) = A T ( Ax − b ) , g ( x ) = λ (cid:107) x (cid:107) , m = d , d = d = · · · = d m = 1 . For this setup, we have (cid:107) F ( x ) − F ( y ) (cid:107) = (cid:107) A T A ( x − y ) (cid:107) = (cid:112) ( x − y ) T ( A T A ) ( x − y ) . The tightest Lipschitz constant of F ( x ) that we canselect is (cid:107) A T A (cid:107) . Meanwhile, (cid:107) F i ( x ) − F i ( y ) (cid:107) = (cid:107) ( a i ) T A ( x − y ) (cid:107) = (cid:112) ( x − y ) T Q i ( x − y ) with Q i = A T a i ( a i ) T A . The Lipschitz constant L for our algorithm is given by L = (cid:113) (cid:107) (cid:80) di =1 (cid:98) Q i (cid:107) ≤(cid:107) A T A (cid:107) (but it can be much lower than (cid:107) A T A (cid:107) ), and the resulting iteration complexity of CODERgiven target error (cid:15) > is O ( L (cid:107) x ∗ − x (cid:107) (cid:15) ) . Example 2 (Elastic net) . Another interesting example for our setting are the elastic net regularizedproblems, which are of the form min x ∈ R d (cid:107) Ax − b (cid:107) + λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) , where λ , λ > areregularization parameters (Zou and Hastie, 2005). This is another instance of (P CO ) with f ( x ) = (cid:107) Ax − b (cid:107) and g ( x ) = λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) . Observe that in this case g is λ -strongly convex but also lgorithm 1 Cyclic cOordinate Dual avEraging with extRapolation (CODER) Input: x ∈ dom( g ) , γ ≥ , L > , m, { S , . . . , S m } . Initialization: x − = x , p = F ( x ), a = A = 0. ψ i ( x i ) = (cid:107) x i − x i (cid:107) , ≤ i ≤ m . for k = 1 to K do a k = γA k − L , A k = A k − + a k . for i = 1 to m do p ik = F i ( x k , . . . , x i − k , x ik − , . . . x mk − ). q ik = p ik + a k − a k ( F i ( x k − ) − p ik − ). x ik = arg min x i ∈ R di { ψ ik ( x i ) := ψ ik − ( x i ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )) } . end for end for return x K , ˜ x K = A K (cid:80) Kk =1 a k x k nonsmooth. Similarly as in the case of Lasso, the problem reduces to (P) using F ( x ) = ∇ f ( x ) , andthe same discussion as for Lasso applies for the Lipschitz constant of F . The iteration complexityof CODER for target error (cid:15) > in this case is bounded by O (cid:0) Lλ log( (cid:15) ) (cid:1) , where L ≤ (cid:107) A T A (cid:107) , butcan potentially be much smaller than (cid:107) A T A (cid:107) . Example 3 ( (cid:96) -norm regularized SVM) . When using (cid:96) -norm regularization, the formulationof support vector machine (SVM) is min x ∈ R d { n (cid:80) ni =1 max { − b i a Ti x , } + λ (cid:107) x (cid:107) } , where A = n [ a , a , . . . , a n ] T ∈ R n × d , b ∈ { , − } n , ¯ A = n [ b a , b a , . . . , b n a n ] T , and λ ≥ . Then with thefact that max { − x, } = max − ≤ y ≤ ( x − y , we know that the SVM problem is an instance of (P MM ) with F ( x , y ) = ( ¯ A T y , − ( ¯ Ax − n )) and g ( x , y ) = λ (cid:107) x (cid:107) + (cid:80) nj =1 − ≤ y i ≤ . The Lipschitzconstant L of F in this case satisﬁes L ≤ (cid:112) (cid:107) ¯ A T ¯ A (cid:107) , while the iteration complexity of CODER isbounded by O (cid:0) L(cid:15) (cid:1) . Example 4 ( (cid:96) regression) . Our results also apply to the problem of (cid:96) regression, where thegoal is to minimize (cid:107) Ax − b (cid:107) over x ∈ R d . This problem can be equivalently written as min x ∈ R d max x ∈ [0 , n (cid:104) Ax − b , x (cid:105) , which is of the form (P MM ) with F ( x ) = (cid:2) A T x − Ax (cid:3) , g ( x ) ≡ ,g ( x ) = − (cid:104) b , x (cid:105) + (cid:80) ni =1 I [0 , ( x i ) , where I [0 , is the convex indicator function of the interval [0 , (i.e., equal to zero within the interval and equal to inﬁnity outside it). In this case, L ≤ (cid:112) (cid:107) A T A (cid:107) , while the iteration complexity of CODER is O ( L (cid:107) x ∗ − x (cid:107) (cid:15) ) . In this section, we provide our main algorithmic result, summarized in Algorithm 1 (CODER), andanalyze its convergence. The algorithm can be seen as a cyclic block coordinate dual averagingmethod with extrapolation, where the q k update corresponds to the (dual) extrapolation step and x k corresponds to the dual averaging step.As is common for dual averaging-type methods, we base our analysis on the use of an estimatesequence ψ k ( x ) = (cid:80) mi =1 ψ ik ( x ) , where ψ ik is recursively deﬁned by ψ ik ( x ) := ψ ik − ( x ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )), ψ i ( x ) = (cid:107) x i − x i (cid:107) in Algorithm 1. Note here that u ∈ R d that appears inthe deﬁnition of ψ k is an arbitrary point. It is provided in the deﬁnition of ψ k for convenience ofthe analysis. All algorithm updates (and x k in particular) are completely independent of u , and7he algorithm can be stated without the knowledge of this point. This is clear from the deﬁnition of x k , as the arg min deﬁning it is independent of u . Unlike in most standard approaches, the estimate sequence ψ k used in our work is not directlyconstructed from our target notion of the gap (Gap( ˜ x k ; u ) or Gap( ˜ x k )). Instead, it is constructedusing the dual extrapolation step q k (in place of the standard use of F ( x k )), which requires carefulcontrol of the errors when bounding ψ k . The analysis is carried out using two core technical lemmasthat bound ψ k above and below (Lemma 1 and Lemma 2) and lead to bounds on both Gap( ˜ x k ; u )for arbitrary u ∈ dom( g ) and distance (cid:107) x k − x ∗ (cid:107) between the algorithm iterate x k and an arbitrarysolution x ∗ to (P).We start by bounding ψ k above, in the following lemma. Lemma 1.

In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ ,ψ k ( x k ) ≤ (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) . Proof.

By deﬁnition, ψ k ( x ) = (cid:80) mi =1 ψ ik ( x ), where ψ ik ’s are speciﬁed in Algorithm 1. It follows that ∀ k ≥ , ψ k ( x ) = k (cid:88) j =1 a j ( (cid:104) q j , x − u (cid:105) + g ( x ) − g ( u )) + 12 (cid:107) x − x (cid:107) . = (cid:104) x − z k , x − u (cid:105) + A k ( g ( x ) − g ( u )) + 12 (cid:107) x − x (cid:107) , (6)where z k := x − (cid:80) kj =1 a j q j . Observe (from Algorithm 1) that x k = arg min x ∈ R d ψ k ( x ) . Thus, ∈ ∂ψ ( x k ) and there exists g (cid:48) ( x k ) ∈ ∂g ( x k ) such that x − z k + A k g (cid:48) ( x k ) + x k − x = . Solving the last equation for x − z k and plugging into Eq. (6), we have ψ k ( x k ) = A k ( g ( x k ) − g ( u ) − (cid:10) g (cid:48) ( x k ) , x k − u (cid:11) ) − (cid:104) x k − x , x k − u (cid:105) + 12 (cid:107) x k − x (cid:107) ≤ − A k γ (cid:107) x k − u (cid:107) − (cid:104) x k − x , x k − u (cid:105) + 12 (cid:107) x k − x (cid:107) = 12 (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) , where the second line is by γ -strong convexity of g and the last line is by (cid:104) x k − x , u − x k (cid:105) = (cid:107) x − u (cid:107) − (cid:107) x k − x (cid:107) − (cid:107) x k − u (cid:107) .Bounding ψ k ( x k ) below requires much more technical work and can be seen as the main technicalresult required for obtaining the convergence bound for Algorithm 1. The result is summarized inthe following lemma. Observe that the summation terms from the ﬁrst line are all bounded belowby Gap( x j ; u ), as F is monotone. This will be used for bounding the ﬁnal gap Gap( ˜ x k , u ) . Lemma 2.

In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ ,ψ k ( x k ) ≥ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) − γA k (cid:107) u − x k (cid:107) + k (cid:88) j =1 (cid:16) γA j − (cid:107) x j − x j − (cid:107) − a j γA j (cid:107) F ( x j ) − p j (cid:107) (cid:17) . roof. By the deﬁnition of ψ k ( x ), ψ k ( x k ) = ψ k − ( x k ) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u ))= ψ k − ( x k − ) + ( ψ k − ( x k ) − ψ k − ( x k − )) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u )) . As ψ k − is (1 + γA k − )-strongly convex and minimized at x k − , we have ψ k − ( x k ) − ψ k − ( x k − ) ≥ γA k − (cid:107) x k − x k − (cid:107) , and, thus, ψ k ( x k ) ≥ ψ k − ( x k − ) + 1 + γA k − (cid:107) x k − x k − (cid:107) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u )) . (7)Our next step is to bound a k (cid:104) q k , x k − u (cid:105) . By the deﬁnition of q k , and using simple algebraicmanipulations, a k (cid:104) q k , x k − u (cid:105) = a k (cid:68) p k + a k − a k ( F ( x k − ) − p k − ) , x k − u (cid:69) = a k (cid:104) p k − F ( x k ) , x k − u (cid:105) + a k (cid:104) F ( x k ) , x k − u (cid:105) + a k − (cid:104) F ( x k − ) − p k − , x k − x k − (cid:105) − a k − (cid:104) p k − − F ( x k − ) , x k − − u (cid:105) . (8)Further, using Cauchy-Schwarz and Young’s inequalities, we also have | a k − (cid:104) F ( x k − ) − p k − , x k − x k − (cid:105)|≤ a k − (cid:107) F ( x k − ) − p k − (cid:107)(cid:107) x k − x k − (cid:107)≤ a k − γA k − (cid:107) F ( x k − ) − p k − (cid:107) + 1 + γA k − (cid:107) x k − x k − (cid:107) . (9)Thus, combining (7)–(9), we have ψ k ( x k ) ≥ ψ k − ( x k − ) + 1 + γA k − (cid:107) x k − x k − (cid:107) − a k − γA k − (cid:107) F ( x k − ) − p k − (cid:107) + a k (cid:104) p k − F ( x k ) , x k − u (cid:105) − a k − (cid:104) p k − − F ( x k − ) , x k − − u (cid:105) + a k ( (cid:104) F ( x k ) , x k − u (cid:105) + g ( x k ) − g ( u )) . (10)Telescoping Eq. (10) from 1 to k, we now have ψ k ( x k ) ≥ ψ ( x ) + k (cid:88) j =1 (cid:16) γA j − (cid:107) x j − x j − (cid:107) − a j − γA j − (cid:107) F ( x j − ) − p j − (cid:107) (cid:17) + a k (cid:104) p k − F ( x k ) , x k − u (cid:105) − a (cid:104) p − F ( x ) , x − u (cid:105) + k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) . To complete the proof, it remains to observe that ψ ( x ) = 0 , p = F ( x ) , and to bound a k (cid:104) p k − F ( x k ) , x k − u (cid:105) below by − a k γA k (cid:107) p k − F ( x k ) (cid:107) − γA k (cid:107) x k − u (cid:107) . This simply follows usingCauchy-Schwarz and Young’s inequality, as | a k (cid:104) p k − F ( x k ) , x k − u (cid:105)| ≤ a k (cid:107) p k − F ( x k ) (cid:107)(cid:107) x k − u (cid:107)≤ a k γA k (cid:107) p k − F ( x k ) (cid:107) + 1 + γA k (cid:107) x k − u (cid:107) , as claimed. 9e are now ready to state and prove our main result. Theorem 1.

In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ , k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) . (11) In particular,

Gap( ˜ x k ; u ) ≤ A k (cid:107) u − x (cid:107) . Further, if x ∗ is any solution to Problem (P) , we also have (cid:107) x k − x ∗ (cid:107) ≤

21 + γA k (cid:107) x − x ∗ (cid:107) . In both bounds, A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.

Combining Lemmas 1 and 2, we have k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) + k (cid:88) j =1 (cid:16) a j γA j (cid:107) F ( x j ) − p j (cid:107) − γA j − (cid:107) x j − x j − (cid:107) (cid:17) . (12)To obtain the bounds in the theorem, we show that all the summation terms from the right-handside of Eq. (12) are non-positive. To do so, let ¯ x j,i = [( x j ) T , . . . , ( x i − j ) T , ( x ij − ) T , . . . , ( x mj − ) T ] T ,so that p ij = F i ( ¯ x j,i ) (i.e., ¯ x j,i corresponds to the value of vector x j in the i th iteration of the innerloop of Algorithm 1). Then, we have (cid:107) F ( x j ) − p j (cid:107) = m (cid:88) i =1 (cid:107) F i ( x j ) − p ij (cid:107) = m (cid:88) i =1 (cid:107) F i ( x j ) − F i ( ¯ x j,i ) (cid:107) ≤ m (cid:88) i =1 ( x j − ¯ x j,i ) T Q i ( x j − ¯ x j,i ) . By the deﬁnitions of (cid:98) Q i ’s and ¯ x j,i ’s, we further have( x j − ¯ x j,i ) T Q i ( x j − ¯ x j,i ) = ( x j − x j − ) T (cid:98) Q i ( x j − x j − ) , and, as a result, (cid:107) F ( x j ) − p j (cid:107) ≤ ( x j − x j − ) T (cid:16) m (cid:88) i =1 (cid:98) Q i (cid:17) ( x j − x j − ) ≤ L (cid:107) x j − x j − (cid:107) . (13)Meanwhile, by our choice of step sizes from Algorithm 1,1 + γA j − L a j γA j − ≥ L a j γA j . a j γA j (cid:107) F ( x j ) − p j (cid:107) ≤ γA j − (cid:107) x j − x j − (cid:107) for all j ≥ , and we can conclude from Eq. (12) that k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) . (14)Now, for the gap bound from the statement of the theorem, by monotonicity of F , we have (cid:104) F ( x j ) , x j − u (cid:105) ≥ (cid:104) F ( u ) , x j − u (cid:105) . Thus, recalling that ˜ x k = A k (cid:80) kj =1 a j x j and using Jensen’sinequality: A k Gap( ˜ x k ; u ) = A k (cid:104) F ( u ) , ˜ x k − u (cid:105) + g ( ˜ x k ) − g ( u ) ≤ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) ≤ (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) , where the last two inequalities are by Eq. (14) and γA k (cid:107) u − x k (cid:107) ≥ . For the remaining bound, by the deﬁnition of x ∗ , Gap( ˜ x k ; x ∗ ) ≥ , and, thus (choosing u = x ∗ )1 + γA k (cid:107) x ∗ − x k (cid:107) ≤ (cid:107) x ∗ − x (cid:107) . Finally, as Algorithm 1 sets a j = γA j − L , A j = A j − + a j , ∀ j ≥ , we have A k ≥ k L (as γ ≥ A = 0) and A k ≥ A k − (cid:0) γ L (cid:1) ≥ A (cid:0) γ L (cid:1) k − , for all k ≥ CO ) and (P MM ) are summarized in the followingtwo corollaries. Here we only state the bounds for the optimality gap, as the bounds on (cid:107) x k − x ∗ (cid:107) are immediate from Theorem 1. Corollary 1.

Consider Problem (P CO ) , where the gradient of f is L -Lipschitz in the context ofAssumption 1 and g is γ -strongly convex for γ ≥ , and let x ∗ ∈ arg min x f ( x )+ g ( x ) . If Algorithm 1is applied to (P CO ) with F = ∇ f, then f ( ˜ x k ) + g ( ˜ x k ) − ( f ( x ∗ ) + g ( x ∗ )) ≤ (cid:107) x ∗ − x (cid:107) A k , where A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.

Observe that Theorem 1 applies, as F , g satisfy Assumption 1. By Jensen’s inequality andconvexity of f , f ( ˜ x k ) + g ( ˜ x k ) − ( f ( x ∗ ) + g ( x ∗ )) ≤ A k k (cid:88) i = j a j ( f ( x j ) + g ( x j ) − f ( x ∗ ) − g ( x ∗ )) ≤ A k k (cid:88) j =1 a j ( (cid:104)∇ f ( x j ) , x j − x ∗ (cid:105) + g ( x j ) − g ( x ∗ )) . As ∇ f ( x j ) = F ( x j ) , it remains to apply Eq. (11) with u = x ∗ . orollary 2. Consider Problem (P MM ) , where φ is convex-concave and its gradient is L -Lipschitzin the context of Assumption 1, and g , g are γ -strongly convex for some γ ≥ . If Algorithm 1 isapplied to (P MM ) with x = (cid:2) x x (cid:3) , F ( x ) = (cid:2) ∇ x φ ( x , x ) −∇ x φ ( x , x ) (cid:3) , and g ( x ) = g ( x ) − g ( x ) , then max y ∈ R d Φ( ˜ x ,k , y ) − min y ∈ R d Φ( y , ˜ x ,k ) ≤ D + D A k , where D = sup x , y ∈ dom( g ) (cid:107) x − y (cid:107) , D = sup x , y ∈ dom( g ) (cid:107) x − y (cid:107) , and A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.

Same as in the previous corollary, we apply Theorem 1 and use Eq. (11) to bound thegap. Observe that, by deﬁnition of Φ , max y ∈ R d Φ( ˜ x ,k , y ) = max y ∈ dom( g ) Φ( ˜ x ,k , y ) andmin y ∈ R d Φ( y , ˜ x ,k ) = min y ∈ dom( g ) Φ( y , ˜ x ,k ). Fix arbitrary y ∈ dom( g ) , y ∈ dom( g ) . Using Jensen’s inequality and that φ is concave in its second argument, we haveΦ( ˜ x ,k , y ) ≤ A k k (cid:88) j =1 a j Φ( x ,j , y )= 1 A k k (cid:88) j =1 a j ( φ ( x ,j , y ) + g ( x ,j ) − g ( y )) ≤ A k k (cid:88) j =1 a j ( φ ( x ,j , x ,j ) + (cid:104)∇ x φ ( x ,j , x ,j ) , y − x ,j (cid:105) + g ( x ,j ) − g ( y )) . By the same token,Φ( y , ˜ x ,k ) ≥ A k k (cid:88) j =1 a j ( φ ( x ,j , x ,j ) + (cid:104)∇ x φ ( x ,j , x ,j ) , y − x ,j (cid:105) + g ( y ) − g ( x ,j )) . Combining the bounds on Φ( ˜ x ,k , y ) , Φ( y , ˜ x ,k ) and using the deﬁnitions of F and g, we haveΦ( ˜ x ,k , y ) − Φ( y , ˜ x ,k ) ≤ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − y (cid:105) + g ( x j ) − g ( y )) . It remains to apply Eq. (11) and take supremum of both sides over y ∈ dom( g ) , y ∈ dom( g ) . Remark 1.

It is straightforward to extend the result from Theorem 1 to the setting where thecoordinate blocks are randomly permuted. This follows simply as, for the proof of the theorem togo through, the ordering of the blocks is irrelevant. The only thing that would change is that theLipschitz constant from Eq. (13) would depend on the random ordering of the blocks. Thus, to obtaina bound that holds in expectation (with expected L w.r.t. the random permutations of the blocks), itsuﬃces to take the expectation w.r.t. the random permutation of coordinate blocks when applyingEq. (13) and write the remaining inequalities in expectation w.r.t. all random permutations. Remark 2.

It is natural to ask whether the extrapolation step in CODER is really needed or not.Let us refer to the cyclic and randomized coordinate method variants without the extrapolationstep (i.e., with q ik = p ik in Step 8 of CODER) as the proximal CCM (PCCM) and proximal RCM PRCM). These methods only perform (block) coordinate dual-averaging steps (Step 9 of CODER).Even though PCCM can be observed to perform well in the conducted experiments (see Section 4),unlike CODER, neither PCCM nor PRCM are guaranteed to converge on the class of generalizedvariational inequalities. In particular, it is easy to construct examples on which both PCCM andPRCM diverge. Perhaps the simplest such example is the bilinear problem min x ∈ R d max y ∈ R d (cid:104) x , y (cid:105) , where each pair ( x i , y i ) is assigned to the same block. In this case, the block coordinate updates ofPCCM and PRCM boil down to (simultaneous) gradient descent-ascent updates, and, due to theseparability of the objective function, the divergent behavior of both methods follows as a simplecorollary of the results from Salimans et al. (2016), Liang and Stokes (2019). Computational considerations.

CODER, as stated in Algorithm 1, requires knowledge of theLipschitz parameter L and γ . This may seem like a limitation of our approach, especially since theLipschitzness of F assumed in our work is much diﬀerent than the traditional Lipschitz assumptionsfor either the full gradient or its (block) coordinate components. However, as we now argue, formost cases of interest this is not a concern. In particular, the strong convexity of g typically comesfrom regularization, which is a design choice and as such is typically known. On the other hand, itturns out that for our approach to work, the knowledge of the Lipschitz parameter L is not requiredat all, as this parameter can be estimated adaptively using the standard doubling trick (Nesterov,2015). This can be concluded from the fact that the only place in the analysis where the Lipschitzconstant of F is used is in Eq. (13), which allows a simple veriﬁcation and update to L wheneverthe stated inequality is not satisﬁed. Algorithm 2

Cyclic cOordinate Dual avEraging with extRapolation (CODER) Input: x ∈ dom( g ) , γ ≥ , L > , m, { S , . . . , S m } Initialization: x − = x , p = F ( x ), a = A = 0. ψ i ( x i ) = (cid:107) x i − x i (cid:107) , ≤ i ≤ m . for k = 1 to K do L k = L k − / repeat L k = 2 L k . a k = γA k − L k , A k = A k − + a k . for i = 1 to m do p ik = F i ( x k , . . . , x i − k , x ik − , . . . x mk − ) . q ik = p ik + a k − a k ( F i ( x k − ) − p ik − ). x ik = arg min x i ∈ R di (cid:8) ψ ik ( x i ) := ψ ik − ( x i ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )) (cid:9) . end for until (cid:107) F ( x k ) − p k (cid:107) ≤ L k (cid:107) x k − x k − (cid:107) end for return x K , ˜ x K = A K (cid:80) Kk =1 a k x k . By the argument used in the proof of Theorem 1 and the Lipschitz assumption on F (Assump-tion 1), this condition must be satisﬁed for any L ≥ (cid:13)(cid:13) (cid:80) i (cid:98) Q i (cid:13)(cid:13) . Thus, a natural approach is to startwith some initial estimate L > L and double it each time the condition from Eq. (13) fails.The total number of times that this estimate can be doubled is then bounded by log ( LL ) , and,under a mild assumption that L = O ( L ) and L is not overwhelmingly (e.g., exponentially in1 /(cid:15), n ) smaller than L , the total overhead due to estimating L is absorbed by the convergence boundfrom Theorem 1. While the doubling trick is standard, the fact that it can be applied at all in the13 = 10 − λ = 10 − λ = 10 − a9aMNIST Figure 1: Performance comparison of CODER and proximal variants of RCM and CCM, on (cid:96) -norm reg-ularized SVM problem and a9a (top row) and MNIST (bottom row) datasets. Only CODER hasprovable theoretical guarantees on this problem instance. Empirically, both CODER and PCCMoutperform PRCM, while CODER is generally the fastest of the three algorithms. setting of coordinate methods is surprising, as this is generally not possible for RCM, due to therandomized nature of the algorithm.The variant of CODER that implements this doubling trick is summarized in Algorithm 2.If one were to use a diﬀerent permutation of the blocks in each iteration (i.e., for each fullpass over all the blocks), the doubling trick would not necessarily be the best choice, as in theworst case we would be estimating the largest L over all the permutations; not the average one. Ofcourse, one could implement a bisection search for L in each iteration, but that would make theadded logarithmic cost multiplicative instead of additive. Extending CODER to a parameter-freesetting where a local Lipschitz constant can be used without a bisection search (as was done in,e.g., Malitsky (2019) for the full-vector update setting of variational inequalities) is an interestingdirection for future research. We evaluate the performance of cyclic and randomized coordinate methods on the nonsmoothconvex (cid:96) -norm-regularized SVM problem, as described in Example 3. As shown in Example 3,the min-max reformulation of this problem is an instance of the generalized variational inequalityproblem (P).For the considered min-max problem, we compare CODER to PCCM Chow et al. (2017) andPRCM. Both CODER and PCCM permute the coordinates once before each iteration and thenperform cyclic coordinate update under the ﬁxed order after permutation. The diﬀerence betweenCODER and PCCM is that PCCM does not use the extrapolation step (or equivalently is a variantof CODER obtained by setting q ik = p ik in Step 8 of Algorithm 1). PRCM chooses each coordinateuniformly at random and then performs the same dual averaging-style coordinate update as CODERand PCCM. All the compared algorithms pick one coordinate per iteration. We test all the three14lgorithms on two large scale datasets a9a and MNIST from the LIBSVM library Chang and Lin(2011). For simplicity, we normalize each data sample to unit Euclidean norm.In the experiments, we vary the (cid:96) -norm regularization parameter λ in { − , − , − } . Forall the settings, we tune the Lipschitz constants L in { /n ∗ k } ( n is the number of samples, k ∈ { , , . . . } ) and return iterate average (as it has better performance than last iterate) for all thethree algorithms. As is standard for ERM, we plot the function value gap of the primal problem interms of the number of passes over the dataset. (The optimal values were evaluated by solving (P)to high accuracy).As shown in Figure 1, both CODER and PCCM perform better than PRCM for all the cases,which veriﬁes the eﬀectiveness of cyclic coordinate updates. Further, CODER is generally fasterthan PCCM. Finally, as discussed in Remark 2, only CODER has theoretical convergence guaranteesfor general convex-concave min-max problems. We presented a novel extrapolated cyclic coordinate method CODER, which provably converges onthe class of generalized variational inequalities, which includes convex composite optimization andconvex-concave min-max optimization. CODER is the ﬁrst cyclic coordinate method that provablyconverges on this broad class of problems. Further, even on the restricted class of convex optimizationproblems, CODER provides improved convergence guarantees, based on a novel Lipschitz conditionfor the gradient. Some open questions that merit further investigation remain. For example, it is anintriguing question whether CODER can be accelerated on the class of smooth convex optimizationproblems. From a diﬀerent perspective, it would be very interesting to understand the complexityof standard optimization problem classes under our new Lipschitz condition by obtaining new oraclelower bounds.

References

Ahmet Alacaoglu, Quoc Tran Dinh, Olivier Fercoq, and Volkan Cevher. Smooth primal-dualcoordinate descent algorithms for nonsmooth convex optimization. In

Proc. NIPS’17 , 2017.Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. Random extrapolation for primal-dualcoordinate descent. In

Proc. ICML’20 , 2020.Zeyuan Allen-Zhu, Zheng Qu, Peter Richt´arik, and Yang Yuan. Even faster accelerated coordinatedescent using non-uniform sampling. In

Proc. ICML’16 , 2016.Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods.

SIAM Journal on Optimization , 23(4):2037–2060, 2013.L´eon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublishedopen problem oﬀered to the attendance of the SLDS 2009 conference, 2009. URL http://leon.bottou.org/papers/bottou-slds-open-problem-2009 .Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In

Proc. NeurIPS’19 , 2019. For each sample of

MNIST , we reassign the label as 1 if it is in { , , . . . , } and − In experiments, all the algorithms diverge when k < k ∈ { , , } . SIAMJournal on Optimization , 28(4):2783–2808, 2018.Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.

ACMTransactions on Intelligent Systems and Technology , 2:27:1–27:27, 2011. Software available at .Yat Tin Chow, Tianyu Wu, and Wotao Yin. Cyclic coordinate-update algorithms for ﬁxed-pointproblems: Analysis and applications.

SIAM Journal on Scientiﬁc Computing , 39(4):A1280–A1300,2017.Cong Dang and Guanghui Lan. Randomized ﬁrst-order methods for saddle point optimization. arXiv preprint arXiv:1409.8625 , 2014.Jelena Diakonikolas and Lorenzo Orecchia. Alternating randomized block coordinate descent. In

Proc. ICML’18 , 2018.Francisco Facchinei and Jong-Shi Pang.

Finite-dimensional variational inequalities and complemen-tarity problems . Springer Science & Business Media, 2007.Olivier Fercoq and Pascal Bianchi. A coordinate-descent primal-dual algorithm with large step sizeand possibly nonseparable functions.

SIAM Journal on Optimization , 29(1):100–134, 2019.Olivier Fercoq and Peter Richt´arik. Accelerated, parallel, and proximal coordinate descent.

SIAMJournal on Optimization , 25(4):1997–2023, 2015.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.

Journal of statistical software , 33(1):1, 2010.Mert G¨urb¨uzbalaban, Asuman Ozdaglar, Pablo A Parrilo, and N Denizcan Vanli. When cycliccoordinate descent outperforms randomized coordinate descent. In

Proc. NIPS’17 , 2017.Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm for generalconvex-concave saddle point problems. arXiv preprint arXiv:1803.01401 , 2018.Filip Hanzely and Peter Richt´arik. Accelerated coordinate descent with arbitrary sampling and bestrates for minibatches. In

Proc. AISTATS’19 , 2019.Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinateprimal-dual proximal algorithm for distributed optimization.

IEEE Transactions on AutomaticControl , 64(10):4050–4065, 2019.Ching-Pei Lee and Stephen J Wright. Random permutations ﬁx a worst case for cyclic coordinatedescent.

IMA Journal of Numerical Analysis , 39(3):1246–1275, 2019.Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergenceof generative adversarial networks. In

Proc. AISTATS’19 , 2019.Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradi-ent method and its application to regularized empirical risk minimization.

SIAM Journal onOptimization , 25(4):2244–2273, 2015. 16i Liu, Steve Wright, Christopher R´e, Victor Bittorf, and Srikrishna Sridhar. An asynchronousparallel stochastic coordinate descent algorithm. In

Proc. ICML’14 , 2014.Yura Malitsky. Golden ratio algorithms for variational inequalities.

Mathematical Programming ,pages 1–28, 2019.Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent withnonconvex penalties.

Journal of the American Statistical Association , 106(495):1125–1138, 2011.Arkadi Nemirovski. Prox-method with rate of convergence O (1 /t ) for variational inequalities withLipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 15(1):229–251, 2004.Yu Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimization problems.

SIAMJournal on Optimization , 22(2):341–362, 2012.Yurii Nesterov. Universal gradient methods for convex optimization problems.

MathematicalProgramming , 152(1-2):381–404, 2015.Yurii Nesterov and Sebastian U Stich. Eﬃciency of the accelerated coordinate descent method onstructured optimization problems.

SIAM Journal on Optimization , 27(1):110–123, 2017.Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinatedescent converges faster with the Gauss-Southwell rule than random selection. In

Proc. ICML’15 ,2015.Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of ﬁrst-order methods for convex-concave bilinear saddle-point problems.

Mathematical Programming , pages 1–35, 2019.Peter Richt´arik and Martin Tak´aˇc. Parallel coordinate descent methods for big data optimization.

Mathematical Programming , 156(1-2):433–484, 2016.Ankan Saha and Ambuj Tewari. On the nonasymptotic convergence of cyclic coordinate descentmethods.

SIAM Journal on Optimization , 23(1):576–601, 2013.Tim Salimans, Ian J Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training GANs. In

Proc. NIPS’16 , 2016.Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, and Wotao Yin. A primer on coordinatedescent algorithms. arXiv preprint arXiv:1610.00040 , 2016.Ruoyu Sun and Yinyu Ye. Worst-case complexity of cyclic coordinate descent: O ( n ) gap withrandomized version. Mathematical Programming , pages 1–34, 2019.Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empiricalrisk minimization with O (1) per-iteration complexity. In Proc. NeurIPS’18 , 2018.Stephen Wright and Ching-pei Lee. Analyzing random permutations for cyclic coordinate descent.

Mathematics of Computation , 89(325):2217–2248, 2020.Stephen J Wright. Coordinate descent algorithms.

Mathematical Programming , 151(1):3–34, 2015.Tong Tong Wu, Kenneth Lange, et al. Coordinate descent algorithms for lasso penalized regression.

Annals of Applied Statistics , 2(1):224–244, 2008.17uchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empiricalrisk minimization. In

Proc. ICML’15 , 2015.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.