Cyclic Coordinate Dual Averaging with Extrapolation for Generalized Variational Inequalities
FFast Cyclic Coordinate Dual Averaging with Extrapolationfor Generalized Variational Inequalities ∗ Chaobing Song and Jelena DiakonikolasDepartment of Computer SciencesUniversity of Wisconsin-Madison [email protected], [email protected]
Abstract
We propose the
Cyclic cOordinate Dual avEraging with extRapolation (CODER) methodfor generalized variational inequality problems. Such problems are fairly general and includecomposite convex minimization and min-max optimization as special cases. CODER is the firstcyclic block coordinate method whose convergence rate is independent of the number of blocks,which fills the significant gap between cyclic coordinate methods and randomized ones thatremained open for many years. Moreover, CODER provides the first theoretical guarantee forcyclic coordinate methods for solving generalized variational inequality problems under onlymonotonicity and Lipschitz continuity assumptions. To remove the dependence on the numberof blocks, the analysis of CODER is based on a novel Lipschitz condition with respect to aMahalanobis norm rather than the commonly used coordinate-wise Lipschitz condition; to beapplicable to general variational inequalities, CODER leverages an extrapolation strategy inspiredby the recent developments in primal-dual methods. Our theoretical results are complementedby numerical experiments, which demonstrate competitive performance of CODER compared toother coordinate methods.
Large-scale optimization problems are omnipresent in machine learning. The ever-increasing scale ofthe problems renders standard first-order methods that rely on full gradient information impracticalfor many settings of interest. Fortunately, most of the standard machine learning problems possessuseful structure that makes them amenable to efficient optimization methods that only access partialproblem information at a time. A specific instance are (block) coordinate methods, which rely onaccessing only a subset of coordinates of the objective function (sub)gradient at a time (Wright, 2015,Nesterov, 2012). These methods have been very popular over the past decade, finding applicationsin areas such as feature selection in high-dimensional computational statistics (Wu et al., 2008,Friedman et al., 2010, Mazumder et al., 2011), empirical risk minimization in machine learning(Zhang and Lin, 2015, Nesterov, 2012, Lin et al., 2015, Allen-Zhu et al., 2016, Alacaoglu et al., 2017,G¨urb¨uzbalaban et al., 2017, Diakonikolas and Orecchia, 2018), and distributed computing (Liuet al., 2014, Fercoq and Richt´arik, 2015, Richt´arik and Tak´aˇc, 2016).Coordinate methods are classified according to the order in which (blocks of) coordinates areselected and updated (Shi et al., 2016), generally falling into of the three main categories: (i) greedy, ∗ This research was partially supported by the Office of the Vice Chancellor for Research and Graduate Educationat the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation. a r X i v : . [ m a t h . O C ] F e b r Gauss-Southwell, methods, which greedily select coordinates that lead to the largest progress (e.g.,coordinates with the largest magnitude of the gradient, which maximize progress in function valuefor descent-type methods), (ii) randomized methods, which select (blocks of) coordinates accordingto some probability distribution over the coordinate blocks, and (iii) cyclic methods, which update(blocks of) coordinates in a cyclic order. Although greedy methods can be quite effective, they aregenerally limited by the greedy selection criterion, which (except in some very specialized instances;see, e.g., Nutini et al. (2015)) requires reading full first-order information, in each iteration. Thus,more attention in the literature has been given to randomized and cyclic methods.From the aspect of theoretical guarantees, a major advantage of randomized coordinate methods(RCM) over cyclic variants has been the simplicity with which convergence arguments can be carriedout. By sampling coordinates randomly with replacement, the expectation of a coordinate gradientis the full gradient, thus the analysis can be largely reduced to that of standard gradient descent.As a result, many variants of RCM with provable guarantees have been proposed for both convexminimization problems (Nesterov, 2012, Lin et al., 2015, Fercoq and Richt´arik, 2015, Diakonikolasand Orecchia, 2018, Allen-Zhu et al., 2016, Hanzely and Richt´arik, 2019, Nesterov and Stich, 2017)and convex-concave min-max problems (Dang and Lan, 2014, Zhang and Lin, 2015, Alacaoglu et al.,2017, Chambolle et al., 2018, Tan et al., 2018, Carmon et al., 2019, Latafat et al., 2019, Fercoqand Bianchi, 2019, Alacaoglu et al., 2020). The complexity of RCM as measured by the number oftimes full gradient information is accessed is no worse than for full-gradient first-order methods,making RCM suitable for high-dimensional settings. However, these guarantees are attained only inexpectation or with high probability. Meanwhile, to sample the coordinates, randomized methodsmust involve generation of pseudorandom numbers from a certain probability distribution, whichmakes the implementation complicated and may dominate the cost if the coordinate update is cheap.Furthermore, in practical tasks such as training of deep neural networks, the strategy of samplingwith replacement is seldom used due to reduced performance caused by not iterating over all thecoordinates with high probability in one pass (while sampling without replacement achieves thiswith probability one) (Bottou, 2009).Compared to sampling with replacement, cyclically choosing coordinates or sampling withoutreplacement (i.e., cyclically choosing coordinate blocks with their order determined according toa random permutation) appears more natural. In fact, cyclic coordinate methods (CCMs) oftenhave better empirical performance than RCM (Beck and Tetruashvili, 2013, Chow et al., 2017,Sun and Ye, 2019). Due to its simplicity and efficiency, CCM has been the default approach inmany well-known software packages for high-dimensional computational statistics such as SparseNet(Mazumder et al., 2011) and GLMNet (Friedman et al., 2010).However, CCM is much harder to analyze than RCM because it is highly nontrivial to establisha connection between the (cyclically selected) coordinate gradient and full gradient. As a result,compared to RCM, there are hardly any theoretical guarantees for CCM. In the seminal paper aboutRCM, Nesterov (2012) has remarked that it is “almost impossible to estimate the rate of convergence”of cyclic coordinate descent in the general case. However, some guarantees have been provided in theliterature, albeit often under very restrictive assumptions such the isotonicity of the gradient (Sahaand Tewari, 2013) or with convergence rates that do not justify better empirical performanceof CCM over RCM (Beck and Tetruashvili, 2013). In particular, the iteration complexity resultfrom Beck and Tetruashvili (2013) for the general convex optimization setting has linear dependenceon the ambient dimension (or the number of blocks in the block coordinate setting). This lineardependence is expected, as the argument from Beck and Tetruashvili (2013) relies on treating thecyclical coordinate gradient as an approximation of full gradient of the current iterate.Beyond the setting of convex optimization, Chow et al. (2017) has provided convergence resultsfor a variant of CCM applied to unconstrained monotone variational inequality problems (VIPs),2here the operator F : R d → R d is assumed to be cocoercive. Cocoercivity is a very strongassumption, which leads to an equivalence between solving the original VIP (equivalently, finding azero of F , which is also known as the monotone inclusion problem) and finding a fixed point ofa nonexpansive (1-Lipschitz) operator (see, e.g., Facchinei and Pang (2007, Chapter 12)). Thiscondition already fails to hold for bilinear matrix games, which is one of the most basic setups ofmin-max optimization. Moreover, the convergence rate of 1 /k / for reducing (cid:107) F ( x ) (cid:107) from Chowet al. (2017) is unsatisfying, as we expect faster convergence for this class of methods.As monotone VIP includes convex optimization problems as a special case, to address bothissues from Beck and Tetruashvili (2013), Chow et al. (2017), our main focus is on answering thefollowing question: Is it possible to obtain a cyclic coordinate method with dimension-independent iteration complexityfor general monotone VIPs?
We consider generalized Minty variational inequality (GMVI) problems, which ask for finding x ∗ such that (cid:104) F ( x ) , x − x ∗ (cid:105) + g ( x ) − g ( x ∗ ) ≥ , ∀ x ∈ R d , (P)where F : R d → R d is a monotone Lipschitz operator and g : R d → R is a proper, convex, lowersemicontinuous, block-separable function with an efficiently computable proximal operator (seeSection 2 for precise definitions).For this general problem (P), we propose a novel Cyclic cOordinate Dual avEraging withextRapolation (CODER) method. When F ( x ) is monotone and L -Lipschitz, and g ( x ) is generalconvex, to attain an (cid:15) -accurate solution to (P) defined as x ∗ (cid:15) that satisfies (cid:104) F ( x ) , x − x ∗ (cid:15) (cid:105) + g ( x ) − g ( x ∗ (cid:15) ) ≥ − (cid:15), ∀ x ∈ R d , (P (cid:15) )CODER needs to equivalently access O ( L/(cid:15) ) full gradients, which is dimension-independent andrecovers the optimal complexity of first-order methods for monotone Lipschitz VIPs. Moreover, if g ( x ) is assumed to be σ -strongly convex, the oracle complexity of CODER becomes O (cid:0) Lσ log (cid:15) (cid:1) .The constant L in the aforementioned bounds is defined based on a novel Lipschitz condition for F w.r.t. a Mahalanobis norm that we introduce (see Assumption 1 for a precise definition), and formost cases of interest is bounded above by the more traditional Lipschitz constant of F (definedw.r.t. the standard Euclidean norm). In fact, the Lipschitz constant resulting from our analysisis often much lower than the Euclidean Lipschitz constant. Meanwhile, the value of L is stronglyinfluenced by the order of cyclic updates that the algorithm takes, thus it partially explains theeffectiveness of random permutations for CCM. Furthermore, to the best of our knowledge, our workis the first to provide any type of convergence guarantees for CCM methods applied to generalizedVIPs. Finally, our method provides consistent guarantees for arbitrary block separation, which ishighly nontrivial in the min-max setting (see Remark 2).To prove our main result, instead of treating coordinate gradient as an approximation of the fullgradient, we consider a novel approximation strategy that relates the collection of cyclic coordinategradients from one full pass over the coordinates to a certain full implicit gradient . This collectionperspective helps us remove the linear dependence on the dimension (or number of coordinateblocks). Furthermore, the approximation of the coordinate gradient collection is more accurate thanthe full gradient of the previous iterate, thus we show that cyclic coordinate methods can dependon a smaller Lipschitz constant L . To make our results applicable to generalized monotone VIPs,we introduce an extrapolation step on the operator, which is inspired by the very recent momentumstrategy for non-bilinear convex-concave min-max problems (Hamedani and Aybat, 2018).3 .2 Related Work As discussed earlier, despite significant research activity devoted to randomized coordinate meth-ods (Nesterov, 2012, Lin et al., 2015, Fercoq and Richt´arik, 2015, Diakonikolas and Orecchia, 2018,Allen-Zhu et al., 2016, Hanzely and Richt´arik, 2019, Nesterov and Stich, 2017, Zhang and Lin,2015, Alacaoglu et al., 2017, Tan et al., 2018), far less attention has been given to cyclic coordinatevariants, and specifically to their rigorous convergence guarantees.In particular, while convergence guarantees have been established for smooth convex optimizationproblems in Beck and Tetruashvili (2013), the obtained bounds exhibit at least linear dependence onthe number of blocks (equal to the dimension in the coordinate case). Further, the bound from Beckand Tetruashvili (2013) also scales with L max /L min , where L max and L min are the maximum and theminimum Lipschitz constants over the blocks, which is unsatisfying, as (block) coordinate methodsmainly exhibit improvements over full gradient methods when the Lipschitz constants over blocksare highly non-uniform.In general, vanilla CCM is known to be order- d slower than RCM in the worst case (Sun andYe, 2019), where d is the dimension, which is in conflict with its comparable and often superiorperformance compared to RCM in practice. This has led to more refined analyses of CCM withsofter guarantees that explain why the worst-case examples are uncommon (G¨urb¨uzbalaban et al.,2017, Lee and Wright, 2019, Wright and Lee, 2020). However, the existing results only apply tounconstrained convex quadratic problems.By contrast to existing work, we introduce a novel extrapolation-based CCM that applies toa broad class of generalized variational inequality problems, which contains convex (composite)optimization as a special case. In the case of convex quadratic functions and unlike RCM or existingCCM methods, the results we obtain never exhibit worse complexity than the full-gradient methods,and are often of much lower complexity. Further, our method provably converges on min-maxproblems on which standard CCM and RCM methods diverge in general (see Remark 2). We consider the d -dimensional Euclidean space ( R d , (cid:107) · (cid:107) ) , where (cid:107) · (cid:107) = (cid:112) (cid:104)· , ·(cid:105) denotes the Euclideannorm, (cid:104)· , ·(cid:105) denotes the (standard) inner product, and d is assumed to be finite. Given a matrix B , the operator norm of B is defined in a standard way as (cid:107) B (cid:107) = max {(cid:107) Bx (cid:107) : x ∈ R d , (cid:107) x (cid:107) ≤ } . Weuse to denote an all-zeros vector.Throughout the paper, we assume that there is a given partition of the set { , , . . . , d } intosets S i , i ∈ { , . . . , m } , where | S i | = d i > . For notational convenience, we assume that sets S i are comprised of consecutive elements from { , , . . . , d } , that is, S = { , , . . . , d } , S = { d + 1 , d + 2 , . . . , d + d } , . . . , S m = { (cid:80) m − j =1 d j + 1 , (cid:80) m − j =1 d j + 2 , . . . , (cid:80) mj =1 d j } . This assumptionis without loss of generality, as all our results are invariant to permutations of the coordinates. Foran operator F : R d → R d , we use F i to denote its coordinate components indexed by S i . We say that an operator F : R d → R d is monotone, if ∀ x , y ∈ R d , (cid:104) F ( x ) − F ( y ) , x − y (cid:105) ≥ . An operator F : R d → R d is said to be M -Lipschitz, if ∀ x , y ∈ R d , (cid:107) F ( x ) − F ( y ) (cid:107) ≤ M (cid:107) x − y (cid:107) . (1)Given a proper, convex, lower semicontinuous function g : R d → R ∪ { + ∞} , we use ∂g ( x ) todenote the subdifferential set (the set of all subgradients) of g at x ∈ dom( g ) . Of particular interestto us are functions g whose proximal operator (or resolvent), defined byprox g,τ ( z ) = arg min x ∈ R d (cid:110) g ( x ) + 12 τ (cid:107) x − z (cid:107) (cid:111) (2)4s efficiently computable for all z ∈ R d , τ > g is convex and strongly convex, we will say that g is γ -stronglyconvex for γ ≥ , if for all x , y ∈ R d and all α ∈ (0 ,
1) : g ((1 − α ) x + α y ) ≤ (1 − α ) g ( x ) + αg ( y ) − γ α (1 − α ) (cid:107) x − y (cid:107) . If g (cid:48) ( x ) ∈ ∂g ( x ) is any subgradient of g at x , then we also have, ∀ y ∈ R d ,g ( y ) ≥ g ( x ) + (cid:10) g (cid:48) ( x ) , y − x (cid:11) + γ (cid:107) y − x (cid:107) . Problem definition.
As discussed in the introduction, we consider Problem (P), under thefollowing assumption.
Assumption 1.
There exists at least one x ∗ that solves (P) . g ( x ) is γ -strongly convex, where γ ≥ , and block-separable over coordinate sets { S i } mi =1 : g ( x ) = (cid:80) mi =1 g i ( x i ) , where x i is the d i -dimensional vector comprised of the entries of x correspondingto the coordinates from S i . Each g i ( x i ) , ≤ i ≤ m, admits an efficiently computable proximaloperator.Operator F : R d → R d is monotone. Further, there exist positive semidefinite matrices Q i , ≤ i ≤ m, such that each F i ( · ) is -Lipschitz continuous w.r.t. the norm (cid:107) · (cid:107) Q i , i.e., ∀ x , y ∈ R d , (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ (cid:113) ( x − y ) T Q i ( x − y ) = (cid:107) x − y (cid:107) Q i , (3) where F i ( x ) is the d i -dimensional vector comprised of the S i coordinates of F ( x ) . Finally, (cid:13)(cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13)(cid:13) = L < ∞ , where (cid:98) Q i is defined by ( (cid:98) Q i ) j,k = (cid:40) ( Q i ) j,k , if min { j, k } > (cid:80) i − (cid:96) =1 d (cid:96) , , otherwise.That is, (cid:98) Q i corresponds to the matrix Q i with the first i − blocks of rows and columns set to zero. For notational convenience, given a candidate solution x ∈ R d and an arbitrary point u ∈ R d ,we define Gap( ˆ x ; u ) := (cid:104) F ( u ) , ˆ x − u (cid:105) + g ( ˆ x ) − g ( u ) , (4)so that Gap( ˆ x ) := sup u ∈ R d Gap( ˆ x ; u ) (5)defines the error of the candidate solution ˆ x for Problem (P). In particular, if Gap( ˆ x ) ≤ (cid:15) for some (cid:15) > , then (cid:104) F ( x ) , x − ˆ x (cid:105) + g ( x ) − g ( ˆ x ) ≥ − (cid:15), ∀ x ∈ R d , which defines the (cid:15) -approximation for Problem (P). Comparison of Lipschitzness assumptions.
Standard Lipschitz assumptions that are usedfor full gradient methods are typically stated as in Eq. (1). Observe that the Lipschitz constant ofthe entire operator F under our assumptions is bounded by (cid:112) (cid:107) (cid:80) mi =1 Q i (cid:107) , as, ∀ x , y , (cid:107) F ( x ) − F ( y ) (cid:107) = m (cid:88) i =1 (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ m (cid:88) i =1 ( x − y ) T Q i ( x − y ) ≤ (cid:13)(cid:13)(cid:13) m (cid:88) i =1 Q i (cid:13)(cid:13)(cid:13) (cid:107) x − y (cid:107) .
5n the worst case for full gradient methods, it is possible that M = (cid:113)(cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) , and this worstcase in fact happens for many interesting examples discussed below. The guarantees that weprovide for our method are in terms of L = (cid:113)(cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) . It is not hard to show that in general (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) ≤ (cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) , while it is possible that (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) (cid:28) (cid:13)(cid:13) (cid:80) mi =1 Q i (cid:13)(cid:13) . In the literature on standard (randomized and cyclic) block coordinate methods and in thecase where F is the gradient of a convex function, the Lipschitz assumptions are typically statedas (Nesterov, 2012): (cid:107) F i ( x ) − F i ( y ) (cid:107) ≤ L i (cid:107) x − y (cid:107) , where x , y ∈ R d are restricted to only differover the i th block of coordinates. These assumptions are hard to directly compare to our Lipschitzassumptions stated in Assumption 1. What can be said is that in general L i ≤ (cid:107) Q i (cid:107) ; however, notethat our final convergence bound is in terms of (cid:13)(cid:13) (cid:80) mi =1 (cid:98) Q i (cid:13)(cid:13) , which is incomparable to weighted sumsof L i ’s that typically appear in the convergence bounds for block coordinate methods. Further, notethat the coordinate Lipschitz assumptions used for convex optimization are generally not suitable formin-max setups. In particular, for bilinear problems, all coordinate Lipschitz constants defined asin Nesterov (2012) would be zero, which does not appear meaningful, given the non-zero complexityof bilinear problems (Ouyang and Xu, 2019).To demonstrate the full power of our method and how our notion of block coordinate Lipschitznesscan provide much improved results compared to other methods on some standard classes of problemswith applications in machine learning, in the following, we provide a few illustrative examples. Example applications.
Before providing concrete example applications, we note that our problemof interest (P) captures broad classes of optimization problems, such as convex-concave min-maxoptimization min x ∈ R d max x ∈ R d Φ( x , x ) , (P MM )where Φ( x , x ) := φ ( x , x ) + g ( x ) − g ( x ) , d + d = d, φ is convex-concave and smooth, and g , g are convex and “simple” (i.e., have efficiently computable proximal operators), and convexcomposite optimization min x ∈ R d (cid:8) f ( x ) + g ( x ) (cid:9) , (P CO )where f is smooth and convex and g is convex and “simple.”To reduce (P MM ) to (P), it suffices to stack vectors x , x and define x = (cid:2) x x (cid:3) , F ( x ) = (cid:2) ∇ x φ ( x , x ) −∇ x φ ( x , x ) (cid:3) , g ( x ) = g ( x ) − g ( x ) . To reduce (P CO ) to (P), it suffices to take F ( x ) = ∇ f ( x ),while g is the same for both problems. See, e.g., Nemirovski (2004), Malitsky (2019) and Corollaries 1and 2 for more information. Now we provide some concrete example applications. Example 1 (Lasso) . The well-known Lasso problem min x ∈ R d (cid:107) Ax − b (cid:107) + λ (cid:107) x (cid:107) is an exampleof (P CO ) and a special case of (P) , where A = [ a , a , . . . , a n ] T ∈ R n × d , b ∈ R n , F ( x ) = A T ( Ax − b ) , g ( x ) = λ (cid:107) x (cid:107) , m = d , d = d = · · · = d m = 1 . For this setup, we have (cid:107) F ( x ) − F ( y ) (cid:107) = (cid:107) A T A ( x − y ) (cid:107) = (cid:112) ( x − y ) T ( A T A ) ( x − y ) . The tightest Lipschitz constant of F ( x ) that we canselect is (cid:107) A T A (cid:107) . Meanwhile, (cid:107) F i ( x ) − F i ( y ) (cid:107) = (cid:107) ( a i ) T A ( x − y ) (cid:107) = (cid:112) ( x − y ) T Q i ( x − y ) with Q i = A T a i ( a i ) T A . The Lipschitz constant L for our algorithm is given by L = (cid:113) (cid:107) (cid:80) di =1 (cid:98) Q i (cid:107) ≤(cid:107) A T A (cid:107) (but it can be much lower than (cid:107) A T A (cid:107) ), and the resulting iteration complexity of CODERgiven target error (cid:15) > is O ( L (cid:107) x ∗ − x (cid:107) (cid:15) ) . Example 2 (Elastic net) . Another interesting example for our setting are the elastic net regularizedproblems, which are of the form min x ∈ R d (cid:107) Ax − b (cid:107) + λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) , where λ , λ > areregularization parameters (Zou and Hastie, 2005). This is another instance of (P CO ) with f ( x ) = (cid:107) Ax − b (cid:107) and g ( x ) = λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) . Observe that in this case g is λ -strongly convex but also lgorithm 1 Cyclic cOordinate Dual avEraging with extRapolation (CODER) Input: x ∈ dom( g ) , γ ≥ , L > , m, { S , . . . , S m } . Initialization: x − = x , p = F ( x ), a = A = 0. ψ i ( x i ) = (cid:107) x i − x i (cid:107) , ≤ i ≤ m . for k = 1 to K do a k = γA k − L , A k = A k − + a k . for i = 1 to m do p ik = F i ( x k , . . . , x i − k , x ik − , . . . x mk − ). q ik = p ik + a k − a k ( F i ( x k − ) − p ik − ). x ik = arg min x i ∈ R di { ψ ik ( x i ) := ψ ik − ( x i ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )) } . end for end for return x K , ˜ x K = A K (cid:80) Kk =1 a k x k nonsmooth. Similarly as in the case of Lasso, the problem reduces to (P) using F ( x ) = ∇ f ( x ) , andthe same discussion as for Lasso applies for the Lipschitz constant of F . The iteration complexityof CODER for target error (cid:15) > in this case is bounded by O (cid:0) Lλ log( (cid:15) ) (cid:1) , where L ≤ (cid:107) A T A (cid:107) , butcan potentially be much smaller than (cid:107) A T A (cid:107) . Example 3 ( (cid:96) -norm regularized SVM) . When using (cid:96) -norm regularization, the formulationof support vector machine (SVM) is min x ∈ R d { n (cid:80) ni =1 max { − b i a Ti x , } + λ (cid:107) x (cid:107) } , where A = n [ a , a , . . . , a n ] T ∈ R n × d , b ∈ { , − } n , ¯ A = n [ b a , b a , . . . , b n a n ] T , and λ ≥ . Then with thefact that max { − x, } = max − ≤ y ≤ ( x − y , we know that the SVM problem is an instance of (P MM ) with F ( x , y ) = ( ¯ A T y , − ( ¯ Ax − n )) and g ( x , y ) = λ (cid:107) x (cid:107) + (cid:80) nj =1 − ≤ y i ≤ . The Lipschitzconstant L of F in this case satisfies L ≤ (cid:112) (cid:107) ¯ A T ¯ A (cid:107) , while the iteration complexity of CODER isbounded by O (cid:0) L(cid:15) (cid:1) . Example 4 ( (cid:96) regression) . Our results also apply to the problem of (cid:96) regression, where thegoal is to minimize (cid:107) Ax − b (cid:107) over x ∈ R d . This problem can be equivalently written as min x ∈ R d max x ∈ [0 , n (cid:104) Ax − b , x (cid:105) , which is of the form (P MM ) with F ( x ) = (cid:2) A T x − Ax (cid:3) , g ( x ) ≡ ,g ( x ) = − (cid:104) b , x (cid:105) + (cid:80) ni =1 I [0 , ( x i ) , where I [0 , is the convex indicator function of the interval [0 , (i.e., equal to zero within the interval and equal to infinity outside it). In this case, L ≤ (cid:112) (cid:107) A T A (cid:107) , while the iteration complexity of CODER is O ( L (cid:107) x ∗ − x (cid:107) (cid:15) ) . In this section, we provide our main algorithmic result, summarized in Algorithm 1 (CODER), andanalyze its convergence. The algorithm can be seen as a cyclic block coordinate dual averagingmethod with extrapolation, where the q k update corresponds to the (dual) extrapolation step and x k corresponds to the dual averaging step.As is common for dual averaging-type methods, we base our analysis on the use of an estimatesequence ψ k ( x ) = (cid:80) mi =1 ψ ik ( x ) , where ψ ik is recursively defined by ψ ik ( x ) := ψ ik − ( x ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )), ψ i ( x ) = (cid:107) x i − x i (cid:107) in Algorithm 1. Note here that u ∈ R d that appears inthe definition of ψ k is an arbitrary point. It is provided in the definition of ψ k for convenience ofthe analysis. All algorithm updates (and x k in particular) are completely independent of u , and7he algorithm can be stated without the knowledge of this point. This is clear from the definition of x k , as the arg min defining it is independent of u . Unlike in most standard approaches, the estimate sequence ψ k used in our work is not directlyconstructed from our target notion of the gap (Gap( ˜ x k ; u ) or Gap( ˜ x k )). Instead, it is constructedusing the dual extrapolation step q k (in place of the standard use of F ( x k )), which requires carefulcontrol of the errors when bounding ψ k . The analysis is carried out using two core technical lemmasthat bound ψ k above and below (Lemma 1 and Lemma 2) and lead to bounds on both Gap( ˜ x k ; u )for arbitrary u ∈ dom( g ) and distance (cid:107) x k − x ∗ (cid:107) between the algorithm iterate x k and an arbitrarysolution x ∗ to (P).We start by bounding ψ k above, in the following lemma. Lemma 1.
In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ ,ψ k ( x k ) ≤ (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) . Proof.
By definition, ψ k ( x ) = (cid:80) mi =1 ψ ik ( x ), where ψ ik ’s are specified in Algorithm 1. It follows that ∀ k ≥ , ψ k ( x ) = k (cid:88) j =1 a j ( (cid:104) q j , x − u (cid:105) + g ( x ) − g ( u )) + 12 (cid:107) x − x (cid:107) . = (cid:104) x − z k , x − u (cid:105) + A k ( g ( x ) − g ( u )) + 12 (cid:107) x − x (cid:107) , (6)where z k := x − (cid:80) kj =1 a j q j . Observe (from Algorithm 1) that x k = arg min x ∈ R d ψ k ( x ) . Thus, ∈ ∂ψ ( x k ) and there exists g (cid:48) ( x k ) ∈ ∂g ( x k ) such that x − z k + A k g (cid:48) ( x k ) + x k − x = . Solving the last equation for x − z k and plugging into Eq. (6), we have ψ k ( x k ) = A k ( g ( x k ) − g ( u ) − (cid:10) g (cid:48) ( x k ) , x k − u (cid:11) ) − (cid:104) x k − x , x k − u (cid:105) + 12 (cid:107) x k − x (cid:107) ≤ − A k γ (cid:107) x k − u (cid:107) − (cid:104) x k − x , x k − u (cid:105) + 12 (cid:107) x k − x (cid:107) = 12 (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) , where the second line is by γ -strong convexity of g and the last line is by (cid:104) x k − x , u − x k (cid:105) = (cid:107) x − u (cid:107) − (cid:107) x k − x (cid:107) − (cid:107) x k − u (cid:107) .Bounding ψ k ( x k ) below requires much more technical work and can be seen as the main technicalresult required for obtaining the convergence bound for Algorithm 1. The result is summarized inthe following lemma. Observe that the summation terms from the first line are all bounded belowby Gap( x j ; u ), as F is monotone. This will be used for bounding the final gap Gap( ˜ x k , u ) . Lemma 2.
In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ ,ψ k ( x k ) ≥ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) − γA k (cid:107) u − x k (cid:107) + k (cid:88) j =1 (cid:16) γA j − (cid:107) x j − x j − (cid:107) − a j γA j (cid:107) F ( x j ) − p j (cid:107) (cid:17) . roof. By the definition of ψ k ( x ), ψ k ( x k ) = ψ k − ( x k ) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u ))= ψ k − ( x k − ) + ( ψ k − ( x k ) − ψ k − ( x k − )) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u )) . As ψ k − is (1 + γA k − )-strongly convex and minimized at x k − , we have ψ k − ( x k ) − ψ k − ( x k − ) ≥ γA k − (cid:107) x k − x k − (cid:107) , and, thus, ψ k ( x k ) ≥ ψ k − ( x k − ) + 1 + γA k − (cid:107) x k − x k − (cid:107) + a k ( (cid:104) q k , x k − u (cid:105) + g ( x k ) − g ( u )) . (7)Our next step is to bound a k (cid:104) q k , x k − u (cid:105) . By the definition of q k , and using simple algebraicmanipulations, a k (cid:104) q k , x k − u (cid:105) = a k (cid:68) p k + a k − a k ( F ( x k − ) − p k − ) , x k − u (cid:69) = a k (cid:104) p k − F ( x k ) , x k − u (cid:105) + a k (cid:104) F ( x k ) , x k − u (cid:105) + a k − (cid:104) F ( x k − ) − p k − , x k − x k − (cid:105) − a k − (cid:104) p k − − F ( x k − ) , x k − − u (cid:105) . (8)Further, using Cauchy-Schwarz and Young’s inequalities, we also have | a k − (cid:104) F ( x k − ) − p k − , x k − x k − (cid:105)|≤ a k − (cid:107) F ( x k − ) − p k − (cid:107)(cid:107) x k − x k − (cid:107)≤ a k − γA k − (cid:107) F ( x k − ) − p k − (cid:107) + 1 + γA k − (cid:107) x k − x k − (cid:107) . (9)Thus, combining (7)–(9), we have ψ k ( x k ) ≥ ψ k − ( x k − ) + 1 + γA k − (cid:107) x k − x k − (cid:107) − a k − γA k − (cid:107) F ( x k − ) − p k − (cid:107) + a k (cid:104) p k − F ( x k ) , x k − u (cid:105) − a k − (cid:104) p k − − F ( x k − ) , x k − − u (cid:105) + a k ( (cid:104) F ( x k ) , x k − u (cid:105) + g ( x k ) − g ( u )) . (10)Telescoping Eq. (10) from 1 to k, we now have ψ k ( x k ) ≥ ψ ( x ) + k (cid:88) j =1 (cid:16) γA j − (cid:107) x j − x j − (cid:107) − a j − γA j − (cid:107) F ( x j − ) − p j − (cid:107) (cid:17) + a k (cid:104) p k − F ( x k ) , x k − u (cid:105) − a (cid:104) p − F ( x ) , x − u (cid:105) + k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) . To complete the proof, it remains to observe that ψ ( x ) = 0 , p = F ( x ) , and to bound a k (cid:104) p k − F ( x k ) , x k − u (cid:105) below by − a k γA k (cid:107) p k − F ( x k ) (cid:107) − γA k (cid:107) x k − u (cid:107) . This simply follows usingCauchy-Schwarz and Young’s inequality, as | a k (cid:104) p k − F ( x k ) , x k − u (cid:105)| ≤ a k (cid:107) p k − F ( x k ) (cid:107)(cid:107) x k − u (cid:107)≤ a k γA k (cid:107) p k − F ( x k ) (cid:107) + 1 + γA k (cid:107) x k − u (cid:107) , as claimed. 9e are now ready to state and prove our main result. Theorem 1.
In Algorithm 1, ∀ u ∈ dom( g ) and k ≥ , k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) . (11) In particular,
Gap( ˜ x k ; u ) ≤ A k (cid:107) u − x (cid:107) . Further, if x ∗ is any solution to Problem (P) , we also have (cid:107) x k − x ∗ (cid:107) ≤
21 + γA k (cid:107) x − x ∗ (cid:107) . In both bounds, A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.
Combining Lemmas 1 and 2, we have k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) + k (cid:88) j =1 (cid:16) a j γA j (cid:107) F ( x j ) − p j (cid:107) − γA j − (cid:107) x j − x j − (cid:107) (cid:17) . (12)To obtain the bounds in the theorem, we show that all the summation terms from the right-handside of Eq. (12) are non-positive. To do so, let ¯ x j,i = [( x j ) T , . . . , ( x i − j ) T , ( x ij − ) T , . . . , ( x mj − ) T ] T ,so that p ij = F i ( ¯ x j,i ) (i.e., ¯ x j,i corresponds to the value of vector x j in the i th iteration of the innerloop of Algorithm 1). Then, we have (cid:107) F ( x j ) − p j (cid:107) = m (cid:88) i =1 (cid:107) F i ( x j ) − p ij (cid:107) = m (cid:88) i =1 (cid:107) F i ( x j ) − F i ( ¯ x j,i ) (cid:107) ≤ m (cid:88) i =1 ( x j − ¯ x j,i ) T Q i ( x j − ¯ x j,i ) . By the definitions of (cid:98) Q i ’s and ¯ x j,i ’s, we further have( x j − ¯ x j,i ) T Q i ( x j − ¯ x j,i ) = ( x j − x j − ) T (cid:98) Q i ( x j − x j − ) , and, as a result, (cid:107) F ( x j ) − p j (cid:107) ≤ ( x j − x j − ) T (cid:16) m (cid:88) i =1 (cid:98) Q i (cid:17) ( x j − x j − ) ≤ L (cid:107) x j − x j − (cid:107) . (13)Meanwhile, by our choice of step sizes from Algorithm 1,1 + γA j − L a j γA j − ≥ L a j γA j . a j γA j (cid:107) F ( x j ) − p j (cid:107) ≤ γA j − (cid:107) x j − x j − (cid:107) for all j ≥ , and we can conclude from Eq. (12) that k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) + 1 + γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) . (14)Now, for the gap bound from the statement of the theorem, by monotonicity of F , we have (cid:104) F ( x j ) , x j − u (cid:105) ≥ (cid:104) F ( u ) , x j − u (cid:105) . Thus, recalling that ˜ x k = A k (cid:80) kj =1 a j x j and using Jensen’sinequality: A k Gap( ˜ x k ; u ) = A k (cid:104) F ( u ) , ˜ x k − u (cid:105) + g ( ˜ x k ) − g ( u ) ≤ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − u (cid:105) + g ( x j ) − g ( u )) ≤ (cid:107) u − x (cid:107) − γA k (cid:107) u − x k (cid:107) ≤ (cid:107) u − x (cid:107) , where the last two inequalities are by Eq. (14) and γA k (cid:107) u − x k (cid:107) ≥ . For the remaining bound, by the definition of x ∗ , Gap( ˜ x k ; x ∗ ) ≥ , and, thus (choosing u = x ∗ )1 + γA k (cid:107) x ∗ − x k (cid:107) ≤ (cid:107) x ∗ − x (cid:107) . Finally, as Algorithm 1 sets a j = γA j − L , A j = A j − + a j , ∀ j ≥ , we have A k ≥ k L (as γ ≥ A = 0) and A k ≥ A k − (cid:0) γ L (cid:1) ≥ A (cid:0) γ L (cid:1) k − , for all k ≥ CO ) and (P MM ) are summarized in the followingtwo corollaries. Here we only state the bounds for the optimality gap, as the bounds on (cid:107) x k − x ∗ (cid:107) are immediate from Theorem 1. Corollary 1.
Consider Problem (P CO ) , where the gradient of f is L -Lipschitz in the context ofAssumption 1 and g is γ -strongly convex for γ ≥ , and let x ∗ ∈ arg min x f ( x )+ g ( x ) . If Algorithm 1is applied to (P CO ) with F = ∇ f, then f ( ˜ x k ) + g ( ˜ x k ) − ( f ( x ∗ ) + g ( x ∗ )) ≤ (cid:107) x ∗ − x (cid:107) A k , where A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.
Observe that Theorem 1 applies, as F , g satisfy Assumption 1. By Jensen’s inequality andconvexity of f , f ( ˜ x k ) + g ( ˜ x k ) − ( f ( x ∗ ) + g ( x ∗ )) ≤ A k k (cid:88) i = j a j ( f ( x j ) + g ( x j ) − f ( x ∗ ) − g ( x ∗ )) ≤ A k k (cid:88) j =1 a j ( (cid:104)∇ f ( x j ) , x j − x ∗ (cid:105) + g ( x j ) − g ( x ∗ )) . As ∇ f ( x j ) = F ( x j ) , it remains to apply Eq. (11) with u = x ∗ . orollary 2. Consider Problem (P MM ) , where φ is convex-concave and its gradient is L -Lipschitzin the context of Assumption 1, and g , g are γ -strongly convex for some γ ≥ . If Algorithm 1 isapplied to (P MM ) with x = (cid:2) x x (cid:3) , F ( x ) = (cid:2) ∇ x φ ( x , x ) −∇ x φ ( x , x ) (cid:3) , and g ( x ) = g ( x ) − g ( x ) , then max y ∈ R d Φ( ˜ x ,k , y ) − min y ∈ R d Φ( y , ˜ x ,k ) ≤ D + D A k , where D = sup x , y ∈ dom( g ) (cid:107) x − y (cid:107) , D = sup x , y ∈ dom( g ) (cid:107) x − y (cid:107) , and A k ≥ max (cid:110) k L , L (cid:16) γ L (cid:17) k − (cid:111) . Proof.
Same as in the previous corollary, we apply Theorem 1 and use Eq. (11) to bound thegap. Observe that, by definition of Φ , max y ∈ R d Φ( ˜ x ,k , y ) = max y ∈ dom( g ) Φ( ˜ x ,k , y ) andmin y ∈ R d Φ( y , ˜ x ,k ) = min y ∈ dom( g ) Φ( y , ˜ x ,k ). Fix arbitrary y ∈ dom( g ) , y ∈ dom( g ) . Using Jensen’s inequality and that φ is concave in its second argument, we haveΦ( ˜ x ,k , y ) ≤ A k k (cid:88) j =1 a j Φ( x ,j , y )= 1 A k k (cid:88) j =1 a j ( φ ( x ,j , y ) + g ( x ,j ) − g ( y )) ≤ A k k (cid:88) j =1 a j ( φ ( x ,j , x ,j ) + (cid:104)∇ x φ ( x ,j , x ,j ) , y − x ,j (cid:105) + g ( x ,j ) − g ( y )) . By the same token,Φ( y , ˜ x ,k ) ≥ A k k (cid:88) j =1 a j ( φ ( x ,j , x ,j ) + (cid:104)∇ x φ ( x ,j , x ,j ) , y − x ,j (cid:105) + g ( y ) − g ( x ,j )) . Combining the bounds on Φ( ˜ x ,k , y ) , Φ( y , ˜ x ,k ) and using the definitions of F and g, we haveΦ( ˜ x ,k , y ) − Φ( y , ˜ x ,k ) ≤ k (cid:88) j =1 a j ( (cid:104) F ( x j ) , x j − y (cid:105) + g ( x j ) − g ( y )) . It remains to apply Eq. (11) and take supremum of both sides over y ∈ dom( g ) , y ∈ dom( g ) . Remark 1.
It is straightforward to extend the result from Theorem 1 to the setting where thecoordinate blocks are randomly permuted. This follows simply as, for the proof of the theorem togo through, the ordering of the blocks is irrelevant. The only thing that would change is that theLipschitz constant from Eq. (13) would depend on the random ordering of the blocks. Thus, to obtaina bound that holds in expectation (with expected L w.r.t. the random permutations of the blocks), itsuffices to take the expectation w.r.t. the random permutation of coordinate blocks when applyingEq. (13) and write the remaining inequalities in expectation w.r.t. all random permutations. Remark 2.
It is natural to ask whether the extrapolation step in CODER is really needed or not.Let us refer to the cyclic and randomized coordinate method variants without the extrapolationstep (i.e., with q ik = p ik in Step 8 of CODER) as the proximal CCM (PCCM) and proximal RCM PRCM). These methods only perform (block) coordinate dual-averaging steps (Step 9 of CODER).Even though PCCM can be observed to perform well in the conducted experiments (see Section 4),unlike CODER, neither PCCM nor PRCM are guaranteed to converge on the class of generalizedvariational inequalities. In particular, it is easy to construct examples on which both PCCM andPRCM diverge. Perhaps the simplest such example is the bilinear problem min x ∈ R d max y ∈ R d (cid:104) x , y (cid:105) , where each pair ( x i , y i ) is assigned to the same block. In this case, the block coordinate updates ofPCCM and PRCM boil down to (simultaneous) gradient descent-ascent updates, and, due to theseparability of the objective function, the divergent behavior of both methods follows as a simplecorollary of the results from Salimans et al. (2016), Liang and Stokes (2019). Computational considerations.
CODER, as stated in Algorithm 1, requires knowledge of theLipschitz parameter L and γ . This may seem like a limitation of our approach, especially since theLipschitzness of F assumed in our work is much different than the traditional Lipschitz assumptionsfor either the full gradient or its (block) coordinate components. However, as we now argue, formost cases of interest this is not a concern. In particular, the strong convexity of g typically comesfrom regularization, which is a design choice and as such is typically known. On the other hand, itturns out that for our approach to work, the knowledge of the Lipschitz parameter L is not requiredat all, as this parameter can be estimated adaptively using the standard doubling trick (Nesterov,2015). This can be concluded from the fact that the only place in the analysis where the Lipschitzconstant of F is used is in Eq. (13), which allows a simple verification and update to L wheneverthe stated inequality is not satisfied. Algorithm 2
Cyclic cOordinate Dual avEraging with extRapolation (CODER) Input: x ∈ dom( g ) , γ ≥ , L > , m, { S , . . . , S m } Initialization: x − = x , p = F ( x ), a = A = 0. ψ i ( x i ) = (cid:107) x i − x i (cid:107) , ≤ i ≤ m . for k = 1 to K do L k = L k − / repeat L k = 2 L k . a k = γA k − L k , A k = A k − + a k . for i = 1 to m do p ik = F i ( x k , . . . , x i − k , x ik − , . . . x mk − ) . q ik = p ik + a k − a k ( F i ( x k − ) − p ik − ). x ik = arg min x i ∈ R di (cid:8) ψ ik ( x i ) := ψ ik − ( x i ) + a k ( (cid:104) q ik , x i − u i (cid:105) + g i ( x i ) − g i ( u i )) (cid:9) . end for until (cid:107) F ( x k ) − p k (cid:107) ≤ L k (cid:107) x k − x k − (cid:107) end for return x K , ˜ x K = A K (cid:80) Kk =1 a k x k . By the argument used in the proof of Theorem 1 and the Lipschitz assumption on F (Assump-tion 1), this condition must be satisfied for any L ≥ (cid:13)(cid:13) (cid:80) i (cid:98) Q i (cid:13)(cid:13) . Thus, a natural approach is to startwith some initial estimate L > L and double it each time the condition from Eq. (13) fails.The total number of times that this estimate can be doubled is then bounded by log ( LL ) , and,under a mild assumption that L = O ( L ) and L is not overwhelmingly (e.g., exponentially in1 /(cid:15), n ) smaller than L , the total overhead due to estimating L is absorbed by the convergence boundfrom Theorem 1. While the doubling trick is standard, the fact that it can be applied at all in the13 = 10 − λ = 10 − λ = 10 − a9aMNIST Figure 1: Performance comparison of CODER and proximal variants of RCM and CCM, on (cid:96) -norm reg-ularized SVM problem and a9a (top row) and MNIST (bottom row) datasets. Only CODER hasprovable theoretical guarantees on this problem instance. Empirically, both CODER and PCCMoutperform PRCM, while CODER is generally the fastest of the three algorithms. setting of coordinate methods is surprising, as this is generally not possible for RCM, due to therandomized nature of the algorithm.The variant of CODER that implements this doubling trick is summarized in Algorithm 2.If one were to use a different permutation of the blocks in each iteration (i.e., for each fullpass over all the blocks), the doubling trick would not necessarily be the best choice, as in theworst case we would be estimating the largest L over all the permutations; not the average one. Ofcourse, one could implement a bisection search for L in each iteration, but that would make theadded logarithmic cost multiplicative instead of additive. Extending CODER to a parameter-freesetting where a local Lipschitz constant can be used without a bisection search (as was done in,e.g., Malitsky (2019) for the full-vector update setting of variational inequalities) is an interestingdirection for future research. We evaluate the performance of cyclic and randomized coordinate methods on the nonsmoothconvex (cid:96) -norm-regularized SVM problem, as described in Example 3. As shown in Example 3,the min-max reformulation of this problem is an instance of the generalized variational inequalityproblem (P).For the considered min-max problem, we compare CODER to PCCM Chow et al. (2017) andPRCM. Both CODER and PCCM permute the coordinates once before each iteration and thenperform cyclic coordinate update under the fixed order after permutation. The difference betweenCODER and PCCM is that PCCM does not use the extrapolation step (or equivalently is a variantof CODER obtained by setting q ik = p ik in Step 8 of Algorithm 1). PRCM chooses each coordinateuniformly at random and then performs the same dual averaging-style coordinate update as CODERand PCCM. All the compared algorithms pick one coordinate per iteration. We test all the three14lgorithms on two large scale datasets a9a and MNIST from the LIBSVM library Chang and Lin(2011). For simplicity, we normalize each data sample to unit Euclidean norm.In the experiments, we vary the (cid:96) -norm regularization parameter λ in { − , − , − } . Forall the settings, we tune the Lipschitz constants L in { /n ∗ k } ( n is the number of samples, k ∈ { , , . . . } ) and return iterate average (as it has better performance than last iterate) for all thethree algorithms. As is standard for ERM, we plot the function value gap of the primal problem interms of the number of passes over the dataset. (The optimal values were evaluated by solving (P)to high accuracy).As shown in Figure 1, both CODER and PCCM perform better than PRCM for all the cases,which verifies the effectiveness of cyclic coordinate updates. Further, CODER is generally fasterthan PCCM. Finally, as discussed in Remark 2, only CODER has theoretical convergence guaranteesfor general convex-concave min-max problems. We presented a novel extrapolated cyclic coordinate method CODER, which provably converges onthe class of generalized variational inequalities, which includes convex composite optimization andconvex-concave min-max optimization. CODER is the first cyclic coordinate method that provablyconverges on this broad class of problems. Further, even on the restricted class of convex optimizationproblems, CODER provides improved convergence guarantees, based on a novel Lipschitz conditionfor the gradient. Some open questions that merit further investigation remain. For example, it is anintriguing question whether CODER can be accelerated on the class of smooth convex optimizationproblems. From a different perspective, it would be very interesting to understand the complexityof standard optimization problem classes under our new Lipschitz condition by obtaining new oraclelower bounds.
References
Ahmet Alacaoglu, Quoc Tran Dinh, Olivier Fercoq, and Volkan Cevher. Smooth primal-dualcoordinate descent algorithms for nonsmooth convex optimization. In
Proc. NIPS’17 , 2017.Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. Random extrapolation for primal-dualcoordinate descent. In
Proc. ICML’20 , 2020.Zeyuan Allen-Zhu, Zheng Qu, Peter Richt´arik, and Yang Yuan. Even faster accelerated coordinatedescent using non-uniform sampling. In
Proc. ICML’16 , 2016.Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods.
SIAM Journal on Optimization , 23(4):2037–2060, 2013.L´eon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublishedopen problem offered to the attendance of the SLDS 2009 conference, 2009. URL http://leon.bottou.org/papers/bottou-slds-open-problem-2009 .Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In
Proc. NeurIPS’19 , 2019. For each sample of
MNIST , we reassign the label as 1 if it is in { , , . . . , } and − In experiments, all the algorithms diverge when k < k ∈ { , , } . SIAMJournal on Optimization , 28(4):2783–2808, 2018.Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACMTransactions on Intelligent Systems and Technology , 2:27:1–27:27, 2011. Software available at .Yat Tin Chow, Tianyu Wu, and Wotao Yin. Cyclic coordinate-update algorithms for fixed-pointproblems: Analysis and applications.
SIAM Journal on Scientific Computing , 39(4):A1280–A1300,2017.Cong Dang and Guanghui Lan. Randomized first-order methods for saddle point optimization. arXiv preprint arXiv:1409.8625 , 2014.Jelena Diakonikolas and Lorenzo Orecchia. Alternating randomized block coordinate descent. In
Proc. ICML’18 , 2018.Francisco Facchinei and Jong-Shi Pang.
Finite-dimensional variational inequalities and complemen-tarity problems . Springer Science & Business Media, 2007.Olivier Fercoq and Pascal Bianchi. A coordinate-descent primal-dual algorithm with large step sizeand possibly nonseparable functions.
SIAM Journal on Optimization , 29(1):100–134, 2019.Olivier Fercoq and Peter Richt´arik. Accelerated, parallel, and proximal coordinate descent.
SIAMJournal on Optimization , 25(4):1997–2023, 2015.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.
Journal of statistical software , 33(1):1, 2010.Mert G¨urb¨uzbalaban, Asuman Ozdaglar, Pablo A Parrilo, and N Denizcan Vanli. When cycliccoordinate descent outperforms randomized coordinate descent. In
Proc. NIPS’17 , 2017.Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm for generalconvex-concave saddle point problems. arXiv preprint arXiv:1803.01401 , 2018.Filip Hanzely and Peter Richt´arik. Accelerated coordinate descent with arbitrary sampling and bestrates for minibatches. In
Proc. AISTATS’19 , 2019.Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinateprimal-dual proximal algorithm for distributed optimization.
IEEE Transactions on AutomaticControl , 64(10):4050–4065, 2019.Ching-Pei Lee and Stephen J Wright. Random permutations fix a worst case for cyclic coordinatedescent.
IMA Journal of Numerical Analysis , 39(3):1246–1275, 2019.Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergenceof generative adversarial networks. In
Proc. AISTATS’19 , 2019.Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradi-ent method and its application to regularized empirical risk minimization.
SIAM Journal onOptimization , 25(4):2244–2273, 2015. 16i Liu, Steve Wright, Christopher R´e, Victor Bittorf, and Srikrishna Sridhar. An asynchronousparallel stochastic coordinate descent algorithm. In
Proc. ICML’14 , 2014.Yura Malitsky. Golden ratio algorithms for variational inequalities.
Mathematical Programming ,pages 1–28, 2019.Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent withnonconvex penalties.
Journal of the American Statistical Association , 106(495):1125–1138, 2011.Arkadi Nemirovski. Prox-method with rate of convergence O (1 /t ) for variational inequalities withLipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 15(1):229–251, 2004.Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAMJournal on Optimization , 22(2):341–362, 2012.Yurii Nesterov. Universal gradient methods for convex optimization problems.
MathematicalProgramming , 152(1-2):381–404, 2015.Yurii Nesterov and Sebastian U Stich. Efficiency of the accelerated coordinate descent method onstructured optimization problems.
SIAM Journal on Optimization , 27(1):110–123, 2017.Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinatedescent converges faster with the Gauss-Southwell rule than random selection. In
Proc. ICML’15 ,2015.Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems.
Mathematical Programming , pages 1–35, 2019.Peter Richt´arik and Martin Tak´aˇc. Parallel coordinate descent methods for big data optimization.
Mathematical Programming , 156(1-2):433–484, 2016.Ankan Saha and Ambuj Tewari. On the nonasymptotic convergence of cyclic coordinate descentmethods.
SIAM Journal on Optimization , 23(1):576–601, 2013.Tim Salimans, Ian J Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training GANs. In
Proc. NIPS’16 , 2016.Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, and Wotao Yin. A primer on coordinatedescent algorithms. arXiv preprint arXiv:1610.00040 , 2016.Ruoyu Sun and Yinyu Ye. Worst-case complexity of cyclic coordinate descent: O ( n ) gap withrandomized version. Mathematical Programming , pages 1–34, 2019.Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empiricalrisk minimization with O (1) per-iteration complexity. In Proc. NeurIPS’18 , 2018.Stephen Wright and Ching-pei Lee. Analyzing random permutations for cyclic coordinate descent.
Mathematics of Computation , 89(325):2217–2248, 2020.Stephen J Wright. Coordinate descent algorithms.
Mathematical Programming , 151(1):3–34, 2015.Tong Tong Wu, Kenneth Lange, et al. Coordinate descent algorithms for lasso penalized regression.
Annals of Applied Statistics , 2(1):224–244, 2008.17uchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empiricalrisk minimization. In
Proc. ICML’15 , 2015.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.