[PDF] Complementary Composite Minimization, Small Gradients in General Norms, and Applications to Regression Problems

Abstract

Full PDF

aa r X i v : . [ m a t h . O C ] J a n Complementary Composite Minimization, Small Gradients inGeneral Norms, and Applications to Regression Problems

Jelena Diakonikolas * University of Wisconsin-Madison [email protected]

Crist´obal Guzm´an † Pontiﬁcia Universidad Cat´olica de Chile [email protected]

Abstract

Composite minimization is a powerful framework in large-scale convex optimization, based on decoupling of theobjective function into terms with structurally different properties and allowing for more ﬂexible algorithmic design.In this work, we introduce a new algorithmic framework for complementary composite minimization , where theobjective function decouples into a (weakly) smooth and a uniformly convex term. This particular form of decouplingis pervasive in statistics and machine learning, due to its link to regularization.The main contributions of our work are summarized as follows. First, we introduce the problem of complementarycomposite minimization in general normed spaces; second, we provide a uniﬁed accelerated algorithmic frameworkto address broad classes of complementary composite minimization problems; and third, we prove that the algorithmsresulting from our framework are near-optimal in most of the standard optimization settings. Additionally, we showthat our algorithmic framework can be used to address the problem of making the gradients small in general normedspaces. As a concrete example, we obtain a nearly-optimal method for the standard ℓ setup (small gradients in the ℓ ∞ norm), essentially matching the bound of Nesterov [2012] that was previously known only for the Euclidean setup.Finally, we show that our composite methods are broadly applicable to a number of regression problems, leading tocomplexity bounds that are either new or match the best existing ones. No function can simultaneously be both smooth and strongly convex with respect to an ℓ p normand have a dimension-independent condition number, unless p = 2 . This is a basic fact from convex analysis and the primary reason why in the existing literature smooth and stronglyconvex optimization is normally considered only for Euclidean (or, slightly more generally, Hilbert) spaces. In fact, itis not only that moving away from p = 2 the condition number becomes dimension-dependent, but that the dependenceon the dimension is polynomial for all examples of functions we know of, unless p is trivially close to two. Thus, itis tempting to assert that dimension-independent linear convergence (i.e., with logarithmic dependence on the inverseaccuracy /ǫ ) is reserved for Euclidean spaces, which has long been common wisdom among optimization researchers.Contrary to this wisdom, we show that it is in fact possible to attain linear convergence even in ℓ p (or, moregenerally, in normed vector) spaces, as long as the objective function can be decomposed into two functions withcomplementary properties. In particular, we show that if the objective function can be written in the following com-plementary composite form ¯ f ( x ) = f ( x ) + ψ ( x ) , (1) * Supported by the NSF grant CCF-2007757 and by the Ofﬁce of the Vice Chancellor for Research and Graduate Education at the University ofWisconsin–Madison with funding from the Wisconsin Alumni Research Foundation. † Partially supported by INRIA through the INRIA Associate Teams project, CORFO through the Clover 2030 Engineering Strategy - 14ENI-26862, and ANID – Millennium Science Initiative Program – NCN17 059. More generally, it is known that the existence of a continuous uniformly convex function with growth bounded by the squared norm impliesthat the space has an equivalent -uniformly convex norm [Borwein et al., 2009]; furthermore, using duality [Zalinescu, 1983], we conclude thatthe existence of a smooth and strongly convex function implies that the space has equivalent -uniformly convex and -uniformly smooth norms,a rare property for a normed space (the most notable examples of spaces that are simultaneously -uniformly convex and -uniformly smooth areHilbert spaces; see e.g., Ball et al. [1994] for related deﬁnitions and more details). f is convex and L -smooth w.r.t. a (not necessarily Euclidean) norm k · k and ψ is m -strongly convex w.r.t. thesame norm and “simple,” meaning that the optimization problems of the form min x h z , x i + ψ ( x ) (2)can be solved efﬁciently for any linear functional z , then ¯ f ( x ) can be minimized to accuracy ǫ > in O (cid:16)q Lm log( Lφ (¯ x ∗ ) ǫ ) (cid:17) iterations, where ¯ x ∗ = argmin x ¯ f ( x ) . As in other standard ﬁrst-order iterative methods, each iteration requires onecall to the gradient oracle of f and one call to a solver for the problem from Eq. (2). To the best of our knowledge,such a result was previously known only for Euclidean spaces [Nesterov, 2013].This is the basic variant of our result. We also consider more general setups in which f is only weakly smooth (withH¨older-continuous gradients) and ψ is uniformly convex (see Section 1.2 for speciﬁc deﬁnitions and useful properties).We refer to the resulting objective functions ¯ f as complementary composite objective functions (as functions f and ψ that constitute ¯ f have complementary properties) and to the resulting optimization problems as complementary com-posite optimization problems. The algorithmic framework that we consider for complementary composite optimizationin Section 2 is near-optimal (optimal up to logarithmic or poly-logarithmic factors) in terms of iteration complexityin most of the standard optimization settings, which we certify by providing near-matching oracle complexity lowerbounds in Section 4. We now summarize some further implications of our results.

Small gradients in ℓ p and S p norms. The original motivation for complementary composite optimization in ourwork comes from making the gradients of smooth functions small in non-Euclidean norms. This is a fundamentaloptimization question, whose study was initiated by Nesterov [2012] and that is still far from being well-understood.Prior to this work, (near)-optimal algorithms were known only for the Euclidean ( ℓ ) and ℓ ∞ setups. For the Euclidean setup, there are two main results: due to Kim and Fessler [2020] and due to Nesterov [2012].The algorithm of Kim and Fessler [2020] is iteration-complexity-optimal; however, the methodology by which thisalgorithm was obtained is crucially Euclidean, as it relies on numerical solutions to semideﬁnite programs, whose for-mulation is made possible by assuming that the norm of the space is inner-product-induced. An alternative approach,due to Nesterov [2012], is to apply the fast gradient method to a regularized function ¯ f ( x ) = f ( x ) + λ k x − x k for asufﬁciently small λ > , where f is the smooth function whose gradient we hope to minimize. Under the appropriatechoice of λ > , the resulting algorithm is near-optimal (optimal up to a logarithmic factor).As discussed earlier, applying fast gradient method directly to a regularized function as in Nesterov [2012] is outof question for p = 2 , as the resulting regularized objective function cannot simultaneously be smooth and stronglyconvex w.r.t. k · k p without its condition number growing with the problem dimension. This is where the framework ofcomplementary composite optimization proposed in our work comes into play. Our result also generalizes to normedmatrix spaces endowed with S p (Schatten- p ) norms. As a concrete example, our approach leads to near-optimalcomplexity results in the ℓ and S setups, where the gradient is minimized in the ℓ ∞ , respectively, S ∞ , norm.It is important to note here why strongly convex regularizers are not sufﬁcient in general and what motivated usto consider the more general uniformly convex functions ψ . While for p ∈ (1 , choosing ψ ( x ) = k · k p (which is ( p − -strongly convex w.r.t. k · k p ; see Nemirovskii and Yudin [1983], Juditsky and Nemirovski [2008]) is sufﬁcient,when p > the strong convexity parameter of k · k p w.r.t. k · k p is bounded above by /d − p . This is not only truefor k · k p , but for any convex function bounded above by a constant on a unit ℓ p -ball; see e.g., [d’Aspremont et al.,2018, Example 5.1]. Thus, in this case, we work with ψ ( x ) = p k · k pp , which is only uniformly convex. Lower complexity bounds.

We complement the development of algorithms for complementary composite mini-mization and minimizing the norm of the gradient with lower bounds for the oracle complexity of these problems. Ourlower bounds leverage recent lower bounds for weakly smooth convex optimization from Guzm´an and Nemirovski[2015], Diakonikolas and Guzm´an [2020]. These existing results sufﬁce for proving lower bounds for minimizing thenorm of the gradient, and certify the near-optimality of our approach for the smooth (i.e., with Lipschitz continuousgradient) setting, when ≤ p ≤ . On the other hand, proving lower bounds for complementary convex optimizationrequires the design of an appropriate oracle model; namely, one that takes into account that our algorithm accesses thegradient oracle of f and solves subroutines of type (2) w.r.t. ψ . With this model in place, we combine constructions In the ℓ ∞ setup, a non-Euclidean variant of gradient descent is optimal in terms of iteration complexity. S p norm of a matrix A is deﬁned as the ℓ p norm of A ’s singular values. Applications to regression problems.

The importance of complementary composite optimization and making thegradients small in ℓ p and S p norms is perhaps best exhibited by considering some of the classical regression problemsthat are frequently used in statistics and machine learning. It turns out that considering these regression problems inthe appropriate complementary composite form not only leads to faster algorithms in general, but also reveals someinteresting properties of the solutions. For example, applications of our framework to the complementary compositeform of bridge regression (a generalization of lasso and ridge regression; see Section 5) leads to an interesting andwell-characterized trade-off between the “goodness of ﬁt” of the model and the ℓ p norm of the regressor.Section 5 highlights a number of interesting regression problems that can be addressed using our framework,including lasso, elastic net, (b)ridge regression, Dantzig selector, ℓ p regression (with standard and correlated errors),and related spectral variants. It is important to note that a single algorithmic framework sufﬁces for addressing all ofthese problems. Most of the results we obtain in this way are either conjectured or known to be unimprovable. Nonsmooth convex optimization problems with the composite structure of the objective function ¯ f ( x ) = f ( x ) + ψ ( x ) , where f is smooth and convex, but ψ is nonsmooth, convex, and “simple,” are well-studied in the optimizationliterature [Beck and Teboulle, 2009, Nesterov, 2013, Scheinberg et al., 2014, He et al., 2015, Gasnikov and Nesterov,2018, and references therein]. The main beneﬁt of exploiting the composite structure lies in the ability to recoveraccelerated rates for nonsmooth problems. One of the most celebrated results in this domain are the FISTA algorithmof Beck and Teboulle [2009] and a method based on composite gradient mapping due to Nesterov [2013], whichdemonstrated that accelerated convergence (with rate /k ) is possible for this class of problems.By comparison, the literature on complementary composite minimization is scarce. For example, Nesterov [2013]proved that in a Euclidean space complementary composite optimization attains a linear convergence rate. The al-gorithm proposed there is different from ours, as it relies on the use of composite gradient mapping, for which theproximal operator of ψ (solution to problems of the form min x { ψ ( x ) + k x − z k } for all z ; compare to Eq. (2)) isassumed to be efﬁciently computable. In addition to being primarily applicable to Euclidean spaces, this assumptionfurther restricts the class of functions that can be efﬁciently optimized compared to our approach (see Section 2.2 for afurther discussion). Another composite algorithm where linear convergence has been proved is the celebrated methodby Chambolle and Pock [2011], where proximal steps are taken w.r.t. both terms in the composite model ( f and ψ ).In the case where both f and ψ are strongly convex, a linear convergence rate can be established. Notice that thisassumption is quite different from our setting, and that this method was only investigated for the Euclidean setup.Beyond the realm of Euclidean norms, linear convergence results have been established for functions that are rela-tively smooth and relatively strongly convex [Bauschke et al., 2017, 2019, Lu et al., 2018]. The class of complementarycomposite functions does not fall into this category. Further, while we show accelerated rates (with square-root depen-dence on the appropriate notion of the condition number) for complementary composite optimization, such results arelikely not attainable for relatively smooth relatively strongly convex optimization [Dragomir et al., 2019]. The problem of minimizing the norm of the gradient has become a central question in optimization and its ap-plications in machine learning, mainly motivated by nonconvex settings, where the norm of the gradient is useful asa stopping criterion. However, the norm of the gradient is also useful in linearly constrained convex optimizationproblems, where the norm of the gradient of a Fenchel dual is useful in controlling the feasibility violation in theprimal [Nesterov, 2012]. Our approach for minimizing the norm of the gradient is inspired by the regularization ap-proach proposed by Nesterov [2012]. As discussed earlier, this regularization approach is not directly applicable tonon-Euclidean settings, and is where our complementary composite framework becomes crucial.Finally, our work is inspired by and uses fundamental results about the geometry of high-dimensional normedspaces; in particular, the fact that for ℓ p and S p spaces the optimal constants of uniform convexity are known [Ball et al., Lower bounds from [Dragomir et al., 2019] show the impossibility of acceleration for the relatively smooth setting. This is strong evidence ofthe impossibility of acceleration in the relatively smooth and relatively strongly convex setting.

Throughout the paper, we use boldface letters to denote vectors and italic letters to denote scalars.We consider real ﬁnite-dimensional normed vector spaces E , endowed with a norm k · k , and denoted by ( E , k · k ) . The space dual to ( E , k · k ) is denoted by ( E ∗ , k · k ∗ ) , where k · k ∗ is the norm dual to k · k , deﬁned in the usual wayby k z k ∗ = sup x ∈ E : k x k≤ h z , x i , where h z , x i denotes the evaluation of a linear functional z on a point x ∈ E . Asa concrete example, we may consider the ℓ p space ( R d , k · k p ) , where k x k p = (cid:0) P di =1 | x i | p (cid:1) /p , ≤ p ≤ ∞ . Thespace dual to ( R d , k · k p ) is isometrically isomorphic to the space ( R d , k · k p ∗ ) , where p + p ∗ = 1 . Throughout, given ≤ p ≤ ∞ , we will call p ∗ = pp − the conjugate exponent to p (notice that ≤ p ∗ ≤ ∞ , and p + p ∗ = 1 ). The(closed) k · k -norm ball centered at x with radius R > will be denoted at B k·k ( x , R ) .We start by recalling some standard deﬁnitions from convex analysis. Deﬁnition 1.1.

A function f : E → R is said to be ( L, κ ) -weakly smooth w.r.t. a norm k · k , where L > and κ ∈ (1 , , if its gradients are ( L, κ − H¨older continuous, i.e., if ( ∀ x , y ∈ E ) : k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k κ − . We denote the class of ( L, κ ) -weakly smooth functions w.r.t. k · k by F k·k ( L, κ ) .Note that when κ = 1 , the function may not be differentiable. Since we will only be working with functions thatare proper, convex, and lower semicontinuous, we will still have that f is subdifferentiable on the interior of its domain[Rockafellar, 1970, Theorem 23.4]. The deﬁnition of ( L, κ ) -weakly smooth functions then boils down to the boundedvariation of the subgradients. Deﬁnition 1.2.

A function ψ : E → R is said to be q -uniformly convex w.r.t. a norm k · k and with constant λ (andwe refer to such functions as ( λ, q ) -uniformly convex), where λ ≥ and q ≥ , if ∀ α ∈ (0 ,

1) :( ∀ x , y ∈ E ) : ψ ((1 − α ) x + α y ) ≤ (1 − α ) ψ ( x ) + αψ ( y ) − λq α (1 − α ) k y − x k q . We denote the class of ( λ, q ) -uniformly convex functions w.r.t. k · k by U k·k ( λ, q ) .With the abuse of notation, we will often use ∇ ψ ( x ) to denote an arbitrary but ﬁxed element of ∂ψ ( x ) . We alsomake a mild assumption that the subgradient oracle of ψ is consistent, i.e., that it returns the same element of ∂ψ ( x ) whenever queried at the same point x . Observe that when λ = 0 , uniform convexity reduces to standard convexity, while for λ > and q = 2 we recoverthe deﬁnition of strong convexity. We will only be considering functions that are lower semicontinuous, convex, andproper. These properties sufﬁce for a function to be subdifferentiable on the interior of its domain. It is then not hardto show that if ψ is ( λ, q ) -uniformly convex w.r.t. a norm k · k and g x ∈ ∂ψ ( x ) is its subgradient at a point x , we have ( ∀ y ∈ E ) : ψ ( y ) ≥ ψ ( x ) + h g x , y − x i + λq k y − x k q . (3) Deﬁnition 1.3.

Let ψ : E → R ∪ { + ∞} . The convex conjugate of ψ, denoted by ψ ∗ , is deﬁned by ( ∀ z ∈ E ∗ ) : ψ ∗ ( z ) = sup x ∈ E {h z , x i − ψ ( x ) } . Recall that the convex conjugate of any function is convex. Some simple examples of conjugate pairs of func-tions that will be useful for our analysis are: (i) univariate functions p | · | p and p ∗ | · | p ∗ , where < p < ∞ (see,e.g., Borwein and Zhu [2004, Exercise 4.4.2]) and (ii) functions k · k and k · k ∗ , where norms k · k and k · k ∗ aredual to each other (see, e.g., Boyd and Vandenberghe [2004, Example 3.27]). The latter example can be easily adaptedto prove that the functions p k · k p and p ∗ k · k p ∗ ∗ are conjugates of each other, for < p < ∞ .The following auxiliary facts will be useful for our analysis.4 act 1.4. Let ψ : E → R ∪ { + ∞} be proper, convex, and lower semicontinuous, and let ψ ∗ be its convex conjugate.Then ψ ∗ is proper, convex, and lower semicontinuous (and thus subdifferentiable on the interior of its domain) and ∀ z ∈ int dom( ψ ∗ ) : g ∈ ∂ψ ∗ ( z ) if and only if g ∈ argsup x ∈ R d {h z , x i − ψ ( x ) } . The following proposition will be repeatedly used in our analysis, and we prove it here for completeness.

Proposition 1.5.

Let ( E , k · k ) be a normed space with k · k : E → R differentiable, and let < q < ∞ . Then (cid:13)(cid:13)(cid:13) ∇ (cid:16) q k x k q (cid:17)(cid:13)(cid:13)(cid:13) ∗ = k x k q − = k x k q/q ∗ , where q ∗ = qq − is the exponent conjugate to q .Proof. We notice that k · k is differentiable if and only if k · k q is differentiable [Zalinescu, 2002, Thm. 3.7.2]. Sincethe statement clearly holds for x = , in the following we assume that x = . Next, write q k · k q as a composition offunctions q | · | q/ and k · k . Applying the chain rule of differentiation, we now have: ∇ (cid:16) q k x k q (cid:17) = 12 (cid:16) k x k (cid:17) q − ∇ (cid:0) k x k (cid:1) = k x k q − ∇ (cid:16) k x k (cid:17) . It remains to argue that (cid:13)(cid:13)(cid:13) ∇ (cid:16) k x k (cid:17)(cid:13)(cid:13)(cid:13) ∗ = k x k . This immediately follows by Fact 1.4, as k · k and k · k ∗ are convexconjugates of each other.We also state here a lemma that allows approximating weakly smooth functions by weakly smooth functions of adifferent order. A variant of this lemma (for p = 2 ) ﬁrst appeared in [Devolder et al., 2014], while the more generalversion stated here is from d’Aspremont et al. [2018]. Lemma 1.6.

Let f : E → R be a function that is ( L, κ ) -weakly smooth w.r.t. some norm k · k . Then for any δ > and M ≥ h p − κ ) pκδ i p − κκ L pκ (4) we have ( ∀ x , y ∈ E ) : f ( y ) ≤ f ( x ) + h∇ f ( x ) , y − x i + Mp k y − x k p + δ . Finally, the following lemma will be useful when bounding the gradient norm in Section 3 (see also [Zalinescu,2002, Section 3.5]).

Lemma 1.7.

Let f : E → R be a function that is convex and ( L, κ ) -weakly smooth w.r.t. some norm k · k . Then: ( ∀ x , y ∈ E ) : κ − L κ − κ k∇ f ( y ) − ∇ f ( x ) k κκ − ≤ f ( y ) − f ( x ) − h∇ f ( x ) , y − x i . Proof.

Let h ( x ) be any ( L, κ ) -weakly smooth function and let x ∗ ∈ argmin x ∈ R d h ( x ) . As h is ( L, κ ) -weakly smooth,we have for all x , y ∈ R d : h ( y ) ≤ h ( x ) + h∇ h ( x ) , y − x i + Lκ k y − x k κ . Fixing x ∈ R d and minimizing both sides of the last inequality w.r.t. y ∈ R d , it follows that h ( x ∗ ) ≤ h ( x ) − L − κ ∗ κ ∗ k∇ h ( x ) k κ ∗ ∗ , (5)where we have used that the functions κ k · k κ and κ ∗ k · k κ ∗ ∗ are convex conjugates of each other.To complete the proof, it remains to apply Eq. (5) to function h x ( y ) = f ( y ) − h∇ f ( x ) , y − x i for any ﬁxed x ∈ R d , and observe that h x ( y ) is convex, ( L, κ ) -weakly smooth, and minimized at y = x . Complementary Composite Minimization

In this section, we consider minimizing complementary composite functions, which are of the form ¯ f ( x ) = f ( x ) + ψ ( x ) , (6)where f is ( L, κ ) -weakly smooth w.r.t. some norm k · k , κ ∈ (1 , , and ψ is ( λ, q ) -uniformly convex w.r.t. the samenorm, for some q ≥ , λ ≥ . We assume that the feasible set X ⊆ E is closed, convex, and nonempty. The algorithmic framework we consider is a generalization of AGD+ from Cohen et al. [2018], stated as follows:

Generalized AGD+ x k = A k − A k y k − + a k A k v k − v k = argmin u ∈X n k X i =0 a i h∇ f ( x i ) , u − x i i + A k ψ ( u ) + m φ ( u ) o y k = A k − A k y k − + a k A k v k , y = v , x ∈ X , (7)where m and the sequence of positive numbers { a k } k ≥ are parameters of the algorithm speciﬁed in the con-vergence analysis below, A k = P ki =0 a i , and we take φ ( u ) to be a function that satisﬁes φ ( u ) ≥ q k u − x k q . Forexample, if λ > , we can take φ ( u ) = λ D ψ ( u , x ) . When λ = 0 , we take φ to be (1 , q ) -uniformly convex.The convergence analysis relies on the approximate duality gap technique (ADGT) of Diakonikolas and Orecchia[2019]. The main idea is to construct an upper estimate G k ≥ ¯ f ( y k ) − ¯ f (¯ x ∗ ) of the true optimality gap, where ¯ x ∗ = argmin x ∈X ¯ f ( u ) , and then argue that A k G k ≤ A k − G k − + E k , which in turn implies: ¯ f ( y k ) − ¯ f (¯ x ∗ ) ≤ A G A k + P ki =1 E i A k . I.e., as long as A G is bounded and the cumulative error P ki =1 E i is either bounded or increasing slowly comparedto A k , the optimality gap of the sequence y k converges to the optimum at rate (1 + P ki =1 E i ) /A k . The goal is, ofcourse, to make A k as fast-growing as possible, but that turns out to be limited by the requirement that A k G k benon-increasing or slowly increasing compared to A k .The gap G k is constructed as the difference U k − L k , where U k ≥ ¯ f ( y k ) is an upper bound on ¯ f ( y k ) and L k ≤ ¯ f (¯ x ∗ ) is a lower bound on ¯ f (¯ x ∗ ) . In this particular case, we make the following choices: U k = f ( y k ) + 1 A k k X i =0 a i ψ ( v i ) . As y k = A k P ki =0 a i v i , we have, by Jensen’s inequality: U k ≥ f ( y k ) + ψ ( y k ) = ¯ f ( y k ) , i.e., U k is a valid upperbound on ¯ f ( y k ) . For the lower bound, we use the following inequalities: ¯ f (¯ x ∗ ) ≥ A k k X i =0 a i f ( x i ) + 1 A k k X i =0 a i h∇ f ( x i ) , ¯ x ∗ − x i i + ψ (¯ x ∗ ) + m A k φ (¯ x ∗ ) − m A k φ (¯ x ∗ ) ≥ A k k X i =0 a i f ( x i ) + 1 A k min u ∈X n k X i =0 a i h∇ f ( x i ) , u − x i i + A k ψ ( u ) + m φ ( u ) o − m A k φ (¯ x ∗ )=: L k , f (¯ x ∗ ) ≥ A k P ki =0 a i f ( x i ) + A k P ki =0 a i h∇ f ( x i ) , ¯ x ∗ − x i i , by convexity of f. We start by bounding the initial (scaled) gap A G . Lemma 2.1 (Initial Gap) . For any δ > and M = h q − κ ) qκδ i q − κκ L qκ , if A M = m , then A G ≤ m φ (¯ x ∗ ) + A δ . Proof.

By deﬁnition, and using that a = A , A G = A (cid:16) f ( y ) + ψ ( v ) − f ( x ) − h∇ f ( x ) , v − x i − ψ ( v ) − m A φ ( v ) (cid:17) + m φ (¯ x ∗ )= A ( f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ) − m φ ( y ) + m φ (¯ x ∗ ) where the second line is by y = v .By assumption, φ ( u ) ≥ q k u − x k q , for all u , and, in particular, φ ( y ) ≥ q k y − x k qq . On the other hand, by ( L, κ ) -weak smoothness of f and using Lemma 1.6, we have that (below M = (cid:2) q − κ ) qκδ (cid:3) q − κκ L qκ ): f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ≤ M q k y − x k qq + δ . Therefore: A G ≤ (cid:0) A M − m (cid:1) k y − x k q q + m φ (¯ x ∗ ) + A δ m φ (¯ x ∗ ) + A δ , (8)as m = A M .The next step is to bound A k G k − A k − G k − , as in the following lemma. Lemma 2.2 (Gap Evolution) . Given arbitrary δ k > and M k = h q − κ ) qκδ k i q − κκ L qκ , if a kq A kq − ≤ max { λA k − ,m } M k then A k G k − A k − G k − ≤ A k δ k . Proof.

To bound A k G k − A k − G k − , we ﬁrst bound A k U k − A k − U k − and A k L k − A k − L k − . By deﬁnition of U k , A k U k − A k − U k − = A k f ( y k ) − A k − f ( y k − ) + a k ψ ( v k )= A k ( f ( y k ) − f ( x k )) + A k − ( f ( x k ) − f ( y k − )) + a k f ( x k ) + a k ψ ( v k ) . (9)For the lower bound, deﬁne the function under the minimum in the deﬁnition of the lower bound as h k ( u ) := P ki =0 a i h∇ f ( x i ) , u − x i i + A k ψ ( u ) + m φ ( u ) , so that we have: A k L k − A k − L k − = a k f ( x k ) + h k ( v k ) − h k − ( v k − ) . (10)Observe ﬁrst that h k ( v k ) − h k − ( v k ) = a k h∇ f ( x k ) , v k − x k i + a k ψ ( v k ) . (11)On the other hand, using the deﬁnition of Bregman divergence and the fact that Bregman divergence is blind to constantand linear terms, we can bound h k − ( v k ) − h k − ( v k − ) as h k − ( v k ) − h k − ( v k − ) = h∇ h k − ( v k − ) , v k − v k − i + D h k − ( v k , v k − ) ≥ A k − D ψ ( v k , v k − ) + m D φ ( v k , v k − ) , where the second line is by v k − being the minimizer of h k − . Combining with Eqs. (10) and (11), we have: A k L k − A k − L k − ≥ a k f ( x k ) + a k ψ ( v k ) + a k h∇ f ( x k ) , v k − x k i + A k − D ψ ( v k , v k − ) − m D φ ( v k , v k − ) . (12)7ombining Eqs. (9) and (12), we can now bound A k G k − A k − G k − as A k G k − A k − G k − ≤ A k ( f ( y k ) − f ( x k )) + A k − ( f ( x k ) − f ( y k − )) − a k h∇ f ( x k ) , v k − x k i − A k − D ψ ( v k , v k − ) − m D φ ( v k , v k − ) ≤ A k ( f ( y k ) − f ( x k ) − h∇ f ( x k ) , y k − x k i ) − A k − D ψ ( v k , v k − ) − m D φ ( v k , v k − ) , where we have used f ( x k ) − f ( y k − ) ≤ h∇ f ( x k ) , x k − y k − i (by convexity of f ) and the deﬁnition of y k fromEq. (7). Similarly as for the initial gap, we now use the weak smoothness of f and Lemma 1.6 to write: f ( y k ) − f ( x k ) − h∇ f ( x k ) , y k − x k i ≤ M k q k y k − x k k q + δ k M k q a kq A kq k v k − v k − k q + δ k , where M k = h q − κ ) qκδ k i q − κκ L qκ and the equality is by y k − x k = a k A k ( v k − v k − ) , which follows by the deﬁnition ofalgorithm steps from Eq. (7).On the other hand, as ψ is ( λ, q ) -uniformly convex, we have that D ψ ( v k , v k − ) ≥ λq k v k − v k − k q . Further, if λ = 0 , we have that D φ ( v k , v k − ) ≥ q k v k − v k − k q . Thus: A k G k − A k − G k − ≤ (cid:16) M k a kq A kq − − max { λA k − , m } (cid:17) k v k − v k − k q q + A k δ k ≤ A k δ k , as a kq A kq − ≤ max { λA k − ,m } M k . We are now ready to state and prove the main result from this section.

Theorem 2.3.

Let ¯ f ( x ) = f ( x ) + ψ ( x ) , where f is convex and ( L, κ ) -weakly smooth w.r.t. a norm k · k , κ ∈ (1 , , and ψ is q -uniformly convex with constant λ ≥ w.r.t. the same norm for some q ≥ . Let ¯ x ∗ be the minimizerof ¯ f . Let x k , v k , y k evolve according to Eq. (7) for an arbitrary initial point x ∈ X , where A M = m , a kq ≤ max { λA k − q ,m A kq − } M k for k ≥ , and M k = h q − κ ) qκδ k i q − κκ L qκ , for δ k > and k ≥ . Then, ∀ k ≥ f ( y k ) − ¯ f (¯ x ∗ ) ≤ A M φ (¯ x ∗ ) + P ki =0 A i δ i A k . In particular, for any ǫ > , setting δ k = a k A k ǫ , for k ≥ , and a = A = 1 , and a kq = max { λA k − q ,m A kq − } M k for k ≥ , we have that ¯ f ( y k ) − ¯ f (¯ x ∗ ) ≤ ǫ after at most k = O (cid:18) min (cid:26)(cid:16) ǫ (cid:17) q − κqκ − q + κ (cid:16) max n L qκ λ , o(cid:17) κqκ − q + κ log (cid:16) Lφ (¯ x ∗ ) ǫ (cid:17) , (cid:16) Lǫ (cid:17) qqκ − q + κ (cid:0) φ (¯ x ∗ ) (cid:1) κqκ − q + κ (cid:27)(cid:19) iterations.Proof. The ﬁrst part of the theorem follows immediately by combining Lemma 2.1 and Lemma 2.2.For the second part, we have ¯ f ( y k ) − ¯ f (¯ x ∗ ) ≤ A M φ (¯ x ∗ ) A k + ǫ , so all we need to show is that, under the step size choice from the theorem statement, we have A M φ (¯ x ∗ ) A k ≤ ǫ . As A = a = 1 , we have that δ = ǫ and M = h q − κ ) qκǫ i q − κκ L qκ . (13)8t remains to bound the growth of A k . In this case, by theorem assumption, we have a kq = max { λA k − q ,m A kq − } M k .Thus, (i) a kq A k − q ≥ λM k and (ii) a kq A kq − ≥ m M k , and the growth of A k can be bounded below as the maximum of growthsdetermined by these two cases.Consider a kq A k − q ≥ λM k ﬁrst. As δ k = a k A k ǫ and M k = h q − κ ) qκδ k i q − κκ L qκ , the condition a kq A kq − A k − ≥ λM k can beequivalently written as: a kq − qκ +1 A k − q − qκ +1 ≥ h q − κ ) qκǫ i − q − κκ λL qκ . Hence, a k A k − ≥ h q − κ ) qκǫ i − q − κqκ − q + κ (cid:16) λL qκ (cid:17) κqκ − q + κ . As a k = A k − A k − , it follows that A k A k − ≥ h q − κ ) qκǫ i − q − κqκ − q + κ (cid:16) λL qκ (cid:17) κqκ − q + κ , further leading to A k ≥ (cid:18) h q − κ ) qκǫ i − q − κqκ − q + κ (cid:16) λL qκ (cid:17) κqκ − q + κ (cid:19) k . On the other hand, the condition a kq A kq − ≥ m M k can be equivalently written as: a k qκ − qκ +1 A qκ − qκ k ≥ m L qκ h qκǫ q − κ ) i q − κκ = 1 , where we have used the deﬁnition of m , which implies A k = Ω (cid:16) k qκ − q + κκ (cid:17) , (14)and further leads to the claimed bound on the number of iterations.Let us point out some special cases of the bound from Theorem 2.3. When f is smooth ( κ = 2 ) and ψ is q -uniformly convex, assuming L q/ ≥ λ, the bound simpliﬁes to k = O (cid:18) min (cid:26)(cid:16) ǫ (cid:17) q − q +2 (cid:16) L q λ (cid:17) q +2 log (cid:16) Lφ (¯ x ∗ ) ǫ (cid:17) , (cid:16) Lǫ (cid:17) qq +2 (cid:0) φ (¯ x ∗ ) (cid:1) q +2 (cid:27)(cid:19) . (15)In particular, if ψ is strongly convex ( q = 2 ), we recover the same bound as in the Euclidean case: k = O (cid:18) min (cid:26)r Lλ log (cid:16) Lφ (¯ x ∗ ) ǫ (cid:17) , r Lφ (¯ x ∗ ) ǫ (cid:27)(cid:19) . (16)Note that this result uses smoothness of f and strong convexity of ψ with respect to the same but arbitrary norm k · k . Because we do not require the same function to be simultaneously smooth and strongly convex w.r.t. k · k , theresulting “condition number” Lλ can be dimension-independent even for non-Euclidean norms (in particular, this willbe possible for any ℓ p norm with p ∈ (1 , ).Because ¯ f is q -uniformly convex, Theorem 2.3 also implies a bound on k y k − ¯ x ∗ k whenever λ > , as follows. Corollary 2.4.

Under the same assumptions as in Theorem 2.3, and assuming, in addition, that λ > , we have that k y k − ¯ x ∗ k ≤ ¯ ǫ after at most k = O (cid:18)(cid:16) qλ ¯ ǫ q (cid:17) q − κqκ − q + κ (cid:16) L qκ λ (cid:17) κqκ − q + κ log (cid:16) qLφ (¯ x ∗ )¯ ǫ q λ (cid:17)(cid:19) iterations.Proof. By q -uniform convexity of ¯ f and ∈ ∂f (¯ x ∗ ) (as ¯ x ∗ minimizes ¯ f ), we have k y k − ¯ x ∗ k q ≤ qλ ( ¯ f ( y k ) − ¯ f (¯ x ∗ )) . Thus, it sufﬁces to apply the bound from Theorem 2.3 with the accuracy parameter ǫ = λ ¯ ǫ q q . .2 Computational Considerations At a ﬁrst glance, the result from Theorem 2.3 may seem of limited applicability, as there are potentially four differentparameters (

L, κ, λ, q ) that one would need to tune. However, we now argue that this is not a constraining factor. First,for most of the applications in which one would be interested in using this framework, function ψ is a regularizingfunction with known uniform convexity parameters λ and q (see Section 5 for several interesting examples). Sec-ond, the knowledge of parameters L and κ is not necessary for our results; we presented the analysis assuming theknowledge of these parameters to not over-complicate the exposition.In particular, the only place in the analysis where the ( L, κ ) smoothness of f is used is in the inequality f ( y k ) ≤ f ( x k ) + h∇ f ( x k ) , y k − x k i + M k q k y k − x k k q + δ k . (17)But instead of explicitly computing the value of M k based on L, κ, one could maintain an estimate of M k , double itwhenever the inequality from Eq. (17) is not satisﬁed, and recompute all iteration- k variables. This is a standard trickemployed in optimization, due to Nesterov [2015]. Observe that, due to ( L, κ ) -weak smoothness of f and Lemma 1.6,there exists a sufﬁciently large M k for any value of δ k . In particular, under the choice δ k = a k A k ǫ from Theorem 3.1,the total number of times that M k can get doubled is logarithmic in all of the problem parameters, which means that itcan be absorbed in the overall convergence bound from Theorem 2.3.Finally, the described algorithm (Generalized AGD+ from Eq. (7)) can be efﬁciently implemented only if theminimization problems deﬁning v k can be solved efﬁciently (preferably in closed form, or with ˜ O ( d ) arithmeticoperations). This is indeed the case for most problems of interest. In particular, when ψ is uniformly convex, wewill typically take φ ( u ) to be the Bregman divergence D ψ ( u , x ) . Then, the computation of v k boils down to solvingproblems of the form (2), i.e., min u ∈X {h z , x i + ψ ( x ) } , for a given z . Such problems are efﬁciently solvable wheneverthe convex conjugate of ψ + I X , where I X is the indicator function of the closed convex set X , is efﬁciently computable,in which case the minimizer is ∇ ( ψ + I X ) ∗ ( z ) . In particular, for X = E and ψ ( x ) = q k · k q , q > , (a commonchoice for our applications of interest; see Section 5), the minimizer is computable in closed form as ∇ (cid:0) q ∗ k z k q ∗ ∗ (cid:1) , where q ∗ = qq − is the exponent dual to q. This should be compared to the computation of proximal maps neededin Nesterov [2013], where the minimizer would be the gradient of the inﬁmal convolution of ψ and the Euclidean normsquared, for which there are much fewer efﬁciently computable examples. Note that such an assumption would besufﬁcient for our algorithm to work in the Euclidean case (by taking φ ( u ) = k u − x k ); however, it is not necessary. ℓ p and Sch p Spaces

We now show how to use the result from Theorem 2.3 to obtain near-optimal convergence bounds for minimizing thenorm of the gradient. In particular, assuming that f is ( L, κ ) -weakly smooth w.r.t. k · k p , to obtain the desired results,we apply Theorem 2.3 to function ¯ f ( · ) = f ( · ) + λψ p ( · ) , where ψ p ( x ) = ( p − k x − x k p , if p ∈ (1 , , p k x − x k pp , if p ∈ (2 , + ∞ ) . (18)Function ψ p is then (1 , max { , p } ) -uniformly convex. The proof of strong convexity of ψ p when < p ≤ canbe found, e.g., in Beck [2017, Example 5.28]. For < p < + ∞ , ψ p is a separable function, hence its p -uniformconvexity can be proved from the duality between uniform convexity and uniform smoothness [Zalinescu, 1983] anddirect computation. These ℓ p results also have spectral analogues, given by the Schatten spaces S p = ( R d × d , k·k S ,p ) .Here, the functions below can be proved to be (1 , max { , p } ) -uniformly convex, which is a consequence of sharpestimates of uniform convexity for Schatten spaces [Ball et al., 1994, Juditsky and Nemirovski, 2008] Ψ S ,p ( x ) = ( p − k x − x k S ,p , if p ∈ (1 , , p k x − x k p S ,p , if p ∈ (2 , + ∞ ) . (19)Finally, both for ℓ and S spaces, our algorithms can work on the equivalent norm with power p = ln d/ (ln d − . The cost of this change of norm is at most logarithmic in d for the diameter and strong convexity constants. Similarly,our results also extend to the case p = ∞ , by similar considerations (here, using exponent p = ln d ).10o obtain the results for the norm of the gradient in ℓ p spaces, we can apply Theorem 2.3 with φ ( x ) = ψ p ( x ) , where ψ p is speciﬁed in Eq. (18). The result is summarized in the following theorem. The same result can be obtainedfor S p spaces, by following the same argument as in Theorem 3.1 below, which we omit for brevity. Theorem 3.1.

Let f be a convex, ( L, κ ) - weakly smooth function w.r.t. a norm k · k p , where p ∈ (1 , ∞ ) . Then, for any ǫ > , Generalized AGD+ from Eq. (7) , initialized at some point x ∈ R d and applied to ¯ f = f + λψ p , where ψ p isspeciﬁed in Eq. (18) , λ = ( ǫ ( p − k x ∗ − x k p , if p ∈ (1 , , ǫ k x ∗ − x k p − p , if p ∈ (2 , ∞ ) , and with the choice φ = ψ p , constructs a point y k with k∇ f ( y k ) k p ∗ ≤ ǫ in at most k =  O (cid:18)(cid:16) Lǫ (cid:17) κ ( κ − κ − (cid:16) κ κ ( κ − κ · k x ∗ − x k p ( p − κ (cid:17) κ − log (cid:16) L k x ∗ − x k p ( p − ǫ (cid:17)(cid:19) , if p ∈ (1 , ,O (cid:18)(cid:16) L k x ∗ − x k p ǫ (cid:17) κ ( p − pκ − p + κ (cid:16) κκ − (cid:17) ppκ − p + κ log (cid:16) L k x ∗ − x k pp ǫ (cid:17)(cid:19) , if p ∈ (2 , ∞ ) , iterations. In particular, when κ = 2 (i.e., when f is L -smooth): k =  e O (cid:18)q L k x ∗ − x k p ǫ (cid:19) , if p ∈ (1 , , e O (cid:18)(cid:16) L k x ∗ − x k p ǫ (cid:17) p − p +2 (cid:19) , if p ∈ (2 , ∞ ) , where e O hides logarithmic factors in L , k x − x k p , p − and /ǫ .Proof. Let us ﬁrst relate k ¯ x ∗ − x k p to k x ∗ − x k p , where ¯ x ∗ = argmin x ∈ R d ¯ f ( x ) , x ∗ ∈ argmin x ∈ R d f ( x ) . By thedeﬁnition of ¯ f : ≤ ¯ f ( x ∗ ) − ¯ f (¯ x ∗ )= f ( x ∗ ) − f (¯ x ∗ ) + λψ p ( x ∗ ) − λψ p (¯ x ∗ ) ≤ λψ p ( x ∗ ) − λψ p (¯ x ∗ ) . It follows that ψ p (¯ x ∗ ) ≤ ψ p ( x ∗ ) . Thus, using the deﬁnition of ψ p , k ¯ x ∗ − x k p ≤ k x ∗ − x k p . (20)By triangle inequality and ¯ x ∗ = argmin x ∈ R d ¯ f ( x ) (which implies ∇ ¯ f ( x ∗ ) = ), k∇ f ( y k ) k p ∗ ≤ k∇ f ( y k ) − ∇ f (¯ x ∗ ) k p ∗ + k∇ f (¯ x ∗ ) k p ∗ = k∇ f ( y k ) − ∇ f (¯ x ∗ ) k p ∗ + k∇ ¯ f (¯ x ∗ ) − λ ∇ ψ p (¯ x ∗ ) k p ∗ = k∇ f ( y k ) − ∇ f (¯ x ∗ ) k p ∗ + λ k∇ ψ p (¯ x ∗ ) k p ∗ . (21)As f is convex and ( L, κ ) weakly smooth, using Lemma 1.7, we also have: κ − L κ − κ k∇ f ( y k ) − ∇ f (¯ x ∗ ) k κκ − p ∗ ≤ f ( y k ) − f (¯ x ∗ ) − h∇ f (¯ x ∗ ) , y k − ¯ x ∗ i = ¯ f ( y k ) − ¯ f (¯ x ∗ ) − λψ p ( y k ) + λψ p (¯ x ∗ ) − (cid:10) ∇ ¯ f (¯ x ∗ ) − λ ∇ ψ p (¯ x ∗ ) , y k − ¯ x ∗ (cid:11) = ¯ f ( y k ) − ¯ f (¯ x ∗ ) − λ (cid:0) ψ p ( y k ) − ψ p (¯ x ∗ ) − h∇ ψ p (¯ x ∗ ) , y k − ¯ x ∗ i (cid:1) ≤ ¯ f ( y k ) − ¯ f (¯ x ∗ ) , (22)where the second line uses ¯ f = f + ψ p , the third line follows by ∇ ¯ f (¯ x ∗ ) = 0 (as ¯ x ∗ = argmin x ∈ R d ¯ f ( x ) ), and thelast inequality is by convexity of ψ p . k∇ f ( y k ) k p ∗ ≤ ǫ, it sufﬁces that λ k∇ ψ p (¯ x ∗ ) k p ∗ ≤ ǫ and ¯ f ( y k ) − ¯ f (¯ x ∗ ) ≤ (cid:0) ǫ (cid:1) κκ − κ − L κ − κ . The ﬁrst condition determines the value of λ. Using Proposition 1.5, λ k∇ ψ p (¯ x ∗ ) k p ∗ ≤ ǫ is equivalent to ( λp − k ¯ x ∗ − x k p ≤ ǫ , if p ∈ (1 , λ k ¯ x ∗ − x k p − p ≤ ǫ , if p ∈ (2 , ∞ ) . Using Eq. (20), it sufﬁces that: λ = ( ǫ ( p − k x ∗ − x k p , if p ∈ (1 , , ǫ k x ∗ − x k p − p , if p ∈ (2 , ∞ ) . (23)Using the choice of λ from Eq. (23), it remains to apply Theorem 2.3 to bound the number of iterations until ¯ f ( y k ) − ¯ f (¯ x ∗ ) ≤ (cid:0) ǫ (cid:1) κκ − κ − L κ − κ . Applying Theorem 2.3, we have: k = O (cid:18)(cid:16) κκ − L κ − κǫ κκ − ( κ − (cid:17) q − κqκ − q + κ (cid:16) L qκ λ (cid:17) κqκ − q + κ log (cid:16) κκ − L κψ p (¯ x ∗ ) ǫ κκ − ( κ − (cid:17)(cid:19) . It remains to plug in the choice of λ from Eq. (23), q = max { p, } , and simplify. Remark 3.2.

Observe that, as the gradient norm minimization relies on the application of Theorem 2.3, the knowledgeof parameters L and κ is not needed, as discussed in Section 2.2. The only parameter that needs to be determined is λ, which cannot be known in advance, as it would require knowing the initial distance to optimum k x ∗ − x k . However,tuning λ can be done at the cost of an additional log( λλ ) multiplicative factor in the convergence bound. In particular,one could start with a large estimate of λ (say, λ = λ = 1 ), run the algorithm, and halt and restart with λ ← λ/ each time k∇ ¯ f ( y k ) k ∗ ≤ ǫ but k∇ f ( y k ) k ∗ > ǫ. This condition is sufﬁcient because, when λ is of the correct order, λ k∇ ψ ( y k ) k ∗ = O ( λ k∇ ψ (¯ x ∗ ) k ∗ ) = O ( ǫ ) , k∇ f ( y k ) k ∗ ≤ ǫ, and k∇ ¯ f ( y k ) k ∗ ≤ k∇ f ( y k ) k ∗ + λ k∇ ψ ( y k ) k ∗ ≤ O ( ǫ ) . In this section, we address the question of the optimality of our algorithmic framework, in a formal oracle modelof computation. We ﬁrst study the question of minimizing the norm of the gradient, which follows from a simplereduction to the complexity of minimizing the objective function and for which nearly tight lower bounds are known.In this case, the lower bounds show that our resulting algorithms are nearly optimal when q = κ = 2 . In cases whereeither we have weaker smoothness ( κ < ) or larger uniform convexity exponent ( q > ), we observe the presence ofpolynomial gaps in the complexity w.r.t. /ǫ .One natural question regarding the aforementioned gaps is whether this is due to the suboptimality of the comple-mentary composite minimization algorithm used, or the reduction from the solution obtained by this method to obtaina small gradient norm. In this respect, we discard the ﬁrst possibility, showing sharp lower bounds for complementarycomposite optimization in a new composite oracle model. Our lower bounds show that the complementary compositeminimization algorithms are optimal up to factors which depend at most logarithmically on the initial distance to theoptimal solution, the target accuracy, and dimension.Before proceeding to the speciﬁc results, we provide a short summary of the classical oracle complexity in convexoptimization and some techniques that will be necessary for our results. For more detailed information on the subject,we refer the reader to the thorough monograph of Nemirovskii and Yudin [1983]. In the oracle model of convexoptimization, we consider a class of objectives F , comprised of functions f : E → R ; an oracle O : F × E → F (where F is a vector space); and a target accuracy, ǫ > . An algorithm A can be described by a sequence of functions ( A k ) k ∈ N , where A k +1 : ( E × F ) k +1 → E , so that the algorithm sequentially interacts with the oracle querying points x k +1 = A k +1 ( x , O ( f, x ) , . . . , x k , O ( f, x k )) . The running time of algorithm A is given by the minimum number of queries to achieve some measure of accuracy(up to a given accuracy ǫ > ), and will be denoted by T ( A , f, ǫ ) . The most classical example in optimization isachieving additive optimality gap bounded by ǫ : T ( A , f, ǫ ) = inf { k ≥ f ( x k ) ≤ f ∗ + ǫ } , ǫT ( A , f, ǫ ) = inf { k ≥ k∇ f ( x k ) k ∗ ≤ ǫ } . Given a measure of efﬁciency T , the worst-case oracle complexity for a problem class F endowed with oracle O , isgiven by Compl ( F , O , ǫ ) = inf A sup f ∈F T ( A , f, ǫ ) . We provide lower complexity bounds for minimizing the norm of the gradient. For the sake of simplicity, we can thinkof these lower bounds for the oracle O ( f, x ) = ∇ f ( x ) , but we point out they work more generally for arbitrary localoracles (more on this in the next section).In short, we reduce the problem of making the gradient small to that of approximately minimizing the objective. Proposition 4.1.

Let f : E → R be a convex and differentiable function, with a global minimizer x ∗ . Then, if k∇ f ( x ) k ∗ ≤ ǫ and k x − x ∗ k ≤ R , then f ( x ) − f ( x ∗ ) ≤ ǫR .Proof. By convexity of f , f ( x ) − f ( x ∗ ) ≤ h∇ f ( x ) , x − x ∗ i ≤ k∇ f ( x ) k ∗ k x − x ∗ k ≤ ǫR, where the second inequality is by duality of norms k · k and k · k ∗ . For the classical problem of minimizing the objective function value, lower complexity bounds for ℓ p -setups havebeen previously studied in both constrained [Guzm´an and Nemirovski, 2015] and unconstrained [Diakonikolas and Guzm´an,2020] settings. Here we summarize those results. Theorem 4.2 ([Guzm´an and Nemirovski, 2015, Diakonikolas and Guzm´an, 2020]) . Let ≤ p ≤ ∞ , and consider theproblem class of unconstrained minimization with objectives in the class F R d , k·k p ( κ, L ) , whose minima are attainedin B k·k p (0 , R ) . Then, the complexity of achieving additive optimality gap ǫ , for any local oracle, is bounded below by: • Ω (cid:16)(cid:16) LR κ ǫ [ln d ] κ − (cid:17) κ − (cid:17) if ≤ p < ; • Ω (cid:16)(cid:16) LR κ ǫ min { p, ln d } κ − (cid:17) pκp + κ − p (cid:17) , if ≤ p < ∞ ; and, • Ω (cid:16)(cid:16) LR κ ǫ [ln d ] κ − (cid:17) κ − (cid:17) , if p = ∞ .The dimension d for the lower bound to hold must be at least as large as the lower bound itself. By combining the reduction from Proposition 4.1 with the lower bounds for function minimization from Theo-rem 4.2, we can now immediately obtain lower bounds for minimizing the ℓ p norm of the gradient, as follows. Corollary 4.3.

Let ≤ p ≤ ∞ , and consider the problem class with objectives in F R d , k·k p ( κ, L ) , whose minima areattained in B k·k p (0 , R ) . Then, the complexity of achieving the dual norm of the gradient bounded by ǫ , for any localoracle, is bounded below by: • Ω (cid:16)(cid:16) LR κ − ǫ [ln d ] κ − (cid:17) κ − (cid:17) if ≤ p < ; • Ω (cid:16)(cid:16) LR κ − ǫ min { p, ln d } κ − (cid:17) pκp + κ − p (cid:17) , if ≤ p < ∞ ; and, • Ω (cid:16)(cid:16) LR κ − ǫ [ln d ] κ − (cid:17) κ − (cid:17) , if p = ∞ . More precisely, to obtain this result one can use the p -norm smoothing construction from Guzm´an and Nemirovski [2015, Section 2.3], incombination with the norm term used in Diakonikolas and Guzm´an [2020, Eq. (3)]. This would lead to a smooth objective over an unconstraineddomain that provides a hard function class. he dimension d for the lower bound to hold must be at least as large as the lower bound itself. Comparing to the upper bounds from Theorem 3.1, it follows that for p ∈ (1 , and κ = 2 , our bound is optimalup to a log( d ) log( LR ( p − ǫ ) factor; i.e., it is near-optimal. Recall that the upper bound for p = 1 can be obtained byapplying the result from Theorem 3.1 with p = log( d ) / [log d − . When p > and κ = 2 , our upper bound islarger than the lower bound by a factor (cid:16) LRǫ (cid:17) p − p +2 log( LRǫ )(min { p, log( d ) } ) pp +2 . The reason for the suboptimalityin the p > regime comes from the polynomial in /ǫ factors in the upper bound for complementary compositeminimization from Section 2, and it is a limitation of the regularization approach used in this work to obtain boundsfor the norm of the gradient. In particular, we believe that it is not possible to obtain tighter bounds via an alternativeanalysis by using the same regularization approach. Thus, it is an interesting open problem to obtain tight boundsfor p > , and it may require developing completely new techniques. Similar complexity gaps are encountered when κ < however, it is reasonable to suspect that here the lower bounds are not sharp. In particular, when κ = 1 points with small subgradients may not even exist, which is not at all reﬂected in the lower bound. Therefore, it is aninteresting open problem to investigate how to strengthen these lower bounds for weakly smooth function classes. We investigate the (sub)optimality of the composite minimization algorithm in an oracle complexity model. To ac-curately reﬂect how our algorithms work (namely, using gradient information on the smooth term and regularizedproximal subproblems w.r.t. the uniformly convex term), we introduce a new problem class and oracle for the comple-mentary composite problem. We observe that existing constructions in the literature of lower bounds for nonsmoothuniformly convex optimization (e.g., Juditsky and Nesterov [2014], Srebro and Sridharan [2012]) apply to our com-posite setting for κ = 1 . The main idea of the lower bounds in this section is to combine these constructions with localsmoothing, to obtain composite functions that match our assumptions. Assumptions 4.4.

Consider the problem class P ( F k·k ( L, κ ) , U k·k ( λ, q ) , R ) , given by composite objective functions ( P f,ψ ) min x ∈ E [ ¯ f ( x ) = f ( x ) + ψ ( x )] , with the following assumptions:(A.1) f ∈ F k·k ( L, κ ) ;(A.2) ψ ∈ U k·k ( λ, q ) ; and,(A.3) the optimal solution of ( P f,ψ ) is attained within B k·k (0 , R ) .The problem class is additionally endowed with oracles O F and O U , for function classes F k·k ( L, κ ) and U k·k ( λ, q ) ,respectively; which satisfy(O.1) O F is a local oracle: if f, g ∈ F k·k ( L, κ ) are such that there exists r > such that they coincide in aneighborhood B k·k ( x , r ) , then O F ( x , f ) = O F ( x , g ) ; and,(O.2) U k·k ( λ, q ) is any oracle (not necessarily local). In brief, we are interested in the oracle complexity of achieving ǫ -optimality gap for the family of problems ( P f,ψ ) ,where f ∈ F k·k ( L, κ ) is endowed with a local oracle, ψ ∈ U k·k ( λ, q ) is endowed with any oracle, and the optimalsolution of problem ( P f,ψ ) lies in B k·k (0 , R ) . A simple observation is that in the case λ = 0 , our model coincideswith the classical oracle mode, which was discussed in the previous section. The goal now is to prove a more generallower complexity bound for the composite model.Before proving the theorem, we ﬁrst provide some building blocks in this construction, borrowed from past workof Guzm´an and Nemirovski [2015], Diakonikolas and Guzm´an [2020]. In particular, our lower bound works generallyfor q -uniformly convex and locally smoothable spaces. Assumptions 4.5.

Given the normed space ( E , k · k ) , we consider the following properties:1. ψ ( x ) = q k x k q is q -uniformly convex with constant ¯ λ w.r.t. k · k . . The space ( E , k · k ) is ( κ, η, η, ¯ µ ) -locally smoothable. That is, there exists a mapping S : F ( E , k·k ) (0 , →F ( E , k·k ) ( κ, µ ) (denoted as the smoothing operator in [Diakonikolas and Guzm´an, 2020, Deﬁnition 2]), suchthat kS f − f k ∞ ≤ η , and this operator preserves the equality of functions when they coincide in a ball ofradius η ; i.e., if f | B k·k (0 , η ) = g | B k·k (0 , η ) then S f | B k·k (0 ,η ) = S g | B k·k (0 ,η ) .

3. There exists ∆ > and vectors z , . . . , z M ∈ E with k z i k ∗ ≤ , such that for all s , . . . , s M ∈ {− , +1 } M inf α ∈ ∆ M (cid:13)(cid:13)(cid:13) X i ∈ [ M ] α i s i z i (cid:13)(cid:13)(cid:13) ∗ ≥ ∆ , (24) where ∆ M = { α ∈ R M + : P i α i = 1 } is the discrete probability simplex in M -dimensions. The three assumptions in Assumption 4.5 are common in the literature, and can be intuitively understood asfollows. The ﬁrst is the existence of a simple function that we can use as the uniformly convex term in the compositemodel. The second appeared in [Guzm´an and Nemirovski, 2015], and provides a simple way to reduce the complexityof smooth convex optimization to its nonsmooth counterpart. We emphasize there is a canonical way to constructsmoothing operators, which is stated in Observation 4.6 below. Finally, the third assumption comes from the hardnessconstructions in nonsmooth convex optimization in Nemirovskii and Yudin [1983], which are given by piecewiselinear objectives that are learned one by one by an adversarial argument. The fact that the resulting piecewise linearfunction has a sufﬁciently negative optimal value (for any adversarial choice of signs) can be directly obtained byminimax duality from Eq. (24).We point out that ℓ dp satisﬁes the assumptions above when ≤ p < ∞ . Observation 4.6 ([Guzm´an and Nemirovski, 2015]) . Let ≤ p < ∞ and η > , and consider the space ℓ dp =( R d , k · k p ) . We now verify the Assumptions 4.5 for q = p , ¯ λ = 1 , ¯ µ = 2 − κ (min { p, ln d } /η ) κ − and ∆ = 1 /M /p .Indeed,1. The p -uniform convexity of ψ was discussed after Eq. (18).2. The smoothing operator can be obtained by inﬁmal convolution, with kernel function φ ( x ) = 2 k x k r (with r = min { p, d } . We recall that the inﬁmal convolution of two functions f and φ is given by ( f (cid:3) φ )( x ) = inf h ∈B p (0 , [ f ( x + h ) + φ ( h )] . The inﬁmal convolution above can be adapted to obtain arbitrary uniform approximation to f and the preserva-tion of equality of functions (see [Guzm´an and Nemirovski, 2015, Section 2.2] for details).3. Letting z i = e i , i ∈ [ M ] be the ﬁrst M canonical vectors, we have (cid:13)(cid:13)(cid:13) X i ∈ [ M ] α i s i z i (cid:13)(cid:13)(cid:13) p ∗ = k α k p ∗ ≥ M /p ∗ − k α k = M − /p . This bound is achieved when α i = 1 /M , for all i .Before proving the result for ℓ p -spaces, we provide a general lower complexity bound for the composite setting,which we will later apply to derive the lower bounds for ℓ p setups. Lemma 4.7.

Let ( E , k · k ) be a normed space that satisﬁes Assumption 4.5 and let P ( F k·k ( L, κ ) , U k·k ( λ, q ) , R ) bea class of complementary composite problems that satisﬁes Assumption 4.4. Suppose the following relations betweenparameters are satisﬁed:(a) qL ¯ λ/ [ λ ¯ µ ] ≤ R q − .(b) ( M + 3) η ≤ R .(c) L µ ( M + 7) η ≤ q ∗ (cid:0) L ∆¯ µ (cid:1) q ∗ (cid:0) ¯ λλ (cid:1) q − . hen, the worst-case optimality gap for the problem class is bounded below by q ∗ (cid:16) L ∆¯ µ (cid:17) q ∗ (cid:16) ¯ λλ (cid:17) q − . Proof.

Given M ∈ N , scalars δ , . . . , δ M > , and s , . . . , s M ∈ {− , +1 } , we consider the functions f s ( x ) = L ¯ µ S (cid:16) max i ∈ [ M ] [ h s i z i , ·i − δ i ] (cid:17) ( x ) , and ¯ f s ( x ) = f s ( x ) + ( λ/ ¯ λ ) ψ ( x ) , where ψ is given by Assumption 4.5.We now show the composite objective ¯ f s satisﬁes Assumption 4.4. Properties (A.1) and (A.2) are clearly satis-ﬁed. Regarding (A.3), we prove next that the optimum of these functions lies in B k·k (0 , R ) . For this, notice that byAssumption 4.5, Property 2: ¯ f s ( x ) ≥ L ¯ µ max i ∈ [ M ] [ h s i z i , x i − δ i ] − Lη ¯ µ + λq ¯ λ k x k q ≥ k x k h λ ¯ λq k x k q − − L ¯ µ i − L ¯ µ ( η + max i δ i ) . We will later show that η + max i δ i ≤ ( M + 3) η/ ≤ R (the last inequality by (b)), hence for k x k ≥ R ¯ f s ( x ) ≥ (cid:16) λ ¯ λq k x k q − − L ¯ µ (cid:17) k x k ≥ , where the last inequality follows from (a). To conclude the veriﬁcation of Assumption (A.3), we now prove that min x ∈ R ¯ f s ( x ) < . Again, by Assumption 4.5, Property 2: inf x ∈ E ¯ f ( x ) ≤ inf x ∈ E (cid:16) L ¯ µ max i ∈ [ M ] [ h s i z i , x i − δ i ] + L ¯ µ η + λq ¯ λ k x k q (cid:17) = max α ∈ ∆ M inf x ∈ E (cid:16)D Lµ X i ∈ [ M ] α i s i z i , x E + λq ¯ λ k x k q − Lµ X i ∈ [ M ] α i δ i + Lµ η (cid:17) = max α ∈ ∆ M − q ∗ (cid:16) L ¯ µ (cid:17) q ∗ (cid:16) ¯ λλ (cid:17) q − (cid:13)(cid:13)(cid:13) X i ∈ [ M ] α i s i z i (cid:13)(cid:13)(cid:13) q ∗ ∗ − Lµ X i ∈ [ M ] α i δ i + Lµ η = − q ∗ (cid:16) L ¯ µ (cid:17) q ∗ (cid:16) ¯ λλ (cid:17) q − ∆ q ∗ + Lµ η.

Notice that the second step above follows from the Sion Minimax Theorem [Sion, 1958]. We conclude that the optimalvalue of ( P f,ψ ) is negative by (c).Following the arguments provided in Guzm´an and Nemirovski [2015, Proposition 2], one can prove that for anyalgorithm interacting with oracle O F , after M steps there exists a choice of s , . . . , s M ∈ {− , +1 } M such that min t ∈ [ M ] f s ( x t ) ≥ L ¯ µ [ − η − max i ∈ [ M ] δ i ]; further, for this adversarial argument it sufﬁces that min i ∈ [ M ] δ i = 0 , and max i ∈ [ M ] δ i ≥ ( M − η/ .We conclude that the optimality gap after M steps is bounded below by min t ∈ [ M ] ¯ f s ( x t ) − min x ∈ E ¯ f s ( x ) ≥ − L µ ( M + 7) η + 1 q ∗ (cid:16) L ¯ µ (cid:17) q ∗ (cid:16) ¯ λλ (cid:17) q − ∆ q ∗ ≥ q ∗ (cid:16) L ∆¯ µ (cid:17) q ∗ (cid:16) ¯ λλ (cid:17) q − , where we used the third bound from the statement.We now proceed to the lower bounds for ℓ p -setups, with ≤ p ≤ ∞ .16 heorem 4.8. Consider the space ℓ dp = ( R d , k · k p ) , where ≤ p < ∞ . Then, the oracle complexity of problem class P := P ( F k·k ( L, κ ) , U k·k ( λ, p ) , R ) , comprised of composite problems in the form ( P f,ψ ) under Assumptions 4.4, isbounded below by Compl( P , ( O F , O ψ ) , ǫ ) ≥  jq L λ − k if p = κ = 2 , ǫ < √ λLR min { λL , } C ( p,κ )min { p, ln d } κ − (cid:0) L p λ κ ǫ p − κ (cid:1) κp + κ − p if ≤ κ < p, p ∈ [2 , ∞ ] , and λ ≥ ˜ λ. where C ( p, κ ) := (cid:18)(cid:16) p − p (cid:17) κ ( p − ( p − κ )(1 − p )+( κ − p (2 p − p − (cid:19) κp + κ − p is bounded below by an absolute constant, and ˜ λ := C max  min { p, ln d } (cid:16) ǫ κ LR (cid:17) κ − , min { p, ln d } ǫ p L ( p +1) R ( p − κp + κ − p )( κ − ! κ − κp +1 − p  , (25) with C > is a universal constant. In particular, our lower bounds show that the algorithm presented in the previous section –particularly the ratesstated in Theorem 2.3– are nearly optimal. In the case p = κ = 2 , the gap between upper and lower boundsis only given by a factor which grows at most logarithmically in Lφ (¯ x ∗ ) /ǫ , and in the case κ < p , the gap is O (cid:0) log( Lφ (¯ x ∗ ) /ǫ ) / min { p, ln d } Θ(1) (cid:1) . In both cases, the gaps are quite moderate, so the proposed algorithm is provedto be nearly optimal. Finally, we would also like to emphasize that the constant C ( p, κ ) = Θ(1) , as a function of < κ ≤ and ≤ p ≤ ∞ . Therefore, the lower bounds also apply to the case p = ∞ . Proof of Theorem 4.8.

By Observation 4.6, in the case of ℓ dp , with ≤ p < ∞ , Assumption 4.5 is satisﬁed if q = p , ∆ = 1 /M /p , λ = 1 , and ¯ µ = 2 − κ (min { p, ln d } /η ) κ − (for given η > ). This way, hypotheses (a), (b), (c) inLemma 4.7 become(a) η ≤ min { p, ln d } (cid:0) λR p − pL (cid:1) κ − .(b) ( M + 3) η ≤ R .(c) η p − κ ≤ p + κ − Lp ( p − ∗ min { p, ln d } ( κ − λM ( M +7) ( p − . Case 1: p = κ = 2 . In order to satisfy (c), it sufﬁces to choose M = jq L λ − k . Given such choice, to satisfy (a),(b) of the lemma, we can choose η = min n λR L , RM + 3 o ≥ R r λL min n r λL , o . Now, under the conditions imposed above, the lemma provides an optimality gap lower bound of λ (cid:16) Lη √ M (cid:17) ≥ √ λLR min n λL , o . In conclusion, if ǫ < √ λLR min { λ/L, } , then Compl( P , ( O F , O ψ ) , ǫ ) ≥ jr L λ k − . Case 2: p > κ (where < κ ≤ , ≤ p < ∞ ). Here, to ensure (a), (b) it sufﬁces that η ≤ min n RM + 3 , min { p, ln d } (cid:16) λR p − p (cid:17) κ − o . (26)We will later certify these conditions hold. On the other hand, for (c) it sufﬁces to let η = h(cid:16) p − p (cid:17) p − p + κ − Lλ min { p, ln d } κ − M ( M + 7) p − i p − κ . p ∗ (cid:16) L p η p ( κ − p (2 − κ ) λM min { p, ln d } p ( κ − (cid:17) p − = "(cid:16) p − p (cid:17) κ ( p − ( p − κ )(1 − p )+( κ − p (2 p − p − · L p min { p, ln d } p ( κ − κp − κ +1) p − λ κ ( M + 7) κp + κ − p p − κ . Let C ( p, κ ) := (cid:18)(cid:16) p − p (cid:17) κ ( p − ( p − κ )(1 − p )+( κ − p (2 p − p − (cid:19) κp + κ − p . In particular, if ǫ is smaller than the gap above,resolving for M gives Compl( P , ( O F , O ψ ) , ǫ ) ≥ M = C ( p, κ )min { p, ln d } κ − (cid:18) L p λ κ ǫ p − κ (cid:19) κp + κ − p , (27)where we further simpliﬁed the bound, noting that p ( κ − κp − κ +1)( p − pκ + κ − p ) ≤ κ − .Now, given the chosen value of M , we will verify that (26) holds. For this, we note that (26) is implied by thefollowing pair of inequalities λ ≥ C ′ ( p, κ ) min { p, ln d } ( κ − κ − (cid:16) ǫ κ LR (cid:17) κ − (28) λ ≥ C ′′ ( p, κ ) min { p, ln d } ǫ p L ( p +1) R ( p − κp + κ − p )( κ − ! κ − κp +1 − p (29)with C ′ ( p, κ ) , C ′′ ( p, κ ) ≥ C > , are bounded below by a universal positive constant. Therefore, there exists auniversal constant C > such that if λ satisﬁes Eqs. (28) and (29) where C ′ ( p, κ ) , C ′′ ( p, κ ) are replaced by C , thenthe lower complexity bound from Eq. (27) holds. Remark 4.9.

Observe that the lower bounds from Theorem 4.8 apply only when λ is sufﬁciently large, which isconsistent with the behavior of our algorithm, which for small values of λ obtains iteration complexity matching theclassical smooth setting (as if we ignore the uniform convexity of the objective). We now provide some interesting applications of the results from Sections 2 and 3 to different regression problems.In typical applications, the data matrix A is assumed to have fewer rows than columns, so that the system Ax = b ,where b is the vector of labels, is underdetermined, and one seeks a sparse solution x ∗ that provides a good linear ﬁtbetween the data and the labels. One of the simplest applications of our framework is to the elastic net regularization, introduced by Zou and Hastie[2005]. Elastic net regularized problems are of the form: min x ∈ R d f ( x ) + λ k x k + λ k x k , i.e., the elastic net regularization combines the lasso and ridge regularizers. Function f is assumed to be ( L, -weaklysmooth (i.e., L -smooth) w.r.t. the Euclidean norm k · k . It is typically chosen as either the linear least squares or thelogistic loss.We can apply results from Section 2 to this problem for q = κ = 2 , choosing ψ ( x ) = λ k x k and φ ( x ) = k x − x k . Observe that our algorithm only needs to solve subproblems of the form min x ∈ R d n h z , x i + λ ′′ k x k + λ ′ k x k o , z ∈ R d and ﬁxed parameters λ ′ , λ ′′ , which is computationally inexpensive, as the problem under themin is separable.Applying Theorem 2.3, the elastic net regularized problems can be solved to any accuracy ǫ > using k = O (cid:18) min (cid:26)r Lλ log (cid:18) L k x ∗ − x k ǫ (cid:17) , r L k x ∗ − x k ǫ (cid:27)(cid:19) iterations, where x ∗ ∈ R d is the problem minimizer. Bridge regression problems were originally introduced by Frank and Friedman [1993], and are deﬁned by min x ∈ R d : k x k p ≤ t k Ax − b k (30)where t is a positive scalar, p ∈ [1 , , A is the matrix of observations, and b is the vector of labels. In particular, for p = 1 , the problem reduces to lasso, while for p = 2 we recover ridge regression.Bridge regression has traditionally been used either as an interpolation between lasso and ridge regression, or tomodel Bayesian priors with the exponential power distribution (see Park and Casella [2008] and Hastie et al. [2009,Section 3.4.3]. The problem is often posed in the equivalent (due to Lagrangian duality) penalized (or regularized)form: min x ∈ R d n k Ax − b k + λp k x k pp o . Writing the regularizer as p k x k pp is typically chosen due to its separable form. However, using different parametriza-tion, the problem from Eq. (30) is also equivalent to min x ∈ R d n k Ax − b k + λ k x k p o , (31)which is more convenient for the application of our results, as k x k p is ( p − -strongly convex w.r.t. k · k p .Further, looking at the gradient ∇ f ( x ) = A T Ax − A T b of f ( x ) = k Ax − b k , it is not hard to argue that f ( x ) is L p -smooth w.r.t. k · k p , where L p = k A T A k p → p ∗ = sup x ∈ R d : k x k p =0 k A T Ax k p ∗ k x k p . Namely, this follows as k∇ f ( x ) − ∇ f ( y ) k p ∗ = k A T A ( x − y ) k p ∗ ≤ k A T A k p → p ∗ k x − y k p . An interesting feature of the formulation in Eq. (31) is that it implies a certain trade-off between the p ∗ -ﬁt of the dataand the p -norm of the regressor. Namely, if ¯ x ∗ solves the problem from Eq. (31), then k A T ( A ¯ x ∗ − b ) k p ∗ = λ k ¯ x ∗ k p . (32)This simply follows by setting the gradient of k Ax − b k + k x k p to zero, and using that (cid:13)(cid:13) ∇ (cid:0) k x k p (cid:1)(cid:13)(cid:13) p ∗ = k x k p , ∀ x ∈ R d (see Proposition 1.5).More recently, related problems of the form min x ∈ R d np ℓ ( x , A , b ) + λ ′ k x k p o , where ℓ ( x , A , b ) is a more general loss function, have been used in distributionally robust optimization (see Blanchet et al.[2019]). Again, a different parametrization of the same problem leads to the equivalent form min x ∈ R d n ℓ ( x , A , b ) + λ k x k p o , (33)and our results can be applied as long as ℓ ( x , A , b ) is L p -smooth w.r.t. k · k p . Note that, by the inequalities relating ℓ p -norms, any function that is L -smooth w.r.t. k · k , is also L -smooth w.r.t. k · k p for p ∈ [1 , . That is,for p ∈ [1 , , the smoothness parameter w.r.t. k · k p can only be lower than the smoothness parameter w.r.t. k · k , often being signiﬁcantly lower.

19 direct application of our result from Theorem 2.3 tells us that we can approximate the problem from Eq. (31)with accuracy ǫ > using k = O (cid:18) min (cid:26)s L p λ ( p −

1) log (cid:16) L p k ¯ x ∗ − x k p ǫ (cid:17) , r L p k ¯ x ∗ − x k p ǫ (cid:27)(cid:19) (34)iterations of Generalized AGD+ from Eq. (7).Further, using Corollary 2.4, we get that within the same number of iterations the output point y k of the algorithmsatisﬁes k y k − ¯ x ∗ k p ≤ q ǫλ ( p − . Additionally, for quadratic losses, using triangle inequality and Eq. (32), we havethe following “goodness of ﬁt” guarantee k A T ( Ay k − b ) k p ∗ ≤ k A T A ( y k − ¯ x ∗ ) k p ∗ + λ k ¯ x ∗ k p ≤ L p s ǫλ ( p −

1) + λ k ¯ x ∗ k p . Finally, note that it is possible to apply our algorithm to ℓ regularized problems (lasso), applying results fromTheorem 2.3 with ψ ( x ) = λ k x k and φ ( x ) = k x − x k . In this case, as ψ is not strongly convex, the resultingbound is k = O (cid:16)q L k ¯ x ∗ − x k ǫ (cid:17) , which matches the iteration complexity of FISTA [Beck and Teboulle, 2009]. Dantzig selector problem, introduced by Cand´es and Tao [2007], consists in solving problems of the form min x ∈ R d : k x k ≤ t k A T ( Ax − b ) k ∞ , or, equivalently min x ∈ R d : k A T ( Ax − b ) k ∞ ≤ t k x k , where t is some positive parameter.Similar to other regression problems described in this section, Dantzig selector problem can be considered in itsunconstrained, regularized form. One variant of the problem that can be addressed with our algorithm is min x ∈ R d k A T ( Ax − b ) k p ∗ + λ k x k p , (35)where p is chosen sufﬁciently close to one so that k · k p closely approximates k · k and k · k p ∗ closely approximates k · k ∞ , where p + p ∗ = 1 . In particular, when p ∗ = [log d ] / ln(1 + ǫ ) we have that (1 − ǫ ) k x k ≤ k x k p ≤ k x k and k x k ∞ ≤ k x k p ≤ (1 + ǫ ) k x k ∞ , ∀ x ∈ R d . As discussed at the beginning of Section 3, in this case, ψ ( x ) = λ k x k p is λ ( p −

1) = Θ( λǫ log( d ) ) -strongly convexw.r.t. k · k p and, by the relationship between norms, is also strongly convex w.r.t. k · k with the strong convexityconstant of the same order. Further, f ( x ) = k A T ( Ax − b ) k p ∗ can be shown to be L -smooth w.r.t. k · k , for L = (1 + ǫ )( p ∗ − A max = Θ( log dǫ A max ) , where A max = max ≤ i,j ≤ d | ( A T A ) ij | . This can be done as follows.Using that k · k p ∗ is ( p ∗ − -smooth w.r.t. k · k p ∗ (as p > ), we have, ∀ x , y ∈ R d , k∇ f ( x ) − ∇ f ( y ) k ∞ ≤ k∇ f ( x ) − ∇ f ( y ) k p ≤ ( p ∗ − k ( A T A )( x − y ) k p ∗ ≤ ( p ∗ − k A T A k → p ∗ k x − y k ≤ ( p ∗ − ǫ ) k A T A k →∞ k x − y k = (1 + ǫ )( p ∗ −

1) max ≤ i,j ≤ d | ( A T A ) ij | · k x − y k . Hence, applying Theorem 2.3, we have that the problem from Eq. (35) can be approximated to arbitrary additive error ¯ ǫ with k = O (cid:16)q A max λ log( d )¯ ǫ log (cid:16) log( d ) A max k ¯ x ∗ − x k ¯ ǫ (cid:17)(cid:17) iterations of Generalized AGD+ from Section 2.Similar to bridge regression, there is an interesting trade-off between the ℓ norm of the regressor and goodness ofﬁt revealed by the formulation we consider (Eq. (35)). In particular, using that at an optimal solution ¯ x ∗ the gradient20f the objective from Eq. (35) is zero and using Proposition 1.5, (1 − ǫ ) λ k ¯ x ∗ k ≤ λ k ¯ x ∗ k p = λ (cid:13)(cid:13)(cid:13) ∇ (cid:16) k ¯ x ∗ k p (cid:17)(cid:13)(cid:13)(cid:13) p ∗ = (cid:13)(cid:13)(cid:13) ∇ (cid:16) k A T ( A ¯ x ∗ − b ) k p ∗ (cid:17)(cid:13)(cid:13)(cid:13) p ∗ ≤ k A T A k p → p ∗ k A T ( A ¯ x ∗ − b ) k p ∗ ≤ ǫ − ǫ A max k A T ( A ¯ x ∗ − b ) k ∞ . Hence, λ k ¯ x ∗ k ≤ (1 + O ( ǫ )) A max k A T ( A ¯ x ∗ − b ) k ∞ . As the ℓ norm of the regressor is considered a proxy forsparsity, this bound provides a trade-off between the parsimony of the model and the goodness of ﬁt, as a function ofthe regularization parameter λ . ℓ p Regression

Standard ℓ p -regression problems have as their goal ﬁnding a vector x ∗ that minimizes k Ax − b k p , where p ≥ . When p = 1 or p = ∞ , this problem can be solved using linear programming. More generally, when p / ∈ { , ∞} , theproblem is nonlinear, and multiple approaches have been developed for solving it, including, e.g., a homotopy-basedsolver [Bubeck et al., 2018], solvers based on iterative reﬁnement [Adil et al., 2019a, Adil and Sachdeva, 2020], andsolvers based on the classical method of iteratively reweighted least squares [Ene and Vladu, 2019, Adil et al., 2019b].Such solvers typically rely on fast linear system solves and attain logarithmic dependence on the inverse accuracy /ǫ, at the cost of iteration count scaling polynomially with one of the dimensions of A (typically the lower dimension,which is equal to the number of rows m ), each iteration requiring a constant number of linear system solves.Here, we consider algorithmic setups in which the iteration count is dimension-independent and no linear systemsolves are required, but the dependence on /ǫ is polynomial. First, for standard ℓ p -regression problems, we can useuse a non-composite variant of the algorithm (with ψ ( · ) = 0 ), while relying on the fact that the function q k · k qp with q = min { , p } is (1 , p ) -weakly smooth for p ∈ (1 , and ( p − , -weakly smooth for p ≥ . Using this fact, itfollows that the function f p ( x ) = 1 q k Ax − b k qp is ( L p , q ) -weakly smooth w.r.t. k · k p , with L p = max { p − , }k A k q − p → p ∗ . On the other hand, function φ ( x ) = q min { p − , } k x − x k ¯ qp , where ¯ q = max { , p } is (1 , ¯ q ) -uniformly convex w.r.t. k · k p . Thus, applying Theorem 2.3, weﬁnd that we can construct a point y k ∈ R d such that f p ( y k ) − f p ( x ∗ ) , where x ∗ ∈ argmin x ∈ R d f p ( x ) , with at most k =  O (cid:16)(cid:16) k A k p − p → p ∗ ǫ (cid:17) p − (cid:16) k x ∗ − x k p p − (cid:17) p p − (cid:17) , if p ∈ (1 , O (cid:16)(cid:16) ( p − k A k p → p ∗ ǫ (cid:17) pp +2 (cid:16) k x ∗ − x k pp p (cid:17) p +2 (cid:17) , if p ≥ iterations of Generalized AGD+. The same result can be obtained by applying the iteration complexity-optimal algo-rithms for smooth minimization over ℓ p -spaces [Nemirovskii and Nesterov, 1985, d’Aspremont et al., 2018].More interesting for our framework is the ℓ p regression on correlated errors, described in the following. ℓ p -regression on correlated errors. As argued in Cand´es and Tao [2007], there are multiple reasons why mini-mizing the correlated errors A T ( Ax − b ) in place of the standard errors Ax − b is more meaningful for manyapplications. First, unlike standard errors, correlated errors are invariant to orthonormal transformations of the data.Indeed, if U is a matrix with orthonormal columns, then ( UA ) T ( UAx − Ub ) = A T ( Ax − b ) , but the same cannotbe established for the standard error Ax − b . Other reasons involve ensuring that the model includes explanatoryvariables that are highly correlated with the data, which is only possible to argue when working with correlated errors(see Cand´es and Tao [2007] for more information).Within our framework, minimization of correlated errors in ℓ p -norms can be reduced to making the gradient smallin the ℓ p -norm; i.e., to applying results from Section 3. In particular, consider the function: f ( x ) = 12 k Ax − b k . ∇ f ( x ) = A T ( Ax − b ) . Further, function f is L p ∗ -smooth w.r.t. k · k p ∗ , where L p ∗ = k A T A k p ∗ → p . Applying the results from Theorem 3.1, it follows that, for any ǫ > , we can construct a vector y k ∈ R d with k A T ( Ay k − b ) k p ≤ ǫ, where p + p ∗ = 1 , with at most k =  e O (cid:18)(cid:16) k A T A k p ∗→ p k x ∗ − x k p ∗ ǫ (cid:17) p − (cid:19) , if p ∈ (1 , e O (cid:18)q k A T A k p ∗→ p k x ∗ − x k p ∗ ǫ (cid:19) , if p > iterations of generalized AGD+, where e O hides a factor that is logarithmic in /ǫ and where each iteration takes timelinear in the number of non-zeros of A . The algorithms we propose in this work are not limited to ℓ p settings, but apply more generally to uniformly convexspaces. A notable example of such spaces are the Schatten spaces , S p := ( R d × d , k · k S ,p ) , where k X k S ,p =( P j ∈ [ d ] σ j ( X ) p ) /p , where σ ( X ) , . . . , σ d ( X ) are the singular values of X . In particular, the aforementioned ℓ p -regression problems have their natural spectral counterparts, e.g., given a linear operator A : R d × d → R k , and b ∈ R k , min X ∈ R d × d l kA X − b k lq + λr k X k r S ,p . The most popular example of such a formulation comes from the nuclear norm relaxation for low-rank matrix com-pletion [Recht et al., 2010, Chandrasekaran et al., 2012, Nesterov and Nemirovski, 2013]. We observe that the exactformulation of the problem may vary, but by virtue of Lagrangian relaxation we can interchangeably consider thesedifferent formulations as equivalent (modulo appropriate choice of regularization/constraint parameter choice).To apply our algorithms to Schatten norm settings, we observe the functions below are (1 , r ) -uniformly convex,with r = max { , p } : Ψ S ,p ( X ) = ( p − k X k S ,p , if p ∈ (1 , , p k X k p S ,p , if p ∈ (2 , + ∞ ) . On the other hand, notice that more generally than regression problems, for composite objectives f ( X ) + λ Ψ S ,p ( X − X ) , if the function f is unitarily invariant and convex, there is a well-known formula for its subdifferential, based on thesubdifferential of its vector counterpart (there is a one-to-one correspondence between unitarily invariant functions R d × d and absolutely symmetric functions on R d ) [Lewis, 1995]. Even if f is not unitarily invariant, in the caseof regression problems the gradients can be computed explicitly. On the other hand, the regularizer Ψ S ,p admitsefﬁciently computable solutions to problems from Eq. (2), given its unitary invariance (see, e.g., Beck [2017, Section7.3.2]).Iteration complexity bounds obtained with these regularizers are analogous to those obtained in the ℓ p setting. Onthe other hand, the lower complexity bounds proved in Section 4 also apply to Schatten spaces by diagonal embeddingfrom ℓ dp , hence all the optimality/suboptimality results established for ℓ p carry over into S p . We presented a general algorithmic framework for complementary composite optimization , where the objective func-tion is the sum of two functions with complementary properties – (weak) smoothness and uniform/strong convexity.The framework has a number of interesting applications, including in making the gradient of a smooth function smallin general norms and in different regression problems that frequently arise in machine learning. We also providedlower bounds that certify near-optimality of our algorithmic framework for the majority of standard ℓ p and S p setups.22ome interesting questions for future work remain. For example, the regularization-based approach that we em-ployed for gradient norm minimization leads to near-optimal oracle complexity bounds only when the objective func-tion is smooth and the norm of the space is strongly convex (i.e., when the p ∗ -norm of the gradient is sought for p ∗ ≥ ). The primary reason for this result is that these are the only settings in which the complementary compositeminimization leads to linear convergence. As the bounds we obtain for complementary composite minimization arenear-tight, this represents a fundamental limitation of direct regularization-based approach. It is an open questionwhether the non-tight bounds for gradient norm minimization can be improved using some type of recursive regu-larization, as in Allen-Zhu [2018]. Of course, there are clear challenges in trying to generalize such an approachto non-Euclidean norms, caused by the fundamental limitation that non-Euclidean norms cannot be simultaneouslysmooth and strongly convex, as discussed at the beginning of the paper. Another interesting question is whetherthere exist direct (not regularization-based) algorithms for minimizing general gradient norms and that converge with(near-)optimal oracle complexity. References

Deeksha Adil and Sushant Sachdeva. Faster p -norm minimizing ﬂows, via smoothed q -norm problems. In Proc. ACM-SIAM SODA’20 , 2020.Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative reﬁnement for ℓ p -norm regression. In Proc. ACM-SIAM SODA’19 , 2019a.Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent IRLS algorithm for p -norm linearregression. In Proc. NeurIPS’19 , 2019b.Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex SGD. In

Proc. NeurIPS’18 , 2018.Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness inequalities for trace norms.

Inventiones mathematicae , 115(1):463–482, 1994.Heinz H Bauschke, J´erˆome Bolte, and Marc Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications.

Mathematics of Operations Research , 42(2):330–348, 2017.Heinz H Bauschke, J´erˆome Bolte, Jiawei Chen, Marc Teboulle, and Xianfu Wang. On linear convergence of non-Euclidean gradient methods without strong convexity and Lipschitz gradient continuity.

Journal of OptimizationTheory and Applications , 182(3):1068–1087, 2019.Amir Beck.

First-Order Methods in Optimization . MOS-SIAM Series on Optimization, 2017.Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAMjournal on imaging sciences , 2(1):183–202, 2009.Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein proﬁle inference and applications to machinelearning.

Journal of Applied Probability , 56(3):830–857, 2019.J. Borwein, A. J. Guirao, P. H´ajek, and J. Vanderwerff. Uniformly convex functions on Banach spaces.

Proceedingsof the AMS , 137(3):1081–1091, 2009.Jonathan M Borwein and Qiji J Zhu.

Techniques of Variational Analysis . Springer, 2004.Stephen Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge university press, 2004.S´ebastien Bubeck, Michael B Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopy method for lp regression provablybeyond self-concordance and in input-sparsity time. In

Proc. ACM STOC’18 , 2018.Emmanuel Cand´es and Terence Tao. The dantzig selector: Statistical estimation when p is much larger than n . Theannals of Statistics , 35(6):2313–2351, 2007. 23ntonin Chambolle and Thomas Pock. A ﬁrst-order primal-dual algorithm for convex problems with applications toimaging.

Journal of Mathematical Imaging and Vision , 40(1):120–145, 2011.Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. The convex geometry of linearinverse problems.

Found. Comput. Math. , 12(6):805–849, 2012.Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noise-corrupted gradients. In

Proc. ICML’18 , pages 1019–1028, 2018.Alexandre d’Aspremont, Crist´obal Guzm´an, and Martin Jaggi. Optimal afﬁne-invariant smooth minimization algo-rithms.

SIAM Journal on Optimization , 28(3):2384–2405, 2018.Olivier Devolder, Franc¸ois Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization withinexact oracle.

Mathematical Programming , 146(1-2):37–75, 2014.Jelena Diakonikolas and Crist´obal Guzm´an. Lower bounds for parallel and randomized convex optimization.

Journalof Machine Learning Research , 21(5):1–31, 2020.Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A uniﬁed theory of ﬁrst-ordermethods.

SIAM Journal on Optimization , 29(1):660–689, 2019.Radu-Alexandru Dragomir, Adrien Taylor, Alexandre d’Aspremont, and J´erˆome Bolte. Optimal complexity and cer-tiﬁcation of Bregman ﬁrst-order methods. arXiv preprint arXiv:1911.08510 , 2019.Alina Ene and Adrian Vladu. Improved convergence for ℓ and ℓ ∞ regression via iteratively reweighted least squares.In Proc. ICML’19 , 2019.LLdiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools.

Technometrics ,35(2):109–135, 1993.Alexander Vladimirovich Gasnikov and Yu E Nesterov. Universal method for stochastic composite optimizationproblems.

Computational Mathematics and Mathematical Physics , 58(1):48–64, 2018.Crist´obal Guzm´an and Arkadi Nemirovski. On lower complexity bounds for large-scale smooth convex optimization.

Journal of Complexity , 31(1):1–14, 2015.Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The elements of statistical learning: data mining, inference,and prediction . Springer Science & Business Media, 2009.Niao He, Anatoli B. Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimizationand semi-separable problems.

Comput. Optim. Appl. , 61(2):275–319, 2015.Anatoli Juditsky and Arkadii S Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv preprint arXiv:0809.0813 , 2008.Anatoli Juditsky and Yuri Nesterov. Deterministic and stochastic primal-dual subgradient algorithms for uniformlyconvex minimization.

Stoch. Syst. , 4(1):44–80, 2014.Donghwan Kim and Jeffrey A Fessler. Optimizing the efﬁciency of ﬁrst-order methods for decreasing the gradient ofsmooth convex functions.

Journal of Optimization Theory and Applications , pages 1–28, 2020.A.S. Lewis. The convex analysis of unitarily invariant matrix functions.

Journal of Convex Analysis , 2(1/2):173–183,1995.Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, andapplications.

SIAM Journal on Optimization , 28(1):333–354, 2018.Arkadi S Nemirovskii and Yu E Nesterov. Optimal methods of smooth convex minimization.

USSR ComputationalMathematics and Mathematical Physics , 25(2):21–30, 1985.A.S. Nemirovskii and Yudin.

Problem Complexity and Method Efﬁciency in Optimization . Wiley, 1983.24u Nesterov. Gradient methods for minimizing composite functions.

Mathematical Programming , 140(1):125–161,2013.Yu Nesterov. Universal gradient methods for convex optimization problems.

Mathematical Programming , 152(1-2):381–404, 2015.Yurii Nesterov. How to make the gradients small.

Optima. Mathematical Optimization Society Newsletter , (88):10–11,2012.Yurii Nesterov and Arkadi Nemirovski. On ﬁrst-order algorithms for ℓ /nuclear norm minimization. Acta Numerica ,22:509, 2013.Trevor Park and George Casella. The Bayesian lasso.

Journal of the American Statistical Association , 103(482):681–686, 2008.Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equationsvia nuclear norm minimization.

SIAM review , 52(3):471–501, 2010.R. Tyrrell Rockafellar.

Convex analysis . Princeton Mathematical Series. Princeton University Press, Princeton, N. J.,1970.Katya Scheinberg, Donald Goldfarb, and Xi Bai. Fast ﬁrst-order methods for composite convex optimization withbacktracking.

Foundations of Computational Mathematics , 14(3):389–417, 2014.Maurice Sion. On general minimax theorems.

Paciﬁc Journal of Mathematics , 8(1):171–176, 1958.Nathan Srebro and Karthik Sridharan. On convex optimization, fat shattering and learning. unpublished note, 2012.C. Zalinescu. On uniformly convex functions.

Journal of Mathematical Analysis and Applications , 95:344–374, 1983.Constantin Zalinescu.

Convex analysis in general vector spaces . World scientiﬁc, 2002.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.