Local and Global Uniform Convexity Conditions
aa r X i v : . [ m a t h . O C ] F e b LOCAL AND GLOBAL UNIFORM CONVEXITY CONDITIONS
THOMAS KERDREUX † , ∗ , ALEXANDRE D’ASPREMONT ‡ , § , AND SEBASTIAN POKUTTA † , ∗ A BSTRACT . We review various characterizations of uniform convexity and smoothness on norm balls in finite-dimensional spaces and connect results stemming from the geometry of Banach spaces with scaling inequalities used in analyzing the convergence of optimization methods. In particular, we establish local versions of theseconditions to provide sharper insights on a recent body of complexity results in learning theory, online learning,or offline optimization, which rely on the strong convexity of the feasible set. While they have a significantimpact on complexity, these strong convexity or uniform convexity properties of feasible sets are not exploitedas thoroughly as their functional counterparts, and this work is an effort to correct this imbalance. We concludewith some practical examples in optimization and machine learning where leveraging these conditions andlocalized assumptions lead to new complexity results.
1. I
NTRODUCTION
Strong convexity or uniform convexity properties of the objective function of an optimization problemhave a significant impact on problem complexity [Nes15] and are heavily exploited by first order methods,notably in machine learning, with applications in various settings such as distributed optimization [JST + +
15, SFM +
17, Sti18], differential privacy [TTZ14, ZZMW17, CLK19, INS +
19, BFTGT19,KBGY20, FKT20], game theory [DH19, LS19, ALW19, MOP20].While the impact of strong convexity or uniform convexity of the objective function is well under-stood. That of similar conditions on the feasible set of optimization problems is a priori just as signif-icant but has been much less explored. Despite the recent growing literature leveraging such set struc-ture, which we now briefly survey, equivalent characterizations of strong convexity of sets and relatedweaker conditions are only sparsely covered. This is arguably leading to some confusion, e.g. , the no-tion of gauge sets introduced in [ALLW18] is equivalent to strong convexity [Mol20]. Another key mo-tivation of our work is that, to our knowledge, only two results [Dun79, KdP20] consider local stronglyconvex assumptions of the constraint set to describe global machine learning problem complexity, whilethese local properties have a significant impact on algorithm performance. This is surprising given thevast amount of literature around localized properties of objective functions, such as Kurdyka-Łojasiewiczproperties [BDL07] for instance, leveraged in the convergence analyses of first-order optimization methods[BDLM10, ABRS10, BNPS17, Rd20, KdP19, Ker20].Uniform convexity (UC) generalizes strong convexity to more precisely quantify the curvature of a convexset, and plays a central role in many fields. For instance, the geometry of a Banach space is greatly influencedby its unit ball’s uniform convexity, which notably drives the convergence behavior of martingales, andinduces several concentration inequalities [Pis75, Pin94, IN14].
Gauges.
For simplicity, we focus here on compact convex sets C in finite-dimensional spaces. The gaugefunction of C provides a correspondence between sets and norm-like functions [Roc70] and is defined as k x k C , inf (cid:8) λ ≥ | x ∈ λ C (cid:9) . (Gauge)For simplicity again, we will only consider centrally symmetric convex bodies with nonempty interior inwhat follows, whose gauge function induces then a norm. † Zuse Institute, Berlin, Germany. ∗ Technische Universit¨at, Berlin, Germany. ‡ CNRS UMR 8548. § D.I. ´Ecole Normale Sup´erieure, Paris, France. niformly Convex Sets in Optimization. Some feasible set structures lead to accelerated convergencerates for first-order algorithms, e.g. , projection-free algorithms. Conditional gradients, a.k.a. Frank-Wolfe(FW) algorithms, are known to enjoy accelerated convergence rates compared to the O (1 /T ) baseline whenthe set is globally strongly convex [Pol66, DR70, GH15]. However, to our knowledge, only two resultsin machine learning consider local strong convexity assumptions on the feasible set. [Dun79] proposes ageometrical condition on a given point x ∗ ∈ ∂ C ensuring accelerated convergence rates for Frank-Wolfealgorithms and [KdP20] then show that this assumption is equivalent to local strong convexity and furthergeneralizes all existing accelerated Frank-Wolfe regimes to hold also on locally uniformly convex sets.Other projection-free algorithms exist with improved guarantees on strongly convex sets, e.g. , for non-convex optimization [RBWM19], min-max problems [GJLJ17, WA18] or approximate Carath´eodory results[CP19]. The various equivalent definitions of strongly convex sets have also stimulated an interest in de-signing and analysing affine-invariant first-order methods. For instance, [dGJ18] proposed a choice of normand prox-function in the implementation of first-order accelerated methods from [Nes05] which make thesemethods affine-invariant and provably optimal for optimization problems constrained on uniformly convex ℓ p balls with p > . [KLLJS20] proposed an optimal (w.r.t. known analyses) affine-invariant analysis ofthe affine-covariant Frank-Wolfe algorithm on strongly convex sets. Their analysis rely on assumptionsthat combine scaling inequalities for strongly convex feasible sets and an affine-invariant characterization ofsmoothness [Jag13]. Finally, strong convexity for sets was also used outside of projection-free optimizationtechniques in, e.g. , [VV20, Bac20]. Uniformly Convex Sets in Machine Learning.
The global strong convexity of sets also characterizesperformance in learning theory and online learning. [HLGS16, HLGS17] studied logarithmic regret boundsof simple algorithms for online linear learning on smooth strongly convex decisions sets. [Mol20, KdP20]later extended these results to non-smooth and uniformly convex sets. [AYAS09, RT10] considered suchassumptions of the constraint set for stochastic linear bandits and [AR09, BCL18] for non-stochastic linearbandits. The global uniform convexity of the decision set has recently attracted much attention in “onlinelearning with a hint”, which is a multiplicative version of optimistic online learning. In this framework,regret bounds are obtained in terms of the uniform convexity power type of the decision set [DFHJ17,BCKP20a, BCKP20b].[KST09] studied generalization bounds of low-norm linear classes. They obtain upper bounds on theRademacher constant of the hypothesis class that depend on the strong convexity of the norm regularizing theclass. However, they expressed these results in terms of the functional strong convexity of the square of thenorm. In Section 6.2, we recall that this result is a quantitative corollary of known results in the geometricalstudy of Banach spaces: a uniformly convex space has a non-trivial Rademacher type. [EBEGT19] alsoconsider global strong convexity of the feasible region to strengthen convergence results in generalizationbounds in the
Predict-Then-Optimize framework. They notably rely on a characterization of strong convexityakin to scaling inequalities covered in (b) of Theorems 4.1-5.1.In online learning on Banach spaces, several works analyse regret bounds in terms of the martingaletype/cotype of the space [ST10, SST11], a property directly tied with uniform convexity. In fact, [ST10,SST11] relies on the fact that the martingale type of a space is related to the existence of a uniformly convexfunction on this space, see [ST10, Theorem 1]. Besides, as we recall in Section 6.2, a uniformly convexspace has also a Rademacher type (the reverse might not be true), a notion related to the martingale type.This martingale type structure has been leveraged in various applications in learning [Sch16, KCd17] as itis a central tool to derive concentration inequalities [Pis75, Pin94, Pis11]. However, our main focus hereremains on uniform convexity as it has a simple geometrical interpretation in terms of scaling inequalitieswith direct algorithmic consequences (items (b) in Theorems 4.1-5.1), and admits local versions (Theorem5.1) which also better characterize empirical performance, as opposed to martingale type/cotype properties.
Contributions.
We first provide elementary proofs of various local and global equivalent characterizationsof uniform convexity of sets. We then discuss applications in machine learning and cover some practicalexamples leveraging these alternative points of view in Section 6. Most of our results are quantitative. e then characterize the uniform convexity of a set in terms of the “angles” between normal cone di-rections and feasible directions at boundary points. These quantifications appear regularly in convergenceproofs of algorithms such as Frank-Wolfe and we call them scaling inequalities . The link with uniformconvexity is often ignored and our objective here is to explicitly quantify this connection.Finally, we derive equivalent relationships for the localized versions of UC (see Theorem 5.1) to betterexplain empirical performance in optimization methods. Related Works.
Our work connects different perspectives of uniform convexity of a set. Our Theorems 4.1-5.1 rely on several classical monographs. We refer to [Zˇa83, AP95, Zˇa02] for the study of functional uniformconvexity and smoothness, to [LT13, Bea11, DGZ93, BGHV09] for the study of the geometry of Banachspaces in terms of uniform convexity and smoothness, and to [Pis11] for results on type/cotype propertiesof a Banach space. We also invoke [GI17] for practical local characterizations of the strong convexity ofsets. Finally, we rely on [Roc70] for convex analysis references and on [Sch14] for convex geometry infinite dimensions. Whenever possible, we keep track of the precise reference to these monographs whenestablishing the results in Sections 3-5. In many cases, we have adapted the proofs to make the resultsquantitative.
Outline.
In Section 2 we group some preliminary facts and in Section 3, we recall the definition of uniformconvexity and smoothness for functions and spaces. In Section 4, we present Theorem 4.1 stating differentequivalent definitions of the uniform convexity of a norm ball in finite-dimensional spaces. Theorem 5.1 inSection 5 provides the same results but with local assumptions. Results in Section 4-5 are self-containedand proofs are elementary. However, they hold even in infinite-dimensional spaces. Finally, in Section 6,we provide three examples in offline optimization and learning theory where these different points of viewon uniform convexity lead to new results.
Notations.
The finite-dimensional ambient vector space is R m and by Int ( C ) and ∂ C , we denote the interiorof C and the boundary of C respectively. The support function of C is defined as σ C ( d ) , sup v ∈C h v ; d i . The normal cone of C at x ∗ ∈ C is defined as N C ( x ∗ ) , (cid:8) d | h d ; x − x ∗ i ≤ ∀ x ∈ C (cid:9) and the support set of C at d is F C ( d ) , (cid:8) x ∈ C | h x ; d i = σ C ( d ) (cid:9) . We write f ∗ ( y ) = sup x ∈ R m h x ; y i − f ( x ) as the Fenchel conjugate of f . We will consider convex functions f : R m → R , finite everywhere and continuous. In particular, wethen have that f ∗∗ = f . For a norm k · k , we write k x k ⋆ , sup (cid:8) h x ; y i | k y k ≤ (cid:9) to denote its dual norm.We sometimes also use k · k ⋆ . We use different star symbols to distinguish between dual norms and Fencheldual, e.g. , the Fenchel dual of a norm is not the dual norm in general. We write B k·k the unit ball and S k·k theunit sphere associated to a norm k · k . We most often consider ( p, q ) s.t. p ≥ , q ∈ ]1 , and /p + 1 /q = 1 .The p (resp. q ) parameter will hence be employed in the context of uniform convexity (resp. smoothness).2. P RELIMINARIES
We restrict the discussion to finite-dimensional spaces for simplicity. It allows for a direct analogy ofduality between a norm and its dual norm with the duality between the norm ball’s gauge function and thesupport function of the norm ball’s polar, which we detail now. Note that results similar to Theorems 4.1 and5.1 hold in infinite-dimensional Banach spaces though. We consider centrally symmetric convex bodies C with non-empty interior so that the gauge function k·k C of C is a norm [Roc70, Theorem 15.2.]. In particular,the unit ball (resp. the sphere) of k · k C corresponds to C (resp. ∂ C ), i.e. , C = B k·k C and ∂ C = S k·k C . Thefunction k · k C and σ C are every-where finite convex functions from R m to R + and, e.g. , subdifferentiable[Roc70, Theorem 23.4].A strictly convex set C is such that for any distinct ( x, y ) ∈ ∂ C , we have ( x + y ) / ∈ C \ ∂ C . Conversely, C is smooth if there is only one supporting hyperplane at each boundary point of C . The following lemmarecalls the classical relation between strict convexity of a set and differentiability of the support function[Sch14, Cor 1.7.3]. emma 2.1 (Support/Gauge Differentiability) . Consider
C ⊂ R m a compact convex set. σ C is differentiableat d ∈ R m \ { } if and only if { y | h y ; d i = σ C ( d ) } = { x } . In that case ∇ σ C ( d ) = x . In particular, if C isstrictly convex, then σ C is differentiable on R m \ { } . The polar of C is defined as C ◦ = (cid:8) d ∈ R m | h x ; d i ≤ ∀ x ∈ C (cid:9) . Importantly, the support and gaugefunction are dual to each other via the polar operation, i.e. , σ C ( · ) = k · k C ◦ [Roc70, Theorem 14.5.]. Wesystematically write x (resp. d ) for an element of C (resp. C ◦ ). This duality parallels that of a norm and itsdual. Indeed, if k · k C is a norm, then k · k C ◦ is a norm and k · k ⋆ C = k · k C ◦ [Roc70, Cor 15.1.2]. Finally, thefollowing classical lemma will be particularly useful [Asp68, Lemma 2]. Lemma 2.2.
Let p, q > s.t. p + q = 1 . Then, for any α > , we have (cid:0) ασ p C (cid:1) ∗ ( · ) = h αp ) / ( p − − α ( αp ) q i k · k q C . In particular for α = p , it means that the Fenchel conjugate of p k · k p is q k · k q⋆ .Proof of Lemma 2.2. We recall the proof for completeness. Consider ρ ∗ ( u ) , sup t> (cid:0) tu − ρ ( t ) (cid:1) . For any y , we have ρ ∗ ( k y k ⋆ ) = sup t> (cid:8) t k y k ⋆ − ρ ( t ) (cid:9) = sup t> sup x =0 h t h y ; x i / k x k − ρ ( t ) i = sup t> x =0 h t h y ; xt/ k x kik xt/ k x kk − ρ ( t ) i = sup t> x =0 n t h y ; x ik x k − ρ ( t ); k x k = t o = sup x =0 (cid:8) h y ; x i − ρ ( k x k ) (cid:9) = ( ρ ◦ k · k ) ∗ ( y ) . Also, an immediate calculation proves that for u ≥ and when ρ ( t ) = αt r with r > , we have ρ ∗ ( u ) = h αr ) / ( r − − α ( αr ) r/ ( r − i u r/ ( r − . We finally conclude noting that σ C ◦ ( · ) = k · k C and k · k C ◦ = k · k ⋆ .3. S PACES , S
ETS , F
UNCTIONS U NIFORM S MOOTHNESS AND C ONVEXITY
In this section, we introduce the necessary concepts to state the main theorems in Sections 4-5. We recallthe classical notions of uniform convexity and smoothness for functions (Section 3.1) and Banach spaces(Section 3.2). We also recall quantitative statements on the duality correspondence between smoothness anduniform convexity in each of these situations.3.1.
Uniform Convexity and Smoothness of Functions.
Uniform convexity and smoothness of functionswere introduced to analyse optimization algorithms [Pol66] and extensively studied in [Zˇa83, AP95, Zˇa02],and is now a standard assumption in the analysis of first order methods, see, e.g. , [IN14].The following equivalent definitions of uniformly smooth function are classical, see, e.g. , [Zˇa02, (i)-(iv)-(ix) of Theorem 3.5.6.], which notably shows that a continuous uniformly smooth function is Fr´echetdifferentiable. This means that a norm for instance is not uniformly smooth as it is not differentiable at , seeLemma 2.1. This explains why hypothesis (c) in Theorem 4.1 below is restricted to S k·k (1) . In the followingsections, we consider only uniform convexity and smoothness of functions to ultimately apply it to simpletransformations of the gauge and support functions. We recall self-contained proofs of the equivalences inthe definition to obtain quantitative statements. Note that whenever we invoke uniformly smooth or convexfunctions in the other sections, we will often refer to these zero-order characterization . Definition 3.1 (Uniformly Smooth Functions) . Consider a convex function f : R m → R and q ∈ ]1 , . Thefollowing assertions are equivalent(a) (Zero-order) There exists c > s.t. f is ( c, q ) -uniformly smooth with respect to k · k , i.e., for any ( x, y ) and λ ∈ [0 , f ( λx + (1 − λ ) y ) + ( c/q ) λ (1 − λ ) k x − y k q ≥ λf ( x ) + (1 − λ ) f ( y ) . b) (First-order) f is differentiable and there exists c ′ > such that for any ( x, y ) , we have f ( y ) ≤ f ( x ) + h∇ f ( x ); y − x i + c ′ q k x − y k q . (c) (H¨older gradient) f is differentiable and there exists c ′′ > such that f is ( c ′′ , q ) -H¨older-smoothw.r.t. k · k , i.e., for any ( x, y ) (cid:13)(cid:13) ∇ f ( x ) − ∇ f ( y ) (cid:13)(cid:13) ⋆ ≤ c ′′ k x − y k q − . Proof of equivalency in Definition 3.1.
We adapt the proof of [Zˇa02, Theorem 3.5.6] to our case.(a) = ⇒ (b). Let ( x, y ) ∈ R m and λ ∈ ]0 , . The zero-order condition evaluated at ( x, y ) implies that f ( y + λ ( x − y )) − f ( y ) λ + ( c/q )(1 − λ ) k x − y k p ≥ f ( x ) − f ( y ) . (1)And because f is a finite convex function, the limit of (cid:0) f ( y + λ ( x − y )) − f ( y ) (cid:1) /λ when λ converges to + exists [Roc70, Theorem 23.1.] and f ′ ( x, · ) is defined for d ∈ R m as f ′ ( x, d ) , lim λ → + f ( y + λd ) − f ( y ) λ . In particular, with d = x − y , it implies in (1) that f ′ ( y, x − y ) + ( c/q ) k x − y k q ≥ f ( x ) − f ( y ) . (2)Let us now show that f ′ ( x, · ) is linear. By definition of f ′ ( x, · ) , we have that f ′ ( x, y ) ≥ − f ′ ( x, − y ) . Letus now show that the other side inequality is also true. Summing the two versions of (2) by interchanging x and y , we obtain f ′ ( y, x − y ) + f ′ ( x, y − x ) + (2 c/q ) k x − y k q ≥ . Let u ∈ R m , t > and write ρ ( t ) = f ( x + tu ) . Then ρ ′ + ( t ) , lim λ → t + ρ ( λ ) − ρ ( t ) λ − t = f ′ ( x + tu, u ) ρ ′− ( t ) , lim λ → t − ρ ( λ ) − ρ ( t ) λ − t = − f ′ ( x + tu, − u ) . Then, because ρ is convex, we have ρ ′ + ( − t ) ≤ ρ ′− (0) ≤ ρ ′ + (0) ≤ ρ ′− ( t ) . Hence, for any t > f ′ ( x, u )+ f ′ ( x, − u ) = ρ ′ + (0) − ρ ′− (0) ≤ ρ ′− ( t ) − ρ ′ + ( − t ) = − (cid:2) f ′ ( x + tu, − u )+ f ′ ( x − tu, u ) (cid:3) ≤ cq (2 t ) q k u k q . We conclude that f ′ ( x, u ) ≤ − f ′ ( x, − u ) and finally that f ′ ( x, u ) = − f ′ ( x, − u ) . Hence f ′ ( x, · ) is abounded linear function for any x so that f is differentiable with f ′ ( x, h ) = h∇ f ( x ); h i . We conclude byletting λ converging to in (1).(b) = ⇒ (a). Write x λ = λx + (1 − λ ) y . Applying the first order at x = x λ + x − x λ and y = x λ + y − x λ ,we obtain f ( x ) ≤ f ( x λ ) + (1 − λ ) h∇ f ( x λ ); x − y i + ( c/q )(1 − λ ) q k x − y k q f ( y ) ≤ f ( x λ ) + λ h∇ f ( x λ ); y − x i + ( c/q ) λ q k x − y k q . Then, by multiplying the inequalities respectively with λ and − λ and summing then, we obtain λf ( x ) + (1 − λ ) f ( y ) ≤ f ( x λ ) + ( c/q ) λ (1 − λ ) (cid:2) (1 − λ ) q − + λ q − (cid:3) k x − y k q . Then, by symmetry of (1 − λ ) q − + λ q − and because q − ∈ ]0 , , we obtain λf ( x ) + (1 − λ ) f ( y ) ≤ f ( x λ ) + 2( c/q ) λ (1 − λ ) k x − y k q . b) = ⇒ (c). For any z , by convexity of f , we have f ( y + z ) ≥ f ( x ) + h∇ f ( x ); y + z − x i and byassumption we have f ( y + z ) ≤ f ( y ) + h∇ f ( y ); z i + c ′ q k z k q . Hence f ( y ) + h∇ f ( y ); z i + c ′ q k z k q ≥ f ( x ) + h∇ f ( x ); y + z − x i so that for any z h z ; ∇ f ( x ) − ∇ f ( y ) i − c ′ q k z k q ≤ f ( y ) − f ( x ) + h∇ f ( x ); x − y i ≤ c ′ q k x − y k q . Then, by taking the supremum over z on both sides, we obtain c ′ (cid:16) q k · k q (cid:17) ∗ (cid:0) ( ∇ f ( x ) − ∇ f ( y )) /c ′ (cid:1) ≤ c ′ q k x − y k q . With p ≥ s.t. p + q = 1 , Lemma 2.2 then implies c ′ p (cid:13)(cid:13)(cid:13) ∇ f ( x ) − ∇ f ( y ) c ′ (cid:13)(cid:13)(cid:13) p⋆ ≤ c ′ q k x − y k q . In particular, pq = q − and we obtain k∇ f ( x ) − ∇ f ( y ) k ⋆ ≤ c ′′ ( q − /p k x − y k q − . (c) = ⇒ (b). By convexity of f , we have f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i . Hence by definition of thedual norm, we obtain f ( y ) − f ( x ) − h∇ f ( x ); y − x i ≤ h∇ f ( y ) − ∇ f ( x ); y − x i ≤ k∇ f ( y ) − ∇ f ( x ) k ⋆ k y − x k . and using the H ¨older-smoothness of f we obtain f ( y ) − f ( x ) − h∇ f ( x ); y − x i ≤ c ′ k y − x k q . We now define uniform convexity of a function, see, e.g. , [AP95, Definition 1]. We state the results interms of subgradients as gauge or support functions are not necessarily differentiable.
Definition 3.2 (Uniformly Convex Functions) . Consider a convex function f : R m → R and p ≥ . Thefollowing assertions are equivalent(a) (Zero-order) There exists c > s.t. f is ( c, p ) -uniformly convex with respect to k · k , i.e., for any ( x, y ) and λ ∈ [0 , , we have f ( λx + (1 − λ ) y ) + ( c/p ) λ (1 − λ ) k x − y k p ≤ λf ( x ) + (1 − λ ) f ( y ) . (b) (First-order) There exists α > s.t. for any ( x, y ) ∈ C and d ∈ ∂f ( x ) , we have f ( y ) ≥ f ( x ) + h d ; y − x i + αp k x − y k p . Proof of equivalency in Definition 3.2. (a) = ⇒ (b). Let ( x, y ) ∈ R m and d ∈ ∂f ( y ) . Combining convexityof f and zero-order uniform convexity, we have f ( y ) + λ h d ; x − y i ≤ f ( y + λ ( x − y )) ≤ f ( y ) + λ ( f ( x ) − f ( y )) − ( c/p ) λ (1 − λ ) k x − y k p . Then, dividing by λ and evaluating with λ converging to zero, we have h d ; x − y i ≤ f ( x ) − f ( y ) − ( c/p ) k x − y k p . b) = ⇒ (a). Write x λ = λx + (1 − λ ) y . We apply the first-order condition at x = x λ + x − x λ and y = x λ + y − x λ . With d ∈ ∂f ( x λ ) , we have f ( x ) ≥ f ( x λ ) + (1 − λ ) h d ; x − y i + αp (1 − λ ) p k y − x k p f ( y ) ≥ f ( x λ ) + λ h d ; y − x i + αp λ p k y − x k p . Multiplying the inequalities respectively by λ and − λ and summing them, we obtain λf ( x ) + (1 − λ ) f ( y ) ≥ f ( x λ ) + αp λ (1 − λ ) (cid:2) (1 − λ ) p − + λ p (cid:3) k y − x k p . Then, by symmetry, we have that min λ ∈ [0 , (cid:2) (1 − λ ) p − + λ p (cid:3) = 1 / p − , which concludes that λf ( x ) + (1 − λ ) f ( y ) ≥ f ( x λ ) + α p − p λ (1 − λ ) k y − x k p . Uniform smoothness (US) and uniform convexity (UC) are dual properties by Fenchel conjugacy [Zˇa83,Theorem 2.1.] or [AP95, Proposition 2.6]. We recall a proof below, both for completeness and to obtainquantitative statements.
Proposition 3.3 (Uniform Smoothness and Convexity with Fenchel duality) . Consider α, c > , p ≥ and q ∈ ]1 , such that p + q = 1 , and a norm k · k with its dual norm k · k ⋆ . Let f : R m → R be a convexfunction. We have the following implications(a) If f is ( α, q ) -uniformly smooth w.r.t. k · k (Definition 3.1 (a)), then f ∗ is (1 / ( pα p − ) , p ) -uniformlyconvex w.r.t. k · k ⋆ (Definition 3.2 (a)).(b) If f is ( c, p ) -uniformly convex w.r.t. k · k , then f ∗ is (cid:0) / ( qc q − ) , q (cid:1) -uniformly smooth with respectto k · k ⋆ .Proof of Proposition 3.3. Let us prove (b), (a) follows similarly. Assume f is ( c, p ) -uniformly convex.Consider ( y , y ) ∈ R m , λ ∈ [0 , and write y λ = λy + (1 − λ ) y . Similarly, for any ( x , x ) ∈ R m , letus write x λ = λx + (1 − λ ) x , f ( x i ) = f i for i = 1 , , f ( x λ ) = f λ , f ( x λ ) ∗ = f ∗ λ etc. By definition ofconjugate functions, and using the zero-order uniform convexity of f ( · ) at x λ , we have h y λ ; x λ i ≤ f ∗ ( y λ ) + f λ ≤ f ∗ ( y λ ) − ( c/p ) λ (1 − λ ) k x − x k p + λf + (1 − λ ) f . By adding and subtracting λ (1 − λ ) h y − y ; x − x i , we obtain h y λ ; x λ i ≤ f ∗ ( y λ )+ λf +(1 − λ ) f − λ (1 − λ ) h y − y ; x − x i + λ (1 − λ ) (cid:2) h y − y ; x − x i− ( c/p ) k x − x k p (cid:3) . The right term in brackets is upper bounded by (( c/p ) k · k p ) ∗ ( y − y ) , so that h y λ ; x λ i − λf − (1 − λ ) f + λ (1 − λ ) h y − y ; x − x i ≤ f ∗ ( y λ ) + λ (1 − λ )(( c/p ) k · k p ) ∗ ( y − y ) . Note also the following equality h y λ ; x λ i + λ (1 − λ ) h y − y ; x − x i = λ h y ; x i + (1 − λ ) h y ; x i . Hence, we obtain λ h y ; x i + (1 − λ ) h y ; x i − λf − (1 − λ ) f ≤ f ∗ ( y λ ) + λ (1 − λ )(( c/p ) k · k p ) ∗ ( y − y ) λ (cid:2) h y ; x i − f (cid:3) + (1 − λ ) (cid:2) h y ; x i − f (cid:3) ≤ f ∗ ( y λ ) + λ (1 − λ )(( c/p ) k · k p ) ∗ ( y − y ) . Because the last inequality is true for any ( x , x ) , we conclude that λf ∗ ( y ) + (1 − λ ) f ∗ ( y ) ≤ f ∗ ( y λ ) + λ (1 − λ )(( c/p ) k · k p ) ∗ ( y − y ) . Lemma 2.2 implies that (( c/p ) k · k p ) ∗ ( y − y ) = qc q − k y − y k q⋆ . Finally f ∗ is (cid:0) / ( qc q − ) , q (cid:1) -uniformlysmooth with respect to k · k ⋆ .In the following proposition, we provide similar results for local notions of uniform convexity andsmoothness of a function. These are quantitative versions of [AP95, Proposition 3.2.] or [Zˇa83, (iv) &(v) Theorem 2.1.]. roposition 3.4. Consider α, c > , p ≥ and q ∈ ]1 , such that p + q = 1 . Let f : R m → R a convexfunction and ( x, d ) such that d ∈ ∂f ( x ) and x ∈ ∂f ∗ ( d ) . The following assertions are equivalent(a) For some α > , f ∗ is ( α, q ) -uniformly-smooth at d w.r.t. to k · k ⋆ , i.e., for all d , we have f ∗ ( d ) ≤ f ∗ ( d ) + h x ; d − d i + αq k d − d k q⋆ . (b) For some c > , f is ( c, p ) -uniformly convex at x w.r.t k · k , i.e., for any y , we have f ( y ) ≥ f ( x ) + h d ; y − x i + cp k y − x k p . Proof of Proposition 3.4.
First note, that since f is finite l.s.c., for d ∈ ∂f ( x ) , we have x ∈ ∂f ∗ ( d ) [Roc70, Theorem 23.5.]. Let us show that (a) = ⇒ (b), the converse follows similarly. Recall that f ( y ) = sup d ∈ R m (cid:8) h y ; d i − f ∗ ( d ) (cid:9) . Write Φ( d ) , α/q k d − d k q⋆ . Combining the uniform smooth-ness assumption on f ∗ , adding and subtracting h y ; d i and with the equality f ∗ ( d ) + f ( x ) = h d ; x i [Roc70,Theorem 23.5.], we have for any yf ( y ) ≥ sup d ∈ R m (cid:8) h y ; d i − (cid:0) f ∗ ( d ) + h x ; d − d i + Φ( d − d ) (cid:1)(cid:9) f ( y ) ≥ sup d ∈ R m (cid:8) h y ; d − d i − (cid:0) f ∗ ( d ) + h x ; d − d i + Φ( d − d ) (cid:1) + h y ; d i (cid:9) f ( y ) ≥ sup d ∈ R m (cid:8) h y − x ; d − d i − Φ( d − d ) + h y ; d i − f ∗ ( d ) (cid:9) f ( y ) ≥ sup d ∈ R m (cid:8) h y − x ; d − d i − Φ( d − d ) (cid:9) + h y ; d i + f ( x ) − h d ; x i f ( y ) ≥ f ( x ) + h d ; y − x i + Φ ∗ ( y − x ) . Write k · k = k · k C for some compact centrally symmetric convex set C with nonempty interior. Then, notethat Φ = αq σ q C . Hence, by Lemma 2.2, we have Φ ∗ ( · ) = k · k p C / ( pα / ( q − ) . Finally f ( y ) ≥ f ( x ) + h d ; y − x i + 1 pα / ( q − k y − x k p C . Uniform Convexity and Smoothness for Sets and Spaces.
Moduli of convexity and smoothness ofa norm k · k C help characterize the geometry of the normed space ( R m , k · k C ) or the convex set C . Thisconnects set uniform convexity with results in the study of Banach spaces, in the special case where C is cen-trally symmetric with nonempty interior. In Section 6.2, we provide an important use case stemming fromthis other perspective on uniform convexity. These moduli are classical objects characterizing either en-hanced convex properties of C (for uniform convexity, rotundity) or regularity of the boundary of C (uniformsmoothness). Here too, these properties are dual for a normed space and its dual space [Lin63].The (global) modulus of convexity [Cla36] is defined, for ǫ ∈ [0 , , as δ k·k C ( ǫ ) = inf { − k ( x + y ) / k C | k x k C = k y k C = 1; k x − y k C ≥ ǫ } . (3)The restriction of ǫ ∈ [0 , ensures that the infimum is defined. It measures the convexity of k · k C at midpoints on the border of C . Note that the value of δ k·k C does not change by considering ( x, y ) ∈ B k·k (1) in place of S k·k (1) , see discussion following [LT13, Definition 1.e.1]. The modulus of smoothness [Lin63]of k · k C is defined, for τ > , as ρ k·k C ( τ ) = sup { ( k x + τ y k C + k x − τ y k C ) / − | k x k C = k y k C = 1 } . (4)We can now define uniformly convex (resp. smooth) norm balls and normed spaces. Definition 3.5 (Uniformly Convex Set or Space) . Consider a compact convex set C , p ≥ and α > .Assume C is centrally symmetric with nonempty interior. C is ( α, p ) -uniformly convex iff for any ǫ ∈ [0 , δ k·k C ( ǫ ) ≥ αǫ p . In that case, we also say that the normed space ( R m , k · k C ) is uniformly convex of type p . here are other equivalent definitions of the set uniform convexity of C . We will detail some of them inTheorem 4.1, prove their equivalence and discuss their practical significance. Definition 3.6 (Uniformly Smooth Set or Space) . Consider a compact convex set C and q ∈ ]1 , . Assume C is centrally symmetric with nonempty interior. C is ( α, q ) -uniformly smooth if for any τ > , we have ρ k·k C ( τ ) ≤ ατ q . In that case, we also say that the normed space ( R m , k · k C ) is uniformly smooth of type q . When a set is ( µ, -uniformly convex (resp. ( L, -uniformly smooth), we say it is µ -strongly convex(resp. L -smooth), see [GI17, Theorem 2.1.] for a thorough review on strongly convex sets in Hilbert spaces.These properties are dual to each other, in terms of the set C and its polar C ◦ , or the norm ball and itsdual norm ball [DGZ93, Proposition IV 1.12]. The Lindenstrauss formula [Lin63, Theorem 1] leads toquantitative versions of that duality. For any τ > , we have ρ k·k C◦ ( τ ) = sup ǫ ∈ [0 , n τ ǫ − δ k·k C ( ǫ ) o . (Lindenstrauss)The following lemma [DGZ93, Proposition 1.12.] then quantifies this duality and is similar to Proposition3.3 on a function and its Fenchel conjugate. The proof directly follows from (Lindenstrauss). Proposition 3.7 (Uniform Smoothness and Convexity with dual norms) . Consider α, c > , p ≥ and q ∈ ]1 , such that p + q = 1 and a compact convex set C centrally symmetric with nonempty interior. Wehave the following implications(a) If C is ( α, q ) -uniformly smooth (Definition 3.5), then C ◦ is (1 / (cid:0) p (2 αq ) / ( q − (cid:1) , p ) -uniformly con-vex (Definition 3.6).(b) If C is ( c, p ) -uniformly convex, then C ◦ is (1 / (cid:0) q (2 αp ) q − (cid:1) , q ) -uniformly smooth.Proof of Proposition 3.7. For instance, let us prove (a). With (Lindenstrauss), we have for any τ > and ǫ ∈ [0 , that τ ǫ/ − δ k·k C◦ ( ǫ ) ≤ ατ q . Optimizing w.r.t. to τ , the optimal τ ∗ = ( ǫ/ (2 αq )) / ( q − leads to δ k·k C◦ ( ǫ ) ≤ p ǫ p (2 αq ) / ( q − .3.3. Local Moduli.
Local counterparts of the global moduli characterize local properties of C around apoint x ∗ ∈ ∂ C with respect to a (normalized) direction d in the normal cone N C ( x ∗ ) . As we will see,these local properties are important as they explain empirical globally accelerated convergence rates inoptimization problems where the functions or constraints do not satisfy global regularity assumptions suchas, e.g. , strong convexity [Dun79, KdP20].The local modulus of smoothness [GI17, (15)] of C at x ∗ ∈ ∂ C with respect to d ∈ S k·k C◦ (1) is defined as,for t > , ρ k·k C ( t, x ∗ , d ) = sup (cid:8) k x ∗ + tx k C − k x ∗ k C − t h d ; x i |k x k C ≤ (cid:9) . (Loc. Smoothness)Similarly to all moduli seen so far, the local modulus of smoothness is designed so that when t goes tozero, the first order terms cancel. In the following, for convenience we write ρ C for ρ k·k C . We measure the local uniform convexity at x ∗ via the local modulus of rotundity . In the equivalent characterization of theset uniform convexity, the definition of modulus of rotundity is most related with the scaling inequalities characterizations, see (b) in Theorems 4.1 and 5.1. For x ∗ ∈ ∂ C and d ∈ N C ( x ∗ ) ∩ S k·k C◦ (1) , the localmodulus of rotundity at x ∗ w.r.t. d is defined for ǫ ∈ [0 , as ν C ( ǫ, x ∗ , d ) = inf (cid:8) h d ; x ∗ − x i | x ∈ C , k x ∗ − x k C ≥ ǫ (cid:9) . (Rotundity)The following lemma makes the duality between smoothness and rotundity explicit by linking the twomoduli, to produce a local counterpart to (Lindenstrauss). We cite a version giving a quantitative dualrelationship between local modulus of smoothness and local modulus of rotundity [GI17, Theorem 2.7.]. Lemma 3.8 (Local Lindstrauss formula) . Consider x ∗ ∈ ∂ C and d ∈ S k·k C◦ (1) ∩ N C ( x ∗ ) . Then the localmodulus of smoothness and rotundity satisfy for any t > ρ C ◦ ( t, d, x ∗ ) = sup ǫ ∈ [0 , (cid:8) ǫt − ν C ( ǫ, x ∗ , d ) (cid:9) . (Loc. Lindenstrauss) roof of Lemma 3.8. Let t > , by definition of ρ C ◦ , for η > there exists d η ∈ C ◦ such that ρ C ◦ ( t, d, x ∗ ) ≤k d + td η k C ◦ − k d k C ◦ − t h x ∗ ; d η i + η . Also, by compactness of C , there exists x η ∈ ∂ C s.t. k td η + d k C ◦ = σ C ( td η + d ) = h td η + d ; x η i . Since d ∈ N C ( x ∗ ) , we have k d k C ◦ = σ C ( d ) = h d ; x ∗ i and hence ρ C ◦ ( t, d, x ∗ ) ≤ k td η + d k C ◦ − k d k C ◦ − t h x ∗ ; d η i + ηρ C ◦ ( t, d, x ∗ ) ≤ h td η + d ; x η i − h d ; x ∗ i − t h x ∗ ; d η i + ηρ C ◦ ( t, d, x ∗ ) ≤ h d ; x η − x ∗ i + h td η ; x η − x ∗ i + η ≤ h d ; x η − x ∗ i + tσ C ◦ ( x η − x ∗ ) + ηρ C ◦ ( t, d, x ∗ ) ≤ sup x ∈C (cid:8) h d ; x − x ∗ i + t k x − x ∗ k C (cid:9) + ηρ C ◦ ( t, d, x ∗ ) ≤ sup ǫ ∈ [0 , sup x ∈C (cid:8) h d ; x − x ∗ i + t k x − x ∗ k C (cid:12)(cid:12) k x − x ∗ k C = ǫ (cid:9) + ηρ C ◦ ( t, d, x ∗ ) ≤ sup ǫ ∈ [0 , (cid:8) tǫ − inf x ∈C (cid:8) h d ; x ∗ − x i | k x − x ∗ k C = ǫ (cid:9)(cid:9) + ηρ C ◦ ( t, d, x ∗ ) ≤ sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) + η. We used that for any x, y ∈ C we have k x − y k C ∈ [0 , . Finally, last inequality is true for any η > , hence ρ C ◦ ( t, d, x ∗ ) ≤ sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) . We now provide a similar reasoning to obtain the equality.Indeed, for λ > , there exists ǫ λ > such that sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ tǫ λ − ν C ( ǫ λ , x ∗ , d ) + λ .Also, for η > , there exists x η ∈ C s.t. ν C ( ǫ λ , x ∗ , d ) ≥ h d ; x ∗ − x η i − η with k x η − x ∗ k C ≥ ǫ λ . Bycompactness of C , there exists d η ∈ C ◦ such that k x η − x ∗ k C = σ C ◦ ( x ∗ − x η ) = h x ∗ − x η ; d η i . Therefore,for t > , we havesup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ tǫ λ − h d ; x ∗ − x η i + λ + η ≤ t k x η − x ∗ k C − h d ; x ∗ − x η i + λ + η sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ t h d η ; x ∗ − x η i − h d ; x ∗ − x η i + λ + η sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ h x η ; d − td η i − h d ; x ∗ i + t h d η ; x ∗ i + λ + η sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , p ) (cid:9) ≤ σ C ( d − td η ) − σ C ( d ) − t h x ∗ ; − d η i + λ + η sup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ k d − td η k C ◦ − k d k C ◦ − t h x ∗ ; − d η i + λ + η. Hence, since − d η ∈ C ◦ , for any λ, η > , we have ν C ( ǫ, x ∗ , d ) (cid:9) ≤ ρ C ◦ ( t, d, x ∗ ) λ + η . We conclude thatsup ǫ ∈ [0 , (cid:8) tǫ − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ ρ C ◦ ( t, d, x ∗ ) .4. E QUIVALENCE BETWEEN G LOBAL S ET AND F UNCTIONAL A SSUMPTIONS
We expose some classical equivalence between functional and geometrical properties in Theorem 4.1below. This leads to new insights in learning theory in Section 6.2 and in optimization in Sections 6.1-6.3.Item (a) is similar to the definition appearing in most machine learning papers [GH15, HLGS16, HLGS17]and gives an intuitive understanding of set uniform convexity. The uniformly mid-convex property is equiv-alent to its continuous counterpart, see, e.g. , [Mol20, Lemma 9], but allows more concise proofs.Item (b) is an essential inequality in analysing projection-free online or offline optimization methods.There are other related and useful inequalities that can be seamlessly derived from this one, see, e.g. , Lemma6.6.Item (d)-(f) provides equivalent functional properties of the gauge and support function of C . Note that C is UC, but it is only a power of its gauge k · k C that is UC in the sense of functions. Also, the support functionis only partially H ¨older smooth as Item (d) holds on the sphere S k·k C◦ (1) . Again, it is only a specific powerof the support function that is uniformly smooth in the sense of functions without restriction on its domain.Finally, item (c) connects all other perspectives with the study of uniformly convex Banach spaces. Thisconnection is rich with hindsights, see, e.g. , Section 6.2.These results are classical and appear in many textbooks [DGZ93, LT13] often in non-quantitative, scat-tered, or too generic forms. We detail self-contained elementary proofs and provide quantitative versions in he finite-dimensional setting. Further, we only present the most practically significant equivalent charac-terizations here. In Section 5, we will provide similar quantitative results with local uniform convexity andsmoothness of C . Theorem 4.1 (Global Set Uniform Convexity) . Consider p ≥ and q ∈ ]1 , s.t. p + q = 1 . Let C be acentrally symmetric compact convex set with nonempty interior. The following assertions are equivalent(a) (Set mid-convex property) There exists α > s.t. for all ( x, y ) ∈ C we have x + y α k x − y k p C B k·k C (1) ⊂ C . (b) (Global scaling inequality) There exists α > s.t. for any ( x, y ) ∈ C × ∂ C and d ∈ R m with d ∈ N C ( y ) (or y ∈ argmax v ∈C h d ; v i ) we have h d ; y − x i ≥ α k d k C ◦ k y − x k p C . (Global-Scaling) (c) (Set Modulus UC) There exists α > s.t. C is ( α, p ) -uniformly convex (Definition 3.5), i.e., for any ǫ > , we have the following lower bound on the modulus (3) , δ k·k C ( ǫ ) ≥ αǫ p . (d) (Support H¨older-Smooth Sphere) The exists c > s.t. the support function σ C ( · ) is ( c, q − -H¨older smooth with respect to k · k C ◦ on S k·k C◦ (1) , i.e., it is differentiable on S k·k C◦ and for any ( d , d ) ∈ S k·k C◦ (1) , we have (cid:13)(cid:13) ∇ σ C ( d ) − ∇ σ C ( d ) (cid:13)(cid:13) C ≤ c (cid:13)(cid:13) d − d (cid:13)(cid:13) q − C ◦ = c (cid:13)(cid:13) d − d (cid:13)(cid:13) / ( p − C ◦ . (e) (Support US) σ q C ( · ) is differentiable on R m and there exists c > s.t. σ q C ( · ) is ( c, q ) -uniformlysmooth on R m with respect to k · k C ◦ for some c > .(f) (Gauge UC) There exists α > s.t. k · k p C is ( α, p ) -uniformly convex with respect to k · k C (Definition3.2).Proof of Theorem 4.1. (a) = ⇒ (c). Let ( x, y ) ∈ S k·k C (1) . For z = x + y k x + y k C , we have x + y + α k x − y k p C z ∈ C .Hence (cid:13)(cid:13)(cid:13) x + y (cid:13)(cid:13)(cid:13) C (cid:16) α k x − y k p C k x + y k C (cid:17) ≤ . This shows that − k ( x + y ) / k C ≥ α k x − y k p C and hence δ k·k C ( ǫ ) ≥ αǫ p .(c) = ⇒ (a). Recall that the modulus of convexity ρ k·k C ( ǫ ) in (3), can be written as the infimum over ( x, y ) ∈ B k·k (1) instead of S k·k (1) , see discussion following [LT13, Definition 1.e.1]. Let ( x, y ) ∈ C . Bydefinition of the modulus of convexity, we have − k ( x + y ) / k C ≥ α k x − y k p C . Hence by the triangleinequality, for any z ∈ B k·k C (1) , we have k ( x + y ) / α k x − y k p C z k C ≤ , so that ( x + y ) / α k x − y k p C z ∈C . (a) = ⇒ (b). Let x ∈ C , y ∈ ∂ C and d ∈ R m s.t. d ∈ N C ( y ) . We have y ∈ argmax v ∈C h d ; v i . Because ( x + y ) / α k x − y k p C z ∈ C , for any z ∈ B k·k C (1) , the optimality of y implies h d ; ( x + y ) / α k x − y k p C z i ≤ h d ; y i . Hence, for any z ∈ B k·k C (1) we have α k x − y k p C h d ; z i ≤ h d ; y − x i . By definition of the dual norm, wehence have α k x − y k p C k d k ⋆ C ≤ h d ; y − x i and conclude with k d k ⋆ C = k d k C ◦ . b) = ⇒ (d). Let ( d , d ) ∈ S k·k C◦ (1) and consider v d i ∈ argmax v ∈C h d i ; v i for i = 1 , . We have thatfor any x ∈ C ( h d ; v d − x i ≥ α k d k C ◦ · k v d − x k p C = α k v d − x k p C h d ; v d − x i ≥ α k d k C ◦ · k v d − x k p C = α k v d − x k p C . Then, by summing the two inequalities evaluated respectively at x = v d and x = v d , we have h d − d ; v d − v d i ≥ α k v d − v d k p C . By Cauchy-Schwartz and since v d i = ∇ σ C ( d i ) for i = 1 , (Lemma 2.1 applies because C is strictly convexand d i = 0 ), we obtain k d − d k C ◦ · k∇ σ C ( d ) − ∇ σ C ( d ) k C ≥ α k∇ σ C ( d ) − ∇ σ C ( d ) k p C , and conclude that k∇ σ C ( d ) − ∇ σ C ( d ) k C ≤ α ) / ( p − k d − d k / ( p − C ◦ . Note finally that / ( p −
1) = q − .(e) = ⇒ (c). Note that [BGHV09, (d) = ⇒ (a) of Theorem 2.2.] is not constructive and that [DGZ93,(ii) = ⇒ (i) in Lemma 5.1.] is incomplete as it only proves that the modulus of smoothness has theright lower-bound for τ ∈ [0 , / . [LT13] do not consider these aspects and [K ¨ot83, §26] neither. [Zˇa02,(iii) of Theorem 3.7.4.] bears some similarity. Recall the duality between support and gauge functions σ C ( · ) = k · k C ◦ . We now show that C ◦ is uniformly smooth by providing an upper bound on its modulusof smoothness and conclude on (c) by duality. Recall that for τ > , the modulus of smoothness of C ◦ isdefined as ρ C ◦ ( τ ) = sup (cid:8)(cid:0) k d + τ d k C ◦ + k d − τ d k C ◦ (cid:1) / − (cid:12)(cid:12) k d k C ◦ = k d k C ◦ = 1 (cid:9) . Consider ( d , d ) ∈ S k·k C◦ , since σ q C is ( c, q ) -uniformly smooth on R m and by equivalence between (a) and(b) in Definition 3.1, we have k d + τ d k q C ◦ ≤ h∇k · k q C ◦ ( d ); τ d i + 2 cq k τ d k q C ◦ k d − τ d k q C ◦ ≤ − h∇k · k q C ◦ ( d ); τ d i + 2 cq k τ d k q C ◦ . When q ∈ ]1 , , (1 + x ) /q is concave and below its tangent. In particular, (1 + x ) /q ≤ x/q . Hence,combined with k d k C ◦ = 1 , we have k d + τ d k C ◦ ≤ q h∇k · k q C ◦ ( d ); τ d i + 2 cq τ q k d − τ d k C ◦ ≤ − q h∇k · k q C ◦ ( d ); τ d i + 2 cq τ q . Then summing the two inequalities and dividing by , we obtain (cid:0) k d + τ d k C ◦ + k d − τ d k C ◦ (cid:1) / − ≤ cq τ q . Hence, C ◦ is (2 c/q , q ) -uniformly smooth. Then Proposition 3.7 (a), implies that C is (1 / (2 p (2 αq ) p − ) , p ) -uniformly convex with α = 2 c/q , i.e., C is ( q p − / (2 p − pc p − )) -uniformly convex.(f) = ⇒ (e) From Lemma 2.2, we have that (cid:0) k · k p C (cid:1) ⋆ ( · ) = h p / ( p − − p q i σ q C ( · ) . Then Item (b) ofProposition 3.3 implies that h p − p q − i σ q C ( · ) is ( c ′ , q ) -uniformly smooth on C with respect to k · k ⋆ C = k · k C ◦ and c ′ = 1 / ( qc q − ) . Hence, σ q C is ( (cid:2) p q − ( p − qc q − (cid:3) , q ) -uniformly smooth. Note also that by equivalence between(a) and (b) in Definition 3.1, we have that σ q C is differentiable. e) = ⇒ (f). Conversely, let us assume that σ q C is ( α, q ) -uniformly smooth. From Lemma 2.2, we havethat (cid:0) σ q C (cid:1) ⋆ ( · ) = h q / ( q − − q p i k · k p C ( · ) . And, with Proposition 3.3 (a), h q − q p − i k · k p C ( · ) is ( c ′ , p ) -uniformlyconvex with respect to k · k C with c ′ = 1 / ( pα p − ) . Finally, we conclude that | · k p C ( · ) is ( (cid:2) q p − ( q − pα p − (cid:3) , p ) -uniformly convex.(d) = ⇒ (e). Conversely, let us show that σ q C ( · ) is uniformly smooth. The proof follows that of [BGHV09,Theorem 2.1.]. Let us start by showing that σ q C ( · ) is differentiable on R m . For d ∈ R m \ { } , we have ∇ σ q C ( d ) = q k d k q − ∇ σ C ( d ) . Because C is strictly convex, there is a unique x ∈ ∂ C s.t. d ∈ N C ( x ) .From [Sch14, Corollary 1.7.3.], we have that ∇ σ C ( d ) = x . Because q > , when d converges to , wehave that ∇ σ q C ( d ) also converges to zero. Hence, σ q C is differentiable at zero with ∇ σ q C (0) = 0 .Let ( d , d ) ∈ R m and x i ∈ ∂ C s.t. d i ∈ N C ( x i ) , i.e. , ∇ σ C ( d i ) = x i . Because σ C is H ¨older smooth on S k·k C◦ , we have k∇ σ C ( d ) − ∇ σ C ( d ) k C ≤ c k d / k d k C ◦ − d / k d k C ◦ k / ( q − C ◦ . We then obtain k∇ σ q C ( d ) − ∇ σ q C ( d ) k C = k qσ q − C ( d ) ∇ σ C ( d ) − qσ q − C ( d ) ∇ σ C ( d ) k C ≤ qσ q − C ( d ) (cid:13)(cid:13) ∇ σ C ( d ) − ∇ σ C ( d ) (cid:13)(cid:13) C + q (cid:13)(cid:13) ∇ σ C ( d ) (cid:13)(cid:13) C (cid:12)(cid:12) σ q − C ( d ) − σ q − C ( d ) (cid:12)(cid:12) ≤ qc k d k q − C ◦ (cid:13)(cid:13) d / k d k C ◦ − d / k d k C ◦ (cid:13)(cid:13) q − C ◦ + q (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) ≤ qc (cid:13)(cid:13) d − d (cid:0) k d k C ◦ / k d k C ◦ (cid:1)(cid:13)(cid:13) q − C ◦ + q (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) . We have for λ , λ > and r ∈ ]0 , | λ r − λ r | ≤ | λ − λ | r [BGHV09, Lemma 2.1.]. Hence, for q − ∈ ]0 , , we have (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) ≤ (cid:12)(cid:12) k d k C ◦ − k d k C ◦ (cid:12)(cid:12) q − ≤ k d − d k q − C ◦ . Also, with thetriangle inequality (cid:13)(cid:13) d − d (cid:0) k d k C ◦ / k d k C ◦ (cid:1)(cid:13)(cid:13) ≤ k d − d k C ◦ + k d k C ◦ − k d k C ◦ ≤ k d − d k C ◦ . Hence k∇ σ q C ( d ) − ∇ σ q C ( d ) k C ≤ q ( c q − + 1) k d − d k q − C ◦ . Equivalence between (a) and (c) in Definition 3.1 shows that σ q C is (2 q ( c q − + 1) , q ) -uniformly smooth.(e) = ⇒ (d). Let ( d , d ) ∈ S k·k C◦ (1) . Since, for i = 1 , , ∇ σ q C ( d i ) = qσ C ( d i ) q − ∇ σ C ( d ) = qσ C ( d ) ,we directly have (because of the equivalence between (a) and (c) in Definition 3.1) k∇ σ C ( d ) − ∇ σ C ( d ) k C ≤ cq k d − d k / ( p − C ◦ . Remark 1.
From the proof of Theorem 4.1, one can obtain quantitative results. (a) and (c) are equiv-alent with the same constant. (a) with ( α, p ) implies (b) with (2 α, p ) ; (b) with ( α, p ) implies (d) with (1 / (2 α ) q − , q − ; (e) with ( c, q ) implies (c) with ( q p − / (2 p − pc p − ) , p ) ; (f) with ( α, p ) implies (e) with ( p q − / (( p − qc q − ) , q ) ; Conversely, (e) with ( α, q ) implies (f) with ( q p − / (( q − pα p − ) , p ) ; Finally, (d)with ( c, q − implies (e) with (2 q ( c q − + 1)) .
5. E
QUIVALENCE BETWEEN L OCAL S ET AND F UNCTIONAL A SSUMPTIONS
In this section, we provide equivalent characterizations of the local uniform convexity of C at x ∗ ∈ ∂ C .The results are summarized in Theorem 5.1, the analog to Theorem 4.1. We seek to articulate differentuseful views on the local uniform convexity property of a set.Item (a) is a Banach geometry definition via the local modulus of rotundity. Item (b) is a geometric local scaling inequality useful in some algorithm analysis, see for instance the Frank-Wolfe method on locallyuniformly convex sets [KdP20]. Note that a natural local version of (Global-Scaling), could be that for any d ∈ N C ( x ∗ ) , for any x ∈ C , we require h d ; x ∗ − x i ≥ α k d k C ◦ k x ∗ − x k q C . However, we opted for a weaker version in (Local-Scaling) which expresses the property only with respectto a single direction in the normal cone at the point of interest. Finally Items (b) and (e) connect these eometrical characterization with their functional counterpart, both in term of smoothness and uniform con-vexity. These results appear scattered in the literature, see, e.g. , [Zˇa83, Chapter 3.7] or [AP95, Proposition3.2.]. We expect these various equivalences to provide convergence proof of algorithms in online and offlinesettings when the decision sets or constraints sets are not globally strongly convex. We provide an exampleof such a result in Section 6.1. Theorem 5.1 (Local Set Uniform Convexity) . Consider p ≥ and q ∈ ]1 , s.t. p + q = 1 . Let C bea compact strictly convex set centrally symmetric with nonempty interior. Let x ∗ ∈ ∂ C , d ∈ N C ( x ∗ ) ∩ S k·k C◦ (1) (note S k·k C◦ (1) = ∂ C ◦ ). The following assertions are equivalent(a) (Modulus of Rotundity) There exists α > s.t. C is ( α, p ) -locally uniformly convex at x ∗ w.r.t.direction d , i.e., for any ǫ ∈ [0 , , we have ν C ( ǫ, x ∗ , d ) , inf (cid:8) h d ; x ∗ − x i | x ∈ C , k x − x ∗ k C ≥ ǫ (cid:9) ≥ αǫ p . (b) (Local scaling inequality) For any x ∈ C , we have h d ; x ∗ − x i ≥ α k x ∗ − x k p C . (Local-Scaling) (c) (Support Local H¨older-Smooth Sphere) There exists c > s.t. σ C ( · ) is ( c, q − -H¨older smooth at d on S k·k C◦ (1) w.r.t. k · k C ◦ , i.e., for any d ∈ S k·k C◦ (1) , we have (cid:13)(cid:13) ∇ σ C ( d ) − ∇ σ C ( d ) (cid:13)(cid:13) C ≤ c (cid:13)(cid:13) d − d (cid:13)(cid:13) q − C ◦ = (cid:13)(cid:13) d − d (cid:13)(cid:13) / ( p − C ◦ . (d) (Support Local US) There exists c > s.t. σ q C ( · ) is ( α, q ) -uniformly smooth at d w.r.t k · k C ◦ , i.e.,for any d ∈ R m , we have σ q C ( d ) ≤ σ q C ( d ) + q h x ∗ ; d − d i + αq k d − d k q C ◦ , where ∇ σ q C ( d ) = qx ∗ .(e) (Gauge local UC) There exists µ > s.t. k · k p C is ( µ, p ) -uniformly convex at x ∗ on C in direction d w.r.t. k · k C , i.e., for any y ∈ R m k y k p C ≥ k x ∗ k p C + p h d ; y − x ∗ i + µ k y − x ∗ k p C . Proof of Theorem 5.1.
Because C is strictly convex, σ C is differentiable on R m \ { } , see Lemma 2.1.In particular, ∇ σ C ( d ) = x ∗ since d ∈ N C ( x ∗ ) . Also, because k d k C ◦ = 1 , note that ∇ σ q C ( d ) = q k d k q − C ◦ ∇ σ C ( d ) = qx ∗ Finally, note that k · k C = σ C ◦ is not necessarily differentiable (would requireassuming that C ◦ is smooth).(a) ⇐⇒ (b) is immediate.(a) = ⇒ (c). Let us assume that C is ( α, p ) -uniformly convex at x ∗ ∈ ∂ C with respect to d ∈ S k·k C◦ (1) ∩ N C ( x ∗ ) , i.e. , for any ǫ > , ν C ( ǫ, x ∗ , d ) ≥ αǫ p . Hence, we have for any x ∈ Ch d ; x ∗ − x i ≥ α k x − x ∗ k p C . Let d ∈ S k·k C◦ (1) and x , argmax x ∈C h x ; d i (it is unique because C is strictly convex compact). Inparticular, h x ∗ − x ; d i ≤ , hence we have h d − d ; x ∗ − x i ≥ h d − d ; x ∗ − x i + h d ; x ∗ − x i | {z } ≤ = h d ; x ∗ − x i ≥ α k x ∗ − x k p C . Then, with Cauchy-Schwartz we have k d − d k C ◦ k x ∗ − x k C ≥ α k x ∗ − x k p C . Hence, k x − x ∗ k C ≤ α / ( p − k d − d k / ( p − C ◦ . ith Lemma 2.1, we have ∇ σ C ( d ) = x and x ∗ = ∇ σ C ( d ) , which concludes with q − / ( p − .(d) = ⇒ (a). Let us now assume that σ q C ( · ) is ( α, q ) -uniformly smooth at d w.r.t k·k C ◦ . Also σ C ( · ) = k·k C ◦ .Let us first prove an upper bound on the local modulus of smoothness ρ C ◦ ( t, d , x ∗ ) of C ◦ at d w.r.t. x ∗ ,see (Loc. Smoothness). By the duality formula (Loc. Lindenstrauss), we will then obtain a lower bound onthe modulus of rotundity. Recall that the local modulus of smoothness in (Loc. Smoothness) is defined forany t > , as ρ C ◦ ( t, d , x ∗ ) = sup (cid:8) k d + td k C ◦ − k d k C ◦ − t h x ∗ ; d i | d ∈ C ◦ (cid:9) . By (d), we have for any d ∈ R m k d + td k q C ◦ ≤ k d k q C ◦ + t h∇ σ q C ( d ); d i + αq t q k d k q C ◦ . Recall from the beginning of the proofs that ∇ σ q C ( d ) = qx ∗ . Then, we have by concavity of (1 + x ) /q when q ∈ ]1 , k d + td k C ◦ ≤ (cid:16) tq h x ∗ ; d i + αq t q (cid:17) q ≤ t h x ∗ ; d i + αq t q . In particular, for d ∈ C ◦ and because k d k C ◦ = 1 , we have ρ C ◦ ( t, d , x ∗ ) ≤ α/q t q . Then, with Lemma3.8, we have that for any ǫ ∈ [0 , and t > sup ǫ ∈ [0 , (cid:8) ǫt − ν C ( ǫ, x ∗ , d ) (cid:9) ≤ α/q t q . Hence, for any ǫ ∈ [0 , ν C ( ǫ, x ∗ , d ) ≥ ǫt − α/q t q . Then for t = ( qǫ/α ) / ( q − , we have ν C ( ǫ, x ∗ , d ) ≥ q p − α p − ( q − ǫ p . Therefore, C is ( q p − α p − ( q − , p ) -locally uniformly convex at x ∗ with respect to d .(c) = ⇒ (d). The proof is similar to that of (d) = ⇒ (f) in Theorem 4.1, we repeat it for completeness.First, by the very same argument, σ q C is differentiable on R m (recall that σ C is not differentiable at ). Now,consider d ∈ R m \ { } and the unique (because C is strictly convex) x ∈ ∂ C s.t. d ∈ N C ( x ) . Then,with Lemma 2.1, we have ∇ σ C ( d ) = x and with the same argument ∇ σ C ( d / k d k C ◦ ) = x . Because σ C is H ¨older smooth at d on S k·k C◦ , we have k∇ σ C ( d ) − ∇ σ C ( d ) k C ≤ c k d − d / k d k C ◦ k / ( q − C ◦ . We thenobtain, by adding and subtracting qσ q − C ( d ) ∇ σ C ( d ) and applying the triangle inequality k∇ σ q C ( d ) − ∇ σ q C ( d ) k C = k qσ q − C ( d ) ∇ σ C ( d ) − qσ q − C ( d ) ∇ σ C ( d ) k C ≤ qσ q − C ( d ) (cid:13)(cid:13) ∇ σ C ( d ) − ∇ σ C ( d ) (cid:13)(cid:13) C + q (cid:13)(cid:13) ∇ σ C ( d ) (cid:13)(cid:13) C (cid:12)(cid:12) σ q − C ( d ) − σ q − C ( d ) (cid:12)(cid:12) ≤ qc k d k q − C ◦ (cid:13)(cid:13) d / k d k C ◦ − d / k d k C ◦ (cid:13)(cid:13) q − C ◦ + q (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) ≤ qc (cid:13)(cid:13) d − d (cid:0) k d k C ◦ / k d k C ◦ (cid:1)(cid:13)(cid:13) q − C ◦ + q (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) . We have for λ , λ > and r ∈ ]0 , | λ r − λ r | ≤ | λ − λ | r . Hence, for q − ∈ ]0 , , we have (cid:12)(cid:12) k d k q − C ◦ − k d k q − C ◦ (cid:12)(cid:12) ≤ (cid:12)(cid:12) k d k C ◦ − k d k C ◦ (cid:12)(cid:12) q − ≤ k d − d k q − C ◦ . Also, by the triangle inequality, (cid:13)(cid:13) d − d (cid:0) k d k C ◦ / k d k C ◦ (cid:1)(cid:13)(cid:13) ≤ k d − d k C ◦ + k d k C ◦ − k d k C ◦ ≤ k d − d k C ◦ . Hence for any d ∈ R m \ { }k∇ σ q C ( d ) − ∇ σ q C ( d ) k C ≤ q ( c q − + 1) k d − d k q − C ◦ . et us now prove that this implies a first-order type definition of local smoothness. For any d , by the meanvalue theorem, there exists λ ∈ ]0 , such that σ q C ( d ) − σ q C ( d ) = h∇ σ q C ( λd + (1 − λ ) d ); d − d i = h∇ σ q C ( d ); d − d i + h∇ σ q C ( λd + (1 − λ ) d ) − ∇ σ q C ( d ); d − d i≤ h∇ σ q C ( d ); d − d i + k∇ σ q C ( λd + (1 − λ ) d ) − ∇ σ q C ( d ) k C k d − d k C ◦ ≤ h∇ σ q C ( d ); d − d i + q ( c q − + 1) k d − d k q C ◦ . Hence σ q C is ( q ( c q − + 1) , q ) -uniformly convex at d w.r.t. k · k C ◦ .Equivalence between (d) and (e) stems from Proposition 3.4. Indeed, from Lemma 2.2 we have that ( q σ q C ) ⋆ ( · ) = p k · k p C . Then, because ( x ∗ , d ) ∈ ∂ C × N C ( x ∗ ) ∩ ∂ C ◦ , we have ( x ∗ , d ) ∈ ∂ q σ q C ( d ) × ∂ p k ·k p C ( x ∗ ) and we can indeed apply Proposition 3.4.6. A PPLICATIONS
Theorems 4.1 and 5.1 offer different points of view on uniform convexity properties which yield improvedrates in optimization or learning. We now detail three situations where the equivalence relationships detailedabove lead to new results.In Section 6.1, we show that the ℓ p balls with p > are locally strongly convex on some points of theirboundaries, while not being globally strongly convex. This leads to novel linear convergence results forvanilla Frank-Wolfe algorithm on some curved sets that are not strongly convex.In Section 6.2, we leverage a result on the geometry of Banach spaces, showing the inclusion of uniformlyconvex spaces into Rademacher spaces of type q . The equivalence between the UC of norms balls and spaceUC then implies generalization bounds on low norm linear predictors.In Section 6.3, we show how the Primal Averaging Frank-Wolfe algorithm [Lan13, Algorithm 4] exhibitsaccelerated sublinear rates w.r.t. the O (1 /T ) baseline when the constraint set is uniformly convex andinf x ∈C k∇ f ( x ) k > c > . The sublinear rates are slower than those of Frank-Wolfe with exact line-searchor short-steps on uniformly convex sets but are obtained with (cheaper) pre-determined function agnosticstep-sizes, and in fact oblivious of any structure of the problem. To our knowledge, this is the only versionof Frank-Wolfe achieving accelerated convergence w.r.t. O (1 /T ) with such agnostic step-sizes.6.1. Linear Convergence Rates for Vanilla Frank-Wolfe on Non-Strongly Convex Sets.
Here, we ap-ply Theorem 5.1 to derive accelerated convergence rates of algorithms solving the following constrainedoptimization problem minimize x ∈C f ( x ) , (OPT)where f is smooth convex function and C a compact convex set. Write x ∗ a solution of (OPT). [KdP20]shows that when a local scaling inequality holds at x ∗ with p ≥ , α > , i.e. , for any x ∈ Ch−∇ f ( x ∗ ); x ∗ − x i ≥ α k∇ f ( x ∗ ) k ⋆ k x ∗ − x k p , (5)then the vanilla Frank-Wolfe algorithm has an accelerated convergence rate compared to O (1 /T ) . By op-timality, −∇ f ( x ∗ ) ∈ N C ( x ∗ ) , and (5) is ensured when (b) in Theorem 5.1 holds. While the local scalinginequalities are key to the convergence analyses, they are harder to check than the other equivalent con-ditions in Theorem 5.1. In the following lemma, we show that although ℓ p balls are not strongly convexwhen p > , there are locally strongly convex ( i.e. ( α, -locally uniformly convex) at any x ∗ ∈ ∂ℓ p (1) s.t. h x ∗ ; e i i 6 = 0 for all i , which means improved convergence rates in this subset of points. Lemma 6.1 (Local Strong Convexity of the ℓ p with p > ) . Consider p > and x = P mi =1 λ i e i ∈ ∂ℓ p (1) s.t. λ i = 0 for all i ∈ [ m ] . Then, there exists α > s.t. ℓ p (1) is ( α, -locally uniformly convex at x . roof of Lemma 6.1. Let us write k·k p the ℓ p norm. With Theorem 5.1 (e), we need to prove that f ( · ) , k·k p is ( α, -uniformly convex at x = P mi =1 λ i e i ∈ ∂ℓ p (1) s.t. λ i = 0 for all i ∈ [ m ] . Note that Item (e) ofTheorem 5.1 requires a quadratic lower bound on R m . Here, we only prove it on a compact domain.However, equivalence with Item (b) of Theorem 5.1 is also valid with such a restriction. We omit theproof. Without loss of generality, by central symmetry of ℓ p , let us assume that all λ i > . Note then that P i λ pi = 1 . f is convex and twice differentiable at x . Let us first prove that the Hessian H f ( x ) has no zeroeigenvalues. We have ∂ f∂x i ( x ) = 2( p − λ p − i + 2(2 − p ) λ p − i ∂ f∂x i ∂x j ( x ) = 2(2 − p )( λ i λ j ) p − . Hence, the Hessian of f at x is of the form H f ( x ) = 2( p − diag ( λ p − , . . . , λ p − m ) + (cid:16) − p )( λ i λ j ) p − (cid:17) ≤ i,j ≤ m . Write
Λ = ( λ i ) i =1 ,...,m , we have that H f ( x ) = 2(2 − p ) h p − − p diag (Λ p − ) + (Λ p − ) T Λ p − i . Then, note that for an invertible diagonal matrix D = diag ( d , . . . , d m ) and vector h = ( h , . . . , h m ) , wehave det (cid:0) D + h T h (cid:1) = det (cid:0) D (cid:1) det (cid:0) I m + D − h T h (cid:1) = (cid:16) m X i =1 h i d i (cid:17) m Y i =1 d i . We then havedet (cid:0) H f ( x ) (cid:1) = (2(2 − p )) m (cid:16) p − − p (cid:17) m (cid:16) − pp − m X i =1 λ p − i /λ p − i (cid:17) m Y i =1 λ p − i det (cid:0) H f ( x ) (cid:1) = (cid:2) p − (cid:3) m h − pp − m X i =1 λ pi i m Y i =1 λ p − i = 2 m (cid:0) p − (cid:1) m − m Y i =1 λ p − i > , so that H f ( x ) ≻ . This ensures that on the compact domain C , there exists a value µ > s.t. for any y ∈ Ck y k C ≥ k x k p C + h∇k · k C ( x ); y − x i + µ k y − x ∗ k p C . This corresponds to Theorem 5.1 (e).When p > , the ℓ p balls are not globally strongly convex. However, Lemma 6.1 shows that they arelocally strongly convex on any boundary point which has no zero coordinates in the canonical basis. In thefollowing corollary, we show that this proves linear convergence rates of the vanilla Frank-Wolfe algorithmon ℓ p balls (with p ≥ ) as analysed in [KdP20]. Corollary 6.2 (Linear Rates for FW on ℓ p for p > ) . Consider a convex smooth function f such thatinf x ∈C k∇ f ( x ) k > c > and C = ℓ p (1) . Assume the solution x ∗ of (OPT) has no zero coordinates in thecanonical basis, then the Frank-Wolfe algorithm with exact line-search or short step size converges linearly.Proof of Corollary 6.2. We use Lemma 6.1 with [KdP20, Theorem 2.5].6.2.
Uniform Smoothness, Rademacher type, and Generalization Bounds.
Here, we show an examplewhere the equivalence between the uniform convexity of the gauge and the Banach space’s uniform convex-ity provides another perspective on a generalization bound for low-norm linear predictors [KST09, Theorem1] with strongly convex norm balls. We also generalize it to uniformly convex regularizing balls. onsider a hypothesis class F of functions f : X → R and n points ( x i ) ∈ X ⊂ R m , sampled froma distribution µ on X . For ( ǫ i ) a sequence of i.i.d. Bernouilli random variable, the Rademacher constant isdefined as R n ( F ) , E ( ǫ i ) , ( x i ) h sup f ∈F (cid:12)(cid:12)(cid:12) n n X i =1 f ( x i ) ǫ i (cid:12)(cid:12)(cid:12)i . (Rademacher constant)This Rademacher constant is a measure of the hypothesis class complexity, and a key quantity appearing inbounds on generalization error [Kol01, BBL02, BM02, BBM05]. In Theorem 6.5, we obtain upper boundson the Rademacher constants of low-norm linear predictors in finite-dimensional spaces. Such hypothesisclasses are of the form F C = (cid:8) f : x ∈ X → h x ; w i | k w k C ≤ (cid:9) , where C is a compact convex centrallysymmetric set with non-empty interior.Besides uniform convexity or smoothness, various properties have been designed to further classify Ba-nach spaces. For instance, the definitions [DGZ93, Definition 5.8.] of Rademacher space of type q ∈ [1 , or cotype p ∈ [2; + ∞ [ involve quantities very similar to the Rademacher constant. Note that Rademacherof type q and cotype are dual properties [LT13, Proposition 1.e.17]. Definition 6.3 (Space of Rademacher type and cotype) . A space ( R m , k · k ) is Rademacher of type q ∈ [1 , if for each finite sequence ( ǫ i ) ni =1 of i.i.d. Bernouilli variable and any fixed finite sequence ( f i ) of elementsof R m , it holds that E ( ǫ i ) (cid:16)(cid:13)(cid:13) n X i =1 ǫ i f i (cid:13)(cid:13) q (cid:17) ≤ C · n X i =1 k f i k q . (type q ) It is of cotype q ∈ [2 , + ∞ [ if there exists C > such that n X i =1 k f i k p ≤ C · E ( ǫ i ) (cid:16)(cid:13)(cid:13)(cid:13) n X i =1 ǫ i f i (cid:13)(cid:13)(cid:13) p (cid:17) . (cotype p )The Rademacher type of Banach spaces was leveraged in a variety of results in machine learning. Forinstance, for some class of low norm linear predictors, [LLNT17] connect the duality between type andcotype (of the norm defining the hypothesis class) to the duality between stable (as they define it) learningand generalization bounds of the corresponding problem.Slightly generalizing the Rademacher type, the martingale type/cotype of Banach spaces have been ex-tensively studied in online learning. A series of works have shown the equivalence between optimal regretbounds and the martingale type of the space associated to the decision set [SST11, RS17]. Such links arenot surprising as connections between martingale properties, the study of Banach spaces and concentrationinequalities have long been known [Pis75, Pin94], see [Pis11, BLM13] for recent references.Uniform convexity is often invoked along with the martingale/Rademacher type property [SST11, Section6]. Indeed, a uniformly smooth space of type q ∈ ]1 , is also a Rademacher Banach space of type q [DGZ93,Lemma 5.9.], while the converse is not true [Jam78]. We recall a self-contained proof of that result [LT13,Theorem 1.e.16] for finite-dimensional spaces. Proposition 6.4 (Uniformly Smooth and Rademacher Spaces) . Let q ∈ ]1 , . A normed space ( R m , k · k ) that is ( α, q ) -uniformly smooth is also Rademacher of type q .Proof of Proposition 6.4. Let p ≥ s.t. /p + 1 /q = 1 . Assume that ( R m , k · k ) is ( α, q ) -uniformly smoothwith α > and q ∈ ]1 , . Then, with Proposition 3.7 (a), we have that ( R m , k·k ⋆ ) is (1 / (2 p (2 αq )) / ( q − , p ) -uniformly convex. From equivalence between (c) and (e) in Theorem 4.1, we finally have that k · k q is ( c ′ , q ) -uniformly smooth w.r.t. k · k (where c ′ only depends only on ( p, α ) ). By the first-order definition ofthe uniform smoothness of k · k q , we have for any h ∈ R m k x + h k q ≤ k x k q + h∇k · k q ( x ); h i + c ′ q k h k q k x − h k q ≤ k x k q − h∇k · k q ( x ); h i + c ′ q k h k q . umming these, we obtain for any ( x, h ) ∈ R m k x + h k q + k x − h k q − k x k q ≤ c ′ q k h k q . We now repeat the very same inductive argument as in [DGZ93, Lemma 5.9.] and prove for any n ≥ , anyfinite sequence of i.i.d. Bernoulli random variables ( ǫ i ) and elements ( x i ) of R m of size n that E ( ǫ i ) (cid:16)(cid:13)(cid:13) n X i =1 ǫ i x i (cid:13)(cid:13) q (cid:17) ≤ c ′ q n X i =1 k x i k q . (6)It is trivial for n = 1 . Assume (6) is true for n > . We have E ( ǫ i ) (cid:16)(cid:13)(cid:13) n +1 X i =1 ǫ i x i (cid:13)(cid:13) q (cid:17) = 12 E ( ǫ i ) (cid:16)(cid:13)(cid:13) n X i =1 ǫ i x i + x n +1 (cid:13)(cid:13) q + (cid:13)(cid:13) n X i =1 ǫ i x i − x n +1 (cid:13)(cid:13) q (cid:17) ≤ E ( ǫ i ) (cid:16) (cid:13)(cid:13) n X i =1 ǫ i x i (cid:13)(cid:13) q + 2 c ′ q k x n +1 k q (cid:17) ≤ E ( ǫ i ) (cid:16)(cid:13)(cid:13) n X i =1 ǫ i x i (cid:13)(cid:13) q (cid:17) + c ′ q k x n +1 k q ≤ c ′ q n +1 X i =1 k x i k q . Hence, ( R m , k · k ) is Rademacher of power type ( c ′ /q, q ) .To the best of our knowledge, [DDGS97] first points out the link between uniform convexity and theRademacher type of the space in a learning framework. While the Rademacher type (resp. cotype) propertyis weaker than uniform smoothness (resp. convexity), establishing generalization results with uniform con-vexity/smoothness properties, as in [KST09, Theorem 1] makes the assumptions much easier to interpret.This seems not to have been exploited directly to obtain upper bounds on Rademacher constants. We nowextend the results of [KST09, Theorem 1] using that insight. Theorem 6.5.
Let p ≥ and q ∈ ]1 , s.t. p + q = 1 . Consider F C = (cid:8) f : x ∈ X → h x ; w i | k w k C ≤ (cid:9) ,where C is a centrally symmetric compact convex set with non-empty interior. Assume C is ( α, p ) -uniformlyconvex with p ≥ and α > . Then, there exists C > (a function of p and α ) s.t. we have R n ( F ) ≤ C /q Dn /p , where D = sup x ∈X k x k C ◦ .Proof of Theorem 6.5. Since C is ( α, p ) -uniformly convex of type p , the space normed with k · k C ◦ is (1 / (cid:0) q (2 αp ) q − (cid:1) , q ) -uniformly smooth, see Proposition 3.7 (b). Hence with Proposition 6.4, there exists C > (a function of ( α, p ) ) s.t. for any sequences ( x i ) and ( ǫ i ) of size n , we have E ( ǫ i ) (cid:16)(cid:13)(cid:13)(cid:13) X i ǫ i x i (cid:13)(cid:13)(cid:13) q C ◦ (cid:17) ≤ C X k x i k q C ◦ . (7)Then, recall that the Rademacher constant is defined as R n ( F C ) = E ( ǫ i ) , ( x i ) h sup f ∈F C n n X i =1 f ( x i ) ǫ i i = E ( ǫ i ) , ( x i ) h sup k w k C ≤ h w ; 1 n n X i =1 x i ǫ i i i . By definition of the dual norm, we have h w ; n P ni =1 x i ǫ i i ≤ k w k C (cid:13)(cid:13) n P ni =1 x i ǫ i (cid:13)(cid:13) C ◦ , hence E ( ǫ i ) , ( x i ) h sup k w k C ≤ h w ; 1 n n X i =1 x i ǫ i i i ≤ E ( ǫ i ) , ( x i ) h(cid:13)(cid:13)(cid:13) n n X i =1 x i ǫ i (cid:13)(cid:13)(cid:13) C ◦ i . rite θ = (cid:13)(cid:13)(cid:13) n P ni =1 x i ǫ i (cid:13)(cid:13)(cid:13) C ◦ . With q ∈ ]1 , , the function | x | /q is concave on R + and θ a non-negativerandom variable. Hence, we have E ǫ h(cid:0) θ q (cid:1) /q i ≤ h E ǫ ( θ q ) i /q . This implies that E ( ǫ i ) h sup k w k C ≤ h w ; 1 n n X i =1 x i ǫ i i i ≤ n h E ( ǫ i ) (cid:16)(cid:13)(cid:13)(cid:13) n X i =1 x i ǫ i (cid:13)(cid:13)(cid:13) q C ◦ (cid:17)i /q . Hence with (7), and taking the expectation w.r.t. the data points, we have R n ( F ) ≤ n E ( x i ) h C n X i =1 k x i k q C ◦ i /q R n ( F ) ≤ n /q C /q Dn = C /q Dn /p , where D = sup x ∈X k x k C ◦ .Upper bounds on Rademacher constants then induce generalization bounds depending on assumptionson the loss functions, see, e.g. , [KST09]. Uniform convexity is stronger than Rademacher type properties,although a major difference is that uniform convexity admits (simple) localized definitions while martin-gale or Rademacher type properties are inherently global assumptions. To obtain results in learning theorythat depend on the local behavior of the hypothesis class around the optimal solution, current approachesstudy the global properties of a neighborhood of the hypothesis class around that solution, see, e.g. , thelocal Rademacher constant [BBM05]. An alternative approach would then be to study local properties ofthe hypothesis class, for instance via local uniform convexity. This is one motivation for Theorem 5.1.[AFM20a, AFM20b] prove tight upper-bound on the Rademacher constant of low-norm linear predictorswith ℓ p with p > , which are instances of uniformly convex sets.6.3. Primal Averaging Frank-Wolfe on Uniformly Convex Sets.
The Primal Averaging Frank-Wolfe(PAFW) method was developed in [Lan13, Algorithm 4] (see Algorithm 1) and replaces the projection oraclewith a linear optimization oracle in Nesterov’s accelerated algorithm. We show here that the theoreticalanalysis of [Lan13, Corollary 1], holds in practice when the constraint set C is uniformly convex and thenorm of the gradient functions are lower bounded on C , i.e. , inf x ∈C k∇ f ( x ) k > c > . To our knowledge,this is the first Frank-Wolfe algorithm with accelerated convergence rates relative to the baseline O (1 /T ) ,obtained with agnostic step-sizes, e.g. , of the form / ( k + 2) . Algorithm 1
Primal Averaging Frank-Wolfe algorithm [Lan13, Algorithm 4]
Input: x ∈ C , y , x and ( α k ) ∈ [0 , N . for k = 1 , . . . do z k − = k − k +1 y k − + k +1 x k − .x k ∈ argmax v ∈C h−∇ f ( z k − ); v i .y k = (1 − α k ) y k − + α k x k . end for [Lan13, Corollary 1] yields an accelerated convergence rate of O (1 /T ) when some assumption is ver-ified for the LMO. The following lemma shows that a property, similar to their assumption, holds for theLMO when the set C is uniformly convex. In Proposition 6.7, we show how this implies new convergencerates for Primal Averaging Frank-Wolfe algorithm. This is a direct consequence of Theorem 4.1 (b). In theparticular case where the set is strongly convex, this is a variation of [GI17, (i) of Theorem 2.1.]. Lemma 6.6.
Consider C a compact convex set in R m , p ≥ , α > and ( d , d ) ∈ R m \ { } . Let ( v , v ) ∈ ∂ C s.t. d i ∈ N C ( v i ) for i = 1 , . If C is ( α, p ) -uniformly convex, then we have k v − v k ≤ (cid:2) α (cid:0) k d k ⋆ + k d k ⋆ (cid:1)(cid:3) / ( p − k d − d k / ( p − ⋆ . roof of Lemma 6.6. Because C is ( α, p ) -uniformly convex, via (b) of Theorem 4.1 applied to ( v i , d i ) for i = 1 , , we obtain h d ; v − v i ≥ α k d k ⋆ k v − v k p and h d ; v − v i ≥ α k d k ⋆ k v − v k p . Summingthe two inequalities implies that h d − d ; v − v i ≥ α (cid:0) k d k ⋆ + k d k ⋆ (cid:1) k v − v k p . Finally with Cauchy-Schwartz, we obtain k v − v k ≤ (cid:2) α (cid:0) k d k ⋆ + k d k ⋆ (cid:1)(cid:3) / ( p − k d − d k / ( p − ⋆ .Hence, if the norms of the d i for i = 1 , are lower bounded by c > , and the set is ( α, p ) -uniformlyconvex with p ∈ [2 , , we obtain that the condition described in [Lan13] is valid and of the form k v − v k ≤ / (2 αc ) / ( p − k d − d k / ( p − . When C is strongly convex and inf x ∈C k∇ f ( x ) k > c , it is already known that vanilla Frank-Wolfe withshort steps or exact line-search converges linearly [DR70, Dun79]. The difference is PAFW has acceleratedconvergence results with agnostic step sizes, i.e. , α k = k +2 , which is much cheaper to implement and alsodo not require knowledge of L in f . When the set is uniformly convex but not strongly convex, [KdP20]obtain sublinear rates for vanilla Frank-Wolfe algorithms on uniformly convex set with short steps or exactline-search. The rates in Proposition 6.7 are strictly inferior to the O (1 /T / (1 − /p ) ) in [KdP20] obtainedwith the same structural assumptions. However, to the best of our knowledge, the accelerated convergencerates of Algorithm 1 are the only accelerated convergence rates holding with oblivious step-sizes. Proposition 6.7.
Consider f a convex L -smooth function w.r.t. k · k and p ≥ , α > . Assume C is ( α, p ) -uniformly convex and inf x ∈C k∇ f ( x ) k > c > . Then the iterates ( y k ) of PAFW (Algorithm 1) with α k = k +2 satisfy f ( y k ) − f ∗ ≤ L (cid:16) LD k·k αc (cid:17) / ( p − k ( p +1) / ( p − when p ∈ ]3 , + ∞ [log( k + 1) k when p = 33 − pp − k when p ∈ [2; 3[ . where D k·k is the diameter of C w.r.t. k · k .Proof of Proposition 6.7. From [Lan13, Theorem 7], we have f ( y k ) − f ∗ ≤ Lk ( k + 1) k X i =1 k x i − x i − k . Then, from Lemma 6.6, since in Algorithm 1, x i are such that x i ∈ argmax x ∈C h∇ f ( z k − ); x i , we have k x i − x i − k ≤ (cid:2) α (cid:0) k∇ f ( z i − ) k ⋆ + k∇ f ( z i − ) k ⋆ (cid:1)(cid:3) / ( p − k∇ f ( z i − ) − ∇ f ( z i − ) k / ( p − ⋆ . Then, since z i ∈ C and k∇ f ( z i − ) − ∇ f ( z i − ) k ⋆ ≤ LD k·k i +1 (see [Lan13, (4.3)], we have k x i − x i − k ≤ (cid:0) LD k·k (cid:1) / ( p − (4 αc ) / ( p − i + 1) / ( p − . Simple computations [Lan13] imply that k X i =1 i / ( p − = ( k + 1) p − p − when p ∈ [3 , + ∞ [ log ( k + 1) when p = 33 − pp − when p ∈ [2; 3[ . ence, f ( y k ) − f ∗ ≤ L (cid:16) LD k·k αc (cid:17) / ( p − k ( p +1) / ( p − when p ∈ [3 , + ∞ [ log ( k + 1) k when p = 33 − pp − k when p ∈ [2; 3[ . cknowledgment. TK is very much indebted to Pierre-Cyril Aubin for the many discussions around uni-form convexity in a learning framework. Research reported in this paper was partially supported through theResearch Campus Modal funded by the German Federal Ministry of Education and Research (fund num-bers 05M14ZAM,05M20ZBM) as well as the Deutsche Forschungsgemeinschaft (DFG) through the DFGCluster of Excellence MATH+. AA is at the d´epartement d’informatique de l’ ´Ecole Normale Sup´erieure,UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA. AA would like to acknowl-edge support from the
ML and Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, a Google focused award, as well as funding by the French government under man-agement of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, referenceANR-19-P3IA-0001 (PRAIRIE 3IA Institute). R EFERENCES [ABRS10] H´edy Attouch, J´erˆome Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimiza-tion and projection methods for nonconvex problems: An approach based on the Kurdyka-Lojasiewiczinequality.
Mathematics of Operations Research , 35(2):438–457, 2010.[AFM20a] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypothe-ses and neural networks. In
International Conference on Machine Learning , pages 431–441. PMLR,2020.[AFM20b] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the Rademacher complexity of linear hypothesissets. arXiv:2007.11045 , 2020.[ALLW18] Jacob Abernethy, Kevin Lai, Kfir Levy, and Jun-Kun Wang. Faster rates for convex-concave games. In
Conference On Learning Theory , pages 1595–1625. PMLR, 2018.[ALW19] Jacob Abernethy, Kevin Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimiza-tion. arXiv preprint arXiv:1906.02027 , 2019.[AP95] Dominique Az´e and Jean-Paul Penot. Uniformly convex and uniformly smooth convex functions. In
Annales de la Facult´e des sciences de Toulouse: Math´ematiques , volume 4, pages 705–730, 1995.[AR09] Jacob Abernethy and Alexander Rakhlin. Beating the adaptive bandit with high probability. In , pages 280–289. IEEE, 2009.[Asp68] Edgar Asplund. Fr´echet differentiability of convex functions.
Acta Mathematica , 121(1):31–47, 1968.[AYAS09] Yasin Abbasi-Yadkori, Andr´as Antos, and Csaba Szepesv´ari. Forced-exploration based algorithms forplaying in stochastic linear bandits. Citeseer, 2009.[Bac20] Francis Bach. On the effectiveness of Richardson extrapolation in machine learning. arXiv preprintarXiv:2002.02835 , 2020.[BBL02] Peter Bartlett, St´ephane Boucheron, and G´abor Lugosi. Model selection and error estimation.
MachineLearning , 48(1-3):85–113, 2002.[BBM05] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities.
The Annalsof Statistics , 33(4):1497–1537, 2005.[BCKP20a] Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfecthints. In
International Conference on Machine Learning , pages 822–831. PMLR, 2020.[BCKP20b] Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online linear optimization withmany hints. arXiv:2010.03082 , 2020.[BCL18] S´ebastien Bubeck, Michael Cohen, and Yuanzhi Li. Sparsity, variance and curvature in multi-armedbandits. In
Algorithmic Learning Theory , pages 111–127. PMLR, 2018.[BDL07] J´erˆome Bolte, Aris Daniilidis, and Adrian Lewis. The Lojasiewicz inequality for nonsmooth suban-alytic functions with applications to subgradient dynamical systems.
SIAM Journal on Optimization ,17(4):1205–1223, 2007. BDLM10] J´erˆome Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of Lojasiewicz in-equalities: Subgradient flows, talweg, convexity.
Transactions of the American Mathematical Society ,362(6):3319–3363, 2010.[Bea11] Bernard Beauzamy.
Introduction to Banach spaces and their geometry . Elsevier, 2011.[BFTGT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convexoptimization with optimal rates.
Advances in Neural Information Processing Systems , 32:11282–11291,2019.[BGHV09] J. Borwein, A. Guirao, Petr. H´ajek, and J. Vanderwerff. Uniformly convex functions on Banach spaces.
Proceedings of the American Mathematical Society , 137(3):1081–1091, 2009.[BLM13] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities: A nonasymptotictheory of independence . Oxford university press, 2013.[BM02] Peter Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and struc-tural results.
Journal of Machine Learning Research , 3(Nov):463–482, 2002.[BNPS17] J´erˆome Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce Suter. From error bounds to the com-plexity of first-order descent methods for convex functions.
Mathematical Programming , 165(2):471–507, 2017.[Cla36] James Clarkson. Uniformly convex spaces.
Transactions of the American Mathematical Society ,40(3):396–414, 1936.[CLK19] Chen Chen, Jaewoo Lee, and Dan Kifer. Renyi differentially private ERM for smooth objectives. In
The22nd International Conference on Artificial Intelligence and Statistics , pages 2037–2046, 2019.[CP19] Cyrille Combettes and Sebastian Pokutta. Revisiting the approximate Carath´eodory problem via theFrank-Wolfe algorithm. arXiv preprint arXiv:1911.04415 , 2019.[DDGS97] Michael Donahue, Christian Darken, Leonid Gurvits, and Eduardo Sontag. Rates of convex approxima-tion in non-Hilbert spaces.
Constructive Approximation , 13(2):187–220, 1997.[DFHJ17] Ofer Dekel, Arthur Flajolet, Nika Haghtalab, and Patrick Jaillet. Online learning with a hint. In
Advancesin Neural Information Processing Systems , pages 5299–5308, 2017.[dGJ18] Alexandre d’Aspremont, Cristobal Guzman, and Martin Jaggi. Optimal affine-invariant smooth mini-mization algorithms.
SIAM Journal on Optimization , 28(3):2384–2405, 2018.[DGZ93] Robert Deville, Gilles Godefroy, and V´aclav Zizler.
Smoothness and renormings in Banach spaces .Longman Scientific Technical, Harlow, 1993.[DH19] Simon Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex-concavesaddle point problems without strong convexity. In
The 22nd International Conference on ArtificialIntelligence and Statistics , pages 196–205. PMLR, 2019.[DR70] V. F. Demyanov and A. M. Rubinov. Approximate methods in optimization problems.
Modern Analyticand Computational Methods in Science and Mathematics , 1970.[Dun79] Joseph Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingularextremals.
SIAM Journal on Control and Optimization , 17(2):187–211, 1979.[EBEGT19] Othman El Balghiti, Adam Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in thepredict-then-optimize framework. In
Advances in Neural Information Processing Systems , pages 14412–14421, 2019.[FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal ratesin linear time. In
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing ,pages 439–449, 2020.[GH15] Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In , 2015.[GI17] Vladimir Goncharov and Grigorii Ivanov. Strong and weak convexity of closed sets in a Hilbert space.In
Operations research, engineering, and cyber security , pages 259–297. Springer, 2017. GJLJ17] Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. Frank-Wolfe algorithms for saddle point prob-lems. In
Artificial Intelligence and Statistics , pages 362–371. PMLR, 2017.[HLGS16] Ruitong Huang, Tor Lattimore, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Following the leader and fast ratesin linear prediction: Curved constraint sets and other regularities. In
Advances in Neural InformationProcessing Systems , pages 4970–4978, 2016.[HLGS17] Ruitong Huang, Tor Lattimore, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Following the leader and fastrates in online linear prediction: Curved constraint sets and other regularities.
The Journal of MachineLearning Research , 18(1):5325–5355, 2017.[IN14] Anatoli Iouditski and Yuri Nesterov. Primal-dual subgradient methods for minimizing uniformly convexfunctions. arXiv preprint arXiv:1401.1792 , 2014.[INS +
19] Roger Iyengar, Joseph Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towardspractical differentially private convex optimization. In , pages 299–316. IEEE, 2019.[Jag13] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In
Proceedings ofthe 30th international conference on machine learning , 2013.[Jam78] RC James. Nonreflexive spaces of type 2.
Israel Journal of Mathematics , 30(1-2):1–13, 1978.[JST +
14] Martin Jaggi, Virginia Smith, Martin Tak´ac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann,and Michael Jordan. Communication-efficient distributed dual coordinate ascent.
Advances in neuralinformation processing systems , 27:3068–3076, 2014.[KBGY20] Nurdan Kuru, ˙Ilker Birbil, Mert Gurbuzbalaban, and Sinan Yildirim. Differentially private acceleratedoptimization algorithms. arXiv preprint arXiv:2008.01989 , 2020.[KCd17] Thomas Kerdreux, Igor Colin, and Alexandre d’Aspremont. An approximate Shapley-Folkman theorem. arXiv preprint arXiv:1712.08559 , 2017.[KdP19] Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. Restarting Frank-Wolfe. In
The22nd International Conference on Artificial Intelligence and Statistics , pages 1275–1283. PMLR, 2019.[KdP20] Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. Projection-free optimization onuniformly convex sets. arXiv:2004.11053 , 2020.[Ker20] Thomas Kerdreux.
Accelerating conditional gradient methods . PhD thesis, Universit´e Paris sciences etlettres, 2020.[KLLJS20] Thomas Kerdreux, Lewis Liu, Simon Lacoste-Julien, and Damien Scieur. Affine invariant analysis ofFrank-Wolfe on strongly convex sets. arXiv:2011.03351 , 2020.[Kol01] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization.
IEEE Transactions onInformation Theory , 47(5):1902–1914, 2001.[K¨ot83] Gottfried K¨othe. Topological vector spaces. In
Topological Vector Spaces I , pages 123–201. Springer,1983.[KST09] Sham Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Riskbounds, margin bounds, and regularization. In
Advances in neural information processing systems , pages793–800, 2009.[Lan13] Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550 , 2013.[Lin63] Joram Lindenstrauss. On the modulus of smoothness and divergent series in banach spaces.
The MichiganMathematical Journal , 10(3):241–252, 1963.[LLNT17] Tongliang Liu, G´abor Lugosi, Gergely Neu, and Dacheng Tao. Algorithmic stability and hypothesiscomplexity. arXiv preprint arXiv:1702.08712 , 2017.[LR15] Ching-Pei Lee and Dan Roth. Distributed box-constrained quadratic optimization for dual linear SVM.In
International Conference on Machine Learning , pages 987–996, 2015. LS19] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergenceof generative adversarial networks. In
The 22nd International Conference on Artificial Intelligence andStatistics , pages 907–915. PMLR, 2019.[LT13] Joram Lindenstrauss and Lior Tzafriri.
Classical Banach spaces II: Function spaces , volume 97.Springer Science & Business Media, 2013.[Mol20] Marco Molinaro. Curvature of feasible sets in offline and online optimization. arXiv:2002.03213 , 2020.[MOP20] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and opti-mistic gradient methods for saddle point problems: Proximal point approach. In
International Conferenceon Artificial Intelligence and Statistics , pages 1497–1507. PMLR, 2020.[MSJ +
15] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael Jordan, Peter Richt´arik, and Martin Tak´ac. Addingvs. averaging in distributed primal-dual optimization. In
International Conference on Machine Learning ,pages 1973–1982. PMLR, 2015.[Nes05] Yu Nesterov. Smooth minimization of non-smooth functions.
Mathematical programming , 103(1):127–152, 2005.[Nes15] Yu Nesterov. Universal gradient methods for convex optimization problems.
Mathematical Program-ming , 152(1-2):381–404, 2015.[Pin94] Iosif Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
The Annals ofProbability , pages 1679–1706, 1994.[Pis75] Gilles Pisier. Martingales with values in uniformly convex spaces.
Israel Journal of Mathematics , 20(3-4):326–350, 1975.[Pis11] Gilles Pisier. Martingales in Banach spaces (in connection with type and cotype). course IHP, Feb. 2–8,2011.[Pol66] Boris Polyak. Existence theorems and convergence of minimizing sequences in extremum problems withrestrictions. In
Soviet Math. Dokl , volume 7, pages 72–75, 1966.[RBWM19] Jarrid Rector-Brooks, Jun-Kun Wang, and Barzan Mozafari. Revisiting projection-free optimizationfor strongly convex constraint sets. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 33, pages 1576–1583, 2019.[Rd20] Vincent Roulet and Alexandre d’Aspremont. Sharpness, restart, and acceleration.
SIAM Journal onOptimization , 30(1):262–289, 2020.[Roc70] Tyrrell Rockafellar.
Convex analysis . Princeton university press, 1970.[RS17] Alexander Rakhlin and Karthik Sridharan. On equivalence of martingale tail bounds and deterministicregret inequalities. In
Conference on Learning Theory , pages 1704–1722. PMLR, 2017.[RT10] Paat Rusmevichientong and John Tsitsiklis. Linearly parameterized bandits.
Mathematics of OperationsResearch , 35(2):395–411, 2010.[Sch14] Rolf Schneider.
Convex bodies: The Brunn–Minkowski theory . Cambridge university press, 2014.[Sch16] Markus Schneider. Probability inequalities for kernel embeddings in sampling without replacement. In
Artificial Intelligence and Statistics , pages 66–74, 2016.[SFM +
17] Virginia Smith, Simone Forte, Chenxin Ma, Martin Tak´aˇc, Michael Jordan, and Martin Jaggi. Cocoa:A general framework for communication-efficient distributed optimization.
The Journal of MachineLearning Research , 18(1):8590–8638, 2017.[SST11] Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In
Advances in neural information processing systems , pages 2645–2653, 2011.[ST10] Karthik Sridharan and Ambuj Tewari. Convex games in Banach spaces. In
Conference on LearningTheory . Citeseer, 2010.[Sti18] Sebastian Stich. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767 ,2018. TTZ14] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimization beyond the worstcase: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417 , 2014.[VV20] V.M. Veliov and Phan Tu Vuong. Gradient methods on strongly convex feasible sets and optimal controlof affine systems.
Applied Mathematics & Optimization , 81(3):1021–1054, 2020.[WA18] Jun-Kun Wang and Jacob Abernethy. Acceleration through optimistic no-regret dynamics. In
Advancesin Neural Information Processing Systems , pages 3824–3834, 2018.[ZZMW17] Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private ERM for smooth objectives. arXiv preprint arXiv:1703.09947 , 2017.[Zˇa83] C Zˇalinescu. On uniformly convex functions.
Journal of Mathematical Analysis and Applications ,95(2):344–374, 1983.[Zˇa02] Constantin Zˇalinescu.
Convex analysis in general vector spaces . World scientific, 2002. Z USE I NSTITUTE B ERLIN & T
ECHNISCHE U NIVERSIT ¨ AT B ERLIN , G
ERMANY
Email address : [email protected] CNRS & D.I., UMR 8548,´E
COLE N ORMALE S UP ´ ERIEURE , P
ARIS , F
RANCE . Email address : [email protected] Z USE I NSTITUTE B ERLIN & T
ECHNISCHE U NIVERSIT ¨ AT B ERLIN , G
ERMANY
Email address : [email protected]@zib.de