[PDF] Online mirror descent and dual averaging: keeping pace in the dynamic case

Abstract

Online mirror descent (OMD) and dual averaging (DA)---two fundamental algorithms for online convex optimization---are known to have very similar (and sometimes identical) performance guarantees when used with a fixed learning rate. Under dynamic learning rates, however, OMD is provably inferior to DA and suffers a linear regret, even in common settings such as prediction with expert advice. We modify the OMD algorithm through a simple technique that we call stabilization. We give essentially the same abstract regret bound for OMD with stabilization and for DA by modifying the classical OMD convergence analysis in a careful and modular way that allows for straightforward and flexible proofs. Simple corollaries of these bounds show that OMD with stabilization and DA enjoy the same performance guarantees in many applications---even under dynamic learning rates. We also shed light on the similarities between OMD and DA and show simple conditions under which stabilized-OMD and DA generate the same iterates.

Full PDF

OOnline mirror descent and dual averaging:keeping pace in the dynamic case

Huang Fang Nicholas J. A. Harvey Victor S. Portella Michael P. Friedlander Abstract

Online mirror descent (OMD) and dual averag-ing (DA)—two fundamental algorithms for onlineconvex optimization—are known to have verysimilar (and sometimes identical) performanceguarantees when used with a ﬁxed learning rate.Under dynamic learning rates, however, OMD isprovably inferior to DA and suffers a linear regret,even in common settings such as prediction withexpert advice. We modify the OMD algorithmthrough a simple technique that we call stabiliza-tion . We give essentially the same abstract regretbound for OMD with stabilization and for DA bymodifying the classical OMD convergence anal-ysis in a careful and modular way that allowsfor straightforward and ﬂexible proofs. Simplecorollaries of these bounds show that OMD withstabilization and DA enjoy the same performanceguarantees in many applications—even under dy-namic learning rates. We also shed light on thesimilarities between OMD and DA and show sim-ple conditions under which stabilized-OMD andDA generate the same iterates.

1. Introduction

Online convex optimization (OCO) lies at the intersectionof machine learning, convex optimization, and game theory.In OCO, a player is required to make a sequence of onlinedecisions over discrete time steps. Each decision incursa cost given by a convex function that is only revealed tothe player after they make that decision. The goal of theplayer is to minimize what is known as regret : the differencebetween the total cost and the cost of a competitor with * Equal contribution Department of Computer Science, Uni-versity of British Columbia, Canada. Correspondence to:Huang Fang < [email protected] > , Nicholas J. A. Harvey < [email protected] > , Victor S. Portella < [email protected] > ,Michael P. Friedlander < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s). the beneﬁt of hindsight. Letting T denote the number ofdecisions, the goal is for the player’s algorithm to ensure itsregret is sublinear in T .Online mirror descent (OMD) and dual averaging (DA) aretwo important algorithm templates for OCO from whichmany classical online learning algorithms can be derivedas special cases; see Shalev-Shwartz (2012) and McMahan(2017) for examples. When the number T of decisions tobe made is known in advance, the performance of OMDand DA (with properly chosen constant learning rates) canbe shown to be very similar (Hazan, 2016). That is, theyachieve essentially the same regret bound when using thesame learning rate. However, when the number of decisionsis not known a priori, there is a fundamental difference inthe regret guarantees for OMD and DA with a similar dy-namic (time-varying) learning rate: while DA can guaranteesublinear regret bound O ( √ T ) for any T > (Nesterov,2009), there are instances for which OMD suffers asymp-totically linear Ω( T ) regret (Orabona & P´al, 2018). Wesummarize this discussion as follows. Previously known fact.

With a dynamic learning rate,OMD does not match the performance of DA.The purpose of this paper is to introduce a stabilization technique that bridges the gap between OMD and DA withdynamic learning rates.

Main result (informal).

With a dynamic learning rate,stabilized-OMD matches the performance of DA.For a formal statement, see the abstract regret bounds inTheorems 4.1, 4.3, and 4.6. In Section 5 we give someapplications: regret bounds with strongly convex mirrormaps; for prediction with expert advice, anytime regretbounds with the best known constant, and a ﬁrst-order re-gret bound. Also, in Section 6 we formally compare theiterates of DA and stabilized-OMD. This sheds light on thedrawbacks of OMD with dynamic learning rate and why sta-bilization helps. To conclude, we derive simple conditionsunder which stabilized-OMD and DA generate exactly thesame iterates. This is analogous to the relationship betweenOMD and DA with a ﬁxed learning rate, and is evidencethat stabilization may be a natural way to extend OMD todynamic learning rates. Additionally, in Appendix I we a r X i v : . [ c s . L G ] J u l nline mirror descent and dual averaging: keeping pace in the dynamic case adapt stabilized-OMD for the composite objective setting,generalizing a result of Duchi et al. (2010).

2. Related work

Mirror descent (MD) originated with Nemirovski & Yudin(1983). Beck & Teboulle (2003) give a modern treatment.Recent interest in ﬁrst-order methods for large-scale prob-lems have boosted the popularity of OMD. See, e.g., Duchiet al. (2010); Allen-Zhu & Orecchia (2016); Beck (2017).DA is due to Nesterov (2009) and was later extended to reg-ularized problems by Xiao (2010). DA is closely related tothe follow-the-regularized-leader (FTRL) algorithm. Stan-dard references for these algorithms include Shalev-Shwartz(2012); Bubeck (2015); Hazan (2016); McMahan (2017).OMD and DA have seen an increase in popularity due toapplications in online learning problems (Kakade et al.,2012; Audibert et al., 2014) and since they generalize awide range of online learning algorithms (Shalev-Shwartz,2012; McMahan, 2017).Unifying views of online learning algorithms have beenshown to be useful for applications and have drawn recentattention. McMahan (2017) uses FTRL with adaptive regu-larizers to derive many online learning algorithms. Joulaniet al. (2017) propose a uniﬁed framework to analyze on-line learning algorithms, even for non-convex problems.Recently, Juditsky et al. (2019) proposed a uniﬁed frame-work called uniﬁed mirror descent (UMD) that encompassesOMD and DA as special cases.Despite these unifying frameworks, the differences betweenOMD and DA seem to have been overlooked. Only recently,Orabona & P´al (2018), who looked more closely at the dif-ference between OMD and DA, presented counter examplesto demonstrate that OMD with dynamic learning rate couldsuffer from linear regret even under the well-studied settingsas in the experts’ problem.For the problem of prediction with expert advice, Cesa-Bianchi et al. (1997) use the doubling trick to give an al-gorithm with a sublinear anytime regret bound, meaninga bound parameterized by T and that holds for all T . Im-proved anytime regret bounds were developed by Auer et al.(2002b); a simpliﬁed description of this result appears inby Cesa-Bianchi & Lugosi (2006, §2.3). Sublinear anytimeregret bounds for DA follow directly from the analysis ofNesterov (2009). Other expositions include (Bubeck, 2011,Theorem 2.4) and (Gerchinovitz, 2011, Proposition 2.1).First-order regret bounds are bounds that depend on the costof the best expert instead of on the number of decisions T .Such bounds can be proven using the doubling trick, asshown by Cesa-Bianchi et al. (1997). First-order regretbounds without using the doubling trick were proven byAuer et al. (2002b). Improved constants are known; see, e.g., de Rooij et al. (2014, Theorem 8). The best known ﬁrst-order regret bound, in some settings, is from a sophisticatedalgorithm designed by Yaroshinsky et al. (2004).

3. Formal deﬁnitions

We consider the online convex optimization problem withunknown time horizon. For each time step t ∈ { , , . . . } the algorithm proposes a point x t from a closed convex set X ⊆ R n , and an adversary simultaneously picks a con-vex cost function f t to which the algorithm has accessvia a ﬁrst order oracle, that is, for any x ∈ X the algo-rithm can compute f t ( x ) and a subgradient g ∈ ∂f t ( x ) := { g ∈ R n : f ( z ) ≥ f ( x ) + h g, z − x i ∀ z ∈ X } . This func-tion penalizes the proposal x t by the amount f t ( x t ) . Thecost of the iteration at time t is deﬁned as f t ( x t ) . The goal isto produce a sequence of proposals { x t } t ≥ that minimizesthe regret against a unknown comparison point z ∈ X thathas accrued up until time T : Regret(

T, z ) := T X t =1 f t ( x t ) − T X t =1 f t ( z ) . We consider the case where the algorithm does not knowthe time horizon T in advance. Hence any parameters of thealgorithms, including learning rate, cannot depend on T .We assume that each function in the sequence { f t } t ≥ is L -Lipschitz (continuous) over X with respect to a norm k·k ,and we denote the dual norm of k·k by k·k ∗ .Both OMD and DA are parameterized by a mirror map (for X ), that is, a closed convex function of Legendre type (Rock-afellar, 1970, Chapter 26) Φ : ¯

D → R whose conjugate isdifferentiable on R n and with ¯ D ⊆ R n a convex set suchthat D∩ ri X 6 = ∅ , where D := int ¯ D and ri X is the relativeinterior of X . The gradient of the mirror map ∇ Φ :

D → R n and the gradient of its conjugate ∇ Φ ∗ : R n → D are mu-tually inverse bijections between the primal space D andthe dual space R n (Rockafellar, 1970, Theorem 26.5). Wewill adopt the following notational convention. Any vectorin the primal space will be written without a hat, such as x ∈ D . The same letter with a hat, namely ˆ x , will denotethe corresponding dual vector: ˆ x := ∇ Φ( x ) and x := ∇ Φ ∗ (ˆ x ) for all letters x. Given a mirror map Φ , the Bregman divergence of x ∈ ¯ D and y ∈ D w.r.t. Φ is deﬁned by D Φ ( x, y ) := Φ( x ) − Φ( y ) − h ∇ Φ( y ) , x − y i . Throughout this paper it will beconvenient to use the notation D Φ ( ab ; c ) := D Φ ( a, c ) − D Φ ( b, c ) . The projection operator induced by the Bregman divergenceis Π Φ X ( y ) := arg min { D Φ ( x, y ) : x ∈ X } .A general template for optimization in the mirror descent nline mirror descent and dual averaging: keeping pace in the dynamic case framework is shown in Algorithm 1. OMD and DA areincarnations of this framework, differing only in how thedual variable ˆ y t is updated. Algorithm 1

Pseudocode for OMD and DA.

Input: x ∈ X ∩ D , η : N → R > . for t = 1 , , . . . do Incur cost f t ( x t ) and receive ˆ g t ∈ ∂f t ( x t )ˆ x t = ∇ Φ( x t ) [OMD update] ˆ y t +1 = ˆ x t − η t ˆ g t [DA update] ˆ y t +1 = ˆ x − η t +1 P i ≤ t ˆ g i y t +1 = ∇ Φ ∗ (ˆ y t +1 ) x t +1 = Π Φ X ( y t +1 ) end for

4. Stabilized-OMD

Orabona & P´al (2018) showed that OMD with the standarddynamic learning rate ( η t ∝ / √ t ) can incur regret linearin T when the feasible set X is unbounded Bregman diver-gence, that is, sup x,z ∈X D Φ ( z, x ) = ∞ . We introduce astabilization technique that resolves this problem, allowingOMD to support a dynamic learning rate and perform sim-ilarly to DA even when the Bregman divergence on X isunbounded.The intuition for the idea is as follows. Suppose Z ⊆ X is a set of comparison points with respect to which wewish our algorithm to have low regret. Usually, we assume sup z ∈Z D Φ ( z, x ) is bounded, that is, the initial point isnot too far (with respect to the Bregman divergence) fromany comparison point. Since sup z ∈Z D Φ ( z, x ) is bounded(but not necessarily sup z ∈Z ,x ∈X D Φ ( z, x ) ), the point x isthe only point in X that is known to be somewhat close(w.r.t. the Bregman divergence) to all the other points in X . Thus, iterates computed by the algorithm should remainreasonably close to x so that no other point z ∈ Z istoo far from the iterates. If there were such a point z , anadversary could later chose functions so that picking z everyround would incur low loss. At the same time, OMD wouldtake many iterations to converge to z since consecutiveOMD iterates tend to be close w.r.t. the Bregman divergence.That is, the algorithm would have high regret against z .To prevent this, the stabilization technique modiﬁes eachiterate x t to mix in a small fraction of x . This idea isnot entirely new: it appears, for example, in the originalExp3 algorithm (Auer et al., 2002a), although for differentreasons.There are two ways to realize the stabilization idea. Primal Stabilization.

Replace x t with a convex combina-tion of x t and x . Dual Stabilization.

Replace ˆ y t with a convex combinationof ˆ y t and ˆ x (Recall from Algorithm 1 that ˆ y t is the dual PrimalDual } Figure 1: Illustration of the t -th iteration of DS-OMD.iterate computed by taking a gradient step). An illustrationfor dual stabilization is shown in Figure 1.After a draft of this paper was made publicly available, wewere informed that an idea similar to primal stabilizationhad appeared in the Robust Optimistic Mirror Descent algo-rithm (Kangarshahi et al., 2018). Their setting is somewhatdifferent since they perform optimistic steps. Furthermore,their results are somewhat weaker in terms of constant fac-tors and since they cannot handle Bregman projections.In this section we use many results regarding Bregmandivergences (see Appendix A.2), and for ease of referencewe will state the main results. Let a, b, c ∈ D . A classicresult is the three-point identity (Bubeck, 2015, §4) D Φ ( a, c ) − D Φ ( b, c ) + D Φ ( b, a ) = h ˆ a − ˆ c, a − b i . (4.1)If γ ˆ a + (1 − γ )ˆ b = ˆ c for some γ ∈ R , then, for all u, v ∈ ¯ D , γD Φ ( uv ; a ) + (1 − γ ) D Φ ( uv ; b ) = D Φ ( uv ; c ) . (4.2)Finally, if p ∈ D and π := Π X ( p ) , then D Φ ( zπ ; p ) ≥ D Φ ( zπ ; π ) = D Φ ( z, π ) ∀ z ∈ X . (4.3) Algorithm 2 gives pseudocode showing our modiﬁcationof OMD to incorporate dual stabilization. Theorem 4.1analyzes it without assuming strong convexity of Φ . Theorem 4.1 (Regret bound for dual-stabilized OMD). Let η : N → R > be such that η t ≥ η t +1 for all t ≥ . Deﬁne γ t = η t +1 /η t ∈ (0 , for all t ≥ . Let { f t } t ≥ be asequence of convex functions with f t : X → R for each t ≥ . Let { x t } t ≥ and { ˆ w t } t ≥ be as in Algorithm 2.Then, for all T > and z ∈ X , Regret(

T, z ) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 . (4.7) nline mirror descent and dual averaging: keeping pace in the dynamic case Algorithm 2

Dual-stabilized OMD (DS-OMD). The param-eters γ t control the amount of stabilization. Input: x ∈ X , η : N → R + , γ : N → (0 , for t = 1 , , . . . do Incur cost f t ( x t ) and receive ˆ g t ∈ ∂f t ( x t )ˆ x t = ∇ Φ( x t )ˆ w t +1 = ˆ x t − η t ˆ g t (4.4) ˆ y t +1 = γ t ˆ w t +1 + (1 − γ t )ˆ x (4.5) y t +1 = ∇ Φ ∗ (ˆ y t +1 ) x t +1 = Π Φ X ( y t +1 ) (4.6) end for Note that strong convexity of Φ is not assumed. As wewill see in Section 5.1, the term D Φ ( x t x t +1 ; w t +1 ) can beeasily bounded when the mirror map is strongly convex.This yields sublinear regret for η t ∝ / √ t , which is not thecase for OMD when sup z ∈Z ,x ∈X D Φ ( z, x ) = + ∞ , where Z ⊆ X is a ﬁxed set of comparison points.

Proof (of Theorem 4.1).The ﬁrst step is the same as in the standard OMD proof. Forall z ∈ X , use the subgradient inequality to deduce f t ( x t ) − f t ( z ) ≤ h ˆ g t , x t − z i (i) = 1 η t h ˆ x t − ˆ w t +1 , x t − z i (ii) = 1 η t (cid:0) D Φ ( x t z ; w t +1 ) + D Φ ( z, x t ) (cid:1) , (4.8)where (i) follows from (4.4), and (ii) from (4.1).The next step exhibits the main point of stabilization. With-out stabilization we would have x t +1 = Π Φ X ( w t +1 ) and D Φ ( z, w t +1 ) ≥ D Φ ( z, x t +1 ) + D Φ ( x t +1 , w t +1 ) by (4.3),so (4.8) would lead to a telescoping sum involving D Φ ( z, · ) if the learning rate were ﬁxed . With a dynamic learning ratethe analysis is trickier: we must obtain telescoping terms byrelating D Φ ( z, w t +1 ) to D Φ ( z, x t +1 ) . This the purpose ofthe next claim. Claim 4.2.

Assume that γ t = η t +1 /η t ∈ (0 , . Then(4.8) ≤ D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17)| {z } telescopes D Φ ( z, x )+ D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 | {z } telescopes . Proof.

First we derive the inequality γ t D Φ ( zx t +1 ; w t +1 ) + (1 − γ t ) D Φ ( z, x ) (i) ≥ γ t D Φ ( zx t +1 ; w t +1 ) + (1 − γ t ) D Φ ( zx t +1 ; x ) Algorithm 3

OMD with primal stabilization.

Input: x ∈ R n , η : N → R , γ : N → R . for t = 1 , , . . . do Incur cost f t ( x t ) and receive ˆ g t ∈ ∂f t ( x t )ˆ x t = ∇ Φ( x t )ˆ w t +1 = ˆ x t − η t ˆ g t w t +1 = ∇ Φ ∗ ( ˆ w t +1 ) y t +1 = Π Φ X ( w t +1 ) (4.10) x t +1 = γ t y t +1 + (1 − γ t ) x (4.11) end for = D Φ ( zx t +1 ; y t +1 ) (by (4.2) and (4.5)) ≥ D Φ ( z, x t +1 ) (by (4.3) and (4.6)) , where (i) uses the fact that D Φ ( x t +1 , x ) ≥ and γ t ≤ .Rearranging and using γ t > yields D Φ ( z, w t +1 ) ≥ D Φ ( x t +1 , w t +1 ) (4.9) − − γ t γ t D Φ ( z, x ) + 1 γ t D Φ ( z, x t +1 ) . Plugging this into (4.8) yields(4.8) = 1 η t (cid:2) D Φ ( x t , w t +1 ) − D Φ ( z, w t +1 ) + D Φ ( z, x t ) (cid:3) (4.9) ≤ η t (cid:2) D Φ ( x t , w t +1 ) − D Φ ( x t +1 , w t +1 )+ (cid:16) γ t − (cid:17) D Φ ( z, x ) − γ t D Φ ( z, x t +1 ) + D Φ ( z, x t ) (cid:3) . The claim then follows by the deﬁnition of γ t .Summing the inequality from Claim 4.2 over t ∈ [ T ] provesTheorem 4.1. For completeness we show these calculationsin Appendix B. Algorithm 3 gives pseudocode for the primal-stabilizedOMD method, which has the following regret bound.

Theorem 4.3 (Regret bound for primal-stabilized OMD).For all t ≥ , let η : N → R > be such that η t ≥ η t +1 ;deﬁne γ t = η t +1 /η t ∈ (0 , ; and let { f t } t ≥ be a sequenceof convex functions with f t : X → R . Let { x t } t ≥ , { y t } t ≥ and { ˆ w t } t ≥ be as in Algorithm 3. Furthermore, assumefor all z ∈ X , the map D Φ ( z, · ) is convex on X . (4.12)Then, for all T > and z ∈ X , Regret(

T, z ) ≤ T X t =1 D Φ ( x t y t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 . (4.13)The proof is identical to the proof of Theorem 4.1, replac-ing D Φ ( x t x t +1 ; w t +1 ) with D Φ ( x t y t +1 ; w t +1 ) and replacing nline mirror descent and dual averaging: keeping pace in the dynamic case Claim 4.2 with the following claim. (The complete proofof Theorem 4.3 can be found in Appendix C.)

Claim 4.4.

Assume that γ t = η t +1 /η t ∈ (0 , . Then(4.8) ≤ D Φ ( x t y t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17)| {z } telescopes D Φ ( z, x )+ D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 | {z } telescopes . Proof.

First, we derive the inequality γ t D Φ ( zy t +1 ; w t +1 ) + (1 − γ t ) D Φ ( z, x ) (i) ≥ γ t D Φ ( z, y t +1 ) + (1 − γ t ) D Φ ( z, x ) (ii) ≥ D Φ ( z, x t +1 ) . where (i) follows from (4.3) and (4.10), and (ii) is by (4.11),(4.12) and γ t ∈ (0 , . Rearranging and using γ t > yields D Φ ( z, w t +1 ) ≥ D Φ ( y t +1 , w t +1 ) − − γ t γ t D Φ ( z, x )+ 1 γ t D Φ ( z, x t +1 ) . (4.14)Plugging this into (4.8) yields(4.8) = 1 η t (cid:16) D Φ ( x t z ; w t +1 ) + D Φ ( z, x t ) (cid:17) (4.14) ≤ η t (cid:16) D Φ ( x t , w t +1 ) − D Φ ( y t +1 , w t +1 )+ (cid:16) γ t − (cid:17) D Φ ( z, x ) − γ t D Φ ( z, x t +1 ) + D Φ ( z, x t ) (cid:17) . The claim follows by the deﬁnition of γ t . In this section, we show that the DA algorithm can be ob-tained by a small modiﬁcation of dual-stabilized onlinemirror descent. Furthermore our proof of Theorem 4.1 canbe adapted to analyze this algorithm.The main difference between DS-OMD and DA is in thegradient step. In iteration t + 1 of DS-OMD the gradientstep is taken from ˆ x t (the dual counterpart of x t ):DS-OMD gradient step: ˆ w t +1 := ˆ x t − η t ˆ g t . Suppose that the algorithm is modiﬁed so that the gradientstep is taken from ˆ y t , the dual point from iteration t before projection onto the feasible region (here deﬁne ˆ y := ˆ x ).The resulting gradient step isLazy gradient step: ˆ w t +1 := ˆ y t − η t ˆ g t . (4.15) As before, we set ˆ y t +1 := γ t ˆ w t +1 + (1 − γ t )ˆ x , where γ t = η t +1 /η t . Then a simple inductive proof yieldsthe following claim. Claim 4.5. ˆ w t = ˆ x − η t − P i .Thus, DS-OMD with the lazy gradient step can be writtenas in Algorithm 1 with the DA update. Theorem 4.6 (Regret bound for dual averaging). Let η : N → R > be such that η t ≥ η t +1 for all t ≥ . De-ﬁne γ t = η t +1 /η t ∈ (0 , for all t ≥ . Let { f t } t ≥ bea sequence of convex functions with f t : X → R for each t ≥ . Let { x t } t ≥ and { ˆ g t } t ≥ be as in as in Algorithm 1with DA updates. Then, for all T > and z ∈ X , Regret(

T, z ) (4.16) ≤ T X t =1 D Φ ( x t x t +1 ; ∇ Φ ∗ (ˆ x t − η t ˆ g t )) η t + D Φ ( z, x ) η T +1 . The proof parallels the proof of Theorem 4.1 and can befound in Appendix D.

Interestingly, the doubling trick (Shalev-Shwartz, 2012) forOMD can be viewed as an incarnation of stabilization. Tosee this, set η t := 1 / √ b lg t c and γ t := { t is a power of 2 } .Then, for each dyadic interval of length ‘ , the ﬁrst iterateis x and a ﬁxed learning rate / √ ‘ is used. Thus, withthese parameters, Algorithm 2 reduces to the doubling trick.Note that in Theorem 4.1 the stabilization parameter γ t usedin round t ≥ depends on the learning rates for rounds t and t + 1 . Thus, to use stabilization as in Theorem 4.1the learning rate for round t can depend on informationavailable only up to round t − . This will be importantwhen we derive ﬁrst-order regret bounds in Section 5.2.2where the learning rate depends on the past functions anditerates. Reindexing the learning rates could ﬁx the problem,but then the proof of Theorem 4.1 would look syntacticallyodd. Although this “dependence on the future” may seemunnatural, in Section 6 we shall see that, under mild con-ditions, stabilized-OMD coincides with DA with dynamiclearning rates. This extends the same behavior observedbetween OMD and DA when the learning rates are ﬁxed.This may be seen as evidence that stabilization is a naturalway to ﬁx OMD for dynamic learning rates. Furthermore,McMahan (2017) shows this off-by-one difference amongother algorithms for OCO and discusses the implications ofthis phenomenon. nline mirror descent and dual averaging: keeping pace in the dynamic case

5. Applications

In this section we show that stabilized-OMD and DA enjoythe same regret bounds in several applications that involve adynamic learning rate.

We now analyze the algorithms of the previous section inthe scenario that the mirror maps are strongly convex. Let η t , γ t , f t be as in the previous section. The next result is acorollary of Theorems 4.1, 4.3, and 4.6. Corollary 5.1 (Regret bound for dual-stabilized OMD).Suppose that Φ is ρ -strongly convex on X with respectto a norm k · k . Let { x t } t ≥ be the iterates produced byAlgorithms 1 with the DA update, or the update rules in Al-gorithms 2 or 3 (for Algorithm 3, the additional assumption(4.12) is required). Then, for all T > and z ∈ X , Regret(

T, z ) ≤ T X t =1 η t k ˆ g t k ∗ ρ + D Φ ( z, x ) η T +1 . This is identical to the bound for dual averaging in Nesterov(2009, Eq. 2.15) (taking his λ i := 1 and his β i := 1 /η i ).The proof is based on the following simple proposition,which bounds the Bregman divergence when Φ is stronglyconvex (Bubeck, 2015, pp. 300). The proof is given in Ap-pendix E. Proposition 5.2.

Suppose Φ is ρ -strongly convex on X with respect to k·k . For any x, x ∈ X and ˆ q ∈ R n , D Φ ( xx ; ∇ Φ ∗ (ˆ x − ˆ q )) ≤ k ˆ q k ∗ / ρ. Proof (of Corollary 5.1). The regret bounds proven byTheorems 4.1, 4.3 and 4.6 all involve a summation withterms of the form . D Φ ( x t x t +1 ; w t +1 ) 4 . D Φ ( x t y t +1 ; w t +1 )4 . D Φ ( x t x t +1 ; ∇ Φ ∗ (ˆ x t − η t ˆ g t )) . For Theorems 4.1 and 4.6, we have x t +1 ∈ X , whereasfor Theorem 4.3 we have y t +1 ∈ X by (4.10). For Theo-rems 4.1 and 4.3 we have w t +1 = ∇ Φ ∗ (ˆ x t − η t ˆ g t ) by (4.4)and (4.12). Therefore all of these terms may be boundedusing Proposition 5.2 with x = x t and ˆ q = η t ˆ g t . This yieldsthe claimed bound. Next consider the setting of “prediction with expert advice”.In this setting, ¯ D is R n ≥ , X is the simplex ∆ n ⊂ R n , andthe mirror map is Φ( x ) := P ni =1 x i log x i . (On X , Φ is thenegative of the entropy function.) The gradient of the mirrormap and its conjugate are ∇ Φ( x ) i = ln( x i ) + 1 and ∇ Φ ∗ (ˆ x ) i = e ˆ x i − . (5.1) For any two points a ∈ ¯ D and b ∈ D , an easy calculationshows that D Φ ( a, b ) is the generalized KL-divergence D KL ( a, b ) = n X i =1 a i ln( a i /b i ) − k a k + k b k . Note that the KL-divergence is convex in its second argu-ment for any b ∈ D = R n> since the functions − ln( · ) andabsolute value are both convex. This means that all theabstract regret bounds from Section 4 hold in this setting.Using them we will derive regret bounds for this settingwith a little extra-work. As an intermediate step, we willderive bounds that use the following function: Λ( a, b ) := D KL ( a, b ) + k a k − k b k + ln k b k = n X i =1 a i ln( a i /b i ) + ln k b k , which is a standard tool in the analysis of algorithms for theexperts’ problem. For examples, see de Rooij et al. (2014,§2.1) and Cesa-Bianchi et al. (2007, Lemma 4).The next result is a corollary of Theorems 4.1, 4.3, and 4.6. Corollary 5.3.

For all t ≥ , let η : N → R > be such that η t ≥ η t +1 ; deﬁne γ t = η t +1 /η t ∈ (0 , ; and let { f t } t ≥ be a sequence of convex functions with f t : X → R . Let x be the uniform distribution ~ /n and let { x t } t ≥ and { ˆ g t } t ≥ be as in one of Algorithms 1 with the DA update, or the DAupdate in Algorithms 2 or 3. Then, for all T > and z ∈ X , Regret(

T, z ) ≤ T X t =1 Λ( x t , ∇ Φ ∗ (ˆ x t − η t ˆ g t )) η t + ln nη T +1 . (5.2)The proof is a direct consequence of the following proposi-tion, which is proven in Appendix F. Proposition 5.4. D Φ ( ab ; c ) ≤ Λ( a, c ) for a, b ∈ X , c ∈ D . Proof (of Corollary 5.3). First, recall that D KL is convexin its second argument, which allows us to use the boundfrom (4.13) for primal-stabilized OMD. As in the proofof Corollary 5.1, we ﬁrst observe that the regret bounds(4.7), (4.13) and (4.16) all have sums with terms of theform D Φ ( x t u t ; ∇ Φ ∗ (ˆ x t − η t ˆ g t )) for some u t ∈ X that maybe bounded using Proposition 5.4. Finally, the standardinequality sup z ∈X D KL ( z, x ) ≤ ln n completes the proof. From Corollary 5.3 we now derive an anytime regret boundin the case of bounded costs. This matches the best knownbound appearing in the literature; see Bubeck (2011, Theo- nline mirror descent and dual averaging: keeping pace in the dynamic case rem 2.4) and Gerchinovitz (2011, Proposition 2.1). More-over, in Appendix G we show that this is tight for DA.

Corollary 5.5.

Deﬁne η t = 2 p ln( n ) /t and γ t = η t +1 /η t ∈ (0 , for all t ≥ . Let { f t := h ˆ g t , ·i} t ≥ be such that ˆ g t ∈ [0 , n for all t ≥ . Let x be theuniform distribution ~ /n and let { x t } t ≥ be as in one ofAlgorithms 1 with the DA update, or the DA update inAlgorithms 2 or 3. Then, Regret(

T, z ) ≤ √ T ln n, ∀ T ≥ , ∀ z ∈ X . The proof follows from Corollary 5.3 and Hoeffding’sLemma, as shown below.

Lemma 5.6 (Hoeffding’s Lemma (Cesa-Bianchi & Lugosi,2006, Lemma 2.2)). Let X be a random variable with a ≤ X ≤ b . Then for any s ∈ R , ln E [ e sX ] − s E X ≤ s ( b − a ) . Proof (of Corollary 5.5). By (5.1) we have ∇ Φ ∗ (ˆ x t − η t ˆ g t ) i = x t ( i ) exp( − η t ˆ g t ( i )) for each i ∈ [ n ] . This to-gether with Lemma 5.6 for s = − η t yields Λ( x t , ∇ Φ ∗ (ˆ x t − η t ˆ g t ))= η t h ˆ g t , x t i + ln (cid:16) n X i =1 x t ( i ) e − η t ˆ g t ( i ) (cid:17) ≤ η t . Plugging this and η t = 2 q ln nt into (5.2), we obtain Regret( T ) ≤ √ ln n T X t =1 √ t + √ T + 12 ! ≤ √ ln n √ T −

14 + √ T + 0 . ! ≤ √ T ln n, by Fact A.3 and sub-additivity of square root. The regret bound described in Section 5.2.1 depends on √ T ;this is known as a “zeroth-order” regret bound. In somescenarios the cost of the best expert up to time T can be farless than T . This makes the problem somewhat easier, andit is possible to improve the regret bound. Formally, let L ∗ T denote the total cost of the best expert until time T . Then L ∗ T ≤ T due to our assumption that all costs are at most . A“ﬁrst-order” regret bound depends on p L ∗ T instead of √ T .The only modiﬁcation to the algorithm is to change thelearning rate. If the costs are “smaller than expected”, thenintuitively time is progressing “slower than expected”. Wewill adopt an elegant idea of Auer et al. (2002b), whichis to use the algorithm’s cost itself as a measure of the progression of time, and to incorporate this into the learningrate. They call this a “self-conﬁdent” learning rate. Corollary 5.7.

Let { f t := h ˆ g t , ·i} t ≥ be such that ˆ g t ∈ [0 , n for all t ≥ . Deﬁne γ t = η t +1 /η t ∈ (0 , and η t = p ln( n ) / (1 + P i

T, z ) ≤ q ln( n ) L ∗ T +8ln n, ∀ T ≥ , ∀ z ∈ X . The main ingredient is the following alternative bound on Λ ,which is proven in Appendix F. Proposition 5.8.

Let a ∈ X , ˆ q ∈ [0 , n and η > . Then Λ( a, ∇ Φ ∗ (ˆ a − η ˆ q )) ≤ η h a, ˆ q i / . Proof (of Corollary 5.7). Let z ∈ X . From Corollary 5.3and Proposition 5.8, we have sup z ∈X T X t =1 h ˆ g t , x t − z i ≤ T X t =1 η t h ˆ g t , x t i nη T +1 . (5.3)Denote the algorithm’s total cost at time t by A t = P i ≤ t h ˆ g i , x i i . Recall the total cost of the best expert attime T is L ∗ T = min z ∈ ∆ n P Tt =1 h ˆ g t , z i and the learningrate is η t = p ln( n ) / (1 + A t − ) . Substituting into (5.3), A T − L ∗ T ≤ √ ln n T X t =1 h ˆ g t , x t i p A t − + p A T ! ≤ √ ln n (cid:16)p A T + p A T + 1 (cid:17) by Proposition A.5 with a i = h ˆ g i , x i i and u = 1 . Rewrit-ing the previous inequality, we have shown that A T − L ∗ T ≤ p ln( n ) A T + √ ln n. By Proposition A.7 we obtain A T − L ∗ T ≤ q ln( n ) L ∗ T + √ ln n + 2 (ln n ) / + 4ln n. Since A T − L ∗ T ≥ Regret(

T, z ) , the result follows.Comparing our bound with some existing results in theliterature: our constant term of obtained in Corollary 5.7is better than the constant ( √ / ( √ − ) obtained by thedoubling trick (Cesa-Bianchi & Lugosi, 2006, Exercise 2.8),and the constant ( √ ) in Auer et al. (2002b) but worsethan the constant ( √ ) of the best known ﬁrst-order regretbound Yaroshinsky et al. (2004), which is obtained by asophisticated algorithm. We also match the constant ofthe Hedge algorithm from de Rooij et al. (2014, Theorem 8).Their result is actually more general; we could similarly nline mirror descent and dual averaging: keeping pace in the dynamic case generalize our analysis, but that would deviate too far fromthe main purpose of this paper.

6. Comparing DS-OMD and DA

In this section we shall write the iterates of dual-stabilizedOMD in two equivalent forms. First we shall write it ina proximal-like formulation similar to the mirror descentformulation in Beck & Teboulle (2003), shedding some lightinto the intuition behind dual-stabilization. We then writethe iterates from DS-OMD in a form very similar to theoriginal deﬁnition of DA in Nesterov (2009). The later willallow us to intuitively understand why OMD does not playwell with dynamic step-size and to derive simple sufﬁcientconditions under which DS-OMD and DA generate the sameiterates, mimicking the relation between OMD and DA fora ﬁxed learning rate.Beck & Teboulle (2003) show that the iterate x t +1 forround t + 1 from OMD is the unique minimizer over X of η t h ˆ g t , ·i + D Φ ( · , x t ) , where ˆ g t ∈ ∂f t ( x t ) . The next propo-sition extends this formulation to DS-OMD, recovering theresult from Beck and Teboulle when γ t = 1 . The proof,which can be found in Appendix H, is a simple applicationof the optimality conditions of (6.1). Proposition 6.1.

Let { f t } t ≥ be a sequence of convexfunctions with f t : X → R for each t ≥ . Let η : N → R > and γ : N → [0 , . Let { x t } t ≥ and { ˆ g t } t ≥ be as inAlgorithm 2. Then, for any t ≥ , { x t +1 } = arg min x ∈X (cid:16) γ t (cid:0) η t h ˆ g t , x i + D Φ ( x, x t ) (cid:1) + (1 − γ t ) D Φ ( x, x ) (cid:17) . (6.1)In spite of their similar descriptions, Orabona & P´al (2018)showed that OMD and DA may behave in extremely differ-ent ways even on the well-studied experts’ problem withsimilar choices of step-sizes. This extreme difference in be-havior is not clear from the classical algorithmic descriptionof these methods as in Algorithm 1. In the case of DA, itis well-known that DA can be seen as an instance of theFTRL algorithm; see Bubeck (2015, §4.4) or Hazan (2016,§5.3.1). More speciﬁcally, if { x t } t ≥ and { ˆ g t } t ≥ are as inAlgorithm 1 with the DA update, then for every t ≥ , { x t +1 } = arg min x ∈X (cid:16) η t +1 t X i =1 h ˆ g i , x i − h ˆ x , x i + Φ( x ) (cid:17) . (6.2)In the next theorem, proven in Appendix H, we writeDS-OMD in a similar form, but with vectors from thenormal cone of X creeping into the formula due to theback and forth between the primal and dual spaces. Re-call that the normal cone of X at a point x ∈ X is the The h∇ Φ( x ) , x i term disappears if x minimizes Φ on X . set N X ( x ) := { p ∈ R n : h p, z − x i ≤ for all z ∈ X } .The result in McMahan (2017, Theorem 11) is similar butslightly more intricate due to the use of time-varying mirrormaps. Moreover, this result does not directly apply whenwe have stabilization. Theorem 6.2.

Let { f t } t ≥ with f t : X → R be a sequenceof convex functions and let η : N → R > be non-increasing.Let { x t } t ≥ and { ˆ g t } t ≥ be as in Algorithm 2. Then, thereare { p t } t ≥ with p t ∈ N X ( x t ) for all t ≥ such that,if γ i = 1 for all i ≥ , then for all t ≥ { x t +1 } = arg min x ∈X (cid:16) t X i =1 h η i ˆ g i + p i , x i − h ˆ x , x i + Φ( x ) (cid:17) (6.3)and if γ i = η i +1 η i for all i ≥ , then for all t ≥ { x t +1 } = arg min x ∈X (cid:16) η t +1 t X i =1 h ˆ g i + p i , x i−h ˆ x , x i +Φ( x ) (cid:17) . (6.4)where p t := η t p t ∈ N X ( x t ) for every t ≥ .With the above theorem, we may compare the iterates of DA,OMD, and DS-OMD by comparing the formulas (6.2), (6.3),and (6.4). For the simple unconstrained case where X = R n we have N X ( x t ) = { } for each t ≥ and DA and DS-OMD are identical. However, if the learning rate is notconstant, OMD is not equivalent to the latter methods. Inparticular, if η t ∝ √ t , (6.3) shows that the subgradientsof the earlier-seen functions have a bigger weight on theiterates if compared to the subgradients of functions fromlater rounds. In other words, OMD may be sensitive to theordering of the functions, and adversarial orderings mayaffect its performance.When X is an arbitrary convex set, DA and DS-OMD arenot necessarily equivalent anymore due to the vectors fromthe normal cone of X . If we know that the iterates live inthe relative interior of X , the next lemma (whose proof wegive in Appendix H) shows that these vectors do not affectthe set of minimizers from (6.4). Lemma 6.3.

For any ˚ x ∈ ri X we have N X (˚ x ) =( − N X (˚ x )) ∩ N X (˚ x ) . In particular, for any p ∈ N X (˚ x ) we have h p, x i = h p, ˚ x i for every x ∈ X .With this lemma, we can easily derive simple and intuitiveconditions under which DS-OMD and DA are equivalent. Corollary 6.4.

Let

D ⊆ R n be the interior of the domainof Φ , let { x t } t ≥ be the DS-OMD iterates as in Algorithm 2and let { x t } t ≥ be the DA iterates as in Algorithm 1 withDA updates. If D ∩ X ⊆ ri X , then x t = x t for each t ≥ . Proof.

Let t ≥ . Since x t = Π Φ X ( y t ) , where y t is as inAlgorithm 2, Lemma H.3 implies x t ∈ D ∩ X ⊆ ri X . By nline mirror descent and dual averaging: keeping pace in the dynamic case Lemma 6.3 we have that the vectors on the normal conein (6.4) do not affect the set of minimizers, which impliesthat (6.2) and (6.4) are equivalent.An important special case of the above corollary is theprediction with expert advice setting as in Section 5.2,where D = R n> and X is the simplex ∆ n . In this set-ting, X ∩ D = { x ∈ (0 , d : P ni =1 x i = 1 } = ri X . Bythe previous corollary DS-OMD and DA produce the sameiterates in this case even for dynamic learning rates. Classi-cal OMD and DA were already known to be equivalent inthe experts’ setting for a ﬁxed learning rate (Hazan, 2016,§5.4.2). In contrast, with a dynamic learning rate, the DAand OMD iterates are certainly different, since OMD with adynamic learning rate may have linear regret (Orabona &P´al, 2018), whereas DA has sublinear regret.

7. Discussion

In this paper we modiﬁed OMD via stabilization in orderto guarantee sublinear regret even when using the methodwith a dynamic learning-rate. We showed that (primal anddual) stabilized-OMD recover the regret bounds enjoyedby DA in the anytime setting, presented some applicationsof our results, and analyzed the similarities and differencesbetween DS-OMD, OMD, and DA.Our bounds for the problem of prediction with expert ad-vice nearly match the current state-of-the-art. A distinctivefeature of our proofs are their relative simplicity if com-pared to other results from the literature. It is our hopethat the simplicity of our analysis framework allows it to beextended to other problems. Moreover, the modularity ofour proofs allowed us to extend this analysis for DA, a factinteresting on its own since drastically different analysistechniques are usually used to analyze DA in the literature(such as the Follow the Leader-Be the Leader Lemma andoptimality conditions of (6.2), see (Shalev-Shwartz, 2012,Section 2.3) for an example). This together with our analy-sis from Section 6 helps demystify the connections betweenDA and OMD, since in spite of having similar descriptionsthey had extremely different analyses and behaved wildlydifferently in some scenarios. We believe that a better un-derstanding between the differences between DA and OMDwill be helpful in future applications and in the design ofnew algorithms.

8. Acknowledgements

We thank Chris Liaw for pointing out a slight ﬂaw in theproofs in an earlier draft of this paper. We also thankFrancesco Orabona for suggesting the use of a slightly dif-ferent deﬁnition of regret which allows for more nuancedstatements of our results. We also express our gratitudefor the detailed feedback given by the three anonymous reviewers from ICML 2020.

References

Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ulti-mate uniﬁcation of gradient and mirror descent. Novem-ber 2016. URL https://arxiv.org/abs/1407.1537 .Audibert, J.-Y., Bubeck, S., and Lugosi, G. Regret in onlinecombinatorial optimization.

Mathematics of OperationsResearch , 39(1):31–45, 2014.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.The nonstochastic multiarmed bandit problem.

SIAMJournal on Computing , 32(1), 2002a.Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive andself-conﬁdent on-line learning algorithms.

Journal ofComputer and System Sciences , 64(1):48–75, 2002b.Beck, A.

First-order methods in optimization , volume 25of

MOS-SIAM Series on Optimization . Society for Indus-trial and Applied Mathematics (SIAM), Philadelphia, PA,2017.Beck, A. and Teboulle, M. Mirror descent and nonlinearprojected subgradient methods for convex optimization.

Operations Research Letters , 31(3):167–175, 2003.Bubeck, S. Introduction to online optimization, December2011. unpublished.Bubeck, S. Convex optimization: Algorithms and complex-ity.

Foundations and Trends in Machine Learning , 8(3-4):231–357, 2015.Cesa-Bianchi, N. and Lugosi, G.

Prediction, learning, andgames . Cambridge University Press, 2006.Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P.,Schapire, R. E., and Warmuth, M. K. How to use expertadvice.

Journal of the ACM , 44(3), May 1997.Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. Improvedsecond-order bounds for prediction with expert advice.

Machine Learning , 66(2-3):321–352, 2007.de Rooij, S., van Erven, T., Gr¨unwald, P. D., and Koolen,W. M. Follow the leader if you can, hedge if you must.

Journal of Machine Learning Research (JMLR) , 15:1281–1316, 2014.Duchi, J. C., Shalev-shwartz, S., Singer, Y., and Tewari, A.Composite objective mirror descent. In

Proceedings ofCOLT , pp. 14–26, 2010. nline mirror descent and dual averaging: keeping pace in the dynamic case

Gerchinovitz, S.

Prediction of individual sequences andprediction in the statistical framework: some links aroundsparse regression and aggregation techniques . PhD thesis,Universit´e Paris-Sud, 2011.Hazan, E. Introduction to online convex optimization.

Foundations and Trends in Optimization , 2(3-4):157–325,2016.Joulani, P., Gy¨orgy, A., and Szepesv´ari, C. A modular anal-ysis of adaptive (non-)convex optimization: Optimism,composite objectives, and variational bounds. In

Proceed-ings of ALT’17 , pp. 681–720, 2017.Juditsky, A. B., Kwon, J., and Moulines, E. Unifying mirrordescent and dual averaging. October 2019. URL http://arxiv.org/abs/1910.13742 .Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regu-larization techniques for learning with matrices.

Journalof Machine Learning Research (JMLR) , 13:1865–1890,2012.Kangarshahi, E. A., Hsieh, Y., Sahin, M. F., and Cevher, V.Let’s be honest: An optimal no-regret framework for zero-sum games. In

Proceedings of ICML’18 , pp. 2493–2501,2018.McMahan, H. B. A survey of algorithms and analysis foradaptive online learning.

Journal of Machine LearningResearch , 18:90:1–90:50, 2017.Nemirovski, A. and Yudin, D.

Problem Complexity andMethod Efﬁciency in Optimization . Wiley Interscience,1983.Nesterov, Y. Primal-dual subgradient methods for convexproblems.

Mathematical Programming , 120(1):221–259,2009.Orabona, F. and P´al, D. Scale-free online learning.

Theor.Comput. Sci. , 716:50–69, 2018.Rockafellar, R. T.

Convex Analysis . Princeton UniversityPress, Princeton, 1970.Shalev-Shwartz, S. Online learning and online convex opti-mization.

Foundations and Trends in Machine Learning ,4(2):107–194, 2012.Xiao, L. Dual averaging methods for regularized stochasticlearning and online optimization.

Journal of MachineLearning Research , 11, 2010.Yaroshinsky, R., El-Yaniv, R., and Seiden, S. S. How tobetter use expert advice.

Machine Learning , 55(3):271–309, 2004. upplementary Materials

Huang Fang Nicholas J. A. Harvey Victor S. Portella Michael P. Friedlander A. Facts and propositions

A.1. Scalar inequalitiesFact A.1.

For any a > and b, x ∈ R , we have − ax + bx ≤ b / a . Fact A.2. e − x ≤ − x + x for x ≥ . Fact A.3. P ti =1 1 √ i ≤ √ t − for t ≥ . Fact A.4. log( x ) ≤ x − for x ≥ .The following proposition is a variant of an inequality that is frequently used in online learning; see, e.g., (Auer et al., 2002,Lemma 3.5), (McMahan, 2017, Lemma 4). Proposition A.5.

Let u > and a , a , . . . , a T ∈ [0 , u ] . Then T X t =1 a t p u + P i

Lemma A.6 (Sums with chain rule). Let S ⊆ R be an interval. Let F : S → R be concave and differentiable on the interiorof S . Let u ≥ and let A : { , . . . , T } → S satisfy A ( i ) − A ( i − ∈ [0 , u ] for each ≤ i ≤ T . Then T X i =1 F (cid:0) u + A ( i − (cid:1) · ( A ( i ) − A ( i − ≤ F ( A ( T )) − F ( A (0)) . As u → , the left-hand side becomes comparable to R T F ( A ( x )) A ( x ) dx , an expression that has no formal meaningsince A is only deﬁned on integers. If this expression existed, it would equal the right-hand side by the chain rule. Proof of Lemma A.6.

Since F is concave, f := F is non-increasing. Fix any ≤ i ≤ T and observe that f ( x ) ≥ f ( A ( i )) ≥ f ( u + A ( i − for all x ≤ A ( i ) . Thus f ( u + A ( i − · ( A ( i ) − A ( i − ≤ Z A ( i ) A ( i − f ( x ) dx = F ( A ( i )) − F ( A ( i − . Summing over i , the right-hand side telescopes, which yields the result. Proof of Proposition A.5.

Apply Lemma A.6 with S = R ≥ , F ( x ) = 2 √ x and A ( i ) = P ≤ j ≤ i a j . Proposition A.7.

Let x, y, α, β > . If x − y ≤ α √ x + β then x − y ≤ α √ y + β + α p β + α . a r X i v : . [ c s . L G ] J u l upplementary Materials Proof.

The proposition’s hypothesis yields y + β + α ≥ x − α √ x + α (cid:16) √ x − α (cid:17) . Taking the square root and rearranging, √ x ≤ r y + β + α α . Squaring both sides and rearranging, x ≤ y + α r y + β + α β + α ≤ y + α √ y + α p β + β + α , by subadditivity of the square root. A.2. Bregman divergence properties

The following lemma collects basic facts regarding the Bregman divergence induced by a mirror map of the Legendre type.See (Zhang, n.d.) and (Rockafellar, 1970, Chapter 26). For a detailed discussion on the necessity (or the lack thereof) of mirrormaps of Legendre type, see (Bubeck, 2011, Chapter 5)

Lemma A.8.

The Bregman divergence induced by Φ satisﬁes the following properties:• D Φ ( x, y ) is convex in x ;• ∇ Φ( ∇ Φ ∗ ( z )) = z and ∇ ∗ Φ( ∇ Φ( x )) = x for all x and z ;• D Φ ( x, y ) = D Φ ∗ ( ∇ Φ( y ) , ∇ Φ( x )) for all x and y . Proposition A.9. If Φ is ρ -strongly convex with respect to k·k then D Φ ( x, y ) ≥ ρ k x − y k .A.2.1. D IFFERENCES OF B REGMAN DIVERGENCES

Recall that in Section 3 we deﬁned the notation D Φ ( ab ; c ) := D Φ ( a, c ) − D Φ ( b, c ) = Φ( a ) − Φ( b ) − h ∇ Φ( c ) , a − b i . This has several useful properties, which we now discuss.

Proposition A.10. D Φ ( ab ; p ) is linear in ˆ p . In particular, D Φ ( ab ; ∇ Φ ∗ (ˆ p − ˆ q )) = D Φ ( ab ; p ) + h ˆ q, a − b i ∀ ˆ q ∈ R n . Proof.

Immediate from the deﬁnition.

Proposition A.11.

For all a, b, c, d ∈ D , D Φ ( ab ; d ) − D Φ ( ab ; c ) = h ˆ c − ˆ d, a − b i = D Φ ( ab ; d ) + D Φ ( ba ; c ) . Proof.

The ﬁrst equality holds from Proposition A.10 with ˆ p = ˆ c and ˆ q = ˆ c − ˆ d . The second equality holds since D Φ ( ba ; c ) = − D Φ ( ab ; c ) .An immediate consequence is the “generalized triangle inequality for Bregman divergence”. See Bubeck (Bubeck, 2015,Eq. (4.1)), Beck and Teboulle (Beck & Teboulle, 2003, Lemma 4.1) or Zhang (Zhang, n.d., Eq. (3)). Proposition A.12.

For all a, b, d ∈ D , D Φ ( a, d ) − D Φ ( b, d ) + D Φ ( b, a ) = h ˆ a − ˆ d, a − b i upplementary Materials Proof.

Apply Proposition A.11 with c = a and use D Φ ( a, a ) = 0 . Proposition A.13.

Let a, b, c, u, v ∈ R n satisfy γ ˆ a + (1 − γ )ˆ b = ˆ c for some γ ∈ R . Then γD Φ ( uv ; a ) + (1 − γ ) D Φ ( uv ; b ) = D Φ ( uv ; c ) . Proof.

By deﬁnition of D Φ , the claimed identity is equivalent to (1 − γ ) (cid:0) Φ( u ) − Φ( v ) − h ∇ Φ( a ) , u − v i (cid:1) + γ (cid:0) Φ( u ) − Φ( v ) − h ∇ Φ( b ) , u − v i (cid:1) = (cid:0) Φ( u ) − Φ( v ) − h ∇ Φ( c ) , u − v i (cid:1) . This equality holds by canceling Φ( u ) − Φ( v ) and by the assumption that ∇ Φ( c ) = (1 − γ ) ∇ Φ( a ) + γ ∇ Φ( b ) .The following proposition is the “Pythagorean theorem for Bregman divergence”. Recall that Π Φ X ( y ) = arg min u ∈X D Φ ( u, y ) .Proofs may be found in (Bubeck, 2015, Lemma 4.1) or (Zhang, n.d., Eq. (17)). Proposition A.14.

Let

X ⊂ R n be a convex set. Let p ∈ R n and π = Π Φ X ( p ) . Then D Φ ( zπ ; p ) ≥ D Φ ( zπ ; π ) = D Φ ( z, π ) ∀ z ∈ X . A generalization of the previous proposition can be obtained by using the linearity property.

Proposition A.15.

Let

X ⊂ R n be a convex set. Let p ∈ R n and π = Π Φ X ( p ) . Then D Φ ( vπ ; ∇ Φ ∗ (ˆ p − ˆ q )) ≥ D Φ ( vπ ; Φ ∗ (ˆ π − ˆ q )) ∀ v ∈ X , ˆ q ∈ R n . Proof. D Φ ( vπ ; ∇ Φ ∗ (ˆ p − ˆ q )) = D Φ ( vπ ; p ) + h ˆ q, v − π i (by Proposition A.10) ≥ D Φ ( vπ ; π ) + h ˆ q, v − π i (by Proposition A.14) = D Φ ( vπ ; ∇ Φ ∗ (ˆ π − ˆ q )) (by Proposition A.10) B. Missing details from the proof of Theorem 4.1

Due to space constraints we had to omit some calculations from the proof of Theorem 4.1 the main body of the paper.In particular, we claimed that summing the expression from Claim 4.2 over t ∈ [ T ] yields Theorem 4.1. For the sake ofcompleteness we present the missing calculations.Summing (4.8) over t and using Claim 4.2 leads to the desired telescoping sum. T X t =1 (cid:0) f t ( x t ) − f t ( z ) (cid:1) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17) D Φ ( z, x ) + D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 ! ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + η + T X t =1 (cid:18) η t +1 − η t (cid:19)! D Φ ( z, x )= T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 . C. Proof of Theorem 4.3

Proof (of Theorem 4.3).Fix z ∈ X . The ﬁrst step is the same as in Theorem 4.1. f t ( x t ) − f t ( z ) (i) ≤ h ˆ g t , x t − z i = 1 η t h ˆ x t − ˆ w t +1 , x t − z i upplementary Materials (ii) = 1 η t (cid:16) D Φ ( x t , w t +1 ) + D Φ ( z, x t ) − D Φ ( z, w t +1 ) (cid:17) (C.1)Here (i) is the subgradient inequality. For (ii), recall our notation ˆ x t = ∇ Φ( x t ) and ˆ w t +1 = ∇ Φ( w t +1 ) , then use thegeneralized triangle inequality for Bregman divergences (Proposition A.12).From Claim 4.4 and the deﬁnition γ t = η t +1 /η t , we obtain D Φ ( z, w t +1 ) ≥ η t η t +1 D Φ ( z, x t +1 ) − (cid:16) η t η t +1 − (cid:17) D Φ ( z, x ) + D Φ ( y t +1 , w t +1 ) . (C.2)The remainder is very similar to Theorem 4.1. Plugging (C.2) into (C.1), we obtain(C.1) ≤ η t D Φ ( x t , w t +1 ) + D Φ ( z, x t ) − η t η t +1 D Φ ( z, x t +1 ) + (cid:16) η t η t +1 − (cid:17) D Φ ( z, x ) − D Φ ( y t +1 , w t +1 ) ! = D Φ ( x t , w t +1 ) − D Φ ( y t +1 , w t +1 ) η t + D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 + (cid:16) η t +1 − η t (cid:17) D Φ ( z, x ) . (C.3)Summing (C.3) over t , the D Φ ( z, x t ) terms telescope, and we obtain T X t =1 (cid:0) f t ( x t ) − f t ( z ) (cid:1) ≤ T X t =1 D Φ ( x t , w t +1 ) − D Φ ( y t +1 , w t +1 ) η t + D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 + (cid:16) η t +1 − η t (cid:17) D Φ ( z, x ) ! ≤ T X t =1 D Φ ( x t , w t +1 ) − D Φ ( y t +1 , w t +1 ) η t + η + T X t =1 (cid:18) η t +1 − η t (cid:19)! D Φ ( z, x )= T X t =1 D Φ ( x t , w t +1 ) − D Φ ( y t +1 , w t +1 ) η t + D Φ ( z, x ) η T +1 , (C.4)as desired. D. Proof of Theorem 4.6

Proof (of Theorem 4.6). As previously mentioned, this proof parallels the proof of Theorem 4.1.Fix z ∈ X . The ﬁrst step is very similar to the proof of Theorem 4.1. f t ( x t ) − f t ( z ) ≤ h ˆ g t , x t − z i (subgradient inequality) = 1 η t h ˆ y t − ˆ w t +1 , x t − z i (by (4.15)) = 1 η t (cid:16) D Φ ( x t , w t +1 ) − D Φ ( z, w t +1 ) + D Φ ( zx t ; y t ) (cid:17) . (D.1)where we have used Proposition A.11 instead of Proposition A.12.As in the proof of Theorem 4.1, the next step is to relate D Φ ( z, w t +1 ) to D Φ ( z, y t +1 ) so that (D.1) can be bounded using atelescoping sum. The following claim is similar to Claim 4.2. Claim D.1.

Assume that γ t = η t +1 /η t ∈ (0 , . Then(D.1) ≤ D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17)| {z } telescopes D Φ ( z, x ) + D Φ ( zx t ; y t ) η t − D Φ ( zx t +1 ; y t +1 ) η t +1 | {z } telescopes . upplementary Materials Proof.

The ﬁrst two steps are identical to the proof of Claim 4.2. γ t (cid:0) D Φ ( z, w t +1 ) − D Φ ( x t +1 , w t +1 ) (cid:1) + (1 − γ t ) D Φ ( z, x ) ≥ γ t D Φ ( zx t +1 ; w t +1 ) + (1 − γ t ) D Φ ( zx t +1 ; x ) (since D Φ ( x t +1 , x ) ≥ and γ t ≤ ) = D Φ ( zx t +1 ; y t +1 ) (by Proposition A.13 and (4.3)) . Rearranging and using γ t > yields D Φ ( z, w t +1 ) ≥ D Φ ( x t +1 , w t +1 ) − (cid:16) γ t − (cid:17) D Φ ( z, x ) + D Φ ( zx t +1 ; y t +1 ) γ t . (D.2)Plugging this into (D.1) yields(D.1) = 1 η t (cid:16) D Φ ( x t , w t +1 ) − D Φ ( z, w t +1 ) + D Φ ( zx t ; y t ) (cid:17) ≤ η t D Φ ( x t , w t +1 ) − D Φ ( x t +1 , w t +1 ) + (cid:16) γ t − (cid:17) D Φ ( z, x ) − D Φ ( zx t +1 ; y t +1 ) γ t + D Φ ( zx t ; y t ) ! , by (D.2). The claim follows by the deﬁnition of γ t .The ﬁnal step is very similar to the proof of Theorem 4.1. Summing (D.1) over t and using Claim D.1 leads to the desiredtelescoping sum. T X t =1 (cid:0) f t ( x t ) − f t ( z ) (cid:1) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17) D Φ ( z, x ) + D Φ ( zx t ; y t ) η t − D Φ ( zx t +1 ; y t +1 ) η t +1 ! ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + η + T X t =1 (cid:18) η t +1 − η t (cid:19)! D Φ ( z, x )= T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 . For the second inequality we have also used that D Φ ( zx ; y ) = D Φ ( z, x ) since x = y . Thus, the above shows that Regret(

T, z ) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 ∀ T > . (D.3)Notice that (D.3) is syntactically identical to (4.7); the only difference is the deﬁnition of w t +1 in these two settings.However the bound (D.3) requires further development because, curiously, this proof has not yet used the deﬁnition of x t . Toconclude the theorem, we will provide an upper bound on (D.3) that incorporates the deﬁnition x t = Π Φ X ( y t ) . Speciﬁcally,we will control D Φ ( x t x t +1 ; w t +1 ) by applying Proposition A.15 as follows. Taking p = y t , π = x t = Π Φ X ( y t ) , v = x t +1 and ˆ q = η t ˆ g t , we obtain D Φ ( x t x t +1 ; w t +1 ) = − D Φ ( vπ ; ∇ Φ ∗ (ˆ p − ˆ q )) (since ˆ w t +1 = ˆ y t − η t ˆ g t = ˆ p − ˆ q ) ≤ − D Φ ( vπ ; ∇ Φ ∗ (ˆ π − ˆ q )) (by Proposition A.15) = D Φ ( x t x t +1 ; ∇ Φ ∗ (ˆ x t − η t ˆ g t )) . Plugging this into (D.3) completes the proof.

E. Additional proofs for Section 5.1

Proof (of Proposition 5.2). First we apply Proposition A.12 with a = x , b = x and d = ∇ Φ ∗ (ˆ x − ˆ q ) to obtain D Φ ( xx ; w ) = h ˆ x − ˆ d, x − x i − D Φ ( x , x ) upplementary Materials = h ˆ q, x − x i − D Φ ( x , x ) ≤ k ˆ q k ∗ k x − x k − ρ k x − x k (deﬁnition of dual norm and Proposition A.9) ≤ k ˆ q k ∗ / ρ (by Fact A.1) . F. Additional proofs for Section 5.2

An initial observation shows that Λ is non-negative in the experts’ setting. Proposition F.1. Λ( a, b ) ≥ for all a ∈ X , b ∈ D . Proof.

Let us write Λ( a, b ) = − P ni =1 a i ln b i a i + ln (cid:0) P ni =1 b i (cid:1) . Since a is a probability distribution, we may apply Jensen’sinequality to show that this expression is non-negative. Proof (of Proposition 5.4). Since a, b ∈ X we have k a k = k b k = 1 . Then D Φ ( ab ; c ) = D KL ( a, c ) − D KL ( b, c )= (cid:0) D KL ( a, c ) + 1 − k c k + ln k c k (cid:1) − (cid:0) D KL ( b, c ) + 1 − k c k + ln k c k (cid:1) = Λ( a, c ) − Λ( b, c ) (by deﬁnition of Λ ) ≤ Λ( a, c ) (by Proposition F.1). Proof (of Proposition 5.8). Let b = ∇ Φ ∗ (ˆ a − η ˆ q ) . By (5.1), b i = a i exp( − η ˆ q i ) . Then Λ( a, ∇ Φ ∗ (ˆ a − η ˆ q )) = n X i =1 a i ln( a i /b i ) + ln k b k = n X i =1 ηa i ˆ q i + ln (cid:16) n X i =1 a i exp( − η ˆ q i ) (cid:17) ≤ n X i =1 ηa i ˆ q i + n X i =1 a i exp( − η ˆ q i ) − (by Fact A.4) ≤ n X i =1 ηa i ˆ q i + n X i =1 a i (cid:18) − η ˆ q i + η ˆ q i (cid:19) − (by Fact A.2) ≤ η n X i =1 a i ˆ q i / , using P ni =1 a i = 1 (since a ∈ X ) and ˆ q i ≤ ˆ q i (since ˆ q ∈ [0 , n ). G. Remarks on lower bounds fo the expert’s problem

We have seen that dual averaging achieves regret √ T ln n for all T . Here we present a lower bound analysis for DAshowing that this is the best one can hope for. Theorem G.1.

There exists a value of n such that, for every T > , there exists a sequence of vectors { c i | c i ∈ { , } n } Ti =1 ,such that lim t →∞ Regret DA ( t ) √ t ln n ≥ , where Regret DA ( T ) denotes the worst-case regret (that is, taking the supremum of the comparison point over the simples) ofthe dual averaging algorithm used in (Bubeck, 2011, Theorem 2.4).It is known in the literature (Cesa-Bianchi & Lugosi, 2006, §3.7) that no algorithm can achieve a regret bound better than √ T / ln n for the problem of learning with expert advice (as ( T, n ) → ∞ ). Thus, there is still a √ gap between the best upperand lower bounds (that hold for all T ) for prediction with expert advice. This gap was previously pointed out by Gerchinovitz(2011, pp. 52). upplementary Materials Algorithm 1

Adaptive randomized weighted majority based on DA (Cesa-Bianchi & Lugosi, 2006)

Input: η : N → R x = [1 /n, /n, ... ] for t = 1 , , . . . do Incur cost f ( x t ) and receive ˆ g t ∈ ∂f t ( x t ) for j=1,2,...,n do y t +1 ,j = x ,j exp (cid:16) − η t P tk =1 ˆ g k ( j ) (cid:17) end for x t +1 = y t +1 / k y t +1 k end for Proof.

The detailed algorithm described in (Bubeck, 2011, Theorem 2.4) is shown in Algorithm 1, where η t is set as p n/t ,we consider the case when n = 2 , and construct the following cost vectors: c t =  [1 , | ≤ t < τ, t is odd [0 , | ≤ t < τ, t is even [1 , | τ ≤ t ≤ T, ∀ t ≥ , where τ := b T − log( T ) √ T c . Without loss of generality, we assume that τ is an odd number.Throughout the remainder of this proof, denote Regret( T ) as the worts-case regret on T rounds, that is, Regret( T ) := sup z ∈ ∆ n Regret(

T, z ) . It is obvious that the second expert is the best one and our regret at time T is Regret( T ) = X ≤ t<τ c | t x t − τ −

12 + X τ ≤ t ≤ T c | t x t It is also easy to check that x t =  [1 / , / | ≤ t < τ, t is odd (cid:20)

11 + exp( η t − ) ,

11 + exp( − η t − ) (cid:21) | ≤ t < τ, t is even (cid:20)

11 + exp( η t − ( t − τ )) ,

11 + exp( − η t − ( t − τ )) (cid:21) | τ ≤ t ≤ T Regret( T ) = X ≤ t<τ,t is even (cid:18)

11 + exp( − η t − ) − (cid:19)| {z } Term 1 + T X t = τ

11 + exp(( t − τ ) η t − ) | {z } Term 2

For Term 1, Term 1 (i) ≥ X ≤ t<τ,t is even − exp( − η t − )4= X ≤ t<τ,t is even η i − O ( η i − ) (ii) ≥ √ n √ τ + o ( τ ) where (i) is true by − x − ≥ x ∀ x ∈ (0 , and (ii) is true by using the fact that P τt =1 1 √ t ≥ √ τ − upplementary Materials By the deﬁnition of τ , we have lim T →∞ τT = 1 , thus lim T →∞ Term 1 √ T ln n = 12 . (G.1)For Term 2, Term 2 = T X t = τ

11 + exp(( t − τ ) η t − ) ≥ T X t = τ

11 + exp(( t − τ ) q nt − ) ≥ T X t = τ

11 + exp(( t − τ ) q nτ − ) ≥ Z Tt = τ

11 + exp(( t − τ ) q nτ − ) dt = Z log( T ) √ Ty =0

11 + exp( y q nτ − ) dy (G.2)Note that Z

11 + exp( βy ) dy = y − ln (cid:0) e βy (cid:1) β Set β = q nτ − and plug the above result to Eq G.2, we get the following,Term 2 ≥ log( T ) √ T − ln (cid:16) (cid:16)q nτ − log( T ) √ T (cid:17)(cid:17)q nτ − + ln 2 q nτ − Using the fact that ln (1 + e x ) = x + o ( x ) ,Term 2 ≥ log( T ) √ T − log( T ) √ T + o r nτ − T ) √ T ! + ln 2 r ( τ − n Note that n = 2 , thus lim T →∞ Term 2 √ T ln n = lim T →∞ ln 2 q ( τ − √ T ln 2 = 12 (G.3)Combining Eq. G.1 and Eq. G.3, we conclude that lim T →∞ Regret(T) √ T ln n = lim T →∞ Term 1 √ T ln n + lim T →∞ Term 2 √ T ln n ≥

12 + 12 = 1 . H. Aditional proofs for Section 6

At many points throughout this section we will need to talk about optimality condition for problems where we minimize aconvex function over a convex set. Such conditions depend on the normal cone of the set on which the optimization is takingplace.

Deﬁnition H.1.

The normal cone to C ⊆ R n at x ∈ R n is the set N C ( x ) := { s ∈ R n | h s, y − x i ≤ ∀ y ∈ C } . upplementary Materials Lemma H.2 ((Rockafellar, 1970, Theorem 27.4)). Let h : C → R be a closed convex function such that (ri C ) ∩ (ri X ) = ∅ .Then, x ∈ arg min z ∈X h ( z ) if and only if there is ˆ g ∈ ∂h ( x ) such that − ˆ g ∈ N X ( x ) .Using the above result allows us to derive a useful characterization of points that realize the Bregman projections. Lemma H.3.

Let y ∈ D and x ∈ ¯ D . Then x = Π Φ X ( y ) if and only if x ∈ D ∩ X and ∇ Φ( y ) − ∇ Φ( x ) ∈ N X ( x ) . Proof.

Suppose x ∈ D ∩ X and ∇ Φ( y ) − ∇ Φ( x ) ∈ N X ( x ) . Since ∇ Φ( y ) − ∇ Φ( x ) = −∇ ( D Φ ( · , y ))( x ) , by Lemma H.2we conclude that x ∈ arg min z ∈X D ( z, y ) . Now suppose x = Π Φ X ( y ) . By Lemma H.2 together with the deﬁnition of Bregmandivergence, this is the case if and only if there is − g ∈ ∂ Φ( x ) such that − ( g − ∇ Φ( y )) ∈ N X ( x ) . Since Φ is of Legendretype we have ∂ Φ( z ) = ∅ for any z

6∈ D (see (Rockafellar, 1970, Theorem 26.1)). Thus, x ∈ D and g = ∇ Φ( x ) since Φ isdifferentiable. Finally, x ∈ X by the deﬁnition of Bregman projection.Before proceding to the proof of the results from Section 6, we need to state on last result about the relation of subgradientsand conjugate functions. Lemma H.4 ((Rockafellar, 1970, Theorem 23.5)). Let f : X → R , let x ∈ X and let ˆ y ∈ R n . Then ˆ y ∈ ∂f ( x ) if and only if x attains sup x ∈ R n ( h ˆ y, x i − f ( x )) = f ∗ (ˆ y ) . Proof (of Proposition 6.1). Let t ≥ and let F t : D → R be the function being minimized on the right-hand side of (6.1). Bydeﬁnition we have x t +1 = Π Φ X ( y t +1 ) . Using the optimality conditions of the Bregman projection, we have x t +1 = Π Φ X ( y t +1 ) ⇐⇒ ˆ y t +1 − ˆ x t +1 ∈ N X ( x t +1 ) , (by Lemma H.3)By further using the deﬁnitions from Algorithm 2 we get ˆ y t +1 − ˆ x t +1 = γ t (ˆ x t − η t ˆ g t ) + (1 − γ t )ˆ x − ˆ x t +1 = γ t (ˆ x t − ˆ x t +1 − η t ˆ g t ) + (1 − γ t )(ˆ x − ˆ x t +1 )= − γ t (cid:0) ∇ ( D Φ ( · , x t ))( x t +1 ) + η t ˆ g t (cid:1) − (1 − γ t ) ∇ ( D Φ ( · , x ))( x t +1 )= −∇ F t ( x t +1 ) Thus, we have −∇ F t ( x t +1 ) ∈ N X ( x t +1 ) . By the optimality conditions from Lemma H.2 we conclude that x t +1 ∈ arg min x ∈X F t ( x ) , as desired.Theorem 6.2 is an easy consequence of the following proposition. Proposition H.5.

Let { f t } t ≥ with f t : X → R be a sequence of convex functions and let η : N → R > be non-increasing.Let { x t } t ≥ and { ˆ g t } t ≥ be as in Algorithm 2. Deﬁne γ [ i,t ] := Q tj = i γ j for every i, t ∈ N . Then, there are { p t } t ≥ with p t ∈ N X ( x t ) for each t ≥ such that { x t +1 } = arg min x ∈X (cid:16) t X i =1 γ [ i,t ] h η i ˆ g i + p i , x i − (cid:16) γ [1 ,t ] + t X i =1 γ [ i +1 ,t ] (1 − γ i ) (cid:17) h ˆ x , x i + Φ( x ) (cid:17) , ∀ t ≥ . (H.1) Proof.

First of all, in order to prove (H.1) we claim it sufﬁces to prove that there are { p t } t ≥ with p t ∈ N X ( x t ) for each t ≥ such that ˆ y t +1 = − t X i =1 γ [ i,t ] ( η i ˆ g i + p i ) + (cid:16) γ [1 ,t ] + t X i =1 γ [ i +1 ,t ] (1 − γ i ) (cid:17) ˆ x , ∀ t ≥ . (H.2)To see the sufﬁciency of this claim, note that x t +1 = Π Φ X ( y t +1 )) ⇐⇒ ˆ y t +1 − ˆ x t +1 ∈ N X ( x t +1 ) (Lemma H.3) ⇐⇒ ˆ y t +1 ∈ ∂ (Φ + δ ( · | X ))( x t +1 ) ( ∂ ( δ ( · | X ))( x ) = N X ( x )) ⇐⇒ x t +1 ∈ arg max x ∈ R n (cid:0) h ˆ y t +1 , x i − Φ( x ) − δ ( x | X ) (cid:1) ( Lemma H.4 ) ⇐⇒ x t +1 ∈ arg min x ∈X (cid:0) −h ˆ y t +1 , x i + Φ( x ) (cid:1) . upplementary Materials The above together with (H.2) yields (H.1). Let us now prove (H.2) by induction on t ≥ .For t = 0 (H.2) holds trivially. Let t > . By deﬁnition, we have ˆ y t +1 = (1 − γ t )(ˆ x t − η t ˆ g t ) + γ t ˆ x . At this point, to usethe induction hypothesis, we need to write ˆ x t in function of ˆ y t . From the deﬁnition of Algorithm 2, we have x t = Π Φ X ( y t ) . ByLemma H.3, the latter holds if and only if ˆ y t − ˆ x t ∈ N X ( x t ) . That is, there is p t ∈ N X ( x t ) such that ˆ x t = ˆ y t − p t . Pluggingthese facts together and using our induction hypothesis we have ˆ y t +1 = γ t (ˆ x t − η t ˆ g t ) + (1 − γ t )ˆ x = γ t (ˆ y t − η t ˆ g t − p t ) + (1 − γ t )ˆ x I.H. = γ t (cid:16) − t − X i =1 γ [ i,t − ( η i ˆ g i + p i ) − η t ˆ g t − p t + (cid:16) γ [1 ,t − + t − X i =1 γ [ i +1 ,t − (1 − γ i ) (cid:17) ˆ x (cid:17) + (1 − γ t )ˆ x = − t X i =1 γ [ i,t ] ( η i ˆ g i + p i ) + (cid:16) γ [1 ,t ] + t X i =1 γ [ i +1 ,t ] (1 − γ i ) (cid:17) ˆ x , and this ﬁnishes the proof of (H.2). Proof (of Theorem 6.2). Deﬁne γ [ i,t ] for every i, t ∈ N as in Proposition H.5. If γ t = 1 for all t ≥ , then γ [ i,t ] = 1 forany t, i ≥ . Moreover, if γ t = η t +1 η t for every t ≥ , then for every t, i ∈ N with t ≥ i we have γ [ i,t ] = η t +1 η i , which yields γ [ i,t ] ( η i ˆ g i + p i ) = η t ˆ g i + η i p i and γ [1 ,t ] + t X i =1 γ [ i +1 ,t ] (1 − γ i ) = η t +1 η + t X i =1 η t +1 η i +1 (cid:16) − η i +1 η i (cid:17) = η t +1 η + η t +1 t X i =1 (cid:16) η i +1 − η i (cid:17) = 1 . I. Dual Stabilized OMD with Composite Functions

In this section we extend the Dual-Stabilized OMD to the case where the functions revealed at each round are compos-ite (Xiao, 2010; Duchi et al., 2010). More speciﬁcally, at each round t ≥ we see a function of the form f t + Ψ , where f t : X → R is convex and Lipschitz continuous and Ψ is a ﬁxed convex function which we assume to be “easy”, i.e., suchthat we know how to efﬁciently compute points in arg min x ∈X ( D Φ ( x, ¯ x ) + Ψ( x )) . In fact, we could simply use the originalDual-Stabilized OMD in this setting, but this approach has some drawbacks. One issue is that subgradients of Ψ would end-upappearing on the regret bound from Theorem 4.1, which is not ideal: we want bounds which are unaffected by the “easy”function Ψ . Another drawback is that we would not be using the knowledge of the structure of the functions, which is in manycases sub-optimal. For example, in an online learning problem one may artiﬁcially add the ‘ -norm to each function in order toinduce sparsity on the iterates of the algorithm. However, using subgradients of the ‘ -norm instead of the norm itself doesnot induce sparsity (McMahan, 2011). Finally, the analysis of Dual-Stabilized OMD adapted to the composite setting thatwe develop in this section is an easy extension of the original analysis of Section 4. This is interesting since in the literaturethe algorithms for the composite setting usually require a more intricate analysis, such as in the analysis of Regularized DAfrom (Xiao, 2010), or the use of powerful results, such as the duality between strong convexity and strong smoothness usedin (McMahan, 2017). An important exception is the analysis of the composite objective mirror descent from (Duchi et al.,2010) is already elegant. Sill, it does not directly applies when we use dual-stabilization and uses proof techniques morespecially-tailored for the proximal step form of writing OMD.In the composite setting we assume without loss of generality that X = R n since we may substitute Ψ by Ψ + δ ( · | X ) where δ ( x | X ) = 0 if x ∈ X and is + ∞ anywhere else. To avoid comparing the loss of the algorithm with points outside ofthe effective domain of Ψ in the deﬁnition of regret, denoted by dom Ψ , we deﬁne the Ψ -regret of the sequence of functions { f t } t ≥ (against a comparison point z ∈ dom Ψ ) by and iterates { x t } t ≥ by Regret Ψ ( T, z ) := T X t =1 (cid:0) f t ( x t ) + Ψ( x t ) (cid:1) − T X t =1 (cid:0) f t ( z ) + Ψ( z ) (cid:1) , ∀ T ≥ . To adapt the Dual Stabilization method to this setting, we use the same idea as in (Duchi et al., 2010). Namely, we modifythe proximal-like formulation of Dual Stabilization from Proposition 6.1 so that we do not linearize (i.e., take the subgradient)of the function Ψ , which yields { x t +1 } := arg min x ∈X (cid:16) γ t (cid:0) η t ( h ˆ g t , x i + Ψ( x )) + D Φ ( x, x t ) (cid:1) + (1 − γ t ) D Φ ( x, x ) (cid:17) , ∀ t ≥ . upplementary Materials Algorithm 2

Dual-stabilized OMD with dynamic learning rate η t and additional regularization function Ψ . Input: x ∈ arg min x ∈ R n Ψ( x ) , η : N → R + , γ : N → [0 , y = ∇ Φ( x ) for t = 1 , , . . . do Incur cost f t ( x t ) and receive ˆ g t ∈ ∂f t ( x t )ˆ x t = ∇ Φ( x t ) B map primal iterate to dual space ˆ w t +1 = ˆ x t − η t ˆ g t B gradient step in dual space (I.1) ˆ y t +1 = γ t ˆ w t +1 + (1 − γ t )ˆ x B stabilization in dual space (I.2) y t +1 = ∇ Φ ∗ (ˆ y t +1 ) B map dual iterate to primal space α t +1 := η t γ t B compute scaling factor for Ψ x t +1 = Π Φ α t +1 Ψ ( y t +1 ) B project onto feasible region (I.3) end for Although we do not prove the equivalence for the sake of conciseness, on Algorithm 2 we present the above procedurewritten in a form closer to Algorithm 2. In this new algorithm, we extend the deﬁnition of Bregman Projection and deﬁnethe Ψ -Bregman projection by { Π ΦΨ ( y ) } := arg min x ∈ R n ( D Φ ( x, y ) + Ψ( y )) . The next lemma shows an analogue of thegeneralized pythagorean theorem for the Ψ -Bregman Projection. Lemma I.1.

Let α > and ¯ y := Π Φ α Ψ ( y ) . Then, D Φ ( x, ¯ y ) + D Φ (¯ y, y ) ≤ D Φ ( x, y ) + α (Ψ( x ) − Ψ(¯ y )) , ∀ x ∈ R n . Proof.

By the optimality conditions of the projection, we have ∇ Φ( y ) − ∇ Φ(¯ y ) ∈ ∂ ( α Ψ)(¯ y ) . Using the generalized triangleinequality for Bregman Divergences (Lemma A.8) and the subgradient inequality, we get D Φ ( x, ¯ y ) + D Φ (¯ y, y ) − D Φ ( x, y ) = h∇ Φ( y ) − ∇ Φ(¯ y ) , x − ¯ y i (i) ≤ α (Ψ( x ) − Ψ(¯ y )) . where (i) follows from ∇ Φ( y ) − ∇ Φ(¯ y ) ∈ ∂ ( α Ψ)(¯ y ) and the convexity of α Ψ( · ) .In the next theorem we show that the regret bound we have for the Dual-Stabilized OMD still holds in this setting whenusing Algorithm 2, and the proof boils down to simple modiﬁcations to the original proof of Theorem 4.1. Theorem I.2.

Suppose that Φ is ρ -strongly convex with respect to a norm k · k and Ψ be a convex function such that dom Ψ ⊆ X . Set γ t = η t +1 /η t for each t ≥ . Let x ∈ arg min x ∈ dom Ψ Ψ( x ) and { x t } t ≥ be the sequence of iteratesgenerated by Algorithm 2. Then for any sequence of convex functions { f t } t ≥ with each f t : X → R and z ∈ X , Regret Ψ ( T, z ) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 , ∀ T > . (I.4) Proof.

Let z ∈ dom Ψ and t ∈ N .By (4.8), we have f t ( x t ) + Ψ( x t ) − f t ( z ) − Ψ( z ) ≤ η t (cid:16) D Φ ( x t , w t +1 ) + D Φ ( z, x t ) − D Φ (cid:0) z, w t +1 (cid:1)(cid:17) + Ψ( x t ) − Ψ( z ) . (I.5)As in Theorem 4.1, let us prove bound the above expression by something with telescoping terms. Claim I.3.

Assume that γ t = η t +1 /η t ∈ (0 , . Then(I.5) ≤ D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17)| {z } telescopes D Φ ( z, x ) + D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 | {z } telescopes + Ψ( x t ) − Ψ( x t +1 ) | {z } telescopes . upplementary Materials Proof.

Fix z ∈ dom Ψ . First we derive the inequality γ t (cid:0) D Φ ( z, w t +1 ) − D Φ ( x t +1 , w t +1 ) (cid:1) + (1 − γ t ) D Φ ( z, x ) ≥ γ t D Φ ( zx t +1 ; w t +1 ) + (1 − γ t ) D Φ ( zx t +1 ; x ) (since D Φ ( x t +1 , x ) ≥ and γ t ≤ ) = D Φ ( zx t +1 ; y t +1 ) (by Proposition A.13 and (I.2)) ≥ D Φ ( z, x t +1 ) + α t +1 (cid:0) Ψ( x t +1 ) − Ψ( z ) (cid:1) (by Lemma I.1 and (I.3)) . Rearranging and using both γ t > and α t +1 = η t γ t yields D Φ ( z, w t +1 ) ≥ D Φ ( x t +1 , w t +1 ) − (cid:16) γ t − (cid:17) D Φ ( z, x ) + 1 γ t D Φ ( z, x t +1 ) + η t (cid:0) Ψ( x t +1 ) − Ψ( z ) (cid:1) . (I.6)Plugging this into (I.5) yields(I.5) = 1 η t (cid:16) D Φ ( x t , w t +1 ) − D Φ ( z, w t +1 ) + D Φ ( z, x t ) (cid:17) + Ψ( x t ) − Ψ( z ) ≤ η t D Φ ( x t x t +1 ; w t +1 ) + (cid:16) γ t − (cid:17) D Φ ( z, x ) − γ t D Φ ( z, x t +1 ) + D Φ ( z, x t ) ! + Ψ( x t ) − Ψ( x t +1 ) . The claim follows by the deﬁnition of γ t .The ﬁnal step is very similar to the standard OMD proof. Summing (I.5) over t and using Claim I.3 leads to the desiredtelescoping sum. T X t =1 (cid:0) f t ( x t ) − f t ( z ) (cid:1) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + (cid:16) η t +1 − η t (cid:17) D Φ ( z, x ) + D Φ ( z, x t ) η t − D Φ ( z, x t +1 ) η t +1 + Ψ( x t ) − Ψ( x t +1 ) ! ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 + Ψ( x ) − Ψ( x T +1 ) ≤ T X t =1 D Φ ( x t x t +1 ; w t +1 ) η t + D Φ ( z, x ) η T +1 , where in the last step inequality we used x ∈ arg min x ∈X Ψ( x ) . References

Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-conﬁdent on-line learning algorithms.

Journal of Computer andSystem Sciences , 64(1):48–75, 2002.Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization.

OperationsResearch Letters , 31(3):167–175, 2003.Bubeck, S. Introduction to online optimization, December 2011. unpublished.Bubeck, S. Convex optimization: Algorithms and complexity.

Foundations and Trends in Machine Learning , 8(3-4):231–357,2015.Cesa-Bianchi, N. and Lugosi, G.

Prediction, learning, and games . Cambridge University Press, 2006.Duchi, J. C., Shalev-shwartz, S., Singer, Y., and Tewari, A. Composite objective mirror descent. In

Proceedings of COLT , pp.14–26, 2010. upplementary Materials

Gerchinovitz, S.

Prediction of individual sequences and prediction in the statistical framework: some links around sparseregression and aggregation techniques . PhD thesis, Universit´e Paris-Sud, 2011.McMahan, H. B. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In

Proceedings of AISTATS’11 , pp. 525–533, 2011.McMahan, H. B. A survey of algorithms and analysis for adaptive online learning.

Journal of Machine Learning Research , 18:90:1–90:50, 2017.Rockafellar, R. T.

Convex Analysis . Princeton University Press, Princeton, 1970.Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization.

Journal of Machine LearningResearch , 11, 2010.Zhang, X. Bregman divergence and mirror descent lecture notes, n.d. Available at http://users.cecs.anu.edu.au/˜xzhang/teaching/bregman.pdfhttp://users.cecs.anu.edu.au/˜xzhang/teaching/bregman.pdf