Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent
PProceedings of Machine Learning Research vol 75:1–21, 2018 31st Annual Conference on Learning Theory
Smoothed Online Convex Optimization in High Dimensionsvia Online Balanced Descent
Niangjun Chen a ∗ CHENNJ @ IHPC . A - STAR . EDU . SG Gautam Goel b ∗ GGOEL @ CALTECH . EDU
Adam Wierman b ADAMW @ CALTECH . EDU a Institute of High Performance Computing b California Institute of Technology
Editors:
Sebastien Bubeck, Vianney Perchet and Philippe Rigollet
Abstract
We study smoothed online convex optimization , a version of online convex optimization where thelearner incurs a penalty for changing her actions between rounds. Given a Ω( √ d ) lower bound onthe competitive ratio of any online algorithm, where d is the dimension of the action space, we askunder what conditions this bound can be beaten. We introduce a novel algorithmic framework forthis problem, Online Balanced Descent (OBD), which works by iteratively projecting the previouspoint onto a carefully chosen level set of the current cost function so as to balance the switchingcosts and hitting costs. We demonstrate the generality of the OBD framework by showing how, withdifferent choices of “balance,” OBD can improve upon state-of-the-art performance guarantees forboth competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, O (1 /α ) , for locally polyhedral costs, where α measures the “steepness”of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performedin the dual space that are dimension-free and imply that OBD has sublinear static regret.
1. Introduction
In this paper we develop a new algorithmic framework, Online Balanced Descent (OBD), for onlineconvex optimization problems with switching costs, a class of problems termed smoothed onlineconvex optimization (SOCO). Specifically, we consider a setting where a learner plays a series ofrounds , , . . . , T . In each round, the learner observes a convex cost function f t , picks a point x t from a convex set X , and then incurs a hitting cost f t ( x t ) . Additionally, she incurs a switching cost for changing her actions between successive rounds, (cid:107) x t − x t − (cid:107) , where (cid:107) · (cid:107) is a norm.This setting generalizes classical Online Convex Optimization (OCO), and has received consid-erable attention in recent years as a result of the recognition that switching costs play a crucial rolein many learning, algorithms, control, and networking problems. In particular, many applicationshave, in reality, some cost associated with a change of action that motivates the learner to adopt“smooth” sequences of actions. For example, switching costs have received considerable attentionin the k -armed bandit setting (Agrawal et al., 1990; Guha and Munagala, 2009; Koren et al., 2017)and the core of the Metrical Task Systems (MTS) literature is determining how to manage switchingcosts, e.g., the k -server problem (Borodin et al., 1992; Borodin and El-Yaniv, 2005).Outside of learning, SOCO has received considerable attention in the networking and controlcommunities. In these problems there is typically a measurable cost to changing an action. For ∗ Niangjun Chen and Gautam Goel contributed equally to this work. c (cid:13) a r X i v : . [ c s . L G ] J u l HEN G OEL W IERMAN example, one of the initial applications where SOCO was adopted is the dynamic management ofservice capacity in data centers (Lin et al., 2011; Lu et al., 2013), where the wear-and-tear costsof switching servers into and out of deep power-saving states is considerable. Other applicationswhere SOCO has seen real-world deployment are the dynamic management of routing between datacenters (Lin et al., 2012; Wang et al., 2014), management of electrical vehicle charging (Kim andGiannakis, 2014), video streaming (Joseph and de Veciana, 2012), speech animation (Kim et al.,2015), multi-timescale control (Goel et al., 2017), power generation planning (Badiei et al., 2015),and the thermal management of System-on-Chip (SoC) circuits (Zanini et al., 2009, 2010).
High-dimensional SOCO.
An important aspect of nearly all the problems mentioned above isthat they are high-dimensional , i.e., the dimension d of the action space is large. For example, inthe case of dynamic management of data centers the dimension grows with the heterogeneity of thestorage and compute nodes in the cluster, as well as the heterogeneity of the incoming workloads.However, the design of algorithms for high-dimensional SOCO problems has proven challenging,with fundamental lower bounds blocking progress.Initial results on SOCO focused on finding competitive algorithms in the low-dimensional set-tings. Specifically, Lin et al. (2011) introduced the problem in the one-dimensional case and gave a3-competitive algorithm. A few years later, Bansal et al. (2015) gave a 2-competitive algorithm, stillfor the one-dimensional case. Following these papers, Antoniadis et al. (2016) claimed that SOCOis equivalent to the classical problem of Convex Body Chasing Friedman and Linial (1993), in thesense that a competitive algorithm for one problem implies the existence of a competitive problemfor the other. Using this connection, they claimed to show the existence of a constant competitivealgorithm for two-dimensional SOCO. However, their analysis turned out to have a bug and theirclaims have been retracted (Pruhs, 2018). However, the connection to Convex Body Chasing doeshighlight a fundamental limitation. In particular, it is not possible to design a competitive algorithmfor high-dimensional SOCO without making restrictions on the cost functions considered since, aswe observe in Section 2, for general convex cost functions and (cid:96) switching costs, the competitiveratio of any algorithm is Ω( √ d ) .The importance of high-dimensional SOCO problems in practical applications has motivatedthe “beyond worst-case” analysis for SOCO as a way of overcoming the challenge of designingconstant competitive algorithms in high dimensions. To this end, Lin et al. (2012); Andrew et al.(2013); Chen et al. (2015); Badiei et al. (2015); Chen et al. (2016) all explored the value of pre-dictions in SOCO, highlighting that it is possible to provide constant-competitive algorithms forhigh-dimensional SOCO problems using algorithms that have predictions of future cost functions,e.g., Lin et al. (2012) gave an algorithm based on receding horizon control that is O (1 /w ) -competitive when given w -step lookahead. Recently, this was revisited in the case quadratic switch-ing costs by Li et al. (2018), which gives an algorithm that combines receding horizon control withgradient descent to achieve a competitive ratio that decays exponentially in w .In addition to the stream of work focused on competitive ratio, there is a separate stream ofwork focusing on the development of algorithms with small regret. With respect to classical static regret, where the comparison is with the fixed, static offline optimal, Andrew et al. (2013) showedthat SOCO is no more challenging than OCO. In fact, many OCO algorithms, e.g., Online GradientDescent (OGD), obtain bounds on regret of the same asymptotic order for SOCO as for OCO.However, the task of bounding dynamic regret in SOCO is more challenging and, to this point, theonly positive results for dynamic regret rely on the use of predictions. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
There have been a number of attempts to connect the communities focusing on regret and com-petitive ratio over the years. Blum and Burch (2000) initiated this direction by providing an analyticframework that connects OCO algorithms with MTS algorithms, allowing the derivation of regretbounds for MTS algorithms in the OCO framework and competitive analysis of OCO algorithmsin the MTS framework. Two breakthroughs in this direction occurred recently. First, Buchbinderet al. (2012) used a primal-dual technique to develop an algorithm that, for the first time, provideda unified approach for algorithm design across competitive ratio and regret in the MTS setting,over a discrete action space. Second, a series of recent papers (Abernethy et al., 2010; Buchbinderet al., 2014; Bubeck et al., 2017) used techniques based on Online Mirror Descent (OMD) to pro-vide significant advances in the analysis of the k -server and MTS problems. However, again thereis a fundamental limit to these unifying approaches in the setting of SOCO. Andrew et al. (2013)shows that no individual algorithm can simultaneously be constant competitive and have sublinearregret. Currently, the only unifying frameworks for SOCO rely on the use of predictions, and useapproaches based on receding horizon control, e.g., Chen et al. (2015); Badiei et al. (2015); Chenet al. (2016); Li et al. (2018). Contributions of this paper.
The prior discussion highlights the challenges associated with de-signing algorithms for high-dimensional SOCO problems, both in terms of competitive ratio anddynamic regret. In this paper, we introduce a new, general algorithmic framework, Online BalancedDescent (OBD), that yields (i) an algorithm with a dimension-free, constant competitive ratio forlocally polyhedral cost functions and (cid:96) switching costs and (ii) an algorithm with a dimension-free,sublinear static regret that does not depend on the size of the gradients of the cost functions. In bothcases, OBD achieves these results without relying on predictions of future cost functions – the firstalgorithm to achieve such a bound on competitive ratio outside of the one-dimensional setting.The key idea behind OBD is to move using a projection onto a carefully chosen level set at eachstep chosen to “balance” the switching and hitting costs incurred. The form of “balance” is chosencarefully based on the performance metric. In particular, our results use a “balance” of the switchingand hitting costs in the primal space to obtain results for the competitive ratio and a “balance” of theswitching costs and the gradient of the hitting costs in the dual space to obtain results for dynamicregret. The resulting OBD algorithms are efficient to implement, even in high dimensions. They arealso memoryless , i.e., do not use any information about previous cost functions.The technical results of the paper bound the competitive ratio and dynamic regret of OBD.In both cases we obtain results that improve the state-of-the-art. In the case of competitive ratio ,we obtain the first results that break through the √ d barrier without the use of predictions. Inparticular, we show that OBD with (cid:96) switching costs yields a constant, dimension-free competitiveratio for locally polyhedral cost functions, i.e. functions which grow at least linearly away fromtheir minimizer. Specifically, in Theorem 7 we show that OBD has a competitive ratio of O (1 /α ) , where α bounds the “steepness” of the costs. Note that Bansal et al. (2015) shows that nomemoryless algorithm can achieve a competitive ratio better than for locally polyhedral functions.By equivalence of norms in finite dimensional space, our algorithm is also competitive when theswitching costs are arbitrary norms (though the exact competitive ratio may depend on d ). Ourproof of this result depends crucially on the geometry of the level sets.In the case of dynamic regret , we obtain the first results that provide sub-linear dynamic regretwithout the use of predictions. Further, the bounds on dynamic regret we prove are independent ofthe size of the gradient of the cost function. Specifically, in Corollary 11 we show that OBD has HEN G OEL W IERMAN dynamic regret bounded by O ( √ LT ) , where L is the total distance traveled by the offline optimal.When comparing to a static optimal, OBD achieves sublinear regret of O ( √ DT ) where D is thediameter of the feasible set. The proof again makes use of a geometric interpretation of OBD interms of the level sets. In particular, the projection onto the level sets allows OBD to be “onestep ahead” of Online Mirror Descent. Further, OBD carefully choses the step sizes to balance theswitching cost and marginal hitting cost in the dual space.
2. Online Optimization with Switching Costs
We study online convex optimization problems with switching costs, a class of problems oftentermed
Smoothed Online Convex Optimization (SOCO) . An instance of SOCO consists of a fixedconvex decision/action space
X ⊂ R d , a norm (cid:107) · (cid:107) on R d , and a sequence of non-negative convexcost functions f t , where t = 1 . . . T and f t ( x ) = + ∞ for all x / ∈ X .At each time t , an online learner observes the cost function f t and chooses a point x t in X ,incurring a hitting cost f t ( x t ) and a switching cost (cid:107) x t − x t − (cid:107) . The total cost incurred is thus cost( ALG ) = T (cid:88) t =1 f t ( x t ) + (cid:107) x t − x t − (cid:107) , (1)where x t are the decisions of online algorithm ALG . While we state this as unconstrained, con-straints are incorporated via the decision space X . Note that the decision variable could be taken tobe matrix-valued instead of vector valued; similarly the switching cost could be any matrix norm.Importantly, we have allowed the learner to see the current cost function when deciding on anaction, i.e., the online algorithm observes f t before picking x t . This is standard when studyingswitching costs due to the added complexity created by the coupling of the actions x t due to theswitching cost term, e.g., Andrew et al. (2013); Bansal et al. (2015), but it is different than thestandard assumption in the OCO literature. While allowing the algorithm to see f t in the standardOCO setting would make the problem trivial, in the SOCO setting considerable complexity remains– the choice in round t of the offline optimal depends on f s for all s > t . Giving the algorithmcomplete information about f t isolates the difficulty of the problem, eliminating the performanceloss that comes from not knowing f t and focusing on the performance loss due to the couplingcreated by the switching costs. This is natural since it is easy to bound the extra cost incurred dueto lack of knowledge of f t . In particular, as done in Blum and Burch (2000) and Buchbinder et al.(2012), the penalty due to not knowing f t is f t ( x t ) − f t ( x t +1 ) ≤ ∇(cid:104) f t ( x t ) , x t − x t +1 (cid:105) ≤ (cid:107)∇ f t ( x t ) (cid:107) (cid:107) x t − x t +1 (cid:107) . Since cost functions are typically assumed to have a bounded gradient in the the OCO literature,this equation gives a translation of the results in this paper to the setting where f t is not known.Note that the SOCO model is closely related to the Metrical Task System (MTS) literature. InMTS, the cost functions f t can be arbitrary, the feasible set X is discrete, and the movement cost canbe any metric distance. Due to the generality of the MTS setting, the results are typically pessimistic.For example, Borodin et al. (1992) shows that the competitive ratio of any deterministic algorithmmust be Ω( n ) where n is the size of X , and Blum et al. (1992) shows an Ω( (cid:112) log( n ) / log log( n )) lower bound for randomized algorithms. In comparison, SOCO restricts both the cost functions andthe feasible space to be convex, though the decision space is continuous rather than discrete. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
Performance Metrics.
The performance of online learning algorithms in this setting is evaluatedby comparing the cost of the algorithm to the cost achievable by the offline optimal , which makesdecisions with advance knowledge of the cost functions. The cost incurred by the offline optimal is cost(
OP T ) = min x ∈X T T (cid:88) t =1 f t ( x t ) + (cid:107) x t − x t − (cid:107) . (2)Sometimes, it is desirable to constrain the power of the offline optimal when performing com-parisons. One common approach for this, which we adopt in this paper, is to constrain the movementthe offline optimal is allowed (Blum et al., 1992; Buchbinder et al., 2012; Blum and Burch, 2000).This is natural for our setting, given the switching costs incurred by the learner. Specifically, definethe L -constrained offline optimal as the offline optimal solution with switching cost upper boundedby L , i.e., the minimizer of the following (offline) problem: OP T ( L ) = min x ∈X T T (cid:88) t =1 f t ( x t ) + (cid:107) x t − x t − (cid:107) subject to T (cid:88) t =1 (cid:107) x t − x t − (cid:107) ≤ L. For large enough L , OP T ( L ) = OP T . Specifically, L = (cid:80) Tt =1 (cid:13)(cid:13) x ∗ t − x ∗ t − (cid:13)(cid:13) guarantees that cost( OP T ( L )) = cost( OP T ) . Further, for small enough L , OP T ( L ) corresponds to the staticoptimal , OP T
ST A , i.e., the optimal static choice such that x t = x . Since the movement cost ofthe static optimal (assuming it takes one step from x ) is (cid:13)(cid:13) x ST A − x (cid:13)(cid:13) ≤ D , cost( OP T ( D )) ≤ cost( OP T
ST A ) . Therefore cost( OP T ( L )) interpolates between the cost of the dynamic offlineoptimal to the static offline optimal as L varies.When comparing an online algorithm to the (constrained) offline optimal cost, two differentapproaches are common. The first, most typically used in the online algorithms community, is touse a multiplicative comparison. This yields the following definition of the competitive ratio . Definition 1
An online algorithm
ALG is C -competitive if, for all sequences of cost functions f , . . . , f T , we have cost( ALG ) ≤ C · cost( OP T ) . As discussed in the introduction, there has been considerable work focused on designing al-gorithms that have a competitive C that is constant with respect to the dimension of the decisionspace, d . In contrast to the multiplicative comparison with the offline optimal cost in the competitiveratio, an additive comparison is most common in the online learning community. This yields thefollowing definition of dynamic regret . Definition 2
The L -(constrained) dynamic regret of an online algorithm ALG is ρ L ( T ) if for allsequences of cost functions f t , . . . , f T , we have cost( ALG ) − cost( OP T ( L )) ≤ ρ L ( T ) . ALG issaid to be no-regret against OPT(L) if ρ L ( T ) is sublinear. As discussed above,
OP T ( L ) interpolates between the offline optimal OP T and the offlinestatic optimal
OP T
ST A . There are many algorithms that are known to achieve O ( √ T ) static re-gret, the best possible given general convex cost functions, e.g., online gradient descent Zinkevich(2003) and follow the regularized leader Xiao (2010). While these results were proven initially forOCO, Andrew et al. (2013) shows that the same bounds hold for SOCO. In contrast, prior work ondynamic regret has focused primarily on OCO, e.g., Herbster and Warmuth (2001); Cesa-Bianchi HEN G OEL W IERMAN et al. (2012); Hall and Willett (2013). For SOCO, the only positive results for SOCO consider algo-rithms that have access to predictions of future workloads, e.g., Lin et al. (2012); Chen et al. (2015,2016).For both competitive ratio and dynamic regret, it is natural to ask what performance guaranteesare achievable. The following lower bounds (proven in the appendix) follow from connections toConvex Body Chasing first observed by Antoniadis et al. (2016).
Proposition 3
The competitive ratio of any online algorithm for SOCO is Ω( √ d ) with (cid:96) switchingcosts and Ω( d ) with (cid:96) ∞ switching costs. The dynamic regret is Ω( d ) in both settings.
3. Online Balanced Descent (OBD)
The core of this paper is a new algorithmic framework for online convex optimization that we term
Online Balanced Descent (OBD). OBD represents the first algorithmic framework that applies toboth competitive analysis and regret analysis in SOCO problems and, as such, parallels recent resultsthat have begun to provide unified algorithmic techniques for competitiveness and regret in otherareas. For example, Buchbinder et al. (2012) and Blum and Burch (2000) provided frameworks thatbridged competitiveness and regret in the context of MTS problems, which have a discrete actionspaces, and Blum et al. (2002) did the same for decision making problems on trees and lists.OBD builds both on ideas from online algorithms, specifically the work of Bansal et al. (2015),and online learning, specifically Online Mirror Descent (OMD) (Nemirovskii et al., 1983; Warmuthand Jagota, 1997; Bubeck et al., 2015). We begin this section with the geometric intuition underlyingthe design of OBD in Section 3.1. Then, in Section 3.2, we describe the details of the algorithm.Finally, in Section 3.3 we provide illustration of how the choice of the mirror map impacts thebehavior of the algorithm.
The key insight that drives the design of OBD is that it is the geometry of the level sets, not thelocation of the minimizer, which should guide the choice of x t . This reasoning leads naturally to analgorithm for SOCO that, at each round, projects the previous point x t − onto a level of the currentcost function f t . However, the question of “which level set?” remains. This choice is where thename Online Balanced Descent comes from – OBD chooses a level set that balances the switchingcost and hitting costs, where the notion of balance used is the heart of the algorithm.To make the intuition described above more concrete, assume for the moment that the switchingcost is defined by the (cid:96) norm. Suppose the algorithm has decided to project onto the l -level set of f t (we show how to pick l in the next section). Then the action of the algorithm is the solution ofthe following optimization problem:minimize (cid:107) x − x t − (cid:107) subject to f t ( x ) ≤ l. Now, let η t be the optimal dual variable for the inequality constraint. By the first order optimalitycondition, x t needs to satisfy x t = x t − − η t ∇ f t ( x t ) . MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT ( a ) A step taken by OMD. Contour linesrepresent the sub-level sets of f t − . ( b ) A step taken by OBD. Contour linesrepresent the sub-level sets of f t . Figure 1:
An illustration of the difference between OBD and OMD assuming the mirror map Φ( x ) = (cid:107) x (cid:107) . While OMD steps in a direction normal to the contour line of f t − ( · ) at x t − , OBD steps in a direction normal to the contour line of f t ( · ) at x t . This resembles a step of Online Gradient Descent (OGD), except that the update direction is thegradient ∇ f t ( x t ) instead of ∇ f t − ( x t − ) , and the step size η t is allowed to vary. Hence in thissetting OBD can be seen as a “one step ahead” version of OGD with time-varying step sizes.For more general switching costs a similar geometric intuition can be obtained using a mirrormap Φ with respect to the norm (cid:107) · (cid:107) . Here, x t is the solution of the following optimization in dualspace where, given a convex function Φ , D Φ ( x, y ) is the Bregman divergence between x and y , i.e., D Φ ( x, y ) = Φ( x ) − Φ( y ) − ∇ Φ( y ) T ( x − y ) :minimize D Φ ( x, x t − ) subject to f t ( x ) ≤ l. As before, let η t be the optimal dual variable for the inequality constraint. The first order optimalitycondition implies that x t must satisfy ∇ Φ( x t ) = ∇ Φ( x t − ) − η t ∇ f t ( x t ) . (3)The form of (3) is similar to a “one step ahead” version of OMD with time varying η t , i.e., theupdate direction in the dual space is in the gradient ∇ f t ( x t ) instead of the gradient ∇ f t − ( x t − ) .The implicit form of the update has been widely used in online learning, e.g., Kivinen and Warmuth(1997); Kulis and Bartlett (2010).Figure 1 illustrates the difference between OBD and OMD when Φ( x ) = (cid:107) x (cid:107) : OBD isnormal to the destination whereas OMD is normal to the starting point. Intuitively, this is whythe guarantees we obtain for OBD are stronger than what previous descent-based approaches haveobtained in this setting – it is better to move in the direction determined by the level set where youland, than the direction determined by the level set where you start. The previous section gives intuition about one key aspect of the algorithm, the projection onto alevel-set. But, in the discussion above we assume we are projecting onto a specific l -sublevel set.The core of OBD is that this sublevel set is determined endogenously in order to “balance” theswitching and hitting costs, as opposed to a fixed exogenous schedule of step-sizes like is typicalin many online descent algorithms. Informally, the operation of OBD is summarized in the metaalgorithm in Algorithm 1, which uses the operator Π Φ K ( x ) : R n → K to denote the Bregmanprojection of x onto a convex set K , i.e., Π Φ K ( x ) = argmin y ∈ K D Φ ( y, x ) , where Φ is m -stronglyconvex and M -Lipschitz smooth in (cid:107)·(cid:107) , i.e., m (cid:107) x − y (cid:107) ≤ D Φ ( x, y ) ≤ M (cid:107) x − y (cid:107) . HEN G OEL W IERMAN
Algorithm 1
Online Balanced Descent (OBD), Meta Algorithm
Require:
Starting point x , mirror map Φ . for t = 1 , . . . , T do Choose a sublevel set K l = { x | f t ( x ) ≤ l } to “balance” the switching and hitting costs. Set x t = Π Φ K l ( x t − ) . end for We term Algorithm 1 a meta algorithm because the general framework given in Algorithm 1 canbe instantiated with different forms of “balance” in order to perform well for different metrics. Morespecifically, the notion of “balance” in the Step 2 that is appropriate varies depending on whetherthe goal is to perform well for competitive ratio or for regret.Our results in this paper highlight two different approaches for defining balance in OBD basedon either balancing the switching cost with the hitting cost in either the primal or dual space. Webalance costs in the primal space to yield a constant, dimension-free competitive algorithm forlocally polyhedral cost functions (Section 4), and balance in the dual space to yield a no-regretalgorithm (Section 5). We summarize these two approaches in the following and then give morecomplete descriptions in the corresponding technical sections. • Primal Online Balanced Descent . The algorithm we consider in Section 4 instantiates Al-gorithm 1 by choosing l such that x ( l ) = Π Φ K l ( x t − ) achieves balance between the switchingcost with the hitting cost in the primal space. Specifically, for some fixed β > , choose l such that either x ( l ) = argmin x f t ( x ) and (cid:107) x ( l ) − x t − (cid:107) < βl , or the following is satisfied: (cid:107) x ( l ) − x t − (cid:107) = βl (4) • Dual Online Balanced Descent . The algorithm we consider in Section 5 instantiates Al-gorithm 1 by balancing the switching cost with the size of the gradient in the dual space.Specifically, for some fixed η , we choose l such that (cid:107)∇ Φ( x ( l )) − ∇ Φ( x t − ) (cid:107) ∗ = η (cid:107)∇ f t ( x ( l )) (cid:107) ∗ , (5)The final piece of the algorithm is computational. Note that the algorithm is memoryless , i.e., itdoes not use any information about previous cost functions. Thus, the only question about efficiencyis whether the appropriate l can be found efficiently. The following lemmas verify that, indeed, it ispossible to compute l , and thus implement OBD, efficiently. Lemma 4
The function g ( l ) = (cid:107) x ( l ) − x t − (cid:107) is continuous in l . Lemma 5
Consider Φ and f t that are continuously differentiable on X . The function h ( l ) = (cid:107)∇ Φ( x t − ) −∇ Φ( x ( l )) (cid:107) ∗ (cid:107)∇ f t ( x ( l )) (cid:107) ∗ is continuous in l . The continuity of g ( l ) and h ( l ) in l is enough to guarantee efficient implementation of Primaland Dual OBD because it shows that an l satisfying the balance conditions in the algorithms existsand, further, can be found to arbitrary precision via bisection. Proofs are included in the appendix. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
An important part of the design of an OBD algorithm is the choice of the mirror map. Differentchoices of a mirror map Φ can lead to very different behavior by the resulting algorithms. Tohighlight this, and give intuition for the impact of the choice, we describe three examples of mirrormaps below. These examples focus on mirror maps that are commonly used for OMD, and theyhighlight interesting connections between the OBD framework and classical online optimizationalgorithms like OGD (Zinkevich, 2003) and Multiplicative Weights (Arora et al., 2012). Euclidean norm squared:
Consider Φ( x ) = (cid:107) x (cid:107) , which is both 1-strongly convex and 1-Lipschitz smooth for the (cid:96) norm. Note that ∇ Φ( x ) = x . Then, the first order condition (3) is x t = x t − − η t ∇ f t ( x t ) . (6)Interestingly, this can be interpreted as a “one-step ahead” OGD (illustrated in Figure 1). However,note that this equation should not be interpreted as an update rule since x t appears on both side ofthe equation. In fact, this contrast highlights an important difference between OGD and OBD. Mahalanobis distance square:
Consider Φ( x ) = (cid:107) x (cid:107) Q for positive definite definite Q , whichis 1-strongly convex and 1-Lipschitz smooth in the Mahalanobis distance (cid:107)·(cid:107) Q . Note that ∇ Φ( x ) = Qx . Then, the first order condition (3) is x t = x t − − η t Q − ∇ f t ( x t ) . (7)This is analogous to a “one step ahead” OGD where the underlying metric is a weighted (cid:96) metric. Negative entropy:
If the feasible set is the δ -interior of the simplex X = P δ = { x | (cid:80) ni =1 x i =1 , x i ≥ δ } , and the norm is the (cid:96) norm (cid:107)·(cid:107) , the mirror map defined by the negative entropy Φ( x ) = (cid:80) ni =1 x i log x i is
12 ln 2 -strongly convex (by Pinsker’s inequality) and δ ln 2 -Lipschitz smooth (byreverse Pinsker’s inequality (Sason and Verd´u, 2015)). In this case, ∇ Φ( x ) = log x + 1 d , where d represents the all 1s vector in R d . Then, the first order condition is x t = x t − exp( − η t ∇ f t ( x t )) . (8)This can be viewed as a “one-step ahead” version of the multiplicative weights update. Again, thisequation should not be interpreted as an update rule since x t appears on both side of the equation.
4. A Competitive Algorithm
In this section, we use the OBD framework to give the first algorithm with a dimension-free, constantcompetitive ratio for online convex optimization with switching costs in general Euclidean spaces,under mild assumptions on the structure of the cost functions. Recall that, for the most general case,where no constraints other than convexity are applied to the cost functions, Proposition 3 showsthat the competitive ratio of any online algorithm must be Ω( √ d ) for (cid:96) switching costs, i.e., mustgrow with the dimension d of the decision space. Our goal in this section is to understand when adimension-free, constant competitive ratio can be obtained. Thus, we are naturally led to restrict thetype of cost functions we consider.Our main result in this section is a new online algorithm whose competitive ratio is constant withrespect to dimension when the cost functions are locally polyhedral , a class that includes the form HEN G OEL W IERMAN of cost functions used in many applications of online convex optimization, e.g, tracking problemsand penalized estimation problems. Roughly speaking, locally polyhedral functions are those thatgrow at least linearly as one moves away from the minimizer, at least in a small neighborhood.
Definition 6
A function f t with minimizer v t is locally α -polyhedral with respect to the norm (cid:107) · (cid:107) if there exists some α, (cid:15) > , such that for all x ∈ X such that (cid:107) x − v t (cid:107) ≤ (cid:15) , f t ( x ) − f t ( v t ) ≥ α (cid:107) x − v t (cid:107) . Note that all strictly convex functions f t which are locally α -polyhedral automatically satisfy f t ( x ) − f t ( v t ) ≥ α (cid:107) x − v t (cid:107) for all x , not just those x which are (cid:15) close to the minimizer v t .In this setting, local polyhedrality is analogous to strong convexity; instead of requiring that thecost functions grow at least quadratically as one moves away from the minimizer, the definitionrequires that cost functions grow at least linearly. The following examples illustrate the breadthof this class of functions. One important class of examples are functions of the form (cid:107) x − v t (cid:107) a where (cid:107) · (cid:107) a is an arbitrary norm; it follows from the equivalence of norms that such functionsare locally polyhedral with respect to any norm. Intuitively, such functions represent “tracking”problems, where we seek to get close to the point v t . Another important example is the class f ( x t ) = g ( x t ) + h ( x t ) where g is locally polyhedral and h is an arbitrary non-negative convexfunction whose minimizer coincides with that of g ; since f ( x t ) − f ( v t ) ≥ g ( x t ) − g ( v t ) , f isalso locally polyhedral. This lets us handle interesting functions such as f ( x t ) = (cid:107) x t (cid:107) + x (cid:48) t Qx t with Q psd, or even f ( X t ) = 2 (cid:107) X t (cid:107) ∞ − log det ( I + X t ) where the decision variable X t is aPSD matrix. Note that locally polyhedral function have previously been applied in the networkingcommunity, e.g., by Huang and Neely (2011) to study delay-throughput trade-offs for stochasticnetwork optimization and by Lin et al. (2012) to design online algorithm for geographical loadbalancing in data centers.Let us now informally describe how the Online Balanced Descent framework described in Sec-tion 3 can be instantiated to give a competitive online algorithm for locally polyhedral cost functions.Online Balanced Descent is, in some sense, lazy : instead of moving directly towards the minimizer v t , it moves to the closest point which results in a suitably large decrease in the hitting cost. Thiscan be interpreted as projecting onto a sublevel set of the current cost function. The trick is to makesure that not too much switching cost is incurred in the process. This is accomplished by carefullypicking the sublevel set so that the hitting costs and switching costs are balanced. A formal de-scription is given Algorithm 2. By Lemma 4, step 6 can be computed efficiently via bisection on l .Note that the memoryless algorithm proposed in Bansal et al. (2015) can be seen as a special caseof Algorithm 2 when the decision variables are scalar.The main result of this section is a characterization of the competitive ratio of Algorithm 2. Theorem 7
For every α > , there exists a choice of β such that Algorithm 2 has competitiveratio at most O (1 /α ) when run on locally α -polyhedral cost functions with (cid:96) switching costs.More generally, let (cid:107) · (cid:107) be an arbitrary norm. There exists a choice of β such that Algorithm 2 hascompetitive ratio at most max { k , } min { k , } (3 + O (1 /α )) when run on locally α -polyhedral cost functionswith switching cost (cid:107) · (cid:107) . Here k and k are constants such that k (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) ≤ k (cid:107) x (cid:107) . We note that in the (cid:96) setting Theorem 7 has a form which is connected to the best known lowerbound on the competitive ratio of memoryless algorithms. In particular, Bansal et al. (2015) use a1-dimensional example with locally polyhedral cost functions to prove the following bound. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
Algorithm 2 (Primal) Online Balanced Descent for t = 1 , . . . , T do Observe cost function f t , set v t = argmin x f t ( x ) . if (cid:107) x t − − v t (cid:107) < βf t ( v t ) then Set x t = v t else Let x ( l ) = Π Φ K lt ( x t − ) , increase l until (cid:107) x ( l ) − x t − (cid:107) = βl . Here K lt is the l -sublevel setof f t , i.e., K lt = { x | f t ( x ) ≤ l } . x t = x ( l ) . end if end forProposition 8 No memoryless algorithm can have a competitive ratio less than 3.
Beyond the (cid:96) setting, the competitive ratio in Theorem 7 is no longer dimension-free. It isinteresting to note that, when the switching costs are (cid:96) or (cid:96) ∞ and α is fixed, Online BalancedDescent has a competitive ratio that is O ( √ d ) . In particular, we showed in Section 2 that for generalcost functions, there is a Ω( d ) lower bound on the competitive ratio for SOCO with (cid:96) ∞ switchingcosts; hence our O ( √ d ) result highlights that local polyhedrality is useful beyond the (cid:96) case.While Theorem 7 suggests that Online Balanced Descent has a constant (dimension-free) com-petitive ratio only in the (cid:96) setting, a more detailed analysis shows that it can be constant-competitiveoutside of the (cid:96) setting as well, though for a more restrictive class of locally polyhedral functions.This is summarized in the the following theorem. Theorem 9
Let (cid:107) · (cid:107) be any norm such that the corresponding mirror map Φ has Bregman diver-gence D Φ satisfying m (cid:107) x − y (cid:107) ≤ D Φ ( x, y ) ≤ M (cid:107) x − y (cid:107) for all x, y ∈ R d and some positiveconstants m and M . Let κ = M/m . For every α > √ κ − , there exists a choice of β so that Al-gorithm 2 has competitive ratio at most O (1 /α ) when run on locally α -polyhedral cost functionswith switching costs given by (cid:107) · (cid:107) . This theorem highlights that for any given locally polyhedral cost functions, the task of findinga constant-competitive algorithm can be reduced to finding an appropriate Φ . In particular, given aclass of polyhedral cost functions with α > and norm (cid:107)·(cid:107) , the problem of finding a dimension-freecompetitive algorithm can be reduced to finding a convex function Φ that satisfies the differentialinequality (cid:107) x − y (cid:107) ≤ Φ( x ) − Φ( y ) − (cid:104)∇ Φ( y ) , x − y (cid:105) ≤ α +48 (cid:107) x − y (cid:107) for all x, y ∈ X .We present an intuitive overview of our proof techniques here, and defer the details to the ap-pendix. We use a potential function argument to bound the difference in costs paid by our algorithmand the offline optimal. Our potential function tracks the distance between the points x t pickedby our algorithm and the points x ∗ t picked by the offline optimal. There are two cases to consider.Either the online point or the offline point has smaller hit cost. The first case is easy to deal with,since our algorithm is designed so that the movement cost is at most a constant β times the hit cost;hence if our online hit cost is less than the offline algorithm’s hit cost, our total per-step cost will atmost be a constant times what the offline paid. The second case is more difficult. The key step isLemma 13, where we show that the potential must have decreased if the offline has smaller hit cost. HEN G OEL W IERMAN
Algorithm 3 (Dual) Online Balanced Descent for t = 1 , . . . , T do Define x ( l ) = Π Φ K l ( x t − ) , increase l from l = f t ( v t ) , until (cid:107)∇ Φ( x ( l )) − ∇ Φ( x t − ) (cid:107) ∗ = η (cid:107)∇ f t ( x ( l )) (cid:107) ∗ . x t = x ( l ) . end for We use this fact to argue that the total per-step cost we charge Online Balanced Descent, namely thesum of the hit cost, movement cost, and change in potential, must be non-positive.The proof of Theorem 9 parallels that of Theorem 7. The key difference is the use of a moregeneral form of Lemma 13, which uses Bregman projection to show that the potential decreases.The Bregman divergence is with respect to the mirror map induced by (cid:107) · (cid:107) .
5. A No-regret Algorithm
While the online algorithms community typically focuses on competitive ratio, regret is typi-cally the focus of the online learning community. The difference in performance metrics leads todifferences in the settings considered. In the previous section, we studied locally polyhedral costfunctions, while here we focus on cost functions that are continuously differentiable and have aminimizer v t in the interior of the feasible set X . Interestingly, it is has been shown that the change in metric from competitive ratio to regret hasa fundamental impact on the type of algorithms that perform well. Concretely, it has been shownthat no single algorithm can perform well across (static) regret and competitive ratio (Andrew et al.,2013). Consequently, it is not surprising that we find a different choice of balance in OBD is neededto obtain the no-regret performance guarantees. Specifically, in contrast to the results of the previoussection, which focus on a form of balance in the primal setting, in this section we focus on balancein the dual setting, where we compare costs as measured in the dual norm, (cid:107) x (cid:107) ∗ = max (cid:107) z (cid:107)≤ (cid:104) z, x (cid:105) .We show that choosing l to balance between the switching cost in the dual space and the size ofthe gradient leads to an online algorithm with small dynamic regret. It is worth emphasizing that,in contrast to the results of the previous section, we balance the switching cost against the marginalhitting cost (cid:107)∇ f t ( x t ) (cid:107) ∗ instead of f t ( x t ) . A formal description of the instantiation of OBD forregret is given in Algorithm 3, which can be implemented efficiently via bisection (Lemma 5). Theorem 10
Consider Φ that is an m -strongly convex function in (cid:107)·(cid:107) with (cid:107)∇ Φ( x ) (cid:107) ∗ boundedabove by G and ∇ Φ(0) = 0 . Then the L -constrained dynamic regret of Algorithm 3 is ≤ GLη + T η m . While the result above does not depend on knowing the parameters of the instance, if we knowthe parameters T , D and L ahead of time then we can optimize the balance parameter η as follows. Corollary 11
When η = (cid:113) GLmT , Algorithm 3 has L -constrained dynamic regret ≤ (cid:113) GLTm . One interesting aspect of this result is that it has a form similar to the dynamic regret bound onOGD in Theorem 2 of Zinkevich (2003). Both are independent of the dimension of the decision
1. Any convex function can be approximated by a convex functions with these properties, e.g., see Nesterov (2005). MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT space d , assuming the diameter of the space is normalized to . The key difference is that the boundin Corollary 11 is independent of the size of the gradients of the cost functions , unlike in the caseof OGD. This can be viewed as a significant benefit that results from the fact that OBD steps in adirection normal to where it lands, rather than where it starts.Finally, note that Theorem 10 and Corollary 11 additionally provide bounds on the static regretof OBD, by setting L = D . In this case, Corollary 11 gives a bound of O ( √ T ) which matches thelower bound in the setting where there are no switching costs (Hazan et al., 2016). References
Jacob Abernethy, Peter L Bartlett, Niv Buchbinder, and Isabelle Stanton. A regularization approachto metrical task systems. In
International Conference on Algorithmic Learning Theory , pages270–284. Springer, 2010.R Agrawal, M Hegde, and D Teneketzis. Multi-armed bandit problems with multiple plays andswitching cost.
Stochastics and Stochastic Reports , 29(4):437–459, 1990.Lachlan Andrew, Siddharth Barman, Katrina Ligett, Minghong Lin, Adam Meyerson, Alan Royt-man, and Adam Wierman. A tale of two metrics: Simultaneous bounds on competitiveness andregret. In
Conference on Learning Theory , pages 741–763, 2013.Antonios Antoniadis, Neal Barcelo, Michael Nugent, Kirk Pruhs, Kevin Schewior, and MicheleScquizzato.
Chasing Convex Bodies and Functions , pages 68–81. 2016.Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications.
Theory of Computing , 8(1):121–164, 2012.Masoud Badiei, Na Li, and Adam Wierman. Online convex optimization with ramp constraints. In
IEEE Conference on Decision and Control , pages 6730–6736, 2015.Nikhil Bansal, Anupam Gupta, Ravishankar Krishnaswamy, Kirk Pruhs, Kevin Schewior, and Clif-ford Stein. A 2-competitive algorithm for online convex optimization with switching costs. In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques ,pages 96–109, 2015.Avrim Blum and Carl Burch. On-line learning and the metrical task system problem.
MachineLearning , 39(1):35–58, Apr 2000.Avrim Blum, Howard Karloff, Yuval Rabani, and Michael Saks. A decomposition theorem andbounds for randomized server problems. In
Foundations of Computer Science , pages 197–207,1992.Avrim Blum, Shuchi Chawla, and Adam Kalai. Static optimality and dynamic search-optimality inlists and trees. In
Proc. of ACM-SIAM Symposium on Discrete Algorithms , pages 1–8, 2002.Allan Borodin and Ran El-Yaniv.
Online computation and competitive analysis . Cambridge Uni-versity Press, 2005.Allan Borodin, Nathan Linial, and Michael E. Saks. An optimal on-line algorithm for metrical tasksystem.
J. ACM , 39(4):745–763, October 1992. HEN G OEL W IERMAN
S. Bubeck, M. B. Cohen, J. R. Lee, Y. Tat Lee, and A. Madry. k-server via multiscale entropicregularization.
ArXiv e-prints , 2017.S´ebastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trendsin Machine Learning , 8(3-4):231–357, 2015.Niv Buchbinder, Shahar Chen, Joshep Seffi Naor, and Ohad Shamir. Unified algorithms for onlinelearning and competitive analysis. In
Conference on Learning Theory , pages 5–1, 2012.Niv Buchbinder, Shahar Chen, and Joseph Seffi Naor. Competitive analysis via regularization.In
Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms , pages436–444. Society for Industrial and Applied Mathematics, 2014.Nicol`o Cesa-Bianchi, Pierre Gaillard, G´abor Lugosi, and Gilles Stoltz. A new look at shifting regret.
CoRR , abs/1202.3323, 2012.Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan LH Andrew.Online convex optimization using predictions. In
ACM SIGMETRICS Performance EvaluationReview , volume 43, pages 191–204. ACM, 2015.Niangjun Chen, Joshua Comden, Zhenhua Liu, Anshul Gandhi, and Adam Wierman. Using predic-tions in online optimization: Looking forward with an eye on the past.
SIGMETRICS Perform.Eval. Rev. , 44(1):193–206, June 2016.Joel Friedman and Nathan Linial. On convex body chasing.
Discrete & Computational Geometry ,9(3):293–321, Mar 1993.Gautam Goel, Niangjun Chen, and Adam Wierman. Thinking fast and slow: Optimization decom-position across timescales. arXiv preprint arXiv:1704.07785 , 2017.Sudipto Guha and Kamesh Munagala. Multi-armed bandits with metric switching costs. In
In-ternational Colloquium on Automata, Languages, and Programming , pages 496–507. Springer,2009.Eric C Hall and Rebecca M Willett. Dynamical models and tracking regret in online convex pro-gramming. arXiv preprint arXiv:1301.1254 , 2013.Elad Hazan et al. Introduction to online convex optimization.
Foundations and Trends in Optimiza-tion , 2(3-4):157–325, 2016.Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor.
Journal of MachineLearning Research , 1:281–309, 2001.Longbo Huang and Michael J. Neely. Delay reduction via lagrange multipliers in stochastic networkoptimization.
IEEE Transactions on Automatic Control , 56(4):842–857, 2011.Vinay Joseph and Gustavo de Veciana. Jointly optimizing multi-user rate adaptation for videotransport over wireless systems: Mean-fairness-variability tradeoffs. In
IEEE INFOCOM , pages567–575, 2012. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity andstrong smoothness: Learning applications and matrix regularization.
Manuscript, http://ttic.uchicago. edu/shai/papers/KakadeShalevTewari09.pdf , 2009.Seung-Jun Kim and Geogios B Giannakis. Real-time electricity pricing for demand response usingonline convex optimization. In
IEEE Innovative Smart Grid Tech. , pages 1–5, 2014.Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. A decision tree framework for spa-tiotemporal sequence prediction. In
ACM International Conference on Knowledge Discovery andData Mining , pages 577–586, 2015.Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linearpredictors.
Information and Computation , 132(1):1–63, 1997.Tomer Koren, Roi Livni, and Yishay Mansour. Multi-armed bandits with metric movement costs.In
Advances in Neural Information Processing Systems , pages 4122–4131, 2017.Brian Kulis and Peter L Bartlett. Implicit online learning. In
Proceedings of the 27th InternationalConference on Machine Learning (ICML-10) , pages 575–582. Citeseer, 2010.Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching costs:Fast algorithms and the fundamental limit. arXiv preprint arXiv:1801.07780 , 2018.Minghong Lin, Adam Wierman, Lachlan L. H. Andrew, and Thereska Eno. Dynamic right-sizingfor power-proportional data centers. In
IEEE INFOCOM , pages 1098–1106, 2011.Minghong Lin, Zhenhua Liu, Adam Wierman, and Lachlan LH Andrew. Online algorithms forgeographical load balancing. In
IEEE Green Computing Conference , pages 1–10, 2012.Tan Lu, Minghua Chen, and Lachlan LH Andrew. Simple and effective dynamic provisioning forpower-proportional data centers.
IEEE Transactions on Parallel and Distributed Systems , 24(6):1161–1171, 2013.Arkadii Nemirovskii, David Borisovich Yudin, and Edgar Ronald Dawson. Problem complexityand method efficiency in optimization. 1983.Yu. Nesterov. Smooth minimization of non-smooth functions.
Mathematical Programming , 103(1):127–152, May 2005.Kirk Pruhs.
Errata , 2018. URL http://people.cs.pitt.edu/˜kirk/Errata.html .Igal Sason and Sergio Verd´u. Upper bounds on the relative entropy and r´enyi divergence as afunction of total variation distance for finite alphabets. In
IEEE Information Theory Workshop ,pages 214–218, 2015.Hao Wang, Jianwei Huang, Xiaojun Lin, and Hamed Mohsenian-Rad. Exploring smart grid and datacenter interactions for electric power load balancing.
ACM SIGMETRICS Performance Evalua-tion Review , 41(3):89–94, 2014. HEN G OEL W IERMAN
Manfred K Warmuth and Arun K Jagota. Continuous and discrete-time nonlinear gradient descent:Relative loss bounds and convergence. In
Electronic proceedings of the 5th International Sympo-sium on Artificial Intelligence and Mathematics . Citeseer, 1997.Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.
Journal of Machine Learning Research , 11(Oct):2543–2596, 2010.Francesco Zanini, David Atienza, Luca Benini, and Giovanni De Micheli. Multicore thermal man-agement with model predictive control. In
IEEE. European Conf. Circuit Theory and Design ,pages 711–714, 2009.Francesco Zanini, David Atienza, Giovanni De Micheli, and Stephen P Boyd. Online convexoptimization-based algorithm for thermal management of MPSoCs. In
The Great lakes sym-posium on VLSI , pages 203–208, 2010.Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In
International Conference on Machine Learning , pages 928–936, 2003.
Appendix A. Lower bounds on competitive ratio and regret
To provide insight as to the difficulty of SOCO, here we give an explicit example that yields lowerbounds on the competitive ratio an regret achievable in SOCO. Our example is based on the lowerbound example for Convex Body Chasing in Friedman and Linial (1993); it was observed in Anto-niadis et al. (2016) that Convex Body Chasing is a special case of SOCO. Starting from the origin,in each round, the adversary examines the i -th coordinate of the algorithm’s current action. If thiscoordinate is negative, the adversary picks the cost function which is the indicator of the hyperplane x i = 1 ; similarly, if the coordinate is non-negative, the adversary picks the indicator correspondingto the hyperplane x i = − . Hence our online algorithm is forced to move by at least 1 unit eachstep, paying a total of at least d cost. The offline optimal, on the other hand, simply moves to theintersection of all these hyperplanes in round 1, paying switching cost (cid:107) (1 , , . . . (cid:107) , which is √ d when the underlying norm is (cid:96) and when the norm is (cid:96) ∞ . Hence the competitive ratio is at least √ d in the (cid:96) setting and d in the (cid:96) ∞ setting. Further, the regret is at least Ω( d ) for both the (cid:96) and (cid:96) ∞ settings. Note that while it may appear at first glance that our example requires an adaptiveadversary, the same example applies for an oblivious adversary because the online algorithm maybe assumed to be deterministic. Appendix B. Proof of Lemma 4
Recall the following Pythagorean inequality satisfied by projection onto convex sets using Bregmandivergence as measure of distance given by Bubeck et al. (2015, Lemma 4.1).
Proposition 12
For x t , x t − , K l in step 3 of Algorithm 1, for any y ∈ K l , we have D Φ ( x t , x t − ) + D Φ ( y, x t ) ≤ D Φ ( y, x t − ) .
2. Bansal et al. (2015) proves that randomization provides no benefit for SOCO. MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
In particular, note that when Φ is the 2-norm squared (cid:107)·(cid:107) , x t , x t − and y form an obtuse triangle.Let h ( l ) = D Φ ( x ( l ) , x t − ) , we first show that h ( l ) is convex, hence continuous in l , and fromthere we show that g ( l ) is continuous in l . To begin, we can equivalently write the variational form h ( l ) = min x max λ ≥ D Φ ( x, x t − ) + λ ( f ( x ) − l ) . Let H ( x, λ, l ) = D Φ ( x, x t − ) + λ ( f ( x ) − l ) , H is affine hence convex in l for any given x and λ . Since maximization preserves convexity, and because H is jointly convex in ( x, l ) , min-imization over x also preserves convexity in l , hence h ( l ) = min x max λ ≥ H ( x, λ, l ) is convexhence continuous in l . To show g ( l ) is continuous at l , given any (cid:15) > , let δ > such that | h ( l ) − h ( l + δ ) | < (cid:15) m/ (this can be done as h is continuous in l ), then | g ( l ) − g ( l + δ ) | ≤ (cid:107) x ( l ) − x ( l + δ ) (cid:107)≤ (cid:114) m · D Φ ( x ( l + δ ) , x ( l )) ≤ (cid:114) m · ( D Φ ( x ( l ) , x t − ) − D Φ ( x ( l + δ ) , x t − ))= (cid:114) m · | h ( l ) − h ( l + δ ) | < (cid:15), where the first inequality is due to triangle inequality, and the second inequality is due to the defini-tion of Bregman divergence and Φ being m strongly convex in (cid:107)·(cid:107) , the third inequality is due to thefact that x ( l ) ∈ K l + δ and Proposition 12. Therefore g is a continuous function in l . Appendix C. Proof of Lemma 5
First, we will show that h ( l ) = (cid:107)∇ Φ( x t − ) − ∇ Φ( x ( l )) (cid:107) ∗ is continuous in l . For any l , l , | h ( l ) − h ( l ) | ≤ (cid:107)∇ Φ( x ( l )) − ∇ Φ( x ( l )) (cid:107) ∗ by the triangle inequality, as Φ is continuouslydifferentiable, we only need to show that for any l , given any (cid:15) > , we can find δ > , such thatfor all l (cid:48) , such that | l − l (cid:48) | < δ , (cid:107)∇ Φ( x ( l )) − ∇ Φ( x ( l (cid:48) )) (cid:107) ∗ < (cid:15) . By Lipschitz smoothness of Φ , wehave: D Φ ( x, y ) ≥ M (cid:107)∇ Φ( x ) − ∇ Φ( y ) (cid:107) ∗ . (9)To show (9), note that by (Kakade et al., 2009, Theorem 1) Φ is M -Lipschitz smooth w.r.t (cid:107)·(cid:107) if and only if Φ ∗ is M -smooth w.r.t (cid:107)·(cid:107) ∗ . Let u = ∇ Φ( x ) , v = ∇ Φ( y ) , then by strong convexity, Φ ∗ ( v ) − Φ ∗ ( u ) − (cid:104)∇ Φ ∗ ( u ) , v − u (cid:105) ≥ M (cid:107) v − u (cid:107) ∗ . (10)By the Fenchel inequality and the definition of u, v , we have Φ ∗ ( v ) = (cid:104) v, y (cid:105) − Φ( y ) and Φ ∗ ( u ) = (cid:104) u, x (cid:105) − Φ( x ) . Furthermore, ∇ Φ ∗ ( u ) = ∇ Φ ∗ ( ∇ Φ( x )) = x , substituting these into (10) gives (9).Using (9) and Proposition 12, we can upper bound (cid:107)∇ Φ( x ( l ) − ∇ Φ( x ( l (cid:48) ) (cid:107) ∗ by (cid:13)(cid:13) ∇ Φ( x ( l )) − ∇ Φ( x ( l (cid:48) )) (cid:13)(cid:13) ∗ ≤ (cid:112) M D Φ ( x ( l ) , x ( l (cid:48) )) ≤ √ M (cid:112) D Φ ( x ( l ) , x t − ) − D Φ ( x ( l (cid:48) ) , x t − ) . HEN G OEL W IERMAN
However, since we have already shown that D Φ ( x ( l ) , x t − ) is continuous in l in the proof of Lemma4, we can choose l (cid:48) sufficiently close to l to make the right hand side smaller than (cid:15) . Therefore h ( l ) = (cid:107)∇ Φ( x t − ) − ∇ Φ( x ( l )) (cid:107) ∗ is continuous in l . Second, since f t is also continuously differ-entiable, and by the continuity of x ( l ) in (cid:107)·(cid:107) , h ( l ) = (cid:107)∇ f t ( x ( l )) (cid:107) ∗ is also continuous in l . Thus,the ratio h ( l ) = h ( l ) h ( l ) is continuous in l . Appendix D. Proof of Theorem 7
We first consider the case when the switching cost is the (cid:96) norm. We define H t = f t ( x t ) , M t = (cid:107) x t − x t − (cid:107) , and define H ∗ t and M ∗ t analogously. We use the potential function φ ( x t , x ∗ t ) = C (cid:107) x t − x ∗ t (cid:107) ; C will end up being the competitive ratio of Algorithm 2. To show Algorithm 2is C -competitive, we need to show that for all t , H t + M t + φ ( x t , x ∗ t ) − φ ( x t − , x ∗ t − ) ≤ C ( H ∗ t + M ∗ t ) , (11)then summing up the inequality over t implies the result. To begin, applying the triangle inequality,we see φ ( x t , x ∗ t ) − φ ( x t − , x ∗ t − ) ≤ C ( (cid:107) x ∗ t − x t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ) + CM ∗ t . (12)Combining (12) and (11), we see that, to show Algorithm 2 is C -competitive, it suffices to show H t + M t + C ( (cid:107) x t − x ∗ t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ) ≤ CH ∗ t , (13)Notice that we always have M t ≤ βH t . We divide our analysis into the following two cases:1. H t ≤ H ∗ t : Since M t ≤ βH t , by the triangle inequality we have H t + M t + C ( (cid:107) x ∗ t − x t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ) ≤ H t + (1 + C ) M t ≤ (1 + β (1 + C )) H t ≤ CH t ≤ CH ∗ t , where in the penultimate step we assumed that β was picked so that β (1 + C ) ≤ C .2. H t > H ∗ t : In this case, it must true that M t = βH t , since H ∗ t is strictly smaller than H t ,implying that our algorithm did not reach the minimizer v t . We use the following Lemma,proved in the Appendix, which shows that the change in potential must actually be negativein this case. Lemma 13
For Algorithm 2, when H t > H ∗ t and f t ( x ) ≥ α (cid:107) x − v t (cid:107) , we have (cid:107) x t − x ∗ t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ≤ − γ (cid:107) x t − x t − (cid:107) , where γ = (cid:114) (cid:16) αβ (cid:17) − αβ . Using Lemma 13, we have H t + M t + C ( (cid:107) x ∗ t − x t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ) ≤ H t + M t − CγM t = (1 + β (1 − Cγ )) H t To show (13), it suffices to pick β such that β (1 − Cγ ) ≤ . MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
Combining the inequalities obtained in cases 1 and 2, we conclude that for any β ∈ (0 , , Algorithm2 is C -competitive, where C = max (cid:18) β − β , ββ γ (cid:19) . Note that the first term is increasing in β and tends to + ∞ as β → , and the second is decreasingin β and tends to + ∞ as β → + ; hence to minimize C we should pick β so that both terms areequal. We easily obtain β = + α +2 , and C = 3 + 8 /α .This result easily extends to the case when the switching cost is an arbitrary norm (cid:107) · (cid:107) a . Sincein finite dimensions all norms are equivalent, we know that there exist some k , k > such that k (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) ≤ k (cid:107) x (cid:107) . We immediately obtain T (cid:88) t =1 f t ( x t ) + (cid:107) x t − x t − (cid:107) ≥ min { , k } (cid:32) T (cid:88) t =1 f t ( x t ) + (cid:107) x t − x t − (cid:107) a (cid:33) (14) T (cid:88) t =1 f t ( x ∗ t ) + (cid:13)(cid:13) x ∗ t − x ∗ t − (cid:13)(cid:13) ≤ max { , k } (cid:32) T (cid:88) t =1 f t ( x ∗ t ) + (cid:13)(cid:13) x ∗ t − x ∗ t − (cid:13)(cid:13) a (cid:33) (15)Combining (14) and (15) with the previous observation that the online cost is at most C times theoffline cost finishes the proof. Appendix E. Proof of Lemma 13
In this section, we actually prove a more general statement, from which Lemma 13 follows. Thismore general statement is not needed to prove Theorem 7, but is needed in our proof of Theorem 9.
Lemma 14
Let (cid:107) · (cid:107) be any norm such that the corresponding mirror map Φ has Bregman diver-gence satisfying m (cid:107) x − y (cid:107) ≤ D Φ ( x, y ) ≤ M (cid:107) x − y (cid:107) . Let κ = M/m . For Algorithm 2, when H t > H ∗ t and f t ( x ) ≥ α (cid:107) x − v t (cid:107) , we have (cid:107) x t − x ∗ t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ≤ − γ (cid:107) x t − x t − (cid:107) , where γ = √ κ (cid:114) (cid:16) αβ (cid:17) − αβ . Before turning to the proof, we note that κ = 1 when the norm is (cid:96) , recovering Lemma 13 used inthe proof of Theorem 7. Proof of Lemma 14.
By the triangle inequality, (cid:107) x t − x ∗ t (cid:107) ≤ (cid:107) x t − v t (cid:107) + (cid:107) x ∗ t − v t (cid:107) . To upperbound (cid:107) x t − x ∗ t (cid:107) , we separately bound each term on the right hand side. By the assumption that f t is polyhedral, (cid:107) x t − v t (cid:107) ≤ α f t ( x t ) = α H t . Also, when H ∗ t < H t , we must have M t = βH t ,hence (cid:107) x t − v t (cid:107) ≤ αβ M t . Using similar argument together with the fact that H ∗ t < H t , we have (cid:107) x ∗ t − v t (cid:107) ≤ αβ M t . Hence (cid:107) x t − x ∗ t (cid:107) ≤ (cid:107) x t − v t (cid:107) + (cid:107) x ∗ t − v t (cid:107) ≤ αβ M t . (16)Since x t is the projection of x t − onto the sublevel set { x | f ( x ) ≤ H t } , this set must contain x ∗ t since by assumption H ∗ t ≤ H t . Therefore by Proposition 12 we have D Φ ( x ∗ t , x t ) + D Φ ( x t , x t − ) ≤ D Φ ( x ∗ t , x t − ) , HEN G OEL W IERMAN since Φ is m -strongly convex and M strongly smooth in (cid:107)·(cid:107) , m (cid:107) x − y (cid:107) ≤ D Φ ( x, y ) ≤ M (cid:107) x − y (cid:107) , hence (cid:107) x t − x ∗ t (cid:107) + (cid:107) x t − x t − (cid:107) ≤ κ (cid:107) x ∗ t − x t − (cid:107) . Let (cid:107) x t − x ∗ t (cid:107) = rM t ; note that by (16), r ≤ αβ . We have, (cid:107) x ∗ t − x t − (cid:107) ≥ √ κ (cid:112) r M t . (17)Hence (cid:107) x ∗ t − x t − (cid:107) − (cid:107) x t − x ∗ t (cid:107) ≥ (cid:18) √ κ (cid:112) r − r (cid:19) M t . (18)Since √ κ ≤ , √ κ √ r − r is a decreasing function in r ; this can seen by taking the derivativewith respect to r . Combining this together with the fact that r ≤ αβ we have (cid:107) x t − x ∗ t (cid:107) − (cid:107) x ∗ t − x t − (cid:107) ≤ − γM t for γ = √ κ (cid:114) (cid:16) αβ (cid:17) − αβ . Note that γ > when β > √ κ − α . Appendix F. Proof of Theorem 10
Recall that we can write the update rule as: ∇ Φ( x t ) = ∇ Φ( x t − ) − η ∇ f t ( x t ) , Let { x Lt } Tt =1 denote a L -constrained offline optimal solution. By convexity of f t , we have f t ( x t ) − f t ( x Lt ) ≤ (cid:104)∇ f t ( x t ) , x t − x Lt (cid:105) = 1 η (cid:104)∇ Φ( x t − ) − ∇ Φ( x t ) , x t − x Lt (cid:105) = 1 η (cid:16) (cid:104)∇ Φ( x t − ) − ∇ Φ( x t ) , x t − − x Lt (cid:105) − (cid:104)∇ Φ( x t ) − ∇ Φ( x t − ) , x t − x t − (cid:105) (cid:17) (19)Recall that the Bregman divergence satisfies the equality (cid:104)∇ f ( x ) − ∇ f ( y ) , x − z (cid:105) = D f ( x, y ) + D f ( z, x ) − D f ( z, y ) , for all x, y, z ∈ R d . We use this identity in each of the inner products in (19) to obtain f t ( x t ) − f t ( x Lt ) ≤ η (cid:16) D Φ ( x t − , x t ) + D Φ ( x Lt , x t − ) − D Φ ( x Lt , x t ) − D Φ ( x t , x t − ) − D Φ ( x t − , x t ) + D Φ ( x t − , x t − ) (cid:17) = 1 η (cid:16) D Φ ( x Lt , x t − ) − D Φ ( x Lt , x t ) (cid:17) − η ( D Φ ( x t , x t − )) . MOOTHED O NLINE C ONVEX O PTIMIZATION IN H IGH D IMENSIONS VIA O NLINE B ALANCED D ESCENT
Notice that D Φ ( x Lt , x t − ) − D Φ ( x Lt , x t )=Φ( x Lt ) − Φ( x t − ) − (cid:104)∇ Φ( x t − ) , x Lt − x t − (cid:105) − Φ( x Lt ) + Φ( x t ) + (cid:104)∇ Φ( x t ) , x Lt − x t (cid:105) =(Φ( x t ) − (cid:104)∇ Φ( x t ) , x t (cid:105) ) − (Φ( x t − ) − (cid:104)∇ Φ( x t − ) , x t − (cid:105) ) + (cid:104)∇ Φ( x t ) − ∇ Φ( x t − ) , x Lt (cid:105) = D Φ (0 , x t − ) − D Φ (0 , x t ) + (cid:104)∇ Φ( x t ) − ∇ Φ( x t − ) , x Lt (cid:105) Summing over t , we have T (cid:88) t =1 D Φ ( x Lt , x t − ) − D Φ ( x Lt , x t )= D Φ (0 , x ) − D Φ (0 , x T ) + T − (cid:88) t =1 (cid:104)∇ Φ( x t ) , x Lt − x Lt +1 (cid:105) − (cid:104)∇ Φ( x ) , x L (cid:105) + (cid:104)∇ Φ( x T ) , x LT (cid:105)≤ T (cid:88) t =1 (cid:107)∇ Φ( x t ) (cid:107) ∗ (cid:13)(cid:13) x Lt − x Lt +1 (cid:13)(cid:13) ≤ G T (cid:88) t =1 (cid:13)(cid:13) x Lt − x Lt +1 (cid:13)(cid:13) = GL, where in the penultimate inequality we used the fact that x = 0 , ∇ Φ(0) = 0 , and x LT +1 = x L = 0 .Putting it all together, we obtain cost( OBD ) − cost( OP T ( L ))= T (cid:88) t =1 (cid:0) f t ( x t ) − f t ( x Lt ) (cid:1) + T (cid:88) t =1 (cid:0) (cid:107) x t − x t − (cid:107) − (cid:13)(cid:13) x Lt − x Lt − (cid:13)(cid:13)(cid:1) ≤ GLη + T (cid:88) t =1 (cid:18) (cid:107) x t − x t − (cid:107) − η D Φ ( x t , x t − ) (cid:19) − L ≤ GLη + T (cid:88) t =1 (cid:18) (cid:107) x t − x t − (cid:107) − m η (cid:107) x t − x t − (cid:107) (cid:19) ≤ GLη + ηT m , where the last inequality is due to completing the square and throwing away the negative terms.where the last inequality is due to completing the square and throwing away the negative terms.