[PDF] Online non-convex optimization with imperfect feedback

Abstract

We consider the problem of online learning with non-convex losses. In terms of feedback, we assume that the learner observes - or otherwise constructs - an inexact model for the loss function encountered at each stage, and we propose a mixed-strategy learning policy based on dual averaging. In this general context, we derive a series of tight regret minimization guarantees, both for the learner's static (external) regret, as well as the regret incurred against the best dynamic policy in hindsight. Subsequently, we apply this general template to the case where the learner only has access to the actual loss incurred at each stage of the process. This is achieved by means of a kernel-based estimator which generates an inexact model for each round's loss function using only the learner's realized losses as input.

Full PDF

OONLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK

AMÉLIE HÉLIOU ∗ , MATTHIEU MARTIN ∗ ,PANAYOTIS MERTIKOPOULOS (cid:5) , ∗ , AND THIBAUD RAHIER ∗ Abstract.

We consider the problem of online learning with non-convex losses. In termsof feedback, we assume that the learner observes – or otherwise constructs – an inexactmodel for the loss function encountered at each stage, and we propose a mixed-strategylearning policy based on dual averaging. In this general context, we derive a series oftight regret minimization guarantees, both for the learner’s static (external) regret, aswell as the regret incurred against the best dynamic policy in hindsight. Subsequently,we apply this general template to the case where the learner only has access to the actualloss incurred at each stage of the process. This is achieved by means of a kernel-basedestimator which generates an inexact model for each round’s loss function using only thelearner’s realized losses as input. Introduction

In this paper, we consider the following online learning framework:(1) At each stage t = 1 , , . . . of a repeated decision process, the learner selects an action x t from a compact convex subset K of a Euclidean space R n .(2) The agent’s choice of action triggers a loss (cid:96) t ( x t ) based on an a priori unknown lossfunction (cid:96) t : K → R ; subsequently, the process repeats.If the loss functions (cid:96) t encountered by the agent are convex , the above framework is thestandard online convex optimization setting of Zinkevich [58] – for a survey, see [16, 29, 45] andreferences therein. In this case, simple ﬁrst-order methods like online gradient descent (OGD)allow the learner to achieve O ( T / ) regret after T rounds [58], a bound which is well-knownto be min-max optimal in this setting [1, 45]. At the same time, it is also possible to achievetight regret minimization guarantees against dynamic comparators – such as the regretincurred against the best dynamic policy in hindsight, cf. [11, 21, 23, 30, 33] and referencestherein.On the other hand, when the problem’s loss functions are not convex, the situation isconsiderably more diﬃcult. When the losses are generated from a stationary stochasticdistribution, the problem can be seen as a version of a continuous-armed bandit in thespirit of Agrawal [3]; in this case, there exist eﬃcient algorithms guaranteeing logarithmicregret by discretizing the problem’s search domain and using a UCB-type policy [18, 37, 48].Otherwise, in an adversarial context, an informed adversary can impose linear regret to ∗ Criteo AI Lab, France (cid:5)

Univ. Grenoble Alpes, CNRS, Inria, LIG, 38000, Grenoble, France.

E-mail addresses : [email protected], [email protected], [email protected],[email protected] .2020 Mathematics Subject Classiﬁcation.

Primary 68Q32; Secondary 90C26, 91A26.

Key words and phrases.

Online optimization; non-convex; dual averaging; bandit / imperfect feedback.P. Mertikopoulos is grateful for ﬁnancial support by the French National Research Agency (ANR) in theframework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), and MIAI@Grenoble Alpes (ANR-19-P3IA-0003). This research was also supported by theCOST Action CA16228 “European Network for Game Theory” (GAMENET). a r X i v : . [ c s . L G ] O c t A. HÉLIOU, M. MARTIN, P. MERTIKOPOULOS, AND T. RAHIER any deterministic algorithm employed by the learner [31, 45, 50]; as a result, UCB-typeapproaches are no longer suitable.In view of this impossibility result, two distinct threads of literature have emerged foronline non-convex optimization. One possibility is to examine less demanding measures ofregret – like the learner’s local regret [31] – and focus on ﬁrst-order methods that minimizeit eﬃciently [28, 31]. Another possibility is to consider randomized algorithms, in which caseachieving no regret is possible: Krichene et al. [38] showed that adapting the well-known Hedge (or multiplicative / exponential weights) algorithm to a continuum allows the learner toachieve O ( T / ) regret, as in the convex case. This result is echoed in more recent works byAgarwal et al. [2] and Suggala & Netrapalli [50] who analyzed the “follow the perturbed leader” (FTPL) algorithm of Kalai & Vempala [34] with exponentially distributed perturbations andan oﬄine optimization oracle (exact or approximate); again, the regret achieved by FTPL inthis setting is O ( T / ) , i.e., order-equivalent to that of Hedge in a continuum. Our contributions and related work.

A crucial assumption in the above works on randomizedalgorithms is that, after selecting an action, the learner receives perfect information on theloss function encountered – i.e., an exact model thereof. This is an important limitation forthe applicability of these methods, which led to the following question by Krichene et al. [38,p. 8]:

One question is whether one can generalize the Hedge algorithm to a bandit setting,so that sublinear regret can be achieved without the need to explicitly maintain a cover.

To address this open question, we begin by considering a general framework for randomizedaction selection with imperfect feedback – i.e., with an inexact model of the loss functionsencountered at each stage. Our contributions in this regard are as follows:(1) We present a ﬂexible algorithmic template for online non-convex learning based on dualaveraging with imperfect feedback [42].(2) We provide tight regret minimization rates – both static and dynamic – under a widerange of diﬀerent assumptions for the loss models available to the optimizer.(3) We show how this framework can be extended to learning with bandit feedback , i.e.,when the learner only observes their realized loss and must construct a loss model fromscratch.Viewed abstractly, the dual averaging (DA) algorithm is an “umbrella” scheme thatcontains Hedge as a special case for problems with a simplex-like domain. In the contextof online convex optimization, the method is closely related to the well-known “follow theregularized leader” (FTRL) algorithm of Shalev-Shwartz & Singer [46], the FTPL method ofKalai & Vempala [34], “lazy” mirror descent (MD) [15, 16, 45], etc. For an appetizer to thevast literature surrounding these methods, we refer the reader to [14, 16, 42, 45, 46, 55, 57]and references therein.In the non-convex setting, our regret minimization guarantees can be summarized as follows(see also Table 1 above): if the learner has access to inexact loss models that are unbiasedand ﬁnite in mean square, the DA algorithm achieves in expectation a static regret bound of O ( T / ) . Moreover, in terms of the learner’s dynamic regret, the algorithm enjoys a boundof O ( T / V / T ) where V T := (cid:80) Tt =1 (cid:107) (cid:96) t +1 − (cid:96) t (cid:107) ∞ denotes the variation of the loss functionsencountered over the horizon of play (cf. Section 4 for the details). Importantly, both boundsare order-optimal, even in the context of online convex optimization, cf. [1, 11, 20].With these general guarantees in hand, we tackle the bandit setting using a “kernelsmoothing” technique in the spirit of Bubeck et al. [19]. This leads to a new algorithm, whichwe call bandit dual averaging (BDA), and which can be seen as a version of the DA methodwith biased loss models. The bias of the loss model can be controlled by tuning the “radius” NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 3

Convex Losses Non-Convex LossesFeedback

Static regret Dynamic regret Static regret Dynamic regretExact O (cid:0) T / (cid:1) [58] O (cid:0) T / V / T (cid:1) [11] O (cid:0) T / (cid:1) [38, 50] O (cid:0) T / V / T (cid:1) Unbiased O (cid:0) T / (cid:1) [58] O (cid:0) T / V / T (cid:1) [11] O (cid:0) T / (cid:1) O (cid:0) T / V / T (cid:1) Bandit O (cid:0) T / (cid:1) [17, 19] O (cid:0) T / V / T (cid:1) [11] O (cid:0) T n +2 n +3 (cid:1) O (cid:0) T n +3 n +4 V n +4 T (cid:1) Table 1:

Overview of related work. In regards to feedback, an “exact” modelmeans that the learner acquires perfect knowledge of the encountered loss functions;“unbiased” refers to an inexact model that is only accurate on average; ﬁnally,“bandit” means that the learner records their incurred loss and has no otherinformation. We only report here the best known bounds in the literature; allbounds derived in this paper are typeset in bold . of the smoothing kernel; however, this comes at the cost of increasing the model’s variance –an incarnation of the well-known “bias-variance” trade-oﬀ. By resolving this trade-oﬀ, weare ﬁnally able to answer the question of Krichene et al. [38] in the positive: BDA enjoysan O ( T n +2 n +3 ) static regret bound and an O ( T n +3 n +4 V / ( n +4) T ) dynamic regret bound, withoutrequiring an explicit discretization of the problem’s search space.This should be contrasted with the case of online convex learning, where it is possible toachieve O ( T / ) regret through the use of simultaneous perturbation stochastic approximation(SPSA) techniques [25], or even O ( T / ) by means of kernel-based methods [17, 19]. Thisrepresents a drastic drop from O ( T / ) , but this cannot be avoided: the worst-case bound forstochastic non-convex optimization is Ω( T ( n +1) / ( n +2) ) [36, 37], so our static regret bound isnearly optimal in this regard (i.e., up to O ( T − / ( n +2)( n +3) ) , a term which is insigniﬁcant forhorizons T ≤ ). Correspondingly, in the case of dynamic regret minimization, the bestknown upper bound is O ( T / V / T ) for online convex problems [11, 24]. We are likewisenot aware of any comparable dynamic regret bounds for online non-convex problems; to thebest our knowledge, our paper is the ﬁrst to derive dynamic regret guarantees for onlinenon-convex learning with bandit feedback.We should stress here that, as is often the case for methods based on lifting, much of thecomputational cost is hidden in the sampling step. This is also the case for the proposed DAmethod which, like [38], implicitly assumes access to a sampling oracle. Estimating (andminimizing) the per-iteration cost of sampling is an important research direction, but onethat lies beyond the scope of the current paper, so we do not address it here.2. Setup and preliminaries

The model.

Throughout the sequel, our only blanket assumption will be as follows:

Assumption 1.

The stream of loss functions encountered is uniformly bounded Lipschitz , i.e.,there exist constants

R, L > such that:(1) | (cid:96) t ( x ) | ≤ R for all x ∈ K ; more succinctly, (cid:107) (cid:96) t (cid:107) ∞ ≤ R .(2) | (cid:96) t ( x (cid:48) ) − (cid:96) t ( x ) | ≤ L (cid:107) x (cid:48) − x (cid:107) for all x, x (cid:48) ∈ K .Other than this meager regularity requirement, we make no structural assumptionsfor (cid:96) t (such as convexity, unimodality, or otherwise). In this light, the framework underconsideration is akin to the online non-convex setting of Krichene et al. [38], Hazan et al.[31], and Suggala & Netrapalli [50]. The main diﬀerence with the setting of Krichene et al.[38] is that the problem’s domain K is assumed convex; this is done for convenience only, toavoid technical subtleties involving “uniform fatness” conditions and the like. A. HÉLIOU, M. MARTIN, P. MERTIKOPOULOS, AND T. RAHIER

In terms of playing the game, we will assume that the learner can employ mixed strategies to randomize their choice of action at each stage; however, because this mixing occursover a continuous domain, deﬁning this randomization requires some care. To that end, let

M ≡ M ( K ) denote the space of all ﬁnite signed Radon measures on K . Then, a mixed strategy is deﬁned as an element π of the set of Radon probability measures ∆ ≡ ∆( K ) ⊆ M ( K ) on K , and the player’s expected loss under π when facing a bounded loss function (cid:96) ∈ L ∞ ( K ) will be denoted as (cid:104) (cid:96), π (cid:105) := E π [ (cid:96) ] = (cid:82) K (cid:96) ( x ) dπ ( x ) . (1) Remark . We should note here that ∆ contains a vast array of strategies, including atomicand singular distributions that do not admit a density. For this reason, we will write ∆ cont for the set of strategies that are absolutely continuous relative to the Lebesgue measure λ on K , and ∆ ⊥ for the set of singular strategies (which are not); by Lebesgue’s decompositiontheorem [26], we have ∆ = ∆ cont ∪ ∆ ⊥ . By construction, ∆ ⊥ contains the player’s purestrategies , i.e., Dirac point masses δ x that select x ∈ K with probability ; however, it alsocontains pathological strategies that admit neither a density, nor a point mass function– such as the Cantor distribution [26]. By contrast, the Radon-Nikodym (RN) derivative p := dπ/dλ of π exists for all π ∈ ∆ cont , so we will sometimes refer to elements of ∆ cont as“Radon-Nikodym strategies”; in particular, if π ∈ ∆ cont , we will not distinguish between π and p unless absolutely necessary to avoid confusion.Much of our analysis will focus on strategies χ with a piecewise constant density on K ,i.e., χ = (cid:80) ki =1 α i C i for a collection of weights α i ≥ and measurable subsets C i ⊆ K , i = 1 , . . . , k , such that (cid:82) K χ = (cid:80) i α i λ ( C i ) = 1 . These strategies will be called simple andthe space of simple strategies on K will be denoted by X ≡ X ( K ) . A key fact regardingsimple strategies is that X is dense in ∆ in the weak topology of M [26, Chap. 3]; as a result,the learner’s expected loss under any mixed strategy π ∈ ∆ can be approximated withinarbitrary accuracy ε > by a simple strategy χ ∈ X . In addition, when k (or n ) is not toolarge, sampling from simple strategies can be done eﬃciently; for all these reasons, simplestrategies will play a key role in the sequel.2.2. Measures of regret.

With all this in hand, the regret of a learning policy π t ∈ ∆ , t = 1 , , . . . , against a benchmark strategy π ∗ ∈ ∆ is deﬁned as Reg π ∗ ( T ) = (cid:80) Tt =1 (cid:2) E π t [ (cid:96) t ] − E π ∗ [ (cid:96) t ] (cid:3) = (cid:80) Tt =1 (cid:104) (cid:96) t , π t − π ∗ (cid:105) , (2)i.e., as the diﬀerence between the player’s mean cumulative loss under π t and π ∗ over T rounds. In a slight abuse of notation, we write Reg p ∗ ( T ) if π ∗ admits a density p ∗ , and Reg x ( T ) for the regret incurred against the pure strategy δ x , x ∈ K . Then, the player’s( static ) regret under π t is given by Reg( T ) = max x ∈K Reg x ( T ) = sup χ ∈X Reg χ ( T ) (3)where the maximum is justiﬁed by the compactness of K and the continuity of each (cid:96) t . Thelemma below provides a link between pure comparators and their approximants in the spiritof Krichene et al. [38]; to streamline our discussion, we defer the proof to the supplement: Lemma 1.

Let U be a convex neighborhood of x in K and let χ ∈ X be a simple strategysupported on U . Then, Reg x ( T ) ≤ Reg χ ( T ) + L diam( U ) T . This lemma will be used to bound the agent’s static regret using bounds obtained forsimple strategies χ ∈ X . Going beyond static comparisons of this sort, the learner’s dynamicregret is deﬁned as DynReg( T ) = (cid:80) Tt =1 [ (cid:104) (cid:96) t , π t (cid:105) − min π ∈ ∆ (cid:104) (cid:96) t , π (cid:105) ] = (cid:80) Tt =1 (cid:104) (cid:96) t , π t − π ∗ t (cid:105) (4) NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 5 where π ∗ t ∈ arg min π ∈ ∆ (cid:104) (cid:96) t , π (cid:105) is a “best-response” to (cid:96) t (that such a strategy exists is aconsequence of the compactness of K and the continuity of each (cid:96) t ). In regard to its staticcounterpart, the agent’s dynamic regret is considerably more ambitious, and achievingsublinear dynamic regret is not always possible; we examine this issue in detail in Section 4.2.3. Feedback models.

After choosing an action, the agent is only assumed to observe an inexact model ˆ u t ∈ L ∞ ( K ) of the t -th stage loss function (cid:96) t ; for concreteness, we will write ˆ u t = (cid:96) t + e t (5)where the “observation error” e t captures all sources of uncertainty in the player’s model.This uncertainty could be both “random” (zero-mean) or “systematic” (non-zero-mean), so itwill be convenient to decompose e t as e t = z t + b t (6)where z t is zero-mean and b t denotes the mean of e t .To deﬁne all this formally, we will write F t = F ( π , . . . , π t ) for the history of the player’smixed strategy up to stage t (inclusive). The chosen action x t and the observed model ˆ u t are both generated after the player chooses π t so, by default, they are not F t -measurable.Accordingly, we will collect all randomness aﬀecting ˆ u t in an abstract probability law P , andwe will write b t = E [ e t | F t ] and z t = e t − b t ; in this way, E [ z t | F t ] = 0 by deﬁnition.In view of all this, we will focus on the following descriptors for ˆ u t : a ) Bias: (cid:107) b t (cid:107) ∞ ≤ B t (7a) b ) Variance: E [ (cid:107) z t (cid:107) ∞ | F t ] ≤ σ t (7b) c ) Mean square: E [ (cid:107) ˆ u t (cid:107) ∞ | F t ] ≤ M t (7c)In the above, B t , σ t and M t are deterministic constants that are to be construed as boundson the bias, (conditional) variance, and magnitude of the model ˆ u t at time t . In obviousterminology, a model with B t = 0 will be called unbiased , and an unbiased model with σ t = 0 will be called exact . (cid:73) Example 1 (Parametric models) . An important application of online optimization is thecase where the encountered loss functions are of the form (cid:96) t ( x ) = (cid:96) ( x ; θ t ) for some sequenceof parameter vectors θ t ∈ R m . In this case, the learner typically observes an estimate ˆ θ t of θ t , leading to the inexact model ˆ u t = (cid:96) ( · ; ˆ θ t ) . Importantly, this means that ˆ u t does notrequire inﬁnite-dimensional feedback to be constructed. Moreover, the dependence of (cid:96) on θ is often linear, so if ˆ θ t is an unbiased estimate of θ t , then so is ˆ u t . (cid:74)(cid:73) Example 2 (Online clique prediction) . As a speciﬁc incarnation of a parametric model,consider the problem of ﬁnding the largest complete subgraph – a maximum clique – of anundirected graph G = ( V , E ) . This is a key problem in machine learning with applicationsto social networks [27], data mining [9], gene clustering [49], feature embedding [56], andmany other ﬁelds. In the online version of the problem, the learner is asked to predict such aclique in a graph G t that evolves over time (e.g., a social network), based on partial historicalobservations of the graph. Then, by the Motzkin-Straus theorem [13, 41], this boils down toan online quadratic program of the form:maximize u t ( x ) = (cid:80) ni,j =1 x i A ij,t x j subject to x i ≥ , (cid:80) ni =1 x i = 1 , (MCP)where A t = ( A ij,t ) ni,j =1 denotes the adjacency matrix of G t . Typically, ˆ A t is constructed bypicking a node i uniformly at random, charting out its neighbors, and letting ˆ A ij,t = |V| / A. HÉLIOU, M. MARTIN, P. MERTIKOPOULOS, AND T. RAHIER whenever j is connected to i . It is easy to check that ˆ A t is an unbiased estimator of A t ; as aresult, the function ˆ u t ( x ) = x (cid:62) ˆ A t x is an unbiased model of u t . (cid:74)(cid:73) Example 3 (Online-to-batch) . Consider an empirical risk minimization model of the form f ( x ) = m (cid:80) mi =1 f i ( x ) (8)where each f i : R n → R corresponds to a data point (or “sample”). In the “online-to-batch” formulation of the problem [45], the optimizer draws uniformly at random a sample i t ∈ { , . . . , m } at each stage t = 1 , , . . . , and observes ˆ (cid:96) t = f i t . Typically, each f i isrelatively easy to store in closed form, so ˆ (cid:96) t is an easily available unbiased model of theempirical risk function f . (cid:74) Prox-strategies and dual averaging

The class of non-convex online learning policies that we will consider is based on thegeneral template of dual averaging (DA) / “follow the regularized leader” (FTRL) methods.Informally, this scheme can be described as follows: at each stage t = 1 , , . . . , the learnerplays a mixed strategy that minimizes their cumulative loss up to round t − (inclusive)plus a “regularization” penalty term (hence the “regularized leader” terminology). In the restof this section, we provide a detailed construction and description of the method.3.1. Randomizing over discrete vs. continuous sets.

We begin by describing the dual aver-aging method when the underlying action set is ﬁnite , i.e., of the form A = { , . . . , n } . Inthis case, the space of mixed strategies is the n -dimensional simplex ∆ n := ∆( A ) = { p ∈ R n + : (cid:80) ni =1 p i = 1 } , and, at each t = 1 , , . . . , the dual averaging algorithm prescribes themixed strategy p t ← arg min p ∈ ∆ n { η (cid:80) t − s =1 (cid:104) ˆ u t , p (cid:105) + h ( p ) } . (9)In the above, η > is a “learning rate” parameter and h : ∆ n → R is the method’s“regularizer”, assumed to be continuous and strongly convex over ∆ n . In this way, thealgorithm can be seen as tracking the “best” choice up to the present, modulo a “day 0”regularization component – the “follow the regularized leader” interpretation.In our case however, the method is to be applied to the inﬁnite-dimensional set ∆ ≡ ∆( K ) of the learner’s mixed strategies, so the issue becomes considerably more involved. Toillustrate the problem, consider one of the prototypical regularizer functions, the negentropy h ( p ) = (cid:80) ni =1 p i log p i on ∆ n . If we naïvely try to extend this deﬁnition to the inﬁnite-dimensional space ∆( K ) , we immediately run into problems: First, for pure strategies, anyexpression of the form (cid:80) x (cid:48) ∈K δ x ( x (cid:48) ) log δ x ( x (cid:48) ) would be meaningless. Second, even if we focuson Radon-Nikodym strategies p ∈ ∆ cont and use the integral deﬁnition h ( p ) = (cid:82) K p log p , adensity like p ( x ) ∝ / ( x (log x ) ) on K = [0 , / has inﬁnite negentropy, implying that even ∆ cont is too large to serve as a domain.3.2. Formal construction of the algorithm.

To overcome the issues identiﬁed above, ourstarting point will be that any mixed-strategy incarnation of the dual averaging algorithmmust contain at least the space

X ≡ X ( K ) of the player’s simple strategies. To that end, let V be an ambient Banach space which contains the set of simple strategies X as an embeddedsubset. For technical reasons, we will also assume that the topology induced on X by thereference norm (cid:107)·(cid:107) of V is not weaker than the natural topology on X induced by the totalvariation norm; formally, (cid:107)·(cid:107) TV ≤ α (cid:107)·(cid:107) for some α > . For example, V could be the(Banach) space M ( K ) of ﬁnite signed measures on K , the (Hilbert) space L ( K ) of square Since the dual space of M ( K ) contains L ∞ ( K ) , we will also view L ∞ ( K ) as an embedded subset of V ∗ . NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 7 integrable functions on K endowed with the L norm, or an altogether diﬀerent model for X . We then have: Deﬁnition 1. A regularizer on X is a lower semi-continuous (l.s.c.) convex function h : V → R ∪ {∞} such that:(1) X is a weakly dense subset of the eﬀective domain dom h := { p : h ( p ) < ∞} of h .(2) The subdiﬀerential ∂h of h admits a continuous selection , i.e., there exists a con-tinuous mapping ∇ h on dom ∂h := { ∂h (cid:54) = ∅ } such that ∇ h ( q ) ∈ ∂h ( q ) for all q ∈ dom ∂h .(3) h is strongly convex, i.e., there exists some K > such that h ( p ) ≥ h ( q )+ (cid:104)∇ h ( q ) , p − q (cid:105) + ( K/ (cid:107) p − q (cid:107) for all p ∈ dom h , q ∈ dom ∂h .The set Q := dom ∂h will be called the prox-domain of h ; its elements will be called prox-strategies . Remark.

For completeness, recall that the subdiﬀerential of h at q is the set ∂h ( q ) = { ψ ∈V ∗ : h ( p ) ≥ h ( q ) + (cid:104) ψ, p − q (cid:105) for all q ∈ V} ; also lower semicontinuity means that the sublevelsets { h ≤ c } of h are closed for all c ∈ R . For more details, we refer the reader to Phelps [44].Some prototypical examples of this general framework are as follows (with more in thesupplement): (cid:73) Example 4 ( L regularization) . Let V = L ( K ) and consider the quadratic regularizer h ( p ) = (1 / (cid:107) p (cid:107) = (1 / (cid:82) K p if p ∈ ∆ cont ∩L ( K ) , and h ( p ) = ∞ otherwise. In this case, Q = dom h = ∆ cont ∩L ( K ) and ∇ h ( q ) = q is a continuous selection of ∂h on Q . (cid:74)(cid:73) Example 5 (Entropic regularization) . Let V = M ( K ) and consider the entropic regularizer h ( p ) = (cid:82) K p log p whenever p is a density with ﬁnite entropy, h ( p ) = ∞ otherwise. ByPinsker’s inequality, h is -strongly convex relative to the total variation norm (cid:107)·(cid:107) TV on V ;moreover, we have Q = { q ∈ ∆ cont : supp( q ) = K} (cid:40) dom h and ∇ h ( q ) = 1 + q log q on Q .In the ﬁnite-dimensional case, this regularizer forms the basis of the well-known Hedge (ormultiplicative/exponential weights) algorithm [5, 6, 39, 54]; for the inﬁnite-dimensional case,see [38, 43] (and below). (cid:74) With all this in hand, the dual averaging algorithm can be described by means of theabstract recursion y t +1 = y t − ˆ u t , p t +1 = Q ( η t +1 y t +1 ) , (DA)where ( i ) t = 1 , , . . . denotes the stage of the process (with the convention y = ˆ u = 0 );( ii ) p t ∈ Q is the learner’s strategy at stage t ; ( iii ) ˆ u t ∈ L ∞ ( K ) is the inexact model revealedat stage t ; ( iv ) y t ∈ L ∞ ( K ) is a “score” variable that aggregates loss models up to stage t ;( v ) η t > is a “learning rate” sequence; and ( vi ) Q : L ∞ ( K ) → Q is the method’s mirrormap , viz. Q ( ψ ) = arg max p ∈V {(cid:104) ψ, p (cid:105) − h ( p ) } . (10)For a pseudocode implementation, see Alg. 1 above. In the paper’s supplement wealso show that the method is well-posed , i.e., the arg max in (10) is attained at a validprox-strategy p t ∈ Q . We illustrate this with an example: (cid:73) Example 6 (Logit choice) . Suppose that h ( p ) = (cid:82) K p log p is the entropic regularizer ofExample 5. Then, the corresponding mirror map is given in closed form by the logit choicemodel: Λ( ϕ ) = exp( ϕ ) (cid:82) K exp( ϕ ) for all ϕ ∈ L ∞ ( K ) . (11) In this case, α = (cid:112) λ ( K ) : this is because (cid:107) p (cid:107) = (cid:2)(cid:82) K p (cid:3) ≤ (cid:82) K p · (cid:82) K λ ( K ) (cid:107) p (cid:107) if p ∈ ∆ cont . A. HÉLIOU, M. MARTIN, P. MERTIKOPOULOS, AND T. RAHIER

Algorithm 1:

Dual averaging with imperfect feedback [Hedge variant: Q ← Λ ] Require: mirror map Q : L ∞ ( K ) → Q ; learning rate η t > ; initialize: y ← for t = 1 , , . . . do set p t ← Q ( η t y t ) [ p t ← Λ( η t y t ) for Hedge] update mixed strategy play x t ∼ p t choose action observe ˆ u t model revealed set y t +1 ← y t − ˆ u t update scores end for This derivation builds on a series of well-established arguments that we defer to the supple-ment. Clearly, (cid:82) K Λ( ϕ ) = 1 and Λ( ϕ ) > as a function on K , so Λ( ϕ ) is a valid prox-strategy. (cid:74) General regret bounds

Static regret guarantees.

We are now in a position to state our ﬁrst result for (DA):

Proposition 1.

For any simple strategy χ ∈ X , Alg. 1 enjoys the bound Reg χ ( T ) ≤ η − T +1 [ h ( χ ) − min h ] + (cid:80) Tt =1 (cid:104) e t , χ − p t (cid:105) + K (cid:80) Tt =1 η t (cid:107) ˆ u t (cid:107) ∗ . (12)Proposition 1 is a “template” bound that we will use to extract static and dynamic regretguarantees in the sequel. Its proof relies on the introduction of a suitable energy functionmeasuring the match between the learner’s aggregate model y t and the comparator χ . Themain diﬃculty is that these variables live in completely diﬀerent spaces ( L ∞ ( K ) vs. X respectively), so there is no clear distance metric connecting them. However, since boundedfunctions ψ ∈ L ∞ ( K ) and simple strategies χ ∈ X are naturally paired via duality, theyare indirectly connected via the Fenchel–Young inequality (cid:104) ψ, χ (cid:105) ≤ h ( χ ) + h ∗ ( ψ ) , where h ∗ ( ψ ) = max p ∈V {(cid:104) ψ, p (cid:105) − h ( p ) } denotes the convex conjugate of h and equality holds if andonly if Q ( ψ ) = χ . We will thus consider the energy function E t := η − t [ h ( χ ) + h ∗ ( η t y t ) − (cid:104) η t y t , χ (cid:105) ] . (13)By construction, E t ≥ for all t and E t = 0 if and only if p t = Q ( η t y t ) = χ . More to thepoint, the deﬁning property of E t is the following recursive bound (which we prove in thesupplement): Lemma 2.

For all χ ∈ X , we have: E t +1 ≤ E t + (cid:104) ˆ u t , χ − p t (cid:105) + (cid:0) η − t +1 − η − t (cid:1) [ h ( χ ) − min h ] + η t K (cid:107) ˆ u t (cid:107) ∗ . (14)Proposition 1 is obtained by telescoping (14); subsequently, to obtain a regret bound forAlg. 1, we must relate Reg x ( T ) to Reg χ ( T ) . This can be achieved by invoking Lemma 1 butthe resulting expressions are much simpler when h is decomposable , i.e., h ( p ) = (cid:82) K θ ( p ( x )) dx for some C function θ : [0 , ∞ ) → R with θ (cid:48)(cid:48) > . In this more explicit setting, we have: Theorem 1.

Fix x ∈ K , let C be a convex neighborhood of x in K , and suppose that Alg. 1 isrun with a decomposable regularizer h ( p ) = (cid:82) K θ ◦ p . Then, letting φ ( z ) = zθ (1 /z ) for z > ,we have: E [Reg x ( T )] ≤ φ ( λ ( C )) − φ ( λ ( K )) η T +1 + L diam( C ) T + 2 (cid:88) Tt =1 B t + α K (cid:88) Tt =1 η t M t . (15) In particular, if Alg. 1 is run with learning rate η t ∝ /t ρ , ρ ∈ (0 , , and inexact modelssuch that B t = O (1 /t β ) and M t = O ( t µ ) for some β, µ ≥ , we have: E [Reg( T )] = O ( φ ( T − nκ ) T ρ + T − κ + T − β + T µ − ρ ) for all κ ≥ . (16) NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 9

Corollary 1.

If the learner’s feedback is unbiased and bounded in mean square ( i.e., B t = 0 and sup t M t < ∞ ) , running Alg. 1 with learning rate η t ∝ /t ρ guarantees E [Reg( T )] = O ( φ ( T − nρ ) T ρ + T − ρ ) . (17) In particular, for the regularizers of Examples 4 and 5, we have:(1) For θ ( z ) = (1 / z , Alg. 1 with η t ∝ t − / ( n +2) guarantees E [Reg( T )] = O ( T n +1 n +2 ) .(2) For θ ( z ) = z log z , Alg. 1 with η t ∝ t − / guarantees E [Reg( T )] = O ( T / ) .Remark . Here and in the sequel, logarithmic factors are ignored in the Landau O ( · ) notation. We should also stress that the role of C in Theorem 1 only has to do with theanalysis of the algorithm, not with the derived bounds (which are obtained by picking asuitable C ).First, in online convex optimization, dual averaging with stochastic gradient feedbackachieves O ( √ T ) regret irrespective of the choice of regularizer, and this bound is tight[1, 16, 45]. By contrast, in the non-convex setting, the choice of regularizer has a visibleimpact on the regret because it aﬀects the exponent of T : in particular, L regularizationcarries a much worse dependence on T relative to the Hedge variant of Alg. 1. This is dueto the term O ( φ ( T − nκ ) T ρ ) that appears in (16) and is in turn linked to the choice of the“enclosure set” C having λ ( C ) ∝ T − nκ for some κ ≥ .The negentropy regularizer (and any other regularizer with quasi-linear growth at inﬁnity,see the supplement for additional examples) only incurs a logarithmic dependence on λ ( C ) .Instead, the quadratic growth of the L regularizer induces an O (1 /λ ( C )) term in thealgorithm’s regret, which is ultimately responsible for the catastrophic dependence on thedimension of K . Seeing as the bounds achieved by the Hedge variant of Alg. 1 are optimal inthis regard, we will concentrate on this speciﬁc instance in the sequel.4.2. Dynamic regret guarantees.

We now turn to the dynamic regret minimization guar-antees of Alg. 1. In this regard, we note ﬁrst that, in complete generality, dynamic regretminimization is not possible because an informed adversary can always impose a uniformlypositive loss at each stage [45]. Because of this, dynamic regret guarantees are often statedin terms of the variation of the loss functions encountered, namely V T := (cid:80) Tt =1 (cid:107) (cid:96) t +1 − (cid:96) t (cid:107) ∞ = (cid:80) Tt =1 max x ∈K | (cid:96) t +1 ( x ) − (cid:96) t ( x ) | , (18)with the convention (cid:96) t +1 = (cid:96) t for t = T . We then have:

Theorem 2.

Suppose that the Hedge variant of Alg. 1 is run with learning rate η t ∝ /t ρ andinexact models with B t = O (1 /t β ) and M t = O ( t µ ) for some β, µ ≥ . Then: E [DynReg( T )] = O ( T µ − ρ + T − β + T ρ − µ V T ) . (19) In particular, if V T = O ( T ν ) for some ν < and the learner’s feedback is unbiased andbounded in mean square ( i.e., B t = 0 and sup t M t < ∞ ) , the choice ρ = (1 − ν ) / guarantees E [DynReg( T )] = O ( T ν ) . (20)To the best of our knowledge, Theorem 2 provides the ﬁrst dynamic regret guarantee foronline non-convex problems. The main idea behind its proof is to examine the evolution ofplay over a series of windows of length ∆ = O ( T γ ) for some γ > . In so doing, Theorem 1can be used to obtain a bound for the learner’s regret relative to the best action x ∈ K withineach window . Obviously, if the length of the window is chosen suﬃciently small, aggregating This notion is due to Besbes et al. [11]. Other notions of variation have also been considered [11, 21, 23],as well as other measures of regret, cf. [30, 32]; for a survey, see [20]. the learner’s regret per window will be a reasonable approximation of the learner’s dynamicregret. At the same time, if the window is taken too small, the number of such windowsrequired to cover T will be Θ( T ) , so this approximation becomes meaningless. As a result,to obtain a meaningful regret bound, this window-by-window examination of the algorithmmust be carefully aligned with the variation V T of the loss functions encountered by thelearner. Albeit intuitive, the details required to make this argument precise are fairly subtle,so we relegate the proof of Theorem 2 to the paper’s supplement.We should also observe here that the O ( T ν ) bound of Theorem 2 is, in general,unimprovable, even if the losses are linear . Speciﬁcally, Besbes et al. [11] showed that, if thelearner is facing a stream of linear losses with stochastic gradient feedback (i.e., an inexactlinear model), an informed adversary can still impose DynReg( T ) = Ω( T / V / T ) . Besbeset al. [11] further proposed a scheme to achieve this bound by means of a periodic restartmeta-principle that partitions the horizon of play into batches of size ( T /V T ) / and thenruns an algorithm achieving ( T /V T ) / regret per batch. Theorem 2 diﬀers from the resultsof Besbes et al. [11] in two key aspects: ( a ) Alg. 1 does not require a periodic restart schedule(so the learner does not forget the information accrued up to a given stage); and ( b ) moreimportantly, it applies to general online optimization problems, without a convex structureor any other structural assumptions (though with a diﬀerent feedback structure).5. Applications to online non-convex learning with bandit feedback

As an application of the inexact model framework of the previous sections, we proceedto consider the case where the learner only observes their realized reward (cid:96) t ( x t ) and has no other information . In this “bandit setting”, an inexact model is not available and mustinstead be constructed on the ﬂy.When K is a ﬁnite set, (cid:96) t is a |K| -dimensional vector, and an unbiased estimator for (cid:96) t can be constructed by setting ˆ u t ( x ) = [ { x = x t } / P ( x = x t )] (cid:96) t ( x t ) for all x ∈ K . This“importance weighted” estimator is the basis for the EXP3 variant of the Hedge algorithmwhich is known to achieve O ( T / ) regret [8]. However, in the case of continuous actionspaces, there is a key obstacle: if the indicator { x = x t } is replaced by a Dirac point mass δ x t ( x ) , the resulting loss model ˆ u t ∝ δ x t would no longer be a function but a generalized(singular) distribution, so the dual averaging framework of Alg. 1 no longer applies.To counter this, we will take a “smoothing” approach in the spirit of [19] and consider theestimator ˆ u t ( x ) = K t ( x t , x ) · (cid:96) t ( x t ) /p t ( x t ) (21)where K t : K × K → R is a (time-varying) smoothing kernel , i.e., (cid:82) K K t ( x, x (cid:48) ) dx (cid:48) = 1 for all x ∈ K . For concreteness (and sampling eﬃciency), we will assume that losses now take valuesin [0 , , and we will focus on simple kernels that are supported on a neighborhood U δ ( x ) = B δ ( x ) ∩ K of x in K and are constant therein, i.e., K δ ( x, x (cid:48) ) = [ λ ( U δ ( x ))] − {(cid:107) x (cid:48) − x (cid:107) ≤ δ } .The “smoothing radius” δ in the deﬁnition of K δ will play a key role in the choice of lossmodel being fed to Alg. 1. If δ is taken too small, K δ will approach a point mass, so it willhave low estimation error but very high variance; at the other end of the spectrum, if δ istaken too large, the variance of the induced estimator will be low, but so will its accuracy.In view of this, we will consider a ﬂexible smoothing schedule of the form δ t = 1 /t µ whichgradually sharpens the estimator over time as more information comes in. Then, to furtherprotect the algorithm from getting stuck in local minima, we will also incorporate in p t anexplicit exploration term of the form ε t /λ ( K ) .Putting all this together, we obtain the bandit dual averaging (BDA) algorithm presentedin pseudocode form as Alg. 2 above. By employing a slight variation of the analysis presented NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 11

Algorithm 2:

Bandit dual averaging [Hedge variant: Q ← Λ ] Require: mirror map Q : L ∞ ( K ) → Q ; parameters η t , δ t , ε t > ; initialize: y ← for t = 1 , , . . . do set p t ← (1 − ε t ) Q ( η t y t ) + ε t /λ ( K ) [ Q ← Λ for Hedge] mixed strategy play x t ∼ p t choose action set ˆ u t = K δ t ( x t , · ) · (cid:96) t ( x t ) /p t ( x t ) payoff model set y t +1 ← y t − ˆ u t update scores end for in Section 4 (basically amounting to a tighter bound in Lemma 2), we obtain the followingguarantees for Alg. 2: Proposition 2.

In this appendix, we provide some more decomposable regularizers that are commonlyused in the literature: (cid:73)

Example 7 (Log-barrier regularization) . Let V = M ( K ) as above and consider the so-called Burg entropy h ( p ) = − (cid:82) K log p [4]. In this case, Q = dom ∂h = { q ∈ ∆ cont : supp( q ) = K} = dom h and ∇ h ( q ) = − /q on Q . In the ﬁnite-dimensional case, this regularizerplays a fundamental role in the aﬃne scaling method of Karmarkar [35], see e.g., Tseng[52], Vanderbei et al. [53] and references therein. The corresponding mirror map is obtainedas follows: let L ( p ; λ ) = (cid:82) K ψp + (cid:82) K log p − λ (cid:82) K p denote the Lagrangian of the problem (10),so q = Q ( ψ ) satisﬁes the ﬁrst-order optimality condition ψ + 1 /q − λ = 0 . (A.1)Solving for q and integrating, we get (cid:82) K ( λ − ψ ) − = (cid:82) K q = 1 . The function φ ( λ ) = (cid:82) K ( λ − ψ ) − is decreasing in λ and continuous whenever ﬁnite; moreover, since ψ ∈ L ∞ ( K ) , it followsthat φ is always ﬁnite (and hence continuous) for large enough λ , and lim λ →∞ φ ( λ ) = 0 .Since sup λ φ ( λ ) = ∞ , there exists some maximal λ ∗ such that (A.1) holds (in practice,this can be located by a simple line search initialized at some λ > (cid:107) ψ (cid:107) ∞ ). We thus get Q ( ψ ) = ( λ ∗ − ψ ) − . (cid:74) (cid:73) Example 8 (Tsallis entropy) . A generalization of the Shannon-Gibbs entropy for nonex-tensive variables is the

Tsallis entropy [51] deﬁned here as h ( p ) = (cid:82) K θ ( p ) where θ ( z ) =[ γ (1 − γ )] − ( z − z γ ) for γ ∈ (0 , , with the continuity convention ( z − z γ ) / (1 − γ ) = z log z for γ = 1 (corresponding to the Shannon-Gibbs case). Working as in Example 7, we have Q = dom ∂h = { q ∈ ∆ cont : supp( q ) = K} (cid:40) dom h , and the corresponding mirror map q = Q ( ψ ) is obtained via the ﬁrst-order stationarity equation ψ − − γq γ − γ (1 − γ ) − λ = 0 . (A.2)Then, solving for q yields Q ( ψ ) = (1 − γ ) / ( γ − (cid:82) K ( µ − ψ ) / ( γ − with µ > (cid:107) ψ (cid:107) ∞ chosen sothat (cid:82) K Q ( ψ ) = 1 . (cid:74) Appendix B. Basic properties of regularizers and mirror maps

The goal of this appendix is to prove some basic results on regularizer functions and mirrormaps that will be used liberally in the sequel. Versions of the results presented here alreadyexist in the literature, but our inﬁnite-dimensional setting introduces some subtleties thatrequire further care. For this reason, we state and prove all required results for completeness.We begin by recalling some deﬁnitions from the main part of the paper. First, we write

M ≡ M ( K ) for the space of all ﬁnite signed Radon measures on K equipped with the totalvariation norm (cid:107) µ (cid:107) TV = µ + ( K ) + µ − ( K ) , where µ + (resp. µ − ) denotes the positive (resp.negative) part of µ coming from the Hahn-Banach decomposition of signed measures on K . As we discussed in Section 3, we also assume given a model Banach space V containingthe set of simple strategies X as an embedded subset and such that (cid:107)·(cid:107) TV ≤ α (cid:107)·(cid:107) for some α > .With all this in hand, we begin by discussing the well-posedness of Alg. 1. To that end,we have the following basic result: Lemma B.1.

Let h be a regularizer on X . Then:(1) Q ( ψ ) ∈ Q for all ψ ∈ V ∗ ; in particular: q = Q ( ψ ) ⇐⇒ ψ ∈ ∂h ( q ) . (B.1) (2) If q = Q ( ψ ) and p ∈ dom h , we have (cid:104)∇ h ( q ) , q − p (cid:105) ≤ (cid:104) ψ, q − p (cid:105) . (B.2) (3) The convex conjugate h ∗ ( ψ ) = max p ∈V {(cid:104) ψ, p (cid:105) − h ( p ) } is Fréchet diﬀerentiable andsatisﬁes D v h ∗ ( ψ ) = (cid:104) v, Q ( ψ ) (cid:105) for all ψ, v ∈ V ∗ . (B.3) Corollary 2.

Alg. 1 is well-posed, i.e., p t ∈ Q for all t = 1 , , . . . if ˆ u t ∈ L ∞ ( K ) .Proof. We proceed item by item:(1) First, since h is strongly convex and lower semi-continuous, the maximum in (10) isattained. Hence, by Fermat’s rule for subdiﬀerentials, q solves (10) if and only if ψ − ∂h ( q ) (cid:51) . We thus get the string of equivalences: q = Q ( ψ ) ⇐⇒ ψ − ∂h ( q ) (cid:51) ⇐⇒ ψ ∈ ∂h ( q ) . (B.4)In particular, this implies that ψ ∈ dom ∂h ( q ) (cid:54) = ∅ , i.e., q ∈ dom ∂h =: Q , as claimed.(2) To establish (B.2), it suﬃces to show that it holds for all q ∈ Q (by continuity). To doso, let φ ( t ) = h ( q + t ( p − q )) − [ h ( q ) + (cid:104) ψ, q + t ( p − q ) (cid:105) ] . (B.5) NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 13

Since h is strongly convex relative ψ ∈ ∂h ( q ) by (B.1), it follows that φ ( t ) ≥ withequality if and only if t = 0 . Moreover, note that ψ ( t ) = (cid:104)∇ h ( q + t ( p − q )) − ψ, p − q (cid:105) is acontinuous selection of subgradients of φ . Given that φ and ψ are both continuous on [0 , ,it follows that φ is continuously diﬀerentiable and φ (cid:48) = ψ on [0 , . Thus, with φ convexand φ ( t ) ≥ φ (0) for all t ∈ [0 , , we conclude that φ (cid:48) (0) = (cid:104)∇ h ( q ) − ψ, p − q (cid:105) ≥ ,from which our claim follows.Finally, the Fréchet diﬀerentiability of h ∗ is a straightforward application of the envelopetheorem, which is sometimes referred to in the literature as Danskin’s theorem, cf. Berge[10, Chap. 4] (cid:3) As we mentioned in the main text, much of our analysis revolves around the energyfunction (13) deﬁned by means of the Fenchel-Young inequality. To formalize this, it will beconvenient to introduce a more general pairing between p ∈ V and ψ ∈ V ∗ , known as the Fenchel coupling . Following [40], this is deﬁned as F ( p, ψ ) = h ( p ) + h ∗ ( ψ ) − (cid:104) ψ, p (cid:105) for all p ∈ dom h, ψ ∈ V ∗ . (B.6)The following series of lemmas gathers some basic properties of the Fenchel coupling. Theﬁrst is a lower bound for the Fenchel coupling in terms of the ambient norm in V : Lemma B.2.

Let h be a regularizer on X with strong convexity modulus K . Then, for all p ∈ dom h and all ψ ∈ V ∗ , we have F ( p, ψ ) ≥ K (cid:107) Q ( ψ ) − p (cid:107) . (B.7) Proof.

By the deﬁnition of F and the inequality (B.2), we have: F ( p, ψ ) = h ( p ) + h ∗ ( ψ ) − (cid:104) ψ, p (cid:105) = h ( p ) + (cid:104) ψ, Q ( ψ ) (cid:105) − h ( Q ( ψ )) − (cid:104) ψ, p (cid:105)≥ h ( p ) − h ( Q ( ψ )) − (cid:104)∇ h ( ψ ) , Q ( ψ ) − p (cid:105)≥ K (cid:107) Q ( ψ ) − p (cid:107) (B.8)where we used (B.2) in the second line, and the strong convexity of h in the last. (cid:3) Our next result is the primal-dual analogue of the so-called “three-point identity” for theBregman divergence [22]:

Proposition B.1.

Let h be a regularizer on X , ﬁx some p ∈ V , ψ, ψ + ∈ V ∗ , and let q = Q ( ψ ) .Then: F ( p, ψ + ) = F ( p, ψ ) + F ( q, ψ + ) + (cid:104) ψ + − ψ, q − p (cid:105) . (B.9) Proof.

By deﬁnition: F ( p, ψ + ) = h ( p ) + h ∗ ( ψ + ) − (cid:104) ψ + , p (cid:105) F ( p, ψ ) = h ( p ) + h ∗ ( ψ ) − (cid:104) ψ, p (cid:105) . (B.10)Thus, by subtracting the above, we get: F ( p, ψ + ) − F ( p, ψ ) = h ( p ) + h ∗ ( ψ + ) − (cid:104) ψ + , p (cid:105) − h ( p ) − h ∗ ( ψ ) + (cid:104) ψ, p (cid:105) = h ∗ ( ψ + ) − h ∗ ( ψ ) − (cid:104) ψ + − ψ, p (cid:105) = h ∗ ( ψ + ) − (cid:104) ψ, Q ( ψ ) (cid:105) + h ( Q ( ψ )) − (cid:104) ψ + − ψ, p (cid:105) = h ∗ ( ψ + ) − (cid:104) ψ, q (cid:105) + h ( q ) − (cid:104) ψ + − ψ, p (cid:105) = h ∗ ( ψ + ) + (cid:104) ψ + − ψ, q (cid:105) − (cid:104) ψ + , q (cid:105) + h ( q ) − (cid:104) ψ + − ψ, p (cid:105) = F ( q, ψ + ) + (cid:104) ψ + − ψ, q − p (cid:105) (B.11)and our proof is complete. (cid:3) We are now in a position to state and prove a key inequality for the Fenchel coupling:

Proposition B.2.

Let h be a regularizer on X with convexity modulus K , ﬁx some p ∈ dom h ,and let q = Q ( ψ ) for some ψ ∈ V ∗ . Then, for all v ∈ V ∗ , we have: F ( p, ψ + v ) ≤ F ( p, ψ ) + (cid:104) v, q − p (cid:105) + 12 K (cid:107) v (cid:107) ∗ (B.12) Proof.

Let q = Q ( ψ ) , ψ + = ψ + v , and q + = Q ( ψ + ) . Then, by the three-point identity (B.9),we have F ( p, ψ ) = F ( p, ψ + ) + F ( q + , ψ ) + (cid:104) ψ − ψ + , q + − p (cid:105) . (B.13)Hence, after rearranging: F ( p, ψ + ) = F ( p, ψ ) − F ( q + , ψ ) + (cid:104) v, q + − p (cid:105) = F ( p, ψ ) − F ( q + , ψ ) + (cid:104) v, q − p (cid:105) + (cid:104) v, q + − q (cid:105) . (B.14)By Young’s inequality, we also have (cid:104) v, q + − q (cid:105) ≤ K (cid:107) q + − q (cid:107) + 12 K (cid:107) v (cid:107) ∗ . (B.15)Thus, substituting in (B.14), we get F ( p, ψ + ) ≤ F ( p, ψ ) + (cid:104) v, q − p (cid:105) + 12 K (cid:107) v (cid:107) ∗ − F ( q + , ψ ) + K (cid:107) q + − q (cid:107) . (B.16)Our claim then follows by noting that F ( q + , ψ ) ≥ K (cid:107) q + − q (cid:107) (cf. Lemma B.2 above). (cid:3) Appendix C. Regret derivations

Notation: from losses to payoﬀs.

In this appendix, we prove the general regret guarantees forAlg. 1. For notational convenience, we will switch in what follows from “losses” to “payoﬀs”,i.e., we will assume that the learner is encountering a sequence of payoﬀ functions u t = − (cid:96) t and gets as feedback the model ˆ u t = − ˆ u t .C.1. Basic bounds and preliminaries.

We begin by providing some template regret boundsthat we will use as a toolkit in the sequel. As a warm-up, we prove the basic comparisonlemma between simple and pure strategies:

Lemma 1.

Let U be a convex neighborhood of x in K and let χ ∈ X be a simple strategysupported on U . Then, Reg x ( T ) ≤ Reg χ ( T ) + L diam( U ) T .Proof. By Assumption 1, we have u t ( x ) ≤ u t ( x (cid:48) )+ L (cid:107) x − x (cid:48) (cid:107) ≤ u t ( x (cid:48) )+ L diam( U ) for all x (cid:48) ∈U . Hence, taking expectations on both sides relative to χ , we get u t ( x ) ≤ (cid:104) u t , χ (cid:105) + L diam( U ) .Our claim then follows by summing over t = 1 , , . . . , T and invoking the deﬁnition of theregret. (cid:3) We now turn to the derivation of our main regret guarantees as outlined in Section 4. Muchof the analysis to follow will revolve around the energy function (13) which, for convenience,we restate below in terms of the Fenchel coupling (B.6): E t := 1 η t [ h ( χ ) + h ∗ ( η t y t ) − (cid:104) η t y t , χ (cid:105) ] = 1 η t F ( χ, η t y t ) . (13)In words, E t essentially measures the primal-dual “distance” between the benchmark strategy χ and the aggregate model y t , taking into account the inﬂation of the latter by η t in (DA). NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 15

Our overall proof strategy will then be to relate the regret incurred by the optimizer to theevolution of E t over time. To that end, an application of Abel’s summation formula gives: E t +1 − E t = 1 η t +1 F ( χ, η t +1 y t +1 ) − η t F ( χ, η t y t )= 1 η t +1 F ( χ, η t +1 y t +1 ) − η t F ( χ, η t y t +1 ) (C.1a) + 1 η t F ( χ, η t y t +1 ) − η t F ( χ, η t y t ) . (C.1b)We now proceed to unpack the two terms (C.1a) and (C.1b) separately, beginning with thelatter.To do so, substituting p ← χ , ψ ← η t y t and ψ + ← η t y t +1 in Proposition B.1 yields(C.1b) = 1 η t [ F ( χ, η t y t + η t u t ) − F ( χ, η t y t )]= 1 η t [ F ( p t , η t y t +1 ) + (cid:104) η t u t , p t − χ (cid:105) ]= F ( p t , η t y t +1 ) η t + (cid:104) u t , p t − χ (cid:105) (C.2)where we used the deﬁnition p t = Q ( η t y t ) of p t . We thus obtain the interim expression E t +1 = E t + (C.1a) + (cid:104) u t , p t − χ (cid:105) + F ( p t , η t y t +1 ) η t (C.3)Moving forward, for the term (C.1a), the deﬁnition of the Fenchel coupling (B.6) readilyyields: (C.1a) = (cid:20) η t +1 − η t (cid:21) h ( χ ) + 1 η t +1 h ∗ ( η t +1 y t +1 ) − η t h ∗ ( η t y t +1 ) . (C.4)Consider now the function ϕ ( η ) = η − [ h ∗ ( ηψ ) + min h ] for arbitrary ψ ∈ L ∞ ( K ) . ByLemma B.1, h ∗ is Fréchet diﬀerentiable with D v h ∗ ( · ) = (cid:104) v, Q ( · ) (cid:105) for all v ∈ V ∗ , so a simplediﬀerentiation yields ϕ (cid:48) ( η ) = 1 η (cid:104) ψ, Q ( ηψ ) (cid:105) − η [ h ∗ ( ηψ ) + min h ]= 1 η [ (cid:104) ηψ, Q ( ηψ ) (cid:105) − h ∗ ( ηψ ) − min h ]= 1 η [ h ( Q ( ηψ )) − min h ] ≥ , (C.5)where we used the Fenchel-Young inequality as an equality in the second-to-last line. Since η t +1 ≤ η t , the above shows that ϕ ( η t ) ≥ ϕ ( η t +1 ) . Hence, substituting ψ ← y t +1 , weultimately obtain η t +1 h ∗ ( η t +1 y t +1 ) − η t h ∗ ( η t y t +1 ) ≤ (cid:20) η t − η t +1 (cid:21) min h. (C.6)Therefore, combining (C.3) and (C.6), we have proved the following template bound: Lemma C.1.

For all χ ∈ X , the policy (DA) enjoys the bound E t +1 ≤ E t + (cid:104) u t , p t − χ (cid:105) + (cid:18) η t +1 − η t (cid:19) [ h ( χ ) − min h ] + 1 η t F ( p t , η t y t +1 ) . (C.7)We are now in a position to prove our basic energy inequality (restated below for conve-nience): Lemma 2.

For all χ ∈ X , we have: E t +1 ≤ E t + (cid:104) u t , χ − p t (cid:105) + (cid:0) η − t +1 − η − t (cid:1) [ h ( χ ) − min h ] + η t K (cid:107) u t (cid:107) ∗ . (14) Proof.

Going back to Proposition B.2 and setting p ← p t , ψ ← η t y t and v ← η t u t , we get F ( p t , η t y t +1 ) ≤ F ( p t , η t y t ) + (cid:104) η t u t , p t − p t (cid:105) + η t K (cid:107) u t (cid:107) ∗ = η t K (cid:107) u t (cid:107) ∗ (C.8)where we used the fact that p t = Q ( η t y t ) . Our claim then follows by dividing both sides by η t and substituting in Lemma C.1. (cid:3) We will come back to these results as needed.C.2.

Static regret guarantees.

We are now ready to prove our static regret results for Alg. 1.We begin with the precursor to our main result in that respect:

Proposition 1.

For any simple strategy χ ∈ X , Alg. 1 enjoys the bound Reg χ ( T ) ≤ η − T +1 [ h ( χ ) − min h ] + (cid:80) Tt =1 (cid:104) e t , χ − p t (cid:105) + K (cid:80) Tt =1 η t (cid:107) u t (cid:107) ∗ . (12) Proof.

Recalling the decomposition u t = u t + e t for the learner’s inexact models, a simplerearrangement of Lemma 2 gives (cid:104) u t , χ − p t (cid:105) ≤ E t − E t +1 + (cid:104) e t , p t − χ (cid:105) + (cid:0) η − t +1 − η − t (cid:1) [ h ( χ ) − min h ] + η t K (cid:107) u t (cid:107) ∗ . (C.9)Thus, telescoping over t = 1 , , . . . , T , we get Reg χ ( T ) ≤ E − E T +1 + (cid:18) η T +1 − η (cid:19) [ h ( χ ) − min h ] + T (cid:88) t =1 (cid:104) e t , p t − χ (cid:105) + 12 K T (cid:88) t =1 η t (cid:107) u t (cid:107) ∗ ≤ h ( χ ) − min hη T +1 + T (cid:88) t =1 (cid:104) e t , p t − χ (cid:105) + 12 K T (cid:88) t =1 η t (cid:107) u t (cid:107) ∗ , (C.10)where we used the fact that E t ≥ for all t and E = η − [ h ( χ ) + h ∗ (0)] = η − [ h ( χ ) − min h ] . (cid:3) As a simple application of Lemma 2, we get the following bound for simple comparators:

Corollary 3.

For all χ ∈ X , Alg. 1 guarantees E [Reg χ ( T )] ≤ h ( χ ) − min hη T +1 + 2 T (cid:88) t =1 B t + 12 K T (cid:88) t =1 η t E [ (cid:107) u t (cid:107) ∗ ] , (C.11) Proof.

Simply take expectations over (12) and use the fact that E [ (cid:104) e t , p t − χ (cid:105) ] = E [ (cid:104) E [ e t | F t ] , p t − χ (cid:105) ] = E [ (cid:104) b t , p t − χ (cid:105) | ] ≤ E [ (cid:107) b t (cid:107) ∞ (cid:107) p t − χ (cid:107) ] ≤ B t . (cid:3) We are ﬁnally in a position to prove the main static regret guarantee of Alg. 1:

Theorem 1.

Proof.

To simplify the proof, we will make the normalizing assumption θ (0) = 0 ; if this is notthe case, θ can always be shifted by θ (0) for this condition to hold. [Note that Examples 4and 5 both satisfy this convention.]With this in mind, let C be a convex neighborhood of x in K , and let unif C = λ ( C ) − C denote the (simple) strategy that assigns uniform probability to the elements of C and zeroto all other points in C . We then have: h (unif C ) = (cid:90) K θ (unif C ) = (cid:90) K θ ( C /λ ( C )) = (cid:90) C θ (1 /λ ( C )) = λ ( C ) θ (1 /λ ( C )) = φ ( λ ( C )) . (C.12)Moreover, since h is decomposable and the probability constraint (cid:82) K χ = 1 is symmetric, theminimum of h over X will be attained at the uniform strategy unif K = λ ( K ) − K . Thus,with X weakly dense in dom h , we obtain min h = h (unif K ) = (cid:90) K θ ( K /λ ( K )) = φ ( λ ( K )) . (C.13)In view of all this, Corollary 3 applied to χ = unif C yields E [Reg χ ( T )] ≤ φ ( λ ( C )) − φ ( λ ( K )) η T +1 + 2 T (cid:88) t =1 B t + α K T (cid:88) t =1 η t M t , (C.14)where we used the fact that (cid:107)·(cid:107) TV ≤ α (cid:107)·(cid:107) so (cid:107)·(cid:107) ∗ ≤ α (cid:107)·(cid:107) ∞ . The bound (15) then follows bycombining the above with Lemma 1.Regarding the bound (16), we ﬁrst note that this is not a pseudo-regret bound but abona ﬁde bound for the learner’s expected regret (so we cannot simply our point-dependentbound over x ∈ K ). In light of this, our ﬁrst step will be to consider a “uniform” simpleapproximant for every x ∈ K . To that end, building on an idea by Blum & Kalai [12] andKrichene et al. [38], ﬁx a shrinkage factor δ > and let K δ ( x ) = { x + δ ( x (cid:48) − x ) : x (cid:48) ∈ K} ⊆ K denote the homothetic transformation that shrinks K to a fraction δ of its original sizeand then transports it to x ∈ K . By construction, we have x ∈ K δ ( x ) ⊆ K and, moreover, diam( K δ ( x )) = δ diam( K ) and λ ( K δ ( x )) = δ n λ ( K ) . Then, letting µ x := unif K δ ( x ) denote theuniform strategy supported on K δ ( x ) , we get E [Reg( T )] = E (cid:20) max x ∈K Reg x ( T ) (cid:21) ≤ E (cid:20) max x ∈K Reg µ x ( T ) (cid:21) + δL diam( K ) T, (C.15)where, in the last step, we used Lemma 1.Now, by Proposition 1, we have Reg µ x ( T ) ≤ h ( µ x ) − min hη T +1 + T (cid:88) t =1 (cid:104) e t , p t − µ x (cid:105) + 12 K T (cid:88) t =1 η t (cid:107) u t (cid:107) ∗ ≤ φ ( δ n λ ( K )) − φ ( λ ( K )) η T +1 + T (cid:88) t =1 (cid:104) e t , p t − µ x (cid:105) + α K T (cid:88) t =1 η t (cid:107) u t (cid:107) ∞ . (C.16)and hence E (cid:20) max x ∈K Reg µ x ( T ) (cid:21) ≤ φ ( δ n λ ( K )) − φ ( λ ( K )) η T +1 + E (cid:34) max x ∈K T (cid:88) t =1 (cid:104) e t , p t − µ x (cid:105) (cid:35) + α K T (cid:88) t =1 η t M t . (C.17)Thus, to proceed, it suﬃces to bound the second term of the above expression.To do so, introduce the auxiliary process ˜ y t +1 = ˜ y t − z t , ˜ p t +1 = Q ( η t +1 ˜ y t +1 ) , (C.18) with ˜ p = p . We then have T (cid:88) t =1 (cid:104) e t , p t − µ x (cid:105) = T (cid:88) t =1 (cid:104) e t , ( p t − ˜ p t ) + (˜ p t − µ x ) (cid:105) = T (cid:88) t =1 (cid:104) e t , p t − ˜ p t (cid:105) + T (cid:88) t =1 (cid:104) b t , ˜ p t − µ x (cid:105) + T (cid:88) t =1 (cid:104) z t , ˜ p t − µ x (cid:105) (C.19)so it suﬃces to derive a bound for each of these terms. This can be done as follows:(1) The ﬁrst term of (C.19) does not depend on x , so we have E (cid:34) max x ∈K T (cid:88) t =1 (cid:104) e t , p t − ˜ p t (cid:105) (cid:35) = T (cid:88) t =1 E [ E [ (cid:104) e t , p t − ˜ p t (cid:105) | F t ]] ≤ B t (C.20)where, in the last step, we used the deﬁnition (7a) of B t and the bound (cid:104) b t , p t − ˜ p t (cid:105) ≤ (cid:107) p t − ˜ p t (cid:107) (cid:107) b t (cid:107) ∞ ≤ B t . (C.21)(2) The second term of (C.19) can be similarly bounded as E (cid:34) max x ∈K T (cid:88) t =1 (cid:104) b t , ˜ p t − µ x (cid:105) (cid:35) ≤ E [ (cid:107) ˜ p t − µ x (cid:107) (cid:107) b t (cid:107) ∗ ] ≤ B t . (C.22)(3) The third term is more challenging; the main idea will be to apply Proposition 1 on thesequnce − z t , t = 1 , , . . . , viewed itself as a sequence of virtual payoﬀ functions. Doingjust that, we get: T (cid:88) t =1 (cid:104) z t , ˜ p t − µ x (cid:105) ≤ h ( µ x ) − min hη T +1 + 12 K T (cid:88) t =1 η t (cid:107) z t (cid:107) ∗ ≤ φ ( δ n λ ( K )) − φ ( λ ( K )) η T +1 + α K T (cid:88) t =1 η t (cid:107) z t (cid:107) ∞ . (C.23)Thus, after maximizing and taking expectations, we obtain E (cid:20) max x ∈K (cid:104) z t , ˜ p t − µ x (cid:105) (cid:21) ≤ φ ( δ n λ ( K )) − φ ( λ ( K )) η T +1 + α K T (cid:88) t =1 η t σ t . (C.24)Therefore, plugging Eqs. (C.20), (C.22) and (C.24) into (C.19) and substituting the resultto (C.17), we ﬁnally get E (cid:20) max x ∈K Reg µ x ( T ) (cid:21) ≤ φ ( δ n λ ( K )) − φ ( λ ( K )) η T +1 + α K T (cid:88) t =1 η t ( M t + σ t ) . (C.25)The guarantee (16) then follows by taking δ n λ ( K ) = T − nκ for some κ ≥ and pluggingeverything back in (C.15). (cid:3) C.3.

Dynamic regret guarantees.

We now turn to the algorithm’s dynamic regret guarantees,as encoded by Theorem 2 (stated below for convenience):

Theorem 2.

In particular, if V T = O ( T ν ) for some ν < and the learner’s feedback is unbiased andbounded in mean square ( i.e., B t = 0 and sup t M t < ∞ ) , the choice ρ = (1 − ν ) / guarantees E [DynReg( T )] = O ( T ν ) . (20) Proof of Theorem 2.

As we discussed in the main body of our paper, our proof strategy willbe to decompose the horizon of play into m virtual segments, estimate the learner’s regretover each segment, and then compare the learner’s regret per-segment to the correspondingdynamic regret over said segment. We stress here again that this partition is only made forthe sake of the analysis, and does not involve restarting the algorithm – e.g., as in Besbeset al. [11].To make this precise, we ﬁrst partition the interval T = [1 . . T ] into m contiguoussegments T k , k = 1 , . . . , m , each of length ∆ (except possibly the m -th one, which mightbe smaller). More explicitly, take the window length to be of the form ∆ = (cid:100) T γ (cid:101) forsome constant γ ∈ [0 , to be determined later. In this way, the number of windows is m = (cid:100) T / ∆ (cid:101) = Θ( T − γ ) and the k -th window will be of the form T k = [( k − . . k ∆] for all k = 1 , . . . , m − (the value k = m is excluded as the m -th window might besmaller). For concision, we will denote the learner’s static regret over the k -th window as Reg( T k ) = max x ∈K (cid:80) t ∈T k (cid:104) u t , δ x − p t (cid:105) (and likewise for its dynamic counterpart).To proceed, let S ⊆ T be a sub-interval of T and write x ∗S ∈ arg max x ∈K (cid:80) s ∈S u s ( x ) forany action that is optimal on average over the interval S . To ease notation, we also write x ∗ t ≡ x ∗{ t } ∈ arg max x ∈K u t ( x ) for any action that is optimal at time t , and x ∗ k ≡ x ∗T k for anyaction that is optimal on average over the k -th window. Then, for all t ∈ ∆ k , k = 1 , , . . . , m ,we have (cid:104) u t , δ x ∗ t − p t (cid:105) = (cid:104) u t , δ x ∗ k − p t (cid:105) + [ u t ( x ∗ t ) − u t ( x ∗ k )] (C.26)so the learner’s dynamic regret over T k can be bounded as DynReg( T k ) = (cid:88) t ∈T k (cid:104) u t , δ x ∗ k − p t (cid:105) + (cid:88) t ∈T k [ u t ( x ∗ t ) − u t ( x ∗ k )] = Reg( T k ) + (cid:88) t ∈T k [ u t ( x ∗ t ) − u t ( x ∗ k )] . (C.27)Following a batch-comparison technique originally due to Besbes et al. [11], let τ k = min T k denote the beginning of the k -th window, and let x ∗ τ k denote a maximizer of the ﬁrst payoﬀfunction encountered in the window T k (this choice could of course be arbitrary). Thus,given that x ∗ k maximizes the per-window aggregate (cid:80) t ∈T k u t ( x ) , we obtain: (cid:88) t ∈T k [ u t ( x ∗ t ) − u t ( x ∗ k )] ≤ (cid:88) t ∈T k [ u t ( x ∗ t ) − u t ( x ∗ τ k )] ≤ |T k | max t ∈T k [ u t ( x ∗ t ) − u t ( x ∗ τ k )] ≤ V k , (C.28)where we let V k = (cid:80) t ∈T k (cid:107) u t +1 − u t (cid:107) ∞ . In turn, combining (C.28) with (C.27), we get: DynReg( T k ) ≤ Reg( T k ) + 2∆ V k , (C.29)and hence, after summing over all windows: DynReg( T ) ≤ m (cid:88) k =1 Reg( T k ) + 2∆ V T . (C.30)Now Theorem 1 applied to the Hedge variant of Alg. 1 readily yields E [Reg( T k )] = O (cid:32) ( k ∆) ρ + ∆ − κ + (cid:88) t ∈T k t − β + (cid:88) t ∈T k t µ − ρ (cid:33) (C.31) so, after summing over all windows, we have m (cid:88) k =1 E [Reg( T k )] = O (cid:32) ∆ ρ m (cid:88) k =1 k ρ + m ∆ − κ + T (cid:88) t =1 t − β + T (cid:88) t =1 t µ − ρ (cid:33) = O (cid:0) ∆ ρ m ρ + m ∆ − κ + T − β + T µ − ρ (cid:1) . (C.32)Since ∆ = O ( T γ ) and m = O ( T / ∆) = O ( T − γ ) , we get ∆ ρ m ρ = O (( m ∆) ρ m ) = O ( T γρ T (1 − γ )(1+ ρ ) ) = O ( T ρ − γ ) (C.33)and, likewise m ∆ − κ = O ( T ∆ − κ ) = O ( T T − γκ ) = O ( T − γκ ) . (C.34)Then, substituting in (C.32) and (C.30), we ﬁnally get the dynamic regret bound E [DynReg( T )] = O (cid:0) T ρ − γ + T − γκ + T − β + T µ − ρ + T γ V T (cid:1) . (C.35)To balance the above expression, we take γ = 2 ρ − µ for the window size exponent (whichcalibrates the ﬁrst and fourth terms in the sum above) and κ = β/γ = β/ (2 ρ − µ ) (for thesecond and the third). In this way, we ﬁnally obtain E [DynReg( T )] = O (cid:0) T − β + T µ − ρ + T ρ − µ V T (cid:1) (C.36)and our proof is complete. (cid:3) Appendix D. Derivations for the bandit framework

In this appendix, we aim at deriving guarantees for the Hedge variant of Alg. 2 usingtemplate bounds from Appendix C. We start by stating preliminary results that are used inthe sequel.D.1.

Preliminary results.

We ﬁrst present a technical bound for the convex conjugate of theentropic regularizer (more on this below):

Lemma D.1.

For all ψ, v ∈ V ∗ , there exists ξ ∈ [0 , such that: log (cid:18)(cid:90) K exp( ψ + v ) (cid:19) ≤ log (cid:18)(cid:90) K exp( ψ ) (cid:19) + (cid:104) v, Λ( ψ ) (cid:105) + 12 (cid:104) v , Λ( ψ + ξv ) (cid:105) . (D.1) Proof.

Consider the function φ : [0 , → R with φ ( t ) = log (cid:0)(cid:82) K exp( ψ + tv ) (cid:1) . By construction, φ (0) = log (cid:0)(cid:82) K exp( ψ ) (cid:1) and φ (1) = log (cid:0)(cid:82) K exp( ψ + v ) (cid:1) . Thus, by a second-order Taylorexpansion with Lagrange remainder, we have: φ (1) = φ (0) + φ (cid:48) (0) + 12 φ (cid:48)(cid:48) ( ξ ) (D.2)for some ξ ∈ [0 , .Now, for all t ∈ [0 , , φ (cid:48) ( t ) = (cid:82) K v exp( ψ + tv ) (cid:82) K exp( ψ + tv ) , which in turns gives φ (cid:48) (0) = (cid:82) K v exp( ψ ) (cid:82) K exp( ψ ) = (cid:104) v, Λ( ψ ) (cid:105) . (D.3)As for the second order derivative of φ , we have for all t ∈ [0 , : φ (cid:48)(cid:48) ( t ) = ∂∂t (cid:20) (cid:82) K v exp( ψ + tv ) (cid:82) K exp( ψ + tv ) (cid:21) = (cid:82) K v exp( ψ + tv ) (cid:82) K exp( ψ + tv ) − (cid:0)(cid:82) K v exp( ψ + tv ) (cid:1) (cid:0)(cid:82) K exp( ψ + tv ) (cid:1) NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 21 ≤ (cid:82) K v exp( ψ + tv ) (cid:82) K exp( ψ + tv ) (cid:0)(cid:82) K exp( ψ + tv ) (cid:1) = (cid:82) K v exp( ψ + tv ) (cid:82) K exp( ψ + tv ) (D.4)Thus, for all t ∈ [0 , , we get φ (cid:48)(cid:48) ( t ) ≤ (cid:104) v , Λ( ψ + tv ) (cid:105) . (D.5)Our claim then follows by injecting (D.3) and (D.5) into (D.2). (cid:3) In the next lemma, we now present an expression of the Fenchel coupling in the speciﬁccase of the negentropy regularizer h ( p ) = (cid:82) K p log p . Lemma D.2.

In the case of the negentropy regularizer h ( p ) = (cid:82) K p log p , the Fenchel couplingfor all ψ ∈ V ∗ and p ∈ dom h is given by F ( p, ψ ) = (cid:90) K p log p + log (cid:18)(cid:90) K exp( ψ ) (cid:19) − (cid:104) ψ, p (cid:105) . (D.6) Proof.

We remind the general expression of the Fenchel coupling given in (B.6): F ( p, ψ ) = h ( p ) + h ∗ ( ψ ) − (cid:104) ψ, p (cid:105) for all p ∈ dom h, ψ ∈ V ∗ , (D.7)where h ∗ ( ψ ) = max p ∈V {(cid:104) ψ, p (cid:105) − h ( p ) } . In the case of the negentropy regularizer h ( p ) = (cid:82) K p log p , we have that arg max p ∈V {(cid:104) ψ, p (cid:105) − h ( p ) } = Λ( ψ ) and h ∗ ( ψ ) = (cid:104) ψ, Λ( ψ ) (cid:105) − h (Λ( ψ )) . (D.8)Combining the above, we then get: h ∗ ( ψ ) = (cid:90) K ψ Λ( ψ ) − (cid:90) K Λ( ψ ) log Λ( ψ )= (cid:90) K ψ Λ( ψ ) − (cid:90) K Λ( ψ ) ψ + (cid:90) K log (cid:18)(cid:90) K exp( ψ ) (cid:19) Λ( ψ )= log (cid:18)(cid:90) K exp( ψ ) (cid:19) . which, combined with (B.6), delivers (D.6). (cid:3) Finally we state a result enabling to control the diﬀerence between the regret

Reg( T ) and (cid:103) Reg( T ) induced respectively by two policies p t and ˜ p t against the same rewards and models. Lemma D.3.

For t = 1 , . . . , T , let p t , ˜ p t be two policies with respective regret Reg( T ) and (cid:103) Reg( T ) against a given sequence of models ( u t ) t for the rewards ( u t ) t . Then: Reg( T ) ≤ (cid:103) Reg( T ) + T (cid:88) t =1 (cid:107) p t − ˜ p t (cid:107) ∞ . (D.9) Proof.

See Slivkins [47, Chap. 6]. (cid:3)

D.2.

Hedge-speciﬁc bounds.

We are now ready to adapt the template bound of Lemma C.1to the Hedge case.

Lemma D.4.

Assuming the regularizer h is the negentropy h ( p ) = (cid:82) K p log p , and that themirror map Q corresponds to the logit operator Λ , there exists ξ ∈ [0 , such that, for all χ ∈ X the policy (DA) enjoys the bound: E t +1 ≤ E t + (cid:104) u t , p t − χ (cid:105) + (cid:18) η t +1 − η t (cid:19) [ h ( χ ) − min h ] + η t G t ( ξ ) . (D.10) where for all ξ ∈ [0 , , G t ( ξ ) = (cid:104) Λ( η t y t + ξη t u t ) , u t (cid:105) . Proof.

We know from Lemma D.4 that the policy (DA) enjoys the bound: E t +1 ≤ E t + (cid:104) u t , p t − χ (cid:105) + (cid:18) η t +1 − η t (cid:19) [ h ( χ ) − min h ] + 1 η t F ( p t , η t y t +1 ) . (D.11)The following lemma will help us handle the Fenchel coupling term in (D.11) Lemma D.5.

For a given t in the policy (DA) , there exists ξ ∈ [0 , such that the followingbounds holds: F ( p t , η t y t +1 ) ≤ η t G t ( ξ ) . (D.12)Injecting the result given in Lemma D.5 in Eq. (D.11) yields the stated claim. (cid:3) Moving forward, we are only left to prove Lemma D.5.

Proof.

Since we are in the case of the negentropy regularizer, Lemma D.2 enables to rewritethe Fenchel coupling term of (D.11) as: F ( p t , η t y t +1 ) = (cid:90) K p t log p t + log (cid:18)(cid:90) K exp( η t y t +1 ) (cid:19) − (cid:104) η t y t +1 , p t (cid:105) . (D.13)Injecting y t +1 = y t + u t in (D.13) yields: F ( p t , η t y t +1 ) = (cid:90) K p t log p t + log (cid:18)(cid:90) K exp( η t y t + η t u t ) (cid:19) − (cid:104) η t y t , p t (cid:105) − (cid:104) η t u t , p t (cid:105) = F ( p t , η t y t ) + (cid:18) log (cid:90) K exp( η t y t + η t u t ) − log (cid:90) K exp( η t y t ) (cid:19) − (cid:104) η t u t , p t (cid:105) = log (cid:18)(cid:90) K exp( η t y t + η t u t ) (cid:19) − log (cid:18)(cid:90) K exp( η t y t ) (cid:19) − (cid:104) η t u t , p t (cid:105) (D.14)where we used the fact that F ( p t , η t y t ) = 0 .Now, by Lemma D.1 applied to ψ ← η t y t and v ← η t u t , there exists ξ ∈ [0 , such that log (cid:18)(cid:90) K exp( η t y t + η t u t ) (cid:19) ≤ log (cid:18)(cid:90) K exp( η t y t ) (cid:19) + η t (cid:104) u t , Λ( η t y t ) (cid:105) + η t (cid:104) u t , Λ( η t y t + ξη t u t ) (cid:105) , (D.15)where we used the fact that p t = Λ( η t y t ) . Our claim then follows by injecting (D.15) into ourprior expression for the Fenchel coupling F ( p t , η t y t +1 ) in the case of the Hedge variant. (cid:3) Proposition D.1.

If we run the Hedge variant of Alg. 1, there exists a sequence ξ t ∈ [0 , such that: E [Reg x ( T )] ≤ log( λ ( K ) /λ ( C )) η T +1 + L diam( C ) T + 2 (cid:88) Tt =1 B t + 12 (cid:88) Tt =1 η t E [ G t ( ξ t ) | F t ] , (D.16) where C is a convex neighborhood of x in K .Proof. This result is obtained by using the template bound given in Lemma D.4, then byproceeding exactly as in the proofs of Proposition 1 and Theorem 1. (cid:3)

We stress here that Proposition D.1 does not correspond to the Hedge instantiationTheorem 1. Indeed, the second order term (cid:80) Tt =1 η t E [ G t ( ξ t ) | F t ] builds on results thatare speciﬁc to Hedge, and is a priori considerably sharper than α K (cid:80) Tt =1 η t M t , the secondorder term of Theorem 1. NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 23

D.3.

Guarantees for Alg. 2.

For clarity, we begin by reminding the speciﬁc assumptionsrelative to Alg. 2. In particular, we are still considering throughout a dual averaging policy(DA) with a negentropy regularizer. We additionally assume that at each round t , we receivea model u t built according to the “smoothing” approach described in Section 5 where for all t : u t ( x ) = K t ( x t , x ) · u t ( x t ) /p t ( x t ) (D.17)where K t : K × K → R is a (time-varying) smoothing kernel , i.e., (cid:82) K K t ( x, x (cid:48) ) dx (cid:48) = 1 for all x ∈ K . For concreteness (and sampling eﬃciency), we will assume that payoﬀs now take valuesin [0 , , and we will focus on simple kernels that are supported on a neighborhood U δ ( x ) = B δ ( x ) ∩ K of x in K and are constant therein, i.e., K δ ( x, x (cid:48) ) = [ λ ( U δ ( x ))] − {(cid:107) x (cid:48) − x (cid:107) ≤ δ } .we incorporate in p t an explicit exploration term of the form ε t /λ ( K ) .Under these assumptions, we may now bound both the bias and variance terms in (D.16). Lemma D.6.

The following inequality holds, where L is a uniform Lipschitz coeﬃcient forthe reward functions u t ( as described in Assumption 1 ) B t ≤ Lδ t . (D.18) Moreover, there exists a constant C K (depending only on the set K ) such that: sup ξ ∈ [0 , E [ G t ( ξ ) | F t ] ≤ C K δ − nt (cid:15) − t . (D.19)Note that bounding the second order term of Theorem 1 under the same assumptionswould have yielded a δ − nt factor instead of δ − nt , which is a strictly weaker result! Proof.

We ﬁrst prove (D.18). Using the fact that u t ( x ) = (cid:82) K u t ( x ) K t ( x t , x ) dx t , we obtain: | E [ u t ( x ) − u t ( x ) | F t ] | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) K K t ( x t , x ) u t ( x t ) p t ( x t ) p t ( x t ) dx t − (cid:90) K u t ( x ) K t ( x t , x ) dx t (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) K ( u t ( x t ) − u t ( x )) K t ( x t , x ) dx t (cid:12)(cid:12)(cid:12)(cid:12) = [ λ ( U δ t ( x ))] − (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) K {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } ( u t ( x t ) − u t ( x )) dx t (cid:12)(cid:12)(cid:12)(cid:12) ≤ L [ λ ( U δ t ( x ))] − (cid:90) K {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } ( u t (cid:107) x t ) − u t ( x ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ≤ λ ( U δt ( x )) δ t ≤ Lδ t . (D.20)This bound is uniform (does not depend on the point x ), and thus implies the statedinequality for B t .We now turn to (D.19). To that end, let ξ ∈ [0 , . We will prove a uniform bound on E [ G t ( ξ ) | F t ] . As a preliminary it is capital to note that, K being convex compact, thereexists constants C M K and C m K such that for all x ∈ K , C m K δ nt ≤ [ λ ( U δ t ( x ))] ≤ C M K δ nt . Now, using G t ( ξ ) = (cid:104) Λ( η t y t + ξη t u t ) , u t (cid:105) and u t ( x ) = K t ( x t , x ) · u t ( x t ) /p t ( x t ) , we maywrite: E (cid:2) G t ( ξ ) (cid:12)(cid:12) F t (cid:3) = E (cid:20)(cid:90) K Λ( η t y t + ξη t u t )( x ) K t ( x t , x ) u t ( x t ) p t ( x t ) dx (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) ≤ (cid:90) K ≤ (cid:122) (cid:125)(cid:124) (cid:123) u t ( x (cid:48) ) p t ( x (cid:48) ) p t ( x (cid:48) ) (cid:18)(cid:90) K Λ( η t y t + ξη t u t )( x ) K t ( x (cid:48) , x ) dx (cid:19) dx (cid:48) ≤ (cid:90) K p t ( x (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ ε t /λ ( K ) [ λ ( U δ t ( x (cid:48) ))] − (cid:124) (cid:123)(cid:122) (cid:125) ≤ ( C m K ) δ − n (cid:18)(cid:90) K Λ( η t y t + ξη t u t )( x ) {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } dx (cid:19) dx (cid:48) ≤ λ ( K )( C m K ) δ nt ε t (cid:90) K (cid:18)(cid:90) K Λ( η t y t + ξη t u t )( x ) {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } dx (cid:19) dx (cid:48) = λ ( K )( C m K ) δ nt ε t (cid:90) K (cid:18)(cid:90) K Λ( η t y t + ξη t u t )( x ) {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } dx (cid:48) (cid:19) dx (Fubini) = λ ( K )( C m K ) δ nt ε t (cid:90) K Λ( η t y t + ξη t u t )( x ) (cid:18)(cid:90) K {(cid:107) x (cid:48) − x (cid:107) ≤ δ t } dx (cid:48) (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) = λ ( U δt ( x )) ≤ C M K δ nt λ ( K ) dx ≤ λ ( K )( C m K ) δ nt ε t C M K δ nt (cid:18)(cid:90) K Λ( η t y t + ξη t u t )( x ) dx (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =1 = (cid:18) λ ( K ) C M K ( C m K ) (cid:19) δ − nt ε − t . (D.21)This bound depends only on K , and is notably independent on ξ ∈ [0 , . The result (D.19)follows directly. (cid:3) We are now ready to prove Proposition 2 and Eq. (23).

Proposition 2.

Suppose that the Hedge variant of Alg. 2 is run with learning rate η t ∝ /t ρ and smoothing/exploration schedules δ t ∝ /t µ , ε t ∝ /t β respectively. Then, the learnerenjoys the bound E [Reg( T )] = O ( T ρ + T − µ + T − β + T nµ + β − ρ ) . (22) In particular, if the algorithm is run with ρ = ( n + 2) / ( n + 3) and µ = β = 1 / ( n + 3) , weobtain the bound E [Reg( T )] = O ( T n +2 n +3 ) .Proof. Let us consider a slight modiﬁcation of Alg. 2 in which • The models ( u t ) received by the learner are the same models than those generatedby running Alg. 2, • At each round t , the action x t is sampled according to ˜ p t = Λ( η t y t ) (without takinginto account the explicit exploration term).The regret of this algorithm may be bounded using the Hedge template bound stated inProposition D.1, since we are indeed considering the regret induced by Hedge against thesequence of reward models u t . Then, writing (cid:103) Reg( T ) for the regret induced by the policy (˜ p t ) t , we get E [ (cid:103) Reg x ( T )] ≤ log( λ ( K ) /λ ( C )) η T +1 + L diam( C ) T + 2 (cid:88) Tt =1 B t + 12 (cid:88) Tt =1 η t E [ G t ( ξ t ) | F t ] . (D.22) Even though these models were generated by Alg. 2, which does not exactly corresponds to Hedge

NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 25

Using the bounds presented in Lemma D.6 we then get: E [ (cid:103) Reg x ( T )] ≤ log( λ ( K ) /λ ( C )) η T +1 + L diam( C ) T +2 L (cid:88) Tt =1 δ t + 12 C K (cid:88) Tt =1 η t δ − nt (cid:15) − t . (D.23)We are however interested in guarantees for Alg. 2, in which we play with the policy ( p t ) t , which slightly diﬀers from the Hedge policy (˜ p t ) t . To that end, Lemma D.3 enables usto bound the diﬀerence between the regrets Reg T and (cid:103) Reg( T ) , induced by ( p t ) t and (˜ p t ) t respectively. Namely we can write: Reg T ≤ (cid:103) Reg( T ) + T (cid:88) t =1 (cid:107) ˜ p t − p t (cid:107) ∞ . (D.24)For any t ≥ , x ∈ K , we have | ˜ p t ( x ) − p t ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12) ˜ p t ( x ) − (1 − ε t )˜ p t ( x ) − ε t λ ( K ) (cid:12)(cid:12)(cid:12)(cid:12) = ε t · | ˜ p t − /λ ( K ) | ≤ ε t (cid:18) λ ( K ) (cid:19) . (D.25)Injecting this in (D.24) we get Reg T ≤ (cid:103) Reg( T ) + (cid:18) λ ( K ) (cid:19) T (cid:88) t =1 (cid:15) t . (D.26)Finally, combining (D.26) with (D.23) yields: E [ (cid:103) Reg x ( T )] ≤ log( λ ( K ) /λ ( C )) η T +1 + L diam( C ) T + 2 L (cid:88) Tt =1 δ t + C K (cid:88) Tt =1 η t δ − nt (cid:15) − t + (cid:18) λ ( K ) (cid:19) (cid:88) Tt =1 (cid:15) t . (D.27)Now, using the same reasoning as in the proof of Theorem 1 with regards to the set C ,and using η t ∝ /t ρ , δ t ∝ /t µ and ε t ∝ /t β straightforwardly gives: E [Reg( T )] = O ( T ρ + T − µ + T − β + T nµ + β − ρ ) . Finally, ρ = ( n + 2) / ( n + 3) and µ = β = 1 / ( n + 3) gives the optimal bound: E [Reg( T )] = O ( T n +2 n +3 ) . (cid:3) Proposition 3.

Suppose that the Hedge variant of Alg. 2 is run with parameters as inProposition 2 against a stream of loss functions with variation V T = O ( T ν ) . Then, thelearner enjoys E [DynReg( T )] = O ( T nµ + β − ρ + T − β + T − µ + T ν +2 ρ − nµ − β ) . (23) In particular, if the algorithm is run with ρ = (1 − ν )( n +2) / ( n +4) and µ = β = (1 − ν ) / ( n +4) ,we obtain the optimized bound E [DynReg( T )] = O ( T n +3+ νn +4 ) .Proof. We use the same virtual segmentation as in the proof of Theorem 2. As a reminder,this means that we partition the interval T = [1 . . T ] into m contiguous segments T k , k = 1 , . . . , m , each of length ∆ (except possibly the m -th one, which might be smaller). Moreexplicitly, take the window length to be of the form ∆ = (cid:100) T γ (cid:101) for some constant γ ∈ [0 , tobe determined later. In this way, the number of windows is m = (cid:100) T / ∆ (cid:101) = Θ( T − γ ) and the k -th window will be of the form T k = [( k − . . k ∆] for all k = 1 , . . . , m − (the value k = m is excluded as the m -th window might be smaller). For concision, we will denote the learner’s static regret over the k -th window as Reg( T k ) = max x ∈K (cid:80) t ∈T k (cid:104) u t , δ x − p t (cid:105) (andlikewise for its dynamic counterpart).Following the proof of Theorem 2 up to (C.30), we can still write in our bandit setting: DynReg( T ) ≤ m (cid:88) k =1 Reg( T k ) + 2∆ V T . (D.28)Now Proposition 2 applied to the Hedge variant of Alg. 2 readily yields E [Reg( T k )] = O (cid:32) ( k ∆) ρ + (cid:88) t ∈T k t − β + (cid:88) t ∈T k t − µ + (cid:88) t ∈T k t β + nµ − ρ (cid:33) (D.29)so, after summing over all windows, we have m (cid:88) k =1 E [Reg( T k )] = O (cid:32) ∆ ρ m (cid:88) k =1 k ρ + T (cid:88) t =1 t − β + T (cid:88) t =1 t − µ + T (cid:88) t =1 t β + nµ − ρ (cid:33) = O (cid:0) ∆ ρ m ρ + T − β + T − µ + T β + nµ − ρ (cid:1) . (D.30)Since ∆ = O ( T γ ) and m = O ( T / ∆) = O ( T − γ ) , we get ∆ ρ m ρ = O (( m ∆) ρ m ) = O ( T γρ T (1 − γ )(1+ ρ ) ) = O ( T ρ − γ ) . (D.31)Then, substituting in (D.30) and (D.28), we ﬁnally get the dynamic regret bound E [DynReg( T )] = O (cid:0) T ρ − γ + T − β + T − µ + T β + nµ − ρ + T γ V T (cid:1) . (D.32)To balance the above expression, we take γ = 2 ρ − β − nµ for the window size exponent(which calibrates the ﬁrst and fourth terms in the sum above). In this way, we ﬁnally obtain E [DynReg( T )] = O (cid:0) T nµ + β − ρ + T − β + T − µ + T ρ − nµ − β V T (cid:1) (D.33)and our proof is complete. (cid:3) Appendix E. Numerical experiments

Our aim in this appendix is to provide some numerical illustrations of the theory presentedin the rest of our paper. All numerical experiments were run on a machine with 48 CPUs(Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz), with 2 Threads per core, and 500Go of RAM.For a simulation horizon of T = 2 × , we choose a reward function [0 , that is a linearcombination of trigonometric terms with diﬀerent frequencies and amplitudes, arbitrarilydrawn. Because of this analytic expression, we are able to calculte the learner’s best actionin hindsight (or instantaneously) and plot the relevant regret curves.For illustration purposes, we compared strategies, called “Grid” and “Kernel”. The“Kernel” method is as outlined in Section 5 (cf. Alg. 2) with parameters described below.The “Grid” method involves partitioning the search space into a grid of a given mesh-size (ahyperparameter of the algorithm), and then treating the problem as a ﬁnite-armed bandit;in particular, the “Grid” strategy employs the EXP3 algorithm [7] with rewards sampled atthe grid points.In Fig. 1, we plot the mean regret for both algorithms, with diﬀerent hyperparameters, over T iterations. The conﬁdence intervals are represented by the shaded areas, which correspondsto the mean value of the regret modulated by the standard deviation of our sample runsof each algorithm (computed on 92 initialization seeds for sampling, kept constant acrossdiﬀerent runs for control validation).The dashed line represent in the ﬁgure corresponds to the theoretical regret bound of T − ,which is the expected regret bound of the Kernel algorithm mean regret (without explicitexploration in our case). For performance evaluation purposes, we “slice” diﬀerent snapshots NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 27

Figure 1: Expected average regret , averaged on 92 realizations for each algorithm(solid line). The variance is presented (shaded area) where we add and removethe standard deviation (computed on the 92 seeds) from the mean. Finally, thetheoretical regret bound is displayed (dashed line).

Figure 2: Two slices of the mean regret , averaged on 92 realizations for eachalgorithm (solid line). Whisker at 5-95% CI , boxes at 25-75% CI and mediandisplayed with vertical bars. of the regret in Fig. 2 at iteration counts × and × . In both cases, we observe adramatic drop in variance for the Kernel algorithm relative to the Grid strategy, with a ﬁxednumber of arms uniformly cut beforehand; we also note that the performance of the Kernelmethod approaches the theoretical slope of T − / that characterizes the Kernel method.By contrast, the mean regret for the Grid approach seems to converge to a ﬁnite valuewhich indicates a much slower regret minimization rate; on the other hand, the mean regretof the Kernel method converges to at the anticipated rate. References [1] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies andminimax lower bounds for online convex games. In

COLT ’08: Proceedings of the 21st Annual Conference on Learning Theory , 2008.[2] Naman Agarwal, Alon Gonen, and Elad Hazan. Learning in non-convex games with an optimizationoracle. In

COLT ’19: Proceedings of the 32nd Annual Conference on Learning Theory , 2019.[3] Rajeev Agrawal. Sample mean based index policies with o (log n ) regret for the multi-armed banditproblem. Advances in Applied Probability , 27(4):1054–1078, December 1995.[4] Felipe Alvarez, Jérôme Bolte, and Olivier Brahic. Hessian Riemannian gradient ﬂows in convexprogramming.

SIAM Journal on Control and Optimization , 43(2):477–501, 2004.[5] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications.

Theory of Computing , 8(1):121–164, 2012.[6] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: Theadversarial multi-armed bandit problem. In

Proceedings of the 36th Annual Symposium on Foundationsof Computer Science , 1995.[7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed banditproblem.

Machine Learning , 47:235–256, 2002.[8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmedbandit problem.

SIAM Journal on Computing , 32(1):48–77, 2002.[9] Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. Clique relaxations in social networkanalysis: The maximum k -plex problem. Operations Research , 59(1):133–142, January 2011.[10] Claude Berge.

Topological Spaces . Dover, New York, 1997.[11] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.

OperationsResearch , 63(5):1227–1244, October 2015.[12] Avrim Blum and Adam Tauman Kalai. Universal portfolios with and without transaction costs.

MachineLearning , 35(3):193–205, 1999.[13] Immanuel M. Bomze, Marco Budinich, Panos M. Pardalos, and Marcello Pelillo. The maximum cliqueproblem. In

Handbook of Combinatorial Optimization . Springer, 1999.[14] Mario Bravo and Panayotis Mertikopoulos. On the robustness of learning in games with stochasticallyperturbed payoﬀ observations.

Games and Economic Behavior , 103, John Nash Memorial issue:41–66,May 2017.[15] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

Foundations and Trends inMachine Learning , 8(3-4):231–358, 2015.[16] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armedbandit problems.

Foundations and Trends in Machine Learning , 5(1):1–122, 2012.[17] Sébastien Bubeck and Ronen Eldan. Multi-scale exploration of convex functions and bandit convexoptimization. In

COLT ’16: Proceedings of the 29th Annual Conference on Learning Theory , 2016.[18] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X -armed bandits. Journal ofMachine Learning Research , 12:1655–1695, 2011.[19] Sébastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization.In

STOC ’17: Proceedings of the 49th annual ACM SIGACT symposium on the Theory of Computing ,2017.[20] Nicolò Cesa-Bianchi and Gábor Lugosi.

Prediction, Learning, and Games . Cambridge University Press,2006.[21] Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. Mirror descent meets ﬁxed share(and feels no regret). In 989-997 (ed.),

Advances in Neural Information Processing Systems , volume 25,2012.[22] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm usingBregman functions.

SIAM Journal on Optimization , 3(3):538–543, August 1993.[23] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, andShenghuo Zhu. Online optimization with gradual variations. In

COLT ’12: Proceedings of the 25thAnnual Conference on Learning Theory , 2012.[24] Benoit Duvocelle, Panayotis Mertikopoulos, Mathias Staudigl, and Dries Vermeulen. Learning intime-varying games. https://arxiv.org/abs/1809.03066 , 2018.[25] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization inthe bandit setting: gradient descent without a gradient. In

SODA ’05: Proceedings of the 16th annualACM-SIAM Symposium on Discrete Algorithms , pp. 385–394, 2005.[26] Gerald B. Folland.

Real Analysis . Wiley-Interscience, 2 edition, 1999.

NLINE NON-CONVEX OPTIMIZATION WITH IMPERFECT FEEDBACK 29 [27] Santo Fortunato. Community detection in graphs.

Physics Reports , 486(3-5):75–174, 2010.[28] Nadav Hallak, Panayotis Mertikopoulos, and Volkan Cevher. Regret minimization in stochastic non-convex learning via a proximal-gradient approach. https://arxiv.org/abs/2010.06250 , 2020.[29] Elad Hazan. A survey: The convex optimization approach to regret minimization. In Suvrit Sra,Sebastian Nowozin, and Stephen J. Wright (eds.),

Optimization for Machine Learning , pp. 287–304.MIT Press, 2012.[30] Elad Hazan and Comandur Seshadhri. Eﬃcient learning algorithms for changing environments. In

ICML ’09: Proceedings of the 26th International Conference on Machine Learning , 2009.[31] Elad Hazan, Karan Singh, and Cyril Zhang. Eﬃcient regret minimization in non-convex games. In

ICML ’17: Proceedings of the 34th International Conference on Machine Learning , 2017.[32] Mark Herbster and Manfred K. Warmuth. Tracking the best expert.

Machine Learning , 32(2):151–178,1998.[33] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization:Competing with dynamic comparators. In

AISTATS ’15: Proceedings of the 18th InternationalConference on Artiﬁcial Intelligence and Statistics , 2015.[34] Adam Tauman Kalai and Santosh Vempala. Eﬃcient algorithms for online decision problems.

Journalof Computer and System Sciences , 71(3):291–307, October 2005.[35] Narendra Karmarkar. Riemannian geometry underlying interior point methods for linear programming.In

Mathematical Developments Arising from Linear Programming , number 114 in ContemporaryMathematics. American Mathematical Society, 1990.[36] Robert David Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In

NIPS’ 04:Proceedings of the 18th Annual Conference on Neural Information Processing Systems , 2004.[37] Robert David Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In

STOC ’08: Proceedings of the 40th annual ACM symposium on the Theory of Computing , 2008.[38] Walid Krichene, Maximilian Balandat, Claire Tomlin, and Alexandre Bayen. The Hedge algorithm ona continuum. In

ICML ’15: Proceedings of the 32nd International Conference on Machine Learning ,2015.[39] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm.

Information andComputation , 108(2):212–261, 1994.[40] Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets andunknown payoﬀ functions.

Mathematical Programming , 173(1-2):465–507, January 2019.[41] Theodore S. Motzkin and Ernst G. Straus. Maxima for graphs and a new proof of a theorem of Turán.

Canadian Journal of Mathematics , 1965.[42] Yurii Nesterov. Primal-dual subgradient methods for convex problems.

Mathematical Programming ,120(1):221–259, 2009.[43] Steven Perkins, Panayotis Mertikopoulos, and David S. Leslie. Mixed-strategy learning with continuousaction sets.

IEEE Trans. Autom. Control , 62(1):379–384, January 2017.[44] Robert Ralph Phelps.

Convex Functions, Monotone Operators and Diﬀerentiability . Lecture Notes inMathematics. Springer-Verlag, 2 edition, 1993.[45] Shai Shalev-Shwartz. Online learning and online convex optimization.

Foundations and Trends inMachine Learning , 4(2):107–194, 2011.[46] Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and Fenchel duality. In

Advances inNeural Information Processing Systems 19 , pp. 1265–1272. MIT Press, 2007.[47] Aleksandrs Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 , 2019.[48] Aleksandrs Slivkins. Introduction to multi-armed bandits.

Foundations and Trends in Machine Learning ,12(1-2):1–286, November 2019.[49] Victor Spirin and Leonid A. Mirny. Protein complexes and functional modules in molecular networks.

Proceedings of the National Academy of Sciences , 2003.[50] Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leaderis optimal. In

ALT ’20: Proceedings of the 31st International Conference on Algorithmic LearningTheory , 2020.[51] Constantino Tsallis. Possible generalization of Boltzmann–Gibbs statistics.

Journal of Statistical Physics ,52:479–487, 1988.[52] Paul Tseng. Convergence properties of Dikin’s aﬃne scaling algorithm for nonconvex quadratic mini-mization.

Journal of Global Optimization , 30(2):285–300, 2004. [53] Robert J. Vanderbei, Marc S. Meketon, and Barry A. Freedman. A modiﬁcation of Karmarkar’s linearprogramming algorithm.

Algorithmica , 1(1):395–407, November 1986.[54] Vladimir G. Vovk. Aggregating strategies. In

COLT ’90: Proceedings of the 3rd Workshop onComputational Learning Theory , pp. 371–383, 1990.[55] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.

Journalof Machine Learning Research , 11:2543–2596, October 2010.[56] Zhisheng Zhong, Tiancheng Shen, Yibo Yang, Chao Zhang, and Zhouchen Lin. Joint sub-bands learningwith clique structures for wavelet domain super-resolution. In

NeurIPS ’18: Proceedings of the 32ndInternational Conference of Neural Information Processing Systems , 2018.[57] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Stephen P. Boyd, and Peter W. Glynn.On the convergence of mirror descent beyond stochastic convex programming.

SIAM Journal onOptimization , 30(1):687–716, 2020.[58] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In