[PDF] Optimal anytime regret with two experts

Abstract

Full PDF

OOptimal anytime regret with two experts

Nicholas J. A. Harvey ∗ Christopher Liaw † Edwin Perkins ‡ Sikander Randhawa § Abstract

The multiplicative weights method is an algorithm for the problem of prediction with expert advice.It achieves the minimax regret asymptotically if the number of experts is large, and the time horizonis known in advance. Optimal algorithms are also known if there are exactly two or three experts, andthe time horizon is known in advance.In the anytime setting, where the time horizon is not known in advance, algorithms can be obtainedby the “doubling trick”, but they are not optimal, let alone practical. No minimax optimal algorithmwas previously known in the anytime setting, regardless of the number of experts.We design the first minimax optimal algorithm for minimizing regret in the anytime setting. Weconsider the case of two experts, and prove that the optimal regret is γ √ t/ at all time steps t , where γ is a natural constant that arose 35 years ago in studying fundamental properties of Brownian motion.The algorithm is designed by considering a continuous analogue, which is solved using ideas fromstochastic calculus. ∗ Email: [email protected] . University of British Columbia, Department of Computer Science. † Email: [email protected] . University of British Columbia, Department of Computer Science. ‡ Email: [email protected] . University of British Columbia, Department of Mathematics. § Email: [email protected] . University of British Columbia, Department of Computer Science. a r X i v : . [ c s . L G ] F e b ontents ± . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 A Standard facts 17

A.1 Basic facts about confluent hypergeometric functions . . . . . . . . . . . . . . . . . . . . . 17A.2 Other standard facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B Application to binary sequence prediction 18C Technical results from Section 3 19

C.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19C.2 Proof of Lemma 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19C.3 Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

D Analysis of Algorithm 1 for general cost vectors 22

D.1 Proof of Proposition D.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24D.2 Proof of Lemma D.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

E Additional proofs for Section 4 26

E.1 Large regret infinitely often . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

F Additional proofs for Section 5 28

F.1 Proof of Lemma 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28F.2 Proof of Lemma 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28F.3 Proof of Lemma 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29F.4 Proof of Lemma 5.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29F.5 Additional proofs from Appendix F.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32F.6 Proof of Claim 5.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35F.7 Discussion on the statement of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 35F.8 Continuous regret against any continuous semi-martingale . . . . . . . . . . . . . . . . . . 35

Introduction

We study the classical problem of prediction with expert advice, whose origin can be traced back as earlyas the 1950s [30]. The problem can be formulated as a sequential game between an adversary and analgorithm as follows. At each time t , an adversary chooses a cost for each of n possible experts. Withoutknowledge of the adversary’s move, the algorithm must choose (perhaps randomly) one of the n expertsto follow. The cost of each expert is then revealed to the algorithm, and the algorithm incurs the cost thatits chosen expert incurred. The goal is to design an algorithm whose regret is small, i.e. the algorithm’sexpected total cost is small relative to the total cost of the best expert. In the theoretical computer sciencecommunity, algorithms for this problem and its variants have been a key component in many results; werefer the reader to [3] and the references therein for a survey on some of these results.The most well-known algorithm for the experts problem is the celebrated multiplicative weights up-date algorithm (MWU) which was introduced, independently, by Littlestone and Warmuth [34] and byVovk [43]. The algorithm itself is very elegant and commonly taught in courses on algorithms, machinelearning, and algorithmic game theory. An analysis of MWU shows that, in the fixed-time setting (where atime horizon T is known in advance), it achieves a regret of (cid:112) ( T /

2) ln n at time T , where n is the numberof experts [13, 11]. This bound on the regret of MWU is known to be tight whenever n ≥ is an even inte-ger [27]. It is also known [13] that (cid:112) ( T /

2) ln n is asymptotically optimal for any algorithm as n, T → ∞ .Hence, MWU is a minimax optimal algorithm as n, T → ∞ . Interestingly, MWU is not optimal for smallvalues of n . For n = 2 , Cover [16] observed that a natural dynamic programming formulation of theproblem leads to a simple analysis showing that the minimax optimal regret is (cid:112) T / π .The assumption that the time horizon T is known in advance may be problematic in some scenarios;examples include any sort of online tasks (e.g., online learning), or tasks requiring convergence over time(e.g., convergence to equilibria). These scenarios may be better suited for the anytime setting , which hasthe stronger requirement that the regret be controlled at all points in time. Another interesting setting isthe geometric horizon setting , introduced by Gravin, Peres, and Sivan [26], in which the time horizon is ageometric random variable of known distribution. In this setting, they gave the optimal algorithm for twoand three experts.The anytime setting is the focus of this work. There is a well-known “doubling trick” [13, §4.6] that canbe used to convert algorithms for the fixed-time setting to algorithms for the anytime setting. Typically,the doubling trick involves restarting the fixed-time horizon algorithm every power-of-two steps with newparameters. If the fixed-time algorithm has regret O ( T c ) for some c ∈ (0 , then the doubling trick yieldsan algorithm with regret O ( t c ) for every t ≥ . On the one hand, this is a conceptually simple and genericreduction from the anytime setting to the fixed-time setting. On the other hand, this approach is inelegant,wasteful, and turns useful algorithms into algorithms of dubious practicality.Instead of using the doubling trick, one can instead use variants of MWU with a dynamic step size;see, e.g., [12, §2.3], [37, Theorem 1], [7, §2.5]. This is a much more elegant and practical approach than thedoubling trick (and is even simpler to implement). However, the analysis is somewhat different and moredifficult than the standard MWU analysis, and is rarely taught. It is known that, with an appropriate choiceof step sizes, MWU can guarantee a regret of √ t ln n for all t ≥ and all n ≥ (see [7, Theorem 2.4] or[25, Proposition 2.1]). However, it is unknown whether √ t ln n is the minimax optimal anytime regret, forany value of n . Results and techniques.

This work considers the anytime setting with n = 2 experts. We show that theoptimal regret is γ √ t , where γ ≈ . is a fundamental constant that arises in the study of Brownianmotion [38]. A concise algorithm achieving the minimax regret is presented in Algorithm 1 on page 4. This means that the algorithm minimizes the maximum, over all adversaries, of the regret. It can be shown, by modifying arguments of [27], that this is the optimal anytime analysis for MWU with step sizes c/ √ t . same function.Recently, interactions between algorithms in discrete and continuous-time, although of a different sort,have been fruitful in other lines of work, e.g., [1, 8, 9, 10, 15, 19, 22, 31, 33, 44]. Secondly, we use toolsof stochastic calculus to design and analyze algorithms for our continuous-time problem. Prior to thiswork, there have also been a line of literature which studies discounted multi-armed bandit problems byconsidering a Brownian approximation to the problem (see for e.g. [6, 14]).Lastly, we use confluent hypergeometric functions to design and analyze the optimal continuous-timealgorithm. These functions may seem exotic, but they turn out to be inherent to our problem since theyalso arise in the matching lower bound. The constant γ in the minimax regret may be defined as α (1 / ,where α is a function giving the root of a confluent hypergeometric function with certain parameters (seeClaim A.5 and [38, Proposition 1(b)]). Applications.

The first application of our techniques is to a problem in probability theory that doesnot involve regret at all. Let ( X t ) t ≥ be a standard random walk. Then E [ | X τ | ] ≤ γ E [ √ τ ] for everystopping time τ ; moreover, the constant γ cannot be improved. This result is originally due to Davis [18,Eq. (3.8)], who proved it first for Brownian motion then derived the result for random walks (via theSkorokhod embedding). We will prove this result as a consequence of our techniques in Subsection 2.4.The prediction problem with two experts is closely related to the problem of predicting binary se-quences ; in fact, this was the problem originally considered by Cover [16]. A notable paper by Feder etal. [23] pursued this problem further, defining the notion of universal s -state predictors, and showingconnections to Lempel-Ziv compression. They derive [23, Theorem 1 and Eq. (14)] a universal onlinepredictor whose expected performance converges to the performance of the best -state predictor at rate / √ t + O (1 /t ) where t is the sequence length. We describe a different online predictor achieving thebetter convergence rate γ/ √ t , and show that no other online predictor can improve the constant γ/ .We will prove this in Appendix B. The problem may be stated formally as follows. For each integer t ≥ , there is a prediction task, whichis said to occur at time t . The task involves a deterministic algorithm A , which must pick a vector x t ∈ [0 , n , and an adversary B , which knows A and picks a vector (cid:96) t ∈ [0 , n . The vector x t must satisfy (cid:80) nj =1 x t,j = 1 and may depend on (cid:96) , . . . , (cid:96) t − (and implicitly x , . . . , x t − ). The vector (cid:96) t may dependon A and on (cid:96) , . . . , (cid:96) t − (and implicitly x , . . . , x t , since A is deterministic and known).The dimension n denotes the number of experts. The coordinate (cid:96) t,j denotes the cost of the j th expertat time t . The vector x t may be viewed as a probability distribution, so the inner product (cid:104) x t , (cid:96) t (cid:105) is theexpected cost of the algorithm at time t . Thus, the total expected cost of the algorithm up to time t is (cid:80) ti =1 (cid:104) x i , (cid:96) i (cid:105) . For j ∈ [ n ] , the total cost of the j th expert up to time t is L t,j = (cid:80) ti =1 (cid:96) i,j . The regret attime t of algorithm A against adversary B is the difference between the algorithm’s total expected costand the total cost of the best expert, i.e., Regret( n, t, A , B ) = t (cid:88) i =1 (cid:104) x i , (cid:96) i (cid:105) − min j ∈ [ n ] L t,j . nytime setting. This work focuses on the anytime setting where the algorithm’s objective is to min-imize, for all t , the regret normalized by √ t . Specifically, the minimax optimal algorithm must solve AnytimeNormRegret( n ) := inf A sup B sup t ≥ Regret( n, t, A , B ) √ t . (2.1)As mentioned above, MWU with a time-varying step size achieves AnytimeNormRegret( n ) ≤ √ ln n forall n ≥ [7, §2.5]. It is unknown whether this bound is tight, although as n → ∞ it can be loose by atmost a factor √ due to the lower bound from the fixed time horizon setting [13]. The minimax optimalanytime regret is unknown even in the case of n = 2 experts. The best known bounds at present are . ≈ (cid:112) /π ≤ AnytimeNormRegret(2) ≤ √ ln 2 ≈ . . (2.2)The lower bound, due to [35], demonstrates a gap between the anytime setting and the fixed-time setting,where the optimal normalized regret is (cid:112) / π [16]. We will show that neither inequality in (2.2) is tight. To state our main theorem and our algorithm, we must define two special functions. erﬁ( x ) = 2 √ π (cid:90) x e z d zM ( x ) = e x − √ πx erﬁ( √ x ) (2.3)The first one is the well-known imaginary error function. The second one is a confluent hypergeometricfunction with certain parameters, as discussed in Appendix A. A key constant used throughout this paperis γ , which is the smallest positive root of M ( x / , i.e., γ := min (cid:8) x > M ( x /

2) = 0 (cid:9) ≈ . ... (2.4)It is known that the constant γ relates to the slow points of Brownian motion [36, §10.3]. Theorem 2.1 (Main result). In the anytime setting with two experts, the minimax optimal normalizedregret (over deterministic algorithms A and adversaries B ) is AnytimeNormRegret(2) = inf A sup B sup t ≥ Regret(2 , t, A , B ) √ t = γ . The proof of this theorem has two parts: an upper bound, in Section 3, which exhibits an optimal al-gorithm, and a lower bound, in Section 4, which exhibits an optimal randomized adversary. The algorithmis very short, and it appears below in Algorithm 1. Remarkably, the quantity γ arises in both the lowerbound and upper bound for seemingly unrelated reasons. In the lower bound γ is the maximizer in (4.3),and in the upper bound γ is the minimizer in (5.16). Remark.

Our lower bound can be strengthened to show that, for any algorithm A , sup B lim sup t ≥ Regret(2 , t, A , B ) √ t ≥ γ . In particular, even if A is granted a “warm-up” period during which its regret is ignored, an adversary canstill force it to incur large regret afterwards. A sketch of this is in Appendix E.1. In fact, γ is the unique positive root. See Fact A.4. γ is the smallest value such that Brownian motion almost surely has a two-sided γ -slow point [38]. We will not use this fact. R : R ≥ × R → R defined by R ( t, g ) =  ( t = 0 ) g + κ √ t · M ( g / t ) ( t > and g ≤ γ √ t ) γ √ t ( t > and g ≥ γ √ t ) where κ = 1 √ π erﬁ( γ/ √ (2.5)and M is defined in (2.3). The function R may seem mysterious at first, but in fact arises naturally fromthe solution to a stochastic calculus problem in Section 5. In our usage of this function, t will correspondto the time and g will correspond to the gap between (i.e., absolute difference of) the total loss for the twoexperts. One may verify that R is continuous on R > × R because the second and third cases agree on thecurve (cid:8) ( t, γ √ t ) : t > (cid:9) since γ satisfies M ( γ /

2) = 0 . We next define the function p to be p ( t, g ) = (cid:0) R ( t, g + 1) − R ( t, g − (cid:1) , (2.6)which is the discrete derivative of R at time t and gap g . It will be shown later that p ( t, g ) ∈ [0 , / whenever t ≥ and g ≥ . The algorithm constructs its distribution x t so that p ( t, g ) is the probabilitymass assigned to the worst expert. We remark that p ( t,

0) = 1 / (Lemma 3.3) for all t ≥ so that whenboth experts are equally good, the algorithm places equal mass on both experts. Algorithm 1

An algorithm achieving the minimax anytime regret for two experts. It is assumed that eachcost vector (cid:96) t ∈ [0 , . Initialize L ← [0 , . for t = 1 , , . . . do If necessary, swap indices so that L t − , ≥ L t − , . The current gap is g t − ← L t − , − L t − , . Set x t ← (cid:2) p ( t, g t − ) , − p ( t, g t − ) (cid:3) , where p is the function defined by (2.6). (cid:46) Observe cost vector (cid:96) t and incur expected cost (cid:104) x t , (cid:96) t (cid:105) . L t ← L t − + (cid:96) t end for Lower Bound.

The common approach to prove lower bounds in the experts problem is to consider arandom adversary that changes the gap by ± at each step and to consider the regret at a fixed time T .Although we do consider a random adversary, looking at a fixed time T will not be able to yield a goodlower bound. The first key idea is to replace the fixed time with a suitable stopping time . In particular, thestopping time we use is the first time that the gap process (which is evolving as a reflected random walk)crosses a c √ t boundary where c > is a constant to be optimized.To analyze this, we use an elementary identity known as Tanaka’s formula for random walks thatallows us to write the regret process as Regret( t ) = Z t + g t / where Z t is a martingale with Z = 0 and g t is the current gap at time t . At this point, it might seem we are ready to apply the optional stoppingtheorem, which states that if we have a stopping time τ then E [ Z τ ] = Z = 0 . In particular, by choosing τ as the first time that the gap g t exceeds the c √ t boundary, one might expect that E [ Regret( τ ) ] =E [ g τ ] / ≥ E [ c √ τ ] / . Unfortunately, the argument cannot be so simple since the adversary is allowedto choose c > and, by taking c sufficiently large, it would violate known upper bounds on the regret.The issue lies in the fact that the optional stopping theorem requires certain conditions on the mar-tingale and stopping time. It turns out that the conditions used in most textbooks are too weak for us4o derive the optimal regret bound. Fortunately there is a strengthening of the optional stopping theo-rem that leads to optimal results in our setting. Namely, if Z t is a martingale with bounded increments(i.e. sup t ≥ | Z t +1 − Z t | ≤ K for some K > ) and τ is a stopping time satisfying E [ √ τ ] < ∞ then E [ Z τ ] = 0 . (The crucial detail is the square root.) This result is stated formally in Theorem 4.2. The ques-tion is now to choose as large a boundary as possible such that the associated stopping time of hitting theboundary satisfies E [ √ τ ] < ∞ . Using classical results of Breiman [4] and Greenwood and Perkins [28],we will show that the optimal choice of c is γ . Upper Bound.

Our analysis of the upper bound uses a fairly standard, undergraduate-style potentialfunction argument with the function R defined in (2.5) as the potential. Specifically, we show that thechange in regret from time t − and gap g t − to time t and gap g t is at most R ( t, g t ) − R ( t − , g t − ) .This implies that max g R ( t, g ) is an upper bound on the regret at time t . It is not difficult to see that R ( t, g ) ≤ γ √ t/ for all t ≥ , which establishes our main upper bound. One interesting twist is that ourpotential function is bivariate: it depends both on the state g of the algorithm and on time t . To capturehow the potential’s evolution depends on time, we use a simple identity known as the discrete Itô formula.The function R and the use of discrete Itô do not come “out of thin air”; both of these ideas come fromconsidering a continuous-time analogue of the problem. The reason for taking this continuous viewpointis that it brings a wealth of analytical tools that may not exist (or are more cumbersome) in the discretesetting. In order to formulate the continuous-time problem, we will assume that the continuous adversaryevolves the gap between the best and worst expert as a reflected Brownian motion. This assumption ismotivated by the discrete-time lower bound, since Brownian motion is the continuous-time analogue of arandom walk. Using this adversary, the continuous-time regret becomes a stochastic integral.An important tool at our disposal is the (continuous) Itô formula (Theorem 5.3), which provides aninsightful decomposition of the continuous-time regret. This decomposition suggests that the algorithmshould satisfy an analytic condition known as the backwards heat equation . A key resulting idea is: ifthe algorithm satisfies the backward heat equation, then there is a natural potential function that upperbounds the regret of the algorithm. This affords us a systematic approach to obtain an explicit continuous-time algorithm and a potential function that bounds the continuous algorithm’s regret. To go back tothe discrete setting, using the same potential function, we replace applications of Itô’s formula with thediscrete Itô formula. Remarkably, this leads to exactly the same regret bound as the continuous setting. As mentioned in Section 1, the following theorem of Davis can be proven as a corollary of our techniques.Intriguingly, the proof involves regret, despite the fact that regret does not appear in the theorem statement.Our second application for binary sequence prediction is discussed in Appendix B.

Theorem 2.2 (Davis [18]). Let ( X t ) t ≥ be a standard random walk. Then E [ | X τ | ] ≤ γ E [ √ τ ] for everystopping time τ ; moreover, the constant γ cannot be improved. Proof.

We begin by proving the first assertion. Suppose that

Regret( T ) is the regret process when Al-gorithm 1 is used against a random adversary. As discussed in Subsection 2.3, we can write the regretprocess as Regret( T ) = Z T + g T / where Z T is a martingale and g T evolves as a reflected random walk.Moreover, if τ is a stopping time satisfying E [ √ τ ] < ∞ then E [ Z τ ] = 0 (see Theorem 4.2).The upper bound in Theorem 2.1 asserts that γ √ T / ≥ Regret( T ) = Z T + g T / for any fixed T ≥ .Hence, γ E [ √ τ ] / ≥ E [ g τ ] / . Replacing g τ with | X τ | (since both g t and | X t | are reflected randomwalks), the proof of the first assertion is complete.The fact that no constant smaller than γ is possible is a direct consequence of the results of Breiman [4]and Greenwood and Perkins [28] as mentioned in Subsection 2.3 (see also Section 4 or [18]).5 emark. Davis [18] proved Theorem 2.2 for both random walks and Brownian motion. We are also ableto recover the result for Brownian motion as a corollary of our continuous-time result (Theorem 5.2). Theproof is very similar to that above.

In our two-expert prediction problem, the most important scenario restricts each cost vector (cid:96) t to be either [0 , or [1 , . This restricted scenario is equivalent to the condition g t − g t − ∈ {± } ∀ t ≥ , where g t := | L t, − L t, | is the gap at time t . To prove the optimal lower bound it suffices to consider this restrictedscenario. The optimal upper bound will first be proven in the restricted scenario, then extended to generalcost vectors in Appendix D. With the sole exception of Appendix D, we will assume the restricted scenario.We now present an expression, valid for any algorithm, that emphasizes how the regret depends on the change in the gap. This expression will be useful in proving both the upper and lower bounds. Henceforthwe will often write Regret( t ) := Regret(2 , t, A , B ) where A and B are usually implicit from the context. Proposition 2.3.

Assume the restricted setting in which g t − g t − ∈ {± } for every t ≥ . When g t − (cid:54) = 0 , let p t denote the probability mass assigned by the algorithm to the worst expert ; this quantitymay depend arbitrarily on (cid:96) , . . . , (cid:96) t − . Then Regret( T ) = T (cid:88) t =1 p t · ( g t − g t − ) · [ g t − (cid:54) = 0] + T (cid:88) t =1 (cid:104) x t , (cid:96) t (cid:105) · [ g t − = 0] . (2.7)Furthermore, assume that if g t − = 0 then p t = x t, = x t, = 1 / . In this case Regret( T ) = T (cid:88) t =1 p t · ( g t − g t − ) (2.8) Remark.

If the cost vectors are randomly chosen so that the gap process ( g t ) t ≥ is the absolute valueof a standard random walk, then (2.7) is the Doob decomposition [32, Theorem 10.1] of the regret process (cid:0) Regret( t ) (cid:1) t ≥ , i.e., the first sum is a martingale and the second sum is an increasing predictable process. Proof.

Define ∆ R ( t ) = Regret( t ) − Regret( t − . The total cost of the best expert at time t is L ∗ t :=min { L t, , L t, } . The change in regret at time t is the cost incurred by the algorithm minus the change inthe total cost of the best expert, so ∆ R ( t ) = (cid:104) x t , (cid:96) t (cid:105) − ( L ∗ t − L ∗ t − ) . Case 1: g t − (cid:54) = 0 . In this case, the best expert at time t − remains a best expert at time t . If the worstexpert incurs cost , then the algorithm incurs cost p t and the best expert incurs cost , so ∆ R ( t ) = p t and g t − g t − = 1 . Otherwise, the best expert incurs cost and the algorithm incurs cost − p t , so ∆ R ( t ) = − p t and g t − g t − = − . In both cases, ∆ R ( t ) = p t · ( g t − g t − ) . Case 2: g t − = 0 . Both experts are best, but one incurs no cost, so L ∗ t = L ∗ t − and ∆ R ( t ) = (cid:104) x t , (cid:96) t (cid:105) .The above two cases prove (2.7). For the last assertion, we have that (cid:104) x t , (cid:96) t (cid:105) = 1 / p t · ( g t − g t − ) whenever g t − = 0 . Hence, we can collapse the two sums in (2.7) into one to get (2.8). In this section, we prove the upper bound in Theorem 2.1 via a sequence of simple steps. We remind thereader that for simplicity, we will assume that the gap changes by ± at each step, which corresponds to i.e. if L t − , ≥ L t − , then p t = x t, and otherwise p t = x t, . (cid:96) t ∈ { [0 , , [1 , } . The analysis can be extended to general loss vectors in [0 , throughthe use of concavity arguments. The details of this extension are not particularly enlightening, so werelegate them to Appendix D.The proof in this section uses the potential function R which, as explained in Subsection 2.3, is definedvia continuous-time arguments in Section 5. Moreover, the structure of the proof is heavily inspired bythe proof in the continuous setting. Finally, we remark that the analysis of this section uses the potentialfunction in a modular way , and could conceivably be used to analyze other algorithms.Moving forward, we will need a few observations about the functions R and p , which were defined inEq. (2.5) and Eq. (2.6). Lemma 3.1.

For any t > , R ( t, g ) is concave and non-decreasing in g .The proof of Lemma 3.1 is a calculus exercise and appears in Appendix C.1. As a consequence, we caneasily get the maximum value of R ( t, g ) for any t . Lemma 3.2.

For any t > , we have R ( t, g ) ≤ γ √ t/ . Proof.

Lemma 3.1 shows that R ( t, g ) is non-decreasing in g . By definition, R ( t, g ) is constant for g ≥ γ √ t .It follows that max g R ( t, g ) ≤ R ( t, γ √ t ) = γ √ t/ .In the definition of the prediction task, the algorithm must produce a probability vector x t . Recallingthe definition of x t in Algorithm 1, it is not a priori clear whether x t is indeed a probability vector. Wenow verify that it is, since Lemma 3.3 implies that p ( t, g ) ∈ [0 , / for all t, g . Lemma 3.3.

Fix t ≥ . Then(1) p ( t,

0) = 1 / ;(2) p ( t, g ) is non-increasing in g ; and(3) p ( t, g ) ≥ . Proof.

For the first assertion, we have p ( t,

0) = 12 ( R ( t, − R ( t, − (cid:18)

12 + κ √ tM (1 / t ) + 12 − κ √ tM (1 / t ) (cid:19) = 12 . For the second equality, we used that ≤ γ ≤ γ √ t for all t ≥ . The second assertion follows from con-cavity of R , which was shown in Lemma 3.1, and an elementary property of concave functions (Fact A.6).The final assertion holds because R is non-decreasing in g , which was also shown in Lemma 3.1. ± In this subsection we prove the upper bound of Theorem 2.1 for a restricted class of adversaries (thatnevertheless capture the core of the problem). The analysis is extended to all adversaries in Appendix D.

Theorem 3.4.

Let A be the algorithm described in Algorithm 1. For any adversary B such that each costvector (cid:96) t is either [0 , or [1 , , we have sup t ≥ Regret(2 , t, A , B ) √ t ≤ γ . Our analysis may also be viewed as an amortized analysis. With this viewpoint, the algorithm incurs amortized regret atmost γ ( √ t − √ t − ≈ γ / √ t at each time step t . f to be f g ( t, g ) = f ( t, g + 1) − f ( t, g − ,f t ( t, g ) = f ( t, g ) − f ( t − , g ) ,f gg ( t, g ) = (cid:0) f ( t, g + 1) + f ( t, g − (cid:1) − f ( t, g ) . It was remarked earlier that p ( t, g ) is the discrete derivative of R , and this is because p ( t, g ) = R g ( t, g ) . (3.1) Lemma 3.5 (Discrete Itô formula). Let g , g , . . . be a sequence of real numbers satisfying | g t − g t − | = 1 .Then for any function f and any fixed time T ≥ , we have f ( T, g T ) − f (0 , g ) = T (cid:88) t =1 f g ( t, g t − ) · ( g t − g t − ) + T (cid:88) t =1 (cid:18) f gg ( t, g t − ) + f t ( t, g t − ) (cid:19) . (3.2)This lemma is a small generalization of [32, Example 10.9] to accommodate a bivariate function f thatdepends on t . The proof is essentially identical, and appears in Appendix C.2 for completeness.Now we show how the regret has a formula similar to (3.2). Recall that Lemma 3.3(1) guarantees p ( t,

0) = 1 / , i.e. x t = [1 / , / . Hence, (2.8) gives Regret( T ) = T (cid:88) t =1 p ( t, g t − ) · ( g t − g t − ) (3.3)where g = 0 and g t ≥ for all t ≥ . Since p = R g , observe that the difference between (3.3) and (3.2)is the quantity f gg ( t, g t − ) + f t ( t, g t − ) . In the continuous setting, we will see that a key idea is to tryto obtain a solution satisfying ( ∂ gg + ∂ t ) f = 0 ; this is the well-known backwards heat equation. In thediscrete setting, we will show that f gg ( t, g t − ) + f t ( t, g t − ) ≥ which suffices for our purposes. Lemma 3.6 (Discrete backwards heat inequality). R gg ( t, g ) + R t ( t, g ) ≥ for all t ∈ R ≥ and g ∈ R ≥ .This lemma is the most technical part of the discrete analysis. Its proof appears in Appendix C.3. Wenow have all the ingredients needed to prove our main theorem (in the present special case). Proof (of Theorem 3.4). Apply Lemma 3.5 to the function R and the sequence g , g , . . . of (integer) gapsproduced by the adversary B . Then, for any time T ≥ , R ( T, g T ) − R (0 , g )= T (cid:88) t =1 R g ( t, g t − ) · ( g t − g t − ) + T (cid:88) t =1 (cid:16) R gg ( t, g t − ) + R t ( t, g t − ) (cid:17) (by Lemma 3.5) ≥ T (cid:88) t =1 p ( t, g t − ) · ( g t − g t − ) (by (3.1) and Lemma 3.6) = Regret( T ) (by (3.3)).Since g = 0 and R (0 ,

0) = 0 , applying Lemma 3.2 shows that

Regret( T ) ≤ R ( T, g T ) ≤ γ √ T / .The reader at this point may be wondering why γ is the right constant to appear in the analysis. InSection 5, we will define the function R specifically to obtain γ in the preceding analysis. In the nextsection, our matching lower bound will prove that γ is indeed the right constant.8 Lower bound

The main result of this section is the following theorem, which implies the lower bound in Theorem 2.1.

Theorem 4.1.

For any algorithm A and any (cid:15) > , there exists an adversary B (cid:15) such that sup t ≥ Regret(2 , t, A , B (cid:15) ) √ t ≥ γ − (cid:15) . (4.1)As remarked earlier, the sup can be replaced by a lim sup , see Appendix E.1.It is common in the literature for regret lower bounds to be proven by random adversaries; see, e.g.,[12, Theorem 3.7]. We will also consider a random adversary, but the novelty is the use of a non-trivialstopping time at which it can be shown that the regret is large. A random adversary.

Suppose an adversary produces a sequence of cost vectors (cid:96) , (cid:96) , . . . ∈ { , } as follows. For all t ≥ ,• If g t − > then (cid:96) t is randomly chosen to be one of the vectors [1 , or [0 , , uniformly and inde-pendent of (cid:96) , . . . , (cid:96) t − . Thus g t − g t − is uniform in {± } .• If g t − = 0 then (cid:96) t = [1 , if x t, ≥ / , and (cid:96) t = [0 , if x t, > / . In both cases g t = 1 .As remarked above, the process ( g t ) t ≥ has the same distribution as the absolute value of a standardrandom walk (which is also known as a reflected random walk).We now obtain from (2.7) a lower bound on the regret of any algorithm against this adversary. Theadversary’s behavior when g t − = 0 ensures that (cid:104) x t , (cid:96) t (cid:105) ≥ / , showing that Regret( T ) ≥ T (cid:88) t =1 p t ( g t − g t − ) · [ g t − (cid:54) = 0] (cid:124) (cid:123)(cid:122) (cid:125) martingale + 12 T (cid:88) t =1 [ g t − = 0] (cid:124) (cid:123)(cid:122) (cid:125) local time ∀ T ∈ N . (Equality holds if the algorithm sets x t = [1 / , / whenever g t − = 0 .) The first sum is a martingaleindexed by t . (This holds because g t − g t − has conditional expectation when g t − (cid:54) = 0 , and [ g t − (cid:54) = 0] =0 when g t − = 0 .) The second sum is called the local time of the random walk. Using Tanaka’s formula[32, Ex. 10.8], the local time can be written as (cid:80) Tt =1 [ g t − = 0] = g t − Z (cid:48) t where Z (cid:48) t is a martingale withuniformly bounded increments and Z (cid:48) = 0 . Thus, combining the two martingales, we have Regret( t ) ≥ Z t + g t ∀ t ∈ Z ≥ , (4.2)where Z t is a martingale with uniformly bounded increments and Z = 0 . Intuition for a stopping time.

Optional stopping theorems assert that, under some hypotheses, theexpected value of a martingale at a stopping time equals the value at the start. Using such a theorem, at astopping time τ it would hold that E [ Regret( τ ) ] ≥ E [ g τ ] / (under some hypotheses on τ and Z ). Thusit is natural to design a stopping time τ that maximizes E [ g τ ] and satisfies the hypotheses. We know from(2.2) that the optimal anytime regret at time t is Θ( √ t ) , so one reasonable stopping time would be τ ( c ) := min (cid:110) t > g t ≥ c √ t (cid:111) for some constant c yet to be determined. If τ ( c ) and Z satisfy the hypotheses of the optional stoppingtheorem, then it will hold that E [ Regret( τ ( c )) ] ≥ c E[ (cid:112) τ ( c ) ] . From this, it follows, fairly easily, that AnytimeNormRegret(2) ≥ c/ ; this will be argued more carefully later.9 n optional stopping theorem. The optional stopping theorems appearing in standard referencesrequire one of the following hypotheses: (i) τ is almost surely bounded, or (ii) E [ τ ] is bounded and themartingale has bounded increments, or (iii) the martingale is almost surely bounded and τ is almost surelyfinite. See, e.g., [5, Theorem 5.33], [21, Theorem 4.8.5], [32, Theorem 10.11], [29, Theorem 12.5.1], [40,Theorem II.57.4], or [45, Theorem 10.10]. These will not suffice for our purposes, and we will requirethe following theorem, which has a weaker hypothesis (due to the square root). We are unable to find areference for this theorem, although it is presumably folklore, so we provide a proof in Appendix E. Theorem 4.2.

Let Z t be a martingale and K > a constant such that | Z t − Z t − | ≤ K almost surely forall t . Let τ be a stopping time. If E [ √ τ ] < ∞ then E [ Z τ ] = E [ Z ] . Optimizing the stopping time.

Since the martingale Z t defined above has bounded increments, The-orem 4.2 may be applied so long as E[ (cid:112) τ ( c ) ] < ∞ , in which case the preceding discussion yields AnytimeNormRegret(2) ≥ c/ . So it remains to determine sup { c ≥ (cid:112) τ ( c ) ] < ∞ } , (4.3)where τ ( c ) is the first time at which a standard random walk crosses the two-sided boundary ± c √ t . Wewill use the following result, in which M is the confluent hypergeometric function defined in Appendix A.Some discussion of our statement of this theorem appears in Appendix E. Theorem 4.3 (Breiman [4], Theorem 2). Let c > and a < be such that c is the smallest positive rootof the function x (cid:55)→ M ( a, / , x / . Then there exists a constant K such that Pr [ τ ( c ) > u ] ∼ Ku a .Recall the definition of γ in (2.4). For intuition, let us apply Theorem 4.3 with c = γ , which is definedso that it is the root for a = − / (see Eq. (A.2) and Fact A.2). It then follows that E (cid:104) (cid:112) τ ( γ ) (cid:105) = (cid:90) ∞ Pr (cid:104) (cid:112) τ ( γ ) > s (cid:105) d s = (cid:90) ∞ Pr (cid:2) τ ( γ ) > s (cid:3) d s ∼ K (cid:90) ∞ s − d s, by Theorem 4.3. This integral is infinite, so Theorem 4.2 cannot be applied to τ ( γ ) . However, the integralis on the cusp of being finite. By slightly decreasing a below − / , and slightly modifying c to be the newroot, we should obtain a finite integral, showing that E[ (cid:112) τ ( c ) ] is finite. The following proof uses analyticproperties of M to show that this is possible. Proof (of Theorem 4.1). Fix any (cid:15) > that is sufficiently small. Consider the random adversary and thestopping times τ ( c ) described above. By Claim A.5, there exists a (cid:15) ∈ ( − , − / and c (cid:15) ≥ γ − (cid:15) such that c (cid:15) is the unique positive root of z (cid:55)→ M ( a (cid:15) , / , z / . As in the above calculations, Theorem 4.3 showsthat E (cid:104) (cid:112) τ ( c (cid:15) ) (cid:105) = (cid:90) ∞ Pr (cid:2) τ ( c (cid:15) ) > s (cid:3) d s ∼ K (cid:90) ∞ s a (cid:15) d s < ∞ , (4.4)since a (cid:15) < − / . It follows that τ ( c (cid:15) ) is almost surely finite, and therefore Regret( τ ( c (cid:15) )) and g τ ( c (cid:15) ) arealmost surely well defined. Applying Theorem 4.2 to the martingale Z t appearing in (4.2), we obtain that E [ Regret( τ ( c (cid:15) )) ] ≥

12 E (cid:2) g τ ( c (cid:15) ) (cid:3) = 12 E (cid:104) c (cid:15) (cid:112) τ ( c (cid:15) ) (cid:105) . By the probabilistic method, there exists a finite sequence of cost vectors (cid:96) , . . . , (cid:96) t (depending on A and (cid:15) ) for which the regret of A at time t is at least c (cid:15) √ t/ . The adversary B (cid:15) (which knows A ) provides thissequence of cost vectors to algorithm A , thereby proving (4.1).10 Derivation of a continuous-time analogue of Algorithm 1

The purpose of this section is to show how the potential function R defined in (2.5) arises naturally asthe solution of a stochastic calculus problem. The derivation of that function is accomplished by defining,then solving, an analogue of the regret minimization problem in continuous time. The main advantage ofconsidering this continuous setting is the wealth of analytic methods available, such as stochastic calculus. Continuous time regret problem.

The continuous regret problem is inspired by (2.8). Notice that,when the adversary chooses costs in { [0 , , [1 , } , the sequence of gaps g , g , g , . . . live in the supportof a reflected random walk. The goal in the discrete case is to find an algorithm p that bounds the regretover all possible sample paths of a reflected random walk. In continuous time it is natural to consider astochastic integral with respect to reflected Brownian motion, denoted | B t | , instead. Our goal now is tofind a continuous-time algorithm whose regret is small for almost all reflected Brownian motion paths. Definition 5.1 (Continuous Regret). Let p : R > × R ≥ → [0 , be a continuous function that satis-fies p ( t,

0) = 1 / for every t > . Let B t be a standard one-dimensional Brownian motion. Then, the continuous regret of p with respect to B is the stochastic integral ContRegret(

T, p, B ) = (cid:90) T p ( t, | B t | ) d | B t | . (5.1) Remark.

The condition p ( t,

0) = 1 / is due to (5.1) being inspired by (2.8), which requires this condition.In this definition we may think of p as a continuous-time algorithm and B as a continuous-time ad-versary. The goal for the remainder of this section is to prove the following result. Theorem 5.2.

There exists a continuous-time algorithm p ∗ such that ContRegret(

T, p ∗ , B ) ≤ γ √ T ∀ T ∈ R ≥ , almost surely . (5.2) Remark.

A natural question arises upon reviewing the definition of continuous regret: What role doesBrownian motion play in Definition 5.1 and is it the “correct” stochastic process to consider in order to un-cover the optimal algorithm? In the analysis that follows, the only properties of reflected Brownian motionthat we use are its non-negativity and that its quadratic variation is t . It turns out that one can generalizeTheorem 5.2 by allowing any non-negative, continuous semi-martingale X to control the gap process, andby letting time grow at the rate of the quadratic variation of X . See Theorem F.11 in Appendix F.8 for moredetails. Since

ContRegret( T ) evolves as a stochastic integral with respect to a semi-martingale (namely reflectedBrownian motion), Itô’s lemma provides an insightful decomposition. The following statement of Itô’slemma is a specialization of [39, Theorem IV.3.3] for the special case of reflected Brownian motion. A semi-martingale is a stochastic process that can written as the sum of a local martingale and a process of finite variation. Specifically, we are using the statement of Itô’s formula that appears in Remark 1 after Theorem IV.3.3 in [39] with X t = | B t | and A t = t . Note that y in their notation is t in ours and (cid:104) | B | , | B | (cid:105) t = t . otation. Up to now, we have used the symbol g as the second parameter to the bivariate functions p and R . Henceforth, it will be more consistent with the usual notation in the literature to use x to denote g . Wewill also use the notation C , to denote the class of bivariate functions that are continuously differentiablein their first argument and twice continuously differentiable in their second argument. Theorem 5.3 (Itô’s formula). Let f : R ≥ × R → R be C , . Then, almost surely, f ( T, | B T | ) − f (0 , | B | ) = (cid:90) T ∂ x f ( t, | B t | ) d | B t | + (cid:90) T (cid:104) ∂ t f ( t, | B t | ) + ∂ xx f ( t, | B t | ) (cid:124) (cid:123)(cid:122) (cid:125) =: ∗ ∆ f ( t, | B t | ) (cid:105) d t. (5.3)The integrand of the second integral is an important quantity arising in PDEs and stochastic processes(see, e.g., [20, pp. 263]). We will denote it by ∗ ∆ f ( t, x ) := ∂ t f ( t, x ) + ∂ xx f ( t, x ) . Some discussion aboutthe statement of Theorem 5.3 appears in Appendix F.7. Applying Itô’s formula to the continuous regret.

Comparing (5.1) and (5.3), it is natural to assumethat p = ∂ x f for a function f that is C , with f (0 ,

0) = 0 , ∂ x f ∈ [0 , , and ∂ x f ( t,

0) = 1 / ; the lattertwo conditions are needed for Definition 5.1 to be applicable. Itô’s formula then yields ContRegret(

T, p = ∂ x f, B ) = (cid:90) T ∂ x f ( t, | B t | ) d | B t | = f ( T, | B T | ) − (cid:90) T ∗ ∆ f ( t, | B t | ) d t. (5.4) Path independence and the backward heat equation.

At this point a useful idea arises: as a thoughtexperiment, suppose that ∗ ∆ f = 0 . Then the second integral would vanish, and we would have the ap-pealing expression ContRegret(

T, p, B ) = f ( T, | B T | ) . Moreover, since f is a deterministic function, theright-hand side depends only on | B T | rather than the entire Brownian path B | [0 ,T ] . Thus, the same mustbe true of the left-hand side: at time T , the continuous regret of the algorithm p depends only on T and | B T | (the gap). We say that say that such an algorithm has path independent regret . Our supposition thatled to these attractive consequences was only that ∗ ∆ f = 0 , which turns out to be a well studied condition. Definition 5.4.

Let f : R > × R → R be a C , function. If ∗ ∆ f ( t, x ) = 0 for all ( t, x ) ∈ R > × R then wesay that f satisfies the backward heat equation . A synonymous statement is that f is space-time harmonic .We may summarize the preceding discussion with the following proposition. Proposition 5.5.

Let f : R > × R → R be a C , function that satisfies ∗ ∆ f = 0 everywhere with f (0 ,

0) = 0 . Let p = ∂ x f . Then, (cid:90) T p ( t, | B t | ) d | B t | = f ( T, | B t | ) . (5.5)Suppose that a function f satisfies the hypothesis of Proposition 5.5 and in addition p = ∂ x f ∈ [0 , with p ( t,

0) = 1 / . Then, we would have ContRegret(

T, p, B ) = f ( T, | B T | ) . (5.6)We are unable to derive a function that satisfies the properties required for (5.6) to hold along with max x ≥ f ( T, | B T | ) ≤ γ √ T / . Instead, we will begin by relaxing the constraint that p ( t, x ) ∈ [0 , and allow p ( t, x ) to be negative. We will overload the notation ContRegret( · ) to include such functions.In the next section, we will derive a family of such functions that all achieve ContRegret(

T, p, | B T | ) = f ( T, | B T | ) = O ( √ T ) . This is done by setting up and solving the backwards heat equation. Next, we usea smoothing argument to obtain a family of functions that all achieve ContRegret(

T, p, | B T | ) = O ( √ T ) ,and that do satisfy p ( t, x ) ∈ [0 , . Finally, we will optimize ContRegret( T, · , | B T | ) over this family offunctions to prove Theorem 5.2. 12 .2.1 Satisfying the backward heat equation The main result of this section is the derivation of a family of functions ˜ p : R > × R → R that satisfy ˜ p ( t, x ) ≤ , ˜ p ( t,

0) = 1 / and ContRegret( T, ˜ p, B ) = f ( T, | B T | ) = O ( √ T ) , (5.7)but do not necessarily satisfy ˜ p ( t, x ) ≥ .The first step is to find a function f which satisfies the partial differential equation ∗ ∆ f = 0 . Sincethe boundary condition ˜ p ( t,

0) = 1 / is a condition on ˜ p = ∂ x f , not on f itself, it will be convenient tosolve a PDE for ˜ p instead, and then derive f by integrating. However, some care is needed since not allantiderivates of ˜ p (in x ) will satisfy the backwards heat equation. Fortunately, we have a useful lemmashowing that if ˜ p satisfies the backward heat equation, then we can construct an f that also does. This isproven in Appendix F.1. Lemma 5.6.

Suppose that h : R > × R → R is a C , function. Define f ( t, x ) := (cid:90) x h ( t, y ) d y − (cid:90) t ∂ x h ( s,

0) d s. Then,(1) f ∈ C , ,(2) If ∗ ∆ h = 0 over R > × R then ∗ ∆ f = 0 over R > × R ,(3) h = ∂ x f . Defining boundary conditions for p . Obtaining a particular solution to the backward heat equationrequires sufficient boundary conditions in order to uniquely identify ˜ p . The boundary condition mentionedabove is that ˜ p ( t,

0) = 1 / for all t . This condition together with the backward heat equation clearly donot suffice to uniquely determine ˜ p . Therefore, we impose some reasonable boundary conditions on ˜ p .What should the value be at the boundary? Intuitively, x (cid:55)→ ˜ p ( t, x ) should be a decreasing functionbecause ˜ p represents the weight placed on the worst expert. Therefore, it is natural to consider an “up-per boundary” which specifies the point at which the difference in experts’ total costs is so great thatthe algorithm places zero weight on the worst expert. The upper boundary can be specified by a curve, { ( t, φ ( t )) : t > } for some continuous function φ : R > → R > . We will incorporate this idea byrequiring ˜ p ( t, φ ( t )) = 0 for all t > .Where should the boundary be? One reasonable choice for the boundary is to use φ α ( t ) = α √ t forsome constant α > , as this is similar to the boundary used by the random adversary in the lower boundof Section 4. These conditions are combined into the following partial differential equation: (backward heat equation) ∂ t u ( t, x ) + ∂ xx u ( t, x ) = 0 for all ( t, x ) ∈ R > × R (5.8) (upper boundary) u ( t, α √ t ) = 0 for all t > (5.9) (lower boundary) u ( t,

0) = for all t > . (5.10)Next we show that the following function solves this PDE. Define ˜ p α : R > × R → R by ˜ p α ( t, x ) := 12 (cid:18) − erﬁ ( x / √ t )erﬁ ( α / √ ) (cid:19) . (5.11) Lemma 5.7. ˜ p α satisfies the following properties:(1) ˜ p α is C , over R > × R ,(2) ˜ p α satisfies the constraints in (5.8), (5.9) and (5.10), and133) For all t > and all x ≥ , ˜ p α ( t, x ) ≤ / .The proof of Lemma 5.7 appears in Appendix F.2. It shows that ˜ p α ( t, x ) nearly defines a valid continu-ous time algorithm, in that it satisfies the conditions of Definition 5.1 except for non-negativity. Next, wewill integrate ˜ p α as described in Lemma 5.6. Define the function ˜ R α : R > × R → R as ˜ R α ( t, x ) = x κ α √ t · M (cid:18) x t (cid:19) where κ α = 1 √ π erﬁ( α/ √ . (5.12) Lemma 5.8. ˜ R α ( t, x ) = (cid:82) x ˜ p α ( t, y ) d y − (cid:82) t ∂ x ˜ p α ( s,

0) d s .The proof of Lemma 5.8 appears in Appendix F.3. By Lemma 5.7, the function ˜ p α satisfies the hypothesisof the function h in Lemma 5.6. Hence, we can apply Lemma 5.6 with h = ˜ p α and f = ˜ R α to assert thefollowing properties on ˜ R α . Lemma 5.9. ˜ R α satisfies the following properties:(1) ˜ R α is C , ,(2) ˜ R α satisfies ∗ ∆ ˜ R α = 0 over R > × R ,(3) ∂ x ˜ R α ( t, x ) = ˜ p α ( t, x ) .Since erﬁ( · ) is a strictly increasing function with erﬁ(0) = 0 , observe that ˜ p α has exactly one root at α √ t. Therefore, for every T, we have ContRegret( T, ˜ p α , B ) = ˜ R α ( T, | B T | ) ≤ max x ≥ ˜ R α ( T, x ) ≤ (cid:18) α κ α M (cid:18) α (cid:19)(cid:19) √ T . (5.13)This establishes (5.7), as desired.

The only remaining step is to modify ˜ p α so that it lies in the interval [0 , / . We will modify ˜ p α in themost natural way: by modifying all negative values to be zero. Specifically, we set p α ( t, x ) := (cid:40) ( t = 0 ) (˜ p α ( t, x )) + ( t > ) =  ( t = 0 ) (cid:16) − erﬁ( x / √ t )erﬁ( α / √ ) (cid:17) + ( t > ) . (5.14)Here, we use the notation ( x ) + = max { , x } . Note that p α ( t,

0) = 1 / for all t > and p α ( t, x ) ∈ [0 , / for all t, x ≥ . So p α defines a valid continuous-time algorithm. From (5.14), we obtain a truncated versionof ˜ R α as R α ( t, x ) :=  t = 0)˜ R α ( t, x ) ( t > ∧ x ≤ α √ t )˜ R α ( t, α √ t ) ( t > ∧ x ≥ α √ t ) . (5.15)It is straightforward to verify that ∂ x R α = p α . This is because for x ≤ α √ t , p α ( t, x ) = ˜ p α ( t, x ) and R α ( t, x ) = ˜ R α ( t, x ) (we have computed the derivatives in Lemma 5.9). In addition, R α ( t, x ) is constantfor x ≥ α √ t its derivative is .If R α were sufficiently smooth then we could immediately apply (5.6) (or Theorem 5.3) to obtain a for-mula for the regret of p α . The only flaw is that ∂ xx R α is not well-defined on the curve (cid:8) ( t, α √ t ) : t > (cid:9) so R α is not in C , and Theorem 5.3 cannot be applied directly. The reader who believes that this issueis unlikely to be problematic may wish to take Lemma 5.10 on faith and skip ahead to Subsection 5.3.14igure 1: The relationships between ˜ p α , ˜ R α , R α,n , p α , and R α Lemma 5.10.

Fix α > . Then, almost surely, for all T ≥ , ContRegret(

T, p α , B ) ≤ R α ( T, | B T | ) .Here, we will present a high-level overview of the proof of this lemma; the details can be found inAppendix F.4. Let φ ( x ) be a smooth function satisfying φ ( x ) = 1 for x ≤ and φ ( x ) = 0 for x ≥ . For n ∈ N , define φ n ( x ) = φ ( nx ) and the approximations R α,n ( t, x ) := ˜ R α ( t, x ) φ n ( x − α √ t ) + ˜ R α ( t, α √ t )(1 − φ n ( x − α √ t )) . It is relatively straightforward to check that R α,n ( t, x ) n →∞ −−−→ R α ( t, x ) pointwise and similarly for thederivatives. The important property is that R α,n is smooth so Itô’s formula may be applied. Lemma 5.10is then proved by taking limits and controlling the error terms.The remainder of this section proves Theorem 5.2 by setting p ∗ = p α for the optimal α . By Lemma 5.10,

ContRegret(

T, ∂ x R α , B ) ≤ R α ( T, | B T | ) ≤ R α ( T, α √ T ) , where the last inequality isbecause ∂ x R α ( t, x ) = p α ( t, x ) is positive for x ∈ [0 , α √ t ) and for x ≥ α √ t . Define h ( α ) := R α (1 , α ) = α κ α M ( α / and note that R α ( T, α √ T ) = √ T · h ( α ) . Thus, the only remaining task is now to solve the followingoptimization problem. min α> h ( α ) = min α> (cid:26) α κ α · M (cid:18) α (cid:19)(cid:27) (5.16)The following lemma verifies that there exists some α for which ContRegret(

T, ∂ x R α , B ) ≤ γ √ T ,completing the proof of Theorem 5.2 Lemma 5.11.

Fix

T > . Then min α R α ( T, α √ T ) = R γ ( T, γ √ T ) = γ √ T .Lemma 5.11 follows easily from the following claim whose proof appears in Appendix F.6. Claim 5.12. h (cid:48) ( α ) = − exp( α / π erﬁ( α/ √ · M ( α / . In particular, h (cid:48) ( α ) < for α ∈ (0 , γ ) , h (cid:48) ( γ ) = 0 , and h (cid:48) ( α ) > for α ∈ ( γ, ∞ ) . 15 roof of Lemma 5.11. Claim 5.12 implies that γ is the global minimizer for h ( α ) . Therefore, for every α > , we have R α ( T, α √ T ) = √ T · h ( α ) ≥ √ T · h ( γ ) = R γ ( T, γ √ T ) . This proves the first equality.The second equality is because M ( γ /

2) = 0 by definition of γ .16 Standard facts

A.1 Basic facts about confluent hypergeometric functions

For any a, b ∈ R with b (cid:54)∈ Z ≤ , the confluent hypergeometric function of the first kind is defined as M ( a, b, z ) = ∞ (cid:88) n =0 ( a ) n z n ( b ) n n ! , (A.1)where ( x ) n := (cid:81) n − i =0 ( x + i ) is the Pochhammer symbol. See, e.g., Abramowitz and Stegun [2, Eq. (13.1.2)].For notational convenience, for i ∈ { , , , . . . , } , we write M i ( x ) = M ( i − / , i + 1 / , x ) . (A.2) Fact A.1. If b / ∈ Z ≤ then dd x M ( a, b, x ) = ab · M ( a + 1 , b + 1 , x ) . Consequently,(1) M (cid:48) ( x ) = − M ( x ) ; and(2) M (cid:48) ( x ) = · M ( x ) . Proof.

See [2, Eq. (13.4.9)].

Fact A.2.

The following identities hold:(1) M ( x ) = −√ πx erﬁ( √ x ) + e x .(2) M ( x ) = √ π erﬁ( √ x )2 √ x .(3) M ( x ) = e x √ x −√ π erﬁ( √ x ))4 x / .(4) · M ( x ) · x + M ( x ) = e x . Proof. (2): See [2], equations (7.1.21) or (13.6.19), and use that erﬁ( x ) = − i erf( ix ) , where i = √− .(1): Differentiating the right-hand side (using the definition of erﬁ in (2.3)) yields − √ π erﬁ( √ x )2 √ x . So theright-hand side is an anti-derivative of − M ( x ) , by part (2). Thus, the identity (1) follows from Fact A.1(1)and the initial condition M (0) = 1 .(3): This follows directly by differentiating (2) and Fact A.1(2).(4): Immediate from (2) and (3). Fact A.3.

The function M ( x ) is decreasing and concave on [0 , ∞ ) . Remark.

In fact, M ( x ) is decreasing and concave on R but we will not require this fact. Proof.

By Fact A.1, we have M (cid:48) ( x ) = − M ( x ) and M (cid:48)(cid:48) ( x ) = − · M ( x ) . Note that the coefficients of M ( x ) , M ( x ) in their Taylor series are all non-negative. As x ≥ , we have that M (cid:48) ( x ) , M (cid:48)(cid:48) ( x ) ≤ asdesired. Fact A.4.

The function x (cid:55)→ M ( x / has a unique positive root at x = γ . Moreover M ( x / > for x ∈ (0 , γ ) and M ( x / < for x ∈ ( γ, ∞ ) . Proof.

The Maclaurin expansion of M ( x / is given by M (cid:18) x (cid:19) = 1 − ∞ (cid:88) k =1 k − k ! x k k . Note that M (0) = 1 . It is clear, from the series expansion above (and Fact A.3), that M ( x / is strictlydecreasing in x on (0 , ∞ ) and lim x →∞ M ( x /

2) = −∞ . Hence, M ( x / contains a positive root γ and it is unique. Finally, it is clear that M ( x / is positive on (0 , γ ) and negative on ( γ, ∞ ) .17 laim A.5. For any (cid:15) > , there exists a (cid:15) ∈ ( − , − / such that the smallest positive root c (cid:15) of z (cid:55)→ M ( a (cid:15) , / , z / satisfies c (cid:15) ≥ γ − (cid:15) . Proof.

Following Perkins’ notation [38], let λ ( − c, c ) be such that c is the smallest positive root of x (cid:55)→ M ( − λ ( − c, c ) , / , x / . By [38, Proposition 1], the map c (cid:55)→ λ ( − c, c ) is strictly decreasing and con-tinuous on R > , so it has a continuous inverse α . From (2.4) and Fact A.2(1), we see that λ ( − γ, γ ) = 1 / ,hence α (1 /

2) = γ . By continuity, for all (cid:15) > , there exists δ ∈ (0 , / such that α (1 / δ ) > γ − (cid:15) .Then we may take a (cid:15) = − (1 / δ ) and c (cid:15) = α (1 / δ ) . A.2 Other standard facts

Fact A.6.

Suppose f : R → R is concave. Then for any α < β , the function g ( t ) = f ( t + β ) − f ( t + α ) is non-increasing. Fact A.7.

Suppose that f : R → R is concave. Let α < β . Then f ( x ) ≥ min { f ( α ) , f ( β ) } for all x ∈ [ α, β ] . B Application to binary sequence prediction

Here we discuss the application of our results for the problem of binary sequence prediction, as discussedby Feder et al. [23]. At each time step t ≥ , the algorithm must randomly predict whether the next bit isa or a , and the adversary chooses the bit’s true value b t . For any finite sequence b ∈ { , } ∗ , let π s ( b ) be the smallest fraction of errors achieved by any s -state predictor (that may be chosen with knowledgeof b ). Let ˆ π ( b ) denote the expected fraction of errors achieved by some online algorithm (or “universalsequential predictor”), whose behavior is independent of | b | .The main objective of Feder et al. is to study algorithms for which ˆ π ( b ) approximates π s ( b ) . In particu-lar, their Theorem 1 describes an algorithm for which ˆ π ( b ) − π ( b ) ≤ / √ t + 1 /t for all b ∈ { , } ∗ , where t = | b | . They then build on this result to approximate any s -state predictor. They appear to have madeefforts to optimize the constant multiplying / √ t ; see remark 2 on page 1260 and the final paragraph oftheir Appendix A. We determine the optimal convergence rate for the problem considered by Feder et al. Theorem B.1.

There is an algorithm achieving ˆ π ( b ) − π ( b ) ≤ γ/ √ t for all b ∈ { , } ∗ , where t = | b | .Moreover, no algorithm can achieve such a guarantee with a constant smaller than γ/ . Proof sketch.

The universal sequential prediction problem reduces easily to the problem of bounding any-time regret for prediction with two experts. Intuitively, one expert always predicts that the next bit is ,whereas the other expert always predicts that it is . The adversary chooses a cost vector [0 , or [1 , toindicate which expert’s prediction was correct. The quantity t · π ( b ) equals the cost of the best expert,and t · ˆ π ( b ) equals the cost of the algorithm, so t · (ˆ π ( b ) − π ( b )) equals the regret. If Algorithm 1 is usedfor the random prediction, then Theorem 2.1 implies the first statement of the theorem.Conversely, for any sequential predictor, we may use our adversaries from the proof of Theorem 2.1to generate the binary sequence (since they only use cost vectors [0 , or [1 , ). For any (cid:15) > , there isan adversary that ensures that the regret is at least ( γ − (cid:15) ) √ t/ at some time t . It follows that there exists b ∈ { , } ∗ for which ˆ π ( b ) − π ( b ) ≥ ( γ − (cid:15) ) / √ t . Taking (cid:15) → , the second statement follows. In fact, there is a unique positive root. Technical results from Section 3

C.1 Proof of Lemma 3.1

The following two lemmas are essentially special cases of Lemma F.1 since ˜ R γ = ˜ R and R γ = R . Werestate them here without the subscript for convenience. Lemma C.1.

Consider the function ˜ R ( t, g ) = g + κ √ tM (cid:16) g t (cid:17) . Then ∂∂g ˜ R ( t, g ) = (cid:16) − erﬁ( g/ √ t )erﬁ( γ/ √ (cid:17) . Lemma C.2. ∂∂g R ( t, g ) = (cid:16) − erﬁ( g/ √ t )erﬁ( γ/ √ (cid:17) + . Proof (of Lemma 3.1). The fact that R ( t, g ) is non-decreasing in g follows from Lemma C.2. The concavityof R ( t, g ) (in g ) follows from the fact that erﬁ is non-decreasing, so ∂∂g R ( t, g ) is non-increasing in g . C.2 Proof of Lemma 3.5

Proof (of Lemma 3.5). By telescoping, f ( T, g T ) − f (0 , g ) = (cid:80) Tt =1 (cid:0) f ( t, g t ) − f ( t − , g t − ) (cid:1) . Considera fixed t ∈ [ T ] . We can write f ( t, g t ) − f ( t − , g t − ) = (cid:18) f ( t, g t ) − f ( t, g t − + 1) + f ( t, g t − − (cid:19) + (cid:18) f ( t, g t − + 1) + f ( t, g t − − − f ( t − , g t − ) (cid:19) . (C.1)For the first bracketed term, by considering the cases g t = g t − + 1 and g t = g t − − , we have f ( t, g t ) − f ( t, g t − + 1) + f ( t, g t − − f ( t, g t − + 1) − f ( t, g t − − · ( g t − g t − )= f g ( t, g t − ) · ( g t − g t − ) . (C.2)Note that the above step is the only place where the assumption that | g t − g t − | = 1 is used. For the secondbracketed term, we have f ( t, g t − + 1) + f ( t, g t − − − f ( t − , g t − ) = f ( t, g t − + 1) + f ( t, g t − − − f ( t, g t − )2+ ( f ( t, g t − ) − f ( t − , g t − ))= 12 f gg ( t, g t − ) + f t ( t, g t − ) . This gives the desired formula.

C.3 Proof of Lemma 3.6

Lemma C.3.

For all u ∈ [0 , / , we have M ( u ) ≥ √ − u . Proof.

The Maclaurin expansion of M ( u ) is given by M ( u ) = 1 − ∞ (cid:88) k =1 k − k ! u k . d k d x k √ − x = − (2 k − − x ) (2 k − / , where ( n )!! denotes the double factorial (note that ( − ). Hence, the Maclaurin expansion of √ − u is √ − u = 1 − ∞ (cid:88) k =1 (2 k − k ! u k . It is not hard to verify that (2 k − ≥ k − . This implies that M ( u ) ≥ √ − u . Lemma C.4.

For all z ∈ [0 , and x ∈ R , we have M (cid:18) ( x + z ) (cid:19) + M (cid:18) ( x − z ) (cid:19) ≥ (cid:112) − z M (cid:18) x − z ) (cid:19) . Proof.

Fix z ∈ [0 , and consider the function h z ( x ) = M (cid:18) ( x + z ) (cid:19) + M (cid:18) ( x − z ) (cid:19) − (cid:112) − z M (cid:18) x − z ) (cid:19) . Note that h z (0) ≥ by applying Lemma C.3 with u = z / . We will show that x = 0 is the minimizer of h z which implies the lemma.Indeed, computing derivatives, we have h (cid:48) z ( x ) = − M (cid:18) ( x + z ) (cid:19) · ( x + z ) − M (cid:18) ( x − z ) (cid:19) · ( x − z ) + 2 M (cid:18) x − z ) (cid:19) · x √ − z . As h (cid:48) z (0) = 0 , x = 0 is a critical point of h z . We will now show that h z is convex which certifies that x = 0 is indeed a minimizer.To obtain h (cid:48)(cid:48) z , we differentiate term-by-term. Let u = ( x + z ) . Then dd x M (cid:18) ( x + z ) (cid:19) · ( x + z ) = M (cid:16) ( x + z ) (cid:17) · ( x + z ) M (cid:18) ( x + z ) (cid:19) = 2 M ( u ) · u M ( u )= 2 u (2 e u √ u − √ π erﬁ( √ u ))4 u / + √ π erﬁ( √ u )2 √ u = e u = exp (cid:18) ( x + z ) (cid:19) . The first equality is by Fact A.1 and the third equality is by identities (2) and (3) in Fact A.2. We cansimilarly show that dd x M (cid:18) ( x − z ) (cid:19) · ( x − z ) = exp (cid:18) ( x − z ) (cid:19) . If n ∈ Z ≥ , we define ( n )!! = (cid:81) (cid:100) n/ (cid:101)− k =0 ( n − k ) . If n ∈ Z < , we define ( n )!! via the recursive relation ( n )!! = ( n +2)!! n +2 sothat ( − (1)!!1 = 1 . dd x M (cid:18) x − z ) (cid:19) · x √ − z = M (cid:18) x − z ) (cid:19) · x (1 − z ) / + M (cid:18) x − z ) (cid:19) · √ − z = 1 √ − z (cid:18) M (cid:18) x − z ) (cid:19) · x (1 − z ) + M (cid:18) x − z ) (cid:19)(cid:19) = exp (cid:16) x − z ) (cid:17) √ − z , where the first equality uses Fact A.1 and the last equality is by identity (4) in Fact A.2.Hence, we have h (cid:48)(cid:48) z ( x ) = 2 e x / − z ) − ( e ( x + z ) / + e ( x − z ) / ) √ − z √ − z . So to check that h (cid:48)(cid:48) z ( x ) ≥ for all x ∈ R , it suffices to check that ( e ( x + z ) / + e ( x − z ) / ) √ − z ≤ e x / − z ) . Indeed, we have ( e ( x + z ) / + e ( x − z ) / ) √ − z ≤ ( e ( x + z ) / + e ( x − z ) / ) e − z / e x / ( e xz + e − xz )2 ≤ e x / e x z / = e x (1+ z ) / ≤ e x / − z ) , where the first inequality is because − a ≤ e − a for all a ∈ R , the second inequality is because ( e a + e − a ) / a ) ≤ e a / for all a ∈ R , and the last inequality is because a ≤ / (1 − a ) for all a < . This proves that h z is convex which concludes the proof that x = 0 is a minimizer for h z andhence, completes the proof of the lemma. Proof (of Lemma 3.6). The inequality R t ( t, g ) + R gg ( t, g ) ≥ is equivalent to R ( t, g + 1) + R ( t, g − ≥ R ( t − , g ) . (C.3)We first prove the claim for t = 1 . In this case, the RHS of (C.3) is identically . On the other hand, theLHS of (C.3) is non-decreasing in g by Lemma 3.1. Hence, it suffices to prove the inequality for g = 0 .With t = 1 and g = 0 , we have R (1 ,

1) + R (1 , −

1) = 2 κM (1 / . As M is decreasing (Fact A.3) and / ≤ γ / , we have M (1 / ≥ M ( γ /

2) = 0 . So (C.3) holds for t = 1 and g ≥ .For the remainder of the proof, we assume that t > . Observe that γ √ t − ≤ γ √ t − ≤ γ √ t + 1 (since t ≥ ). We will consider a few cases depending on the value of g . The inequality γ √ t − ≤ γ √ t − is equivalent to √ t − √ t − ≤ /γ . As t (cid:55)→ √ t is concave and t ≥ , the LHS ismaximized at t = 1 (Fact A.6). Hence, the inequality is true provided √ ≤ /γ . One can check numerically that this lastinequality is true as γ ≤ . ase 1: g ≤ γ √ t − . In this case, g + 1 ≤ γ √ t , g ≤ γ √ t − , and g − ≤ γ √ t . Hence, R ( t, g + 1) = g + 12 + κ √ t · M (cid:18) ( g + 1) t (cid:19) R ( t, g −

1) = g −

12 + κ √ t · M (cid:18) ( g − t (cid:19) R ( t − , g ) = g κ √ t · M (cid:18) g t − (cid:19) . So (C.3) is equivalent to √ t · M (cid:18) ( g + 1) t (cid:19) + √ t · M (cid:18) ( g − t (cid:19) ≥ √ t − · M (cid:18) g t − (cid:19) , (C.4)or rearranging, is equivalent to M (cid:18) ( g + 1) t (cid:19) + M (cid:18) ( g − t (cid:19) ≥ (cid:112) − /t · M (cid:18) g t − (cid:19) . The latter inequality is true by Lemma C.4 using x = g/ √ t and z = 1 / √ t ∈ (0 , . Case 2: γ √ t − ≤ g ≤ γ √ t − . Let ˜ R be the function defined in Lemma C.1. In this case, we have R ( t, g + 1) = γ √ t = ˜ R ( t, γ √ t ) ≥ ˜ R ( t, g + 1) = g + 12 + κ √ t · M (cid:18) ( g + 1) t (cid:19) . The inequality is by Lemma C.1 which implies that ˜ R ( t, g + 1) is non-increasing for g ∈ ( γ √ t − , ∞ ) .Using the lower bound on R ( t, g + 1) , (C.3) is again implied by (C.4) and we have already verified that(C.4) is true. Case 3: γ √ t − ≤ g . Note that for g ≥ γ √ t − , the functions R ( t − , g ) and R ( t, g + 1) are constantin g but R ( t, g − is non-decreasing in g . Hence, it suffices to check (C.3) for g = γ √ t − which holdsby case 2. D Analysis of Algorithm 1 for general cost vectors

In this section, we prove the upper bound of Theorem 2.1 in full generality.

Theorem D.1.

Let A be the algorithm described in Algorithm 1. For any adversary B (allowing any costvectors (cid:96) t ∈ [0 , ), we have sup t ≥ Regret(2 , t, A , B ) √ t ≤ γ . In Subsection 3.1, since the gap was integer-valued, the identity of the best expert could only changewhen the gap is exactly (at which time there are two best experts). In general, the gap can be real-valued,so the best expert can switch abruptly, which affects our formula for the regret. We will need to generalizeProposition 2.3 to deal with this possibility. Let ∆ R ( t ) = Regret( t ) − Regret( t − . Proposition D.2.

Let g t − be the gap after time t − but before playing an action at time t . Let g t be thegap after time t . Let p ( t, g t − ) denote the probability mass assigned to the worst expert at time t . Supposethat p ( t,

0) = 1 / for all t ≥ . 22. If a best expert at time t − remains a best expert at time t then ∆ R ( t ) = ( g t − g t − ) p ( t, g t − ) .

2. If a best expert at time t − is no longer a best expert at time t then ∆ R ( t ) = g t − ( g t + g t − ) p ( t, g t − ) . Moreover, g t + g t − ≤ .The proof of this is very similar to that of Proposition 2.3 and appears in Appendix D.1 Remark.

Note that, at any specific time, the set of best experts may have size either one or two so thechoice of the best expert in Proposition D.2 may be ambiguous. However, note that if g t − = 0 (i.e., thereare two best experts at time t − ) then p ( t, g t − ) = 1 / so both formulas give ∆ R ( t ) = g t . On the otherhand, if g t = 0 (i.e., there are two best experts at time t ) then both formulas give ∆ R ( t ) = − g t − p ( t, g t − ) .Hence there is no issue with the ambiguity.We will need the following identity which is essentially the same as Lemma 3.5 but without specializingto the case where | g t − g t − | = 1 . Lemma D.3.

Let g , g , . . . be a sequence of real numbers. Then for any function f and any fixed time T ≥ , we have f ( T, g T ) − f (0 , g ) = T (cid:88) t =1 f ( t, g t ) − f ( t, g t − + 1) + f ( t, g t − − T (cid:88) t =1 (cid:18) f gg ( t, g t − ) + f t ( t, g t − ) (cid:19) . (D.1) Proof.

The proof is identical to the proof of Lemma 3.5 except that we do not perform the simplificationin (C.2).When we assumed the gaps were integer-valued, we had ∆ R ( t ) = R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − because both sides were equal to R g ( t, g t − ) · ( g t − g t − ) . This does not hold in the general setting, butwe will be able to prove the following inequality. Lemma D.4.

For all t ≥ , ∆ R ( t ) ≤ R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − . The proof of Lemma D.4 appears in Appendix D.2. Given Lemma D.4, we can now prove our upperbound in general. 23 roof (of Theorem D.1). Fix any T ≥ . Then R ( T, g T ) − R (0 , g ) = T (cid:88) t =1 R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − T (cid:88) t =1 (cid:18) R gg ( t, g t − ) + R t ( t, g t − ) (cid:19) (Lemma D.3) ≥ T (cid:88) t =1 ∆ R ( t ) (Lemma D.4 and Lemma 3.6) = Regret( T ) . As g = 0 and R (0 ,

0) = 0 , we have

Regret( T ) ≤ R ( T, g T ) ≤ γ √ T / , where the last inequality is byLemma 3.2. D.1 Proof of Proposition D.2

Proof (of Proposition D.2). Fix t and for notational convenience, let p = p ( t, g t − ) throughout the proof.In addition, throughout the proof, we use expert 1 to refer to the worst expert at time t − (chosenarbitrarily if the choice of worst expert is not unique) and use expert 2 to refer to the other expert. Let (cid:96) t, , (cid:96) t, ∈ [0 , be the respective losses at time t and L t, , L t, be the respective cumulative losses up totime t . Note that g t − = L t − , − L t − , . Finally, we set L ∗ t = min i ∈ [2] L t,i . By assumption, L ∗ t − = L t − , .For the first assertion we have L ∗ t = L t, (because a best expert remains a best expert). Note that (cid:96) t, + (cid:96) t, = ( L t, − L t, ) − ( L t − , − L t − , ) = g t − g t − . So the cost of the algorithm can be can bewritten as p(cid:96) t, + (1 − p ) (cid:96) t, = p ( g t − g t − ) + (cid:96) t, . On the other hand, L ∗ t − L ∗ t − = L t, − L t − , = (cid:96) t, . Subtracting this from the above display equationgives ∆ R ( t ) = ( g t − g t − ) p .In the second assertion, we have L ∗ t = L t, . Again, the algorithm incurs cost p(cid:96) t, + (1 − p ) (cid:96) t, . Thistime, note that (cid:96) t, − (cid:96) t, = ( L t, − L t, ) − ( L t − , − L t − , ) = − g t − g t − . So the algorithm incurs cost − p ( g t + g t − ) + (cid:96) t, . On the other hand, L ∗ t − L ∗ t − = L t, − L t − , = L t, − L t − , + L t − , − L t − , = (cid:96) t, + g t − = (cid:96) t, − g t − , where the last equality uses the identity (cid:96) t, − (cid:96) t, = − g t − g t − . Subtracting this last quantity with thechange in the algorithm’s cost gives ∆ R ( t ) = g t − − p ( g t + g t − ) .To complete the proof for the second assertion, it remains to check that g t + g t − ≤ . From above,we have the identity, g t + g t − = (cid:96) t, − (cid:96) t, ≤ (cid:96) t, ≤ , as desired. D.2 Proof of Lemma D.4

Proof (of Lemma D.4). Fix t ≥ . We will consider the two cases corresponding to the two cases in Propo-sition D.2. Case 1: A best expert at time t − remains a best expert at time t . In this case, ∆ R ( t ) = ( g t − g t − ) p ( t, g t − ) , so it suffices to check that p ( t, g t − ) · ( g t − g t − ) ≤ R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − . (D.2)24earranging, the above inequality is equivalent to R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − − p ( t, g t − ) · ( g t − g t − ) ≥ . If g t − is fixed then notice that the LHS of the above expression is concave in g t . To see this, Lemma 3.1implies that R ( t, g t ) is concave in g t , the second term is constant in g t , and the last term is linear in g t .Hence, it suffices to verify the inequality when g t = g t − ± (Fact A.7). Indeed, if | g t − g t − | = 1 then R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − R ( t, g t − + 1) − R ( t, g t − − · ( g t − g t − )= p ( t, g t − ) · ( g t − g t − ) , where the second equality used the definition of p . Case 2: A best at time t − is no longer a best expert at time t . This case is nearly identical to theprevious case but in this case ∆ R ( t ) = g t − ( g t + g t − ) p ( t, g t − ) with the promise that g t + g t − ≤ .Hence, the inequality we need to verify is that g t − ( g t + g t − ) p ( t, g t − ) ≤ R ( t, g t ) − R ( t, g t − + 1) + R ( t, g t − − . (D.3)Once again, we do this via a concavity argument. Fix g t − ∈ [0 , . Since g t + g t − ≤ , we have g t ∈ [0 , − g t − ] . Notice that the LHS of (D.3) is linear in g t and the RHS of (D.3) is concave in g t (by Lemma 3.1). Hence, it suffices to check the inequality assuming g t ∈ { , − g t − } . Note that thecase g t = 0 is handled by case 1 since the LHS of (D.2) and (D.3) are identical (see also the remark afterProposition D.2).Now assume that g t = 1 − g t − . Then (D.3) becomes − g t − − p ( t, g t − ) ≤ R ( t, − g t − ) − R ( t, g t − + 1) + R ( t, g t − − Recall that p ( t, g ) = R ( t,g +1) − R ( t,g − so that the above inequality is equivalent to − g t − − R ( t, g t − + 1) + R ( t, g t − − ≤ R ( t, − g t − ) − R ( t, g t − + 1) + R ( t, g t − − . Rearranging the inequality becomes ≤ g t − + R ( t, − g t − ) − R ( t, g t − − . Note that g t − ≤ ≤ γ √ t (since t ≥ and γ ≥ ). Hence, by definition of R , the RHS of the aboveinequality is g t − + R ( t, − g t − ) − R ( t, g t − −

1) = g t − + 1 − g t − κ √ tM (cid:18) (1 − g t − ) (cid:19) − g t − − − κ √ tM (cid:18) ( g t − − (cid:19) = 1 , and obviously, ≤ . 25 Additional proofs for Section 4

Before proving Theorem 4.2, some preliminary definitions are required. For a martingale ( X t ) t ∈ N , define itsmaximum process X ∗ t = max ≤ s ≤ t | X s | and its quadratic variation process [ X ] t = (cid:80) ≤ s ≤ t ( X s − X s − ) . Theorem E.1 (Davis [17]). There exists a constant C such that for any martingale ( X t ) t ∈ N with X = 0 , E [ X ∗∞ ] ≤ C E (cid:104) [ X ] / ∞ (cid:105) .We will prove a more general variant of Theorem 4.2. To recover Theorem 4.2, we apply the followingtheorem with σ = 0 and then take expectations to get that E [ Z τ ] = Z . Theorem E.2.

Let ( Z t ) t ∈ Z ≥ be a martingale with respect to the filtration {F t } and K > a constant suchthat | Z t − Z t − | ≤ K almost surely for all t . Let σ ≤ τ be stopping times and suppose that E [ √ τ ] < ∞ .Then the random variables Z σ , Z τ are almost surely well-defined and E [ Z τ | F σ ] = Z σ . Proof.

Define the stopped process Z t ∧ τ , which is also a martingale [32, Theorem 10.15]. Since E [ √ τ ] < ∞ we have Pr [ τ < ∞ ] = 1 . On the event { τ < ∞} , ( Z t ∧ τ ) t ≥ has a well-defined limit, which is used asthe almost sure definition of Z τ . As { τ < ∞} ⊆ { σ < ∞} , the same argument shows that ( Z t ∧ σ ) ≥ hasa well-defined limit, and we use this as the almost sure definition of Z σ .We claim that also Z t ∧ τ L −→ Z τ ∈ L and Z t ∧ σ L −→ Z σ ∈ L , from which the theorem concludesas follows. By the definition of conditional expectation, we need to check that E [ Z τ A ] = E [ Z σ A ] for all A ∈ F σ . To that end, fix A ∈ F σ and note that A ∩ { σ ≤ t } ∈ F σ ∧ t . For any fixed t , t ∧ σ ≤ t ∧ τ ≤ t , so the optional sampling theorem [32, Theorem 10.11] applied to the stopped process yields E [ Z t ∧ τ | F t ∧ σ ] = Z t ∧ σ . Hence, E (cid:2) Z τ ∧ t A { σ ≤ t } (cid:3) = E (cid:2) Z σ ∧ t A { σ ≤ t } (cid:3) . (E.1)Since Z τ ∧ t L −→ Z τ ∈ L , it follows that Z τ ∧ t A { σ ≤ t } L −→ Z τ A { σ< ∞} . This is because E (cid:2) | Z τ ∧ t A σ ≤ t − Z τ A σ< ∞ | (cid:3) ≤ E [ | Z τ ∧ t A σ ≤ t − Z τ A σ ≤ t | ] + E [ | Z τ A σ< ∞ − Z τ A σ ≤ t | ] ≤ E [ | Z t ∧ τ − Z τ | ] + E [ | Z τ | t<σ< ∞ ] . The quantity

E [ | Z t ∧ τ − Z τ | ] → because Z t ∧ τ L −→ Z τ . Next, Z τ ∈ L and t<σ< ∞ → a.s. so E [ | Z τ | t<σ< ∞ ] → by dominated convergence. Finally, note that Z τ A σ< ∞ = Z τ A as σ< ∞ = 1 a.s. Hence, E (cid:2) Z τ ∧ t A { σ ≤ t } (cid:3) t →∞ −−−→ E [ Z τ A ] . (E.2)Similarly, E (cid:2) Z σ ∧ t A { σ ≤ t } (cid:3) t →∞ −−−→ E [ Z σ A ] . (E.3)Combining (E.1), (E.2), and (E.3) gives E [ Z τ A ] = E [ Z σ A ] as desired.It remains to show that Z τ ∧ t L −→ Z τ ∈ L and Z σ ∧ t L −→ Z σ ∈ L . We will only prove the conver-gence for Z τ ∧ t as the two arguments are identical. The L convergence is proven using the dominatedconvergence theorem [32, Corollary 6.26], which requires exhibiting a random variable that bounds | Z t ∧ τ | for all t and has finite expectation. For notational convenience, let X t = Z t ∧ τ . Clearly | X t | ≤ X ∗ t ≤ X ∗∞ ,so it remains to show that E [ X ∗∞ ] < ∞ . Using Theorem E.1 and that Z has increments bounded by K , E [ X ∗∞ ] ≤ C E (cid:104) [ X ] / ∞ (cid:105) = C E  (cid:16) (cid:88) ≤ s ≤ τ ( Z s − Z s − ) (cid:17) /  ≤ CK E (cid:104) τ / (cid:105) < ∞ . The dominated convergence theorem states that Z t ∧ τ L −→ Z τ ∈ L , as required.26 emark (on Theorem 4.3). Breiman’s result is not stated in exactly this form because he focused on thecase a ∈ Z < , in which case M degenerates to a polynomial. One can show by direct calculation that thefunction θ ( a ) in his equation (2.6) is identical to our function M ( a, / , c / for all a ∈ R .An alternative approach is to use a result of Greenwood and Perkins [28, Theorem 5], which shows ina more general context that Pr [ τ ( c ) > u ] = u − λ ( − c,c ) π ( u ) where − λ ( − c, c ) is the largest non-positiveeigenvalue of a certain Sturm-Liouville equation and π ( u ) is a “slowly-varying function”. It is shownby Perkins [38, Proposition 1] that c is the smallest positive root of x (cid:55)→ M ( − λ ( − c, c ) , / , x / . Astandard result [24, Lemma VIII.8.2] states that any slowly-varying function π satisfies π ( u ) = O ( u (cid:15) ) for every (cid:15) > . This alternative approach suffices to prove Theorem 4.1 since (4.4) is unaffected by theslowly-varying function. E.1 Large regret infinitely often

In this subsection, we sketch the following theorem.

Theorem E.3.

For any algorithm A and any (cid:15) > , there exists an adversary B (cid:15) such that lim sup t ≥ Regret(2 , t, A , B (cid:15) ) √ t ≥ γ − (cid:15) . (E.4) Sketch.

We use the same adversary as in Theorem 4.1 so that

Regret( t ) ≥ Z t + g t , where Z t is a martingale with Z = 0 and g t evolves as a reflected random walk. Let F t := σ ( g , . . . , g t ) be the natural filtration. Finally, let c (cid:15) ≥ γ − (cid:15) be as in the proof of Theorem 4.1.Define the stopping times τ := 0 and τ i := inf (cid:8) t > τ i − : g t ≥ c (cid:15) √ t (cid:9) for i ≥ . Note that, bythe strong Markov property, for each i ≥ , the process { g τ i − + t } t ≥ is a reflected random walk startedat position g τ i − > . Moreover, observe that τ i is similar to the stopping time used in Theorem 4.1in that the asymptotics of the boundary are the same but the starting point is perturbed by a (random)additive constant. It is not hard to show (via [28, Theorem 5]) that E (cid:2) √ τ i (cid:3) < ∞ . Hence, we can applyTheorem E.2 to obtain that E (cid:2) Z τ i | F τ i − (cid:3) = Z τ i − for all i ≥ .We will now inductively construct a sequence of events which satisfy the conclusions of the theorem.To that end, define the events A i = { τ i < ∞ , Z τ i ≥ . . . ≥ Z τ ≥ } . For the base case, we have A = { τ < ∞ , Z τ ≥ } . In the proof of Theorem 4.1, we have alreadyverified that Pr [ A ] > (this also follows from the previous paragraph). For the inductive step, supposethat Pr [ A i − ] > . The condition that E (cid:2) Z τ i | F τ i − (cid:3) = Z τ i − implies that, for any B ∈ F τ i − with Pr [ B ] > , the event B ∩ (cid:8) τ i < ∞ , Z τ i ≥ Z τ i − (cid:9) has positive probability. Taking B = A i − implies that Pr [ A i ] > .To conclude, for any n ≥ , the event A n has positive probability. Hence, there exists a sequence oftimes T , . . . , T n < ∞ and loss vectors up to time T n that guarantee g T i ≥ c (cid:15) √ T i for all i ∈ [ n ] and Z T n ≥ . . . ≥ Z T ≥ . In particular, for all i ∈ [ n ] , Regret( T i ) ≥ Z T i + g T i ≥ c (cid:15) (cid:112) T i . As n ≥ was arbitrary, the theorem follows. Verifying that

E [ √ τ i ] < ∞ is the only non-rigorous portion of the proof. Additional proofs for Section 5

F.1 Proof of Lemma 5.6

Proof (of Lemma 5.6). First, we check that f ∈ C , . Let ( t, x ) ∈ R > × R . It is easy to check via standardapplications of the Dominated Convergence Theorem (DCT) and the Fundamental Theorem of Calculus(FTC) that(1) ∂ t f ( t, x ) = (cid:82) x ∂ t h ( t, y ) d y − ∂ x h ( s, , (2) ∂ x f ( t, x ) = h ( t, x ) , and(3) ∂ xx f ( t, x ) = ∂ x h ( t, x ) . All of the above partial derivatives are clearly continuous since h is C , .Next, we show that if ∗ ∆ h ( t, x ) = 0 for all ( t, x ) ∈ R > × R , then ∗ ∆ f ( t, x ) = 0 for all R > × R . ByDCT and FTC, ∗ ∆ f ( t, x ) = (cid:18) ∂ t + 12 ∂ xx (cid:19) (cid:18)(cid:90) x h ( t, y ) d y − (cid:90) t ∂ x h ( s,

0) d s (cid:19) = (cid:90) x ∂ t h ( t, y ) d y + 12 ∂ xx (cid:90) x h ( t, y ) d y − (cid:18) ∂ t + 12 ∂ xx (cid:19) (cid:90) t ∂ x h ( s,

0) d s (by DCT) = (cid:90) x ∂ t h ( t, y ) d y + 12 ∂ x h ( t, x ) − ∂ x h ( t, (by FTC) = (cid:90) x (cid:18) ∂ t h ( t, y ) + 12 ∂ xx h ( t, y ) (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =0 d y (by FTC) = 0 . An application of FTC shows that ∂ x f ( t, x ) = h ( t, x ) for every ( t, x ) as y (cid:55)→ h ( t, y ) is continuous. F.2 Proof of Lemma 5.7

Proof (of Lemma 5.7). Let us assume that we can write u ( t, x ) = v ( x/ √ t ) . Then, we have ∂ t u ( t, x ) = − x t / v (cid:48) ( x/ √ t ) , and ∂ xx u ( t, x ) = t v (cid:48)(cid:48) ( x/ √ t ) . The backward heat equation enforces that v (cid:48)(cid:48) ( x/ √ t ) = x √ t v (cid:48) ( x/ √ t ) . By a change of variables ( z = x/ √ t ) , we obtain the following ordinary differential equation v (cid:48)(cid:48) ( z ) = z · v (cid:48) ( z ) . (F.1)Hence, v (cid:48) ( z ) = C · e z for some constant C . We can then integrate to obtain v ( z ) = (cid:82) z Ce y / d y + D = (cid:82) z/ √ √ Ce u d u + D , for some constant D . For the last equality, we made the change of variables u = y/ √ in the integral. Therefore, by the definition of erﬁ (and replacing C √ with C ), we have v ( z ) = C erﬁ( z/ √

2) + D . Hence, for some constants C, D ∈ R , we have u ( t, x ) = C erﬁ( x/ √ t ) + D. Plugging in the boundary condition at x = 0 and recalling that erﬁ(0) = 0 we see that D = 1 / . Pluggingin the boundary condition that u ( t, α √ t ) = 0 and using that D = 1 / we see that C = −

12 erﬁ ( α/ √ ) . Therefore, we have that the following function q ( t, x ) = 12 (cid:32) − erﬁ (cid:0) x/ √ t (cid:1) erﬁ (cid:0) α/ √ (cid:1) (cid:33) satisfies the backwards heat equation and the boundary conditions. Moreover, q ∈ C , on R > × R .28 .3 Proof of Lemma 5.8 Recall that ˜ R α ( t, x ) = x + κ α √ t · M ( x / t ) where κ α = √ π erﬁ( α/ √ . First we need to compute somederivatives. Lemma F.1.

The following identities hold for every α > .1. ∂ x ˜ R α ( t, x ) = ˜ p α ( t, x ) = (cid:16) − erﬁ( x/ √ t )erﬁ( α/ √ t ) (cid:17) .2. ∂ xx ˜ R α ( t, x ) = ∂ x ˜ p α ( t, x ) = − κ α · exp( x / t ) √ t . Proof.

The proof is a straightforward calculation. We have ∂ x ˜ R α ( t, x ) = 12 − κ α x √ t · M (cid:18) x t (cid:19) = 12 − √ π erﬁ( α/ √ · x √ t · √ π erﬁ( x/ √ t )2 · x/ √ t = 12 (cid:32) − erﬁ( x/ √ t )erﬁ( α/ √ (cid:33) , where the first equality uses Fact A.1 and the second equality uses the identity (2) in Fact A.2. This provesthe first identity.For the second identity, using the definition of erﬁ( · ) , we have ∂ xx ˜ R α = ∂ x ˜ p α ( t, x ) = − exp( x / t ) √ π erﬁ( α/ √ √ s = − κ α · exp( x / t ) √ t . Proof (of Lemma 5.8). By the first identity in Lemma F.1, we have (cid:90) x ˜ p α ( t, y ) d y = ˜ R α ( t, x ) − ˜ R α ( t, (F.2)Note that ˜ R α ( t,

0) = κ α √ t . Next, the second identity of Lemma F.1 implies that − ∂ x ˜ p α ( s,

0) = κ α √ s .Hence, − (cid:90) t ∂ x ˜ p α ( s,

0) d s = κ α √ t = ˜ R α ( t, . (F.3)Summing (F.2) and (F.3) gives (cid:90) x ˜ p α ( t, y ) d y − (cid:90) t ∂ x ˜ p α ( s,

0) d s = ˜ R α ( t, x ) − ˜ R α ( t,

0) + ˜ R α ( t,

0) = ˜ R α ( t, x ) . F.4 Proof of Lemma 5.10

The main idea of the proof is that we will approximate R α by a sequence of smooth functions (i.e. functionsin C , ).Fix α > . Recall that ˜ R α ( t, x ) = x + κ α √ t · M (cid:16) x t (cid:17) for t > , x ∈ R , where κ α = √ π erﬁ( α / √ ) .(For t = 0 , it suffices to define ˜ R α ( t, x ) = 0 .) We also have the truncated version, R α , defined as R α ( t, x ) =  ˜ R α ( t, x ) t > ∧ x ≤ α √ t ˜ R α ( t, α √ t ) t > ∧ x ≥ α √ t t = 0 . Recall also that p α = ∂ x R α . For convenience, we restate the lemma.29 emma 5.10. Fix α > . Then, almost surely, for all T ≥ , ContRegret(

T, p α , B ) ≤ R α ( T, | B T | ) .For the remainder of this section, we will write ˜ f = ˜ R α and f = R α . Let φ ( x ) be any non-increasing C function satisfying φ ( x ) = 1 for x ≤ and φ ( x ) = 0 for x ≥ . For concreteness, we may take φ ( x ) =  x ≤ − x ) + π sin(2 πx ) x ∈ [0 , x ≥ . (F.4)We leave it as an easy calculus exercise to verify that φ is indeed a non-increasing C function.Next, define φ n ( x ) = φ ( nx ) and f n ( t, x ) = ˜ f ( t, x ) · φ n ( x − α √ t ) + f ( t, α √ t ) · (cid:16) − φ n ( x − α √ t ) (cid:17) . Note that f n ∈ C , on R > × R for all n . The function f n is a smooth approximation to f and its limit isexactly f (= R α ) . Claim F.2.

For every t > , x ∈ R , lim n →∞ f n ( t, x ) = f ( t, x ) . Proof. If x ≤ α √ t then φ n ( x − α √ t ) = 1 so f n ( t, x ) = ˜ f ( t, x ) = f ( t, x ) . In particular, this alsoholds for the limit. Next, suppose that a = x − α √ t > . If n > /a then φ n ( x − α √ t ) = 0 so f n ( t, x ) = ˜ f ( t, α √ t ) = f ( t, x ) .Recall that our goal is to relate f ( T, | B T | ) and (cid:82) T ∂ x f ( t, | B t | ) d | B t | . However, one cannot apply Itô’sformula to f directly as it is not in C , . Instead, we will apply Itô’s formula to the smoothed version of f ,namely f n , and then take limits. The remainder of this section does this limiting argument carefully.For technical reasons (namely that ˜ f ( t, x ) has a pole when t → and x (cid:54) = 0 ), we will not be able tostart the stochastic integral at . Hence, we will fix (cid:15) > and, at the end of the proof, we will allow (cid:15) → .The following lemma bounds the stochastic integral of ∂ x f n with respect to | B t | . Lemma F.3.

Almost surely, for every T ≥ (cid:15) (cid:90) T(cid:15) ∂ x f n ( t, | B t | ) d | B t | ≤ f n ( T, | B T | ) − f n ( (cid:15), | B (cid:15) | ) − (cid:90) T(cid:15) α √ t · φ (cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t − (cid:90) T(cid:15) φ (cid:48)(cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t. (F.5) Proof.

The proof is by Itô’s formula (Theorem 5.3) applied to f n . We have, for all T ≥ (cid:15) , f n ( T, | B T | ) − f n ( (cid:15), | B (cid:15) | ) = (cid:90) T(cid:15) ∂ x f n ( t, | B t | ) d | B t | + (cid:90) T(cid:15) ∂ t f n ( t, | B t | ) + 12 ∂ xx f n ( t, | B t | ) d t. (F.6)Computing derivatives of f n , we have ∂ t f n ( t, x ) = ( ∂ t ˜ f ( t, x )) · φ n ( x − α √ t ) − α √ t ˜ f ( t, x ) φ (cid:48) n ( x − α √ t )+ ∂ t ( f ( t, α √ t )) · (1 − φ n ( x − α √ t )) + α √ t f ( t, α √ t ) · φ (cid:48) n ( x − α √ t ) (F.7) ∂ x f n ( t, x ) = ( ∂ x ˜ f ( t, x )) · φ n ( x − α √ t ) + ˜ f ( t, x ) φ (cid:48) n ( x − α √ t ) − f ( t, α √ t ) φ (cid:48) n ( x − α √ t ) (F.8) ∂ xx f n ( t, x ) = ( ∂ xx ˜ f ( t, x )) · φ n ( x − α √ t ) + 2( ∂ x ˜ f ( t, x )) φ (cid:48) n ( x − α √ t )+ (cid:16) ˜ f ( t, x ) − f ( t, α √ t ) (cid:17) φ (cid:48)(cid:48) n ( x − α √ t ) . (F.9)30ecalling the notation ∗ ∆ = ∂ t + ∂ xx , we have ∗ ∆ f n ( t, x ) = (cid:16) ∗ ∆ ˜ f ( t, x ) (cid:17) · φ n ( x − α √ t ) + ∂ t ( f ( t, α √ t )) · (1 − φ n ( x − α √ t ))+ ( ∂ x ˜ f ( t, x )) φ (cid:48) n ( x − α √ t )+ α √ t · ( f ( t, α √ t ) − ˜ f ( t, x )) · φ (cid:48) n ( x − α √ t ) + 12 (cid:16) ˜ f ( t, x ) − f ( t, α √ t ) (cid:17) φ (cid:48)(cid:48) n ( x − α √ t ) . (F.10)By Lemma 5.9, ∗ ∆ ˜ f = 0 . By Claim F.4 below, ∂ t ( f ( t, α √ t )) > . Next, observe that ( ∂ x ˜ f ( t, x )) · φ (cid:48) n ( x − α √ t ) ≥ . To see this, if x ≤ α √ t then φ (cid:48) n ( x − α √ t ) = 0 . On the other hand, if x > α √ t then φ (cid:48) n ( x − α √ t ) ≤ because φ n is non-increasing and ∂ x ˜ f ( t, x ) ≤ by Lemma 5.9 and (5.11). Hence, wecan lower bound (F.10) by ∗ ∆ f n ( t, x ) ≥ α √ t · ( f ( t, α √ t ) − ˜ f ( t, x )) · φ (cid:48) n ( x − α √ t ) + 12 (cid:16) ˜ f ( t, x ) − f ( t, α √ t ) (cid:17) φ (cid:48)(cid:48) n ( x − α √ t ) . (F.11)Plugging (F.11) into (F.6) gives f n ( T, | B T | ) − f n ( (cid:15), | B (cid:15) | ) ≥ (cid:90) T(cid:15) ∂ x f n ( t, | B t | ) d | B t | + (cid:90) T(cid:15) α √ t · φ (cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t + 12 (cid:90) T(cid:15) φ (cid:48)(cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t. (F.12)Rearranging (F.12) gives the lemma. Claim F.4. If t > then ∂ t ( ˜ f ( t, α √ t )) > . Proof.

Note that ˜ f ( t, α √ t ) = √ t · (cid:18) α M ( α / ) √ π erﬁ( α / √ ) (cid:19) = √ t · f (1 , α ) . So it suffices to check that ˜ f (1 , α ) > . To see this, note that ˜ f (1 ,

0) = κ α > and ∂ x ˜ f (1 , x ) ≥ as longas x ≤ α (by the first identity of Lemma F.1). Hence, ˜ f (1 , α ) > .At this point, we would like to take limits on both sides of (F.5). This is achieved by the following twolemmas. Lemma F.5.

Almost surely, for every T ≥ (cid:15) ,1. lim n →∞ (cid:82) T(cid:15) α √ t · φ (cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t = 0 ; and2. lim n →∞ (cid:82) T(cid:15) φ (cid:48)(cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t = 0 . Lemma F.6.

For every T ≥ (cid:15) , (cid:90) T(cid:15) ∂ x f n ( t, | B t | ) d | B t | L −→ (cid:90) T(cid:15) ∂ x f ( t, | B t ) d | B t | as n → ∞ . 31ithin this section, X n L −→ X means that E (cid:2) ( X n − X ) (cid:3) → as n → ∞ . We relegate the proofsof Lemma F.5 and Lemma F.6 to Appendix F.5. We now take limits on both sides of (F.5) to obtain thefollowing bound on the stochastic integral of ∂ x f . Lemma F.7.

Almost surely, for every T ≥ (cid:15) , (cid:90) T(cid:15) ∂ x f ( t, | B t | ) d | B t | ≤ f ( T, | B T | ) − f ( (cid:15), | B (cid:15) | ) . (F.13) Proof.

By Lemma F.6, for every T ≥ (cid:15) , (cid:90) T(cid:15) ∂ x f n ( t, | B t | ) d | B t | L −→ (cid:90) T(cid:15) ∂ x f ( t, | B t ) d | B t | . Hence, there exists a subsequence n k such that (cid:90) T(cid:15) ∂ x f n k ( t, | B t | ) d | B t | a.s. −→ (cid:90) T(cid:15) ∂ x f ( t, | B t ) d | B t | . Using Lemma F.3 to bound the left-hand-side and then Lemma F.5 to take limits gives that (F.13) holdsfor any fixed T ≥ (cid:15) . Hence, almost surely, (F.13) holds for all rational T ≥ (cid:15) . As both sides of (F.13) arecontinuous as a function of T , (F.13) holds for all T ≥ (cid:15) . Proof (of Lemma 5.10). We will work in the probability 1 set where Lemma F.7 holds (for every rational (cid:15) > ) and t (cid:55)→ B t is continuous.Fix T > . Note that ContRegret(

T, ∂ x f, B ) is defined because ∂ x f ∈ [0 , / and ∂ x f ( t,

0) = 1 / for all t > (see (5.14)). Recalling Definition 5.1, we have, for (cid:15) ≤ T , ContRegret(

T, ∂ x f, B ) = (cid:90) T ∂ x f ( t, | B t | ) d | B t | = (cid:90) T(cid:15) ∂ x f ( t, | B t | ) d | B t | + (cid:90) (cid:15) ∂ x f ( t, | B t | ) d | B t |≤ f ( T, | B T | ) − f ( (cid:15), | B (cid:15) | ) + (cid:90) (cid:15) ∂ x f ( t, | B t | ) d | B t | (Lemma F.7) . The right-hand-side is continuous in (cid:15) so taking (cid:15) → (and recalling that f (0 ,

0) = 0 ), gives

ContRegret(

T, ∂ x f, B ) ≤ f ( T, | B T | ) . F.5 Additional proofs from Appendix F.4

Before we prove Lemma F.5, we will need one key observation.

Lemma F.8.

The key observation is that f ( t, α √ t ) is already a first-order Taylor expansion of ˜ f ( t, x ) (in x ) aboutthe point γ √ t . Indeed, ˜ f ( t, α √ t ) = f ( t, α √ t ) and ( ∂ x ˜ f )( t, α, √ t ) = 0 . Hence, by Taylor’s Theorem (seee.g. [42, Theorem 5.15]) | ˜ f ( t, x ) − f ( t, α √ t ) | ≤ · ( x − α √ t ) · sup t ≥ (cid:15), | x − α √ t |≤ | ∂ xx ˜ f ( t, x ) |

There is an absolute constant

C > such that | φ (cid:48) n ( x ) | ≤ Cn and | φ (cid:48)(cid:48) n ( x ) | ≤ Cn . Proof.

Note that φ (cid:48) n ( x ) = n · φ (cid:48) ( x ) and n · φ (cid:48)(cid:48) ( x ) . It is easy to see, from differentiating (F.4) or by continuityand compact arguments, that there exists C > such that | φ (cid:48) ( x ) | , | φ (cid:48)(cid:48) ( x ) | ≤ C for all x ∈ R . Proof (of Lemma F.5). We start with the second assertion. The first assertion is similar but simpler. Weclaim that there exists a constant C (cid:48) (depending on (cid:15) and α ) such that (cid:12)(cid:12)(cid:12) φ (cid:48)(cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:48) [ | B t | − α √ t ∈ [0 , / n ]] (F.14)Indeed, if | B t | − α √ t / ∈ [0 , /n ] then φ (cid:48)(cid:48) n ( | B t | − α √ t ) = 0 so both sides of (F.14) are equal to . On theother hand, if | B t | − α √ t ∈ [0 , /n ] then Lemma F.8 shows that | f ( t, α √ t ) − ˜ f ( t, | B t | ) | ≤ C (cid:15) /n where C (cid:15) is the constant from Lemma F.8. Next, Claim F.9 gives | φ (cid:48)(cid:48) n ( | B t | − α √ t ) | ≤ Cn . So taking C (cid:48) = C (cid:15) · C gives (F.14). Hence, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T(cid:15) φ (cid:48)(cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) T(cid:15) C (cid:48) · [ | B t | − α √ t ∈ [0 , /n ]] d t = C (cid:48) · m (cid:16)(cid:110) t ∈ [ (cid:15), T ] : | B t | − α √ t ∈ [0 , /n ] (cid:111)(cid:17) , where m denotes the Lebesgue measure. By continuity of measure, we have lim n m (cid:16)(cid:110) t ∈ [ (cid:15), T ] : | B t | − α √ t ∈ [0 , /n ] (cid:111)(cid:17) = (cid:90) T(cid:15) (cid:104) | B t | = α √ t (cid:105) d t = 0 a.s.This proves the second assertion. 33or the first assertion, we can use the bound (from Lemma F.8 and Claim F.9) (cid:12)(cid:12)(cid:12) φ (cid:48) n ( x − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, x ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:48) n [ x − α √ t ∈ [0 , / n ]] ≤ C (cid:48) n . (F.15)Hence, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T(cid:15) α √ t · φ (cid:48) n ( | B t | − α √ t ) · (cid:16) f ( t, α √ t ) − ˜ f ( t, | B t | ) (cid:17) d t (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) T(cid:15) α √ t C (cid:48) n d t ≤ C (cid:48) α √ T /n → . Proof (of Lemma F.6). By (F.8), we have ∂ x f n ( t, x ) − ∂ x f ( t, x ) = (cid:16) ∂ x ˜ f ( t, x ) φ n ( x − α √ t ) − ∂ x f ( t, x ) (cid:17) + (cid:16) φ (cid:48) n ( x − α √ t ) · (cid:16) ˜ f ( t, x ) − f ( t, α √ t ) (cid:17)(cid:17) . (F.16)For the first bracketed term, since ∂ x ˜ f ( t, x ) = ∂ x f ( t, x ) when x ≤ α √ t and ∂ x f ( t, x ) = 0 when x ≥ α √ t ,we have (cid:12)(cid:12)(cid:12) ∂ x ˜ f ( t, x ) φ n ( x − α √ t ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ∂ x ˜ f ( t, x ) φ n ( x − α √ t ) (cid:12)(cid:12)(cid:12) [ x − α √ t ∈ [0 , /n ]] ≤ C (cid:48) n , where the final inequality is by the second assertion in Lemma F.8. The second bracketed term has beenbounded in (F.15), and so we have proved (cid:12)(cid:12)(cid:12) ∂ x f n ( t, x ) − ∂ x f ( t, x ) (cid:12)(cid:12)(cid:12) ≤ C (cid:48)(cid:48) n for all t ≥ ε and all x. (F.17)Tanaka’s formula (see [41, Theorem IV.43.3]) states that | B t | = (cid:90) t sign ( B s ) d B s + L t =: W t + L t , where L is the local time at zero of B and W is a Brownian motion. Recall that t (cid:55)→ L t is a continuous non-decreasing random process which increases only on the set { t : B t = 0 } . Therefore by the Itô isometryproperty, for any T ≥ ε , E (cid:104)(cid:16)(cid:90) Tε ∂ x f n ( t, | B t | ) d | B | t − (cid:90) Tε ∂ x f ( t, | B t | ) d | B | t (cid:17) (cid:105) ≤ (cid:20) (cid:16)(cid:90) Tε ( ∂ x f n − ∂ x f )( t, | B t | ) d W t (cid:17) (cid:21) + 2 E (cid:20) (cid:16)(cid:90) Tε ( ∂ x f n − ∂ x f )( t, | B t | )) d L t (cid:17) (cid:21) = 2 E (cid:20) (cid:90) Tε ( ∂ x f n − ∂ x f )( t, | B t | ) d t (cid:21) + 2 E (cid:20) (cid:16)(cid:90) Tε ( ∂ x f n − ∂ x f )( t,

0) d L t (cid:17) (cid:21) . Now use (F.17) to bound the right-hand side by C (cid:48)(cid:48) /n ) T + 2( C (cid:48)(cid:48) /n ) E (cid:2) L T (cid:3) ≤ C (cid:48)(cid:48)(cid:48) n − T, where the last inequality uses Tanaka’s formula (and the fact that W t is also a standard Brownian motion)to bound E (cid:2) L T (cid:3) = E (cid:2) ( | B T | − W T ) (cid:3) ≤ (cid:2) | B T | (cid:3) + 2 E (cid:2) | W T | (cid:3) = 4 E (cid:2) | B T | (cid:3) = O ( T ) . The result follows. 34 .6 Proof of Claim 5.12

Proof of Claim 5.12.

Recall that h ( α ) = α + M ( α / √ π erﬁ( α/ √ . Hence, h (cid:48) ( α ) = 12 − α · M ( α / √ π erﬁ( α/ √ − exp( α / · M ( α / π erﬁ( α/ √ (by Fact A.1) = − exp( α / · M ( α / π erﬁ( α/ √ (by Fact A.2(2)) . This proves the first assertion.Next, observe that exp( α / α/ √ is positive for all α ∈ R . Hence, by Fact A.4, we have that h (cid:48) ( α ) < for α ∈ (0 , γ ) , h (cid:48) ( γ ) = 0 , and h (cid:48) ( α ) > for α ∈ ( γ, ∞ ) . F.7 Discussion on the statement of Theorem 5.3

In this paper, we use the version of Itô’s formula that appears in Remark 1 after Theorem IV.3.3 in [39].It states that if f ∈ C , , X is a continuous semimartingale and A is a process with bounded variationthen f ( A T , X T ) − f ( A , X ) = (cid:90) T ∂ x f ( A t , X t ) d X t + (cid:90) T ∂ t f ( A t , X t ) d A t + 12 (cid:90) T ∂ xx f ( A t , X t ) d (cid:104) X, X (cid:105) t . (F.18)In our setting, we take X t = | B t | and A t = t . We now explain the notation (cid:104) X, X (cid:105) .(1) For a continuous local martingale M , (cid:104) M, M (cid:105) is the unique increasing continuous process vanish-ing at such that M − (cid:104) M, M (cid:105) is a martingale [39, Theorem IV.1.8].(2) If X is a continuous semimartingale with M being the (continuous) local martingale part then (cid:104) X, X (cid:105) = (cid:104) M, M (cid:105) [39, Definition IV.1.20].Tanaka’s formula [41, Theorem IV.43.3] asserts that | B t | = W t + L t where W t is a Brownian Motionand L t is the local time of B t at 0, which is an increasing, continuous, adapated process. Hence, | B t | is asemimartingale with (cid:104) | B | , | B | (cid:105) t = (cid:104) W, W (cid:105) t = t . Plugging these into (F.18) gives f ( T, | B T | ) − f (0 , | B | ) = (cid:90) T ∂ x f ( t, | B t | ) d | B t | + (cid:90) T (cid:104) ∂ t f ( t, | B t | ) + ∂ xx f ( t, | B t | ) (cid:105) d t, which is what appears in Theorem 5.3. F.8 Continuous regret against any continuous semi-martingale

Recall that the continuous regret upper bound (Theorem 5.2) involved the adversary evolving the gap pro-cess as a reflected Brownian motion, which is a continuous semi-martingale. In this section, we generalizethe definition of continuous regret to allow arbitrary, non-negative, continuous semi-martingales to con-trol the gap process, and derive an analogue of Theorem 5.2 in this generalized setting. We use the notation [ X ] t to refer to (cid:104) X, X (cid:105) t , the quadratic variation process of X , which was introduced in Appendix F.7.We begin with a generalized definition of continuous regret. A continuous semimartingale X is a process that can be written as X = M + N where M is a continuous local martingaleand N is a continuous adapted process of finite variation. efinition F.10 (Continuous Regret). Let p : R > × R ≥ → [0 , be a continuous function that satisfies p ( t,

0) = 1 / for every t > . Let X t be a continuous, non-negative, semi-martingale. Then, the continuousregret of p with respect to X is the stochastic integral ContRegret(

T, p, X ) = (cid:90) T p ( t, X t ) d X t . (F.19)The main result for this generalized setting is as follows. Theorem F.11.

There exists a continuous-time algorithm p ∗ such that for any continuous, non-negative,semi-martingale X , ContRegret(

T, p ∗ , X ) ≤ γ (cid:112) [ X ] T ∀ T ∈ R ≥ , almost surely . (F.20)We provide an overview of the proof of this result below. For the sake of exposition, we sketch theproof of Theorem F.11 in the setting where we allow p ∗ to take values in ( −∞ , . Truncating p ∗ as wasdone in Subsection 5.2.2 yields Theorem F.11 as stated. Proof sketch . Let p ∗ ( t, x ) := ˜ p γ ([ X ] t , x ) and R ( t, x ) := ˜ R γ ( t, x ) . (See Eq. (5.11) and Eq. (5.12) for defini-tions of ˜ p γ and ˜ R γ ) . Recall the following three important properties of R from Lemma 5.9:(1) R is C , ,(2) R satisfies ∗ ∆ R = 0 over R > × R ,(3) ∂ x R ( t, x ) = ˜ p γ ( t, x ) .Since R is C , , we may apply Itô’s formula (specifically Eq. (F.18) with A t = [ X ] t , which is a boundedvariation process since it is increasing) to obtain R ([ X ] T , X T ) = (cid:90) T ∂ x R ([ X ] t , X t ) d X t + (cid:90) T ∂ t R ([ X ] t , X t ) + 12 ∂ xx R ([ X ] t , X t ) d[ X ] t = (cid:90) T p ∗ ( t, X t ) d X t + (cid:90) T ∂ t R ([ X ] t , X t ) + 12 ∂ xx R ([ X ] t , X t ) (cid:124) (cid:123)(cid:122) (cid:125) = ∗ ∆ R ([ X ] t ,X t ) d[ X ] t ( ∂ x R = ˜ p γ )= (cid:90) T p ∗ ( t, X t ) d X t ( ∗ ∆ R = 0)= ContRegret( T, p ∗ , X ) . Next, recall the upper bound on R from Eq. (5.13): R ( t, x ) = R γ ( t, x ) ≤ (cid:18) γ κ γ M (cid:18) γ (cid:19)(cid:19) √ t = γ √ t, where the final equality is because γ is a root of M (cid:16) x (cid:17) . Putting everything together, we have

ContRegret(

T, p ∗ , X ) = R ([ X ] T , X T ) ≤ γ (cid:112) [ X ] T , as desired. 36 eferences [1] Sepehr Abbasi-Zadeh, Nikhil Bansal, Guru Guruganesh, Aleksandar Nikolov, Roy Schwartz, and Mo-hit Singh. Sticky Brownian rounding and its applications to constraint satisfaction problems. arXivpreprint arXiv:1812.07769 , 2018.[2] Milton Abramowitz and Irene A. Stegun. Handbook of mathematical functions: with formulas, graphs,and mathematical tables , volume 55. Courier Corporation, 1965.[3] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications.

Theory of Computing , 8(1):121–164, 2012.[4] Leo Breiman. First exit times for a square root boundary. In

Proceedings of the Fifth Berkeley Sympo-sium on Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory, Part 2 ,pages 9–16. University of California Press, 1967.[5] Leo Breiman.

Probability . SIAM, 1992.[6] Monica Brezzi and Tze Leung Lai. Optimal learning and experimentation in bandit problems.

Journalof Economic Dynamics and Control , 27(1):87 – 108, 2002. ISSN 0165-1889.[7] Sébastien Bubeck. Introduction to online optimization, December 2011. unpublished.[8] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee, James R. Lee, and Aleksander Mądry. k -server viamultiscale entropic regularization. In Proceedings of the 50th Annual ACM SIGACT Symposium onTheory of Computing , pages 3–16. ACM, 2018.[9] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution withprojected langevin monte carlo.

Discrete Comput. Geom. , 59(4):757–783, June 2018. ISSN 0179-5376.[10] Gruia Calinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodularfunction subject to a matroid constraint.

SIAM Journal on Computing , 40(6):1740–1766, 2011.[11] Nicolo Cesa-Bianchi. Analysis of two gradient-based algorithms for on-line regression.

Journal ofComputer and System Sciences , 59(3):392–411, 1999.[12] Nicolò Cesa-Bianchi and Gábor Lugosi.

Prediction, learning, and games . Cambridge University Press,2006.[13] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Man-fred K Warmuth. How to use expert advice.

Journal of the ACM (JACM) , 44(3):427–485, 1997.[14] Fu Chang and Tze Leung Lai. Optimal stopping and dynamic allocation.

Advances in Applied Proba-bility , 19(4):829–853, 1987.[15] Chandra Chekuri, TS Jayram, and Jan Vondrák. On multiplicative weight updates for concave andsubmodular function maximization. In

Proceedings of the Conference on Innovations in TheoreticalComputer Science (ITCS) , pages 201–210, 2015.[16] Thomas M. Cover. Behavior of sequential predictors of binary sequences. In

Proceedings of the 4thPrague Conference on Information Theory, Statistical Decision Functions, Random Processes . PublishingHouse of the Czechoslovak Academy of Sciences, Prague, 1965.3717] Burgess Davis. On the intergrability of the martingale square function. Israel Journal of Mathemat-ics , 8:187–190, 1970.[18] Burgess Davis. On the L p norms of stochastic integrals and other martingales. Duke Math. J , 43(4):697–704, 1976.[19] Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A unified theoryof first-order methods.

SIAM Journal on Optimization , 29(1):660–689, 2019.[20] J. L. Doob.

Classical Potential Theory and Its Probabilistic Counterparts . Springer-Verlag, 1984.[21] Rick Durrett.

Probability: Theory and Examples . Cambridge University Press, fifth edition, 2019.[22] Ronen Eldan and Assaf Naor. Krivine diffusions attain the Goemans–Williamson approximation ratio. arXiv preprint arXiv:1906.10615 , 2019.[23] Meier Feder, Neri Merhav, and Michael Gutman. Universal prediction of individual sequences.

IEEETransactions on Information Theory , 38(4):1258–1270, 1992.[24] William Feller.

An Introduction to Probability Theory and Its Applications . John Wiley & Sons, secondedition, 1971.[25] Sébastien Gerchinovitz.

Prediction of individual sequences and prediction in the statistical framework:some links around sparse regression and aggregation techniques . PhD thesis, Université Paris-Sud, 2011.[26] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards optimal algorithms for predictionwith expert advice. In

Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discretealgorithms , pages 528–547. SIAM, 2016.[27] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Tight Lower Bounds for MultiplicativeWeights Algorithmic Families. In , volume 80, pages 48:1–48:14, 2017.[28] Priscilla Greenwood and Edwin Perkins. A conditioned limit theorem for random walk and brownianlocal time on square root boundaries.

Annals of Probability , 11:227–261, 1983.[29] Geoffrey Grimmett and David Stirzaker.

Probability and Random Processes . Oxford University Press,third edition, 2001.[30] James Hannan. Approximation to bayes risk in repeated play.

Contributions to the Theory of Games ,3:97–139, 1957.[31] Robert Kleinberg, Georgios Piliouras, and Éva Tardos. Multiplicative updates outperform genericno-regret learning in congestion games. In

Proceedings of the 41st Annual ACM Symposium on Theoryof Computing (STOC) , pages 533–542, 2009.[32] Achim Klenke.

Probability Theory: A Comprehensive Course . Springer, 2008.[33] Yin Tat Lee and Santosh S. Vempala. Geodesic walks in polytopes. In

Proceedings of the 49th AnnualACM SIGACT Symposium on Theory of Computing , pages 927–940. ACM, 2017.[34] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm.

Information and com-putation , 108(2):212–261, 1994. This appears to be a typographical error in the title of the paper.

Proceedings of ICML , 2014.[36] Peter Mörters and Yuval Peres.

Brownian Motion . Cambridge University Press, 2010.[37] Yurii Nesterov. Primal-dual subgradient methods for convex problems.

Mathematical Programming ,120(1):221–259, 2009.[38] Edwin Perkins. On the Hausdorff dimension of the brownian slow points.

Zeitschrift für Wahrschein-lichkeitstheorie und Verwandte Gebiete , 64:369–399, 1983.[39] Daniel Revuz and Marc Yor.

Continuous martingales and Brownian motion , volume 293. SpringerScience & Business Media, 2013.[40] L. C. G. Rogers and David Williams.

Diffusions, Markov Processes and Martingales. Volume 1: Founda-tions . Cambridge University Press, second edition, 2000.[41] L. C. G. Rogers and David Williams.

Diffusions, Markov Processes and Martingales. Volume 2: ItôCalculus , volume 2. Cambridge University Press, second edition, 2000.[42] Walter Rudin.

Principles of Mathematical Analysis . John Wiley & Sons, third edition, 1976.[43] Volodimir G. Vovk. Aggregating strategies.

Proc. of Computational Learning Theory, 1990 , 1990.[44] Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on acceleratedmethods in optimization.

Proceedings of the National Academy of Sciences , 113(47):E7351–E7358, 2016.[45] David Williams.