The Multiplicative Version of Azuma's Inequality, with an Application to Contention Analysis
aa r X i v : . [ c s . D S ] F e b The Multiplicative Version of Azuma’s Inequality, with anApplication to Contention Analysis
William Kuszmaul and Qi QiMassachusetts Institute of Technology { kuszmaul, qqi } @mit.edu Abstract
Azuma’s inequality is a tool for proving concentration bounds on random variables. Theinequality can be thought of as a natural generalization of additive Chernoff bounds. On theother hand, the analogous generalization of multiplicative Chernoff bounds has, to our knowl-edge, never been explicitly formulated.We formulate a multiplicative-error version of Azuma’s inequality. We then show how toapply this new inequality in order to greatly simplify (and correct) the analysis of contentiondelays in multithreaded systems managed by randomized work stealing.
Introduction
One of the most widely used tools in algorithm analysis is the Chernoff bound, which gives aconcentration inequality on sums of independent random variables. The Chernoff bound exists inmany forms, but the two most common variants are the additive and multiplicative bounds:
Theorem 1 (Additive Chernoff Bound) . Let X , . . . , X n ∈ { , } be independent random variables,and let X = P ni =1 X i . Then for any ε > X ≥ E [ X ] + ε ] ≤ exp (cid:18) − ε n (cid:19) . Theorem 2 (Multiplicative Chernoff Bound) . Let X , . . . , X n ∈ { , } be independent randomvariables. Let X = P ni =1 X i and let µ = E [ X ]. Then for any δ > X ≥ (1 + δ ) µ ] ≤ exp (cid:18) − δ µ δ (cid:19) and for any 0 < δ <
1, Pr[ X ≤ (1 − δ ) µ ] ≤ exp (cid:18) − δ µ (cid:19) . Although the additive Chernoff bound is often convenient to use, the multiplicative bound canin some cases be much stronger. Suppose, for example, that X , X , . . . , X n each take value 1with probability (log n ) /n . By the additive bound, one can conclude that P i X i = O ( √ n log n )with high probability in n . On the other hand, the multiplicative bound can be used to show that P i X i = O (log n ) with high probability in n . In general, whenever E [ X ] ≪ n , the multiplicativebound is more powerful. Handling dependencies with Azuma’s inequality.
Chernoff bounds require that the randomvariables X , X , . . . , X n be independent. In many algorithmic applications, however, the X i ’sare not independent, such as when analyzing algorithms in which X , X , . . . , X n are the resultsof decisions made by an adaptive adversary over time. When analyzing these applications (see,e.g., [1,2,4,7,8,17,21,24,26,33,34,40,41,43,54]), a stronger inequality known as Azuma’s inequality is often useful.
Theorem 3 (Azuma’s inequality) . Let Z , Z , . . . , Z n be a supermartingale, meaning that E [ Z i | Z , . . . , Z i − ] ≤ Z i − . Assume additionally that | Z i − Z i − | ≤ c i . Then for any ε > Z n − Z ≥ ε ] ≤ exp (cid:18) − ε P ni =1 c i (cid:19) . By applying Azuma’s inequality to the exposure martingale for a sum P i X i of random variables,one arrives at the following corollary, which is often useful in analyzing randomized algorithms (fordirect applications of Corollary 4, see, e.g., [19–21, 28, 40, 43]). Corollary 4.
Let X , X , . . . , X n be random variables satisfying X i ∈ [0 , c i ]. Suppose that E [ X i | X , . . . , X i − ] ≤ p i for all i . Then for any ε > "X i X i ≥ X i p i + ε ≤ exp (cid:18) − ε P ni =1 c i (cid:19) . his paper: an inequality with multiplicative error. Azuma’s inequality can be viewed asa generaliation of additive
Chernoff bounds. In this paper, we formulate the multiplicative analogto Azuma’s inequality.
Theorem 5.
Let Z , Z , . . . , Z n be a supermartingale, meaning that E [ Z i | Z , . . . , Z i − ] ≤ Z i − .Assume additionally that − a i ≤ Z i − Z i − ≤ b i , where a i + b i = c for some constant c > i . Let µ = P ni =1 a i . Then for any δ > Z n − Z ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . This theorem yields the following corollary.
Corollary 6.
Let X , . . . , X n ∈ [0 , c ] be real-valued random variables with c >
0. Suppose that E [ X i | X , . . . , X i − ] ≤ a i for all i . Let µ = P ni =1 a i . Then for any δ > "X i X i ≥ (1 + δ ) µ ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . In the same way that multiplicative Chernoff bounds are in some cases much stronger thanadditive Chernoff bounds, the multiplicative Azuma’s inequality is in some cases much strongerthan the standard (additive) Azuma’s inequality, as occurs, in particular, when P i a i ≪ cn .As far as we know, no one to date has explicitly formulated the multiplicative version of Azuma’sinequality. Our work is targeted towards algorithm designers. Our hope is that Theorem 5 willsimplify the task of analyzing randomized algorithms, providing an instrument that can be used inplace of custom Chernoff bounds and ad-hoc combinatorial arguments. Extensions.
We present two extensions of Theorem 5 and Corollary 6.In Section 4, we generalize Theorem 5 so that a , a , . . . , a n are determined by an adaptiveadversary . This means that each a i can be partially a function of Z , . . . , Z i − . As long as the a i ’sare restricted to satisfy P i a i ≤ µ , then the bound from Theorem 5 continues to hold. We alsodiscuss several applications of the adaptive version of the theorem.In Appendix A, we extend Theorem 5 to give a lower tail bound. In particular, just as Theorem5 gives an upper tail bound for supermartingales, a similar approach gives a lower tail bound forsubmartingales (also with multiplicative error). An application: work stealing.
In order to demonstrate the power of Theorem 5 we revisit aclassic result in multithreaded scheduling. In the problem of scheduling multithreaded computationson parallel computers, a fundamental question is how to decide when one processor should “steal”computational threads from another. In the seminal paper,
Scheduling Multithreaded Computationsby Work Stealing [16], Blumofe and Leiserson presented the first provably good work-stealing sched-uler for multithreaded computations with dependencies. The paper has been influential to boththeory and practice, amassing almost two thousand citations, and inspiring the Cilk Programminglanguage and runtime system [15, 32, 51]. 2ne result in [16] is an analysis of the so-called (
P, M )-recycling game , which models thecontention incurred by a randomized work-stealing algorithm. By combining the analysis of the( P, M )-recycling game with a delay-sequence argument, the authors are able to bound the executiontime and communication cost of their randomized work-stealing algorithm.The (
P, M )-recycling game takes place on P bins which are initially empty. In each step of thegame, if there are k balls presently in the bins, then the player selects some value j ∈ { , , . . . , P − k } and then tosses j balls at random into bins. At the end of each step, one ball is removed from eachnon-empty bin. The game continues until M total tosses have been made. The goal of the playeris to maximize the total delay experienced by the balls, where the delay of a ball b thrown into abin i is defined to be the number of balls already present in bin i at the time of b ’s throw. Lemma6 of [16] states that, even if the player is an adaptive adversary, the total delay is guaranteed to beat most O ( M + P log P + P log ǫ − ) with probability at least 1 − ǫ .In part due to lack of good analytical tools, the authors of [16] attempt to analyze the ( P, M )-recycling game via a combinatorial argument. Unfortunately, the argument fails to notice certainsubtle (but important) dependencies between random variables, and consequently the analysis isincorrect. In Section 3, we give a simple and short analysis of the (
P, M )-recycling using Theorem 5. Wealso explain why the same argument does not follow from the standard Azuma’s inequality. Inaddition to being simpler (and more correct) than the analysis in [16], our analysis enables theslightly stronger bound of O ( M + P log ǫ − ). Although Chernoff bounds are often attributed to Herman Chernoff, they were originally formulatedby Herman Rubin (see discussion in [22]). Azuma’s inequality, on the other hand, was independentlyformulated by several different authors, including Kazuoki Azuma [5], Wassily Hoeffding [36], andSergei Bernstein [13] (although in a slightly different form). As a result, the inequality is sometimesalso referred to as the Azuma-Hoeffding inequality.The key technique used to prove Azuma’s inequality is to apply Markov’s inequality to the mo-ment generating function of a random variable. This technique is well understood and has servedas the foundation for much of the work on concentration inequalities in statistics and probabil-ity theory [11, 25, 29–31, 35, 38, 42, 44–49, 52] (see [18] or [23] for a survey). Extensive work hasbeen devoted to generalizing Azuma’s inequality in various ways. For example, Bernstein-typeinequalities parameterize the concentration bound by the k th moments of the random variablesbeing summed [11–13, 25, 29–31, 35, 38, 47]. Most of the research in this direction has been tar-geted towards applications in statistics and probability theory, rather than to theoretical computerscience.The main contribution of this paper is to explicitly formulate the multiplicative analogue ofAzuma’s inequality, and to discuss its application within algorithm analysis. We emphasise thatthe proof of the inequality is not, in itself, a substantial contribution, since the inequality is relativelystraightforward to derive by combining the proof of the multiplicative Chernoff bound with that ofAzuma’s inequality. Nonetheless, by presenting the theorem as a tool that can be directly referencedby algorithm designers, we hope to simplify the task of proving concentration bounds within thecontext of algorithm analysis. Not to be confused with the ball recycling game of [6]. We thank Charles Leiserson of MIT, one of the original authors of [16], for suggesting that the analysis in [16]should be revisited, which led us to write this paper. X of not necessarily independent random variablesis stochastically dominated by a sum X ′ of independent random variables (see Lemma 3 of [3]),thereby allowing for the application of Chernoff bounds to X . In this section we prove the following theorem and corollary.
Theorem 5.
Let Z , Z , . . . , Z n be a supermartingale, meaning that E [ Z i | Z , . . . , Z i − ] ≤ Z i − .Assume additionally that − a i ≤ Z i − Z i − ≤ b i , where a i + b i = c for some constant c > i . Let µ = P ni =1 a i . Then for any δ > Z n − Z ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Corollary 6.
Let X , . . . , X n ∈ [0 , c ] be real-valued random variables with c >
0. Suppose that E [ X i | X , . . . , X i − ] ≤ a i for all i . Let µ = P ni =1 a i . Then for any δ > "X i X i ≥ (1 + δ ) µ ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . We start our proof by establishing a simple inequality.
Lemma 7.
For any t > X such that E [ X ] ≤ − a ≤ X ≤ b , E (cid:2) e tX (cid:3) ≤ exp (cid:18) aa + b (cid:16) e t ( a + b ) − (cid:17) − ta (cid:19) . Proof.
Consider the linear function f defined on [ − a, b ] that passes through points ( − a, e − ta ) and( b, e tb ). Since e tx is convex, Jensen’s inequality states that f upper bounds e tx , implying that E [ e tX ] ≤ E [ f ( X )]. Since f is linear, E [ f ( X )] only depends on E [ X ], and one can derive that E [ f ( X )] = b − E [ X ] a + b e − ta + a + E [ X ] a + b e tb . This quantity is maximized when E [ X ] is maximized at E [ X ] = 0. Therefore, E [ e tX ] ≤ ba + b e − ta + aa + b e tb = e − ta (cid:18) aa + b (cid:16) e t ( a + b ) − (cid:17)(cid:19) ≤ exp (cid:18) aa + b (cid:16) e t ( a + b ) − (cid:17) − ta (cid:19) . roof of Theorem 5. By Markov’s inequality, for any t > v ,Pr[ Z n − Z ≥ v ] = Pr h e t ( Z n − Z ) ≥ e tv i ≤ E [ e t ( Z n − Z ) ] e tv . (1)Let X i = Z i − Z i − . Since Z i is a supermartingale, for any i , E [ X i | Z , . . . , Z i − ] ≤
0. Moreover,from the assumptions in the problem, − a i ≤ X i ≤ b i . Therefore, Lemma 7 applies to X = ( X i | Z , . . . , Z i − ), and we have E [ e tX i | Z , . . . , Z i − ] ≤ exp (cid:16) a i c (cid:0) e tc − (cid:1) − ta i (cid:17) . (2)In the following derivation, which will involve expectations of expectations, it will be importantto understand which random variables each expectation is taken over. We will adopt the notation E S [ f ( S )] to denote an expectation taken over a set of random variables S . Using (2) along withthe law of total expectation, which states that E X,Y [ A ] = E X [ E Y [ A | X ]] for any random variable A that is a function of random variables X and Y , we derive E Z ,X ,...,X i − ,X i i Y j =1 e tX j = E Z ,X ,...,X i − E X i i Y j =1 e tX j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z , X , . . . , X i − = E Z ,X ,...X i − i − Y j =1 e tX j E X i (cid:2) e X i (cid:12)(cid:12) Z , X . . . , X i − (cid:3) = E Z ,X ,...X i − i − Y j =1 e tX j E X i (cid:2) e X i (cid:12)(cid:12) Z , Z , . . . , Z i − (cid:3) ≤ E Z ,X ,...X i − i − Y j =1 e tX j exp (cid:16) a i c (cid:0) e tc − (cid:1) − ta i (cid:17) = exp (cid:16) a i c (cid:0) e tc − (cid:1) − ta i (cid:17) E Z ,X ,...X i − i − Y j =1 e tX j . By applying the above inequality iteratively, we arrive at the following: E [ e t ( Z n − Z ) ] = E Z ,X ,...,X n " n Y i =1 e tX i ≤ n Y i =1 exp (cid:16) a i c ( e tc − − ta i (cid:17) = exp (cid:16) µc (cid:0) e tc − (cid:1) − tµ (cid:17) . By (1), we have Pr[ Z n − Z ≥ v ] ≤ exp (cid:16) µc (cid:0) e tc − (cid:1) − tµ − tv (cid:17) . t = (ln(1 + δ )) /c and v = δµ for δ > Z n − Z ≥ δµ ] ≤ exp (cid:18) µδc − µc ln(1 + δ ) − µc δ ln(1 + δ ) (cid:19) = exp (cid:16) µc ( δ − (1 + δ ) ln(1 + δ )) (cid:17) . For any δ > δ − (1 + δ ) ln(1 + δ ) ≤ − δ δ , which can be seen by inspecting the derivative of both sides. As a result,Pr[ Z n − Z ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Remark.
A stronger but more unwieldy bound may sometimes be helpful. By skipping the ap-proximation of δ − (1 + δ ) ln(1 + δ ), we derivePr[ Z n − Z ≥ δµ ] ≤ (cid:18) e δ (1 + δ ) (1+ δ ) (cid:19) µ/c . We conclude the section by proving Corollary 6.
Proof of Corollary 6.
Define Z i = P ij =1 ( X j − a j ). Note that Z i − Z i − = X i − a i . The givencondition E [ X i | X , . . . , X i − ] ≤ a i implies that E [ Z i − Z i − | Z , . . . , Z i − ] = E [ Z i − Z i − | X , . . . , X i − ] ≤ Z i is a supermartingale. Moreover, as each X i ∈ [0 , c ], we have that Z i − Z i − ≥ − a i , Z i − Z i − ≥ c − a i . Setting µ = P ni =1 a i , Theorem 5 impliesPr[ Z n − Z ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . We may break down Z n − Z as Z n − Z = n X i =1 ( Z i − Z i − )= n X i =1 ( X i − a i )= n X i =1 X i − µ. Therefore, Pr " n X i =1 X i ≥ (1 + δ ) µ ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Consider f ( x ) = x/ (1+ x ) − ln(1+ x )+ x / ((1+ x )(2+ x )). Then f (0) = 0 and f ′ ( x ) = − x / (cid:0) (1 + x )(2 + x ) (cid:1) ≤ x ≥
0. Therefore, f ( x ) ≤ x ≥
0, and the inequality holds for δ > Analyzing the ( P, M ) -Recyling Game In this section we revisit the analysis of the (
P, M )-recycling game given in [16]. We begin bydefining the game and explaining why the analysis given in [16] is incorrect. Then we applyTheorem 5 to obtain a simple and correct analysis.
The (
P, M )-recycling game is a combinatorial game, in which balls labelled 1 to P are tossed atrandom into P bins. Initially, all P balls are in a reservoir separate from the P bins. At each stepof the game, the player executes the following two operations in sequence:1. The player chooses some of the balls in the reservoir (possibly all and possibly none). For eachof these balls, the player removes it from the reservoir, selects one of the P bins uniformlyand independently at random, and tosses the ball into it.2. The player inspects each of the P bins in turn, and for each bin that contains at least oneball, the player removes any one of the balls in the bin and returns it to the reservoir.The player is permitted to make a total of M ball tosses. The game ends when M ball tosses havebeen made and all balls have been removed from the bins and placed back in the reservoir. Theplayer is allowed to base their strategy (how many/which balls to toss) depending on outcomesfrom previous turns.After each step t of the game, there are some number n t of balls left in the bins. The total delay is defined as D = P Tt =1 n t , where T is the total number of steps in the game. Equivalently, if wedefine the delay of a ball b being tossed into a bin i to be the number of balls already present inbin i at the time of the toss, then the total delay is the sum of the delays of all ball tosses.We would like to give high probability bounds on the total delay, no matter what strategy theplayer takes. The following bound is given by [16].
Theorem 8 (Lemma 6 in [16]) . For any ε >
0, with probability at least 1 − ε , the total delay inthe ( P, M )-recycling game is O ( M + P log P + P log ε − ).In order to prove Theorem 8, the authors [16] sketch a complicated combinatorial analysis ofthe game. Define the indicator random variable x ir to be 1 if the i th toss of ball 1 is delayed byball r , and 0 otherwise. A key component in the analysis [16] is to show that , for any set R ⊆ [ P ]of balls, Pr[ x ir for all r ∈ R ] ≤ P −| R | . (3)Unfortunately, due to subtle dependencies between the random variables x ir , (3) is not true (oreven close to true). To see why, suppose that the player (i.e., the adversary) takes the followingstrategy: Throw balls A = { , , . . . , P } in the first step. If the balls in A do not land in the samebin, then wait P − A again. Continue In fact, the analysis requires a somewhat stronger property to be shown. But for simplicity of exposition, wefocus on this simpler variant. t in which all of the balls in A land in the same bin. At the endof step t , remove ball 2, leaving balls 3 , , . . . , P in the same bin as each other. Then on step t + 1perform the first throw of ball 1.If M is sufficiently large so that all balls in A almost certainly land together before the processends, then the probability that the first throw of ball 1 lands in the same bin as balls 3 , , . . . , P isapproximately 1 /P . In contrast, (3) claims to bound the same probability by 1 /P P − .The difficulty of proving Theorem 8 via an ad-hoc combinatorial argument is further demon-strated by another error in [16]’s analysis. Throughout the proof, the authors define m i to be thenumber of times that ball i is thrown, and then treat each m i as taking a fixed value. In actuality,however, the m i ’s are random variables that are partially controlled by an adversary (i.e., the playerof the game), meaning that the outcomes of the m i ’s may be linked to the outcomes of the x i,r ’s.This consideration adds even further dependencies that must be considered in order to obtain acorrect analysis. We now give a simple (and correct) analysis of the (
P, M )-recycling game using the multiplicativeversion of Azuma’s inequality. In fact, we prove a slightly stronger bound than Theorem 8.
Theorem 9.
For any ε >
0, with probability at least 1 − ε , the total delay in the ( P, M )-recyclinggame is O ( M + P log(1 /ε )). Proof.
For i = 1 , , . . . , M , define the delay X i of the i th toss to be the number of balls in the binthat the i th toss lands in, not counting the i th toss itself. The total delay can be expressed as D = P Mi =1 X i .As the player’s strategy can adapt to the outcomes of previous tosses, the X i ’s may havecomplicated dependencies. Nonetheless, since there are at most P − i th toss, we know that X i ∈ [0 , P ]. Moreover, since the toss selects a bin { , , . . . , P } at random,each ball present at the time of the toss has probability 1 /P of contributing to the delay X i . Thus,no matter the outcomes of X , . . . , X i − , we have that E [ X i | X , . . . , X i − ] ≤ ( P − /P ≤
1. Wecan therefore apply Corollary 6, with a i = 1 for all i and c = P , to deduce thatPr[ D ≥ (1 + δ ) M ] ≤ exp (cid:18) − δ M (2 + δ ) P (cid:19) . (4)If M ≥ P ln(1 /ε ), we may substitute δ = 2 into (4) to derive Pr[ D ≥ M ] ≤ exp ( − M/P ) ≤ ε .If M ≤ P ln(1 /ε ), we may instead substitute δ = 2 P ln(1 /ε ) /M . As δ ≥
2, we have δ/ (2 + δ ) ≥ / D ≥ M + 2 P ln(1 /ε )] ≤ exp ( − δM/ (2 P )) = ε .In either case, Pr [ D ≥ M + 2 P ln(1 /ε )] ≤ ε , which proves the theorem statement. In order to fully understand the proof of Theorem 9, it is informative to consider what happens ifwe attempt to use (the standard) Azuma’s inequality to analyze D = P Mi =1 X i . Applying Corollary4 with c i = P for all i , we get thatPr[ D > (1 + δ ) M ] ≤ exp (cid:18) − ( δM ) M P (cid:19) = exp (cid:18) − δ M P (cid:19) . (5)8n contrast, for δ ≥
2, Corollary 6 gives a bound ofPr[
D > (1 + δ ) M ] ≤ exp (cid:18) − δ M (2 + δ ) P (cid:19) ≤ exp (cid:18) − δM P (cid:19) . (6)Since D ≤ P M trivially, the interesting values for δ are δ ≤ P . On the other hand, for all δ satisfying 2 ≤ δ < P , the bound given by (6) is stronger than the bound given by (5). The reasonthat the multiplicative version of Azuma’s does better than the additive version is that the randomvariables X i have quite small means, meaning that the a i ’s used by the multiplicative bound aremuch smaller than the c i ’s used by the additive bound. When δ is a constant, this results in a fullfactor-of-Θ( P ) difference in the exponent achieved by the two bounds. It is not possible to derivea O ( M + P log ε − ) high probability bound with (5) alone. In this section, we extend Theorem 5 and Corollary 6 in order to allow for the values a , a , . . . , a n and b , b , . . . , b n to be random variables that are determined adaptively. Formally, we definethe supermartingale Z , . . . , Z n with respect to a filtration, and then defining a , a , . . . , a n and b , b , . . . , b n to be predictable processes with respect to that same filtration.The statement of Theorem 10 uses several notions that are standard in probability theory (see,e.g., [50] and [14] for formal definitions) but less standard in theoretical computer science. Theorem 10.
Let Z , . . . Z n be a supermartingale with respect to the filtration F , . . . , F n , and let A , . . . , A n and B , . . . , B n be predictable processes with respect to the same filtration. Supposethere exist values c > µ , satisfying that − A i ≤ Z i − Z i − ≤ B i , A i + B i = c , and P ni =1 A i ≤ µ (almost surely). Then for any δ > Z n − Z ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Corollary 11.
Suppose that Alice constructs a sequence of random variables X , . . . X n , with X i ∈ [0 , c ] , c >
0, using the following iterative process. Once the outcomes of X , . . . , X i − aredetermined, Alice then selects the probability distribution D i from which X i will be drawn; X i isthen drawn from distribution D i . Alice is an adaptive adversary in that she can adapt D i to theoutcomes of X , . . . , X i − . The only constraint on Alice is that P i E [ X i | D i ] ≤ µ , that is, the sumof the means of the probability distributions D , . . . , D n must be at most µ .If X = P i X i , then for any δ > X ≥ (1 + δ ) µ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Remark.
Formally, a filtration F , . . . , F n − is a sequence of σ -algebras such that F i ⊆ F i +1 foreach i . Informally, one can simply think of the F i ’s as revealing “random bits”. For each i , F i reveals the set of random bits used to determine all of Z , . . . , Z i , A , . . . , A i , and B , . . . , B i . Thefact that Z , Z , . . . , Z n is a martingale with respect to F , F , . . . , F n − means simply that therandom bits F i determine Z i (that is, Z i is F i -measurable ), and that E [ Z i | F i − ] = Z i − . Thefact that A , . . . , A n and B , . . . , B n are predictable processes, means simply that each A i and B i is determined by the random bits F i − (that is, A i , B i are F i − -measurable ).9o prove Theorem 10, we prove the following key lemma: Lemma 12.
Let Z , . . . Z n be a supermartingale with respect to the filtration F , . . . , F n , and let A , . . . , A n and B , . . . , B n be predictable processes with respect to the same filtration. Supposethere exist values c > µ , satisfying that − A i ≤ Z i − Z i − ≤ B i , A i + B i = c , and P ni =1 A i ≤ µ (almost surely). Then for any t > E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i ≤ exp (cid:16) µc ( e tc − − tµ (cid:17) . Proof.
We proceed by induction on n . The base case.
For n = 0, Z n − Z = 0, and for any c, t >
0, ( e tc − /c − t >
0. Therefore, µ ( e tc − /c − tµ ≥ t ( Z n − Z ), and the inequality holds. The inductive step.
Assume that this statement is true for n −
1, and we shall prove it for n .The law of total expectation states that for any random variable X and any σ -algebras H ⊆ H , E [ E [ X | H ] | H ] = E [ X | H ]. As { F i } is a filtration, we know F i − ⊆ F i , and thus E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i = E h E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i (cid:12)(cid:12)(cid:12) F i . Since e t ( Z − Z ) is F -measurable, we can pull it out of the expectation as follows: E h E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i (cid:12)(cid:12)(cid:12) F i = E h e t ( Z − Z ) · E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i (cid:12)(cid:12)(cid:12) F i . (7)Let Z ′ i = Z i +1 , F ′ i = F i +1 , A ′ i = A i +1 , B ′ i = B i +1 . We know that Z ′ , . . . , Z ′ n − is a super-martingale with respect to F ′ , . . . , F ′ n − . Additionally, we know A ′ , . . . , A ′ n − and B ′ , . . . , B ′ n − are predictable processes with respect to F ′ , . . . , F ′ n − satisfying that − A ′ i ≤ Z ′ i − Z ′ i − ≤ B ′ i , A ′ i + B ′ i = c , and P n − i =1 A ′ i ≤ µ − ( A | F ). Therefore, we may apply our inductive hypothesis toderive E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i = E h e t ( Z ′ n − − Z ′ ) (cid:12)(cid:12)(cid:12) F ′ i ≤ (cid:18) exp (cid:18) µ − A c ( e tc − − t ( µ − A ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F (cid:19) . (8)Combining (7) and (8), we find that E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i = E (cid:20) e t ( Z − Z ) · exp (cid:18) µ − A c ( e tc − − t ( µ − A ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F (cid:21) . As A is F -measurable, we can pull the exponential term out of the expectation to arrive at E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i = (cid:18) exp (cid:18) µ − A c ( e tc − − t ( µ − A ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F (cid:19) E h e t ( Z − Z ) (cid:12)(cid:12)(cid:12) F i . (9)Since Z i is a supermartingale, E [ Z − Z | F ] ≤
0. Therefore, Lemma 7 applies to X =( Z − Z | F ), a = ( A | F ), b = ( B | F ), and we have E [ e t ( Z − Z ) | F ] ≤ (cid:18) exp (cid:18) A c ( e tc − − tA (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F (cid:19) . (10)Combining (9) and (10), we have E h e t ( Z n − Z ) (cid:12)(cid:12)(cid:12) F i ≤ (cid:16) exp (cid:16) µc ( e tc − − tµ (cid:17) (cid:12)(cid:12)(cid:12) F (cid:17) = exp (cid:16) µc ( e tc − − tµ (cid:17) . roof of Theorem 10. By Lemma 12 and the law of total expectation, E [ e t ( Z n − Z ) ] = E [ E [ e t ( Z n − Z ) | F ]] ≤ E h exp (cid:16) µc ( e tc − − tµ (cid:17)i = exp (cid:16) µc ( e tc − − tµ (cid:17) . The rest of the proof is identical to the proof of Theorem 5.Corollary 11 is a straightforward application of Theorem 10.
Proof of Corollary 11.
Define the filtration F , F , . . . , F n by F i = σ ( X , X , . . . , X i , D , D , . . . , D i +1 ) . That is, F i is the smallest σ -algebra with respect to which all of X , X , . . . , X i and D , D , . . . , D i +1 are measurable.Define A i = E [ X i | D i ] to be the expected value of X i once its distribution is determined, and B i = c − A i . Define Z , . . . , Z n to be given by Z i = i X j =1 X i − i X j =1 A i . Since A i and B i are D i -measurable and F i − contains D i , we know that A i and B i are also F i − -measurable, implying that they are predictable processes with respect to filtration F , ..., F n .As each X i is drawn from distribution D i after all of X , ..., X i − and D , ..., D i have beendetermined, we have E [ X i | F i − ] = E [ X i | D i ]. We can then compute that E [ Z i | F i − ] = E [ X i − A i + Z i − | F i − ]= E [ X i | F i − ] − A i + Z i − = E [ X i | D i ] − A i + Z i − = Z i − , implying that Z , ..., Z n is a martingale with respect to filtration F , ..., F n .Finally, { Z i } , { A i } and { B i } satisfy the requirements of Theorem 10, namely that − A i ≤ Z i − Z i − ≤ B i , that A i + B i = c , and that P i A i ≤ µ . Thus, by Theorem 10,Pr[ Z n ≥ δµ ] ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) . Expanding out Z n gives Pr "X i X i ≥ X i A i + δµ ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) , and thus we have Pr "X i X i ≥ (1 + δ ) µ ≤ exp (cid:18) − δ µ (2 + δ ) c (cid:19) , as desired. 11 .1 Applications of Theorem 10 in Concurrent Work By allowing for an adaptive adversary, Theorem 10 naturally lends itself to applications with onlineadversaries. We conclude the section by briefly discussing two applications of Theorem 10 that havearisen in several of our recent concurrent works. In both cases, Theorem 10 significantly simplifiedthe task of analyzing an algorithm.
Edge orientation in incremental forests
In [9], Bender et al. consider the problem of edgeorientation in an incremental forest. In this problem, edges e , e , . . . , e k of a forest arrive oneby one, and we are responsible for maintaining an orientation of the edges (i.e., an assignmentof directions to the edges) such that every vertex has out-degree at most O (1). As each edge e i arrives, we may need to flip the orientations of other edges in order to accommodate the newlyarrived edge. The goal in [9] is to flip at most O (log log n ) orientations per edge insertion (withhigh probability). We refer to an edge insertion as a step.A key component of the algorithm in [9] is that vertices may “volunteer” to have their out-degree incremented during a given step. During each step i , there are polylog n vertices S i thatare eligible to volunteer, and each of these vertices volunteers with probability 1 / polylog n . Thealgorithm is designed to satisfy the property that each vertex v can appear in at most O (log n ) S i ’s.An essential piece of the analysis is to show that, for any set S of size polylog n , the numberof vertices in S that ever volunteer is at most | S | / i ,the expected number of vertices in S that volunteer is | S ∩ S i | / polylog n . S i is partially a functionof the algorithm’s past randomness, and thus S i are effectively selected by an adaptive adversary,subject to the constraint that each vertex v appears in at most O (log n ) S i ’s. By applying Theorem10, one can deduce that the number of vertices in S that volunteer is small (with high probability).Note that, since | S | = polylog n , a bound with additive error would not suffice here. Such abound would allow for the number of vertices that volunteer to deviate by Ω( √ n ) from its mean,which is larger than | S | / Task scheduling against an adaptive adversary
Another concurrent work to ours [10] con-siders a scheduling problem in which the arrival of new work to be scheduled is controlled by a(mostly) adaptive adversary. In particular, although the amount of new work that arrives duringeach step is fixed (to 1 − ǫ ), the tasks to which that new work is assigned are determined by theadversary. The scheduling algorithm is then allowed to select a single task to perform 1 unit ofwork on. The goal is to design a scheduling algorithm that prevents the backlog (i.e., the maximumamount of unfinished work for any task) from becoming too large.Due to the complexity of the algorithm in [10], we cannot explain in detail the applicationof Theorem 10. The basic idea, however, is that the adversary must decide how to allocate itsresources across tasks over time, but that the adversary can adapt (in an online fashion) to eventsthat it has observed in the past. Theorem 10 allows for the authors of [10] to obtain Chernoff-stylebounds on the number of a certain “bad events” that occur, while handling the adaptiveness of theadversary. 12 Acknowledgments
We thank Charles Leiserson of MIT for suggesting that the proof in [16] might warrant revisiting,and for giving useful feedback on the manuscript. We also thank Tao B. Schardl of MIT for severalhelpful conversations, and thank Kevin Yang of UC Berkeley for helpful discussions on probabilitytheory.
References [1] James Aspnes. Randomized consensus in expected O ( n ) total work using single-writer regis-ters. In International Symposium on Distributed Computing , pages 363–373. Springer, 2011.[2] Vincenzo Auletta, Ioannis Caragiannis, Christos Kaklamanis, and Pino Persiano. Randomizedpath coloring on binary trees. In
International Workshop on Approximation Algorithms forCombinatorial Optimization , pages 60–71. Springer, 2000.[3] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced allocations.
SIAMJournal on Computing , 29(1):180–200, 1999.[4] Yossi Azar, Ilan Reuven Cohen, and Iftah Gamzu. The loss of serving in the dark. In
Forty-Fifth Annual ACM Symposium on Theory of Computing , pages 951–960, 2013.[5] Kazuoki Azuma. Weighted sums of certain dependent random variables.
Tohoku MathematicalJournal, Second Series , 19(3):357–367, 1967.[6] Michael A Bender, Jake Christensen, Alex Conway, Martin Farach-Colton, Rob Johnson, andMeng-Tsung Tsai. Optimal ball recycling. In
Thirtieth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 2527–2546. SIAM, 2019.[7] Michael A Bender, Mart´ın Farach-Colton, and William Kuszmaul. Achieving optimal backlogin multi-processor cup games. In
Fifty-First Annual ACM SIGACT Symposium on Theory ofComputing , pages 1148–1157, 2019.[8] Michael A Bender, Tsvi Kopelowitz, William Kuszmaul, and Seth Pettie. Contention resolutionwithout collision detection. In
Fifty-Second Annual ACM SIGACT Symposium on Theory ofComputing , pages 105–118, 2020.[9] Michael A Bender, Tsvi Kopelowitz, William Kuszmaul, Eli Porat, and Clifford Stein. Edgeorientation for incremental forests and low-latency cuckoo hashing. Under Submission toThirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, 2020.[10] Michael A Bender and William Kuszmaul. Randomized cup game algorithms against strongadversaries. Under Submission to Thirty-First Annual ACM-SIAM Symposium on DiscreteAlgorithms, 2020.[11] George Bennett. Probability inequalities for the sum of independent random variables.
Journalof the American Statistical Association , 57(297):33–45, 1962.[12] Sergei Bernstein. On a modification of Chebyshev’s inequality and of the error formula ofLaplace.
Ann. Sci. Inst. Sav. Ukraine, Sect. Math , 1(4):38–49, 1924.1313] Sergei N Bernstein. On certain modifications of Chebyshev’s inequality.
Doklady AkademiiNauk SSSR , 17(6):275–277, 1937.[14] Patrick Billingsley.
Probability and Measure . John Wiley & Sons, 2008.[15] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith HRandall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system.
Journal of Paralleland Distributed Computing , 37(1):55–69, 1996.[16] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded computations by workstealing.
Journal of the ACM (JACM) , 46(5):720–748, 1999.[17] B´ela Bollob´as, Christian Borgs, Jennifer T Chayes, and Oliver Riordan. Directed scale-freegraphs. In
Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , volume 3,pages 132–139, 2003.[18] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration Inequalities: ANonasymptotic Theory of Independence . Oxford University Press, 2013.[19] Arnaud Casteigts, Yves M´etivier, John Michael Robson, and Akka Zemmari. Design patternsin beeping algorithms: Examples, emulation, and analysis.
Information and Computation ,264:32–51, 2019.[20] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms.
IEEE Transactions on Information Theory , 50(9):2050–2057, 2004.[21] Deeparnab Chakrabarty and C Seshadhri. A ˜ O ( n ) non-adaptive tester for unateness. arXivpreprint arXiv:1608.06980 , 2016.[22] Herman Chernoff. A career in statistics. Past, Present, and Future of Statistical Science , 29,2014.[23] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey.
Internet Mathematics , 3(1):79–127, 2006.[24] Kevin P Costello, Asaf Shapira, and Prasad Tetali. Randomized greedy: new variants ofsome classic approximation algorithms. In
Twenty-Second Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 647–655. SIAM, 2011.[25] Victor H de la Pena, Michael J Klass, and Tze Leung Lai. Self-normalized processes: expo-nential inequalities, moment bounds and iterated logarithm laws.
Annals of Probability , pages1902–1933, 2004.[26] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In
Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ,pages 202–210, 2003.[27] Devdatt P Dubhashi and Desh Ranjan. Balls and bins: A study in negative dependence.
BRICS Report Series , 3(25), 1996.[28] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributedoptimization: Convergence analysis and network scaling.
IEEE Transactions on AutomaticControl , 57(3):592–606, 2011. 1429] Kacha Dzhaparidze and JH Van Zanten. On Bernstein-type inequalities for martingales.
Stochastic Processes and Their Applications , 93(1):109–117, 2001.[30] Xiequan Fan, Ion Grama, Quansheng Liu, et al. Exponential inequalities for martingales withapplications.
Electronic Journal of Probability , 20, 2015.[31] David A Freedman. On tail probabilities for martingales.
Annals of Probability , pages 100–118,1975.[32] Matteo Frigo, Charles E Leiserson, and Keith H Randall. The implementation of the Cilk-5 multithreaded language. In
ACM SIGPLAN 1998 Conference on Programming LanguageDesign and Implementation , pages 212–223, 1998.[33] Ajay Gopinathan and Zongpeng Li. Strategyproof mechanisms for content delivery via layeredmulticast. In
International Conference on Research in Networking , pages 82–96. Springer,2011.[34] L Gyorfi, G´abor Lugosi, and Guszt´av Morvai. A simple randomized algorithm for sequentialprediction of ergodic time series.
IEEE Transactions on Information Theory , 45(7):2642–2650,1999.[35] Erich Haeusler. An exact rate of convergence in the functional central limit theorem forspecial martingale difference arrays.
Zeitschrift f¨ur Wahrscheinlichkeitstheorie und VerwandteGebiete , 65(4):523–534, 1984.[36] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In
TheCollected Works of Wassily Hoeffding , pages 409–426. Springer, 1994.[37] Kumar Joag-Dev and Frank Proschan. Negative association of random variables with appli-cations.
The Annals of Statistics , pages 286–295, 1983.[38] Rasul A Khan. Lp-version of the dubins–savage inequality and some exponential inequalities.
Journal of Theoretical Probability , 22(2):348, 2009.[39] Alam Khursheed and KM Lai Saxena. Positive dependence in multivariate distributions.
Communications in Statistics-Theory and Methods , 10(12):1183–1196, 1981.[40] Dennis Komm, Rastislav Kr`alovic, Richard Kr`alovic, and Tobias M¨omke. Randomized onlinealgorithms with high probability guarantees. In
Thirty-First International Symposium onTheoretical Aspects of Computer Science (STACS 2014) . Schloss Dagstuhl-Leibniz-Zentrumfuer Informatik, 2014.[41] V Kumar. An approximation algorithm for circular arc colouring.
Algorithmica , 30(3):406–417,2001.[42] Emmanuel Lesigne and Dalibor Voln`y. Large deviations for martingales.
Stochastic Processesand Their Applications , 96(1):143–159, 2001.[43] Reut Levi, Dana Ron, and Ronitt Rubinfeld. Local algorithms for sparse spanning graphs. In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques(APPROX/RANDOM 2014) . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.1544] Robert Liptser and Vladimir Spokoiny. Deviation probability bound for martingales withapplications to statistical estimation.
Statistics & probability letters , 46(4):347–357, 2000.[45] Quansheng Liu and Fr´ed´erique Watbled. Exponential inequalities for martingales and asymp-totic properties of the free energy of directed polymers in a random environment.
Stochasticprocesses and their applications , 119(10):3101–3132, 2009.[46] Colin McDiarmid. On the method of bounded differences.
Surveys in Combinatorics ,141(1):148–188, 1989.[47] Iosif Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
TheAnnals of Probability , pages 1679–1706, 1994.[48] Emmanuel Rio et al. Extensions of the Hoeffding-Azuma inequalities.
Electronic Communi-cations in Probability , 18, 2013.[49] Emmanuel Rio et al. On McDiarmid’s concentration inequality.
Electronic Communicationsin Probability , 18, 2013.[50] Sebastien Roch. Modern Discrete Probability: An Essential Toolkit.
Lecture notes , 2015.[51] Tao B Schardl, I-Ting Angelina Lee, and Charles E Leiserson. Brief announcement: Open cilk.In , pages 351–353, 2018.[52] Sara A van de Geer. On Hoeffding’s inequality for dependent random variables. In
EmpiricalProcess Techniques for Dependent Data , pages 161–169. Springer, 2002.[53] David Wajc. Negative association: definition, properties, and applications. ∼ dwajc/notes/Negative%20Association.pdf , 2017.[54] Grigory Yaroslavtsev and Samson Zhou. Approximate F -sketching of valuation functions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques(APPROX/RANDOM 2019) . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
A Multiplicative Lower Tail Bounds
In this section we prove a lower tail bound with multiplicative error for both the normal and theadversarial setting. Whereas Theorem 5 and Theorem 10 allow us to bound the probability of arandom variable substantially exceeding its mean, Theorem 13 and Theorem 15 allow us to boundthe probability of a random variable taking a substantially smaller value than its mean.
Theorem 13.
Let Z , Z , . . . , Z n be a submartingale, meaning that E [ Z i | Z , . . . , Z i − ] ≥ Z i − .Assume additionally that − a i ≤ Z i − Z i − ≤ b i , where a i + b i = c for some constant c > i . Let µ = P ni =1 a i . Then for any 0 ≤ δ < Z n − Z ≤ − δµ ] ≤ exp (cid:18) − δ µ c (cid:19) . orollary 14. Let X , . . . , X n ∈ [0 , c ] be real-valued random variables with c >
0. Suppose E [ X i | X , . . . , X i − ] ≥ a i for all i . Let µ = P ni =1 a i . Then for any 0 ≤ δ < "X i X i ≤ (1 − δ ) µ ≤ exp (cid:18) − δ µ c (cid:19) . Theorem 15.
Let Z , . . . Z n be a submartingale with respect to the filtration F , . . . , F n , and let A , . . . , A n and B , . . . , B n be predictable processes with respect to the same filtration. Supposethere exist values c > µ , satisfying that − A i ≤ Z i − Z i − ≤ B i , A i + B i = c , and P ni =1 A i ≥ µ (almost surely). Then for any δ > Z n − Z ≤ − δµ ] ≤ exp (cid:18) − δ µ c (cid:19) . Corollary 16.
Suppose that Alice constructs a sequence of random variables X , . . . X n , with X i ∈ [0 , c ] , c >
0, using the following iterative process. Once the outcomes of X , . . . , X i − aredetermined, Alice then selects the probability distribution D i from which X i will be drawn; X i isthen drawn from distribution D i . Alice is an adaptive adversary in that she can adapt D i to theoutcomes of X , . . . , X i − . The only constraint on Alice is that P i E [ X i | D i ] ≥ µ , that is, the sumof the means of the probability distributions D , . . . , D n must be at least µ .If X = P i X i , then for any δ > X ≤ (1 − δ ) µ ] ≤ exp (cid:18) − δ µ c (cid:19) . We begin by proving Theorem 13. The proof is similar to the proof for the upper tail bound, witha different approximation used.
Lemma 17.
For any t < X such that E [ X ] ≥ − a ≤ X ≤ b , E [ e tX ] ≤ exp (cid:18) aa + b (cid:16) e t ( a + b ) − (cid:17) − ta (cid:19) . Proof.
Same as Lemma 7.
Proof of Theorem 13.
By Markov’s inequality, for any t < v ,Pr[ Z n − Z ≤ v ] = Pr[ t ( Z n − Z ) ≥ tv ]= Pr h e t ( Z n − Z ) ≥ e tv i ≤ E [ e t ( Z n − Z ) ] e tv . X i = Z i − Z i − . Since Z i is a submartingale, for any i , E [ X i | Z , . . . , Z i − ] ≥
0. Moreover,from the assumptions in the problem, − a i ≤ X i ≤ b i . Therefore, Lemma 17 applies to X = ( X i | Z , . . . , Z i − ), and we have E [ e tX i | Z , . . . , Z i − ] ≤ exp (cid:16) a i c (cid:0) e tc − (cid:1) − ta i (cid:17) , for any t <
0. Using the same derivation as in the proof for Theorem 5, we havePr[ Z n − Z ≤ v ] ≤ exp (cid:16) µc (cid:0) e tc − (cid:1) − tµ − tv (cid:17) . Plugging in t = ln(1 − δ ) /c and v = − δµ for δ > Z n − Z ≤ − δµ ] ≤ exp (cid:16) µc ( − δ − (1 − δ ) ln(1 − δ )) (cid:17) . For any 0 ≤ δ < − δ − (1 − δ ) ln(1 − δ ) ≤ − δ , which can be seen by inspecting the derivative of both sides. As a result,Pr[ Z n − Z ≤ − δµ ] ≤ exp (cid:18) − δ µ c (cid:19) . Remark.
As with the upper tail bound, we may derive a stronger but more unwieldy bound ofPr[ Z n − Z ≤ − δµ ] ≤ (cid:18) e − δ (1 − δ ) (1 − δ ) (cid:19) µ/c . The proof of Corollary 14 is identical to the proof of Corollary 6.The proof of Theorem 15 can be obtained by combining the proofs of Theorem 10 and Theorem13. The proof of Corollary 16 is identical to the proof of Corollary 11. Consider f ( x ) = − x/ (1 − x ) − ln(1 − x ) + x / (2(1 − x )). Then f (0) = 0, and f ′ ( x ) = − x / (2(1 − x ) ) ≤ ≤ x <
1. Therefore, f ( x ) ≤ ≤ x <
1, and the inequality holds for 0 ≤ δ <1.