[PDF] Near-Optimal Confidence Sequences for Bounded Random Variables

Abstract

Many inference problems, such as sequential decision problems like A/B testing, adaptive sampling schemes like bandit selection, are often online in nature. The fundamental problem for online inference is to provide a sequence of confidence intervals that are valid uniformly over the growing-into-infinity sample sizes. To address this question, we provide a near-optimal confidence sequence for bounded random variables by utilizing Bentkus' concentration results. We show that it improves on the existing approaches that use the Cram{é}r-Chernoff technique such as the Hoeffding, Bernstein, and Bennett inequalities. The resulting confidence sequence is confirmed to be favorable in both synthetic coverage problems and an application to adaptive stopping algorithms.

Full PDF

NNear-Optimal Conﬁdence Sequences for BoundedRandom Variables

Arun Kumar Kuchibhotla ∗ Qinqing Zheng † University of Pennsylvania

June 8, 2020

Abstract

Many inference problems, such as sequential decision problems like A/B testing, adaptivesampling schemes like bandit selection, are often online in nature. The fundamental problemfor online inference is to provide a sequence of conﬁdence intervals that are valid uniformlyover the growing-into-inﬁnity sample sizes. To address this question, we provide a near-optimalconﬁdence sequence for bounded random variables by utilizing Bentkus’ concentration results.We show that it improves on the existing approaches that use the Cramér-Chernoﬀ techniquesuch as the Hoeﬀding, Bernstein, and Bennett inequalities. The resulting conﬁdence sequenceis conﬁrmed to be favorable in both synthetic coverage problems and an application to adaptivestopping algorithms.

The abundance of data over the decades has increased the demand for sequential algorithms andinference procedures in statistics and machine learning. For instance, when the data is too large to ﬁtin a single machine, it is natural to split data into small batches and process one at a time. Besides,many industry or laboratory data, like user behaviors on a website, patient records, temperaturehistories, are naturally generated and available in a sequential order. In both scenarios, the collectionor processing of new data can be costly, and practitioners often would like to stop data samplingwhen a required criterion is satisﬁed. This gives the pressing call for algorithms that minimize thenumber of sequential samples subject to the prescribed accuracy of the estimator is satisﬁed.Many important problems ﬁt into this framework, including sequential hypothesis testing prob-lems such as testing positiveness of the mean [29], testing equality of distributions and testingindependence [4, 28], A/B testing [18, 19], sequential probability ratio test [27], best arm iden-tiﬁcation for multi-arm bandits (MAB) [29, 28], and p ε, δ q -mean estimation [20, 16]. All theseapplications require conﬁdence sequences to determine the number of samples required for a certainguarantee.Let Y , Y , . . . be independent real-valued random variables, available sequentially, with mean µ P R . Given δ P r , s , a ´ δ conﬁdence sequence is a sequence of conﬁdence interval ConfSeq p δ q “t CI p δ q , CI p δ q , . . . u , where CI n is constructed on-the-ﬂy after observing data sample Y n , such that P p µ P CI n p δ q for all n ě q ě ´ δ. (1) ∗ Department of Statistics. Email: [email protected] . † Department of Statistics. Email: [email protected] . a r X i v : . [ m a t h . S T ] J un nlike the traditional conﬁdence interval in statistics, the guarantee (1) is non-asymptotic and isuniform over the sample sizes. Ideally, we want CI n p δ q to reduce in width as either n or δ orboth increase. Unfortunately, but not surprisingly, guarantee (1) is impossible to achieve non-trivially without further assumptions [3, 25]. In this paper, we assume that the random variablesare bounded: there exist known constants L, U P R such that P p L ď Y i ď U q “ , which yields µ P r

L, U s . Although boundedness can be replaced by tail assumptions such as sub-Gaussianity orpolynomial tails, we will restrict our discussion to the bounded case; see Section 5 for a discussion. Motivating Example.

We give a concrete example to illustrate how the conﬁdence sequencecan be applied to sequential statistical inference. Estimating the mean of a random variable is aclassic problem in statistics and widely applied to various applications. An estimator p µ is said tobe p ε, δ q -accurate for the mean µ if P p| p µ { µ ´ | ď ε q ě ´ δ [10, 20, 16]. This means that theestimator has a relative error of at most ε with probability at least ´ δ . Relative error is importantin many examples such as permanent estimation [9], estimation of volume of a convex body [11],and estimation of partition function of Gibbs distribution [17], where the unknown magnitude of µ can render absolute error unreliable. The important question we would like to answer is How many samples are required to obtain an estimator of the mean that is p ε, δ q accurate? Suppose one can construct a ´ δ conﬁdence sequence ConfSeq p δ q “ t CI n “ r s Y n ´ Q n , s Y n ` Q n s , n ě u , where s Y n is the empirical mean of the ﬁrst n samples. Mnih et al. [20] shows thatwith stopping time N “ min t n : p ´ ε q UB n ď p ` ε q LB n u , where UB n and LB n are two simplefunctions of the radius of the conﬁdence intervals Q , . . . , Q n , the estimator p µ “ p { q sign p s Y N qrp ´ ε q UB N ` p ` ε q LB N s is p ε, δ q -accurate. Contributions.

The need for sequential algorithms has triggered a surge of interest in developingsharp conﬁdence sequences. In the recent years, several conﬁdence sequences were proposed by stitching the ﬁxed sample size conﬁdence intervals [29, 20, 15], and the ﬁxed sample size intervalsare derived from the concentration inequalities such as Hoeﬀding, Bernstein, or Bennett [20, Section2]. Although the techniques of stitching are slightly diﬀerent for those methods, they are easilytransferable from one to another. The tightness of the conﬁdence sequences is mainly controlled bythe sharpness of the ﬁxed sample size concentration inequalities.To the best of our knowledge, all the existing conﬁdence sequences are built upon concentrationresults that bound the moment generating function and follow the Cramér-Chernoﬀ technique.Those concentration results are conservative and can be signiﬁcantly improved [21]. In this paper,we leverage the reﬁned concentration results introduced by Bentkus [5]. We ﬁrst develop a “maximal”version of Bentkus’ concentration inequality. Based on it, we construct the conﬁdence sequence viastitching. In honor of Bentkus, who pioneered this line of reﬁned concentration inequalities, we callour conﬁdence sequence as

Bentkus’ Conﬁdence Sequence .To summarize, major contributions of this article are four-fold. • For pointwise concentration results, we provide a near-optimal reﬁnement of the Cramér-Chernoﬀtail bound for bounded random variables based on the results of [5, 6, 22]. Unlike the Chernoﬀbounds, the reﬁned bound is optimal up to e { , i.e., there exists a distribution such that for Of course, if we take CI n p δ q “ p´8 , , then (1) is trivially satisﬁed. Y i with that distribution, our tail bound is at most e { times the true tail.Further, our bound is always smaller than the Chernoﬀ bounds. The computation of Bentkus’method is non-trivial. We provide closed-form equations for our tail bound so that the exactnumerical solution can be obtained easily. • We use these results in conjunction with a “stitching” method to construct non-asymptotic con-ﬁdence sequences. For s Y “ n ´ ř ni “ Y i , the conﬁdence interval is CI n p δ q : “ r s Y n ´ q low n p δ q , s Y n ` q up n p δ qs , for diﬀerent values q low n p δ q , q up n p δ q ě and they scale like a Var p Y q log log p n q{ n as n Ñ 8 . Theoptimal nature of our tail bounds implies that our conﬁdence sequence is always shorter in lengththan the adaptive Hoeﬀding bound [29] and the empirical Bernstein bound [20]. • Our conﬁdence sequence utilizes the variance of Y . For practical usage, we also provide a closeform upper bound on the true variance of the bounded random variables, so that our conﬁdencesequence is actionable in practice. This upper bound can even be improved using numericalsolvers. • We conducted numerical experiments to verify our theoretical claims. We also applied the con-ﬁdence sequence to the p ε, δ q -accurate mean estimation problem above, and show that it gives amuch smaller stopping time than the competing methods. Organization.

The rest of this article is organized as follows. Section 2 reviews the related work.Section 3 contains all our theoretical results. Section 4 presents the numerical experiments thatconﬁrm the superiority of our method. Section 5 summarizes the contributions and considers somefuture directions of the work.

Zhao et al. [29] propose conﬁdence sequences through Hoeﬀding’s inequality; the assumption hereis that Y i ’s are { -sub-Gaussian. For random variables supported on r L, U s , this is satisﬁed afterdividing by p U ´ L q . This conﬁdence sequence does not scale with the true variance and hence canbe conservative. Mnih et al. [20] building on [2] construct conﬁdence sequences through Bernstein’sinequality and the intervals here scale correctly with the true variance. Both these conﬁdencesequences are closed-form in nature. More recently, Howard et al. [15] uniﬁed the techniques ofobtaining conﬁdence sequences under a variety of assumptions on random variables. This workbuilds on much of the existing statistics literature and we refer the reader to this paper for adetailed historical account. The main message from this paper is that the width of any conﬁdencesequence CI n p δ q satisfying (1) must be larger than a A p log log n q{ n as n tends to inﬁnity, where A represents the common variance of Y , Y , . . . .All the conﬁdence sequences in the works mentioned above depend on bounding the momentgenerating function and follow the Cramér-Chernoﬀ technique. Such conﬁdence sequences are con-servative and can be signiﬁcantly improved [21]. To understand the deﬁciency of such concentrationinequalities, consider for example the Bernstein’s inequality: for s Y n “ ř ni “ Y i { n , P ` ? n p s Y n ´ µ q ě t ˘ ď exp ` ´ t {t A ` Bt {p ? n qu ˘ , exp p´ t {p A qq . However, the central limit theorem implies P p? n p s Y n ´ µ q ě t q « ´ Φ p t { A q where the limit behaves like exp p´ t {p A qq{ a π pp t { A q ` q [1, Formula 7.1.13].Therefore, Bernstein’s inequality and the true tail diﬀer by the scaling a π p t { A ` q which canbe signiﬁcant for large t . This explains why a further reﬁnement is possible and Bentkus [5] presentssuch reﬁned concentration inequalities. Our work essentially builds on [5, 6, 8, 22, 24] to obtain aconﬁdence sequence through the technique of stitching. For any random variable Y i with mean µ , X i “ Y i ´ µ is mean zero and hence we will mostlyrestrict to the case of mean zero random variables. The result for general µ will readily follow; seeTheorem 4. In this section, we discuss Bentkus’ concentration inequality for bounded mean zerorandom variables. After this, we present a reﬁned conﬁdence sequence that is not readily actionablebecause it depends on the true variance of random variables. Finally, we present an actionableversion where we replace the true variance by an estimated upper bound. This provides an analog ofthe empirical Bernstein conﬁdence sequence, which we call Empirical Bentkus Conﬁdence Sequence .The setting is as follows. Suppose X , X , . . . are independent random variables satisfying E r X i s “ , Var p X i q ď A i , and P p X i ą B q “ , for i ě . (2)We will ﬁrst derive concentration inequalities under the one-sided bound assumption as in (2) whichonly requires X i ď B almost surely. To derive actionable versions of the concentration inequalities(with estimated variance), we will impose a two-sided bound assumption. We now present a concentration inequality that holds uniformly over all sample sizes up to a ﬁxedtime. Our reﬁned tail bounds are based on a worst case two-point distribution satisfying (2). Deﬁnemean zero independent random variables G , G , . . . as P ` G i “ ´ A i { B ˘ “ B {p A i ` B q and P p G i “ B q “ A i {p A i ` B q . (3)These random variables satisfy Var p G i q “ A i and P p G i ą B q “ . Furthermore, G i ’s are the worstcase random variables satisfying (2) in the sense that for all n ě and x P R , sup X ,...,X n „ (2) E rp ř ni “ X i ´ x q ` s “ E rp ř ni “ G i ´ x q ` s , (4)where p a q ` “ max t a, u and the supremum is over all distributions of X i ’s satisfying (2). Moreover, P p ř ni “ G i ě x q ď sup X ,...,X n „ (2) P p ř ni “ X i ě x q ď p e { q P p ř ni “ G i ě x q , (5)where the right hand side inequality holds for all x in the support of ř ni “ G i ; it holds for all x P R if P p ř ni “ G i ě x q is replaced by its log-linear interpolation; see [5] for details. The left hand sideinequality in (5) is trivial because G i ’s satisfy (2) and the right hand side inequality is derivedusing (4). Inequalities in (5) show that concentration inequalities based on the two-point randomvariables G i are sharp up to a constant factor e { . This is unlike the classical Cramér-Chernoﬀinequalities which diﬀer from the optimal one by a factor depending on x ; see Talagrand [26].4et A “ t A , A , . . . u as the collection of standard deviations and for n ě , δ P r , s , deﬁne q p δ ; n, A , B q as the solution to inf x ď u E rp ř ni “ G i ´ x q ` sp u ´ x q ` “ δ. (6)The solution exists uniquely for δ ě P p ř ni “ G i “ nB q and is deﬁned to be nB ` if δ ă P p ř ni “ G i “ nB q . The following result provides a reﬁned concentration inequality for S n “ ř ni “ X i . It is a“maximal” version of Theorem 2.1 of [8], and we defer the proof to Appendix D. Theorem 1.

Fix n ě . If X , X , . . . , X n are independent random variables satisfying (2) , then P ˆ max ď t ď n S t ě q p δ ; n, A , B q ˙ ď δ, for any δ P r , s . (7) Further, if A “ ¨ ¨ ¨ “ A n “ A and if ˜ q p¨ ; A, B q is some function such that P p max ď t ď n S t ě n ˜ q p δ { n ; A, B qq ď δ for all δ P r , s and all X , . . . , X n satisfying (2) , then q p δ ; n, A , B q ď n ˜ q p δ { n ; A, B q for all δ P r , s . The ﬁrst part of Theorem 1 provides a ﬁnite sample valid estimate of the quantile and the secondpart implies that it is sharper than the usual concentration inequalities such as Hoeﬀding, Bernstein,Bennett or Prokhorov inequalities. To see this fact, note that P p max ď t ď n S t ě n ˜ q p δ { n ; A, B qq ď δ for all δ P r , s is equivalent to the existence of a function H p u ; A, B q such that P p max ď k ď n S t ě nu q ď H n p u ; A, B q for all u . The classical concentration inequalities mentioned above are all of thisproduct form and hence weaker than our bound. See Figure 1a for an illustration and [5, 6, 8, 22]for further discussion. Computation of q p¨ ; n, A , B q is discussed in Section 9 of Bentkus et al. [8]and we provide a detailed discussion in Appendix C. In this respect, the following result describesthe function in (6) as a piecewise smooth function in case A “ . . . “ A n “ A . Proposition 1.

Set p AB “ A {p A ` B q and Z n “ ř ni “ R i where R i „ Bernoulli p p AB q . Then inf x ď u E rp ř ni “ G i ´ x q ` s{p u ´ x q ` “ P p np AB ` u p ´ p AB q{ B ; Z n q , for all u P R , where P p x ; Z n q “ for x ď np AB and for any x ě np AB and ď k ď n ´ , P p x ; Z n q “ $’’&’’% np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. Here p k “ P p Z n ě k q , e k “ E r Z n t Z n ě k us , and v k “ E r Z n t Z n ě k us . Based on the description in Proposition 1 and (6), computation of q p¨ ; n, A, B q follows. InAppendix C.1, we also provide a similar piecewise description of q p¨ ; n, A, B q . Although Theorem 1 leads to a uniform in sample size conﬁdence sequence until size n , it is very widefor sample sizes much smaller than n . We now use the method of stitching to obtain a conﬁdencesequence that is valid for all sample sizes and scales reasonably well with respect to the sample size.We refer the reader to Mnih et al. [20, Section 3.2] and Howard et al. [15, Section 3.1] for details.To construct a uniform over n conﬁdence sequence, we require two user-chosen parameters:5

500 1000 1500 2000 2500 3000 n − − HoeﬀdingBernsteinBentkus S n = n X i =1 X i (a) Pointwise Bounds n − − A-Hoeﬀding E - B e r n s t e i n A-Bentkus S n = n X i =1 X i (b) Uniform Bounds Figure 1: Comparison of the concentration bounds when δ “ . . X i are centered i.i.d. Bernoulli p { q . We give the true standard deviation A i “ ? { and upper bound B “ { toall the methods. (a) The average failure frequencies across 300 trials and ď n ď are: Ho-eﬀding . ˘ . , Bernstein . ˘ . , Bentkus . ˘ . . (b) A-Bentkusis computed using η “ . , h p k q “ p k ` q . ζ p . q . For 3000 trials, there is zero failure for Adap-tive Hoeﬀding and Empirical Bernstein, but for A-Bentkus (8). In both cases, all the boundshave failure frequency bounded above by δ but the Bentkus’ bound is the least conservative. Thediﬀerences between the bounds continue to grow as n increases.1. a scalar η ą , which determines the geometric spacing.2. a function h : R ` Ñ R ` such that ř k “ { h p k q ď . Ideally, { h p k q adds up to .The following result gives a uniform over n tail inequality by splitting t n ě u into Ť k ě t r η k s ď n ď t η k ` u u and then applying (7) within t r η k s ď n ď t η k ` u u . See Appendix E for the proof. Theorem 2. If X , X , . . . are independent random variables satisfying (2) , then P ˆ D n ě S n ě q ˆ δh p k n q ; c n , A , B ˙˙ ď δ, for any δ P r , s , (8) where k n : “ inf t k ě r η k s ď n ď t η k ` u u and c n : “ t η k n ` u . The choice of the spacing parameter η and stitching function h p¨q determine the shape of theconﬁdence sequence and there is no universally optimal setting. The growth rate of h p¨q determineshow the budget of δ is spent over sample sizes; a quickly growing h p¨q such as k yield conﬁdenceintervals of essentially conﬁdence for larger sample size. The choice of η determines howconservative the bound is for t r η k s ď n ď t η k ` u u ; for η too large the bound will be conservative for n close to η k . Eq. (5) shows that bound is tightest at n “ t η k ` u in each epoch. Throughout thispaper, we use η “ . and h p k q “ ζ p . qp k ` q . where ζ p¨q is the Riemann zeta function.The same stitching method used in Theorem 2 can also be used with Hoeﬀding and Bernsteininequalities as done in [29] and [2], respectively. However, given that inequality (7) is sharper than6oeﬀding and Bernstein inequalities, our bound (8) is sharper for the same spacing parameter η and stitching function h p¨q ; see Figure 1b. Stitched bounds as in Theorem 2 are always piecewiseconstant but the Hoeﬀding and Bernstein versions from [29] and [20] are smooth because they areupper bounds of the piecewise constant boundaries (obtained using n ď c n ď ηn and k n ď log η n ` ).For practical use, smoothness is immaterial and the piecewise constant versions are sharper. Theorem 2 is not actionable in its form because it involves the unknown sequence of A , A , . . . . Inthe case where A “ A “ ¨ ¨ ¨ “ A , one needs to generate an upper bound of A (for a known B ) andobtain an actionable version of Theorem 2. Finite-sample over-estimation of A requires a two-sidedbound on the X i ’s; one-sided bounds on the random variables do not suﬃce. This actionable versionis a reﬁned version of empirical Bernstein inequality that is uniform over the sample sizes.We will assume that P p B ď X i ď B q “ @ i . Deﬁne s A p δ q “ p B ´ B q{ and for n ě , δ P r , s p A n : “ t n { u ´ ř t n { u i “ p X i ´ X i ´ q { , and s A n p δ q : “ b p A n ` g ,n p δ q ` g ,n p δ q , (9)where g ,n p δ q : “ p ? n q ´ a t c n { u p B ´ B q Φ ´ ` ´ δ {p e h p k n qq ˘ , for the distribution function Φ p¨q of a standard normal random variable. We will write s A n p δ ; B, B q , when needed, to stress thedependence of s A n p δ q on B, B . Lemma F.1 shows that s A n p δ q is a valid over-estimate of A uniformlyover n and yields the following actionable bound. We defer the proof to Appendix F. Theorem 3. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p B ď X i ď B q “ for all i ě , then for any δ , δ P r , s , P ˆ D n ě S n ě q ˆ δ h p k n q ; c n , s A ˚ n p δ q , B ˙ or A ě s A ˚ n p δ , B, B q ˙ ď δ ` δ , and P ˆ D n ě S n ď ´ q ˆ δ h p k n q ; c n , s A ˚ n p δ q , ´ B ˙ or A ě s A ˚ n p δ , B, B q ˙ ď δ ` δ , where s A ˚ n p δ q : “ min ď s ď n s A n p δ , B, B q , and k n , c n are those deﬁned in Theorem 2. Theorem 3 is an analogue of the empirical Bernstein inequality [20, Eq. (5)]. The over-estimateof A in (9) can be improved by using non-analytic expressions, but we present the version above forsimplicity; see Appendix F for details on how to improve s A n p δ q in (9).Theorem 3 can be used to construct a conﬁdence sequence as follows. Suppose Y , Y , . . . areindependent random variables with mean µ , variance A , and satisfying P p L ď Y i ď U q “ . Then X i “ Y i ´ µ is a zero mean random variable where P p L ´ µ ď X i ď U ´ µ q “ , and Theorem 3is directly applicable with B “ ´ B “ U ´ L . An interesting observation is that we can reﬁne thevalues of B and B while we are updating the conﬁdence interval for µ . Suppose we have a validupper and lower bound when we observed n data points: ´ q low n ď ř ni “ Y n ´ nµ ď q up n , this implies µ low n : “ s Y n ´ n ´ q up n ď µ ď s Y n ` n ´ q low n “ : µ up n , where s Y n is the empirical mean of Y . We thus have a valid estimate r L ´ µ up n , U ´ µ low n s of thesupport of X , and when we observe Y n ` , we can use U ´ µ low n as B and L ´ µ up n as B . Importantly,as Theorem 3 provides a uniform concentration bound, these recursively deﬁned upper and lowerbounds hold simultaneously too. This leads to the following result, proved in Appendix G.7 heorem 4. If random variables Y , Y , . . . are independent with mean µ , variance A and satisfy P p L ď Y i ď U q “ . Deﬁne µ up0 : “ U , µ low0 : “ L , and for n ě µ up n “ s Y n ` n q ˆ δ h p k n q ; c n , s A ˚ n p δ , U, L q , ´p L ´ µ up n ´ q ˙ µ low n “ s Y n ´ n q ˆ δ h p k n q ; c n , s A ˚ n p δ , U, L q , U ´ µ low n ´ ˙ Let µ up ˚ n “ min ď i ď n µ up i and µ low ˚ n “ max ď i ď n µ low i . Then for any δ , δ P r , s P ´ µ P r µ low ˚ n , µ up ˚ n s and A ď s A ˚ n p δ , U, L q for all n ě ¯ ě ´ δ ´ δ . (10)Because µ up0 “ U, µ low0 “ L , the conﬁdence intervals r µ low ˚ n , µ up ˚ n s is always a subset of r L, U s . In this section, we conduct numerical experiments to demonstrate the eﬃcacy of our method.Section 4.1 examines the coverage probability and the width of the conﬁdence intervals constructedon a synthetic data from

Bernoulli p . q ; for other cases, see Appendix B. Section 4.2 applies theconﬁdence sequences to an adaptive stopping algorithm for p ε, δ q -mean estimation.We compare our adaptive Bentkus conﬁdence sequence (10) with the adaptive Hoeﬀding [29],empirical Bernstein [20], and two other versions of empirical Bernstein inequality from [15]: Eq. (24)and Theorem 4 with the gamma-exponential boundary from Proposition 9 of [15]. We denote thesemethods by

A-Bentkus , A-Hoeffding , E-Bernstein , HRMS-Bernstein , and

HRMS-Bernstein-GE respectively. For all the experiments, we use δ “ . . For A-Bentkus , we ﬁx the spacing parameter η “ . , the stitching function h p k q “ p k ` q . ζ p . q , and δ “ δ { , δ “ δ { . In this experiment, we generate i.i.d samples Y , Y , . . . , Y i.i.d „ Bernoulli p . q and compute theconﬁdence sequences for the true mean µ “ . . Figure 2a gives an illustration of the conﬁdencesequences obtained and shows the sharpness of A-Bentkus (10).For most of the cases ( n ě ), A-Bentkus dominates the other methods. For smaller samplesizes,

A-Bentkus also closely traces

A-Hoeffding and outperforms the others. This is because thevariance estimation is likely conservative and in which case our s A ˚ n ends up using the trivial upperbound p U ´ L q{ , which is essentially what A-Hoeffding is exploiting. In fact, we have provided thesame upper bound for all the other Bernstein-type methods too, and

A-Bentkus still outperforms.This phenomenon shows the intrinsic sharpness of our bound.We repeat the above experiment 1000 times and report the average miscoverage rate: ř r “ t µ R CI p r q n for some ď n ď u . where CI p r q n is the conﬁdence interval constructed after observing Y , . . . , Y n in the r -th replication.The results are . for A-Bentkus , . for HRMS-Bernstein-GE , and for the others. All the Code is available at https://github.com/enosair/bentkus_conf_seq . n − . . . . . . . Empirical Mean

E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeﬀdingA-Bentkus (a) One Replication n . . . . . . C I W i d t h E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeﬀdingA-Bentkus (b) Average Width over 1000 Replications

Figure 2: Comparison of the 95% conﬁdence sequences for the mean when Y i „ Bernoulli p . q .Except A-Hoeffding , all other methods estimate the variance.

A-Bentkus is the conﬁdence sequencein (10).

HRMS-Bernstein-GE involves a tuning parameter ρ which is chosen to optimize the sequenceat n “ as suggested in Figure 7 of [15]. (a) shows the conﬁdence sequences from a singlereplication. (b) shows the average widths of the conﬁdence sequences over 1000 replications. Theupper and lower bounds for all the other methods are cut at and for a fair comparison.methods control the miscoverage rate by δ “ . but are all conservative. Recall from (5) thatour failure probability bound can be conservative up to a constant of e { . Furthermore, from theproofs of Theorems 2 and 4, we get that for η “ . , h p k q “ p k ` q . ζ p . q , P ` µ R CI n p δ q for some ď n ď ˘ ď ř log η p q k “ δ { h p k q ď . δ. With δ “ . , the true bound on the failure probability is . . This explains why the averagemiscoverage rate is small.We also report the average width of the conﬁdence intervals in Figure 2b. All the values arebetween and as we cut the bounds from above and below for the other methods. As mentionedabove, when n is very small A-Bentkus closely traces

A-Hoeffding and both have smaller width.Yet the advantage of

A-Hoeffding disappears for n ě and A-Bentkus enjoys smaller conﬁdenceinterval width afterwards, and

HRMS-Bernstein-GE improves slightly on

A-Bentkus after observingvery large number of samples.

We apply our conﬁdence sequence to the adaptive stopping rule for estimating the mean of a boundedrandom variable Y . The goal is to obtain an estimator p µ such that the relative error | p µ { µ ´ | isbounded by ε , and terminate the data sampling once such criterion is satisﬁed.Given s Y the empirical mean and any conﬁdence sequence centered at s Y satisfying (1), Algo-rithm 1 yields a valid stopping time and an p ε, δ q -accurate estimator; see [20, Section 3.1] for aproof. Clearly, a tighter conﬁdence sequence will require less data sampling and yields a smallerstopping time. We follow the setup in Mnih et al. [20]. The data samples are i.i.d generatedas Y i “ m ´ ř mj “ U ij , where U i , . . . , U im are i.i.d uniformly distributed in r , s . This impliesthat µ “ E r Y i s “ { and A “ Var p Y i q “ {p m q . Because Algorithm 1 requires symmetric9 =1 m=10 m=20 m=100 m=100005001000150020002500 S t o pp i n g T i m e A-HoeffdingE-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-Bentkus

Figure 3: Comparison of conﬁdence sequences for an p ε, δ q -estimator. Here ε “ . and δ “ . . Algorithm 1:

Adaptive Stopping Algorithm Initialization: n Ð , LB Ð , UB Ð 8 while p ` ε q LB ă p ´ ε q UB do n Ð n + 1 Sample Y n and compute the n -th interval in the sequence: r s Y n ´ Q n , s Y n ` Q n s Ð ConfSeq p n, δ q LB Ð max t LB , | s Y n | ´ Q n u UB Ð min t UB , | s Y n | ` Q n u return stopping time N “ n and estimator p µ “ p { q sgn p s Y N qrp ` ε q LB ` p ´ ε q UB s .intervals, we shall symmetrize the intervals returned by A-Bentkus by taking the largest devia-tion. We consider 5 cases: m “ , , , , and report the average stopping time (i.e. thenumber of samples required to achieve p ε, δ q “ p . , . q accuracy) based on trials in Fig-ure 3. HRMS-Bernstein-GE involves a tuning parameter ρ , chosen here to optimize the conﬁdencesequence at n “ (best out of , , , ). As m increases, the variance of Y i is decreasing.As expected, A-Hoeffding does not exploit the variance of random variables so the stopping timesremains roughly the same. For others, the stopping time is decreasing. It is clear that on average,

A-Bentkus is the best for all the values of m and the ratios of our stopping time to the second bestare . , . , . , . , . . In this paper, we proposed a conﬁdence sequence for bounded random variables and examined itseﬃcacy in both synthetic examples and adaptive stopping algorithms. Although our results are pre-sented for the mean, they can be applied to testing equality of distributions, testing independence [4,10ections 3 and 5], and identifying the best arm in the multi-armed bandits problem [29, 14].We assumed that X i ’s are independent and bounded and generalizations for dependence and sub-Gaussian cases are of interest. Regarding independence, Theorem 2.1 of [22] shows that Theorem 1holds even if X i ’s form a supermartingale diﬀerence sequence, i.e., assumption (2) is replaced by E r X i | X , . . . , X i ´ s ď , P p E r X i | X , . . . , X i ´ s ď A i q “ , and P p X i ą B q “ . Theorem 2 follows readily, but Theorem 3 requires further restrictions that allow estimation of A i . Regarding the boundedness assumption, which maybe restrictive for applications in statistics,ﬁnance and economics, one can replace assumption (2) by E r X i s “ , Var p X i q ď A i , and P p X i ą x q ď s F p x q for all x P R , (11)where s F p¨q is a survival function on r , , i.e., s F p¨q is non-increasing and s F p q “ and s F p8q “ .For example, s F p x q “ {t ` p x { K q α ` u , or s F p x q “ exp p´p x { K q α ` q , for some K ą ; α “ in thesecond example is sub-Gaussianity. Similar to (5), there exist random variables η i “ η i p A i , s F q satisfying (11) such that sup X ,...,X n „ (11) E »–˜ n ÿ i “ X i ´ t ¸ ` ﬁﬂ “ E »–˜ n ÿ i “ η i ´ t ¸ ` ﬁﬂ , for all n ě and t P R . Here the superemum is taken over all distributions satisfying (11).Hence, Theorem 1 can be generalized, which in turn leads to generalizations of Theorems 2 and 3.The details on the construction of η i and the corresponding conﬁdence sequence will be discussedelsewhere. AppendixA Competing Concentration Bounds

Theorem 5 (Hoeﬀding; Theorem 3.1.2 of [12]) . If X , . . . , X n are independent mean-zero randomvariables satisfying P p B ď X i ď B q “ , then P ˜ S n ě d n p B ´ B q log ˆ δ ˙ ¸ ď δ, @ δ P r , s . (There is a generalization of Hoeﬀding’s inequality that relaxes the boundedness assumption bya sub-Gaussian assumption; see [29] for details.) Theorem 6 (Adaptive Hoeﬀding; Corollary 1 of [29]) . If X , . . . , X n are independent mean-zerorandom variables satisfying P p B ď X i ď B q “ , then P ˜ D n ě S n ě p B ´ B q c . n log p log . n ` q ` log p { δ q . n ¸ ď δ, @ δ P r , s . heorem 7 (Bernstein; Theorem 3.1.7 of [12]) . If X , . . . , X n , . . . are independent random variablessatisfying (2) , then P ¨˝ S n ě gffe n ÿ i “ A i log ˆ δ ˙ ` B log ˆ δ ˙ ` B log ˆ δ ˙ ˛‚ ď δ, @ δ P r , s . Theorem 8 (Empirical Bernstein; Eq. (5) of [20]) . If X , X , . . . are independent mean zero randomvariables satisfying (2) with A “ A “ . . . “ A , then P ˆ D n ě S n ě b nη p A n log p h p k n q{p δ qq ` Bη log p h p k n q{p δ qq ˙ ď δ, where ˜ A n is the sample variance and k n is the constant deﬁned in Theorem 2. B More Simulations

B.1 Hyperparameters of Stitching

There are two hyperparameters of our stitching methods: (1) the spacing parameter η ą and (2)the power parameter c ą for the stitching function h c p k q “ ζ p c qp k ` q c where ζ p¨q is the Riemannzeta function. n η = 1 . η = 1 . Figure 4: The upper bound of S n obtained by adaptive Bentkus bound in Theorem 2 for diﬀerentvalues of η . Both the variance A “ ? { and the upper bound B “ { is known.Figure 4 illustrates that the choice of η determines how the budget δ is distributed across diﬀerentsample sizes.Figure 5 shows both the stitching function h c p¨q and corresponding upper bound A-Bentkus obtains. For a ﬁxed sample size n , the bigger h c p k n q is, the smaller budget δ { h c p k n q it obtains andhence it needs a larger upper bound. Hence, the faster h c p¨q grows, the more conservative upperbound (and corresponding, wider conﬁdence interval) one will get. B.2 Conﬁdence Sequence for

Bernoulli p . q In this section, we present a comparison of our conﬁdence sequence with

A-Hoeffding , E-Bernstein , HRMS-Bernstein , and

HRMS-Bernstein-GE on synthetic data from

Bernoulli p . q . In this case,12 k h c ( k ) = ζ ( c )( k + ) c c = 1.01c = 1.1c = 3.0 n c = 1.01c = 1.1c = 3.0 Figure 5:

Left:

The stitching function h c p¨q for diﬀerent values of c . Right:

The upper bound of S n obtained by A-Bentkus with diﬀerent values of c . Both the variance A “ { and the upperbound B “ { is known. Y , Y , . . . „ Bernoulli p . q and the variance is { . Hence in this case Hoeﬀding’s inequality issharp and nothing can be gained by variance exploitation. We note this very fact in our experiment,where our method behaves as well as A-Hoeffding for moderate to large sample sizes. Figures 6aand 6b show the comparison of conﬁdence sequences in one replication and comparison of averagewidth over 1000 replications. As in the case of

Bernoulli p . q (Section 4.1), for small sample sizes, A-Hoeffding and

A-Bentkus behave very closely and are better than all other methods but for n moderately large, the sharpness of A-Bentkus clearly pays oﬀ by outperforming

A-Hoeffding andall other methods.

C Computation of q p δ ; n, A , B q In this section we provide some details on the computation of q p δ ; n, A , B q based on [6] and [23].We will restrict to the case where A “ A “ ¨ ¨ ¨ “ A n “ ¨ ¨ ¨ “ A .For any random variable η , deﬁne P p u ; η q : “ inf x ď u E rp η ´ x q ` sp u ´ x q ` . For any

A, B , set p AB “ A {p A ` B q .Deﬁne Bernoulli random variables R , R , . . . , R n as P p R i “ q “ p AB “ ´ P p R i “ q . Set Z n “ ř ni “ R i . Z n is a binomial random variables with n trials and success probability p AB : Z n „ Bi p n, p AB q . For ď k ď n , deﬁne p k : “ P p Z n ě k q ,e k : “ E r Z n t Z n ě k us ,v k : “ E “ Z n t Z n ě k u ‰ . n . . . . . . Empirical Mean

Figure 6: Comparison of the 95% conﬁdence sequences for the mean when Y i „ Bernoulli p . q .Except A-Hoeffding , all other methods estimate the variance.

A-Bentkus is the conﬁdence sequencein (10).

HRMS-Bernstein-GE involves a tuning parameter ρ which is chosen to optimize the boundaryat n “ . (a) shows the conﬁdence sequences from a single replication. (b) shows the averagewidths of the conﬁdence sequences over 1000 replications. The upper and lower bounds for allthe other methods are cut at and for a fair comparison. The failure frequency is . for HRMS-Bernstein-GE and for the others. Proposition 2.

For all u P R , P ˜ u ; n ÿ i “ G i ¸ “ P ˆ Bu ` nA A ` B ; Z n ˙ “ P ˆ Bu ` nA A ` B ; Z n ˙ . Furthermore, for any x ě and ď k ď n ´ , P p x ; Z n q “ $’’’’&’’’’% , if x ď np AB , np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. Formally, we can set P p x ; Z n q “ for all x ą n because P p Z n ą n q “ .Proof. The result is mostly an implication of Proposition 3.2 of [23]. It is clear that M n : “ n ÿ i “ G i d “ A ` B B ˜ n ÿ i “ R i ´ nA A ` B ¸ , where R i „ Bernoulli p A {p A ` B qq , that is, P p R i “ q “ p AB “ ´ P p R i “ q . Proposition 3.2(vi) of [23] implies that P p u ; M n q : “ P ˆ Bu ` nA A ` B ; Z n ˙ . P p x ; Z n q for all x P R . The support of Z n is given by supp p Z n q “ t , , , . . . , n u . Proposition 3.2(iv) of [23] (with α “ ) implies that P p x ; Z n q “ , if x ď np AB , P p Z n “ n q , if x ě n. Furthermore, x ÞÑ P p x ; ř ni “ R i q is strictly decreasing on p np AB , n q . Deﬁne function F p h q : R Ñ R such that F p h q : “ E r Z n p Z n ´ h q ` s E p Z n ´ h q ` . (12)For any np AB ă x ă n , let h x be the unique solution of F p h q “ x (13)(Uniqueness here is established by Proposition 3.2(ii) of [23].) Then by Proposition 3.2(iii) of [23], P p x ; Z n q “ E rp Z n ´ h x q ` sp x ´ h x q ` “ E r Z n p Z n ´ h x q ` s ´ h x E rp Z n ´ h x q ` sp x ´ h x q ` “ p x ´ h x q E rp Z n ´ h x q ` sp x ´ h x q ` “ E rp Z n ´ h x q ` sp x ´ h x q ` . (14)This holds for all nA {p A ` B q ă x ă n . We will now discuss solving (13).Proposition 3.2(i) of [23] implies that h ÞÑ F p h q is continuous and increasing.If h ď , F p h q “ E r Z n p Z n ´ h qs E r Z n ´ h s “ np AB p ´ p AB q ` n p AB ´ hnp AB np AB ´ h “ np AB ` np p ´ p AB q np ´ h . This is strictly increasing on p´8 , s , and F p q “ np AB ` p ´ p AB q . We get that for any np AB ă x ď np AB ` p ´ p AB q , F p h q “ x ô h x “ np AB ´ np AB p ´ p AB q x ´ np AB . This further implies (from (14)) that P p x ; Z n q “ E r Z n ´ h x s x ´ h x “ np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , for np AB ď x ď np AB ` p ´ p AB q . ă h ă n ´ , set k “ r h s , in other words, k ´ ă h ď k . Since t Z n ě h u ô t Z n ě k u , hence E r Z n p Z n ´ h q ` s “ E r Z n t Z n ě h us ´ h E r Z n t Z n ě h us“ E r Z n t Z n ě k us ´ h E r Z n t Z n ě k us , E rp Z n ´ h q ` s “ E r Z n t Z n ě k us ´ h P p Z n ě k q . Therefore, F p h q “ E r Z n t Z n ě k us ´ h E r Z n t Z n ě k us E r Z n t Z n ě k us ´ h P p Z n ě k q“ v k ´ he k e k ´ hp k . It is not diﬃcult to verify that F p¨q is strictly increasing in p k ´ , k s and hence h x “ v k ´ xe k e k ´ xp k , if F p k ´ q ă x ď F p k q . Substituting this h x in (14) yields the value of P p x ; Z n q , that is, P p x ; Z n q “ ˆ x ´ v k ´ xe k e k ´ xp k ˙ ´ ˆ e k ´ v k ´ xe k e k ´ xp k p k ˙ “ ˆ e k ´ xp k xe k ´ x p k ´ v k ˙ ˆ e k ´ v k p k e k ´ xp k ˙ “ e k ´ v k p k xe k ´ x p k ´ v k , whenever F p k ´ q ă x ď F p k q , where F p k q “ v k ´ ke k e k ´ kp k , ď k ď n ´ . Hence for ď k ď n ´ , P p x ; Z n q “ v k p k ´ e k x p k ´ xe k ` v k , whenever v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k . Finally, we prove that F p¨q is a constant on r n ´ , n s . It is clear that F p n ´ q “ v n ´ ´ p n ´ q e n ´ e n ´ ´ p n ´ q p n ´ “ E r Z n t Z n ě n ´ us ´ p n ´ q E r Z n t Z n ě n ´ us E r Z n t Z n ě n ´ us ´ p n ´ q P p Z n ě n ´ q“ p n ´ n p n ´ qq P p Z n “ n qp n ´ p n ´ qq P p Z n “ n q “ n. Further if h ą n ´ , then p Z n ´ h q ` ą if and only if Z n “ h and hence from (12) F p h q “ E r Z n p Z n ´ h q ` s E rp Z n ´ h q ` s “ n p n ´ h q P p Z n “ n qp n ´ h q P p Z n “ n q “ n. Therefore, the function F p h q is constant on r n ´ , n s .16 − h . . . . . . . F ( h ) n = 3 , A = 0 . , B = 1 − h . . . . . . P ( x ; Z n ) n = 3 , A = 0 . , B = 1 − h − − − − − − P ( x ; Z n ) np AB v /e v − e e − p v − e e − p n = 3 , A = 0 . , B = 1 Figure 7: Examples functions F p h q and P p x ; Z n q when n “ , A “ . and B “ . . We plot P p x ; Z n q in both linear (second plot) and log (third plot) scales on the y-axis.For h ą n , we set F p h q “ n since P p Z n ą h q “ . To put all the pieces together, we have F p h q “ $’’’&’’’% np AB ` np p ´ p AB q np ´ h if h ă“ ,v r h s ´ he r h s e r h s ´ hp r h s if ă h ď n ´ ,n if h ą n ´ . Consequently, for np AB ă x ă n , h x “ F ´ p x q “ np AB ´ np AB p ´ p AB q x ´ np AB , if np AB ă x ď np AB ` p ´ p AB q , v k ´ xe k e k ´ xp k , if F p k ´ q ă x ď F p k q , ď k ď n ´ . C.1 Computation of the Quantile

Recall p AB “ A {p A ` B q , Z n “ ř ni “ R i , and ř ni “ G i is identically distributed as B ´ p A ` B qp Z n ´ np AB q . We will compute x δ such that P p x δ ; Z n q “ δ. (15)This implies that P ˜ p A ` B q x δ ´ nA B ; n ÿ i “ G i ¸ “ δ, or equivalently, q p δ ; n, A, B q “ p A ` B q x δ ´ nA B .

Hence we concentrate on solving (15). Recall that for any x ě and ď k ď n ´ , P p x ; Z n q “ $’’’’&’’’’% , if x ď np AB , np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e “ np AB ` p ´ p AB q , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. (16)17he function P p¨ ; Z n q is a non-increasing function and hence if δ ď p nAB , then we get x δ “ n ` ´ ;this corresponds to the last case in (16). If P p v { e ; Z n q ď δ ď , then x δ “ np AB ` c p ´ δ q np AB p ´ p AB q δ ; this corresponds to the ﬁrst and second case in (16); note that P p v { e ; Z n q “ np AB p ´ p AB q{rp ´ p AB q ` np AB p ´ p AB qs . For the remaining cases, note that if there exists a ď k ď n ´ suchthat P ˆ v k ´ ke k e k ´ kp k ; Z n ˙ ď δ ď P ˆ v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ; Z n ˙ , then v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ď x δ ď v k ´ ke k e k ´ kp k , (17)and using the closed form expression of P p¨ ; Z n q on this interval, we get x δ “ e k ` b e k ´ p k p v k ´ p v k p k ´ e k q{ δ q p k . (18)Using these calculations, one can ﬁnd k looping over ď k ď n ´ such that (17) holds. Thisapproach has a complexity of O p n q , assuming the availability of p k , e k , and v k .We now describe an approach that reduces the complexity by ﬁnding quick-to-compute upperand lower bounds on x δ . Lemmas 1.1 and 3.1 of [8] show that P p Z n ě x q ď P p x ; Z n q ď e P ˝ p Z n ě x q , (19)where P ˝ p Z n ě x q represents the log-linear interpolation of P p Z n ě x q , that is, for x P t , , . . . , n u P ˝ p Z n ě x q “ P p Z n ě x q , (20)and for x P p k ´ , k q such that x “ p ´ λ qp k ´ q ` λk , P ˝ p Z n ě x q “ p P p Z n ě k ´ qq ´ λ p P p Z n ě k qq λ . Equation (2) of [5] further shows that P ˝ p Z n ě x q ď p ´ λ q P p Z n ě k ´ q ` λ P p Z n ě k q . (21)Hence, to ﬁnd x “ x δ satisfying P p x ; Z n q “ δ , ﬁnd k P t , , . . . , n u such that P p Z n ě k q ě δ. This implies (from (19)) that P p k ; Z n q ě δ and because x ÞÑ P p x ; Z n q is decreasing, x δ ě k .Further, ﬁnd k P t , , . . . , n u such that P p Z n ě k q ď δe . P o p Z n ě k q “ P p Z n ě k q ď δ { e . Hence using (20), we get P p k ; Z n q ď δ which implies that x δ ď k . Summarizing this discussion, we get that x δ satisfying P p x δ ; Z n q “ δ also satisﬁes k ď x δ ď k , (22)where P p Z n ě k q ě δ and P p Z n ě k q ď δe . The bounds in (22) are not very useful because the closed form experssion (18) of x δ requires ﬁndingupper and lower bounds for x δ in terms of p v k ´ ke k q{p e k ´ kp k q ’s.Now we note that v k ě ke k ě k p k ñ v k ´ k e k e k ´ k p k ě k . This combined with (22) proves that k ď x δ ď k ď v k ´ k e k e k ´ k p k . The lower bound here is still not in terms of the ratios p v k ´ ke k q{p e k ´ kp k q . But given the upperbound, we can search for k ď k (by running a loop from k to 0) such that v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ď x δ ď v k ´ ke k e k ´ kp k . (23)Another approach is to make use of the lower bound in (22). Because k ď p v k ´ k e k q{p e k ´ k p k q ,there are two possibilities:1. k ď x δ ď p v k ´ k e k q{p e k ´ k p k q ;2. k ď p v k ´ k e k q{p e k ´ k p k q ă x δ .In the ﬁrst case, it suﬃces to search for k ď k such that (23). In the second case, we can searchover k ` ď k ď k as before. D Proof of Theorem 1

It is clear that p S t , F t q nt “ with F t “ σ t X , . . . , X t u is a martingale because E r S t | F t ´ s “ S t ´ ` E r X t s “ S t ´ . Consider now the process D t : “ p S t ´ x q ` for a ﬁxed x ą . The function f : y ÞÑ p y ´ x q ` is continuous and satisﬁes f p y q “ , if y ď x, p y ´ x q , if y ą x, and f p y q “ , if y ď x, , if y ą x. Therefore, f p¨q is a convex function. This implies by Jensen’s inequality that E r D t | F t ´ s “ E r f p S t q| F t ´ s ě f p S t ´ q . p D t , F q nt “ is a submartingale. Doob’s inequality now implies that P ˆ max ď t ď n S t ě u ˙ p a q “ P ˆ max ď t ď n p S t ´ x q ` ě p u ´ x q ` ˙ “ P ˆ max ď t ď n D t ě p u ´ x q ` ˙ p b q ď E r D n sp u ´ x q ` ď E rp S n ´ x q ` sp u ´ x q ` . Here equality (a) holds for every x ď u and inequality (b) holds because of Doob’s inequality.Because x ď u is arbitrary, we get P ˆ max ď t ď n S t ě u ˙ ď inf x ď u E rp S n ´ x q ` sp u ´ x q ` , and condition (2) along with Theorem 2.1 of [8] (or [22]) imply that P ˆ max ď t ď n S t ě u ˙ ď inf x ď u E rp ř ni “ G i ´ x q ` sp u ´ x q ` . The deﬁnition (6) of q p δ ; n, A , B q readily implies P ˆ max ď t ď n S t ě q p δ ; n, A , B q ˙ ď δ. This completes the proof of (7). We now prove the sharpness. Note that the condition P ˆ max ď t ď n S t ě n ˜ q p δ { n ; A, B q ˙ ď δ for all δ P r , s , is equivalent to the existence of a function x ÞÑ H p x ; A, B q such that P ˆ max ď t ď n S t ě nu ˙ ď H n p u ; A, B q , for all u. (The function δ ÞÑ ˜ q p δ { n ; A, B q is the inverse of u ÞÑ H n p u ; A, B q .) In particular, this implies that P p S n ě nu q ď H n p u ; A, B q for all u. Now, Lemma 4.7 of [6] (also see Eq. (2.8) of [13]) implies that H n p u ; A, B q ě ` BuA ˙ ´p A ` Bu q{p A ` B q ´ ´ uB ¯ ´p B ´ Bu q{p B ` A q + n “ inf h ě e ´ nhu E ” e h ř ni “ G i ı , where G , . . . , G n are independent random variables constructed through (3). Proposition 3.5 of [23]implies that inf h ě e ´ nhu E ” e h ř ni “ G i ı ě inf x ď nu E rp ř ni “ G i ´ x q ` sp nu ´ x q ` . Summarizing the inequalities, we conclude P p S n ě nu q ď inf x ď nu E rp ř ni “ G i ´ x q ` sp nu ´ x q ` ď inf h ě E ” e h ř ni “ G i ´ h p nu q ı ď H n p u ; A, B q @ u. This proves that q p δ ; n, A, B q ď n ˜ q p δ { n ; A, B q for any valid ˜ q p¨ ; A, B q .20 Proof of Theorem 2

The proof is based on (7) and a union bound. It is clear that P ˜ D t ě t ÿ i “ X i ě q p δ { h p k t q ; c t , A , B q ¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : t ÿ i “ X i ě q p δ { h p k t q ; c t , A , B q +¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : t ÿ i “ X i ě q p δ { h p k q ; t η k ` u , A , B q +¸ ď ÿ k “ P ˜ max r η k s ď t ď t η k ` u t ÿ i “ X i ě q p δ { h p k q ; t η k ` u , A , B q ¸ ď ÿ k “ δh p k q ď δ. F Proof of Theorem 3

Theorem 2 implies that P ˆ D n ě S n ě q ˆ δ h p k n q ; c n , A, B ˙˙ ď δ . Lemma F.1 (below) proves P ` D n ě A ě s A n p δ q ˘ ď δ . In particular this implies that P ˆ D n ě A ě min ď s ď n s A s p δ q ˙ ď δ . Combining the inequalities above with a union bound (and Lemma H.2) proves the result.

Lemma F.1.

Under the assumptions of Theorem 3, we have for any δ P r , s , P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ ď δ, (24) where W i “ p X i ´ X i ´ q { and V t : “ ř t t { u i “ W i . Proof.

Fix x ě . Note that for any u ě ´ x , P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ “ P ˆ max ď t ď n p u ´ t V t ´ tA uq ` ě p u ` x q ` ˙ , ď E rp u ´ t V n ´ nA uq ` sp u ` x q ` . tp u ´ t V t ´ tA uu t ě is a submartingale. There-fore, P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ ď inf u ě´ x E rp u ´ t V n ´ nA uq ` sp u ` x q ` “ inf u ě´ x E rp u ` nA ´ V n q ` sp u ` x q ` “ inf u ě nA ´ x E rp u ´ V n q ` sp u ´ nA ` x q . Corollary 2.7 (Eq. (2.24)) of [24] implies that inf u ě nA ´ x E rp u ´ V n q ` sp u ´ nA ` x q ď P p E ,n ` Z a E ,n ; nA ´ x q “ P p E ,n ` Z a E ,n ; E ,n ´ x q , (25)where E j,t “ ř t t { u i “ E r W ji s for j “ , and Z stands for a standard normal distribution. Inequal-ity (25) is the best inequality to use and there is a more precise version; see Theorem 2.4(I) andCorollary 2.7 of [24]. With the best inequality, the following steps will lead to a reﬁned upper boundon A ; we will not pursue this direction here.It now follows from [7] that P p E ,n ` Z ? E ,n ; E ,n ´ x q ď e P ˜ Z ď ´ x a E ,n ¸ . Because X i P r

B, B s with probability 1, W i ď p B ´ B q { and hence E ,n “ E n ÿ i “ E r W i s ď p B ´ B q n ÿ i “ E r W i s “ p B ´ B q E ,n { “ n p B ´ B q A { . This implies that P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ ď e P ˆ Z ď ´ ? x ? n p B ´ B q A ˙ . Equating the right hand side to δ yields P ˆ max ď t ď n t V t ´ tA u ď ´ ? n p B ´ B q A ? ´ ˆ ´ δe ˙˙ ď δ. (26)22ecause of this maximal inequality, we can apply stitching and get (24). Note that P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ “ P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙+¸ ď ÿ k “ P ˜ D r η k s ď t ď t η k ` u : V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ ď ÿ k “ δh p k q ď δ, where the last inequality follows from (26) applied to t ď t ď t c t { u u .Inequality (24) yields P ˜ tA ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙ ´ V t ď @ t ě ¸ ě ´ δ. Inequality tA ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙ ´ V t ď holds for A ą if and only if A ď g ,t ` b g ,t ` g ,t , where g ,t “ a t c t { u p B ´ B q A ? t Φ ´ ˆ ´ δe h p k t q ˙ and g ,t “ V t t { u t t { u . Hence a rewriting of (24) is P ´ A ě g ,t ` b g ,t ` g ,t @ t ě ¯ ě ´ δ. It is clear that g ,t “ O p {? t q and E r V t t { u { t t { u s “ A and hence the upper bounds above growslike A ` O p a log p h p k t qq{ t q . G Proof of Theorem 4

The assumption P p L ď X i ď U q “ implies that P p L ´ µ ď X i ´ µ ď U ´ µ q “ and henceapplying Theorem 2 with X i ´ µ and its upper bound U ´ µ yields P ˜ D n ě n ÿ i “ p X i ´ µ q ě q ˆ δ { h p k n q ; c n , A, U ´ µ ˙¸ ď δ . (27)23imilarly applying Theorem 2 with µ ´ X i and its upper bound µ ´ L yields P ˜ D n ě n ÿ i “ p µ ´ X i q ě q ˆ δ { h p k n q ; c n , A, µ ´ L ˙¸ ď δ . (28)Finally Lemma F.1 implies that P ` D n ě A ě s A ˚ n p δ ; U, L q ˘ ď δ . (29)Now combining inequalities (27), (28), and (29) yields with probability ě ´ δ ´ δ , for all n ě ´ n q ˆ δ { h p k n q ; c n , A, µ ´ L ˙ ď S n n ´ µ ď n q ˆ δ { h p k n q ; c n , A, U ´ µ ˙ , and A ď s A ˚ n p δ q . On this event, we get by using U ´ µ ď U ´ L and µ ´ L ď U ´ L , µ low0 ď µ ď µ up0 , and then recursively using µ low n ´ ď µ ď µ up n ´ , ´ n q ˆ δ { h p k n q ; c n , s A ˚ n p δ q , µ up n ´ ´ L ˙ ď S n n ´ µ ď n q ˆ δ { h p k n q ; c n , s A ˚ n p δ q , U ´ µ low n ´ ˙ . This proves the result.

H Auxiliary Results

Deﬁne M t , t ě as M t : “ ř ti “ G i , with P ` G i “ ´ A i { B ˘ “ B A i ` B and P p G i “ B q “ A i A i ` B . Lemma H.1.

For any t ě and x P R , the map p A , . . . , A t q ÞÑ E rp M t ´ x q ` s is non-decreasing.Proof. Suppose we prove that for every y P R , A ÞÑ E rp G ´ y q ` s is non-decreasing , (30)then by conditioning on G , . . . , G t and taking y “ x ` G ` ¨ ¨ ¨ ` G t , we get for A ď A E rp G p A q ´ y q ` s ď E rp G p A q ´ y q ` s . Now taking expectations on both sides with respect to G , . . . , G t implies non-decreasingness of A ÞÑ E rp M t ´ x q ` s . This implies the result.To prove (30), E rp G ´ y q ` s “ B A ` B ˆ ´ A B ´ y ˙ ` ` A A ` B p B ´ y q ` . A Ñ A { B is increasing, it suﬃces to show A { B ÞÑ E rp G ´ y q ` s is non-decreasingwith respect to A { B . Set p “ A { B and deﬁne g p p q “ ` p p´ Bp ´ y q ` ` p ` p p B ´ y q ` . Diﬀerentiating with respect to p yields B g p p qB p “ ´ p´ Bp ´ y q ` p ` p q ´ B p´ Bp ´ y q ` ` p ` p B ´ y q ` p ` p q “ ´p´ Bp ´ y q ` ´ B p ` p qp´ Bp ´ y q ` ` p B ´ y q ` p ` p q . If y ď ´ Bp then y ` Bp ă and B ´ y ą B p ` p q ą and hence B g p p qB p “ ´p Bp ` y q ` B p ` p qp Bp ` y q ` p B ´ y q p ` p q “ B ` B p ` B p p ` p q ą . If ´ Bp ă y ă B then y ` Bp ą and B ´ y ą and hence B g p p qB p “ p B ´ y q p ` p q ą . If y ą B , then B g p p q{B p “ . Hence B g p p q{B p ě for all p . This proves (30).Recall the deﬁnition of q p δ ; t, A , B q from (6). In the case of equal variances, that is, A “ A “ . . . “ A , we write A, q p δ ; t, A, B q for A , q p δ ; t, A , B q , respectively. We now prove that A ÞÑ q p δ ; t , A, B q is an non-decreasing function. Lemma H.2.

For any t ě , the function A ÞÑ q p δ ; t, A, B q is an non-decreasing function.Proof. Lemma H.1 proves that A ÞÑ E rp M t ´ x q ` s is non-decreasing. This implies that I p A ; u q isalso non-decreasing in A , where I p A ; u q : “ inf x ď u E rp M t ´ x q ` sp u ´ x q ` . Lemma 3.1 of [8] proves that I p A ; u q is also non-increasing in u . Fix A ď A . From the deﬁnitionof δ , I p A , q p δ ; t, A , B qq “ δ and I p A , q p δ ; t, A , B qq “ δ. Because I p A ; u q is non-decreasing in A , I p A ; q p δ ; t, A , B qq “ δ “ I p A ; q p δ ; t, A , B qq ď I p A ; q p δ ; t, A , B qq Hence I p A ; q p δ ; t, A , B qq ď I p A ; q p δ ; t, A , B qq and because I p A ; u q is non-increasing in u , weconclude that q p δ ; t, A , B q ď q p δ ; t, A , B q . This proves the result modulo the condition A ÞÑ E rp M t ´ x q ` s is non-decreasing. Lemma H.3.

For any δ P r , s , q p δ ; t, AB, B q “ Bq p δ ; t, A, B q . roof. Recall that q p δ ; t, AB, B q is deﬁned as the solution of inf x ď u E rp M t ´ x q ` sp u ´ x q ` “ δ, where M t is deﬁned as M t “ ř ti “ G i with P ` G i “ ´p A B q{ B ˘ “ B A B ` B “ B A ` B and , P ` G i “ B ˘ “ A B A B ` B “ A A ` B . This implies that G i d “ BG i and hence M t d “ BM t . Therefore, E rp M t ´ x q ` s “ E rp BM t ´ x q ` s “ B E rp M t ´ x { B q ` s , and inf x ď u E rp M t ´ x q ` sp u ´ x q ` “ B inf x ď u E rp M t ´ x { B q ` s B p u { B ´ x { B q ` “ inf x ď u { B E rp M t ´ x q ` sp u { B ´ x q ` . The right hand side above equals δ , when u “ Bq p δ ; t, A, B q because the deﬁnition of q p δ ; t, A, B q implies that inf x ď q p δ ; t,A,B q E rp M t ´ x q ` sp q p δ ; t, A, B q ´ x q ` “ δ. This completes the proof.

I Alternative Empirical Bentkus Conﬁdence Sequences with Esti-mated Variance

In Section 3.3, we presented one actionable version of Theorem 2, where we used an analytical upperbound on the variance A . In this section, we present an alternative empirical Bentkus conﬁdencesequence that requires numerical computation. In our initial experiments, we found solving for theupper bound of A in this way to be unstable. Because the proof technique here is very analogues tothat of the empirical Bernstein bound in [2, Eq. (48)-(50)], we present the alternative bound below.Deﬁne the empirical variance as p A n : “ n ´ ř ni “ p X i ´ s X n q , where s X n “ n ´ ř ni “ X i . For any δ , δ P r , s , deﬁne s A n : “ sup ! a ě p A n ě a ´ Bn q ´ δ h p k n q ; c n , a, B ¯ ´ n q ´ δ h p k n q ; c n , a, B ¯) . Lemma I.1 shows that s A n is an over-estimate of A uniformly over n and yields the following action-able bound. Recall that S n “ ř ni “ X i “ n s X n . 26 heorem 9. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p| X i | ą B q “ for all i ě , then for any δ , δ P r , s , P ˆ D n ě | S n | ě q ˆ δ h p k n q ; c n , s A ˚ n , B ˙ or A ě s A ˚ n p δ q ˙ ď δ ` δ , where s A ˚ n : “ min ď s ď n s A s . Here k n and c n are same as those deﬁned in Theorem 2. This theorem is an analogue of the empirical Bernstein inequality [20, Eq. (5)]. Furthermore, theupper bound s A n on A is better than that in the Bernstein version [2, Eq. (49)-(50)]; see Lemma I.2. I.1 Proof of Theorem 9 and Comparison of Standard Deviation Estimation fromOther Inequalities

Lemma I.1. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p| X i | ą B q “ , for all i ě , then for any δ P r , s P ¨˝ D t ě p A t ď A ´ Bt q ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ. Proof.

Consider the random variable X i ´ E r X i s . These are mean zero and are bounded in absolutevalue by B . Further the variance can be bounded as Var p X i ´ E r X i sq “ E rp X i ´ E r X i sq s ď B E r| X i | s “ B A . Applying Theorem 2 with variables X i ´ E r X i s implies P ˜ D t ě t ÿ i “ ´p X i ´ E r X i sq ě q ˆ δh p k t q ; c t , AB, B ˙¸ ď δ. Lemma H.3 proves that q ˆ δh p k t q ; c t , AB, B ˙ “ Bq ˆ δh p k t q ; c t , A, B ˙ . Hence we get with probability at least ´ δ , simultaneously for all t ě t ÿ i “ p X i ´ s X t q “ t ÿ i “ X i ´ t ˜ t ÿ i “ X i ¸ ě t ÿ i “ E r X i s ´ Bq ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ . Hence for any δ P r , s , P ¨˝ D t ě t p A t ď tA ´ Bq ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ n ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ. This completes the proof. 27e will now prove Theorem 9. Theorem 2 implies that P ˜ D t ě ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ě q ˆ δ h p k t q ; c t , A , B ˙¸ ď δ , (31)Lemma I.1 implies that P ¨˝ D t ě p A t ď tt ´ A ´ Bt ´ q ˆ δ h p k t q ; c t , A, B ˙ ´ t p t ´ q ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ . Hence with probability at least ´ δ ´ δ , simultaneously for all t ě , ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ď q ˆ δ h p k t q ; c t , A, B ˙ , p A t ď tt ´ A ´ Bt ´ q ˆ δ h p k t q ; c t , A, B ˙ ´ t p t ´ q ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ On this event, A ď s A t simultaneously for all t ě which in turn implies that A ď min ď s ď t s A s alsoholds simultaneously for all t ě . Substituting this in (31) (along with Lemma H.2) implies theresult. Lemma I.2.

Suppose δ ÞÑ ˜ q p δ { n ; A, B q is a function such that P ˆ max ď t ď n S t ě n ˜ q p δ { n ; A, B q ˙ ď δ, (32) for all δ P r , s and independent random variables X , . . . , X n satisfying (2) . Deﬁne the (over)-estimator of A as ˜ A t : “ sup " a ě p A t ě a ´ Bc t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ ´ c t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯* . Then s A n ď ˜ A n . Proof.

We have proved in Appendix D that (32) implies q p δ ; n, a, B q ď n ˜ q ´ δ { n ; a, B ¯ , for all n, a, and B . Hence if a satisﬁes p A t ě a ´ Bt q ˆ δ h p k t q ; c t , a, B ˙ ´ t q ˆ δ h p k t q ; c t , a, B ˙ , then p A n ě a ´ Bc t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ ´ c t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ , which implies the result. 28 eferences [1] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs,and mathematical tables , volume 55. US Government printing oﬃce, 1948.[2] J.-Y. Audibert, R. Munos, and C. Szepesvári. Exploration–exploitation tradeoﬀ using varianceestimates in multi-armed bandits.

Theoretical Computer Science , 410(19):1876–1902, 2009.[3] R. R. Bahadur and L. J. Savage. The nonexistence of certain statistical procedures in nonpara-metric problems.

The Annals of Mathematical Statistics , 27(4):1115–1122, 1956.[4] A. Balsubramani and A. Ramdas. Sequential nonparametric testing with the law of the iter-ated logarithm. In

Proceedings of the Thirty-Second Conference on Uncertainty in ArtiﬁcialIntelligence , pages 42–51, 2016.[5] V. Bentkus. A remark on the inequalities of Bernstein, Prokhorov, Bennett, Hoeﬀding, andTalagrand.

Liet. Mat. Rink , 42(3):332–342, 2002.[6] V. Bentkus. On Hoeﬀding’s inequalities.

The Annals of Probability , 32(2):1650–1673, 2004.[7] V. Bentkus. An extension of the hoeﬀding inequality to unbounded random variables.

Lithua-nian Mathematical Journal , 48(2):137–157, 2008.[8] V. Bentkus, N. Kalosha, and M. Van Zuijlen. On domination of tail probabilities of (super)martingales: explicit bounds.

Lithuanian Mathematical Journal , 46(1):1–43, 2006.[9] P. Dagum and M. Luby. Approximating the permanent of graphs with large factors.

TheoreticalComputer Science , 102(2):283–305, 1992.[10] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for Monte Carlo estimation.

SIAM Journal on computing , 29(5):1484–1496, 2000.[11] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximatingthe volume of convex bodies.

Journal of the ACM (JACM) , 38(1):1–17, 1991.[12] E. Giné and R. Nickl.

Mathematical foundations of inﬁnite-dimensional statistical models ,volume 40. Cambridge University Press, 2016.[13] W. Hoeﬀding. Probability inequalities for sums of bounded random variables.

Journal of theAmerican Statistical Association , 58(301):13–30, 1963.[14] S. R. Howard and A. Ramdas. Sequential estimation of quantiles with applications to a/b-testing and best-arm identiﬁcation. arXiv preprint arXiv:1906.09712 , 2019.[15] S. R. Howard, A. Ramdas, J. McAuliﬀe, and J. Sekhon. Uniform, nonparametric, non-asymptotic conﬁdence sequences. arXiv preprint arXiv:1810.08240 , 2018.[16] M. Huber. An optimal p ε, δ q -randomized approximation scheme for the mean of random vari-ables with bounded relative variance. Random Structures & Algorithms , 55(2):356–370, 2019.[17] M. Huber et al. Approximation algorithms for the normalizing constant of gibbs distributions.

The Annals of Applied Probability , 25(2):974–985, 2015.2918] R. Johari, L. Pekelis, and D. J. Walsh. Always valid inference: Bringing sequential analysis toA/B testing. arXiv preprint arXiv:1512.04922 , 2015.[19] R. Johari, P. Koomen, L. Pekelis, and D. Walsh. Peeking at A/B tests: Why it matters, andwhat to do about it. In

Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 1517–1525, 2017.[20] V. Mnih, C. Szepesvári, and J.-Y. Audibert. Empirical Bernstein stopping. In

Proceedings ofthe 25th international conference on Machine learning , pages 672–679, 2008.[21] T. K. Philips and R. Nelson. The moment bound is tighter than Chernoﬀ’s bound for positivetail probabilities.

The American Statistician , 49(2):175–178, 1995.[22] I. Pinelis. Binomial upper bounds on generalized moments and tail probabilities of (super)martingales with diﬀerences bounded from above. In

High dimensional probability , pages 33–52. Institute of Mathematical Statistics, 2006.[23] I. Pinelis. On the Bennett-Hoeﬀding inequality. arXiv preprint arXiv:0902.4058 , 2009.[24] I. Pinelis et al. Optimal binomial, poisson, and normal left-tail domination for sums of non-negative random variables.

Electronic Journal of Probability , 21, 2016.[25] R. Singh. Existence of bounded length conﬁdence intervals.

The Annals of MathematicalStatistics , 34(4):1474–1485, 1963.[26] M. Talagrand. The missing factor in Hoeﬀding’s inequalities. In

Annales de l’IHP Probabilitéset statistiques , volume 31, pages 689–702, 1995.[27] A. Wald.

Sequential analysis . Courier Corporation, 2004.[28] F. Yang, A. Ramdas, K. G. Jamieson, and M. J. Wainwright. A framework for multi-A(rmed)/B (andit) testing with online fdr control. In

Advances in Neural Information ProcessingSystems , pages 5957–5966, 2017.[29] S. Zhao, E. Zhou, A. Sabharwal, and S. Ermon. Adaptive concentration inequalities for se-quential decision problems. In