Near-Optimal Confidence Sequences for Bounded Random Variables
NNear-Optimal Confidence Sequences for BoundedRandom Variables
Arun Kumar Kuchibhotla ∗ Qinqing Zheng † University of Pennsylvania
June 8, 2020
Abstract
Many inference problems, such as sequential decision problems like A/B testing, adaptivesampling schemes like bandit selection, are often online in nature. The fundamental problemfor online inference is to provide a sequence of confidence intervals that are valid uniformlyover the growing-into-infinity sample sizes. To address this question, we provide a near-optimalconfidence sequence for bounded random variables by utilizing Bentkus’ concentration results.We show that it improves on the existing approaches that use the Cramér-Chernoff techniquesuch as the Hoeffding, Bernstein, and Bennett inequalities. The resulting confidence sequenceis confirmed to be favorable in both synthetic coverage problems and an application to adaptivestopping algorithms.
The abundance of data over the decades has increased the demand for sequential algorithms andinference procedures in statistics and machine learning. For instance, when the data is too large to fitin a single machine, it is natural to split data into small batches and process one at a time. Besides,many industry or laboratory data, like user behaviors on a website, patient records, temperaturehistories, are naturally generated and available in a sequential order. In both scenarios, the collectionor processing of new data can be costly, and practitioners often would like to stop data samplingwhen a required criterion is satisfied. This gives the pressing call for algorithms that minimize thenumber of sequential samples subject to the prescribed accuracy of the estimator is satisfied.Many important problems fit into this framework, including sequential hypothesis testing prob-lems such as testing positiveness of the mean [29], testing equality of distributions and testingindependence [4, 28], A/B testing [18, 19], sequential probability ratio test [27], best arm iden-tification for multi-arm bandits (MAB) [29, 28], and p ε, δ q -mean estimation [20, 16]. All theseapplications require confidence sequences to determine the number of samples required for a certainguarantee.Let Y , Y , . . . be independent real-valued random variables, available sequentially, with mean µ P R . Given δ P r , s , a ´ δ confidence sequence is a sequence of confidence interval ConfSeq p δ q “t CI p δ q , CI p δ q , . . . u , where CI n is constructed on-the-fly after observing data sample Y n , such that P p µ P CI n p δ q for all n ě q ě ´ δ. (1) ∗ Department of Statistics. Email: [email protected] . † Department of Statistics. Email: [email protected] . a r X i v : . [ m a t h . S T ] J un nlike the traditional confidence interval in statistics, the guarantee (1) is non-asymptotic and isuniform over the sample sizes. Ideally, we want CI n p δ q to reduce in width as either n or δ orboth increase. Unfortunately, but not surprisingly, guarantee (1) is impossible to achieve non-trivially without further assumptions [3, 25]. In this paper, we assume that the random variablesare bounded: there exist known constants L, U P R such that P p L ď Y i ď U q “ , which yields µ P r
L, U s . Although boundedness can be replaced by tail assumptions such as sub-Gaussianity orpolynomial tails, we will restrict our discussion to the bounded case; see Section 5 for a discussion. Motivating Example.
We give a concrete example to illustrate how the confidence sequencecan be applied to sequential statistical inference. Estimating the mean of a random variable is aclassic problem in statistics and widely applied to various applications. An estimator p µ is said tobe p ε, δ q -accurate for the mean µ if P p| p µ { µ ´ | ď ε q ě ´ δ [10, 20, 16]. This means that theestimator has a relative error of at most ε with probability at least ´ δ . Relative error is importantin many examples such as permanent estimation [9], estimation of volume of a convex body [11],and estimation of partition function of Gibbs distribution [17], where the unknown magnitude of µ can render absolute error unreliable. The important question we would like to answer is How many samples are required to obtain an estimator of the mean that is p ε, δ q accurate? Suppose one can construct a ´ δ confidence sequence ConfSeq p δ q “ t CI n “ r s Y n ´ Q n , s Y n ` Q n s , n ě u , where s Y n is the empirical mean of the first n samples. Mnih et al. [20] shows thatwith stopping time N “ min t n : p ´ ε q UB n ď p ` ε q LB n u , where UB n and LB n are two simplefunctions of the radius of the confidence intervals Q , . . . , Q n , the estimator p µ “ p { q sign p s Y N qrp ´ ε q UB N ` p ` ε q LB N s is p ε, δ q -accurate. Contributions.
The need for sequential algorithms has triggered a surge of interest in developingsharp confidence sequences. In the recent years, several confidence sequences were proposed by stitching the fixed sample size confidence intervals [29, 20, 15], and the fixed sample size intervalsare derived from the concentration inequalities such as Hoeffding, Bernstein, or Bennett [20, Section2]. Although the techniques of stitching are slightly different for those methods, they are easilytransferable from one to another. The tightness of the confidence sequences is mainly controlled bythe sharpness of the fixed sample size concentration inequalities.To the best of our knowledge, all the existing confidence sequences are built upon concentrationresults that bound the moment generating function and follow the Cramér-Chernoff technique.Those concentration results are conservative and can be significantly improved [21]. In this paper,we leverage the refined concentration results introduced by Bentkus [5]. We first develop a “maximal”version of Bentkus’ concentration inequality. Based on it, we construct the confidence sequence viastitching. In honor of Bentkus, who pioneered this line of refined concentration inequalities, we callour confidence sequence as
Bentkus’ Confidence Sequence .To summarize, major contributions of this article are four-fold. • For pointwise concentration results, we provide a near-optimal refinement of the Cramér-Chernofftail bound for bounded random variables based on the results of [5, 6, 22]. Unlike the Chernoffbounds, the refined bound is optimal up to e { , i.e., there exists a distribution such that for Of course, if we take CI n p δ q “ p´8 , , then (1) is trivially satisfied. Y i with that distribution, our tail bound is at most e { times the true tail.Further, our bound is always smaller than the Chernoff bounds. The computation of Bentkus’method is non-trivial. We provide closed-form equations for our tail bound so that the exactnumerical solution can be obtained easily. • We use these results in conjunction with a “stitching” method to construct non-asymptotic con-fidence sequences. For s Y “ n ´ ř ni “ Y i , the confidence interval is CI n p δ q : “ r s Y n ´ q low n p δ q , s Y n ` q up n p δ qs , for different values q low n p δ q , q up n p δ q ě and they scale like a Var p Y q log log p n q{ n as n Ñ 8 . Theoptimal nature of our tail bounds implies that our confidence sequence is always shorter in lengththan the adaptive Hoeffding bound [29] and the empirical Bernstein bound [20]. • Our confidence sequence utilizes the variance of Y . For practical usage, we also provide a closeform upper bound on the true variance of the bounded random variables, so that our confidencesequence is actionable in practice. This upper bound can even be improved using numericalsolvers. • We conducted numerical experiments to verify our theoretical claims. We also applied the con-fidence sequence to the p ε, δ q -accurate mean estimation problem above, and show that it gives amuch smaller stopping time than the competing methods. Organization.
The rest of this article is organized as follows. Section 2 reviews the related work.Section 3 contains all our theoretical results. Section 4 presents the numerical experiments thatconfirm the superiority of our method. Section 5 summarizes the contributions and considers somefuture directions of the work.
Zhao et al. [29] propose confidence sequences through Hoeffding’s inequality; the assumption hereis that Y i ’s are { -sub-Gaussian. For random variables supported on r L, U s , this is satisfied afterdividing by p U ´ L q . This confidence sequence does not scale with the true variance and hence canbe conservative. Mnih et al. [20] building on [2] construct confidence sequences through Bernstein’sinequality and the intervals here scale correctly with the true variance. Both these confidencesequences are closed-form in nature. More recently, Howard et al. [15] unified the techniques ofobtaining confidence sequences under a variety of assumptions on random variables. This workbuilds on much of the existing statistics literature and we refer the reader to this paper for adetailed historical account. The main message from this paper is that the width of any confidencesequence CI n p δ q satisfying (1) must be larger than a A p log log n q{ n as n tends to infinity, where A represents the common variance of Y , Y , . . . .All the confidence sequences in the works mentioned above depend on bounding the momentgenerating function and follow the Cramér-Chernoff technique. Such confidence sequences are con-servative and can be significantly improved [21]. To understand the deficiency of such concentrationinequalities, consider for example the Bernstein’s inequality: for s Y n “ ř ni “ Y i { n , P ` ? n p s Y n ´ µ q ě t ˘ ď exp ` ´ t {t A ` Bt {p ? n qu ˘ , exp p´ t {p A qq . However, the central limit theorem implies P p? n p s Y n ´ µ q ě t q « ´ Φ p t { A q where the limit behaves like exp p´ t {p A qq{ a π pp t { A q ` q [1, Formula 7.1.13].Therefore, Bernstein’s inequality and the true tail differ by the scaling a π p t { A ` q which canbe significant for large t . This explains why a further refinement is possible and Bentkus [5] presentssuch refined concentration inequalities. Our work essentially builds on [5, 6, 8, 22, 24] to obtain aconfidence sequence through the technique of stitching. For any random variable Y i with mean µ , X i “ Y i ´ µ is mean zero and hence we will mostlyrestrict to the case of mean zero random variables. The result for general µ will readily follow; seeTheorem 4. In this section, we discuss Bentkus’ concentration inequality for bounded mean zerorandom variables. After this, we present a refined confidence sequence that is not readily actionablebecause it depends on the true variance of random variables. Finally, we present an actionableversion where we replace the true variance by an estimated upper bound. This provides an analog ofthe empirical Bernstein confidence sequence, which we call Empirical Bentkus Confidence Sequence .The setting is as follows. Suppose X , X , . . . are independent random variables satisfying E r X i s “ , Var p X i q ď A i , and P p X i ą B q “ , for i ě . (2)We will first derive concentration inequalities under the one-sided bound assumption as in (2) whichonly requires X i ď B almost surely. To derive actionable versions of the concentration inequalities(with estimated variance), we will impose a two-sided bound assumption. We now present a concentration inequality that holds uniformly over all sample sizes up to a fixedtime. Our refined tail bounds are based on a worst case two-point distribution satisfying (2). Definemean zero independent random variables G , G , . . . as P ` G i “ ´ A i { B ˘ “ B {p A i ` B q and P p G i “ B q “ A i {p A i ` B q . (3)These random variables satisfy Var p G i q “ A i and P p G i ą B q “ . Furthermore, G i ’s are the worstcase random variables satisfying (2) in the sense that for all n ě and x P R , sup X ,...,X n „ (2) E rp ř ni “ X i ´ x q ` s “ E rp ř ni “ G i ´ x q ` s , (4)where p a q ` “ max t a, u and the supremum is over all distributions of X i ’s satisfying (2). Moreover, P p ř ni “ G i ě x q ď sup X ,...,X n „ (2) P p ř ni “ X i ě x q ď p e { q P p ř ni “ G i ě x q , (5)where the right hand side inequality holds for all x in the support of ř ni “ G i ; it holds for all x P R if P p ř ni “ G i ě x q is replaced by its log-linear interpolation; see [5] for details. The left hand sideinequality in (5) is trivial because G i ’s satisfy (2) and the right hand side inequality is derivedusing (4). Inequalities in (5) show that concentration inequalities based on the two-point randomvariables G i are sharp up to a constant factor e { . This is unlike the classical Cramér-Chernoffinequalities which differ from the optimal one by a factor depending on x ; see Talagrand [26].4et A “ t A , A , . . . u as the collection of standard deviations and for n ě , δ P r , s , define q p δ ; n, A , B q as the solution to inf x ď u E rp ř ni “ G i ´ x q ` sp u ´ x q ` “ δ. (6)The solution exists uniquely for δ ě P p ř ni “ G i “ nB q and is defined to be nB ` if δ ă P p ř ni “ G i “ nB q . The following result provides a refined concentration inequality for S n “ ř ni “ X i . It is a“maximal” version of Theorem 2.1 of [8], and we defer the proof to Appendix D. Theorem 1.
Fix n ě . If X , X , . . . , X n are independent random variables satisfying (2) , then P ˆ max ď t ď n S t ě q p δ ; n, A , B q ˙ ď δ, for any δ P r , s . (7) Further, if A “ ¨ ¨ ¨ “ A n “ A and if ˜ q p¨ ; A, B q is some function such that P p max ď t ď n S t ě n ˜ q p δ { n ; A, B qq ď δ for all δ P r , s and all X , . . . , X n satisfying (2) , then q p δ ; n, A , B q ď n ˜ q p δ { n ; A, B q for all δ P r , s . The first part of Theorem 1 provides a finite sample valid estimate of the quantile and the secondpart implies that it is sharper than the usual concentration inequalities such as Hoeffding, Bernstein,Bennett or Prokhorov inequalities. To see this fact, note that P p max ď t ď n S t ě n ˜ q p δ { n ; A, B qq ď δ for all δ P r , s is equivalent to the existence of a function H p u ; A, B q such that P p max ď k ď n S t ě nu q ď H n p u ; A, B q for all u . The classical concentration inequalities mentioned above are all of thisproduct form and hence weaker than our bound. See Figure 1a for an illustration and [5, 6, 8, 22]for further discussion. Computation of q p¨ ; n, A , B q is discussed in Section 9 of Bentkus et al. [8]and we provide a detailed discussion in Appendix C. In this respect, the following result describesthe function in (6) as a piecewise smooth function in case A “ . . . “ A n “ A . Proposition 1.
Set p AB “ A {p A ` B q and Z n “ ř ni “ R i where R i „ Bernoulli p p AB q . Then inf x ď u E rp ř ni “ G i ´ x q ` s{p u ´ x q ` “ P p np AB ` u p ´ p AB q{ B ; Z n q , for all u P R , where P p x ; Z n q “ for x ď np AB and for any x ě np AB and ď k ď n ´ , P p x ; Z n q “ $’’&’’% np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. Here p k “ P p Z n ě k q , e k “ E r Z n t Z n ě k us , and v k “ E r Z n t Z n ě k us . Based on the description in Proposition 1 and (6), computation of q p¨ ; n, A, B q follows. InAppendix C.1, we also provide a similar piecewise description of q p¨ ; n, A, B q . Although Theorem 1 leads to a uniform in sample size confidence sequence until size n , it is very widefor sample sizes much smaller than n . We now use the method of stitching to obtain a confidencesequence that is valid for all sample sizes and scales reasonably well with respect to the sample size.We refer the reader to Mnih et al. [20, Section 3.2] and Howard et al. [15, Section 3.1] for details.To construct a uniform over n confidence sequence, we require two user-chosen parameters:5
500 1000 1500 2000 2500 3000 n − − HoeffdingBernsteinBentkus S n = n X i =1 X i (a) Pointwise Bounds n − − A-Hoeffding E - B e r n s t e i n A-Bentkus S n = n X i =1 X i (b) Uniform Bounds Figure 1: Comparison of the concentration bounds when δ “ . . X i are centered i.i.d. Bernoulli p { q . We give the true standard deviation A i “ ? { and upper bound B “ { toall the methods. (a) The average failure frequencies across 300 trials and ď n ď are: Ho-effding . ˘ . , Bernstein . ˘ . , Bentkus . ˘ . . (b) A-Bentkusis computed using η “ . , h p k q “ p k ` q . ζ p . q . For 3000 trials, there is zero failure for Adap-tive Hoeffding and Empirical Bernstein, but for A-Bentkus (8). In both cases, all the boundshave failure frequency bounded above by δ but the Bentkus’ bound is the least conservative. Thedifferences between the bounds continue to grow as n increases.1. a scalar η ą , which determines the geometric spacing.2. a function h : R ` Ñ R ` such that ř k “ { h p k q ď . Ideally, { h p k q adds up to .The following result gives a uniform over n tail inequality by splitting t n ě u into Ť k ě t r η k s ď n ď t η k ` u u and then applying (7) within t r η k s ď n ď t η k ` u u . See Appendix E for the proof. Theorem 2. If X , X , . . . are independent random variables satisfying (2) , then P ˆ D n ě S n ě q ˆ δh p k n q ; c n , A , B ˙˙ ď δ, for any δ P r , s , (8) where k n : “ inf t k ě r η k s ď n ď t η k ` u u and c n : “ t η k n ` u . The choice of the spacing parameter η and stitching function h p¨q determine the shape of theconfidence sequence and there is no universally optimal setting. The growth rate of h p¨q determineshow the budget of δ is spent over sample sizes; a quickly growing h p¨q such as k yield confidenceintervals of essentially confidence for larger sample size. The choice of η determines howconservative the bound is for t r η k s ď n ď t η k ` u u ; for η too large the bound will be conservative for n close to η k . Eq. (5) shows that bound is tightest at n “ t η k ` u in each epoch. Throughout thispaper, we use η “ . and h p k q “ ζ p . qp k ` q . where ζ p¨q is the Riemann zeta function.The same stitching method used in Theorem 2 can also be used with Hoeffding and Bernsteininequalities as done in [29] and [2], respectively. However, given that inequality (7) is sharper than6oeffding and Bernstein inequalities, our bound (8) is sharper for the same spacing parameter η and stitching function h p¨q ; see Figure 1b. Stitched bounds as in Theorem 2 are always piecewiseconstant but the Hoeffding and Bernstein versions from [29] and [20] are smooth because they areupper bounds of the piecewise constant boundaries (obtained using n ď c n ď ηn and k n ď log η n ` ).For practical use, smoothness is immaterial and the piecewise constant versions are sharper. Theorem 2 is not actionable in its form because it involves the unknown sequence of A , A , . . . . Inthe case where A “ A “ ¨ ¨ ¨ “ A , one needs to generate an upper bound of A (for a known B ) andobtain an actionable version of Theorem 2. Finite-sample over-estimation of A requires a two-sidedbound on the X i ’s; one-sided bounds on the random variables do not suffice. This actionable versionis a refined version of empirical Bernstein inequality that is uniform over the sample sizes.We will assume that P p B ď X i ď B q “ @ i . Define s A p δ q “ p B ´ B q{ and for n ě , δ P r , s p A n : “ t n { u ´ ř t n { u i “ p X i ´ X i ´ q { , and s A n p δ q : “ b p A n ` g ,n p δ q ` g ,n p δ q , (9)where g ,n p δ q : “ p ? n q ´ a t c n { u p B ´ B q Φ ´ ` ´ δ {p e h p k n qq ˘ , for the distribution function Φ p¨q of a standard normal random variable. We will write s A n p δ ; B, B q , when needed, to stress thedependence of s A n p δ q on B, B . Lemma F.1 shows that s A n p δ q is a valid over-estimate of A uniformlyover n and yields the following actionable bound. We defer the proof to Appendix F. Theorem 3. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p B ď X i ď B q “ for all i ě , then for any δ , δ P r , s , P ˆ D n ě S n ě q ˆ δ h p k n q ; c n , s A ˚ n p δ q , B ˙ or A ě s A ˚ n p δ , B, B q ˙ ď δ ` δ , and P ˆ D n ě S n ď ´ q ˆ δ h p k n q ; c n , s A ˚ n p δ q , ´ B ˙ or A ě s A ˚ n p δ , B, B q ˙ ď δ ` δ , where s A ˚ n p δ q : “ min ď s ď n s A n p δ , B, B q , and k n , c n are those defined in Theorem 2. Theorem 3 is an analogue of the empirical Bernstein inequality [20, Eq. (5)]. The over-estimateof A in (9) can be improved by using non-analytic expressions, but we present the version above forsimplicity; see Appendix F for details on how to improve s A n p δ q in (9).Theorem 3 can be used to construct a confidence sequence as follows. Suppose Y , Y , . . . areindependent random variables with mean µ , variance A , and satisfying P p L ď Y i ď U q “ . Then X i “ Y i ´ µ is a zero mean random variable where P p L ´ µ ď X i ď U ´ µ q “ , and Theorem 3is directly applicable with B “ ´ B “ U ´ L . An interesting observation is that we can refine thevalues of B and B while we are updating the confidence interval for µ . Suppose we have a validupper and lower bound when we observed n data points: ´ q low n ď ř ni “ Y n ´ nµ ď q up n , this implies µ low n : “ s Y n ´ n ´ q up n ď µ ď s Y n ` n ´ q low n “ : µ up n , where s Y n is the empirical mean of Y . We thus have a valid estimate r L ´ µ up n , U ´ µ low n s of thesupport of X , and when we observe Y n ` , we can use U ´ µ low n as B and L ´ µ up n as B . Importantly,as Theorem 3 provides a uniform concentration bound, these recursively defined upper and lowerbounds hold simultaneously too. This leads to the following result, proved in Appendix G.7 heorem 4. If random variables Y , Y , . . . are independent with mean µ , variance A and satisfy P p L ď Y i ď U q “ . Define µ up0 : “ U , µ low0 : “ L , and for n ě µ up n “ s Y n ` n q ˆ δ h p k n q ; c n , s A ˚ n p δ , U, L q , ´p L ´ µ up n ´ q ˙ µ low n “ s Y n ´ n q ˆ δ h p k n q ; c n , s A ˚ n p δ , U, L q , U ´ µ low n ´ ˙ Let µ up ˚ n “ min ď i ď n µ up i and µ low ˚ n “ max ď i ď n µ low i . Then for any δ , δ P r , s P ´ µ P r µ low ˚ n , µ up ˚ n s and A ď s A ˚ n p δ , U, L q for all n ě ¯ ě ´ δ ´ δ . (10)Because µ up0 “ U, µ low0 “ L , the confidence intervals r µ low ˚ n , µ up ˚ n s is always a subset of r L, U s . In this section, we conduct numerical experiments to demonstrate the efficacy of our method.Section 4.1 examines the coverage probability and the width of the confidence intervals constructedon a synthetic data from
Bernoulli p . q ; for other cases, see Appendix B. Section 4.2 applies theconfidence sequences to an adaptive stopping algorithm for p ε, δ q -mean estimation.We compare our adaptive Bentkus confidence sequence (10) with the adaptive Hoeffding [29],empirical Bernstein [20], and two other versions of empirical Bernstein inequality from [15]: Eq. (24)and Theorem 4 with the gamma-exponential boundary from Proposition 9 of [15]. We denote thesemethods by
A-Bentkus , A-Hoeffding , E-Bernstein , HRMS-Bernstein , and
HRMS-Bernstein-GE respectively. For all the experiments, we use δ “ . . For A-Bentkus , we fix the spacing parameter η “ . , the stitching function h p k q “ p k ` q . ζ p . q , and δ “ δ { , δ “ δ { . In this experiment, we generate i.i.d samples Y , Y , . . . , Y i.i.d „ Bernoulli p . q and compute theconfidence sequences for the true mean µ “ . . Figure 2a gives an illustration of the confidencesequences obtained and shows the sharpness of A-Bentkus (10).For most of the cases ( n ě ), A-Bentkus dominates the other methods. For smaller samplesizes,
A-Bentkus also closely traces
A-Hoeffding and outperforms the others. This is because thevariance estimation is likely conservative and in which case our s A ˚ n ends up using the trivial upperbound p U ´ L q{ , which is essentially what A-Hoeffding is exploiting. In fact, we have provided thesame upper bound for all the other Bernstein-type methods too, and
A-Bentkus still outperforms.This phenomenon shows the intrinsic sharpness of our bound.We repeat the above experiment 1000 times and report the average miscoverage rate: ř r “ t µ R CI p r q n for some ď n ď u . where CI p r q n is the confidence interval constructed after observing Y , . . . , Y n in the r -th replication.The results are . for A-Bentkus , . for HRMS-Bernstein-GE , and for the others. All the Code is available at https://github.com/enosair/bentkus_conf_seq . n − . . . . . . . Empirical Mean
E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeffdingA-Bentkus (a) One Replication n . . . . . . C I W i d t h E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeffdingA-Bentkus (b) Average Width over 1000 Replications
Figure 2: Comparison of the 95% confidence sequences for the mean when Y i „ Bernoulli p . q .Except A-Hoeffding , all other methods estimate the variance.
A-Bentkus is the confidence sequencein (10).
HRMS-Bernstein-GE involves a tuning parameter ρ which is chosen to optimize the sequenceat n “ as suggested in Figure 7 of [15]. (a) shows the confidence sequences from a singlereplication. (b) shows the average widths of the confidence sequences over 1000 replications. Theupper and lower bounds for all the other methods are cut at and for a fair comparison.methods control the miscoverage rate by δ “ . but are all conservative. Recall from (5) thatour failure probability bound can be conservative up to a constant of e { . Furthermore, from theproofs of Theorems 2 and 4, we get that for η “ . , h p k q “ p k ` q . ζ p . q , P ` µ R CI n p δ q for some ď n ď ˘ ď ř log η p q k “ δ { h p k q ď . δ. With δ “ . , the true bound on the failure probability is . . This explains why the averagemiscoverage rate is small.We also report the average width of the confidence intervals in Figure 2b. All the values arebetween and as we cut the bounds from above and below for the other methods. As mentionedabove, when n is very small A-Bentkus closely traces
A-Hoeffding and both have smaller width.Yet the advantage of
A-Hoeffding disappears for n ě and A-Bentkus enjoys smaller confidenceinterval width afterwards, and
HRMS-Bernstein-GE improves slightly on
A-Bentkus after observingvery large number of samples.
We apply our confidence sequence to the adaptive stopping rule for estimating the mean of a boundedrandom variable Y . The goal is to obtain an estimator p µ such that the relative error | p µ { µ ´ | isbounded by ε , and terminate the data sampling once such criterion is satisfied.Given s Y the empirical mean and any confidence sequence centered at s Y satisfying (1), Algo-rithm 1 yields a valid stopping time and an p ε, δ q -accurate estimator; see [20, Section 3.1] for aproof. Clearly, a tighter confidence sequence will require less data sampling and yields a smallerstopping time. We follow the setup in Mnih et al. [20]. The data samples are i.i.d generatedas Y i “ m ´ ř mj “ U ij , where U i , . . . , U im are i.i.d uniformly distributed in r , s . This impliesthat µ “ E r Y i s “ { and A “ Var p Y i q “ {p m q . Because Algorithm 1 requires symmetric9 =1 m=10 m=20 m=100 m=100005001000150020002500 S t o pp i n g T i m e A-HoeffdingE-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-Bentkus
Figure 3: Comparison of confidence sequences for an p ε, δ q -estimator. Here ε “ . and δ “ . . Algorithm 1:
Adaptive Stopping Algorithm Initialization: n Ð , LB Ð , UB Ð 8 while p ` ε q LB ă p ´ ε q UB do n Ð n + 1 Sample Y n and compute the n -th interval in the sequence: r s Y n ´ Q n , s Y n ` Q n s Ð ConfSeq p n, δ q LB Ð max t LB , | s Y n | ´ Q n u UB Ð min t UB , | s Y n | ` Q n u return stopping time N “ n and estimator p µ “ p { q sgn p s Y N qrp ` ε q LB ` p ´ ε q UB s .intervals, we shall symmetrize the intervals returned by A-Bentkus by taking the largest devia-tion. We consider 5 cases: m “ , , , , and report the average stopping time (i.e. thenumber of samples required to achieve p ε, δ q “ p . , . q accuracy) based on trials in Fig-ure 3. HRMS-Bernstein-GE involves a tuning parameter ρ , chosen here to optimize the confidencesequence at n “ (best out of , , , ). As m increases, the variance of Y i is decreasing.As expected, A-Hoeffding does not exploit the variance of random variables so the stopping timesremains roughly the same. For others, the stopping time is decreasing. It is clear that on average,
A-Bentkus is the best for all the values of m and the ratios of our stopping time to the second bestare . , . , . , . , . . In this paper, we proposed a confidence sequence for bounded random variables and examined itsefficacy in both synthetic examples and adaptive stopping algorithms. Although our results are pre-sented for the mean, they can be applied to testing equality of distributions, testing independence [4,10ections 3 and 5], and identifying the best arm in the multi-armed bandits problem [29, 14].We assumed that X i ’s are independent and bounded and generalizations for dependence and sub-Gaussian cases are of interest. Regarding independence, Theorem 2.1 of [22] shows that Theorem 1holds even if X i ’s form a supermartingale difference sequence, i.e., assumption (2) is replaced by E r X i | X , . . . , X i ´ s ď , P p E r X i | X , . . . , X i ´ s ď A i q “ , and P p X i ą B q “ . Theorem 2 follows readily, but Theorem 3 requires further restrictions that allow estimation of A i . Regarding the boundedness assumption, which maybe restrictive for applications in statistics,finance and economics, one can replace assumption (2) by E r X i s “ , Var p X i q ď A i , and P p X i ą x q ď s F p x q for all x P R , (11)where s F p¨q is a survival function on r , , i.e., s F p¨q is non-increasing and s F p q “ and s F p8q “ .For example, s F p x q “ {t ` p x { K q α ` u , or s F p x q “ exp p´p x { K q α ` q , for some K ą ; α “ in thesecond example is sub-Gaussianity. Similar to (5), there exist random variables η i “ η i p A i , s F q satisfying (11) such that sup X ,...,X n „ (11) E »–˜ n ÿ i “ X i ´ t ¸ ` fifl “ E »–˜ n ÿ i “ η i ´ t ¸ ` fifl , for all n ě and t P R . Here the superemum is taken over all distributions satisfying (11).Hence, Theorem 1 can be generalized, which in turn leads to generalizations of Theorems 2 and 3.The details on the construction of η i and the corresponding confidence sequence will be discussedelsewhere. AppendixA Competing Concentration Bounds
Theorem 5 (Hoeffding; Theorem 3.1.2 of [12]) . If X , . . . , X n are independent mean-zero randomvariables satisfying P p B ď X i ď B q “ , then P ˜ S n ě d n p B ´ B q log ˆ δ ˙ ¸ ď δ, @ δ P r , s . (There is a generalization of Hoeffding’s inequality that relaxes the boundedness assumption bya sub-Gaussian assumption; see [29] for details.) Theorem 6 (Adaptive Hoeffding; Corollary 1 of [29]) . If X , . . . , X n are independent mean-zerorandom variables satisfying P p B ď X i ď B q “ , then P ˜ D n ě S n ě p B ´ B q c . n log p log . n ` q ` log p { δ q . n ¸ ď δ, @ δ P r , s . heorem 7 (Bernstein; Theorem 3.1.7 of [12]) . If X , . . . , X n , . . . are independent random variablessatisfying (2) , then P ¨˝ S n ě gffe n ÿ i “ A i log ˆ δ ˙ ` B log ˆ δ ˙ ` B log ˆ δ ˙ ˛‚ ď δ, @ δ P r , s . Theorem 8 (Empirical Bernstein; Eq. (5) of [20]) . If X , X , . . . are independent mean zero randomvariables satisfying (2) with A “ A “ . . . “ A , then P ˆ D n ě S n ě b nη p A n log p h p k n q{p δ qq ` Bη log p h p k n q{p δ qq ˙ ď δ, where ˜ A n is the sample variance and k n is the constant defined in Theorem 2. B More Simulations
B.1 Hyperparameters of Stitching
There are two hyperparameters of our stitching methods: (1) the spacing parameter η ą and (2)the power parameter c ą for the stitching function h c p k q “ ζ p c qp k ` q c where ζ p¨q is the Riemannzeta function. n η = 1 . η = 1 . Figure 4: The upper bound of S n obtained by adaptive Bentkus bound in Theorem 2 for differentvalues of η . Both the variance A “ ? { and the upper bound B “ { is known.Figure 4 illustrates that the choice of η determines how the budget δ is distributed across differentsample sizes.Figure 5 shows both the stitching function h c p¨q and corresponding upper bound A-Bentkus obtains. For a fixed sample size n , the bigger h c p k n q is, the smaller budget δ { h c p k n q it obtains andhence it needs a larger upper bound. Hence, the faster h c p¨q grows, the more conservative upperbound (and corresponding, wider confidence interval) one will get. B.2 Confidence Sequence for
Bernoulli p . q In this section, we present a comparison of our confidence sequence with
A-Hoeffding , E-Bernstein , HRMS-Bernstein , and
HRMS-Bernstein-GE on synthetic data from
Bernoulli p . q . In this case,12 k h c ( k ) = ζ ( c )( k + ) c c = 1.01c = 1.1c = 3.0 n c = 1.01c = 1.1c = 3.0 Figure 5:
Left:
The stitching function h c p¨q for different values of c . Right:
The upper bound of S n obtained by A-Bentkus with different values of c . Both the variance A “ { and the upperbound B “ { is known. Y , Y , . . . „ Bernoulli p . q and the variance is { . Hence in this case Hoeffding’s inequality issharp and nothing can be gained by variance exploitation. We note this very fact in our experiment,where our method behaves as well as A-Hoeffding for moderate to large sample sizes. Figures 6aand 6b show the comparison of confidence sequences in one replication and comparison of averagewidth over 1000 replications. As in the case of
Bernoulli p . q (Section 4.1), for small sample sizes, A-Hoeffding and
A-Bentkus behave very closely and are better than all other methods but for n moderately large, the sharpness of A-Bentkus clearly pays off by outperforming
A-Hoeffding andall other methods.
C Computation of q p δ ; n, A , B q In this section we provide some details on the computation of q p δ ; n, A , B q based on [6] and [23].We will restrict to the case where A “ A “ ¨ ¨ ¨ “ A n “ ¨ ¨ ¨ “ A .For any random variable η , define P p u ; η q : “ inf x ď u E rp η ´ x q ` sp u ´ x q ` . For any
A, B , set p AB “ A {p A ` B q .Define Bernoulli random variables R , R , . . . , R n as P p R i “ q “ p AB “ ´ P p R i “ q . Set Z n “ ř ni “ R i . Z n is a binomial random variables with n trials and success probability p AB : Z n „ Bi p n, p AB q . For ď k ď n , define p k : “ P p Z n ě k q ,e k : “ E r Z n t Z n ě k us ,v k : “ E “ Z n t Z n ě k u ‰ . n . . . . . . Empirical Mean
E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeffdingA-Bentkus (a) One Replication n . . . . . . C I W i d t h E-BernsteinHRMS-BernsteinHRMS-Bernstein-GEA-HoeffdingA-Bentkus (b) Average Width over 1000 Replications
Figure 6: Comparison of the 95% confidence sequences for the mean when Y i „ Bernoulli p . q .Except A-Hoeffding , all other methods estimate the variance.
A-Bentkus is the confidence sequencein (10).
HRMS-Bernstein-GE involves a tuning parameter ρ which is chosen to optimize the boundaryat n “ . (a) shows the confidence sequences from a single replication. (b) shows the averagewidths of the confidence sequences over 1000 replications. The upper and lower bounds for allthe other methods are cut at and for a fair comparison. The failure frequency is . for HRMS-Bernstein-GE and for the others. Proposition 2.
For all u P R , P ˜ u ; n ÿ i “ G i ¸ “ P ˆ Bu ` nA A ` B ; Z n ˙ “ P ˆ Bu ` nA A ` B ; Z n ˙ . Furthermore, for any x ě and ď k ď n ´ , P p x ; Z n q “ $’’’’&’’’’% , if x ď np AB , np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. Formally, we can set P p x ; Z n q “ for all x ą n because P p Z n ą n q “ .Proof. The result is mostly an implication of Proposition 3.2 of [23]. It is clear that M n : “ n ÿ i “ G i d “ A ` B B ˜ n ÿ i “ R i ´ nA A ` B ¸ , where R i „ Bernoulli p A {p A ` B qq , that is, P p R i “ q “ p AB “ ´ P p R i “ q . Proposition 3.2(vi) of [23] implies that P p u ; M n q : “ P ˆ Bu ` nA A ` B ; Z n ˙ . P p x ; Z n q for all x P R . The support of Z n is given by supp p Z n q “ t , , , . . . , n u . Proposition 3.2(iv) of [23] (with α “ ) implies that P p x ; Z n q “ , if x ď np AB , P p Z n “ n q , if x ě n. Furthermore, x ÞÑ P p x ; ř ni “ R i q is strictly decreasing on p np AB , n q . Define function F p h q : R Ñ R such that F p h q : “ E r Z n p Z n ´ h q ` s E p Z n ´ h q ` . (12)For any np AB ă x ă n , let h x be the unique solution of F p h q “ x (13)(Uniqueness here is established by Proposition 3.2(ii) of [23].) Then by Proposition 3.2(iii) of [23], P p x ; Z n q “ E rp Z n ´ h x q ` sp x ´ h x q ` “ E r Z n p Z n ´ h x q ` s ´ h x E rp Z n ´ h x q ` sp x ´ h x q ` “ p x ´ h x q E rp Z n ´ h x q ` sp x ´ h x q ` “ E rp Z n ´ h x q ` sp x ´ h x q ` . (14)This holds for all nA {p A ` B q ă x ă n . We will now discuss solving (13).Proposition 3.2(i) of [23] implies that h ÞÑ F p h q is continuous and increasing.If h ď , F p h q “ E r Z n p Z n ´ h qs E r Z n ´ h s “ np AB p ´ p AB q ` n p AB ´ hnp AB np AB ´ h “ np AB ` np p ´ p AB q np ´ h . This is strictly increasing on p´8 , s , and F p q “ np AB ` p ´ p AB q . We get that for any np AB ă x ď np AB ` p ´ p AB q , F p h q “ x ô h x “ np AB ´ np AB p ´ p AB q x ´ np AB . This further implies (from (14)) that P p x ; Z n q “ E r Z n ´ h x s x ´ h x “ np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , for np AB ď x ď np AB ` p ´ p AB q . ă h ă n ´ , set k “ r h s , in other words, k ´ ă h ď k . Since t Z n ě h u ô t Z n ě k u , hence E r Z n p Z n ´ h q ` s “ E r Z n t Z n ě h us ´ h E r Z n t Z n ě h us“ E r Z n t Z n ě k us ´ h E r Z n t Z n ě k us , E rp Z n ´ h q ` s “ E r Z n t Z n ě k us ´ h P p Z n ě k q . Therefore, F p h q “ E r Z n t Z n ě k us ´ h E r Z n t Z n ě k us E r Z n t Z n ě k us ´ h P p Z n ě k q“ v k ´ he k e k ´ hp k . It is not difficult to verify that F p¨q is strictly increasing in p k ´ , k s and hence h x “ v k ´ xe k e k ´ xp k , if F p k ´ q ă x ď F p k q . Substituting this h x in (14) yields the value of P p x ; Z n q , that is, P p x ; Z n q “ ˆ x ´ v k ´ xe k e k ´ xp k ˙ ´ ˆ e k ´ v k ´ xe k e k ´ xp k p k ˙ “ ˆ e k ´ xp k xe k ´ x p k ´ v k ˙ ˆ e k ´ v k p k e k ´ xp k ˙ “ e k ´ v k p k xe k ´ x p k ´ v k , whenever F p k ´ q ă x ď F p k q , where F p k q “ v k ´ ke k e k ´ kp k , ď k ď n ´ . Hence for ď k ď n ´ , P p x ; Z n q “ v k p k ´ e k x p k ´ xe k ` v k , whenever v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k . Finally, we prove that F p¨q is a constant on r n ´ , n s . It is clear that F p n ´ q “ v n ´ ´ p n ´ q e n ´ e n ´ ´ p n ´ q p n ´ “ E r Z n t Z n ě n ´ us ´ p n ´ q E r Z n t Z n ě n ´ us E r Z n t Z n ě n ´ us ´ p n ´ q P p Z n ě n ´ q“ p n ´ n p n ´ qq P p Z n “ n qp n ´ p n ´ qq P p Z n “ n q “ n. Further if h ą n ´ , then p Z n ´ h q ` ą if and only if Z n “ h and hence from (12) F p h q “ E r Z n p Z n ´ h q ` s E rp Z n ´ h q ` s “ n p n ´ h q P p Z n “ n qp n ´ h q P p Z n “ n q “ n. Therefore, the function F p h q is constant on r n ´ , n s .16 − h . . . . . . . F ( h ) n = 3 , A = 0 . , B = 1 − h . . . . . . P ( x ; Z n ) n = 3 , A = 0 . , B = 1 − h − − − − − − P ( x ; Z n ) np AB v /e v − e e − p v − e e − p n = 3 , A = 0 . , B = 1 Figure 7: Examples functions F p h q and P p x ; Z n q when n “ , A “ . and B “ . . We plot P p x ; Z n q in both linear (second plot) and log (third plot) scales on the y-axis.For h ą n , we set F p h q “ n since P p Z n ą h q “ . To put all the pieces together, we have F p h q “ $’’’&’’’% np AB ` np p ´ p AB q np ´ h if h ă“ ,v r h s ´ he r h s e r h s ´ hp r h s if ă h ď n ´ ,n if h ą n ´ . Consequently, for np AB ă x ă n , h x “ F ´ p x q “ np AB ´ np AB p ´ p AB q x ´ np AB , if np AB ă x ď np AB ` p ´ p AB q , v k ´ xe k e k ´ xp k , if F p k ´ q ă x ď F p k q , ď k ď n ´ . C.1 Computation of the Quantile
Recall p AB “ A {p A ` B q , Z n “ ř ni “ R i , and ř ni “ G i is identically distributed as B ´ p A ` B qp Z n ´ np AB q . We will compute x δ such that P p x δ ; Z n q “ δ. (15)This implies that P ˜ p A ` B q x δ ´ nA B ; n ÿ i “ G i ¸ “ δ, or equivalently, q p δ ; n, A, B q “ p A ` B q x δ ´ nA B .
Hence we concentrate on solving (15). Recall that for any x ě and ď k ď n ´ , P p x ; Z n q “ $’’’’&’’’’% , if x ď np AB , np AB p ´ p AB qp x ´ np AB q ` np AB p ´ p AB q , if np AB ă x ď v e “ np AB ` p ´ p AB q , v k p k ´ e k x p k ´ xe k ` v k , if v k ´ ´p k ´ q e k ´ e k ´ ´p k ´ q p k ´ ă x ď v k ´ ke k e k ´ kp k , P p Z n “ n q “ p nAB , if x ě v n ´ ´p n ´ q e n ´ e n ´ ´p n ´ q p n ´ “ n. (16)17he function P p¨ ; Z n q is a non-increasing function and hence if δ ď p nAB , then we get x δ “ n ` ´ ;this corresponds to the last case in (16). If P p v { e ; Z n q ď δ ď , then x δ “ np AB ` c p ´ δ q np AB p ´ p AB q δ ; this corresponds to the first and second case in (16); note that P p v { e ; Z n q “ np AB p ´ p AB q{rp ´ p AB q ` np AB p ´ p AB qs . For the remaining cases, note that if there exists a ď k ď n ´ suchthat P ˆ v k ´ ke k e k ´ kp k ; Z n ˙ ď δ ď P ˆ v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ; Z n ˙ , then v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ď x δ ď v k ´ ke k e k ´ kp k , (17)and using the closed form expression of P p¨ ; Z n q on this interval, we get x δ “ e k ` b e k ´ p k p v k ´ p v k p k ´ e k q{ δ q p k . (18)Using these calculations, one can find k looping over ď k ď n ´ such that (17) holds. Thisapproach has a complexity of O p n q , assuming the availability of p k , e k , and v k .We now describe an approach that reduces the complexity by finding quick-to-compute upperand lower bounds on x δ . Lemmas 1.1 and 3.1 of [8] show that P p Z n ě x q ď P p x ; Z n q ď e P ˝ p Z n ě x q , (19)where P ˝ p Z n ě x q represents the log-linear interpolation of P p Z n ě x q , that is, for x P t , , . . . , n u P ˝ p Z n ě x q “ P p Z n ě x q , (20)and for x P p k ´ , k q such that x “ p ´ λ qp k ´ q ` λk , P ˝ p Z n ě x q “ p P p Z n ě k ´ qq ´ λ p P p Z n ě k qq λ . Equation (2) of [5] further shows that P ˝ p Z n ě x q ď p ´ λ q P p Z n ě k ´ q ` λ P p Z n ě k q . (21)Hence, to find x “ x δ satisfying P p x ; Z n q “ δ , find k P t , , . . . , n u such that P p Z n ě k q ě δ. This implies (from (19)) that P p k ; Z n q ě δ and because x ÞÑ P p x ; Z n q is decreasing, x δ ě k .Further, find k P t , , . . . , n u such that P p Z n ě k q ď δe . P o p Z n ě k q “ P p Z n ě k q ď δ { e . Hence using (20), we get P p k ; Z n q ď δ which implies that x δ ď k . Summarizing this discussion, we get that x δ satisfying P p x δ ; Z n q “ δ also satisfies k ď x δ ď k , (22)where P p Z n ě k q ě δ and P p Z n ě k q ď δe . The bounds in (22) are not very useful because the closed form experssion (18) of x δ requires findingupper and lower bounds for x δ in terms of p v k ´ ke k q{p e k ´ kp k q ’s.Now we note that v k ě ke k ě k p k ñ v k ´ k e k e k ´ k p k ě k . This combined with (22) proves that k ď x δ ď k ď v k ´ k e k e k ´ k p k . The lower bound here is still not in terms of the ratios p v k ´ ke k q{p e k ´ kp k q . But given the upperbound, we can search for k ď k (by running a loop from k to 0) such that v k ´ ´ p k ´ q e k ´ e k ´ ´ p k ´ q p k ´ ď x δ ď v k ´ ke k e k ´ kp k . (23)Another approach is to make use of the lower bound in (22). Because k ď p v k ´ k e k q{p e k ´ k p k q ,there are two possibilities:1. k ď x δ ď p v k ´ k e k q{p e k ´ k p k q ;2. k ď p v k ´ k e k q{p e k ´ k p k q ă x δ .In the first case, it suffices to search for k ď k such that (23). In the second case, we can searchover k ` ď k ď k as before. D Proof of Theorem 1
It is clear that p S t , F t q nt “ with F t “ σ t X , . . . , X t u is a martingale because E r S t | F t ´ s “ S t ´ ` E r X t s “ S t ´ . Consider now the process D t : “ p S t ´ x q ` for a fixed x ą . The function f : y ÞÑ p y ´ x q ` is continuous and satisfies f p y q “ , if y ď x, p y ´ x q , if y ą x, and f p y q “ , if y ď x, , if y ą x. Therefore, f p¨q is a convex function. This implies by Jensen’s inequality that E r D t | F t ´ s “ E r f p S t q| F t ´ s ě f p S t ´ q . p D t , F q nt “ is a submartingale. Doob’s inequality now implies that P ˆ max ď t ď n S t ě u ˙ p a q “ P ˆ max ď t ď n p S t ´ x q ` ě p u ´ x q ` ˙ “ P ˆ max ď t ď n D t ě p u ´ x q ` ˙ p b q ď E r D n sp u ´ x q ` ď E rp S n ´ x q ` sp u ´ x q ` . Here equality (a) holds for every x ď u and inequality (b) holds because of Doob’s inequality.Because x ď u is arbitrary, we get P ˆ max ď t ď n S t ě u ˙ ď inf x ď u E rp S n ´ x q ` sp u ´ x q ` , and condition (2) along with Theorem 2.1 of [8] (or [22]) imply that P ˆ max ď t ď n S t ě u ˙ ď inf x ď u E rp ř ni “ G i ´ x q ` sp u ´ x q ` . The definition (6) of q p δ ; n, A , B q readily implies P ˆ max ď t ď n S t ě q p δ ; n, A , B q ˙ ď δ. This completes the proof of (7). We now prove the sharpness. Note that the condition P ˆ max ď t ď n S t ě n ˜ q p δ { n ; A, B q ˙ ď δ for all δ P r , s , is equivalent to the existence of a function x ÞÑ H p x ; A, B q such that P ˆ max ď t ď n S t ě nu ˙ ď H n p u ; A, B q , for all u. (The function δ ÞÑ ˜ q p δ { n ; A, B q is the inverse of u ÞÑ H n p u ; A, B q .) In particular, this implies that P p S n ě nu q ď H n p u ; A, B q for all u. Now, Lemma 4.7 of [6] (also see Eq. (2.8) of [13]) implies that H n p u ; A, B q ě ` BuA ˙ ´p A ` Bu q{p A ` B q ´ ´ uB ¯ ´p B ´ Bu q{p B ` A q + n “ inf h ě e ´ nhu E ” e h ř ni “ G i ı , where G , . . . , G n are independent random variables constructed through (3). Proposition 3.5 of [23]implies that inf h ě e ´ nhu E ” e h ř ni “ G i ı ě inf x ď nu E rp ř ni “ G i ´ x q ` sp nu ´ x q ` . Summarizing the inequalities, we conclude P p S n ě nu q ď inf x ď nu E rp ř ni “ G i ´ x q ` sp nu ´ x q ` ď inf h ě E ” e h ř ni “ G i ´ h p nu q ı ď H n p u ; A, B q @ u. This proves that q p δ ; n, A, B q ď n ˜ q p δ { n ; A, B q for any valid ˜ q p¨ ; A, B q .20 Proof of Theorem 2
The proof is based on (7) and a union bound. It is clear that P ˜ D t ě t ÿ i “ X i ě q p δ { h p k t q ; c t , A , B q ¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : t ÿ i “ X i ě q p δ { h p k t q ; c t , A , B q +¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : t ÿ i “ X i ě q p δ { h p k q ; t η k ` u , A , B q +¸ ď ÿ k “ P ˜ max r η k s ď t ď t η k ` u t ÿ i “ X i ě q p δ { h p k q ; t η k ` u , A , B q ¸ ď ÿ k “ δh p k q ď δ. F Proof of Theorem 3
Theorem 2 implies that P ˆ D n ě S n ě q ˆ δ h p k n q ; c n , A, B ˙˙ ď δ . Lemma F.1 (below) proves P ` D n ě A ě s A n p δ q ˘ ď δ . In particular this implies that P ˆ D n ě A ě min ď s ď n s A s p δ q ˙ ď δ . Combining the inequalities above with a union bound (and Lemma H.2) proves the result.
Lemma F.1.
Under the assumptions of Theorem 3, we have for any δ P r , s , P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ ď δ, (24) where W i “ p X i ´ X i ´ q { and V t : “ ř t t { u i “ W i . Proof.
Fix x ě . Note that for any u ě ´ x , P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ “ P ˆ max ď t ď n p u ´ t V t ´ tA uq ` ě p u ` x q ` ˙ , ď E rp u ´ t V n ´ nA uq ` sp u ` x q ` . tp u ´ t V t ´ tA uu t ě is a submartingale. There-fore, P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ ď inf u ě´ x E rp u ´ t V n ´ nA uq ` sp u ` x q ` “ inf u ě´ x E rp u ` nA ´ V n q ` sp u ` x q ` “ inf u ě nA ´ x E rp u ´ V n q ` sp u ´ nA ` x q . Corollary 2.7 (Eq. (2.24)) of [24] implies that inf u ě nA ´ x E rp u ´ V n q ` sp u ´ nA ` x q ď P p E ,n ` Z a E ,n ; nA ´ x q “ P p E ,n ` Z a E ,n ; E ,n ´ x q , (25)where E j,t “ ř t t { u i “ E r W ji s for j “ , and Z stands for a standard normal distribution. Inequal-ity (25) is the best inequality to use and there is a more precise version; see Theorem 2.4(I) andCorollary 2.7 of [24]. With the best inequality, the following steps will lead to a refined upper boundon A ; we will not pursue this direction here.It now follows from [7] that P p E ,n ` Z ? E ,n ; E ,n ´ x q ď e P ˜ Z ď ´ x a E ,n ¸ . Because X i P r
B, B s with probability 1, W i ď p B ´ B q { and hence E ,n “ E n ÿ i “ E r W i s ď p B ´ B q n ÿ i “ E r W i s “ p B ´ B q E ,n { “ n p B ´ B q A { . This implies that P ˆ max ď t ď n t V t ´ tA u ď ´ x ˙ ď e P ˆ Z ď ´ ? x ? n p B ´ B q A ˙ . Equating the right hand side to δ yields P ˆ max ď t ď n t V t ´ tA u ď ´ ? n p B ´ B q A ? ´ ˆ ´ δe ˙˙ ď δ. (26)22ecause of this maximal inequality, we can apply stitching and get (24). Note that P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ “ P ˜ D t ě V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ “ P ˜ ď k “ D r η k s ď t ď t η k ` u : V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙+¸ ď ÿ k “ P ˜ D r η k s ď t ď t η k ` u : V t t { u ´ t t { u A ď ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙¸ ď ÿ k “ δh p k q ď δ, where the last inequality follows from (26) applied to t ď t ď t c t { u u .Inequality (24) yields P ˜ tA ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙ ´ V t ď @ t ě ¸ ě ´ δ. Inequality tA ´ a t c t { u p B ´ B q A ? ´ ˆ ´ δe h p k t q ˙ ´ V t ď holds for A ą if and only if A ď g ,t ` b g ,t ` g ,t , where g ,t “ a t c t { u p B ´ B q A ? t Φ ´ ˆ ´ δe h p k t q ˙ and g ,t “ V t t { u t t { u . Hence a rewriting of (24) is P ´ A ě g ,t ` b g ,t ` g ,t @ t ě ¯ ě ´ δ. It is clear that g ,t “ O p {? t q and E r V t t { u { t t { u s “ A and hence the upper bounds above growslike A ` O p a log p h p k t qq{ t q . G Proof of Theorem 4
The assumption P p L ď X i ď U q “ implies that P p L ´ µ ď X i ´ µ ď U ´ µ q “ and henceapplying Theorem 2 with X i ´ µ and its upper bound U ´ µ yields P ˜ D n ě n ÿ i “ p X i ´ µ q ě q ˆ δ { h p k n q ; c n , A, U ´ µ ˙¸ ď δ . (27)23imilarly applying Theorem 2 with µ ´ X i and its upper bound µ ´ L yields P ˜ D n ě n ÿ i “ p µ ´ X i q ě q ˆ δ { h p k n q ; c n , A, µ ´ L ˙¸ ď δ . (28)Finally Lemma F.1 implies that P ` D n ě A ě s A ˚ n p δ ; U, L q ˘ ď δ . (29)Now combining inequalities (27), (28), and (29) yields with probability ě ´ δ ´ δ , for all n ě ´ n q ˆ δ { h p k n q ; c n , A, µ ´ L ˙ ď S n n ´ µ ď n q ˆ δ { h p k n q ; c n , A, U ´ µ ˙ , and A ď s A ˚ n p δ q . On this event, we get by using U ´ µ ď U ´ L and µ ´ L ď U ´ L , µ low0 ď µ ď µ up0 , and then recursively using µ low n ´ ď µ ď µ up n ´ , ´ n q ˆ δ { h p k n q ; c n , s A ˚ n p δ q , µ up n ´ ´ L ˙ ď S n n ´ µ ď n q ˆ δ { h p k n q ; c n , s A ˚ n p δ q , U ´ µ low n ´ ˙ . This proves the result.
H Auxiliary Results
Define M t , t ě as M t : “ ř ti “ G i , with P ` G i “ ´ A i { B ˘ “ B A i ` B and P p G i “ B q “ A i A i ` B . Lemma H.1.
For any t ě and x P R , the map p A , . . . , A t q ÞÑ E rp M t ´ x q ` s is non-decreasing.Proof. Suppose we prove that for every y P R , A ÞÑ E rp G ´ y q ` s is non-decreasing , (30)then by conditioning on G , . . . , G t and taking y “ x ` G ` ¨ ¨ ¨ ` G t , we get for A ď A E rp G p A q ´ y q ` s ď E rp G p A q ´ y q ` s . Now taking expectations on both sides with respect to G , . . . , G t implies non-decreasingness of A ÞÑ E rp M t ´ x q ` s . This implies the result.To prove (30), E rp G ´ y q ` s “ B A ` B ˆ ´ A B ´ y ˙ ` ` A A ` B p B ´ y q ` . A Ñ A { B is increasing, it suffices to show A { B ÞÑ E rp G ´ y q ` s is non-decreasingwith respect to A { B . Set p “ A { B and define g p p q “ ` p p´ Bp ´ y q ` ` p ` p p B ´ y q ` . Differentiating with respect to p yields B g p p qB p “ ´ p´ Bp ´ y q ` p ` p q ´ B p´ Bp ´ y q ` ` p ` p B ´ y q ` p ` p q “ ´p´ Bp ´ y q ` ´ B p ` p qp´ Bp ´ y q ` ` p B ´ y q ` p ` p q . If y ď ´ Bp then y ` Bp ă and B ´ y ą B p ` p q ą and hence B g p p qB p “ ´p Bp ` y q ` B p ` p qp Bp ` y q ` p B ´ y q p ` p q “ B ` B p ` B p p ` p q ą . If ´ Bp ă y ă B then y ` Bp ą and B ´ y ą and hence B g p p qB p “ p B ´ y q p ` p q ą . If y ą B , then B g p p q{B p “ . Hence B g p p q{B p ě for all p . This proves (30).Recall the definition of q p δ ; t, A , B q from (6). In the case of equal variances, that is, A “ A “ . . . “ A , we write A, q p δ ; t, A, B q for A , q p δ ; t, A , B q , respectively. We now prove that A ÞÑ q p δ ; t , A, B q is an non-decreasing function. Lemma H.2.
For any t ě , the function A ÞÑ q p δ ; t, A, B q is an non-decreasing function.Proof. Lemma H.1 proves that A ÞÑ E rp M t ´ x q ` s is non-decreasing. This implies that I p A ; u q isalso non-decreasing in A , where I p A ; u q : “ inf x ď u E rp M t ´ x q ` sp u ´ x q ` . Lemma 3.1 of [8] proves that I p A ; u q is also non-increasing in u . Fix A ď A . From the definitionof δ , I p A , q p δ ; t, A , B qq “ δ and I p A , q p δ ; t, A , B qq “ δ. Because I p A ; u q is non-decreasing in A , I p A ; q p δ ; t, A , B qq “ δ “ I p A ; q p δ ; t, A , B qq ď I p A ; q p δ ; t, A , B qq Hence I p A ; q p δ ; t, A , B qq ď I p A ; q p δ ; t, A , B qq and because I p A ; u q is non-increasing in u , weconclude that q p δ ; t, A , B q ď q p δ ; t, A , B q . This proves the result modulo the condition A ÞÑ E rp M t ´ x q ` s is non-decreasing. Lemma H.3.
For any δ P r , s , q p δ ; t, AB, B q “ Bq p δ ; t, A, B q . roof. Recall that q p δ ; t, AB, B q is defined as the solution of inf x ď u E rp M t ´ x q ` sp u ´ x q ` “ δ, where M t is defined as M t “ ř ti “ G i with P ` G i “ ´p A B q{ B ˘ “ B A B ` B “ B A ` B and , P ` G i “ B ˘ “ A B A B ` B “ A A ` B . This implies that G i d “ BG i and hence M t d “ BM t . Therefore, E rp M t ´ x q ` s “ E rp BM t ´ x q ` s “ B E rp M t ´ x { B q ` s , and inf x ď u E rp M t ´ x q ` sp u ´ x q ` “ B inf x ď u E rp M t ´ x { B q ` s B p u { B ´ x { B q ` “ inf x ď u { B E rp M t ´ x q ` sp u { B ´ x q ` . The right hand side above equals δ , when u “ Bq p δ ; t, A, B q because the definition of q p δ ; t, A, B q implies that inf x ď q p δ ; t,A,B q E rp M t ´ x q ` sp q p δ ; t, A, B q ´ x q ` “ δ. This completes the proof.
I Alternative Empirical Bentkus Confidence Sequences with Esti-mated Variance
In Section 3.3, we presented one actionable version of Theorem 2, where we used an analytical upperbound on the variance A . In this section, we present an alternative empirical Bentkus confidencesequence that requires numerical computation. In our initial experiments, we found solving for theupper bound of A in this way to be unstable. Because the proof technique here is very analogues tothat of the empirical Bernstein bound in [2, Eq. (48)-(50)], we present the alternative bound below.Define the empirical variance as p A n : “ n ´ ř ni “ p X i ´ s X n q , where s X n “ n ´ ř ni “ X i . For any δ , δ P r , s , define s A n : “ sup ! a ě p A n ě a ´ Bn q ´ δ h p k n q ; c n , a, B ¯ ´ n q ´ δ h p k n q ; c n , a, B ¯) . Lemma I.1 shows that s A n is an over-estimate of A uniformly over n and yields the following action-able bound. Recall that S n “ ř ni “ X i “ n s X n . 26 heorem 9. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p| X i | ą B q “ for all i ě , then for any δ , δ P r , s , P ˆ D n ě | S n | ě q ˆ δ h p k n q ; c n , s A ˚ n , B ˙ or A ě s A ˚ n p δ q ˙ ď δ ` δ , where s A ˚ n : “ min ď s ď n s A s . Here k n and c n are same as those defined in Theorem 2. This theorem is an analogue of the empirical Bernstein inequality [20, Eq. (5)]. Furthermore, theupper bound s A n on A is better than that in the Bernstein version [2, Eq. (49)-(50)]; see Lemma I.2. I.1 Proof of Theorem 9 and Comparison of Standard Deviation Estimation fromOther Inequalities
Lemma I.1. If X , X , . . . are mean-zero independent random variables satisfying Var p X i q “ A and P p| X i | ą B q “ , for all i ě , then for any δ P r , s P ¨˝ D t ě p A t ď A ´ Bt q ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ. Proof.
Consider the random variable X i ´ E r X i s . These are mean zero and are bounded in absolutevalue by B . Further the variance can be bounded as Var p X i ´ E r X i sq “ E rp X i ´ E r X i sq s ď B E r| X i | s “ B A . Applying Theorem 2 with variables X i ´ E r X i s implies P ˜ D t ě t ÿ i “ ´p X i ´ E r X i sq ě q ˆ δh p k t q ; c t , AB, B ˙¸ ď δ. Lemma H.3 proves that q ˆ δh p k t q ; c t , AB, B ˙ “ Bq ˆ δh p k t q ; c t , A, B ˙ . Hence we get with probability at least ´ δ , simultaneously for all t ě t ÿ i “ p X i ´ s X t q “ t ÿ i “ X i ´ t ˜ t ÿ i “ X i ¸ ě t ÿ i “ E r X i s ´ Bq ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ . Hence for any δ P r , s , P ¨˝ D t ě t p A t ď tA ´ Bq ˆ δh p k t q ; c t , A, B ˙ ´ t ˇˇˇˇˇ n ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ. This completes the proof. 27e will now prove Theorem 9. Theorem 2 implies that P ˜ D t ě ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ě q ˆ δ h p k t q ; c t , A , B ˙¸ ď δ , (31)Lemma I.1 implies that P ¨˝ D t ě p A t ď tt ´ A ´ Bt ´ q ˆ δ h p k t q ; c t , A, B ˙ ´ t p t ´ q ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ˛‚ ď δ . Hence with probability at least ´ δ ´ δ , simultaneously for all t ě , ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ ď q ˆ δ h p k t q ; c t , A, B ˙ , p A t ď tt ´ A ´ Bt ´ q ˆ δ h p k t q ; c t , A, B ˙ ´ t p t ´ q ˇˇˇˇˇ t ÿ i “ X i ˇˇˇˇˇ On this event, A ď s A t simultaneously for all t ě which in turn implies that A ď min ď s ď t s A s alsoholds simultaneously for all t ě . Substituting this in (31) (along with Lemma H.2) implies theresult. Lemma I.2.
Suppose δ ÞÑ ˜ q p δ { n ; A, B q is a function such that P ˆ max ď t ď n S t ě n ˜ q p δ { n ; A, B q ˙ ď δ, (32) for all δ P r , s and independent random variables X , . . . , X n satisfying (2) . Define the (over)-estimator of A as ˜ A t : “ sup " a ě p A t ě a ´ Bc t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ ´ c t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯* . Then s A n ď ˜ A n . Proof.
We have proved in Appendix D that (32) implies q p δ ; n, a, B q ď n ˜ q ´ δ { n ; a, B ¯ , for all n, a, and B . Hence if a satisfies p A t ě a ´ Bt q ˆ δ h p k t q ; c t , a, B ˙ ´ t q ˆ δ h p k t q ; c t , a, B ˙ , then p A n ě a ´ Bc t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ ´ c t t ˜ q ´ p δ {p h p k t qqq { c t ; a, B ¯ , which implies the result. 28 eferences [1] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs,and mathematical tables , volume 55. US Government printing office, 1948.[2] J.-Y. Audibert, R. Munos, and C. Szepesvári. Exploration–exploitation tradeoff using varianceestimates in multi-armed bandits.
Theoretical Computer Science , 410(19):1876–1902, 2009.[3] R. R. Bahadur and L. J. Savage. The nonexistence of certain statistical procedures in nonpara-metric problems.
The Annals of Mathematical Statistics , 27(4):1115–1122, 1956.[4] A. Balsubramani and A. Ramdas. Sequential nonparametric testing with the law of the iter-ated logarithm. In
Proceedings of the Thirty-Second Conference on Uncertainty in ArtificialIntelligence , pages 42–51, 2016.[5] V. Bentkus. A remark on the inequalities of Bernstein, Prokhorov, Bennett, Hoeffding, andTalagrand.
Liet. Mat. Rink , 42(3):332–342, 2002.[6] V. Bentkus. On Hoeffding’s inequalities.
The Annals of Probability , 32(2):1650–1673, 2004.[7] V. Bentkus. An extension of the hoeffding inequality to unbounded random variables.
Lithua-nian Mathematical Journal , 48(2):137–157, 2008.[8] V. Bentkus, N. Kalosha, and M. Van Zuijlen. On domination of tail probabilities of (super)martingales: explicit bounds.
Lithuanian Mathematical Journal , 46(1):1–43, 2006.[9] P. Dagum and M. Luby. Approximating the permanent of graphs with large factors.
TheoreticalComputer Science , 102(2):283–305, 1992.[10] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for Monte Carlo estimation.
SIAM Journal on computing , 29(5):1484–1496, 2000.[11] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximatingthe volume of convex bodies.
Journal of the ACM (JACM) , 38(1):1–17, 1991.[12] E. Giné and R. Nickl.
Mathematical foundations of infinite-dimensional statistical models ,volume 40. Cambridge University Press, 2016.[13] W. Hoeffding. Probability inequalities for sums of bounded random variables.
Journal of theAmerican Statistical Association , 58(301):13–30, 1963.[14] S. R. Howard and A. Ramdas. Sequential estimation of quantiles with applications to a/b-testing and best-arm identification. arXiv preprint arXiv:1906.09712 , 2019.[15] S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240 , 2018.[16] M. Huber. An optimal p ε, δ q -randomized approximation scheme for the mean of random vari-ables with bounded relative variance. Random Structures & Algorithms , 55(2):356–370, 2019.[17] M. Huber et al. Approximation algorithms for the normalizing constant of gibbs distributions.
The Annals of Applied Probability , 25(2):974–985, 2015.2918] R. Johari, L. Pekelis, and D. J. Walsh. Always valid inference: Bringing sequential analysis toA/B testing. arXiv preprint arXiv:1512.04922 , 2015.[19] R. Johari, P. Koomen, L. Pekelis, and D. Walsh. Peeking at A/B tests: Why it matters, andwhat to do about it. In
Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 1517–1525, 2017.[20] V. Mnih, C. Szepesvári, and J.-Y. Audibert. Empirical Bernstein stopping. In
Proceedings ofthe 25th international conference on Machine learning , pages 672–679, 2008.[21] T. K. Philips and R. Nelson. The moment bound is tighter than Chernoff’s bound for positivetail probabilities.
The American Statistician , 49(2):175–178, 1995.[22] I. Pinelis. Binomial upper bounds on generalized moments and tail probabilities of (super)martingales with differences bounded from above. In
High dimensional probability , pages 33–52. Institute of Mathematical Statistics, 2006.[23] I. Pinelis. On the Bennett-Hoeffding inequality. arXiv preprint arXiv:0902.4058 , 2009.[24] I. Pinelis et al. Optimal binomial, poisson, and normal left-tail domination for sums of non-negative random variables.
Electronic Journal of Probability , 21, 2016.[25] R. Singh. Existence of bounded length confidence intervals.
The Annals of MathematicalStatistics , 34(4):1474–1485, 1963.[26] M. Talagrand. The missing factor in Hoeffding’s inequalities. In
Annales de l’IHP Probabilitéset statistiques , volume 31, pages 689–702, 1995.[27] A. Wald.
Sequential analysis . Courier Corporation, 2004.[28] F. Yang, A. Ramdas, K. G. Jamieson, and M. J. Wainwright. A framework for multi-A(rmed)/B (andit) testing with online fdr control. In
Advances in Neural Information ProcessingSystems , pages 5957–5966, 2017.[29] S. Zhao, E. Zhou, A. Sabharwal, and S. Ermon. Adaptive concentration inequalities for se-quential decision problems. In