BBounds on Bayes Factors for Binomial A/B Testing
Maciej Skorski [email protected]
Keywords: Hypothesis Testing, Bayesian Statistics, AB TestingAbstract: Bayes factors, in many cases, have been proven to bridge the classic -value based significance testing andbayesian analysis of posterior odds. This paper discusses this phenomena within the binomial A/B testingsetup (applicable for example to conversion testing). It is shown that the bayes factor is controlled by the
Jensen-Shannon divergence of success ratios in two tested groups, which can be further bounded by the Welchstatistic. As a result, bayesian sample bounds almost match frequentionist’s sample bounds. The link betweenJensen-Shannon divergence and Welch’s test as well as the derivation are an elegant application of tools frominformation geometry.
A/B testing
A/B testing is the technique of collect-ing data from two parallel experiments and comparingthem by probabilistic inference. A particularly impor-tant case is assessing which of two success-countingexperiments achieves a higher success rate. This nat-urally applies to evaluating conversion rates on twodifferent versions of a webpage. A typical questionbeing asked is if there is a difference (called also non-zero effect) in conversion between groups: p , p areunknown conversion rates in two experiments, and thetask is to compare hypotheses H = { p = p } and H a = { p (cid:54) = p } , given observed data. In the frequen-tionist approach, one falsifies H by the two-samplet-test [Welch, 1938]. In the bayesian approach one evaluates the strength of both H and H a and decidesbased the ratio called bayes factorK = Pr [ D | H ] Pr [ D | H a ] (1)which converted by the Bayes theorem to Pr [ H | D ] Pr [ H a | D ] = K · Pr [ H ] Pr [ H a ] quantifies posterior odds and allows a re-search to choose a model more plausible given data (usually one gives H and H a same chance of gettingconsidered and sets Pr [ H ] = Pr [ H a ] = ). The deci-sion rule and confidence depends on the magnitudeof K [Kass and Raftery, 1995, Jeffreys, 1998]. In thebayesian approach a hypothesis assigns an arbitrarydistribution to parameters which is more general. Testing counts proportions
Suppose that empiri-cal data D has r i = r runs and r · ¯ p i successes in the i -th experiment, i = ,
2. Under the binomial count-ing model, the data likelihood under a hypothesis H equalsPr [ D | H ] = (cid:90) ∏ i = p ¯ p i ri ( − p i ) ( − ¯ p ) i r d P H ( p , p ) (2)where the prior distribution P H ( · , · ) reflects what isassumed prior to seeing data (and what will be tested);one can for example choose { p = p = . } for H = H and { p (cid:54) = p } for H a uniformly over all validvalues of p , p , but in practice more informative pri-ors are used because some configurations of valuesare unrealistic (e.g. extremely low or high conver-sion). The corresponding factor K can be computedfor example by the R package BayesFactor [Moreyand Rouder, 2018].
Problem: Bayesian A/B testing power
Estimates,neither frequentist nor bayesian, will not be conclu-sive without sufficiently many samples. Frequentistswidely use rules of thumbs that are derived based ont-tests. Under the bayesian methodology this is lit-tle more complicated because hypotheses can be arbi-trary priors over parameters. Under the binomial A/Bmodel, we will answer the following questions • when, given data, a bayesian hypothesis on zeroeffect may be rejected ( K (cid:28) K a )? • what is the relation to the classical t-test? a r X i v : . [ s t a t . O T ] F e b his will allow us to understand data limitations whendoing bayesian inference, and relate them to widely-spread frequentionist rule of thumbs. Our problem, as stated, is a question about maximiz-ing minimal bayes factor . It is known that for certainproblems bayes factors can be related to frequention-ist’s p-values [Edwards et al., 1963, Kass and Raftery,1995, Goodman, 1999] and thus bridges the Bayesianand frequentionist world (this should be contrastedwith a wide-spread belief that both methods are veryincompatible [Kruschke and Liddell, 2018]). Thenovel contributions of this paper are (a) bounding theBayes factor for binomial distributions (b) discussionof sample bounds for binomial A/B testing in relationto the frequentionist approach.
Main result: Bayes factor and Welch’s statistic
The following theorem shows that no “zero-effect”hypothesis can be falsified, unless the number of sam-ples is big in relation to a certain dataseet statistic .This statistic turns out to be the Jensen-Shannon di-vergence, well-known in information theory. It is inturn bounded by the Welch’s t-statistic.
Theorem 1 (Bayes Factors for Binomial Testing) . Consider two independent experiments, each with rindependent trials with unknown success probabili-ties p and p respectively. Let observed data D hasr · ¯ p i successes and r · ( − ¯ p i ) failures for group i.Then max H : { p = p } min H a Pr [ H | D ] Pr [ H a | D ] = e − r · JS ( ¯ p , ¯ p ) (3) where the maximum is over null hypothesis (priors)H over p , p such that p = p , the minimum is overall valid alternative hypothesis (priors) over p , p ,and JS denotes the Jensen-Shannon divergence.Moreover, the Jensen-Shannnon divergence isbounded by the Welch’s t-statistic (on D ) JS ( ¯ p , ¯ p ) (cid:62) t Welch ( t , ¯ p , ¯ p ) r (4) so that we can bound max H : { p = p } min H a Pr [ H | D ] Pr [ H a | D ] (cid:54) e − t Welch ( t , ¯ p , ¯ p ) / (5) Remark 1 (Most favorable hypotheses) . Note that • Maximally favorable alternative (H a which maxi-mizes Pr [ D | H a ] ) is p = θ and q = θ • Maximally favorable null of the form p = q isp = q = ¯ p + ¯ p If null is of the form p = q = x then the bound be-comes e − r · KL ( ¯ p , x ) − r · KL ( ¯ p , x ) . Figure 1: Comparison of the bayesian (6) (bayesian) and thefrequentionist (7) sample lower bounds, where p = p and p = p · ( + δ ) for δ = . Application: sample bounds
The main result im-plies the following sample rule
Corollary 1 (Bayesian Sample Bound) . To confirmthe non-zero effect (p (cid:54) = p ) the number of samplesfor the bayesian method should ber (cid:29) JS ( p , p ) (6)Under the frequenionist method the rule of thumbis t Welch (cid:29)
1, which gives (see Section 2) r (cid:29) ( p ( − p ) + p ( − p )( p − p ) (7)Note that both formulas needs assumptions on loca-tions of the parameters. In particular, testing smallereffects or effects with higher variance require moresamples.Bounds Equation (6) and Equation (7) are close toeach other by a constant factor (a different small fac-tor is necessary to make the bound small in both thebayesian credibility and p-value sense). The differ-ence (under the normalized constant) is illustrated onFigure 1, for the case when one wants to test a relativeuplift of 10%.Since high values of t Welch means small p-values,we conclude that the frequentionist p-values boundsthe bayes factor and indeed, are evidence against anull-hypothesis in the well-defined bayesian sense.However, because of the scaling t Welch → e − t / ,this is true for p-values much lower than the standardthreshold of 0.05. In some sense, the bayesian ap-proach is more conservative and less reluctant to re-ject than frequentionist tests; this conclusion is sharedwith other works [Goodman, 1999]. Preliminaries
Entropy, Divergence
The binary cross-entropy of p and q is defined by H ( p , q ) = − p log ( − p ) − ( − p ) log ( − q ) (8)which becomes the standard (Shannon) binary en-tropy when p = q , denoted as H ( p ) = H ( p , p ) . TheKullback-Leibler divergence is defined as KL ( p , q ) = H ( p , q ) − H ( p ) (9)and the Jensen-Shannon divergence [Lin, 1991] is de-fined as JS ( p , q ) = H ( p , q ) − H ( p ) − H ( q ) (10)(always positive because the entropy is concave).The following lemma shows that the cross-entropyfunction is convex in the second argument. Thisshould be contrasted with the fact that the entropyfunction (of one argument) is concave. Lemma 1 (Convexity of cross-entropy) . For any pthe mapping x → H ( p , x ) is convex in x.Proof. Since − p · log ( · ) for p ∈ [ , ] is convex weobtain − γ p log x − γ p log x (cid:62) − p log ( γ x + γ x ) for any x , x and any γ , γ (cid:62) γ + γ =
1. Replac-ing x i by 1 − x i and p by 1 − p in the above inequalitygives us also − γ ( − p ) log ( − x ) − γ ( − p ) log ( − x ) (cid:62) − ( − p ) log ( γ ( − x ) + γ ( − x ))= − ( − p ) log ( − γ x ) − γ x ) Adding side by side yields γ H ( p , x ) + γ H ( p , x ) (cid:62) γ H ( p , x ) + γ H ( p , x ) which finishes the proof. This argument works formultivariate case, when p , x are probability vectors. Lemma 2 (Quadratic bounds on KL/cross-entropy) . For any p it holds that KL ( p , x ) (cid:62) (cid:18) p + − p (cid:19) · ( x − p ) (11) Proof.
We will prove a general version. Let ( p i ) i and ( x i ) i be probability vectors of the same length. By theelementary inequalitylog ( + u ) (cid:62) u − u (12) we obtain − log ( x i / p i ) = − log ( − ( p i − x i ) / p i ) (cid:62) (13) − p i − x i p i + (cid:18) p i − x i p i (cid:19) (14)multiplying both sides by p i and adding inequalitiesside by side we obtain − ∑ p i log ( x i / p i ) (cid:62) − ∑ i ( x i − p i ) + ∑ i ( p i − x i ) p i (15) = ∑ i ( p i − x i ) p i (16)which means KL ( x , p ) ∑ i ( p i − x i ) p i . Our lemma followsby specializing to the vectors ( p , − p ) and ( x , − x ) . To decide whether means in twogroups are equal, under the assumption of unequalvariances, one performs the Welch’s t-test with thestatistic t Welch = µ − µ (cid:114) s r + s r (17)where s i are sample variances and µ i are samplemeans for group i = ,
2. The null hypothesis is re-jected unless the statistic is sufficiently high (in abso-lute terms). In our case the formula simplifies to
Claim 1.
If r θ and r θ success out of r trials havebeen observed respectively in the first and the secondgroup then t Welch ( t , θ , θ ) = r − · θ − θ (cid:112) θ ( − θ ) + θ ( − θ ) (18) We change the notation slightly, unknown successrates will be p and q , and corresponding successes r · θ , r · θ . Alternatives
Maximizng over alll posible priors P a over pairs ( p , q ) we getmax P Pr [ D | H a ] = c · max P (cid:90) [ , ] e − rH ( θ , p ) − rH ( θ , q ) P a ( p , q ) d ( p , q ) (19)here c = B ( r θ + , r ( − θ )+ ) · B ( r θ + , r ( − θ )+ ) is anormalizing constant, which equalsmax P a Pr [ D | H a ] = c · e − rH ( θ ) − rH ( θ ) (20)achieved for P a being a unit mass at ( p , q ) = ( θ , θ ) . Null
Let H states that the baseline is p and the efectis 0. Then we obainPr [ D | H ] = c · e − rH ( θ , p ) − rH ( θ , p ) (21)with the same normalizing constant c . Bayes factor
If none of two hypothesis is a prioriprefered, that is when Pr [ H ] = Pr [ H a ] , then the Bayesfactor equals the likelihood ratio (by Bayes theorem)Pr [ H | D ] Pr [ H a | D ] = Pr [ D | H ] Pr [ D | H a ] . (22)In turn the likelihood ratio (in favor of H ) equalsmin H a Pr [ D | H ] Pr [ D | H a ] = e − r · ( H ( θ , p )+ H ( θ , p ) − H ( θ ) − H ( θ )) (23)(the normalizing constant c cancells). Using the rela-tion between the KL divergence and cross-entropy weobtainmin H a Pr [ D | H ] Pr [ D | H a ] = e − r KL ( θ , p ) − r KL ( θ , p ) (24)We will use the following observation Claim 2.
The expression KL ( θ , p ) + KL ( θ , p ) isminimized under p = θ ∗ = θ + θ , and achieves value JS ( θ , θ ) .Proof. We have KL ( θ , p ) + KL ( θ , p )= H ( θ , p ) + H ( θ , p ) − H ( θ ) − H ( θ ) Now the existence of the minimum at p = θ ∗ follows by convexity of p → H ( θ , p ) + H ( θ , p ) ,proved in Lemma 1. We note that H ( θ , p ) + H ( θ , p ) = H (cid:16) θ + θ , p (cid:17) for any p (by definition),and thus for p = θ + θ = θ ∗ we obtain H ( θ , p ) + H ( θ , p ) = H ( θ ∗ ) and KL ( θ , p ) + KL ( θ , p ) = H ( θ ∗ ) − H ( θ ) − H ( θ ) . This combined with thedefinition of the Jensen-Shannon divergence finishesthe proof.We can now bound Equation (24) asmin H a Pr [ D | H ] Pr [ D | H a ] (cid:54) e − r · JS ( θ ∗ ) (25)This proves the first part of Theorem 1 Connecting t-statistic and bayes factor exponent
Recall that by Claim 1 under t-test we have T ≈ r · | θ − θ | · ( θ ( − θ ) + θ ( − θ )) (26)It remains to connect | θ − θ | and JS ( θ , θ ) . ByLemma 2 we have the following refinement ofPinsker’s inequality Claim 3.
We have KL ( θ , p ) (cid:62) ( θ − p ) θ ( − θ ) . Using 2 JS ( θ , θ ) = KL ( θ , θ ∗ )+ KL ( θ , θ ∗ ) , theinequality from Claim 3, and the Welch’s formula inEquation (26) we obtain Claim 4.
We have JS ( θ , θ ) (cid:62) t Welch ( t , θ , θ ) r (27) Proof.
Claim 3 implies2 JS ( θ , θ ) (cid:62) ( θ − θ ) · (cid:18) θ ( − θ ) + θ ( − θ ) (cid:19) (28)we recognize the Weltch’s statistic and write2 JS ( θ , θ ) (cid:62) t Welch ( t , θ , θ ) r (29)Combining Equation (25) and Equation (27) im-plies the second part of the theorem. Acknowledgments
The author thanks to Evan Miller for inspiring dis-cussions.
REFERENCES
Edwards, W., Lindman, H., and Savage, L. J. (1963).Bayesian statistical inference for psychological re-search.
Psychological Review , 70(3):193–242.Goodman, S. N. (1999). Toward evidence-based medicalstatistics. 2: The bayes factor.
Annals of internalmedicine , 130 12:1005–13.Jeffreys, H. (1998).
The Theory of Probability . OxfordClassic Texts in the Physical Sciences. OUP Oxford.Kass, R. E. and Raftery, A. E. (1995). Bayes fac-tors.
Journal of the American Statistical Association ,90(430):773–795.Kruschke, J. K. and Liddell, T. M. (2018). The bayesiannew statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspec-tive.
Psychonomic Bulletin & Review , 25(1):178–206.in, J. (1991). Divergence measures based on the shannonentropy.
IEEE Transactions on Information Theory ,37(1):145–151.Morey, R. D. and Rouder, J. N. (2018). BAYESFAC-TOR: computation of bayes factors for common de-signs. r package version 0.9.12-4.2. http://CRAN.R-project.org/package=BayesFactor .Welch, B. L. (1938). The significance of the difference be-tween two means when the population variances areunequal.