[PDF] An Evaluation Framework for Personalization Strategy Experiment Designs

Abstract

Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take precedence of post-hoc variance reduction techniques in order to enable the scaling of the number of experiments. We present an evaluation framework that, along with a few simple rule of thumbs, allow experimenters to quickly compare which experiment setup will lead to the highest probability of detecting a treatment effect under their particular circumstance.

Full PDF

AAn evaluation framework for personalization strategyexperiment designs

C. H. Bryan Liu [email protected] College London & ASOS.com, UK

Emma J. McCoy

Imperial College London, UK

ABSTRACT

Online Controlled Experiments (OCEs) are the gold standard inevaluating the effectiveness of changes to websites. An importanttype of OCE evaluates different personalization strategies, whichpresent challenges in low test power and lack of full control in groupassignment. We argue that getting the right experiment setup —the allocation of users to treatment/analysis groups — should takeprecedence of post-hoc variance reduction techniques in order toenable the scaling of the number of experiments. We present anevaluation framework that, along with a few rule of thumbs, allowexperimenters to quickly compare which experiment setup willlead to the highest probability of detecting a treatment effect undertheir particular circumstance.

KEYWORDS

Experimentation, Experiment design, Personalization strategies,A/B testing.

The use of Online Controlled Experiments (OCEs, e.g. A/B tests) hasbecome popular in measuring the impact of products and guidingbusiness decisions on the Web. Major companies report runningthousands of OCEs on any given day and many startups exist purelyto manage OCEs. A large number of OCEs address simple varia-tions on elements of the user experience based on random splits, e.g.showing a different colored button to users based on a user ID hashbucket. Here, we are interested in experiments that compare person-alization strategies , complex sets of targeted customer interactionsthat are common in e-commerce and digital marketing. Examplesof personalization strategies include the scheduling, budgeting andordering of marketing activities directed at a user based on theirpurchase history.Experiments for personalization strategies face two unique chal-lenges. Firstly, strategies are often only applicable to a small fractionof the user base, and thus many simple experiment designs suf-fer from either a lack of sample size / statistical power, or dilutedmetric movement by including irrelevant samples [2]. Secondly,as users are not randomly assigned a priori , but must qualify tobe treated with a strategy via their actions or attributes, groupsof users subjected to different strategies cannot be assumed to bestatistically equivalent and hence are not directly comparable.While there are a number of variance reduction techniques (in-cluding stratification and control variates [3, 7]) that partially ad-dress the challenges, the strata and control variates involved canvary dramatically from one personalization strategy experimentto another, requiring many ad hoc adjustments. As a result, suchtechniques may not scale well when organizations design and runhundreds or thousands of experiments at any given time. We argue that personalization strategy experiments should focuson the assignment of users from the strategies they qualified forto the treatment/analysis groups. We call this mapping process an experiment setup . Identifying the best experiment setup increasesthe chance to detect any treatment effect. An experimentationframework can also reuse and switch between different setupsquickly with little custom input, ensuring the operation can scale.More importantly, the process does not hinder the subsequentapplication of variance reduction techniques, meaning that we canstill apply the techniques if required.To date, many experiment setups exist to compare personaliza-tion strategies. An increasingly popular approach is to compare thestrategies using multiple control groups — Quantcast calls it a dualcontrol [1], and Facebook calls it a multi-cell lift study [5]. In thetwo-strategy case, this involves running two experiments on tworandom partitions of the user base in parallel, with each experimentfurther splitting the respective partition into treatment/control andmeasuring the incrementality (the change in a metric as comparedto the case where nothing is done) of each strategy. The incremen-tality of the strategies are then compared against each other.Despite the setup above gaining traction in display advertising,there is a lack of literature on whether it (or any other candidate)is a good setup — one that has a higher sensitivity and/or apparenteffect size than other setups. While Liu et al. [5] noted that multi-cell lift studies require a large number of users, they did not discusshow the number compares to other setups. The ability to identifyand adopt a better experiment setup can reduce the required samplesize, and hence enable more cost-effective experimentation.We address the gap in the literature by introducing an evaluationframework that compares experiment setups given two personal-ization strategies. The framework is designed to be flexible so thatit is able to deal with a wide range of baselines and changes in userresponses presented by any pairs of strategies ( situations hereafter).We also recognize the need to quickly compare common setups, andprovide some rule of thumbs on situations where a setup will bebetter than another. In particular, we outline the conditions whereemploying a multi-cell setup, as well as metric dilution, is desirable.To summarize, our contributions are: (i) We develop a flexibleevaluation framework for personalization strategy experiments,where one can compare two experiment setups given the situationpresented by two competing strategies (Sec. 2); (ii) We providesimple rule of thumbs to enable experimenters who do not requirethe full flexibility of the framework to quickly compare commonsetups (Sec. 3); and (iii) We make our results useful to practitionersby making the code used in the paper (Sec. 4) publicly available. A single-cell lift study is often used to measure the incrementality of a single person-alization strategy, and hence is not a representative comparison. Code/supp. documents available on: github.com/liuchbryan/experiment_design_evaluation. a r X i v : . [ s t a t . M E ] J u l dKDD ’20, Aug 2020, San Diego, CA Liu and McCoy Qualify forstrategy 1 Qualify forstrategy 2Group 0:Qualify for neitherstrategy Group 1 Group 2Group 3

Figure 1: Venn diagram of the user groups in our evaluationframework. The outer, left inner (red), and right inner (blue)boxes represent the entire user base, those who qualify forstrategy 1, and those who qualify for strategy 2 respectively.

We first present our evaluation framework for personalization strat-egy experiments. The experiments compare two personalizationstrategies, which we refer to as strategy 1 and strategy 2. Often oneof them is the existing strategy, and the other is a new strategy weintend to test and learn from. In this section we introduce (i) howusers qualifying themselves into strategies creates non-statisticallyequivalent groups, (ii) how experimenters usually assign the users,and (iii) when we would consider an assignment to be better.

As users qualify themselves into the two strategies, four disjointgroups emerge: those who qualify for neither strategy, those whoqualify only for strategy 1, those who qualify only for strategy 2,and those who qualify for both strategies. We denote these groups(user) groups 0, 1, 2, and 3 respectively (see Fig. 1). It is perhapsobvious that we cannot assume those in different user groups arestatistically equivalent and compare them directly.We assume groups 0, 1, 2, 3 have n , n , n , and n users re-spectively. We also assume the metric has a different distributionbetween groups, and within the same group, between the scenariowhere the group is subjected to the treatment associated to the cor-responding strategy and where nothing is done (baseline). We listall group-scenario combinations in Table 1, and denote the meanand variance of the metric ( µ G , σ G ) for a combination G . Many experiment setups exist and are in use in different organiza-tions. Here we introduce four common setups of various sophisti-cation, which we also illustrate in Fig. 2.

Setup 1 (Users in the intersection only).

The setup considers userswho qualify for both strategies only. The said users are randomlysplit (usually 50/50) into two (analysis) groups A and B , and areprescribed the treatment specified by strategies 1 and 2 respectively.The setup is easy to implement, though it is difficult to translateany learnings obtained from the setup to other user groups (e.g.those who qualify for one strategy only) [4]. Setup 2 (All samples).

The setup is a simple A/B test where itconsiders all users, regardless on whether they qualify for any For example, the metric for Group 1 without any interventions has mean and variance( µ C , σ C ), and that for Group 2 with the treatment prescribed under strategy 2 hasmean and variance ( µ I , σ I ). • • • • • • • • •• • • • • • • • •• • • • • • • • • • • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • • AB ABSetup 1 Setup 2ABSetup 3 • • •• • •• • • ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾▾ ▾▾ ▾ • • • • • • • • • • • • • • • • A = A2 - A1B = B2 - B1Setup 4 ▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾ A1A2 B1B2

Figure 2: Experiment setups overlaid on the user groupingVenn diagram in Fig. 1. The hatched boxes indicate who areincluded in the analysis, and the downward triangles anddots indicate who are subjected to treatment prescribed un-der strategies 1 and 2 respectively. See Sec. 2.2 for a detaileddescription. strategy or not. The users are randomly split into two analysisgroups A and B , and are prescribed the treatment specified bystrategy 1(2) if (i) they qualify under the strategy and (ii) they are inanalysis group A ( B ). This setup is easiest to implement but usuallysuffers severely from a dilution in metric [2]. Setup 3 (Qualified users only).

The setup is similar to Setup 2except only those who qualified for at least one strategy (“triggered”users in some literature [2]) are included in the analysis groups. Thesetup sits between Setup 1 and Setup 2 in terms of user coverage,and has the advantage of capturing the most number of usefulsamples yet having the least metric dilution. However, the setupalso prevents one from telling the incrementality of a strategy itself,but only the difference in incrementalities between two strategies.

Setup 4 (Dual control / multi-cell lift test).

As described in Sec-tion 1, the setup first split the users randomly into two randomiza-tion groups. For the first randomization group, we consider thosewho qualify for strategy 1, and split them into analysis groups A A

2. Group A A A A

1. Weapply the same process to the second randomization group, withstrategy 2 and analysis groups B B There are a number of considerations when one evaluates compet-ing experiment setups. They range from technical considerations(e.g. the complexity of setting up the setups) to business considera-tions (e.g. if the incrementality of individual strategies is required). n evaluation framework for personalization strategy experiment designs AdKDD ’20, Aug 2020, San Diego, CA

Group 0 Group 1 Group 2 Group 3Baseline ( C ontrol) C C C C I ntervention) / I I Iϕ Under strategy 2: Iψ Table 1: All group-scenario combinations in our evaluation framework for personalization strategy experiments. The columnsrepresent the groups described in Fig. 1. The baseline represents the scenario where nothing is done. We assume those whoqualify for both strategies (Group 3) can only receive treatment(s) associated to either strategy.

Here we focus on the statistical aspect and propose two evalua-tion criteria: (i) the actual effect size of a treatment as presented bythe two analysis groups in an experiment setup, and (ii) the sensitiv-ity of the experiment represented by the minimum detectable effect(MDE) under a pre-specified test power. Both criteria are necessaryas the former indicates whether a setup suffers from metric dilution,whereas the latter indicates whether the setup suffers from lack ofpower/sample size. An ideal setup should yield a high actual effectsize and a high sensitivity (i.e. a low MDE), though as we observein the next section it is usually a trade-off.We formally define the two evaluation criteria from first prin-ciples while introducing relevant notations along the way. Let A and B be the two analysis groups in an experiment setup, with userresponses randomly distributed with mean and variance ( µ A , σ A ) and ( µ B , σ B ) respectively. We first recall that if there are sufficientsamples, the sample mean of the two groups approximately followsthe normal distribution by the Central Limit Theorem:¯ A approx. ∼ N (cid:0) µ A , σ A / n A (cid:1) , ¯ B approx. ∼ N (cid:0) µ B , σ B / n B (cid:1) , (1)where n A and n B are the number of samples taken from A and B respectively. The difference in the sample means then also approxi-mately follows a normal distribution:¯ D ≜ ( ¯ B − ¯ A ) approx. ∼ N (cid:16) ∆ ≜ µ B − µ A , σ D ≜ σ A / n A + σ B / n B (cid:17) . (2)Here, ∆ is the actual effect size that we are interested in.The definition of the MDE θ ∗ requires a primer to the power ofa statistical test. A common null hypothesis statistical test in per-sonalization strategy experiments uses the two-tailed hypotheses H : ∆ = H : ∆ (cid:44)

0, with the test statistic under H being: T ≜ D / σ ¯ D approx. ∼ N ( , ) . (3)We recall the null hypothesis will be rejected if | T | > z − α / , where α is the significance level and z − α / is the 1 − α / quantile of astandard normal. Under a specific alternate hypothesis ∆ = θ , thepower is specified as1 − β θ ≜ Pr (cid:0) | T | > z − α / | ∆ = θ (cid:1) ≈ − Φ (cid:0) z − α / − | θ | / σ ¯ D (cid:1) . (4)where Φ denotes the cumulative density function of a standardnormal. To achieve a minimum test power π min , we require that1 − β θ > π min . Substituting Eq. (4) into the inequality and rearrang-ing to make θ the subject yields the effect sizes that the test will beable to detect with the specified power: | θ | > ( z − α / − z − π min ) σ ¯ D . (5) θ ∗ is then defined as the positive minimum θ that satisfies Ineq. (5),i.e. that specified by the RHS of the inequality. We will use the terms “high(er) sensitivity” and “low(er) MDE” interchangeably. The approximation in Eq. (4) is tight for experiment design purposes, where α < . and − β > . for nearly all cases. We finally define what it means to be better under these evalu-ation criteria. WLOG we assume the actual effect size of the twocompeting experiment setups are positive, and say a setup S issuperior to another setup R if, all else being equal,(i) S produces a higher actual effect size ( ∆ S > ∆ R ) and a lowerminimum detectable effect size ( θ ∗ S < θ ∗ R ), or(ii) The gain in actual effect is greater than the loss in sensitivity: ∆ S − ∆ R > θ ∗ S − θ ∗ R , (6)which means an actual effect still stands a higher chance to beobserved under S . Having described the evaluation framework above, in this sectionwe use the framework to compare the common experiment setupsdescribed in Sec. 2.2. We will first derive the actual effect size andMDE in Sec. 3.1, and using the result to create rule of thumbs on(i) whether diluting the metric by including users who qualify forneither strategies is beneficial (Sec. 3.2) and (ii) if dual control isa better setup for personalization strategy experiments (Sec. 3.3),two questions that are often discussed among e-commerce andmarketing-focused experimenters. For brevity, we relegate mostof the intermediate algebraic work when deriving the actual &minimum detectable effect sizes, as well as the conditions that leadto a setup being superior, to our supplementary document. We first present the actual effect size and MDE of the four exper-iment designs. For each setup we first compute the sample size,metric mean, and metric variance of the analysis groups (here wepresent only one of them), which arises as a mixture of user groupsdescribed in Sec. 2.1. We then substitute the quantities computedinto the definitions of ∆ (see Eq. (2)) and θ ∗ (see Ineq. (5)) to obtainthe setup-specific actual effect size and MDE. We assume all randomsplits are done 50/50 in these setups to maximize the test power. The setup randomlysplits user group 3 into two analysis groups, each with n / samples.Users in analysis group A are provided treatment under strategy 1,and hence the group metric has a mean and variance of ( µ Iϕ , σ Iϕ ) .The actual effect size and MDE for Setup 1 are hence: ∆ S = µ Iψ − µ Iϕ , (7) θ ∗ S = ( z − α / − z − π min ) (cid:113) ( σ Iϕ + σ Iψ )/ n . (8) If both the actual effect sizes are negative, we simply swap the analysis groups. If theactual effect sizes are of opposite signs, it is likely an error. Expressions for other analysis groups can be easily obtained by substituting in thecorresponding user groups. dKDD ’20, Aug 2020, San Diego, CA Liu and McCoy

This setup also contains two analysisgroups, A and B , each taking half of the population (i.e. ( n + n + n + n )/ A and B arethe weighted metric mean and variance of the constituent usergroups. As we only provide treatment to those who qualify forstrategy 1 in group A , and likewise for group B with strategy 2,each user group will give different responses, e.g. for group A : µ A = ( n µ C + n µ I + n µ C + n µ Iϕ )/( n + n + n + n ) , (9) σ A = ( n σ C + n σ I + n σ C + n σ Iϕ )/( n + n + n + n ) . (10)Substituting the above (and that for group B ) into the definitionsof actual effect size and MDE we have: ∆ S = n ( µ C − µ I ) + n ( µ I − µ C ) + n ( µ Iψ − µ I ϕ ) n + n + n + n , (11) θ ∗ S = ( z − α / − z − π min ) × (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) (cid:0) n ( σ C ) + n ( σ I + σ C ) + n ( σ C + σ I ) + n ( σ Iϕ + σ Iψ ) (cid:1) ( n + n + n + n ) . (12) The setup is very similar toSetup 2, with members from user group 0 excluded. This leads toboth analysis groups having ( n + n + n )/ µ A = n µ I + n µ C + n µ Iϕ n + n + n , σ A = n σ I + n σ C + n σ Iϕ n + n + n . (13)This leads to the following actual effect size and MDE for Setup 3: ∆ S = n ( µ C − µ I ) + n ( µ I − µ C ) + n ( µ Iψ − µ I ϕ ) n + n + n , (14) θ ∗ S = ( z − α / − z − π min ) (cid:118)(cid:117)(cid:116) (cid:16) n ( σ I + σ C ) + n ( σ C + σ I ) + n ( σ Iϕ + σ Iψ ) (cid:17) ( n + n + n ) . (15) The setup is the odd one out as it hasfour analysis groups. Two of the analysis groups ( A A

2) aredrawn from those who qualified into strategy 1 and are allocatedinto the first randomization group, and the other two ( B B n A = n A = ( n + n )/ , n B = n B = ( n + n )/ . (16)The metric mean and variance for group A µ A = n µ C + n µ C n + n , σ A = n σ C + n σ C n + n . (17)As the setup takes the difference of differences in the metric(i.e. the difference between groups B B

1, and the difference between groups A A ∆ S = ( µ B − µ B ) − ( µ A − µ A ) = n ( µ I − µ C ) + n ( µ Iψ − µ C ) n + n − n ( µ I − µ C ) + n ( µ Iϕ − µ C ) n + n . (18)The MDE for Setup 4 is similar to that specified in RHS of Ineq. (5),albeit with more groups: θ ∗ S = ( z − α / − z − π min ) (cid:113) σ A / n A + σ A / n A + σ B / n B + σ B / n B = · ( z − α / − z − π min )× (cid:118)(cid:116) n ( σ C + σ I ) + n ( σ C + σ Iϕ )( n + n ) + n ( σ C + σ I ) + n ( σ C + σ Iψ )( n + n ) . (19) The use of responses from users who do not qualify for any ofthe strategies we are comparing, an act known as metric dilution,has stirred countless debates in experimentation teams. On onehand, responses from these users make any treatment effect lesspronounced by contributing exactly zero; on the other hand, itmight be necessary as one does not know who actually qualify [5],or it might be desirable as they can be leveraged to reduce thevariance of the treatment effect estimator [2].Here, we are interested in whether we should engage in the actof dilution given the assumed user responses prior to an experiment.This can be clarified by understanding the conditions where Setup 3would emerge superior (as defined in Sec. 2.3) to Setup 2. By in-specting Eqs. (11) and (14), it is clear that ∆ S > ∆ S if n >

0. Thus,Setup 3 is superior to Setup 2 under the first criterion if θ ∗ S < θ ∗ S ,which is the case if σ C , the metric variance of users who qualifyfor neither strategies, is large. This can be shown by substitutingEqs. (12) and (15) into the θ -inequality and rearranging the termsto obtain: (cid:16) n ( σ I + σ C ) + n ( σ C + σ I ) + n ( σ Iϕ + σ Iψ ) (cid:17) ·( n + n + n + n ) ( n + n + n ) < σ C . (20)If we assume the metric variance does not vary much for users whoqualified for at least one strategy, i.e. σ I ≈ σ C ≈ · · · ≈ σ Iψ ≈ σ S ,Ineq. (20) can then be simplified as σ S (cid:18) n n + n + n + (cid:19) < σ C , (21)where it can be used to quickly determine if one should considerdilution at all.If Ineq. (20) is not true (i.e. θ ∗ S ≥ θ ∗ S ), we should then considerwhen the second criterion (i.e. ∆ S − ∆ S > θ ∗ S − θ ∗ S ) is met.Writing η = n ( µ C − µ I ) + n ( µ I − µ C ) + n ( µ Iψ − µ Iϕ ) , ξ = n ( σ C + σ I ) + n ( σ I + σ C ) + n ( σ Iψ + σ Iϕ ) , and z = z − α / − z − π min , n evaluation framework for personalization strategy experiment designs AdKDD ’20, Aug 2020, San Diego, CA we can substitute Eqs. (11), (12), (14), (15) into the inequality andrearrange to obtain n + n + n n (cid:113) n σ C + ξ > n + n + n + n n (cid:112) ξ − η √ z . (22)As the LHS of Ineq. (22) is always positive, Setup 3 is superior ifthe RHS ≤

0. Noting ∆ S = η /( n + n + n ) and θ ∗ S = √ · z · (cid:112) ξ /( n + n + n ) , the trivial case is satisfied if ( n + n + n + n )/( n ) · θ ∗ S ≤ ∆ S .If the RHS of Ineq. (22) is positive, we can safely square bothsides and use the identities for ∆ S and θ ∗ S to get2 σ C n > (cid:20)(cid:16) θ ∗ S − ∆ S + n + n + n n θ ∗ S (cid:17) − (cid:16) n + n + n n θ ∗ S (cid:17) (cid:21) z . (23)As the LHS is always positive, the second criterion is met if θ ∗ S ≤ ∆ S . (24)Note this is a weaker, and thus more easily satisfiable conditionthan that introduced in the previous paragraph. This suggests anexperiment setup is always superior to an diluted alternative ifthe experiment is already adequately powered. Introducing anydilution will simply make things worse.Failing the condition in Ineq. (24), we can always fall back toIneq. (23). While the inequality operates in squared space, it isessentially comparing the standard error of user group 0 (LHS) —those who qualify for neither strategies — to the gap between theminimum detectable and actual effects ( θ ∗ S − ∆ S ). The gap canbe interpreted as the existing noise level, thus a higher standarderror means mixing in group 0 users will introduce extra noise,and one is better off without them. Conversely, a smaller standarderror means group 0 users can lower the noise level, i.e. stabilizethe metric fluctuation, and one should take advantage of them.To summarize, diluting a personalization strategies experimentsetup is not helpful if (i) users who do not qualify for any strategieshave a large metric variance (Ineq. (20)) or (ii) the experiment isalready adequately powered (Ineq. (24)). It could help if the ex-periment has not gained sufficient power yet and users who donot qualify for any strategy provide low-variance responses, suchthat they exhibit stabilizing effects when included into the analysis(complement of Ineq. (23)). Often when advertisers compare two personalization strategies,the question on whether to use a dual control/multi-cell designcomes up. Proponents of such approach celebrate its ability to tella story by making the incrementality of an individual strategyavailable, while opponents voice concerns on the complexity insetting up the design. Here we are interested if Setup 4 (dual control)is superior to Setup 3 (a simple A/B test) from a power/detectableeffect perspective, and if so, under what circumstances.We first observe θ ∗ S > θ ∗ S is always true, and hence a dualcontrol setup will never be superior to a simpler setup under the firstcriterion. This can be verified by substituting in Eqs. (19) and (15) and rearranging the terms to show the inequality is equivalent to2 (cid:16) n ( n + n ) ( σ C + σ I ) + n ( n + n ) ( σ C + σ I ) + n ( n + n ) σ Iϕ + n ( n + n ) σ Iψ + (cid:16) n ( n + n ) + n ( n + n ) (cid:17) σ C (cid:17) > n ( n + n + n ) ( σ C + σ I ) + n ( n + n + n ) ( σ C + σ I ) + n ( n + n + n ) σ Iϕ + n ( n + n + n ) σ Iψ , (25)which is always true given the n s are non-negative and the σ s arepositive: not only the coefficients of the σ -terms are larger on theLHS than their RHS counterparts, the LHS also carries an extra σ C term with non-negative coefficient and a factor of two.Moving on to the second evaluation criterion, we recall thatSetup 4 is superior if ∆ S − ∆ S > θ ∗ S − θ ∗ S , otherwise Setup 3 issuperior under the same criterion. The full flexibility of the modelcan be seen by substituting Eqs. (14), (15), (18), and (19) into theinequality and rearrange to obtain n n ( µ I − µ C ) + n ( µ Iψ − µ C ) n + n + n n ( µ I − µ C ) + n ( µ Iϕ − µ C ) n + n (cid:113) n ( σ C + σ I ) + n ( σ C + σ I ) + n ( σ Iϕ + σ Iψ ) > (26) √ z (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) · ( + n n + n ) [ n ( σ C + σ I ) + n ( σ C + σ Iϕ )] + ( + n n + n ) [ n ( σ C + σ I ) + n ( σ C + σ Iψ )] n ( σ C + σ I ) + n ( σ C + σ I ) + n ( σ Iϕ + σ Iψ ) −  , where z = z − α / − z − π min .A key observation from inspecting Ineq. (26) is that the LHS ofthe inequality scales along O (√ n ) , while the RHS remains a constant.This leads to the insight that Setup 4 is more likely to be superiorif the n s are large. Here we assume the ratio n : n : n remainsunchanged when we scale the number of samples, an assumptionthat generally holds when an organization increases their reachwhile maintaining their user mix. It is worth pointing out that ourclaim is stronger than that in previous work — we have shown thathaving a large user base not only fulfills the requirement of runninga dual control experiment as described in [5], it also makes a dualcontrol experiment a better setup than its simpler counterparts interms of apparent and detectable effect sizes.The scaling relationship can be seen more clearly if we ap-ply some simplifying assumptions to the σ - and n -terms. If weassume the metric variances are similar across user groups (i.e. σ C ≈ σ I ≈ · · · ≈ σ Iψ ≈ σ S ), the RHS of Ineq. (26) becomes √ z (cid:20)(cid:114) n + n + n n + n + n + n + n n + n − (cid:21) , (27)which remains a constant if the ratio n : n : n remains un-changed. If we assume the number of users in groups 1, 2, 3 aresimilar (i.e. n = n = n = n ), the LHS of Ineq. (26) becomes √ n (cid:0) ( µ I − µ C ) − ( µ I − µ C ) + µ Iψ − µ Iϕ (cid:1) (cid:113) σ C + σ I + σ C + σ I + σ Iϕ + σ Iψ , (28)which clearly scales along O (√ n ) .We conclude the section by providing an indication on whata large n may look like, if we assume both the metric variances dKDD ’20, Aug 2020, San Diego, CA Liu and McCoy and the number of users are are similar across user groups, we canrearrange Ineq. (26) to make n the subject: n > (cid:16) √ (cid:16) √ − (cid:17) z (cid:17) σ S ∆ , (29)where ∆ = ( µ I − µ C ) − ( µ I − µ C ) + µ Iψ − µ Iϕ is the effect size.With a 5% significance level and 80% power, the first coefficientamounts to around 791, which is roughly 50 times the coefficientone would use to determine the sample size of a simple A/B test [6].This suggests a dual control setup is perhaps a luxury accessibleonly to the largest advertising platforms and their top advertisers.For example, consider an experiment to optimize conversion ratewhere the baselines attain 20% (hence having a metric variance of0 . ( − . ) = . >

5M users in each user group.

Having performed theoretical calculations for the actual and de-tectable effects and conditions where an experiment setup is supe-rior to another, here we verify those calculations using simulationresults. We focus on the results presented in Section 3.1, as the restof the results presented followed from those calculations.In each experiment setup evaluation, we randomly select thevalue of the parameters (i.e. the µ s, σ s, and n s), and take 1,000actual effect samples, each by (i) sampling the responses from theuser groups under the specified parameters, (ii) computing the meanfor the analysis groups, and (iii) taking the difference of the means.We also take 100 MDE samples in separate evaluations, eachby (i) sampling a critical value under null hypothesis; (ii) comput-ing the test power under a large number of possible effect sizes,each using the critical value and sampled metric means under thealternate hypothesis; and (iii) searching the effect size space forthe value that gives the predefined power. As the power vs. effectsize curve is noisy given the use of simulated power samples, weuse the bisection algorithm provided by the noisyopt package toperform the search. The algorithm dynamically adjusts the numberof samples taken from the same point on the curve to ensure thenoise does not send us down the wrong search space.We expect the mean of the sampled actual effect and MDE tomatch the theoretical value. To verify this, we perform 1,000 boot-strap resamplings on the samples obtained above to obtain an em-pirical bootstrap distribution of the sample mean in each evaluation.The 95% bootstrap resampling confidence interval (BRCI) shouldthen contain the theoretical mean 95% of the times. The histogramof the percentile rank of the theoretical quantity in relation to thebootstrap samples across multiple evaluations should also follow auniform distribution [8].The result is shown in Table 2. One can observe that there aremore evaluations having their theoretical quantity lying outsidethan the BRCI than expected. Upon further investigation, we ob-served a characteristic ∪ -shape from the histograms of the per-centile ranks for the actual effects. This suggests the bootstrapsamples may be under-dispersed but otherwise centered on thetheoretical quantities.We also observed the histograms for MDEs curving upward to theright, this suggests that the theoretical value is a slight overestimate Actual effect size Minimum detectable effectSetup 1 1049/1099 (95.45%) 66/81 (81.48%)Setup 2 853/999 (85.38%) 87/106 (82.08%)Setup 3 922/1099 (83.89%) 93/116 (80.18%)Setup 4 240/333 (72.07%) 149/185 (80.54%) Table 2: Number of evaluations where the theoretical valueof the quantities (columns) falls between the 95% bootstrapconfidence interval for each experiment setup (rows). SeeSection 4 for a detailed description on the evaluations. (of <

1% to the bootstrap mean in all cases). We believe this is likelydue to a small bias in the bisection algorithm. The algorithm testsif the mean of the power samples is less than the target power todecide which half of the search space to continue along. Given wecan bisect up to 10 times in each evaluation, it is likely to see afalse positive even when we set the significance level for individualcomparisons to 1%. This leads to the algorithm favoring a smallerMDE sample. Having that said, since we have tested for a widerange of parameters and the overall bias is small, we are satisfiedwith the theoretical quantities for experiment design purposes.

We have addressed the problem of comparing experiment designsfor personalization strategies by presenting an evaluation frame-work that allows experimenters to evaluate which experiment setupshould be adopted given the situation. The flexible framework canbe easily extended to compare setups that compare more than twostrategies by adding more user groups (i.e. new sets to the Venndiagram in Fig. 1). A new setup can also be incorporated quicklyas it is essentially a different weighting of user group-scenariocombinations shown in Table 1. The framework also allows thedevelopment of simple rule of thumbs such as:(i) Metric dilution should never be employed if the experimentalready has sufficient power; though it can be useful if theexperiment is under-powered and the non-qualifying usersprovide a “stabilizing effect”; and(ii) A dual control setup is superior to simpler setups only if onehas access to the user base of the largest organizations.We have validated the theoretical results via simulations, and madethe code available so that practitioners can benefit from the resultsimmediately when designing their upcoming experiments. ACKNOWLEDGMENTS

The work is partially funded by the EPSRC CDT in Modern Statisticsand Statistical Machine Learning at Imperial and Oxford (StatML.IO)and ASOS.com. The authors thank the anonymous reviewers forproviding many improvements to the original manuscript.

REFERENCES

WSDM âĂŹ15 (Shanghai, China).349âĂŞ358. https://doi.org/10.1145/2684822.2685307 n evaluation framework for personalization strategy experiment designs AdKDD ’20, Aug 2020, San Diego, CA [3] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sensitivityof Online Controlled Experiments by Utilizing Pre-experiment Data. In

WSDM ’13 (Rome, Italy). 123–132. https://doi.org/10.1145/2433396.2433413[4] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen:Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments.In

KDD ’17