[PDF] Unbiased variance reduction in randomized experiments

Abstract

This paper develops a flexible method for decreasing the variance of estimators for complex experiment effect metrics (e.g. ratio metrics) while retaining asymptotic unbiasedness. This method uses the auxiliary information about the experiment units to decrease the variance. The method can incorporate almost any arbitrary predictive model (e.g. linear regression, regularization, neural networks) to adjust the estimators. The adjustment involves some free parameters which can be optimized to achieve the smallest variance reduction given the predictive model performance. Also we approximate the achievable reduction in variance in fairly general settings mathematically. Finally, we use simulations to show the method works.

Full PDF

11 Unbiased variance reduction in randomized experiments

Reza Hosseini, Amir NajmiGoogle Inc.

Address: 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA email: [email protected], [email protected]

Abstract

This paper develops a ﬂexible method for decreasing the variance of estimators for complex experimenteffect metrics (e.g. ratio metrics) while retaining asymptotic unbiasedness. This method uses the auxiliaryinformation about the experiment units to decrease the variance. The method can incorporate almost anyarbitrary predictive model (e.g. linear regression, regularization, neural networks) to adjust the estimators.The adjustment involves some free parameters which can be optimized to achieve the smallest variance re-duction given the predictive model performance. Also we approximate the achievable reduction in variancein fairly general settings mathematically. Finally, we use simulations to show the method works.

Online experiments play a pivotal role in decision making for many technology companies andare widely used across the industry. From a business perspective, decision makers like to havethe fastest results, with the smallest sample of impacted users. Therefore recently there has beensome effort across the industry to develop experiment effect estimation methods which reduce thevariance e.g. Deng et al. (2013).Most variance reduction methods come with the price of introducing bias (which persistseven asymptotically) or making restrictive modeling assumptions. One exception is discussed inChapter 7 of Imbens and Rubin (2015) among other references: If one deﬁnes the experimenteffect to be the simple expected difference between treatment (denoted by 1) and control (denotedby zero): τ = E { Y ( ) } − E { Y ( ) } , (1)and X are some auxiliary variables which are independent of the assignment procedure (assign-ment of units to control and treatment), then the estimated parameter for τ from the linear regres-sion equation Y = α + τ W + X β + ε , where W = τ with known asymptotic variance. The proof of this result depends heavily on the squared lossfunction properties as well as the simplicity of the experiment effect considered as apposed tomore complex metrics e.g. the ratio metric: τ = E { Y ( ) } / E { Y ( ) } . However in most cases in thetechnology industry, the metrics of interest are ratios rather than simple differences. (One reasonfor that being it is easier to think about growth as a percent increase.) Another limitation of thisapproach is estimating linear regression coefﬁcients could be difﬁcult when dealing with a largenumber of weak predictors which would have been possible say using regularization methods(e.g. see Hastie et al. (2009)). Our focus in this work is to develop methods which can removethese limitations.A notable work regarding the simple difference (Equation 1) is the seminal work of Freed-man (2008a,b) which evaluates regression adjustments using Neyman’s nonparametric model(Neyman (1923)). Freedman argues that regression adjustment may introduce bias (althoughasymptotically unbiased as discussed above); the adjustment may or may not improve precision. a r X i v : . [ m a t h . S T ] A p r As mentioned, here we are mainly concerned with more complex metrics (beyond simple differ-ence) and our approach is not a regression adjustment, but rather uses a predictions model to adjustthe estimators. It is worth noting that, using extensions of linear regression to other settings e.g.generalized linear models would even sacriﬁce the asymptotic unbiasedness and that is one of thereasons we do not use regression adjustments for the general settings.A related paper to this work is Deng et al. (2013) suggesting to use a control variable (mostof the time pre-treatment data on the same variable) which is uncorrelated with the assignmentmechanism to decrease the variance while maintaining unbiasedness for the simple differencemetric τ = E { Y ( ) } − E { Y ( ) } . We know that an unbiased estimate for this effect is equal to ∆ = ¯ Y t − ¯ Y c , where t denotes the treatment arm and c denotes the control arm. Then assume we are givena control (univariate) variable X for which E { X c } = E { X t } . Then ∆ cv = ( ¯ Y t − θ ¯ X t ) − ( ¯ Y c − θ ¯ X c ) is an unbiased estimator of τ . Moreover if X and Y have non-zero correlation, then var { Y − θ X } is minimized when θ = cov ( Y , X ) / var { X } in which case the variance shrinks by a multiplicativefactor of ( − ρ ) , where ρ is the correlation between Y and X . Of course, here we are not dealingwith one variable Y only and rather Y t and Y c , while we can only pick one θ . Therefore Deng etal. (2013) suggest to pick a common θ by using all the data (which still works well in practice).They also discuss application of this method to some other more complex metrics for exampleCTR (click through rate) using the delta method in Appendix B of the paper. Our method ismore general in several ways. First we replace the variable X with a predictive model which cantake multiple inputs. More importantly, it allows for the experiment effect form to be the muchmore general case of τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) for arbitrary differentiable function g (withdomain being the support of the variables involved), in particular for g = log this will capture theratio metric since τ log = log ( E { Y ( ) } ) − log ( E { Y ( ) } ) = log (cid:0) E { Y ( ) } / E { Y ( ) } (cid:1) . In this case, we also provide an exact equation for picking the parameter θ . Moreover we alsodiscuss further generalizations of the method to more complex metrics for example sum ratio and ratio of mean ratios involving more parameters and derive numerical equations for their optimiza-tion. The paper is organized as follows. Section 2 discusses the various ways the experimenteffect can be deﬁned and the connection between various deﬁnitions. Section 3 develops the math-ematical theory for our method. Section 4 performs a simulation study to conﬁrm that the methodretains unbiasedness while decreasing the variance (as much as expected by the theory). FinallySection 5 summarizes the results and discussed further potential applications and extensions. The experiment effect, denoted here by τ can be deﬁned in many different ways depending on theapplication. Clearly the estimation method and the conﬁdence interval calculation also depend onthis choice and some deﬁnitions may have some advantages over others from a theoretical pointof view. The most common deﬁnition is that of the simple difference: τ = E { Y ( ) } − E { Y ( ) } , can be estimated using OLS and (asymptotically unbiased) estimators with smaller variance canbe obtained by incorporating predictor variables X which are independent of the assignment pro-cess W into the regression as discussed in Chapter 7 of Imbens and Rubin (2015). This simple difference has been widely considered in the literature from the seminal work of Neyman (1923)to more recently in Freedman (2008a,b) and Miratrix et al. (2013).Here we introduce some general classes of experiment effects and discuss their interpreta-tion and relationship. We denote the auxiliary information available to us by X and assume X isindependent of the assignment mechanism. X is a collection of predictive variables which is usedin a prediction model to predict the responses of interest which appear in the experiment effectdeﬁnition of interest.One variation we can consider is by taking the expectation of X outside: τ X = E X { E { Y ( ) | X }} − E X { E { Y ( ) | X }} , which is equal to τ = E { Y ( ) } − E { Y ( ) } by the Law of Iterated Expectation.Here we introduce a generalization of the simple experiment effect deﬁned above. For anydifferentiable increasing function g , we consider τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) , τ g , X = E X (cid:110) g ( E { Y ( ) | X } ) − g ( E { Y ( ) | X } ) (cid:111) . Note that in general τ g , X (cid:54) = τ g . In most technology applications, as far as we know, metrics such τ g , X which depend on auxiliary variables are not considered. However this is a common choicein statistical literature, observational studies or clinical studies. For example in Chapter 8 ofImbens and Rubin (2015), the authors mention that in general we can consider τ to be a functionof Y ( ) , Y ( ) , X , W , i.e. τ = τ ( Y ( ) , Y ( ) , X , W ) rather than τ = τ ( Y ( ) , Y ( )) . The reason manyauthors consider this is the simpliﬁcations they get by assuming models. In fact τ then couldbecome a parameter of a well-studied model and we give an example of that here for generalizedlinear models.Let’s assume that the dynamics of the system and the experiment effect could be describedby a generalized linear model (GLM) with a given link function g : g ( E { Y | x , w } ) = α + ( x − µ X ) β w + λ w , where µ X = E { X } and β w might depend on w . Then we have λ = τ g , X . This is because: τ g , X = E X { g ( E { Y ( ) | X } ) − g ( E { Y ( ) | X } ) } = E X { α + ( X − µ X ) β + λ × − ( α + ( X − µ X ) β + λ × ) } = E X { ( X − µ X )( β − β ) + λ } = E X { ( X − µ X ) } ( β − β ) + λ = λ This implies that if we choose g to be the link function of the GLM, then the GLM coefﬁcient λ is equal to the experiment effect τ g , X and not necessarily τ g . Therefore one cannot extendthe regression adjustment approach (for which the estimated regression coefﬁcient remains anasymptotically unbiased estimator of τ ) to the general case if the purpose is to estimate τ g . Notethe arbitrary dependence of τ g , X on X is often an undesirable property for decision makers. Unlike regression adjustment, our adjustment method keeps the estimator asymptotically unbiased asshown in the next section.In the above we discussed the simple difference metric τ and its generalization to τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) . However we can also consider ratio effects e.g.: τ r = E { Y ( ) } / E { Y ( ) } . Note that we have g ( τ r ) = τ g , for g = log , which means we can infer about the ratio metric using an appropriate g .All the experiment effects above are functions of the mean of treatment and control vari-ables. While considering the arbitrary g transformation expends the class of estimators signiﬁ-cantly, there are other metrics which are not covered and we discuss them here.One popular metric considered in online experiments is “sum ratio”. Consider an exper-iment run on two equally sized arms (with same number of potential users) with t denoting thetreatment and c denoting the control, then the sum ratio is estimated byˆ τ = ∑ i ∈ t Y i / ∑ i ∈ c Y i = ( n t / n c ) ¯ Y t / ¯ Y c , where n t and n c are the number of users with impressions on each arm and ¯ Y t , ¯ Y c are the sampleaverages on each arm. Usually Y i is a non-negative variable which can attain zero such as watchtime or money spent. Note that not all the potential users appear in the sum as some might not haveany impressions of the feature. Of course ˆ τ is an estimator and one needs to deﬁne a parameterwhich ˆ τ is estimating. In order to ﬁnd a reasonable population parameter ˆ τ is estimating note thatˆ τ can be written as ˆ τ = (cid:16) ( n t / N ) / ( n c / N ) (cid:17) ¯ Y t / ¯ Y c = ( ¯ p t / ¯ p c ) ¯ Y t / ¯ Y c , where is the (usually unobserved) number of potential users on each arm and ¯ p t , ¯ p c are the proba-bility that a user has impression on treatment and control arms respectively.Deﬁning I ( W ) to be an indicator variable to denote if the user on the arm W has an im-pression of the feature ( I ( W ) =

1) or not ( I ( W ) = τ = E { I ( ) } E { I ( ) } E { Y ( ) | I ( ) = } E { Y ( ) | I ( ) = } and we have ˆ τ → τ as desired (by Delta Theorem, see DasGupta (2008)). Note that with g canalso consider τ g = g ( E { I ( ) } E { Y ( ) | I ( ) = } ) − g ( E { I ( ) } E { Y ( ) | I ( ) = } ) , which for g = log is also equal to τ g = g ( E { I ( ) } ) − g ( E { I ( ) } ) + g ( E { Y ( ) | I ( ) = } ) − g ( E { Y ( ) | I ( ) = } )) . Another popular metric considered in online experiment is constructed by comparing theratio of two random variables on the treatment arm to the control arm. More formally consider Y i and Z i are two response variables and consider the ratio of ratios:ˆ τ = (cid:0) ∑ i ∈ t Y i / ∑ i ∈ t Z i (cid:1) / (cid:0) ∑ i ∈ c Y i / ∑ i ∈ c Z i (cid:1) = ( ¯ Y t / ¯ Z t ) / ( ¯ Y c / ¯ Z c ) , which is an estimator for τ = (cid:0) E { Y ( ) } / E { Z ( ) } (cid:1) / (cid:0) E { Y ( ) } / E { Z ( ) } (cid:1) d . We could also consider the transformed version for g = log τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) − g ( E { Z ( ) } ) − g ( E { Z ( ) } ) , which has two components, each being a difference in one response. Our method developed hereapplies also to these more complex cases (sum ratio and ratio of mean ratios). In the discussionsection we give an example of a metric for which our method does not apply directly. In this section, we develop mathematical theory for reducing variance for general metrics as dis-cussed in the introduction. Subsection 3.1 ﬁnds necessary and sufﬁcient conditions for the adjust-ments to leave the estimators asymptotically unbiased. Subsection 3.1 uses asymptotic theory toapproximate the achievable variance reduction for one of the adjustment methods which requiresminimal assumption on the predictive model used for adjustment.

In this subsection, we introduce various adjustments and the conditions needed for each adjustmentto avoid introducing bias asymptotically.In the following, we present two general ideas to adjust estimators while keeping themunbiased. Parts (a) and (b) of lemma 3.1 use the fact that that adding a random variable withexpectation 0 to an estimator will not introduce bias. Part (c) notes that if we add a statisticto the estimator with the same expected value, we can avoid bias by weighting the two termsappropriately.For simplicity we introduce the following notations. Consider Y to be a response of interestwith two arms, control denoted by c and treatment denoted by t and let N t , N c be the sample sizeon the respective arms. Then we let ¯ Y t = ( / N t ) ∑ i ∈ t Y i , ¯ Y c = ( / N c ) ∑ i ∈ c Y i , which are the average value of the response Y on treatment and control arms respectively. Nowassume X i = ( X i , · · · , X pi ) is a set of p predictive variables and consider a function h which onlydepends on X i and the experiment arm: h : R p × [ , ] → R , where 0 denotes the control and 1 denotes the treatment. For example h ( X i , ) is the value of h for given predictors X i and for a unit on the control arm. Then we use the following short handnotations: ¯ h t ( w ) = ( / N t ) ∑ i ∈ t h ( X i , w ) , ¯ h c ( w ) = ( / N c ) ∑ i ∈ c h ( X i , w ) , w = , Also consider a simpler function f which is only a function of the predictors (and not theexperiment arm): f : R p → R , and consider the short-hand notations: ¯ f t = ( / N t ) ∑ i ∈ t f ( X i ) , ¯ f c = ( / N c ) ∑ i ∈ c f ( X i ) . Note that while for developing the theory in this section, we do not require h , f to haveany particular properties, in order to get good adjustments in practice h , f are typically predictionfunctions which return the expected value of response given the predictor variables. Lemma 3.1 (Unbiased Adjustment Lemma):

Suppose τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) be thedeﬁned experiment effect. Also suppose θ ∈ R and γ + γ = , γ , γ ∈ R . Consider the estimatorˆ τ g = g ( ¯ Y t ) − g ( ¯ Y c ) which is asymptotically unbiased in general (and unbiased if g ( x ) = x , ∀ x ∈ R ).Now assume h ( X i , W i ) and f ( X i ) are imputation functions. Then we can consider theseadjusted estimators:(a) ˆ τ ad j ( w ) = ˆ τ g − θ (cid:0) g ( ¯ h t ( w )) − g ( ¯ h c ( w )) (cid:1) , w = , τ ad j = ˆ τ g − θ (cid:0) g ( ¯ f t ) − g ( ¯ f c ) (cid:1) (c) ˆ τ ad j = γ ˆ τ g + γ (cid:0) g ( ¯ h c ( )) − g ( ¯ h t ( )) (cid:1) (d) ˆ τ ad j = γ (cid:0) g ( ¯ Y t ) − g ( ¯ h c ( )) (cid:1) + γ [ g ( ¯ h t ( )) − g ( ¯ Y c ) (cid:1) Then(a) ˆ τ ad j ( w ) is asymptotically unbiased if E { g ( ¯ h t ( w )) } → E { g ( ¯ h c ( w )) } which is also true if E { ¯ h t ( w ) } → E { ¯ h c ( w ) } . (b) ˆ τ ad j is asymptotically unbiased if E { g ( ¯ f t ) } → E { g ( ¯ f c ) } which is also true if E { ¯ f t } → E { ¯ f c } . (c) ˆ τ ad j is asymptotically unbiased if E { g ( ¯ h c ( )) − g ( ¯ h t ( )) } → τ g (d) ˆ τ ad j is asymptotically unbiased if E { g ( ¯ Y t ) − g ( ¯ h c ( )) } , E { g ( ¯ h t ( )) − g ( ¯ Y c ) } → τ . Proof (a), (b) First part of each of (a), (b) is obvious. Second part is because of Delta Theorem; seee.g. DasGupta (2008).(c), (d) Straight-forward from the assumption and linearity of expectation. .1 Retaining asymptotic unbiasedness Remark.

Note that (c), (d) are in general much stronger conditions than (a) and (b).

Remark on generality of the method in Parts (a) , (b)

Parts (a) and (b) open the door for veryﬂexible paradigm for variance reduction with a very wide class of predictive models since E { ¯ h t ( w ) } → E { ¯ h c ( w ) } , holds for almost any predictive model e.g. generalized linear models, regularization methods etc. Remark on related alternatives to (a) and (b)

We can consider related alternatives to (a), (b).E.g. consider ˆ τ ad j = ˆ τ g − ∑ w ∈{ , } θ w (cid:0) g ( ¯ h t ( w )) − g ( ¯ h c ( w )) (cid:1) . In this case we are adjusting the estimator by removing two terms which have asymptotic expecta-tion of zero: one is the difference of the g -transformed average model prediction on the two armsassuming both were control; and one is the g -transformed averages on the two arms assuming bothwere treatment. Another alternative can be achieved by consideringˆ τ ad j = ˆ τ g − θ [ g ( γ ¯ h t ( ) + γ ¯ h t ( )) − g ( γ ¯ h c ( ) + γ ¯ h c ( ))] for γ + γ = , γ , γ ≥

0. Now if we deﬁne f ( X i ) = γ h ( X i , ) + γ h ( X i , ) which is a weightedaverage of prediction of h on control and treatment, we getˆ τ ad j = ˆ τ g − θ ( g ( ¯ f t ) − g ( ¯ f c )) , which is a special case of Part (b). We could interpret f as a prediction function which tries topredict the value of the response well for both arms. This can motivate us to ﬁt a prediction func-tion to all data (including experiment and control) by ignoring the label as suggested by Deng etal. (2013). Remark on training the data.

Note that even though in Part (a) for w =

0, the model is predictingthe values assuming the data is on the control arm, we are not required to use the control data onlyand we can use all the available data to ﬁt a model and predict assuming all data is on the controlarm. The following corollary explicitly states the case for the ratio metrics, by considering andappropriate g . Corollary 3.1:

Suppose for non-negative valued response Y with positive expectation, we areinterested in τ r = E { Y ( ) } / E { Y ( ) } . Thenˆ τ = ( ¯ Y t / ¯ Y c ) is an asymptotically unbiased estimate of τ r . Also(a) For either of w = ,

1, we haveˆ τ ad j ( w ) = ˆ τ × (cid:0) ¯ h t ( w ) / ¯ h c ( w ) (cid:1) θ is asymptotically unbiased if E { ¯ h t ( w ) / ¯ h c ( w ) } → . (b) ˆ τ ad j = ˆ τ × (cid:0) ¯ f t / ¯ f c (cid:1) θ is asymptotically unbiased if E { ¯ f t / ¯ f c } → . (c) ˆ τ ad j = ˆ τ γ × ( ¯ h c ( ) / ¯ h t ( )) γ where γ + γ = h c ( ) / ¯ h t ( ) → τ .(d) ˆ τ ad j = ( ¯ Y t / ¯ h c ( )) γ ( ¯ h t ( ) / ¯ Y c ) γ where γ + γ = ( ¯ Y t / ¯ h c ( )) , ( ¯ h t ( ) / ¯ Y c ) → τ . Proof

In order to infer about τ r we can infer aboutlog ( τ r ) = log ( E { Y ( ) } ) − log ( E { Y ( ) } ) which is the same as τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) when g = log. This subsection uses asymptotic theory to approximate the reduction in the estimator variance withthe methods suggested above. We only work out the case for parts (a), (b) of Lemma 3.1:ˆ τ ad j ( w ) = ˆ τ g − θ (cid:0) g ( ¯ h t ( w )) − g ( ¯ h c ( w )) (cid:1) , w = , τ ad j = ˆ τ g − θ ( g ( ¯ f t ) − g ( ¯ f c ) (cid:1) . In this case we ﬁnd an optimal θ (as a function of g and the predictive power of h ) toachieve maximum reduced variance.We do not pursue parts (c), (d) type estimators as they require much stronger assumptionson the model to retain unbiasedness. The main result of this subsection is given in Theorem 3.2which relies on the Delta Theorem in the multivariate case (see DasGupta (2008)). Much moregeneral results are proved in Section 3.3. However the simpler cases in this section sufﬁce formost applications. Also we are able to provide more detailed solutions for these simpler cases.First in Theorem 3.1, we ﬁnd the minimizer ( θ ) for the variance of statistics of the form T = g ( Y ) − θ g ( H ) . This is already a generalization of the main idea in Deng et al. (2013). Howeverour adjusted estimators discussed in Lemma 3.1 are more complex and involves two differencesadded to each other i.e. of the form T = g ( Y ) − θ g ( H ) − (cid:0) g ( Y (cid:63) ) − θ g ( H (cid:63) ) (cid:1) , and Theorem 3.2 optimizes for θ for this case. Theorem 3.1:

Consider the random variable T = g ( Y ) − θ g ( H ) where H , Y are random variableswith ﬁnite moments and g is a real-valued differentiable function and g (cid:48) ( µ H ) (cid:54) =

0. Also assume ρ = cor ( Y , H ) . Then(a) argmin θ var { T } ≈ (cid:0) g (cid:48) ( µ Y ) / g (cid:48) ( µ H ) (cid:1) cov ( Y , H ) / σ H (b) The minimum is then approximated by ( g (cid:48) ( µ Y ) σ Y ) ( − ρ ) . Moreover var { g ( Y ) − θ g ( H ) } / var { g ( Y ) } ≈ ( − ρ ) which does not depend on g . .2 Reduction optimization and approximating achievable reduction Proof

See Appendix.

Corollary 3.2:

For g = log, the argmin is equal to ( µ H / µ Y ) cov ( Y , H ) / σ H . Theorem 3.2:

Consider the random variables T = g ( Y ) − g ( Y (cid:63) ) , T = g ( Y ) − θ g ( H ) , T = g ( Y (cid:63) ) − θ g ( H (cid:63) ) , T = T − T = g ( Y ) − θ g ( H ) − (cid:0) g ( Y (cid:63) ) − θ g ( H (cid:63) ) (cid:1) where Y , H , Y (cid:63) , H (cid:63) are random variables with ﬁnite moments and g is a real-valued differentiablefunction. Let µ = ( µ Y , µ H , µ Y (cid:63) , µ H (cid:63) ) be the mean vector and ( σ Y , σ H , σ Y (cid:63) , σ H (cid:63) ) be the standard devi-ation vector. Also assume all the pairwise correlations are zero except for cov ( Y , H ) and cov ( Y (cid:63) , H (cid:63) ) and let ρ = cor ( Y , H ) and ρ (cid:63) = cor ( Y (cid:63) , H (cid:63) ) . Then(a) θ : = argmin θ var { T } ≈ g (cid:48) ( µ H ) g (cid:48) ( µ Y ) cov ( Y , H ) + g (cid:48) ( µ H (cid:63) ) g (cid:48) ( µ Y (cid:63) ) cov ( Y (cid:63) , H (cid:63) ) g (cid:48) ( µ H ) σ H + g (cid:48) ( µ H (cid:63) ) σ H (cid:63) . For g = identity, we have θ = cov ( Y , H ) + cov ( Y (cid:63) , H (cid:63) ) σ H + σ H (cid:63) . For g = log, we have θ = cov ( Y , H ) / ( µ H µ Y ) + cov ( Y (cid:63) , H (cid:63) ) / ( µ H (cid:63) µ Y (cid:63) ) σ H / µ H + σ H (cid:63) / µ H (cid:63) . (b) The minimum is then approximated by g (cid:48) ( µ Y ) σ Y + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) − δ , where δ = (cid:0) g (cid:48) ( µ H ) g (cid:48) ( µ Y ) cov ( Y , H ) + g (cid:48) ( µ H (cid:63) ) g (cid:48) ( µ Y (cid:63) ) cov ( Y (cid:63) , H (cid:63) ) (cid:1) g (cid:48) ( µ H ) σ H + g (cid:48) ( µ H (cid:63) ) σ H (cid:63) Therefore var { T } − var { T } → δ .For g = identity, we have δ = (cid:0) cov ( Y , H ) + cov ( Y (cid:63) , H (cid:63) ) (cid:1) σ H + σ H (cid:63) . For g = log, we have δ = (cid:0) cov ( Y , H ) / ( µ H µ Y ) + cov ( Y (cid:63) , H (cid:63) ) / ( µ H (cid:63) µ Y (cid:63) ) (cid:1) σ H / µ H + σ H (cid:63) / µ H (cid:63) . (c) min { θ , θ } ≤ θ ≤ max { θ , θ } , where θ i is the argmin for minimizing the variance for T i , i = , g (cid:48) ( µ H ) σ H = g (cid:48) ( µ H (cid:63) ) σ H (cid:63) and ρ g (cid:48) ( µ Y ) σ Y = ρ g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) , we havevar { T } = ( − ρ ) var { Y } + ( − ( ρ (cid:63) ) ) var { Y (cid:63) } ≤ ( − min { ρ , ρ (cid:63) } ) var { T } Proof

See Appendix.Now let’s apply this theorem to our problem.

Corollary 3.3 (Variance Reduction):

Suppose τ g = g ( E { Y ( ) } ) − g ( E { Y ( ) } ) be the deﬁnedexperiment effect. Also consider the consistent (raw) estimatorˆ τ g = g ( ¯ Y t ) − g ( ¯ Y c ) . Assume f ( X i ) is an interpolation function which does not depend on the experiment arm, e.g. f ( X i ) = h ( X i , ) and consider the adjusted estimatorˆ τ ad j = ˆ τ g − θ (cid:0) g ( ¯ f t ) − g ( ¯ f c ) (cid:1) . We denote f ( X i ) by H i for simplicity. Also we denote ( Y i , f ( X i )) when i ∈ t (chosen at random)by ( Y t , H t ) and ( Y c , H c ) for the control arm. Let ρ c = cor ( Y c , H c ) ρ t = cor ( Y t , H t ) Then(a) the optimal θ which minimizes ˆ τ ad j is between θ and θ which minimize the variance of g ( ¯ Y c ) − θ g ( ¯ f c ) and g ( ¯ Y t ) − θ g ( ¯ f t ) respectively: θ = cov ( Y c , H c ) / var { H c } , and θ = cov ( Y t , H t ) / var { H t } (b) Moreover if we assume g (cid:48) ( µ H c ) σ H c = g (cid:48) ( µ H t ) σ H t and ρ c g (cid:48) ( µ Y c ) σ Y c = ρ t g (cid:48) ( µ Y t ) σ Y t var { ˆ τ ad j } / var { ˆ τ } ≤ ( − min { ρ c , ρ t } ) . Proof

See Appendix.

Remark.

The assumptions is Part (b) are only used to approximate the decrease in the varianceand are not needed for the method to work. Even if these assumptions do not hold, we can expecta decrease in variance. Moreover, in most practical cases, these assumption approximately hold.For example in most useful interpolation models we expect even stronger assumptions to hold e.g.we expect µ H t = µ H c meaning that the model predictions on the experiment and control arm are thesame on average, which must be true for most models considering the experiment units are chosenat random and independent from their x i which is used by the interpolation function f . .3 Adjusting other complex metrics The reduction in the conﬁdence interval length, can then be approximated by(raw CI length - adj CI length) / raw CI length ≈ − (cid:112) − ρ (2)Figure 1, depicts this relationship. For example with a correlation of 0.6 we can expect20% reduction and with a correlation of 0.8 we can expect 40% reduction and with 0.9 correlationwe get close to 50%. To get an idea for cost savings, lets compare this to the reduction achievedby increasing the sample size from n to kn for k >

1. In this case the reduction is equal to(raw CI length - adj CI length) / raw CI length = σ / √ n − σ / √ kn σ / √ n = − / √ k This means for a 20% reduction we need to multiply sample size by 1.5 (as compared with 0.6correlation with adjustment) and to get 40% reduction need to almost triple the sample size.

Fig. 1:

Left: The reduction in CI (conﬁdence intervals) length as a function of cross-validatedcorrelation between predictions and observed. Middle: The reduction in CI length as afunction of increase in sample size. Right: The practical sample size gain compared to thecross-validated correlation.

In this subsection we discuss using these methods for other complex metrics discussed at the endof Section 2.First consider the sum ratio:ˆ τ = ∑ i ∈ t Y i / ∑ i ∈ c Y i = ( n t / n c ) ¯ Y t / ¯ Y c , and note that since we are assuming not all potential users appear in the sum, n t and n c can bethought of random variables themselves (if they are ﬁxed variables in the experiment by designthen we can simply revert back to the mean ratio case). Considering the second term in the aboveis simply the estimator for the mean ratio, we can replace the second term by its adjusted ver-sion by multiplying the estimator by [ ¯ f t / ¯ f c ] θ which leaves the estimator asymptotically unbiased.Therefore in summary we can use the exact same adjustment as the mean ratio for the sum case.In the appendix we have performed simulations for this case.Next consider the ratio of ratios metrics. The estimator in that case in the log scale is givenby ˆ τ g = g ( ¯ Y t ) − g ( ¯ Z t ) − (cid:0) g ( ¯ Y c ) − g ( ¯ Z c ) (cid:1) = g ( ¯ Y t ) − g ( ¯ Y c ) − (cid:0) g ( ¯ Z t ) − g ( ¯ Z c ) (cid:1) . For general g the above expression could be considered as a g-transformed difference of difference which has many applications. For example when we compare the experiment and treatment armsdifference with a baseline. Now we can adjust each of the two components on the right side individually withoutintroducing bias asˆ τ ad j = g ( ¯ Y t ) − g ( ¯ Y c ) − θ Y ( g ( ¯ f Y , t ) − g ( ¯ f Y , c )) − (cid:16) g ( ¯ Z t ) − g ( ¯ Z c ) − θ Z ( g ( ¯ f Z , t ) − g ( ¯ f Z , c )) (cid:17) . This implies an adjusted estimator for τ is ( ¯ Y t / ¯ Y c )( ¯ Z c / ¯ Z t )[ ¯ f Y , c / ¯ f Y , t )] θ Y [ ¯ f Z , t / ¯ f Z , c )] θ Z . Note that we do not need to require θ Y = θ Z of course and could pick the individual θ which minimizes each term separately. However that would not necessarily give us the globalargmin. Indeed it is possible to ﬁnd the optimal values for this problem and much more generalclass of metrics (using linear algebra) and we publish those results in future manuscripts. This section, performs a simulation study to test the variance reduction method in Part (a) ofLemma 3.1 with g = log which is equivalent to the Mean Ratio metric in which case the rawestimator is equal to ˆ τ = ( ∑ i ∈ t Y i ( ) / N t ) / ( ∑ i ∈ c Y i ( ) / N c ) , and the adjusted is equal toˆ τ ad j = ˆ τ × [( ∑ i ∈ t h ( X i , ) / N t ) / ( ∑ i ∈ c h ( X i , ) / N c )] θ . We examine the existence of bias in the new method and the amount of variance reduction whichis compared to the theoretical reduction approximated in Equation Corollary 3.3. More simulationstudies are performed for various other scenarios and the results are included in the appendix. Wegenerate a large population of users (1000,000) to play the role of the whole (super) populationand use smaller sample sizes to test the methods.In this simulation, we assume a set of users are exposed to different versions of the samefeature. Each user may have a different number of impressions (denoted by

Imp in the following)to the feature. In each impression the user has a chance to spend certain amount denoted by A )of time/money on the feature. Also the user have a chance to either have an explicit interaction(denoted by Interact ) with the feature or not each time.We ﬁrst simulate a 100,000 population of users with random user attributes (country andgender) and then for those users, we simulate impression counts using these model assumptions:

Imp ( u ) ∼ ZT P ( µ ) , µ = exp ( X ( u ) β + θ W ( u ) + ε ( u )) , where W ( u ) = u is in the treatment arm and zero otherwise; ZTP stands for Zero-TruncatedPoisson. The reason we use ZTP instead of Poisson is to avoid having user imbalance in thepredictors (country and gender here) the two arms due to experiment which would then result inbias. We use country and gender as our predictors here. Figure 2 depicts the number of users foreach slice of the predictors (country/gender) and for each arm. At the population level we observeapproximate balance in how the users in each arm are distributed across country and gender which means the appearance of users in the data from various slices of the prediction variables is notimpacted by the experiment as desired.Next conditional on the impression, we simulate interactions and amounts. For amount A ( u ) | Imp ( u ) ∼ exp ( µ ) µ = exp ( X ( u ) β + θ W ( u ) + ε ( u )) and for the interaction: Interact ( u ) | Imp ( u ) ∼ Bernoulli ( µ ) µ = logit − ( X ( u ) β + θ W ( u ) + ε ( u )) , where logit − ( x ) = exp ( x ) / ( + exp ( x )) . Also ε ( u ) , ε ( u ) , ε ( u ) model the user random effectand assumed to be multivariate normal. Fig. 2:

Check for population imbalance across predictorsWe consider the three response variables: impression, amount and interaction and thepredictors: country and gender. To test our method, we consider the Mean Ratio Metric for eachresponse variable: τ = E { Y ( ) } / E { Y ( ) } , deﬁned over the (super) population of all units eligible and t , c denote treatment and control re-spectively. Then assuming we have data from a random experiment, we can estimate τ by thefollowing unbiased estimator which we call the raw method:raw estimate = ( ∑ i ∈ t y i / n t ) / ( ∑ i ∈ c y i / n c ) , where n t , n c denote the sample size on the treatment and the control arm respectively.Then to calculate adjusted estimators we use linear regression to ﬁt a predictive model foreach response. We ﬁt two regression models for each response:(i) using control data only: the data is ﬁtted only to the control arm data (ii) using all data (including treatment data): the data is ﬁtted to both arms, using the experimentlabel as a predictor – however when predicting, we assume all units are on control arm.(iii) using all data but dropping the experiment label as a predictorand we use the multiplicative adjustment from Corollary 3.1, Part (a): a = ( ∑ i ∈ t h ( x i ; 0 ) / n t ) / ( ∑ i ∈ c h )( x i ; 0 ) / n c ) , where h denotes the model prediction function and x i denotes the value of the predictorsfor user i . Also note that the 0 in the second component of h ( x i ; 0 ) prescribe to the model to assumethe user has been on the control arm. This is automatically the case for (i), but for two if we usethe experiment arm as a categorical variable in the model, when performing the interpolation weneed to make sure to predict the response for all users assuming they were on the control arm.The results from these two models are given Tables 1, 2 and 3 respectively. The cross-validatedcorrelation for each response is given in the column cv cor , the optimal θ and the theoreticalreduction in the variance are also given. There is little difference between the model performancein this example and we expect both adjustment methods produce similar reduction.model formula cv cor theta sd ratio1 imp count: gender+country 0.67 0.97 0.742 obs interact: gender+country 0.75 1.00 0.663 obs amount: gender+country 0.72 1.00 0.69 Tab. 1:

Prediction accuracy for various responses when using control data only.model formula cv cor theta sd ratio1 imp count: gender+country+expt id 0.69 0.99 0.722 obs interact: gender+country+expt id 0.74 1.00 0.673 obs amount: gender+country+expt id 0.66 1.00 0.75

Tab. 2:

Prediction accuracy for various responses when using all the data in building the modeland the experiment label. However we assume all data is on control arm when predicting.model formula cv cor theta sd ratio1 imp count: gender+country 0.65 1.00 0.762 obs interact: gender+country 0.69 1.00 0.723 obs amount: gender+country 0.62 1.00 0.78

Tab. 3:

Prediction accuracy for various responses when using all the data in building the modelbut not using the experiment label.Figure 3 checks if the adjustment (in the log scale) is on average unbiased. To that end, wesample 500 users from all the data randomly multiple times and calculate the adjustment in eachcase. We expect to see a distribution which is symmetric around zero in the absence of bias. Weobserve that in all cases the adjustment is unbiased. Fig. 3:

Check for imbalance in the adjustment when g = log. The top panels are using the controldata only. The middle plots use all the data and use the experiment label in training. Thebottom plots are using all data but do not use the experiment label.Next we vary the sample size of users from 500 to 20,000 and for each sample size, wesample from the population 1000 times to calculate the estimators 1000 times for both raw andadjusted cases. Figure 4 depicts the mean of the estimator for the three responses on the left panelsand the variance on the right panels. We only show adjustment methods based on (i) and (ii) as(iii) was very similar to (ii). It is clear that both adjustment methods leave the estimator unbiasedwhile reducing the variance signiﬁcantly regardless of the sample size. Figure 5 depicts the ratioof the standard deviation adjusted estimators to the raw estimator which is approximated to beratio = sd ( adj estimator ) / sd ( raw estimator ) ≈ (cid:112) − ρ , using Theorem 3.1 and its corollaries. For ρ ≈ .

70 this ratio is approximately 0.71 which is whatwe observe in the ﬁgure. Note that this ratio does not depend on the sample size according to thetheory and the ﬁgure conﬁrms that for this case. Other simulation results for varying correlationvalues (presented in the appendix) were also consistent with the theory. Fig. 4:

Mean Ratio Metric estimator’s mean and variance across sample size. Fig. 5:

The ratio of standard deviations of adjusted and raw estimators for Mean Ratio.Finally, Figure 6 varies the sample size from 500 to 10000. For each sample size, wesample from the population once, and for that one sample we calculate a CI using the bucketedJackKnife method with 50 buckets. The bucketed jack-knife method is similar to regular jackknife method but takes out one bucket for each calculation, rather than one unit. We observe thatthe adjusted conﬁdence intervals are within the raw conﬁdence intervals most of the time, whileall converging toward a common value. Fig. 6:

Comparing the CI length convergence across varying sample sizes.

This paper developed a method to decrease the variance of a wide class of experiment effectsmetrics while leaving the estimator (asymptotically) unbiased.Our goal was to keep the method very ﬂexible in two ways. (1) The method is applicableto a wide variety of metrics. (2) The interpolation/prediction function used in the method canbe quite arbitrary i.e. any machine learning algorithm could be used for that purpose. The sec-ond requirement is a useful one in the context of technology industry at the moment where manypowerful complex machine learning algorithms have been developed. However their statisticalproperties are almost a black-box to the practitioners. Given any such predictive model, we in-troduced adjusted estimators which remain asymptotically unbiased. The adjusted estimator alsoincludes extra parameters which can be optimized for a given predictive model.We think these methods can have other applications beyond experiment analysis. Forexample we can use these adjustments also in the context of observational studies to estimatecasual effects since the adjustments perform balancing with respect to the prescribed auxiliaryvariables.

Acknowledgement

The ﬁrst author (Reza Hosseini) would like to thank Jonathan Hennessy (Google) for numerousfruitful discussions and for going over the ﬁrst draft. A Proofs

Proof (Proof of Theorem 3.1)(a) Deﬁne f ( H , Y ; θ ) = g ( Y ) − θ g ( H ) . Also assume that the bivariate distribution of ( H , Y ) hasthe mean µ = ( µ H , µ Y ) . Then by applying the Taylor series approximation to f (see DasGupta(2008)): f ( H , Y ) = f ( µ ) + ∇ ( f )( µ ) . [( H , Y ) − µ ] = f ( µ ) + ∂ f ∂ h ( µ )( H − µ H ) + ∂ f ∂ y ( µ )( Y − µ Y ) . We can approximate the variance of f ( H , Y ) byvar { f ( H , Y ; θ ) } ≈ ∂ f ∂ h ( µ ) var { H } − ∂ f ∂ h ( µ ) ∂ f ∂ y ( µ ) cov ( Y , H ) + ∂ f ∂ y ( µ ) var { Y } . By replacing ∂ f ∂ h ( µ ) = − θ g (cid:48) ( µ H ) , ∂ f ∂ y ( µ ) = g (cid:48) ( µ Y ) , we get:var { f ( H , Y ; θ ) } = θ ( g (cid:48) ( µ H ) σ H ) + ( − g (cid:48) ( µ H ) g (cid:48) ( µ Y ) cov ( Y , H )) θ + ( g (cid:48) ( µ Y ) σ Y ) . The above expression is a 2nd-order convex polynomial of the form p ( θ ) = a θ + b θ + c which is minimized at − b / ( a ) with minimum being − b / a + c . Thereforeargmin θ f ( X , H ; θ ) = g (cid:48) ( µ Y ) g (cid:48) ( µ H ) cov ( Y , H ) σ H (b) The minimum can be obtained by replacing θ from (a). For the second part, note that byTaylor series approximation we havevar { g ( Y ) } = g (cid:48) ( µ Y ) var { Y } , and the result follows. Proof (Proof of Theorem 3.2)To prove (a) and (b) deﬁne f ( Y , H , Y (cid:63) , H (cid:63) ; θ ) = g ( Y ) − θ g ( H ) − (cid:0) g ( Y (cid:63) ) − θ g ( H (cid:63) ) (cid:1) Then after by Taylor series expansion, we can approximate ff ( Y , H , Y (cid:63) , H (cid:63) ; θ ) ≈ f ( µ ) + ∇ ( f )( µ ) . [( Y , H , Y (cid:63) , H (cid:63) ) − µ ] . Thereforevar { T } ≈ ∑ U ∈{ Y , H , Y (cid:63) , H (cid:63) } ∂ f ∂ u ( µ ) var { U } − ∂ f ∂ h ( µ ) ∂ f ∂ y ( µ ) cov ( Y , H ) − ∂ f ∂ h (cid:63) ( µ ) ∂ f ∂ y (cid:63) ( µ ) cov ( Y (cid:63) , H (cid:63) )= a θ + b θ + c , A Proofs where, a = g (cid:48) ( µ H ) σ H + g (cid:48) ( µ H (cid:63) ) σ H (cid:63) b = − (cid:0) g (cid:48) ( µ H ) g (cid:48) ( µ Y ) cov ( Y , H ) + g (cid:48) ( µ H (cid:63) ) g (cid:48) ( µ Y (cid:63) ) cov ( Y (cid:63) , H (cid:63) )) (cid:1) c = g (cid:48) ( µ Y ) σ Y + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) . This is a quadratic function of θ for which the argmin is equal to − b / ( a ) and the minimum isequal to − b / ( a ) + c which gives (a) and (b).To prove (c), note that if we let a = g (cid:48) ( µ H ) g (cid:48) ( µ Y ) cov ( Y , H ) a = g (cid:48) ( µ H (cid:63) ) g (cid:48) ( µ Y (cid:63) ) cov ( Y (cid:63) , H (cid:63) ) b = g (cid:48) ( µ H ) σ H b = g (cid:48) ( µ H (cid:63) ) σ H (cid:63) Then in (a) we showed argmin θ var { T } ≈ a + a b + b and from Theorem 3.1 we have θ i = a i / b i , i = , { a b , a b } ≤ a + a b + b ≤ max { a b , a b } To prove (d) note that by the assumptions given: g (cid:48) ( µ H ) σ H = g (cid:48) ( µ H (cid:63) ) σ H (cid:63) and ρ g (cid:48) ( µ Y ) σ Y = ρ g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) , we conclude a = a and b = b . This allows us to rewrite δ as δ = ( a + a ) b + b = a b + a b . By Part(b), we havevar { T } = g (cid:48) ( µ Y ) σ Y + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) − ( a + a ) / ( b + b )= g (cid:48) ( µ Y ) σ Y + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) − a / b − a / b = g (cid:48) ( µ Y ) σ Y − a b + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) − a b = g (cid:48) ( µ Y ) σ Y − ρ g (cid:48) ( µ Y ) σ Y + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) − ρ (cid:63) g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) = g (cid:48) ( µ Y ) σ Y ( − ρ ) + g (cid:48) ( µ Y (cid:63) ) σ Y (cid:63) ( − ρ (cid:63) ) = ( − ρ ) var { g ( Y ) } + ( − ρ (cid:63) ) var { g ( Y (cid:63) ) }≤ ( − min { ρ , ρ (cid:63) } ) var { g ( Y ) } + ( − min { ρ , ρ (cid:63) } ) var { g ( Y (cid:63) ) } = ( − min { ρ , ρ (cid:63) } ) (cid:0) var { g ( Y ) } + var { g ( Y (cid:63) ) } (cid:1) = ( − min { ρ , ρ (cid:63) } ) var { T } . Proof (Proof of Corollary 3.3)The adjusted estimator is equal toˆ τ ad j = ˆ τ g − θ (cid:16) g ( ¯ f t ) − g ( ¯ f c ) (cid:17) (3) = g ( ¯ Y t ) − g ( ¯ Y c ) − θ (cid:16) g ( ¯ f t ) − g ( ¯ f c ) (cid:17) = (cid:16) g ( ¯ Y t ) − θ ( g ( ¯ f t ) (cid:17) − (cid:16) g ( ¯ Y c ) − θ g ( ¯ f c ) (cid:17) (4) The form in Equation 3 was convenient for showing unbiasedness. And we use the form in Equa-tion 4 for deriving the variance. Now we can use Theorem 3.2 by letting Y = ¯ Y t , Y (cid:63) = ¯ Y c , H = ¯ f t , H (cid:63) = ¯ f c , and also noting that ρ c = cor ( Y c , H c ) = cor ( ¯ Y c , ¯ H c ) ρ t = cor ( ¯ Y t , ¯ H t ) B Simulations for various scenarios

Here we perform a few more simulation studies for various scenarios.

B.1 Simulation results for Sum Ratio

This subsection presents the results for the Sum Ratio Metric for the same simulated data in Tables1, 2 and 3 (in the main text).Figure 7 depicts the mean and the standard deviation of the raw and adjusted estimators forSum Ratio. Figure 8 depicts the ratio of the standard deviation of the adjusted estimators versusthe raw estimator. We observe that the estimators remain unbiased while the variance is decreased,albeit the decrease in the variance is less than the decrease for the Mean Ratio case. B Simulations for various scenarios

Fig. 7:

Sum Ratio Metric estimator’s mean and variance across sample size. .2 Simulation with no predictive power Fig. 8:

The ratio of standard deviations of adjusted and raw estimators for Sum Ratio.

B.2 Simulation with no predictive power

This subsection considers a simulation in which the auxiliary variables do not have any predictionpower. It is desirable if the methods we introduced in this paper do not increase the variancesigniﬁcantly, which might be the case due to the extra variance introduced by the model uncertaintywhich generates some variance in the adjustment. Tables 4 and 5 show the predictive power of themodels for this case and the expected reduction in the variance. Figure 9 depicts the mean and thestandard deviation of the raw and adjusted estimators. Figure 10 depict the ratio of the standarddeviation of the adjusted estimators versus the raw estimator. Fortunately we cannot observe anyincrease in the variance using the adjusted methods.model formula cv cor theta sd ratio1 imp count: gender+country 0.03 0.70 1.002 obs interact: gender+country -0.03 0.00 1.003 obs amount: gender+country -0.04 0.00 1.00

Tab. 4:

Prediction accuracy for various responses using only control data.model formula cv cor theta sd ratio1 imp count: gender+country+expt id 0.16 1.00 0.992 obs interact: gender+country+expt id 0.14 0.67 0.993 obs amount: gender+country+expt id 0.19 0.91 0.98

Tab. 5:

Prediction accuracy for various responses when using all the data in building the modeland the experiment label. However we assume all data is on control arm when predicting. B Simulations for various scenarios

Fig. 9:

Mean Ratio Metric estimator’s mean and variance across sample size. This is for thescenario with no predictive power. .2 Simulation with no predictive power Fig. 10:

The ratio of standard deviations of adjusted and raw estimators for Mean Ratio. This isfor the scenario with no predictive power. B Simulations for various scenarios

Fig. 11:

Mean Ratio Metric estimator’s mean and variance across sample size. This is for thescenario with experiment impacting the user populations appearing in the data. .3 Simulation results for Ratio of Mean Ratios Fig. 12:

Comparing the CI length convergence across varying sample sizes.

B.3 Simulation results for Ratio of Mean Ratios

We use the same simulation settings as the simulation in the main text here. The result is given inFigure 13 which show the adjusted estimator is unbiased while decreasing the variance.

Fig. 13:

The ratio of mean ratios. REFERENCES

References

DasGupta, A. (2008) Asymptotic Theory of Statistics and Probability, Springer Texts in StatisticDeng, A., Xu, Y., Kohavi, R., and Walker, T. (2013) Improving the sensitivity of onlinecontrolled experiments by utilizing pre-experiment data Proceedings of the sixth ACM in-ternational conference on Web search and data mining, February 04-08, 2013, Rome, Italy[doi:10.1145/2433396.2433413]Freedman, D. A. (2008a) On regression adjustments in experiments with several treatments.

An-nals of Applied Statistics , 2:176–96Freedman, D. A. (2008b) On regression adjustments to experimental data.

Advances in AppliedMathematics , 40:180–193Hastie, T., Tibshirani, R. and Friedman, J. H. (2009)

The Elements of Statistical Learning , SecondEdition. SpringerImbens, W. I. and Rubin D. B. (2015) Causal Inference for Statistics, social and biomedicalsciences, an introduction

IEEE Transactions on Automatic Control , AC-19:716–723.Miratrix, L. W., Sekhon, J. S. and Yu, B. (2013) Adjusting treatment effect estimates by post-stratiﬁcation in randomized experiments,

Journal of the Royal Statistical Society, Series B ,75(2):369–396Neyman, J. (1923) Sur les applications de la th´eorie des probabilit´es aux experiences agricoles:Essai des principes.,