[PDF] Experimentation for Homogenous Policy Change

Abstract

When the Stable Unit Treatment Value Assumption (SUTVA) is violated and there is interference among units, there is not a uniquely defined Average Treatment Effect (ATE), and alternative estimands may be of interest, among them average unit-level differences in outcomes under different homogeneous treatment policies. We term this target the Homogeneous Assignment Average Treatment Effect (HAATE). We consider approaches to experimental design with multiple treatment conditions under partial interference and, given the estimand of interest, we show that difference-in-means estimators may perform better than correctly specified regression models in finite samples on root mean squared error (RMSE). With errors correlated at the cluster level, we demonstrate that two-stage randomization procedures with intra-cluster correlation of treatment strictly between zero and one may dominate one-stage randomization designs on the same metric. Simulations demonstrate performance of this approach; an application to online experiments at Facebook is discussed.

Full PDF

EE XPERIMENTATION FOR HOMOGENEOUS POLICY CHANGE

A P

REPRINT

Molly Offer-Westort Drew Dimmery

February 1, 2021 A BSTRACT

When the Stable Unit Treatment Value Assumption (SUTVA) is violated and there is interferenceamong units, there is not a uniquely deﬁned Average Treatment Effect (ATE), and alternative esti-mands may be of interest, among them average unit-level differences in outcomes under differenthomogeneous treatment policies. We term this target the Homogeneous Assignment Average Treat-ment Effect (HAATE). We consider approaches to experimental design with multiple treatmentconditions under partial interference and, given the estimand of interest, we show that difference-in-means estimators may perform better than correctly speciﬁed regression models in ﬁnite sampleson root mean squared error (RMSE). With errors correlated at the cluster level, we demonstrate thattwo-stage randomization procedures with intra-cluster correlation of treatment strictly between zeroand one may dominate one-stage randomization designs on the same metric. Simulations demonstrateperformance of this approach; an application to online experiments at Facebook is discussed.

When the Stable Unit Value Assumption (SUTVA) is violated, a unit’s potential outcomes are not only a function of thetreatment assigned to them directly, but also of the treatments assigned to other units, that is, there is “interference” [Cox,1958]. In such settings, the ATE is not uniquely deﬁned, as a unit’s treatment status may be associated with multiplepotential outcomes under different treatment assignments of other units. Researchers may be interested in consideringalternative design-speciﬁc estimands, including a generalization of the ATE under interference, the Expected AverageTreatment Effect (EATE) [Sävje et al., 2017]; decomposition of treatment effects into direct and indirect effects as wellas overall causal effects of the intervention [Halloran and Struchiner, 1995, Hudgens and Halloran, 2008, VanderWeeleand Tchetgen, 2011]; and related targets, such as the treatment effect on the uniquely treated, and spillover effects onthe treated and non-treated as a function of treatment saturation [Baird et al., 2018]. Aronow and Samii [2017] developa general approach for estimating causal effects in the presence of arbitrary but known interference.Motivated by an analytic setting at Facebook, we consider average differences in unit-level outcomes under alternativeinterventions, where under counterfactuals of interest all units would be assigned the same treatment. For some researchdesigns, isolating direct effects or discerning the form and magnitude of interference may be explicit objectives. Here,we are only interested in the average effects of alternative homogeneous policies, and the decomposition of theseeffects is not of speciﬁc interest. Our motivating application is in consideration of the design of experiments run totune the performance of the video player on Facebook. The substance of these experiments is minor and technical (e.g.changing how much data is pre-buffered on a video) but vital to improving the overall Facebook experience. The goalof this experimentation is to launch a single conﬁguration to all users and all videos which will provide the best userexperience.We consider experimental design in settings with partial interference, where units are organized into mutually exclusiveclusters and where there is no interference between units in different clusters [Sobel, 2006]. Making one video ona user’s feed play more seamlessly may hurt the performance on another video, but it is implausible for a video’sperformance to affect the performance of a video for another user . Estimation under partial interference with binarytreatment has been considered by Hudgens and Halloran [2008], Tchetgen and VanderWeele [2012], and Liu andHudgens [2014] for two-stage randomization designs, where treatment saturation is randomized at the cluster level andtreatment or control is randomly assigned to units within clusters according to the relevant saturation level. Baird et al.[2018] discuss optimal experimental design in such settings. a r X i v : . [ s t a t . M E ] J a n PREPRINT - F

EBRUARY

1, 2021Our contribution is to extend discussion of estimation and experimental design to settings where the researcher intendsto estimate effects of several treatment levels. We present a linear-in-means model [Mofﬁtt et al., 2001, Bramoullé et al.,2009, Chin, 2018] for a clustered multi-treatment experiment and propose methods to estimate the HAATE. We alsodiscuss optimal randomization procedures in such settings. Often, the standard approach to experimental design is touse a one-stage randomization procedure, either assuming no interference and randomizing at the unit level, or allowingfor interference and randomizing at the cluster level so that treatment assignment within clusters is homogeneous.Given the HAATE as the estimand of interest, we identify a bias-variance trade-off and show that root mean squarederror (RMSE) under two-stage randomization procedures with intra-cluster correlation of treatment strictly betweenzero and one may dominate one-stage randomization procedures with randomization at either the unit or cluster level,under certain conditions. We provide a principled way to make this choice by selecting from a continuum of possibleintra-cluster correlations of treatments (see Figure 1). We then demonstrate performance of this approach throughsimulations. We discuss an application to online experiments at Facebook.Figure 1: Intra-cluster correlation of treatmentWe consider randomization procedures that may be classiﬁed according to intra-cluster correlation of treatment. Aboveis a stylized representation of proportion of each of three treatment conditions within clusters, conditional onintra-cluster correlation of treatment in an experimental design. The column to the left represents experiments withone-stage randomization at the cluster level, while the column to the right represents one-stage randomization at theunit level. In between, procedures may have intermediate intra-cluster correlation of treatment.

Consider an experiment over a set of of clusters, indexed by j = 1 , . . . , J , each composed of n individual units. We assume i.i.d. draws from an inﬁnite super-population of clusters. Let A j = ( A j, , . . . , A j,n ) T be the cluster-level treatment assignment vector, where a j,i ∈ { , . . . , M } determines which of M + 1 treatments a unit i incluster j receives, with level zero being the control or status quo. Outcomes for units in cluster j are deﬁned as Y j = ( Y j, , . . . , Y j,n ) T . The probability distribution of the random vector A j is completely determined by theexperimental design set by the researcher; we discuss randomization procedures below. A generalizes this conceptacross clusters. Here, we will relax the Stable Unit Treatment Value Assumption (SUTVA). We allow for partial interference [Sobel,2006], that is, the potential outcomes of a unit in group j may vary with the ( M + 1) n possible different A j , butholding a j ﬁxed, will not vary under any a j (cid:48) , where j (cid:54) = j (cid:48) . Potential outcomes for unit i in cluster j take the form Y j,i ( a j,i , a j, − i ) , where a j, − i represents the vector a j with the i th element removed. Then, under partial interference, Y j,i ( a j,i , a j, − i , a j (cid:48) ) = Y j,i ( a j,i , a j, − i , a (cid:48) j (cid:48) ) In exposition, we ﬁx all n j as equal, i.e., n j = n for all j in , . . . , J . PREPRINT - F

EBRUARY

1, 2021for any j (cid:48) (cid:54) = j and for any realizations of a j (cid:48) and a (cid:48) j (cid:48) . Consequently, interference may take arbitrary form withinclusters, but there is no interference among units across clusters. We assume these potential outcomes are well-deﬁned,so that any value for A j in { , . . . , M } n could hypothetically be realized for any cluster. For some design-speciﬁc estimands, it is necessary to marginalize over a distribution of treatment assignments. Forexample, Sävje et al. [2017] consider the assignment-conditional unit-level treatment effect as the effect of receivingtreatment a , as compared to a (cid:48) , holding all other treatments ﬁxed at a − i . They then marginalize over treatments A following the distribution determined by some experimental design to ﬁnd the expected average treatment effect.We are interested, however, in comparing counterfactual outcomes under two speciﬁc designs which each allow foronly one vector of homogeneous treatment assignments. Consider for example, a setting where we want to changefrom the status quo, under which all units receive the same treatment, to some alternative policy, which will also beimplemented homogeneously across all units. In this setting, we do not need to take the step of marginalizing over alltreatment vectors, as each unit has a single potential outcome under each setting. We will term these homogeneousvectors ¯ m , homogeneous under treatment m , as compared to ¯ , homogeneous under the control.We term the homogeneous assignment unit-level treatment effect as the unit-level treatment effect of being assigned m when all other units receive assignment m , as compared to being assigned zero when all other units receive assignmentzero. This is represented as, Ψ j,i ( ¯ m , ¯ ) = Y j,i ( A j,i = m, A − i = ¯ m − i ) − Y j,i ( A j,i = 0 , A − i = ¯ − i ) . As shorthand, we may write, Ψ j,i ( ¯ m , ¯ ) = Y j,i ( ¯ m ) − Y j,i (¯ ) . Taking the expectation over the distribution of units, our inferential target is the Homogeneous Assignment AverageTreatment Effect (HAATE), which is deﬁned as, Ψ HAAT E ( ¯ m , ¯ ) = E [ Y j,i ( ¯ m ) − Y j,i (¯ )] . Note that we could average over units within clusters and then take the expectation over the distribution of clustersfor the cluster version of the HAATE, but as clusters are of uniform size, the effect is equivalent. This estimand hasbeen considered by others, including Basse and Airoldi [2018] and Chin [2018]. This effect is well-deﬁned. However,without assumptions on interference it is not directly estimable, as all units will be under one condition or the other, andso we can ﬁnd the mean under either condition only to the exclusion of the other. Under no interference, the treatmentassignment vector for other units does not affect the potential outcome, and the HAATE, the EATE, and the ATE willall be equivalent.Having assumed partial interference, however, it is the case that a unit’s potential outcomes are only deﬁned by theassignments of treatment within their cluster, regardless of treatment assignments in other clusters. We can thususe homogeneous random assignment at the cluster level, such that each cluster has positive probability of beingassigned ¯ m j or ¯ j , to identify the HAATE. Then, under partial interference, potential outcomes are only a functionof the within-block treatment assignment vector, and under random assignment, treatment is independent of potentialoutcomes. Thus, Ψ HAAT E ( ¯ m , ¯ ) = E [ Y j,i ( ¯ m ) − Y j,i (¯ )] by linearity of expectations, = E [ Y j,i ( ¯ m )] − E [ Y j,i (¯ )] by partial interference, = E [ Y j,i ( ¯ m j )] − E [ Y j,i (¯ j )] by independence of potential outcomes with treatment under random assignment, and consistency, = E (cid:2) Y j,i (cid:12)(cid:12) A j = ¯ m j (cid:3) − E (cid:2) Y j,i (cid:12)(cid:12) A j = ¯ j (cid:3) . This allows us to use the difference-in-means as an unbiased estimator of the HAATE, (cid:98) Ψ HAAT EDM ( ¯ m , ¯ ) = (cid:80) Jj =1 (cid:80) ni =1 Y j,i { A j,i = m } (cid:80) Jj =1 (cid:80) ni =1 { A j,i = m } − (cid:80) Jj =1 (cid:80) ni =1 Y j,i { A j,i = 0 } (cid:80) Jj =1 (cid:80) nj =1 { A j,i = 0 } . This deﬁnition is ﬂexible to ﬁxing any treatment level as the control, and can be used for comparison of arbitrary treatmentlevels. PREPRINT - F

EBRUARY

1, 2021

We assume a simple linear-in-means model as the underlying data generating process [Manski, 1993, Bramoullé et al.,2009]. In such models, a unit’s outcome is a linear function of the mean characteristics of their group, and possibly aset of the unit’s own characteristics. Chin [2018] has developed such models to account for interference in Bernoullirandomized trials with covariates that are arbitrary functions of the treatment assignment vector; Baird et al. [2018]consider regression models for randomized treatment saturation designs with binary treatments and a ﬁxed set ofsaturations determined by the researcher.We follow in the spirit of Baird et al. [2018] and Chin [2018], with generalization to the multiple treatment setting.The model proposed by Baird et al. [2018] includes an intercept and separate indicators for directly treated units anduntreated units in treated clusters at each assigned saturation level. They estimate slopes separately for treated anduntreated units in treated clusters as the difference in coefﬁcients on indicators at two different treatment saturationlevels, divided by the difference in saturation levels. To allow for more ﬂexible design selection in the multiple treatmentsetting, we will include slope coefﬁcients directly in the model.We assume that for all treatment vectors, the individual expected potential outcome is a function of direct treatment a j,i received and the realized proportion of each of the treatment conditions within the cluster, p j [ m ] ( A j ) = n (cid:80) ni =1 { A j,i = m } . That is, E [ Y j,i ( A j )] = E (cid:2) Y j,i ( a j,i ; p j [0] ( A j ) , . . . , p j [ M ] ( A j )) (cid:3) . Following Baird et al. [2018], this is a modiﬁcation of the stratiﬁed interference assumption [Hudgens and Halloran,2008], allowing realized potential outcomes to depend on the units that receive each treatment assignment, butconstraining expected potential outcomes to simplify representation of standard errors without network knowledge. We also follow Baird et al. [2018] in imposing homoskedasticity over potential outcomes, with intra-cluster correlationof errors such that for all treatment vectors, the variance covariance matrix is characterized by,Var [ Y j,i ( a j )] = σ + τ , Cov [ Y j,i ( a j ) , Y j,i (cid:48) ( a j )] = τ for i (cid:54) = i (cid:48) , andCov [ Y j,i ( a j ) , Y j (cid:48) ,i (cid:48) ( a j )] = 0 for j (cid:54) = j (cid:48) . Intra-cluster correlation of errors is then ρ u = τ τ + σ . These two assumptions mirror Baird et al. [2018] Assumptions 1 and 2.We propose a model, Y j,i = M (cid:88) m =0 β m × { A j,i = m } + M (cid:88) m =0 M (cid:88) (cid:96) =1 δ m,(cid:96) × p j, [ (cid:96) ] × { A j,i = m } + ε j,i . The error term is the difference between the realized unit potential outcome as a function of the treatment vector, andthe unit expected potential outcome for that treatment vector. Under the assumptions imposed above, E [ ε j,i | A j ] = 0 [Baird et al., 2018, Lemma 1], and the OLS estimate is unbiased.Relating the model to the potential outcomes framework, the expected outcome under homogeneous control, E [ Y j,i (¯ j )] ,is represented by β , and the expected outcome under homogeneous treatment m , E [ Y j,i ( ¯ m j )] , is represented by β m + δ m,m . If the model is appropriately speciﬁed, then the HAATE is Ψ HAAT E ( ¯ m , ¯ ) = ( β m + δ m,m ) − β . When randomization is not homogeneous at the cluster level, the model demonstrates why the difference-in-meansestimator is biased for the HAATE in this setting (Chin 2018 makes a similar observation). Under the difference inmeans estimator, the expected outcome for a unit assigned treatment m is,E [ Y j,i | A j,i = m ] = β m + M (cid:88) (cid:96) =1 δ m,(cid:96) × E [ p j, [ (cid:96) ] | A j,i = m ] . To avoid indexing by units as well, and because the slope effect is modeled separately for each direct treatment condition, weinclude all units in the proportion so that each unit in the same cluster shares the same values of p j [ m ] for all m in , . . . , M . As the number of observations per cluster is discrete, it is not necessary to assume continuous potential outcomes. PREPRINT - F

EBRUARY

1, 2021Only when the bias in the expected outcome for a unit assigned treatment m and the bias in the expected outcome for aunit assigned the control condition are perfectly offset is the estimator unbiased for the HAATE. Alternatively, if thereis no interference, all of the δ m,(cid:96) terms are zero.In this model, we have assumed that interference is a linear function of proportion of each type of treatment withina cluster; this can be broadened to consider the type of general “interference control variables,” discussed in Chin[2018], Section 3, which are a function of the vector A j, − i . We could deﬁne such a deterministic function, followingChin [2018], as X i ( A j, − i ) , to account for non-linearities in interference, endogenous effects, or to encode furtherinformation about the network structure or other ways that interference is channeled. This requires independence ofpotential outcomes with treatment assignment, and equivalent exogeneity assumptions on the errors as those imposedby our assumptions above. We wish to estimate the HAATE for each of M treatment conditions in comparison to the control condition; ourobjective is to select a design to minimize average root mean squared error (RMSE) across the M estimates, with aﬁxed experimental sample size. RMSE is deﬁned as (cid:114)(cid:16) Ψ − (cid:98) Ψ (cid:17) . We generalize the concept of coverage or saturation in the binary treatment setting [Hudgens and Halloran, 2008, Bairdet al., 2018] to intra-cluster correlation of treatment in the multiple treatment setting.We consider two-stage experimental designs, where all units have positive probability of being assigned to any of thetreatment conditions in , . . . , M . The researcher ﬁrst randomly and independently assigns a treatment probabilityvector to each cluster, deﬁned by π j = ( π j [0] , . . . , π j [ M ] ) , such that π j [ m ] ∈ (0 , for each respective treatmentcondition, and (cid:80) Mm =0 π j [ m ] = 1 . Treatment is then randomly and independently assigned to units within each clusterfollowing the relevant treatment probability vector. The distribution of treatment probability vectors is set by theresearcher under some experimental design.We implement this procedure as cluster-level draws from a Dirichlet distribution, followed by within-cluster ran-domization according to a Multinomial distribution, parameterized by the Dirichlet draw. Distribution of treatmentwithin clusters then follows a Dirichlet-multinomial distribution with n trials; each condition follows a Beta-binomialdistribution with n trials. The Dirichlet distribution is parameterized by α = ( α , . . . , α M ) .We set all α m as equal at ¯ α so that unconditional expected probability of success of a given condition is M +1 , regardlessof the magnitude of each α m . This means that we only consider experimental designs where assignment probabilitiesare balanced, although in practice a researcher may wish to select from designs where, for example, a larger proportionof units are assigned to the control condition. Lower values of ¯ α are associated with greater overdispersion relative tothe Multinomial distribution.Deﬁning an indicator for each treatment level, intra-cluster correlation of treatment as a function of ¯ α is the same foreach treatment condition at a given level of ¯ α , and takes the form, ρ m (¯ α ) = 1 (cid:112) (( M + 1)¯ α + 1) . This allows us to consider permitted experimental designs along a spectrum. At one end of the spectrum, when ¯ α is verysmall, randomization approximates cluster-level randomization, where each draw from the Dirichlet distribution resultsin a Multinomial distribution with probability approaching one on any given condition, and probability approachingzero on all other conditions. Intra-cluster correlation for any treatment indicator is near one. At the other end ofthe spectrum, when ¯ α is very large, randomization approaches unit-level randomization, where each draw from theDirichlet distribution results in a Multinomial distribution with probability of approximately M +1 on each condition.Intra-cluster correlation for any treatment indicator is near zero.Suppose we are in a setting where there may be both partial interference and intra-cluster correlation of errors.With randomization at the cluster level, the difference-in-means estimator will be unbiased for the HAATE. Withrandomization at the unit level, the difference-in-means estimator will produce an estimate with a smaller variance, butwill not generally produce one that is unbiased for the HAATE, but rather for the average direct causal effect deﬁned byHudgens and Halloran [2008]. 5 PREPRINT - F

EBRUARY

1, 2021In the absence of interference, however, with randomization at the unit level, the difference-in-means estimator willbe unbiased for the HAATE, and variance of the estimate will be decreased relative to cluster-level homogeneousrandomization. In the absence of both interference and intra-cluster correlation of errors, the two approaches areequivalent for the difference-in-means estimator. In the presence of both interference and intra-cluster correlation oferrors, intermediate designs may be preferable in terms of RMSE.

For the linear-in-means model, the variance can be appropriately estimated using the cluster robust generalization ofthe sandwich estimator [Eicker et al., 1963, Huber, 1967, White, 1980], or with nonparametric bootstrapping at thecluster level. Greenwald [1983] and Moulton [1986] demonstrate that increases in intra-cluster correlation of errors andintra-cluster correlation of the regressors are associated with upward adjustments to the conventional OLS varianceestimate. In our setting, the intra-cluster correlation of treatment is decreasing with ¯ α , as we move from cluster tounit-level randomization. This does not mean, however, that holding ﬁxed intra-cluster correlation of errors the varianceof the estimator is necessarily monotonically decreasing with ¯ α .Indeed, the distribution of the design matrix also changes with ¯ α , and the precision of the estimate of the HAATE fromthe linear-in-means model also depends on our ability to estimate the slopes (in our case, the δ m,(cid:96) values). Baird et al.[2018] demonstrate the competing design features in estimating slope effects in randomized saturation designs forbinary treatments. In this case, it is only necessary to have a minimum of two different interior saturations (i.e., non-homogenous clusters) in the experimental design to estimate slope effects. With a greater distance between saturations,all else equal, the slope is more precisely estimated. However, all else is not equal; the number of untreated units inhigh-saturation clusters and treated units in low-saturation clusters is decreasing as the distance between saturationsbecomes more extreme. These factors represent a trade-off, and will also interact with intra-cluster correlation of errors, ρ u . At high values of ρ u , there will be less unique information learned from additional observations from within thesame cluster, and so the unequal number of observations at low and high saturations will become less important. Inour design setting, with low levels of ¯ α , we will be in the scenario where we observe clusters with both very low andvery high proportions of the respective treatment and control variables. As ¯ α increases, the distribution of treatmentconditions within clusters will be more balanced, with very little overdispersion in proportion of each treatment.For the difference-in-means estimator, we also estimate effects under an OLS model to facilitate comparable varianceestimation, Y j,i = M (cid:88) m =0 β m × { A j,i = m } + η j,i . The difference in means is estimated as (cid:98) β m − (cid:98) β . Using cluster-robust variance estimates, we will see changes inthe variance of the estimate due to the interaction of intra-cluster correlation of errors and intra-cluster correlation oftreatment as discussed above. But even without intra-cluster correlation of errors, if the true data generating process isas described in the linear-in-means model, the variance of the difference-in-means estimator will change in other wayswith the design of the experiment; at each value of ¯ α , each (cid:98) β m under the difference-in-means model will absorb theamount of δ m,(cid:96) associated with E [ p j, [ (cid:96) ] | A j,i = m ] . However, the conditional distribution of p j, [ (cid:96) ] | A j,i = m will alsochange.These competing factors mean that there is not a one-size ﬁts all solution for experimental design. The optimal designfor a given sample size will depend on the form and magnitude of interference, as well as the intra-cluster correlation oferrors. To demonstrate a range of scenarios, we develop simulations with varying intra-cluster correlation of errors andmagnitude of interference. We assume 100 clusters, each with n = 50 . Errors are normally distributed with mean zeroand σ ﬁxed at one; τ varies following ρ u / (1 − ρ u ) according to the values in Table 1. Magnitude of interference isdeﬁned by c , values of which are also listed in the Table. We run 1,000 simulations at each combination of ρ u × c × ¯ α parameter values.We consider a setting with three treatment conditions, motivated by the Facebook application discussed below, wherethere is a control condition, as well as a “high” treatment condition (treatment condition one) and a “low” treatmentcondition (treatment condition two). Compared to the homogeneous control setting, homogeneous treatment under the“high” treatment condition results in the highest outcomes; homogeneous treatment under the “low” treatment conditionresults in lowest outcomes. Fixing direct treatment, outcomes are linearly increasing with with the proportion of “high”6 PREPRINT - F

EBRUARY

1, 2021treatment assignments in a cluster, and linearly decreasing with proportion of “low” treatment assignments in a cluster.We let the data generating process follow the linear model with slope effects, as below: Y j,i = β × { A j,i = 0 } + β { A j,i = 1 } + β × { A j,i = 2 } + δ , × p j [1] × c × { A j,i = 0 } + δ , × c × p j [2] × { A j,i = 0 } + δ , × p j [1] × c × { A j,i = 1 } + δ , × c × p j [2] × { A j,i = 1 } + δ , × p j [1] × c × { A j,i = 2 } + δ , × c × p j [2] × { A j,i = 2 } + ε j,i . Magnitude of interference is implemented as a multiplier of slope effects, c ; larger multipliers are associated withincreased levels of interference, while a multiplier of zero is associated with no interference. Consequently, expectedpotential outcomes under homogeneous treatment are 5 under control, . c under the “high” treatment, and . − . c under the “low” treatment.Parameter Value β δ , . cδ , − . cβ δ , cδ , − cβ δ , . cδ , − . cσ ρ u { , . , . , . , . } c { , . , . , } (M+1) ¯ α { . , . , . , . , . , . , . , . , . , . , . , . , , , , , } Table 1: Simulation parameter valuesWe estimate the HAATE both under the correctly speciﬁed linear-in-means model and the difference-in-means model,using OLS with cluster-robust variance estimates, implemented in the sandwich package [Zeileis, 2006, Berger et al.,2017] in R .For varying levels of intra-cluster correlation of errors, ρ u , and magnitude of interference, c , Table 2 reports the “optimal” ρ m as a function of ¯ α as the design with the lowest RMSE. Results from simulations with different values of ρ m (¯ α ) arereported in Appendix Table 4. 7 PREPRINT - F

EBRUARY

1, 2021

Optimal ρ m (¯ α ) Bias RMSE Estimated SE Coverage ρ u c Ψ(¯ , ¯ ) Ψ(¯ , ¯ ) LM DM LM DM LM DM LM DM LM DM0 0 2.5 -2.5 0.999 0.725 0.000 0.002 0.035 0.034 0.034 0.035 0.956 0.9570.1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Table 2: Simulation results for minimum RMSE design over intra-cluster correlation of errors and magnitude ofinterferenceThe “LM” columns represent the linear-in-means estimator, the “DM” columns the difference-in-means model.Empirical bias, RMSE, standard errors from cluster-robust variance estimates, and coverage of 95 percent conﬁdenceintervals for each estimator average over the two estimates, (cid:98)

Ψ(¯ , ¯ ) and (cid:98) Ψ(¯ , ¯ ) , and over the 1,000 iterations.8 PREPRINT - F

EBRUARY

1, 2021 (a) Linear-in-means estimator ρ m ( α ) R M SE (b) Difference-in-means estimator ρ m ( α ) R M SE Figure 2: Simulated RMSE on intra-cluster correlation of treatment as a function of ¯ α Magnitude of interference, c , is ﬁxed at 0.1; intra-cluster correlation of errors ρ u varies over { . , . , . } . Vertical dashed linesrepresent the value of intra-cluster correlation of treatment, ρ m (¯ α ) , with minimum RMSE. The upper and lower panel do not sharethe same scaling on the y-axis. PREPRINT - F

EBRUARY

1, 2021In this setting, the HAATE is generally estimated with higher RMSE under the correctly speciﬁed linear-in-meansestimator as compared to the difference-in-means estimator when interference is low to moderate (here, c ≤ . ). This islargely attributable to the greater variance of the linear-in-means estimator. However, the linear-in-means estimatorgenerally has lower bias and correct coverage, while the difference-in-means estimator has poor to very poor coveragein the presence of interference when randomization is not at the cluster level, as demonstrated in Appendix Table 4. Aswell, when the magnitude of interference is relatively high (here, c ≥ ), the bias component begins to outweigh thevariance in RMSE, and the unbiased linear-in-means estimator becomes more preferable. These simulations were allconducted with ﬁxed n and J ; with increasing sample sizes, the ﬁxed bias would eventually outweigh the decreasingvariance at all non-zero levels of c .For the linear-in-means estimator, in this setting to minimize RMSE for the HAATE it is nearly always preferable to setthe design as close to cluster-level randomization as possible ( ρ m (¯ α ) ≈ ). In this case, the slopes are often droppedfrom the linear model and are not estimated, and consequently the linear-in-means estimator and the difference-in-means estimator are effectively the same. This is the case even when there is no intra-cluster correlation of errors orinterference.For the difference-in-means estimator, when there are high levels of interference (here, c ≥ . ), the minimum RMSE isobtained with designs that randomize closer to the cluster level. When there is no interference but errors are correlated, itis preferable to set designs with randomization closer to the unit level. When there is no interference and no intra-clustercorrelation of errors, the choice of ¯ α does not matter. However, when there is an intermediate level of interference(here, c = . ), the design trade-offs become evident at different levels of ρ u > , as is illustrated in Figure 2. For ρ u ∈ { . , . , . } , the optimal ρ m (¯ α ) is an interior value, but it is decreasing with ρ u . These methods were applied to online experiments at Facebook, where treatments were combinations of engineeringparameters determining the behavior of video playback on the platform. When a user scrolls past or interacts with videoson their Facebook timeline, multiple videos may load and play in quick succession; these parameters ensure that videosload and play with minimal error. Parameters were tuned according to the typical process for testing conﬁgurationchanges for the Facebook web video player; as such, they would typically be invisible to the user.Here, the unit of observation is the user-video. When conducting experiments to tune these parameters, the researchercan randomize conﬁgurations at the unit level, the user-video, or they may randomize at the cluster level, the user. Forthis analysis, we deployed both designs for comparison. In this setting, treatment assigned to one video may interferewith outcomes of another video for the same user, as a data-intensive conﬁguration assigned to one video may result indegraded performance for other videos. The most likely form this will take is that videos that load at nearly the sametime may saturate the user’s network connection, reducing the speed at which a single video would otherwise load.However, it is unlikely that treatments assigned to videos for one user will affect outcomes associated with videos foranother user. Consequently, the partial interference assumption is likely supported, with clustering at the user level.Figure 3 illustrates estimates of treatment effects from both unit and cluster randomized experiments for one repre-sentative outcome variable, stall counts. Both types of experiments were run over four days, on approximately equalnumbers of clusters. Ten treatment conditions were assigned in addition to the control, the status quo conﬁgurationof engineering parameters. These treatment conditions can broadly be categorized into “high” and “low” data usagetreatments, relative to data usage under the control condition.The difference-in-means estimator is unbiased for the HAATE under partial interference when treatment is randomizedat the cluster level. However, estimates are much less precise than the estimates from the unit-randomized experiment.For the stall counts outcome, the standard error is three times larger under the cluster-randomized design as comparedto the unit-randomized design. Additionally, the effects are measured to be around one and a half times more extremeunder the unit-randomized design. Thus, while the unit level design appears to induce bias, it provides the prospect oflarge gains in precision.The direction of bias is also systematic with type of treatments: high data usage treatments were estimated as being evenmore effective under unit-randomized experiments as compared to cluster-randomized experiments, and low data usagetreatments were estimated as being even less effective. This may be based on the nature of data usage: when all videosfor a user are assigned a low data usage conﬁguration, there is relatively little competition for bandwidth. When a givenvideo is assigned a low data usage conﬁguration and all of the other videos are assigned a balanced distribution of lowand high data usage conﬁgurations, there is increased competition for bandwidth, and the given low data usage videomay exhibit deterioration in performance relative to performance under user homogeneous assignment. When all videosfor a user are assigned a high data usage conﬁguration, there is a high degree of competition for bandwidth. When a10

PREPRINT - F

EBRUARY

1, 2021 -6.0%-3.0%0.0%3.0%6.0% 0 1 2 3 4 5 6 7 8 9

Configuration % C hange i n S t a ll s R e l a t i v e t o C on t r o l Randomization

UserVideo

Figure 3: Ratio effect estimate on stall count, user and user-video randomizationEffect estimates on stall count from user-level randomization are represented in circles, user-video randomization intriangles. The ﬁve high data usage conﬁgurations are on the left; the ﬁve low data usage conﬁgurations are on the right.Treatment effects are estimated as ratios with standard errors estimated by the delta method, but are otherwiseunadjusted.given video is assigned a high data usage conﬁguration and all of the other videos are assigned a balanced distributionof low and high data usage conﬁgurations, there is relatively decreased competition for bandwidth, and the given highdata usage video may exhibit improved performance relative to performance under user homogeneous assignment.We used data from the user- and video- randomized experiments to estimate intra-cluster correlations of errors, andmean and variance of number of videos per user, as clusters were not of uniform size. Based on domain knowledge, weestimated interference as a linear effect of proportion of high data usage videos in a cluster. Table 3 reports estimates ofparameters for the stall count outcome, averaged over all treatments. We used this information to conduct simulationsand select a value of ¯ α for the intra-cluster correlation of treatment that would minimize RMSE, and conducted a secondround of experiments following an adapted version of the Dirichlet-multinomial procedure described above. The videoplayback conﬁgurations we sought to optimize had ﬁve continuous parameters; in the second stage experiment, wetested 30 unique conﬁgurations, with a balanced distribution of high and low data usage conﬁgurations.ICC SEs, cluster SEs, unit Bias term Var ( n j ) ¯ n j PREPRINT - F

EBRUARY

1, 2021

Researchers may wish to estimate the average effect of a homogeneous treatment policy change in the presence ofinterference, and so will be concerned with both experimental design and estimator. When using RMSE as a metricwith a ﬁnite sample, it may be preferable to use the difference-in-means estimator, even when the linear-in-means modelis correctly speciﬁed. In many applied settings, it may be much more practical to use difference-in-means estimatorswhen deploying experiments at scale and so this may be taken as a reassurance to practitioners that biased estimationusing difference-in-means estimators may still have lower RMSE than correctly speciﬁed linear models that account forinterference.As a general guideline when using the difference-in-means estimator, when there is low or no interference and ρ u > ,lowest RMSE is obtained by randomizing at the unit level. When ρ u = 0 , lowest RMSE is obtained by assigningtreatment at the cluster level. In the presence of sufﬁciently high intra-cluster correlation of errors and sufﬁciently lowinterference, researchers can locate an optimal intra-cluster correlation of treatment in terms of RMSE which is betweenthese two extremes.However, estimated coverage will not be correct for the HAATE in such cases, and so if correct coverage is alsodesired, randomization should be implemented at the cluster level. If the objective of analysis is to test for or measureinterference, an approach more similar to that proposed by Baird et al. [2018], Propositions 1 and 2, would berecommended, using a model informed by domain knowledge and minimizing standard errors on estimates of slopeeffects. References

Peter M Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to asocial network experiment.

The Annals of Applied Statistics , 11(4):1912–1947, 2017.Sarah Baird, J Aislinn Bohren, Craig McIntosh, and Berk Özler. Optimal design of experiments in the presence ofinterference.

Review of Economics and Statistics , 100(5):844–860, 2018.Guillaume W Basse and Edoardo M Airoldi. Limitations of design-based causal inference and A/B testing underarbitrary and network interference.

Sociological Methodology , 48(1):136–151, 2018.Susanne Berger, Nathaniel Graham, and Achim Zeileis. Various versatile variances: An object-oriented implementationof clustered covariances in R. Working Paper 2017-12, Working Papers in Economics and Statistics, ResearchPlatform Empirical and Experimental Economics, Universität Innsbruck, 2017. URL http://EconPapers.RePEc.org/RePEc:inn:wpaper:2017-12 .Yann Bramoullé, Habiba Djebbari, and Bernard Fortin. Identiﬁcation of peer effects through social networks.

Journalof Econometrics , 150(1):41–55, 2009.Alex Chin. Regression adjustments for estimating the global treatment effect in experiments with interference. arXivpreprint arXiv:1808.08683, 2018.D. R. Cox.

Planning of Experiments . Wiley, Oxford, 1958.Friedhelm Eicker et al. Asymptotic normality and consistency of the least squares estimators for families of linearregressions.

The Annals of Mathematical Statistics , 34(2):447–456, 1963.Bruce C Greenwald. A general analysis of bias in the estimated standard errors of least squares coefﬁcients.

Journal ofEconometrics , 22(3):323–338, 1983.M Elizabeth Halloran and Claudio J Struchiner. Causal inference in infectious diseases.

Epidemiology , 6(2):142–151,1995.P. J. Huber. The behaviour of maximum likelihood estimates under nonstandard conditions. In L. Le Cam and J. Neyman,editors,

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume 1, pages221–233, Berkeley, 1967. University of California Press.Michael G. Hudgens and M. Elizabeth Halloran. Toward causal inference with interference.

Journal of the AmericanStatistical Association , 103(482):832–842, 2008.Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained Bayesian optimization with noisyexperiments.

Bayesian Analysis , 14(2):495–519, 2019. doi: 10.1214/18-BA1110. URL https://doi.org/10.1214/18-BA1110 .Lan Liu and Michael G Hudgens. Large sample randomization inference of causal effects in the presence of interference.

Journal of the American Statistical Association , 109(505):288–301, 2014.12

PREPRINT - F

EBRUARY

1, 2021Charles F Manski. Identiﬁcation of endogenous social effects: The reﬂection problem.

The Review of Economic Studies ,60(3):531–542, 1993.Robert A Mofﬁtt et al. Policy interventions, low-level equilibria, and social interactions.

Social Dynamics , 4(45-82):6–17, 2001.Brent R Moulton. Random group effects and the precision of regression estimates.

Journal of Econometrics , 32(3):385–397, 1986.Fredrik Sävje, Peter M Aronow, and Michael G Hudgens. Average treatment effects in the presence of unknowninterference. arXiv preprint arXiv:1711.06399, 2017.Michael E Sobel. What do randomized studies of housing mobility demonstrate? causal inference in the face ofinterference.

Journal of the American Statistical Association , 101(476):1398–1407, 2006.Il’ya Meerovich Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals.

ZhurnalVychislitel’noi Matematiki i Matematicheskoi Fiziki , 7(4):784–802, 1967.Eric J Tchetgen Tchetgen and Tyler J VanderWeele. On causal inference in the presence of interference.

StatisticalMethods in Medical Research , 21(1):55–75, 2012.Tyler J VanderWeele and Eric J Tchetgen Tchetgen. Effect partitioning under interference in two-stage randomizedvaccine trials.

Statistics & Probability Letters , 81(7):861–869, 2011.Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.

Econometrica , 48(4):817–838, 1980.Achim Zeileis. Object-oriented computation of sandwich estimators.

Journal of Statistical Software , 16(9):1–16, 2006.doi: 10.18637/jss.v016.i09. 13

PREPRINT - F

EBRUARY

1, 2021

A Tables

Bias RMSE Estimated SE Coverage ρ u c ρ m Ψ(¯ , ¯ ) Ψ(¯ , ¯ ) LM DM LM DM LM DM LM DM0 0 0.99 2.5 -2.5 0.001 0.001 0.036 0.035 0.035 0.034 0.959 0.961 ... ... ... ... -0.001 -0.001 0.053 0.035 0.052 0.035 0.957 0.963 ... ... ... ... -0.007 0.000 0.348 0.035 0.333 0.035 0.957 0.9570.1 ... ... ... ... ... ... ... -0.004 -0.003 0.132 0.070 0.126 0.068 0.946 0.957 ... ... ... ... ... ... ... ... ... ... ... -0.006 -0.002 0.247 0.124 0.233 0.120 0.942 0.955 ... ... ... ... ... ... ... ... ... ... ... -0.009 -0.004 0.360 0.182 0.351 0.180 0.950 0.954 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... -0.076 -0.005 3.509 0.076 3.406 0.077 0.952 0.9720 0.1 0.99 2.6 -2.75 -0.002 -0.001 0.035 0.035 0.035 0.035 0.964 0.963 ... ... ... ... -0.001 0.035 0.054 0.095 0.052 0.036 0.949 0.562 ... ... ... ... -0.009 0.073 0.338 0.190 0.336 0.035 0.956 0.5000.1 ... ... ... ... ... ... ... -0.004 0.032 0.134 0.113 0.128 0.069 0.954 0.805 ... ... ... ... -0.013 0.072 0.686 0.190 0.659 0.036 0.953 0.5000.3 ... ... ... -0.002 -0.002 0.170 0.166 0.165 0.163 0.957 0.956 ... ... ... ... ... ... ... ... -0.010 0.072 1.192 0.191 1.162 0.042 0.953 0.5000.5 ... ... ... ... ... ... ... ... ... ... ... -0.015 0.072 1.797 0.192 1.717 0.049 0.949 0.5010.8 ... ... ... -0.021 -0.020 0.488 0.480 0.496 0.488 0.966 0.967 ... ... ... ... -0.019 0.034 0.735 0.378 0.699 0.355 0.948 0.953 ... ... ... ... -0.073 0.073 3.565 0.203 3.407 0.078 0.947 0.5680 0.5 0.99 3 -3.75 0.000 0.004 0.035 0.037 0.035 0.036 0.964 0.964 ... ... ... ... ... ... ... ... -0.004 0.366 0.350 0.933 0.336 0.036 0.954 0.5000.1 ... ... ... -0.001 0.003 0.090 0.090 0.089 0.089 0.958 0.963 ... ... ... ... ... ... ... ... ... ... ... -0.004 -0.001 0.161 0.159 0.164 0.162 0.961 0.958 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... -0.081 0.366 1.784 0.933 1.731 0.050 0.952 0.5000.8 ... ... ... ... ... ... ... -0.010 0.172 0.700 0.567 0.698 0.358 0.962 0.811 ... ... ... ... ... ... ... ... ... ... ... ... -0.008 0.736 0.336 1.865 0.334 0.039 0.960 0.5000.1 ... ... ... -0.005 0.003 0.091 0.092 0.089 0.090 0.960 0.961 ... ... ... ... ... ... ... ... ... ... ... -0.004 0.003 0.167 0.166 0.165 0.164 0.959 0.955 ... ... ... ... -0.002 0.348 0.235 0.906 0.234 0.156 0.961 0.500 ... ... ... ... ... ... ... ... ... ... ... -0.005 0.349 0.362 0.915 0.352 0.206 0.951 0.500 ... ... ... ... ... ... ... -0.031 -0.022 0.507 0.502 0.496 0.488 0.964 0.964 ... ... ... ... -0.024 0.343 0.729 0.964 0.700 0.370 0.952 0.570 ... ... ... ... -0.052 0.734 3.456 1.866 3.424 0.080 0.961 0.500

Table 4: Simulation results over intra-cluster correlation of errors, magnitude of interference, and experimental designThe “LM” columns represent the linear-in-means estimator, the “DM” columns the difference-in-means model.Empirical bias, RMSE, standard errors from cluster-robust variance estimates, and coverage of 95 percent conﬁdenceintervals for each estimator average over the two estimates, (cid:98)

Ψ(¯ , ¯ ) and (cid:98) Ψ(¯ , ¯ ))