[PDF] Designing Transportable Experiments

Abstract

Full PDF

DD ESIGNING T RANSPORTABLE E XPERIMENTS

A P

REPRINT

My Phan

University of Massachusetts – Amherst [email protected]

David Arbour ∗ Adobe Research [email protected]

Drew Dimmery ∗ Facebook Core Data Science [email protected]

Anup B. Rao ∗ Adobe Research [email protected]

November 25, 2020 A BSTRACT

We consider the problem of designing a randomized experiment on a source population to estimatethe Average Treatment Effect (ATE) on a target population. We propose a novel approach whichexplicitly considers the target when designing the experiment on the source. Under the covariate shiftassumption, we design an unbiased importance-weighted estimator for the target population’s ATE.To reduce the variance of our estimator, we design a covariate balance condition (Target Balance)between the treatment and control groups based on the target population. We show that TargetBalance achieves a higher variance reduction asymptotically than methods that do not consider thetarget population during the design phase. Our experiments illustrate that Target Balance reduces thevariance even for small sample sizes. K eywords experiments · generalizability · experimental design The problem of generalization is present everywhere that experiments are run. In the online environment, tests are runwith the users who show up on the product while the experiment is running (and are therefore highly active users),while inferences about user experience are most useful on the full set of users (both highly active and less active) [Wanget al., 2019]. In clinical research, it is an omnipresent problem to recruit minorities into randomized trials [Fisher andKalbaugh, 2011], thus making it difﬁcult to assume that the measured effects will generalize to the larger population ofinterest (i.e. the United States as a whole, or people afﬂicted with a particular health condition). In lab experiments,the sample is often one of convenience such as undergraduates in rich countries or from pools of potential subjectsavailable online [Henrich et al., 2010]. Field experiments in governance or development such as Dunning et al. [2019]are conducted in particular countries or in particular communities, but the policy implications of such work stretchfar beyond the borders of the study population. As in Dunning et al. [2019], the desire is not just to understand howBurkina Faso voters respond to more information about their political leaders, but to understand how voters across theworld might respond to similar informational treatments. The same is true for experiments in development economics,such as microﬁnance [Meager, 2019] and for studies of internet phenomena [Munger, 2018]. In these cases, it isn’t a surprise after running an experiment that generalizing the knowledge is important; indeed, generalization of knowledgeto a broader population is core to the motivation for the experiment in the ﬁrst place.We pre-suppose that an experimenter knows ex-ante the population on which they wish to draw broader inferences.The task we consider, therefore, is to design an experiment that best allows the generation of causal knowledge onthis inferential target. While previous work [Hartman et al., 2015, Dehejia et al., 2019, Stuart et al., 2011, DuGoffet al., 2014] has examined corrections on the analysis side to extrapolate estimates from sample to target population, the ∗ equal contribution a r X i v : . [ s t a t . M E ] N ov PREPRINT - N

OVEMBER

25, 2020novelty of this work is in doing this through a design-based solution. That is, if you know your goal is to generalize to atarget population, we consider how that should modify experimental design.

Contributions.

Using the Mahalanobis distance and importance weighting, we design an estimator with a balancingcondition for the target distribution’s ATE that is unbiased and has low variance.• In Section 3, we introduce an importance-weighted estimator with a balance condition called Target Balance thatexplicitly considers the target distribution in the design phase.• In Section 5.1 we show that using the importance-weighted estimator with Target Balance results in an unbiasedestimator of the target distribution’s ATE (Theorem 1)• We analyze the variance assuming a linear model. In Section 5.2.1, we show that when the dimension of thecovariates d = 1 , for a ﬁnite sample size n , Target Balance reduces the variance (Corollary 1). Moreover, among allbalance criteria with rejection probability at most α (including balancing by only considering the source distribution,which we call Source Balance), Target Balance achieves the optimal variance reduction (Theorem 2). When d ≥ (Section 5.2.2), when the sample size is large, Target Balance reduces the variance (Theorem 3) and achieves a lowervariance than Source Balance (Theorem 4).• In Section 6 we perform experiments to show that Target Balance has small mean-squared errors even for d > ,small sample size and non-linear model. We ﬁrst ﬁx notation before proceeding to the problem setting. Upper-case letters are used to denote random variables,lower-case letters are used to denote values taken by them. We use bold-faced letters to denote n samples and normalletters to denote a single sample. For example, X i ∈ R d is a random variable denoting the covariates of sample i . X = ( X , ..., X n ) T ∈ R n × d is the random variables X , .., X n concatenated together. x i ∈ R d is a value of X i , and x = ( x , ..., x n ) T ∈ R n × d is a value of X .Some random variables, like X , can have two different distributions, either source distribution or target distribution. Inthat case, we use E S X , var S X and cov S X to denote the expectation, variance and covariance with respect to the sourcedistribution, and E T X , var T X and cov T X to denote with respect to the target distribution. We use no superscriptswhen there is no confusion. For example E A i is the expectation of the treatment assignment A i of sample i . For arandom variable R , we use E R , var R and cov R to denote the expectation, variance and covariance over the randomnessof R . For example, E S X denote the expectation over the randomness of X = ( X , ..., X n ) T according to the sourcedistribution. We omit the subscripts when it’s clear.The problem considered in this paper is as follows. We assume that we are presented with two populations, referredto as the source and target populations, with corresponding densities p S and p T , respectively. We further assumethat we observe a set of pre-treatment covariates from the source population, x , . . . x n ∼ p S . We assume that weare freely able to assign treatment, a , . . . , a n ∈ { , } to individuals observed in the source population and observetheir outcomes, y , . . . , y n ∈ R . The estimand of interest is the average treatment effect for the target population (thepopulation of individuals which were not subject to an experiment), τ TY = E T [ Y A =1 − Y A =0 ] . (1)Where Y A =0 , Y A =1 are the potential outcomes [Rubin, 2011], i.e., the values of Y that would have been observed hadtreatment been observed at A = 1 or A = 0 , respectively. We use Y to denote ( Y , Y ) and Y ∗ to denote the observed outcome.In order to make this problem tractable we will assume the following throughout the remainder of the paper: Assumption 1.

Equality of conditional densities, i.e., p S ( Y | X ) = p T ( Y | X ) (note p S ( X ) (cid:54) = p T ( X ) in general). Assumption 2.

Overlap between source and target distributions, i.e., p T ( X ) > ⇒ p S ( X ) > . Assumption 3. Y = ψ ( X ) T β + E Y = ψ ( X ) T β + E where ψ is a basis function and E , E are mean zero random variables. To reduce notational clutter, and without loss of generality, we will assume that ψ is the identity function for theremainder of paper. Assumption 4.

The ratio of the pdfs, p T ( X ) /p S ( X ) , is known. PREPRINT - N

OVEMBER

25, 2020In a nested trial design in which the sampling probabilities from the population are known [Dahabreh et al., 2019], thisratio can be calculated as p T ( X ) p S ( X ) = p ( X | S = 0) p ( X | S = 1) = p ( S = 0 | X ) p ( S = 1 | X ) p ( S = 1) p ( S = 0) , where S = 1 indicates that the unit is selected to be in the source and S = 0 indicates that the unit is not selected and isin the target.The assumptions, though nontrivial, are common throughout the literature on transportability [Stuart et al., 2011,Hartman et al., 2015, Bareinboim and Pearl, 2016].We conjecture that similar results to those in this paper will hold in the case in which importance weights are estimatedwith parametric convergence rates. We leave this extension as future work.For a sample i , let X i , A i , Y ai and Y ∗ i be the covariates, treatment, outcome of treatment a , and observed outcome. Let n be the size of the control group (where A i = 0 ) and n be the size of the treatment group (where A i = 1 ). Similarto common practice (c.f., Stuart et al. [2011], Hartman et al. [2015], Rudolph and van der Laan [2017], Buchanan et al.[2018]), we infer τ TY with importance weights, ˆ τ TY := 1 n (cid:88) A i =1 W i Y ∗ i − n n (cid:88) A i =0 W i Y ∗ i = 1 n n (cid:88) i =1 W i A i Y i − n n (cid:88) i =1 W i (1 − A i ) Y i (2)where W i = p T ( X i ) p S ( X i ) . While equation 2 is unbiased, the estimate can incur large variance in the presence of largeimportance weights.For ease of notation we deﬁne Z i = 2 A i − ∈ {− , } and let Z be the n × vector of random variables Z , ..., Z n and z be a value taken by Z . Y i = ( Y i , Y i ) is a random variable denoting all possible outcomes of sample i . Y = ( Y , ..., Y n ) T ∈ R n × is the random variables Y , .., Y n concatenated together. y i ∈ R is a value of Y i , and y = ( y , ..., y n ) T ∈ R n × is a value of Y . Let w i = p T ( x i ) p S ( x i ) and w be the n × n diagonal matrix with w ( i, i ) = w i . For matrix a , we use ˜ a to denote wa where each row i of a is multiplied by w i . We consider n = n = n/ throughout the paper. The core contribution of this work is a procedure to estimateequation 1 which explicitly considers the target population when designing the experiment for the source population.We focus on adapting re-randomization, an experimental design procedure which optimizes balance, i.e., the differencein means of X between treatment and control groups. Speciﬁcally, rerandomization centers on a balance criterion , Deﬁnition 1 (Target Balance) . With a rejection threshold α , deﬁne the balance condition: φ αT ( x , Z ) = (cid:26) , if M ( n ( wx ) T Z ) < a ( x )0 , otherwisewhere M ( n ( wx ) T Z ) is a distance (deﬁned below in Eq. 3) between the covariates associated with treatment andcontrol given by Z and a ( x ) is chosen such that P ( φ αT = 1 | x ) = 1 − α . We omit α and simply write φ T when α is not necessary for exposition. We omit x and write a when there is noconfusion.The full assignment procedure is then1. Assign A randomly for each person , . . . , n such that (cid:80) i a i = n . There are (cid:0) nn (cid:1) ways to choose G , each of whichis equally likely.2. If φ T ( x , z ) = 0 return to step (1).3. Conduct experiment with treatment assignments, A .Following standard practice in rerandomization [Morgan et al., 2012], we will focus on a criterion based on Mahalanobisdistance, but incorporating a weighting term to express our desire for balance in the target distribution rather than in the3 PREPRINT - N

OVEMBER

25, 2020source. We refer to this weighted Mahalanobis distance as M ( n ( wx ) T Z ) , where M ( · ) is deﬁned as: M ( U ) := ( U ) T Cov ( U ) − ( U )= (cid:107) B (cid:107) where B = U Cov ( U ) − / (3)Thus, the balance condition M ( n ( wx ) T Z ) < a is equivalent to truncating the square norm of B to be less than a .Note that this is a standardized measure of the difference in importance-weighted covariate-means between treatmentand control, since n (cid:88) Z i =1 w i x i − n (cid:88) Z i = − w i x i = 2 n ( wx ) T Z . Thus, rerandomization simply rejects designs with covariate imbalance larger than a pre-speciﬁed value.The novelty in our proposed design is to reject samples based on imbalance in the the target distribution rather thanbased on imbalance in the source distribution. The standard in the rerandomization literature is to focus on balance inthe source distribution, which in our setup implies assuming that the target distribution is equal to the source distribution.Therefore, importance weights in this case are all equal to one. We call this balancing condition

Source Balance , whichwe denote by φ αS ( x , Z ) .We explain the intuition behind using Target Balance rather than Source Balance. An unbiased estimator for the source’sATE is: ˆ τ SY := n (cid:80) i A i Y i − n (cid:80) i A i Y i . There are existing results [Li et al., 2018, Harshaw et al., 2019] that canbe applied to linear models to show variance reduction of ˆ τ SY with Source Balance deﬁned by x .By deﬁning new variables ˜ Y ai = W i · Y ai for a ∈ { , } , the importance weighted estimator can now be expressed in asimilar form to ˆ τ SY : ˆ τ TY = n (cid:80) i A i ˜ Y i − n (cid:80) i A i ˜ Y i . We are no longer in a linear setting because ˜ Y ai = W i ( β Ta X i + E a ) . But by deﬁning ˜ X i = W i X i and ˜ E = W i E wehave: ˜ Y ai = β Ta ˜ X i + ˜ E a , where E (cid:104) ˜ Y a | X (cid:105) is a linear function of ˜ X , which can be considered as a feature-transformed X. [Li et al., 2018, Harshaw et al., 2019] can now be applied to estimate ˆ τ TY with Target Balance deﬁned by ˜ x = wx .Let ρ ( x , z ) ∈ { , } be a function of x and z used in the re-randomization procedure. Note that since we re-sample Z until ρ = 1 , only the distribution of Z is affected by the balance condition. Let Z ρ denote the distribution of Z afterbeing accepted by the balance condition ρ = 1 . Our work relates to and ties together two distinct strands of research: (1) ex-post generalization of experimental resultsto population average effects and (2) ex-ante experimental design. We will discuss each in turn.

Generalization.

Within the literature on methods for generalization, work has generally focused on ex-post adjustments to experimentspreviously run.The foundational work of Stuart et al. [2011] provides an approach based on propensity scores for generalizing theresults of experimental interventions to target populations. Our work will leverage this general framework, but introducemethods for optimizing an experimental design to ensure effective generalization performance of resulting estimates.Hartman et al. [2015] similarly uses a combination of matching and weighting to generalize experimental resultsin-sample to a population average treatment effect on the treated. Other work has also considered weighting-basedapproaches to generalization [Buchanan et al., 2018].Dehejia et al. [2019] shows how to use an outcome-modeling approach to extrapolate effects estimated in one populationto a population. In contrast to Hartman et al. [2015] and Stuart et al. [2011], this approach relies on modeling theoutcomes and then predicting effects in different locations rather than simply reweighting data observed in-sample.Dahabreh et al. [2018] provides a variety of estimation methods to generalize to a target population, including doubly-robust methods. Rudolph and van der Laan [2017], likewise, provides a doubly-robust targeted maximum likelihoodestimator for transporting effects.There has also been work focused particularly on identiﬁcation in this setting. Dahabreh et al. [2019] deﬁnes a rigoroussampling framework for describing generalizability of experimental results and identiﬁability conditions through4

PREPRINT - N

OVEMBER

25, 2020the g-formula. Bareinboim and Pearl [2016] lays out a general framework for determining identiﬁability of effectsgeneralized to new populations through, termed “data fusion”.Miratrix et al. [2018] and Coppock et al. [2018] challenge the premise of the necessity for generalization due to therarity of heterogeneous treatment effects. These studies speciﬁcally focused on survey experiments, however, and itisn’t truly up for debate that many important objects of study have important heterogeneous components [Allcott, 2015,Vivalt, 2015, Dehejia et al., 2019].

Experimental design.

The standard practice for experimental design is blocking [Greevy et al., 2004], in which units are divided into clustersand then a ﬁxed number of units within each cluster are assigned to treatment. This ensures balance on the clusterindicators within the sample. Higgins et al. [2016] provides a blocking scheme based on k-nearest-neighbors that canbe calculated more efﬁciently than the “optimal” blocking of [Greevy et al., 2004].Kallus [2018] takes an optimization approach to the problem of experimental design. This work optimizes treatment al-locations based on in-sample measures of balance (particularly with respect to kernel means), showing how assumptionsof smoothness are necessary to improve on simple Bernoulli randomization.Rerandomization approaches simply draw allocations randomly until one is located which meets the pre-speciﬁedbalance criteria. This is also the basis of our proposed method. Morgan et al. [2012] analyzes the rerandomizationprocedure of discarding randomized assignments that have more in-sample imbalance than a pre-speciﬁed criteria interms of Mahalanobis distance. Li et al. [2018] provides asymptotic results for rerandomization that does not rely ondistributional assumptions on the covariates.Harshaw et al. [2019] provides an efﬁcient method for obtaining linear balance using a Gram-Schmidt walk. Theiralgorithm includes a robustness-balance tradeoff tuneable by a parameter in their algorithm, and provides useful toolsfor analyzing experimental design which we use in our theoretical analyses in Section 5.All aforementioned work on experimental design places as its objective estimation of effects on the sample (i.e. itoptimizes for the sample average treatment effect). This work departs by considering the alternative objective ofprioritizing estimation on a target population (i.e. the population average treatment effect).

In this section we will analyze the expectation and variance of our importance-weighted estimator in Eq. 2 with TargetBalance in Deﬁnition 1.Section 5.1 shows that using the importance-weighted estimator with Target Balance results in an unbiased estimator ofthe target’s ATE (Theorem 1).In Section 5.2 we analyze the variance. In Section 5.2.1, Corollary 1 shows that when the dimension of the covariates d = 1 , for a ﬁnite sample size n , Target Balance reduces the variance. Moreover, among all reasonable balance criteriawith rejection probability at most α (including Source Balance), Target Balance achieves the optimal variance reduction(Theorem 2). Section 5.2.2 shows that when d ≥ , when the sample size is large, Target Balance reduces the variance(Theorem 3) and achieves a lower variance than Source Balance (Theorem 4). In this section we will show that our importance-weighted estimator in Eq. 2 is an unbiased estimator of the target’sATE with Target Balance:

Theorem 1.

Let ˆ τ TY be the importance-weighted estimator in Equation 2. When n = n = n/ : E S X , Y , Z φT (cid:2) ˆ τ TY (cid:3) = τ TY . The proof makes use of the fact that the conditional distributions of Y given X in both the source and the target are thesame ( p S ( Y | X ) = p T ( Y | X ) ), and therefore p T ( X ) p S ( X ) = p T ( X,Y ) p S ( X,Y ) . In this section we analyze the variance. We use ˜ Y ai and ˜ y ai to denote W i Y ai and w i y ai for a ∈ { , } .5 PREPRINT - N

OVEMBER

25, 2020 d = 1 In this section we will show that when X is a -dimensional random variable and the sample size is ﬁnite, TargetBalance reduces the variance compared to complete randomization. Moreover, among all symmetric balance conditions(deﬁned below) with rejection probability at most α (including Source Balance), Target Balance achieves the optimalvariance reduction.Let ρ ( x , Z ) ∈ { , } denote a function that depends on only x and Z , and satisﬁes the symmetric condition ρ ( x , Z ) = ρ ( x , − Z ) . This deﬁnition captures all reasonable balance conditions (including Source Balance) where ρ = 1 denotesacceptance and ρ = 0 denotes rejection. Note that the constant function ρ ( x , Z ) = 1 for all x , Z also satisﬁes thecriteria ρ ( x , Z ) = ρ ( x , − Z ) , and ρ = 1 becomes the entire sample space. We proceed to compare Target Balance withany ρ satisfying the criteria above.First we note that by the law of total variance: Lemma 1.

For any function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) :var S X , Y , Z ρ (ˆ τ TY ) = E S X (cid:104) var S Y , Z ρ (ˆ τ TY | X ) (cid:105) + var S X ( 1 n n (cid:88) i =1 W i ( β − β ) T X i ) . Note that the second term does not depend on ρ . Therefore we focus on analyzing the variance conditioned on X = x in this section, and the result for var S X , Y , Z ρ (ˆ τ TY ) easily follows from var S Y , Z ρ (ˆ τ TY | x ) .Let C i = Y i + Y i , c i = y i + y i , β = β + β , E = E + E and σ E = var ( E ) . The variance of the importance weightedestimator can be written as Lemma 2.

Let n = n = n/ . For any function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) :var Z ρ (ˆ τ TY | x , y ) = 4 n E Z ρ (cid:32) n (cid:88) i =1 Z i w i c i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y  Using the law of total variance and the fact that W i C i = W i X i β + W i E and E [ E| x ] = 0 we have: Lemma 3.

Let n = n = n/ . For any function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) :var S Y , Z ρ (ˆ τ TY | x ) = 4 n β E Z ρ (cid:32) n (cid:88) i =1 w i x i Z i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x  + 6 n σ E n (cid:88) i =1 w i . We note that the design affects only the ﬁrst term in the above decomposition. Let V := n (cid:80) i Z i w i x i = n ˜ x T Z andlet B := V var ( V ) − / . Recall that the Malahanobis distance M ( n ( wx ) T Z ) = || B || . Re-randomization procedurecorresponds to truncating B where B is a mean zero random variable (as Z i ’s are random variables) that is symmetricabout zero.It is easy to show that the best way to truncate a symmetric random variable B to minimize the variance is to truncatethe tail symmetrically (cid:107) B (cid:107) < a for some threshold a . Therefore Target Balance reduces the variance, and among allthe balance conditions with rejection probability at most α (including Source Balance), Target Balance achieves theoptimal variance reduction. Theorem 2.

Let n = n = n/ and d = 1 . Let ρ ( x , Z ) be a function satisfying ρ ( x , Z ) = ρ ( x , − Z ) and P ( ρ = 1 | x ) ≥ − α . Then: var S Y , Z φαT (ˆ τ TY | x , ) ≤ var S Y , Z ρ (ˆ τ TY | x ) . Applying Theorem 2 with ρ being the constant function ρ ( x , Z ) = 1 for all x , Z , we have: Corollary 1.

When d = 1 and n = n = n/ , using Target Balance reduces the variance compared to completerandomization: var S Y , Z φT (ˆ τ TY | x ) ≤ var S Y , Z (ˆ τ TY | x ) PREPRINT - N

OVEMBER

25, 2020 d ≥ In this section we show that when the sample size is large, Target Balance reduces the variance and achieves a lowervariance than Source Balance. We discuss the case of ﬁnite sample size in the appendix.In this section we condition on x and y so the randomness only comes from Z . Similar to Section 5.2.1, ﬁrst we notethat by the law of total variance: Lemma 4.

For any function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) :var S X , Y , Z ρ (ˆ τ TY ) = E S X , Y var Z ρ (ˆ τ TY | X , Y ) + var S X , Y ( n (cid:88) i =1 W i ( Y i − Y i )) Since the second term does not depend on ρ , we focus on analyzing the variance conditioned on X = x , Y = y in thissubsection. The result for var S X , Y , Z ρ (ˆ τ TY ) easily follows from var Z ρ (ˆ τ TY | X , Y ) .Conditioning on x and y , Li et al. [2018] state that if the following conditions (Condition in [Li et al., 2018]) aresatisﬁed, ﬁnite central limit theorem implies that (ˆ τ TY , n ˜ x T Z ) approaches a normal distribution as n goes to inﬁnity. Let avg (˜ y ) and avg (˜ x ) denote the average of the rows of ˜ y and ˜ x . As n → ∞ :• The ﬁnite population variances and covariance cov (˜ x ) , cov (˜ y ) , cov (˜ y ) , cov (˜ y − ˜ y ) , cov (˜ y , ˜ x ) and cov (˜ y , ˜ x ) have limiting values.• max ≤ i ≤ n | ˜ y ai − avg (˜ y ) a | /n → for a ∈ { , } and max ≤ i ≤ n (cid:107) ˜ x i − avg (˜ x ) (cid:107) /n → We apply Corollary 2 in Li et al. [2018] to give the expression for the asymptotic variance of ˆ τ TY under Mahalanobisbalance condition. Let as-var denote the variance of the asymptotic sampling distribution of a sequence of randomvariables.Applying Corollary 2 in [Li et al., 2018] to our case with covariates ˜ x and x and the weighted outcome ˜ y directly yieldsthe following result showing both Target Balance and Source Balance reduce the variance. Theorem 3 (Corollary 2 in [Li et al., 2018]) . When n = n = n/ :as-var Z φS (cid:0) ˆ τ TY | x , y (cid:1) = lim n →∞ var Z (ˆ τ TY | x , y )(1 − (1 − v d,a ) R x ) , as-var Z φT (cid:0) ˆ τ TY | x , y (cid:1) = lim n →∞ var Z (ˆ τ TY | x , y )(1 − (1 − v d,a ) R x ) , where R x = Corr (ˆ τ TY , n ˜ x T Z ) , R x = Corr (ˆ τ TY , n x T Z ) and v d,a = P ( χ d +2 ≤ a ) P ( χ d ≤ a ) . We now show that Target Balance has a smaller variance than Source Balance. We use the following equivalentexpressions for R x and R x . Let Q = nn − (cid:0) I d − n T (cid:1) where I d is an identity matrix of dimension d. Recall that c i = y i + y i . Let c := ( c , · · · , c n ) and ˜ c = wc . We will show that: Lemma 5.

When n = n = n/ : R x = (cid:115) (cid:107) Q ˜ c (cid:107) − min ˆ β (cid:107) Q ˜ c − Q x ˆ β (cid:107) (cid:107) Q ˜ c (cid:107) .R x = (cid:115) (cid:107) Q ˜ c (cid:107) − min ˆ β (cid:107) Q ˜ c − Q ˜ x ˆ β (cid:107) (cid:107) Q ˜ c (cid:107) . Intuitively R x and R x describe how well ˜ c is described by a linear function of ˜ x and x , respectively. Because of ourmodel, a linear model in terms of ˜ x = wx ﬁts ˜ c = wc better than a linear model in terms of x . Therefore, R x will belarger than R x and using φ T will result in a smaller variance than φ S .From W i C i = W i β T X i + W i E i with β = ( β + β ) , we show that with the same rejection probability α , TargetBalance has a lower variance than Source Balance. Theorem 4.

When n = n = n/ , with the same rejection probability α :as-var Z φαT (cid:0) ˆ τ TY | x , y (cid:1) ≤ as-var Z φαS (cid:0) ˆ τ TY | x , y (cid:1) PREPRINT - N

OVEMBER

25, 2020

NonlinearBias(ATE) NonlinearVar(ATE) NonlinearMSE(ATE)LinearBias(ATE) LinearVar(ATE) LinearMSE(ATE)2500 5000 7500 2500 5000 7500 2500 5000 75002500 5000 7500 2500 5000 7500 2500 5000 7500102030405101520250510152025051015200246012345

Sample Size

Complete Randomization Source Balance Target Balance Weighted Unweighted

Figure 1: Bias, Variance and MSE as a function of the sample size. All unweighted estimators are biases because theymeasure the ATE of the source distribution. As there is no importance weight threshold, all weighted estimators areunbiased (Theorem 1) but the weighted estimator with Target Balance has the lowest variance.

The y axes are in logscale. NonlinearBias(ATE) NonlinearVar(ATE) NonlinearMSE(ATE)LinearBias(ATE) LinearVar(ATE) LinearMSE(ATE)0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.750.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75011010011010010000110011010001100110

Distance between the source and the target distribution ( δ ) Complete Randomization Source Balance Target Balance Weighted Unweighted

Figure 2: Bias, Variance and MSE as a function of the distance δ (deﬁned in Section 6) between the source and thetarget distribution. Because of the importance weight threshold, the biases of the importance weighted methods increaseas the δ increase. If the distance is too large, the bias of the importance weighted estimators is large, leading to highMSE. However when the distance is not too large, the weighted estimator with Target Balance has the lowest MSE. The y axes are in log scale. PREPRINT - N

OVEMBER

25, 2020

NonlinearBias(ATE) NonlinearVar(ATE) NonlinearMSE(ATE)LinearBias(ATE) LinearVar(ATE) LinearMSE(ATE)0 50 100 150 0 50 100 150 0 50 100 1500 50 100 150 0 50 100 150 0 50 100 15050100150200100200300400500050100150200020040060005101505101520

Weight Threshhold

Complete Randomization Source Balance Target Balance Weighted Unweighted

Figure 3: Bias, Variance and MSE as a function of importance weight threshold. As the threshold increases, the bias ofthe weighted methods decreases and the variance of the weighted methods increases. Therefore there is a thresholdwhen the MSE is minimized. The weighted estimator with Target Balance has the lowest MSE for a reasonably goodthreshold.

The y axes are in log scale. We perform simulations on the two following models:

Linear Model Y = X + N orm (0 , Y = 3 X + N orm (0 , Nonlinear Model Y = X T X + N orm (0 , Y = 2 X T X + N orm (0 , We use the following source and target distributions for X . In the source distribution, X ∼ MultivariateNorm ( , I ) where I is the identity matrix. In the target distribution, X ∼ MultivariateNorm ( + δ, I ) where δ is a parameter thatwill be speciﬁed later.We randomly choose an assignment such that n = n = n/ . To select the random assignment with the top balance,instead of choosing a ﬁxed threshold α , we select the rejection probability α = 0 . as in Def. 1. To implement this,we draw / (1 − α ) assignments at random, calculate their Mahalanobis distance and pick one among the smallest uniformly at random.If the source and the target distributions are faraway, importance weighting can induce large variance. We use theweight clipping technique, in which if the importance weight is larger than a threshold, it will be set to that threshold. Itwill induce bias but reduce variance, and therefore reduce mean square error (MSE).We compare 6 methods (WE, CR), (WE, SB), (WE, TB), (UE, CR), (UE, SB) and (UE, TB) by combining the following2 properties: Weighted and Unweighted .• Weighted Estimator (WE).

We consider the importance weighted estimator in Eq. 2.•

Unweighted Estimator (UE).

We consider the unweighted estimator which is equivalent to Eq.2 with all weights setto one.

Complete Randomization, Source Balance and Target Balance .• Complete Randomization (CR).

This is the randomized assignment without balancing.•

Source Balance (SB).

This is the rerandomization algorithm seeking Source Balance.•

Target Balance (TB).

This is the rerandomization algorithm seeking Target Balance as in Deﬁnition 1.9

PREPRINT - N

OVEMBER

25, 2020We study the MSE of our methods in relation to the 3 following parameters: the sample size n , the importance weightsthreshold and the distance δ . Recall that in the source distribution, X ∼ MultivariateNorm ( , I ) where I is the identitymatrix and in the target distribution, X ∼ MultivariateNorm ( + δ, I ) . Sample Size.

In this experiment for both models we vary the sample size from to with step size and setthe number of covariates to . For the linear model, δ = 0 . . For the nonlinear model, δ = 0 . . For each sample sizewe repeat the experiment times. There is no importance weight threshold. The results are shown and discussed inFigure 1. Distance δ . In this experiment for both models we vary δ from . to . with step size . . We set the number ofcovariates to , the sample size to be and the importance weight threshold to be . For each threshold we repeatthe experiment times. The results are shown and discussed in Figure 2. Threshold.

In this experiment for both models we vary the importance weight threshold from , then to withstep size . We set the number of covariates to and the sample size to be and δ = 0 . . For each threshold werepeat the experiment times. The results are shown and discussed in Figure 3.Across all simulations, Target Balance with the Weighted Estimator substantially reduces the MSE. In this work, we’ve shown that a desire for generalizability should change the way experiments are designed andrun. In particular, we argue that balance should be sought on the target population rather than the samples in whichrandomization will actually be performed. We present a method for designing an experiment along these lines, showtheoretically that it is unbiased and more efﬁcient than sample balancing.

References

Hunt Allcott. Site selection bias in program evaluation.

The Quarterly Journal of Economics , 130(3):1117–1165, 2015.Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem.

Proceedings of the NationalAcademy of Sciences , 113(27):7345–7352, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1510507113. URL .Ashley L Buchanan, Michael G Hudgens, Stephen R Cole, Katie R Mollan, Paul E Sax, Eric S Daar, Adaora A Adimora,Joseph J Eron, and Michael J Mugavero. Generalizing evidence from randomized trials using inverse probability ofsampling weights.

Journal of the Royal Statistical Society. Series A,(Statistics in Society) , 181(4):1193, 2018.Alexander Coppock, Thomas J. Leeper, and Kevin J. Mullinix. Generalizability of heterogeneous treatment effectestimates across samples.

Proceedings of the National Academy of Sciences , 115(49):12441–12446, 2018. ISSN0027-8424. doi: 10.1073/pnas.1808083115. URL .Issa J. Dahabreh, Sarah E. Robertson, Jon A. Steingrimsson, Elizabeth A. Stuart, and Miguel A. Hernan. Extendinginferences from a randomized trial to a new target population, 2018.Issa J. Dahabreh, Sebastien J-P. A. Haneuse, James M. Robins, Sarah E. Robertson, Ashley L. Buchanan, Elisabeth A.Stuart, and Miguel A. Hernán. Study designs for extending causal inferences from a randomized trial to a targetpopulation, 2019.Rajeev Dehejia, Cristian Pop-Eleches, and Cyrus Samii. From local to global: External validity in a fertility naturalexperiment.

Journal of Business & Economic Statistics , pages 1–27, 2019.Eva H. DuGoff, Megan Schuler, and Elizabeth A. Stuart. Generalizing observational study results: Applying propensityscore methods to complex surveys.

Health Services Research , 49(1):284–303, 2014. doi: 10.1111/1475-6773.12090.URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6773.12090 .Thad Dunning, Guy Grossman, Macartan Humphreys, Susan D. Hyde, Craig McIntosh, Gareth Nellis, Claire L. Adida,Eric Arias, Clara Bicalho, Taylor C. Boas, Mark T. Buntaine, Simon Chauchard, Anirvan Chowdhury, Jessica Gottlieb,F. Daniel Hidalgo, Marcus Holmlund, Ryan Jablonski, Eric Kramon, Horacio Larreguy, Malte Lierl, John Marshall,Gwyneth McClendon, Marcus A. Melo, Daniel L. Nielson, Paula M. Pickering, Melina R. Platas, Pablo Querubín,Pia Rafﬂer, and Neelanjan Sircar. Voter information campaigns and political accountability: Cumulative ﬁndingsfrom a preregistered meta-analysis of coordinated trials.

Science Advances , 5(7), 2019. doi: 10.1126/sciadv.aaw2612.URL https://advances.sciencemag.org/content/5/7/eaaw2612 .Jill A Fisher and Corey A Kalbaugh. Challenging assumptions about minority participation in us clinical research.

American journal of public health , 101(12):2217–2222, 2011.10

PREPRINT - N

OVEMBER

25, 2020Robert Greevy, Bo Lu, Jeffrey H Silber, and Paul Rosenbaum. Optimal multivariate matching before randomization.

Biostatistics , 5(2):263–275, 2004.Christopher Harshaw, Fredrik Sävje, Daniel Spielman, and Peng Zhang. Balancing covariates in randomized experimentsusing the gram-schmidt walk, 2019.Erin Hartman, Richard Grieve, Roland Ramsahai, and Jasjeet S Sekhon. From sate to patt: combining experimentalwith observational studies to estimate population treatment effects.

JR Stat. Soc. Ser. A Stat. Soc.(forthcoming). doi ,10:1111, 2015.Joseph Henrich, Steven J Heine, and Ara Norenzayan. Most people are not weird.

Nature , 466(7302):29–29, 2010.Michael J Higgins, Fredrik Sävje, and Jasjeet S Sekhon. Improving massive experiments with threshold blocking.

Proceedings of the National Academy of Sciences , 113(27):7369–7376, 2016.Nathan Kallus. Optimal a priori balance in the design of controlled experiments.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 80(1):85–112, 2018.Xinran Li, Peng Ding, and Donald B Rubin. Asymptotic theory of rerandomization in treatment–control experiments.

Proceedings of the National Academy of Sciences , 115(37):9157–9162, 2018.Rachael Meager. Understanding the average impact of microcredit expansions: A bayesian hierarchical analysis ofseven randomized experiments.

American Economic Journal: Applied Economics , 11(1):57–91, January 2019. doi:10.1257/app.20170299. URL .Luke W. Miratrix, Jasjeet S. Sekhon, Alexander G. Theodoridis, and Luis F. Campos. Worth weighting? how to thinkabout and use weights in survey experiments.

Political Analysis , 26(3):275–291, 2018. doi: 10.1017/pan.2018.1.Kari Lock Morgan, Donald B Rubin, et al. Rerandomization to improve covariate balance in experiments.

The Annalsof Statistics , 40(2):1263–1282, 2012.Kevin Munger. Temporal validity in online social science. 2018.Donald B Rubin. Causal inference using potential outcomes.

Journal of the American Statistical Association , 2011.Kara E Rudolph and Mark J van der Laan. Robust estimation of encouragement-design intervention effects transportedacross sites.

Journal of the Royal Statistical Society. Series B, Statistical methodology , 79(5):1509, 2017.Elizabeth A. Stuart, Stephen R. Cole, Catherine P. Bradshaw, and Philip J. Leaf. The use of propensity scores to assessthe generalizability of results from randomized trials.

Journal of the Royal Statistical Society: Series A (Statistics inSociety) , 174(2):369–386, 2011. doi: 10.1111/j.1467-985X.2010.00673.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-985X.2010.00673.x .Eva Vivalt. Heterogeneous treatment effects in impact evaluation.

American Economic Review , 105(5):467–70, May2015. doi: 10.1257/aer.p20151015. URL .Yu Wang, Somit Gupta, Jiannan Lu, Ali Mahmoudzadeh, and Sophia Liu. On heavy-user bias in a/b testing. In

Proceedings of the 28th ACM International Conference on Information and Knowledge Management , pages 2425–2428, 2019. 11

PREPRINT - N

OVEMBER

25, 2020

Supplement to "Designing Transportable Experiments"

In Section A we discuss the variance reduction for d ≥ when the sample size is ﬁnite. In Section B we show the proofsof Section 5.1. In Section C we show the proofs of Section 5.2.1. In Section D we show the proofs of Section 5.2.2. InSection E we show the proofs of Appendix A.For a random variable R with value r , we write the expectation, variance and covariance conditioning on r as ashort-hand for conditioning on R = r . On the other hand, the expectation, variance and covariance conditioningon R are functions of R and therefore are random variables. For example, E [ˆ τ TY | X , Y ] is a function of X and Y , E [ˆ τ TY | X , y ] = E [ˆ τ TY | X , Y = y ] is a function of X , while E [ˆ τ TY | x , y ] = E [ˆ τ TY | X = x , Y = y ] is a value.Conditioning on x and y , the randomness only comes from Z . Therefore var Z ρ ( . | x , y ) , Cov Z ρ ( . | x , y ) and E Z ρ ( . | x , y ) can be written as var Z ( . | x , y , ρ = 1) , Cov Z ( . | x , y , ρ = 1) and E Z ( . | x , y , ρ = 1) respectively. We use both notationsin the proofs.For a random variable R , we use Cov ( R ) − / to denote the Cholesky square root of Cov ( R ) − .We restate the model and some notations here for convenience. Let the model be: Y i = X Ti β + E i Y i = X Ti β + E i Let (cid:15) i and (cid:15) i be the values taken by random variables E i and E i . Let C i = Y i + Y i , ˜ C i = W i C i , C := ( C , · · · , C n ) and ˜ C = ( ˜ C , · · · , ˜ C n ) . Let c i , ˜ c i , c and ˜ c be the values taken by C i , ˜ C i , C and ˜ C . Then C i = X Ti β + E i c i = x Ti β + (cid:15) i ˜ C i = ˜ X Ti β + ˜ E i ˜ c i = ˜ x Ti β + ˜ (cid:15) i where β = β + β , E i = E i + E i , ˜ X i = W i X i and ˜ E i = W i E i . Let (cid:15) i and ˜ (cid:15) i = w i (cid:15) i be the value taken by E i and ˜ E i . Let ˜ E = ( ˜ E , · · · , ˜ E n ) . A Additional Results: Finite Sample Size Variance Reduction for d ≥ In this section we discuss the ﬁnite sample case when X is a multivariate random variable, which is a generalization ofthe result in Section 5.2.1 when d = 1 . We show that when the sample size is ﬁnite, if β points to all directions withequal probability, then a balance condition which also consider the target population and is similar to Target Balanceachieves the optimal variance reduction in expectation over β . The proofs are in Appendix E.We will use the variance decomposition in the matrix form similar to [Harshaw et al., 2019] and provide intuition aboutthe effect of balancing on the variance.The following lemma is the general case when d ≥ of Lemma 3 in Section 5.2.1. Lemma A.1.

For any function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) :var S Y , Z ρ (ˆ τ TY | x ) = β T Cov Z ρ ( V | x ) β + 6 n σ E n (cid:88) i =1 w i , for V := n ( w · x ) T Z = n ˜ x T Z . Since the design affects only the ﬁrst term in the above expression, we focus on the the random variable V . V is now a d -dimensional vector and β is unknown.To understand the ﬁrst term, we use the same decomposition of β T Cov Z ρ ( V | x ) β as in [Harshaw et al., 2019]. Let e , ..., e n and λ , .., λ n be the normalized eigenvectors and corresponding eigenvalues of matrix Cov Z ρ ( V | x ) . Since Cov Z ρ ( V | x ) is symmetric, the eigenvectors form an orthonormal basis so we can write β as a linear combination of e , .., e n and get: β = (cid:107) β (cid:107) n (cid:88) i =1 η i e i where η i = (cid:104) β, e i (cid:105) / (cid:107) β (cid:107) is the coefﬁcient that captures the alignment of the weighted outcome β with respect to theeigenvector e i . Therefore: β T Cov Z ρ ( V | x ) β = (cid:107) β (cid:107) n (cid:88) i =1 η i λ i PREPRINT - N

OVEMBER

25, 2020In the worst case, β can align with the eigenvector of Cov Z ρ ( V | x ) with the largest eigenvalue. Therefore a good designis one with ρ that minimize the largest eigenvalue of Cov Z ρ ( V | x ) . We leave this for future works. In this work weconsider the average case direction - when β with norm (cid:107) β (cid:107) = l can point in any direction with equal probability. Inthat case, we have Lemma A.2. E (cid:107) β (cid:107) = l β T Cov Z ρ ( V | x ) β = l Trace ( Cov Z ρ ( V | x )) . (4)We can then ask for the balance event Ω which results in minimizing the trace of Cov Z ( V | x , Ω) , which is shown in thefollowing lemma. Note that when d = 1 , the trace of Cov Z ( V | x , Ω) is the variance var Z ( V | x , Ω) , and this result is thegeneral case of minimizing the variance of a -dimensional random variable in Section 5.2.1. Lemma A.3.

Let U ∈ R d be a random variable such that E [ U ] = 0 .Let u α be such that P ( (cid:107) U (cid:107) < u α ) = 1 − α .Let Ω be an event such that P (Ω) ≥ − α and E [ U | Ω] = 0 . Then:

T race ( Cov ( U |(cid:107) U (cid:107) < u α ) ≤ T race ( Cov ( U | Ω))

It follows from Lemma A.1, Lemma A.2 and Lemma A.3 that we can minimize E β var S Y , Z (ˆ τ TY | x , Ω) by deﬁning thefollowing balance condition: Deﬁnition 2 (Alternate Target Balance) . With a rejection threshold α , deﬁne the balance condition φ (cid:48) αT = (cid:26) , if (cid:107) V (cid:107) < a , otherwisewhere a be such that P ( φ (cid:48) αT = 1 | x ) = 1 − α . Recall that Target Balance use the condition (cid:107) B (cid:107) < a where B = V Cov Z ( V ) − / is the normalized random variableof V . Note since that V = n ˜ x T Z , Alternate Target Balance also considers the target population in the design phase.However Alternate Target Balance is not invariant under linear transformations of the covariates x i ’s while TargetBalance is.We have the following Theorem which is a generalization of Theorem 2 in Section 5.2.1. Theorem A.1.

Let (cid:107) β (cid:107) = l and β points in any direction with equal probability and n = n = n/ .Let ρ ( X , Z ) be a function satisfying ρ ( X , Z ) = ρ ( X , − Z ) and P ( ρ = 1 | x ) ≥ − α . Then E β var S Y , Z φ (cid:48) αT (ˆ τ TY | x ) ≤ E β var S Y , Z ρ (ˆ τ TY | x ) Similar to Section 5.2.1, applying Theorem 2 with ρ being the constant function ρ ( x , Z ) = 1 for all x , Z , we have: Corollary A.1.

Let (cid:107) β (cid:107) = l and β points in any direction with equal probability. When n = n = n/ , usingAlternate Target Balance reduces the variance compared to complete randomization in expectation over β . E β var S Z φ (cid:48) T , Y (ˆ τ TY | x ) ≤ E β var S Z , Y (ˆ τ TY | x ) Recall that the ﬁrst term in the decomposition in Lemma A.1 is equal to: β T Cov Z ρ ( V | x ) β = γ T Cov Z ρ ( B | x ) γ = γ T Cov Z ( B | x , ρ = 1) γ where γ = β T Cov Z ( V ) / and B = V Cov Z ( V ) − / .When the sample size is large, B converges to a standard normal distribution. Recall that Target Balance is equalto truncating (cid:107) B (cid:107) < a . So Cov Z φT ( B | x ) is the covariance of a standard normal random variable B truncated by (cid:107) B (cid:107) < a . From Theorem 3.1 in [Morgan et al., 2012] when B is a standard normal distribution, Cov ( B | x , φ T = 1) = vCov ( B | x ) for some v < , so the variance is reduced. However we do not need to go through this analysis because[Li et al., 2018] already has variance reduction results for the case when the sample size is large. In Section 5.2.2 we usethe result from [Li et al., 2018] directly to show that Target Balance achieves a smaller variance than Source Balance.13 PREPRINT - N

OVEMBER

25, 2020

B Proofs of Section 5.1

In this section we prove Theorem 1. We made use of the following lemma from Morgan et al. [2012]:

Lemma B.1 (from the proof of Theorem 2.1 in Morgan et al. [2012]) . Let A := ( A , ..., A n ) T ∈ R n . Let n = n = n/ . For any function ρ ( x , A ) ∈ { , } satisfying ρ ( x , A ) = ρ ( x , − A ) : E S A [ A i | x , y , ρ = 1] = 12 We also prove the following lemma in order to prove Theorem 1:

Lemma B.2.

For any function ρ ( x , A ) ∈ { , } satisfying ρ ( x , A ) = ρ ( x , − A ) : E A | ρ =1 [ˆ τ TY | X , Y ] = 1 n n (cid:88) i =1 W i ( Y i − Y i ) E S Y , A | ρ =1 [ˆ τ TY | X ] = 1 n n (cid:88) i =1 W i ( β − β ) T X i Proof.

From Lemma B.1, E [ A i | X , Y , ρ = 1] = E [ A i | X , ρ = 1] = . Therefore: E A | ρ =1 [ˆ τ TY | X , Y ] = 1 n n (cid:88) i =1 E A (cid:20) W i A i Y i (cid:12)(cid:12)(cid:12)(cid:12) X , Y , ρ = 1 (cid:21) − n n (cid:88) i =1 E A (cid:20) W i (1 − A i ) Y i (cid:12)(cid:12)(cid:12)(cid:12) X , Y , ρ = 1 (cid:21) = 1 n n (cid:88) i =1 W i Y i E A (cid:2) A i (cid:12)(cid:12) X , Y , ρ = 1 (cid:3) − n n (cid:88) i =1 W i Y i E A (cid:2) − A i (cid:12)(cid:12) X , Y , ρ = 1 (cid:3) = 1 n n (cid:88) i =1 W i ( Y i − Y i ) E S A | ρ =1 , Y [ˆ τ TY | X ] = E S Y (cid:2) E A [ˆ τ TY | X , Y , ρ = 1] | X (cid:3) = E S Y (cid:34) n n (cid:88) i =1 W i ( Y i − Y i ) | X (cid:35) = 1 n n (cid:88) i =1 W i ( β − β ) T X i PREPRINT - N

OVEMBER

25, 2020

Proof of Theorem 1.

Let D S and D T be the supports of the source and target distributions. Since p T ( X ) > → p S ( X ) > and p T ( Y | X ) = p S ( Y | X ) , we have D T ⊆ D S . Using Lemma B.2: E S X , Y , Z φT (cid:2) ˆ τ TY (cid:3) = E S X , Y E A φT [ˆ τ TY | X , Y ]= 1 n n (cid:88) i =1 E S X , Y (cid:2) W i ( Y i − Y i ) (cid:3) = 1 n n (cid:88) i =1 (cid:90) ( x,y ) ∈ D S (cid:18) p T ( x ) p S ( x ) ( y − y ) (cid:19) p S ( x, y ) dxy = 1 n n (cid:88) i =1 (cid:90) ( x,y ) ∈ D S (cid:18) p T ( y | x ) p T ( x ) p S ( y | x ) p S ( x ) ( y − y ) (cid:19) p S ( x, y ) dxy because p T ( y | x ) = p S ( y | x )= 1 n n (cid:88) i =1 (cid:90) ( x,y ) ∈ D S (cid:18) p T ( y, x ) p S ( y, x ) ( y − y ) (cid:19) p S ( x, y ) dxy = 1 n n (cid:88) i =1 (cid:90) ( x,y ) ∈ D S p T ( x, y )( y − y ) dxy = 1 n n (cid:88) i =1 (cid:90) ( x,y ) ∈ D T p T ( x, y )( y − y ) dxy because D T ⊆ D S = τ TY C Proofs of Section 5.2.1

In this section we prove Lemma 1, Lemma 2, Lemma 3, Theorem 2 and Corollary 1. Note that the results in this sectionare the special case when d = 1 of the results in Section A. Lemma 2 is a special case when d = 1 of Lemma E.1.Lemma 3 is a special case of Lemma A.1 and Theorem 2 is a special case of Theorem A.1. However in this sectionwe state the full proofs for the case d = 1 so that the readers do not need to read the proofs of Section A in order tounderstand Section 5.2.1 in the main paper. Proof of Lemma 1.

By law of total variance:var S Z ρ , X , Y (ˆ τ TY ) = E S X var S Y , Z ρ (ˆ τ TY | X ) + var S X (cid:16) E S Y , Z ρ [ˆ τ TY | X ] (cid:17) Since ρ ( x , Z ) = ρ ( x , − Z ) , from Lemma B.2: E S Y , Z ρ [ˆ τ TY | X ] = 1 n n (cid:88) i =1 W i ( β − β ) T X i Therefore: var S X (cid:16) E S Y , Z ρ [ˆ τ TY | X ] (cid:17) = var S X (cid:32) n n (cid:88) i =1 W i ( β − β ) T X i ) (cid:33) Proof of Lemma 2.

By deﬁnition:var Z (ˆ τ TY | x , y , ρ = 1) = E Z (cid:2) (ˆ τ TY − E Z [ˆ τ TY | x , y , ρ = 1]) | x , y , ρ = 1 (cid:3) From Lemma B.2 E Z [ˆ τ TY | x , y , ρ = 1] = 1 n (cid:32) n (cid:88) i =1 w i y i − n (cid:88) i =1 w i y i (cid:33) PREPRINT - N

OVEMBER

25, 2020On the other hand conditioning on X = x and Y = y and let y ∗ i denote the observed outcome of sample i : ˆ τ TY = 2 n ( (cid:88) Z i =1 w i y ∗ i − (cid:88) Z i = − w i y ∗ i )= 2 n n (cid:88) i =1 w i A i y i − n n (cid:88) i =1 w i (1 − A i ) y i Therefore:var Z (ˆ τ TY | x , y , ρ = 1) = E Z (cid:32) n ( n (cid:88) i =1 w i A i y i − n (cid:88) i =1 w i (1 − A i ) y i ) − n n (cid:88) i =1 w i ( y i − y i ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y , ρ = 1  = E Z (cid:32) n ( n (cid:88) i =1 w i (2 A i − y i + 1 n n (cid:88) i =1 w i (2 A i − y i ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y , ρ = 1  = 4 n E Z (cid:32) n (cid:88) i =1 w i Z i y i + y i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y , ρ = 1  = 4 n E Z (cid:32) n (cid:88) i =1 Z i w i c i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y , ρ = 1  where c i = y i + y i . Proof of Lemma 3.

By law of total variance:var S Y , Z ρ (ˆ τ TY | x ) = E S Y (cid:2) var Z ρ (ˆ τ TY | x , Y ) | x (cid:3) + var S Y ( E Z ρ [ˆ τ TY | x , Y ] | x )= E S Y (cid:2) var Z ρ (ˆ τ TY | x , Y ) | x (cid:3) + var S Y (cid:32) n n (cid:88) i =1 w i ( Y i − Y i ) | x (cid:33) = E S Y (cid:2) var Z ρ (ˆ τ TY | x , Y ) | x (cid:3) + 1 n n (cid:88) i =1 w i var ( E i − E i )= E S Y (cid:2) var Z ρ (ˆ τ TY | x , Y ) | x (cid:3) + 2 n σ E n (cid:88) i =1 w i Recall that ˜ C i = β ˜ X i + ˜ E i . From Lemma 2:var Z (ˆ τ TY | x , Y , ρ = 1)= 4 n E Z (cid:32) n (cid:88) i =1 Z i ˜ C i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1  = 4 n E Z (cid:20)(cid:16) Z T ˜ C (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) = 4 n E Z (cid:20)(cid:16) Z T β ˜ x + Z T ˜ E (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) = 4 n β E Z (cid:20)(cid:0) Z T ˜ x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) + 4 n E Z (cid:20)(cid:16) Z T ˜ E (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) + 4 n E Z (cid:20) ˜ x T ZZ T ˜ E (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) (5)Now we consider E S Y (cid:2) var Z (ˆ τ TY | x , Y , ρ = 1) | x (cid:3) . The third term in Eq. 5 becomes: n E S Y (cid:20) E Z (cid:20) ˜ x T ZZ T ˜ E (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:21) = 8 n E Z (cid:20) ˜ x T ZZ T (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) E S Y [˜ E | x ]= 0 because E S Y [˜ E | x ] = PREPRINT - N

OVEMBER

25, 2020The second term in Eq. 5 becomes: n E S Y (cid:20) E Z (cid:20)(cid:16) Z T ˜ E (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x , Y , ρ = 1 (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:21) = 4 n E S Y  E Z (cid:32) n (cid:88) i =1 Z i w i E i (cid:33) (cid:12)(cid:12) x , Y , ρ = 1  (cid:12)(cid:12)(cid:12)(cid:12) x  = 4 n E S Y (cid:34) E Z (cid:34) n (cid:88) i =1 ( Z i w i E i ) (cid:12)(cid:12) x , Y , ρ = 1 (cid:35) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:35) + 4 n E S Y  E Z (cid:88) i (cid:54) = j ( Z i w i E i )( Z j w j E j ) (cid:12)(cid:12) x , Y , ρ = 1  (cid:12)(cid:12)(cid:12)(cid:12) x  = 4 n E S Y (cid:34) E Z (cid:34) n (cid:88) i =1 ( Z i w i E i ) (cid:12)(cid:12) x , Y , ρ = 1 (cid:35) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:35) + 4 n (cid:88) i (cid:54) = j E Z [ Z i Z j | x , ρ = 1] w i w j E S Y [ E i E j | x ]= 4 n E S Y (cid:34) E Z (cid:34) n (cid:88) i =1 ( Z i w i E i ) (cid:12)(cid:12) x , Y , ρ = 1 (cid:35) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:35) + 0 because E S Y [ E i E j | x ] = E S Y [ E i | x ] E S Y [ E j | x ] = 0= 4 n E S Y (cid:34) n (cid:88) i =1 ( w i E i ) ] (cid:12)(cid:12) x (cid:35) because Z i = 1= 4 n σ E n (cid:88) i =1 w i The ﬁrst term in Eq. 5 becomes: n E S Y (cid:20) β E Z (cid:20)(cid:0) Z T ˜ x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:21) = 4 n β E Z (cid:20)(cid:0) Z T ˜ x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) = 4 n β E Z (cid:34) ( n (cid:88) i =1 Z i w i x i ) (cid:12)(cid:12) x , ρ = 1 (cid:35) Putting all terms together: E S Y (cid:2) var Z (ˆ τ TY | x , Y , ρ = 1) | x (cid:3) = 4 n β E Z (cid:34) ( n (cid:88) i =1 Z i w i x i ) (cid:12)(cid:12) x , ρ = 1 (cid:35) + 4 n σ E n (cid:88) i =1 w i Therefore: var S Y , Z ρ (ˆ τ TY | x ) = E S Y (cid:2) var Z (ˆ τ TY | x , Y , ρ = 1) | x (cid:3) + 2 n σ E n (cid:88) i =1 w i = 4 n β E Z (cid:34) ( n (cid:88) i =1 Z i w i x i ) (cid:12)(cid:12) x , ρ = 1 (cid:35) + 6 n σ E n (cid:88) i =1 w i In order to prove Theorem 2, we will show that for a random variable U with E [ U ] = 0 , among events Ω preserve theexpectation E [ U | Ω] = 0 , truncating the tail results in the smallest variance. Note that if ρ ( x , Z ) = ρ ( x , − Z ) it followsfrom Lemma B.1 that E [ n ˜ x T Z | ρ = 1] = E [ n ˜ x T Z ] = 0 .In order to prove Theorem 1 we show how to minimize the variance of a random variable: Lemma C.1.

Let U ∈ R be a random variable such that E [ U ] = 0 .Let u α be such that P ( U < u α ) = 1 − α . Let Ω be an event such that P (Ω) ≥ − α and E [ U | Ω] = 0 . Then: E ( U | U < u α ) ≤ E ( U | Ω) Proof.

Let p ( u ) be the pdf of U . Deﬁne f ( u ) as follow: f ( u ) = p ( U = u, Ω) PREPRINT - N

OVEMBER

25, 2020then: p ( u | Ω) = p ( U = u, Ω) P (Ω) = f ( u )1 − α . Therefore: E [ U | Ω] = (cid:90) u u f ( u )1 − α du . We want to minimize E ( U | Ω) : (cid:90) u u f ( u )1 − α du subject to: ≤ f ( u ) ≤ p ( u ) ∀ u P (Ω) = (cid:90) u f ( u ) du = 1 − α This can be done by maximize f ( u ) so that f ( u ) = p ( u ) for the smallest u , which is equal to set Ω to be the event U < u α . Proof of Theorem 2.

Let V := n (cid:80) i w i x i Z i and B = V var ( V ) − / . From Lemma 3:var S Y , Z ρ (ˆ τ TY | x ) = β E Z (cid:20) V (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) + 6 n σ E n (cid:88) i =1 w i . = β var ( V ) E Z (cid:20) B (cid:12)(cid:12)(cid:12)(cid:12) x , ρ = 1 (cid:21) + 6 n σ E n (cid:88) i =1 w i . Since ρ ( x , Z ) = ρ ( x , − Z ) , from Lemma B.1 we have E Z [ B | x , ρ = 1] = 0 , which satisﬁes the criteria in Lemma C.1.Let η := 1 − P ( ρ = 1 | x ) . Then η ≤ α . Let b η be such that P ( B < b η | x ) = 1 − η and b α be such that P ( B < b α | x ) = 1 − α . From Lemma C.1: E Z (cid:2) B (cid:12)(cid:12) x , ρ = 1 (cid:3) ≥ E Z (cid:2) B (cid:12)(cid:12) x , B < b η (cid:3) ≥ E Z (cid:2) B (cid:12)(cid:12) x , B < b α (cid:3) because b η ≥ b α ≥ E Z (cid:2) B (cid:12)(cid:12) x , φ αT = 1 (cid:3) Proof of Corollary 1.

Let ρ being the constant function ρ ( x , Z ) = 1 for all x , Z . Then:var S Y , Z ρ (ˆ τ TY | x ) = var S Y , Z (ˆ τ TY | x ) From Theorem 2 we have: var S Y , Z φαT (ˆ τ TY | x ) ≤ var S Y , Z ρ (ˆ τ TY | x ) = var S Y , Z (ˆ τ TY | x ) D Discussion on Section 5.2.2

Proof of Lemma 4.

By law of total variance:var S X , Y , Z ρ (ˆ τ TY ) = E S X , Y (cid:2) var Z ρ (ˆ τ TY | X , Y ) (cid:3) + var S X , Y (cid:0) E Z ρ [ˆ τ TY | X , Y ] (cid:1) Since ρ ( x , Z ) = ρ ( x , − Z ) , from Lemma B.2: E Z [ˆ τ TY | X , Y , ρ = 1] = 1 n n (cid:88) i =1 W i ( Y i − Y i ) PREPRINT - N

OVEMBER

25, 2020Therefore: var S X , Y (cid:0) E Z [ˆ τ TY | X , Y , ρ = 1] (cid:1) = var S X , Y (cid:32) n n (cid:88) i =1 W i ( Y i − Y i ) (cid:33) We now prove Lemma 5. We use the following result in Harshaw et al. [2019] to prove Lemma 5.

Lemma D.1 (Lemma A1 in Harshaw et al. [2019]) . Let y ∗ i denote the observed outcome of sample i : n ( (cid:88) z i =1 y ∗ i − (cid:88) z i = − y ∗ i ) − n n (cid:88) i =1 ( y i − y i ) = 2 n c T z where c i = y i + y i and c := ( c , · · · , c n ) . We will also use the following lemma:

Lemma D.2.

Let Q := n − n E [ ZZ T ] . Let I n denote the n × n identity matrix and denote the n dimensional vector of . Then: Q = I n − n T .Q = Q T Q = Q = Q T Q = QQ T . Let s ∈ R n × d be a matrix. Then Q s = s − avg ( s ) where avg ( s ) ∈ R d is the average of rows of s .Proof. First we will show that: E [ ZZ T ] = nn − (cid:18) I n − n T (cid:19) by showing that E [ Z i ] = 1 and E [ Z i Z j ] = − n − when i (cid:54) = j . First we have that E [ Z i ] = 1 because Z i = 1 . Sincethere are exactly n/ samples with value Z i = 1 and n/ samples with values Z i = − , note that ( (cid:80) ni =1 Z i ) = 0 and: E [( n (cid:88) i =1 Z i ) ] = E [ n (cid:88) i =1 Z i ] + (cid:88) i (cid:54) = j E [ Z i Z j ] . Since all pairs ( i, j ) where i (cid:54) = j have equal roles and there are n ( n − such pairs: E [ Z i Z j ] = E [( (cid:80) ni =1 Z i ) ] − E [ (cid:80) ni =1 Z i ] n ( n − − nn ( n − − n − Since Q is symmetric, Q = Q T . We will show that Q = Q : Q = ( I n − n T )( I n − n T )= I n − n T I n − n I n T + 1 n T T = I n − n T = Q Since Q = Q T , we have Q = Q = QQ T = Q T Q . For the last property: Q s = I n s − n T s = s − avg ( s ) because I n s = s and n T s = avg ( s ) PREPRINT - N

OVEMBER

25, 2020

Proof of Lemma 5.

For any matrix s ∈ R n × d we will compute R s := Corr (ˆ τ TY , n Z T s ) where for any Y ∈ R , X ∈R d , Corr ( Y, X ) is deﬁned as: Corr ( Y, X ) =

Corr ( Y, X T β ∗ ) = Cov ( Y, X T β ∗ ) (cid:112) var ( Y ) (cid:112) var ( X T β ∗ ) where β ∗ = arg min ˆ β E (cid:107) Y − X T ˆ β (cid:107) . Substituting s = x and s = ˜ x will give us R x and R x .Let ˜ δ i = ˜ y i − ˜ y i and ˜ δ := (˜ δ , · · · , ˜ δ n ) . From Lemma D.1, we have: ˆ τ TY = 2 n Z T ˜ c + 1 n T ˜ δ where ∈ R n is a vector of .We note that conditioning on y , T ˜ δ is a constant independent of Z . Let Q := n − n E [ ZZ T ] and note that Q = Q T and Q = Q . First, let us compute β ∗ = arg min ˆ β E Z (cid:107) ˆ τ TY − n Z T s ˆ β (cid:107) . We have, β ∗ = arg min ˆ β E Z (cid:107) ˆ τ TY − n Z T s ˆ β (cid:107) = arg min ˆ β E Z (cid:107) n Z T ˜ c + 1 n T ˜ δ − n Z T s ˆ β (cid:107) = arg min ˆ β E Z (cid:107) Z T ˜ c − Z T s ˆ β (cid:107) = arg min ˆ β (˜ c − s ˆ β ) T E [ ZZ T ](˜ c − s ˆ β )= arg min ˆ β (˜ c − s ˆ β ) T Q (˜ c − s ˆ β )= arg min ˆ β (˜ c − s ˆ β ) T Q T Q (˜ c − s ˆ β )= arg min ˆ β (cid:107) Q ˜ c − Q s β (cid:107) . Using the fact that Q = Q T Q , we have β ∗ = ( s T Q s ) − s T Q ˜ c . By deﬁnition, we have

Corr (ˆ τ TY , n Z T s ) = E Z (cid:2) ˆ τ TY n Z T s β ∗ (cid:3) − E Z (cid:2) ˆ τ TY (cid:3) E Z (cid:2) n Z T s β ∗ (cid:3)(cid:113) var Z (ˆ τ TY ) var Z ( n Z T s β ∗ )= E Z (cid:2) ˆ τ TY Z T s β ∗ (cid:3)(cid:113) var Z (ˆ τ TY ) var Z ( Z T s β ∗ ) because E [ Z ] = 0= E Z (cid:104)(cid:16) n ˜ c T Z + n T ˜ δ (cid:17) Z T s β ∗ (cid:105)(cid:114) var Z (cid:16) n Z T ˜ c + n T ˜ δ (cid:17) var Z ( Z T s β ∗ )= E Z (cid:2)(cid:0) n ˜ c T Z (cid:1) Z T s β ∗ (cid:3)(cid:113) var Z (cid:0) n Z T ˜ c (cid:1) var Z ( Z T s β ∗ )= E Z (cid:2) ˜ c T ZZ T s β ∗ (cid:3)(cid:112) var Z ( Z T ˜ c ) var Z ( Z T s β ∗ ) PREPRINT - N

OVEMBER

25, 2020For the numerator we have: E Z (cid:2) ˜ c T ZZ T s β ∗ (cid:3) = ˜ c T Q s β ∗ = nn − c T Q s ( s T Q s ) − s T Q ˜ c = nn − c T Q s ( s T Q s ) − s T Q s ( s T Q s ) − s T Q ˜ c = nn − (cid:0) ˜ c T Q s ( s T Q s ) − s T Q (cid:1) (cid:0) Q s ( s T Q s ) − s T Q ˜ c (cid:1) = nn − β ∗ T s T Q )( Q s β ∗ )= nn − (cid:107) Q s β ∗ (cid:107) Let u = Q s β and v = Q ˜ c − Q s β . We will show that u and v are orthogonal, therefore (cid:107) Q s β ∗ (cid:107) = (cid:107) Q ˜ c (cid:107) − (cid:107) Q ˜ c − Q s β (cid:107) : u T v = ( Q ˜ c − Q s β ∗ ) T ( Q s β ∗ )= ˜ c T Q s β ∗ − β ∗ T s T Q s β ∗ = ˜ c T Q s β ∗ − (cid:107) Q s β ∗ (cid:107) = 0 . Therefore (cid:107) Q s β ∗ (cid:107) = (cid:107) Q ˜ c (cid:107) − (cid:107) Q ˜ c − Q s β (cid:107) .For the denominator, since E [ Z ] = 0 we have:var Z (cid:0) Z T ˜ c (cid:1) var Z ( Z T s β ∗ ) = E Z [˜ c T ZZ T ˜ c ] E Z [ β ∗ T s T ZZ T s β ∗ ]= n ( n − (˜ c T Q ˜ c )( β ∗ T s T Q s β ∗ )= n ( n − (˜ c T Q T Q ˜ c )( β ∗ T s T Q T Q s β ∗ )= n ( n − (cid:107) Q ˜ c (cid:107) (cid:107) Q s β ∗ (cid:107) Putting the numerator and denominator together we have: R s = Corr (ˆ τ TY , n Z T s )= (cid:107) Q s β ∗ (cid:107) (cid:107) Q ˜ c (cid:107)(cid:107) Q s β ∗ (cid:107) = (cid:107) Q s β ∗ (cid:107)(cid:107) Q ˜ c (cid:107) = (cid:112) (cid:107) Q ˜ c (cid:107) − (cid:107) Q ˜ c − Q s β (cid:107) (cid:107) Q ˜ c (cid:107) Substituting s = x and s = ˜ x gives us the expression for R x and R x . Proof of Theorem 4.

We have ˜ C = ˜ X T β + ˜ E where C = Y + Y , E = E + E , β = β + β , ˜ C i = p T ( X ) p S ( X ) C, ˜ X = p T ( X ) p S ( X ) X and ˜ E = p T ( X ) p S ( X ) E . Let S ∈ R d be arandom variable independent of E i for all i and s ∈ R n × d be the values of n samples of S . We have: R s = (cid:107) Q ˜ c (cid:107) − min ˆ β (cid:107) Q ˜ c − Q s ˆ β (cid:107) (cid:107) Q ˜ c (cid:107) . PREPRINT - N

OVEMBER

25, 2020We will show that lim n →∞ R x ≥ lim n →∞ R s . It is sufﬁcient to show lim min ˆ β (cid:107) Q ˜ c − Q ˜ x ˆ β (cid:107) ≤ lim min ˆ β (cid:107) Q ˜ c − Q s ˆ β (cid:107) . From Lemma D.2, note that for any matrix s ∈ R n × d with n rows, n − n Q s = s − avg ( s ) where avg ( s ) ∈ R d is the average of rows of s . We have: n lim n →∞ min ˆ β (cid:107) Q ˜ c − Q s ˆ β (cid:107) = min ˆ β lim n →∞ n (cid:107) Q ˜ c − Q s ˆ β (cid:107) when S is bounded = min ˆ β lim n →∞ n n ( n − (cid:13)(cid:13)(cid:13)(cid:13) n − n Q ˜ c − n − n Q s ˆ β (cid:13)(cid:13)(cid:13)(cid:13) = min ˆ β lim n →∞ n (cid:107) (˜ c − avg (˜ c )) − ( s − avg ( s )) ˆ β (cid:107) = min ˆ β lim n →∞ n (cid:107) (˜ c − s ˆ β ) − ( avg (˜ c ) − avg ( s ) ˆ β ) (cid:107) = min ˆ β lim n →∞ n (cid:107) u − avg ( u ) (cid:107) where u = ˜ c − s ˆ β are n samples of U = ˜ C − S T ˆ β = min ˆ β var ( U )= min ˆ β E [( ˜ C − S T ˆ β ) ] − (cid:16) E [ ˜ C − S T ˆ β ] (cid:17) = min ˆ β E [ ˜ X T β − S T ˆ β ] + E [ ˜ E ] − (cid:16) E [ ˜ X T β − S T ˆ β ] (cid:17) because E [ ˜ E ] = 0 and E is independent of ˜ X and S = min ˆ β var ( ˜ X T β − S T ˆ β ) + E [ ˜ E ] ≥ E [ ˜ E ] When S = ˜ X, this is minimized, therefore: lim n →∞ R x ≤ lim n →∞ R s Substituting s = x : lim n →∞ R x ≤ lim n →∞ R x Recall that: as-var Z (cid:18) ˆ τ TY | x , y , M (cid:18) n Z T s (cid:19) ≤ a (cid:19) = lim n → n var (ˆ τ TY | x , y )(1 − (1 − v d,a ) R s ) , where as-var is the variance of the asymptotic sampling distribution. Let s ( a ) denote the rejection probability P ( φ S =0 | x ) = 0 when using threshold a in Source Balance, and t ( a ) denote the rejection probability P ( φ T = 0 | x ) = 0 whenusing threshold a in Target Balance. We have:as-var Z (cid:16) ˆ τ TY | x , y , φ s ( a ) S = 1 (cid:17) = as-var Z (cid:18) ˆ τ TY | x , y , M (cid:18) n Z T x (cid:19) ≤ a (cid:19) = lim n →∞ var (ˆ τ TY | x , y )(1 − (1 − v d,a ) R x ) ≥ lim n →∞ var (ˆ τ TY | x , y )(1 − (1 − v d,a ) R x ) ≥ lim n →∞ var (cid:18) ˆ τ TY | x , y , M (cid:18) n Z T ˜ x (cid:19) ≤ a (cid:19) ≥ as-var Z (cid:18) ˆ τ TY | x , y , M (cid:18) n Z T ˜ x (cid:19) ≤ a (cid:19) ≥ as-var Z (cid:16) ˆ τ TY | x , y , φ t ( a ) T = 1 (cid:17) PREPRINT - N

OVEMBER

25, 2020Now we will show that lim n →∞ s ( a ) = lim n →∞ t ( a ) . Let U ∈ R d be a standard multivariate random variable. Wehave: lim n →∞ s ( a ) = lim n →∞ P ( M (cid:18) n Z T x (cid:19) ≤ a )= lim n →∞ P ( (cid:107) B S (cid:107) < a ) where B S = 2 n Z T x Cov ( 2 n Z T x ) − / = P ( (cid:107) U (cid:107) < a ) because B S converges in distribution to U by ﬁnite central limit theoremSimilarly we have: lim n →∞ t ( a ) = lim n →∞ P ( M (cid:18) n Z T ˜ x (cid:19) ≤ a )= lim n →∞ P ( (cid:107) B T (cid:107) < a ) where B T := 2 n Z T ˜ x Cov ( 2 n Z T ˜ x ) − / = P ( (cid:107) U (cid:107) < a ) because B T converges in distribution to U by ﬁnite central limit theoremTherefore lim n →∞ t ( a ) = lim n →∞ s ( a ) . When the sample size is large, with the same rejection probability, usingTarget Balance results in a smaller asymptotic variance than Source Balance . E Proofs of Section A

In this Section we present the proof of Lemma A.1, Lemma A.2, Lemma A.3, Theorem A.1 and Corollary A.1.In order to prove Lemma A.1, we ﬁrst prove the following lemma.

Lemma E.1 (minor changes to Lemma 1 in [Harshaw et al., 2019]) . Let ˜ (cid:15) i = ˜ c i − β T ˜ x i and ˜ (cid:15) = (˜ (cid:15) , · · · , ˜ (cid:15) n ) . Forany function ρ ( x , Z ) ∈ { , } satisfying ρ ( x , Z ) = ρ ( x , − Z ) : n var Z (ˆ τ TY | x , y , ρ = 1) = Cov (˜ c T Z | ρ = 1) (6) = β T Cov (˜ x T Z | ρ = 1] β + Cov (˜ (cid:15) T Z | ρ = 1) + 2 β T Cov (˜ x T Z , ˜ (cid:15) T Z | ρ = 1) (7) Proof of Lemma E.1.

By deﬁnition:var Z (ˆ τ TY | x , y , ρ = 1) = E Z (cid:2) (ˆ τ TY − E Z [ˆ τ TY | x , y , ρ = 1]) | x , y , ρ = 1 (cid:3) We have: E Z [ˆ τ TY | x , y , ρ = 1] = 2 n E Z (cid:34) (cid:88) Z i =1 w i y ∗ i − (cid:88) Z i = − w i y ∗ i (cid:12)(cid:12)(cid:12)(cid:12) ρ = 1 (cid:35) = 2 n E (cid:34) n (cid:88) i =1 A i w i y i − n (cid:88) i =1 (1 − A i ) w i y i (cid:12)(cid:12)(cid:12)(cid:12) ρ = 1 (cid:35) = 2 n (cid:32) n (cid:88) i =1 E [ A i | ρ = 1] w i y i − n (cid:88) i =1 E [1 − A i | ρ = 1] w i y i (cid:33) = 1 n (cid:32) n (cid:88) i =1 w i y i − n (cid:88) i =1 w i y i (cid:33) because E [ A i | ρ = 1] = 1 / by Lemma B.123 PREPRINT - N

OVEMBER

25, 2020Therefore using Lemma D.1:var Z (ˆ τ TY | x , y , ρ = 1) = E Z (cid:32) n ( (cid:88) Z i =1 w i y ∗ i − (cid:88) Z i = − w i y ∗ i ) − n n (cid:88) i =1 w i ( y i − y i ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12) x , y , ρ = 1  = 4 n E [˜ c T ZZ T ˜ c | x , y , ρ = 1]= 4 n Cov (˜ c T Z | x , y , ρ = 1) because E [˜ c T Z | x , y , ρ = 1] = 0 from Lemma B.1 = 4 n Cov ((˜ x β + ˜ (cid:15) ) T Z | x , y , ρ = 1)= β T Cov (˜ x T Z | x , y , ρ = 1] β + Cov (˜ (cid:15) T Z | x , y , ρ = 1) + 2 β T Cov (˜ x T Z , ˜ (cid:15) T Z | x , y , ρ = 1) Proof of Lemma A.1.

OVEMBER

25, 2020The second term: E S Y [ Cov (˜ E T Z | x , ρ = 1) | x ]= E ˜ E [ E Z [˜ E T ZZ T ˜ E | x , ρ = 1] | x ]= E ˜ E  n (cid:88) i =1 n (cid:88) j =1 E Z [ w i E i Z i Z j E j w j | ρ = 1] | x  = E ˜ E  n (cid:88) i =1 E [ w i E i Z i | ρ = 1] + (cid:88) i (cid:54) = j E [ w i E i Z i Z j E j w j | ρ = 1] | x  = E ˜ E  n (cid:88) i =1 E [ w i E i | ρ = 1] + (cid:88) i (cid:54) = j E [ w i Z i Z j E j w j | ρ = 1] E [ E i | ρ = 1] | x  because Z i = 1= n (cid:88) i =1 E [ w i E i ] because E [ E i | ρ = 1] = 0= n (cid:88) i =1 w i σ E Putting all together:var S Y , Z ρ (ˆ τ TY | x ) = 4 n (cid:32) β T Cov (˜ x T Z | x , ρ = 1] β + n (cid:88) i =1 w i σ E (cid:33) + + 2 n σ E n (cid:88) i =1 w i = 4 n β T Cov Z (˜ x T Z | x , ρ = 1] β + 6 n σ E n (cid:88) i =1 w i Proof of Lemma A.2.

We use the same decomposition of β T Cov Z ( V | x , Ω) β as in [Harshaw et al., 2019]. Let e , ..., e n and λ , .., λ n be the normalized eigenvectors and corresponding eigenvalues of matrix Cov Z ( V | x , E ) . Since Cov Z ( V | x , E ) is symmetric, the eigenvectors form an orthonormal basis so we can write β as a linear combination of e , .., e n and get: β = (cid:107) β (cid:107) n (cid:88) i =1 η i e i where η i = (cid:104) β, e i (cid:105) / (cid:107) β (cid:107) is the coefﬁcient that captures the alignment of the weighted outcome β with respect to theeigenvector e i . Therefore: β T Cov Z ( V | x , Ω) β = (cid:107) β (cid:107) n (cid:88) i =1 η i λ i Then: E β (cid:2) β T Cov Z ( V | x , Ω) β (cid:3) = E β (cid:34) (cid:107) β (cid:107) n (cid:88) i =1 η i λ i (cid:35) = l n (cid:88) i =1 λ i E β [ η i ]= l n (cid:88) i =1 λ i E θ cos ( θ ) where θ is the angle between β and e i . Since β points to any direction withequal probability, θ is uniformly distributed in [0 , π ] . = l n (cid:88) i =1 λ i = l Trace ( Cov Z ( V | x , Ω)) . PREPRINT - N

OVEMBER

25, 2020

Proof of Lemma A.3.

Let p ( u ) be the pdf of U . Deﬁne f ( u ) as follow: f ( u ) = p ( U = u, Ω) Then: p ( u | Ω) = p ( U = u, Ω) P (Ω) = f ( u )1 − α Since P (Ω) = 1 − α we have: (cid:90) u f ( u ) du = 1 − α We have:

T race ( Cov ( U | Ω)) =

T race ( E [ U U T | Ω]) =

T race ( E [ U U T | Ω] =

T race ( E [ U T U | Ω] = (cid:90) u u T u f ( u )1 − α du We want to minimize

T race ( Cov ( U | Ω)) : (cid:90) u u T u f ( u )1 − α du subject to: ≤ f ( u ) ≤ p ( u ) ∀ u (cid:90) u f ( u ) du = 1 − α This can be done by maximize f ( u ) so that f ( u ) = p ( u ) for the smallest u T u , which is equal to set Ω to be the event (cid:107) U (cid:107) < u α . Proof of Theorem A.1.

Let η := 1 − P ( ρ = 1 | x ) . Then η ≤ α . Let v η be such that P ( (cid:107) V (cid:107) < v η | x ) = 1 − η . FromLemma A.1: E β var S Y , Z ρ (ˆ τ TY | x ) = 4 n E β β T Cov ( V | x , ρ = 1) β + 6 n σ n (cid:88) i =1 w i (8) = 4 n l Trace ( Cov ( V | x , ρ = 1)) + 6 n σ n (cid:88) i =1 w i (9) ≥ n l Trace ( Cov ( V | x , (cid:107) V (cid:107) < v η )) + 6 n σ n (cid:88) i =1 w i (10) ≥ n l Trace ( Cov ( V | x , (cid:107) V (cid:107) < v α )) + 6 n σ n (cid:88) i =1 w i because v η ≥ v α (11) ≥ n l Trace ( Cov ( V | x , φ αT (cid:48) = 1)) + 6 n σ n (cid:88) i =1 w i (12) ≥ n E β β T Cov ( V | x , φ αT (cid:48) = 1) β + 6 n σ n (cid:88) i =1 w i (13) ≥ E β var S Y , Z φαT (cid:48) (ˆ τ TY | x ) (14) Proof of Corollary A.1.

Let ρ being the constant function ρ ( x , Z ) = 1 for all x , Z . Then:var S Z ρ , Y (ˆ τ TY | x ) = var S Z , Y (ˆ τ TY | x ) From Theorem A.1 we have: E β var S Z φ (cid:48) T , Y (ˆ τ TY | x ) ≤ E β var S Z ρ , Y (ˆ τ TY | x ) = E β var S Z , Y (ˆ τ TY | x ))