[PDF] Designing Experiments Informed by Observational Studies

Abstract

The increasing availability of passively observed data has yielded a growing methodological interest in "data fusion." These methods involve merging data from observational and experimental sources to draw causal conclusions -- and they typically require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach to leveraging observational data, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment. We consider the case of a stratified experiment with a binary outcome, and suppose pilot estimates for the stratum potential outcome variances can be obtained from the observational study. We extend results from Zhao et al. (2019) in order to generate confidence sets for these variances, while accounting for the possibility of unmeasured confounding. Then, we pose the experimental design problem as one of regret minimization, subject to the constraints imposed by our confidence sets. We show that this problem can be converted into a convex minimization and solved using conventional methods. Lastly, we demonstrate the practical utility of our methods using data from the Women's Health Initiative.

Full PDF

DDesigning Experiments Informedby Observational Studies

Evan Rosenman, Art OwenFebruary 23, 2021

Abstract

The increasing availability of passively observed data has yielded a growing method-ological interest in “data fusion.” These methods involve merging data from observa-tional and experimental sources to draw causal conclusions – and they typically requirea precarious tradeoﬀ between the unknown bias in the observational dataset and theoften-large variance in the experimental dataset. We propose an alternative approachto leveraging observational data, which avoids this tradeoﬀ: rather than using obser-vational data for inference, we use it to design a more eﬃcient experiment.We consider the case of a stratiﬁed experiment with a binary outcome, and supposepilot estimates for the stratum potential outcome variances can be obtained from theobservational study. We extend results from Zhao et al. (2019) in order to generateconﬁdence sets for these variances, while accounting for the possibility of unmeasuredconfounding. Then, we pose the experimental design problem as one of regret mini-mization, subject to the constraints imposed by our conﬁdence sets. We show that thisproblem can be converted into a convex minimization and solved using conventionalmethods. Lastly, we demonstrate the practical utility of our methods using data fromthe Women’s Health Initiative.

Contents .

5, Fine Stratiﬁcation . . . . . . . . . . . . . . 94.3 Performance Over Multiple Conditions . . . . . . . . . . . . . . . . . . . 11 a r X i v : . [ s t a t . M E ] F e b Appendix 15

A.1 Further Details about the Women’s Health Initiative . . . . . . . . . . . 15A.2 Propensity Score Construction, Covariate Balance, and Gold StandardEﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16A.3 Stratiﬁcation Variable Distributions . . . . . . . . . . . . . . . . . . . . 19

B Proof of Validity of Conﬁdence Regions 20

B.1 Review of Proof in Zhao et al. (2019) . . . . . . . . . . . . . . . . . . . . 20B.2 Extension to Design Case . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C Proof of Concavity of Minimax Problem 23

The past half-century of causal inference research has engendered a healthy skepticismtoward observational data (Imbens and Rubin, 2015). In observational data sets, re-searchers do not control whether or not each individual receives a treatment of interest.Hence, they cannot be certain that treated individuals and untreated individuals areotherwise comparable.This challenge can be overcome only if the covariates measured in the observationaldata are suﬃciently rich to fully explain who receives the treatment and who doesnot. This is a fundamentally untestable assumption – and even if it holds, carefulmodeling is necessary to remove the selection eﬀect. The applied literature includesmyriad examples of treatments that showed promise in observational studies only to beoverturned by later randomized trials (Hartman et al., 2015). One prominent case, theeﬀect of hormone therapy on the health of postmenopausal women, will be discussedin this manuscript (Writing Group for the Women’s Health Initiative Investigators,2002).The “virtuous” counterpart to observational data is the well-designed experiment.Data from a randomized trial yield unbiased estimates of a causal eﬀect without theneed for problematic statistical assumptions. Yet experiments are frequently expensive,and, as a consequence, generally involve fewer units. Especially if one is interested insubgroup causal eﬀects, this means experimental estimates can be imprecise.In this paper, we discuss an approach that allows us to leverage the availabilityof observational data, while retaining the attractive unbiasedness properties of ran-domized experiments: we use the observational data not for inference, but rather toinﬂuence the design of the experiment. Our discussion will be limited to settings withbinary outcomes, in which computations are tractable. We suppose the experiment hasa stratiﬁed design, and seek to determine allocations of units to strata and treatmentassignments.Suppose pilot estimates of the stratum potential outcome variances are obtainedfrom the observational study. If the outcomes are binary, we show that recent advancesin sensitivity analysis from Zhao et al. (2019) can be extended to generate conﬁdencesets for these variances, while incorporating the possibility of unmeasured confounding.Next, we pose the experimental design problem as one of regret minimization subject tothe potential outcome variances lying within their conﬁdence sets. We use a trick fromvon Neumann to convert the problem into a convex (though non-DCP) minimization,which can be solved using projected gradient descent. This approach can yield modesteﬃciency gains in the experiment, especially if there is heterogeneity in treatmenteﬀects and baseline incidence rates across strata.The remainder of the paper proceeds as follows. Section 2 deﬁnes our notation,assumptions, and loss function. Section 3 gives our main results. These include the erivation of bias-aware conﬁdence sets for the pilot variance estimates; the formula-tion of the design problem as a regret minimization; and the strategy to convert thatproblem into a computationally tractable one. We demonstrate the practical utilityof our methods on data from the Women’s Health Initiative in Section 4. Section 5discusses future work and concludes. We suppose we have access to an observational study with units i in indexing set O suchthat |O| = n o . We associate with each unit i ∈ O a pair of unseen potential outcomes( Y i (0) , Y i (1)); an observed covariate vector X i where X i ∈ R p ; a propensity score p i ∈ (0 ,

1) denoting that probability of receiving treatment. We also associate witheach i a treatment indicator W i and an observed outcome deﬁned by Y i = W i Y i (1) +(1 − W i ) Y i (0).There are multiple perspectives on randomness in causal inference. In the settingof Rubin (1974) – as in much of the early potential outcomes literature – all quantitiesare treated as ﬁxed except, the treatment assignment W i . More modern approachessometimes treat the potential outcomes Y i (0) and Y i (1) and covariates X i as randomvariables (see e.g. VanderWeele and Robins, 2012). Similarly, some authors treat all ofthe data elements (including the treatment assignment W i ) as random draws from asuper-population (see e.g. Imbens and Rubin, 2015). Per the discussion in Chin (2019),these subtleties often have little eﬀect on the choice of estimators, but they do aﬀectthe population to which results can be generalized.In our setting, we assume that the RCT data has not yet been collected, so it doesnot make sense to talk about their ﬁxed potential outcomes. More naturally, we treatthe potential outcomes and covariates as random. Thus, we view X i , Y i (0) , Y i (1) asdrawn from a joint distribution F O . The RCT data will be denoted (with a slightabuse of notation) as ( Y i (0) , Y i (1) , X i ) for i ∈ R , sampled from a joint distribution F R .Because we are treating the potential outcomes as random variables, we can reasonabout their means and variances under the distribution F R . We will make the following assumptions about allocation to treatment.

Assumption 1 (Allocations to Treatment) . For i ∈ O , W i ∼ Bern( p i ) for p i . For i ∈ R , treatment is allocated via a simple random sample of size n rkt for k = 1 , . . . , K .We suppose we have a ﬁxed stratiﬁcation scheme based on the covariates X i . Thiscan be derived from substantive knowledge or from applying a modern machine learningalgorithm on the observational study to uncover treatment heterogeneity (e.g. Wagerand Athey, 2018; Hill, 2011). The stratiﬁcation is such that there are k = 1 , . . . , K strata and each has an associated population weight w , . . . , w K . Using the stratiﬁca-tion on the observational study, we deﬁne indexing subsets O k with cardinalities n ok to identify units in each stratum. For each stratum, deﬁne I k as the set of covariatevalues deﬁning the stratum, such that X i ∈ I k = ⇒ i ∈ O k .Suppose we can recruit only n r total units for the RCT. We need to decide both thenumber of units n rk recruited for each stratum, subject to the constraint (cid:80) k n rk = n r ,and the count of units we will assign to treatment vs. control in each stratum, suchthat the associated counts n rkt and n rkc sum to n rk . Hence, our variables of interestwill be { ( n rkt , n rkc ) } K . eﬁne E R , Var R , E O , and Var O as expectations and variances under the distribu-tions F R and F O , respectively. We will need two further assumptions. Assumption 2 (Common Potential Outcome Means) . Conditional on the stratum,the potential outcome averages for the two populations are equal. In other words, E R ( Y i (0) | X i ∈ I k ) = E O ( Y i (0) | X i ∈ I k ) and E R ( Y i (1) | X i ∈ I k ) = E O ( Y i (1) | X i ∈ I k )for all k ∈ , . . . , K . We denote these shared quantities as µ k (0) and µ k (1). Assumption 3 (Common Potential Outcome Variances) . Conditional on the stratum,the potential outcome means for the two populations are equal. In other words,Var R ( Y i (0) | X i ∈ I k ) = Var O ( Y i (0) | X i ∈ I k ) andVar R ( Y i (1) | X i ∈ I k ) = Var O ( Y i (1) | X i ∈ I k )for all k ∈ , . . . , K . We denote these shared quantities as σ k (0) and σ k (1). Given Assumption 2, we can deﬁne a mean eﬀect, τ k = E R ( Y i (1) − Y i (0) | X i ∈ I k ) = E O ( Y i (1) − Y i (0) | X i ∈ I k ) = µ k (1) − µ k (0)for each k ∈ , . . . , K . We can collect these values into a vector τ .Denote the associated causal estimates derived from the RCT as ˆ τ rk for k =1 , . . . , K . We can collect these estimates into a vector ˆ τ r . We use a weighted L loss when estimating the causal eﬀects across strata, L ( τ , ˆ τ r ) = (cid:88) k w k (ˆ τ k − τ k ) . Our goal will be to minimize the risk, deﬁned as an expectation of the loss over boththe treatment assignments and the potential outcomes. For simplicity, we suppress thesubscript and write R ( τ , ˆ τ r ) = E (cid:32)(cid:88) k w k (ˆ τ k − τ k ) (cid:33) = (cid:88) k w k (cid:18) σ k (1) n rkt + σ k (0) n rkc (cid:19) . Were (cid:0) σ k (1) , σ k (0) (cid:1) Kk =1 known exactly, it would be straightforward to compute optimalallocations in the RCT. The optimal choice from minimizing this quantity is simply: n rkt = n r √ w k σ k (1) (cid:80) j √ w k ( σ j (1) + σ j (0)) , n rkc = n r √ w k σ k (0) (cid:80) j √ w k ( σ j (1) + σ j (0)) (1)which yields a risk of 1 n r (cid:32)(cid:88) k √ w k (cid:18) σ k (1) + σ k (0) (cid:19)(cid:33) . ssumption 3 guarantees shared variance across the observational and RCT datasets.So we might be tempted to obtain pilot estimates of σ k (1) and σ k (0) from the observa-tional study and then to plug them in to determine the allocation of units in the RCT.However, any estimate of the variances derived from the observational study shouldbe treated with caution. Our assumptions do not preclude the possibility of unmea-sured confounding, which can introduce substantial bias into the pilot estimation step.Hence, a framework that exclusively optimizes expected loss is incongruent with whatwe know about sources of uncertainty. Decision theory provides an attractive framework in the form of regret minimization,originally attributed to Bell (1982), as well as Loomes and Sugden (1982). In thisframework, a decision-maker chooses between multiple prospects, and cares not onlyabout the received payoﬀ but also about the foregone choice. If the foregone choicewould have yielded higher payoﬀ than the chosen one, the decision-maker experiencesregret (Diecidue and Somasundaram, 2017). Decisions are made to minimize the max-imum possible regret.In our case, the decision is in how to allocate units in our RCT. One choice is anallocation informed by the observational study. The other is a “default” allocationagainst which we seek to compare. Denote the default values as ˜ n rkt and ˜ n rkc , where acommon choice would be equal allocation, ˜ n rkt = ˜ n rkc = n r / K for all k ; or weightedallocation ˜ n rkt = ˜ n rkc = w k n r for all k .Regret is deﬁned as the diﬀerence between the risk of our chosen allocation and thedefault allocation,Regret (cid:16) { n rkt , n rkc } Kk =1 (cid:17) = (cid:88) k w k (cid:18) σ k (1) (cid:18) n rkt − n rkt (cid:19) + σ k (0) (cid:18) n rkc − n rkc (cid:19)(cid:19) . Choosing this as our objective, we can now begin to formulate an optimization problem.Suppose we can capture our uncertainty about ( σ k (1) , σ k (0)) via a convex con-straint, indexed by a user-deﬁned parameter Γ,( σ k (1) , σ k (0)) ∈ A (Γ) k , k = 1 , . . . , K , where A (Γ) k ⊂ R . We could then obtain the regret-minimizing unit allocations as thesolution tomin n rkt ,n rkc max σ k (1) ,σ k (0) (cid:88) k w k (cid:18) σ k (1) (cid:18) n rkt − n rkt (cid:19) + σ k (0) (cid:18) n rkc − n rkc (cid:19)(cid:19) subject to ( σ k (1) , σ k (0)) ∈ A (Γ) k , k = 1 , . . . , K (cid:88) k n rkt + n rkc = n r . (2)Deﬁning and solving Optimization Problem 2 will be the goal of the remainder of thispaper. To construct our conﬁdence regions A k , k = 1 , . . . , K , we will extend recent sensitivityanalysis results from Zhao et al. (2019).The authors consider the case of causal estimation via stabilized inverse probabilityof treatment weighting (SIPW). Zhao and co-authors focus on observational studies, nd consider the case where unmeasured confounding is present. To quantify this con-founding, they propose a marginal sensitivity model indexed by a quantity Γ, whichbounds the odds ratio between the true treatment probability (a function of the co-variates and the potential outcomes) and the treatment probability marginalized overthe potential outcomes (a function of the covariates only). Their method extends thewidely-used Rosenbaum sensitivity model (Rosenbaum, 1987).The authors’ focus is on developing valid conﬁdence intervals for the average treat-ment eﬀect even when Γ-level confounding may be present. They oﬀer two key insights.First, they demonstrate that for any choice of Γ, one can eﬃciently compute upper andlower bounds on the true potential outcome means via linear fractional programming.These bounds, referred to as the “partially identiﬁed region,” quantify the possible biasin the point estimate of the ATE. Second, the authors show that the bootstrap is validin this setting. Hence, they propose drawing repeated bootstrap replicates; computingextrema within each replicate using their linear fractional programming approach; andthen taking the relevant α -level quantiles of these extrema. This procedure yields avalid α -level conﬁdence region for the ATE.We adapt this approach to our setting in the case of binary outcomes. Note that if Y i ∈ { , } , then potential outcome variances can be expressed directly as a functionof potential outcome means, via σ k (1) = µ k (1) · (1 − µ k (1)) and σ k (0) = µ k (0) · (1 − µ k (0)) . As Zhao et al. (2019) provides the necessary machinery to bound mean estimates,we can exploit this relationship between the means and variances to bound varianceestimates. In particular, we can show that the bootstrap is also valid if our estimandis µ k ( e ) · (1 − µ k ( e )), rather than µ k ( e ), for e ∈ { , } and k = 1 , . . . K . Computingthe extrema is also straightforward. Note that the function f ( x ) = x · (1 − x ) ismonotonically increasing in x if 0 < x < . x if0 . < x <

1. Hence, if we use the Zhao et al. (2019) method to solve for a partiallyidentiﬁed region for µ k (1) and µ k (0), we can equivalently compute such intervals for σ k (1) and σ k (0).Denote as ˆ µ Uk ( e ) the upper bound and ˆ µ Lk ( e ) the lower bound computed for a meanfor e ∈ { , } . Denote (ˆ σ k ( e )) U and (ˆ σ k ( e )) L as the analogous quantities for variance.We apply the following logic: • If ˆ µ Uk ( e ) ≤ .

5, set(ˆ σ k ( e )) L = µ Lk ( e )(1 − ˆ µ Lk ( e )) and (ˆ σ k ( e )) U = µ Uk ( e )(1 − ˆ µ Uk ( e )) . • If ˆ µ Lk ( e ) ≥ .

5, set(ˆ σ k ( e )) L = µ Uk ( e )(1 − ˆ µ Uk ( e )) and (ˆ σ k ( e )) U = µ Lk ( e )(1 − ˆ µ Lk ( e )) . • If ˆ µ Lk ( e ) < . µ Uk ( e ) > .

5, set(ˆ σ k ( e )) L = min (cid:16) ˆ µ Lk ( e )(1 − ˆ µ Lk ( e )) , ˆ µ Uk ( e )(1 − ˆ µ Uk ( e )) (cid:17) and (ˆ σ k ( e )) U = 0 . . Hence, we propose the following procedure for deriving valid conﬁdence regions for( σ k (0) , σ k (1)) for each choice of k :1. Draw B bootstrap replicates from the units i ∈ O k .2. For each replicate: • Compute µ Uk ( e ) , µ Lk ( e ) for e ∈ { , } using Zhao and co-authors’ linear frac-tional programming approach. Determine (ˆ σ k ( e )) U and (ˆ σ k ( e )) L for e ∈ { , } using the approach describedabove.3. Each replicate can now be represented as a rectangle in [0 , × [0 , σ k (1)), and the other the value of (ˆ σ k (0)) and thevertices correspond to the extrema. Any set such that a 1 − α proportion of therectangles have all four corners included in the set will asymptotically form avalid α -level conﬁdence interval.A full proof of the validity of this method can be found in Appendix B.Note that the ﬁnal step does not specify the shape of the conﬁdence set (it need noteven be convex). For simplicity, we compute the minimum volume ellipsoid containingall vertices, then shrink the ellipsoid toward its center until only B · (1 − α ) of therectangles have all four of their vertices included. For details on constructing theellipsoids (sometimes known as L¨owner-John ellipsoids), see Boyd et al. (2004). Observethat this is by no means the smallest valid conﬁdence set, but it is convex and easy towork with numerically.In Figure 1, we demonstrate this procedure on simulated data using Γ = 1 . σ k (0) , ˆ σ k (1)). In purple, we plot the rectangles corresponding tothe extrema computed in each of 200 bootstrap replicates drawn from the data. Thedashed ellipsoids represent 90% conﬁdence sets. In the cases of strata 2 and 4, theellipsoids extend beyond the upper bound of 0.25 in at least one direction, so weintersect the ellipsoids with the hard boundary at 0.25. The resulting ﬁnal conﬁdencesets, A , A , A , and A , are all convex. Figure 1: Simulated example of conﬁdence regions in four strata under Γ = 1 . he objective is convex in n rkt , n rkc and aﬃne (and thus concave) in σ k (1) , σ k (0).Now, having obtained convex constraints, we can invoke Von Neumann’s minimax the-orem (Von Neumann, 1928) to switch the order of the minimization and maximization.Hence, the solution to Problem 2 is equivalent to the solution ofmax σ k (1) ,σ k (0) min n rkt ,n rkc (cid:88) k w k (cid:18) σ k (1) (cid:18) n rkt − n rkt (cid:19) + σ k (0) (cid:18) n rkc − n rkc (cid:19)(cid:19) subject to ( σ k (1) , σ k (0)) ∈ A (Γ) k , k = 1 , . . . , K (cid:88) k n rkt + n rkc = n r . But the inner problem has an explicit solution, given by n rkt = n r √ w k σ k (1) (cid:80) k √ w k ( σ k (1) + σ k (0)) , n rkc = n r √ w k σ k (0) (cid:80) k √ w k ( σ k (1) + σ k (0)) . Plugging this in yields the simpliﬁed problemmax σ k (1) ,σ k (0) n r (cid:32)(cid:88) k √ w k ( σ k (1) + σ k (0)) (cid:33) − (cid:32)(cid:88) k w k (cid:18) σ k (1)˜ n rkt + σ k (0)˜ n rkc (cid:19)(cid:33) subject to ( σ k (1) , σ k (0)) ∈ A (Γ) k , k = 1 , . . . , K . (3)Problem 3 is concave. See Appendix C for a detailed proof. The solution is non-trivial, owing to the fact that the problem is not DCP-compliant. Nonetheless, asimple projected gradient descent algorithm is guaranteed to converge under very mildconditions given the curvature (Iusem, 2003). Hence, we can eﬃciently solve thisproblem. To evaluate our methods in practice, we make use of data from the Women’s HealthInitiative, a 1991 study of the eﬀects of hormone therapy on postmenopausal women.The study included both a randomized controlled trial and an observational study. Atotal of 16,608 women were included in the trial, with half randomly selected to take625 mg of estrogen and 2.5 mg of progestin, and the remainder receiving a placebo. Acorresponding 53,054 women in the observational component of the WHI were deemedclinically comparable to women in the trial. About a third of these women were usingestrogen plus progestin, while the remaining women in the observational study werenot using hormone therapy (Prentice et al., 2005).We investigate the eﬀect of the treatment on incidence of coronary heart disease.The data is split into two non-overlapping subsets, which we term the “gold” and“silver” datasets. We estimate the probability of treatment for observational unitsvia ﬁtted propensity scores. The data split is the same as the one used in Rosenmanet al. (2018). Details on the construction of these data elements can be found inthe Appendix, Section A.2, while further details about the WHI can be found in theSupplement, Section A.1. o choose our subgroups for stratiﬁcation, we utilize the clinical expertise of re-searchers in the study’s writing group. The trial protocol highlights age as an impor-tant subgroup variable to consider (Writing Group for the Women’s Health InitiativeInvestigators, 1998), while subsequent work considered a patient’s history of cardio-vascular disease (Roehm, 2015). We also consider Langley scatter, a measure of solarirradiance at each woman’s enrollment center, which is not plausibly related to baselineincidence or treatment eﬀect. Langley scatter exhibits no association with the outcomein the observational control population: a Pearson’s Chi-squared test yields a p-valueof 0.89. The analogous tests for age and history of cardiovascular disease have p-valuesbelow 10 − .The age variable has three levels, corresponding to whether a woman was in herﬁfties, sixties, or seventies. The cardiovascular disease history variable is binary. TheLangley scatter variable has ﬁve levels, corresponding to strata between 300 and 500Langleys of irradiance. We provide brief summaries of these variables in Tables 9, 10,and 11 in Appendix Section A.3.The RCT gold dataset is used to estimate “gold standard” stratum causal eﬀects.We now suppose that the observational study is being used to design an experiment ofsize n r = 1 ,

000 units. We compare the estimates from the designed pseudo-experimentsagainst the gold standard estimates under the unweighted L loss.In the design setting, we face the additional challenge of choosing the appropriatevalue of Γ. The WHI provides a very rich set of covariates, and our propensity modelincorporates more than 50 variables spanning the demographic and clinical domains(see details in Appendix Section A.2). Hence, we will run our algorithm at values ofΓ = 1 . . , . , and 2 . Γ = 1 . , Fine Stratiﬁcation We show one example in detail, in which we choose Γ = 1 . To investigate the utility of our regret-minimizing allocations, we sample pseudo-experiments of 1,000 units from the RCT silver dataset 1,000 times with replacement.We do so under three designs: equal allocation by strata; na¨ıve allocation based onthe pilot estimates; and regret-minimizing allocations. Below, we show the average L loss when compared against the gold standard estimates derived from the RCT goldestimate. Results are shown in Figure 3. Our method yields a modest reduction inaverage loss (3.6%) relative to the na¨ıve design. It also outperforms the equal design,though by a slimmer margin (1.6%). This is encouraging – especially because thedesign was intended to guard against worst-case loss, rather than to optimize averageloss. We now simulate with all possible combinations of the stratiﬁcation variables. For eachchoice of a stratiﬁcation, we select 1 ,

000 units under equal allocation, na¨ıve allocation,and regret-minimizing allocation with Γ = 1 . , . , . .

0. We then compute the L loss versus the “gold standard” estimates derived from the RCT gold datasets.In Table 1, we summarize the loss relative to equal allocation. We see immediatelythat the entries are all non-positive. This makes some intuitive sense: the objectivein Problem 2 can always be set to 0 by choosing n rkt = ˜ n rkt and n rkc = ˜ n rkc forall k ; hence, the algorithm is designed to guarantee that we cannot do worse thanallocating equally. By the same token, many of the gains we see are modest, owing tothe conservatism of the regret-minimizing approach. Notably, we seem to achieve thegreatest gains when we are stratifying only on clinically relevant variables and usinga relatively low value of Γ. We achieve a 5-6% risk reduction at low values of Γ inthe fourth row of the table, in which we stratify on the clinically relevant age andcardiovascular disease variables. On the other hand, the algorithm quickly defaultsto recommending equal allocation when variables are not clinically relevant. In thethird row, in which we stratify only on the irrelevant Langley scatter variable, thestarred entries correspond to cases in which the regret-minimizing allocation is equalallocation.In Table 2, we summarize the loss relative to na¨ıve allocation. In this case, ourmethod can underperform a na¨ıve allocation derived from the observational study pilotvariance estimates. This can be seen most clearly in the ﬁrst row of the table, in whichwe stratify only on the age variable. Such underperformance is a consequence of the factthat our algorithm is defensive toward underperformance when bias and variance arepresent in the pilot estimates. However, there are two clear trends in the results. First,when we stratify on a variable that turns out not to be clinically relevant, like Langleyscatter, the na¨ıve allocation is essentially just recommending an allocation based onnoise from the data; as a result, our regret-minimizing allocations uniformly outperform L Loss Loss Relative to Equal AllocationΓ = 1 Γ = 1 . . − − . − .

0% 0 . − − . − .

5% 0 . − − . − . − . − − . − . − . − − . − . − . − − . − . − . L loss comparisons for regret-minimizing allocations relative to equal allocation.For starred entries, the regret-minimizing allocation defaults to equal allocation. na¨ıve allocations. Second, the regret-minimizing allocations tend to outperform thena¨ıve allocations as the number of strata grow. We signiﬁcantly outperform na¨ıveallocation in the ﬁnal row, which corresponds to stratiﬁcation on all three variablesand a total of 30 strata. SubgroupVariable(s) Na¨ıve Alloc L Loss Loss Relative to Na¨ıve AllocationΓ = 1 Γ = 1 . . − − − − − − − − − − − − − − − − − L loss comparisons for regret-minimizing allocations relative to na¨ıve allocation. While these simulation results show modest performance gains, they are encourag-ing. A wise analyst would be extremely cautious about designing an RCT exclusivelyusing observational study pilot estimates of stratum variances. Because such pilot es-timates can have both bias and variance, relying too heavily upon them might wasteresources. Our framework allows data from the observational study to be incorporatedinto the RCT design while guarding against the possibility of underperforming a defaultallocation.

We brieﬂy discuss challenges in the more general case of Y i ∈ R . In keeping with thetheme of IPW estimation, we consider estimators of the formˆ σ k (1) = (cid:88) i ∈O k Y i (cid:18) W i p i (cid:19) (cid:30) (cid:88) i ∈O k (cid:18) W i p i (cid:19) −  (cid:88) i ∈O k Y i (cid:18) W i p i (cid:19) (cid:30) (cid:88) i ∈O k (cid:18) W i p i (cid:19) ˆ σ k (0) = (cid:88) i ∈O k Y i (cid:18) − W i − p i (cid:19) (cid:30) (cid:88) i ∈O k (cid:18) − W i − p i (cid:19) −  (cid:88) i ∈O k Y i (cid:18) − W i − p i (cid:19) (cid:30) (cid:88) i ∈O k (cid:18) − W i − p i (cid:19) , (4) here p i are the true treatment probabilities. Such estimators are asymptoticallyunbiased.Under the sensitivity model of Zhao et al. (2019), we suppose we estimate p i withﬁtted propensity scores, ˆ π i , deﬁned asˆ π i = 11 + e − ˆ g ( X i ) . In the typical setting in which we use logistic regression to estimate the propensityscores, ˆ g ( X i ) = ˆ β T X i .We account for the possibility of Γ-level unmeasured confounding by allowing thetrue probability p i to satisfy p i ∈ (cid:26)

11 + z i e − ˆ g ( X i ) (cid:12)(cid:12) ≤ z i ≤ Γ (cid:27) , Any aﬃne transformation of our optimization variable will not change the curvatureof the problem, so we redeﬁne the problem in terms of the v i = p − i , an aﬃne functionof the z i . We deﬁne two vectors v t = ( v i ) i : W i =1 and v c = ( v i ) i : W i =0 , and analogouslydeﬁne vectors Y t = ( Y i ) W i =1 and Y c = ( Y i ) W i =0 . Now, we can express the equationsin 4 as quadratic fractional program, e.g.ˆ σ k (1) = v Tt Θ t v t v Tt T v t , ˆ σ k (0) = v Tc Θ c v c v Tc T v c where Θ t = Y t T − Y t Y t T and Y c T − Y c Y c T . We have few guarantees on the curvature of the problem: the numerators will beneither convex nor concave in the v e terms, e ∈ { , } , as long as the vectors , Y t , and Y t are linearly independent. The denominators will be convex in the v e terms. Thisposes a major challenge. Quadratic fractional programming problems can be solvedeﬃciently in some special cases, but are, in general, NP-hard (Phillips, 2001).One promising avenue for future work is to apply Dinkelbach’s method to trans-form the quadratic fractional problem to a series of quadratic programming problems(Dinkelbach, 1967). This will not immediately yield a solution because of the indeﬁnitenumerator, but it will allow us to make use of considerable recent work on new solutionmethods in quadratic programming (see e.g. Park and Boyd, 2017). References

Akaike, H. (1974). A new look at the statistical model identiﬁcation.

IEEE Transactionson Automatic Control , 19(6):716–723.Bell, D. E. (1982). Regret in decision making under uncertainty.

Operations Research ,30(5):961–981.Boyd, S., Boyd, S. P., and Vandenberghe, L. (2004).

Convex Optimization . CambridgeUniversity Press.Chin, A. (2019).

Modern statistical approaches for randomized experiments under in-terference . PhD thesis, Stanford University.Diecidue, E. and Somasundaram, J. (2017). Regret theory: A new foundation.

Journalof Economic Theory , 172:88–119. inkelbach, W. (1967). On nonlinear fractional programming. Management Science ,13(7):492–498.Graziano, A. M. and Raulin, M. L. (1993).

Research methods: A process of inquiry .HarperCollins College Publishers.Hartman, E., Grieve, R., Ramsahai, R., and Sekhon, J. S. (2015). From SATE toPATT: Combining experimental with observational studies to estimate populationtreatment eﬀects.

Journal of the Royal Statistical Society: Series A (Statistics inSociety) , 10:1111.Hays, J., Hunt, J. R., Hubbell, F. A., Anderson, G. L., Limacher, M., Allen, C., andRossouw, J. E. (2003). The Women’s Health Initiative recruitment methods andresults.

Annals of Epidemiology , 13(9):S18–S77.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.

Journal ofComputational and Graphical Statistics , 20(1):217–240.Imbens, G. W. and Rubin, D. B. (2015).

Causal Inference for Statistics, Social, andBiomedical Sciences: An Introduction . Cambridge University Press, New York, NY,USA.Iusem, A. N. (2003). On the convergence properties of the projected gradient methodfor convex optimization.

Computational & Applied Mathematics , 22(1):37–52.Loomes, G. and Sugden, R. (1982). Regret theory: An alternative theory of rationalchoice under uncertainty.

The Economic Journal , 92(368):805–824.Park, J. and Boyd, S. (2017). General heuristics for nonconvex quadratically con-strained quadratic programming. arXiv preprint arXiv:1703.07870 .Phillips, A. T. (2001). Quadratic fractional programming: Dinkelbach’s method. In

Encyclopedia of Optimization , volume 4.Prentice, R. L., Langer, R., Stefanick, M. L., Howard, B. V., Pettinger, M., Ander-son, G., Barad, D., Curb, J. D., Kotchen, J., Kuller, L., et al. (2005). Combinedpostmenopausal hormone therapy and cardiovascular disease: Toward resolving thediscrepancy between observational studies and the Women’s Health Initiative clinicaltrial.

American Journal of Epidemiology , 162(5):404–414.Roehm, E. (2015). A reappraisal of Women’s Health Initiative estrogen-alone trial:long-term outcomes in women 50–59 years of age.

Obstetrics and Gynecology Inter-national , 2015.Rosenbaum, P. (2009).

Design of Observational Studies . Springer Series in Statistics.Springer, New York.Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences inmatched observational studies.

Biometrika , 74(1):13–26.Rosenman, E., Owen, A. B., Baiocchi, M., and Banack, H. (2018). Propensityscore methods for merging observational and experimental datasets. arXiv preprintarXiv:1804.07863 .Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and non-randomized studies.

Journal of Educational Psychology , 66(5):688. an, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association , 101(476):1619–1637.Van der Vaart, A. W. (2000).

Asymptotic Statistics , volume 3. Cambridge UniversityPress.VanderWeele, T. J. and Robins, J. M. (2012). Stochastic counterfactuals and stochasticsuﬃcient causes.

Statistica Sinica , 22(1):379.Von Neumann, J. (1928). On game theory.

Proceedings of the Academy of Sciences ,100(1):295–320.Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treat-ment eﬀects using random forests.

Journal of the American Statistical Association ,113(523):1228–1242.Weyl, H. (1912). The asymptotic distribution law for the eigenvalues of linear par-tial diﬀerential equations (with applications to the theory of black body radiation).

Mathematical Annals , 71(1):441–479.Writing Group for the Women’s Health Initiative Investigators (1998). Design of theWomen’s Health Initiative clinical trial and observational study.

Controlled ClinicalTrials , 19(1):61–109.Writing Group for the Women’s Health Initiative Investigators (2002). Risks andbeneﬁts of estrogen plus progestin in healthy postmenopausal women: Principalresults from the Women’s Health Initiative randomized controlled trial.

Journal ofthe American Medical Association , 288(3):321–333.Zhao, Q., Small, D. S., and Bhattacharya, B. B. (2019). Sensitivity analysis for inverseprobability weighting estimators via the percentile bootstrap.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 81(4):735–761.

A Appendix

A.1 Further Details about the Women’s Health Initiative

In this section we evaluate our estimators on data from the Women’s Health Initiativeto estimate the eﬀect of hormone therapy on coronary heart disease. The Women’sHealth Initiative is a study of postmenopausal women in the United States, consistingof randomized controlled trial and observational study components with 161,808 totalwomen enrolled (Prentice et al., 2005). Eligibility and recruitment data for the WHIcan be found in the early results papers (Hays et al., 2003; Writing Group for theWomen’s Health Initiative Investigators, 2002). Participants were women between 50and 79 years old at baseline, who had a predicted survival of at least three years andwere unlikely to leave their current geographic area for three years.Women with a uterus who met various safety, adherence, and retention criteria wereeligible for a combined hormone therapy trial. A total of 16,608 women were includedin the trial, with 8,506 women randomized to take 625 milligrams of estrogen and2.5 milligrams of progestin, and the remainder receiving a placebo. A corresponding53,054 women in the observational component of the Women’s Health Initiative hadan intact uterus and were not using unopposed estrogen at baseline, thus renderingthem clinically comparable (Prentice et al., 2005). About a third of these women were sing estrogen plus progestin, while the remaining women in the observational studywere not using hormone therapy (Prentice et al., 2005).Participants received semiannual contacts and annual in-clinic visits for the collec-tion of information about outcomes. Disease events, including CHD, were ﬁrst self-reported and later adjudicated by physicians. We focus on outcomes during the initialphase of the study, which extended for an average of 8.16 years of follow-up in therandomized controlled trial and 7.96 years in the observational study.The overall rate of coronary heart disease in the trial was 3.7% in the treatedgroup (314 cases among 8,472 women reporting) versus 3.3% (269 cases among 8,065women reporting) for women not randomized to estrogen and progestin. In the obser-vational study, the corresponding rates were 1.6% among treated women (706 out of17,457 women reporting) and 3.1% among control women (1,108 out of 35,408 womenreporting). Our methodology compares means and not survival curves. In the initialfollow-up period, death rates were relatively low in both the observational study (6.4%)and the randomized trial (5.7%). Hence, we do not correct for the possibility of thesedeaths censoring coronary heart disease events. A.2 Propensity Score Construction, Covariate Balance, andGold Standard Eﬀects

The Women’s Health Initiative researchers collected a rich set of covariates about theparticipants in the study. For the purposes of computational speed, we narrow to a setof 684 variables, spanning demographics, medical history, diet, physical measurements,and psychosocial data collected at baseline.The most meaningful measure of covariate imbalance can be found by looking atclinically relevant factors. Prentice et al. (2005) identiﬁed factors that are correlatedwith CHD. They found that hormone therapy users in the observational study weremore likely to be Caucasian or Asian/Paciﬁc Islander, less likely to be overweight, andmore likely to have a college degree. These imbalances strongly suggest that applyinga na¨ıve diﬀerencing estimate to the observational data will yield an unfairly rosy viewof the eﬀect of hormone therapy on CHD.To generate our estimators for this dataset, we need a propensity model e ( x ) tomap the observed covariates to an estimated probability of receiving the treatmentin the observational study. We used a logistic regression to generate an expressivemodel while limiting overﬁt. A forward stepping algorithm was ﬁrst applied to theobservational dataset to put an ordering on the variables. All 684 baseline covariateswere provided as candidates to a logistic regression predicting the treatment indicator,and variables were automatically added, one at a time, based on which addition mostreduced Akaike’s Information Criterion (Akaike, 1974).Using this ordering, models containing from one to 120 variables were generated.Model ﬁt was assessed via the area under the Receiver Operator Characteristic curve.At each model size, the area under the curve was computed ﬁrst for the nominal modeland then computed again using a ten-fold cross-validation. This procedure generatedthe curves seen in Figure 4. Notably, we observe that the predictive power rises rapidlywith the addition of the ﬁrst twenty variables to the logistic regression model, but slowsdramatically thereafter. There is also very little evidence of overﬁt, as the nominal areaunder the durve only very slightly outpaces the cross-validated area under the curve,even in models with 100 or more variables. This is likely a consequence of the largenumber of observations in the observational dataset.We next applied a heuristic threshold, selecting the largest model such that themost recent variable addition increased the cross-validated area under the curve by atleast one basis point (0.01%). This yields a model containing 53 variables, with area under the receiver operator characteristic curve of 82.49%, or about 1% lower than amodel containing all 684 covariates. As our goal is to obtain an association between e ( x i ) and W i , and additional variables beyond the 53rd do not materially improve thisassociation, omission of the additional variables seems warranted.Matching on the propensity score should reduce imbalances on clinically relevantcovariates. To evaluate this eﬀect, we use standardized diﬀerences (as advocated byRosenbaum (2009)). Let ¯ x tj and ¯ x cj be the treated and control group averages forcontinuous covariate j in the ODB before matching and let ˆ σ tj and ˆ σ cj be the samplevariances within those two groups. Let ¯ x tjk and ¯ x cjk be those averages taken oversubjects i ∈ O k and deﬁne post-stratiﬁcation averages as ˜ x tj = (cid:80) k n ok ¯ x tjk /n o and˜ x cj = (cid:80) k n ok ¯ x cjk /n o . These are weighted averages of x ij with greater weight put onobservations from treatment conditions that are underrepresented in their own strata.Rosenbaum’s standardized diﬀerences for the original and reweighted data areSD j = ¯ x tj − ¯ x cj (cid:113) (ˆ σ tj + ˆ σ cj ) and (cid:102) SD j = ˜ x tj − ˜ x cj (cid:113) (ˆ σ tj + ˆ σ cj ) , respectively. These quantities measure the practical signiﬁcance of the imbalance be-tween groups unlike t -statistics which have a standard error in the denominator. Notethat Rosenbaum uses the same denominator in both weighted and unweighted stan-dardized diﬀerences.We considered ten equal-width propensity score strata to evaluate the standardizeddiﬀerences between treated and control on risk factors listed in Prentice et al. (2005),before and after adjusting for the propensity score. With the exception of the physicalfunctioning score, all of these covariates were included in the propensity model. Im-balance measures for the continuous covariates can be found in Table 3. As we cansee, the stratiﬁcation procedure reduces all standardized diﬀerences to less than 0.05in absolute value, representing very good matches between the populations.For categorical variables, the stratiﬁcation procedure similarly reweights individual omen, such that the eﬀective proportion of women in each category changes afterstratifying on the propensity score. Standardized diﬀerences can also be computed forcategorical variables, using the procedure described in Graziano et al.Graziano andRaulin (1993) We achieve similar balance on two signiﬁcant categorical variables –ethnicity and smoking status – in Tables 4 and 5. Table 3: Standardized diﬀerences (SD) between treated and control populations in theobservational dataset, before and after stratiﬁcation on the propensity score, for clinicalrisk factors for coronary heart disease.

Before Stratifying After StratifyingTest Ctrl SD Test Ctrl SDAge − .

56 63.06 63.33 − . BMI − .

25 26.71 26.62 0 . Physical functioning .

26 81.15 81.23 0 . Age at menopause .

06 50.35 50.33 0 . White Black Latino AAPI NativeAmerican Missing/Other SDBeforeStratifying

Treated 89.0% 2.7% 2.9% 4.0% 0.2% 1.1% 0.26Control 83.1% 8.1% 3.9% 2.8% 0.4% 1.5%

AfterStratifying

Treated 83.4% 6.9% 4.3% 3.6% 0.5% 1.4% 0.05Control 84.8% 6.4% 3.6% 3.4% 0.4% 1.4%Table 5: Standardized diﬀerences (SD) between treated and control populations in theobservational database, before and after stratiﬁcation on the propensity score, for smokingcategory.

NeverSmoked PastSmoker CurrentSmoker SDBeforeStratifying

Treated 48.7% 46.2% 5.1% 0.11Control 52.3% 41.1% 6.6%

AfterStratifying

Treated 50.9% 42.5% 6.6% 0.01Control 51.0% 42.7% 6.3%

Lastly, consider estimation of the “gold standard” causal eﬀect. We randomlypartition the randomized trial data into two subsets of equal size, such that eachcontains the same number of treated and control women. We select one of these subsetsand refer to it as our “gold” dataset, to be used for estimating the true causal eﬀect.The remaining subset is referred to as the “silver” dataset, and is used for evaluatingour estimators. ecause of the randomization, we ﬁnd that treated and control are already wellbalanced on the coronary heart disease risk factors in the gold dataset, as summarizedin Tables 6, 7, and 8. Table 6: Standardized diﬀerences (SD) between treated and control populations in RCTgold dataset, for clinical risk factors for coronary heart disease.

Variable Treated Control SDAge − BMI − Physical functioning − Age at menopause − White Black Latino AAPI NativeAmerican Missing/Other SDTreated

Control

Never Smoked Past Smoker Current Smoker SDTreated

Control

A.3 Stratiﬁcation Variable Distributions

In Tables 9, 10, and 11, we provide the distributions for the variables with which westratify in the main text.

Age ObservationalStudy RCT RCT “Silver”Dataset50-59 istory ofCardiovascularDisease ObservationalStudy RCT RCT “Silver”DatasetYes No Langley Scatter(g-cal/cm ) ObservationalStudy RCT RCT “Silver”Dataset300-325 B Proof of Validity of Conﬁdence Regions

We hew closely to the proofs provided in Zhao et al. (2019). Their primary proofsconsider the missing data problem, which is equivalent to estimating the mean of eitherof the potential outcomes. We begin by providing details of their proof and then showhow it can be extended to our case.

B.1 Review of Proof in Zhao et al. (2019)

The authors deﬁne e ( x , y ) as the probability of treatment given covariates X = x ∈ X and outcome Y = y ∈ R and compare it against the marginal treatment probability e ( x ). They use A rather than W to denote a treatment indicator, so in keeping withtheir notation: e ( x , y ) = P ( A = 1 | X = x , Y = y ) and e ( x , y ) = P ( A = 1 | X = x )Then, for any choice of Λ >

1, they deﬁne a collection of sensitivity models E (Λ) = (cid:26) ≤ e ( x , y ) ≤ | ≤ OR( e ( x , y ) , e ( x )) , for all x , y (cid:27) where OR( p , p ) = [ p / (1 − p )] / [ p / (1 − p )] is the odds ratio. This model wasoriginally introduced by Tan (2006). Per proposition 7.1, it is related to the widelyused Rosenbaum sensitivity model. In keeping with that model, we use Γ rather thanΛ to denote our sensitivity parameter in the text, but retain the notation Λ throughoutthis proof.Via remark 3.2, Zhao and co-authors reparameterize the problem such that eachmodel in E (Λ) corresponds to a choice of h ( x , y ), the logit-scale diﬀerence of the ob-served probability e ( x ) and the complete data selection probability e ( x , y ). So we canalternatively write: E (Λ) = (cid:110) e ( h ) ( x , y ) | h ∈ H ( λ ) (cid:111) here λ = log(Λ) and H ( λ ) = { h : X × R | || h || ∞ ≤ λ } . In words: every choice of h ∈ H ( λ ) deﬁnes, at each possible value of X and Y , a discrepancy between e ( x ) and e ( x , y ). The choice of H ( λ ) bounds the maximum of those discrepancies. So, as Λgrows, we are allowing for greater and greater discrepancies in these probabilities.For each choice of h , they deﬁne a “shifted estimand,” µ ( h ) = (cid:18) E (cid:18) AYe ( h ) ( x , y ) (cid:19)(cid:19) − E (cid:18) AYe ( h ) ( x , y ) (cid:19) where A is the treatment indicator and the expectation is over the joint distributionof X , Y, A . The corresponding “shifted estimator” is given byˆ µ ( h ) = (cid:32) n n (cid:88) i =1 A i ˆ e ( h ) ( X i , Y i ) (cid:33) − n n (cid:88) i =1 A i Y i ˆ e ( h ) ( X i , Y i ) . The sum is over a sample of points ( X i , Y i , A i ) drawn i.i.d. from their joint distribution.The quantity in the denominators, ˆ e ( h ) ( X i , Y i ), is obtained by estimating P ( A = 1 | X = x ) and then shifting the estimate by h ( x i , y i ) for all units i such that X i = x i and Y i = y i .Now, the proof of the validity of their approach proceeds in several stages.1. First, they consider the case where data-dependent intervals [ L ( h ) , U ( h ) ] are asymp-totically guaranteed to contain µ ( h ) with 1 − α probability. They argue that taking L = inf h ∈H ( λ ) L ( h ) and U = sup h ∈H ( λ ) U ( h ) yields an interval [ L, H ] with asymp-totic 1 − α coverage for every value of µ ( h ) for which h ∈ H ( λ ). (Proposition4.1).2. For each choice of h ∈ H ( λ ), they establish that the bootstrap is valid (Theorem4.2). • First, they use the general theory of Z-estimators to show that ˆ µ ( h ) and itsbootstrap analogue, ˆˆ µ ( h ) , are asymptotically normal with the same mean andvariance. (Theorem C.1 and Corollary C.2) • Then, they conclude that deﬁning L ( h ) B as the α/ P (cid:16) µ ( h ) < L ( h ) B (cid:17) → α X , Y and A .Analogous results holds for U ( h ) B , the 1 − α/ Q α/ (cid:18) inf h ∈H ( λ ) ˆˆ µ ( h ) (cid:19) ≤ inf h ∈H ( λ ) L ( h ) and Q − α/ (cid:32) sup h ∈H ( λ ) ˆˆ µ ( h ) (cid:33) ≥ sup h ∈H ( λ ) U ( h ) via Lemma 4.3. B.2 Extension to Design Case

Our challenge is to extend this argument to the case where our estimand of interestis not a single µ but rather the pair ( σ k (0) , σ k (1)) = ( µ k (0)(1 − µ k (0)) , µ k (1)(1 − µ k (1)). Crucially, we will now have two h functions h and h , corresponding to each f the potential outcomes, but they both lie within H ( λ ). The deﬁnition of the shiftedestimand under h given above generalizes to the case of two shifted estimands in astraightforward way. We extend Proposition 1 in the following argument. Proposition 1.

Suppose there exists a data-dependent region β ( h ,h ) k ∈ R such that lim inf n →∞ P (cid:16)(cid:16) σ ( h ) k (0) , σ ( h ) k (1) (cid:17) ∈ β ( h ,h ) k (cid:17) ≥ − α holds for every ( h , h ) ∈ H ( λ ) × H ( λ ) , where σ ( h ) k ( e ) = µ ( h ) k ( e )(1 − µ ( h ) k ( e )) for e ∈ { , } , and n is the sample size. Under these conditions, the set β k = (cid:91) h ,h ∈H ( λ ) β ( h ,h ) k is an asymptotic conﬁdence set of ( σ k (0) , σ k (1)) with at least − α coverage if h , h ∈H ( λ ) .Proof. This follows from the fact that, by assumption, the true data-generating distri-bution satisﬁes in h , h ∈ H ( λ ).Next, we must show that the bootstrap is valid in our setting. We adopt the samemodel and regularity conditions of Theorem 4.2 in Zhao et al. (2019). In their proof ofCorollary 5.1, the authors show that the pairs (cid:16) ˆ µ ( h ) k (0) , ˆ µ ( h ) k (1) (cid:17) and (cid:16) ˆˆ µ ( h ) k (0) , ˆˆ µ ( h ) k (1) (cid:17) are both jointly asymptotically normal, with the same limiting distribution. We deﬁnethe function f ( x, y ) = ( x · (1 − x ) , y · (1 − y )) . We can see that applying f ( · ) to the tuple of potential outcome means will yield thepotential outcome variances, and the same logic holds for applying f ( · ) to any estimatorof the potential outcome means. Moreover, because f ( · ) is continuously diﬀerentiable,we can use the Delta Method to observe immediately that (cid:16) ˆ σ ( h ) k (0) , ˆ σ ( h ) k (1) (cid:17) and (cid:16) ˆˆ σ ( h ) k (0) , ˆˆ σ ( h ) k (1) (cid:17) have the same asymptotic distribution, and thus the bootstrapis valid (Van der Vaart, 2000).Lastly, we generalize Lemma 4.3 to our setting. For each possible bootstrap replicate b ∈ { , . . . , N } where N = n n , deﬁne the quartet of pointsˆˆ R k,b =  (cid:16) inf h ∈H ( λ ) ˆˆ σ ( h ) k (0) , inf h ∈H ( λ ) ˆˆ σ ( h ) k (1) (cid:17) , (cid:16) inf h ∈H ( λ ) ˆˆ σ ( h ) k,b (0) , sup h ∈H ( λ ) ˆˆ σ ( h ) k,b (1) (cid:17) , (cid:16) sup h ∈H ( λ ) ˆˆ σ ( h ) k,b (0) , inf h ∈H ( λ ) ˆˆ σ ( h ) k,b (1) (cid:17) , (cid:16) sup h ∈H ( λ ) ˆˆ σ ( h ) k,b (0) , sup h ∈H ( λ ) ˆˆ σ ( h ) k,b (1) (cid:17)  . In words, ˆˆ R k,b contains the vertices of a rectangle in R which deﬁnes the extrema ofthe potential outcome variances consistent with h , h ∈ H ( λ ).Denote as Conv( · ) the standard convex hull operator. Deﬁne a related operator,Conv (cid:63) ( S, B ) = Conv (cid:32) (cid:91) b ∈B S b (cid:33) which takes in a set S of cardinality N S as well as a set B ⊂ { , . . . , N S } . The functionreturns the convex hull of the points contained in the entries in S indexed by B .We choose a set B α ⊆ { , , . . . , N = n n } such that |B α | = (1 − α ) N , and we deﬁnethe set A k = Conv (cid:63) (cid:16) { ˆˆ R k,b } , B α (cid:17) . emma 1. The set A k is an asymptotically valid conﬁdence set.Proof. For 1 ≤ b ≤ N , where N = n n is the total number of possible bootstrap samples,we have that for every h , h ∈ H ( λ ), (cid:16) ˆˆ σ ( h ) k,b (0) , ˆˆ σ ( h ) k,b (1) (cid:17) ∈ Conv( ˆˆ R k,b ) , for all 1 ≤ b ≤ N ,Since this holds entrywise, it follows that any set containing a ﬁxed proportion of thesets on the RHS must contain at least that proportion of points on the LHS, and henceConv (cid:63) (cid:16)(cid:110)(cid:16) ˆˆ σ ( h ) k,b (0) , ˆˆ σ ( h ) k,b (1) (cid:17)(cid:111) , B α (cid:17) ⊆ Conv (cid:63) (cid:16)(cid:110)

Conv( ˆˆ R k,b ) (cid:111) , B α (cid:17) . Since this holds for every h , h ∈ H ( λ ), we can take the union on the LHS toobserve (cid:91) h ,h ∈H ( λ ) Conv (cid:63) (cid:16)(cid:110)(cid:16) ˆˆ σ ( h ) k,b (0) , ˆˆ σ ( h ) k,b (1) (cid:17)(cid:111) , B α (cid:17) ⊆ Conv (cid:63) (cid:16)(cid:110)

Conv( ˆˆ R k,b ) (cid:111) , B α (cid:17) . Observe that the RHS is simply A k , since any ellipse containing the vertices of arectangle will contain the convex hull of those vertices as well.On the LHS, we can make use of our bootstrap validity result to observelim inf n →∞ P (cid:16)(cid:16) σ ( h ) k (0) , σ ( h ) k (1) (cid:17) ∈ Conv (cid:63) (cid:16)(cid:110)(cid:16) ˆˆ σ ( h ) k,b (0) , ˆˆ σ ( h ) k,b (1) (cid:17)(cid:111) , B α (cid:17)(cid:17) ≥ − α . It follows from Proposition 1 that the LHS is a valid 1 − α level conﬁdence region.Hence, the right-hand side must be as well.To conclude, we observe that our ellipsoid method must necessarily comprise asuperset of a convex hull for some choice of B α . Hence, our method will indeed generatevalid conﬁdence regions for the potential outcome variances. C Proof of Concavity of Minimax Problem

We begin with the unweighted case, and demonstrate concavity by direct computationof the Hessian. Deﬁne f ( σ k ( · )) = 1 n r (cid:32)(cid:88) k σ k (1) + σ k (0) (cid:33) − (cid:32)(cid:88) k σ k (1)˜ n rkt + σ k (0)˜ n rkc (cid:33) The Hessian is given by ∇ f = 12 n (cid:16) H + vv T (cid:17) where H = diag (cid:32) − (cid:80) j σ j (0) + σ j (1) σ k ( t ) (cid:33) k,t and v = (cid:18) σ (0) , σ (1) , . . . , σ K (0) , σ K (1) (cid:19) T . We want to consider the eigenvalues of H + vv T . First, observe that at most oneeigenvalue can be nonnegative. This follows from the famed Weyl Inequalities Weyl(1912). H has all strictly negative eigenvalues, while vv T , being an outer product, has ne positive eigenvalue, v T v , with all other eigenvalues 0. Denoting as λ i ( G ) the i th largest eigenvalue of matrix G , the Weyl Ineqalities tell us that λ ( H + vv T ) ≤ λ ( H ) + λ ( vv T ) = λ ( H ) < . Hence, only one non-negative eigenvalue is possible.Next, we can use the matrix determinant lemma to observe thatdet( H + vv T ) = (1 + v T H − v ) det( H )and direct computation tells us that v T H − v = − . Hence, the determinant is 0, meaning at least one of our eigenvalues must be zero.Combined with our prior result, this means our maximum eigenvalue must be zero andwe conclude the Hessian is negative semideﬁnite. Thus, f is indeed concave.Finally, note that the extension to the weighted case is straightforward. We cansimply deﬁne new variables ˜ σ k ( e ) = √ w k σ k ( e ) for e ∈ { , } , and then repeat the proofabove using the ˜ σ k ( e ) variables. Since σ k ( e ) is simply an aﬃne transformation of ˜ σ k ( e ),concavity in the former follows from concavity in the latter.),concavity in the former follows from concavity in the latter.