[PDF] Treatment effect estimation with Multilevel Regression and Poststratification

Abstract

Multilevel regression and poststratification (MRP) is a flexible modeling technique that has been used in a broad range of small-area estimation problems. Traditionally, MRP studies have been focused on non-causal settings, where estimating a single population value using a nonrepresentative sample was of primary interest. In this manuscript, MRP-style estimators will be evaluated in an experimental causal inference setting. We simulate a large-scale randomized control trial with a stratified cluster sampling design, and compare traditional and nonparametric treatment effect estimation methods with MRP methodology. Using MRP-style estimators, treatment effect estimates for areas as small as 1.3\% of the population have lower bias and variance than standard causal inference methods, even in the presence of treatment effect heterogeneity. The design of our simulation studies also requires us to build upon a MRP variant that allows for non-census covariates to be incorporated into poststratification.

Full PDF

TTreatment eﬀect estimation with Multilevel Regression andPoststratiﬁcation

Yuxiang Gao ∗ , Lauren Kennedy † , and Daniel Simpson ‡ Department of Statistical Sciences, University of Toronto Department of Econometrics and Business Statistics, Monash UniversityFebruary 22, 2021

Abstract

Multilevel regression and poststratiﬁcation (MRP) is a ﬂexible modeling technique that has been usedin a broad range of small-area estimation problems. Traditionally, MRP studies have been focused onnon-causal settings, where estimating a single population value using a nonrepresentative sample was ofprimary interest. In this manuscript, MRP-style estimators will be evaluated in an experimental causalinference setting. We simulate a large-scale randomized control trial with a stratiﬁed cluster samplingdesign, and compare traditional and nonparametric treatment eﬀect estimation methods with MRPmethodology. Using MRP-style estimators, treatment eﬀect estimates for areas as small as 1.3% of thepopulation have lower bias and variance than standard causal inference methods, even in the presenceof treatment eﬀect heterogeneity. The design of our simulation studies also requires us to build upon aMRP variant that allows for non-census covariates to be incorporated into poststratiﬁcation.

Keywords:

Causal inference, treatment eﬀect heterogeneity, multilevel regression and poststratiﬁcation,small-area estimation, non-census variables

Randomized control trials (RCTs) are often considered the gold standard for estimating causal treatmenteﬀects of a speciﬁc intervention. Applications of RCTs are widespread, having been applied in ﬁelds suchas healthcare, economics, political science, education and technology. Studying the statistical properties ofRCTs and the methods used to model them are active areas of research in causal inference since challengesinvolving data quality, model speciﬁcation and experimental design can arise in practice.The main purpose of this manuscript is to introduce Multilevel Regression and Poststratiﬁcation (MRP)[Gelman and Little, 1997, Little, 1993], a method has traditionally been used for public opinion modellingand small-area estimation, as a viable method in experimental causal inference. We will highlight thestrengths and limitations of modern causal inference methods along with MRP and its variants for estimat-ing conditional treatment eﬀects over subpopulation groups of various sizes, focusing speciﬁcally on scenarioswith treatment eﬀect heterogeneity. In particular, the causal inference problem we are analyzing is a simu-lated version of a large-scale stratiﬁed cluster RCT conducted in the United States in 2015–2016. Secondly, ∗ Author email: [email protected] † Author email: [email protected] ‡ Author email: [email protected]

Acknowledgements:

Yuxiang Gao was funded by the Ontario Graduate Scholarship. Daniel Simpson was funded by theNatural Sciences and Engineering Research Council of Canada and the Canadian Research Chair program. Research reportedin this publication was supported by National Institute of Aging of the National Institutes of Health under award numberR01AG067149. The content is solely the responsibility of the authors and does not necessarily represent the oﬃcial views ofthe National Institutes of Health. a r X i v : . [ s t a t . M E ] F e b e analyze a variant of MRP seen in [Lax et al., 2019, Gelman, 2018, Kastellec et al., 2015], which we call Multilevel Regression and Poststratiﬁcation with Multiple Imputation (MRP-MI) . MRP-MI allows the mod-eler to include a variable that is present in the sample but missing in the population, known as a non-censusvariable, in the poststratiﬁcation frame.

Generalization in causal inference refers to mapping treatment eﬀects from a speciﬁc study to treatmenteﬀects in a target population. This population can be a ﬁnite population of the experimental sample,a group of individuals who are in a diﬀerent location relative to the experimental sample or a group ofindividuals whose covariates were measured at a diﬀerent time relative to the experimental sample. Withoutpost-experimentation adjustments, there is a chance that treatment eﬀects from an experimental samplewill be diﬀerent than the true treatment eﬀect of the target population. An important point to note isthat, no unadjusted treatment eﬀect estimates from the experimental sample are generalizable to the targetpopulation unless the sample is equal to the target population or the sample is a random draw from thetarget population [Stuart et al., 2018]. There are diﬀerent interpretations and deﬁnitions of generalizabilityin the literature, and in this part of the introduction we will brieﬂy outline some of the diﬀerent frameworksand assumptions used to deﬁne generalizability.In [Kern et al., 2016], the estimand of interest is the Target Average Treatment Eﬀect (TATE), whichis deﬁned as the mean of the diﬀerence in potential outcomes N (cid:80) Ni =1 ( Y i (1) − Y i (0)) where N is the sizeof the target population. A sampling indicator S i ∈ { , } is deﬁned for every individual in the targetpopulation and the experimental sample. S i = 0 if the individual is in the experimental sample and 1otherwise. Estimating TATE can be done through weighting approach, or an outcome modelling approach.Through the weighting approach as outlined in [Kern et al., 2016], a score w i := ˆ e i − ˆ e i is deﬁned for everyindividual in the experimental sample, where ˆ e i is the estimated probability that individual i is in the targetpopulation. The idea behind the weights w i is that they make the experimental sample more similar tothe target population. A weighted linear regression model is then ﬁt to the experimental sample using theset of w i and then the coeﬃcient of the treatment indicator is used as the TATE estimate. The outcomemodelling approach in [Kern et al., 2016] uses a predictive model on the experimental sample and thenpredicts outcomes under treatment and control for every individual in the the target population.In [Ackerman et al., 2019], the generalization methods of weighting and outcome modelling as seen in[Kern et al., 2016] are tested. Targeted Maximum likelihood Estimation [Gruber and Van Der Laan, 2009], adoubly robust estimation method that combines both weighting and outcome modelling is used to estimatethe TATE. For a more thorough literature review of generalizing experimental sample treatment eﬀectestimates to the TATE using approaches in [Ackerman et al., 2019, Kern et al., 2016], the reader can referto [Stuart et al., 2018]. A broader literature review on experimental causal inference and generalizabilitycan be found in the recent survey [Colnet et al., 2020].Another way to generalize experimental sample estimates to a target population is through poststratiﬁca-tion, a weighting technique in survey statistics. Poststratiﬁcation is brieﬂy mentioned in [Stuart et al., 2011],but the current challenges that remain are continuous covariates or a large number of poststratiﬁcation cells.[Miratrix et al., 2018] analyzes poststratiﬁcation in survey experiments, but only analyzes scenerios whereall observed covariates are categorical.To measure representativeness between the experimental sample and the target population, various mea-sures have been developed. One method was based oﬀ the sampling propensity score s ( X ) := P ( S = 1 | X )where X is the pre-treatment covariate vector and S is a binary indicator for membership in the experimen-tal sample. [Stuart et al., 2011] deﬁne the propensity score diﬀerence, which is the diﬀerence of averagesfor s ( X i ) between the experimental sample and the target population. [Tipton, 2014] extends the usage ofsampling propensity scores by deﬁning a new metric known as the generalization index , a value between 0and 1 that is a measure of representativeness between the experimental sample and the target population.Though this manuscript is not focused on generalizing treatment eﬀect estimates but rather on recoveringtreatment eﬀect heterogeneity, we calculate the generalization index in our simulation studies to show thatindeed our experimental sample is representative and hence a high-quality sample of the target population.2 .2 Treatment eﬀect heterogeneity across diﬀerent subpopulations in the targetpopulation When there is no treatment eﬀect heterogeneity, it’s not a concern if the experimental sample is representativeor not. This is because treatment eﬀects for any subpopulation will be equal so over/undersampling certainsubpopulations does not result in diﬀerent treatment eﬀect estimates. However, any amount of treatmenteﬀect heterogeneity will bias treatment eﬀect estimates coming from a nonrepresentative experimental samplewhen the nonrepresentativeness and heterogeneity are not properly adjusted for.More speciﬁcally, estimating heterogeneous treatment eﬀects consists of estimating the non-constantfunction τ ( x ) := E ( Y (1) − Y (0) | X = x ), which is the expected value of the diﬀerence in potential outcomesconditional on pretreatment variables X = x . This expectation is taken at the superpopulation level, wherethe superpopulation is the data generating process for the ﬁnite target population. In this manuscript, wegenerate a ﬁnite target population then sample from it, but the observed sample is used to estimate CATEs atthe superpopulation level. For additional details on the connection between superpopulation causal inferenceand ﬁnite population causal inference, we refer the reader to [Ding et al., 2017, Imbens and Rubin, 2015].Modeling τ ( x ) can be done parametrically and non-parametrically. The parametric approach would beto use a linear regression model with the treatment variable interacting with other pre-treament variables.A parametric modeling approach’s beneﬁts are that it’s simple to implement and its model coeﬃcientsare interpretable. However, it’s prone to model misspeciﬁcation when τ ( x ) is nonlinear and pre-treatmentcovariate X is high-dimensional, which can result in biasing estimates of τ ( x ).Estimating τ ( x ) can be done more ﬂexibly with nonparametric regression methods [Hahn et al., 2020,Wager and Athey, 2018, Hill, 2011] and machine learning methods [K¨unzel et al., 2019]. Nonparametrictreatment eﬀect estimation methods have been used in survey experiments [Green and Kern, 2012] and large-scale randomized control trials concerning education interventions [Yeager et al., 2019, Athey and Wager, 2019].For a more extensive overview of the challenges arising from treatment eﬀect heterogeneity in experimenta-tion, we refer the reader to [Athey and Imbens, 2017].The focus of this paper will mainly be to address the challenges of treatment eﬀect heterogeneity inexperimentation when the sample is representative of the target population, so as a motivating example, wewill discuss a large-scale RCT conducted recently on the target population of public high schools in the UnitedStates, one that was representative of the target population yet exhibited treatment eﬀect heterogeneity. ThisRCT was used as a model for our simulation studies run in this manuscript. The National Study of Learning Mindsets (NSLM) [Yeager et al., 2019] was an experiment conducted on anationally representative sample of 9 th grade public high-school students in the United States during Fall2015. The growth mindset intervention was an education intervention that encouraged treated individualsto view intellectual ability not as a ﬁxed trait, but a muscle that can be trained through the sustainedeﬀort of seeking help when learning new academic content and trying new learning strategies. Indeed, thegrowth mindset intervention was shown to be eﬀective in improving the GPA of high-school students. Theoriginal NSLM analysis used linear modeling with survey weights as well as a response surface method[Hahn et al., 2020].The sampling design in NSLM ﬁrst has school strata deﬁned by levels of minority composition level andschool-achievement level. Then schools are cluster sampled from each school strata. Finally, in the chosenschools that decided to partake in the study, students were given the option to participate or not. Theaverage student-level response rate was 92 percent. The intervention was randomly assigned at the studentlevel.The intervention of interest is how eﬀective the treatment is on the GPA of individuals as measuredby the diﬀerence between post-intervention GPA and pre-intervention GPA. The causal relationship of in-terest in this manuscript is shown in the causal directed acyclic graph (DAG) [Pearl, 1995] seen in Figure1.3. This DAG is a simpliﬁed model of the NSLM that contains only the main variables in their analysis[Yeager et al., 2019]. 3 igure 1.3 (Causal DAG): Shaded nodes are ob-served variables whereas unshaded nodes are unob-served noise. Because the data collected is comingfrom a randomized experiment, all the variables ex-cept the outcome (blue post-GPA node) and thetreatment (red treatment node) are pre-treatmentvariables. The structure of the posited causal graphis incorporated in treatment eﬀect estimation meth-ods used in this manuscript. The outcome variablePost-GPA and the covariate Previous-GPA are bothcontinuous and truncated with a lower and upperbound. The rest of the observed covariates are cat-egorical.

TreatmentPost-GPA

Prev-GPASchool Race/Eth Mat.-Edu.GenderMin-Comp.School-Ach. (cid:15)

School (cid:15)

Prev-GPA (cid:15)

Post-GPA

The NSLM’s target population covariate structure is rich and detailed. Across diﬀerent covariate subpop-ulations, diﬀerent treatment eﬀects are suspected to exist. In this manuscript, we explore the limitations ofusing a representative sample to calculate treatment eﬀects for subpopulations in a population as structuredand heterogeneous as the target population of NSLM.We design a simulation study that generates a complex target population and samples from it througha stratiﬁed cluster design to generate a RCT. The sampling design has schools as the clusters, and theclusters are stratiﬁed by school-level covariates minority composition level and school-achievement level. Thesimulated population is built to match the population summary statistics reported in [Yeager et al., 2019].Section 5 contains additional details on the design of the simulation study and further information aboutthe population’s structure and how it relates to the NSLM can be found in Supplementary Material C.Additionally, we compare various causal inference methods to see the extent that these methods canrecover the true heterogeneity of the intervention in the target population. In our simulation studies, weestimate treatment eﬀects ranging from the Average Treatment Eﬀect (ATE) for the whole population toConditional Average Treatment Eﬀects (CATEs) for subpopulations as small as 1.3 percent of the population.

In this manuscript, we propose a treatment eﬀect estimation method which builds on a hierarchical model-based survey estimation method used for small-area estimation, Multilevel Regression and Poststratiﬁca-tion (MRP) [Park et al., 2004, Gelman and Little, 1997]. MRP combines a hierarchical model’s posteriordistribution of the outcome ( θ c ) Jc =1 with an external poststratiﬁcation matrix containing population sizes( N c ) Jc =1 , where c is a stratum of ﬁnest granularity in the target population. MRP then forms a posteriordistribution for the average outcome in the target population, θ ps := (cid:80) Jc =1 N c θ c (cid:80) Jc (cid:48) =1 N c (cid:48) , and summary statisticssuch as the posterior mean of θ ps can be used as the point estimate for average outcome in the tar-get population. A strength of MRP is that it’s eﬀective in estimating outcomes of small areas in thepresence of selection bias across various subpopulations of a target population [Gao et al., 2020]. MRPhas historically been a commonly used methodology in small-area estimation and public opinion model-ing [Lauderdale et al., 2020, Wang et al., 2015], however it’s applications to causal inference have been anunexplored area. A primary goal of this manuscript is to explore the extent of MRP’s eﬀectiveness inexperimental causal inference. For a more detailed step-by-step outline of MRP, we refer the reader to[Lax and Phillips, 2009].MRP requires covariates to be categorical, in order to perform Poststratiﬁcation . Current practice is todiscretize continuous covariates before using them in a MRP model. The eﬀect of the granularity of thediscretization on the MRP estimates is studied in [Gao et al., 2020]. Discretization of continuous covariatesin MRP is possible when we have population counts for each of the discretized bins. For example, if age isdiscretized into 6 groups, then the poststratiﬁcation matrix used must have population counts for each ofthe 6 age groups. 4n our simulation study, which is based on the NSLM study, we are given the poststratiﬁcation matrixfor all combinations of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education, thus providing population counts for each combination. Prev-GPA of individualsin every collected sample is reported, but no such discretization of Prev-GPA at the population level isavailable.It’s also important to note that Post-GPA and Prev-GPA in our simulation studies are both truncatedcontinuous distributions with support (0 , . − Prev-GPA as a function of the binary treatment indicatorand the covariates School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. This is not the case in our study since the variable Post-GPA − Prev-GPA is not inthe same family of distributions that Post-GPA or Prev-GPA are in. Thus to estimate the treatment eﬀectsfrom DAG Figure 1.3, we have to use Prev-GPA as a covariate and Post-GPA as the outcome.We will utilize a variant of MRP that bypasses missing population counts for the continuous covariatePrev-GPA and also takes into account the truncated nature of both Prev-GPA and Post-GPA. We will referto this variant as

Multilevel Regression and Poststratiﬁcation with Multiple Imputation (MRP-MI) . We applyMRP-MI to to the problem of heterogeneous treatment eﬀect estimation for the causal graph in Figure 1.3.In the literature, MRP-MI is an extension of the procedure where the modeler uses MRP with non-censusvariables [Kennedy and Gelman, 2019, Lax et al., 2019, Gelman, 2018, Kastellec et al., 2015]. MRP-MI issimilar to the prediction framework carried out in example 2 of [Kennedy and Gelman, 2019]. [Kennedy and Gelman, 2019]dealt with a non-census variable by imputing it for each individual. In contrast, we predict our non-census variable at the poststratiﬁcation cell-level rather than the individual level. In [Lax et al., 2019,Kastellec et al., 2015] the non-census variable (variable with missing population counts) is political partymembership (discrete with 3 levels) whereas in this manuscript the non-census variable is Prev-GPA (con-tinuous and truncated).In this estimation problem, MRP-MI entails ﬁtting both a Prev-GPA hierarchical model and a Post-GPAhierarchical model, and then poststratifying with the posterior predictive distribution of Post-GPA afterintegrating out the posterior predictive distribution of Prev-GPA. The detailed description of MRP-MI isshown in the later Section 4.The challenge with modeling CATEs in a simulation study modelled after the target population in NSLMis that there are scenarios where some school clusters ˜ O are not observed. Hence calculating the CATE for˜ O will be an out-of-sample prediction problem. Indeed, that was the case in the original NSLM study –Only around 70 schools out of the 12000 total schools in the target population were sampled from. Insteadof calculating CATEs for all 12000 schools, the modelers only calculated CATEs for the clusters observed. Apart of this manuscript (Section 5.6) will be dedicated to showcasing the limitations of MRP-MI and otherMRP variants when it comes to calculating out-of-sample CATEs. This manuscript has the following order: Section 1 contains a summary of two common problems in ex-perimental causal inference, namely treatment eﬀect heterogeneity and generalizability of treatment eﬀectestimates. Section 2 contains our problem setup for estimating treatment eﬀects for any group in the targetpopulation, where our problem is modeled after the National Study of Learning Mindsets. Section 3 presentsthe various treatment eﬀect estimation methods that are tested in our simulation studies. Section 4 presentsthe detailed outline of our proposed treatment eﬀect estimation method. Section 5 is a description of oursimulation study and results, which includes how the target population is generated and how we sample fromthe target population. Section 6 is the conclusion. The supplementary material contains further discussionof the MRP variant (MRP-MI), connection between treatment eﬀect estimation and MRP, and simulationdesign information.

Suppose that we’re interested in estimating how eﬀective an education intervention is on the Post-GPA ofan individual. Let ( Y (0) , Y (1) , X ) be a superpopulation joint distribution for the control potential outcome,5 aternal Education Gender Race/Ethnicity School Minority Composition Index School Achievement Index N c No Female Asian 00001 Low Low 1No Female Asian 00002 Low Low 1No Female Asian 00006 Low Low 1No Female Asian 00007 Low Low 1No Female Asian 00012 Low Low 2... ... ... ... ... ... ...... ... ... ... ... ... ...Yes Male Other 11220 High High 9Yes Male Other 11221 High High 15

Table 1: Poststratiﬁcation matrix M for the population P with population sizes N c for all J combinations ofSchool-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education.Note that (cid:80) Jc =1 N c = N P .treatment potential outcome and the pre-treatment covariate. Y (0) is the Post-GPA of the individualunder no treatment and Y (1) is the Post-GPA of the individual under treatment. X is a vector containingdemographic information and the previous academic achievement scores of an individual. We’re interestedin the following superpopulation CATE estimands:1. Average Treatment Eﬀect: τ ATE := E ( Y (1) − Y (0))2. Conditional Average Treatment Eﬀect for any subpopulation O : τ CATE , O := E ( Y (1) − Y (0) | X ∈ O )3. Conditional Average Treatment Eﬀect for X = x : τ CATE ,X = x := E ( Y (1) − Y (0) | X = x )Suppose that a ﬁnite population P := ( Y m (0) , Y m (1) , X m ) N P m =1 is generated i.i.d from ( Y (0) , Y (1) , X ).Suppose that ( Y i (0) , Y i (1) , X i ) ni =1 is a sample drawn from P in a probabilistic manner. In our case,( Y i (0) , Y i (1) , X i ) ni =1 is drawn as a stratiﬁed cluster sample. Let Z i ∈ { , } be a treatment-control indi-cator variable for individual i in the sample, and Y i be the outcome variable for individual i in the sample.The details of the sampling design and assignment mechanism for ( Y i (0) , Y i (1) , X i , Z i ) ni =1 can be found inSection 5 of this manuscript.Referencing the causal DAG in Figure 1.3, we have X i = ( V i , ME i , G i , RE i , School i , SA i , MC i ), whichis the pre-treatment covariate vector for individual i consisting of Previous-GPA, Minority Composition,Gender, Race/Ethnicity, School, School Achievement level and Minority Composition level of School. SchoolAchievement level and Minority Composition level of School are school-level covariates and the other co-variates are individual-level covariates. ( Y i (0) , Y i (1) , X i , Z i ) ni =1 is a stratiﬁed cluster sample of the ﬁnitepopulation P , where clusters are schools.Finally, we will assume that we’re given a poststratiﬁcation matrix M for the target population, withpopulation counts for all J combinations of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. Table 1 shows M . We do not have population information forPrev-GPA. To allow for identiﬁability of superpopulation CATEs and ATE, we will assume the following below for allindividuals 1 ≤ i ≤ n . For more on the role that these assumptions play in superpopulation treatment eﬀectestimation, we refer the reader to [Imbens and Rubin, 2015].1. Stable Unit Treatment Value Assumption (SUTVA) [Rubin, 1980]: We assume that there is no inter-ference between units. That is, the treatments assigned to a unit do not aﬀect the potential outcomesof another unit.2. Consistency: Y i = Z i Y i (1) + (1 − Z i ) Y i (0) almost-surely3. Ignorability: ( Y i (0) , Y i (1)) ⊥⊥ Z i

6. Overlap: 0 < P ( Z i = 1 | X i ) < Y i = Y i ( Z i ) almost-surely. This deﬁnes the observed stratiﬁedcluster sample D := ( Y i , X i , Z i ) ni =1 , consisting of the outcome, pre-treatment covariate and treatment variablerespectively. Based oﬀ D , various causal inference methods can be use to estimate the ATE and CATEs.MRP-style estimators will be compared with other standard causal inference methods for calculatingCATEs and ATE, so we motivate the need for MRP by providing the connection between MRP and super-population treatment eﬀect estimation below. Suppose that the covariate vector X of an individual in the superpopulation has J number possible valuesthat it can take on. Also, suppose we’re given a poststratiﬁcation matrix ˜ M for the target population, withpopulation counts ( N c ) Jc =1 for all J cells. Note that the CATE τ CATE , O for a subpopulation O is then equalto E ( Y (1) − Y (0) | X ∈ O ) = (cid:90) c ∈ I O E ( Y (1) − Y (0) | X = x c ) P ( X = x c | X ∈ O )= (cid:88) c ∈ I O E ( Y (1) − Y (0) | X = x c ) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (1)where I O is the index set for subpopulation O . Thus poststratifying the CATE of every cell in subpopu-lation O is performing integration of the potential outcome surface diﬀerence over the covariate space, wherethe discrete probability measure is deﬁned by the given poststratiﬁcation weights (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) c ∈ I O .It follows that the ATE is integrating E ( Y (1) − Y (0) | X = x c ) over all the poststratiﬁcation cells:ATE = E ( Y (1) − Y (0))= (cid:90) c E ( Y (1) − Y (0) | X = x c ) P ( X = x c )= J (cid:88) c =1 E ( Y (1) − Y (0) | X = x c ) N c (cid:80) Jc (cid:48) =1 N c (cid:48) (2)Hence calculating the superpopulation ATE and superpopulation CATEs require modelling the responsesurfaces g z ( x ) := E ( Y ( z ) | X = x ) for z = 0 ,

1, and then poststratifying the diﬀerence ∆ g ( x ) := g ( x ) − g ( x ) by weights deﬁned by { P ( X = x c ) } Jc =1 and { P ( X = x c | X ∈ O ) } c ∈ I O respectively. Various outcomemodeling methods as mentioned in the introduction can be used to model the two potential outcome surfaces g ( x ) , g ( x ), with an increasingly popular approach being nonparametric methods. Nonparametric methodssuch as BART have performed well in empirical studies for estimating treatment eﬀects in an observationalsetting such as [Hahn et al., 2020, Dorie et al., 2019].Poststratiﬁcation weights and sampling weights (Both are survey weights) are used to adjust for selectionbias of the covariates that are not the treatment covariate. On the other hand, propensity score weights incausal inference are used to adjust for non-random treatment assignment. Both can be combined togetherwhen modeling to simultaneously adjust for selection bias across non-treatment covariates and reduce con-founder bias in treatment eﬀect estimates [DuGoﬀ et al., 2014]. Weights in survey statistics and weightsin causal inference are related insofar that they both adjust for representativeness—survey weights adjustfor representativeness in the sample whereas causal inference weights adjust for representativeness betweentreatment and control groups. This connection is discussed in detail in [Mercer et al., 2017].Another connection between survey statistics and causal inference is that both ﬁelds rely on the samemachinery from missing data analysis as ways of accounting for nonresponse and confounders respectively.In Rubin’s potential outcomes framework [Rubin, 1974], the target estimand is a treatment eﬀect in thepopulation, which is deﬁned as a diﬀerence in potential outcomes. In survey statistics, the target estimand is apopulation-level quantity. Both ﬁelds require the modeler to predict missing values in the target population.In causal inference, every individual in the observed sample has a potential outcome missing that must7e predicted. In survey statistics, the outcome (and potentially some covariates) must be predicted forindividuals that did not respond. [Kang et al., 2007] provides a more detailed review of methods that canadjust for missing data in both survey estimation problems and in causal inference problems. This paperwas inspired by the commonality of both ﬁelds using the same methods as seen in [Kang et al., 2007] toadjust for missing data. MRP has historically been a method used for nonresponse adjustment in surveys,and in this paper we explore its capabilities in experimental causal inference.In a MRP application setting, we’re given the poststratiﬁcation weights (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) c ∈ I O for subpopu-lation O . In scenarios where the modeler wants to estimate the CATE for subpopulation O but isn’t giventhe poststatiﬁcation weights, the in-sample frequencies of every cell c in subpopulation O is used instead[Hahn et al., 2020, Dorie et al., 2019, Hill, 2011]. That is, P ( X = x c | X ∈ O ) = N c (cid:80) c (cid:48)∈ I O N c (cid:48) is replaced with n c (cid:80) c (cid:48)∈ I O n c (cid:48) , where n c are the number of observations in our observed sample D = ( Y i , X i , Z i ) ni =1 that arefrom poststratiﬁcation cell c .Thus the point estimate of τ CATE , O becomes (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from a model frequency from D (cid:122) (cid:125)(cid:124) (cid:123) ˆ P ( X = x c | X ∈ O ) = (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from a model (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) (3)This can result in a poor point estimate of τ CATE , O if the covariates in the observed sample { X i } ni =1 area nonrepresentative sample of the target population. Supplementary Material B further expands on suchchallenges.Assuming that a poststratiﬁcation matrix ˜ M is available, it would be sensible to perform poststratiﬁcationwith it in the scenario of treatment eﬀect estimation. Doing so will help the modeler adjust for non-responseand sampling bias. We propose ﬁve treatment estimation methods below. The ﬁrst three treatment methods (OLS, SVY, BART)had variants of them used in the original NSLM study [Yeager et al., 2019] for ATE and CATE estimation ofvarious subpopulations, hence we decided to include them in our simulation study that modeled the NSLM’starget population and experimental design. The last two methods (BARP-I and MRP-I) are MRP-stylemethods that incorporate information from poststratiﬁcation matrix M . BARP-I and MRP-I impute aPrev-GPA estimate for every poststratiﬁcation cell in M , hence the acronym I.All of the ﬁve methods are able to estimate the ATE τ ATE and any CATE τ CATE , O where subpopulation O has atleast one observation in the sample D . The last two methods (BARP-I and MRP-I) are able toestimate CATEs for subpopulations not observed in D . BARP-I and MRP-I should be viewed more asframeworks rather than speciﬁc methods since they allow for various ways to model the outcome Post-GPAand the covariate Prev-GPA.1. Ordinary least squares (OLS)

In this manuscript, we will follow notation consistent with [Gelman and Hill, 2006]. For the i th indi-vidual, deﬁne β RE j [ i ] to be their unique race/ethnicity coeﬃcient. As well for the i th individual, deﬁne β School j [ i ] to be their unique school coeﬃcient. Fit the following regression model below: Y i ∼ N (cid:16) β + β Z Z i + Oﬀset( V i ) + β School j [ i ] + β RE j [ i ] + β ME ME i + β G G i , σ (cid:17) (4)Using the full sample D , the ATE estimate would be ˆ β Z . Calculating CATE for a speciﬁc subpopulationin the population is done by subsetting the stratiﬁed cluster sample D into the targeted subpopulationand then estimating ˆ β Z . 8. Survey regression with sampling weights deﬁned through raking (SVY) [Lumley, 2020]

Deﬁne raked weights [Lohr, 2009, Deming and Stephan, 1940] on the stratiﬁed cluster sample using theraking variables strata, gender and race/ethnicity. The population counts used for raking are deﬁnedin the poststratiﬁcation matrix of J cells. Using these raked weights w i , ﬁt the model for the stratiﬁedcluster sample: Y i ∼ N (cid:18) β + β Z Z i + Oﬀset( V i ) + β RE j [ i ] + β ME ME i + β G G i , σ w i (cid:19) (5)The weights are rescaled to sum to 1 for numerical stability. We will use the survey package[Lumley, 2020] in R to ﬁt this model. Note that there is no school factor since the weights w i al-ready account for the cluster sampling within every school stratum.Using the full sample D , the ATE estimate would be ˆ β Z . Calculating CATE for a speciﬁc subpopulationin the population is done by subsetting the stratiﬁed cluster sample D into the targeted subpopulationand then estimating ˆ β Z . The model-robust weighted standard error [Lumley and Scott, 2017] for ˆ β Z is computed using the survey package in R [Lumley, 2020].3. Bayesian Additive Regression Trees (BART) [Hill, 2011, Chipman et al., 2010]

Referencing Figure 1.3, we have pre-treament covariate X i = ( V i , ME i , G i , RE i , School i , SA i , MC i ).(a) Fit the following Bayesian Additive Regression Tree (BART) model to the stratiﬁed cluster sample D : Y i |{ M j } mj =1 , { T j } mj =1 , (cid:15) i , σ ∼ m (cid:88) j =1 g j ( Z i , X i ; T j , M j ) + (cid:15) i M j | T j ind. ∼ P Node T j ind. ∼ P Tree (cid:15) i | σ ∼ N (0 , σ ) σ ∼ P (6)where g j is the assignment function for tree j , T j is tree j where prior probabililty is assigned to anode splitting at each depth, M j := { u jb } B j b =1 are the leaf nodes for tree j with gaussian priors as-signed to each leaf u jb . Each independent tree has prior mass that favors shorter trees and leaf nodevalues near 0. There is a conjugate prior used for P , which is usually a inverse chi-squared withdata-informed hyperparameters. For additional information on the BART model speciﬁcation,refer to [Chipman et al., 2010]. The R package dbarts [Dorie et al., 2018, Chipman et al., 2010]is used for BART ﬁtting.(b) The model (6) is ﬁt to the stratiﬁed cluster sample. The posterior distribution for individual i inthe stratiﬁed cluster sample is deﬁned as τ BART ( X i ) := m (cid:88) j =1 g post j ( Z i = 1 , X i ; T j , M j ) − m (cid:88) j =1 g post j ( Z i = 0 , X i ; T j , M j ) (7)where g post j := g j |D is the posterior assignment function for tree j . For ease of notation, let τ BART i := τ BART ( X i ). Let I O ⊆ { , . . . , n } be the indices of individuals in the stratiﬁed clus-ter sample that belong to the subpopulation group O . Then the estimated distribution of theCATE for O is deﬁned as | I O | (cid:80) i ∈ I O τ BART i . Likewise, the estimated distribution for the ATE is n (cid:80) i ∈ [ n ] τ BART i . 9oint estimate and standard error come from the expected value and variance that’s calculated basedon the estimated distribution | I O | (cid:80) i ∈ I O τ BART i . This is the standard application of BART and itsvariants for causal inference, as seen in [Hahn et al., 2020, Hill, 2011].In the original NLSM study, a variant of BART (Bayesian causal forests [Hahn et al., 2020]) was usedto calculate CATEs for subpopulations of varying School Achievement Level.4. BART with poststratiﬁcation after imputing a point estimate of Prev-GPA for everypoststratiﬁcation cell (BARP-I) [Bisbee, 2019] (a) Fit the BART model in 6 to the stratiﬁed cluster sample D .(b) Let c be a cell for poststratiﬁcation matrix M and ( me c , g c , re c , s c , sa c , mc c ) be the correspondingcovariate of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education for cell c . Deﬁne ˆ v c to be a point estimate of Prev-GPA for poststratiﬁcationcell c , based oﬀ the sample D . There is no restriction on how to calculate ˆ v c – A straightforwardapproach would be to ﬁt a hierarchical regression model with Prev-GPA as the outcome and thecovariates being School-Achievement × Minority-Composition × School × Gender × Race/Eth-nicity × Maternal-Education, and then take the posterior mean of the linear predictor evaluatedat ( me c , g c , re c , s c , sa c , mc c ) as the point estimate for ˆ v c .(c) Infer the posterior predictive distributions Y rep ,n ,c ∼ p ( y ∗ | z = 1 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D ) Y rep ,n ,c ∼ p ( y ∗ | z = 0 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D )with the model in Equation 6. Do this for all cells c in poststratiﬁcation matrix M . For posteriorsof poststratiﬁcation cells with no observation in D , we use the out-of-sample capabilities of dbarts to infer the posterior predictive distributions Y rep ,n ,c and Y rep ,n ,c .(d) Use the poststratiﬁed distribution (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) to get the estimated distribution for τ ATE , and E (cid:32) (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) |D (cid:33) as the point estimate for τ ATE .In general, let I O be the index set containing poststratiﬁcation cells for the subpopulation O .This subpopulation does not need to be observed in our population. Use Equation 8ˆ τ CATE , O := (cid:80) j ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) (8)to get the estimated distribution for τ CATE , O , and E (ˆ τ CATE , O |D ) as the point estimate for τ CATE , O and V (ˆ τ CATE , O |D ) as the accompanying variance.5. MRP after imputing a point estimate of Prev-GPA for every poststratiﬁcation cell (MRP-I) (a) Using D , ﬁt a hierarchical linear model M Post-GPA with Post-GPA as the outcome, and Prev-GPAas a covariate. Assume the outcome is a truncated normal in (0 , . brms [B¨urkner, 2018, B¨urkner, 2017] will be used for ﬁtting M Post-GPA . For more detailson the speciﬁcation of M Post-GPA , refer to Section 5 of this manuscript.10b) Let c be a cell for poststratiﬁcation matrix M and ( me c , g c , re c , s c , sa c , mc c ) be the correspondingcovariate of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education for cell c . For every poststratiﬁcation cell c , deﬁne ˆ v c to be a point estimateof Prev-GPA for poststratiﬁcation cell c , based oﬀ the sample D (c) Using M Post-GPA , infer the posterior predictive distributions Y rep ,n ,c ∼ p ( y ∗ | z = 1 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c D ) Y rep ,n ,c ∼ p ( y ∗ | z = 0 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D )for every poststratiﬁcation cell c in M . For posteriors of poststratiﬁcation cells with no observationin D , we use brms to infer the posterior predictive distributions Y rep ,n ,c and Y rep ,n ,c (More detailson this is in Subsection 5.2.3).(d) Use the poststratiﬁed distribution (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) to get the estimated distribution for τ ATE , and E (cid:32) (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) |D (cid:33) as the point estimate for τ ATE .In general, let I O be the index set containing poststratiﬁcation cells for the subpopulation O .This subpopulation does not need to be observed in our population. Use Equation 8 to get theestimated distribution for τ CATE , O , and E (ˆ τ CATE , O |D ) as the point estimate for τ CATE , O and V (ˆ τ CATE , O |D ) as the accompanying variance.In our simulation studies, Step (b) for MRP-I and BARP-I is performed by ﬁrst using the observedsample D to ﬁt a hierarchical regression model M Prev-GPA with Prev-GPA as the outcome and thecovariates being School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. M Prev-GPA in BARP-I is a BART speciﬁcation and M Prev-GPA in MRP-I is ahierarchical regression model speciﬁcation. For a more detailed speciﬁcation of M Prev-GPA for MRP-Iin our simulation studies, refer to Section 5 of this manuscript. Once M Prev-GPA is ﬁt, the posteriorpredictive distribution of Prev-GPA for every poststratiﬁcation cell c based oﬀ M Prev-GPA has its meanset as ˆ v c .Posterior predictive distributions for the outcome Post-GPA are used for methods BARP-I and MRP-I,since they are inferring Y rep ,n ,c , Y rep ,n ,c for poststratiﬁcation cells c that are not observed in D . Boththese methods are able to calculate CATEs for subpopulations not observed in D . On the other hand,OLS, SVY, BART are only able to estimate CATEs observed in D .As motivated in the previous subsection, including poststratiﬁcation-style estimators such as BARP-I and MRP-I is sensible as we’re provided a poststratiﬁcation matrix M . The current challenge ofperforming MRP in our simulation setting is that, if one wants to include non-census variable Prev-GPA as a covariate in the outcome model for MRP, then atleast one point estimate of Prev-GPA willhave to be used for every poststratiﬁcation cell in M . BARP-I and MRP-I performs this by imputing anestimated Prev-GPA quantity for every poststratiﬁcation cell. This is because M is a poststratiﬁcationmatrix of all combinations for the discrete covariates School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education, but there exists no discretization ofcontinuous covariate Prev-GPA where there is a poststratiﬁcation matrix of Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education.This challenge of having truncated continuous covariate Prev-GPA with no poststratiﬁcation matrix fora discretization of Prev-GPA motivates the application of the variant of MRP, MRP-MI. From a highlevel, MRP-MI estimates poststratiﬁcation frequencies and uses these estimated frequencies to performpoststratiﬁcation. This accounts for the full distribution of Prev-GPA in every poststratiﬁcation cell c ,unlike what MRP-I and BARP-I does by taking Prev-GPA point estimate ˆ v c .11 Multilevel Regression and Poststratiﬁcation with Multiple Im-putation

We use the MRP variant, Multilevel Regression and Poststratiﬁcation with Multiple Imputation (MRP-MI),as a viable framework for estimating treatment eﬀects for our simulation study. It can be applied when themodeler is given a poststratiﬁcation matrix ˜ M ∈ R J × K for the K -dimensional categorical covariate vector˜ C ∈ R K and wants to also poststratify by a continuous covariate ˜ V ∈ R but that continuous covariate doesnot have population counts in poststratiﬁcation matrix ˜ M .As mentioned in subsection 1.4, using Post-GPA - Prev-GPA as the outcome is not possible, thus we haveto include Prev-GPA as a covariate and Post-GPA as the outcome when modeling and estimating treatmenteﬀects in our causal inference problem (Figure 1.3). In our simulation study, the poststratiﬁcation matrix M does not contain population counts for strata of Prev-GPA, but we’d still like to poststratify by Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Educationand MRP-MI is able to do this. The three steps below outline the framework for MRP-MI. More detail andmotivation for model speciﬁcations relevant to our simulation study are in Section 5. brms is used to ﬁt thePost-GPA and Prev-GPA models used in the MRP-MI framework.1.

Model ﬁtting step:

Fit a hierarchical regression model M Post-GPA to the observed sample D =( V i , C i , Z i , Y i ) ni =1 , where V i is Prev-GPA, C i = (ME i , G i , RE i , School i , SA i , MC i ) is the categoricalcovariate, and Z i is the treatment variable and Y i is the Post-GPA outcome for individual i . For moredetails on the speciﬁcation of M Post-GPA in our simulation study, refer to Equation 19 in Section 5 ofthis manuscript.Additionally, ﬁt a hierarchical regression model M Prev-GPA to the observed sample D , where Prev-GPA V i is the outcome and the covariates are C i . For more details on the speciﬁcation of M Prev-GPA , referto Equation 16 in Section 5 of this manuscript.2.

Sampling for Post-GPA PPD step:

Let p ( y ∗ | c, z, v ∗ , D ) be the posterior predictive distributionof Post-GPA for the new datapoint ( c, z, v ∗ ) based on M Post-GPA . Let p ( v ∗ | c, D ) be the posteriorpredictive distribution of Prev-GPA for the new datapoint ( c, z ) based on M Prev-GPA . Deﬁne Y rep ,nz,c ∼ p ( y ∗ | c, z, D ), where p ( y ∗ | c, z, D ) = (cid:90) p ( y ∗ | c, z, v ∗ , D ) p ( v ∗ | c, z, D ) dv ∗ = (cid:90) p ( y ∗ | c, z, v ∗ , D ) p ( v ∗ | c, D ) dv ∗ , since M Prev-GPA does not have z as a covariate (9)Hence p ( y ∗ | c, z, D ) is the posterior predictive distribution of Post-GPA in poststratiﬁcation cell c , afterintegrating out the posterior predictive distribution of Prev-GPA for cell c . To retrieve samples from p ( y ∗ | c, z, D ), the below two-step procedure performs it: Algorithm 1:

MRP-MI

Result:

Return 1000 samples ( y ∗ jz,c ) j =1 of Y rep ,nz,c ∼ p ( y ∗ | c, z, D ) for every combination of( z, c ) m = 1000; for all combinations of ( c, z ) dofor j ∈ { , . . . , m } do Sample v ∗ jc from p ( v ∗ | c, D );Sample y ∗ jz,c from p ( y ∗ | c, z, v ∗ jc , D ) endend For poststratiﬁcation cells c not observed in D , the out-of-sample capabilities of brms are used tosample v ∗ jc and y ∗ jz,c . More details on how this is done can be found in Section 5.2.3.12he above algorithm uses the posterior predictive distribution of M Prev-GPA to impute Prev-GPA 1000times in every poststratiﬁcation cell c , hence the acronym MI (multiple imputation). Note that theabove algorithm actually generates the samples (cid:16) θ [ j ] v , v ∗ jc , θ [ j ] y , y ∗ jc,z (cid:17) j =1 , where (cid:16) θ [ j ] v (cid:17) j =1 and (cid:16) θ [ j ] y (cid:17) j =1 are posterior samples of the Prev-GPA and Post-GPA model respectively. However, this way of sam-pling introduces a dependency structure into the joint random variable (after conditioning on sample D ) (cid:16) Y rep ,n , , . . . , Y rep ,n ,J , Y rep ,n , , . . . , Y rep ,n ,J (cid:17) To make samples from this joint random variable conditionally mutually independent (when conditionedon the dataset D ), we have to use mutually exclusive posterior draws of (cid:16) θ [ j ] v , θ [ j ] y (cid:17) j =1 . This meansusing the ﬁrst J draws for Y rep ,n , , the second J draws for Y rep ,n , and so on. This wouldn’t befeasible if J is large, which in MRP applications it usually is.3. Poststratiﬁcation step:

The estimated distribution for CATE of subpopulation O based on dataset D is ˆ τ CATE , O = (cid:80) c ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) (10)Based oﬀ the above algorithm, we get 1000 samples of the estimator deﬁned in Equation 10 through  (cid:80) c ∈ I O N c (cid:16) y ∗ j ,c − y ∗ j ,c (cid:17)(cid:80) c (cid:48) ∈ I O N c (cid:48)  j =1 (11)We have the below proposition for the framework MRP-MI when it’s used to calculate the point estimateof the CATE for subpopulation O . The proof of this proposition is in Supplementary Material A. Proposition 1.

Let g z ( c, v ∗ ) = (cid:82) y ∗ p ( y ∗ | z, c, v ∗ , D ) dy ∗ be the expected value of the posterior predictivedistribution of Post-GPA model M Post-GPA under treatment Z = z , categorical covariate C = c and Prev-GPA V = v ∗ . Calculating E (ˆ τ CATE , O |D ) is integrating g ( c, v ∗ ) − g ( c, v ∗ ) with the continuous probabilitymeasure deﬁned by (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) p ( v ∗ | c, D ) dv ∗ (cid:111) ( c,v ∗ ) ∈ I O × (0 , . .If we consider g z ( c, v ∗ ) as an estimate for E ( Y ( z ) | C = c, V = v ∗ ) then Proposition 1 shows that E (ˆ τ CATE , O |D )is poststratifying g ( c, v ∗ ) − g ( c, v ∗ ) with the poststratiﬁcation weights deﬁned by P ( C = c, V = v ∗ | ( C, V ) ∈ O ) = P ( V = v ∗ | C = c, ( C, V ) ∈ O ) P ( C = c | ( C, V ) ∈ O )= p ( v ∗ | c, D ) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (12)This is because (cid:110) p ( v ∗ | c, D ) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) ( c,v ∗ ) ∈ I O × (0 , . can be viewed as the poststratiﬁcation weightsfor all combinations of Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education when Prev-GPA is discretized very ﬁne.In contrast to the methods BARP-I and MRP-I which impute Prev-GPA just once for every poststrat-iﬁcation cell, we see that MRP-MI (with M Post-GPA and M Prev-GPA speciﬁed in Section 5) accounts forthe full distribution of Prev-GPA in every poststratiﬁcation cell, p ( v ∗ | c, D ), by integrating it out as seen inEquation 9. The trade-oﬀ of accounting for the whole distribution of Prev-GPA in every poststratiﬁcationcell is that the computational cost of getting 1000 samples for ˆ τ CATE , O is signiﬁcantly higher.13 Simulation results

A detailed outline of the data generating process for our simulation study that modeled the NSLM studyis provided in this section. We ﬁrst generate a ﬁnite population P from a superpopulation, then utilize astratiﬁed sampling scheme to generate the observed sample D .Simulation results of calculating CATEs for subpopulations of varying sizes is presented and the sixtreatment eﬀect estimation methods are compared. We evaluate the eﬃcacy of each estimation methodwith various metrics that measure the quality of point estimation, standard error and uncertainty intervalcoverage.The individuals in NSLM were recruited through a stratiﬁed cluster sample, where the clusters are schools.Stratiﬁcation was performed on school-level categorical covariates, and the outcome variable of interest wasPost-GPA. In the end, around 70 high-schools had agreed to participate in the study, and randomizationof the growth mindset intervention was performed at the individual level of the stratiﬁed cluster sample.Despite the population of US public high-schools being 12 , th grade public high-school students in the United States. R [R Core Team, 2020] was the computing language used for the simulation studies and the ggplot2 package [Wickham, 2016] was used to generate the plots in this manuscript. brms was the library used forhierarchical model ﬁtting. [B¨urkner, 2018, B¨urkner, 2017, Carpenter et al., 2017]. P Let i be the index of an individual, k be index of the school, s be index of the stratum. V iks is the previous-GPA for student i in school k in stratum s . Y iks is the post-GPA and Z iks is the treatment-control indicator.Page 31 of the appendix of the NSLM study [Yeager et al., 2019] contained descriptive statistics of thestudent-level and school-level covariates seen in Figure 1.3. These descriptive statistics were used to deﬁnethe coeﬃcients used in the simulated version of the NSLM study as seen below.The sizes of each stratum, M s , are (2806 , , , , M s U School ks , T School ks just once: U School ks i.i.d ∼ N (0 , . ) T School ks i.i.d ∼ N (0 , . ) (13) U School ks , T School ks are on the same scale of the treatment eﬀects seen in Extended Data Figure 2 in theoriginal NSLM study [Yeager et al., 2019]. P For each stratum, s , let M ks ∼ Poisson(200) be the school size for school k within stratum s . Then for i ∈ [ M ks ], we generate the covariates along with previous and post GPA conditionally independently:14 iks ∼ Bernoulli(0 . V iks ∼ N (0 , . ( µ X [ s ] , σ X [ s ])RE iks ∼ Cat ( p RE [ s ])ME iks ∼ Bernoulli ( p ME [ s ]) Y iks ( Z iks ) | G iks , V iks , RE iks , ME iks , U School ks , T School ks ∼ N (0 , . (Max(0 , Min(4 . , V iks + τ SA [ s ] Z iks + τ MC [ s ] Z iks + τ G [G iks ] Z iks + τ RE [RE iks ] Z iks + τ ME [ME iks ] Z iks + U School ks Z iks + T School ks )) , .

6) (14)G iks is the gender indicator, RE iks is the race/ethnicity, ME iks is the maternal education indicator, X iks is the previous GPA, Y iks is the post-GPA and Z iks is the treatment indicator. The coeﬃcient vectors µ X , σ X , p RE , p ME , τ SA , τ MC , τ G , τ RE , τ ME are deﬁned in the Supplementary Material C.We can assume that this data generating process for P = ( Y m (0) , Y m (1) , X m ) N P m =1 which is around 2.2million individuals is from the data generating process of i.i.d draws from a superpopulation ( Y (0) , Y (1) , X ).This is because ﬁxing the strata sizes M s is equivalent to sampling from a 5-dimensional categorial variablewith mean being equal to 5-dimensional vector (cid:16) M s (cid:80) s (cid:48) =1 M s (cid:48) (cid:17) s =1 in the asymptotic regime. P The population P is generated once and then a high-quality stratiﬁed cluster sample D of around 12 , Stratiﬁed sample of clusters (schools):

Sample (28 , , , ,

27) schools randomly (without re-placement) from each of the 5 stratum respectively. This returns students in 140 schools.2.

Subsample clusters with response probabilities on units in clusters:

Select Poissson (cid:0) (cid:1) schools from the 140. Every student in the population has response probability invlogit − ( V iks ), theinverse-logit of the previous GPA for that student. Deﬁne this size n sample as D .The individual response probability invlogit − ( V iks ) is 92 percent. This agrees with the high responserate in the NSLM study [Yeager et al., 2019].Generating P is done using the R package DeclareDesign [Blair et al., 2019].

DeclareDesign assigns Z i at the the ﬁnite population level. That is, the assignment mechanism at the ﬁnite population level is completerandomization of ( Z i ) N P i =1 , giving (cid:80) N P i =1 Z i = N P . Because D is a very small sample of the ﬁnite population P ( D is 0.5 percent of P ), the Z i in D are approximately independent Bernoulli(0 .

5) samples. Hence theassignment mechanism for D can be thought of as a Bernoulli randomization—for every individual in D , weindependently assignment them a 50 percent probability of being in the treatment group. As expected, thestratiﬁed cluster sample D still maintains close to a 50 −

50 randomization between treatment and control.It’s clear to see that identiﬁability assumptions 1 - 4 are satisﬁed for D . SUTVA and Consistencyare satisﬁed through the data generating process for P , Ignorability is satisﬁed through the assignmentmechanism for ( Z i ) N P i =1 , and through the assignment mechanism we have P ( Z i = 1 | X i ) = 0 . P and thensample from it, we calculate CATEs and ATE at the superpopulation level.Generated D was fairly representative of the target population P . The Tipton Generalization Index[Tipton, 2014] for D along with the subgroups we analyze in each simulation iteration was calculated to benear 0 . generalize package [Ackerman et al., 2019] with random forests being theselection method). This matches the high Generalization Index of the sample in the original NSLM study. As the identiﬁability assumptions are satisﬁed for the data generating process of P , we can derive formulasfor τ CATE , O for any subpopulation O . 15iven that U School ks = u ks , T School ks = t ks for every stratum s and school k , let Y ( z ) be a random samplefrom the population deﬁned in Equation 14, under treatment Z = z . The G-formula [Hern´an and Robins, 2020]applied to the data-generating process recovers the true ATE: E ( Y ( z )) = (cid:88) s (cid:88) k E ( Y ( z ) | z, s, u ks , t ks ) M s (cid:80) s (cid:48) M s (cid:48) M s = (cid:88) s (cid:88) k (cid:88) me (cid:88) g (cid:88) re (cid:18)(cid:90) . E ( Y ( z ) | z, s, me, g, re, v, u ks , t ks ) p ( v | s ) dv (cid:19) · p ( re | s ) p ( me | s ) p ( g ) M s (cid:80) s (cid:48) M s (cid:48) M s (15)Applying the G-formula for the whole target population gives τ ATE := E ( Y (1)) − E ( Y (0)) = 0 . E ( Y ( z ) | z, s, me, g, re, k ) = (cid:82) . E ( Y ( z ) | z, s, me, g, re, v, u ks , t ks ) p V | S ( v | s ) dv is the expectedvalue of Post-GPA under Z = z for poststratiﬁcation cell School-Achievement × Minority-Composition × Gender × Race/Ethnicity × Maternal-Education × School = ( s, g, re, me, k ), after integrating out Prev-GPA. Hence summing E ( Y ( z ) | z, s, me, g, re, k ) P ( s, me, g, re, k | X ∈ O ) over all the poststratiﬁcation cells insubpopulation O returns E ( Y ( z ) | X ∈ O ), which in turn will result in the CATE τ CATE , O . M Prev-GPA and M Post-GPA which areused in BARP-I, MRP-I, MRP-MI

Prior predictive checks [Gabry and Mahr, 2021, Gabry et al., 2019, Gelman et al., 2013] as seen in Figure 1were used to calibrate the prior speciﬁcations of the hierarchical models M Prev-GPA and M Post-GPA . Fittedwith the package brms , both hierachical models assume a truncated normal outcome in the range (0 , . M Prev-GPA and M Post-GPA , models which are used in MRP-MI, MRP-Iand BARP-I. (Left) The dark-blue line corresponds to the density of Prev-GPA observations in D . The 200light-blue lines correspond to the densities of 200 prior predictive draws of Prev-GPA for individuals in D .(Right) The dark-blue line corresponds to the density of Post-GPA observations in D . The 200 light-bluelines correspond to the densities of 200 prior predictive draws of Post-GPA for individuals in D . M Prev-GPA

The Prev-GPA model is ﬁt to the stratiﬁed cluster sample D (a new ﬁt happens for every simulation iterationsince a new D is sampled from P ). 16or the i th individual, deﬁne α RE j [ i ] , α School j [ i ] , α MC j [ i ] , α SA j [ i ] , α SA,MC j [ i ] to be their unique random intercepts forthe covariates Race/Ethnicity, School, Minority Composition, School Achievement, Minority Composition × School Achievement respectively.The outcome is modeled as: V i | ME i , G i , α RE j , α School j , α MC j , α SA j , α SA,MC j ∼ N (0 , . ( β + β ME ME i + β G G i + α RE j [ i ] + α School j [ i ] + α MC j [ i ] + α SA j [ i ] + α SA,MC j [ i ] , σ (cid:17) (16)Covariates = RE, School, MC, SA, and SA,MC have atleast 3 levels and have their priors modeled as: α Covariates j | σ Covariates i.i.d ∼ N (cid:0) , ( σ Covariates ) (cid:1) σ Covariates ∼ N + (cid:0) , . (cid:1) (17)Covariates = ME and G are binary and have their priors (along with the global intercept) modeled as: β Covariates ∼ N (cid:0) , . (cid:1) β ∼ N (cid:0) . , . (cid:1) (18) M Post-GPA

The Post-GPA model is ﬁt to the stratiﬁed cluster sample D (a new ﬁt happens for every simulation iterationsince a new D is sampled from P ).For the i th individual, deﬁne γ RE j [ i ] , γ School j [ i ] , γ MC j [ i ] , γ SA j [ i ] , γ SA,MC j [ i ] to be their unique random intercepts when Z i = 1, for the covariates Race/Ethnicity, School, Minority Composition, School Achievement, MinorityComposition × School Achievement respectively.The outcome is modeled as: Y i | V i , Z i , ME i , G i , α RE j , α School j , α MC j , α SA j , α SA,MC j ∼ N (0 , . ( β + Oﬀset( V i ) + β ME ME i + β G G i + γ ME ME i Z i + γ G G i Z i + α RE j [ i ] + α School j [ i ] + α MC j [ i ] + α SA j [ i ] + α SA,MC j [ i ] + γ RE j [ i ] Z i + γ School j [ i ] Z i + γ MC j [ i ] Z i + γ SA j [ i ] Z i + γ SA,MC j [ i ] Z i , σ (cid:17) (19)Covariates = RE, School, MC, SA, and SA,MC have atleast 3 levels and have their priors modeled as: α Covariates j | σ Covariates i.i.d ∼ N (cid:0) , ( σ Covariates ) (cid:1) σ Covariates ∼ N + (cid:0) , . (cid:1) γ Covariates j | σ Covariates γ i.i.d ∼ N (0 , ( σ Covariates γ ) ) σ Covariates γ ∼ N + (0 , . )Cor( γ Covariates j , α Covariates j ) ∼ Default prior in brms (20)Covariates = ME and G are binary and have their priors (along with the global intercept) modeled as: β Covariates ∼ N (cid:0) , . (cid:1) γ Covariates ∼ N (cid:0) , . (cid:1) β ∼ N (cid:0) , . (cid:1) (21)17 .2.3 Out-of-sample prediction with M Prev-GPA and M Post-GPA

The treatment eﬀect estimation methods MRP-I and MRP-MI use M Prev-GPA and M Post-GPA to samplefrom the Post-GPA and Prev-GPA posterior predictive distributions ( Y rep,n z,c , V rep,n z,c ) of poststratiﬁcationcells c . BARP-I uses M Prev-GPA to sample from the Prev-GPA posterior predictive distribution V rep,n z,c ofpoststratiﬁcation cells c .In our simulation studies, inferring posterior predictive samples of ( Y rep,n z,c , V rep,n z,c ) for a cell c not observedin D requires sampling from new levels j of the posteriors α School j and γ School j . This is because D is a stratiﬁedcluster sample consisting of around 70 schools (school is a cluster), leaving 11000 other unobserved schools inthe target population. Sampling from the posterior of α School j for an unobserved school j is done by samplingfrom a normal distribution with mean implied by the posteriors α School j ∗ of all the observed schools j ∗ , andstandard deviation implied by the posterior distribution of σ School . Likewise, sampling from the posterior of γ School j for an unobserved school j is done by sampling from a normal distribution with mean implied by theposteriors γ School j ∗ of all the observed schools j ∗ , and standard deviation implied by the posterior distributionof σ School γ . Point estimation quality, standard error quality, and quality of uncertainty interval is evaluated through thethree metrics when estimating CATE for subpopulation O . Let M be the number of simulation iterations.We will sample D , . . . , D M from the same target population P ( P resets after each sample) in each newsimulation iteration based on the same stratiﬁed sampling scheme.For each of the six estimation methods, based on sample D m , deﬁne E (ˆ τ CATE , O |D m ) to be the pointestimate for τ CATE , O , and ˆ σ m,CATE , O to be the standard error that accompanies the point estimate. τ CATE , O (MSE) MSE measures the quality of the point estimate for each of the 6 treatment eﬀect estimation methods inour simulation study. For each method and for every simulation iteration, we calculate the empirical MSEin Equation 22: L MSE ( D , . . . , D M ) := 1 M M (cid:88) m =1 ( E (ˆ τ CATE , O |D m ) − τ CATE , O ) = (cid:32) M M (cid:88) m =1 E (ˆ τ CATE , O |D m ) − τ CATE , O (cid:33) + 1 M M (cid:88) m =1 (cid:32) E (ˆ τ CATE , O |D m ) − M M (cid:88) m (cid:48) =1 E (ˆ τ CATE , O |D m (cid:48) ) (cid:33) (22) τ CATE , O (PSR) To simultaneously evaluate the quality of a point estimate and the accompanying standard error estimate,we utilize a metric from decision theory, which only depends on the ﬁrst and the second moment of aprobabilistic forecast. Consider the following metric [Gneiting and Raftery, 2007]: L PSR ( D , . . . , D M ) := − M M (cid:88) m =1 (cid:18) τ CATE , O − E (ˆ τ CATE , O |D m )ˆ σ m,CATE , O (cid:19) − M M (cid:88) m =1 log(ˆ σ , O ) (23)A higher PSR value corresponds to a higher quality prediction based oﬀ this metric.18 .3.3 Coverage probabililty of uncertainty intervals Lastly, we will measure the quality of the uncertainty interval generated by the estimation method. Let l ( m )0 . (ˆ τ CATE , O ) , u ( m )0 . (ˆ τ CATE , O ) be the bounds of the symmetric 95-percent uncertainty interval for thepoint estimate E (ˆ τ CATE , O |D m ). The coverage probability of the uncertainty interval is then: L Coverage ( D , . . . , D M ) := 1 M M (cid:88) m =1 (cid:16) l ( m )0 . (ˆ τ CATE , O ) ≤ τ CATE , O ≤ u ( m )0 . (ˆ τ CATE , O ) (cid:17) (24) × lowest minoritycomposition and CATE for African Americans in lowest achieving schools Figure 2 shows the 5 competing methods, along with MRP-MI and their point estimates for the ATE andtwo CATEs. As the target group in the population gets smaller, the traditional treatment eﬀect estimationmethods, SVY and OLS in particular, produce more variable point estimates.As the area gets smaller, to the point where we are looking at 3.2% of the population, MRP-MI’s pointestimates of the target CATE remain tight around the truth. The frequentist regression methods SVYand OLS exhibit higher spread in point estimates (though they remain unbiased). This isn’t surprising, asBayesian methods (MRP-MI, BARP-I, BART, MRP-I) tend to outperform frequentist approaches (SVY,OLS) in the presence of limited data. In contrast to this, calculating the ATE uses the full sample D andthis corresponds to 100% of the population. All the methods except MRP-I have point estimate quality asnearly identical when calculating the ATE.Figure 2: Left: Point estimate of ATE based on the full stratiﬁed cluster sample of around 12000 observations.Middle: Point estimate of CATE for individuals in the highest-achieving schools with the lowest minoritycomposition. Right: Point estimate of CATE for individuals in the lowest-achieving schools and are African-Americans. The labels in the top-left of each plot are the proportion of the population that the grouprepresents and the expected sample size of that group in D . 100 simulations were conducted. The redhorizontal line is the true treatment eﬀect τ CATE , O for the speciﬁc subpopulation O on the plot.In Figure 2, we visualized point estimates for the ATE and two CATEs of varying sizes in the targetpopulation (size is in reference to the percent of the population in the top-left of the three plots in Figure2). Table 2 contains the MSE and PSR for each of the six methods. MRP-MI is shown to have the highestPSR based on Table 2, implying that it strikes the best balance of producing accurate point estimateswith reasonable standard errors. Based on Table 2, MRP-MI is shown to have the best MSE, except forcalculating the CATE corresponding to the group of individuals who are in the lowest achieving schools andare African-American (Race/Eth. coded as 2). However, the MSE of MRP-MI (10.70) is still close to beingthe lowest for this CATE (9.60).An important detail to note is that, even though MRP-I appears to be very similar to MRP-MI in pointestimate quality when just looking at Figure 2, we can see from Table 2 that the MSE values for the ATE19re drastically diﬀerent. MRP-MI’s MSE is more than twice as small as MRP-I’s MSE when the targetestimand is the ATE. This decrease in point estimation quality is attributed to MRP-I not accounting forthe full Prev-GPA distribution in every poststratiﬁcation cell c . It appears that this MSE diﬀerence betweenMRP-MI and MRP-I gets smaller as the target subpopulation becomes a smaller proportion of the targetpopulation. Model ATE (MSE × − ) ATE (PSR) SA3 MC1 (MSE × − ) SA3 MC1 (PSR) SA1 RE2 (MSE × − ) SA1 RE2 (PSR)OLS 1.59 7.30 7.83 6.12 50.44 4.31SVY 1.40 7.83 7.94 5.87 46.29 4.33BART 1.64 7.54 10.90 3.18 21.55 0.13BARP-I 1.97 7.55 15.17 3.67 19.11 3.05MRP-I 2.97 7.07 5.55 6.44 Table 2: Mean square error ( L MSE ) and the proper scoring rule ( L PSR ) value for the ATE, CATE for thegroup of individuals who are in the highest achieving schools with the lowest minority composition, andCATE for the group of individuals who are in the lowest achieving schools and are African-American. 100simulations were conducted. The best metric value in each column is in bold. Values are rounded to twodecimal places.The coverage probabilities in Table 3 shows that when calculating the ATE, all six methods have well-calibrated uncertainty intervals as they are all close to the nominal value of 95%. As the target subpopulationgets smaller (size is in reference to the percent of the population in the top-left of the three plots in Figure2), we see that BARP-I and BART have poor calibration. The frequentist methods SVY and OLS retainintervals with 95% coverage probability.Model ATE SA3 MC1 SA1 RE2OLS 0.93 0.93 0.94SVY 0.97 0.91 0.93BART 0.90 0.58 0.45BARP-I 0.91 0.76 0.60MRP-I 0.87 0.98 0.99MRP-MI 0.97 0.97 0.98Table 3: Coverage probabilities L coverage of 95 uncertainty intervals based on the various treatment eﬀectestimation methods. The estimands are the ATE, CATE for the group of individuals who are in the highestachieving schools with the lowest minority composition, and CATE for the group of individuals who are inthe lowest achieving schools and are African-American. 100 simulations were conducted.In general, we see that MRP-MI is able to capture the true heterogeneity of treatment eﬀects basedon Figure 2 and the MSE columns of Table 2, even for areas as small as 3.2% of the target population.MRP-MI produces low variance and low absolute bias estimates as seen in Figure 2. The coverage ofthe 95% uncertainty intervals for MRP-MI remains well-calibrated and hence close to 95% coverage whencalculating the ATE (whole target population) and calculating CATEs for smaller subpopulations of thetarget population. MRP-MI appears to have the highest quality for point estimation, standard errors, andinterval coverage.The more simplistic methods MRP-I and BARP-I do not account for the full distribution of the continuousnon-census variable Prev-GPA in every poststratiﬁcation cell as they impute only one Prev-GPA value,whereas MRP-MI accounts for the full distribution of Prev-GPA at the cost of additional computation. Theeﬀect of not accounting for the full Prev-GPA distribution is most drastic when the estimand is the ATE.MSE, PSR and conﬁdence interval coverage of BARP-I and MRP-I is worse than for MRP-MI. Studyingthe eﬀects of not accounting for the full distribution of a non-census variable is a subject of future MRPresearch. 20 .5 Estimating the CATEs of the ﬁve subpopulation groups in highest achievingschools × lowest minority composition × race/ethnicity To further test the extent that MRP-MI and the other ﬁve methods capture treatment eﬀect heterogeneityfor small areas, we look at subpopulations as small as 1.3% of the target population.Speciﬁcally, the target subpopulations are the ﬁve race/ethnicities in the stratum highest achievingschools × lowest minority composition. Figure 3 shows the CATE point estimates for CATEs of these ﬁvesubpopulations.Figure 3: Point estimates of τ CATE , O where O corresponds to the ﬁve race/ethnicities in the stratum highestachieving schools × lowest minority composition. 100 simulations were conducted. The labels in the top-leftof each plot are the proportion of the population that the group represents and the expected sample size forthat group in D . The red horizontal line is the true treatment eﬀect τ CATE , O for the speciﬁc subpopulation O on the plot.Figure 3 shows the ﬁve subpopulations ranging from 1.3% of the population to 10.3% of the population.Their expected sample sizes in D vary from 110 to 890 individuals. One average, the response rates areequal amongst the ﬁve subpopulations in the stratum of highest achieving schools with the lowest minoritycomposition based oﬀ our experimental design for collecting D .In the stratum of highest achieving schools with the lowest minority composition, the subpopulation withthe smallest expected sample size of 110 in D are Asian individuals. From Figure 3 we see that the Bayesianestimation methods (MRP-MI, MRP-I, BART, BARP-I) have lower variance amongst their point estimatesthan the frequentist methods’ (SVY, OLS) point estimates.In the stratum of highest achieving schools with the lowest minority composition, the subpopulation withthe highest expected sample size of 890 in D are White individuals. From Figure 3, we see that the Bayesianmethods (MRP-MI, MRP-I, BART, BARP-I) appear more similar to the frequentist methods (SVY, OLS).Unsurprisingly, the frequentist methods (SVY, OLS) have the highest variance estimates in the presence oflimited data such as in the subpopulations of Asian individuals and African-American individuals in Figure3. From Figure 3 MRP-MI and MRP-I appear quite similar and both have a good balance of generatinglow-variance and low-bias estimates. This implies that the MSE of MRP-MI and MRP-I should be scoredhighly amongst the competing methods. Indeed this is true from Table 4 — MRP-MI has the best MSEfor all ﬁve CATEs and MRP-I has the second best MSE for all ﬁve CATEs. The shrinkage of the Bayesiannonparametric tree-based methods (BART, BARP-I) with their default priors appear to be too strong forestimating the heterogeneous treatment eﬀects of small areas, since they are shown to be the most biasedyet have the lowest variance in the ﬁve panels of Figure 3.Table 4 further shows that MRP-MI has the best PSR values for each of the ﬁve CATEs. There is alarge drop oﬀ in estimation quality for the Bayesian nonparametric tree-based methods (BART, BARP-I),as seen by their low PSR values for calculating CATEs of Asian individuals (RE1) and individuals of otherrace/ethnicity (RE5). 21odel RE1 RE2 RE3 RE4 RE5OLS (MSE × − ) 118.90 93.98 36.09 12.07 46.79SVY (MSE × − ) 103.65 81.45 36.88 12.21 45.08BART (MSE × − ) 54.83 9.12 18.79 8.99 46.32BARP-I (MSE × − ) 54.13 8.76 19.38 8.95 49.49MRP-I (MSE × − ) 14.27 7.10 6.29 5.56 10.70MRP-MI (MSE × − ) OLS (PSR) 3.49 3.68 4.60 5.78 4.34SVY (PSR) 3.57 3.78 4.21 5.61 3.91BART (PSR) -9.05 5.94 1.46 5.88 -7.08BARP-I (PSR) -2.87 5.60 3.31 5.52 -1.98MRP-I (PSR) 5.46 6.12 6.25 6.41 5.80MRP-MI (PSR)

Table 4: Mean square error ( L MSE ) and the average proper scoring rule value ( L PSR ) based on the varioustreatment eﬀect estimation methods. The ﬁrst six rows correspond to L MSE estimates and the last six rowscorrespond to L PSR estimates. The estimands are CATEs for all ﬁve race/ethnicities in the subpopulationof highest achieving schools × lowest minority composition. RE1 corresponds to Asian individuals, RE2corresponds to African-American individuals, RE3 corresponds to Hispanic individuals, RE4 corresponds toWhite individuals, RE5 corresponds to individuals of other race/ethnicity. 100 simulations were conducted.The best metric value in each column is in bold. Values are rounded to two decimal placesTable 5 shows that MRP-MI has well-calibrated 95% uncertainty intervals despite the heterogeneity oftreatment eﬀects across the ﬁve small subpopulations. The Bayesian nonparametric tree-based methods(BARP-I, BART) have the worst calibration as their intervals’ coverage probabilities are quite far from 95percent. Model RE1 RE2 RE3 RE4 RE5OLS 0.94 0.94 0.96 0.98 0.99SVY 0.95 0.92 0.89 0.94 0.93BART 0.21 0.84 0.50 0.82 0.34BARP-I 0.28 0.90 0.72 0.89 0.38MRP-I 0.99 0.99 1.00 0.98 0.96MRP-MI 0.99 0.98 0.98 0.97 0.94Table 5: Coverage probabilities L coverage of 95 uncertainty intervals based on the various treatment eﬀectestimation methods. The estimands are CATEs for the ﬁve race/ethnicities in the subpopulation highestachieving schools × lowest minority composition. RE1 corresponds to Asian individuals, RE2 correspondsto African-American individuals, RE3 corresponds to Hispanic individuals, RE4 corresponds to White indi-viduals, RE5 corresponds to individuals of other race/ethnicity. 100 simulations were conducted.In summary, MRP-MI appears to be the most eﬀective in estimating treatment eﬀects for very largeareas (ATE) to very small areas in the target population when there is treatment eﬀect heterogeneity.The accompanying standard errors and uncertainty intervals produced by MRP-MI are also well-calibrated.Frequentist methods (SVY, OLS) appear to perform well when the observed sample D is suﬃciently largebut have extremely high variance estimates when they’re used to calculate CATEs for small-areas, as data ismore limited. Bayesian nonparametric tree-based methods (BART, BARP-I) have default priors that exhibitvery strong shrinkage when compared to hierarchical modeling approaches (MRP-MI, MRP-I) for calculatingheterogeneous CATEs in small areas and furthermore their uncertainty intervals have poor coverage whenthe target area gets smaller. A further area of research would be to perform prior calibration such as priorpredictive checks for these Bayesian nonparametric methods.Even though the point estimates Figures 2 and 3 show that MRP-I and BARP-I have comparable pointestimates to MRP-MI, we do not recommend using MRP-I and BARP-I for a causal inference problemsimilar to the one in our simulation study. They do not account for the full distribution of the continuous22ovariate Prev-GPA, whereas MRP-MI does.MRP-MI, MRP-I and BARP-I get posterior predictive samples fromˆ τ CATE , O = (cid:80) c ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) As mentioned in Section 4, MRP-MI introduces a dependency structure (after conditioning on D ) to (cid:16) Y rep ,n , , . . . , Y rep ,n ,J , Y rep ,n , , . . . , Y rep ,n ,J (cid:17) (25)since the same 1000 posterior samples from M Prev-GPA and M Post-GPA are used to sample from Y rep ,nz,c for every z and c . MRP-I and BARP-I use the same 1000 posterior samples from their respective outcomemodel ﬁts to get samples from Y rep ,nz,c for every z and c as well, and hence a dependency structure existsin Equation 25 for MRP-I and BARP-I as well. The accompanying standard error (cid:112) V (ˆ τ CATE , O |D ) forthe point estimate of τ CATE , O is aﬀected by the dependency structure in Equation 25 and studying thisrelationship for MRP-I, BARP-I and in particular MRP-MI, is an unexplored area. So far, the subpopulations O where CATEs are estimated are subpopulations that are observed in the strat-iﬁed cluster sample D . BARP-I, MRP-I and MRP-MI use the estimated distribution ˆ τ CATE , O to calculateCATE point estimates and this can be done even for O not observed in D . In this subsection, we showcasethe extent that MRP-MI is able to estimate τ CATE , O for all the schools in the target population P .Stratiﬁed cluster sample D in our simulation study contains around 70 schools for every new simulationiteration and thus there are around 11000 schools left unobserved. We are however given poststratiﬁcationcell sizes N c for every cell c in the population and thus are able to calculate ˆ τ CATE , O for every school O using MRP-MI (as well as MRP-I and BARP-I). Figure 4 shows MRP-MI point estimates E (ˆ τ CATE , O |D ) forevery school in our simulation studies.From the middle histogram of Figure 4, we see that point estimates of all 11221 schools E (ˆ τ CATE , O |D )from MRP-MI are not able to capture the full variation of CATEs τ CATE , O using D , even though we’regiven poststratiﬁcation counts N c for all the cells c in schools not in D . In fact, the out-of-sample CATEestimates of MRP-MI in the middle histogram appear to be quite similar to the in-sample CATE estimatesof MRP-MI in the bottom histogram. This implies that even with an external poststratiﬁcation matrix onthe full population, it remains diﬃcult to calculate treatment eﬀects for subpopulations not observed in thesample when using MRP-MI. We observed very similar challenges for MRP-I and BARP-I.The top scatterplot further reinforces the claim that MRP-MI is limited when calculating CATEs forschools not observed in D . In summary, MRP-MI (along with MRP-I and BARP-I) clusters CATE cal-culations for all the schools around the school-level clusters seen in Table 5.1 with the current amount ofinformation provided through D and poststratiﬁcation matrix M . Improving treatment eﬀect estimates ofMRP-style estimators for subpopulations not observed in the sample is an open research area. The primary goal of this manuscript was investigating the capabilities of MRP-style estimators in experi-mental causal inference. Through simulation studies, we studied the limitations of MRP-style estimatorsin the problem of estimating heterogeneous treatment eﬀects for subpopulations as small as 1.3 percent ofthe target population. We modeled our simulation studies after the National Student of Learning Mindsetsstudy [Yeager et al., 2019], an experimental dataset that was fairly representative of target population, yetexhibited treatment eﬀect heterogeneity.We also build upon a variant of MRP (MRP-MI) that is able to incorporate non-census variables (alongwith their full distributions if they’re continuous) into the poststratiﬁcation step. In our simulations thatmodeled the NSLM study, MRP-MI along with other MRP-style estimators (MRP-I and BARP-I) are23igure 4: (Top) Average of CATE point estimates from MRP-MI for all 11221 schools based on 100simulations. Every dot corresponds to a speciﬁc school. The x-axis are averages of our point estimates E (ˆ τ CATE , O |D ). The y-axis are the true CATEs τ CATE , O for every school. The dots are colour-coded by theirschool strata (Table 5.1). (Middle) The point estimates of the CATE for every school from MRP-MI for onesimulation iteration. (Bottom) The point estimates of the CATE for all schools in D from MRP-MI for thesame simulation iteration as the middle histogram.compared against modern Bayesian nonparametric causal inference methods [Hill, 2011] and frequentistregression methods [Lumley, 2020].We showed that MRP-style estimators are eﬀective at point estimation of conditional average treatmenteﬀects (CATE) for subpopulations of various sizes in the target population. When compared to the othertreatment eﬀect estimation methods in our simulations, MRP-MI strikes the best balance of having high-quality point estimates (low bias and variance in point estimates) and reasonable standard errors, alongwith well-calibrated uncertainty intervals for CATE estimation. The applications of the MRP-MI frameworkextend beyond causal inference and can be used in a traditional survey estimation problem where the modeleris solely interested in estimating population-level quantities, but this avenue of research is not considered inour manuscript.Though the main focus of this manuscript was on treatment eﬀect heterogeneity in small areas forstratiﬁed cluster experiments that are representative of the target population, we believe that MRP-styleestimators such as MRP-MI can excel in dealing with nonrepresentative experimental data as well. MRP hastraditionally been used as a model-based approach to generalizing survey estimates to a target population andwe believe that this can be extended to generalization studies in experimentation. Comparing MRP-styleestimators to propensity-score based generalization techniques in causal inference [Ackerman et al., 2020,Ackerman et al., 2019, Stuart et al., 2018] is an open area of research. Extending the analysis done in this24anuscript to the setting of observational causal inference is also an open area of research.When provided with a poststratiﬁcation matrix of the target population along with experimental data,we showed through simulations that MRP-style estimators improve upon standard causal inference methods,especially in scenarios when the modeler wants to uncover treatment eﬀect heterogeneity across small areas inthe target population. MRP-MI is an eﬀective framework for incorporating continuous non-census variablesinto poststratiﬁcation. As such, we see MRP-MI as a useful approach for modelers when dealing withexperimental data and poststratiﬁcation information on not necessarily all of the variables in the targetpopulation. References [Ackerman et al., 2020] Ackerman, B., Lesko, C. R., Siddique, J., Susukida, R., and Stuart, E. A. (2020).Generalizing randomized trial ﬁndings to a target population using complex survey population data. arXivpreprint arXiv:2003.07500 .[Ackerman et al., 2019] Ackerman, B., Schmid, I., Rudolph, K. E., Seamans, M. J., Susukida, R., Mojtabai,R., and Stuart, E. A. (2019). Implementing statistical methods for generalizing randomized trial ﬁndingsto a target population.

Addictive behaviors , 94:124–132.[Athey and Imbens, 2017] Athey, S. and Imbens, G. W. (2017). The econometrics of randomized experi-ments. In

Handbook of economic ﬁeld experiments , volume 1, pages 73–140. Elsevier.[Athey and Wager, 2019] Athey, S. and Wager, S. (2019). Estimating treatment eﬀects with causal forests:An application. arXiv preprint arXiv:1902.07409 .[Bisbee, 2019] Bisbee, J. (2019). BARP: Improving Mister P using Bayesian additive regression trees.

Amer-ican Political Science Review , 113(4):1060–1065.[Blair et al., 2019] Blair, G., Cooper, J., Coppock, A., and Humphreys, M. (2019). Declaring and diagnosingresearch designs.

American Political Science Review , 113:838–859.[B¨urkner, 2017] B¨urkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan.

Journal of Statistical Software , 80(1):1–28.[B¨urkner, 2018] B¨urkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms.

The R Journal , 10(1):395–411.[Carpenter et al., 2017] Carpenter, B., Gelman, A., Hoﬀman, M. D., Lee, D., Goodrich, B., Betancourt, M.,Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan : A probabilistic programming language.

Journal of Statistical Software , 76(1).[Chipman et al., 2010] Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). BART: Bayesianadditive regression trees.

The Annals of Applied Statistics , 4(1):266–298.[Colnet et al., 2020] Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.-P., Josse, J.,and Yang, S. (2020). Causal inference methods for combining randomized trials and observational studies:a review. arXiv preprint arXiv:2011.08047 .[Deming and Stephan, 1940] Deming, W. E. and Stephan, F. F. (1940). On a least squares adjustment ofa sampled frequency table when the expected marginal totals are known.

The Annals of MathematicalStatistics , 11(4):427–444.[Ding et al., 2017] Ding, P., Li, X., and Miratrix, L. W. (2017). Bridging ﬁnite and super population causalinference.

Journal of Causal Inference , 5(2).[Dorie et al., 2018] Dorie, V., Chipman, H., and McCulloch, R. (2018). dbarts: Discrete Bayesian additiveregression trees sampler. 25Dorie et al., 2019] Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versusdo-it-yourself methods for causal inference: Lessons learned from a data analysis competition.

StatisticalScience , 34(1):43–68.[DuGoﬀ et al., 2014] DuGoﬀ, E. H., Schuler, M., and Stuart, E. A. (2014). Generalizing observational studyresults: applying propensity score methods to complex surveys.

Health services research , 49(1):284–303.[Gabry and Mahr, 2021] Gabry, J. and Mahr, T. (2021). bayesplot: Plotting for bayesian models. R packageversion 1.8.0.[Gabry et al., 2019] Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A. (2019). Visual-ization in Bayesian workﬂow.

Journal of the Royal Statistical Society: Series A (Statistics in Society) ,182(2):389–402.[Gao et al., 2020] Gao, Y., Kennedy, L., Simpson, D., Gelman, A., et al. (2020). Improving multilevelregression and poststratiﬁcation with structured priors.

Bayesian Analysis .[Gelman, 2018] Gelman, A. (2018). Mrp (or rpp) with non-census variables.

Statistical Modeling, CausalInference, and Social Science Blog .[Gelman et al., 2013] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B.(2013).

Bayesian data analysis . CRC press.[Gelman and Hill, 2006] Gelman, A. and Hill, J. (2006).

Data analysis using regression and multilevel/hier-archical models . Cambridge university press.[Gelman and Little, 1997] Gelman, A. and Little, T. C. (1997). Poststratiﬁcation into many categories usinghierarchical logistic regression.[Gneiting and Raftery, 2007] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction,and estimation.

Journal of the American statistical Association , 102(477):359–378.[Green and Kern, 2012] Green, D. P. and Kern, H. L. (2012). Modeling heterogeneous treatment eﬀects insurvey experiments with bayesian additive regression trees.

Public opinion quarterly , 76(3):491–511.[Gruber and Van Der Laan, 2009] Gruber, S. and Van Der Laan, M. J. (2009). Targeted maximum likelihoodestimation: A gentle introduction.[Hahn et al., 2020] Hahn, P. R., Murray, J. S., Carvalho, C. M., et al. (2020). Bayesian regression treemodels for causal inference: regularization, confounding, and heterogeneous eﬀects.

Bayesian Analysis .[Hern´an and Robins, 2020] Hern´an, M. and Robins, J. (2020). Causal inference: What if.

Chapman &Hall/CRC; Boca Raton: 2020 .[Hill, 2011] Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.

Journal of Computa-tional and Graphical Statistics , 20(1):217–240.[Imbens and Rubin, 2015] Imbens, G. W. and Rubin, D. B. (2015).

Causal inference in statistics, social,and biomedical sciences . Cambridge University Press.[Kang et al., 2007] Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparisonof alternative strategies for estimating a population mean from incomplete data.

Statistical science ,22(4):523–539.[Kastellec et al., 2015] Kastellec, J. P., Lax, J. R., Malecki, M., and Phillips, J. H. (2015). Polarizingthe electoral connection: Partisan representation in supreme court conﬁrmation politics.

The journal ofpolitics , 77(3):787–804.[Kennedy and Gelman, 2019] Kennedy, L. and Gelman, A. (2019). Know your population and know yourmodel: Using model-based regression and poststratiﬁcation to generalize ﬁndings beyond the observedsample. arXiv preprint arXiv:1906.11323 . 26Kern et al., 2016] Kern, H. L., Stuart, E. A., Hill, J., and Green, D. P. (2016). Assessing methods forgeneralizing experimental impact estimates to target populations.

Journal of research on educationaleﬀectiveness , 9(1):103–127.[K¨unzel et al., 2019] K¨unzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for esti-mating heterogeneous treatment eﬀects using machine learning.

Proceedings of the national academy ofsciences , 116(10):4156–4165.[Lauderdale et al., 2020] Lauderdale, B. E., Bailey, D., Blumenau, J., and Rivers, D. (2020). Model-basedpre-election polling for national and sub-national outcomes in the us and uk.

International Journal ofForecasting , 36(2):399–413.[Lax and Phillips, 2009] Lax, J. R. and Phillips, J. H. (2009). How should we estimate public opinion in thestates?

American Journal of Political Science , 53(1):107–121.[Lax et al., 2019] Lax, J. R., Phillips, J. H., Zelizer, A., et al. (2019). The party or the purse? unequalrepresentation in the us senate.

American Political Science Review , 113(4):917–940.[Little, 1993] Little, R. J. (1993). Post-stratiﬁcation: a modeler’s perspective.

Journal of the AmericanStatistical Association , 88(423):1001–1012.[Lohr, 2009] Lohr, S. L. (2009).

Sampling: design and analysis . Nelson Education.[Lumley, 2020] Lumley, T. (2020). survey: analysis of complex survey samples. R package version 4.0.[Lumley and Scott, 2017] Lumley, T. and Scott, A. (2017). Fitting regression models to survey data.

Sta-tistical Science , pages 265–278.[Mercer et al., 2017] Mercer, A. W., Kreuter, F., Keeter, S., and Stuart, E. A. (2017). Theory and practice innonprobability surveys: parallels between causal inference and survey inference.

Public Opinion Quarterly ,81(S1):250–271.[Miratrix et al., 2018] Miratrix, L. W., Sekhon, J. S., Theodoridis, A. G., and Campos, L. F. (2018). Worthweighting? how to think about and use weights in survey experiments.

Political Analysis , 26(3):275–291.[Park et al., 2004] Park, D. K., Gelman, A., and Bafumi, J. (2004). Bayesian multilevel estimation withpoststratiﬁcation: state-level estimates from national polls.

Political Analysis , 12(4):375–385.[Pearl, 1995] Pearl, J. (1995). Causal diagrams for empirical research.

Biometrika , 82(4):669–688.[R Core Team, 2020] R Core Team (2020).

R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.[Rubin, 1974] Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomizedstudies.

Journal of educational Psychology , 66(5):688.[Rubin, 1980] Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomizationtest comment.

Journal of the American Statistical Association , 75(371):591–593.[Stuart et al., 2018] Stuart, E. A., Ackerman, B., and Westreich, D. (2018). Generalizability of randomizedtrial results to target populations: Design and analysis possibilities.

Research on social work practice ,28(5):532–537.[Stuart et al., 2011] Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. (2011). The use of propensityscores to assess the generalizability of results from randomized trials.

Journal of the Royal StatisticalSociety: Series A (Statistics in Society) , 174(2):369–386.[Tipton, 2014] Tipton, E. (2014). How generalizable is your experiment? an index for comparing experi-mental samples and populations.

Journal of Educational and Behavioral Statistics , 39(6):478–501.27Wager and Athey, 2018] Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treat-ment eﬀects using random forests.

Journal of the American Statistical Association , 113(523):1228–1242.[Wang et al., 2015] Wang, W., Rothschild, D., Goel, S., and Gelman, A. (2015). Forecasting elections withnon-representative polls.

International Journal of Forecasting , 31(3):980–991.[Wickham, 2016] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag NewYork.[Yeager et al., 2019] Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C.,Tipton, E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., et al. (2019). A national experiment revealswhere a growth mindset improves achievement.

Nature , 573(7774):364–369.28 upplementary Material

A Proof of Proposition 1

Proof. E (ˆ τ CATE , O |D ) = (cid:80) c ∈ I O N c E (cid:0) Y rep ,n ,c |D (cid:1) − N c E (cid:0) Y rep ,n ,c |D (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) y ∗ p ( y ∗ | , c, D ) dy ∗ − (cid:82) y ∗ p ( y ∗ | , c, D ) dy ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) p ( v ∗ | , c, D ) dy ∗ dv ∗ − (cid:82) (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) p ( v ∗ | , c, D ) dy ∗ dv ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) (cid:0)(cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ − (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ (cid:1) p ( v ∗ | c, D ) dv ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:88) c ∈ I O N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:90) (cid:18)(cid:90) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ − (cid:90) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ (cid:19) p ( v ∗ | c, D ) dv ∗ = (cid:88) c ∈ I O N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:90) ( g ( c, v ∗ ) − g ( c, v ∗ )) p ( v ∗ | c, D ) dv ∗ (26)Deﬁning dµ ( c, v ∗ ) := N c (cid:80) c (cid:48)∈ I O N c (cid:48) p ( v ∗ | c, D ) dv ∗ where the support is ( c, v ∗ ) ∈ I O × (0 , . E (ˆ τ CATE , O |D ) = (cid:82) ( g ( c, v ∗ ) − g ( c, v ∗ )) dµ ( c, x ∗ ). B Further discussion on connection between MRP and treatment eﬀect estima-tion

Suppose that we have an estimation method M that produces unbiased estimates ˆ E ( Y (1) − Y (0) | X = x c )for the true treatment eﬀect τ CATE ,X = x in every poststratiﬁcation cell c .Let NR ( O ) ⊆ I O be the set of poststratiﬁcation cells such that the in-sample frequency n c ∗ (cid:80) c (cid:48)∈ I O n c (cid:48) diﬀersfrom the true frequency P ( X = x c ∗ | X ∈ O ) = N c ∗ (cid:80) c (cid:48)∈ I O N c (cid:48) . This means that the observed sample of covariatesdo not generate the true frequencies observed in the ﬁnite target population for cells in NR ( O ).From Equation 3, the point estimate of τ CATE , O minus the truth is (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from M (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) − τ CATE , O = (cid:88) c ∈ I O τ CATE ,X = x (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) − (cid:88) c ∈ I O τ CATE ,X = x (cid:32) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:33) = (cid:88) c ∈ NR( O ) τ CATE ,X = x (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) − N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:33) (27)Hence the bias for estimating τ CATE , O is driven by the size and direction of the true treatment eﬀects τ CATE ,X = x c ∗ and the diﬀerences in the proportions n c ∗ (cid:80) c (cid:48)∈ I O n c (cid:48) − N c ∗ (cid:80) c (cid:48)∈ I O N c (cid:48) for all c ∗ ∈ NR ( O ). The diﬀer-ences in proportions is caused by the observed sample of covariate vectors { X i } ni =1 being a nonrepresentativesample of individuals in the target population, { X i } N P i =1 .29 Coeﬃcients for the data generating process in Section 5

The coeﬃcient vectors used in our simulation study outlined in Section 5 are deﬁned as: • µ X = (2 . , . , . • σ X = (0 . , , . • p RE = (0 . , . , . , . , . p RE = (0 . , . , . , . , . p RE = (0 . , . , . , . , . p RE = c (0 . , . , . , . , . p RE = c (0 . , . , . , . , . • p ME = (0 . , . , . p ME = (0 . , . , . • τ SA = (0 . , . , . • τ MC = (0 , − . , . • τ G = (0 . , . • τ RE = (0 . , . , . , . , . • τ ME = (0 . , τ SA , τ MC , τ G , τ RE , τ ME were chosen so that the simulation setup in this manuscript produced treatmenteﬀects in the same range as the ones reported in [Yeager et al., 2019].The GitHub repository for this manuscript is: https://github.com/alexgao09/causal_mrp_publichttps://github.com/alexgao09/causal_mrp_public