Treatment effect estimation with Multilevel Regression and Poststratification
TTreatment effect estimation with Multilevel Regression andPoststratification
Yuxiang Gao ∗ , Lauren Kennedy † , and Daniel Simpson ‡ Department of Statistical Sciences, University of Toronto Department of Econometrics and Business Statistics, Monash UniversityFebruary 22, 2021
Abstract
Multilevel regression and poststratification (MRP) is a flexible modeling technique that has been usedin a broad range of small-area estimation problems. Traditionally, MRP studies have been focused onnon-causal settings, where estimating a single population value using a nonrepresentative sample was ofprimary interest. In this manuscript, MRP-style estimators will be evaluated in an experimental causalinference setting. We simulate a large-scale randomized control trial with a stratified cluster samplingdesign, and compare traditional and nonparametric treatment effect estimation methods with MRPmethodology. Using MRP-style estimators, treatment effect estimates for areas as small as 1.3% of thepopulation have lower bias and variance than standard causal inference methods, even in the presenceof treatment effect heterogeneity. The design of our simulation studies also requires us to build upon aMRP variant that allows for non-census covariates to be incorporated into poststratification.
Keywords:
Causal inference, treatment effect heterogeneity, multilevel regression and poststratification,small-area estimation, non-census variables
Randomized control trials (RCTs) are often considered the gold standard for estimating causal treatmenteffects of a specific intervention. Applications of RCTs are widespread, having been applied in fields suchas healthcare, economics, political science, education and technology. Studying the statistical properties ofRCTs and the methods used to model them are active areas of research in causal inference since challengesinvolving data quality, model specification and experimental design can arise in practice.The main purpose of this manuscript is to introduce Multilevel Regression and Poststratification (MRP)[Gelman and Little, 1997, Little, 1993], a method has traditionally been used for public opinion modellingand small-area estimation, as a viable method in experimental causal inference. We will highlight thestrengths and limitations of modern causal inference methods along with MRP and its variants for estimat-ing conditional treatment effects over subpopulation groups of various sizes, focusing specifically on scenarioswith treatment effect heterogeneity. In particular, the causal inference problem we are analyzing is a simu-lated version of a large-scale stratified cluster RCT conducted in the United States in 2015–2016. Secondly, ∗ Author email: [email protected] † Author email: [email protected] ‡ Author email: [email protected]
Acknowledgements:
Yuxiang Gao was funded by the Ontario Graduate Scholarship. Daniel Simpson was funded by theNatural Sciences and Engineering Research Council of Canada and the Canadian Research Chair program. Research reportedin this publication was supported by National Institute of Aging of the National Institutes of Health under award numberR01AG067149. The content is solely the responsibility of the authors and does not necessarily represent the official views ofthe National Institutes of Health. a r X i v : . [ s t a t . M E ] F e b e analyze a variant of MRP seen in [Lax et al., 2019, Gelman, 2018, Kastellec et al., 2015], which we call Multilevel Regression and Poststratification with Multiple Imputation (MRP-MI) . MRP-MI allows the mod-eler to include a variable that is present in the sample but missing in the population, known as a non-censusvariable, in the poststratification frame.
Generalization in causal inference refers to mapping treatment effects from a specific study to treatmenteffects in a target population. This population can be a finite population of the experimental sample,a group of individuals who are in a different location relative to the experimental sample or a group ofindividuals whose covariates were measured at a different time relative to the experimental sample. Withoutpost-experimentation adjustments, there is a chance that treatment effects from an experimental samplewill be different than the true treatment effect of the target population. An important point to note isthat, no unadjusted treatment effect estimates from the experimental sample are generalizable to the targetpopulation unless the sample is equal to the target population or the sample is a random draw from thetarget population [Stuart et al., 2018]. There are different interpretations and definitions of generalizabilityin the literature, and in this part of the introduction we will briefly outline some of the different frameworksand assumptions used to define generalizability.In [Kern et al., 2016], the estimand of interest is the Target Average Treatment Effect (TATE), whichis defined as the mean of the difference in potential outcomes N (cid:80) Ni =1 ( Y i (1) − Y i (0)) where N is the sizeof the target population. A sampling indicator S i ∈ { , } is defined for every individual in the targetpopulation and the experimental sample. S i = 0 if the individual is in the experimental sample and 1otherwise. Estimating TATE can be done through weighting approach, or an outcome modelling approach.Through the weighting approach as outlined in [Kern et al., 2016], a score w i := ˆ e i − ˆ e i is defined for everyindividual in the experimental sample, where ˆ e i is the estimated probability that individual i is in the targetpopulation. The idea behind the weights w i is that they make the experimental sample more similar tothe target population. A weighted linear regression model is then fit to the experimental sample using theset of w i and then the coefficient of the treatment indicator is used as the TATE estimate. The outcomemodelling approach in [Kern et al., 2016] uses a predictive model on the experimental sample and thenpredicts outcomes under treatment and control for every individual in the the target population.In [Ackerman et al., 2019], the generalization methods of weighting and outcome modelling as seen in[Kern et al., 2016] are tested. Targeted Maximum likelihood Estimation [Gruber and Van Der Laan, 2009], adoubly robust estimation method that combines both weighting and outcome modelling is used to estimatethe TATE. For a more thorough literature review of generalizing experimental sample treatment effectestimates to the TATE using approaches in [Ackerman et al., 2019, Kern et al., 2016], the reader can referto [Stuart et al., 2018]. A broader literature review on experimental causal inference and generalizabilitycan be found in the recent survey [Colnet et al., 2020].Another way to generalize experimental sample estimates to a target population is through poststratifica-tion, a weighting technique in survey statistics. Poststratification is briefly mentioned in [Stuart et al., 2011],but the current challenges that remain are continuous covariates or a large number of poststratification cells.[Miratrix et al., 2018] analyzes poststratification in survey experiments, but only analyzes scenerios whereall observed covariates are categorical.To measure representativeness between the experimental sample and the target population, various mea-sures have been developed. One method was based off the sampling propensity score s ( X ) := P ( S = 1 | X )where X is the pre-treatment covariate vector and S is a binary indicator for membership in the experimen-tal sample. [Stuart et al., 2011] define the propensity score difference, which is the difference of averagesfor s ( X i ) between the experimental sample and the target population. [Tipton, 2014] extends the usage ofsampling propensity scores by defining a new metric known as the generalization index , a value between 0and 1 that is a measure of representativeness between the experimental sample and the target population.Though this manuscript is not focused on generalizing treatment effect estimates but rather on recoveringtreatment effect heterogeneity, we calculate the generalization index in our simulation studies to show thatindeed our experimental sample is representative and hence a high-quality sample of the target population.2 .2 Treatment effect heterogeneity across different subpopulations in the targetpopulation When there is no treatment effect heterogeneity, it’s not a concern if the experimental sample is representativeor not. This is because treatment effects for any subpopulation will be equal so over/undersampling certainsubpopulations does not result in different treatment effect estimates. However, any amount of treatmenteffect heterogeneity will bias treatment effect estimates coming from a nonrepresentative experimental samplewhen the nonrepresentativeness and heterogeneity are not properly adjusted for.More specifically, estimating heterogeneous treatment effects consists of estimating the non-constantfunction τ ( x ) := E ( Y (1) − Y (0) | X = x ), which is the expected value of the difference in potential outcomesconditional on pretreatment variables X = x . This expectation is taken at the superpopulation level, wherethe superpopulation is the data generating process for the finite target population. In this manuscript, wegenerate a finite target population then sample from it, but the observed sample is used to estimate CATEs atthe superpopulation level. For additional details on the connection between superpopulation causal inferenceand finite population causal inference, we refer the reader to [Ding et al., 2017, Imbens and Rubin, 2015].Modeling τ ( x ) can be done parametrically and non-parametrically. The parametric approach would beto use a linear regression model with the treatment variable interacting with other pre-treament variables.A parametric modeling approach’s benefits are that it’s simple to implement and its model coefficientsare interpretable. However, it’s prone to model misspecification when τ ( x ) is nonlinear and pre-treatmentcovariate X is high-dimensional, which can result in biasing estimates of τ ( x ).Estimating τ ( x ) can be done more flexibly with nonparametric regression methods [Hahn et al., 2020,Wager and Athey, 2018, Hill, 2011] and machine learning methods [K¨unzel et al., 2019]. Nonparametrictreatment effect estimation methods have been used in survey experiments [Green and Kern, 2012] and large-scale randomized control trials concerning education interventions [Yeager et al., 2019, Athey and Wager, 2019].For a more extensive overview of the challenges arising from treatment effect heterogeneity in experimenta-tion, we refer the reader to [Athey and Imbens, 2017].The focus of this paper will mainly be to address the challenges of treatment effect heterogeneity inexperimentation when the sample is representative of the target population, so as a motivating example, wewill discuss a large-scale RCT conducted recently on the target population of public high schools in the UnitedStates, one that was representative of the target population yet exhibited treatment effect heterogeneity. ThisRCT was used as a model for our simulation studies run in this manuscript. The National Study of Learning Mindsets (NSLM) [Yeager et al., 2019] was an experiment conducted on anationally representative sample of 9 th grade public high-school students in the United States during Fall2015. The growth mindset intervention was an education intervention that encouraged treated individualsto view intellectual ability not as a fixed trait, but a muscle that can be trained through the sustainedeffort of seeking help when learning new academic content and trying new learning strategies. Indeed, thegrowth mindset intervention was shown to be effective in improving the GPA of high-school students. Theoriginal NSLM analysis used linear modeling with survey weights as well as a response surface method[Hahn et al., 2020].The sampling design in NSLM first has school strata defined by levels of minority composition level andschool-achievement level. Then schools are cluster sampled from each school strata. Finally, in the chosenschools that decided to partake in the study, students were given the option to participate or not. Theaverage student-level response rate was 92 percent. The intervention was randomly assigned at the studentlevel.The intervention of interest is how effective the treatment is on the GPA of individuals as measuredby the difference between post-intervention GPA and pre-intervention GPA. The causal relationship of in-terest in this manuscript is shown in the causal directed acyclic graph (DAG) [Pearl, 1995] seen in Figure1.3. This DAG is a simplified model of the NSLM that contains only the main variables in their analysis[Yeager et al., 2019]. 3 igure 1.3 (Causal DAG): Shaded nodes are ob-served variables whereas unshaded nodes are unob-served noise. Because the data collected is comingfrom a randomized experiment, all the variables ex-cept the outcome (blue post-GPA node) and thetreatment (red treatment node) are pre-treatmentvariables. The structure of the posited causal graphis incorporated in treatment effect estimation meth-ods used in this manuscript. The outcome variablePost-GPA and the covariate Previous-GPA are bothcontinuous and truncated with a lower and upperbound. The rest of the observed covariates are cat-egorical.
TreatmentPost-GPA
Prev-GPASchool Race/Eth Mat.-Edu.GenderMin-Comp.School-Ach. (cid:15)
School (cid:15)
Prev-GPA (cid:15)
Post-GPA
The NSLM’s target population covariate structure is rich and detailed. Across different covariate subpop-ulations, different treatment effects are suspected to exist. In this manuscript, we explore the limitations ofusing a representative sample to calculate treatment effects for subpopulations in a population as structuredand heterogeneous as the target population of NSLM.We design a simulation study that generates a complex target population and samples from it througha stratified cluster design to generate a RCT. The sampling design has schools as the clusters, and theclusters are stratified by school-level covariates minority composition level and school-achievement level. Thesimulated population is built to match the population summary statistics reported in [Yeager et al., 2019].Section 5 contains additional details on the design of the simulation study and further information aboutthe population’s structure and how it relates to the NSLM can be found in Supplementary Material C.Additionally, we compare various causal inference methods to see the extent that these methods canrecover the true heterogeneity of the intervention in the target population. In our simulation studies, weestimate treatment effects ranging from the Average Treatment Effect (ATE) for the whole population toConditional Average Treatment Effects (CATEs) for subpopulations as small as 1.3 percent of the population.
In this manuscript, we propose a treatment effect estimation method which builds on a hierarchical model-based survey estimation method used for small-area estimation, Multilevel Regression and Poststratifica-tion (MRP) [Park et al., 2004, Gelman and Little, 1997]. MRP combines a hierarchical model’s posteriordistribution of the outcome ( θ c ) Jc =1 with an external poststratification matrix containing population sizes( N c ) Jc =1 , where c is a stratum of finest granularity in the target population. MRP then forms a posteriordistribution for the average outcome in the target population, θ ps := (cid:80) Jc =1 N c θ c (cid:80) Jc (cid:48) =1 N c (cid:48) , and summary statisticssuch as the posterior mean of θ ps can be used as the point estimate for average outcome in the tar-get population. A strength of MRP is that it’s effective in estimating outcomes of small areas in thepresence of selection bias across various subpopulations of a target population [Gao et al., 2020]. MRPhas historically been a commonly used methodology in small-area estimation and public opinion model-ing [Lauderdale et al., 2020, Wang et al., 2015], however it’s applications to causal inference have been anunexplored area. A primary goal of this manuscript is to explore the extent of MRP’s effectiveness inexperimental causal inference. For a more detailed step-by-step outline of MRP, we refer the reader to[Lax and Phillips, 2009].MRP requires covariates to be categorical, in order to perform Poststratification . Current practice is todiscretize continuous covariates before using them in a MRP model. The effect of the granularity of thediscretization on the MRP estimates is studied in [Gao et al., 2020]. Discretization of continuous covariatesin MRP is possible when we have population counts for each of the discretized bins. For example, if age isdiscretized into 6 groups, then the poststratification matrix used must have population counts for each ofthe 6 age groups. 4n our simulation study, which is based on the NSLM study, we are given the poststratification matrixfor all combinations of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education, thus providing population counts for each combination. Prev-GPA of individualsin every collected sample is reported, but no such discretization of Prev-GPA at the population level isavailable.It’s also important to note that Post-GPA and Prev-GPA in our simulation studies are both truncatedcontinuous distributions with support (0 , . − Prev-GPA as a function of the binary treatment indicatorand the covariates School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. This is not the case in our study since the variable Post-GPA − Prev-GPA is not inthe same family of distributions that Post-GPA or Prev-GPA are in. Thus to estimate the treatment effectsfrom DAG Figure 1.3, we have to use Prev-GPA as a covariate and Post-GPA as the outcome.We will utilize a variant of MRP that bypasses missing population counts for the continuous covariatePrev-GPA and also takes into account the truncated nature of both Prev-GPA and Post-GPA. We will referto this variant as
Multilevel Regression and Poststratification with Multiple Imputation (MRP-MI) . We applyMRP-MI to to the problem of heterogeneous treatment effect estimation for the causal graph in Figure 1.3.In the literature, MRP-MI is an extension of the procedure where the modeler uses MRP with non-censusvariables [Kennedy and Gelman, 2019, Lax et al., 2019, Gelman, 2018, Kastellec et al., 2015]. MRP-MI issimilar to the prediction framework carried out in example 2 of [Kennedy and Gelman, 2019]. [Kennedy and Gelman, 2019]dealt with a non-census variable by imputing it for each individual. In contrast, we predict our non-census variable at the poststratification cell-level rather than the individual level. In [Lax et al., 2019,Kastellec et al., 2015] the non-census variable (variable with missing population counts) is political partymembership (discrete with 3 levels) whereas in this manuscript the non-census variable is Prev-GPA (con-tinuous and truncated).In this estimation problem, MRP-MI entails fitting both a Prev-GPA hierarchical model and a Post-GPAhierarchical model, and then poststratifying with the posterior predictive distribution of Post-GPA afterintegrating out the posterior predictive distribution of Prev-GPA. The detailed description of MRP-MI isshown in the later Section 4.The challenge with modeling CATEs in a simulation study modelled after the target population in NSLMis that there are scenarios where some school clusters ˜ O are not observed. Hence calculating the CATE for˜ O will be an out-of-sample prediction problem. Indeed, that was the case in the original NSLM study –Only around 70 schools out of the 12000 total schools in the target population were sampled from. Insteadof calculating CATEs for all 12000 schools, the modelers only calculated CATEs for the clusters observed. Apart of this manuscript (Section 5.6) will be dedicated to showcasing the limitations of MRP-MI and otherMRP variants when it comes to calculating out-of-sample CATEs. This manuscript has the following order: Section 1 contains a summary of two common problems in ex-perimental causal inference, namely treatment effect heterogeneity and generalizability of treatment effectestimates. Section 2 contains our problem setup for estimating treatment effects for any group in the targetpopulation, where our problem is modeled after the National Study of Learning Mindsets. Section 3 presentsthe various treatment effect estimation methods that are tested in our simulation studies. Section 4 presentsthe detailed outline of our proposed treatment effect estimation method. Section 5 is a description of oursimulation study and results, which includes how the target population is generated and how we sample fromthe target population. Section 6 is the conclusion. The supplementary material contains further discussionof the MRP variant (MRP-MI), connection between treatment effect estimation and MRP, and simulationdesign information.
Suppose that we’re interested in estimating how effective an education intervention is on the Post-GPA ofan individual. Let ( Y (0) , Y (1) , X ) be a superpopulation joint distribution for the control potential outcome,5 aternal Education Gender Race/Ethnicity School Minority Composition Index School Achievement Index N c No Female Asian 00001 Low Low 1No Female Asian 00002 Low Low 1No Female Asian 00006 Low Low 1No Female Asian 00007 Low Low 1No Female Asian 00012 Low Low 2... ... ... ... ... ... ...... ... ... ... ... ... ...Yes Male Other 11220 High High 9Yes Male Other 11221 High High 15
Table 1: Poststratification matrix M for the population P with population sizes N c for all J combinations ofSchool-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education.Note that (cid:80) Jc =1 N c = N P .treatment potential outcome and the pre-treatment covariate. Y (0) is the Post-GPA of the individualunder no treatment and Y (1) is the Post-GPA of the individual under treatment. X is a vector containingdemographic information and the previous academic achievement scores of an individual. We’re interestedin the following superpopulation CATE estimands:1. Average Treatment Effect: τ ATE := E ( Y (1) − Y (0))2. Conditional Average Treatment Effect for any subpopulation O : τ CATE , O := E ( Y (1) − Y (0) | X ∈ O )3. Conditional Average Treatment Effect for X = x : τ CATE ,X = x := E ( Y (1) − Y (0) | X = x )Suppose that a finite population P := ( Y m (0) , Y m (1) , X m ) N P m =1 is generated i.i.d from ( Y (0) , Y (1) , X ).Suppose that ( Y i (0) , Y i (1) , X i ) ni =1 is a sample drawn from P in a probabilistic manner. In our case,( Y i (0) , Y i (1) , X i ) ni =1 is drawn as a stratified cluster sample. Let Z i ∈ { , } be a treatment-control indi-cator variable for individual i in the sample, and Y i be the outcome variable for individual i in the sample.The details of the sampling design and assignment mechanism for ( Y i (0) , Y i (1) , X i , Z i ) ni =1 can be found inSection 5 of this manuscript.Referencing the causal DAG in Figure 1.3, we have X i = ( V i , ME i , G i , RE i , School i , SA i , MC i ), whichis the pre-treatment covariate vector for individual i consisting of Previous-GPA, Minority Composition,Gender, Race/Ethnicity, School, School Achievement level and Minority Composition level of School. SchoolAchievement level and Minority Composition level of School are school-level covariates and the other co-variates are individual-level covariates. ( Y i (0) , Y i (1) , X i , Z i ) ni =1 is a stratified cluster sample of the finitepopulation P , where clusters are schools.Finally, we will assume that we’re given a poststratification matrix M for the target population, withpopulation counts for all J combinations of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. Table 1 shows M . We do not have population information forPrev-GPA. To allow for identifiability of superpopulation CATEs and ATE, we will assume the following below for allindividuals 1 ≤ i ≤ n . For more on the role that these assumptions play in superpopulation treatment effectestimation, we refer the reader to [Imbens and Rubin, 2015].1. Stable Unit Treatment Value Assumption (SUTVA) [Rubin, 1980]: We assume that there is no inter-ference between units. That is, the treatments assigned to a unit do not affect the potential outcomesof another unit.2. Consistency: Y i = Z i Y i (1) + (1 − Z i ) Y i (0) almost-surely3. Ignorability: ( Y i (0) , Y i (1)) ⊥⊥ Z i
6. Overlap: 0 < P ( Z i = 1 | X i ) < Y i = Y i ( Z i ) almost-surely. This defines the observed stratifiedcluster sample D := ( Y i , X i , Z i ) ni =1 , consisting of the outcome, pre-treatment covariate and treatment variablerespectively. Based off D , various causal inference methods can be use to estimate the ATE and CATEs.MRP-style estimators will be compared with other standard causal inference methods for calculatingCATEs and ATE, so we motivate the need for MRP by providing the connection between MRP and super-population treatment effect estimation below. Suppose that the covariate vector X of an individual in the superpopulation has J number possible valuesthat it can take on. Also, suppose we’re given a poststratification matrix ˜ M for the target population, withpopulation counts ( N c ) Jc =1 for all J cells. Note that the CATE τ CATE , O for a subpopulation O is then equalto E ( Y (1) − Y (0) | X ∈ O ) = (cid:90) c ∈ I O E ( Y (1) − Y (0) | X = x c ) P ( X = x c | X ∈ O )= (cid:88) c ∈ I O E ( Y (1) − Y (0) | X = x c ) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (1)where I O is the index set for subpopulation O . Thus poststratifying the CATE of every cell in subpopu-lation O is performing integration of the potential outcome surface difference over the covariate space, wherethe discrete probability measure is defined by the given poststratification weights (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) c ∈ I O .It follows that the ATE is integrating E ( Y (1) − Y (0) | X = x c ) over all the poststratification cells:ATE = E ( Y (1) − Y (0))= (cid:90) c E ( Y (1) − Y (0) | X = x c ) P ( X = x c )= J (cid:88) c =1 E ( Y (1) − Y (0) | X = x c ) N c (cid:80) Jc (cid:48) =1 N c (cid:48) (2)Hence calculating the superpopulation ATE and superpopulation CATEs require modelling the responsesurfaces g z ( x ) := E ( Y ( z ) | X = x ) for z = 0 ,
1, and then poststratifying the difference ∆ g ( x ) := g ( x ) − g ( x ) by weights defined by { P ( X = x c ) } Jc =1 and { P ( X = x c | X ∈ O ) } c ∈ I O respectively. Various outcomemodeling methods as mentioned in the introduction can be used to model the two potential outcome surfaces g ( x ) , g ( x ), with an increasingly popular approach being nonparametric methods. Nonparametric methodssuch as BART have performed well in empirical studies for estimating treatment effects in an observationalsetting such as [Hahn et al., 2020, Dorie et al., 2019].Poststratification weights and sampling weights (Both are survey weights) are used to adjust for selectionbias of the covariates that are not the treatment covariate. On the other hand, propensity score weights incausal inference are used to adjust for non-random treatment assignment. Both can be combined togetherwhen modeling to simultaneously adjust for selection bias across non-treatment covariates and reduce con-founder bias in treatment effect estimates [DuGoff et al., 2014]. Weights in survey statistics and weightsin causal inference are related insofar that they both adjust for representativeness—survey weights adjustfor representativeness in the sample whereas causal inference weights adjust for representativeness betweentreatment and control groups. This connection is discussed in detail in [Mercer et al., 2017].Another connection between survey statistics and causal inference is that both fields rely on the samemachinery from missing data analysis as ways of accounting for nonresponse and confounders respectively.In Rubin’s potential outcomes framework [Rubin, 1974], the target estimand is a treatment effect in thepopulation, which is defined as a difference in potential outcomes. In survey statistics, the target estimand is apopulation-level quantity. Both fields require the modeler to predict missing values in the target population.In causal inference, every individual in the observed sample has a potential outcome missing that must7e predicted. In survey statistics, the outcome (and potentially some covariates) must be predicted forindividuals that did not respond. [Kang et al., 2007] provides a more detailed review of methods that canadjust for missing data in both survey estimation problems and in causal inference problems. This paperwas inspired by the commonality of both fields using the same methods as seen in [Kang et al., 2007] toadjust for missing data. MRP has historically been a method used for nonresponse adjustment in surveys,and in this paper we explore its capabilities in experimental causal inference.In a MRP application setting, we’re given the poststratification weights (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) c ∈ I O for subpopu-lation O . In scenarios where the modeler wants to estimate the CATE for subpopulation O but isn’t giventhe poststatification weights, the in-sample frequencies of every cell c in subpopulation O is used instead[Hahn et al., 2020, Dorie et al., 2019, Hill, 2011]. That is, P ( X = x c | X ∈ O ) = N c (cid:80) c (cid:48)∈ I O N c (cid:48) is replaced with n c (cid:80) c (cid:48)∈ I O n c (cid:48) , where n c are the number of observations in our observed sample D = ( Y i , X i , Z i ) ni =1 that arefrom poststratification cell c .Thus the point estimate of τ CATE , O becomes (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from a model frequency from D (cid:122) (cid:125)(cid:124) (cid:123) ˆ P ( X = x c | X ∈ O ) = (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from a model (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) (3)This can result in a poor point estimate of τ CATE , O if the covariates in the observed sample { X i } ni =1 area nonrepresentative sample of the target population. Supplementary Material B further expands on suchchallenges.Assuming that a poststratification matrix ˜ M is available, it would be sensible to perform poststratificationwith it in the scenario of treatment effect estimation. Doing so will help the modeler adjust for non-responseand sampling bias. We propose five treatment estimation methods below. The first three treatment methods (OLS, SVY, BART)had variants of them used in the original NSLM study [Yeager et al., 2019] for ATE and CATE estimation ofvarious subpopulations, hence we decided to include them in our simulation study that modeled the NSLM’starget population and experimental design. The last two methods (BARP-I and MRP-I) are MRP-stylemethods that incorporate information from poststratification matrix M . BARP-I and MRP-I impute aPrev-GPA estimate for every poststratification cell in M , hence the acronym I.All of the five methods are able to estimate the ATE τ ATE and any CATE τ CATE , O where subpopulation O has atleast one observation in the sample D . The last two methods (BARP-I and MRP-I) are able toestimate CATEs for subpopulations not observed in D . BARP-I and MRP-I should be viewed more asframeworks rather than specific methods since they allow for various ways to model the outcome Post-GPAand the covariate Prev-GPA.1. Ordinary least squares (OLS)
In this manuscript, we will follow notation consistent with [Gelman and Hill, 2006]. For the i th indi-vidual, define β RE j [ i ] to be their unique race/ethnicity coefficient. As well for the i th individual, define β School j [ i ] to be their unique school coefficient. Fit the following regression model below: Y i ∼ N (cid:16) β + β Z Z i + Offset( V i ) + β School j [ i ] + β RE j [ i ] + β ME ME i + β G G i , σ (cid:17) (4)Using the full sample D , the ATE estimate would be ˆ β Z . Calculating CATE for a specific subpopulationin the population is done by subsetting the stratified cluster sample D into the targeted subpopulationand then estimating ˆ β Z . 8. Survey regression with sampling weights defined through raking (SVY) [Lumley, 2020]
Define raked weights [Lohr, 2009, Deming and Stephan, 1940] on the stratified cluster sample using theraking variables strata, gender and race/ethnicity. The population counts used for raking are definedin the poststratification matrix of J cells. Using these raked weights w i , fit the model for the stratifiedcluster sample: Y i ∼ N (cid:18) β + β Z Z i + Offset( V i ) + β RE j [ i ] + β ME ME i + β G G i , σ w i (cid:19) (5)The weights are rescaled to sum to 1 for numerical stability. We will use the survey package[Lumley, 2020] in R to fit this model. Note that there is no school factor since the weights w i al-ready account for the cluster sampling within every school stratum.Using the full sample D , the ATE estimate would be ˆ β Z . Calculating CATE for a specific subpopulationin the population is done by subsetting the stratified cluster sample D into the targeted subpopulationand then estimating ˆ β Z . The model-robust weighted standard error [Lumley and Scott, 2017] for ˆ β Z is computed using the survey package in R [Lumley, 2020].3. Bayesian Additive Regression Trees (BART) [Hill, 2011, Chipman et al., 2010]
Referencing Figure 1.3, we have pre-treament covariate X i = ( V i , ME i , G i , RE i , School i , SA i , MC i ).(a) Fit the following Bayesian Additive Regression Tree (BART) model to the stratified cluster sample D : Y i |{ M j } mj =1 , { T j } mj =1 , (cid:15) i , σ ∼ m (cid:88) j =1 g j ( Z i , X i ; T j , M j ) + (cid:15) i M j | T j ind. ∼ P Node T j ind. ∼ P Tree (cid:15) i | σ ∼ N (0 , σ ) σ ∼ P (6)where g j is the assignment function for tree j , T j is tree j where prior probabililty is assigned to anode splitting at each depth, M j := { u jb } B j b =1 are the leaf nodes for tree j with gaussian priors as-signed to each leaf u jb . Each independent tree has prior mass that favors shorter trees and leaf nodevalues near 0. There is a conjugate prior used for P , which is usually a inverse chi-squared withdata-informed hyperparameters. For additional information on the BART model specification,refer to [Chipman et al., 2010]. The R package dbarts [Dorie et al., 2018, Chipman et al., 2010]is used for BART fitting.(b) The model (6) is fit to the stratified cluster sample. The posterior distribution for individual i inthe stratified cluster sample is defined as τ BART ( X i ) := m (cid:88) j =1 g post j ( Z i = 1 , X i ; T j , M j ) − m (cid:88) j =1 g post j ( Z i = 0 , X i ; T j , M j ) (7)where g post j := g j |D is the posterior assignment function for tree j . For ease of notation, let τ BART i := τ BART ( X i ). Let I O ⊆ { , . . . , n } be the indices of individuals in the stratified clus-ter sample that belong to the subpopulation group O . Then the estimated distribution of theCATE for O is defined as | I O | (cid:80) i ∈ I O τ BART i . Likewise, the estimated distribution for the ATE is n (cid:80) i ∈ [ n ] τ BART i . 9oint estimate and standard error come from the expected value and variance that’s calculated basedon the estimated distribution | I O | (cid:80) i ∈ I O τ BART i . This is the standard application of BART and itsvariants for causal inference, as seen in [Hahn et al., 2020, Hill, 2011].In the original NLSM study, a variant of BART (Bayesian causal forests [Hahn et al., 2020]) was usedto calculate CATEs for subpopulations of varying School Achievement Level.4. BART with poststratification after imputing a point estimate of Prev-GPA for everypoststratification cell (BARP-I) [Bisbee, 2019] (a) Fit the BART model in 6 to the stratified cluster sample D .(b) Let c be a cell for poststratification matrix M and ( me c , g c , re c , s c , sa c , mc c ) be the correspondingcovariate of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education for cell c . Define ˆ v c to be a point estimate of Prev-GPA for poststratificationcell c , based off the sample D . There is no restriction on how to calculate ˆ v c – A straightforwardapproach would be to fit a hierarchical regression model with Prev-GPA as the outcome and thecovariates being School-Achievement × Minority-Composition × School × Gender × Race/Eth-nicity × Maternal-Education, and then take the posterior mean of the linear predictor evaluatedat ( me c , g c , re c , s c , sa c , mc c ) as the point estimate for ˆ v c .(c) Infer the posterior predictive distributions Y rep ,n ,c ∼ p ( y ∗ | z = 1 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D ) Y rep ,n ,c ∼ p ( y ∗ | z = 0 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D )with the model in Equation 6. Do this for all cells c in poststratification matrix M . For posteriorsof poststratification cells with no observation in D , we use the out-of-sample capabilities of dbarts to infer the posterior predictive distributions Y rep ,n ,c and Y rep ,n ,c .(d) Use the poststratified distribution (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) to get the estimated distribution for τ ATE , and E (cid:32) (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) |D (cid:33) as the point estimate for τ ATE .In general, let I O be the index set containing poststratification cells for the subpopulation O .This subpopulation does not need to be observed in our population. Use Equation 8ˆ τ CATE , O := (cid:80) j ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) (8)to get the estimated distribution for τ CATE , O , and E (ˆ τ CATE , O |D ) as the point estimate for τ CATE , O and V (ˆ τ CATE , O |D ) as the accompanying variance.5. MRP after imputing a point estimate of Prev-GPA for every poststratification cell (MRP-I) (a) Using D , fit a hierarchical linear model M Post-GPA with Post-GPA as the outcome, and Prev-GPAas a covariate. Assume the outcome is a truncated normal in (0 , . brms [B¨urkner, 2018, B¨urkner, 2017] will be used for fitting M Post-GPA . For more detailson the specification of M Post-GPA , refer to Section 5 of this manuscript.10b) Let c be a cell for poststratification matrix M and ( me c , g c , re c , s c , sa c , mc c ) be the correspondingcovariate of School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education for cell c . For every poststratification cell c , define ˆ v c to be a point estimateof Prev-GPA for poststratification cell c , based off the sample D (c) Using M Post-GPA , infer the posterior predictive distributions Y rep ,n ,c ∼ p ( y ∗ | z = 1 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c D ) Y rep ,n ,c ∼ p ( y ∗ | z = 0 , v = ˆ v c , me c , g c , re c , s c , sa c , mc c , D )for every poststratification cell c in M . For posteriors of poststratification cells with no observationin D , we use brms to infer the posterior predictive distributions Y rep ,n ,c and Y rep ,n ,c (More detailson this is in Subsection 5.2.3).(d) Use the poststratified distribution (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) to get the estimated distribution for τ ATE , and E (cid:32) (cid:80) Jc =1 N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) Jc (cid:48) =1 N c (cid:48) |D (cid:33) as the point estimate for τ ATE .In general, let I O be the index set containing poststratification cells for the subpopulation O .This subpopulation does not need to be observed in our population. Use Equation 8 to get theestimated distribution for τ CATE , O , and E (ˆ τ CATE , O |D ) as the point estimate for τ CATE , O and V (ˆ τ CATE , O |D ) as the accompanying variance.In our simulation studies, Step (b) for MRP-I and BARP-I is performed by first using the observedsample D to fit a hierarchical regression model M Prev-GPA with Prev-GPA as the outcome and thecovariates being School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education. M Prev-GPA in BARP-I is a BART specification and M Prev-GPA in MRP-I is ahierarchical regression model specification. For a more detailed specification of M Prev-GPA for MRP-Iin our simulation studies, refer to Section 5 of this manuscript. Once M Prev-GPA is fit, the posteriorpredictive distribution of Prev-GPA for every poststratification cell c based off M Prev-GPA has its meanset as ˆ v c .Posterior predictive distributions for the outcome Post-GPA are used for methods BARP-I and MRP-I,since they are inferring Y rep ,n ,c , Y rep ,n ,c for poststratification cells c that are not observed in D . Boththese methods are able to calculate CATEs for subpopulations not observed in D . On the other hand,OLS, SVY, BART are only able to estimate CATEs observed in D .As motivated in the previous subsection, including poststratification-style estimators such as BARP-I and MRP-I is sensible as we’re provided a poststratification matrix M . The current challenge ofperforming MRP in our simulation setting is that, if one wants to include non-census variable Prev-GPA as a covariate in the outcome model for MRP, then atleast one point estimate of Prev-GPA willhave to be used for every poststratification cell in M . BARP-I and MRP-I performs this by imputing anestimated Prev-GPA quantity for every poststratification cell. This is because M is a poststratificationmatrix of all combinations for the discrete covariates School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education, but there exists no discretization ofcontinuous covariate Prev-GPA where there is a poststratification matrix of Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education.This challenge of having truncated continuous covariate Prev-GPA with no poststratification matrix fora discretization of Prev-GPA motivates the application of the variant of MRP, MRP-MI. From a highlevel, MRP-MI estimates poststratification frequencies and uses these estimated frequencies to performpoststratification. This accounts for the full distribution of Prev-GPA in every poststratification cell c ,unlike what MRP-I and BARP-I does by taking Prev-GPA point estimate ˆ v c .11 Multilevel Regression and Poststratification with Multiple Im-putation
We use the MRP variant, Multilevel Regression and Poststratification with Multiple Imputation (MRP-MI),as a viable framework for estimating treatment effects for our simulation study. It can be applied when themodeler is given a poststratification matrix ˜ M ∈ R J × K for the K -dimensional categorical covariate vector˜ C ∈ R K and wants to also poststratify by a continuous covariate ˜ V ∈ R but that continuous covariate doesnot have population counts in poststratification matrix ˜ M .As mentioned in subsection 1.4, using Post-GPA - Prev-GPA as the outcome is not possible, thus we haveto include Prev-GPA as a covariate and Post-GPA as the outcome when modeling and estimating treatmenteffects in our causal inference problem (Figure 1.3). In our simulation study, the poststratification matrix M does not contain population counts for strata of Prev-GPA, but we’d still like to poststratify by Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Educationand MRP-MI is able to do this. The three steps below outline the framework for MRP-MI. More detail andmotivation for model specifications relevant to our simulation study are in Section 5. brms is used to fit thePost-GPA and Prev-GPA models used in the MRP-MI framework.1.
Model fitting step:
Fit a hierarchical regression model M Post-GPA to the observed sample D =( V i , C i , Z i , Y i ) ni =1 , where V i is Prev-GPA, C i = (ME i , G i , RE i , School i , SA i , MC i ) is the categoricalcovariate, and Z i is the treatment variable and Y i is the Post-GPA outcome for individual i . For moredetails on the specification of M Post-GPA in our simulation study, refer to Equation 19 in Section 5 ofthis manuscript.Additionally, fit a hierarchical regression model M Prev-GPA to the observed sample D , where Prev-GPA V i is the outcome and the covariates are C i . For more details on the specification of M Prev-GPA , referto Equation 16 in Section 5 of this manuscript.2.
Sampling for Post-GPA PPD step:
Let p ( y ∗ | c, z, v ∗ , D ) be the posterior predictive distributionof Post-GPA for the new datapoint ( c, z, v ∗ ) based on M Post-GPA . Let p ( v ∗ | c, D ) be the posteriorpredictive distribution of Prev-GPA for the new datapoint ( c, z ) based on M Prev-GPA . Define Y rep ,nz,c ∼ p ( y ∗ | c, z, D ), where p ( y ∗ | c, z, D ) = (cid:90) p ( y ∗ | c, z, v ∗ , D ) p ( v ∗ | c, z, D ) dv ∗ = (cid:90) p ( y ∗ | c, z, v ∗ , D ) p ( v ∗ | c, D ) dv ∗ , since M Prev-GPA does not have z as a covariate (9)Hence p ( y ∗ | c, z, D ) is the posterior predictive distribution of Post-GPA in poststratification cell c , afterintegrating out the posterior predictive distribution of Prev-GPA for cell c . To retrieve samples from p ( y ∗ | c, z, D ), the below two-step procedure performs it: Algorithm 1:
MRP-MI
Result:
Return 1000 samples ( y ∗ jz,c ) j =1 of Y rep ,nz,c ∼ p ( y ∗ | c, z, D ) for every combination of( z, c ) m = 1000; for all combinations of ( c, z ) dofor j ∈ { , . . . , m } do Sample v ∗ jc from p ( v ∗ | c, D );Sample y ∗ jz,c from p ( y ∗ | c, z, v ∗ jc , D ) endend For poststratification cells c not observed in D , the out-of-sample capabilities of brms are used tosample v ∗ jc and y ∗ jz,c . More details on how this is done can be found in Section 5.2.3.12he above algorithm uses the posterior predictive distribution of M Prev-GPA to impute Prev-GPA 1000times in every poststratification cell c , hence the acronym MI (multiple imputation). Note that theabove algorithm actually generates the samples (cid:16) θ [ j ] v , v ∗ jc , θ [ j ] y , y ∗ jc,z (cid:17) j =1 , where (cid:16) θ [ j ] v (cid:17) j =1 and (cid:16) θ [ j ] y (cid:17) j =1 are posterior samples of the Prev-GPA and Post-GPA model respectively. However, this way of sam-pling introduces a dependency structure into the joint random variable (after conditioning on sample D ) (cid:16) Y rep ,n , , . . . , Y rep ,n ,J , Y rep ,n , , . . . , Y rep ,n ,J (cid:17) To make samples from this joint random variable conditionally mutually independent (when conditionedon the dataset D ), we have to use mutually exclusive posterior draws of (cid:16) θ [ j ] v , θ [ j ] y (cid:17) j =1 . This meansusing the first J draws for Y rep ,n , , the second J draws for Y rep ,n , and so on. This wouldn’t befeasible if J is large, which in MRP applications it usually is.3. Poststratification step:
The estimated distribution for CATE of subpopulation O based on dataset D is ˆ τ CATE , O = (cid:80) c ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) (10)Based off the above algorithm, we get 1000 samples of the estimator defined in Equation 10 through (cid:80) c ∈ I O N c (cid:16) y ∗ j ,c − y ∗ j ,c (cid:17)(cid:80) c (cid:48) ∈ I O N c (cid:48) j =1 (11)We have the below proposition for the framework MRP-MI when it’s used to calculate the point estimateof the CATE for subpopulation O . The proof of this proposition is in Supplementary Material A. Proposition 1.
Let g z ( c, v ∗ ) = (cid:82) y ∗ p ( y ∗ | z, c, v ∗ , D ) dy ∗ be the expected value of the posterior predictivedistribution of Post-GPA model M Post-GPA under treatment Z = z , categorical covariate C = c and Prev-GPA V = v ∗ . Calculating E (ˆ τ CATE , O |D ) is integrating g ( c, v ∗ ) − g ( c, v ∗ ) with the continuous probabilitymeasure defined by (cid:110) N c (cid:80) c (cid:48)∈ I O N c (cid:48) p ( v ∗ | c, D ) dv ∗ (cid:111) ( c,v ∗ ) ∈ I O × (0 , . .If we consider g z ( c, v ∗ ) as an estimate for E ( Y ( z ) | C = c, V = v ∗ ) then Proposition 1 shows that E (ˆ τ CATE , O |D )is poststratifying g ( c, v ∗ ) − g ( c, v ∗ ) with the poststratification weights defined by P ( C = c, V = v ∗ | ( C, V ) ∈ O ) = P ( V = v ∗ | C = c, ( C, V ) ∈ O ) P ( C = c | ( C, V ) ∈ O )= p ( v ∗ | c, D ) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (12)This is because (cid:110) p ( v ∗ | c, D ) N c (cid:80) c (cid:48)∈ I O N c (cid:48) (cid:111) ( c,v ∗ ) ∈ I O × (0 , . can be viewed as the poststratification weightsfor all combinations of Prev-GPA × School-Achievement × Minority-Composition × School × Gender × Race/Ethnicity × Maternal-Education when Prev-GPA is discretized very fine.In contrast to the methods BARP-I and MRP-I which impute Prev-GPA just once for every poststrat-ification cell, we see that MRP-MI (with M Post-GPA and M Prev-GPA specified in Section 5) accounts forthe full distribution of Prev-GPA in every poststratification cell, p ( v ∗ | c, D ), by integrating it out as seen inEquation 9. The trade-off of accounting for the whole distribution of Prev-GPA in every poststratificationcell is that the computational cost of getting 1000 samples for ˆ τ CATE , O is significantly higher.13 Simulation results
A detailed outline of the data generating process for our simulation study that modeled the NSLM studyis provided in this section. We first generate a finite population P from a superpopulation, then utilize astratified sampling scheme to generate the observed sample D .Simulation results of calculating CATEs for subpopulations of varying sizes is presented and the sixtreatment effect estimation methods are compared. We evaluate the efficacy of each estimation methodwith various metrics that measure the quality of point estimation, standard error and uncertainty intervalcoverage.The individuals in NSLM were recruited through a stratified cluster sample, where the clusters are schools.Stratification was performed on school-level categorical covariates, and the outcome variable of interest wasPost-GPA. In the end, around 70 high-schools had agreed to participate in the study, and randomizationof the growth mindset intervention was performed at the individual level of the stratified cluster sample.Despite the population of US public high-schools being 12 , th grade public high-school students in the United States. R [R Core Team, 2020] was the computing language used for the simulation studies and the ggplot2 package [Wickham, 2016] was used to generate the plots in this manuscript. brms was the library used forhierarchical model fitting. [B¨urkner, 2018, B¨urkner, 2017, Carpenter et al., 2017]. P Let i be the index of an individual, k be index of the school, s be index of the stratum. V iks is the previous-GPA for student i in school k in stratum s . Y iks is the post-GPA and Z iks is the treatment-control indicator.Page 31 of the appendix of the NSLM study [Yeager et al., 2019] contained descriptive statistics of thestudent-level and school-level covariates seen in Figure 1.3. These descriptive statistics were used to definethe coefficients used in the simulated version of the NSLM study as seen below.The sizes of each stratum, M s , are (2806 , , , , M s U School ks , T School ks just once: U School ks i.i.d ∼ N (0 , . ) T School ks i.i.d ∼ N (0 , . ) (13) U School ks , T School ks are on the same scale of the treatment effects seen in Extended Data Figure 2 in theoriginal NSLM study [Yeager et al., 2019]. P For each stratum, s , let M ks ∼ Poisson(200) be the school size for school k within stratum s . Then for i ∈ [ M ks ], we generate the covariates along with previous and post GPA conditionally independently:14 iks ∼ Bernoulli(0 . V iks ∼ N (0 , . ( µ X [ s ] , σ X [ s ])RE iks ∼ Cat ( p RE [ s ])ME iks ∼ Bernoulli ( p ME [ s ]) Y iks ( Z iks ) | G iks , V iks , RE iks , ME iks , U School ks , T School ks ∼ N (0 , . (Max(0 , Min(4 . , V iks + τ SA [ s ] Z iks + τ MC [ s ] Z iks + τ G [G iks ] Z iks + τ RE [RE iks ] Z iks + τ ME [ME iks ] Z iks + U School ks Z iks + T School ks )) , .
6) (14)G iks is the gender indicator, RE iks is the race/ethnicity, ME iks is the maternal education indicator, X iks is the previous GPA, Y iks is the post-GPA and Z iks is the treatment indicator. The coefficient vectors µ X , σ X , p RE , p ME , τ SA , τ MC , τ G , τ RE , τ ME are defined in the Supplementary Material C.We can assume that this data generating process for P = ( Y m (0) , Y m (1) , X m ) N P m =1 which is around 2.2million individuals is from the data generating process of i.i.d draws from a superpopulation ( Y (0) , Y (1) , X ).This is because fixing the strata sizes M s is equivalent to sampling from a 5-dimensional categorial variablewith mean being equal to 5-dimensional vector (cid:16) M s (cid:80) s (cid:48) =1 M s (cid:48) (cid:17) s =1 in the asymptotic regime. P The population P is generated once and then a high-quality stratified cluster sample D of around 12 , Stratified sample of clusters (schools):
Sample (28 , , , ,
27) schools randomly (without re-placement) from each of the 5 stratum respectively. This returns students in 140 schools.2.
Subsample clusters with response probabilities on units in clusters:
Select Poissson (cid:0) (cid:1) schools from the 140. Every student in the population has response probability invlogit − ( V iks ), theinverse-logit of the previous GPA for that student. Define this size n sample as D .The individual response probability invlogit − ( V iks ) is 92 percent. This agrees with the high responserate in the NSLM study [Yeager et al., 2019].Generating P is done using the R package DeclareDesign [Blair et al., 2019].
DeclareDesign assigns Z i at the the finite population level. That is, the assignment mechanism at the finite population level is completerandomization of ( Z i ) N P i =1 , giving (cid:80) N P i =1 Z i = N P . Because D is a very small sample of the finite population P ( D is 0.5 percent of P ), the Z i in D are approximately independent Bernoulli(0 .
5) samples. Hence theassignment mechanism for D can be thought of as a Bernoulli randomization—for every individual in D , weindependently assignment them a 50 percent probability of being in the treatment group. As expected, thestratified cluster sample D still maintains close to a 50 −
50 randomization between treatment and control.It’s clear to see that identifiability assumptions 1 - 4 are satisfied for D . SUTVA and Consistencyare satisfied through the data generating process for P , Ignorability is satisfied through the assignmentmechanism for ( Z i ) N P i =1 , and through the assignment mechanism we have P ( Z i = 1 | X i ) = 0 . P and thensample from it, we calculate CATEs and ATE at the superpopulation level.Generated D was fairly representative of the target population P . The Tipton Generalization Index[Tipton, 2014] for D along with the subgroups we analyze in each simulation iteration was calculated to benear 0 . generalize package [Ackerman et al., 2019] with random forests being theselection method). This matches the high Generalization Index of the sample in the original NSLM study. As the identifiability assumptions are satisfied for the data generating process of P , we can derive formulasfor τ CATE , O for any subpopulation O . 15iven that U School ks = u ks , T School ks = t ks for every stratum s and school k , let Y ( z ) be a random samplefrom the population defined in Equation 14, under treatment Z = z . The G-formula [Hern´an and Robins, 2020]applied to the data-generating process recovers the true ATE: E ( Y ( z )) = (cid:88) s (cid:88) k E ( Y ( z ) | z, s, u ks , t ks ) M s (cid:80) s (cid:48) M s (cid:48) M s = (cid:88) s (cid:88) k (cid:88) me (cid:88) g (cid:88) re (cid:18)(cid:90) . E ( Y ( z ) | z, s, me, g, re, v, u ks , t ks ) p ( v | s ) dv (cid:19) · p ( re | s ) p ( me | s ) p ( g ) M s (cid:80) s (cid:48) M s (cid:48) M s (15)Applying the G-formula for the whole target population gives τ ATE := E ( Y (1)) − E ( Y (0)) = 0 . E ( Y ( z ) | z, s, me, g, re, k ) = (cid:82) . E ( Y ( z ) | z, s, me, g, re, v, u ks , t ks ) p V | S ( v | s ) dv is the expectedvalue of Post-GPA under Z = z for poststratification cell School-Achievement × Minority-Composition × Gender × Race/Ethnicity × Maternal-Education × School = ( s, g, re, me, k ), after integrating out Prev-GPA. Hence summing E ( Y ( z ) | z, s, me, g, re, k ) P ( s, me, g, re, k | X ∈ O ) over all the poststratification cells insubpopulation O returns E ( Y ( z ) | X ∈ O ), which in turn will result in the CATE τ CATE , O . M Prev-GPA and M Post-GPA which areused in BARP-I, MRP-I, MRP-MI
Prior predictive checks [Gabry and Mahr, 2021, Gabry et al., 2019, Gelman et al., 2013] as seen in Figure 1were used to calibrate the prior specifications of the hierarchical models M Prev-GPA and M Post-GPA . Fittedwith the package brms , both hierachical models assume a truncated normal outcome in the range (0 , . M Prev-GPA and M Post-GPA , models which are used in MRP-MI, MRP-Iand BARP-I. (Left) The dark-blue line corresponds to the density of Prev-GPA observations in D . The 200light-blue lines correspond to the densities of 200 prior predictive draws of Prev-GPA for individuals in D .(Right) The dark-blue line corresponds to the density of Post-GPA observations in D . The 200 light-bluelines correspond to the densities of 200 prior predictive draws of Post-GPA for individuals in D . M Prev-GPA
The Prev-GPA model is fit to the stratified cluster sample D (a new fit happens for every simulation iterationsince a new D is sampled from P ). 16or the i th individual, define α RE j [ i ] , α School j [ i ] , α MC j [ i ] , α SA j [ i ] , α SA,MC j [ i ] to be their unique random intercepts forthe covariates Race/Ethnicity, School, Minority Composition, School Achievement, Minority Composition × School Achievement respectively.The outcome is modeled as: V i | ME i , G i , α RE j , α School j , α MC j , α SA j , α SA,MC j ∼ N (0 , . ( β + β ME ME i + β G G i + α RE j [ i ] + α School j [ i ] + α MC j [ i ] + α SA j [ i ] + α SA,MC j [ i ] , σ (cid:17) (16)Covariates = RE, School, MC, SA, and SA,MC have atleast 3 levels and have their priors modeled as: α Covariates j | σ Covariates i.i.d ∼ N (cid:0) , ( σ Covariates ) (cid:1) σ Covariates ∼ N + (cid:0) , . (cid:1) (17)Covariates = ME and G are binary and have their priors (along with the global intercept) modeled as: β Covariates ∼ N (cid:0) , . (cid:1) β ∼ N (cid:0) . , . (cid:1) (18) M Post-GPA
The Post-GPA model is fit to the stratified cluster sample D (a new fit happens for every simulation iterationsince a new D is sampled from P ).For the i th individual, define γ RE j [ i ] , γ School j [ i ] , γ MC j [ i ] , γ SA j [ i ] , γ SA,MC j [ i ] to be their unique random intercepts when Z i = 1, for the covariates Race/Ethnicity, School, Minority Composition, School Achievement, MinorityComposition × School Achievement respectively.The outcome is modeled as: Y i | V i , Z i , ME i , G i , α RE j , α School j , α MC j , α SA j , α SA,MC j ∼ N (0 , . ( β + Offset( V i ) + β ME ME i + β G G i + γ ME ME i Z i + γ G G i Z i + α RE j [ i ] + α School j [ i ] + α MC j [ i ] + α SA j [ i ] + α SA,MC j [ i ] + γ RE j [ i ] Z i + γ School j [ i ] Z i + γ MC j [ i ] Z i + γ SA j [ i ] Z i + γ SA,MC j [ i ] Z i , σ (cid:17) (19)Covariates = RE, School, MC, SA, and SA,MC have atleast 3 levels and have their priors modeled as: α Covariates j | σ Covariates i.i.d ∼ N (cid:0) , ( σ Covariates ) (cid:1) σ Covariates ∼ N + (cid:0) , . (cid:1) γ Covariates j | σ Covariates γ i.i.d ∼ N (0 , ( σ Covariates γ ) ) σ Covariates γ ∼ N + (0 , . )Cor( γ Covariates j , α Covariates j ) ∼ Default prior in brms (20)Covariates = ME and G are binary and have their priors (along with the global intercept) modeled as: β Covariates ∼ N (cid:0) , . (cid:1) γ Covariates ∼ N (cid:0) , . (cid:1) β ∼ N (cid:0) , . (cid:1) (21)17 .2.3 Out-of-sample prediction with M Prev-GPA and M Post-GPA
The treatment effect estimation methods MRP-I and MRP-MI use M Prev-GPA and M Post-GPA to samplefrom the Post-GPA and Prev-GPA posterior predictive distributions ( Y rep,n z,c , V rep,n z,c ) of poststratificationcells c . BARP-I uses M Prev-GPA to sample from the Prev-GPA posterior predictive distribution V rep,n z,c ofpoststratification cells c .In our simulation studies, inferring posterior predictive samples of ( Y rep,n z,c , V rep,n z,c ) for a cell c not observedin D requires sampling from new levels j of the posteriors α School j and γ School j . This is because D is a stratifiedcluster sample consisting of around 70 schools (school is a cluster), leaving 11000 other unobserved schools inthe target population. Sampling from the posterior of α School j for an unobserved school j is done by samplingfrom a normal distribution with mean implied by the posteriors α School j ∗ of all the observed schools j ∗ , andstandard deviation implied by the posterior distribution of σ School . Likewise, sampling from the posterior of γ School j for an unobserved school j is done by sampling from a normal distribution with mean implied by theposteriors γ School j ∗ of all the observed schools j ∗ , and standard deviation implied by the posterior distributionof σ School γ . Point estimation quality, standard error quality, and quality of uncertainty interval is evaluated through thethree metrics when estimating CATE for subpopulation O . Let M be the number of simulation iterations.We will sample D , . . . , D M from the same target population P ( P resets after each sample) in each newsimulation iteration based on the same stratified sampling scheme.For each of the six estimation methods, based on sample D m , define E (ˆ τ CATE , O |D m ) to be the pointestimate for τ CATE , O , and ˆ σ m,CATE , O to be the standard error that accompanies the point estimate. τ CATE , O (MSE) MSE measures the quality of the point estimate for each of the 6 treatment effect estimation methods inour simulation study. For each method and for every simulation iteration, we calculate the empirical MSEin Equation 22: L MSE ( D , . . . , D M ) := 1 M M (cid:88) m =1 ( E (ˆ τ CATE , O |D m ) − τ CATE , O ) = (cid:32) M M (cid:88) m =1 E (ˆ τ CATE , O |D m ) − τ CATE , O (cid:33) + 1 M M (cid:88) m =1 (cid:32) E (ˆ τ CATE , O |D m ) − M M (cid:88) m (cid:48) =1 E (ˆ τ CATE , O |D m (cid:48) ) (cid:33) (22) τ CATE , O (PSR) To simultaneously evaluate the quality of a point estimate and the accompanying standard error estimate,we utilize a metric from decision theory, which only depends on the first and the second moment of aprobabilistic forecast. Consider the following metric [Gneiting and Raftery, 2007]: L PSR ( D , . . . , D M ) := − M M (cid:88) m =1 (cid:18) τ CATE , O − E (ˆ τ CATE , O |D m )ˆ σ m,CATE , O (cid:19) − M M (cid:88) m =1 log(ˆ σ , O ) (23)A higher PSR value corresponds to a higher quality prediction based off this metric.18 .3.3 Coverage probabililty of uncertainty intervals Lastly, we will measure the quality of the uncertainty interval generated by the estimation method. Let l ( m )0 . (ˆ τ CATE , O ) , u ( m )0 . (ˆ τ CATE , O ) be the bounds of the symmetric 95-percent uncertainty interval for thepoint estimate E (ˆ τ CATE , O |D m ). The coverage probability of the uncertainty interval is then: L Coverage ( D , . . . , D M ) := 1 M M (cid:88) m =1 (cid:16) l ( m )0 . (ˆ τ CATE , O ) ≤ τ CATE , O ≤ u ( m )0 . (ˆ τ CATE , O ) (cid:17) (24) × lowest minoritycomposition and CATE for African Americans in lowest achieving schools Figure 2 shows the 5 competing methods, along with MRP-MI and their point estimates for the ATE andtwo CATEs. As the target group in the population gets smaller, the traditional treatment effect estimationmethods, SVY and OLS in particular, produce more variable point estimates.As the area gets smaller, to the point where we are looking at 3.2% of the population, MRP-MI’s pointestimates of the target CATE remain tight around the truth. The frequentist regression methods SVYand OLS exhibit higher spread in point estimates (though they remain unbiased). This isn’t surprising, asBayesian methods (MRP-MI, BARP-I, BART, MRP-I) tend to outperform frequentist approaches (SVY,OLS) in the presence of limited data. In contrast to this, calculating the ATE uses the full sample D andthis corresponds to 100% of the population. All the methods except MRP-I have point estimate quality asnearly identical when calculating the ATE.Figure 2: Left: Point estimate of ATE based on the full stratified cluster sample of around 12000 observations.Middle: Point estimate of CATE for individuals in the highest-achieving schools with the lowest minoritycomposition. Right: Point estimate of CATE for individuals in the lowest-achieving schools and are African-Americans. The labels in the top-left of each plot are the proportion of the population that the grouprepresents and the expected sample size of that group in D . 100 simulations were conducted. The redhorizontal line is the true treatment effect τ CATE , O for the specific subpopulation O on the plot.In Figure 2, we visualized point estimates for the ATE and two CATEs of varying sizes in the targetpopulation (size is in reference to the percent of the population in the top-left of the three plots in Figure2). Table 2 contains the MSE and PSR for each of the six methods. MRP-MI is shown to have the highestPSR based on Table 2, implying that it strikes the best balance of producing accurate point estimateswith reasonable standard errors. Based on Table 2, MRP-MI is shown to have the best MSE, except forcalculating the CATE corresponding to the group of individuals who are in the lowest achieving schools andare African-American (Race/Eth. coded as 2). However, the MSE of MRP-MI (10.70) is still close to beingthe lowest for this CATE (9.60).An important detail to note is that, even though MRP-I appears to be very similar to MRP-MI in pointestimate quality when just looking at Figure 2, we can see from Table 2 that the MSE values for the ATE19re drastically different. MRP-MI’s MSE is more than twice as small as MRP-I’s MSE when the targetestimand is the ATE. This decrease in point estimation quality is attributed to MRP-I not accounting forthe full Prev-GPA distribution in every poststratification cell c . It appears that this MSE difference betweenMRP-MI and MRP-I gets smaller as the target subpopulation becomes a smaller proportion of the targetpopulation. Model ATE (MSE × − ) ATE (PSR) SA3 MC1 (MSE × − ) SA3 MC1 (PSR) SA1 RE2 (MSE × − ) SA1 RE2 (PSR)OLS 1.59 7.30 7.83 6.12 50.44 4.31SVY 1.40 7.83 7.94 5.87 46.29 4.33BART 1.64 7.54 10.90 3.18 21.55 0.13BARP-I 1.97 7.55 15.17 3.67 19.11 3.05MRP-I 2.97 7.07 5.55 6.44 Table 2: Mean square error ( L MSE ) and the proper scoring rule ( L PSR ) value for the ATE, CATE for thegroup of individuals who are in the highest achieving schools with the lowest minority composition, andCATE for the group of individuals who are in the lowest achieving schools and are African-American. 100simulations were conducted. The best metric value in each column is in bold. Values are rounded to twodecimal places.The coverage probabilities in Table 3 shows that when calculating the ATE, all six methods have well-calibrated uncertainty intervals as they are all close to the nominal value of 95%. As the target subpopulationgets smaller (size is in reference to the percent of the population in the top-left of the three plots in Figure2), we see that BARP-I and BART have poor calibration. The frequentist methods SVY and OLS retainintervals with 95% coverage probability.Model ATE SA3 MC1 SA1 RE2OLS 0.93 0.93 0.94SVY 0.97 0.91 0.93BART 0.90 0.58 0.45BARP-I 0.91 0.76 0.60MRP-I 0.87 0.98 0.99MRP-MI 0.97 0.97 0.98Table 3: Coverage probabilities L coverage of 95 uncertainty intervals based on the various treatment effectestimation methods. The estimands are the ATE, CATE for the group of individuals who are in the highestachieving schools with the lowest minority composition, and CATE for the group of individuals who are inthe lowest achieving schools and are African-American. 100 simulations were conducted.In general, we see that MRP-MI is able to capture the true heterogeneity of treatment effects basedon Figure 2 and the MSE columns of Table 2, even for areas as small as 3.2% of the target population.MRP-MI produces low variance and low absolute bias estimates as seen in Figure 2. The coverage ofthe 95% uncertainty intervals for MRP-MI remains well-calibrated and hence close to 95% coverage whencalculating the ATE (whole target population) and calculating CATEs for smaller subpopulations of thetarget population. MRP-MI appears to have the highest quality for point estimation, standard errors, andinterval coverage.The more simplistic methods MRP-I and BARP-I do not account for the full distribution of the continuousnon-census variable Prev-GPA in every poststratification cell as they impute only one Prev-GPA value,whereas MRP-MI accounts for the full distribution of Prev-GPA at the cost of additional computation. Theeffect of not accounting for the full Prev-GPA distribution is most drastic when the estimand is the ATE.MSE, PSR and confidence interval coverage of BARP-I and MRP-I is worse than for MRP-MI. Studyingthe effects of not accounting for the full distribution of a non-census variable is a subject of future MRPresearch. 20 .5 Estimating the CATEs of the five subpopulation groups in highest achievingschools × lowest minority composition × race/ethnicity To further test the extent that MRP-MI and the other five methods capture treatment effect heterogeneityfor small areas, we look at subpopulations as small as 1.3% of the target population.Specifically, the target subpopulations are the five race/ethnicities in the stratum highest achievingschools × lowest minority composition. Figure 3 shows the CATE point estimates for CATEs of these fivesubpopulations.Figure 3: Point estimates of τ CATE , O where O corresponds to the five race/ethnicities in the stratum highestachieving schools × lowest minority composition. 100 simulations were conducted. The labels in the top-leftof each plot are the proportion of the population that the group represents and the expected sample size forthat group in D . The red horizontal line is the true treatment effect τ CATE , O for the specific subpopulation O on the plot.Figure 3 shows the five subpopulations ranging from 1.3% of the population to 10.3% of the population.Their expected sample sizes in D vary from 110 to 890 individuals. One average, the response rates areequal amongst the five subpopulations in the stratum of highest achieving schools with the lowest minoritycomposition based off our experimental design for collecting D .In the stratum of highest achieving schools with the lowest minority composition, the subpopulation withthe smallest expected sample size of 110 in D are Asian individuals. From Figure 3 we see that the Bayesianestimation methods (MRP-MI, MRP-I, BART, BARP-I) have lower variance amongst their point estimatesthan the frequentist methods’ (SVY, OLS) point estimates.In the stratum of highest achieving schools with the lowest minority composition, the subpopulation withthe highest expected sample size of 890 in D are White individuals. From Figure 3, we see that the Bayesianmethods (MRP-MI, MRP-I, BART, BARP-I) appear more similar to the frequentist methods (SVY, OLS).Unsurprisingly, the frequentist methods (SVY, OLS) have the highest variance estimates in the presence oflimited data such as in the subpopulations of Asian individuals and African-American individuals in Figure3. From Figure 3 MRP-MI and MRP-I appear quite similar and both have a good balance of generatinglow-variance and low-bias estimates. This implies that the MSE of MRP-MI and MRP-I should be scoredhighly amongst the competing methods. Indeed this is true from Table 4 — MRP-MI has the best MSEfor all five CATEs and MRP-I has the second best MSE for all five CATEs. The shrinkage of the Bayesiannonparametric tree-based methods (BART, BARP-I) with their default priors appear to be too strong forestimating the heterogeneous treatment effects of small areas, since they are shown to be the most biasedyet have the lowest variance in the five panels of Figure 3.Table 4 further shows that MRP-MI has the best PSR values for each of the five CATEs. There is alarge drop off in estimation quality for the Bayesian nonparametric tree-based methods (BART, BARP-I),as seen by their low PSR values for calculating CATEs of Asian individuals (RE1) and individuals of otherrace/ethnicity (RE5). 21odel RE1 RE2 RE3 RE4 RE5OLS (MSE × − ) 118.90 93.98 36.09 12.07 46.79SVY (MSE × − ) 103.65 81.45 36.88 12.21 45.08BART (MSE × − ) 54.83 9.12 18.79 8.99 46.32BARP-I (MSE × − ) 54.13 8.76 19.38 8.95 49.49MRP-I (MSE × − ) 14.27 7.10 6.29 5.56 10.70MRP-MI (MSE × − ) OLS (PSR) 3.49 3.68 4.60 5.78 4.34SVY (PSR) 3.57 3.78 4.21 5.61 3.91BART (PSR) -9.05 5.94 1.46 5.88 -7.08BARP-I (PSR) -2.87 5.60 3.31 5.52 -1.98MRP-I (PSR) 5.46 6.12 6.25 6.41 5.80MRP-MI (PSR)
Table 4: Mean square error ( L MSE ) and the average proper scoring rule value ( L PSR ) based on the varioustreatment effect estimation methods. The first six rows correspond to L MSE estimates and the last six rowscorrespond to L PSR estimates. The estimands are CATEs for all five race/ethnicities in the subpopulationof highest achieving schools × lowest minority composition. RE1 corresponds to Asian individuals, RE2corresponds to African-American individuals, RE3 corresponds to Hispanic individuals, RE4 corresponds toWhite individuals, RE5 corresponds to individuals of other race/ethnicity. 100 simulations were conducted.The best metric value in each column is in bold. Values are rounded to two decimal placesTable 5 shows that MRP-MI has well-calibrated 95% uncertainty intervals despite the heterogeneity oftreatment effects across the five small subpopulations. The Bayesian nonparametric tree-based methods(BARP-I, BART) have the worst calibration as their intervals’ coverage probabilities are quite far from 95percent. Model RE1 RE2 RE3 RE4 RE5OLS 0.94 0.94 0.96 0.98 0.99SVY 0.95 0.92 0.89 0.94 0.93BART 0.21 0.84 0.50 0.82 0.34BARP-I 0.28 0.90 0.72 0.89 0.38MRP-I 0.99 0.99 1.00 0.98 0.96MRP-MI 0.99 0.98 0.98 0.97 0.94Table 5: Coverage probabilities L coverage of 95 uncertainty intervals based on the various treatment effectestimation methods. The estimands are CATEs for the five race/ethnicities in the subpopulation highestachieving schools × lowest minority composition. RE1 corresponds to Asian individuals, RE2 correspondsto African-American individuals, RE3 corresponds to Hispanic individuals, RE4 corresponds to White indi-viduals, RE5 corresponds to individuals of other race/ethnicity. 100 simulations were conducted.In summary, MRP-MI appears to be the most effective in estimating treatment effects for very largeareas (ATE) to very small areas in the target population when there is treatment effect heterogeneity.The accompanying standard errors and uncertainty intervals produced by MRP-MI are also well-calibrated.Frequentist methods (SVY, OLS) appear to perform well when the observed sample D is sufficiently largebut have extremely high variance estimates when they’re used to calculate CATEs for small-areas, as data ismore limited. Bayesian nonparametric tree-based methods (BART, BARP-I) have default priors that exhibitvery strong shrinkage when compared to hierarchical modeling approaches (MRP-MI, MRP-I) for calculatingheterogeneous CATEs in small areas and furthermore their uncertainty intervals have poor coverage whenthe target area gets smaller. A further area of research would be to perform prior calibration such as priorpredictive checks for these Bayesian nonparametric methods.Even though the point estimates Figures 2 and 3 show that MRP-I and BARP-I have comparable pointestimates to MRP-MI, we do not recommend using MRP-I and BARP-I for a causal inference problemsimilar to the one in our simulation study. They do not account for the full distribution of the continuous22ovariate Prev-GPA, whereas MRP-MI does.MRP-MI, MRP-I and BARP-I get posterior predictive samples fromˆ τ CATE , O = (cid:80) c ∈ I O N c (cid:0) Y rep ,n ,c − Y rep ,n ,c (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) As mentioned in Section 4, MRP-MI introduces a dependency structure (after conditioning on D ) to (cid:16) Y rep ,n , , . . . , Y rep ,n ,J , Y rep ,n , , . . . , Y rep ,n ,J (cid:17) (25)since the same 1000 posterior samples from M Prev-GPA and M Post-GPA are used to sample from Y rep ,nz,c for every z and c . MRP-I and BARP-I use the same 1000 posterior samples from their respective outcomemodel fits to get samples from Y rep ,nz,c for every z and c as well, and hence a dependency structure existsin Equation 25 for MRP-I and BARP-I as well. The accompanying standard error (cid:112) V (ˆ τ CATE , O |D ) forthe point estimate of τ CATE , O is affected by the dependency structure in Equation 25 and studying thisrelationship for MRP-I, BARP-I and in particular MRP-MI, is an unexplored area. So far, the subpopulations O where CATEs are estimated are subpopulations that are observed in the strat-ified cluster sample D . BARP-I, MRP-I and MRP-MI use the estimated distribution ˆ τ CATE , O to calculateCATE point estimates and this can be done even for O not observed in D . In this subsection, we showcasethe extent that MRP-MI is able to estimate τ CATE , O for all the schools in the target population P .Stratified cluster sample D in our simulation study contains around 70 schools for every new simulationiteration and thus there are around 11000 schools left unobserved. We are however given poststratificationcell sizes N c for every cell c in the population and thus are able to calculate ˆ τ CATE , O for every school O using MRP-MI (as well as MRP-I and BARP-I). Figure 4 shows MRP-MI point estimates E (ˆ τ CATE , O |D ) forevery school in our simulation studies.From the middle histogram of Figure 4, we see that point estimates of all 11221 schools E (ˆ τ CATE , O |D )from MRP-MI are not able to capture the full variation of CATEs τ CATE , O using D , even though we’regiven poststratification counts N c for all the cells c in schools not in D . In fact, the out-of-sample CATEestimates of MRP-MI in the middle histogram appear to be quite similar to the in-sample CATE estimatesof MRP-MI in the bottom histogram. This implies that even with an external poststratification matrix onthe full population, it remains difficult to calculate treatment effects for subpopulations not observed in thesample when using MRP-MI. We observed very similar challenges for MRP-I and BARP-I.The top scatterplot further reinforces the claim that MRP-MI is limited when calculating CATEs forschools not observed in D . In summary, MRP-MI (along with MRP-I and BARP-I) clusters CATE cal-culations for all the schools around the school-level clusters seen in Table 5.1 with the current amount ofinformation provided through D and poststratification matrix M . Improving treatment effect estimates ofMRP-style estimators for subpopulations not observed in the sample is an open research area. The primary goal of this manuscript was investigating the capabilities of MRP-style estimators in experi-mental causal inference. Through simulation studies, we studied the limitations of MRP-style estimatorsin the problem of estimating heterogeneous treatment effects for subpopulations as small as 1.3 percent ofthe target population. We modeled our simulation studies after the National Student of Learning Mindsetsstudy [Yeager et al., 2019], an experimental dataset that was fairly representative of target population, yetexhibited treatment effect heterogeneity.We also build upon a variant of MRP (MRP-MI) that is able to incorporate non-census variables (alongwith their full distributions if they’re continuous) into the poststratification step. In our simulations thatmodeled the NSLM study, MRP-MI along with other MRP-style estimators (MRP-I and BARP-I) are23igure 4: (Top) Average of CATE point estimates from MRP-MI for all 11221 schools based on 100simulations. Every dot corresponds to a specific school. The x-axis are averages of our point estimates E (ˆ τ CATE , O |D ). The y-axis are the true CATEs τ CATE , O for every school. The dots are colour-coded by theirschool strata (Table 5.1). (Middle) The point estimates of the CATE for every school from MRP-MI for onesimulation iteration. (Bottom) The point estimates of the CATE for all schools in D from MRP-MI for thesame simulation iteration as the middle histogram.compared against modern Bayesian nonparametric causal inference methods [Hill, 2011] and frequentistregression methods [Lumley, 2020].We showed that MRP-style estimators are effective at point estimation of conditional average treatmenteffects (CATE) for subpopulations of various sizes in the target population. When compared to the othertreatment effect estimation methods in our simulations, MRP-MI strikes the best balance of having high-quality point estimates (low bias and variance in point estimates) and reasonable standard errors, alongwith well-calibrated uncertainty intervals for CATE estimation. The applications of the MRP-MI frameworkextend beyond causal inference and can be used in a traditional survey estimation problem where the modeleris solely interested in estimating population-level quantities, but this avenue of research is not considered inour manuscript.Though the main focus of this manuscript was on treatment effect heterogeneity in small areas forstratified cluster experiments that are representative of the target population, we believe that MRP-styleestimators such as MRP-MI can excel in dealing with nonrepresentative experimental data as well. MRP hastraditionally been used as a model-based approach to generalizing survey estimates to a target population andwe believe that this can be extended to generalization studies in experimentation. Comparing MRP-styleestimators to propensity-score based generalization techniques in causal inference [Ackerman et al., 2020,Ackerman et al., 2019, Stuart et al., 2018] is an open area of research. Extending the analysis done in this24anuscript to the setting of observational causal inference is also an open area of research.When provided with a poststratification matrix of the target population along with experimental data,we showed through simulations that MRP-style estimators improve upon standard causal inference methods,especially in scenarios when the modeler wants to uncover treatment effect heterogeneity across small areas inthe target population. MRP-MI is an effective framework for incorporating continuous non-census variablesinto poststratification. As such, we see MRP-MI as a useful approach for modelers when dealing withexperimental data and poststratification information on not necessarily all of the variables in the targetpopulation. References [Ackerman et al., 2020] Ackerman, B., Lesko, C. R., Siddique, J., Susukida, R., and Stuart, E. A. (2020).Generalizing randomized trial findings to a target population using complex survey population data. arXivpreprint arXiv:2003.07500 .[Ackerman et al., 2019] Ackerman, B., Schmid, I., Rudolph, K. E., Seamans, M. J., Susukida, R., Mojtabai,R., and Stuart, E. A. (2019). Implementing statistical methods for generalizing randomized trial findingsto a target population.
Addictive behaviors , 94:124–132.[Athey and Imbens, 2017] Athey, S. and Imbens, G. W. (2017). The econometrics of randomized experi-ments. In
Handbook of economic field experiments , volume 1, pages 73–140. Elsevier.[Athey and Wager, 2019] Athey, S. and Wager, S. (2019). Estimating treatment effects with causal forests:An application. arXiv preprint arXiv:1902.07409 .[Bisbee, 2019] Bisbee, J. (2019). BARP: Improving Mister P using Bayesian additive regression trees.
Amer-ican Political Science Review , 113(4):1060–1065.[Blair et al., 2019] Blair, G., Cooper, J., Coppock, A., and Humphreys, M. (2019). Declaring and diagnosingresearch designs.
American Political Science Review , 113:838–859.[B¨urkner, 2017] B¨urkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan.
Journal of Statistical Software , 80(1):1–28.[B¨urkner, 2018] B¨urkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms.
The R Journal , 10(1):395–411.[Carpenter et al., 2017] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M.,Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan : A probabilistic programming language.
Journal of Statistical Software , 76(1).[Chipman et al., 2010] Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). BART: Bayesianadditive regression trees.
The Annals of Applied Statistics , 4(1):266–298.[Colnet et al., 2020] Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.-P., Josse, J.,and Yang, S. (2020). Causal inference methods for combining randomized trials and observational studies:a review. arXiv preprint arXiv:2011.08047 .[Deming and Stephan, 1940] Deming, W. E. and Stephan, F. F. (1940). On a least squares adjustment ofa sampled frequency table when the expected marginal totals are known.
The Annals of MathematicalStatistics , 11(4):427–444.[Ding et al., 2017] Ding, P., Li, X., and Miratrix, L. W. (2017). Bridging finite and super population causalinference.
Journal of Causal Inference , 5(2).[Dorie et al., 2018] Dorie, V., Chipman, H., and McCulloch, R. (2018). dbarts: Discrete Bayesian additiveregression trees sampler. 25Dorie et al., 2019] Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versusdo-it-yourself methods for causal inference: Lessons learned from a data analysis competition.
StatisticalScience , 34(1):43–68.[DuGoff et al., 2014] DuGoff, E. H., Schuler, M., and Stuart, E. A. (2014). Generalizing observational studyresults: applying propensity score methods to complex surveys.
Health services research , 49(1):284–303.[Gabry and Mahr, 2021] Gabry, J. and Mahr, T. (2021). bayesplot: Plotting for bayesian models. R packageversion 1.8.0.[Gabry et al., 2019] Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A. (2019). Visual-ization in Bayesian workflow.
Journal of the Royal Statistical Society: Series A (Statistics in Society) ,182(2):389–402.[Gao et al., 2020] Gao, Y., Kennedy, L., Simpson, D., Gelman, A., et al. (2020). Improving multilevelregression and poststratification with structured priors.
Bayesian Analysis .[Gelman, 2018] Gelman, A. (2018). Mrp (or rpp) with non-census variables.
Statistical Modeling, CausalInference, and Social Science Blog .[Gelman et al., 2013] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B.(2013).
Bayesian data analysis . CRC press.[Gelman and Hill, 2006] Gelman, A. and Hill, J. (2006).
Data analysis using regression and multilevel/hier-archical models . Cambridge university press.[Gelman and Little, 1997] Gelman, A. and Little, T. C. (1997). Poststratification into many categories usinghierarchical logistic regression.[Gneiting and Raftery, 2007] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction,and estimation.
Journal of the American statistical Association , 102(477):359–378.[Green and Kern, 2012] Green, D. P. and Kern, H. L. (2012). Modeling heterogeneous treatment effects insurvey experiments with bayesian additive regression trees.
Public opinion quarterly , 76(3):491–511.[Gruber and Van Der Laan, 2009] Gruber, S. and Van Der Laan, M. J. (2009). Targeted maximum likelihoodestimation: A gentle introduction.[Hahn et al., 2020] Hahn, P. R., Murray, J. S., Carvalho, C. M., et al. (2020). Bayesian regression treemodels for causal inference: regularization, confounding, and heterogeneous effects.
Bayesian Analysis .[Hern´an and Robins, 2020] Hern´an, M. and Robins, J. (2020). Causal inference: What if.
Chapman &Hall/CRC; Boca Raton: 2020 .[Hill, 2011] Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.
Journal of Computa-tional and Graphical Statistics , 20(1):217–240.[Imbens and Rubin, 2015] Imbens, G. W. and Rubin, D. B. (2015).
Causal inference in statistics, social,and biomedical sciences . Cambridge University Press.[Kang et al., 2007] Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparisonof alternative strategies for estimating a population mean from incomplete data.
Statistical science ,22(4):523–539.[Kastellec et al., 2015] Kastellec, J. P., Lax, J. R., Malecki, M., and Phillips, J. H. (2015). Polarizingthe electoral connection: Partisan representation in supreme court confirmation politics.
The journal ofpolitics , 77(3):787–804.[Kennedy and Gelman, 2019] Kennedy, L. and Gelman, A. (2019). Know your population and know yourmodel: Using model-based regression and poststratification to generalize findings beyond the observedsample. arXiv preprint arXiv:1906.11323 . 26Kern et al., 2016] Kern, H. L., Stuart, E. A., Hill, J., and Green, D. P. (2016). Assessing methods forgeneralizing experimental impact estimates to target populations.
Journal of research on educationaleffectiveness , 9(1):103–127.[K¨unzel et al., 2019] K¨unzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for esti-mating heterogeneous treatment effects using machine learning.
Proceedings of the national academy ofsciences , 116(10):4156–4165.[Lauderdale et al., 2020] Lauderdale, B. E., Bailey, D., Blumenau, J., and Rivers, D. (2020). Model-basedpre-election polling for national and sub-national outcomes in the us and uk.
International Journal ofForecasting , 36(2):399–413.[Lax and Phillips, 2009] Lax, J. R. and Phillips, J. H. (2009). How should we estimate public opinion in thestates?
American Journal of Political Science , 53(1):107–121.[Lax et al., 2019] Lax, J. R., Phillips, J. H., Zelizer, A., et al. (2019). The party or the purse? unequalrepresentation in the us senate.
American Political Science Review , 113(4):917–940.[Little, 1993] Little, R. J. (1993). Post-stratification: a modeler’s perspective.
Journal of the AmericanStatistical Association , 88(423):1001–1012.[Lohr, 2009] Lohr, S. L. (2009).
Sampling: design and analysis . Nelson Education.[Lumley, 2020] Lumley, T. (2020). survey: analysis of complex survey samples. R package version 4.0.[Lumley and Scott, 2017] Lumley, T. and Scott, A. (2017). Fitting regression models to survey data.
Sta-tistical Science , pages 265–278.[Mercer et al., 2017] Mercer, A. W., Kreuter, F., Keeter, S., and Stuart, E. A. (2017). Theory and practice innonprobability surveys: parallels between causal inference and survey inference.
Public Opinion Quarterly ,81(S1):250–271.[Miratrix et al., 2018] Miratrix, L. W., Sekhon, J. S., Theodoridis, A. G., and Campos, L. F. (2018). Worthweighting? how to think about and use weights in survey experiments.
Political Analysis , 26(3):275–291.[Park et al., 2004] Park, D. K., Gelman, A., and Bafumi, J. (2004). Bayesian multilevel estimation withpoststratification: state-level estimates from national polls.
Political Analysis , 12(4):375–385.[Pearl, 1995] Pearl, J. (1995). Causal diagrams for empirical research.
Biometrika , 82(4):669–688.[R Core Team, 2020] R Core Team (2020).
R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.[Rubin, 1974] Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies.
Journal of educational Psychology , 66(5):688.[Rubin, 1980] Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomizationtest comment.
Journal of the American Statistical Association , 75(371):591–593.[Stuart et al., 2018] Stuart, E. A., Ackerman, B., and Westreich, D. (2018). Generalizability of randomizedtrial results to target populations: Design and analysis possibilities.
Research on social work practice ,28(5):532–537.[Stuart et al., 2011] Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. (2011). The use of propensityscores to assess the generalizability of results from randomized trials.
Journal of the Royal StatisticalSociety: Series A (Statistics in Society) , 174(2):369–386.[Tipton, 2014] Tipton, E. (2014). How generalizable is your experiment? an index for comparing experi-mental samples and populations.
Journal of Educational and Behavioral Statistics , 39(6):478–501.27Wager and Athey, 2018] Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treat-ment effects using random forests.
Journal of the American Statistical Association , 113(523):1228–1242.[Wang et al., 2015] Wang, W., Rothschild, D., Goel, S., and Gelman, A. (2015). Forecasting elections withnon-representative polls.
International Journal of Forecasting , 31(3):980–991.[Wickham, 2016] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag NewYork.[Yeager et al., 2019] Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C.,Tipton, E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., et al. (2019). A national experiment revealswhere a growth mindset improves achievement.
Nature , 573(7774):364–369.28 upplementary Material
A Proof of Proposition 1
Proof. E (ˆ τ CATE , O |D ) = (cid:80) c ∈ I O N c E (cid:0) Y rep ,n ,c |D (cid:1) − N c E (cid:0) Y rep ,n ,c |D (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) y ∗ p ( y ∗ | , c, D ) dy ∗ − (cid:82) y ∗ p ( y ∗ | , c, D ) dy ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) p ( v ∗ | , c, D ) dy ∗ dv ∗ − (cid:82) (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) p ( v ∗ | , c, D ) dy ∗ dv ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:80) c ∈ I O N c (cid:0)(cid:82) (cid:0)(cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ − (cid:82) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ (cid:1) p ( v ∗ | c, D ) dv ∗ (cid:1)(cid:80) c (cid:48) ∈ I O N c (cid:48) = (cid:88) c ∈ I O N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:90) (cid:18)(cid:90) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ − (cid:90) y ∗ p ( y ∗ | , c, v ∗ , D ) dy ∗ (cid:19) p ( v ∗ | c, D ) dv ∗ = (cid:88) c ∈ I O N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:90) ( g ( c, v ∗ ) − g ( c, v ∗ )) p ( v ∗ | c, D ) dv ∗ (26)Defining dµ ( c, v ∗ ) := N c (cid:80) c (cid:48)∈ I O N c (cid:48) p ( v ∗ | c, D ) dv ∗ where the support is ( c, v ∗ ) ∈ I O × (0 , . E (ˆ τ CATE , O |D ) = (cid:82) ( g ( c, v ∗ ) − g ( c, v ∗ )) dµ ( c, x ∗ ). B Further discussion on connection between MRP and treatment effect estima-tion
Suppose that we have an estimation method M that produces unbiased estimates ˆ E ( Y (1) − Y (0) | X = x c )for the true treatment effect τ CATE ,X = x in every poststratification cell c .Let NR ( O ) ⊆ I O be the set of poststratification cells such that the in-sample frequency n c ∗ (cid:80) c (cid:48)∈ I O n c (cid:48) differsfrom the true frequency P ( X = x c ∗ | X ∈ O ) = N c ∗ (cid:80) c (cid:48)∈ I O N c (cid:48) . This means that the observed sample of covariatesdo not generate the true frequencies observed in the finite target population for cells in NR ( O ).From Equation 3, the point estimate of τ CATE , O minus the truth is (cid:88) c ∈ I O ˆ E ( Y (1) − Y (0) | X = x c ) (cid:124) (cid:123)(cid:122) (cid:125) Point estimate from M (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) − τ CATE , O = (cid:88) c ∈ I O τ CATE ,X = x (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) (cid:33) − (cid:88) c ∈ I O τ CATE ,X = x (cid:32) N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:33) = (cid:88) c ∈ NR( O ) τ CATE ,X = x (cid:32) n c (cid:80) c (cid:48) ∈ I O n c (cid:48) − N c (cid:80) c (cid:48) ∈ I O N c (cid:48) (cid:33) (27)Hence the bias for estimating τ CATE , O is driven by the size and direction of the true treatment effects τ CATE ,X = x c ∗ and the differences in the proportions n c ∗ (cid:80) c (cid:48)∈ I O n c (cid:48) − N c ∗ (cid:80) c (cid:48)∈ I O N c (cid:48) for all c ∗ ∈ NR ( O ). The differ-ences in proportions is caused by the observed sample of covariate vectors { X i } ni =1 being a nonrepresentativesample of individuals in the target population, { X i } N P i =1 .29 Coefficients for the data generating process in Section 5
The coefficient vectors used in our simulation study outlined in Section 5 are defined as: • µ X = (2 . , . , . • σ X = (0 . , , . • p RE = (0 . , . , . , . , . p RE = (0 . , . , . , . , . p RE = (0 . , . , . , . , . p RE = c (0 . , . , . , . , . p RE = c (0 . , . , . , . , . • p ME = (0 . , . , . p ME = (0 . , . , . • τ SA = (0 . , . , . • τ MC = (0 , − . , . • τ G = (0 . , . • τ RE = (0 . , . , . , . , . • τ ME = (0 . , τ SA , τ MC , τ G , τ RE , τ ME were chosen so that the simulation setup in this manuscript produced treatmenteffects in the same range as the ones reported in [Yeager et al., 2019].The GitHub repository for this manuscript is: https://github.com/alexgao09/causal_mrp_publichttps://github.com/alexgao09/causal_mrp_public