AA Review of Generalizabilityand Transportability
Irina Degtiar and Sherri Rose
Harvard T.H. Chan School of Public Health and Stanford University
Abstract.
When assessing causal e ff ects, determining the target pop-ulation to which the results are intended to generalize is a criticaldecision. Randomized and observational studies each have strengthsand limitations for estimating causal e ff ects in a target population.Estimates from randomized data may have internal validity but areoften not representative of the target population. Observational datamay better reflect the target population, and hence be more likely tohave external validity, but are subject to potential bias due to unmea-sured confounding. While much of the causal inference literature hasfocused on addressing internal validity bias, both internal and exter-nal validity are necessary for unbiased estimates in a target popu-lation. This paper presents a framework for addressing external va-lidity bias, including a synthesis of approaches for generalizabilityand transportability, the assumptions they require, as well as testsfor the heterogeneity of treatment e ff ects and di ff erences betweenstudy and target populations. MSC 2010 subject classifications:
Primary 62-2, Statistics Researchexposition; secondary 62G05, Statistics Nonparametric inference Es-timation.
Key words and phrases: generalizability, transportability, external va-lidity, treatment e ff ect heterogeneity, causal inference.
1. BACKGROUND
The goal of causal inference is often to gain understanding of a particulartarget population based on study findings. The true underlying causal e ff ectwill typically vary with the definition of the chosen target population. However,samples unrepresentative of the target population arise frequently in studiesranging from randomized controlled trials (RCTs) in clinical medicine to policyresearch (Bell et al., 2016; Kennedy-Martin et al., 2015; Allcott, 2015). In a clin-ical trial setting, physicians may be left interpreting evidence from RCTs withpatients who have demographics and comorbidities that are quite di ff erent from Irina Degtiar is a PhD candidate at the Department of Biostatistics, Harvard T.H.Chan School of Public Health, 655 Huntington Ave, Boston, MA 02115, USA(email: [email protected]). Sherri Rose is an Associate Professor at the Centerfor Health Policy and Center for Primary Care and Outcomes Research, StanfordUniversity, 615 Crothers Way, CA 94305, USA (email: [email protected]). a r X i v : . [ s t a t . M E ] F e b I. DEGTIAR ET AL.
Figure 1 . Internal vs. external validity biases as they relate to target, study, and analysis populations. those of their patients. As an example, within cancer RCTs, African Americansare widely underrepresented despite being at an increased risk for many cancers(Chen and Wong, 2018). Failing to address this lack of representation can leadto inappropriate conclusions and harm (Chen et al., 2020). In a policy setting,it is important to consider the e ff ects that can be expected in the eventual tar-get population in order to set expectations for anticipated results and determinegroups that should be targeted for an intervention.The relationships between target, study, and analysis populations are visual-ized in Figure 1. The target sample is a representative sample of the target popu-lation, whereas the study population is defined by enrollment processes and in-clusion or exclusion criteria. Due to these practical and scientific considerations,the study population may di ff er from the target population. Correspondingly,the enrolled participants who form the study sample may have di ff erent char-acteristics from those of the target sample. In the cancer RCT example, while aphysician might care about the target population of patients that may come into be treated by their clinic (of which the clinic’s current patients are a targetsample), the study sample on which they’re basing their treatment recommen-dations may not include any African Americans. The study population is the hy-pothetical population that the study sample represents, which likewise includesno African Americans. Post-enrollment, further dropout and missingness mayoccur that create the observed analysis sample. In this case, dropout may haveoccurred for patients who experienced severe adverse events such that the anal-ysis sample consists of patients who did not experience severe side e ff ects. Therethen exists a hypothetical analysis population from which the analysis sampledata is a simple random sample. Hereafter, for simplicity and consistency withthe literature, we will use the terms study sample and study population to beinclusive of the analysis sample and analysis populations, respectively.Several key concepts are crucial to understand when considering extendingcausal inferences beyond a study sample. Generalizability focuses on the set-ting where the study population is a subset of the target population of inter-est, while transportability addresses the setting where the study population is (at
ENERALIZABILITY AND TRANSPORTABILITY least partly) external to the target population. Internal validity is defined as ane ff ect estimate being unbiased for the causal treatment e ff ect in the populationfrom which the sample is a simple random sample (i.e., moving vertically froma sample to its corresponding population in Figure 1). External validity is con-cerned with how well results generalize to other contexts. Specifically, that the(internally valid) e ff ect estimate is unbiased for the causal treatment e ff ect ina di ff erent setting, such as a target population of interest (moving laterally be-tween populations in Figure 1). External validity bias has also been referred toas sample selection bias (Heckman, 1979; Imai, King and Stuart, 2008; Moreno-Torres et al., 2012; Bareinboim, Tian and Pearl, 2014; Haneuse, 2016).External validity bias arises from di ff erences between the study and targetpopulations in (1) subject characteristics; (2) setting, such as geography or typeof health center; (3) treatment, such as timing, dosage, or sta ff training; and (4)outcomes, such as length of follow-up or timing of measurements (Cronbachand Shapiro, 1982; Rothwell, 2005; Dekkers et al., 2010; Green and Glasgow,2006; Burchett, Umoquit and Dobrow, 2011; Attanasio, Meghir and Szekely,2003). The focus of most generalizability and transportability methods is on ad-dressing di ff erences in subject characteristics. Hence, these methods assume theremaining threats to external validity are not present in the data sources theyare looking to generalize across. Namely, external validity bias then arises solelyfrom: (1) variation in the probability of enrollment in the study, (2) heterogene-ity in treatment e ff ects, and (3) the correlation between (1) and (2) (Olsen et al.,2013). We therefore distinguish between factors di ff erentiating the target pop-ulation from the study population (external validity bias) and those that createdi ff erences between treatment groups (internal validity bias), e.g., confound-ing. RCTs are frequently performed in a nonrepresentative subset of the targetpopulation and may have imperfect follow-up (challenging their external valid-ity) and may have baseline imbalances (leading to internal validity bias). Obser-vational studies may be susceptible to unmeasured confounding (threateningtheir internal validity), but may be more representative of the target population(hence having better external validity). Lack of representation in an RCT canlead to external validity bias that is larger than the internal validity bias of anobservational study (Bell et al., 2016).The optimal solution to external validity bias centers on study design, whichwe review briefly here, but do not cover extensively. One type of ideal studywould randomly sample subjects from the target population and then randomlyassign treatment to the selected individuals. However, this is usually infeasible.Alternative study designs for improving study generalizability and transporta-bility include purposive sampling, where investigators deliberately select indi-viduals such as for representation or heterogeneity (Shadish, Cook and Camp-bell, 2001; Allcott and Mullainathan, 2012); pragmatic or practical clinical tri-als, which aim to be representative of clinical practice (Schwartz and Lellouch,1967; Ford and Norrie, 2016); stratified selection based on e ff ect modifiers orpropensity scores for selection (Tipton et al., 2014; Tipton, 2013a; Allcott andMullainathan, 2012); and balanced sampling designs for site selection that se-lect representative sites through stratified ranked sampling (Tipton and Peck,2017). In lieu or in addition to study designs that address external validity bias,generalizability and transportability methods can improve the external validity I. DEGTIAR ET AL.
Figure 2 . Overview framework for assessing and addressing external validity bias after data collection.Estimand : consider study and target popula-tions, and with them, the estimand of interest
Assumptions : assess validity of assumptions neces-sary for generalizability or transportability approaches
Evaluating Generalizability : examine whether treatment ef-fect modification exists and whether e ff ect modifiers dif-fer in distribution between study and target populations Generalizability and Transportability Methods :apply methods for addressing external validity bias of e ff ect estimates after data collection.This manuscript provides a review of generalizability and transportability re-search, synthesizing across the statistics, epidemiology, computer science, andeconomics literature in a more complete manner than has been done to date. Ex-isting review literature has examined narrower subsets of the topic: generaliz-ing or transporting to a target population from only RCT data (Stuart, Bradshawand Leaf, 2015; Stuart, Ackerman and Westreich, 2018; Kern et al., 2016; Tip-ton and Olsen, 2018; Ackerman et al., 2019), identifiability rather than estima-tion (Bareinboim and Pearl, 2016), or meta-analysis approaches for combiningsummary-level information (Verde and Ohmann, 2015; Kaizar, 2015). A recentrelated review on combining randomized and observational data featured a sim-ulation, real data analysis, and software guide (Colnet et al., 2020). However,these previous reviews have not summarized the full range of generalizabil-ity and transportability methods that incorporate data from randomized, obser-vational, or a combination of randomized and observational studies, nor tech-niques for evaluating generalizability, as we do here. Additionally, although theimportance of describing generalizability and transportability is recognized bydi ff erent trial reporting guidelines (e.g., CONSORT, RECORD, STROBE), theyprovide no clear guidance on tests or estimation procedures (Schulz, Altmanand Moher, 2010; Benchimol et al., 2015; von Elm et al., 2008). We also con-tribute recommendations for methodologists and applied researchers.The remainder of the article synthesizes considerations for assessing and ad-dressing external validity bias after data collection (presented as a frameworkin Figure 2) and is organized as follows. Section 2 defines the estimand of inter-est, the average treatment e ff ect in a target population, as well as alternatives.Section 3 presents key assumptions underlying many of the methods. Section 4reviews methods for assessing treatment e ff ect heterogeneity, thus further mo-tivating the need for methods that enable generalizing or transporting studyresults to a target population. Section 5 then summarizes the analytic methodsavailable for external validity bias correction that generate treatment e ff ect es-timates for a target population of interest. These techniques include weightingand matching, outcome regressions, and doubly robust approaches. Section 6 ENERALIZABILITY AND TRANSPORTABILITY then concludes with guidance for both applied and methods researchers.
2. ESTIMAND
Assume, for one or more studies, the existence of outcome Y , treatment A ∈{ , } , and baseline covariates X ∈ R d . For simplicity of notation, we define X to represent all treatment e ff ect confounders and e ff ect modifiers (subgroupswhose e ff ects are expected to di ff er) that di ff er between study and target pop-ulations; each variable in X is both a confounder and an e ff ect modifier. With-out loss of generality, we focus on the single study setting, with S = 1 indicat-ing selection into it. The observational unit for the study sample is O study = { X, A, Y , S = 1 } . O study has probability distribution P study ∈ M study , where M study is our collection of possible probability distributions (i.e., statistical model). Weobserve n s realizations of O study , indexed by j . The observational unit for a rep-resentative sample from the target population is given by O = { X, A, Y , S } ∼ P ∈M . We observe n realizations of O , indexed by i . Target sample subjects whodo not appear in the study sample will have S = 0. We use the terminology“selected” or “sampled” throughout the paper for simplicity although for trans-portability, subjects are not directly sampled into the study from the target pop-ulation. For generalizability, O study ∈ O , while for transportability, the two aredisjoint sets, O study (cid:60) O .Biases are defined with respect to an estimand. We will focus on the averagetreatment e ff ect in a well-defined target population of interest: the populationaverage treatment e ff ect (PATE). Namely, we are interested in the average out-come had everyone in the target population been assigned to treatment A=1 compared to the outcome had everyone been assigned to treatment
A=0 . Wewrite this as τ = E X ( E ( Y | S = 1 , A = 1 , X ) − E ( Y | S = 1 , A = 0 , X )) = E ( Y − Y ),where Y and Y are the potential outcomes under treatment and no treat-ment, respectively, and required identifiability assumptions are delineated inthe next section. The corresponding estimator is given by ˆ τ = 1 /n (cid:80) ni =1 ( ˆ Y i − ˆ Y i ).We also write Y a to represent the potential outcome under a with lowercase a a specific value for random variable A . Potential outcomes are either explicitlyassumed in the potential outcomes framework or a consequence of the struc-tural causal model (Rubin, 1974; Pearl, 2000). Di ff erent target populations cor-respond to alternative PATEs because the expectation is taken with respect toalternative distributions of covariates X . However, necessarily, we only observeoutcomes in the study sample. A study therefore directly estimates the sam-ple average treatment e ff ect (SATE): τ s = E ( Y − Y | S = 1) with estimator ˆ τ s =1 /n s (cid:80) j : S j =1 ( ˆ Y j − ˆ Y j ).When the distributions of treatment e ff ect modifiers di ff er between study andtarget populations, the true study average e ff ect will not equal the true targetpopulation average e ff ect (SATE (cid:44) PATE) due to external validity bias. Samplingvariability as well as internal validity biases can also drive estimates of SATEfurther from the truth (Figure 3). Biases may di ff er in magnitude and may makethe SATE either larger or smaller than the PATE.We may also be interested in estimating other target parameters. For example,the population conditional average treatment e ff ects (PCATE): τ x = E ( Y − Y | X )is examined in some of the estimation methods we explore later. Another param-eter of interest is the population average treatment e ff ects among the treated: I. DEGTIAR ET AL.
Figure 3 . Illustrative example of the di ff erence between target population and sample average treatmente ff ects (PATE and SATE). Biases may di ff er in magnitude and may make the SATE either larger orsmaller than the PATE. τ = E ( Y − Y | A = 1). Similar generalizability and transportability considera-tions presented in the following sections will apply for these and other causalestimands.
3. ASSUMPTIONS
Under the potential outcomes framework, the assumptions below are su ffi -cient to identify the PATE using the observed study data. A corresponding setof assumptions under the structural equation model (SEM) framework has alsobeen derived (Pearl and Bareinboim, 2014; Pearl, 2015; Pearl and Bareinboim,2011; Bareinboim and Pearl, 2014; Bareinboim and Tian, 2015; Bareinboim andPearl, 2016; Correa, Tian and Bareinboim, 2018). Additional assumptions in-clude those of no missing data or measurement error in outcome, treatment, orcovariate measurements. Other target parameters of interest necessitate a simi-lar set of assumptions. Su ffi cient assumptions for identifying the PATE with respect to internal validity: Conditional treatment exchangeability : Y a ⊥ A | X, S = 1 for all a ∈ A , the setof all possible treatments. This condition requires no unmeasured confoundingof the treatment-outcome relationship in the study. It is satisfied by perfectlyrandomized trials (e.g., no loss to follow-up, other informative missingness orcensoring, etc.) and by observational studies that have all confounders mea-sured. While this condition is su ffi cient, it is not always necessary. When esti-mating the PATE, it can be replaced by the weaker condition of mean conditionalexchangeability of the treatment e ff ect, E ( Y − Y | X, A, S = 1) = E ( Y − Y | X, S =1) (Kern et al., 2016; Dahabreh et al., 2019a).
Positivity of treatment assignment : P ( X = x | S = 1) > ⇒ P ( A = a | X = x,S = 1) >
0, with probability 1 for all a ∈ A . This condition entails that eachsubject in the study has a positive probability of receiving each version of thetreatment. In combination with the conditional treatment exchangeability as-sumption above, this assumption is also known as strongly ignorable treatmentassignment (Varadhan, Henderson and Weiss, 2016). Stable unit treatment value assumption (SUTVA) : if A = a then Y = Y a . Thisassumption requires no interference between subjects and treatment version ir-relevance (i.e., consistency/well-defined interventions) in the study and target ENERALIZABILITY AND TRANSPORTABILITY populations (Dahabreh et al., 2017; Kallus, Puli and Shalit, 2018). Following the assumptions above, identifying the PATE involves a parallel setof assumptions for external validity:
Conditional exchangeability for study selection : Y a ⊥ S | X for all a ∈ A . Thisassumption is also known as exchangeability over selection and the generaliz-ability assumption. It requires that the outcomes among individuals with thesame treatment and covariate values in the study and target populations are thesame (Stuart et al., 2011). All e ff ect modifiers that di ff er between study and tar-get populations must therefore be measured. This assumption would be satisfiedby a study sample that is a random sample from the target population or a non-probability study sample in which all e ff ect modifiers are measured. A weakercondition, mean conditional exchangeability of selection, E ( Y − Y | X, S = 1) = E ( Y − Y | X ) can replace conditional exchangeability for study selection whenfocusing on the PATE (Kern et al., 2016; Dahabreh et al., 2019a). Positivity of selection : P ( X = x ) > ⇒ P ( S = 1 | X = x ) > a ∈ A . This assumption requires common support with respect tostudy selection; in every stratum of e ff ect modifiers, there is a positive prob-ability of being in the study sample (Dahabreh et al., 2017). This can be re-placed by smoothing assumptions under a parametric model, for example, thatthe propensity score distribution has su ffi cient overlap or common support be-tween the study sample and target population (Westreich et al., 2017; Tiptonet al., 2017). Thus, with conditional positivity of selection we assume that allmembers of the target population are represented by individuals in the study.The positivity assumption in combination with the no unmeasured e ff ect modi-fication assumption above is also known as strongly ignorable sample selectiongiven the observed covariates (Chan, 2017). SUTVA for study selection : if S = s (and A = a ) then Y = Y a . This assumptionstates that there is no interference between subjects selected into the study ver-sus those not selected and that there is treatment version irrelevance betweenstudy and target samples (the same treatment is given to both) (Tipton, 2013b;Tipton et al., 2017). It necessitates no di ff erence across study and target sam-ples in how outcomes are measured or in how the intervention is applied, thatthere is a common data-generating function for the outcome across individualsin the study and target populations (i.e., that being in the study does not changetreatment e ff ects), and that the potential outcomes are not a function of theproportion of individuals selected for the study. Treatment version irrelevancein SUTVA can be replaced by the condition of having the same distribution oftreatment versions between study and target populations when estimating thePATE (Lesko et al., 2017). Similar internal and external validity assumptions are needed for transporta-bility, with the following modifications. When the study sample is a subset of thetarget population (generalizability), the positivity assumption for selection willneed the propensity for selection to be bounded away from 0, whereas when thesample is not a subset of the target population (transportability), the propen-sity to be in the target population will need to be bounded away from 0 and 1
I. DEGTIAR ET AL. (Tipton, 2013b). Furthermore, for transportability, the set of covariates, X , re-quired for conditional exchangeability for study selection cannot include thosethat separate the study sample from the target population (e.g., hospital typeif transporting results from teaching hospitals to community clinics, or geo-graphic location if transporting between states) (Tipton, 2013b). Further dis-tinctions are discussed by Pearl (2015) using the SEM framework. Under thisframework, Pearl and Bareinboim formalize the assumptions necessary for us-ing di ff erent transport formulas to reweight randomized data, providing graph-ical conditions for identifiability as well as transport formulas for randomizedstudies (Pearl and Bareinboim, 2014; Pearl, 2015), observational studies (Pearland Bareinboim, 2011; Pearl, 2015; Bareinboim and Tian, 2015; Bareinboim andPearl, 2016; Correa and Bareinboim, 2017; Correa, Tian and Bareinboim, 2018),and a combination of heterogeneous studies (Bareinboim and Pearl, 2014, 2016).
4. ASSESSING DISSIMILARITY BETWEEN TARGET AND STUDYPOPULATIONS AND TESTING FOR TREATMENT EFFECTHETEROGENEITY
Numerous quantitative approaches can help evaluate the extent to whichstudy results may be expected to generalize to the target population. These as-sessments examine population di ff erences and whether treatment e ff ect hetero-geneity exists. Methods for assessing the similarity of study and target popula-tions can broadly be categorized into those that compare baseline patient char-acteristics and those that compare outcomes for groups on the same treatment.For the former, many make use of the propensity score for selection, which alsoserves the purpose of assessing the extent to which propensity score adjustmentusing measured covariates can su ffi ciently remove baseline di ff erences betweenstudy and target samples. However, most of these methods do not emphasizee ff ect modifiers, hence should be combined with an assessment of whether thenoted population di ff erences correspond to heterogeneity of treatment e ff ects.To test for heterogeneity of e ff ects, one must first identify e ff ect modifiers. Ef-fect modifiers are often pre-specified by the investigator, but data-driven ap-proaches exist as well, and will be discussed in this section. When summary-level study data are available, assessments that examine dif-ferences in univariate covariate metrics between study and target samples canbe deployed. Cahan, Cahan and Cimino (2017) propose a generalization scorefor evaluating clinical trials that incorporates baseline patient characteristics,the trial setting, protocol, and patient selection: it takes ratios of the mean ormedian values of these characteristics in the study and target samples, then av-erages across categories for an overall score. However, this approach does notaccount for any measures of dispersion, which may reflect exclusion of moreheterogeneous individuals from the study. When only baseline patient charac-teristics are responsible for relevant study vs. target population di ff erences, onecan perform multiplicity-adjusted univariate tests for di ff erences in e ff ect mod-ifiers between study and target samples (Greenhouse et al., 2008). Alternatively,one could examine absolute standardized mean di ff erences (SMD) for each co- ENERALIZABILITY AND TRANSPORTABILITY variate, ( ¯ X study − ¯ X ) /σ ¯ X , where ¯ X study and ¯ X are the means of baseline covariatesin the study and target samples, respectively, and σ ¯ X is the standard deviationof ¯ X (Tipton et al., 2017). High values indicate heavy extrapolation and relianceon correct model specification; in smaller samples, imbalances will often occurby chance (Tipton et al., 2017). With one or more RCTs, generalizability acrosscategorical eligibility criteria can be assessed by the percent of the target samplethat would have been eligible for the study or set of studies (Weng et al., 2014;He et al., 2016; Sen et al., 2016).Joint distributions of patient characteristics can likewise be compared, suchas by examining the SMD in propensity scores for selection (Stuart et al., 2011).When the propensity score is not symmetrically distributed, summarizing meandi ff erences is insu ffi cient. Tipton (2014) developed a generalizability index thatbins propensity scores and is bounded between 0 and 1: (cid:80) kj =1 (cid:112) w p j w s j with j = 1 , ..., k bins, each with target sample proportions w p j and study sample pro-portions w s j . It is based on the distributions of propensity scores rather thanonly the averages. However, this approach requires patient-level study and tar-get sample data. A generalizability index score of < > ffi cient, and C statistic; these largely focus on comparing cumulativedensities (Tipton, 2014; Ding, Feller and Miratrix, 2016). To assess the degreeof extrapolation with respect to e ff ect modifiers, one can examine overlap in thepropensity of selection distributions, such as the proportion of target sampleindividuals with propensity scores outside the 5 th and 95 th percentiles of thesample propensity scores (Tipton et al., 2017).One can also adopt a machine learning approach for detecting covariate shift–a change in the distribution of covariates between training and test data (here,the study and target data) (Glauner et al., 2017). After creating a joint datasetwith target and study sample data, a classification algorithm predicts whetherthe data came from the study. A dissimilarity metric surpassing a threshold ofacceptability then indicates sizable dissimilarity between datasets. However, aninability to accurately predict study vs. target data origin does not rule out dif-ferences in e ff ect modifiers. A low score might furthermore indicate an incorrectmodel specification or insu ffi cient model tuning.The tests discussed in this subsection assess di ff erences between populations;however, they require investigator knowledge of which characteristics moderatethe treatment e ff ect (or are correlated with unmeasured e ff ect modifiers) andwhat level of di ff erences are clinically relevant. Many covariates are often testedor included in a propensity score regression for study selection. This approachprioritizes predictors that are strongly associated with study selection ratherthan those that exhibit strong e ff ect modification. Investigators should thereforeaim to identify relevant e ff ect modifiers for testing or inclusion in the propensityscore regression and test this subset. When individual-level outcome data or joint distributions of group-level out-come data are available in both the study and target samples for at least one I. DEGTIAR ET AL. of the treatment groups, the following methods can assess the extent to whichmeasured e ff ect modifiers account for population di ff erences. One can comparethe observed outcomes in the target sample to predicted outcomes using studycontrols (Stuart et al., 2011), or more generally, study individuals who receivedthe same treatment (Hotz, Imbens and Mortimer, 2005): 1 /n a (cid:80) Ni =1 A i = a ) Y i vs.1 /n s,a (cid:80) i : S i =1 A i = a ) w i Y i with weights w i defined by weighting and matchingmethods discussed in Section 5.1. Hartman et al. (2015) formalize this com-parison with equivalence tests. Alternatively, conditional outcomes for studyand non-study target sample individuals receiving the same treatment, condi-tioning on measured e ff ect modifiers, can be compared to detect unmeasurede ff ect modification, although other identifiability assumption violations mightalso be at fault: E ( Y | X, A = a, S = 1) vs. E ( Y | X, A = a, S = 0). Possible tests includeanalysis of covariance, Mantel-Haenszel, U-statistic based tests, stratified log-rank, or stratified rank sum, depending on the outcome (Marcus, 1997; Hotz,Imbens and Mortimer, 2005; Luedtke, Carone and van der Laan, 2019). For ex-ample, study controls could be compared to subgroups of the target populationthat were known to be excluded from the study (e.g., patients who declinedparticipation in a RCT, as done by Davis (1988)). Relatedly, unmeasured ef-fect modification can be imperfectly tested for by disaggregating a characteristicthat di ff erentiates the study from the target sample (Allcott and Mullainathan,2012). These outcome di ff erences should not exceed those observed betweenstudy treatment groups (Begg, 1992).In addition to testing for outcome di ff erences, one can test for di ff erences be-tween study and target regression coe ffi cients or between baseline hazards ina Cox regression (Pan and Schaubel, 2009). Any identified di ff erences in out-comes or e ff ects will reflect sample di ff erences unaccounted for by the outcomeor weighting method, indicating unmeasured e ff ect modification or an ine ff ec-tive modeling approach. To have this comparison reflect relevant di ff erences,study controls must be representative of the target population after weightingor regression adjustment. Hartman et al. (2015) provides a more formal set ofidentifiability assumptions that may be violated when each equivalence test isrejected. If unmeasured e ff ect modification is suspected, one can perform sensi-tivity analysis to assess the extent to which it can impact results (Marcus, 1997;Nguyen et al., 2017, 2018; Dahabreh et al., 2019b; Andrews and Oster, 2017)or to generate bounds on the treatment e ff ect when only partial identification ispossible (Chan, 2017). ff ect heterogeneity Identified population di ff erences are relevant insofar as they correspond todi ff erences in treatment e ff ect modifiers. The following tests enable an inves-tigator to assess whether treatment e ff ects vary substantially across measuredcovariates. Many are suitable for use in observational or RCT data, althoughhave largely been demonstrated in RCT data to date. While some tests require apriori specification of subgroups, others can discover them in data-driven waysand most require individual-level data. A straightforward, but often overlookedissue is that studies with enrolled patients that are homogeneous with respectto e ff ect modifiers will have di ffi culty identifying heterogeneity of e ff ects. Theseapproaches are therefore best applied to data representative of the target popu- ENERALIZABILITY AND TRANSPORTABILITY lations (Gunter, Zhu and Murphy, 2011).Tests of prespecified subgroups should focus on target population subgroupsunder- or over-represented in the study, or any other clinically relevant sub-group expected to exhibit e ff ect heterogeneity. Largely, methods for testing treat-ment e ff ect heterogeneity of a priori specified subgroups exhibit limited power.Those testing several e ff ect modifiers individually are particularly underpow-ered to detect significant e ff ects once multiple testing adjustments are incorpo-rated. One approach tests the interaction term of treatment assignment withan e ff ect modifier in a linear model, which also requires modeling assumptionsas to the linearity and additivity of e ff ects (Fang, 2017; Gabler et al., 2009). Toaddress this lack of power, sequential tests for identifying treatment-covariateinteractions can be used with either randomized or observational data (Qian,Chakraborty and Maiti, 2019). Alternative approaches, each addressing slightlydi ff erent goals, include testing whether the conditional average treatment e ff ectis identical across predefined subgroups (Crump et al., 2008; Green and Kern,2012), comparing subgroup e ff ects to average e ff ects (Simon, 1982), and identi-fying qualitative interactions or treatment di ff erences exceeding a prespecifiedclinically significant threshold (Gail and Simon, 1985).When e ff ect modifiers are not known a priori, a variety of techniques canbe applied for identifying subgroups with heterogeneous e ff ects. These includethose that identify variables that qualitatively interact with treatment (i.e., forwhich the optimal treatment di ff ers by subgroup) (Gunter, Zhu and Murphy,2011) as well as determine the magnitude of interaction (Chen et al., 2017; Tianet al., 2014). Various machine learning approaches can also be used to identifysubgroups with heterogeneous treatment e ff ects while minimizing modeling as-sumptions. Approaches that also present tests for treatment e ff ect di ff erencesbetween subgroups include Bayesian additive regression trees (BART) and otherclassification and regression tree (CART) variants (Su et al., 2008, 2009; Lip-kovich et al., 2011; Green and Kern, 2012; Athey and Imbens, 2016). Tree-basedmethods develop partitions in the covariate space recursively to grow towardterminal nodes with homogeneity for the outcome. These approaches may beparticularly useful when heterogeneity may be a function of a more complexcombination of factors.With many e ff ect modifiers or when e ff ect modifiers are unknown, globaltests for heterogeneity can also be used. Pearl (2015) provides conditions foridentifying treatment e ff ect heterogeneity (including heterogeneity due to un-measured e ff ect modifiers) for randomized trials with binary treatments, situa-tions with no unobserved confounders, and with mediating instruments. E ff ectheterogeneity can be tested for using the baseline risk of the outcome as ane ff ect modifier; interaction-based tests assess for di ff erences in baseline risk be-tween study and target population control groups (Varadhan, Henderson andWeiss, 2016; Weiss, Segal and Varadhan, 2012). These tests avoid the need formultiple testing but require outcome data in the target sample and modeling as-sumptions. A consistent nonparametric test also exists that assesses for constantconditional average treatment e ff ects, τ x = τ ∀ x ∈ X (Crump et al., 2008). Addi-tional methods, which su ff er from limited power and rely on estimates of SATE,include testing whether potential outcomes across treatment groups have equalvariances and whether cumulative distribution functions of treatment and con- I. DEGTIAR ET AL. trol outcomes di ff er by a constant shift (Fang, 2017). Global tests do not identifysubgroups responsible for e ff ect heterogeneity, although if a global test is re-jected, one can then compare individual subgroups to determine which demon-strate e ff ect heterogeneity.If these assessments of generalizability fail and the target population is notwell-represented by the study population (specifically, when strong ignorabilityfails), Tipton (2013b) provides several recommended paths forward. Investiga-tors can change the target population to one represented by the study. That is,change the estimand of interest by aligning inclusion and exclusion criteria, out-come timepoints, or treatment doses (Hern´an et al., 2008; Weisberg, Hayden andPontes, 2009). A population coverage percentage can then summarize the per-cent overlap between the new and original target sample propensity scores, anddescribe relevant di ff erences from the original target population. Investigatorscan alternatively retain the original target population and note the limitationsof extrapolated results and likelihood of remnant bias. It is also important toacknowledge that a di ff erent study may need to be conducted.
5. GENERALIZABILITY AND TRANSPORTABILITY METHODS FORESTIMATING POPULATION AVERAGE TREATMENT EFFECTS
Following the application of the methods in the previous sections, includ-ing assessing the plausibility of relevant assumptions, an analytic method istypically needed to generalize or transport results from randomized or obser-vational data to a target population. These approaches have many parallels tothose used to address internal validity bias. We revisit weighting and matching-based methods and outcome regressions in depth while additionally examiningtechniques that use both propensity and outcome regressions (these are oftendoubly robust). To mitigate external validity bias, generalizability and trans-portability methods address di ff erences in the distribution of e ff ect modifiersbetween study and target populations. To do so, for weighting and matching-based approaches, these methods account for the probability of selection intothe study, rather than the probability of treatment assignment. Outcome regres-sions require that treatment e ff ect is allowed to vary across all e ff ect modifiersin addition to all confounders being correctly included in the regression.Most generalizability and transportability methods have been developed forrandomized data. When outcome data are available from both randomized stud-ies and an observational study representative of the target population, theircombination has the potential to overcome sensitivity to positivity violationsfor selection into the study (an issue that RCT data commonly face) as well asto unmeasured confounding (which may a ffl ict observational studies). Incorpo-rating observational data in a principled manner can also shrink mean squarederror. However, many such approaches do not leverage the internal validity ofRCT data. The following sections will highlight some exceptions. While mostapproaches require individual-level study and target sample data, the Appendixhighlights approaches that only use summary-level data for either the study ortarget sample. ENERALIZABILITY AND TRANSPORTABILITY Methods that adjust for di ff ering baseline covariate distributions betweenstudy and target samples via weighting or matching are particularly e ff ectivewhen e ff ect modifiers strongly predict selection into the study. While includ-ing unnecessary covariates can decrease precision, increase the chance of ex-treme weights and di ffi cult-to-match subjects, and provide no bias reduction(Nie et al., 2013), failing to include an e ff ect modifier is typically of greater con-cern than including unnecessary covariates (Stuart, 2010; Dahabreh et al., 2018).Matching and reweighting methods strongly rely on common covariate supportbetween study and target populations and perform poorly when a portion of thetarget population is not well-represented in the study sample or when empiri-cal positivity violations occur. Investigators should use the estimation approachthat leads to the best e ff ect modifier balance for their study (Stuart, 2010) andstrive for fewer assumptions. Full matching and fine balance of covariate first moments(i.e., expected values) have been used in the generalizability context (Stuartet al., 2011; Bennett, Vielma and Zubizarreta, 2020). Stuart et al. (2011) fullymatch study and target sample individuals based on their propensity scores toform sets so that each matched set has at least one study and target individ-ual. Individuals’ outcomes are then reweighting by the number of target sampleindividuals in their matched set. This approach relies heavily on the distancemetric used, which can be misled by covariates that don’t a ff ect the outcome.Fine balance of covariate first moments is a nonparametric approach for largerdata that can also be used with multi-valued treatments (Bennett, Vielma andZubizarreta, 2020). This approach matches samples to a target population toachieve fine balance on the first moments of all covariates rather than workingwith the propensity score.Some implementations of these methods only match a subset of study indi-viduals (hence show areas of the covariate distribution without common sup-port), while others ensure all study and target sample individuals are matched.Matching methods require calibration for bias-variance tradeo ff such as via acaliper or by choosing the ratio of study to target individuals to match. A va-riety of distance metrics exist; however, none specifically target e ff ect modi-fiers. With unrepresentative observational data, treatment groups can first bematched based on confounding variables before matching study pairs to the tar-get sample based on e ff ect modifiers, or each treatment group can be separatelymatched to the target sample (Bennett, Vielma and Zubizarreta, 2020). In a low-dimensional setting with categorical or binary co-variates, one can use nonparametric post-stratification (also known as direct ad-justment or subclassification), as has been done in the literature with random-ized data (Miettinen, 1972; Prentice et al., 2005) and with observational data inthe context of instrumental variables (Angrist and Fern´andez-Val, 2013). Post-stratification consists of obtaining estimates for each stratum of e ff ect modifiers,then reweighting these estimates to reflect the e ff ect modifier distribution in thetarget population, i.e., ˆ E ( Y a ) = 1 /n (cid:80) Ll =1 n l ¯ Y al , where L is the number of strata, n l is the target sample size in stratum l , n = (cid:80) Ll =1 n l , and ¯ Y al is an estimate from I. DEGTIAR ET AL. study sample data of the potential outcome on treatment a in stratum l , com-monly the stratum-specific sample mean for subjects on treatment a (Miettinen,1972; Prentice et al., 2005).Post-stratification only requires stratum-specific summary data and closed-form variance formulas are often available. However, empty strata quickly be-come an issue when dealing with continuous variables or many stratifying vari-ables. Conversely, if insu ffi cient strata are used, residual external validity biaswill remain, which is particularly problematic in small samples (Tipton et al.,2017). To combat this, inference can be pooled across strata using multilevel re-gression with post-stratification (Pool, Abelson and Popkin, 1964; Gelman andLittle, 1997; Park, Gelman and Bafumi, 2004; Kennedy and Gelman, 2019).For higher dimensional settings or with continuous covariates, more flexiblenonparametric approaches can be applied, such as maximum entropy weight-ing, where study strata are reweighted to the distribution in the target sam-ple (Hartman et al., 2015). When target and study populations di ff er on post-treatment variables such as adherence, principal stratification can be used toestimate PATEs by classifying subjects into never-taker, always-taker, and com-plier categories (Frangakis, 2009). Estimating using the propensity for study selection.
Most weighting approachesuse a propensity of selection regression to construct weights. They rely on cor-rect specification of the propensity score regression and su ffi cient overlap inpropensity scores between study subjects and target sample individuals not inthe study. These approaches have the additional advantage of allowing one setof weights to be used for treatment e ff ects related to multiple outcomes. Themost straightforward weighting approaches tend to have large variances in thepresence of extreme weights, give disproportionate weight to outlier observa-tions, and produce outcome estimates outside the support of the outcome vari-able. Weight standardization can address these issues, as can weight trimming,although the latter induces bias by changing the target population of interest,hence requiring a careful bias-variance trade-o ff .Inverse probability of participation weighting (IPPW), a Horvitz-Thompson-like approach (Horvitz and Thompson, 1952), is the most common weightingtechnique for generalizability (Flores and Mitnik, 2013; Baker et al., 2013; Leskoet al., 2017; Westreich et al., 2017; Correa, Tian and Bareinboim, 2018; Dahabrehet al., 2018, 2019a). Most simply, IPPW weights the outcome for each studyindividual on treatment a by the inverse probability (propensity) of being inthe study. Weights have been developed for estimating PATEs, including thosethat incorporate treatment assignment to account for covariate imbalances inan RCT or for confounding in an observational study. The observed outcomesare reweighted to obtain the potential outcomes for each treatment group a : E ( Y a ) = n (cid:80) ni =1 w i Y i with w i = 1 π s,i I ( S i = 1) I ( A i = a ) for random treatment assignment (Lesko et al., 2017) w i = 1 π s,i π a,i I ( S i = 1) I ( A i = a ) more generally (Stuart et al., 2011; Dahabreh et al., 2019a) , where I ( S i = 1) is the indicator for being in the study, I ( A i = a ) is the indicatorfor being assigned treatment a , π s,i = P ( S i = 1 | X i ) is the propensity score forselection into the study and π a,i = P ( A i = a | S i = 1 , X i ) is the propensity score for ENERALIZABILITY AND TRANSPORTABILITY assignment to treatment a in the study.Individual-level data are typically required, although one can also use jointcovariate distributions from group-level data (Cole and Stuart, 2010) or univari-ate moments (e.g., means, variances) with additional assumptions (Signorovitchet al., 2010; Phillippo et al., 2018). Because IPPW only uses study individuals ona given treatment to estimate potential outcomes for that treatment, power canbecome an issue, particularly for multi-level treatments. These methods alsoperform poorly when study selection probabilities are small, which can be acommon occurrence for generalizability (Tipton, 2013b). IPPW weights havealso been developed for regression parameters in a generalized linear model(Haneuse et al., 2009), as well as for Cox model hazard ratios and baseline risks(Cole and Stuart, 2010; Pan and Schaubel, 2008).For transportability to the target population S = 0, odds of participationweights are used rather than inverse probability of participation weights (Westre-ich et al., 2017; Dahabreh et al., 2018). This corresponds to the estimator E ( Y a | S =0) = n (cid:80) Ni =1 w i Y i with N = n + n s and weights (Dahabreh et al., 2018): w i = 1 − π s,i π s,i π a,i I ( S i = 1) I ( A i = a ) . To address potentially unbounded outcome estimates, standardization then re-places n by the sum of the weights, which normalizes the weights to sum to1 (Dahabreh et al., 2018, 2019a). The resulting estimator will be more stable,bounded by the range of the observed outcomes, and perform better when thetarget sample is much larger than the study.Under regularity conditions, estimates derived using IPPW are consistent andasymptotically normal (Lunceford and Davidian, 2004; Pan and Schaubel, 2008;Cole and Stuart, 2010; Correa, Tian and Bareinboim, 2018; Buchanan et al.,2018). Variance for the IPPW estimator can be obtained through either a boot-strap approach or robust sandwich estimators. The latter may be di ffi cult to cal-culate (Haneuse et al., 2009) and bootstrap methods for IPPW have been shownto perform better when there is substantial treatment e ff ect heterogeneity orsmaller sample sizes (Chen and Kaizar, 2017; Tipton et al., 2017).Propensity scores can also be used in the context of post-stratification, weight-ing or matching individuals within strata. RCT individuals are divided intostrata defined by their propensity scores; quintiles are commonly used, basedon results showing that this approach may remove over 90% of bias (O’ Muirc-heartaigh and Hedges, 2014). E ff ects are estimated using sample data withineach subgroup, such as through separate regressions or a joint parametric re-gression with fixed e ff ects for subgroups and interaction terms for subgroupsby RCT status. Results can then be reweighted based on the number of targetsample individuals in each subgroup (O’ Muircheartaigh and Hedges, 2014).Alternatively, the target sample can be matched to RCT individuals within thesame propensity score stratum (Tipton, 2013b).The post-stratification estimator is asymptotically normal and closed-formvariance estimates exist for independent strata (O’ Muircheartaigh and Hedges,2014; Lunceford and Davidian, 2004). Compared to IPPW, strata reweighting ismore likely to be numerically stable and easily implementable when treatment I. DEGTIAR ET AL. assignment is done at the group level (e.g., cluster-randomized trials). However,stratification implicitly assumes that treatment e ff ects are identical for studyand target patients in the same stratum; this assumption is rarely met, resultingin residual confounding and inconsistent estimates (Lunceford and Davidian,2004). It also relies on the assumptions of treatment e ff ect heterogeneity beingfully captured by the propensity score for treatment and that outcomes are con-tinuous and bounded. With too few strata, bias reduction will be insu ffi cient;conversely, too many strata can lead to small strata counts and unstable esti-mates (Stuart, 2010; Tipton et al., 2017).Propensity strata approaches have also been used to address positivity oftreatment assignment violations within the target sample in the setting whereoutcome data are available from both a randomized and observational study(Rosenman et al., 2018). Rosenman et al. (2020) present an extension whichaims to adjust for potential unmeasured confounding bias. Outcome regressions, also known as re-sponse surface modeling, have not been as extensively developed for generaliz-ability and transportability compared to propensity-based approaches. Broadlyspeaking, outcome regressions approaches fit an outcome regression in studysample data to estimate conditional means, then obtain PATEs by marginalizingover (i.e., standardizing to) the target sample covariate distribution by predict-ing counterfactuals for the target sample: ˆ E ( Y a ) = n (cid:80) ni =1 ˆ E ( Y i | S i = 1 , A i = a, X i ).If the target sample is not a simple random sample from the target population,this would be a weighted average using sampling weights (Kim et al., 2018).Outcome regression approaches are particularly e ff ective when e ff ect modi-fiers strongly predict the outcome and when the outcome is common but selec-tion into the study is rare. They are also convenient for exploring PCATEs. Theseapproaches can yield better precision than weighting or matching-based meth-ods because they can adjust both for confounders, e ff ect-modifiers, and factorsonly predictive of the outcome, thus decreasing variance in the estimate. Theyare simple to implement when an outcome regression for confounding adjust-ment has already been fit and accounts for all relevant e ff ect modifiers. Thesame regression that was used to estimate impacts within the study can then beused to predict counterfactuals in the target sample. Outcome regression meth-ods can be used with either randomized or observational study data, but havebeen used most frequently in RCTs. In the presence of significant non-overlapbetween the target and study samples, outcome regressions rely on heavy ex-trapolation (Kern et al., 2016; Attanasio, Meghir and Szekely, 2003), often withno corresponding inflation of the variance to reflect uncertainty in the resultingestimates.The simplest approach is an ordinary least squares outcome regression (Flo-res and Mitnik, 2013; Kern et al., 2016; Elliott and Valliant, 2017; Dahabrehet al., 2018, 2019a). An outcome regression is fit with interaction terms betweentreatment and all e ff ect modifiers before predicting counterfactual outcomes forthe target sample (the marginalization step). Dahabreh et al. (2018) showed theconsistency of this type of outcome regression for the PATE. For RCTs, separateregressions are recommended for each treatment group to better capture treat- ENERALIZABILITY AND TRANSPORTABILITY ment e ff ect heterogeneity (Dahabreh et al., 2019a), although this approach pre-cludes borrowing information across treatment groups, which is possible withmachine learning methods that discover treatment e ff ect heterogeneity.Among these machine learning techniques is BART, which is the most com-monly used data-adaptive outcome regression approach for generalizability andtransportability (Chipman, George and McCulloch, 2007, 2010; Kern et al., 2016;Hill, 2011). Tree-based methods, including BART, were briefly introduced inSection 4.3. BART models the outcome as a sum of trees with linear additiveterms and a regularization prior. BART addresses external validity bias via itsdata-driven discovery of treatment e ff ect heterogeneity and strengths of themethod include its ability to obtain confidence intervals from the posterior dis-tribution (Hill, 2011; Green and Kern, 2012). However, BART credible intervalsshow undercoverage when the target population di ff ers substantially from theRCT (Hill, 2011).Data availability may challenge these outcome regression approaches. Whenthe covariates in the target sample aren’t available in the study sample, or viceversa, but the SATE can be expected to be approximately unbiased for the PATE,the SATE estimates’ credible intervals can be expanded to account for uncer-tainty in the target population covariate distribution (Hill, 2011). Here, we consider meta-analytic ap-proaches for summary-level data as well as studies that combine individual-level data from more than one study (for example, one randomized and oneobservational study). Much of the literature has focused on meta-analytic tech-niques using summary-level study data and no target sample covariate informa-tion. This body of bias-adjusted meta-analysis methods largely does not explic-itly define a target population for whom inference is desired, but rather relieson subjective investigator judgments of the levels of bias in each study, specifiedusing bias functions or priors in a Bayesian framework. Eddy (1989) presentsthe first such approach, the confidence profile method for combining chainsof evidence. Likelihoods are adjusted for di ff erent study designs’ (investigator-specified) internal and external validity biases; uncertainty around these biasesare incorporated through prior distributions. Various subsequent Bayesian hier-archical models have been developed, such as a 3-level model (Prevost, Abramsand Jones, 2000) with the levels corresponding to models of the observed ev-idence, variability between studies, and variability between study types (ran-domized vs. observational). When available, covariate information can be addedto the models to address e ff ect heterogeneity. E ff ectively, this estimator averagesacross the internal and external validity biases of the studies and therefore isonly unbiased when the external validity bias in the RCT exactly ‘cancels’ theinternal validity bias in the observational data (Kaizar, 2011).Other meta-analysis studies leveraging summary-level data separately spec-ify internal and external validity bias parameters for an explicit target popula-tion and down-weight studies with higher risk of bias. One such example is thebias adjusted meta-analysis approach by Turner et al. (2009), which presents achecklist that subjectively quantifies the extent of internal and external valid-ity bias for each study and then weighs studies’ average outcomes by the extentof bias. Greenland (2005) pool across observational case-control studies using aBayesian meta-sensitivity model with bias parameters to separately permit con- I. DEGTIAR ET AL. sideration of misclassification, non-response, and unmeasured confounding. Inthe intermediate setting where individual-level data is available in the study butonly covariate moments (e.g., means, variances) are available in the target set-ting, Phillippo et al. (2018) present an outcome regression approach for indirecttreatment comparison across RCTs.When individual-level outcome data is available in the target sample or frommultiple studies, data can be combined into one joint dataset for outcome re-gression analysis if the outcome regression can be expected to be the same acrossstudies (Kern et al., 2016). Such an approach can be preferential to IPPW, whichuses only study and not target sample outcome data (Kern et al., 2016). How-ever, it will be dominated by observational data results (and their potential bi-ases) when observational subjects constitute the majority of the joint dataset,e ff ectively result in a weighted average across studies, weighted by the propor-tion of subjects in each study.Hierarchical Bayesian evidence synthesis is the only outcome regression ap-proach we identified that attempts to empirically adjust for unobserved con-founding when estimating e ff ects for observational patients who are not well-represented in the RCTs (Verde et al., 2016; Verde, 2019). Summary-level RCTdata are combined with individual-level observational data through a weight-ing approach in which the control group event rate is assumed to be similaracross all studies and a study quality bias term is added to the observationalstudies’ outcome regression to account for unmeasured confounding or otheruncontrolled biases and to inflate variance. Alternatively, Gechter (2015) derivebounds on the PATE and PCATE when transporting from an RCT to a targetsample with outcome data (all untreated). Double robust methods for generalizabil-ity and transportability typically combine outcome and propensity of selectionregressions. They are asymptotically unbiased when at least one of these regres-sion functions is consistently estimated, and if both are consistently estimated,asymptotically e ffi cient. However, if neither regression is estimated consistently,the mean squared error may be worse than using a propensity or outcome re-gression alone. Incorporation of flexible modeling approaches can help miti-gate regression misspecification. Three asymptotically locally e ffi cient doublerobust approaches have been developed in randomized data: a targeted maxi-mum likelihood estimator (TMLE) for instrumental variables (Rudolph and vanDer Laan, 2017), which is a semiparametric substitution estimator, the estimat-ing equation-based augmented inverse probability of participation weighting(A-IPPW) (Dahabreh et al., 2018, 2019a), and an augmented calibration weight-ing estimator that can also incorporate outcome information from the targetsample when it is available (Dong et al., 2020).The TMLE was developed for transportability in an encouragement designsetting (i.e., intervention focused on encouraging individuals in the treatmentgroup to participate in the intervention) with instrumental variables (Rudolphand van Der Laan, 2017) and has also been used for generalizability (Schmidet al., 2020). Three di ff erent PATE estimators were developed: intent to treat,complier, and as-treated. All use an outcome regression to obtain an initial ENERALIZABILITY AND TRANSPORTABILITY estimate, then adjust that estimate with a fluctuation function using a clevercovariate C , which is derived from the e ffi cient influence curve and incorpo-rates the propensity of selection information in a bias reduction step. For ex-ample, for the intent to treat PATE, the fluctuation function takes the form:logit( ˆ E ( Y | S = 1 , A, Z, X ) + (cid:15)C ), where C = I ( S = 1 , A = a ) P ( A = a | S = 1 , X ) P ( S = 1) P ( Z = z | S = 0 , A = a, X ) P ( X | S = 0) P ( Z = z | S = 1 , A = a, X ) P ( X | S = 1)and Z corresponds to the intervention taken (whereas A corresponds to the as-signed intervention, as before). The approach allows outcome and propensityregressions to be flexibly fit, for example, using an ensemble of machine learn-ing algorithms. Variances are calculated from the influence curve.A-IPPW has been developed both for generalizing results to estimate PATEsfor all trial-eligible individuals (Dahabreh et al., 2019a,c) and for transportingresults to estimate PATEs for trial-eligible individuals not included in a trial(Dahabreh et al., 2018). Three double robust estimating equation-based estima-tors are presented: A-IPPW, A-IPPW with normalized weights that sum to 1 toensure bounded estimates, and a weighted outcome regression estimator usingparticipation weights. The non-normalized A-IPPW estimators are as follows,with w i the same as for IPPW: n n (cid:88) i =1 { w i { Y i − ˆ E ( Y i | S i = 1 ,A i = a,X i ) } + ˆ E ( Y i | S i = 1 ,A i = a,X i ) } for generalizability1 n N (cid:88) i =1 { w i { Y i − ˆ E ( Y i | S i = 1 ,A i = a,X i ) } + { − I ( S i = 1) } ˆ E ( Y i | S i = 1 ,A i = a,X i ) } for transportability . Variance can be derived using empirical sandwich estimates or using a nonpara-metric bootstrap. As these estimators are partial M-estimators, they can produceestimates outside bounds if the outcome regression is not well-chosen and theymay have multiple solutions.Several other double robust estimators for transportability resemble the IPPWestimator, with sampling weights derived through alternative approaches thatdo not rely on propensity scores (Josey et al., 2020a,b; Dong et al., 2020). Forexample, the semiparametric and e ffi cient augmented weighting estimator byDong et al. (2020) calibrates the RCT covariate distribution to match that of thesampling-weighted target sample.An alternative reweighted outcome regression method for observational datadoes not claim double robustness and draws from the unsupervised domainadaptation literature. In general, unsupervised domain adaptation methods aimto make predictions for a target sample (the “target domain”) when outcomesare only observed in the study sample (“source domain”). The approach of Jo-hansson et al. (2018) is a regularized neural network estimator for PCATE pa-rameters that jointly learns representations from the data and a reweightingfunction. Representational learning creates balance between the study and tar-get covariate distributions and between treated and control distributions in arepresentational space so that predictors use information common across thesedistributions and focus on covariates predictive of the outcome. In this learnedrepresentational space, results are then re-weighted to minimize an upper boundon the expected value of the loss function under the target covariate distribu-tion. Propensity scores can also be used to reweight a likelihood function, as I. DEGTIAR ET AL. done by Nie et al. (2013) in an RCT setting for calibrating control outcomesfrom prior studies to the trial target sample. Similarly, Flores and Mitnik (2013)reweight an outcome regression to the target sample.
Several methods have combinedrandomized and observational data sources such that that they retain the in-ternal validity of the randomized data and the external validity of the targetsample observational data. These approaches broadly rely on the assumptionthat the relationship between unmeasured confounders and potential outcomesis the same in the RCT as in the target sample, which is a weaker assumptionthan that of no unmeasured confounding required by most of the methods de-scribed thus far. One study combined individual-level data from several RCTsto transport results to the target sample, extending the A-IPPW estimator (aswell as corresponding IPPW and outcome regression estimators) to the multi-study setting (Dahabreh et al., 2019d). The remainder of the section discussesapproaches that combine randomized and observational data.When di ff erences in e ff ect modifiers between the RCT and target popula-tion are known (e.g., by inclusion and exclusion criteria), cross-design synthesismeta-analysis is a method for combining randomized and observational studydata while capitalizing on the internal validity of the randomized data and theexternal validity of the observational data (Begg, 1992; Greenhouse et al., 2017).It provides a means for estimating treatment e ff ects for patients excluded fromthe RCT and can use summary-level RCT data if outcomes are available by rel-evant patient subgroups, although can only accommodate a limited number ofstrata of relevant e ff ect modifiers.Cross-design synthesis meta-analysis e ff ectively assumes a constant amountof unmeasured confounding across patients eligible and ineligible for the RCTs(Kaizar, 2011). This approach will have smaller bias than use of randomizedor observational data alone under various common data scenarios and, acrosssimulations, shows better coverage through smaller bias and increased variance(Kaizar, 2011).When di ff erences between RCT and target populations are less well under-stood, there are continuous e ff ect modifiers, or a higher dimensional set of e ff ectmodifiers, one can use Bayesian calibrated risk-adjusted regressions (Varadhan,Henderson and Weiss, 2016; Henderson, Varadhan and Weiss, 2017). This para-metric approach requires individual-level information from observational andrandomized studies, leveraging outcome regressions and calibration using thepropensity of selection. The target population is assumed to be represented bya subset of the observational data; the RCT data are likewise assumed to be rep-resented by a (potentially di ff erent) subset of the observational data. The cali-brated risk-adjusted model performs well when there is poor overlap betweenRCT and target data; however, it relies on the observational dataset having sub-stantial e ff ect modifier overlap with both the target sample and RCT. Robustvariance formulas or bootstrapping can be used to obtain confidence intervals.A 2-step frequentist approach for consistently estimating PCATE parametershas been developed to estimate e ff ects in a target population represented by ob-servational data (Kallus, Puli and Shalit, 2018). It begins with outcome regres-sions for each treatment group of the observational data, or a flexible regressionthat captures e ff ect heterogeneity. Observational data are then standardized to ENERALIZABILITY AND TRANSPORTABILITY the RCT population before ‘debiasing’ their estimates using RCT data by includ-ing a correction term that can depend on measured covariates. This method re-lies on the assumption that calibrating internal validity bias in the subset of theobservational data distribution overlapping with RCT data appropriately cali-brates the bias for the entire target sample. The 2-step approach would thereforenot necessarily decrease bias if the covariate distribution is highly imbalanced,resulting in average biases that are quite di ff erent between the RCT overlappingvs. nonoverlapping subsets of the target sample.Lu et al. (2019) present an approach that, unlike the above methods, assumesno unmeasured confounding in the observational data when combining RCTand comprehensive cohort study data (where patients who decline randomiza-tion are enrolled in a parallel observational study). They use semiparametricdouble robust estimators that can incorporate flexible regressions.
6. DISCUSSION
Obtaining unbiased estimates for a relevant target population requires ap-plying generalizability or transportability methods in studies that meet requiredidentifiability assumptions. The internal validity of randomized trials is not suf-ficient to obtain unbiased causal e ff ects; external validity also needs to be con-sidered. In this synthesis, we have discussed (1) sources of external validity biasand study designs to address it, (2) defining an estimand in a target populationof interest, (3) the identifiability assumptions underpinning generalizability andtransportability approaches, (4) a variety of approaches for quantifying the rel-evant dissimilarity between study and target samples and assessing treatmente ff ect heterogeneity, and (5) a variety of matching and weighting methods, out-come regression approaches, and techniques that use both outcome and propen-sity regressions that generalize or transport from randomized and observationalstudies to a target population. These approaches have been applied across di-verse settings from RCT results transported to patients represented in registriesto cluster-randomized educational intervention trials generalized to broader ge-ographic areas. Across a variety of settings, it is important to estimate resultsfor populations that go beyond the study population. We suggest the followingconsiderations for researchers. Make e ff orts to explicitly define the target population(s) and identify the studypopulation from which your study sample data is a simple random sample. Describ-ing the study population may be a di ffi cult task, and there may not be a prac-tically meaningful population that is representative of your study sample data.However, this clarity will allow you to compare and, when feasible, better-alignthe study sample data to the target population. Discussion regarding target pop-ulation(s) should be guided by the ensuing decisions the study aims to inform aswell as practical considerations (e.g., lack of certain subgroups in your study).These considerations may require iteration between feasibility and the desiredstudy aims as well as careful discussion amidst study collaborators. When com-bining across studies, meta-analyses should likewise carefully specify targetpopulation(s) for inference and incorporate considerations of treatment e ff ectheterogeneity or demonstrate that e ff ect heterogeneity is not a concern. Withouttransparency in the target population(s), a study cannot estimate well-definedtreatment e ff ects nor can readers judge the generalizability of study results to I. DEGTIAR ET AL. any other population of interest.
Plan for generalization in your study design, when feasible, including writing gen-eralizability considerations into your grant or study objectives.
Enroll randomizedstudy participants or design observational study inclusion and exclusion crite-ria to have the study sample be representative of the target population, or fullycapturing the heterogeneity of e ff ect modifiers. Collect data on likely treatmente ff ect modifiers that are associated with study participation. Attempt to iden-tify and mitigate potential sources of missingness or selection bias. If possible,collect baseline characteristics and outcome data on study nonparticipants whoare part of the target population. Otherwise, identify external sources of datathat might inform the composition of your target population with respect to ef-fect modifiers and work towards aligning variables between these target sampledata sources and your study. Clearly describe the internal and external validity assumptions needed to identifythe treatment e ff ect as they relate to your study. Substantively assess the justifiabil-ity of these internal and external validity assumptions. To the extent possible,test the validity of the assumptions and perform sensitivity analyses to assessthe impact of assumption violations.
Quantify the dissimilarity between the study and target populations using at leastone method.
Ideally, use multiple methods, as they each tell di ff erent parts ofthe story: examine univariate and joint distributions of e ff ect modifiers, di ff er-ences in the propensity to participate in the study, and (if outcome informationis available in the target sample) di ff erences in outcomes between study and tar-get subjects on the same treatment. If di ff erences are identified, one should in-vestigate which subpopulations drive those di ff erences and assess whether theyhave heterogeneous treatment e ff ects. In addition to examining subject charac-teristics, assess whether di ff erences exist in the setting, treatment, or outcome. To obtain causal estimates when the target and study populations di ff er with re-spect to e ff ect modifiers, incorporate at least one generalizability or transportabil-ity estimator. Alternatively, at the minimum, assess and describe sources of ef-fect heterogeneity and whether they’re likely to di ff er for the target population.Derive estimates using as much data as possible (e.g., when outcome data isavailable, use it in a principled way). The choice of method for external validitybias adjustment may be restricted by data availability (e.g., summary-level vs.individual-level data) but should be driven by similar principles as those thatguide the choice between outcome regressions, matching and weighting meth-ods, and double robust approaches for confounding adjustment (Van der Laan,Laan and Robins, 2003; Neugebauer and van der Laan, 2005; van der Laan andRose, 2011). Flexible nonparametric and semiparametric models and estimatorsthat use ensemble machine learning minimize the need for strict parametric as-sumptions and have the potential to perform the best (Kern et al., 2016). For both methods developers and applied researchers, we recommend releasingpublicly available code alongside the paper and providing details for implementation.
Published code facilitates replicability and accessibility of methods for futureresearch and applied use. A substantial barrier to the adoption of new statisticalmethods, including advances in generalizability and transportability, is the lackof available computational tools.While much of the causal inference literature has focused on issues of inter-
ENERALIZABILITY AND TRANSPORTABILITY nal validity, both internal and external validity are necessary for valid inference.When treatment e ff ect heterogeneity exists, as is often the case, study resultsmay not hold for a target population of interest. Approaches to address inter-nal validity biases can be borrowed to improve upon methods for addressingexternal validity bias. This review presents a framework for such analysis andsummarizes di ff erent choices for estimators that can be used to generalize ortransport results to a population di ff erent from the one under study. It brings to-gether diverse cross-disciplinary literature to provide guidance both for appliedand methods researchers. Improving the incorporation of results from observa-tional studies, including electronic health databases, can lead to better inferencefor policy-relevant populations with reduced bias and improved precision.
7. ACKNOWLEDGMENTS
This research was supported by NIH New Innovator Award DP2MD012722and NIH training grants T32LM012411 and T32ES07142. The authors thankSebastien Haneuse, Francesca Dominici, and Laura Hatfield for helpful feed-back on this work as well as seminar and conference audiences at the HarvardProgram on Causal Inference, Mathematica Policy Research, Harvard–StanfordHealth Policy Data Science Lab, 2018 Harvard Data Science Initiative Confer-ence, and 2020 NIA Workshop on Applications of Machine Learning to ImproveHealthcare Delivery for Older Adults.
REFERENCES
Ackerman, B. , Schmid, I. , Rudolph, K. E. , Seamans, M. J. , Susukida, R. , Mojtabai, R. and
Stu-art, E. A. (2019). Implementing Statistical Methods for Generalizing Randomized Trial Find-ings to a Target Population.
Addictive behaviors Allcott, H. (2015). Site Selection Bias in Program Evaluation.
The Quarterly journal of economics
Allcott, H. and
Mullainathan, S. (2012). External Validity and Partner Selection Bias.
NationalBureau of Economic Research Working Paper Series
Andrews, I. and
Oster, E. (2017). Weighting for External Validity Technical Report No. w23826,National Bureau of Economic Research, Cambridge, MA.
Angrist, J. D. and
Fern ´andez-Val, I. (2013). ExtrapoLATE-Ing: External Validity and Overiden-tification in the LATE Framework. In
Advances in Economics and Econometrics
Athey, S. and
Imbens, G. (2016). Recursive Partitioning for Heterogeneous Causal E ff ects. Pro-ceedings of the National Academy of Sciences
Attanasio, O. , Meghir, C. and
Szekely, M. (2003). Using Randomised Experiments and Struc-tural Models for ’Scaling up’: Evidence from the PROGRESA Evaluation.
IFS Working Paper
EWP03/05 . Baker, R. , Brick, J. M. , Gotway Crawford, C. A. , Terhanian, G. , Langer, G. , Bates, N. A. , Battaglia, M. , Couper, M. P. , Dever, J. A. , Gile, K. J. , Tourangeau, R. , Valliant, R. and
Rivers, D. (2013). Summary Report of the Aapor Task Force on Non-Probability Sampling.
Journal of survey statistics and methodology Bareinboim, E. and
Pearl, J. (2014). Transportability from Multiple Environments with LimitedExperiments: Completeness Results. In
Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.) 280–288.Curran Associates, Inc.
Bareinboim, E. and
Pearl, J. (2016). Causal Inference and the Data-Fusion Problem.
Proceedingsof the National Academy of Sciences
Bareinboim, E. , Tian, J. and
Pearl, J. (2014). Recovering from Selection Bias in Causal and Sta-tistical Inference. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence . AAAI’14 I. DEGTIAR ET AL.
Bareinboim, E. and
Tian, J. (2015). Recovering Causal E ff ects from Selection Bias. In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial Intelligence . AAAI’15
Begg, C. B. (1992). Cross Design Synthesis: A New Strategy for Medical E ff ectiveness Research.United States General Accounting O ffi ce, (GA0/PEMD-92-18). Statistics in Medicine Bell, S. H. , Olsen, R. B. , Orr, L. L. and
Stuart, E. A. (2016). Estimates of External Validity BiasWhen Impact Evaluations Select Sites Nonrandomly.
Educational Evaluation and Policy Analysis Benchimol, E. I. , Smeeth, L. , Guttmann, A. , Harron, K. , Moher, D. , Petersen, I. , Sørensen, H. T. , von Elm, E. , Langan, S. M. and
RECORD Working Committee (2015). The REporting of Stud-ies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement.
PLOS Medicine e1001885. Bennett, M. , Vielma, J. P. and
Zubizarreta, J. R. (2020). Building Representative Matched Sam-ples with Multi-Valued Treatments in Large Observational Studies.
Journal of computational andgraphical statistics Buchanan, A. L. , Hudgens, M. G. , Cole, S. R. , Mollan, K. R. , Sax, P. E. , Daar, E. S. , Adimora, A. A. , Eron, J. J. and
Mugavero, M. J. (2018). Generalizing Evidence from Ran-domized Trials Using Inverse Probability of Sampling Weights.
Journal of the Royal StatisticalSociety. Series A, Statistics in society
Burchett, H. , Umoquit, M. and
Dobrow, M. (2011). How Do We Know When Research from OneSetting Can Be Useful in Another? A Review of External Validity, Applicability and Transfer-ability Frameworks.
Journal of health services research & policy Cahan, A. , Cahan, S. and
Cimino, J. J. (2017). Computer-Aided Assessment of the Generalizabil-ity of Clinical Trial Results.
International Journal of Medical Informatics Chan, W. (2017). Partially Identified Treatment E ff ects for Generalizability. Journal of Research onEducational E ff ectiveness Chen, Z. and
Kaizar, E. (2017). On Variance Estimation for Generalizing from a Trial to a TargetPopulation. arXiv:1704.07789 [stat] . Chen, C. and
Wong, R. (2018).
Black Patients Miss out on Promising Cancer Drugs . Chen, S. , Tian, L. , Cai, T. and
Yu, M. (2017). A General Statistical Framework for Subgroup Iden-tification and Comparative Treatment Scoring.
Biometrics Chen, I. Y. , Pierson, E. , Rose, S. , Joshi, S. , Ferryman, K. and
Ghassemi, M. (2020). Ethical MachineLearning in Health. arXiv preprint arXiv:2009.10576 . Chipman, H. A. , George, E. I. and
McCulloch, R. (2007). Bayesian ensemble learning. In
Ad-vances in Neural Information Processing Systems 19 - Proceedings of the 2006 Conference
Chipman, H. A. , George, E. I. and
McCulloch, R. E. (2010). BART: Bayesian Additive RegressionTrees.
The Annals of Applied Statistics Cole, S. R. and
Stuart, E. A. (2010). Generalizing Evidence from Randomized Clinical Trials toTarget Populations: The ACTG 320 Trial.
American Journal of Epidemiology
Colnet, B. , Mayer, I. , Chen, G. , Dieng, A. , Li, R. , Varoquaux, G. , Vert, J.-P. , Josse, J. and
Yang, S. (2020). Causal Inference Methods for Combining Randomized Trials and Observational Stud-ies: A Review. arXiv:2011.08047 [stat] . Correa, J. D. and
Bareinboim, E. (2017). Causal E ff ect Identification by Adjustment under Con-founding and Selection Biases. In Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence . AAAI’17
Correa, J. D. , Tian, J. and
Bareinboim, E. (2018). Generalized Adjustment under Confoundingand Selection Biases. In
AAAI . Cronbach, L. J. and
Shapiro, K. (1982).
Designing Evaluations of Educational and Social Programs ,1st ed.
A Joint Publication in the Jossey-Bass Series in Social and Behavioral Science & in HigherEducation . Jossey-Bass, San Francisco.
Crump, R. K. , Hotz, V. J. , Imbens, G. W. and
Mitnik, O. A. (2008). Nonparametric Tests for Treat-ment E ff ect Heterogeneity. Review of Economics and Statistics Dahabreh, I. , Robertson, S. , Stuart, E. and
Hernan, M. (2017). Extending Inferences from Ran-domized Participants to All Eligible Individuals Using Trials Nested within Cohort Studies. arXiv:1709.04589 [stat] . Dahabreh, I. J. , Robertson, S. E. , Steingrimsson, J. A. , Stuart, E. A. and
Hernan, M. A. (2018).Extending Inferences from a Randomized Trial to a New Target Population. arXiv:1805.00550[stat] . Dahabreh, I. J. , Robertson, S. E. , Tchetgen, E. J. , Stuart, E. A. and
Hern ´an, M. A. (2019a).ENERALIZABILITY AND TRANSPORTABILITY Generalizing Causal Inferences from Individuals in Randomized Trials to All Trial-EligibleIndividuals.
Biometrics Dahabreh, I. J. , Robins, J. M. , Haneuse, S. J.-P. A. , Saeed, I. , Robertson, S. E. , Stuart, E. A. and
Hern ´an, M. A. (2019b). Sensitivity Analysis Using Bias Functions for Studies Extending Infer-ences from a Randomized Trial to a Target Population.
Dahabreh, I. J. , Hernan, M. A. , Robertson, S. E. , Buchanan, A. and
Steingrimsson, J. A. (2019c). Generalizing Trial Findings Using Nested Trial Designs with Sub-Sampling of Non-Randomized Individuals.
Dahabreh, I. J. , Robertson, S. E. , Petito, L. C. , Hern ´an, M. A. and
Steingrimsson, J. A. (2019d).E ffi cient and Robust Methods for Causally Interpretable Meta-Analysis: Transporting Infer-ences from Multiple Randomized Trials to a Target Population. Davis, K. (1988). The Comprehensive Cohort Study: The Use of Registry Data to Confirm andExtend a Randomized Trial.
Recent results in cancer research
Dekkers, O. M. , von Elm, E. , Algra, A. , Romijn, J. A. and
Vandenbroucke, J. P. (2010). Howto Assess the External Validity of Therapeutic Trials: A Conceptual Approach.
InternationalJournal of Epidemiology Ding, P. , Feller, A. and
Miratrix, L. (2016). Randomization Inference for Treatment E ff ect Vari-ation. Journal of the Royal Statistical Society. Series B, Statistical methodology Dong, N. , Stuart, E. A. , Lenis, D. and
Quynh Nguyen, T. (2020). Using Propensity Score Analysisof Survey Data to Estimate Population Average Treatment E ff ects: A Case Study ComparingDi ff erent Methods. Evaluation review Eddy, D. (1989). The Confidence Profile Method: A Bayesian Method for Assessing Health Tech-nologies.
Operations Research Elliott, M. R. and
Valliant, R. (2017). Inference for Nonprobability Samples.
Statistical science Fang, A. (2017).
10 Things to Know about Heterogeneous Treatment E ff ects . Flores, C. A. and
Mitnik, O. A. (2013). Comparing Treatments across Labor Markets: An Assess-ment of Nonexperimental Multiple-Treatment Strategies.
The Review of Economics and Statistics Ford, I. and
Norrie, J. (2016). Pragmatic Trials.
New England Journal of Medicine
Frangakis, C. (2009). The Calibration of Treatment E ff ects from Clinical Trials to Target Popula-tions. Clinical Trials: Journal of the Society for Clinical Trials Gabler, N. B. , Duan, N. , Liao, D. , Elmore, J. G. , Ganiats, T. G. and
Kravitz, R. L. (2009). Dealingwith Heterogeneity of Treatment E ff ects: Is the Literature up to the Challenge? Trials Gail, M. and
Simon, R. (1985). Testing for Qualitative Interactions between Treatment E ff ects andPatient Subsets. Biometrics Gechter, M. (2015). Generalizing the Results from Social Experiments: Theory and Evidencefrom Mexico and India.
Department of Economics, Pennsylvania State University
Unpublishedmanuscript
Gelman, A. and
Little, T. C. (1997). Poststratification into Many Categories Using HierarchicalLogistic Regression.
Survey Methodology Glauner, P. , Migliosi, A. , Meira, J. , Valtchev, P. , State, R. and
Bettinger, F. (2017). Is Big DataSu ffi cient for a Reliable Detection of Non-Technical Losses? arXiv:1702.03767 [cs] . Green, L. W. and
Glasgow, R. E. (2006). Evaluating the Relevance, Generalization, and Applica-bility of Research: Issues in External Validation and Translation Methodology.
Evaluation & thehealth professions Green, D. P. and
Kern, H. L. (2012). Modeling Heterogeneous Treatment E ff ects in Survey Exper-iments with Bayesian Additive Regression Trees. Public Opinion Quarterly Greenhouse , Kelleher, K. , Seltman, H. and
Gardner, W. (2008). Generalizing from ClinicalTrial Data: A Case Study. the Risk of Suicidality among Pediatric Antidepressant Users.
Statis-tics in Medicine Greenhouse, J. B. , Kaizar, E. E. , Anderson, H. D. , Bridge, J. A. , Libby, A. M. , Valuck, R. and
Kelleher, K. J. (2017). Combining Information from Multiple Data Sources: An Introductionto Cross-Design Synthesis with a Case Study. In
Methods in Comparative E ff ectiveness Research Greenland, S. (2005). Multiple-Bias Modelling for Analysis of Observational Data.
Journal Of TheRoyal Statistical Society Series A
Gunter, L. , Zhu, J. and
Murphy, S. A. (2011). Variable Selection for Qualitative Interactions.
Statistical Methodology I. DEGTIAR ET AL.
Haneuse, S. (2016). Distinguishing Selection Bias and Confounding Bias in Comparative E ff ec-tiveness Research. Medical care e23–e29. Haneuse, S. , Schildcrout, J. , Crane, P. , Sonnen, J. , Breitner, J. and
Larson, E. (2009). Adjust-ment for Selection Bias in Observational Studies with Application to the Analysis of AutopsyData.
Neuroepidemiology Hartman, E. , Grieve, R. , Ramsahai, R. and
Sekhon, J. S. (2015). From Sample Average TreatmentE ff ect to Population Average Treatment E ff ect on the Treated: Combining Experimental withObservational Studies to Estimate Population Treatment E ff ects. Journal of the Royal StatisticalSociety: Series A (Statistics in Society)
He, Z. , Ryan, P. , Hoxha, J. , Wang, S. , Carini, S. , Sim, I. and
Weng, C. (2016). Multivariate Analysisof the Population Representativeness of Related Clinical Studies.
Journal of biomedical informat-ics Heckman, J. J. (1979). Sample Selection Bias as a Specification Error.
Econometrica Henderson, N. C. , Varadhan, R. and
Weiss, C. O. (2017). Cross-Design Synthesis for Extendingthe Applicability of Trial Evidence When Treatment E ff ect Is Heterogenous: Part II. Appli-cation and External Validation. Communications in Statistics: Case Studies, Data Analysis andApplications Hern ´an, M. A. , Alonso, A. , Logan, R. , Grodstein, F. , Michels, K. B. , Willett, W. C. , Man-son, J. E. and
Robins, J. M. (2008). Observational Studies Analyzed like Randomized Exper-iments: An Application to Postmenopausal Hormone Therapy and Coronary Heart Disease.
Epidemiology Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference.
Journal of Computa-tional and Graphical Statistics Horvitz, D. G. and
Thompson, D. J. (1952). A Generalization of Sampling without Replacementfrom a Finite Universe.
Journal of the American Statistical Association Hotz, V. J. , Imbens, G. W. and
Mortimer, J. H. (2005). Predicting the E ffi cacy of Future TrainingPrograms Using Past Experiences at Other Locations. Journal of econometrics
Imai, K. , King, G. and
Stuart, E. A. (2008). Misunderstandings between Experimentalists and Ob-servationalists about Causal Inference.
Journal of the Royal Statistical Society: Series A (Statisticsin Society)
Johansson, F. D. , Kallus, N. , Shalit, U. and
Sontag, D. (2018). Learning Weighted Representa-tions for Generalization across Designs. arXiv:1802.08598 [stat] . Josey, K. P. , Yang, F. , Ghosh, D. and
Raghavan, S. (2020a). A Calibration Approach to Transporta-bility with Observational Data.
Josey, K. P. , Berkowitz, S. A. , Ghosh, D. and
Raghavan, S. (2020b). Transporting ExperimentalResults with Entropy Balancing.
Kaizar, E. E. (2011). Estimating Treatment E ff ect via Simple Cross Design Synthesis. Statistics inMedicine Kaizar, E. E. (2015). Incorporating Both Randomized and Observational Data into a Single Anal-ysis.
Annual Review of Statistics and Its Application Kallus, N. , Puli, A. M. and
Shalit, U. (2018). Removing Hidden Confounding by ExperimentalGrounding. arXiv:1810.11646 [cs, stat] . Kennedy, L. and
Gelman, A. (2019). Know Your Population and Know Your Model: Using Model-Based Regression and Poststratification to Generalize Findings beyond the Observed Sample. arXiv:1906.11323 [stat] . Kennedy-Martin, T. , Curtis, S. , Faries, D. , Robinson, S. and
Johnston, J. (2015). A LiteratureReview on the Representativeness of Randomized Controlled Trial Samples and Implicationsfor the External Validity of Trial Results.
Trials Kern, H. L. , Stuart, E. A. , Hill, J. and
Green, D. P. (2016). Assessing Methods for Generaliz-ing Experimental Impact Estimates to Target Populations.
Journal of Research on EducationalE ff ectiveness Kim, J. K. , Park, S. , Chen, Y. and
Wu, C. (2018). Combining Non-Probability and ProbabilitySurvey Samples through Mass Imputation.
Lesko, C. R. , Buchanan, A. L. , Westreich, D. , Edwards, J. K. , Hudgens, M. G. and
Cole, S. R. (2017). Generalizing Study Results: A Potential Outcomes Perspective.
Epidemiology Lipkovich, I. , Dmitrienko, A. , Denne, J. and
Enas, G. (2011). Subgroup Identification Based onDi ff erential E ff ect Search-a Recursive Partitioning Method for Establishing Response to Treat-ment in Patient Subpopulations: Subgroup Identification Based on Di ff erential E ff ect SearchENERALIZABILITY AND TRANSPORTABILITY (SIDES). Statistics in Medicine Lu, Y. , Scharfstein, D. O. , Brooks, M. M. , Quach, K. and
Kennedy, E. H. (2019). Causal Inferencefor Comprehensive Cohort Studies.
Luedtke, A. , Carone, M. and van der Laan, M. J. (2019). An Omnibus Non-parametric Test ofEquality in Distribution for Unknown Functions.
Journal of the Royal Statistical Society. SeriesB, Statistical methodology Lunceford, J. K. and
Davidian, M. (2004). Stratification and Weighting via the Propensity Scorein Estimation of Causal Treatment E ff ects: A Comparative Study. Statistics in Medicine Marcus, S. (1997). Assessing Non-Consent Bias with Parallel Randomized and NonrandomizedClinical Trials.
Journal Of Clinical Epidemiology Miettinen, O. S. (1972). Standardization of Risk Ratios.
American Journal of Epidemiology Moreno-Torres, J. G. , Raeder, T. , Alaiz-Rodr´ıguez, R. , Chawla, N. V. and
Herrera, F. (2012). AUnifying View on Dataset Shift in Classification.
Pattern Recognition Neugebauer, R. and van der Laan, M. (2005). Why Prefer Double Robust Estimators in CausalInference?
Journal of statistical planning and inference
Nguyen, T. Q. , Ebnesajjad, C. , Cole, S. R. and
Stuart, E. A. (2017). Sensitivity Analysis for an Un-observed Moderator in RCT-to-Target-Population Generalization of Treatment E ff ects. Annalsof Applied Statistics Nguyen, T. Q. , Ackerman, B. , Schmid, I. , Cole, S. R. and
Stuart, E. A. (2018). Sensitivity Analysesfor E ff ect Modifiers Not Observed in the Target Population When Generalizing Treatment Ef-fects from a Randomized Controlled Trial: Assumptions, Models, E ff ect Scales, Data Scenarios,and Implementation Details. PLOS ONE e0208795. Nie, L. , Zhang, Z. , Rubin, D. and
Chu, J. (2013). Likelihood Reweighting Methods to ReducePotential Bias in Noninferiority Trials Which Rely on Historical Data to Make Inference.
TheAnnals of Applied Statistics O’ Muircheartaigh, C. and
Hedges, L. V. (2014). Generalizing from Unrepresentative Experi-ments: A Stratified Propensity Score Approach.
Journal of the Royal Statistical Society: Series C(Applied Statistics) Olsen, R. B. , Orr, L. L. , Bell, S. H. and
Stuart, E. A. (2013). External Validity in Policy Evalu-ations That Choose Sites Purposively: External Validity in Policy Evaluations.
Journal of policyanalysis and management Pan, Q. and
Schaubel, D. E. (2008). Proportional Hazards Models Based on Biased Samples andEstimated Selection Probabilities.
The Canadian Journal of Statistics / La Revue Canadienne deStatistique Pan, Q. and
Schaubel, D. E. (2009). Evaluating Bias Correction in Weighted Proportional HazardsRegression.
Lifetime Data Analysis Park, D. K. , Gelman, A. and
Bafumi, J. (2004). Bayesian Multilevel Estimation with Poststratifi-cation: State-Level Estimates from National Polls.
Political Analysis Pearl, J. (2000).
Causality : Models, Reasoning, and Inference . Cambridge University Press.
Pearl, J. (2015). Generalizing Experimental Findings.
Journal of Causal Inference . Pearl, J. and
Bareinboim, E. (2011). Transportability of Causal and Statistical Relations: A For-mal Approach. In
Pearl, J. and
Bareinboim, E. (2014). External Validity: From Do-Calculus to Transportabilityacross Populations.
Statistical Science Phillippo, D. M. , Ades, A. E. , Dias, S. , Palmer, S. , Abrams, K. R. and
Welton, N. J. (2018). Meth-ods for Population-Adjusted Indirect Comparisons in Health Technology Appraisal.
Medicaldecision making Pool, I. , Abelson, R. and
Popkin, S. (1964).
Candidates, Issues and Strategies; a Computer Simulationof the 1960 Presidential Election . Massachusetts Institute of Technology Press.
Prentice, R. L. , Langer, R. , Stefanick, M. L. , Howard, B. V. , Pettinger, M. , Anderson, G. , Barad, D. , Curb, J. D. , Kotchen, J. , Kuller, L. , Limacher, M. and
Wactawski-Wende, J. (2005).Combined Postmenopausal Hormone Therapy and Cardiovascular Disease: Toward Resolvingthe Discrepancy between Observational Studies and the Women’s Health Initiative ClinicalTrial.
American Journal of Epidemiology
Prevost, T. C. , Abrams, K. R. and
Jones, D. R. (2000). Hierarchical Models in Generalized Synthe-sis of Evidence: An Example Based on Studies of Breast Cancer Screening.
Statistics in Medicine I. DEGTIAR ET AL. Qian, M. , Chakraborty, B. and
Maiti, R. (2019). A Sequential Significance Test for Treatment byCovariate Interactions. arXiv:1901.08738 [stat] . Rosenman, E. , Owen, A. B. , Baiocchi, M. and
Banack, H. (2018). Propensity Score Methods forMerging Observational and Experimental Datasets. arXiv:1804.07863 [stat] . Rosenman, E. , Basse, G. , Owen, A. and
Baiocchi, M. (2020). Combining Observational and Ex-perimental Datasets Using Shrinkage Estimators.
Rothwell, P. M. (2005). External Validity of Randomised Controlled Trials: “To Whom Do theResults of This Trial Apply?”.
The Lancet
Rubin, D. B. (1974). Estimating Causal E ff ects of Treatments in Randomized and NonrandomizedStudies. Journal of Educational Psychology Rudolph, K. and van Der Laan, M. (2017). Robust Estimation of Encouragement Design In-tervention E ff ects Transported across Sites. Journal Of The Royal Statistical Society Series B-Statistical Methodology Schmid, I. , Rudolph, K. E. , Nguyen, T. Q. , Hong, H. , Seamans, M. J. , Ackerman, B. and
Stu-art, E. A. (2020). Comparing the Performance of Statistical Methods That Generalize E ff ectEstimates from Randomized Controlled Trials to Much Larger Target Populations. Communi-cations in statistics. Simulation and computation ahead-of-print
Schulz, K. F. , Altman, D. G. and
Moher, D. (2010). CONSORT 2010 Statement: Updated Guide-lines for Reporting Parallel Group Randomised Trials.
BMJ c332-c332.
Schwartz, D. and
Lellouch, J. (1967). Explanatory and Pragmatic Attitudes in TherapeuticalTrials.
Journal of Chronic Diseases Sen, A. , Chakrabarti, S. , Goldstein, A. , Wang, S. , Ryan, P. B. and
Weng, C. (2016). GIST 2.0: AScalable Multi-Trait Metric for Quantifying Population Representativeness of Individual Clin-ical Studies.
Journal of Biomedical Informatics Shadish, W. R. , Cook, T. D. and
Campbell, D. T. (2001).
Experimental and Quasi-ExperimentalDesigns for Generalized Causal Inference . Houghton Mi ffl in, Boston. Signorovitch, J. E. , Wu, E. Q. , Yu, A. P. , Gerrits, C. M. , Kantor, E. , Bao, Y. , Gupta, S. R. and
Mulani, P. M. (2010). Comparative E ff ectiveness without Head-to-Head Trials: A Method forMatching-Adjusted Indirect Comparisons Applied to Psoriasis Treatment with Adalimumabor Etanercept. PharmacoEconomics Simon, R. (1982). Patient Subsets and Variation in Therapeutic E ffi cacy. British Journal of ClinicalPharmacology Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.
Statistical Science Stuart, E. A. , Ackerman, B. and
Westreich, D. (2018). Generalizability of Randomized Trial Re-sults to Target Populations: Design and Analysis Possibilities.
Research on Social Work Practice Stuart, E. A. , Bradshaw, C. P. and
Leaf, P. J. (2015). Assessing the Generalizability of RandomizedTrial Results to Target Populations.
Prevention Science Stuart, E. A. , Cole, S. R. , Bradshaw, C. P. and
Leaf, P. J. (2011). The Use of Propensity Scoresto Assess the Generalizability of Results from Randomized Trials: Use of Propensity Scores toAssess Generalizability.
Journal of the Royal Statistical Society: Series A (Statistics in Society)
Su, X. , Zhou, T. , Yan, X. , Fan, J. and
Yang, S. (2008). Interaction Trees with Censored SurvivalData.
The International Journal of Biostatistics . Su, X. , Tsai, C. , Wang, H. , Nickerson, D. and
Li, B. (2009). Subgroup Analysis via RecursivePartitioning.
Journal Of Machine Learning Research Tian, L. , Alizadeh, A. A. , Gentles, A. J. and
Tibshirani, R. (2014). A Simple Method for Esti-mating Interactions between a Treatment and a Large Number of Covariates.
Journal of theAmerican Statistical Association
Tipton, E. (2013a). Stratified Sampling Using Cluster Analysis: A Sample Selection Strategy forImproved Generalizations from Experiments.
Evaluation Review Tipton, E. (2013b). Improving Generalizations from Experiments Using Propensity Score Subclas-sification: Assumptions, Properties, and Contexts.
Journal of Educational and Behavioral Statis-tics Tipton, E. (2014). How Generalizable Is Your Experiment? An Index for Comparing ExperimentalSamples and Populations.
Journal of Educational and Behavioral Statistics Tipton, E. and
Olsen, R. B. (2018). A Review of Statistical Methods for Generalizing from Evalu-ENERALIZABILITY AND TRANSPORTABILITY ations of Educational Interventions. Educational Researcher Tipton, E. and
Peck, L. R. (2017). A Design-Based Approach to Improve External Validity inWelfare Policy Evaluations.
Evaluation Review Tipton, E. , Hedges, L. , Vaden-Kiernan, M. , Borman, G. , Sullivan, K. and
Caverly, S. (2014).Sample Selection in Randomized Experiments: A New Method Using Propensity Score Strati-fied Sampling.
Journal of Research on Educational E ff ectiveness Tipton, E. , Hallberg, K. , Hedges, L. V. and
Chan, W. (2017). Implications of Small Samples forGeneralization: Adjustments and Rules of Thumb.
Evaluation Review Turner, R. M. , Spiegelhalter, D. J. , Smith, G. C. S. and
Thompson, S. G. (2009). Bias Modellingin Evidence Synthesis.
Journal of the Royal Statistical Society: Series A (Statistics in Society)
Van der Laan, M. J. , Laan, M. and
Robins, J. M. (2003).
Unified Methods for Censored LongitudinalData and Causality . Springer Science & Business Media. van der Laan, M. J. and
Rose, S. (2011).
Targeted Learning . Springer Series in Statistics . SpringerNew York, New York, NY.
Varadhan, R. , Henderson, N. C. and
Weiss, C. O. (2016). Cross-Design Synthesis for Extendingthe Applicability of Trial Evidence When Treatment E ff ect Is Heterogeneous: Part i. Methodol-ogy. Communications in Statistics: Case Studies, Data Analysis and Applications Verde, P. E. (2019). The Hierarchical Metaregression Approach and Learning from Clinical Evi-dence.
Biometrical Journal Verde, P. E. and
Ohmann, C. (2015). Combining Randomized and Non-Randomized Evidencein Clinical Research: A Review of Methods and Applications: Combining Randomized andNon-Randomized Evidence.
Research Synthesis Methods Verde, P. E. , Ohmann, C. , Morbach, S. and
Icks, A. (2016). Bayesian Evidence Synthesis for Ex-ploring Generalizability of Treatment E ff ects: A Case Study of Combining Randomized andNon-Randomized Results in Diabetes: Bayesian Evidence Synthesis for Exploring Generaliz-ability of Treatment E ff ects: A Case Study of Combining Randomized and Non-RandomizedResults in Di. Statistics in Medicine von Elm, E. , Altman, D. G. , Egger, M. , Pocock, S. J. , Gøtzsche, P. C. and
Vandenbroucke, J. P. (2008). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE)Statement: Guidelines for Reporting Observational Studies.
Journal of Clinical Epidemiology Weisberg, H. I. , Hayden, V. C. and
Pontes, V. P. (2009). Selection Criteria and Generalizabil-ity within the Counterfactual Framework: Explaining the Paradox of Antidepressant-InducedSuicidality?
Clinical Trials Weiss, C. O. , Segal, J. B. and
Varadhan, R. (2012). Assessing the Applicability of Trial Evidenceto a Target Sample in the Presence of Heterogeneity of Treatment E ff ect: APPLICABILITY OFTREATMENT EFFECTS. Pharmacoepidemiology and Drug Safety Weng, C. , Li, Y. , Ryan, P. , Zhang, Y. , Liu, F. , Gao, J. , Bigger, J. T. and
Hripcsak, G. (2014). ADistribution-Based Method for Assessing the Di ff erences between Clinical Trial Target Pop-ulations and Patient Populations in Electronic Health Records. Applied clinical informatics Westreich, D. , Edwards, J. K. , Lesko, C. R. , Stuart, E. and
Cole, S. R. (2017). Transportability ofTrial Results Using Inverse Odds of Sampling Weights.
American Journal of Epidemiology I. DEGTIAR ET AL.
APPENDIX: SUMMARY OF METHODS THAT ONLY REQUIRESUMMARY-LEVEL DATA
Without access to individual patient data in the study and/or target sam-ples, investigators will be constrained as to the estimators available to them. Thefollowing estimators can be applied in this setting. Investigators should striveto maximally use the available data and hence use methods that incorporateindividual-level data where they are available.
Summary-level data for both study (covariate and outcome) and target samples(covariate).
Post-stratification (Miettinen, 1972; Prentice et al., 2005) only re-quires joint distributions or cell counts for each stratum. Using only study andtarget sample means, one could also apply outcome regressions that are linearin their predictors.
Summary-level outcome data for both study and target samples.
Bias-adjustedmeta-analysis approaches by Turner et al. (2009) and Greenland (2005) requiresummary-level study outcome data with estimates of bias for each study. Whenthat summary-level data are stratified by e ff ect modifiers, one can use approachesby Eddy (1989) and Prevost, Abrams and Jones (2000). If summary-level studydata are stratified by participants included vs. excluded from the study, cross-design synthesis can be used (Begg, 1992; Kaizar, 2011). Summary-level covariate and outcome data in the study, individual-level covariateand outcome data in the target sample.
With summary-level study and individual-level target sample data, one can use hierarchical Bayesian evidence synthesis(Verde et al., 2016; Verde, 2019).