[PDF] A Review of Generalizability and Transportability

Abstract

When assessing causal effects, determining the target population to which the results are intended to generalize is a critical decision. Randomized and observational studies each have strengths and limitations for estimating causal effects in a target population. Estimates from randomized data may have internal validity but are often not representative of the target population. Observational data may better reflect the target population, and hence be more likely to have external validity, but are subject to potential bias due to unmeasured confounding. While much of the causal inference literature has focused on addressing internal validity bias, both internal and external validity are necessary for unbiased estimates in a target population. This paper presents a framework for addressing external validity bias, including a synthesis of approaches for generalizability and transportability, the assumptions they require, as well as tests for the heterogeneity of treatment effects and differences between study and target populations.

Full PDF

AA Review of Generalizabilityand Transportability

Irina Degtiar and Sherri Rose

Harvard T.H. Chan School of Public Health and Stanford University

Abstract.

When assessing causal e ﬀ ects, determining the target pop-ulation to which the results are intended to generalize is a criticaldecision. Randomized and observational studies each have strengthsand limitations for estimating causal e ﬀ ects in a target population.Estimates from randomized data may have internal validity but areoften not representative of the target population. Observational datamay better reﬂect the target population, and hence be more likely tohave external validity, but are subject to potential bias due to unmea-sured confounding. While much of the causal inference literature hasfocused on addressing internal validity bias, both internal and exter-nal validity are necessary for unbiased estimates in a target popu-lation. This paper presents a framework for addressing external va-lidity bias, including a synthesis of approaches for generalizabilityand transportability, the assumptions they require, as well as testsfor the heterogeneity of treatment e ﬀ ects and di ﬀ erences betweenstudy and target populations. MSC 2010 subject classiﬁcations:

Primary 62-2, Statistics Researchexposition; secondary 62G05, Statistics Nonparametric inference Es-timation.

Key words and phrases: generalizability, transportability, external va-lidity, treatment e ﬀ ect heterogeneity, causal inference.

1. BACKGROUND

The goal of causal inference is often to gain understanding of a particulartarget population based on study ﬁndings. The true underlying causal e ﬀ ectwill typically vary with the deﬁnition of the chosen target population. However,samples unrepresentative of the target population arise frequently in studiesranging from randomized controlled trials (RCTs) in clinical medicine to policyresearch (Bell et al., 2016; Kennedy-Martin et al., 2015; Allcott, 2015). In a clin-ical trial setting, physicians may be left interpreting evidence from RCTs withpatients who have demographics and comorbidities that are quite di ﬀ erent from Irina Degtiar is a PhD candidate at the Department of Biostatistics, Harvard T.H.Chan School of Public Health, 655 Huntington Ave, Boston, MA 02115, USA(email: [email protected]). Sherri Rose is an Associate Professor at the Centerfor Health Policy and Center for Primary Care and Outcomes Research, StanfordUniversity, 615 Crothers Way, CA 94305, USA (email: [email protected]). a r X i v : . [ s t a t . M E ] F e b I. DEGTIAR ET AL.

Figure 1 . Internal vs. external validity biases as they relate to target, study, and analysis populations. those of their patients. As an example, within cancer RCTs, African Americansare widely underrepresented despite being at an increased risk for many cancers(Chen and Wong, 2018). Failing to address this lack of representation can leadto inappropriate conclusions and harm (Chen et al., 2020). In a policy setting,it is important to consider the e ﬀ ects that can be expected in the eventual tar-get population in order to set expectations for anticipated results and determinegroups that should be targeted for an intervention.The relationships between target, study, and analysis populations are visual-ized in Figure 1. The target sample is a representative sample of the target popu-lation, whereas the study population is deﬁned by enrollment processes and in-clusion or exclusion criteria. Due to these practical and scientiﬁc considerations,the study population may di ﬀ er from the target population. Correspondingly,the enrolled participants who form the study sample may have di ﬀ erent char-acteristics from those of the target sample. In the cancer RCT example, while aphysician might care about the target population of patients that may come into be treated by their clinic (of which the clinic’s current patients are a targetsample), the study sample on which they’re basing their treatment recommen-dations may not include any African Americans. The study population is the hy-pothetical population that the study sample represents, which likewise includesno African Americans. Post-enrollment, further dropout and missingness mayoccur that create the observed analysis sample. In this case, dropout may haveoccurred for patients who experienced severe adverse events such that the anal-ysis sample consists of patients who did not experience severe side e ﬀ ects. Therethen exists a hypothetical analysis population from which the analysis sampledata is a simple random sample. Hereafter, for simplicity and consistency withthe literature, we will use the terms study sample and study population to beinclusive of the analysis sample and analysis populations, respectively.Several key concepts are crucial to understand when considering extendingcausal inferences beyond a study sample. Generalizability focuses on the set-ting where the study population is a subset of the target population of inter-est, while transportability addresses the setting where the study population is (at

ENERALIZABILITY AND TRANSPORTABILITY least partly) external to the target population. Internal validity is deﬁned as ane ﬀ ect estimate being unbiased for the causal treatment e ﬀ ect in the populationfrom which the sample is a simple random sample (i.e., moving vertically froma sample to its corresponding population in Figure 1). External validity is con-cerned with how well results generalize to other contexts. Speciﬁcally, that the(internally valid) e ﬀ ect estimate is unbiased for the causal treatment e ﬀ ect ina di ﬀ erent setting, such as a target population of interest (moving laterally be-tween populations in Figure 1). External validity bias has also been referred toas sample selection bias (Heckman, 1979; Imai, King and Stuart, 2008; Moreno-Torres et al., 2012; Bareinboim, Tian and Pearl, 2014; Haneuse, 2016).External validity bias arises from di ﬀ erences between the study and targetpopulations in (1) subject characteristics; (2) setting, such as geography or typeof health center; (3) treatment, such as timing, dosage, or sta ﬀ training; and (4)outcomes, such as length of follow-up or timing of measurements (Cronbachand Shapiro, 1982; Rothwell, 2005; Dekkers et al., 2010; Green and Glasgow,2006; Burchett, Umoquit and Dobrow, 2011; Attanasio, Meghir and Szekely,2003). The focus of most generalizability and transportability methods is on ad-dressing di ﬀ erences in subject characteristics. Hence, these methods assume theremaining threats to external validity are not present in the data sources theyare looking to generalize across. Namely, external validity bias then arises solelyfrom: (1) variation in the probability of enrollment in the study, (2) heterogene-ity in treatment e ﬀ ects, and (3) the correlation between (1) and (2) (Olsen et al.,2013). We therefore distinguish between factors di ﬀ erentiating the target pop-ulation from the study population (external validity bias) and those that createdi ﬀ erences between treatment groups (internal validity bias), e.g., confound-ing. RCTs are frequently performed in a nonrepresentative subset of the targetpopulation and may have imperfect follow-up (challenging their external valid-ity) and may have baseline imbalances (leading to internal validity bias). Obser-vational studies may be susceptible to unmeasured confounding (threateningtheir internal validity), but may be more representative of the target population(hence having better external validity). Lack of representation in an RCT canlead to external validity bias that is larger than the internal validity bias of anobservational study (Bell et al., 2016).The optimal solution to external validity bias centers on study design, whichwe review brieﬂy here, but do not cover extensively. One type of ideal studywould randomly sample subjects from the target population and then randomlyassign treatment to the selected individuals. However, this is usually infeasible.Alternative study designs for improving study generalizability and transporta-bility include purposive sampling, where investigators deliberately select indi-viduals such as for representation or heterogeneity (Shadish, Cook and Camp-bell, 2001; Allcott and Mullainathan, 2012); pragmatic or practical clinical tri-als, which aim to be representative of clinical practice (Schwartz and Lellouch,1967; Ford and Norrie, 2016); stratiﬁed selection based on e ﬀ ect modiﬁers orpropensity scores for selection (Tipton et al., 2014; Tipton, 2013a; Allcott andMullainathan, 2012); and balanced sampling designs for site selection that se-lect representative sites through stratiﬁed ranked sampling (Tipton and Peck,2017). In lieu or in addition to study designs that address external validity bias,generalizability and transportability methods can improve the external validity I. DEGTIAR ET AL.

Figure 2 . Overview framework for assessing and addressing external validity bias after data collection.Estimand : consider study and target popula-tions, and with them, the estimand of interest

Assumptions : assess validity of assumptions neces-sary for generalizability or transportability approaches

Evaluating Generalizability : examine whether treatment ef-fect modiﬁcation exists and whether e ﬀ ect modiﬁers dif-fer in distribution between study and target populations Generalizability and Transportability Methods :apply methods for addressing external validity bias of e ﬀ ect estimates after data collection.This manuscript provides a review of generalizability and transportability re-search, synthesizing across the statistics, epidemiology, computer science, andeconomics literature in a more complete manner than has been done to date. Ex-isting review literature has examined narrower subsets of the topic: generaliz-ing or transporting to a target population from only RCT data (Stuart, Bradshawand Leaf, 2015; Stuart, Ackerman and Westreich, 2018; Kern et al., 2016; Tip-ton and Olsen, 2018; Ackerman et al., 2019), identiﬁability rather than estima-tion (Bareinboim and Pearl, 2016), or meta-analysis approaches for combiningsummary-level information (Verde and Ohmann, 2015; Kaizar, 2015). A recentrelated review on combining randomized and observational data featured a sim-ulation, real data analysis, and software guide (Colnet et al., 2020). However,these previous reviews have not summarized the full range of generalizabil-ity and transportability methods that incorporate data from randomized, obser-vational, or a combination of randomized and observational studies, nor tech-niques for evaluating generalizability, as we do here. Additionally, although theimportance of describing generalizability and transportability is recognized bydi ﬀ erent trial reporting guidelines (e.g., CONSORT, RECORD, STROBE), theyprovide no clear guidance on tests or estimation procedures (Schulz, Altmanand Moher, 2010; Benchimol et al., 2015; von Elm et al., 2008). We also con-tribute recommendations for methodologists and applied researchers.The remainder of the article synthesizes considerations for assessing and ad-dressing external validity bias after data collection (presented as a frameworkin Figure 2) and is organized as follows. Section 2 deﬁnes the estimand of inter-est, the average treatment e ﬀ ect in a target population, as well as alternatives.Section 3 presents key assumptions underlying many of the methods. Section 4reviews methods for assessing treatment e ﬀ ect heterogeneity, thus further mo-tivating the need for methods that enable generalizing or transporting studyresults to a target population. Section 5 then summarizes the analytic methodsavailable for external validity bias correction that generate treatment e ﬀ ect es-timates for a target population of interest. These techniques include weightingand matching, outcome regressions, and doubly robust approaches. Section 6 ENERALIZABILITY AND TRANSPORTABILITY then concludes with guidance for both applied and methods researchers.

2. ESTIMAND

Assume, for one or more studies, the existence of outcome Y , treatment A ∈{ , } , and baseline covariates X ∈ R d . For simplicity of notation, we deﬁne X to represent all treatment e ﬀ ect confounders and e ﬀ ect modiﬁers (subgroupswhose e ﬀ ects are expected to di ﬀ er) that di ﬀ er between study and target pop-ulations; each variable in X is both a confounder and an e ﬀ ect modiﬁer. With-out loss of generality, we focus on the single study setting, with S = 1 indicat-ing selection into it. The observational unit for the study sample is O study = { X, A, Y , S = 1 } . O study has probability distribution P study ∈ M study , where M study is our collection of possible probability distributions (i.e., statistical model). Weobserve n s realizations of O study , indexed by j . The observational unit for a rep-resentative sample from the target population is given by O = { X, A, Y , S } ∼ P ∈M . We observe n realizations of O , indexed by i . Target sample subjects whodo not appear in the study sample will have S = 0. We use the terminology“selected” or “sampled” throughout the paper for simplicity although for trans-portability, subjects are not directly sampled into the study from the target pop-ulation. For generalizability, O study ∈ O , while for transportability, the two aredisjoint sets, O study (cid:60) O .Biases are deﬁned with respect to an estimand. We will focus on the averagetreatment e ﬀ ect in a well-deﬁned target population of interest: the populationaverage treatment e ﬀ ect (PATE). Namely, we are interested in the average out-come had everyone in the target population been assigned to treatment A=1 compared to the outcome had everyone been assigned to treatment

A=0 . Wewrite this as τ = E X ( E ( Y | S = 1 , A = 1 , X ) − E ( Y | S = 1 , A = 0 , X )) = E ( Y − Y ),where Y and Y are the potential outcomes under treatment and no treat-ment, respectively, and required identiﬁability assumptions are delineated inthe next section. The corresponding estimator is given by ˆ τ = 1 /n (cid:80) ni =1 ( ˆ Y i − ˆ Y i ).We also write Y a to represent the potential outcome under a with lowercase a a speciﬁc value for random variable A . Potential outcomes are either explicitlyassumed in the potential outcomes framework or a consequence of the struc-tural causal model (Rubin, 1974; Pearl, 2000). Di ﬀ erent target populations cor-respond to alternative PATEs because the expectation is taken with respect toalternative distributions of covariates X . However, necessarily, we only observeoutcomes in the study sample. A study therefore directly estimates the sam-ple average treatment e ﬀ ect (SATE): τ s = E ( Y − Y | S = 1) with estimator ˆ τ s =1 /n s (cid:80) j : S j =1 ( ˆ Y j − ˆ Y j ).When the distributions of treatment e ﬀ ect modiﬁers di ﬀ er between study andtarget populations, the true study average e ﬀ ect will not equal the true targetpopulation average e ﬀ ect (SATE (cid:44) PATE) due to external validity bias. Samplingvariability as well as internal validity biases can also drive estimates of SATEfurther from the truth (Figure 3). Biases may di ﬀ er in magnitude and may makethe SATE either larger or smaller than the PATE.We may also be interested in estimating other target parameters. For example,the population conditional average treatment e ﬀ ects (PCATE): τ x = E ( Y − Y | X )is examined in some of the estimation methods we explore later. Another param-eter of interest is the population average treatment e ﬀ ects among the treated: I. DEGTIAR ET AL.

Figure 3 . Illustrative example of the di ﬀ erence between target population and sample average treatmente ﬀ ects (PATE and SATE). Biases may di ﬀ er in magnitude and may make the SATE either larger orsmaller than the PATE. τ = E ( Y − Y | A = 1). Similar generalizability and transportability considera-tions presented in the following sections will apply for these and other causalestimands.

3. ASSUMPTIONS

Under the potential outcomes framework, the assumptions below are su ﬃ -cient to identify the PATE using the observed study data. A corresponding setof assumptions under the structural equation model (SEM) framework has alsobeen derived (Pearl and Bareinboim, 2014; Pearl, 2015; Pearl and Bareinboim,2011; Bareinboim and Pearl, 2014; Bareinboim and Tian, 2015; Bareinboim andPearl, 2016; Correa, Tian and Bareinboim, 2018). Additional assumptions in-clude those of no missing data or measurement error in outcome, treatment, orcovariate measurements. Other target parameters of interest necessitate a simi-lar set of assumptions. Su ﬃ cient assumptions for identifying the PATE with respect to internal validity: Conditional treatment exchangeability : Y a ⊥ A | X, S = 1 for all a ∈ A , the setof all possible treatments. This condition requires no unmeasured confoundingof the treatment-outcome relationship in the study. It is satisﬁed by perfectlyrandomized trials (e.g., no loss to follow-up, other informative missingness orcensoring, etc.) and by observational studies that have all confounders mea-sured. While this condition is su ﬃ cient, it is not always necessary. When esti-mating the PATE, it can be replaced by the weaker condition of mean conditionalexchangeability of the treatment e ﬀ ect, E ( Y − Y | X, A, S = 1) = E ( Y − Y | X, S =1) (Kern et al., 2016; Dahabreh et al., 2019a).

Positivity of treatment assignment : P ( X = x | S = 1) > ⇒ P ( A = a | X = x,S = 1) >

0, with probability 1 for all a ∈ A . This condition entails that eachsubject in the study has a positive probability of receiving each version of thetreatment. In combination with the conditional treatment exchangeability as-sumption above, this assumption is also known as strongly ignorable treatmentassignment (Varadhan, Henderson and Weiss, 2016). Stable unit treatment value assumption (SUTVA) : if A = a then Y = Y a . Thisassumption requires no interference between subjects and treatment version ir-relevance (i.e., consistency/well-deﬁned interventions) in the study and target ENERALIZABILITY AND TRANSPORTABILITY populations (Dahabreh et al., 2017; Kallus, Puli and Shalit, 2018). Following the assumptions above, identifying the PATE involves a parallel setof assumptions for external validity:

Conditional exchangeability for study selection : Y a ⊥ S | X for all a ∈ A . Thisassumption is also known as exchangeability over selection and the generaliz-ability assumption. It requires that the outcomes among individuals with thesame treatment and covariate values in the study and target populations are thesame (Stuart et al., 2011). All e ﬀ ect modiﬁers that di ﬀ er between study and tar-get populations must therefore be measured. This assumption would be satisﬁedby a study sample that is a random sample from the target population or a non-probability study sample in which all e ﬀ ect modiﬁers are measured. A weakercondition, mean conditional exchangeability of selection, E ( Y − Y | X, S = 1) = E ( Y − Y | X ) can replace conditional exchangeability for study selection whenfocusing on the PATE (Kern et al., 2016; Dahabreh et al., 2019a). Positivity of selection : P ( X = x ) > ⇒ P ( S = 1 | X = x ) > a ∈ A . This assumption requires common support with respect tostudy selection; in every stratum of e ﬀ ect modiﬁers, there is a positive prob-ability of being in the study sample (Dahabreh et al., 2017). This can be re-placed by smoothing assumptions under a parametric model, for example, thatthe propensity score distribution has su ﬃ cient overlap or common support be-tween the study sample and target population (Westreich et al., 2017; Tiptonet al., 2017). Thus, with conditional positivity of selection we assume that allmembers of the target population are represented by individuals in the study.The positivity assumption in combination with the no unmeasured e ﬀ ect modi-ﬁcation assumption above is also known as strongly ignorable sample selectiongiven the observed covariates (Chan, 2017). SUTVA for study selection : if S = s (and A = a ) then Y = Y a . This assumptionstates that there is no interference between subjects selected into the study ver-sus those not selected and that there is treatment version irrelevance betweenstudy and target samples (the same treatment is given to both) (Tipton, 2013b;Tipton et al., 2017). It necessitates no di ﬀ erence across study and target sam-ples in how outcomes are measured or in how the intervention is applied, thatthere is a common data-generating function for the outcome across individualsin the study and target populations (i.e., that being in the study does not changetreatment e ﬀ ects), and that the potential outcomes are not a function of theproportion of individuals selected for the study. Treatment version irrelevancein SUTVA can be replaced by the condition of having the same distribution oftreatment versions between study and target populations when estimating thePATE (Lesko et al., 2017). Similar internal and external validity assumptions are needed for transporta-bility, with the following modiﬁcations. When the study sample is a subset of thetarget population (generalizability), the positivity assumption for selection willneed the propensity for selection to be bounded away from 0, whereas when thesample is not a subset of the target population (transportability), the propen-sity to be in the target population will need to be bounded away from 0 and 1

I. DEGTIAR ET AL. (Tipton, 2013b). Furthermore, for transportability, the set of covariates, X , re-quired for conditional exchangeability for study selection cannot include thosethat separate the study sample from the target population (e.g., hospital typeif transporting results from teaching hospitals to community clinics, or geo-graphic location if transporting between states) (Tipton, 2013b). Further dis-tinctions are discussed by Pearl (2015) using the SEM framework. Under thisframework, Pearl and Bareinboim formalize the assumptions necessary for us-ing di ﬀ erent transport formulas to reweight randomized data, providing graph-ical conditions for identiﬁability as well as transport formulas for randomizedstudies (Pearl and Bareinboim, 2014; Pearl, 2015), observational studies (Pearland Bareinboim, 2011; Pearl, 2015; Bareinboim and Tian, 2015; Bareinboim andPearl, 2016; Correa and Bareinboim, 2017; Correa, Tian and Bareinboim, 2018),and a combination of heterogeneous studies (Bareinboim and Pearl, 2014, 2016).

4. ASSESSING DISSIMILARITY BETWEEN TARGET AND STUDYPOPULATIONS AND TESTING FOR TREATMENT EFFECTHETEROGENEITY

Numerous quantitative approaches can help evaluate the extent to whichstudy results may be expected to generalize to the target population. These as-sessments examine population di ﬀ erences and whether treatment e ﬀ ect hetero-geneity exists. Methods for assessing the similarity of study and target popula-tions can broadly be categorized into those that compare baseline patient char-acteristics and those that compare outcomes for groups on the same treatment.For the former, many make use of the propensity score for selection, which alsoserves the purpose of assessing the extent to which propensity score adjustmentusing measured covariates can su ﬃ ciently remove baseline di ﬀ erences betweenstudy and target samples. However, most of these methods do not emphasizee ﬀ ect modiﬁers, hence should be combined with an assessment of whether thenoted population di ﬀ erences correspond to heterogeneity of treatment e ﬀ ects.To test for heterogeneity of e ﬀ ects, one must ﬁrst identify e ﬀ ect modiﬁers. Ef-fect modiﬁers are often pre-speciﬁed by the investigator, but data-driven ap-proaches exist as well, and will be discussed in this section. When summary-level study data are available, assessments that examine dif-ferences in univariate covariate metrics between study and target samples canbe deployed. Cahan, Cahan and Cimino (2017) propose a generalization scorefor evaluating clinical trials that incorporates baseline patient characteristics,the trial setting, protocol, and patient selection: it takes ratios of the mean ormedian values of these characteristics in the study and target samples, then av-erages across categories for an overall score. However, this approach does notaccount for any measures of dispersion, which may reﬂect exclusion of moreheterogeneous individuals from the study. When only baseline patient charac-teristics are responsible for relevant study vs. target population di ﬀ erences, onecan perform multiplicity-adjusted univariate tests for di ﬀ erences in e ﬀ ect mod-iﬁers between study and target samples (Greenhouse et al., 2008). Alternatively,one could examine absolute standardized mean di ﬀ erences (SMD) for each co- ENERALIZABILITY AND TRANSPORTABILITY variate, ( ¯ X study − ¯ X ) /σ ¯ X , where ¯ X study and ¯ X are the means of baseline covariatesin the study and target samples, respectively, and σ ¯ X is the standard deviationof ¯ X (Tipton et al., 2017). High values indicate heavy extrapolation and relianceon correct model speciﬁcation; in smaller samples, imbalances will often occurby chance (Tipton et al., 2017). With one or more RCTs, generalizability acrosscategorical eligibility criteria can be assessed by the percent of the target samplethat would have been eligible for the study or set of studies (Weng et al., 2014;He et al., 2016; Sen et al., 2016).Joint distributions of patient characteristics can likewise be compared, suchas by examining the SMD in propensity scores for selection (Stuart et al., 2011).When the propensity score is not symmetrically distributed, summarizing meandi ﬀ erences is insu ﬃ cient. Tipton (2014) developed a generalizability index thatbins propensity scores and is bounded between 0 and 1: (cid:80) kj =1 (cid:112) w p j w s j with j = 1 , ..., k bins, each with target sample proportions w p j and study sample pro-portions w s j . It is based on the distributions of propensity scores rather thanonly the averages. However, this approach requires patient-level study and tar-get sample data. A generalizability index score of < > ﬃ cient, and C statistic; these largely focus on comparing cumulativedensities (Tipton, 2014; Ding, Feller and Miratrix, 2016). To assess the degreeof extrapolation with respect to e ﬀ ect modiﬁers, one can examine overlap in thepropensity of selection distributions, such as the proportion of target sampleindividuals with propensity scores outside the 5 th and 95 th percentiles of thesample propensity scores (Tipton et al., 2017).One can also adopt a machine learning approach for detecting covariate shift–a change in the distribution of covariates between training and test data (here,the study and target data) (Glauner et al., 2017). After creating a joint datasetwith target and study sample data, a classiﬁcation algorithm predicts whetherthe data came from the study. A dissimilarity metric surpassing a threshold ofacceptability then indicates sizable dissimilarity between datasets. However, aninability to accurately predict study vs. target data origin does not rule out dif-ferences in e ﬀ ect modiﬁers. A low score might furthermore indicate an incorrectmodel speciﬁcation or insu ﬃ cient model tuning.The tests discussed in this subsection assess di ﬀ erences between populations;however, they require investigator knowledge of which characteristics moderatethe treatment e ﬀ ect (or are correlated with unmeasured e ﬀ ect modiﬁers) andwhat level of di ﬀ erences are clinically relevant. Many covariates are often testedor included in a propensity score regression for study selection. This approachprioritizes predictors that are strongly associated with study selection ratherthan those that exhibit strong e ﬀ ect modiﬁcation. Investigators should thereforeaim to identify relevant e ﬀ ect modiﬁers for testing or inclusion in the propensityscore regression and test this subset. When individual-level outcome data or joint distributions of group-level out-come data are available in both the study and target samples for at least one I. DEGTIAR ET AL. of the treatment groups, the following methods can assess the extent to whichmeasured e ﬀ ect modiﬁers account for population di ﬀ erences. One can comparethe observed outcomes in the target sample to predicted outcomes using studycontrols (Stuart et al., 2011), or more generally, study individuals who receivedthe same treatment (Hotz, Imbens and Mortimer, 2005): 1 /n a (cid:80) Ni =1 A i = a ) Y i vs.1 /n s,a (cid:80) i : S i =1 A i = a ) w i Y i with weights w i deﬁned by weighting and matchingmethods discussed in Section 5.1. Hartman et al. (2015) formalize this com-parison with equivalence tests. Alternatively, conditional outcomes for studyand non-study target sample individuals receiving the same treatment, condi-tioning on measured e ﬀ ect modiﬁers, can be compared to detect unmeasurede ﬀ ect modiﬁcation, although other identiﬁability assumption violations mightalso be at fault: E ( Y | X, A = a, S = 1) vs. E ( Y | X, A = a, S = 0). Possible tests includeanalysis of covariance, Mantel-Haenszel, U-statistic based tests, stratiﬁed log-rank, or stratiﬁed rank sum, depending on the outcome (Marcus, 1997; Hotz,Imbens and Mortimer, 2005; Luedtke, Carone and van der Laan, 2019). For ex-ample, study controls could be compared to subgroups of the target populationthat were known to be excluded from the study (e.g., patients who declinedparticipation in a RCT, as done by Davis (1988)). Relatedly, unmeasured ef-fect modiﬁcation can be imperfectly tested for by disaggregating a characteristicthat di ﬀ erentiates the study from the target sample (Allcott and Mullainathan,2012). These outcome di ﬀ erences should not exceed those observed betweenstudy treatment groups (Begg, 1992).In addition to testing for outcome di ﬀ erences, one can test for di ﬀ erences be-tween study and target regression coe ﬃ cients or between baseline hazards ina Cox regression (Pan and Schaubel, 2009). Any identiﬁed di ﬀ erences in out-comes or e ﬀ ects will reﬂect sample di ﬀ erences unaccounted for by the outcomeor weighting method, indicating unmeasured e ﬀ ect modiﬁcation or an ine ﬀ ec-tive modeling approach. To have this comparison reﬂect relevant di ﬀ erences,study controls must be representative of the target population after weightingor regression adjustment. Hartman et al. (2015) provides a more formal set ofidentiﬁability assumptions that may be violated when each equivalence test isrejected. If unmeasured e ﬀ ect modiﬁcation is suspected, one can perform sensi-tivity analysis to assess the extent to which it can impact results (Marcus, 1997;Nguyen et al., 2017, 2018; Dahabreh et al., 2019b; Andrews and Oster, 2017)or to generate bounds on the treatment e ﬀ ect when only partial identiﬁcation ispossible (Chan, 2017). ﬀ ect heterogeneity Identiﬁed population di ﬀ erences are relevant insofar as they correspond todi ﬀ erences in treatment e ﬀ ect modiﬁers. The following tests enable an inves-tigator to assess whether treatment e ﬀ ects vary substantially across measuredcovariates. Many are suitable for use in observational or RCT data, althoughhave largely been demonstrated in RCT data to date. While some tests require apriori speciﬁcation of subgroups, others can discover them in data-driven waysand most require individual-level data. A straightforward, but often overlookedissue is that studies with enrolled patients that are homogeneous with respectto e ﬀ ect modiﬁers will have di ﬃ culty identifying heterogeneity of e ﬀ ects. Theseapproaches are therefore best applied to data representative of the target popu- ENERALIZABILITY AND TRANSPORTABILITY lations (Gunter, Zhu and Murphy, 2011).Tests of prespeciﬁed subgroups should focus on target population subgroupsunder- or over-represented in the study, or any other clinically relevant sub-group expected to exhibit e ﬀ ect heterogeneity. Largely, methods for testing treat-ment e ﬀ ect heterogeneity of a priori speciﬁed subgroups exhibit limited power.Those testing several e ﬀ ect modiﬁers individually are particularly underpow-ered to detect signiﬁcant e ﬀ ects once multiple testing adjustments are incorpo-rated. One approach tests the interaction term of treatment assignment withan e ﬀ ect modiﬁer in a linear model, which also requires modeling assumptionsas to the linearity and additivity of e ﬀ ects (Fang, 2017; Gabler et al., 2009). Toaddress this lack of power, sequential tests for identifying treatment-covariateinteractions can be used with either randomized or observational data (Qian,Chakraborty and Maiti, 2019). Alternative approaches, each addressing slightlydi ﬀ erent goals, include testing whether the conditional average treatment e ﬀ ectis identical across predeﬁned subgroups (Crump et al., 2008; Green and Kern,2012), comparing subgroup e ﬀ ects to average e ﬀ ects (Simon, 1982), and identi-fying qualitative interactions or treatment di ﬀ erences exceeding a prespeciﬁedclinically signiﬁcant threshold (Gail and Simon, 1985).When e ﬀ ect modiﬁers are not known a priori, a variety of techniques canbe applied for identifying subgroups with heterogeneous e ﬀ ects. These includethose that identify variables that qualitatively interact with treatment (i.e., forwhich the optimal treatment di ﬀ ers by subgroup) (Gunter, Zhu and Murphy,2011) as well as determine the magnitude of interaction (Chen et al., 2017; Tianet al., 2014). Various machine learning approaches can also be used to identifysubgroups with heterogeneous treatment e ﬀ ects while minimizing modeling as-sumptions. Approaches that also present tests for treatment e ﬀ ect di ﬀ erencesbetween subgroups include Bayesian additive regression trees (BART) and otherclassiﬁcation and regression tree (CART) variants (Su et al., 2008, 2009; Lip-kovich et al., 2011; Green and Kern, 2012; Athey and Imbens, 2016). Tree-basedmethods develop partitions in the covariate space recursively to grow towardterminal nodes with homogeneity for the outcome. These approaches may beparticularly useful when heterogeneity may be a function of a more complexcombination of factors.With many e ﬀ ect modiﬁers or when e ﬀ ect modiﬁers are unknown, globaltests for heterogeneity can also be used. Pearl (2015) provides conditions foridentifying treatment e ﬀ ect heterogeneity (including heterogeneity due to un-measured e ﬀ ect modiﬁers) for randomized trials with binary treatments, situa-tions with no unobserved confounders, and with mediating instruments. E ﬀ ectheterogeneity can be tested for using the baseline risk of the outcome as ane ﬀ ect modiﬁer; interaction-based tests assess for di ﬀ erences in baseline risk be-tween study and target population control groups (Varadhan, Henderson andWeiss, 2016; Weiss, Segal and Varadhan, 2012). These tests avoid the need formultiple testing but require outcome data in the target sample and modeling as-sumptions. A consistent nonparametric test also exists that assesses for constantconditional average treatment e ﬀ ects, τ x = τ ∀ x ∈ X (Crump et al., 2008). Addi-tional methods, which su ﬀ er from limited power and rely on estimates of SATE,include testing whether potential outcomes across treatment groups have equalvariances and whether cumulative distribution functions of treatment and con- I. DEGTIAR ET AL. trol outcomes di ﬀ er by a constant shift (Fang, 2017). Global tests do not identifysubgroups responsible for e ﬀ ect heterogeneity, although if a global test is re-jected, one can then compare individual subgroups to determine which demon-strate e ﬀ ect heterogeneity.If these assessments of generalizability fail and the target population is notwell-represented by the study population (speciﬁcally, when strong ignorabilityfails), Tipton (2013b) provides several recommended paths forward. Investiga-tors can change the target population to one represented by the study. That is,change the estimand of interest by aligning inclusion and exclusion criteria, out-come timepoints, or treatment doses (Hern´an et al., 2008; Weisberg, Hayden andPontes, 2009). A population coverage percentage can then summarize the per-cent overlap between the new and original target sample propensity scores, anddescribe relevant di ﬀ erences from the original target population. Investigatorscan alternatively retain the original target population and note the limitationsof extrapolated results and likelihood of remnant bias. It is also important toacknowledge that a di ﬀ erent study may need to be conducted.

5. GENERALIZABILITY AND TRANSPORTABILITY METHODS FORESTIMATING POPULATION AVERAGE TREATMENT EFFECTS

Following the application of the methods in the previous sections, includ-ing assessing the plausibility of relevant assumptions, an analytic method istypically needed to generalize or transport results from randomized or obser-vational data to a target population. These approaches have many parallels tothose used to address internal validity bias. We revisit weighting and matching-based methods and outcome regressions in depth while additionally examiningtechniques that use both propensity and outcome regressions (these are oftendoubly robust). To mitigate external validity bias, generalizability and trans-portability methods address di ﬀ erences in the distribution of e ﬀ ect modiﬁersbetween study and target populations. To do so, for weighting and matching-based approaches, these methods account for the probability of selection intothe study, rather than the probability of treatment assignment. Outcome regres-sions require that treatment e ﬀ ect is allowed to vary across all e ﬀ ect modiﬁersin addition to all confounders being correctly included in the regression.Most generalizability and transportability methods have been developed forrandomized data. When outcome data are available from both randomized stud-ies and an observational study representative of the target population, theircombination has the potential to overcome sensitivity to positivity violationsfor selection into the study (an issue that RCT data commonly face) as well asto unmeasured confounding (which may a ﬄ ict observational studies). Incorpo-rating observational data in a principled manner can also shrink mean squarederror. However, many such approaches do not leverage the internal validity ofRCT data. The following sections will highlight some exceptions. While mostapproaches require individual-level study and target sample data, the Appendixhighlights approaches that only use summary-level data for either the study ortarget sample. ENERALIZABILITY AND TRANSPORTABILITY Methods that adjust for di ﬀ ering baseline covariate distributions betweenstudy and target samples via weighting or matching are particularly e ﬀ ectivewhen e ﬀ ect modiﬁers strongly predict selection into the study. While includ-ing unnecessary covariates can decrease precision, increase the chance of ex-treme weights and di ﬃ cult-to-match subjects, and provide no bias reduction(Nie et al., 2013), failing to include an e ﬀ ect modiﬁer is typically of greater con-cern than including unnecessary covariates (Stuart, 2010; Dahabreh et al., 2018).Matching and reweighting methods strongly rely on common covariate supportbetween study and target populations and perform poorly when a portion of thetarget population is not well-represented in the study sample or when empiri-cal positivity violations occur. Investigators should use the estimation approachthat leads to the best e ﬀ ect modiﬁer balance for their study (Stuart, 2010) andstrive for fewer assumptions. Full matching and ﬁne balance of covariate ﬁrst moments(i.e., expected values) have been used in the generalizability context (Stuartet al., 2011; Bennett, Vielma and Zubizarreta, 2020). Stuart et al. (2011) fullymatch study and target sample individuals based on their propensity scores toform sets so that each matched set has at least one study and target individ-ual. Individuals’ outcomes are then reweighting by the number of target sampleindividuals in their matched set. This approach relies heavily on the distancemetric used, which can be misled by covariates that don’t a ﬀ ect the outcome.Fine balance of covariate ﬁrst moments is a nonparametric approach for largerdata that can also be used with multi-valued treatments (Bennett, Vielma andZubizarreta, 2020). This approach matches samples to a target population toachieve ﬁne balance on the ﬁrst moments of all covariates rather than workingwith the propensity score.Some implementations of these methods only match a subset of study indi-viduals (hence show areas of the covariate distribution without common sup-port), while others ensure all study and target sample individuals are matched.Matching methods require calibration for bias-variance tradeo ﬀ such as via acaliper or by choosing the ratio of study to target individuals to match. A va-riety of distance metrics exist; however, none speciﬁcally target e ﬀ ect modi-ﬁers. With unrepresentative observational data, treatment groups can ﬁrst bematched based on confounding variables before matching study pairs to the tar-get sample based on e ﬀ ect modiﬁers, or each treatment group can be separatelymatched to the target sample (Bennett, Vielma and Zubizarreta, 2020). In a low-dimensional setting with categorical or binary co-variates, one can use nonparametric post-stratiﬁcation (also known as direct ad-justment or subclassiﬁcation), as has been done in the literature with random-ized data (Miettinen, 1972; Prentice et al., 2005) and with observational data inthe context of instrumental variables (Angrist and Fern´andez-Val, 2013). Post-stratiﬁcation consists of obtaining estimates for each stratum of e ﬀ ect modiﬁers,then reweighting these estimates to reﬂect the e ﬀ ect modiﬁer distribution in thetarget population, i.e., ˆ E ( Y a ) = 1 /n (cid:80) Ll =1 n l ¯ Y al , where L is the number of strata, n l is the target sample size in stratum l , n = (cid:80) Ll =1 n l , and ¯ Y al is an estimate from I. DEGTIAR ET AL. study sample data of the potential outcome on treatment a in stratum l , com-monly the stratum-speciﬁc sample mean for subjects on treatment a (Miettinen,1972; Prentice et al., 2005).Post-stratiﬁcation only requires stratum-speciﬁc summary data and closed-form variance formulas are often available. However, empty strata quickly be-come an issue when dealing with continuous variables or many stratifying vari-ables. Conversely, if insu ﬃ cient strata are used, residual external validity biaswill remain, which is particularly problematic in small samples (Tipton et al.,2017). To combat this, inference can be pooled across strata using multilevel re-gression with post-stratiﬁcation (Pool, Abelson and Popkin, 1964; Gelman andLittle, 1997; Park, Gelman and Bafumi, 2004; Kennedy and Gelman, 2019).For higher dimensional settings or with continuous covariates, more ﬂexiblenonparametric approaches can be applied, such as maximum entropy weight-ing, where study strata are reweighted to the distribution in the target sam-ple (Hartman et al., 2015). When target and study populations di ﬀ er on post-treatment variables such as adherence, principal stratiﬁcation can be used toestimate PATEs by classifying subjects into never-taker, always-taker, and com-plier categories (Frangakis, 2009). Estimating using the propensity for study selection.

Most weighting approachesuse a propensity of selection regression to construct weights. They rely on cor-rect speciﬁcation of the propensity score regression and su ﬃ cient overlap inpropensity scores between study subjects and target sample individuals not inthe study. These approaches have the additional advantage of allowing one setof weights to be used for treatment e ﬀ ects related to multiple outcomes. Themost straightforward weighting approaches tend to have large variances in thepresence of extreme weights, give disproportionate weight to outlier observa-tions, and produce outcome estimates outside the support of the outcome vari-able. Weight standardization can address these issues, as can weight trimming,although the latter induces bias by changing the target population of interest,hence requiring a careful bias-variance trade-o ﬀ .Inverse probability of participation weighting (IPPW), a Horvitz-Thompson-like approach (Horvitz and Thompson, 1952), is the most common weightingtechnique for generalizability (Flores and Mitnik, 2013; Baker et al., 2013; Leskoet al., 2017; Westreich et al., 2017; Correa, Tian and Bareinboim, 2018; Dahabrehet al., 2018, 2019a). Most simply, IPPW weights the outcome for each studyindividual on treatment a by the inverse probability (propensity) of being inthe study. Weights have been developed for estimating PATEs, including thosethat incorporate treatment assignment to account for covariate imbalances inan RCT or for confounding in an observational study. The observed outcomesare reweighted to obtain the potential outcomes for each treatment group a : E ( Y a ) = n (cid:80) ni =1 w i Y i with w i = 1 π s,i I ( S i = 1) I ( A i = a ) for random treatment assignment (Lesko et al., 2017) w i = 1 π s,i π a,i I ( S i = 1) I ( A i = a ) more generally (Stuart et al., 2011; Dahabreh et al., 2019a) , where I ( S i = 1) is the indicator for being in the study, I ( A i = a ) is the indicatorfor being assigned treatment a , π s,i = P ( S i = 1 | X i ) is the propensity score forselection into the study and π a,i = P ( A i = a | S i = 1 , X i ) is the propensity score for ENERALIZABILITY AND TRANSPORTABILITY assignment to treatment a in the study.Individual-level data are typically required, although one can also use jointcovariate distributions from group-level data (Cole and Stuart, 2010) or univari-ate moments (e.g., means, variances) with additional assumptions (Signorovitchet al., 2010; Phillippo et al., 2018). Because IPPW only uses study individuals ona given treatment to estimate potential outcomes for that treatment, power canbecome an issue, particularly for multi-level treatments. These methods alsoperform poorly when study selection probabilities are small, which can be acommon occurrence for generalizability (Tipton, 2013b). IPPW weights havealso been developed for regression parameters in a generalized linear model(Haneuse et al., 2009), as well as for Cox model hazard ratios and baseline risks(Cole and Stuart, 2010; Pan and Schaubel, 2008).For transportability to the target population S = 0, odds of participationweights are used rather than inverse probability of participation weights (Westre-ich et al., 2017; Dahabreh et al., 2018). This corresponds to the estimator E ( Y a | S =0) = n (cid:80) Ni =1 w i Y i with N = n + n s and weights (Dahabreh et al., 2018): w i = 1 − π s,i π s,i π a,i I ( S i = 1) I ( A i = a ) . To address potentially unbounded outcome estimates, standardization then re-places n by the sum of the weights, which normalizes the weights to sum to1 (Dahabreh et al., 2018, 2019a). The resulting estimator will be more stable,bounded by the range of the observed outcomes, and perform better when thetarget sample is much larger than the study.Under regularity conditions, estimates derived using IPPW are consistent andasymptotically normal (Lunceford and Davidian, 2004; Pan and Schaubel, 2008;Cole and Stuart, 2010; Correa, Tian and Bareinboim, 2018; Buchanan et al.,2018). Variance for the IPPW estimator can be obtained through either a boot-strap approach or robust sandwich estimators. The latter may be di ﬃ cult to cal-culate (Haneuse et al., 2009) and bootstrap methods for IPPW have been shownto perform better when there is substantial treatment e ﬀ ect heterogeneity orsmaller sample sizes (Chen and Kaizar, 2017; Tipton et al., 2017).Propensity scores can also be used in the context of post-stratiﬁcation, weight-ing or matching individuals within strata. RCT individuals are divided intostrata deﬁned by their propensity scores; quintiles are commonly used, basedon results showing that this approach may remove over 90% of bias (O’ Muirc-heartaigh and Hedges, 2014). E ﬀ ects are estimated using sample data withineach subgroup, such as through separate regressions or a joint parametric re-gression with ﬁxed e ﬀ ects for subgroups and interaction terms for subgroupsby RCT status. Results can then be reweighted based on the number of targetsample individuals in each subgroup (O’ Muircheartaigh and Hedges, 2014).Alternatively, the target sample can be matched to RCT individuals within thesame propensity score stratum (Tipton, 2013b).The post-stratiﬁcation estimator is asymptotically normal and closed-formvariance estimates exist for independent strata (O’ Muircheartaigh and Hedges,2014; Lunceford and Davidian, 2004). Compared to IPPW, strata reweighting ismore likely to be numerically stable and easily implementable when treatment I. DEGTIAR ET AL. assignment is done at the group level (e.g., cluster-randomized trials). However,stratiﬁcation implicitly assumes that treatment e ﬀ ects are identical for studyand target patients in the same stratum; this assumption is rarely met, resultingin residual confounding and inconsistent estimates (Lunceford and Davidian,2004). It also relies on the assumptions of treatment e ﬀ ect heterogeneity beingfully captured by the propensity score for treatment and that outcomes are con-tinuous and bounded. With too few strata, bias reduction will be insu ﬃ cient;conversely, too many strata can lead to small strata counts and unstable esti-mates (Stuart, 2010; Tipton et al., 2017).Propensity strata approaches have also been used to address positivity oftreatment assignment violations within the target sample in the setting whereoutcome data are available from both a randomized and observational study(Rosenman et al., 2018). Rosenman et al. (2020) present an extension whichaims to adjust for potential unmeasured confounding bias. Outcome regressions, also known as re-sponse surface modeling, have not been as extensively developed for generaliz-ability and transportability compared to propensity-based approaches. Broadlyspeaking, outcome regressions approaches ﬁt an outcome regression in studysample data to estimate conditional means, then obtain PATEs by marginalizingover (i.e., standardizing to) the target sample covariate distribution by predict-ing counterfactuals for the target sample: ˆ E ( Y a ) = n (cid:80) ni =1 ˆ E ( Y i | S i = 1 , A i = a, X i ).If the target sample is not a simple random sample from the target population,this would be a weighted average using sampling weights (Kim et al., 2018).Outcome regression approaches are particularly e ﬀ ective when e ﬀ ect modi-ﬁers strongly predict the outcome and when the outcome is common but selec-tion into the study is rare. They are also convenient for exploring PCATEs. Theseapproaches can yield better precision than weighting or matching-based meth-ods because they can adjust both for confounders, e ﬀ ect-modiﬁers, and factorsonly predictive of the outcome, thus decreasing variance in the estimate. Theyare simple to implement when an outcome regression for confounding adjust-ment has already been ﬁt and accounts for all relevant e ﬀ ect modiﬁers. Thesame regression that was used to estimate impacts within the study can then beused to predict counterfactuals in the target sample. Outcome regression meth-ods can be used with either randomized or observational study data, but havebeen used most frequently in RCTs. In the presence of signiﬁcant non-overlapbetween the target and study samples, outcome regressions rely on heavy ex-trapolation (Kern et al., 2016; Attanasio, Meghir and Szekely, 2003), often withno corresponding inﬂation of the variance to reﬂect uncertainty in the resultingestimates.The simplest approach is an ordinary least squares outcome regression (Flo-res and Mitnik, 2013; Kern et al., 2016; Elliott and Valliant, 2017; Dahabrehet al., 2018, 2019a). An outcome regression is ﬁt with interaction terms betweentreatment and all e ﬀ ect modiﬁers before predicting counterfactual outcomes forthe target sample (the marginalization step). Dahabreh et al. (2018) showed theconsistency of this type of outcome regression for the PATE. For RCTs, separateregressions are recommended for each treatment group to better capture treat- ENERALIZABILITY AND TRANSPORTABILITY ment e ﬀ ect heterogeneity (Dahabreh et al., 2019a), although this approach pre-cludes borrowing information across treatment groups, which is possible withmachine learning methods that discover treatment e ﬀ ect heterogeneity.Among these machine learning techniques is BART, which is the most com-monly used data-adaptive outcome regression approach for generalizability andtransportability (Chipman, George and McCulloch, 2007, 2010; Kern et al., 2016;Hill, 2011). Tree-based methods, including BART, were brieﬂy introduced inSection 4.3. BART models the outcome as a sum of trees with linear additiveterms and a regularization prior. BART addresses external validity bias via itsdata-driven discovery of treatment e ﬀ ect heterogeneity and strengths of themethod include its ability to obtain conﬁdence intervals from the posterior dis-tribution (Hill, 2011; Green and Kern, 2012). However, BART credible intervalsshow undercoverage when the target population di ﬀ ers substantially from theRCT (Hill, 2011).Data availability may challenge these outcome regression approaches. Whenthe covariates in the target sample aren’t available in the study sample, or viceversa, but the SATE can be expected to be approximately unbiased for the PATE,the SATE estimates’ credible intervals can be expanded to account for uncer-tainty in the target population covariate distribution (Hill, 2011). Here, we consider meta-analytic ap-proaches for summary-level data as well as studies that combine individual-level data from more than one study (for example, one randomized and oneobservational study). Much of the literature has focused on meta-analytic tech-niques using summary-level study data and no target sample covariate informa-tion. This body of bias-adjusted meta-analysis methods largely does not explic-itly deﬁne a target population for whom inference is desired, but rather relieson subjective investigator judgments of the levels of bias in each study, speciﬁedusing bias functions or priors in a Bayesian framework. Eddy (1989) presentsthe ﬁrst such approach, the conﬁdence proﬁle method for combining chainsof evidence. Likelihoods are adjusted for di ﬀ erent study designs’ (investigator-speciﬁed) internal and external validity biases; uncertainty around these biasesare incorporated through prior distributions. Various subsequent Bayesian hier-archical models have been developed, such as a 3-level model (Prevost, Abramsand Jones, 2000) with the levels corresponding to models of the observed ev-idence, variability between studies, and variability between study types (ran-domized vs. observational). When available, covariate information can be addedto the models to address e ﬀ ect heterogeneity. E ﬀ ectively, this estimator averagesacross the internal and external validity biases of the studies and therefore isonly unbiased when the external validity bias in the RCT exactly ‘cancels’ theinternal validity bias in the observational data (Kaizar, 2011).Other meta-analysis studies leveraging summary-level data separately spec-ify internal and external validity bias parameters for an explicit target popula-tion and down-weight studies with higher risk of bias. One such example is thebias adjusted meta-analysis approach by Turner et al. (2009), which presents achecklist that subjectively quantiﬁes the extent of internal and external valid-ity bias for each study and then weighs studies’ average outcomes by the extentof bias. Greenland (2005) pool across observational case-control studies using aBayesian meta-sensitivity model with bias parameters to separately permit con- I. DEGTIAR ET AL. sideration of misclassiﬁcation, non-response, and unmeasured confounding. Inthe intermediate setting where individual-level data is available in the study butonly covariate moments (e.g., means, variances) are available in the target set-ting, Phillippo et al. (2018) present an outcome regression approach for indirecttreatment comparison across RCTs.When individual-level outcome data is available in the target sample or frommultiple studies, data can be combined into one joint dataset for outcome re-gression analysis if the outcome regression can be expected to be the same acrossstudies (Kern et al., 2016). Such an approach can be preferential to IPPW, whichuses only study and not target sample outcome data (Kern et al., 2016). How-ever, it will be dominated by observational data results (and their potential bi-ases) when observational subjects constitute the majority of the joint dataset,e ﬀ ectively result in a weighted average across studies, weighted by the propor-tion of subjects in each study.Hierarchical Bayesian evidence synthesis is the only outcome regression ap-proach we identiﬁed that attempts to empirically adjust for unobserved con-founding when estimating e ﬀ ects for observational patients who are not well-represented in the RCTs (Verde et al., 2016; Verde, 2019). Summary-level RCTdata are combined with individual-level observational data through a weight-ing approach in which the control group event rate is assumed to be similaracross all studies and a study quality bias term is added to the observationalstudies’ outcome regression to account for unmeasured confounding or otheruncontrolled biases and to inﬂate variance. Alternatively, Gechter (2015) derivebounds on the PATE and PCATE when transporting from an RCT to a targetsample with outcome data (all untreated). Double robust methods for generalizabil-ity and transportability typically combine outcome and propensity of selectionregressions. They are asymptotically unbiased when at least one of these regres-sion functions is consistently estimated, and if both are consistently estimated,asymptotically e ﬃ cient. However, if neither regression is estimated consistently,the mean squared error may be worse than using a propensity or outcome re-gression alone. Incorporation of ﬂexible modeling approaches can help miti-gate regression misspeciﬁcation. Three asymptotically locally e ﬃ cient doublerobust approaches have been developed in randomized data: a targeted maxi-mum likelihood estimator (TMLE) for instrumental variables (Rudolph and vanDer Laan, 2017), which is a semiparametric substitution estimator, the estimat-ing equation-based augmented inverse probability of participation weighting(A-IPPW) (Dahabreh et al., 2018, 2019a), and an augmented calibration weight-ing estimator that can also incorporate outcome information from the targetsample when it is available (Dong et al., 2020).The TMLE was developed for transportability in an encouragement designsetting (i.e., intervention focused on encouraging individuals in the treatmentgroup to participate in the intervention) with instrumental variables (Rudolphand van Der Laan, 2017) and has also been used for generalizability (Schmidet al., 2020). Three di ﬀ erent PATE estimators were developed: intent to treat,complier, and as-treated. All use an outcome regression to obtain an initial ENERALIZABILITY AND TRANSPORTABILITY estimate, then adjust that estimate with a ﬂuctuation function using a clevercovariate C , which is derived from the e ﬃ cient inﬂuence curve and incorpo-rates the propensity of selection information in a bias reduction step. For ex-ample, for the intent to treat PATE, the ﬂuctuation function takes the form:logit( ˆ E ( Y | S = 1 , A, Z, X ) + (cid:15)C ), where C = I ( S = 1 , A = a ) P ( A = a | S = 1 , X ) P ( S = 1) P ( Z = z | S = 0 , A = a, X ) P ( X | S = 0) P ( Z = z | S = 1 , A = a, X ) P ( X | S = 1)and Z corresponds to the intervention taken (whereas A corresponds to the as-signed intervention, as before). The approach allows outcome and propensityregressions to be ﬂexibly ﬁt, for example, using an ensemble of machine learn-ing algorithms. Variances are calculated from the inﬂuence curve.A-IPPW has been developed both for generalizing results to estimate PATEsfor all trial-eligible individuals (Dahabreh et al., 2019a,c) and for transportingresults to estimate PATEs for trial-eligible individuals not included in a trial(Dahabreh et al., 2018). Three double robust estimating equation-based estima-tors are presented: A-IPPW, A-IPPW with normalized weights that sum to 1 toensure bounded estimates, and a weighted outcome regression estimator usingparticipation weights. The non-normalized A-IPPW estimators are as follows,with w i the same as for IPPW: n n (cid:88) i =1 { w i { Y i − ˆ E ( Y i | S i = 1 ,A i = a,X i ) } + ˆ E ( Y i | S i = 1 ,A i = a,X i ) } for generalizability1 n N (cid:88) i =1 { w i { Y i − ˆ E ( Y i | S i = 1 ,A i = a,X i ) } + { − I ( S i = 1) } ˆ E ( Y i | S i = 1 ,A i = a,X i ) } for transportability . Variance can be derived using empirical sandwich estimates or using a nonpara-metric bootstrap. As these estimators are partial M-estimators, they can produceestimates outside bounds if the outcome regression is not well-chosen and theymay have multiple solutions.Several other double robust estimators for transportability resemble the IPPWestimator, with sampling weights derived through alternative approaches thatdo not rely on propensity scores (Josey et al., 2020a,b; Dong et al., 2020). Forexample, the semiparametric and e ﬃ cient augmented weighting estimator byDong et al. (2020) calibrates the RCT covariate distribution to match that of thesampling-weighted target sample.An alternative reweighted outcome regression method for observational datadoes not claim double robustness and draws from the unsupervised domainadaptation literature. In general, unsupervised domain adaptation methods aimto make predictions for a target sample (the “target domain”) when outcomesare only observed in the study sample (“source domain”). The approach of Jo-hansson et al. (2018) is a regularized neural network estimator for PCATE pa-rameters that jointly learns representations from the data and a reweightingfunction. Representational learning creates balance between the study and tar-get covariate distributions and between treated and control distributions in arepresentational space so that predictors use information common across thesedistributions and focus on covariates predictive of the outcome. In this learnedrepresentational space, results are then re-weighted to minimize an upper boundon the expected value of the loss function under the target covariate distribu-tion. Propensity scores can also be used to reweight a likelihood function, as I. DEGTIAR ET AL. done by Nie et al. (2013) in an RCT setting for calibrating control outcomesfrom prior studies to the trial target sample. Similarly, Flores and Mitnik (2013)reweight an outcome regression to the target sample.

Several methods have combinedrandomized and observational data sources such that that they retain the in-ternal validity of the randomized data and the external validity of the targetsample observational data. These approaches broadly rely on the assumptionthat the relationship between unmeasured confounders and potential outcomesis the same in the RCT as in the target sample, which is a weaker assumptionthan that of no unmeasured confounding required by most of the methods de-scribed thus far. One study combined individual-level data from several RCTsto transport results to the target sample, extending the A-IPPW estimator (aswell as corresponding IPPW and outcome regression estimators) to the multi-study setting (Dahabreh et al., 2019d). The remainder of the section discussesapproaches that combine randomized and observational data.When di ﬀ erences in e ﬀ ect modiﬁers between the RCT and target popula-tion are known (e.g., by inclusion and exclusion criteria), cross-design synthesismeta-analysis is a method for combining randomized and observational studydata while capitalizing on the internal validity of the randomized data and theexternal validity of the observational data (Begg, 1992; Greenhouse et al., 2017).It provides a means for estimating treatment e ﬀ ects for patients excluded fromthe RCT and can use summary-level RCT data if outcomes are available by rel-evant patient subgroups, although can only accommodate a limited number ofstrata of relevant e ﬀ ect modiﬁers.Cross-design synthesis meta-analysis e ﬀ ectively assumes a constant amountof unmeasured confounding across patients eligible and ineligible for the RCTs(Kaizar, 2011). This approach will have smaller bias than use of randomizedor observational data alone under various common data scenarios and, acrosssimulations, shows better coverage through smaller bias and increased variance(Kaizar, 2011).When di ﬀ erences between RCT and target populations are less well under-stood, there are continuous e ﬀ ect modiﬁers, or a higher dimensional set of e ﬀ ectmodiﬁers, one can use Bayesian calibrated risk-adjusted regressions (Varadhan,Henderson and Weiss, 2016; Henderson, Varadhan and Weiss, 2017). This para-metric approach requires individual-level information from observational andrandomized studies, leveraging outcome regressions and calibration using thepropensity of selection. The target population is assumed to be represented bya subset of the observational data; the RCT data are likewise assumed to be rep-resented by a (potentially di ﬀ erent) subset of the observational data. The cali-brated risk-adjusted model performs well when there is poor overlap betweenRCT and target data; however, it relies on the observational dataset having sub-stantial e ﬀ ect modiﬁer overlap with both the target sample and RCT. Robustvariance formulas or bootstrapping can be used to obtain conﬁdence intervals.A 2-step frequentist approach for consistently estimating PCATE parametershas been developed to estimate e ﬀ ects in a target population represented by ob-servational data (Kallus, Puli and Shalit, 2018). It begins with outcome regres-sions for each treatment group of the observational data, or a ﬂexible regressionthat captures e ﬀ ect heterogeneity. Observational data are then standardized to ENERALIZABILITY AND TRANSPORTABILITY the RCT population before ‘debiasing’ their estimates using RCT data by includ-ing a correction term that can depend on measured covariates. This method re-lies on the assumption that calibrating internal validity bias in the subset of theobservational data distribution overlapping with RCT data appropriately cali-brates the bias for the entire target sample. The 2-step approach would thereforenot necessarily decrease bias if the covariate distribution is highly imbalanced,resulting in average biases that are quite di ﬀ erent between the RCT overlappingvs. nonoverlapping subsets of the target sample.Lu et al. (2019) present an approach that, unlike the above methods, assumesno unmeasured confounding in the observational data when combining RCTand comprehensive cohort study data (where patients who decline randomiza-tion are enrolled in a parallel observational study). They use semiparametricdouble robust estimators that can incorporate ﬂexible regressions.

6. DISCUSSION

Obtaining unbiased estimates for a relevant target population requires ap-plying generalizability or transportability methods in studies that meet requiredidentiﬁability assumptions. The internal validity of randomized trials is not suf-ﬁcient to obtain unbiased causal e ﬀ ects; external validity also needs to be con-sidered. In this synthesis, we have discussed (1) sources of external validity biasand study designs to address it, (2) deﬁning an estimand in a target populationof interest, (3) the identiﬁability assumptions underpinning generalizability andtransportability approaches, (4) a variety of approaches for quantifying the rel-evant dissimilarity between study and target samples and assessing treatmente ﬀ ect heterogeneity, and (5) a variety of matching and weighting methods, out-come regression approaches, and techniques that use both outcome and propen-sity regressions that generalize or transport from randomized and observationalstudies to a target population. These approaches have been applied across di-verse settings from RCT results transported to patients represented in registriesto cluster-randomized educational intervention trials generalized to broader ge-ographic areas. Across a variety of settings, it is important to estimate resultsfor populations that go beyond the study population. We suggest the followingconsiderations for researchers. Make e ﬀ orts to explicitly deﬁne the target population(s) and identify the studypopulation from which your study sample data is a simple random sample. Describ-ing the study population may be a di ﬃ cult task, and there may not be a prac-tically meaningful population that is representative of your study sample data.However, this clarity will allow you to compare and, when feasible, better-alignthe study sample data to the target population. Discussion regarding target pop-ulation(s) should be guided by the ensuing decisions the study aims to inform aswell as practical considerations (e.g., lack of certain subgroups in your study).These considerations may require iteration between feasibility and the desiredstudy aims as well as careful discussion amidst study collaborators. When com-bining across studies, meta-analyses should likewise carefully specify targetpopulation(s) for inference and incorporate considerations of treatment e ﬀ ectheterogeneity or demonstrate that e ﬀ ect heterogeneity is not a concern. Withouttransparency in the target population(s), a study cannot estimate well-deﬁnedtreatment e ﬀ ects nor can readers judge the generalizability of study results to I. DEGTIAR ET AL. any other population of interest.

Plan for generalization in your study design, when feasible, including writing gen-eralizability considerations into your grant or study objectives.

Enroll randomizedstudy participants or design observational study inclusion and exclusion crite-ria to have the study sample be representative of the target population, or fullycapturing the heterogeneity of e ﬀ ect modiﬁers. Collect data on likely treatmente ﬀ ect modiﬁers that are associated with study participation. Attempt to iden-tify and mitigate potential sources of missingness or selection bias. If possible,collect baseline characteristics and outcome data on study nonparticipants whoare part of the target population. Otherwise, identify external sources of datathat might inform the composition of your target population with respect to ef-fect modiﬁers and work towards aligning variables between these target sampledata sources and your study. Clearly describe the internal and external validity assumptions needed to identifythe treatment e ﬀ ect as they relate to your study. Substantively assess the justiﬁabil-ity of these internal and external validity assumptions. To the extent possible,test the validity of the assumptions and perform sensitivity analyses to assessthe impact of assumption violations.

Quantify the dissimilarity between the study and target populations using at leastone method.

Ideally, use multiple methods, as they each tell di ﬀ erent parts ofthe story: examine univariate and joint distributions of e ﬀ ect modiﬁers, di ﬀ er-ences in the propensity to participate in the study, and (if outcome informationis available in the target sample) di ﬀ erences in outcomes between study and tar-get subjects on the same treatment. If di ﬀ erences are identiﬁed, one should in-vestigate which subpopulations drive those di ﬀ erences and assess whether theyhave heterogeneous treatment e ﬀ ects. In addition to examining subject charac-teristics, assess whether di ﬀ erences exist in the setting, treatment, or outcome. To obtain causal estimates when the target and study populations di ﬀ er with re-spect to e ﬀ ect modiﬁers, incorporate at least one generalizability or transportabil-ity estimator. Alternatively, at the minimum, assess and describe sources of ef-fect heterogeneity and whether they’re likely to di ﬀ er for the target population.Derive estimates using as much data as possible (e.g., when outcome data isavailable, use it in a principled way). The choice of method for external validitybias adjustment may be restricted by data availability (e.g., summary-level vs.individual-level data) but should be driven by similar principles as those thatguide the choice between outcome regressions, matching and weighting meth-ods, and double robust approaches for confounding adjustment (Van der Laan,Laan and Robins, 2003; Neugebauer and van der Laan, 2005; van der Laan andRose, 2011). Flexible nonparametric and semiparametric models and estimatorsthat use ensemble machine learning minimize the need for strict parametric as-sumptions and have the potential to perform the best (Kern et al., 2016). For both methods developers and applied researchers, we recommend releasingpublicly available code alongside the paper and providing details for implementation.

Published code facilitates replicability and accessibility of methods for futureresearch and applied use. A substantial barrier to the adoption of new statisticalmethods, including advances in generalizability and transportability, is the lackof available computational tools.While much of the causal inference literature has focused on issues of inter-

ENERALIZABILITY AND TRANSPORTABILITY nal validity, both internal and external validity are necessary for valid inference.When treatment e ﬀ ect heterogeneity exists, as is often the case, study resultsmay not hold for a target population of interest. Approaches to address inter-nal validity biases can be borrowed to improve upon methods for addressingexternal validity bias. This review presents a framework for such analysis andsummarizes di ﬀ erent choices for estimators that can be used to generalize ortransport results to a population di ﬀ erent from the one under study. It brings to-gether diverse cross-disciplinary literature to provide guidance both for appliedand methods researchers. Improving the incorporation of results from observa-tional studies, including electronic health databases, can lead to better inferencefor policy-relevant populations with reduced bias and improved precision.

7. ACKNOWLEDGMENTS

This research was supported by NIH New Innovator Award DP2MD012722and NIH training grants T32LM012411 and T32ES07142. The authors thankSebastien Haneuse, Francesca Dominici, and Laura Hatﬁeld for helpful feed-back on this work as well as seminar and conference audiences at the HarvardProgram on Causal Inference, Mathematica Policy Research, Harvard–StanfordHealth Policy Data Science Lab, 2018 Harvard Data Science Initiative Confer-ence, and 2020 NIA Workshop on Applications of Machine Learning to ImproveHealthcare Delivery for Older Adults.

REFERENCES

Ackerman, B. , Schmid, I. , Rudolph, K. E. , Seamans, M. J. , Susukida, R. , Mojtabai, R. and

Stu-art, E. A. (2019). Implementing Statistical Methods for Generalizing Randomized Trial Find-ings to a Target Population.

Addictive behaviors Allcott, H. (2015). Site Selection Bias in Program Evaluation.

The Quarterly journal of economics

Allcott, H. and

Mullainathan, S. (2012). External Validity and Partner Selection Bias.

NationalBureau of Economic Research Working Paper Series

Andrews, I. and

Oster, E. (2017). Weighting for External Validity Technical Report No. w23826,National Bureau of Economic Research, Cambridge, MA.

Angrist, J. D. and

Fern ´andez-Val, I. (2013). ExtrapoLATE-Ing: External Validity and Overiden-tiﬁcation in the LATE Framework. In

Advances in Economics and Econometrics

Athey, S. and

Imbens, G. (2016). Recursive Partitioning for Heterogeneous Causal E ﬀ ects. Pro-ceedings of the National Academy of Sciences

Attanasio, O. , Meghir, C. and

Szekely, M. (2003). Using Randomised Experiments and Struc-tural Models for ’Scaling up’: Evidence from the PROGRESA Evaluation.

IFS Working Paper

EWP03/05 . Baker, R. , Brick, J. M. , Gotway Crawford, C. A. , Terhanian, G. , Langer, G. , Bates, N. A. , Battaglia, M. , Couper, M. P. , Dever, J. A. , Gile, K. J. , Tourangeau, R. , Valliant, R. and

Rivers, D. (2013). Summary Report of the Aapor Task Force on Non-Probability Sampling.

Journal of survey statistics and methodology Bareinboim, E. and

Pearl, J. (2014). Transportability from Multiple Environments with LimitedExperiments: Completeness Results. In

Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.) 280–288.Curran Associates, Inc.

Bareinboim, E. and

Pearl, J. (2016). Causal Inference and the Data-Fusion Problem.

Proceedingsof the National Academy of Sciences

Bareinboim, E. , Tian, J. and

Pearl, J. (2014). Recovering from Selection Bias in Causal and Sta-tistical Inference. In

Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence . AAAI’14 I. DEGTIAR ET AL.

Bareinboim, E. and

Tian, J. (2015). Recovering Causal E ﬀ ects from Selection Bias. In Proceedings ofthe Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence . AAAI’15

Begg, C. B. (1992). Cross Design Synthesis: A New Strategy for Medical E ﬀ ectiveness Research.United States General Accounting O ﬃ ce, (GA0/PEMD-92-18). Statistics in Medicine Bell, S. H. , Olsen, R. B. , Orr, L. L. and

Stuart, E. A. (2016). Estimates of External Validity BiasWhen Impact Evaluations Select Sites Nonrandomly.

Educational Evaluation and Policy Analysis Benchimol, E. I. , Smeeth, L. , Guttmann, A. , Harron, K. , Moher, D. , Petersen, I. , Sørensen, H. T. , von Elm, E. , Langan, S. M. and

RECORD Working Committee (2015). The REporting of Stud-ies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement.

PLOS Medicine e1001885. Bennett, M. , Vielma, J. P. and

Zubizarreta, J. R. (2020). Building Representative Matched Sam-ples with Multi-Valued Treatments in Large Observational Studies.

Journal of computational andgraphical statistics Buchanan, A. L. , Hudgens, M. G. , Cole, S. R. , Mollan, K. R. , Sax, P. E. , Daar, E. S. , Adimora, A. A. , Eron, J. J. and

Mugavero, M. J. (2018). Generalizing Evidence from Ran-domized Trials Using Inverse Probability of Sampling Weights.

Journal of the Royal StatisticalSociety. Series A, Statistics in society

Burchett, H. , Umoquit, M. and

Dobrow, M. (2011). How Do We Know When Research from OneSetting Can Be Useful in Another? A Review of External Validity, Applicability and Transfer-ability Frameworks.

Journal of health services research & policy Cahan, A. , Cahan, S. and

Cimino, J. J. (2017). Computer-Aided Assessment of the Generalizabil-ity of Clinical Trial Results.

International Journal of Medical Informatics Chan, W. (2017). Partially Identiﬁed Treatment E ﬀ ects for Generalizability. Journal of Research onEducational E ﬀ ectiveness Chen, Z. and

Kaizar, E. (2017). On Variance Estimation for Generalizing from a Trial to a TargetPopulation. arXiv:1704.07789 [stat] . Chen, C. and

Wong, R. (2018).

Black Patients Miss out on Promising Cancer Drugs . Chen, S. , Tian, L. , Cai, T. and

Yu, M. (2017). A General Statistical Framework for Subgroup Iden-tiﬁcation and Comparative Treatment Scoring.

Biometrics Chen, I. Y. , Pierson, E. , Rose, S. , Joshi, S. , Ferryman, K. and

Ghassemi, M. (2020). Ethical MachineLearning in Health. arXiv preprint arXiv:2009.10576 . Chipman, H. A. , George, E. I. and

McCulloch, R. (2007). Bayesian ensemble learning. In

Ad-vances in Neural Information Processing Systems 19 - Proceedings of the 2006 Conference

Chipman, H. A. , George, E. I. and

McCulloch, R. E. (2010). BART: Bayesian Additive RegressionTrees.

The Annals of Applied Statistics Cole, S. R. and

Stuart, E. A. (2010). Generalizing Evidence from Randomized Clinical Trials toTarget Populations: The ACTG 320 Trial.

American Journal of Epidemiology

Colnet, B. , Mayer, I. , Chen, G. , Dieng, A. , Li, R. , Varoquaux, G. , Vert, J.-P. , Josse, J. and

Yang, S. (2020). Causal Inference Methods for Combining Randomized Trials and Observational Stud-ies: A Review. arXiv:2011.08047 [stat] . Correa, J. D. and

Bareinboim, E. (2017). Causal E ﬀ ect Identiﬁcation by Adjustment under Con-founding and Selection Biases. In Proceedings of the Thirty-First AAAI Conference on ArtiﬁcialIntelligence . AAAI’17

Correa, J. D. , Tian, J. and

Bareinboim, E. (2018). Generalized Adjustment under Confoundingand Selection Biases. In

AAAI . Cronbach, L. J. and

Shapiro, K. (1982).

Designing Evaluations of Educational and Social Programs ,1st ed.

A Joint Publication in the Jossey-Bass Series in Social and Behavioral Science & in HigherEducation . Jossey-Bass, San Francisco.

Crump, R. K. , Hotz, V. J. , Imbens, G. W. and

Mitnik, O. A. (2008). Nonparametric Tests for Treat-ment E ﬀ ect Heterogeneity. Review of Economics and Statistics Dahabreh, I. , Robertson, S. , Stuart, E. and

Hernan, M. (2017). Extending Inferences from Ran-domized Participants to All Eligible Individuals Using Trials Nested within Cohort Studies. arXiv:1709.04589 [stat] . Dahabreh, I. J. , Robertson, S. E. , Steingrimsson, J. A. , Stuart, E. A. and

Hernan, M. A. (2018).Extending Inferences from a Randomized Trial to a New Target Population. arXiv:1805.00550[stat] . Dahabreh, I. J. , Robertson, S. E. , Tchetgen, E. J. , Stuart, E. A. and

Hern ´an, M. A. (2019a).ENERALIZABILITY AND TRANSPORTABILITY Generalizing Causal Inferences from Individuals in Randomized Trials to All Trial-EligibleIndividuals.

Biometrics Dahabreh, I. J. , Robins, J. M. , Haneuse, S. J.-P. A. , Saeed, I. , Robertson, S. E. , Stuart, E. A. and

Hern ´an, M. A. (2019b). Sensitivity Analysis Using Bias Functions for Studies Extending Infer-ences from a Randomized Trial to a Target Population.

Dahabreh, I. J. , Hernan, M. A. , Robertson, S. E. , Buchanan, A. and

Steingrimsson, J. A. (2019c). Generalizing Trial Findings Using Nested Trial Designs with Sub-Sampling of Non-Randomized Individuals.

Dahabreh, I. J. , Robertson, S. E. , Petito, L. C. , Hern ´an, M. A. and

Steingrimsson, J. A. (2019d).E ﬃ cient and Robust Methods for Causally Interpretable Meta-Analysis: Transporting Infer-ences from Multiple Randomized Trials to a Target Population. Davis, K. (1988). The Comprehensive Cohort Study: The Use of Registry Data to Conﬁrm andExtend a Randomized Trial.