# Covariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach

CCovariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach

Authors:

Siyun Yang (MS), Fan Li (PhD), Laine E. Thomas (PhD), Fan Li (PhD) Correspondence to

Fan Li , Department of Biostatistics, Yale University School of Public Health, New Haven, Connecticut (email: [email protected] ) Author affiliations:

Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina (Siyun Yang, MS; Laine E. Thomas, PhD); Department of Statistical Science, Duke University, Durham, North Carolina (Fan Li , PhD); Department of Biostatistics, Yale University School of Public Health, New Haven, Connecticut (Fan Li , PhD); Duke Clinical Research Institute, Durham, North Carolina (Laine E. Thomas, PhD) Conflict of interest: none declared.

Data availability statement:

The HF-ACTION data that is used as an illustrative example in this study can be accessed through NIH at https://biolincc.nhlbi.nih.gov/studies/hf_action/

Running head:

Propensity score weighting in subgroup analyses of RCTs bbreviations:

ANCOVA: analysis of covariance PS: propensity score OW: overlap weighting IPW: inverse probability weighting RCT: randomized controlled trials HTE: heterogeneous treatment effect S-ATE: Subgroup average treatment effect bstract Background : Subgroup analyses are frequently conducted in randomized clinical trials (RCT) to assess evidence of heterogeneous treatment effect across patient subpopulations. Although randomization balances covariates within subgroups in expectation, chance imbalance may be amplified in small subgroups and adversely impact the precision of subgroup analyses. Covariate adjustment in overall analysis of RCT is often conducted, via either analysis of covariance (ANCOVA) or propensity score weighting, but covariate adjustment for subgroup analysis has been rarely discussed. In this article, we develop propensity score weighting methodology for covariate adjustment to improve the precision and power of subgroup analyses in RCTs.

Methods : We extend the propensity score weighting methodology to subgroup analyses by fitting a logistic regression propensity model with pre-specified covariate-subgroup interactions. We show that overlap weighting exactly balances the covariates with interaction terms in each subgroup. Extensive simulations were performed to compare the operating characteristics of unadjusted estimator, different propensity score weighting estimators and the ANCOVA estimator. We apply these methods to the HF-ACTION trial to evaluate the effect of exercise training on 6-minute walk test in several pre-specified subgroups.

Results : Efficiency of the adjusted estimators is higher than that of the unadjusted estimator. The propensity score weighting estimator is as efficient as ANCOVA, and may be more efficient when subgroup sample size is small (N<125), or when outcome model is mis-specified. The weighting estimators with full-interaction propensity model consistently outperform traditional main-effect propensity model. onclusion : Propensity score weighting serves as a transparent and objective alternative to adjust chance imbalance of important covariates in subgroup analyses of RCTs. It is important to include the full set of covariate-subgroup interactions in the propensity score model.

Keywords : Covariate adjustment; Power; Propensity score; Overlap weighting; Randomized controlled trials; Subgroup analyses. ackground

Subgroup analyses are frequently conducted in randomized clinical trials (RCTs) to assess evidence of heterogeneous treatment effect (HTE) across patient subpopulations. When planned and interpreted properly, subgroup analyses offer valuable information about which population receives the most benefit or may be harmed by an intervention, and can inform important decisions on personalized healthcare. While various data-driven methods have been developed to search for HTE,

1, 2 the vast majority of clinical trials still report confirmatory subgroup analyses of pre-specified subpopulations. A major challenge for these analyses is a lack of power and precision, which is amplified by chance imbalances that frequently occur in small subgroups. Optimizing precision is particularly important for subgroups because precision is already limited by smaller sample sizes and small interaction effects.

4, 5

In this article, we develop an easy-to-use propensity score weighting approach to improve the power of subgroup analyses by eliminating chance imbalances. The standard confirmatory subgroup analysis of an RCT derives treatment effect estimates within pre-specified subgroups and tests for interactions between those subgroups that partition the population. While randomization balances baseline covariates within each subgroup in expectation, chance imbalance is likely to occur within subgroups due to limited sample sizes and multiple opportunities for imbalance. Recent studies have demonstrated that covariate adjustment addresses baseline imbalance and improves the precision for analyzing RCTs with continuous and binary outcomes; and time-to-event outcomes. While methods for covariate adjustment are extensively studied for estimating the overall treatment effect, discussions on covariate adjustment for estimating subgroup average treatment effect or HTE have been limited. For estimating the average treatment effect with continuous outcomes, the two main approaches for covariate adjustment are analysis of covariance (ANCOVA) and propensity score (PS) weighting. Under mild conditions, Yang and Tsiatis, and Tsiatis et al. showed that the ANCOVA estimator with covariate-by-treatment interactions achieves the semiparametric efficiency with continuous outcomes. Shen et al. demonstrated that the inverse propensity score weighting estimator under a logistic regression βworkingβ model is asymptotically equivalent to the efficient ANCOVA estimator. More recently, Zeng et al. generalized this equivalence to a large family of PS weighting estimators, including overlap weighting. Despite the asymptotic equivalence, the two methods can differ in finite sample performance. Moreover, PS weighting has two advantages over ANCOVA: first, it avoids modeling the outcome data and is thus arguably more objective; second, it preserves the marginal treatment effect estimand for non-continuous outcomes. However, to our knowledge, propensity score weighting methodology has not been discussed in the context of covariate adjustment for subgroup analyses in RCTs. This paper fills the gap; we consider two PS weighting methods, inverse probability weighting and overlap weighting.

15, 17-19

We perform extensive simulations to compare the operating characteristics of the PS weighting estimators with ANCOVA, and clarify their relative merits and limitations for subgroup analyses. We demonstrate the application of these approaches by estimating several subgroup average treatment effects in the HF-ACTION study. Causal Estimand and ANCOVA in Subgroup Analysis onsider a two-arm RCT of π individuals. Let the binary treatment π π = 1 if unit π receives the treatment. Under the potential outcome framework, each individual has two potential outcomes, π π (1) and π π (0) , corresponding to the treatment and control condition. Of the two, only the one corresponding to the actual treatment assigned is observed, denoted as π π = π π π π (1) + (1 β π π )π π (0) . Suppose a set of π½ covariates πΏ π = (π π1 , β¦ , π ππ½ ) π are collected at the baseline. We assume that πΏ π is a pre-specified list of covariates to be adjusted in the analysis. We also define a vector of π pre-specified subgroup indicators, πΊ π = (π π1 , β¦ , π ππ ) π . The subgroup variables a priori define the subgroup membership so that π ππ = 1 if unit π belongs to the π π‘β subgroup, and one unit can belong to multiple subgroups simultaneously. For generality, the subgroup indicator πΊ π may or may not depend on the set of πΏ π . The π π‘β subgroup average treatment effect (S-ATE) is defined as: π π = πΈ[π(1) β π(0)|π π = 1]. (1) In RCTs, investigators often wish to test whether there is HTE across subgroups in one-at-a-time fashion for each π . In particular, the group comparisons are defined as π ππ = 0 versus π ππ = 1 , while averaging over the levels of {π π1 , β¦ , π ππ }\{π ππ }. Therefore, the HTE across two subgroup levels can be formalized as πΏ ππ»ππΈ = πΈ[π(1) β π(0)|π π = 1] β πΈ[π(1) β π(0)|π π = 0] (2) Assuming each unit π is randomly assigned to the treatment or the control arm with a constant probability Because randomization balances observed and unobserved covariates in both the overall and subgroup samples in expectation, the within-subgroup unadjusted difference-in-means estimator for S-ATE is unbiased. πΜ ππππ΄π·π½ = β π π π π π ππππ=1 β π π π ππππ=1 β β π π (1 β π π )π ππππ=1 β (1 β π π )π ππππ=1 (3) However, due to the relatively small sample size in each subgroup, there is a higher chance for imbalance to occur within a subgroup, and adjusting for these imbalances may improve precision and power. For estimating the average treatment effect with a continuous outcome, the standard approach fits an analysis of covariance (ANCOVA) model. We extend Yang and Tsiatis, Tsiatis et al. and Linβs approach to estimate the S-ATE, by separately fitting the ANCOVA model within each subgroup to allow for subgroup-specific coefficients. This is equivalent to fitting a full interaction model on the overall sample, whose predictors include πΏ π , treatment indicator, and their interactions with the subgroup variables. We call this the βANCOVA-Sβ model: π π = π½ + π½ πΏ π + π½ πΊ π + π½ π π + π½ ( πΏ π πΊ π ) + π½ ( π π πΊ π ) + π½ ( πΏ π π π ) + π½ ( πΏ π πΊ π π π ) + π π , π = 1, β¦ , π, (4) where π½ = (π½ , π½ , π½ , π½ , π½ , π½ , π½ , π½ ) are regression coefficients, (πΏ π πΊ π ), π π πΊ π , (πΏ π π π ), (πΏ π πΊ π π π ) are vectors of all 2-way and 3-way interactions between the subgroup indicators πΊ π , general covariates πΏ π and treatment assignment π π , and the random error π π is assumed to have mean zero. Denote πΜ π (1) and πΜ π (0) as the predicted values from model (4) when setting π π = 1 and π π = 0 , respectively. We define the βANCOVA-Sβ estimator for S-ATE as πΜ ππ΄ππΆπππ΄ = β (πΜ π (1) β πΜ π (0))π ππππ=1 β π ππππ=1 (5) This estimator, πΜ ππ΄ππΆπππ΄ , is consistent for π π even if the functional forms of πΏ π is incorrectly modeled, due to the similar arguments as in estimating the overall treatment effect.

13, 14 ompared to the πΜ ππππ΄π·π½ in (3), πΜ ππ΄ππΆπππ΄ is more efficient when the adjusted covariates πΏ π are predictive of π π . Variance and confidence intervals of πΜ ππ΄ππΆπππ΄ can either be obtained using the delta method or bootstrap resampling. Propensity Score Weighting in Subgroup Analysis

One limitation of ANCOVA is the potential for inviting a βfishing expeditionβ, that is, one may search for a model specification that provides the largest overall and/or subgroup treatment effects and thus jeopardize the objectivity of statistical analysis.

14, 15

PS weighting is an attractive alternative approach for covariate adjustment;

16, 19 it achieves the same goal of improving precision, but offers better transparency and objectivity by encouraging pre-specification without access to the outcome π π . For subgroup analysis, we define the propensity score, π(πΏ, πΊ) = Pr (π = 1|πΏ, πΊ) , as the conditional probability of receiving treatment given baseline covariates πΏ and subgroup variables πΊ . When the subgroup variables πΊ is an explicit function of πΏ , πΊ can theoretically be omitted from the conditioning set. However, explicitly writing out πΊ in this definition helps clarify the role of propensity scores for subgroup analysis. In RCTs, the treatment allocation process is completely controlled and the true propensity score is a constant for all baseline and subgroup variables (i.e. π(πΏ, πΊ) = π ); this is distinct from observational studies, where the true propensity score is unknown and needs to be estimated from the data. Even so, the use of a propensity score will be helpful for RCTs. i et al. proposed a class of balancing weights to estimate the weighted average treatment effect (WATE), and Yang et al. further extended this to causal subgroup analysis in observational studies. Define β(πΏ, πΊ) to be a tilting function of the true propensity score π(πΏ, πΊ) . Under randomization, different specifications of β(πΏ, πΊ) lead to the same target population, and the subgroup WATE reduces to S-ATE. See Supplementary Material A for a complete definition of the WATE and the tilting function. We extend their work and propose the HΓ‘jek estimator for estimating S-ATE in RCTs: πΜ πβ = β π π π π π€ π1 π ππππ=1 β π π π€ π1 π ππππ=1 β β π π (1βπ π )π€ π0 π ππππ=1 β (1βπ π )π€ π0 π ππππ=1 , (6) where π€ π1 = β(πΏ π , πΊ π )/π(πΏ π , πΊ π ) for treated units and π€ π0 = β(πΏ π , πΊ π )/[1 β π(πΏ π , πΊ π )] for control units. When β(πΏ π , πΊ π ) = 1 , the balancing weights correspond to the inverse probability weights (IPW): (π€ π1 , π€ π0 ) = ( π ,πΊ π ) , π ,πΊ π ) ) ; when β(πΏ π , πΊ π ) = π(πΏ π , πΊ π )[1 β π(πΏ π , πΊ π )] , the balancing weights become the overlap weights (OW): (π€ π1 , π€ π0 ) = (1 βπ(πΏ π , πΊ π ), π(πΏ π , πΊ π ) ). Both IPW and OW estimators are unbiased for S-ATE due to randomization.

14, 23

Below we focus on these two weighting schemes. To implement PS weighting method for subgroup analysis in RCTs, one could fit a βworkingβ PS model. If the working model is specified by logistic regression, πΜ(πΏ π , πΊ π ) = πππππ‘ β1 (πΌΜ +πΏ ππ πΆΜ πΏ + πΊ ππ πΆΜ πΊ + (πΏ π πΊ π ) π πΆΜ πΏπΊ ) , where (πΌΜ , πΆΜ πΏπ» , πΆΜ πΊπ» , πΆΜ πΏπΊπ» ) π are the maximum likelihood estimators, Yang et al. showed that OW balances the covariates with interaction terms in each subgroup as well as the overall sample, β π π π ππ π ππππ=1 π€Μ π1 β β (1 β π π )π ππ π ππππ=1 π€Μ π0 = 0 , (7) for each π = 1, β¦ , π , and π = 1, β¦ , π. This property has important practical implications in subgroup analyses. For any baseline covariate-subgroup interaction included in the PS model, the ssociated chance imbalance in the subgroup sample and in the overall sample of a specific randomized trial vanishes once the OWs are applied. Moreover, the exact balance property can translate into better efficiency in estimating S-ATE with a finite sample. We illustrate the intuition of potential efficiency gain of OW in the Supplementary Material B. In contrast, the exact balance property (7) does not generally hold for IPW in finite-samples. Extending the work of Zeng et al. to subgroups, we show that the class of PS weighting estimators with β(πΏ π , πΊ π ) chosen as a smooth function of π(πΏ π , πΊ π ) is asymptotically equivalent to the βANCOVA-Sβ estimator πΜ ππ΄ππΆπππ΄ for estimating S-ATE in RCTs. In RCTs, the consistency of the πΜ πβ in (6) does not depend on the specification of the βworkingβ PS model. This is because any βworkingβ PS model leads to a consistent estimate of π(πΏ π , πΊ π ) , which equals the randomization probability in a specific RCT. Shen et al. and Williamson et al. have provided analytical arguments to show the variance reduction properties of IPW for estimating the ATE in RCTs. Similar results are obtained for OW, but showed improved finite-sample performance of OW over IPW. By similar arguments, we can show that the PS weighting estimator improves the precision of the unadjusted estimator, πΜ ππππ΄π·π½ , if the PS model adjusts for important prognostic covariates. Variance and confidence intervals of the πΜ πβ can be obtained either via M-estimation-based sandwich estimator or bootstrap resampling. Simulation Study

We consider a variety of simulation scenarios to compare the performance of the unadjusted estimator, the ANCOVA-S, the IPW and OW estimators with different PS model specifications. We evaluate these methods based on percent bias, relative efficiency, coverage rate, type I error ate and power for the associated hypothesis testing. Consistent with the recent RCTs published in top medical journals, we generate

N β {250, 500, 750, 1000} patients to represent small, moderate and large trials. In each data set, subgroup indicators πΊ π = (π π1 , β¦ , π ππ ) π are independently drawn from Bernoulli(0.25). We generate π½ independent additional baseline covariates πΏ π = (π π1 , β¦ , π ππ½ ) π , half of which are continuous, and the other half are binary. The continuous covariates are independently drawn from the standard normal distribution , and the binary covariates are drawn from Bernoulli(0.3). The treatment assignment π π is generated from Bernoulli(0.5). We simulate from two outcome models. In outcome model 1, the continuous outcome π π is generated from a normal distribution with mean πΈ(π|π, πΊ, πΏ ) specified below and standard deviation π π¦ = πΈ(π|π, πΊ, πΏ ) = π½ + π½ πΏ + π½ πΊ + π½ π + π½ (πΏπΊ) + π½ (ππΊ)+ π½ (πΏπ) + π½ (πΏπΊπ). (8) In outcome model 2, the continuous outcome π π additionally depend on the (π½ β 1) interactions between the pairs of covariates with consecutive indices, denoted by πΏ πππ‘ =(π π , β¦ , π πβ1 π π ) π : πΈ(π|π, πΊ, πΏ ) = π½ + π½ πΏ + π½ πΊ + π½ π + π½ (πΏπΊ) + π½ (ππΊ)+ π½ (πΏπ) + π½ (πΏπΊπ) + π½ πΏ πππ‘ . (9) Details of parameter specifications in the two outcome models are given in Supplementary Material C. For PS weighting estimators, the PS model specification is a key consideration. Since including subgroup-covariate interactions in the PS model leads to improved balance of covariates within subgroups, and is necessary to achieve exact subgroup covariate balance via OW, we consider wo types of logistic propensity score models. The first is the main-effects logistic model, which includes the subgroup indicator and the covariates main effects; the corresponding design matrix is ( πΏ, πΊ ). This model represents the standard practice in the literature. The second is a saturated PS model, which, besides the subgroup variables and the main effects, also includes their 2-way interactions; the corresponding design matrix is ( πΏ, πΊ , πΏπΊ) . Given the estimated propensity score, we use the HΓ‘jek-type estimator (6) for both IPW and OW. For the ANCOVA, we fit the correctly specified outcome model (8) and estimate S-ATE using πΜ ππ΄ππΆπππ΄ in (5). When the outcomes are generated from equation (9), the πΜ ππ΄ππΆπππ΄ omitting the additional consecutive covariate interactions is misspecified. Although technically the PS model is always correctly specified in RCTs, we still refer to this scenario as βmodel misspecificationβ henceforth for ease of reference. We consider four simulation scenarios, summarized in Table 1. Throughout the simulation scenarios, all estimators for S-ATE are essentially unbiased. Therefore, we focus our discussion on the relative efficiency, coverage rate, type I error rate and power. Additional details of each scenario, together with variance estimation and evaluation metrics are provided in the Supplementary Material C. Scenario 1

Scenario 1 assumes that the treatment effect is additive and homogeneous within subgroups, and the subgroup variable π represents a high-risk co-morbidity. We set π = 8 to represent the most common number of covariates included in analysis, and vary (π½ , π ) β{(0,0), (β1, 0.5)}, representing the treatment effects are null or subgroup-treatment interaction is half as the main treatment effect. Next, we set π½ , π½ to be vector of zeros, representing homogeneous treatment effect within subgroups. hen testing the null hypothesis of no treatment effect or no HTE across subgroup levels, the rejection rates under the null hypothesis and coverage rates under the alternative are close to the nominal 5% and 95% levels (Figure 1(a) and Supplementary Table 2). In Figure 2(a), the efficiency of all adjusted estimators is higher than the unadjusted estimator. Specifically, the efficiency of PS weighting with full interaction model has the same efficiency as ANCOVA. When subgroup sample size is relatively small (N<125), the OW estimator with the propensity scores estimated from the full interaction model (OW-Full) leads to higher eο¬ciency compared to ANCOVA and the IPW estimator with a full-interaction propensity model (IPW-Full). In contrast, when the propensity scores are estimated from the main-effect model, both IPW (IPW-Main) and OW (OW-Main) perform poorly in estimating S-ATE. Figure 3(a) displays the results on power analysis. For testing S-ATE in the small subgroup ( π = 1 ), all methods are comparable, but ANCOVA and IPW-Full have smaller power when sample size is less than 100. For testing the HTE, OW-Full leads to higher power when subgroup sample size is less than 125.

Scenario 2

Scenario 2 differs from scenario 1 by allowing for HTE within subgroups. To do this, we fix (π½ , π ) = (β1, 0.5), Ξ² = 0.5 Γ π , and π½ = 0.25 Γ π , where π€ is a π -vector of ones. The choice of π½ represents the interaction effect size is half as the subgroup-treatment effect π . The coverage rates of all estimators are close to the nominal level (Supplementary Table 3). Similar to Scenario 1, when subgroup sample size is less than 125, the efficiency of OW-Full is comparable to the ANCOVA, followed by the IPW-Full. The IPW-Main and OW-Main perform equally poor (Supplementary Figure 1(a)). The result of power confirms this trend. OW-Full and ANCOVA estimator have consistently higher power compared to other estimators, followed by he IPW-Full and main effect estimators. The unadjusted estimator has the lowest power throughout (Supplementary Figure 1(b)). Scenario 3

We explore the performance of various estimators under βoutcome misspecificationβ. The outcomes are generated from model (9) with π½ = βπ π¦ /(π½ β 1) (1,1,1,1,1,1,1) , representing effects of the interactions. Other regression coefficients remain the same as in Scenario 1. Again, the rejection rates under the null hypothesis and coverage rates under the alternative are close to the nominal levels (Figure 1(b) and Supplementary Table 4). Figure 2(b) demonstrates that when subgroup sample size is less than 125, OW-Full and ANCOVA estimator are comparable, both outperforming IPW-Full estimator, and the main effect estimators. OW-Full becomes slightly more eο¬cient than ANCOVA when the outcome model is misspeciο¬ed. Figure 3(b) shows when testing S-ATE in the small subgroup ( π = 1 ), OW-Full estimator has the highest power. ANCOVA and IPW-Full have even lower power than unadjusted and the main effect estimators when subgroup sample size is less than 100, but outperform them when subgroup sample size is larger than 125. For testing the HTE, OW-Full estimator leads to consistently higher power than all other estimators.

Scenario 4

In Scenario 4, we consider

π = 2 pre-specified subgroup indicators (namely 4 subgroups), which represents the common number of subgroups being analyzed in healthcare clinical trials. Our findings suggest the family-wise error rate is inflated if no adjustment is considered for the significant threshold. The extent of inflation is similar between ANCOVA, IPW-Full and OW-Full estimators. Therefore, multiple testing adjustment such as the Bonferroni correction should e considered to maintain the nominal family-wise error rate. Full description and results of Scenario 4 is provided in the Supplementary Material C and Supplementary Figure 2, 3(a).

Summary of Simulation Findings

In summary, across all the scenarios, efficiency of the adjusted estimators is higher than that of the unadjusted estimator. For the weighting estimators, the full-interaction propensity model consistently outperforms the main-effect propensity model. For large sample size, OW and IPW perform similarly, but OW is more efficient than OW in small samples (e.g. N<125). When the ANCOVA model is correctly specified, weighting methods with full-interaction propensity model is as efficient as ANCOVA, and is even more efficient when paired with OW under small subgroup sample size (N<125). When the ANCOVA outcome model is mis-specified, OW-Full outperforms the ANCOVA estimator. Lastly, family-wise error rate is inflated for all methods when multiple testing is present in subgroup analyses and adjustment may be considered. We also investigate scenarios with multiple subgroups that have HTE within subgroups, misspecified outcome model, and both HTE and misspecified outcome model (Supplementary Table 1). The results on relative efficiency and power are similar to that of the single subgroup Scenarios, and results on family-wise error rate are similar to that of Scenario 4 (Supplementary Figure 4-6).

Application to the HF-ACTION Trial

The Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION) is a patient-level randomized trial that evaluated the efficacy and safety of an exercise raining program in chronic heart failure and reduced ejection fraction patients (N=2331). We limited the HF-ACTION cohort to the 2175 (93.3%) patients who were either white or black, among whom 1724 (79.3%) patients had the 6-minute walk distance outcome measured at 3 months. The investigators pre-specified subgroups (prior to databaseline lock) based on 10 baseline covariates. Figure 4 shows a Connect-S plot of representative subgroups that have imbalanced covariate between treatment arms. While the traditional IPW-Main results in absolute standardized mean difference (ASMD) greater than 0.25 in subgroups defined by mitral regurgitation grade, angina classification, ventricular conduction and π½ -blocker usage, the OW-Full achieves exact balance in all subgroups. For illustration, we estimate the S-ATE in these imbalanced subgroups one at a time, and consider adjusting four baseline covariates: baseline 6-minute walk distance, age, race, and duration of the cardiopulmonary exercise test. Although the data were complete for most candidate covariates, <1% patients are missing baseline 6-minute walk distance or cardiopulmonary exercise test. The methods of multiple imputation were used to create 10 complete data sets using R package mice . With each imputed data set, ANCOVA-S model includes treatment indicator, subgroup variable, baseline covariates, and all 2-way interactions. To estimate the propensity scores, we fit the saturated logistic model, which includes the subgroup variable, baseline covariates, and their full set of interactions. Within-imputation variances are estimated via 1000 bootstrap samples. The final point and variance estimates are obtained by the Rubinβs rule. Figure 5 presents the S-ATE estimates, standard errors (SEs), and 95% conο¬dence intervals (CIs) of unadjusted, IPW-Full, OW-Full, and ANCOVA-S. In all subgroups, SE from the unadjusted estimator is larger than the adjusted estimators, and three adjusted estimators have comparable Es. The point estimate from the unadjusted estimator is slightly larger in most subgroups. It is worth mentioning that for testing S-ATE in four subgroups (delayed intraventricular conduction, angina class I, angina class II-IV and no π½ -blocker usage), the difference between methods may alter the conclusion about subgroup treatment effect because the unadjusted estimate includes the null, while the adjusted methods do not. Specifically, results from unadjusted estimate suggest that there is no statistically significant effect of exercise training on 6-minute walk distance at 3 months in these subgroups. In contrast, results from the adjusted methods suggest that treated patients on average walk 25, 23, 31, 22 meters longer in 6-minute walk distance than control patients in these four subgroups, separately. This discrepancy highlights the potentially significant efficiency gain from covariate adjustment in subgroup analysis. Discussion

In this paper, we introduce two PS weighting methods (IPW and OW) for covariate adjustment in subgroup analyses of RCTs and compare them with ANCOVA. Conceptually, weighting methods are more transparent and objective than ANCOVA because they avoid modeling the outcomes. Through extensive simulations, we found that for PS weighting it is crucial to include covariates predictive of the outcome, subgroup variables and the full set of their interactions in a logistic PS model to remove subgroup covariate imbalance and achieve higher power. It is also important to include the interaction terms in the PS model, otherwise the weighting methods may lead to lower power compared to ANCOVA. When the sample size is large, OW and IPW perform similarly, but OW consistently outperform IPW in terms of power with small sample size (e.g. N<125). Finally, OW with propensity from saturated logistic model outperforms the ANCOVA estimator under small subgroup sample size (e.g. N< 125) and/or when the ANCOVA odel is misspecified. This advantage of OW is largely due to its exact balance property when the logistic PS model includes covariate-subgroup interactions. As with all studies that involve multiple testing, the family-wise error rate is inflated similarly for ANCOVA, IPW and OW. Typical methods such as Bonferroni correction for multiple testing may be employed to address this.

An attractive feature of the PS weighting approach is that the propensity score specification does not involve outcomes, and thus separates design from analysis. This avoids the βfishing expeditionβ of selecting an outcome model specification that results in the most dramatic treatment effect size, as is possible in ANCOVA. This also helps with model diagnostics, allowing balance assessment. The properties of OW have been studied in both observational studies and clinical trials, theoretically and through simulations. Because the sample size is usually limited in subgroup, the exact balance property of OW emphasizes the design matrix of a logistic regression PS model should be (

πΏ, πΊ, πΏπΊ ) and guarantees the important covariates are balanced within subgroups, which translates into improved power in estimating S-ATE. In practice, choice of covariates and subgroup variables should be pre-specified based on clinical knowledge. Recent development in machine learning also potentially facilitates this process by searching for covariates that give large HTE across subgroup levels, perhaps through secondary analysis of existing databases.

In addition, one should weigh the benefit of uncovering a finer resolution of treatment effect heterogeneity with the potential risk of over-fitting with a finite sample size. As the number of subgroup variables and covariates increases, the number of all possible pairwise interactions would grow exponentially, leading to model non-convergence and unstable propensity score estimates. Therefore, in the HF-ACTION example, we construct eparate PS models for each subgroup variable one at a time, which could maintain the validity of subgroup analyses as well as avoiding over-complicated PS model. There are several possible extensions of our study. First, although we focused on a continuous outcome, the PS weighting estimators are directly applicable to RCTs with binary and categorical outcomes. In this case, the PS weighting methods preserve the definition of subgroup estimands averaged over pre-treatment covariates within the subgroup and may be preferred, for the same reason that weighting is preferred in estimating the overall ATE in RCTs.

15, 16, 19

Second, we have only examined up to two subgroup variables in our simulation study; we conjecture that the patterns with more subgroup variables are generally similar. However, it is desirable to check whether there is a difference between constructing separate PS models for each subgroup variable one at a time, or applying a variable selection procedure to choose from the full set of subgroup-covariate interactions in a PS model.

References

1. Assmann SF, Pocock SJ, Enos LE, et al. Subgroup analysis and other (mis)uses of baseline data in clinical trials.

The Lancet

Annals of internal medicine

Journal of Clinical Epidemiology

BMC medical research methodology

Biometrics

Journal of clinical epidemiology

Journal of Clinical Epidemiology

Statistics in medicine

Clinical trials (London, England)

The American Statistician

Statistics in Medicine

Statistics in medicine

Econometrica

Epidemiology (Cambridge, Mass)

Statistics in medicine

JAMA : the journal of the American Medical Association

Statistical science

Trials

The Annals of Applied Statistics

Biometrika

Statistical methods in medical research arXiv preprint arXiv:201002121

Journal of the American Statistical Association

The annals of applied statistics

American journal of epidemiology

Statistics in medicine

The American statistician

Multiple imputation for nonresponse in surveys . John Wiley & Sons, 2004. 33. Rubin DB. For objective causal inference, design trumps analysis.

The annals of applied statistics

The American economic review

The annals of applied statistics

Annals of translational medicine

Statistical methods in medical research a) (b)

Figure 1.

The Type I error rate of different estimators for testing no treatment effect of S-ATEs and HTE across subgroups under (a) Scenario 1; (b) Scenario 3. a) (b)

Figure 2.

The relative efficiency of adjusted estimators relative to unadjusted estimator for estimating S-ATEs and HTE across subgroups when π½ = β1, π = 0.5 under (a) Scenario 1, Ξ² = π½ = πββ , and model is correctly specified; (b) Scenario 3, Ξ² = π½ = π, and model is misspecified. (a) (b) Figure 3.

The power of different estimators for estimating S-ATEs and HTE across subgroups when π½ = β1, π = 0.5 under (a) Scenario 1, Ξ² = π½ = π, and model is correctly specified; (b) Scenario 3, Ξ² = π½ = π, and model is misspecified. Figure 4.

The Connect-S plot of subgroup ASMD in HF-ACTION trial. NYHA indicates New York Heart Association; MRgrade, mitral regurgitation grade; Angina, Canadian Cardiovascular Society Angina class; afib/flut: atrial fibrillation or atrial flutter; Vent Conduction, ventricular conduction; VCD, intraventricular conduction delay ; LBBB, left bundle branch block; RBBB, right bundle branch block; 6MWD: 6-minute walk distance.

Figure 5.

Subgroup average treatment effect (S-ATE) estimates on 6-minute walk distance at 3 months from the HF-ACTION study. able 1.

Summary of simulation scenarios.