[PDF] Causal inference methods for small non-randomized studies: Methods and an application in COVID-19

Abstract

The usual development cycles are too slow for the development of vaccines, diagnostics and treatments in pandemics such as the ongoing SARS-CoV-2 pandemic. Given the pressure in such a situation, there is a risk that findings of early clinical trials are overinterpreted despite their limitations in terms of size and design. Motivated by a non-randomized open-label study investigating the efficacy of hydroxychloroquine in patients with COVID-19, we describe in a unified fashion various alternative approaches to the analysis of non-randomized studies. A widely used tool to reduce the impact of treatment-selection bias are so-called propensity score (PS) methods. Conditioning on the propensity score allows one to replicate the design of a randomized controlled trial, conditional on observed covariates. Extensions include the g-computation approach, which is less frequently applied, in particular in clinical studies. Moreover, doubly robust estimators provide additional advantages. Here, we investigate the properties of propensity score based methods including three variations of doubly robust estimators in small sample settings, typical for early trials, in a simulation study. R code for the simulations is provided.

Full PDF

CCausal inference methods for small non-randomized studies: Methodsand an application in COVID-19

Sarah Friedrich and Tim Friede

Department of Medical Statistics, University Medical Center G¨ottingen, Germany ∗ [email protected], [email protected] August 3, 2020

Abstract

The usual development cycles are too slow for the development of vaccines, diagnostics andtreatments in pandemics such as the ongoing SARS-CoV-2 pandemic. Given the pressure in sucha situation, there is a risk that ﬁndings of early clinical trials are overinterpreted despite their limi-tations in terms of size and design. Motivated by a non-randomized open-label study investigatingthe efﬁcacy of hydroxychloroquine in patients with COVID-19, we describe in a uniﬁed fashion var-ious alternative approaches to the analysis of non-randomized studies and apply them to the exam-ple study exploring the question whether different methods might have led to different conclusions.A widely used tool to reduce the impact of treatment-selection bias are so-called propensity score(PS) methods. Conditioning on the propensity score allows one to replicate the design of a ran-domized controlled trial, conditional on observed covariates. Extensions include the doubly robustg-computation, which is less frequently applied, in particular in clinical studies. Here, we investigatethe properties of propensity score based methods including g-computation in small sample settings,typical for early trials, in a simulation study. We conclude that the doubly robust g-computationhas some desirable properties and should be more frequently applied in clinical research. In the hy-droxychloroquine study, g-computation resulted in a very wide conﬁdence interval indicating muchuncertainty. We speculate that application of the method might have prevented some of the hypesurrounding hydroxychloroquine in the early stages of the SARS-CoV-2 pandemic. R code for theg-computation is provided.

Keywords:

COVID-19; Causal Inference; Propensity Score; Small samples ∗ Humboldtallee 32, 37073 G¨ottingen, Germany a r X i v : . [ s t a t . A P ] J u l Introduction

Pandemic situations such as the currently ongoing SARS-CoV-2 pandemic require the fast developmentof diagnostics, vaccines and treatments. As the usual development programs are too long in these sit-uations, more efﬁcient development pathways are sought. These include more innovative approachessuch as platform trials and adaptive designs [34]. Furthermore, in situations of desperate medical needsuch as with COVID-19, early clinical trials might receive more attention than they would normally do.In March 2020, for instance, [17] published a report of a small open-label non-randomized controlledstudy suggesting that “hydroxychloroquine treatment is signiﬁcantly associated with viral load reduc-tion/disappearance in COVID-19 patients”. Although typically not much notice would have been takenof such a small-scale study with its methodological limitations, the treatment was haled “a game changer”by the US president putting pressure on the regulatory authorities to license the drug for COVID-19 [27].In particular when a lot of importance is placed on non-randomized studies, their analyses and inter-pretation must be robust. Non-randomized studies might be prone to bias due to confounding. Besidescovariate adjustments in regression models a wide range of methods were proposed to deal with this.A widely used tool to reduce the impact of treatment-selection bias in observational data are so-calledpropensity score (PS) methods. The propensity score is deﬁned as a participant’s probability of receivingtreatment given the observed covariates [28, 29]. Conditioning on the propensity score allows one toreplicate the design of a randomized controlled trial, conditional on observed covariates. Extensions in-clude the doubly robust g-computation [18, 25], which is less frequently applied, in particular in clinicalstudies. Doubly robust refers here to the property that it is sufﬁcient that either the outcome or the propen-sity score model is correctly speciﬁed. Hence, g-computation does not rely on correct speciﬁcation ofboth models.[17] did not apply any of these methods for non-randomized studies, but analyzed the trial as if itwas randomized. Here, we describe in a uniﬁed fashion various alternative approaches and apply themto the data by [17]. We explore the question whether different methods might have led to differentconclusions. New evidence has emerged in the meanwhile and we now know that hydroxychloroquine isnot an appropriate therapy in COVID-19 [12, 30]. Here, we wonder whether a more appropriate analysisof the study by [17] could have prevented much of the hype and as a result might have saved someresources.In the context of the analysis of clinical registries and routine data including electronic health recordssome of the methods described above have widely been applied and their characteristics explored insimulation studies. Given the applications, simulation experiments naturally considered large-scale datasets [7]. Here, we investigate the properties of propensity score based methods including g-computationin small sample settings, typical for early trials, in a simulation study. The doubly robust g-computationhas some desirable properties as the simulations will demonstrate, but has so far not gained the attentiondeserved in clinical applications.The manuscript is organized as follows. In Section 2 we introduce the study by [17], which motivatedour investigations, in more detail. In Section 3 several approaches to the analysis of non-randomizedtrials are described and applied to the motivating example. Their properties are assessed in a simulationstudy, in particular in the setting of small sample sizes, in Section 4. We close with a brief discussion ofthe ﬁndings and the limitations of our study (Section 5).2

Motivating example in COVID-19 [17] conducted an open-label non-randomized study investigating the efﬁcacy and safety of hydroxy-chloroquine in addition to standard of care in comparison to standard of care alone. The patients in thehydroxychloroquine group were all from the coordinating centre whereas the controls were recruitedfrom several centres including the coordinating centre. In the coordinating centre, however, only thosepatients refusing therapy with hydroxychloroquine were included as controls. A total of 36 patients wereincluded in the analyses, 20 patients receiving hydroxychloroquine and 16 control patients. Out of the20 patients on hydroxychloroquine, 6 patients received in addition also azithromycin. For the purposeof illustration, we only consider two treatment groups, i.e. with and without hydroxychloroquine. Theprimary outcome was virological clearance at Day 6 (with Day 0 being baseline). The individual par-ticipant data of the study are reported in Supplementary Table 1 of [17]. The variables included in thetable include the patient’s age, sex, clinical status (asymptomatic, upper respiratory infection or lowerrespiratory infection), duration of symptoms, and results of daily PCR testing for Days 0 to 6. [17] reportvirological cure at Day 6 for 14 out of 20 patients treated with hydroxychloroquine and for 2 out of 16in the control group, resulting in a p-value of . in an analysis not adjusted for any covariates.The study by [17] has not been without criticism, mainly due to its limitations in design includ-ing the small sample size, choice of control patients, open label treatment and study discontinuations[1]. Although some preclinical data suggested potentially beneﬁcial effects [13], there were also someearly warnings regarding some potentially harmful effects [15]. In the meanwhile, data from large-scalerandomized controlled trials are available suggesting that hydroxychloroquine is not suitable for post-exposure prophylaxis for or the treatment of COVID-19 [11, 12, 19]. The timeline of events is nicelydepicted in Figure 1 of a review by [30]. We consider a binary outcome Y as well as a binary treatment A (1: experimental treatment, 0: control)and a vector of observed covariates L . In a randomized controlled trial, one would assume that due torandomization, the inﬂuence of the covariates L is the same for treated and control patients. Thus, theeffect of interest in this situation would be the odds ratio OR = P ( Y = 1 | A = 1) /P ( Y = 0 | A = 1) P ( Y = 1 | A = 0) /P ( Y = 0 | A = 0) , (3.1)i. e. the ratio of the odds of having the outcome under treatment and the odds of experiencing the outcomein the control group. One way to model this is using a logistic regression model, i. e. we model theprobabilities in (3.1) as P ( Y = 1 | a ) = exp( β + β a )1 + exp( β + β a ) , (3.2)yielding an estimator of the odds ratio (cid:100) OR = exp( (cid:98) β ) .3n observational studies, where allocation of the treatment is not in the hand of the investigator, thisdirect comparison of the treatments may no longer be fair due to the inﬂuence of other confoundingfactors, i.e., the distribution of the other risk factors L may differ between treated and controls. In orderto imitate an RCT and to get valid estimates in this situation, one has to consider the so-called potentialor counterfactual outcomes [18]: Let Y a =1 denote the outcome that would have been observed undertreatment value a = 1 , and Y a =0 the outcome that would have been observed under control ( a = 0 ). Acausal effect is now deﬁned as follows: we say that A has a causal effect on Y if Y a =1 (cid:54) = Y a =0 for anindividual. In practice, however, only one of these outcomes is observed for an individual under study.Therefore, we can only ever estimate an average causal effect , which is present if P ( Y a =1 = 1) (cid:54) = P ( Y a =0 = 1) , i. e. the probability of the outcome under treatment is different from that under control,in the population of interest [18]. In our situation with a binary outcome and a binary treatment, we thusconsider the causal odds ratio OR c := P ( Y a =1 = 1) / (1 − P ( Y a =1 = 1)) P ( Y a =0 = 1) / (1 − P ( Y a =0 = 1)) , (3.3)as our primary outcome measure. Different approaches have been proposed to estimate OR c and we willdiscuss the most commonly used ones in the following. The conventional method to correct for baseline differences between groups is adjusting for all relevantpatient characteristics in the outcome regression model. Thus, (3.1) becomes OR = P ( Y = 1 | A = 1 , L , ..., L p ) /P ( Y = 0 | A = 1 , L , ..., L p ) P ( Y = 1 | A = 0 , L , ..., L p ) /P ( Y = 0 | A = 0 , L , ..., L p ) for a p -dimensional vector of baseline covariates L = ( L , . . . , L p ) . The model for these probabilitiesthen becomes P ( Y = 1 | a, L ) = exp( β + β a + β (cid:96) + · · · + β p +1 (cid:96) p )1 + exp( β + β a + β (cid:96) + · · · + β p +1 (cid:96) p ) , (3.4)where (cid:96) , . . . , (cid:96) p denote the observed values of the covariates L , . . . , L p .In our data example, adjusting for all baseline covariates in the outcome model was not favorabledue to the small number of subjects in the two groups. For comparison, we therefore included the specialcase of a simple logistic regression model with treatment assignment as the only covariate, i. e. model(3.2). Several different methods have been proposed to estimate OR c in the literature, see e. g. [18] for aconcise introduction. Many of these methods are based on the propensity score. The propensity scoreof individual i is deﬁned as (cid:98) p i := (cid:98) P ( A i = 1 | L i ) , i. e., the estimated probability of receiving treatmentgiven the covariates. For all methods considered in this paper, we estimate the propensity score using alogistic regression model for treatment allocation based on all observed covariates, i. e. P ( A = 1 | L ) = exp( β + β (cid:96) + · · · + β p (cid:96) p )1 + exp( β + β (cid:96) + · · · + β p (cid:96) p ) .

4n a practical data analysis, there are several possibilities for taking the propensity score into account.We will describe the most common methods in the following and apply them to the data example.

PS covariate adjustment

According to [36] and [32], covariate adjustment using the propensity scorewas the most commonly used PS method in clinical literature. In this approach, the outcome Y isregressed on the estimated propensity score (cid:98) p and the treatment exposure A . The estimated treatmenteffect (cid:100) OR c is the odds ratio for treatment exposure obtained from this logistic regression, that is wemodel P ( Y = 1 | A, (cid:98) p ) = exp( β + β a + β (cid:98) p )1 + exp( β + β a + β (cid:98) p ) and an estimator of the causal odds ratio is given by exp( (cid:98) β ) . Matching on the propensity score

Another possibility to balance treatment allocation is to matchsubjects on the propensity score. The idea is to ﬁnd individuals with a similar propensity score in thetreatment and the control group. There are various methods to match individuals. Particularly in smallsample studies, it is impossible in practice to ﬁnd exact matches. Thus, one needs to deﬁne an accept-able difference between the propensity scores of treated individuals and controls that will be used formatching. These differences are called calipers and should be small enough to allow for “a practicalbut meaningful equation of pairs” [2]. Following the recently published recommendations by [3], whoinvestigated propensity matching in small sample sizes, we performed a 1:1 nearest neighbor match-ing without replacement on the logit of the propensity score using calipers with a maximum width of0.2 standard deviations. In this modiﬁcation of classical nearest neighbor matching, subjects are onlymatched if the absolute difference of their propensity scores is within the pre-speciﬁed caliper distance[7]. This distance is usually deﬁned as a proportion of the standard deviation of the logit of the propensityscore. In R, this can e. g. be performed using the

MatchIt -package, where the PS-model, the method usedfor matching and the caliper can be speciﬁed. A caliper of 0.2 avoids matching dissimilar individuals.Note, however, that this setting differs from the default setting in R, where the caliper is set to 0.When it comes to analyzing the matched data set, recommendations as to whether a matched-pairanalysis is required or not are not entirely clear, see e. g. [6, 9, 14, 35] for discussions of this point. Thus,we compared three different approaches to analyze the matched data set:1.

Match unadjusted:

A logistic regression model for the outcome conditional on treatment expo-sure as in (3.2) was implemented. This method does not account for the matched pairs.2.

Match conditional:

A conditional logistic regression model accounting for the matched pairs wasimplemented. This is achieved by the clogit function in R.3.

Match GEE:

The logistic regression model was estimated using generalized estimating equations(GEE), which allows for speciﬁcation of the matched pairs and estimation of robust standard errors.This approach was implemented using the geepack -package in R. There are different ways tospecify the correlation structure, e. g. using an exchangeable correlation matrix.Note that since we match individuals without replacement, the matched data set will usually be smallerthan the original study, sometimes even discarding treated individuals.5 nverse probability of treatment weighting (IPTW)

Inverse probability weighting uses the wholedata set, but weighs each individual with his or her (inverse) probability of receiving the actually giventreatment. This way, it generates a pseudo-population with (almost) perfect covariate balance betweentreatment groups. More speciﬁcally, IPTW assigns weight w i = 1 / (cid:98) p i to treated subjects and weight w i =1 / (1 − (cid:98) p i ) to controls. The resulting pseudo-population is analyzed using weighted logistic regressionwith robust standard errors obtained from the sandwich -package in R. Note that these standard errorscould also be obtained by ﬁtting a GEE model instead. In contrast to the GEE ﬁtted for the matchedsample above, however, we do not have clustered data here. The fourth possibility to account for covariate unbalance that we consider is g-computation [18, 25], alsoknown as the parametric g-formula or direct standardization . The idea is that the marginal counterfactualrisk P ( Y a = 1) = (cid:88) (cid:96) P ( Y a = 1 | L = (cid:96) ) P ( L = (cid:96) ) = (cid:88) (cid:96) P ( Y = 1 | L = (cid:96), A = a ) P ( L = (cid:96) ) . Here, the sum is over all values (cid:96) of the confounder(s) L that occur in the population. The right-handside of this equation can now be estimated using the available data on Y, A and L . More precisely, wehave to predict the outcome for every person i in the population assuming1. i was treated2. i was a controlirrespective of the treatment actually received. Thus, we ﬁrst ﬁt a so-called Q-model to the data relatingthe outcome Y to the exposure A and to confounders L . For a binary outcome as in our situation, thisis usually a logistic regression model as in (3.4). Based on the Q-model, we then predict (cid:98) P ( Y = 1 | L = (cid:96), A = 1) and (cid:98) P ( Y = 1 | L = (cid:96), A = 0) for all individuals by artiﬁcially creating two new data sets: Onewhere A = 1 for all individuals and one where A = 0 for all individuals, respectively. The causal OR (cid:100) OR c can then be estimated using Equation (3.3), i. e. we calculate the marginal OR as p / (1 − p ) p / (1 − p ) . Here p = (cid:98) P ( Y = 1) is the mean probability of the outcome if everyone were treated and p = (cid:98) P ( Y = 1) is the mean probability of the outcome if no-one received the treatment.While IP weighting requires the propensity model to be correct, i. e. a correct model for the treatment A conditional on confounders L , the g-formula requires a correct model for the outcome Y conditionalon treatment A and the confounders L , the Q-model. A doubly robust estimator, in contrast, is consistentif at least one of the two models is correctly speciﬁed. There are many types of doubly robust estimators,but we will focus on a very simple one here [18]. First, we estimate the weights w i as described above.We then ﬁt our Q-model to the data including an additional covariate z , where z i = w i if A i = 1 and z i = − w i if A i = 0 . Finally, we again obtain a causal OR from Equation (3.3). Conﬁdence intervalsfor g-computation are usually obtained by a nonparametric bootstrap approach, i. e. by drawing with6eplacement from the data and analyzing each bootstrap data set like we analyzed the original data. Here,the number of bootstrap repetitions should be chosen reasonably large. We used 10,000 repetitions in theanalysis of the data example and 1,000 bootstrap repetitions in the simulation study. Upper and lower95% conﬁdence intervals are obtained using the 2.5 and 97.5 percentiles of the bootstrap distribution.Note that a statistical test can be obtained similarly by calculating the test statistic in each bootstrapsample and then comparing the original test statistic to the empirical (1 − α ) -quantile of the bootstrapdistribution. A p -value is obtained by counting how often the original test statistic is smaller than thebootstrap statistic and dividing this number by the conducted bootstrap replications. We will now re-analyze the data example described in Section 2 using the methods detailed above.Following the authors’ example, we imputed missing outcomes on Day 6 by a last-observation-carried-forward-approach, i. e. patients with missing PCR were considered positive on Day 6, if they wereactually positive the day(s) before [17]. Moreover, after setting time since onset of symptoms to zerofor asymptomatic patients, two patients with missing time since onset remained. Both patients wereclassiﬁed as URTI and we used the median time since onset (3 days) for URTI-patients to impute thesemissing values. Our PS-model included sex, age, clinical status and time since onset of symptoms asexplanatory variables. The Q-model for the g-computation additionally included treatment status and thecovariate z described above. We used 10,000 bootstrap iterations for the calculation of the conﬁdenceinterval. The matching procedure resulted in a data set with 11 controls and 11 treated patients, thusdiscarding 5 controls and 9 treated patients from the analysis. Due to the small remaining sample size,the conditional logistic regression model did not converge at all, while the GEE procedure (using anexchangeable correlation structure) did not produce estimates of the standard errors and the simple model(match unadjusted) returned conﬁdence intervals ranging from 0 to ∞ . Thus, the matching procedure didnot yield useful estimates and is discarded from the results displayed below. The code for this analysisis freely available on Github ( https://github.com/smn74/CIM_COVID-19 ).The distribution of the propensity scores in the two groups –estimated by a kernel density estimator–is displayed in Figure 1. As we can see, the propensity scores show a good overlap between the treat-ment and control group. Moreover, propensity scores are reasonably far from 0 and 1 to ensure stableestimation of the IPT weights [37].As we can see in Figure 2, all four methods (the crude unadjusted as well as the three causal inferencemethods) yield similar point estimates, which are, however, extremely large. Moreover, the methodsdiffer with respect to the width of the conﬁdence intervals and statistical signiﬁcance: The very wideconﬁdence interval obtained by the doubly robust g-computation approach includes 1 whereas the othermethods yielded statistically signiﬁcant treatment effects. So the question remains: Which method canwe trust the most in this situation? In order to answer this, we have conducted a large simulation studywhich will be presented in the next section. The set-up of our simulation study closely followed [5]. The data-generating process is as follows: First,we generate n covariates x , . . . , x n . We then generate the treatment status for each subject i = 1 , . . . N .00.51.01.52.0 0.00 0.25 0.50 0.75 1.00Propensity score D en s i t y Hydroxychloroquine Control

Figure 1: Propensity score distribution in the treatment (red curve) and control group (blue curve), re-spectively.according to the model logit( p i, treatment ) = β + β x + · · · + β n x n . (4.1)Treatment is then randomly assigned to each subject following a Bernoulli distribution with subject-speciﬁc probability of treatment assignment A i ∼ Bernoulli ( p i, treatment ) . Next, the outcome Y i of eachsubject is simulated conditional on treatment assignment A i and the covariates associated with the out-come according to logit( p i, outcome ) = α + β trt A i + α x + · · · + α n x n (4.2)and Y i ∼ Bernoulli ( p i, outcome ) . Here, β trt denotes the conditional treatment effect on the log-odds scale.Note that this data-generating process allows for introduction of a conditional odds ratio given thecovariates. With the propensity score methods, however, we estimate a marginal odds ratio, which is in8 rude ORDoubly robust g−computationIPTW analysisPS covariate 1 10 100Odds Ratio M e t hod Figure 2: Estimated Odd Ratios with 95% conﬁdence intervals obtained by the different methods. Here,”Crude OR” refers to the simple logistic regression of treatment on the outcome, while ”PS covariate”additionally includes the estimated propensity score as covariate in the logistic regression. The ORs aredisplayed on a logarithmic scale.general different from the conditional OR due to non-collapsibility of the odds ratio, see e. g. [16, 26, 33].The marginal and conditional OR coincide if there’s no treatment effect, i. e., if β trt = 0 . For situationswhere the desired marginal OR is not 1, we have determined the corresponding conditional OR fora given marginal OR as described by [5]. In particular, we use Monte Carlo simulation to estimatethe conditional OR. Therefore, we randomly generate 1,000 data sets of size N = 10 , . For eachindividual, we generate the counterfactual outcomes under treatment ( A i = 1 ) and control ( A i = 0 ) andcalculate the marginal OR as p / (1 − p ) p / (1 − p ) , where p denotes the mean probability of the outcome if everyone were treated and p denotes the meanprobability of the outcome if no-one received the experimental treatment, see also Equation (3.3). Usingan iterative process, we modiﬁed β trt until we got close enough to the desired marginal OR.Concerning the covariates, we considered three different scenarios: The ﬁrst scenario aimed at mimicking the data example. Thus, we generated four covariates:1. x (representing sex) followed a Bernoulli distribution with parameter 0.59. x (representing age) was drawn from a N (45 , distribution and rounded to integers3. x (clinical status) was simulated as a categorical covariate with three categories, i. e. a Bin (2 , . distribution4. x (time since onset of disease) was generated from a uniform distribution on [0 , and roundedto integers.Treatment status was then generated according to Equation (4.1) with ( β , β , β , β , , β , , β ) = ( − . , . , . , . , − . , . . Here, β , and β , correspond to the dummy-coded categories x , and x , for x = 1 and x = 2 ,respectively. The parameters were obtained from the data by univariate logistic regression. Note thatthis implies a moderate association of treatment with x , x , and x , a weak association with x and astrong association with x , .Similarly, the outcome was generated following Equation (4.2) with ( α , α , α , α , , α , , α ) = ( − . , . , . , . , − . , . , implying a moderate association with x and x , a strong association with x and a weak associationwith x . The parameter β trt was varied to generate different marginal ORs in the following way: For β trt = 0 , marginal and conditional OR coincide and are equal to 1. For β trt = 0 . we get a marginalOR of 2 and for β trt = 2 . the true marginal OR equals 10. Finally, β = − . resulted in a similardistribution of treated individuals and controls as in the original data, yielding an average of . ofindividuals in the treatment group. To study the inﬂuence of more or less unbalanced treatment groups,we also varied this parameter in the simulations. In particular, we additionally considered a treatmentallocation of approx. 2:1 and 4:1. The parameters in this setting are identical to Scenario 1, but we additionally added an unmeasuredconfounder. Thus, we simulated a covariate x following an N (0 , distribution with a strong effect onboth treatment assignment and outcome. Therefore, β and α were set to log(5) . However, x enteredneither the propensity score model nor the Q-model for the g-computation. For a marginal OR of 2 and10, β trt was set to 1.1111 and 3.4793, respectively. This scenario aims at reproducing some of the ﬁndings of [5]. Therefore, we used the same set-up ashe did, namely simulating 9 binary covariates with different association to treatment assignment andoutcome as described in Table 1.Here, a strong association is represented by a coefﬁcient of log(5) , i. e. β = β = β = α = α = α = log(5) , while a moderate association has a coefﬁcient of log(2) , i.e. β = β = β = α = α = α = log(2) . We chose β = − . to obtain a balanced design with respect to treatment and α was set10able 1: Association to treatment assignment and outcome used in the simulation scenario motivated by[5] (Scenario 3). Strongly associatedwith treatment Moderately associatedwith treatment Not associatedwith treatmentStrongly associatedwith outcome x x x Moderately associatedwith outcome x x x Not associatedwith outcome x x x to − . For more details on the simulation set-up, see [5]. The propensity score model and the Q-modelincluded all 9 covariates. For a marginal OR of 2 and 10, β trt was set to 0.9707 and 3.2625, respectively.In addition to Austin’s setting with an equal treatment allocation of 1:1, we also considered a situationwith approx. 4:1 treated patients.An overview of all simulated scenarios is given in Table 2.Table 2: Overview of the simulated scenarios and where to ﬁnd the results. β trt true marginal OR β Percent treatedon average simulatedfor true OR0 1 − . − . − − . − . − . To study the inﬂuence of small sample sizes on the methods, we simulated N = 40 , , individualsfor each scenario. Simulations were performed in R Version 3.6.3 with 2,000 simulation runs and thebootstrap conﬁdence intervals for the g-computation are based on 1,000 bootstrap replications. Note thatwhile 1,000 bootstrap replications sufﬁce in simulations, we would recommend a higher number, say10,000, in real-life applications.We used different measures to compare the results. With respect to the point estimators, we consid-ered the bias on the log-scale, i. e. the difference between the true marginal log OR and the estimated logOR. Since the methods often resulted in extreme estimators of the treatment effect, we considered boththe mean bias (difference between true OR and mean estimated treatment effect) and the median bias11difference between true OR and the median of the estimated treatment effect). The results are displayedin Figures 3 and 4, respectively. Moreover, the mean square error of each estimated marginal OR (MSE)is displayed in Tables 3–5.Concerning the conﬁdence intervals, we considered the percentage of 95% conﬁdence intervals thatcontained the true odds ratio (coverage probability) as well as the median length of the 95% conﬁdenceinterval. These measures are displayed in Figure 5 and Tables 3–5, respectively. Finally, we also reportedhow often the chosen models did not converge or yielded an estimated OR ≥ . These were excludedfrom the calculations and reported as failures in the tables.For comparison, we again included the crude OR estimated by simple logistic regression includingonly treatment status as a covariate. S c ena r i o : C O V I D - S c ena r i o : U nob s e r v ed c on f ounde r True OR = 1 True OR = 2 -10123 40 100 1000N -10123 40 100 1000N m ean b i a s -10123 40 100 1000N -10123 40 100 1000N m ean b i a s S c ena r i o : A u s t i n ( ) -10123 40 100 1000N -10123 40 100 1000N m ean b i a s True OR = 10 -10123 40 100 1000N m ean b i a s -10123 40 100 1000N m ean b i a s -10123 40 100 1000N m ean b i a s Method covg-computation IPTWmatching conditional matching geematching unadjusted unadjusted

Figure 3: Displayed is the mean bias on the log-scale for the three scenarios (rows) and the three simu-lated marginal odds ratios (columns).Across all scenarios considered here, we note that the matching procedure is the most prone to failure.Especially for the small sample sizes, this approach often fails as we have also seen in the data example.These results are in line with the ﬁndings of [3], who stress the need for development of appropriatematching methods in small sample studies. Overall, there’s not much difference between the differentmatching approaches. Furthermore, we found that using the default caliper, which is 0 in R, leads to12 c ena r i o : C O V I D − S c ena r i o : U nob s e r v ed c on f ounde r True OR = 1 True OR = 2 l l l −1012 40 100 1000N l l l −1012 40 100 1000N m ed i an b i a s l l l −1012 40 100 1000N l l l −1012 40 100 1000N m ed i an b i a s S c ena r i o : A u s t i n ( ) l l l −1012 40 100 1000N l l l −1012 40 100 1000N m ed i an b i a s True OR = 10 l l l −1012 40 100 1000N m ed i an b i a s l l l −1012 40 100 1000N m ed i an b i a s l l l −1012 40 100 1000N m ed i an b i a s Method l covg−computation IPTWmatching conditional matching geematching unadjusted unadjusted Figure 4: Displayed is the median bias on the log-scale for the three scenarios (rows) and the threesimulated marginal odds ratios (columns).extremely biased results with coverage probabilities dropping below 1% in some situations (results notshown).As expected, the number of failures decreases with increasing sample size across scenarios. More-over, we note that the mean bias of all methods decreases with growing sample sizes. This is, however,not the case for the median bias, see Figure 3 and 4, respectively. This ﬁnding implies that with growingsample sizes, the methods lead to less extreme results in the estimation. The largest mean bias is observedfor g-computation and the unadjusted crude OR. Considering the median bias, however, the unadjustedcrude OR remains positively biased through all scenarios, while g-computation does not differ muchfrom the other methods anymore. In particular, it leads to the best results for Scenario 2.In all of the considered scenarios, the CIs for the unadjusted crude OR greatly undercover the nominal95% level, see Figure 5. This is especially apparent for growing sample size, which reduces the lengthof the CIs, thus centering them more around the wrong point estimate.In Scenario 2, where we have an additional unobserved confounder, all methods lead to biased resultseven for large sample sizes and coverage probabilities are far from the 95% level, see Figure 5. Inthis situation, the doubly robust g-computation yields by far the best results with respect to coverage.13 c ena r i o : C O V I D − S c ena r i o : U nob s e r v ed c on f ounde r True OR = 1 True OR = 2 l l l l l l C o v e r age l l l l l l C o v e r age S c ena r i o : A u s t i n ( ) l l l l l l C o v e r age True OR = 10 l l l C o v e r age l l l C o v e r age l l l C o v e r age Method l covg−computation IPTWmatching conditional matching geematching unadjusted unadjusted Figure 5: Displayed is the coverage probability (in %) for the three scenarios (rows) and the three simu-lated marginal odds ratios (columns).Moreover, it also has the least median bias (Figure 4). These results are especially important, since youcan never be sure to have included all relevant confounders in a practical data analysis.For Scenario 3, we again see that the matching methods fail for N = 40 (Table 5). The resultsobtained are similar to [5], but one has to keep in mind that he simulated data sets of size N = 10 , .Moreover, he did not include the g-computation approach in his comparisons.Figures 6 and 7 show the median bias and coverage probabilities of the methods in Scenario 1 and3, respectively, where we varied the proportion of individuals who receive treatment. In Scenario 1 (Fig-ure 6), we don’t see much difference with respect to the median bias. The coverage, on the other hand,decreases with increasing imbalance for almost all methods. In Scenario 3 (Figure 7), the differencebetween the methods becomes more apparent with increasing imbalance. In particular, the performanceof IPTW is decreasing with increasing imbalance as demonstrated by the higher median bias and thelower coverage probability (especially for smaller sample sizes), while g-computation seems to improveslightly with growing imbalance. 14 ed i an B i a s C o v e r age l l l −1.0−0.50.00.51.0 40 100 1000N l l l −1.0−0.50.00.51.0 40 100 1000N l l l l l l l l l −1.0−0.50.00.51.0 40 100 1000N l l l l covg−computation IPTWmatching conditional matching geematching unadjusted unadjusted Figure 6: Median bias and coverage probabilities for Scenario 1 with a true OR of 1 and differentproportions of treated individuals. Note that the coverage is truncated to ≥ implying that theunadjusted method is not displayed for N = 1000 . In ongoing pandemics there is an urgent unmet medical need to develop vaccines, diagnostics and treat-ments in a very timely fashion. Despite the time pressure, however, the standards of evidence shouldnot unduly be lowered [10, 27]. Using a small-scale non-randomized study in COVID-19 [17] as amotivating example, we demonstrate how robust analyses can be conducted by use of appropriate causalinference methods. We speculate that application of doubly-robust g-computation in the motivating studymight have cast doubt on the efﬁcacy of hydroxychloroquine and consequently, in the context of othercriticism voiced, might have dampened the early enthusiasm regarding the use of this drug in COVID-19. Ultimately this might have saved some resources that now were wasted on clinical trials investigatinghydroxychloroquine in patients suffering from COVID-19. More importantly, this might have preventeda shortage of hydroxychloroquine in the treatment of rheumatological disorders, which some tried tocounter by revised treatment schedules to maintain the standard of care [31].The conventional method to correct for baseline differences between groups is adjusting for all rele-vant patient characteristics in the outcome regression model. This is, however, not favorable for different15 ed i an B i a s C o v e r age l l l −2−101 40 100 1000N l l l −2−101 40 100 1000N l l l l l l l covg−computation IPTWmatching conditional matching geematching unadjusted unadjusted Figure 7: Median bias and coverage probabilities for Scenario 3 with a true OR of 1 and differentproportions of treated individuals. Note that the coverage is truncated to ≥ implying that theunadjusted method is not displayed for some settings.reasons. As [28] point out, covariate adjustment works poorly in cases where e. g. the variance of acovariate is unequal in the treatment and the control group. A commonly applied alternative in obser-vational studies are propensity score methods. Since these methods were derived from a formal modelfor causal inference, their use allows for well-deﬁned causal questions [23]. Moreover, propensity scoremethods also work as a dimension reduction tool by combining multiple covariates into a single score[23, 29]. This is especially important in situations with a large number of covariates compared to thenumber of subjects under study. In our data example, adjusting for all baseline covariates in the outcomemodel was not favorable due to the small number of subjects in the two groups. For comparison, wetherefore included the special case of a simple logistic regression model with treatment assignment asthe only covariate.Some comments on the estimands obtained by the different methods are in place: First, as already ex-plained in the set-up of the simulation study, we aimed at estimating marginal treatment effects here. Dueto non-collapsibility of the odds ratio, these are different from conditional treatment effects, i. e. effectsat subject level [8, 22]. Second, our aim was to estimate the average causal effect in our study popula-tion. Propensity score matching, however, creates a population where treated individuals, who cannot16e matched to any control patients, are excluded. Thus, the effect estimate obtained here correspondsto a subset of the population, which is hard to describe. Since the matched population is not very wellcharacterized, it is difﬁcult to generalize results obtained there to the general population [18]. Finally, itis worth noting that among the methods we discussed here, only IPT weighting and g-computation canbe generalized to more complex situations involving time-varying treatments [18].Motivated by the study conducted by [17] we investigated the properties of a range of causal inferencemethods in small samples. As expected this posed additional challenges to the various approaches.Interestingly, it turned out that the default settings in software implementations are often more suitablefor large sample sizes and need to be adjusted for applications in small-scale studies. For example, wefound that the matching procedure in R using the default calipers of 0 resulted in extremely biased resultsin our small sample simulations. SAS software, in contrast, uses a default caliper width of 0.25.In our investigations, we focused here on the non-randomized nature of the study by [17]. However,the study suffers also from other weaknesses including the small sample size, open label treatment andstudy discontinuations [1]. For instance, we did not address the problems in the interpretation causedby study discontinuations here, but used the last-observation-carried-forward approach as [17] althoughthis approach has gone out of fashion due to its limitations, in particular in underestimating the standarderrors, see e. g. [24] and the references cited therein.Besides the design of efﬁcient trials to develop treatments for COVID-19 [34], one concern to trialiststhese days is the threat posed by the SARS-CoV-2 pandemic to clinical trials in non-COVID-19 indica-tions [4, 21]. SARS-CoV-2 infections of patients in these trials, or merely the increased risk thereof,might lead to post-randomization events (or intercurrent events in the language of the ICH E9 addendum[20]) such as treatment or study discontinuations as well as adverse events that ultimately invalidate ananalysis relying on randomization. In such situations, the causal inference approach discussed here mightprovide a suitable alternative analysis strategy either as primary or sensitivity analysis. Acknowledgement

Support by the DFG (grant FR 4121/2-1) is gratefully acknowledged.

Conﬂict of interest

The authors declare that they have no conﬂict of interest.

References [1] Paul Elias Alexander, Victoria Borg Debono, Manoj J. Mammen, Alfonso Iorio, Komal Aryala,Dianna Deng, Eva Brocard, and Waleed Alhazzani. COVID-19 coronavirus research has overalllow methodological quality thus far: case in point for chloroquine/hydroxychloroquine.

Journal ofClinical Epidemiology , 123:120–126, 2020.[2] Robert P Althauser and Donald Rubin. The computerized construction of a matched sample.

Amer-ican Journal of Sociology , 76(2):325–346, 1970.173] Anais Andrillon, Romain Pirracchio, and Sylvie Chevret. Performance of propensity score match-ing to estimate causal effects in small samples.

Statistical Methods in Medical Research , 29(3):644–658, 2020.[4] Stefan D Anker, Javed Butler, Muhammad Shahzeb Khan, William T Abraham, Johann Bauer-sachs, Edimar Bocchi, Biykem Bozkurt, Eugene Braunwald, Vijay K Chopra, John G Cleland,Justin Ezekowitz, Gerasimos Filippatos, Tim Friede, Adrian F Hernandez, Carolyn S P Lam,JoAnn Lindenfeld, John J V McMurray, Mandeep Mehra, Marco Metra, Milton Packer, BurkertPieske, Stuart J Pocock, Piotr Ponikowski, Giuseppe M C Rosano, John R Teerlink, HiroyukiTsutsui, Dirk J Van Veldhuisen, Subodh Verma, Adriaan A Voors, Janet Wittes, Faiez Zannad,Jian Zhang, Petar Seferovic, and Andrew J S Coats. Conducting clinical trials in heart failureduring (and after) the COVID-19 pandemic: an Expert Consensus Position Paper from the HeartFailure Association (HFA) of the European Society of Cardiology (ESC).

European Heart Journal ,41(22):2109–2117, 2020. URL: https://doi.org/10.1093/eurheartj/ehaa461 , arXiv:https://academic.oup.com/eurheartj/article-pdf/41/22/2109/33368354/ehaa461.pdf , doi:10.1093/eurheartj/ehaa461 .[5] Peter C. Austin. The performance of different propensity score methods for estimating marginalodds ratios. Statistics in Medicine , 26(16):3078–3094, 2007. URL: https://doi.org/10.1002%2Fsim.2781 , doi:10.1002/sim.2781 .[6] Peter C Austin. A critical appraisal of propensity-score matching in the medical literature between1996 and 2003. Statistics in Medicine , 27(12):2037–2049, 2008.[7] Peter C Austin. A comparison of 12 algorithms for matching on the propensity score.

Statistics inMedicine , 33(6):1057–1069, 2014.[8] Peter C. Austin, Paul Grootendorst, Sharon-Lise T. Normand, and Geoffrey M. Anderson. Con-ditioning on the propensity score can result in biased estimation of common measures of treat-ment effect: a monte carlo study.

Statistics in Medicine , 26(4):754–768, 2007. URL: https://doi.org/10.1002%2Fsim.2618 , doi:10.1002/sim.2618 .[9] Seunghee Baek, Seong Ho Park, Eugene Won, Yu Rang Park, and Hwa Jung Kim. Propensity scorematching: a conceptual review for radiology researchers. Korean Journal of Radiology , 16(2):286–296, 2015.[10] Howard Bauchner and Phil B. Fontanarosa. Randomized Clinical Trials and COVID-19:Managing Expectations.

JAMA , 323(22):2262–2263, 06 2020. URL: https://doi.org/10.1001/jama.2020.8115 , arXiv:https://jamanetwork.com/journals/jama/articlepdf/2765696/jama\_bauchner\_2020\_ed\_200043.pdf , doi:10.1001/jama.2020.8115 .[11] David R. Boulware, Matthew F. Pullen, Ananta S. Bangdiwala, Katelyn A. Pastick, Sarah M.Lofgren, Elizabeth C. Okafor, Caleb P. Skipper, Alanna A. Nascene, Melanie R. Nicol, MahsaAbassi, Nicole W. Engen, Matthew P. Cheng, Derek LaBar, Sylvain A. Lother, Lauren J. MacKen-zie, Glen Drobot, Nicole Marten, Ryan Zarychanski, Lauren E. Kelly, Ilan S. Schwartz, Emily G.McDonald, Radha Rajasingham, Todd C. Lee, and Kathy H. Hullsiek. A randomized trial of hy-droxychloroquine as postexposure prophylaxis for covid-19. New England Journal of Medicine ,18020. URL: https://doi.org/10.1056/NEJMoa2016638 , arXiv:https://doi.org/10.1056/NEJMoa2016638 , doi:10.1056/NEJMoa2016638 .[12] Alexandre B. Cavalcanti, Fernando G. Zampieri, Regis G. Rosa, Luciano C.P. Azevedo, Vi-viane C. Veiga, Alvaro Avezum, Lucas P. Damiani, Aline Marcadenti, Leticia Kawano-Dourado,Thiago Lisboa, Debora L. M. Junqueira, Pedro G.M. de Barros e Silva, Lucas Tramujas, Er-lon O. Abreu-Silva, Ligia N. Laranjeira, Aline T. Soares, Leandro S. Echenique, Adriano J.Pereira, Flavio G.R. Freitas, Otavio C.E. Gebara, Vicente C.S. Dantas, Remo H.M. Furtado, Eve-line P. Milan, Nicole A. Golin, Fabio F. Cardoso, Israel S. Maia, Conrado R. Hoffmann Filho,Adrian P.M. Kormann, Roberto B. Amazonas, Monalisa F. Bocchi de Oliveira, Ary Serpa-Neto,Maicon Falavigna, Renato D. Lopes, Flavia R. Machado, and Otavio Berwanger. Hydroxy-chloroquine with or without azithromycin in mild-to-moderate covid-19. New England Journal ofMedicine , 2020. URL: https://doi.org/10.1056/NEJMoa2019014 , arXiv:https://doi.org/10.1056/NEJMoa2019014 , doi:10.1056/NEJMoa2019014 .[13] Andrea Cortegiani, Giulia Ingoglia, Mariachiara Ippolito, Antonino Giarratano, and Sharon Einav.A systematic review on the efﬁcacy and safety of chloroquine for the treatment of COVID-19. Journal of Critical Care , 57:279–283, 2020.[14] Markus C Elze, John Gregson, Usman Baber, Elizabeth Williamson, Samantha Sartori, RoxanaMehran, Melissa Nichols, Gregg W Stone, and Stuart J Pocock. Comparison of propensity scoremethods and covariate adjustment: evaluation in 4 cardiovascular studies.

Journal of the AmericanCollege of Cardiology , 69(3):345–357, 2017.[15] Christian Funck-Brentano, Lee S Nguyen, and Joe-Elie Salem. Retraction and republication:cardiac toxicity of hydroxychloroquine in COVID-19.

The Lancet , 2020. doi:10.1016/S0140-6736(20)31528-2 .[16] Mitchell H Gail, S Wieand, and Steven Piantadosi. Biased estimates of treatment effect in random-ized experiments with nonlinear regressions and omitted covariates.

Biometrika , 71(3):431–444,1984.[17] Philippe Gautret, Jean-Christophe Lagier, Philippe Parola, Line Meddeb, Morgane Mailhe, Bar-bara Doudier, Johan Courjon, Val´erie Giordanengo, Vera Esteves Vieira, Herv´e Tissot Dupont,et al. Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial.

International Journal of Antimicrobial Agents , 56(1), 2020. doi:https://doi.org/10.1016/j.ijantimicag.2020.105949 .[18] Miguel Hernan and James Robins.

Causal Inference: What If . Chapman & Hall/CRC, Boca Raton,2020.[19] Peter Horby, Marion Mafham, Louise Linsell, Jennifer L Bell, Natalie Staplin, Jonathan R Ember-son, Martin Wiselka, Andrew Ustianowski, Einas Elmahi, Benjamin Prudon, Anthony Whitehouse,Timothy Felton, John Williams, Jakki Faccenda, Jonathan Underwood, J Kenneth Baillie, LucyChappell, Saul N Faust, Thomas Jaki, Katie Jeffery, Wei Shen Lim, Alan Montgomery, KathrynRowan, Joel Tarning, James A Watson, Nicholas J White, Edmund Juszczak, Richard Haynes, andMartin J Landray. Effect of hydroxychloroquine in hospitalized patients with covid-19: Prelim-inary results from a multi-centre, randomized, controlled trial. medRxiv , 2020. URL: https: , , doi:10.1101/2020.07.15.20151852 .[20] ICH. ICH E9 (R1) addendum on estimands and sensitivity analysis in clinical trials to the guidelineon statistical principles for clinical trials. 2019. URL: .[21] Cornelia Ursula Kunz, Silke J¨orgens, Frank Bretz, Nigel Stallard, Kelly Van Lancker,Dong Xi, Sarah Zohar, Christoph Gerlinger, and Tim Friede. Clinical trials impactedby the covid-19 pandemic: Adaptive designs to the rescue? Statistics in Biophar-maceutical Research , 2020. URL: https://doi.org/10.1080/19466315.2020.1799857 , arXiv:https://doi.org/10.1080/19466315.2020.1799857 , doi:10.1080/19466315.2020.1799857 .[22] Huzhang Mao and Liang Li. Flexible regression approach to propensity score analysis and itsrelationship with matching and weighting. Statistics in Medicine , 2020. doi:https://doi.org/10.1002/sim.8526 .[23] Daniel F McCaffrey, Beth Ann Grifﬁn, Daniel Almirall, Mary Ellen Slaughter, Rajeev Ramchand,and Lane F Burgette. A tutorial on propensity score estimation for multiple treatments using gen-eralized boosted models.

Statistics in Medicine , 32(19):3388–3414, 2013.[24] Geert Molenberghs, Herbert Thijs, Ivy Jansen, Caroline Beunckens, Michael G Kenward, CraigMallinckrodt, and Raymond J Carroll. Analyzing incomplete longitudinal clinical trial data.

Bio-statistics , 5(3):445–464, 2004.[25] James Robins. A new approach to causal inference in mortality studies with a sustained exposureperiod – application to control of the healthy worker survivor effect.

Mathematical Modelling ,7(9-12):1393–1512, 1986.[26] Laurence D Robinson and Nicholas P Jewell. Some surprising results about covariate adjustmentin logistic regression models.

International Statistical Review/Revue Internationale de Statistique ,pages 227–240, 1991.[27] Bejamnin N. Rome and Jerry Avorn. Drug evaluation during the covid-19 pandemic.

New EnglangJournal of Medicine , 382(24):2282–2284, 2020.[28] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal effects.

Biometrika , 70(1):41–55, 1983.[29] Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassiﬁ-cation on the propensity score.

Journal of the American Statistical Association , 79(387):516–524,1984.[30] Sebastian E. Sattui, Jean W. Liew, Elizabeth R. Graef, Ariella Coler-Reilly, Francis Berenbaum, AliDuarte-Garcia, Carly Harrison, Maximilian F. Konig, Peter Korsten, Michael S. Putman, Philip C.Robinson, Emily Sirotich, Manuel F. Ugarte-Gil, Kate Webb, Kristen J. Young, Alfred H.J. Kim,and Jeffrey A. Sparks. Swinging the pendulum: lessons learned from public discourse concerning20ydroxychloroquine and COVID-19.

Expert Review of Clinical Immunology , 2020. doi:https://doi.org/10.1080/1744666X.2020.1792778 .[31] Marc H. Scheetz, Maximilian F. Konig, Philip C. Robinson, Jeffrey A. Sparks, and Alfred H.J.Kim. A Pharmacokinetics-Informed Approach to Navigating Hydroxychloroquine Shortages inPatients With Rheumatic Disease During the COVID-19 Pandemic.

ACR Open Rheumatol-ogy , 2020. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/acr2.11164 , arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/acr2.11164 , doi:10.1002/acr2.11164 .[32] Baiju R. Shah, Andreas Laupacis, Janet E. Hux, and Peter C. Austin. Propensity score methodsgave similar results to traditional regression modeling in observational studies: a systematic review. Journal of Clinical Epidemiology , 58(6):550–559, jun 2005. URL: https://doi.org/10.1016%2Fj.jclinepi.2004.10.016 , doi:10.1016/j.jclinepi.2004.10.016 .[33] Arvid Sj¨olander, Elisabeth Dahlqwist, and Johan Zetterqvist. A note on the noncollapsibility ofrate differences and rate ratios. Epidemiology , 27(3):356–359, 2016.[34] Nigel Stallard, Lisa Hampson, Norbert Benda, Werner Brannath, Thomas Burnett, Tim Friede,Peter K. Kimani, Franz Koenig, Johannes Krisam, Pavel Mozgunov, Martin Posch, James Wason,Gernot Wassmer, John Whitehead, S. Faye Williamson, Sarah Zohar, and Thomas Jaki. Efﬁcientadaptive designs for clinical trials of interventions for COVID-19.

Statistics in BiopharmaceuticalResearch , 2020. arXiv:arXiv:2005.13309v1 .[35] Elizabeth A. Stuart. Developing practical recommendations for the use of propensity scores:Discussion of ’A critical appraisal of propensity score matching in the medical literature between1996 and 2003’ by Peter Austin, Statistics in Medicine.

Statistics in Medicine , 27(12):2062–2065,2008. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3207 , arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.3207 , doi:10.1002/sim.3207 .[36] Sherry Weitzen, Kate L Lapane, Alicia Y Toledano, Anne L Hume, and Vincent Mor. Principlesfor modeling propensity scores in medical research: a systematic literature review. Pharmacoepi-demiology and Drug Safety , 13(12):841–853, 2004.[37] Yunji Zhou, Roland A Matsouaka, and Laine Thomas. Propensity score weighting under lim-ited overlap and model misspeciﬁcation.

Statistical Methods in Medical Research , 2020. PMID:32693715. URL: https://doi.org/10.1177/0962280220940334 , arXiv:https://doi.org/10.1177/0962280220940334 , doi:10.1177/0962280220940334 .21 a b l e : M e d i a n l e ng t ho f t h e % c on ﬁ d e n ce i n t e r v a l s ( L e ng t h C I) , m ea n s qu a r ee rr o r o f t h ee s ti m a t e d t r ea t m e n t e ff ec t ( M S E ) a ndnu m b e r o f m od e l s ( ou t o f s i m u l a ti on r un s ) t h a t d i dno t c onv e r g e o rr e s u lt e d i n a n O R > f o r S ce n a r i o1 : C OV I D - . C r ud e O R r e f e r s t o t h e s i m p l e l og i s ti c r e g r e ss i on a d j u s ti ngon l y f o r t r ea t m e n t a ss i gn m e n t , PS c ov a r i a t e d e no t e s t h e m e t hod i n c l ud i ng t h e PS i n t h e ou t c o m e r e g r e ss i on m od e l a ndg - c o m p r e f e r s t o t h e doub l y r obu s t g - c o m pu t a ti on m e t hod . T r u e O R N C r ud e O R PS C ov a r i a t e I P T W g - c o m p M a t c h un a d j u s t e d M a t c h c ond iti on a l M a t c h G EE N = C I L e ng t h5 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h3 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h0 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h14 . . . . . . . M S E . . . . . . f a il u r e s N = C I L e ng t h6 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h1 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h11391 . . . · . . . M S E . . . . . f a il u r e s N = C I L e ng t h62 . . . . . . M S E . . . . . f a il u r e s N = C I L e ng t h14 . . . . . . . M S E . . . . . . . f a il u r e s a b l e : M e d i a n l e ng t ho f t h e % c on ﬁ d e n ce i n t e r v a l s ( L e ng t h C I) , m ea n s qu a r ee rr o r o f t h ee s ti m a t e d t r ea t m e n t e ff ec t ( M S E ) a ndnu m b e r o f m od e l s ( ou t o f s i m u l a ti on r un s ) t h a t d i dno t c onv e r g e o rr e s u lt e d i n a n O R > f o r S ce n a r i o2 : U n m ea s u r e d C on f ound e r . C r ud e O R r e f e r s t o t h e s i m p l e l og i s ti c r e g r e ss i on a d j u s ti ngon l y f o r t r ea t m e n t a ss i gn m e n t , PS c ov a r i a t e d e no t e s t h e m e t hod i n c l ud i ng t h e PS i n t h e ou t c o m e r e g r e ss i on m od e l a ndg - c o m p r e f e r s t o t h e doub l y r obu s t g - c o m pu t a ti on m e t hod . T r u e O R N C r ud e O R PS C ov a r i a t e I P T W g - c o m p M a t c h un a d j u s t e d M a t c h c ond iti on a l M a t c h G EE N = C I L e ng t h15 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h8 . . . . . . . M S E . . . . . . f a il u r e s N = C I L e ng t h2 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h45 . . . . . . . M S E . . . . f a il u r e s N = C I L e ng t h21 . . . . . . . M S E . . . . . . f a il u r e s N = C I L e ng t h5 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h282 . . . . · . . . M S E . . f a il u r e s N = C I L e ng t h297 . . . . · . . . M S E . f a il u r e s N = C I L e ng t h75 . . . . . . . M S E f a il u r e s a b l e : M e d i a n l e ng t ho f t h e % c on ﬁ d e n ce i n t e r v a l s ( L e ng t h C I) , m ea n s qu a r ee rr o r o f t h ee s ti m a t e d t r ea t m e n t e ff ec t ( M S E ) a ndnu m b e r o f m od e l s ( ou t o f s i m u l a ti on r un s ) t h a t d i dno t c onv e r g e o rr e s u lt e d i n a n O R > f o r S ce n a r i o3 : A u s ti n ( ) . C r ud e O R r e f e r s t o t h e s i m p l e l og i s ti c r e g r e ss i on a d j u s ti ngon l y f o r t r ea t m e n t a ss i gn m e n t , PS c ov a r i a t e d e no t e s t h e m e t hod i n c l ud i ng t h e PS i n t h e ou t c o m e r e g r e ss i on m od e l a ndg - c o m p r e f e r s t o t h e doub l y r obu s t g - c o m pu t a ti on m e t hod . T r u e O R N C r ud e O R PS C ov a r i a t e I P T W g - c o m p M a t c h un a d j u s t e d M a t c h c ond iti on a l M a t c h G EE N = C I L e ng t h9 . . . . ··· M S E . . . . ··· f a il u r e s N = C I L e ng t h4 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h1 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h18 . . . . ··· M S E . . . ··· f a il u r e s N = C I L e ng t h8 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h2 . . . . . . . M S E . . . . . . . f a il u r e s N = C I L e ng t h150 . . . . · ··· M S E ··· f a il u r e s N = C I L e ng t h59 . . . . . . . M S E . . . . . f a il u r e s N = C I L e ng t h14 . . . . . . . M S E . . . . . . . f a il u r e s00000066