[PDF] Survival analysis for AdVerse events with VarYing follow-up times (SAVVY) -- comparison of adverse event risks in randomized controlled trials

Abstract

Analyses of adverse events (AEs) are an important aspect of benefit-risk and health-technology assessments of therapies. The SAVVY project aims to improve the analyses of AE data in clinical trials through the use of survival techniques appropriately dealing with varying follow-up times and competing events (CEs). In an empirical study including randomized clinical trials (RCT) from several sponsor organisations the effect of varying follow-up times and CEs on comparisons of two treatment arms with respect to AE risks is investigated. CEs definition does not only include death before AE but also end of follow-up for AEs due to events possibly related to the disease course or safety of the treatment. The comparisons of relative risks (RRs) of standard probability-based estimators to the gold-standard Aalen-Johansen estimator (AJE) or hazard-based estimators to an estimated hazard ratio (HR) from Cox regression are done descriptively, with graphical displays, and using a random effects meta-analysis. The influence of different factors on the size of the bias is investigated in a meta-regression. Ten sponsors provided 17 RCTs including 186 types of AEs. We confirm for estimation of the RR concerns regarding incidence densities: the probability transform incidence density ignoring CEs performed worst. However, accounting for CEs in an analysis that parametrically mimicked the non-parametric AJE performed better than both one minus Kaplan-Meier and AJE that only considered death as a CE. The analysis based on hazards revealed that the incidence density underestimates the HR of AE and CE of death hazard compared to the gold-standard Cox regression. Both the choice of the estimator of the AE probability and a careful definition of CEs are crucial in estimation of RRs. For RRs based on hazards, the HR based on Cox regression has better properties than the ratio of incidence densities.

Full PDF

SSurvival analysis for AdVerse events with VarYing follow-up times(SAVVY) – comparison of adverse event risks in randomizedcontrolled trials

Kaspar Ruﬁbach , ∗ , Regina Stegherr , ∗ , Claudia Schmoor , Valentine Jehl ,Arthur Allignol , Annette Boeckenhoﬀ , Cornelia Dunger-Baldauf , Lewin Eisele ,Thomas K¨unzel , Katrin Kupas , Friedhelm Leverkus , Matthias Trampisch , Yumin Zhao ,Tim Friede and Jan Beyersmann , ∗∗ August 13, 2020 F. Hoﬀmann-La Roche, Basel, Switzerland Institute of Statistics, Ulm University, Ulm, Germany Clinical Trials Unit, Faculty of Medicine and Medical Center, University of Freiburg,Freiburg im Breisgau, Germany Novartis Pharma AG, Novartis Pharma AG, Basel, Switzerland Merck KGaA, Darmstadt, Germany Bayer AG, Wuppertal, Germany Janssen-Cilag GmbH, Neuss, Germany Bristol-Myers-Squibb GmbH & Co. KGaA, Mnchen, Germany Pﬁzer, Berlin, Germany Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim, Germany Eli Lilly and Company, Indianapolis, Indiana, USA Department of Medical Statistics, University Medical Center G¨ottingen, G¨ottingen,Germany ∗ : joint ﬁrst authorship ∗∗ Corresponding author: Jan Beyersmann, [email protected]

AbstractBackground:

Analyses of adverse events (AEs) are an important aspect of beneﬁt-risk and health-technology assessments of therapies. The SAVVY (Survival analysisfor AdVerse events with Varying follow-up times) project aims to improve the analysesof AE data in clinical trials through the use of survival techniques appropriately deal-ing with varying follow-up times and competing events (CEs). In an empirical studyincluding randomized clinical trials (RCT) from several sponsor organisations the ef-fect of varying follow-up times and competing events on comparisons of two treatmentarms with respect to AE risks is investigated.

Methods:

CEs deﬁnition does not only include death before AE but also end offollow-up for AEs due to events possibly related to the disease course or safety of thetreatment. The comparisons of relative risks (RRs) of standard probability-based es-timators to the gold-standard Aalen-Johansen estimator or hazard-based estimatorsto an estimated hazard ratio (HR) from Cox regression are done descriptively, with a r X i v : . [ s t a t . A P ] A ug raphical displays, and using a random eﬀects meta-analysis on AE level. The inﬂu-ence of diﬀerent factors on the size of the bias is investigated in a meta-regression. Results:

Ten sponsors provided seventeen RCTs including 186 types of investigatedAEs. We ﬁnd that decisions on categorizing the size of the eﬀect based on RRs of AEprobabilities according to guidelines issued by the

German Institute for Quality andEﬃciency in Health Care (IQWiG) crucially depend on the estimator chosen. We con-ﬁrm for estimation of the RR concerns regarding incidence densities: the probabilitytransform incidence density ignoring CEs performed worst. However, accounting forCEs in an analysis that parametrically mimicked the non-parametric Aalen-Johansenperformed better than both one minus Kaplan-Meier and Aalen-Johansen that onlyconsidered death as a CE. In terms of relevance of eﬀects, the choice of the estimator iskey and more important than the features of the underlying data such as percentage ofcensoring, CEs, amount of follow-up, or value of the gold-standard RR. The analysisbased on hazards revealed that the incidence density underestimates the HR of AEand CE of death hazard compared to the gold-standard Cox regression.

Conclusions:

Both the choice of the estimator of the cumulative AE probability anda careful deﬁnition of CEs are crucial in estimation of RRs. Categorization of evi-dence crucially depends on the chosen estimator. There is an urgent need to improvethe guidelines of reporting AEs so that the Kaplan-Meier estimator and the incidenceproportion are ﬁnally replaced by the Aalen-Johansen estimator with appropriate def-inition of CEs. For RRs based on hazards, the HR based on Cox regression has betterproperties than the ratio of incidence densities.

Keywords:

Aalen-Johansen estimator, Adverse Events, Competing Events, DrugSafety, Risk-beneﬁt assessment, Health-technology assessment, Incidence Proportion, In-cidence Density, Kaplan-Meier estimator, Randomized Clinical Trial

This paper reports ﬁndings of a methodological, empirical study of the SAVVY projectgroup when comparing safety in terms of AEs between treatment arms. The motivation isthat commonly employed methods to quantify absolute AE risk either do not account forvarying follow-up times or for CEs and may be grouped into methods that either under- oroverestimate cumulative AE probabilities in a time-to-ﬁrst-event context. These concernshave been thoroughly investigated in a companion paper[1], considering only one trial armin an opportunistic set of 186 types of AEs from seventeen RCTs from diﬀerent diseaseindications. Here, we extend these considerations to comparing AE risk between arms, thechallenge being that, say, the fact of overestimation in both arms of the same trial allowsfor both under- and overestimation of RRs when comparing arms.The SAVVY project group is a collaborative eﬀort from academia and pharmaceuticalindustry with the aim to improve the analyses of AE data in clinical trials through the useof survival techniques that account for both varying follow-up times, censoring and CEs.To this end, the companion paper has veriﬁed the relevance of using the Aalen-Johansenestimator[2] as the non-parametric gold-standard method when quantifying absolute AErisk. The reason is that the Aalen-Johansen estimator is the only (non-parametric) esti-mator that accounts for CEs, censoring, and varying follow-up times simultaneously, and,being non-parametric, does not rely on restrictive parametric assumptions such as incidencedensities do. 2rieﬂy, in the companion paper we found that the one minus Kaplan-Meier estimatorwith simple censoring of CEs overestimates the cumulative AE probability. This is well-known and has been shown either empirically, see Schuster et al.[3] for a recent example inone single study, or analytically[4]. This paper builds on these and the one-sample resultsanalyzed in the companion paper Stegherr et al.[1] as follows: First, we consider the samesix estimators of AE probabilities as in Stegherr et al.[1] Second, we then extend the resultson the bias of estimation of absolute

AE probabilities in one sample when compared to thegold-standard Aalen-Johansen estimator to an assessment of how these same estimatorsperform when estimating RRs between two randomized treatment arms. With this, weanswer a question raised in Unkel et al.[5], namely which direction the bias goes for anestimator of a RR when based on biased one-sample estimators. Third, since in applicationsvery often the RR for AEs is not (only) quantiﬁed via estimates of AE probabilities butalso using an estimate of the HR of AE hazards, we extend the analysis to two hazard-based estimators of RR. Here, on the one hand, we investigate to which extent incidencedensities may be used to approximate HR estimates from a semi-parametric Cox model.On the other hand, we investigate the relevance of CEs and their CE-speciﬁc hazards byinvestigating conclusions either based on a Cox model for time to AEs or based on a RRfor a probability using the Aalen-Johansen estimator. Properties and estimands of variousestimators to quantify RRs between two randomized arms, and when to prefer which, arediscussed elsewhere[5, 6]. And ﬁnally, our analysis is based on 186 types of AEs fromseventeen RCTs from several disease indications.Individual trial data analyses were run within the sponsor organisation using SAS andR software provided by the academic project group members. Only aggregated data neces-sary for meta-analyses were shared and meta-analyses were run centrally at the academicinstitutions.

Arm E is referring to the experimental treatment and Arm C to the control.Since for a time-to-event endpoint, both AE probabilities and the amount of censoring[7] are time-dependent, we will consider diﬀerent evaluation times called τ . These evalua-tion times either imposed no restriction, i.e., evaluated the estimators until the maximumfollow-up time ( τ E and τ C in each arm, respectively), or considered the minimum of quan-tiles of observed times in the two treatment arms; the quantiles were 100% (whose minimumover both arms we denote by τ max = min { τ E , τ C } ), 90%, 60% and 30%. For computationof RRs for AE probabilities, in this paper we always use the same follow-up time in botharms, i.e. τ max . The rationale to evaluate estimators at the latest meaningful time is thatthis reﬂects common practice for the estimation of AE risk using the incidence proportion,where simply the number of patients with a certain AE in a given time interval is dividedthrough the total number of patients. Analyses for maximum follow-up (which may bearm-speciﬁc) were not performed. For hazard-based estimators we use all available dataand exclusively look at the arm-speciﬁc maximum follow-up time, i.e. τ E and τ C , alsowhen comparing arms. The partial likelihood estimator of the HR based on Cox regres-sion “self-adjusts” for τ max in that it only requires data up this time point. To limit thevariability of the Cox HR we limit the considered dataset to those AEs with a frequency3f ≥

10 in each arm for hazard-based analyses.

As one-sample estimators of the AE probability we use the incidence proportion, the proba-bility transform incidence density ignoring and accounting for CE, one minus Kaplan-Meier,and the Aalen-Johansen estimator. A brief summary of one-sample estimators is given inthe companion paper[1]. An even more detailed Statistical Analysis Plan is presentedelsewhere [6].

For a time-to-event endpoint, the incidence density estimates the hazard function assum-ing that it is constant[6, 1]. Their ratio therefore estimates a HR under the restrictiveassumption that the hazard functions of both treatment arms are constant. The commonCox model is a semi-parametric extension which only requires the HR to be constant, butnot the hazards in the arms.If we consider a time-to-event endpoint with just one possible event, e.g. death foroverall survival, then one minus the survival function, i.e. the expected proportion ofpatients with the event of interest over time, bears a direct relationship to the (cumulative)hazard. However, once we have to consider CEs, the direct relation between the hazard fora given event and the cumulative incidence function, which takes the role of one minus thesurvival (or “event-probability”) function, breaks down[8]. As a consequence, if we wantto use a hazard-based analysis to quantify the eﬀect of treatment on the event of interestand all CEs, in theory it is necessary to report all event-speciﬁc hazards, or rather thecorresponding HRs. For this reason, we will not only report the performance of hazard-based RR estimators for the endpoint of interest, time to (ﬁrst) AE, but also for thecompeting endpoints time to a CE of death and time to all CEs (see “Deﬁnition of CEs” inStegherr et al.[1] and section below). This is in perfect analogy to not only consider a (oneminus Kaplan-Meier-like) simple probability transformation of the AE incidence density,but to also consider an (Aalen-Johansen-like) transformation accounting for CEs.

For given estimators ˆ q E and ˆ q C of AE probabilities calculated at a speciﬁc evaluation timewithin each treatment arm, one can consider either the risk diﬀerence, (cid:100) RD = ˆ q E − ˆ q C , theRR, (cid:100) RR = ˆ q E / ˆ q C , or the odds ratio, (cid:100) OR = ˆ q E / (1 − ˆ q E ) / (ˆ q C / (1 − ˆ q C )). In this paper, wewill focus on the RR to quantify the treatment eﬀect. The IQWiG summarizes reasonsto prefer the RR over the risk diﬀerence in Appendix A[9] of their general methods. Thekey feature therein that leads us to prefer the RR is that the risk diﬀerence is an absoluteeﬀect measure and as such strongly depends on the baseline risk in the control arm.Finally, we prefer the RR over the OR because it is easier to interpret in the sensethat it is an immediate comparison of the cumulative AE probabilities estimated in theone-sample case[1].Variance estimators are easily obtained via the delta rule. In the one-sample case,estimates of AE probabilities were benchmarked on the gold-standard Aalen-Johansenestimator with the primary deﬁnition of CEs, i.e. considering all clinical events described4elow as CEs. This, because the latter is a fully nonparametric estimator that accounts forcensoring, does not rely on a constant hazard assumption, and accounts for CEs, see Table 1in Stegherr et al.[1]. Furthermore, as is well-known, simply taking one minus Kaplan-Meierfor time-to-ﬁrst-AE is a biased estimator of the AE probability in presence of CEs. Here,as a straightforward extension for the comparison of AE probabilities between two armsusing the RR, we benchmark the latter on the RR estimated using the Aalen-Johansenestimator in each arm.The gold-standard for estimates of the HR will be the HR from Cox regression. This,because the latter is typically used to quantify a treatment eﬀect not only for eﬃcacy, butalso for time-to-ﬁrst-AE type endpoints. Variances of comparisons of diﬀerent estimators ofthe HR will be received via bootstrapping[4]. The reason to bootstrap is that we comparediﬀerent estimators of the same quantity based on one data set, leading to the diﬀerentestimators being dependent.As we base our analyses on real datasets we do not know the underlying true eﬀects,either RRs or HRs. This is why we chose the above gold-standard estimators to benchmarkthe candidate estimators against. In what follows, we will still call the deviation of anestimator under consideration to its respective gold-standard bias , although, of course,the comparison is between estimators. We note however that, say, a comparison of oneminus Kaplan-Meier and Aalen-Johansen will converge in probability towards the true orasymptotic bias when sample size tends to inﬁnity. The deﬁnition of events as CE (or “competing risk”) is discussed in detail in the companionpaper[1]. Brieﬂy, both death before AE and any event that would both be viewed from apatient perspective as an event of his/her course of disease or treatment and would stopthe recording of the AE of interest will be viewed as a CE, including possibly disease- orsafety-related loss to follow-up, withdrawal of consent, and discontinuation. The latter isour primary deﬁnition of a CE while we will also look at a CE of “death only”. Eventhough interpretation depends on the severity of the event, a categorization into these twotypes of CEs is considered here for illustrative purposes. One motivation is that, in a time-to-ﬁrst-event analysis, the incidence proportion, that is, the number of AEs divided byarm size, should be unbiased in the absence of censoring. To investigate the impact of ourprimary CE deﬁnition, we also included an investigation of an estimator

Aalen-Johansen(death only) , which only treated death before AE as competing, but not the other CEsthat belong to our primary deﬁnition.An overview how the diﬀerent estimators account for the three sources of bias, i.e.,censoring, no constant hazards, and CEs is given in Table 1 in Stegherr et al.[1].To explicitly investigate the role of censoring without the methodological complicationof CEs, a composite endpoint where AEs and CEs are combined into one single event isconsidered. As a consequence, the gold-standard in this setting is the one minus Kaplan-Meier estimator which is compared to the incidence proportion.5 .5 Random eﬀects meta-analysis and meta-regression

In the meta-analysis and meta-regression, the ratios of the RR estimates, either based onprobability or hazard estimates, obtained with one of the other estimators divided by theRR estimate obtained with the gold-standard Aalen-Johansen estimator (for probabilities)or the Cox regression HR (for hazard based) are considered on the log-scale. The standarderrors of these log-ratios are calculated with a bootstrap. Then, a normal-normal hierar-chical model is ﬁtted and the exponential of the resulting estimate can be interpreted asthe average ratio of the two RR estimators.In a meta-regression it is further investigated which variables impact this average ratio.Therefore, the proportion of censoring, the proportion of CEs, the evaluation time point τ in years, and the size of the RR under consideration estimated by the gold-standardare included as covariates in an univariable and a multivariable meta-regression. For thelatter, since the sum of the proportion of censoring, the proportion of CEs, and the valueof the gold-standard Aalen-Johansen estimator converge to 1, including all of them in themodel would lead to collinearity. For that reason, we omit the proportion of CEs in themodel. All covariates are centered in the meta-regressions. The impact of the use of the diﬀerent estimators on the conclusions derived from thecomparison of treatment arms is investigated by the use of categories. These are typicallyderived from comparing the conﬁdence interval (CI) of the RR to thresholds. There is nouniversally accepted standard how one should combine a point estimate and its associatedvariability, in our case RR, into evidence categories. As an example, we use a categorizationmotivated by the methods put forward by the IQWiG[9] for severe AEs (Table 14) to beused for the German beneﬁt-risk assessment. In contrast to the usual IQWiG procedure,however, we do not only categorize the beneﬁt of a therapy, but also the harm. Thereby,in this ﬁrst analysis we do not distinguish between a positive and a negative treatmenteﬀect. Four categories are possible: (0) “no eﬀect” if 1 is included in the CI, (a) “minor”(“gering”) if the upper bound of the CI is in the interval [0 . ,

1) for a RR < , .

11] for a RR >

1, (b) “considerable” (“betr¨achtlich”) ifthe upper bound of the CI is in the interval [0 . , .

9) for a RR < . , .

33] for a RR >

1, and (c) “major” (“erheblich”) if the upper bound issmaller than 0.75 for a RR < >

1. Thesame categorization is used for the HR instead of RR.

Ten organizations provided 17 trials including 186 types of AEs (median 8; interquartilerange [3; 9]). Twelve (71.6% out of 17) trials were from oncology, nine (52.9%) were activelycontrolled and eight (47.1%) were placebo controlled. Median follow-up was 927 days inArm E (interquartile range [449; 1380]), 896 days in Arm C (interquartile range [308;1263]), and 856 days (interquartile range [308; 1234]) in both arms combined. The trialsincluded between 200 and 7171 patients (median: 443; interquartile range [411; 1134]). In6igure 1: Relative frequency of observed events, per treatment arm.Arm E, estimated values from the Aalen-Johansen estimator ranged from 0 to 0.95 with amedian of 0.09; in Arm C the range was from 0 to 0.77 with a median of 0.05. RRs basedon these estimates ranged from 0.28 to 16.81 with a median of 1.71. HRs for AE hazards(restricted to AEs with ≥

10 events in each arm) ranged from 0.14 to 10.83, with a medianof 1.30.Figure 1 displays for the 186 types of AEs boxplots of the treatment-arm speciﬁcobserved relative frequencies, i.e., the number of a speciﬁc type of event divided by thetotal number of patients. Events considered are “observed AE”, “observed death beforeAE”, “observed other CEs” (i.e. excluding death), and “observed censoring”. Within eacharm, the ﬁgure illustrates a smaller amount of observed censoring compared to observedCEs. That is, AE recording often ended due to death or other CEs such as treatmentdiscontinuation, preventing censoring of the time to AE. Comparing the arms we observemore AEs in the treatment Arm E, a comparable number of deaths, and more CEs in thecontrol Arm C. All combined, this leads to less censoring in the control Arm C.

In this paragraph we summarize a key ﬁnding of this paper: namely, that categorizationof evidence based on RR crucially depends on the estimator one uses to estimate the RR.Table 1 shows the evidence categories for our considered estimators of the RR of those AEswhere neither the estimated AE probability in Arm E nor the estimated AE probability inArm C is 0 ( n = 155 types of AEs for one minus Kaplan-Meier and n = 156 types of AEsfor all other estimators).Overall, we ﬁnd quite a number of switches to neighboring categories. Reasons forswitches are wider CIs of the Aalen-Johansen estimator as well as RR estimates / CIbounds that are close to the cutoﬀs between categories. As the incidence proportion onaverage estimates the RR well (third column in Table 2), we see a similar number of switchesto a higher ( n = 8, below the diagonal in Table 1) and lower ( n = 9) evidence category.Interestingly, while on average the probability transform of the incidence density accountingfor CEs is approximately unbiased as well, we see double as many eﬀect upgradings ( n =7 able 1: The impact of the choice of relative eﬀect estimator for AE probabilities on qualitativeconclusions. Diagonal entries are set in bold face. Deviations from the gold-standard Aalen-Johansen estimator are the oﬀ-diagonal entries. Oﬀ-diagonal zeros are omitted from the display. gold-standard Aalen-Johansen(0) no eﬀect (a) minor (b) considerable (c) major i n c i d e n ce p r o p o r t i o n (0) no eﬀect p r o b a b ili t y t r a n s f o r m i n c i d e n ce d e n s i t y i g n o r i n g C E (0) no eﬀect

13 9 4(a) minor 11 o n e m i nu s K a p l a n - M e i e r (0) no eﬀect p r o b a b ili t y t r a n s f o r m i n c i d e n ce d e n s i t y a cc o un t i n g f o r C E (0) no eﬀect A a l e n - J o h a n s e n ( d e a t h o n l y ) (0) no eﬀect

14) as downgradings ( n = 7). Quite logically, for those estimators that underestimatethe RR with respect to the gold-standard Aalen-Johansen estimator, namely probabilitytransform incidence density ignoring CEs, one minus Kaplan-Meier, and Aalen-Johansen(death only) we see relevantly more switches to a lower than higher evidence category,namely n = 41 /n = 16, 32 /

8, and 28 /

6, respectively.Switches between categories are more rare for a given estimator for earlier follow-uptimes, mainly because of increased variability (results not shown).In summary, the choice of the estimator of the RR does have an impact on the conclu-sions.In what follows, we will describe the diﬀerent properties of the considered estimatorsthat ultimately lead to this relevant number of diverging conclusions.

Panel A of Figure 2 shows box plots of the ratio of the one-sample estimators deﬁned earlierdivided by the gold-standard Aalen-Johansen estimator, separately per treatment arm.Boxplots for Arm E are slightly diﬀerent compared to the companion paper, because we usehere τ max as evaluation time. Brieﬂy, incidence proportion and gold-standard often performcomparably. Probability transforms of the incidence density perform worst when ignoring8igure 2: Panel A: Ratios of one-sample estimators per treatment arm. The denominator isalways the gold-standard Aalen-Johansen estimator. Panel B: Ratio of RRs estimated withestimator of interest and the gold-standard Aalen-Johansen estimator. Panel C: Kerneldensity estimates of the RR based on AE probabilities of the estimators divided by thegold-standard Aalen-Johansen estimator.CE, but when accounting for CE perform much better than the other three procedureswhich are clearly biased with many examples of extreme overestimation. Also, the incidenceproportion displays examples of biases (downwards), with underestimation of up to 67%.Comparing the two arms, overestimation of the AE probability is more pronounced in ArmC. These biases become less pronounced when looking at earlier evaluation times whichprevent CEs and censoring after the respective horizon to enter calculations (results not9hown). Meta-analyses of all estimators divided by the gold-standard Aalen-Johansen estimator aredisplayed in the ﬁrst two columns of Table 2. These results conﬁrm the visual impressiongathered from the boxplots in Panel A of Figure 2, but we note that Panel A of Figure 2also displays biases much more pronounced than the meta-analytical averages. In general,the amount of overestimation increased with later evaluation times (results for earlierevaluation times not shown). Further investigations included uni- and multivariable meta-regressions, see the companion paper[1] for Arm E results. Results for Arm C reﬂect thediﬀerent event pattern described above and are consistent (data not shown).

Table 2: Results of the meta-analyses of the log ratio of the estimator of interest dividedby the gold-standard Aalen-Johansen estimator. The ﬁrst two columns show the estimatedaverage ratio and 95% CI per treatment arm and meta-analyze the data shown in Panel A ofFigure 2. The third column gives results of the meta-analyses of the response variable log ratioof the RRs estimated with the estimator of interest and the gold-standard Aalen-Johansenestimator , the estimated average ratio and 95% CI. The denominator is the RR obtainedusing the gold-standard Aalen-Johansen estimator. This third column relates to the Panel Bin Figure 2.

Experimental Control Ratio of RR with 95% CIIncidence Proportion 0.974 [0.966;0.982] 0.978 [0.970; 0.985] 0.997 [0.991; 1.002]Probability Transform of the 1.817 [1.733;1.904] 2.424 [2.249; 2.613] 0.732 [0.703; 0.763]Incidence Density ignoring CEOne minus Kaplan-Meier 1.187 [1.161;1.214] 1.321 [1.257; 1.389] 0.838 [0.786; 0.894]Probability Transform of the 1.099 [1.080;1.118] 1.124 [1.093; 1.156] 0.977 [0.957; 0.997]Incidence Density Accounting for CEAalen-Johansen (death only) 1.146 [1.125;1.168] 1.254 [1.201; 1.308] 0.860 [0.811; 0.911]

Panel B in Figure 2 displays boxplots of ratios of RRs estimated with estimator of interestand the gold-standard Aalen-Johansen estimator. Interestingly, dividing the two biasedestimates of the AE probability based on the incidence proportion, which both tend to underestimate the true AE probability, leads to an estimate of the RR that on averageperforms comparably to the Aalen-Johansen estimator. However, note that compared tothe latter, we see instances with an overestimation of the RR of just short of a factor3. Performance is comparable for one minus Kaplan-Meier and Aalen-Johansen (deathonly): they both overestimate arm-speciﬁc AE probabilities but generally underestimatethe RR compared to the gold-standard. This is even more pronounced for the probabilitytransform of the incidence density ignoring CEs: it overestimates AE probabilities most,resulting in generally largest underestimation of the RR compared to the gold-standard.Finally, the probability transform of the incidence density accounting for CEs has a perfor-10ance comparable to the incidence proportion for estimation of RR. This, in spite of quitediﬀerent patterns for estimation of AE probabilities as displayed in Panel A of Figure 2.Apart from shedding light on estimation quality of individual estimators the latter is a keyconclusion from comparing Panel A and Panel B of Figure 2: diﬀerent patterns of under-or overestimation of AE probabilities can lead to similar performance for RR.This implies that in general, one cannot conclude how an estimator of the relativeAE risk performs based on looking how these same estimators performs on estimation ofarm-wise AE probabilities.It is interesting to see that the estimators estimating a higher AE probability than thegold-standard Aalen-Johansen estimator, namely probability transform incidence densityignoring CEs, one minus Kaplan-Meier, and Aalen-Johansen (death only), yield a smallerRR compared to the gold-standard. This is because all these estimators do not appro-priately take into account CEs and overestimate the AE probability the more CEs in anarm there are. Since we have more CEs in Arm C, this eventually leads to an under-estimation of the RR. The incidence proportion and the probability transform incidencedensity accounting for CEs correctly deal with CEs, leading on average to good estimationperformance of the RR.As discussed in Stegherr et al.[1] one reason for the good performance of the incidenceproportion might be a high amount of CEs before possible censoring. However, not onlythe proportion of censoring but also the timing of the censoring are relevant, as illustratedin Stegherr et al.Factors that inﬂuence the respective behaviour of a given estimator are discussed below.

The third column in Table 2 displays the results of the random-eﬀects meta-analyses for thelog ratio of the two RRs. These are in line with the results discussed above. The averageratio between the RR calculated with the incidence proportion and RR calculated withthe Aalen-Johansen estimator is close to 1. The biggest underestimation is observed forthe probability transform incidence density ignoring CE, with an average diﬀerence of therisks of about 27%. The Aalen-Johansen (death only) estimator has a more pronouncedunderestimation compared to its counterpart accounting for all CE, but this reduction is notas pronounced as the one using the one minus Kaplan-Meier or the probability transformof the incidence density ignoring CEs. Finally, the probability transform of the incidencedensity accounting for CEs only slightly more underestimates on average compared tothe incidence proportion. In general, these diﬀerences decrease when considering earlierevaluation times (data not shown).

For the meta-regressions, we use as covariates the percentage of censoring (for both armscombined), the percentage of CEs (for both arms combined), the maximum follow-up time,and the size of the RR as estimated by the gold-standard Aalen-Johansen estimator. Co-variates were centered, i.e., the row “average RR” contains the average RR of the estimator11f interest and the Aalen-Johansen estimator if the covariate takes its mean. These meanswere 28.6% censoring (= percentage of censored observations until τ max ), 53.8% CEs, 2.38years (781 days) evaluation time, and a RR of 2.55. Table 3 provides results from the uni-and multivariable analysis.We illustrate the interpretation of the parameters of the meta regression models by thefollowing example calculation: The average ratio of the RR calculated with the probabilitytransform of the incidence density ignoring CEs and the RR calculated with the Aalen-Johansen estimator at τ max under 28.6% censoring is 0.729. If a trial has 38.6% censoring,i.e. an increased censoring proportion of 10 percentage points, the average ratio is estimatedas 0 . · .

066 = 0 . So far, we have primarily discussed how the diﬀerent estimators perform compared to thegold-standard Aalen-Johansen estimator on average . In a ﬁrst step, variability of the RRsof every estimator with respect to the gold-standard Aalen-Johansen estimator can beassessed from the switches between RR categories in Table 1 and the boxplots in Panel Bin Figure 2. Here, we provide a more detailed account of the variability of the RRs usingkernel density estimates of their distribution. As in the one-sample scenario, on averagethe incidence proportion appears to provide a good estimator of relative AE risk based onprobabilities. Considering the plot of the kernel density estimates of the RRs of the AEprobability in Panel C of Figure 2, the RR based on the incidence proportion and the gold-standard is most often close to one. However, there is also a peak of the estimated kerneldensity at larger ratios, indicating that the estimators are not always comparable. For theRR based on the ratio of the probability transform of the incidence density accounting forCEs and the gold-standard we still have clustering around one, but not as pronounced asfor the incidence proportion. The ratios of the one minus Kaplan-Meier or Aalen-Johansen(death only) estimator have less values close to one. For these two estimators more valuesare smaller than one than larger. The estimated kernel density of the probability transformof the incidence density ignoring CE has no peak at one but is bimodal with both modesbelow one. 12 a b l e : A v e r a g e RR a nd m u l t i p li c a t i v ec h a n g e b y % i n c r e a s e i n a m o un t o f ce n s o r i n g , % i n c r e a s e i n C E s , o n e a dd i t i o n a l y e a r o f o b s e r v a t i o n o r a . r e a t e r RR f r o m un i v a r i a b l e a nd m u l t i v a r i a b l e m e t a - r e g r e ss i o n s . T h e RR i s e s t i m a t e db y t h e go l d - s t a nd a r d A a l e n - J o h a n s e n e s t i m a t o r . p r o b a b ili t y t r a n s f o r m p r o b a b ili t y t r a n s f o r m i n c i d e n ce i n c i d e n ce d e n s i t y o n e m i nu s K a p l a n - M e i e r i n c i d e n ce d e n s i t y A a l e n - J o h a n s e n p r o p o r t i o n i g n o r i n g C E a cc o un t i n g f o r C E ( d e a t h o n l y ) U n i v a r i a b l e m e t a - r e g r e ss i o n % ce n s o r i n g a v e r a g e RR . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] % i n c r e a s e . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] % C E s a v e r a g e RR . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] % i n c r e a s e . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] s i z e o f RR a v e r a g e RR . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] i n c r e a s e o f . . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] e v a l u a t i o n a v e r a g e RR . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] t i m e o n e a dd i t i o n a l y e a r . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] M u l t i v a r i a b l e m e t a - r e g r e ss i o n a v e r a g e RR . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] % ce n s o r i n g10 % i n c r e a s e . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] s i z e o f RR i n c r e a s e o f . . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] e v a l u a t i o n t i m e o n e a dd i t i o n a l y e a r . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] . [ . ; . ] Panel A of Figure 3 displays boxplots of the HRs calculated from Cox regression. The threeboxplots display HRs for hazards of AE, all CEs, and a CE of death. Assessing the eﬀect forthe endpoint of interest, here time to AE, as well as of any CE, here time to all CEs or timeto a CE of death, is generally recommended for any (hazard-based) analysis of competingrisks[8]. We ﬁnd that the hazard of AE is generally larger for Arm E compared to ArmC, meaning that the instantaneous risk of AE is typically higher, not unexpectedly. Forthe hazards of CEs, for both types, what we ﬁnd is that the hazard in Arm E is generallylower than in Arm C, i.e. there is an eﬀect of the experimental treatment on the CE. Ifwe simply censored at CEs we would thus introduce arm-dependent censoring, a featurethat may lead to biased eﬀect estimates[10, 11]. We will use this to explain observationswe make about the conclusions of the RR of the gold-standard Aalen-Johansen estimatorand the HR of the Cox model Table 6 below.Boxplots of the ratios of incidence density estimates in each arm, evaluated at τ E and τ C , respectively, and the gold-standard HR calculated from Cox regression, are provided inPanel B in Figure 3, for again the same endpoints as in Panel. The ratio of the incidencedensities of the AE in the two arms underestimates with respect to the Cox regressionHR while for the other two endpoints on the median they turn out to be approximatelyunbiased compared to the Cox HR, with a tendency to overestimation when accountingfor all CEs and underestimation when only accounting for a CE of death.To appreciate the diﬀerences between the two estimators of the RR based on hazards,i.e. the incidence density ratio and the gold-standard Cox regression HR, recall the prop-erties of the two methods: Both properly account for censoring and they properly estimateevent-speciﬁc hazards, or rather the relative eﬀect based on these. The only diﬀerencebetween the two methods is what they assume about the shape of the underlying hazard:the incidence density assumes them to be constant up to the considered follow-up time,which also implies that they are proportional . The gold-standard Cox regression HR onlyassumes them to be proportional , but not constant. But in addition, we also have diﬀerentdata patterns for the three event causes for which we show boxplots of relative hazard-based risks in Panel B in Figure 3: looking at Figure 1 we ﬁnd a low proportion of eventsfor AE and an even lower proportion of CEs of death, but a higher proportion of eventsfor all CEs. Furthermore, within a given patient, death happens later than AE. So theresults we observe in Panel B in Figure 3 are a result of diﬀerent tradeoﬀs between allthese aspects.If we restrict follow-up to earlier timepoints, then variability increases and on average,results persist for the events of AE and CE of death, but for all CEs the ratio of incidencedensities underestimates with earlier timepoints (data not shown). At earlier timepointsthe number of events for all CEs also decreases, so that the tradeoﬀ with the constant14igure 3: Panel A: Cox regression HRs (on log-scale) for the three event types AE, all CEs,and CE of death. Panel B: Ratio of RRs estimated with estimator of interest and the gold-standard HR based on Cox regression estimator, all evaluated at τ E and τ C . Panel C: Plotsof the kernel density estimates of the HR of the estimators divided by the gold-standardCox-regression estimator.hazard assumption starts to resemble that of an event of AE. Table 4 conﬁrms these ﬁndings in univariable meta-analyses and provides an average quan-tiﬁcation of the amount of underestimation of all three estimators relative to the Cox15egression HR.

Table 4: Results of the meta-analyses of the response variable log ratio of the ratio of theestimator of interest and the HR estimated by the Cox model . Estimated average ratio and95% CI. The denominator is the HR obtained using the Cox model.

Estimator ratio with 95% CIRatio incidence density AE 0.803 [0.741; 0.871]Ratio incidence density all events CE 0.908 [0.851; 0.969]Ratio incidence density (death only) CE 0.958 [0.934; 0.982]

For the meta-regressions reported in Table 5, in line with what has been done for RR inTable 3, we again use as covariates the percentage of censoring, the percentage of CEs, themaximum follow-up time, and the size of the HR as estimated by the gold-standard Coxregression as covariates. Means at which covariates were centered were 29.6% censoring(= mean percentage of censored observations until τ E and τ C ), 56.7% CEs, 2.72 years (994days) maximum follow-up time, and a HR of 1.74. Note that these are slightly diﬀerentfrom the ones reported above because we are using a diﬀerent follow-up time here. In theunivariable analyses in Table 5, we ﬁnd an estimated average RR below one for time to AEand time to a CE of death, i.e. a lower average RR compared to the gold-standard Coxregression estimator, for all covariates. For time to all CEs the ratio of incidence densitiesoverestimates on average. Most pronounced covariate eﬀects are found for time to AE, forexample the average ratio of the HR based on the incidence density and the gold-standardCox regression HR under 29.6% censoring is 0.825. If a trial has 39.6% censoring, i.e.an increased censoring proportion of 10 percentage points, the estimated average ratio isestimated as 0 . · .

052 = 0 . As for the estimation of RRs based on probabilities, we provide a more detailed account ofthe variability of the HRs using kernel density estimates of their distribution. Consideringthe plot of the kernel density estimates of the ratios of the HRs in Panel C in Figure 3, theratio of the HRs for time to AE has its highest peak just below one, but also further peaksthat are even smaller than one, indicating that the estimators are not always comparable.The estimated density for the ratio of HRs for time to a CE of death is multimodal, withthe highest peak further below one than for time to AE. A higher proportion of ratios ofHRs for time to all CEs is close to one, but also this estimated density is multimodal. All16 able 5: Univariable and multivariable meta-regression for the response variable log ratio of theHRs, estimated with the estimator of interest and the gold-standard Cox regression . Thesize of the HR is estimated by the Cox model. Note that for the incidence density of deathonly CE the percentage of CEs correspond to the percentage of deaths and for the other twoestimators its the percentage of all events CEs. incidence density incidence density incidence densityof AE of all events CE of death only CE

Univariable meta-regression % censoring average HR 0.825 [0.784; 0.868] 1.063 [1.045; 1.081] 0.962 [0.938; 0.986]10% increase 1.052 [1.031; 1.073] 0.990 [0.983; 0.996] 1.026 [1.015; 1.036]% CEs average HR 0.786 [0.741; 0.834] 1.061 [1.044; 1.079] 0.942 [0.913; 0.971]10% increase 0.948 [0.925; 0.971] 1.011 [1.004; 1.017] 1.041 [1.013; 1.071]size of HR average HR 0.836 [0.790; 0.884] 1.063 [1.048; 1.078] 0.959 [0.932; 0.987]increase of 0.1 1.004 [1.000; 1.008] 0.972 [0.966; 0.977] 0.996 [0.987; 1.005]evaluation time average HR 0.854 [0.809; 0.902] 1.053 [1.040; 1.067] 0.976 [0.944; 1.010]one additional year 0.937 [0.905; 0.970] 0.945 [0.936; 0.955] 0.977 [0.954; 1.000]

Multivariable meta-regression average HR 0.846 [0.809; 0.884] 1.054 [1.042; 1.066] 1.003 [0.979; 1.029]% censoring 10% increase 1.058 [1.040; 1.076] 1.004 [0.999; 1.009] 1.038 [1.029; 1.048]size of HR increase of 0.1 1.005 [1.002; 1.008] 0.980 [0.974; 0.986] 0.993 [0.986; 0.999]evaluation time one additional year 0.933 [0.907; 0.960] 0.957 [0.948; 0.966] 0.949 [0.931; 0.968] densities are left-skewed, indicating that there is a relevant portion of AEs for which theratio of incidence densities underestimates compared to the gold-standard Cox regressionHR for time to all CEs.

Table 6 provides a comparison of the conclusions drawn from the estimators of a RR fortime to AE. The majority of AEs either lead to “no eﬀect” or an eﬀect of “major”, andthese are quite consistently detected by the two methods. However, we also observe for19 /

94 = 20 .

2% of AEs a diverging conclusion, following from the combination of bias andvariability in estimation described above.

We have considered two eﬀect measures to quantify the RR of an AE in two arms: the RRbased on AE probabilities evaluated at τ max and the HR with maximum available follow-upin both arms, where Cox’ method of estimating the HR implicitly leads to an evaluation at τ max , too. Our analyses reveal that all the considered estimators are overall inferior to thetwo gold-standards we considered, either the RR based on the arm-wise Aalen-Johansenestimator or the HR based on Cox regression. One question that remains is whether17able 6: Conclusions of the RR calculated with the ratio of incidence densities at τ E , τ C and conclusions of the RR calculated with the gold-standard Aalen-Johansen estimator at τ max compared to the conclusions of the HR calculated with the Cox model at τ E , τ C . Thetable shows the analysis of those n = 94 AEs that were observed with a frequency of ≥ RR gold-standardAalen-Johansen (0) no eﬀect the qualitative conclusions drawn based on the two gold-standards are relevantly diﬀerentwhen relying on the criteria put forward by the IQWiG (Table 14 in their general methodsdocument[9]). Table 6 has the results. We observed quite some diﬀerent classiﬁcationsbased on the two estimates of the RR. However, this is not a surprise, as the estimand the two methods look at is not the same: Cox HR quantiﬁes a relative eﬀect based on anendpoint of AE hazard , whereas RR based on gold-standard Aalen-Johansen is based on acomparison of probabilities at a evaluation time. The latter integrates the hazard for theendpoint of interest and the hazard for CE into one cumulative eﬀect measure, whereas aCox regression only considers one hazard at a time, and this is likely the primary reasonfor the divergent decisions in the lower part of Table 6. Empirically, if the boxplot forthe HR for the CE in Figure 3 would center around one, then (ignoring the fact that thecategorization also takes into account uncertainty) in theory the decision based on RR andHR would approximately coincide, i.e. we would have no non-diagonal entries in lower partof Table 6. But whenever there is an eﬀect on the CE, then it is expected that decisionsdiverge.Summarizing all these analyses concerning eﬀect quantiﬁcation for AEs using Cox re-gression based HRs, we conclude that the ratio of incidence densities cannot be considereda uniformly good approximation of the HR based on Cox regression. We also ﬁnd thatcategorization of the relevance of diﬀerences between treatment arms may diﬀer dependingon whether it is based on one event-speciﬁc hazard alone (Cox for AE) or on a properprobability estimator (Aalen-Johansen, integrating the two event-speciﬁc hazards). In this section, we aim to explore how the incidence proportion and one minus Kaplan-Meier compare in the absence of CEs. To this end, we consider a composite endpointwhere AEs and CEs are combined into one single event. The gold-standard in this setting18s the one minus Kaplan-Meier estimator which is compared to the incidence proportionin Figure 4. Since for this endpoint we do not have CEs, the conclusions on relative eﬀectsfrom the probability- and hazard-based analyses are aligned in the sense of the directionof the eﬀect.Figure 4: Ratios of probability estimates (left and middle) and RR (right) based on inci-dence proportion of the composite endpoint combining AE and CE divided by the corre-sponding estimate of composite one minus Kaplan-Meier estimator.As visible in the left and middle boxplot in Figure 4, in the composite endpoint analysisunderestimation by the incidence proportion is more pronounced than in the analysis ofthe AE probability presented above. As discussed by Stegherr et al.[1] one reason forthis observation is that even in the presence of censoring, for the one minus Kaplan-Meierestimator the type of the last event is most important. If the last event is an AE or CE theone minus Kaplan-Meier estimator is equal to one, even though censoring has been observedat earlier follow-up times. The incidence proportion is only equal to one if no censoring isobserved. For the RR this leads to an overestimation compared to the gold-standard oneminus Kaplan-Meier estimator.

Survival analyses accounting for CEs is methodologically well established, but practicaluse lacks behind [12, 13]. Failure to account for censoring (e.g., incidence proportion) orCEs (e.g., one minus Kaplan-Meier) will generally lead to biased quantiﬁcation of absoluteAE risk, and the possible amount of bias has been investigated in the companion paper[1].There, we conﬁrmed that one minus Kaplan-Meier should not be used to estimate thecumulative AE probability, as it is bound to overestimate as a consequence of ignoringcompeting risks. Here, we found that this arm-wise overestimation often leads to anunderestimation of the RR when comparing two arms. The same pattern is observed forthe probability transform incidence density ignoring CE and Aalen-Johansen (death only),19.e. the other two estimators that do not correctly account for CEs.For estimation of AE probabilities, the incidence proportion performed surprisingly wellwhen compared to the gold-standard Aalen-Johansen estimator. As discussed in Stegherret al.[1], the reason was a high amount of CEs before possible censoring, potentially re-lated to the majority of the seventeen trials analyzed coming from oncology. These goodarm-wise estimates translate in on average unbiased estimation of RR as well. However,as discussed by Stegherr et al., use of the incidence proportion implicitly assumes eventsto be competing as deﬁned in the methods section. Furthermore, although on average theincidence proportion performs well, we have still 17/156 = 10.9% AE types that were cate-gorized diﬀerently in Table 1, with one being turned from a “major” (incidence proportion)to a “no eﬀect” with the Aalen-Johansen estimator. Finally, we found that incidence den-sities, typically criticized because of the restrictive constant hazard assumption, led to theworst performance when their probability transform ignored CEs. However, accountingfor CEs in an analysis that parametrically mimicked the non-parametric Aalen-Johansenperformed better than both one minus Kaplan-Meier and Aalen-Johansen (death only).This conﬁrms the conclusion from the one-sample case that ignoring CEs appeared tobe worse than assuming constant hazards in our empirical study. In general, we cautionagainst making conclusions about the amount and direction of bias for estimation of theRR based on the behavior of the one-sample estimators. Overall, in terms of relevanceof eﬀects, the choice of the estimator is key and more important than the features of theunderlying data such as percentage of censoring, CEs, amount of follow-up, or the valueof the gold-standard RR.We focused on the results when evaluating estimators using the maximum follow-uptime as evaluation time. When looking at earlier evaluation times where the estimatorswere evaluated at earlier time points deﬁned by quantiles of the observed times (resultsnot shown in detail), the resulting bias was, in general, less pronounced, due to a reducedrelative frequency of CEs and of censoring (see Figure 1). We regarded the situation ofincluding all data up to the maximum follow-up time as the most relevant as this is theusual practice.Kernel estimates of densities of ratios of estimated RRs by the diﬀerent estimatorsdivided by the RR estimated by the gold-standard Aalen-Johansen estimator revealed thatall estimators except probability incidence density ignoring CEs have a peak just below one.However, all estimators either had further peaks away from one or the estimated densitywas unimodal but with high variance. This indicates that the considered estimators arenot always comparable to the gold-standard Aalen-Johansen estimator. In general, it isnot obvious what feature of the data generating mechanism actually leads to observed datafor which, e.g., the incidence proportion performs much worse than the Aalen-Johansenestimator - we found deviations up to a factor of 3.Combining bias and variability for estimation of RR for AE probabilities, we analyzedthe impact of using diﬀerent estimators on making decisions about eﬀect size based onthe criteria put forward by IQWiG. We found that the number of AE types for which agiven estimator deviates from the decision on eﬀect size based on the gold-standard Aalen-Johansen estimator is non-negligible, and that discrepancies by more than one categoryalso occur quite often. This is likely a consequence of both, the bias we see for estimationof RR for some of the estimators and the variability.The analysis based on hazards reveals that the incidence density underestimates the RR20or time to AE and time to a CE of death compared to the gold-standard Cox regression,while no obvious bias was observed for time to all CEs. The discrepancy in conclusions withregard to eﬀect sizes drawn based on the ratio of incidence densities and the Cox regressionHR appeared to be a bit less than for the estimators of RR based on AE probabilities.Still, the ratio of incidence densities cannot be considered a uniformly good approximationof the HR based on Cox regression.Comparing the evidence categories derived from the two gold-standard estimators, theAalen-Johansen estimator of the RR and the Cox regression HR for an AE we ﬁnd quitesome discrepancies. However, this is not surprising, as the former is based on probabilityestimators and as such on cumulative measures integrating the two hazards relating to theprimary event of interest (AE) and the potential CE, whereas the latter is an instantaneousmeasure only considering the AE hazard. In other words, this comparison reiterates theimportance of accounting for CEs. There are now as many hazards as there are CEs,and outcome probabilities depend on all event-speciﬁc hazards. We are aware that inapplications it might not be feasible to look at a hazard-based analysis for the AE andall CEs for every preferred term of AE, say. However, we recommend to consider such ananalysis at least for AEs of special interest.Finally, in an analysis of a composite endpoint with a single event of AE or CE, weﬁnd an overestimation of RR compared to the gold-standard one minus Kaplan-Meierestimator.Our empirical study does have shortcomings, some of which were to be anticipated inan opportunistic sample of randomized clinical trials. Inter alia, a large number of trialsfrom oncology came with a high amount of CEs, which, in turn, led to comparable per-formances of arm comparisons based on incidence proportion and Aalen-Johansen. Theseshortcomings have been discussed in detail in the companion paper[1], but this oppor-tunistic “real world” sample allowed us to investigate and demonstrate which biases canoccur in practice when estimating a RR. The observed results motivate future empiricalinvestigations on how to quantify RR with the aim of better generalizability. As a furtherpoint, it was not the aim of this investigation to accurately estimate RRs, but to com-pare diﬀerent estimators. Our present study does not allow for a meaningful comparisonof results in diﬀerent diseases. Follow-up investigations concentrating on trials in speciﬁcdisease areas are planned.Methodological restrictions include a focus on AE occurrence in a time-to-ﬁrst-eventsetting, which does not consider recurrent AEs and often excludes AEs after treatmentdiscontinuation. A more detailed discussion in the companion paper[1] stresses both theneed to consider more complex event histories and the need to still account for CEs in suchconsiderations. In other words, both AEs after treatment discontinuation and recurrentAEs will still be subject to CEs, and this must be accounted for when comparing arms.We focused here on two eﬀect measures, a comparison of probability estimates and theHR at diﬀerent evaluation times. This, because we consider these two as the overwhelm-ingly used eﬀect measures to quantify safety eﬀects and we are interested in assessingpotential biases of those often used methods. Alternative eﬀect measures for time-to-eventendpoints could have been considered as well, a potential candidate being restricted meansurvival time (RMST). However, to the best of our knowledge, RMST is primarily used inthe context of eﬃcacy analyses and also has its challenges, e.g. interpretability and choiceof the restriction timepoint, see the discussion in Freidlin and Korn[14]. Furthermore,21MST may be suited for a comprehensive composite endpoint, but extension to CEs is,although possible, not straightforward.Replacing the often used incidence proportion by the gold-standard Aalen-Johansenestimator, while conceptually and empirically indicated, requires careful discussion in trialteams to deﬁne CEs, and a more granular data collection. In addition to the date of AEsone also needs to collect dates of CEs, which may lead to more missing data, e.g. unknowndate of loss to follow-up.In line with Stegherr et al.[1], our recommendation is to “play it safe” when estimatingRRs in a time-to-ﬁrst-event analysis and neither hope for a small amount nor a largeamount of CEs nor a favorable interplay of the distributions of the times of AEs, CEs,and censorings. In the former case, one minus Kaplan-Meier might work well, while in thelatter two cases the incidence proportion might do so. However, in general the proportion ofCEs cannot exclusively explain how an estimator performs compared to the gold-standardAalen-Johansen estimator. Therefore, playing it safe, we recommend using RR based onthe Aalen-Johansen estimator for AE probabilities and the HR from Cox regression for alltypes of events that are typically considered in a time-to-ﬁrst-event analysis. Guidelines forreporting AEs should therefore advocate the Aalen-Johansen estimator instead of incidenceproportion, incidence density and one minus Kaplan-Meier. A request for results from Coxregression in guidelines should be complemented by also requesting results for CE-speciﬁchazards.

Data and code

Individual trial data analyses were run within the sponsor organizations using SAS and Rsoftware provided by the academic project group members. Only aggregated data neces-sary for meta-analyses were shared and meta-analyses were run centrally at the academicinstitutions.A markdown ﬁle providing exemplary code to compute all the estimators discussed inthis paper for a given dataset is available on github: https://github.com/numbersman77/AEprobs . The corresponding output is available as html ﬁle: https://numbersman77.github.io/AEprobs/SAVVY_AEprobs.html . Funding

Not applicable.

Declaration of conﬂicting interests

KR and TK are employees of F. Hoﬀmann-La Roche (Basel, Switzerland). VJ and CDBare employees of Novartis Pharma AG (Basel, Switzerland). AA, AB, LE, KK, FL,MT, YZ are employees of Merck KGaA (Darmstadt, Germany), Bayer AG (Wuppertal,Germany), Janssen-Cilag GmbH (Neuss, Germany), Bristol-Myers-Squibb GmbH & Co.KGaA (M¨unchen, Germany), Pﬁzer Deutschland (Berlin, Germany), Boehringer IngelheimPharma GmbH & Co. KG (Ingelheim, Germany), Eli Lilly and Company (Indianapolis,22SA), respectively. TF has received personal fees for consultancies (including data mon-itoring committees) from Bayer, Boehringer Ingelheim, Janssen, Novartis and Roche, alloutside the submitted work. JB has received personal fees for consultancy from Pﬁzer,all outside the submitted work. CS has received personal fees for consultancies (includingdata monitoring committees) from Novartis and Roche, all outside the submitted work.The companies mentioned contributed data to the empirical study. RS has declared noconﬂict of interest.

References [1] Stegherr R, Schmoor C, Beyersmann J et al. Survival analysis for AdVerse eventswith VarYing follow-up times (SAVVY) — estimation of adverse event risks 2020; .[2] Allignol A, Beyersmann J and Schmoor C. Statistical issues in the analysis of adverseevents in time-to-event data.

Pharm Stat

J Clin Epidemiol https://arxiv.org/abs/2001.05709 .[5] Unkel S, Amiri M, Benda N et al. On estimands and the analysis of adverse eventsin the presence of varying follow-up times within the beneﬁt assessment of therapies.

Pharm Stat

Biometrical Journal

Lancet

J ClinEpidemiol .[10] Schemper M and Smith TL. A note on quantifying follow-up in studies of failure time.

Controlled clinical trials

Lancet (London, England)

J Clin Epidemiol

BMJ Open