[PDF] On estimands and the analysis of adverse events in the presence of varying follow-up times within the benefit assessment of therapies

Abstract

The analysis of adverse events (AEs) is a key component in the assessment of a drug's safety profile. Inappropriate analysis methods may result in misleading conclusions about a therapy's safety and consequently its benefit-risk ratio. The statistical analysis of AEs is complicated by the fact that the follow-up times can vary between the patients included in a clinical trial. This paper takes as its focus the analysis of AE data in the presence of varying follow-up times within the benefit assessment of therapeutic interventions. Instead of approaching this issue directly and solely from an analysis point of view, we first discuss what should be estimated in the context of safety data, leading to the concept of estimands. Although the current discussion on estimands is mainly related to efficacy evaluation, the concept is applicable to safety endpoints as well. Within the framework of estimands, we present statistical methods for analysing AEs with the focus being on the time to the occurrence of the first AE of a specific type. We give recommendations which estimators should be used for the estimands described. Furthermore, we state practical implications of the analysis of AEs in clinical trials and give an overview of examples across different indications. We also provide a review of current practices of health technology assessment (HTA) agencies with respect to the evaluation of safety data. Finally, we describe problems with meta-analyses of AE data and sketch possible solutions.

Full PDF

OOn estimands and the analysis of adverse events inthe presence of varying follow-up times within thebeneﬁt assessment of therapies

S. Unkel ∗ , M. Amiri , N. Benda , J. Beyersmann , D. Knoerzer ,K. Kupas , F. Langer , F. Leverkus , A. Loos , C. Ose , T. Proctor ,C. Schmoor , C. Schwenke , G. Skipka , K. Unnebrink , F. Voss , and T. Friede Department of Medical Statistics, University Medical Center Goettingen, Germany Center for Clinical Trials, University Hospital Essen, Germany Biostatistics and Special Pharmacokinetics Unit,Federal Institute for Drugs and Medical Devices, Bonn, Germany Institute of Statistics, Ulm University, Germany Roche Pharma AG, Grenzach, Germany Bristol-Myers Squibb GmbH & Co. KGaA, M¨unchen, Germany Lilly Deutschland GmbH, Bad Homburg, Germany Pﬁzer Deutschland GmbH, Berlin, Germany Merck KGaA, Darmstadt, Germany Institute of Medical Biometry and Informatics,University of Heidelberg, Germany Clinical Trials Unit, Faculty of Medicine and Medical Center - University of Freiburg,Germany Schwenke Consulting: Strategies and Solutions in Statistics (SCO:SSIS),Berlin, Germany Institute for Quality and Eﬃciency in Health Care, Cologne, Germany AbbVie Deutschland GmbH & Co. KG, Ludwigshafen, Germany Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim, GermanySeptember 24, 2018 ∗ Correspondence should be addressed to: Steﬀen Unkel, Department of Medical Statistics, Uni-versity Medical Center Goettingen, Humboldtallee 32, 37073 Goettingen, Germany ( e-mail:[email protected] ). a r X i v : . [ s t a t . O T ] S e p bstract The analysis of adverse events (AEs) is a key component in the assessment ofa drug’s safety proﬁle. Inappropriate analysis methods may result in misleadingconclusions about a therapy’s safety and consequently its beneﬁt-risk ratio. Thestatistical analysis of AEs is complicated by the fact that the follow-up times canvary between the patients included in a clinical trial. This paper takes as its focusthe analysis of AE data in the presence of varying follow-up times within the beneﬁtassessment of therapeutic interventions. Instead of approaching this issue directlyand solely from an analysis point of view, we ﬁrst discuss what should be estimatedin the context of safety data, leading to the concept of estimands. Although thecurrent discussion on estimands is mainly related to eﬃcacy evaluation, the conceptis applicable to safety endpoints as well. Within the framework of estimands, wepresent statistical methods for analysing AEs with the focus being on the timeto the occurrence of the ﬁrst AE of a speciﬁc type. We give recommendationswhich estimators should be used for the estimands described. Furthermore, we statepractical implications of the analysis of AEs in clinical trials and give an overview ofexamples across diﬀerent indications. We also provide a review of current practicesof health technology assessment (HTA) agencies with respect to the evaluation ofsafety data. Finally, we describe problems with meta-analyses of AE data and sketchpossible solutions.

Key words : Adverse events, beneﬁt assessment, estimands, clinical trials, safety data.

Based on the experience with early beneﬁt assessments by the Institute for Quality andEﬃciency in Health Care (IQWiG) in Germany we looked over some examples whethervariable follow-up times for AEs between individual patients, treatment groups and studiesare common across diﬀerent indications.In general, trials with a primary time-to-event endpoint usually have variable follow-uptimes for each individual patient. However, depending on the indication the averagefollow-up time between treatment groups can be very diﬀerent, see Table 1.*** Insert Table 1 about here ***In oncology, study treatment is often given until disease progression only with limitedfollow-up time for AEs after discontinuation of trial treatment due to subsequent therapies,resulting in diﬀerential follow-up times for the diﬀerent treatment arms and censoredobservations for the occurrence of AEs. Due to the dependency of follow-up time of AEson progression-free survival, the treatment group with longer progression-free survivalhas a higher likelihood of observing an AE. In such situations, a simple comparison ofincidence proportions between arms is biased in favour of the inferior treatment. In time-to-event trials with average follow-up times not considerably diﬀerent between treatmentgroups, there can still be variable follow-up times for individual patients. Examplesare large trials in cardiovascular disease, metabolism (diabetes) or respiratory disease(COPD) with cardiovascular outcomes or mortality as primary endpoint, where mortalityis relatively low.Other trial designs than time-to-event trials with variable follow-up times were identiﬁedin infectious diseases and central nervous system disorders. For example, there are severaltrials with planned trial length in Hepatitis C that allowed shortening the treatmenttime either for all patients in the experimental arm or for those in the experimentalarm who achieved an early response. Therefore, the follow-up time in the arm with theexperimental drug was shorter than in the control group, see Table 1. Here, the follow-uptimes not only diﬀer between the treatment groups but also between studies of a drug inthe same indication (e.g. Hepatitis C). Comparison of AEs between treatment groups canbe undertaken via incidence proportions only for the duration of the shorter treatment.It is not possible to demonstrate a potential advantage in terms of lower AE probabilitiesover a longer period of time. On the other hand, safety evaluations using na¨ıvely all AEson treatment are biased in favour of the treatment group with shorter treatment duration.For instance, in multiple sclerosis, trials with ﬁxed treatment duration have frequentlybeen used, but if control patients switch to treatment in an extension study this wouldlead to longer follow-up in the experimental group and AE probabilities that are biased infavour of the control group. Figure 1 illustrates diﬀerent scenarios of typical AE follow-upperiods in clinical trials. *** Insert Figure 1 about here ***Adverse events are assessed on a regular basis at the visit at the beginning of the trial (V0)and during treatment (V1,...,Vn). The end of treatment (EoT) triggers a safety follow-upvisit (Saf-FU) marking the last regular safety assessment. Adverse events occurring ontreatment or during safety follow-up are analyzed as treatment-emergent AEs (TEAEs,marked by bold symbols). First occurrences of AEs (marked by triangles) are in generalconsidered only when occurring during the TEAE period. Serious AEs may be reportedspontaneously after the safety follow-up visit.

Drug approval usually requires a conﬁrmatory proof of eﬃcacy in at least two well con-ducted randomized controlled trials followed by a beneﬁt-risk assessment that shows thatthe treatment’s beneﬁt outweighs the expected risk associated with the new treatment.In contrast to the demonstration of eﬃcacy as compared to a control, the beneﬁt risk as-sessment is far less standardized with respect to properly balancing beneﬁt and expectedside eﬀects. Nevertheless, details on clinical trial speciﬁcations with respect to the investi-gated population, the study duration and the number of patients to be studied are givenin several regulatory guidelines, as in therapeutic EMA guidelines (the

CHMP clinicaleﬃcacy and safety guidelines ), ICH guidelines [8, 9, 10, 11] and U.S. FDA guidelines [12].Safety assessments in drug approval are usually done with respect to the number andproportion of patients with speciﬁc AEs that occurred in the individual clinical trials,primarily focusing on the estimated probability of experiencing a given event within thepredeﬁned study duration potentially stratiﬁed in relevant subpopulations. It is based ona limited database available at the marketing authorization application where uncertain-ties about important risks may either prevent from authorization or imply the requirementof post-authorization safety investigations. Due to a number of diﬀerent side eﬀects, andsince the absence of evidence of an increased risk is not necessarily evidence of absenceof an increased risk [13], the analysis of AEs is often crude and merely descriptive aimingto detect safety signals, although proposals have been made in the literature on the useof more sophisticated methods [14, 15, 16]. Study duration may be too short or diﬀerentstudies may have diﬀerent durations leading to diﬀerent rates. In addition the databasefrom phase III studies may be insuﬃcient to detect infrequent but serious events. In caseof uncertainty the EMA’s Pharmacovigilance Risk Assessment Committee (PRAC) mayrequire that a post-authorisation safety study (PASS) be carried out after a medicinehas been approved. Nevertheless, diﬃculties in the assessment at the time of marketingauthorization are highly relevant especially in indications with a small patient populationand a high unmet medical need imposing high pressure to the regulatory system.Hence, an assessment of safety often targets the probability of an AE within a subject anda given period of time. The safety of a new treatment is assessed using all data availablefor this treatment. The FDA, e.g., asks for a document called integrated summary ofsafety (ISS) as described in U.S. Food and Drug Administration (FDA) [17]. In contrastto eﬃcacy assessments, comparative safety assessments are diﬃcult and the appraisal ofside eﬀects is usually done in absolute terms. Even if relative (comparative) safety as-sessments are given and relate to individual studies, accounting for diﬀerent observationaltimes due to study discontinuation, other events and the presence of changing individualproneness over time is diﬃcult. Competing risks for a number of targeted adverse eﬀectsadd another layer of complexity. For example, in two trials involving patients with type 2diabetes and an elevated risk of cardiovascular disease, patients treated with canagliﬂozinhad a lower risk of cardiovascular events than those who received placebo but a greaterrisk of amputation of toes, feet, or legs with canagliﬂozin than with placebo (6.3 vs. 3.4participants with amputation per 1000 patient-years, corresponding to a hazard ratio(HR) of 1.97; 95% conﬁdence interval: [1.41, 2.75]) [18]. In such a situation, the increasedrisk of amputation might be diﬃcult to determine because of the much larger risk ofmortality in comparison.

The evaluation of safety is considered as an important element of health technology as-sessments (HTA) [19]. However, due to diﬀerent approaches to HTA in diﬀerent healthcare systems and little methodological guidance given generally [20], the integration ofsafety data and with it the analyses and interpretation of safety data diﬀers considerably.Some health care systems, such as the one in England, focus on the economic value of anew technology by implementing a cost eﬀectiveness threshold. Those are usually basedon an incremental cost eﬀectiveness ratio (ICER) or quality-adjusted life years (QALY)by comparing against the standard of care [21].In this context data on AEs which are considered most relevant from the patients’ QoLand/or costing perspective are usually integrated by means of utility functions [22]. Acomprehensive review of the current practice of economic models concludes that thereappears to be an implicit assumption within modelling guidance that adverse eﬀects arevery important. There appears to be a lack of clarity how they should be dealt with andconsidered for modelling purposes [23].Other health care systems, such as the one in Germany, base their decision on the in-cremental medical beneﬁt against the standard of care [24]. Usually data from clinicaltrials are used in order to evaluate the added beneﬁt for all patient relevant endpointsseparately to demonstrate an overall added beneﬁt of the drug for the population in scope.Typically, a comprehensive description of AEs is provided in those assessments. However,no speciﬁc guidelines with respect to safety analyses are in place in countries followingthis approach. In the following, we discuss one speciﬁc example.

At the beginning of 2011 the early beneﬁt assessment of new drugs was introduced inGermany with the Act on the Reform of the Market for Medicinal Products (AMNOG).The Federal Joint Committee (G-BA) generally commissions the Institute for Quality andEﬃciency in Health Care (IQWiG) with this type of assessment, which examines whethera new therapy shows an added beneﬁt (a positive patient-relevant treatment eﬀect) overthe current standard therapy. The IQWiG is required to assess the extent of added ben-eﬁt on the basis of a dossier submitted by the pharmaceutical company responsible. Inthis assessment, the qualitative and quantitative certainty of results within the evidencepresented, as well as the size of observed eﬀects and their consistency, are appraised.The general methods of IQWiG are described in the Allgemeine Methoden [24, Version5.0]. In accordance with §

35b (1) Sentence 4 Sozialgesetzbuch (SGB) V, the followingoutcomes related to patient beneﬁt are to be given appropriate consideration: increase inlife expectancy, improvement in health status and quality of life, as well as reduction indisease duration and adverse eﬀects. In the beneﬁt assessment, all patient-relevant end-points play a role, and for safety, particular consideration is given to serious and severeAEs as well as treatment discontinuations. In addition, AEs that are of special interestwithin the context of the disease or drug class considered may play a role.For the assessment of the extent of added beneﬁt the eﬀect sizes are of main interest. Aneﬀect size in this context is deﬁned as the (relative) diﬀerence between the new treatmentand the appropriate comparator therapy. It is an important step to grade the qualitativecertainty of the estimated eﬀect size e.g. based on trial design, data quality and esti-mation method. Patients not included into the analyses and patients with incompletedata increase the risk of bias. Therefore, the following information is gathered and con-sidered to assess the risk of bias: study design (randomized, open-label or double-blind),proportion of patients without consideration in the analyses per study arm, proportionof patients with incomplete data per study arm (censored data, lost-to-follow-up, ...),reasons for censoring (informative, non-informative, competing risks), distribution of cen-soring times. Generally, this information is used to assess the direction (in favour of arm x ) and the strength (low or high) of the risk of bias.In case of varying follow-up times, methods based on survival time analyses are preferredcompared to analyses based on four-fold contingency tables [15]. To classify the risk ofbias the number of the censored patients and the reasons for censoring have to be con-sidered. Diﬀerent types of censoring are possible: “uninformatively censored”, “patientswith competing risks” and “informatively censored”, which inﬂuence the risk of bias. Wewill elaborate further on this issue in Subsection 4.3. It is paramount to agree upon the relevant target of estimation deﬁned by the questionwhat would happen to a speciﬁc patient or what is the patient’s risk with respect to aspeciﬁc event or multiple events when treated with a given drug as compared to anotherdrug or to not being treated at all. In the context of eﬃcacy assessments, this concept hasrecently been introduced within the framework of estimands in the new draft addendumR1 to the ICH E9 guideline on statistical principles in clinical trials entitled

Estimandsand Sensitivity Analyses in Clinical Trials ; this addendum is referring to the precise pa-rameter or function of parameters to be estimated in situations where intercurrent events as treatment discontinuation, death, rescue medication or switch to the other study treat-ment may inﬂuence subsequent measurements [25, 7]. We would like to stress the factthat the above-mentioned addendum is not yet ﬁnalized, hence our expositions in thesequel can only reﬂect the current state of discussion. Parts of the draft addendum areseen critically by HTA agencies [26], and the addendum does not prescribe the use of aparticular estimand in a certain situation. Moreover, the discussion of the application ofthe estimands approach to safety data and questions related to beneﬁt-risk assessmentsis ongoing and only started recently.Four diﬀerent elements are required to describe the estimand of interest: the targetedpopulation , the endpoint (variable), the intervention eﬀect that describes how intercur-rent events that potentially inﬂuences the endpoint are accounted for, and the summarymeasure that summarize the comparison of the two treatments under investigation.Whereas the nomenclature of types of estimands has been developed and changed dur-ing the last few years, currently, the following classes of estimands are discussed in theregulatory context: • Treatment policy : treatment policy estimand do not account for any intercurrentevent. The treatment eﬀect is measured irrespective of any intercurrent event, astreatment discontinuation or additional medication given. • Composite : composite estimands combine the variable of interest with the in-tercurrent event, e.g., by deﬁning a treatment failure by the lack of response ortreatment discontinuation. • Hypothetical : hypothetical estimands target an eﬀect that would occur in theoverall population in a hypothetical scenario, in which no patient experienced theintercurrent event. E.g., the eﬀect of all patients adhering to treatment constitutesa hypothetical eﬀect when some patients in fact do not adhere to treatment. • Prinicpal stratum : principal strata estimands are deﬁned by the subset of patientsin whom the intercurrent event occurs either under one of the treatments or underboth. Since a group comparison trial cannot directly identify these patients withrespect to the not-administered treatment, causal inference methods using speciﬁcassumptions would be required for the analysis. • While on treatment : while on treatment estimands relate to the eﬀect prior tothe occurrence of an intercurrent event, e.g. before intake of rescue medication orthe eﬀect while being alive.Although the current discussion is related to the eﬃcacy evaluation, the concept is ap-plicable to safety endpoints, usually the occurrence of a speciﬁc side eﬀect, as well. Con-sidering the variable of interest as the time to a speciﬁc side eﬀect, summary measuresmight be given after a speciﬁc period of time. Relevant intercurrent events are treatmentdiscontinuation or switch, death or other side eﬀects that may prevent from the event ofinterest.Whereas the basic idea of an estimand is not restricted to the eﬃcacy assessment, diﬀer-ent issues related to the large number of event types and the desired “equivalence proof”combined with the cautionary principle point to somewhat diﬀerent diﬃculties. Envis-aging the chances of a beneﬁcial treatment in the sense of both, eﬃcacy and safety, for0a given patient may be seen as a concept of incorporating eﬃcacy and safety in an es-timand, but may be still diﬃcult to be interpreted in the comparison of two treatmentswith diﬀerent safety and eﬃcacy patterns. In that sense, it appears sensible to go alongthe lines that are conceived for the eﬃcacy assessment and clarify the precise parametersin the event analysis in the presence of other concurring events either in relation to thegiven treatment, the patient’s condition or competing side eﬀects. The estimand framework is not speciﬁc to the clinical development of new drugs in theregulatory context but is also relevant to address needs of HTA bodies. The aim of theHTA process is the assessment of evidence as a basis for further decisions about reim-bursement, pricing and market access.Estimands are supposed to focus and describe the research question in detail. As the aimsof drug approval agencies and HTA bodies diﬀer to some extent, diﬀerent estimands maybe of primary interest, but some overlap in secondary considerations can be expected. Thefollowing considerations focussing on HTA may also apply to the beneﬁt risk assessmentin drug approval. In the German early beneﬁt assessment according to AMNOG, i.e. §

35a SGB V, there is a need for the HTA authorities to identify the estimands, which arenot necessarily the same as in the regulatory context to obtain marketing authorization.For marketing authorization the current practice, in general, is to report estimates forwhat could be considered while on treatment estimands to provide evidence for the safetyproﬁle of the treatment of interest. Speciﬁc information, however, on AEs occurring in along-term follow-up regardless of study drug adherence may be requested in special cases,which may, however, be hampered by the limited observational period. Potentially dilut-ing eﬀects of treatment policy estimands in case of treatment discontinuation or switchmay, depending on the treatment comparison and the disease, be anti-conservative forcomparative safety assessments, hence favouring while on treatment estimands for regula-tory purposes. Certainly, within the context of recent regulatory discussion on estimandsin eﬃcacy, a new regulatory framework also on estimands in safety is needed.1HTA bodies are most interested in the treatment policy estimand , independently of oc-currences like rescue therapy (e.g. rheumatoid arthritis) or subsequent therapies (e.g.oncology). However, in indications like oncology, the AEs are frequently only collectedup to a certain point after last dose of study medication. These data do not support atreatment policy estimand. In such situations, the evidence for risk assessment is lessstrong. It may be diﬃcult or impossible to cover the treatment policy estimand withthe data usually obtained. To obtain suﬃcient data for a treatment policy estimand andprovide a solid basis for an early beneﬁt assessment, the current practice of how data arecollected in clinical studies needs to be changed [15].A major challenge in oncology is the treatment change after progression of the dis-ease, where in many cases patients enter a subsequent clinical study, e.g. in malignantmelanoma where about four years ago the only treatment option was dacarbacin and mostpatients entered clinical trials after progression. These studies were under evaluation bythe HTA bodies recently due to the time gap between study conduct and marketing au-thorization. However, in most studies a patient is not allowed to enter a new clinicalstudy, if they are still participants of the prior study. As a consequence, they need towithdraw consent for the ﬁrst study to enter the next. Therefore, AEs cannot be collectedafter progression for the ﬁrst study in general. Exceptions from this practice are observed[27], following a protocol recording new onset of serious adverse events up to 90 days afterlast dose of study treatment and those serious AEs considered related any time after dis-continuation of treatment. Another example consists of the German Society of PediatricOncology and Hematology (GPOH) [28], which records further follow-up data on childrenafter end of treatment and with this provides the possibility for long-term surveillanceand follow-up and late eﬀect evaluation in paediatric oncology patients. Apart from theseexamples, the while on treatment estimand is used in clinical trials to avoid bias dueto unbalanced withdrawals in the treatment groups. E.g., if the control treatment is astandard ﬁrst line treatment and a subsequent study in second line requires the standardtreatment as ﬁrst line, then only patients of the control group are allowed to enter thesecond line study leading to unbalanced subsequent therapy along with biases in eﬃcacybut also safety.2Four diﬀerent scenarios, which are displayed in Figure 2, can be distinguished to describesafety estimands in an HTA system.*** Insert Figure 2 about here ***The scenarios in Figure 2 diﬀer according to the lengths of the planned and observedfollow-up times in the study. The deﬁnition of estimands becomes increasingly complexwith more pronounced diﬀerences in follow-up time due to intercurrent events. In the ﬁrsttwo scenarios, the planned follow-up times of AEs in the study population are similar,e.g. in trials with ﬁxed trial lengths (see Subsection 2.1). Whereas in Scenario 1, thereare no or only minor diﬀerences in observed follow-up times between individual patientswithin treatment groups or between treatment groups, in Scenario 2 medium to largediﬀerences in observed follow-up times do exist, due to e.g. high level of treatment orstudy discontinuations. Scenarios 3 and 4 consider studies with expected diﬀerences infollow-up times by treatment group, e.g. in oncology (see Subsection 2.1), where AEreporting stops a certain number of days after last dose of study drug and the time onstudy drug diﬀers between treatment groups. In all four scenarios, HTA bodies usuallyaim for the treatment policy estimand. However, studies are commonly planned to collectdata that are appropriate for the while on treatment estimand, which is usually the focusof regulatory agencies in the marketing authorization process. When beneﬁt dossiers arebased on the same studies, it is often not possible to provide estimates of the treatmentpolicy estimand desired by the HTA agencies due to lack of adequate data. Hence, it isapparent that the described requirements by HTA bodies with regard to safety analysesneed to be taken into account already in the planning phase of a clinical study.

The aim of this Section is to discuss methods for analysing

AEs. We focus on the oc-currence of the ﬁrst AE of a speciﬁc type because the statistical considerations for theanalysis of the ﬁrst AE are relevant also for the analysis of recurrent AEs. We found agreat variety of methods in the literature, some of which were not well deﬁned, makingit diﬃcult to identify both estimator and estimand. Main methods which focussed on3adverse events occurrence in one group were described in the literature as crude rate, in-cidence proportion, incidence rate, exposure-adjusted incidence rate, hazard function foradverse events, Kaplan-Meier and cumulative incidence function considering competingrisks. We ﬁrst describe methods of estimation within one treatment group in Subsec-tion 4.1, and subsequently the comparison of AE occurrence between samples in Subsec-tion 4.2. The time point of AE occurrence or comparison thereof will be made explicit.A major issue will be that safety data need not be completely observed over the wholestudy period for all patients and will possibly be right-censored. This requires the use oftime-to-event methodology [16, 14] taking diﬀerent follow-up times into account, which iscommon in eﬃcacy analyses, but less so for safety. In this context, it is often discussedwhich kind of censoring is informative, and the role of censoring will brieﬂy be revisitedin Subsection 4.3. Methods for meta-analyses of AE data are discussed in Subsection 4.4.

One major common method is the so-called “crude rate”, which despite its name is infact a proportion and is deﬁned as ˆP(AE) = a/n , where a is the number of patientsobserved to experience at least one AE of a speciﬁc type and n is the total number ofstudy patients, see [14], [16], [29], [30], [31] and references therein. The crude rate is acorrect estimator of the probability to experience at least one AE of the interesting typein case of complete data and identical follow-up times for all patients. With diﬀerentfollow-up times in diﬀerent samples, the crude rate will estimate the AE probability atdiﬀerent points in time. This diﬃculty is resolved by considering the incidence proportion([14] and references therein):ˆP(AE in [0 , t ]) = (cid:80) u ≤ t a u n , (1)where a u denotes the number of patients observed to experience at least one AE of aspeciﬁc type at time u . The expression (1) estimates the probability of experiencing atleast one AE within some time-interval [0 , t ], but is again only valid for complete dataover the considered time interval. In the presence of censoring, both the crude rate andthe incidence proportion underestimate the AE probability [14]. The reason is that these4methods in fact estimate the probability of both the AE occurrence and the non-occurrenceof censoring.Some authors, e.g. Ioannidis et al. [30] and Amit et al. [32], therefore suggest to use oneminus the Kaplan-Meier estimator for estimating P(AE in [0 , t ]), censoring time to AE byboth the end of follow-up and by competing events that preclude AE occurrence such asdeath without a prior AE. However, this method overestimates the AE probability [14, 33].The reason is that one minus the Kaplan-Meier estimator approximates a cumulativedistribution function, implicitly assuming that eventually 100% of all patients experiencethe AE under consideration, possibly after death. Therefore, one minus the Kaplan-Meierestimator must not be used to estimate P(AE in [0 , t ]) in the presence of competing eventswhich prevent the occurrence of the AE under consideration.It is the Aalen-Johansen estimator [34] that generalizes the Kaplan-Meier estimator tomultiple event types and nonparametrically estimates the so-called cumulative incidencefunction P(AE in [0 , t ]), accounting for both competing events [35, 36] and the usualcensoring due to end of follow-up. The Aalen-Johansen estimator for the probability ofAE occurrence is ([14] and references therein):ˆP( T ≤ t, AE) = (cid:88) u ≤ t ˆP( T > u − ) · a u n u , (2)where T is the time until occurrence of an AE or of a competing event, ˆP( T > u − ) de-notes the estimate of the probability of not experiencing an AE or the competing eventjust prior to time u and n u is the number of patients at risk of observing an AE or a com-peting event just prior to u . The interpretation of (2) is that of a sum over the empiricalprobabilities of experiencing an AE at the observed event times. Here, for estimation ofP( T > u − ) the Kaplan-Meier method is used, because the deﬁnition of T encompassesall competing events.Time-to-event analyses are based on hazards, because in general follow-up times areincomplete. In fact, the sum over the quotients on the right hand side in (2) is theNelson-Aalen estimator of the cumulative AE hazard (cid:82) t α AE ( u ) du . For analysing AEs,the Nelson-Aalen estimator is key in three ways. Firstly, it enters the computation of theAalen-Johansen estimator. Secondly, it is closely linked to the Mean Cumulative Func-5tion which is based on the Nelson-Aalen estimator and is also used in safety analyses [29].Thirdly, the Nelson-Aalen estimator is the cumulative nonparametric counterpart of thecommonly used incidence rate (or incidence density) of AEs [14, 30, 31]: IR AE = a (cid:80) t i (3)where t i is the time at risk for patient i and (cid:80) t i denotes the population time (person-years) at risk. The incidence rate (3) is an estimator of the AE hazard α AE ( t ) under aconstant hazard assumption, α AE ( t ) = α AE for all times t . The incidence rate is popular,because its denominator accounts for varying follow-up times. Sometimes, the exposure-adjusted incidence rate is reported by counting in the denominator only the populationtime during exposure to study treatment. However, it is not a probability estimator (andshould better not be reported as a percentage). In fact, it is easily seen that depending onhow time is measured (think of milliseconds or decades), the denominator can be madearbitrarily large or small, possibly resulting in values larger than one.In perfect analogy to the Aalen-Johansen estimator, translating incidence rates into proba-bility statements requires incorporating competing events (CEs), e.g. death without priorAE. For instance, IR AE and the incidence rate of the competing event, IR CE = c/ (cid:80) t i (writing c for the number of patients observed to experience the competing event), canbe used to obtain a parametric counterpart of the Aalen-Johansen estimator. If constantevent-speciﬁc hazards are assumed, the cumulative incidence function of the event typeAE is explicitly given asP( T ≤ t, AE) = (cid:90) t α AE · exp ( − ( α AE + α CE ) s ) d s (4)= α AE α AE + α CE (1 − exp( − ( α AE + α CE ) t )) , where α CE denotes the competing event hazard. By plugging in both event-speciﬁc inci-dence rates, the cumulative incidence function can be estimated parametrically under aconstant hazards assumption.We also note that both the Nelson-Aalen estimator and the incidence rate allow for AEsto be recurrent. In this situation, the translation into probability statements becomesmore complex because of the more complicated recurrent events structure, but also withrecurrent AEs competing events have to be taken into account.6 When comparing two treatment groups with respect to AE occurrence, often measures likerisk diﬀerence, relative risk or odds ratio of crude rates are suggested [e.g. 32]. However, ifsuch relative measures are used in the presence of censoring and are based on biased one-sample estimators as discussed above, the result of such a comparison will be biased too,but the direction of the bias is uncertain. For instance, a ratio of incidence proportionscalculated from censored data will divide something too small by something too small.As a parametric analysis, the ratio of incidence rates is an appropriate estimator of thehazard ratio under a constant hazard assumption. The obvious semi-parametric extensionis to use a Cox proportional hazards model, α AE ( t | Z ) = α AE;0 ( t ) exp( β (cid:62) AE Z ) , (5)where α AE;0 ( t ) is an unspeciﬁed baseline AE hazard, β AE is the vector of regression coef-ﬁcients and Z a vector of baseline covariates including treatment group. In other words,if in (5) the only covariate is treatment group, Z ∈ { , } , then the ratio of the incidencerate in group 1 to the incidence rate in group 0 estimates the hazard ratio exp( β AE ) underthe assumption of a constant baseline hazard for adverse events, α AE;0 ( t ) ≡ constant. Ifthis assumption is in doubt, any Cox regression software technically censoring the timeto AE by observed competing events will yield the usual maximum partial likelihood es-timator of exp( β AE ). Technically, censoring by observed competing events is in perfectanalogy to calculation of the incidence rates, but, again in analogy to the incidence rates,it does not allow for probability statements. In other words, the analysis remains some-what incomplete without consideration of the hazard of the competing event, e.g., via asecond Cox model, α CE ( t | Z ) = α CE;0 ( t ) exp( β (cid:62) CE Z ) , which technically censors the time to the competing event by observed AEs, see Beyers-mann et al. [37] for a practical in-depth discussion. A reasonable method of choice willbe a Cox regression model for the event-speciﬁc hazards. The important point is that itrequires as many Cox regression models as there are event-speciﬁc hazards present. Al-though ﬁtting two Cox models is straightforward from a computational perspective, the7presence of two hazards is not without subtleties.We want to illustrate this using a toy example, assuming, for ease of presentation, con-stant hazards. We consider a treatment that modiﬁes the AE hazard by a factor of 0 . .

25. As t → ∞ , one can see from (4) thatP(AE | group 1) in treatment group 1 becomes0 . · α AE . · α AE + 0 . · α CE , (6)where α AE and α CE denote the AE hazard and competing event hazard in group 0,respectively. The expression (6) is greater than P(AE | group 0), although the AE hazardhas been reduced. The reason is simple. In our toy example, treatment reduces bothhazards, thus, delaying both events. Because the eﬀect is larger on the competing eventthan on the AE, there will eventually be more AEs in treatment group 1 than in treatmentgroup 0, such that the cumulative AE probabilities cross at some point in time. This isillustrated in Figure 3, showing the cumulative AE probabilities in group 0 and group 1over time for the situation of constant hazards described above.*** Insert Figure 3 about here ***In group 0, both the AE hazard rate and the competing event hazard rate were set to0.02 events per day, eventually leading to an AE probability of 1/2 in group 0 and of 2/3in group 1. In group 1, the AE hazard is modiﬁed by a factor of 0.5 and the competingevent hazard by a factor of 0.25.Because multiple hazards are present and an analysis of only one hazard does not suf-ﬁce for probability statements, so-called ‘direct’ approaches such as the Fine and Graymodel for the so-called subdistribution hazard [38] or, easier to interpret, the proportionalodds cumulative incidence function model [39] have been developed. Most popular is per-haps the Fine and Gray approach, which interprets one minus the cumulative incidencefunction as a survival function and ﬁts a Cox model to the corresponding hazard, theso-called subdistribution hazard. The approach is useful in that a subdistribution hazardratio greater (smaller) than one translates into an increase (decrease) of the cumulativeincidence function but is otherwise diﬃcult to interpret [40], because the subdistribution8hazard, say λ ( t | Z ), can be expressed as λ ( t | Z ) = P( T > t | Z )1 − P( T ≤ t, AE | Z ) · α AE ( t | Z ) , which results in a complicated mixture of eﬀects on the hazard scale and on probabilityscales. Alternatives include group comparisons based on conﬁdence bands of the cumula-tive incidence functions [e.g. 41] or the proportional odds cumulative incidence functionmodel [39] mentioned above. The latter is a generalization of the logistic regression modelto binomial probabilities P( T ≤ t, AE | Z ) as a function of time t and in the presence ofcensoring. Our presentation so far has demonstrated that the analysis of AE occurrence in trialswith varying follow-up times has to account for diﬀerences in follow-up, in particular inthe form of censoring . Survival methodology should not only be used for eﬃcacy, butalso for safety analyses. However, the safety estimands of interest may still be a matter ofdebate (see Section 5). Above, we have demonstrated that the relationship between haz-ards and probabilities is more subtle when competing events that preclude AE occurrenceare present. But even when this is accounted for, there may be a choice between, sayincidence rates and ‘exposure-adjusted incidence rates’ (which are also incidence rates,but with a diﬀerent at-risk period, see Subsection 4.1).Censoring is a concept that pertains to all these aspects, but it is more subtle than mayseem at ﬁrst glance. So far, we have argued that more standard statistical techniques notfrom the ﬁeld of survival analysis are inappropriate because the analysis will be about AEsthat are observed rather than about AEs that the patient experiences. The observationtime of AEs is often restricted, e.g. in oncological trials often to progression of disease+ 30 days. The patient may experience AEs after this period but this is not reportedin the case report form (CRF). Next, we have found that one minus the Kaplan-Meierestimator censoring the time to AE by competing events overestimates the cumulative AE probability , but that such a censoring approach yields a valid analysis of the AE hazard .Whether or not censoring yields a valid analysis, has an impact on the estimand at hand,9of course.When survival methods for AE analysis are discussed, authors often warn against in-formative censoring [e.g. 15], but this discussion on censoring is somewhat complicatedby inconsistent terminology in the literature.

Random censoring typically refers to thesituation where time-to-event and time-to-censoring are independent random variablestaking values in [0 , ∞ ) [e.g. 42, p. 30]. This is also called independent censoring or non-informative censoring [e.g. 43] in the literature and it is neither uncommon that theselast two terms are used interchangeably [e.g. 2] nor that they refer to diﬀerent censoringmechanisms [e.g. 44]. This bedevils the discussion both on AE analyses and estimands,because one must ﬁrst deﬁne what is meant by, e.g., the term independent censoring ,because diﬀerent authors may use the term diﬀerently, referring to diﬀerent censoringmechanisms.In our context, one will rarely be willing to assume that the time to a certain AE andthe time to a competing event such as death (or progression) without prior AE are inde-pendent and present themselves as an example of random censoring. In fact, and moreimportantly, it is entirely unclear how to deﬁne time-to-AE for a patient who has died asthe value of a random variable in the positive real numbers. Such a value would suggestthat there is an AE after death , which is an awkward concept (to say the least), and weprefer an agnostic point of view.Above, we have found that observed occurrence of a competing event can be regarded as‘independent censoring’ in the sense that technically treating it as a censored observationallows for a correct analysis of the AE hazard. In other words, the analysis of the AEhazard does not depend on whether censoring was due to administrative closure of thestudy or whether it was due to a competing event. On the other hand, observed occur-rence of a competing event can be regarded as ‘informative censoring’ in the sense thattechnically treating it as a censored observation does not allow for a correct analysis ofthe AE probability as in a Kaplan-Meier procedure.These ideas are made rigorous in the counting process approach to survival analysis[45, 46, 42], see also Allignol et al. [14] for a non-technical account. In a nutshell, censor-ing by a competing event is independent censoring in that it preserves the desired form0of the intensity of the counting process of the event under consideration, but it becomesindependent, yet informative censoring if the target parameter is the cumulative incidencefunction.It is worthwhile to reﬂect on these concepts in oncology trials, where common endpointsare progression-free survival and overall survival. It is not uncommon that recording ofAEs is stopped for patients who progress and undergo a second line treatment. Of course,these patients may still experience AEs, and it is generally assumed that the hazard ofan AE after progression is diﬀerent as compared to before progression. Progression thenis a competing event for AE without prior progression. And progression is a competingevent for death without prior progression and without prior AE. Hence, censoring by theprogression event will yield a valid analysis of the hazards of the other two competingevents, but any probability statement will need to account for all hazards involved. Adiﬀerent question, however, is what kind of censoring by progression is with respect toAE occurrence after progression. In a way, the answer is easy: if censoring by progres-sion events yields a valid analysis of AE occurrence before progression, but if the AEhazard after progression changes, progression cannot be independent (and, hence, notnon-informative) censoring with respect to AE occurrence after progression.The argument can be made rigorously by showing that censoring by progression does notpreserve the desired form of the intensity of the AE counting process, if the latter is notrestricted to AEs before progression, but the bottom line is obvious: if recording of AEsis stopped for patients with diagnosed progression, inference for post-progression AEs isimpossible. When data from more than one study are available it is not uncommon to na¨ıvely poolthe data across the studies by e.g. “simply combin[ing] the numerator events and thedenominators for the selected studies” [47]. McEntegart [48] and later R¨ucker and Schu-macher [49] as well as Chuang-Stein and Beltangady [50] warned of such na¨ıve poolingas results might be biased due to Simpson’s paradox. The International Conference onHarmonization (ICH) E9 states that “any statistical procedures used to combine data1across trials should be described in detail” and that “attention should be paid [...] to theproper modelling of the various sources of variation”. The use of meta-analysis techniquesis encouraged [51], since these techniques allow for variation in baseline (control group)outcomes across the various studies. Random-eﬀects meta-analysis in addition allowsfor variation in treatment eﬀects across studies (so-called between-trial heterogeneity ).Therefore, this type of models is appropriate to formally combine several studies in oneanalysis. In the context of safety analyses a number of speciﬁc problems arise [see e.g.52], some of which will be considered in the following.Meta-analysis can be carried out using aggregated data of the individual studies or, ifavailable, individual patient data (IPD). IPD meta-analyses have some advantages overaggregate data meta-analyses [53], in particular with time-to-event data considered inthis manuscript. If time-to-event data are considered and the meta-analysis is based onpublished data, it is sometimes necessary to reconstruct the data by using appropriatemethods [54, 55].As explained in Subsection 4.2, eﬀect measures such as the risk diﬀerence, relative risk orodds ratio of crude rates are not appropriate when analyzing AE with varying follow-uptimes. Alternatives include the ratio of incidence rates, which would be appropriate forinstance under a constant hazard assumption, the hazard ratio estimated in a Cox pro-portional hazards model or the subdistribution hazard ratio of the popular Fine & Graymodel. For the purpose of meta-analyses these eﬀect measures would be log-transformedto estimate a combined eﬀect on the log-scale, e.g. log rate ratio or log hazard ratio.Technically this is fairly straightforward. However, some challenges arise if for examplethe follow-up times vary considerably between the studies. Under assumptions such asproportional hazards the formal combination of studies with diﬀerent follow-up times arejustiﬁed. In practice, however, such assumptions might be challenged. Whereas with asingle study the hazard ratio estimated by weighted Cox regression might be interpretedas an average eﬀect over follow-up when the proportional hazards assumption does nothold true [56], this interpretation in the presence of considerably diﬀerent follow-up timesacross studies in a meta-analysis does not apply in the same way. Furthermore, with thesearguments the variation in follow-up times across studies is likely to yield some level of2between-trial heterogeneity in treatment eﬀects.Summarizing AE data from a clinical development programme, typically only a smallnumber of studies is available. Meta-analysis of (very) few studies has recently attractedmore attention as it is also frequently accounted in settings other than the one consid-ered here. An overview and discussion of the various methods in the context of ben-eﬁt assessments is provided by [57]. Speciﬁcally, empirical studies demonstrated thatbetween-trial heterogeneity is likely to be present [58], suggesting the use of random-eﬀects rather than ﬁxed eﬀect meta-analysis. With few studies, however, the between-trial heterogeneity is diﬃcult to assess with standard methods for random-eﬀects meta-analysis based on normal approximation yielding conﬁdence intervals for the combinedeﬀect which are too short and which have coverage probabilities well below the nominalconﬁdence level. This is due to an underestimation of the between-trial heterogeneity anda failure to account for the uncertainty in the estimation of the heterogeneity.

Bayesian random-eﬀects meta-analysis with weakly informative prior on the between-trial hetero-geneity (and an uninformative prior on the treatment eﬀect) has been suggested for meta-analysis with few studies [59, 60]. This avoids zero estimates of the between-trial het-erogeneity and accounts for uncertainty in the estimation yielding satisfactory coverageprobabilities and interval lengths [61, 62]. Application of the DIRECT algorithm [63],which is faster than MCMC sampling and does not require inspection of convergencediagnostics, means that computations are fairly straightforward. An implementation isavailable in form of the R package bayesmeta , which can be downloaded from CRAN( https://cran.r-project.org/package=bayesmeta ).Neal et al. [18] report an integrated analysis of two large-scale randomized placebo-trialsassessing the eﬃcacy and safety of canagliﬂozin in patients with type 2 diabetes and el-evated risk of cardiovascular disease. Patients were followed up for varying lengths oftime as the programme was event driven with a constraint on minimum follow-up. Ina stratiﬁed Cox regression, a beneﬁcial eﬀect of canagliﬂozin versus placebo on the pri-mary outcome time to death from cardiovascular causes, nonfatal myocardial infarction,or nonfatal stroke, whatever occurred ﬁrst, was demonstrated (HR = 0.86; 95% conﬁdenceinterval (CI): [0.75, 0.97]). For the purpose of illustration, we consider here the AE “low3trauma fracture”. Figure 4 is a forest plot of the HR with 95% CI from the two studiesCANVAS and CANVAS-R.*** Insert Figure 4 about here ***The Figure also includes a ﬁxed-eﬀect meta-analysis, which is in fact very similar to thestratiﬁed Cox regression reported by Neal et al. [18]. As noted by Neal et al. [18], therewas considerable between-trial heterogeneity. Therefore, the use of a random-eﬀects meta-analysis is indicated. The forest plot includes results from three random-eﬀects meta-analyses, modiﬁed Knapp-Hartung as a frequentist method suggested for meta-analyseswith few studies [64] and Bayesian random-eﬀects meta-analyses with two choices of theprior for the heterogeneity parameter τ [62]. As can be seen from Figure 4, the modiﬁedKnapp-Hartung method yields a very wide, non-informative interval whereas the Bayesianintervals are much shorter. In comparison to the ﬁxed-eﬀect model, the Bayesian intervalsare considerably longer as they account for the pronounced between-trial heterogeneity.Meta-analyses of AE data are further complicated when the events considered are rare .Normal approximations of the distributions of eﬀects, e.g. log hazard ratios or log oddsratios, break down with low event rates. In particular if some studies result in zero eventsin both arms. Then measures such as (log-)odds ratios cannot be calculated. Amongthe remedies proposed for such problems are continuity corrections [65] and models ofthe counts such as binomial distributions. The latter can be ﬁtted using likelihood orBayesian methods [59, 66, 67]. Very recently, the use of weakly informative priors for thetreatment eﬀect as well as for the between-trial heterogeneity has been shown to result insatisfactory properties in meta-analyses with few studies and rare events [68]. In the following, the diﬀerent statistical methods discussed in the previous Section are putinto the context of the estimand framework described in Subsection 3.1 for the speciﬁcsituation of the analysis of AEs. The decision on the estimand is made on the study level.Although in the following we consider estimands of individual studies, the same principleapplies to meta-analyses of AE data. A disadvantage of aggregate data meta-analyses4in this context is of course that diﬀerent estimands might have been used in the studieswhereas in IPD meta-analyses the same or at least similar estimands can be applied tothe studies. As with Section 3, our statements are based on the estimands as described inthe draft addendum R1 to the ICH E9 guideline [25], hence the estimands we are referringto in this Section may be subject to change.We focus speciﬁcally on trials in oncology with diﬀerences in follow-up times driven byprogression, discontinuation of treatment, death or end of follow-up. As progression ingeneral leads to treatment discontinuation, we consider only the intercurrent events treat-ment discontinuation and death . For all estimands considered in Subsection 3.1, except theprincipal stratum estimand, the population is deﬁned by the inclusion and exclusion crite-ria of the study and contains the treated patients (ﬁrst element of estimand description).With respect to the second element of estimand description, for all estimands, the end-point is the time to the ﬁrst AE of a speciﬁc type. The estimands diﬀer as a consequence ofthe following aspects: the follow-up time over which AEs are included in the analysis andthe way how the intercurrent events treatment discontinuation and death are accountedfor (third element of estimand description), and by the population-level summary forthe endpoint (fourth element of estimand description). Here, by ‘intercurrent events’ wereally mean ‘ post-randomization events ’, because death is both post-randomization andterminal, but not truly intercurrent.Considering the treatment policy estimand , the interest is in the comparison of treatmentgroups with respect to AE occurrence until death or end of follow-up, irrespective of theintercurrent event discontinuation of treatment. So, it includes all AEs until death orend of study, and, therefore, requires the collection of AE data after treatment discon-tinuation. The AE hazards of the treatment groups can be compared by calculating thehazard ratio in a Cox regression model where for patients without AE the time to AEis censored by death or by end of follow-up. In the interpretation of the result, it hasto be considered if treatment has an eﬀect on the competing event death without priorAE, i.e. if the hazard ratio between treatment groups with respect to death without priorAE is diﬀerent from one. The hazards of the competing event death without prior AEhave also to be taken into account in the estimation of the AE probabilities within the5treatment groups by using the Aalen-Johansen estimator. Treatment groups can also becompared with respect to the AE probabilities by estimating the diﬀerence between AEprobabilities at a speciﬁed time point or over the whole study period, by calculating thesubdistribution hazard ratio from a Fine and Gray model [38], or by calculating the oddsratio from a proportional odds cumulative incidence function model [39]. The decisionbetween treatment comparison by means of hazard functions or by means of probabilityfunctions deﬁnes the fourth element of estimand description mentioned above.The while on treatment estimand includes the AEs until discontinuation of treatment andrequires the collection of AE data up to this event. Treatment groups can be comparedwith the same methods as used for the treatment policy estimand, now treating discon-tinuation of treatment before AE and death without prior AE as competing events.The composite estimand would combine the interesting event AE with the intercurrentevents treatment discontinuation and death without prior AE. The endpoint is then timeto AE, treatment discontinuation, or death whatever occurs ﬁrst. So, AE data after treat-ment discontinuation are not required. However, this does not hold in general for anyintercurrent event. In this situation, no competing events are present and standard sur-vival analysis techniques can be applied, i.e. the composite event hazards of the treatmentgroups can be compared by calculating the hazard ratio in a Cox regression model, andthe probabilities of the composite event can be estimated by one minus the Kaplan-Meierestimator, where for patients without the composite event, the time to event is censoredby end of follow-up. However, whether such a composite endpoint is an adequate safetyoutcome is a diﬀerent question. We reiterate that any eﬀect on the composite may bedisentangled and eﬀects on the single components of the composite may be analysed usingthe techniques discussed in Section 4.The hypothetical estimand targets the eﬀect of treatment on AE occurrence in the hy-pothetical scenario in which the intercurrent event treatment discontinuation would notoccur. This estimand requires the collection of AE data only up to this event, under theassumption that both the AE hazard and the death hazard remain unchanged in thishypothetical situation. The estimator used for the while on treatment estimand wouldbe valid also for the hypothetical estimand, but now handling treatment discontinuation6as a pure censoring event and not as a competing risk. As the assumption is obviouslyboth untestable and rather strong, sensitivity analyses are a minimum requirement, whenthis estimand is targeted. An even more hypothetical estimand would target the eﬀect oftreatment on AE occurrence, assuming that neither treatment discontinuation nor deathbefore AE occurs. This estimand is even more hypothetical in the sense that one mightbe able to imagine situations where treatment continuation is enforced, but enforcing theabsence of death is much more speculative.The principal stratum estimand builds on the causal framework of potential primary out-comes. A potential outcome is deﬁned for each possible treatment assignment, but onlyone potential outcome is observed, namely that for the actually assigned treatment. Thisframework is now extended to potential intercurrent events and targets a causal treat-ment comparison within subpopulations deﬁned by potential intercurrent events. In ourexample, the potential intercurrent event is treatment discontinuation, while alive, as afunction of time and for each possible treatment assignment. A basic principal stratumcontains all patients with identical values for all potential intercurrent events. For theexample of treatment discontinuation (but ignoring time until discontinuation for easeof presentation) and two possible treatment assignments, one basic principal stratumis the set of all patients whose potential intercurrent event statuses are (yes, yes); thisis the set of all patients who discontinue treatment under both treatment assignments.By construction, measuring the diﬀerence between potential outcomes on such a basicprincipal stratum yields a causal eﬀect ‘adjusting’ for the intercurrent events. Next, aprincipal stratum is a union of basic principal strata. A common example is the set ofpatients where the potential intercurrent events are unaﬀected by treatment — (yes, yes)and (no, no) in the example above — and the complement are all patients where thepotential intercurrent events diﬀer between treatments ((no, yes), (yes, no)). However, itis impossible to identify these patients in advance, and a post-hoc analysis, e.g., basedonly on those patients who did not stop treatment would be biased, one reason beingtime-dependent confounding. For the principal stratum estimand causal inference meth-ods would be required, extending the statistical methods presented in Section 4. Causalinference methods have been developed for the analysis of time-to-event data [69, 70],7including methods for competing events [71], but practical applications may be subtle.For instance, a key concept in causal reasoning is that of a treatment regimen, determinedat time 0. In our context, one may, at least theoretically, envisage enforcing a treatmentregimen of no discontinuation. However, progression events (which trigger treatment dis-continuation) are not controllable in terms of a ‘progression regimen’, which is one majormotivation behind the principal stratum framework [72].

The introduction of estimands provides a framework to structure the discussion aboutthe analyses of AE in terms of speciﬁc demands/needs and appropriateness of diﬀerentanalyses approaches. We formulated a framework based on safety estimands within whichwe proposed statistical methods including methods for evidence synthesis that map theAE data to a single value. For the described estimands we have given recommendationswhich estimators should be used. In particular, we would like to advocate the use of time-to-event methodology for the analysis of AE data, although such a proposal is known fora quite a long time [35].As discussed in Subsection 3.2, estimands of primary interest may diﬀer between drugapproval agencies and the HTA bodies in certain instances. The regulatory agenciesevaluate the safety proﬁle of a new treatment. The assessment is based on all safety dataavailable for the new treatment which is summarized in the summary of safety. There, thesafety proﬁle of the new treatment is provided in detail showing safety data of all kinds (i.e.AEs, laboratory data, physical examination, vital signs, etc.) of all available studies withthe new treatment to assess the beneﬁt and risk of this treatment. In addition, the safety isassessed across the whole lifetime of the new treatment by assessing the required periodicsafety updates. The standard approach is to use descriptive statistics such as absolute andrelative frequencies. Common practice evaluates the safety of the treatment itself using awhile on treatment estimand, which could also be formulated as a hypothetical estimandunder relevant assumptions as stated in Section 5. The HTA bodies on the other hand areinterested in the relative eﬀect on AEs of the new treatment in comparison to a chosen8comparator in the indication of interest. The estimand of interest is the treatment policy.In indications such as diabetes mellitus, studies may cover both types of estimands iftreatment times and follow-up are similar in both treatment groups. In indications suchas oncological diseases the follow-up of AEs is stopped at the planned end of the studytreatment plus a certain number of days, which often results in considerable diﬀerences infollow-up times. In such scenarios the treatment policy estimands cannot be covered asAEs usually are not collected for subsequent therapies. A possible solution would be toplan studies that enable estimates for both estimands. All eﬀorts should be undertakenwhile planning and conducting clinical trials to obtain similar follow-up times. In practice,however, this might not always be possible. Practical challenges include scenarios withpatients moving on to diﬀerent studies due to treatment failure. In such cases similarfollow-up times cannot be guaranteed. Even if the regulatory agencies have other tasksthan HTA bodies, the general objective is to establish beneﬁcial treatments for patients.The common parts should be identiﬁed in order to harmonize both perspectives.Further work, some of it under way, can be done or is required in several areas, some ofwhich have already been mentioned. Firstly, in the present paper, we restricted ourselvesto methods for analysing the time to occurrence of ﬁrst AEs. Therefore, the methodologiesproposed in this paper need to be revisited with a view to analysing recurrent AEs,intermittent AEs and AEs with varying severity. Secondly, our proposed methods foranalysing AEs lack from adequately accounting for the occurrence of multiple, diﬀerentAEs. Approaches that account for a number of types of possibly related AEs do exist[73, 74]. Thirdly, an empirical investigation is underway to investigate in a large numberof randomized controlled trials whether the diﬀerent analyses of AEs lead to diﬀerentdecisions when comparing safety between groups. Fourthly, it is an open question whatimpact regulators and HTA agencies might have on the development of methods for theanalysis of AE data through e.g. the ICH E9 addendum on estimands and sensitivityanalysis in clinical trials. Finally, we have demonstrated that there is a gap betweenwhat would ideally be seen in beneﬁt assessments from an HTA agency perspective andcurrent practices of how data are collected in trials with an objective to achieve marketingauthorization. It will not be easy to reconcile these two diﬀerent views. However, the9present article stimulated the discussion between diﬀerent stakeholders.

Acknowledgements

The Working Group Therapeutic Research (ATF) of the German Society for Medical In-formatics, Biometrics and Epidemiology (GMDS) and the Working Group PharmaceuticalResearch (APF) of the German Region of the International Biometric Society (IBS-DR)have established the joint project group “Analysis of adverse events in the presence ofvarying follow-up times in the context of beneﬁt assessments”. We are grateful to allmembers of the ATF/APF project group as well as to Brenda Crowe and Ralf Bender formaking valuable comments and suggestions on the ﬁrst draft of this paper. Furthermore,we would like to thank Christian R¨over for his assistance in preparing the forest plot.

References [1] A. Agresti.

Categorical Data Analysis . Wiley, Hoboken, New Jersey, 3rd edition,2013.[2] D. Collett.

Modelling Survival Data in Medical Research . Chapman & Hall/CRCPress, Boca Raton, Florida, 3rd edition, 2015.[3] M. Akacha, F. Bretz, D. Ohlssen, and H. Schmidli. Estimands and their role inclinical trials.

Statistics in Biopharmaceutical Research , 22:1–4, 2015.[4] A.-K. Leuchs, J. Zinserling, A. Brandt, D. Wirtz, and N. Benda. Choosing appro-priate estimands in clinical trials.

Therapeutic Innovation & Regulatory Science , 49:584–592, 2015.[5] M. Akacha, F. Bretz, and S. Ruberg. Estimands in clinical trials - broadening theperspective.

Statistics in Medicine , 36:5–19, 2017.[6] International Conference on Harmonisation.

ICH E2A: Clinical safety data manage-ment: deﬁnition and standards for expedited reporting , 1994. CPMP/ICH/377/95,0retrieved on 28 January 2018 from .[7] International Conference on Harmonisation.

ICH E9: Statistical principles forclinical trials , 1998. CPMP/ICH/363/96, retrieved on 28 January 2018 from .[8] International Conference on Harmonisation.

ICH E1: Population exposure: the ex-tent of population exposure to assess clinical safety for drugs intended for long-termtreatment of non-life-threatening conditions , 1994. CPMP/ICH/375/95, retrievedon 30 January 2018 from .[9] International Conference on Harmonisation.

ICH E3: Structure and con-tent of clinical study reports , 1995. CPMP/ICH/137/95, retrieved on28 January 2018 from .[10] International Conference on Harmonisation.

ICH E5(R1): Ethnic factors in theacceptability of foreign clinical data , 1998. CPMP/ICH/289/95, retrieved on30 January 2018 from .[11] International Conference on Harmonisation.

ICH M4E(R2): Revision of M4Eguideline on enhancing the format and structure of beneﬁt-risk informationin ICH , 2016. CPMP/ICH/2887/1999, retrieved on 20 August 2018 from .[12] U.S. Food and Drug Administration (FDA).

Guidance for industry. Premarketingrisk assessment , 2005. Retrieved on 31 August 2018 from .1[13] D. G. Altman and J. M. Bland. Statistics notes: absence of evidence is not evidenceof absence.

BMJ , 311:485, 1995.[14] A. Allignol, J. Beyersmann, and C. Schmoor. Statistical issues in the analysis ofadverse events in time-to-event data.

Pharmaceutical Statistics , 15:297–305, 2016.[15] R. Bender, L. Beckmann, and S. Lange. Biometrical issues in the analysis of adverseevents within the beneﬁt assessment of drugs.

Pharmaceutical Statistics , 15:292–296,2016.[16] C. Chuang-Stein, V. Le, and W. Chen. Recent advancements in the analysis andpresentation of safety data.

Drug Information Journal , 35:377–397, 2001.[17] U.S. Food and Drug Administration (FDA).

Guidance for industry. Integrated sum-maries of eﬀectiveness and safety: location within the common technical document ,2009. Retrieved on 3 February 2018 from .[18] B. Neal, V. Perkovic, K. W. Mahaﬀey, D. de Zeeuw, G. Fulcher, N. Erondu, W. Shaw,G. Law, M. Desai, and D. R. Matthews. Canagliﬂozin and cardiovascular and renalevents in type 2 diabetes.

The New England Journal of Medicine , 377:644–657, 2017.[19] R. Busse, J. Orvain, M. Velasco, M. Perleth, M. Drummond, F. G¨urtner,T. Jørgensen, A. Jovell, J. Malone, A. R¨uther, and C. Wild. Best practice in un-dertaking and reporting health technology assessments.

International Journal ofTechnology Assessment in Health Care , 18:361–422, 2012.[20] D. Pieper, S.-L. Antoine, J.-C. Morfeld, T. Mathes, and M. Eikermann. Methodolog-ical approaches in conducting overviews: current state in HTA agencies.

ResearchSynthesis Methods , 5:187–199, 2014.[21]

Guide to the methods of technology appraisal 2013 . National Institute for Healthand Care Excellence, London, United Kingdom, 2013. Retrieved on 28 January from .2[22] R. Ara and A. Wailoo. Using health state utility values in models exploring thecost-eﬀectiveness of health technologies.

Value in Health , 15:971–974, 2012.[23] D. Craig, C. McDaid, T. Fonseca, C. Stock, S. Duﬀy, and N. Woolacott. Are adverseeﬀects incorporated in economic models? An initial review of current practice.

HealthTechnology Assessment , 13:1–71, 97–181, 2009.[24] IQWiG. Allgemeine Methoden, Version 5.0, 2017. Institute for Quality and Eﬃciencyin Health Care, Cologne, Germany.[25] International Conference on Harmonisation.

ICH E9(R1): Addendum to theguideline on statistical principles for clinical trials on estimands and sensitivityanalysis in clinical trials, step 2 , 2017. Draft guideline, retrieved on 28 January2018 from .[26] IQWiG. Comments from IQWiG on ‘ICH E9 (R1) addendum on estimands andsensitivity analysis in clinical trials to the guideline on statistical principles for clinicaltrials’ (EMA/CHMP/ICH/436221/2017), 2018. Institute for Quality and Eﬃciencyin Health Care, Cologne, Germany, retrieved on 18 April 2018 from .[27] C. Robert, J. Schachter, G. V. Long, A. Arance, J. J. Grob, L. Mortier, A. Daud,M. S. Carlino, C. McNeil, M. Lotem, J. Larkin, P. Lorigan, B. Neyns, C. U.Blank, O. Hamid, C. Mateus, R. Shapira-Frommer, M. Kosh, H. Zhou, N. Ibrahim,S. Ebbinghaus, A. Ribas, and the KEYNOTE-006 investigators. Pembrolizumabversus ipilimumab in advanced melanoma.

The New England Journal of Medicine ,372:2521–2532, 2015.[28] G. Calaminus and P. Kaatsch. Position paper of the society of pediatric oncologyand hematology (GPOH) on (long-term) surveillance, (long-term) follow-up and lateeﬀect evaluation in pediatric oncology patients.

Klinische P¨adiatrie , 219:173–178,2007.3[29] O. Siddiqui. Statistical methods to analyze adverse events data of randomized clinicaltrials.

Journal of Biopharmaceutical Statistics , 19:889–899, 2009.[30] J. A. Ioannidis, S. W. Evans, P. C. Gøtzsche, R. T. O’Neill, D. G. Altman, K. Schulz,and D. Moher. Better reporting of harms in randomized trials: an extension of theCONSORT statement.

Annals of Internal Medicine , 141:781–788, 2004.[31] N. Lineberry, J. A. Berlin, B. Mansi, S. Glasser, M. Berkwits, C. Klem, A. Bhat-tacharya, L. Citrome, R. Enck, J. Fletcher, D. Haller, T.-T. Chen, and C. Laine.Recommendations to improve adverse event reporting in clinical trial publications:a joint pharmaceutical industry/journal editor perspective.

BMJ , 355:i5078, 2016.[32] O. Amit, R. M. Heiberger, and P. W. Lane. Graphical approaches to the analysis ofsafety data from clinical trials.

Pharmaceutical Statistics , 7:20–35, 2008.[33] T. A. Gooley, W. Leisenring, J. Crowley, and B. E. Storer. Estimation of failure prob-abilities in the presence of competing risks: new representations of old estimators.

Statistics in Medicine , 18:695–706, 1999.[34] O. O. Aalen and S. Johansen. An empirical transition matrix for non-homogeneousMarkov chains based on censored observations.

Scandinavian Journal of Statistics ,5:141–150, 1978.[35] R. T. O’Neill. Statistical analyses of adverse event data from clinical trials specialemphasis on serious events.

Drug Information Journal , 21:9–20, 1987.[36] B. J. Crowe, A. Brueckner, C. Beasley, and P. Kulkarni. Current practices, challenges,and statistical issues with product safety labeling.

Statistics in BiopharmaceuticalResearch , 5:180–193, 2013.[37] J. Beyersmann, A. Allignol, and M. Schumacher.

Competing Risks and MultistateModels with R . Springer, New York, 2012.[38] J. Fine and R. Gray. A proportional hazards model for the subdistribution of acompeting risk.

Journal of the American Statistical Association , 94:496–509, 1999.4[39] F. Eriksson, J. Li, T. Scheike, and M.-J. Zhang. The proportional odds cumulativeincidence model for competing risks.

Biometrics , 71:687–695, 2015.[40] P. K. Andersen and N. Keiding. Interpretability and importance of functionals incompeting risks and multistate models.

Statistics in Medicine , 31:1074–1088, 2012.[41] J. Beyersmann, S. D. Termini, and M. Pauly. Weak convergence of the wild bootstrapfor the Aalen–Johansen estimator of the cumulative incidence function of a competingrisk.

Scandinavian Journal of Statistics , 40:387–402, 2013.[42] O. O. Aalen, Ø Borgan, and H. K. Gjessing.

Survival and Event History Analysis:A Process Point of View . Springer, New York, 2008.[43] J. O’Quigley.

Proportional Hazards Regression . Springer, 2008.[44] D. G. Kleinbaum and M. Klein.

Survival Analysis: A Self-Learning Text . Springer,New York, 3rd edition, 2012.[45] P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Keiding.

Statistical Models Based onCounting Processes.

Springer, New York, 1993.[46] N. Keiding and P. K. Andersen, editors.

Survival and Event History Analysis . Wiley,Chichester, 2006.[47] U.S. Food and Drug Administration (FDA).

Reviewer guidance. Conducting a clinicalsafety review of a new product application and preparing a report on the review , 2005.Retrieved on 28 January 2018 from .[48] D. J. McEntegart. Pooling in integrated safety databases.

Drug Information Journal ,34:495–499, 2000.[49] G. R¨ucker and M. Schumacher. Simpson’s paradox visualized: The example of therosiglitazone meta-analysis.

BMC Medical Research Methodology , 8:34, 2008.5[50] C. Chuang-Stein and M. Beltangady. Reporting cumulative proportion of subjectswith an adverse event based on data from multiple studies.

Pharmaceutical Statistics ,10:3–7, 2011.[51] Council for International Organisations of Medical Sciences (CIOMS) Working groupX. Evidence synthesis for meta-analysis for drug safety, 2016. Geneva Switzerland.[52] J. A. Berlin, B. J. Crowe, E. Whalen, H. A. Xia, C. E. Koro, and J. Kuebler. Meta-analysis of clinical trial safety data in a drug development program: answers tofrequently asked questions.

Clinical Trials , 10:20–31, 2013.[53] J. F. Tierney, C. Vale, R. Riley, C.T. Smith, L. Stewart, M. Clarke, and M. Rovers. In-dividual participant data (IPD) meta-analyses of randomised controlled trials: Guid-ance on their use.

PloS Medicine , 12:e1001855, 2015.[54] P. Guyot, A. E. Ades, M.J.N.M. Ouwens, and N. Welton. Enhanced secondary anal-ysis of survival data: reconstructing the data from published Kaplan-Meier survivalcurves.

BMC Medical Research Methodology , 12:9, 2012.[55] Z. Liu, B. Rich, and J. A. Hanley. Recovering the raw data behind a non-parametricsurvival curve.

Systematic Reviews , 3:151, 2015.[56] M. Schemper, S. Wakounig, and G. Heinze. The estimation of average hazard ratiosby weighted Cox regression.

Statistics in Medicine , 28:2473–2489, 2009.[57] R. Bender, T. Friede, A. Koch, O. Kuß, P. Schlattmann, G. Schwarzer, and G. Skipka.Methods for evidence synthesis in the case of very few studies, 2018.

Research Syn-thesis Methods , DOI: 10.1002/jrsm.1297.[58] R. M. Turner, J. Davey, M. J. Clarke, S. G. Thompson, and J. P. T. Higgins. Pre-dicting the extent of heterogeneity in meta-analysis, using empirical data from thecochrane database of systematic reviews.

International Journal of Epidemiology , 41:818–827, 2012.[59] D. J. Spiegelhalter, K. R. Abrams, and J. P. Myles.

Bayesian Approaches to ClinicalTrials and Health-Care Evaluation . Wiley: Chichester, 2004.6[60] J. P. T. Higgins, S. G. Thompson, and D. J. Spiegelhalter. A re-evaluation of random-eﬀects meta-analysis.

Journal of the Royal Statistical Society Series A , 172:137–159,2009.[61] T. Friede, C. R¨over, S. Wandel, and B. Neuenschwander. Meta-analysis of few smallstudies in orphan diseases.

Research Synthesis Methods , 8:79–91, 2017.[62] T. Friede, C. R¨over, S. Wandel, and B. Neuenschwander. Meta-analysis of twostudies in the presence of heterogeneity with applications in rare diseases.

BiometricalJournal , 59:658–671, 2017.[63] C. R¨over and T. Friede. Discrete approximation of a mixture distribution via re-stricted divergence.

Journal of Computational and Graphical Statistics , 26:217–222,2017.[64] C. R¨over, G. Knapp, and T. Friede. Hartung-Knapp-Sidik-Jonkman approach andits modiﬁcation for random-eﬀects meta-analysis with few studies.

BMC MedicalResearch Methodology , 15:99, 2015.[65] M. J. Bradburn, J. J. Deeks, J. A. Berlin, and A. R. Localio. Much ado aboutnothing: a comparison of the performance of meta-analytical methods with rareevents.

Statistics in Medicine , 26:53–77, 2007.[66] D. B¨ohning, R. Kuhnert, and S. Rattanasiri.

Meta-analysis of Binary Data UsingProﬁle Likelihood . Chapman and Hall/CRC Press, Boca Raton, Florida, 2008.[67] O Kuß. Statistical methods for meta-analyses including information from studieswithout any events - add nothing to nothing and succeed nevertheless.

Statistics inMedicine , 34:1097–1116, 2015.[68] B. K. G¨unhan, C. R¨over, and T. Friede. Meta-analysis of few studies involving rareevents, 2018. http://arxiv.org/abs/1809.04407 (submitted for publication).[69] Stuart G Baker. Analysis of survival data from a randomized trial with all-or-nonecompliance: estimating the cost-eﬀectiveness of a cancer screening program.

Journalof the American Statistical Association , 93:929–934, 1998.7[70] M. A. Hern´an and J. M. Robins.

Causal Inference . Chapman & Hall/CRC Press,Boca Ration, Florida, 2018. forthcoming.[71] P. K. Andersen, E. Syriopoulou, and E. T. Parner. Causal inference in survivalanalysis using pseudo-observations.

Statistics in Medicine , 36:2669–2681, 2017.[72] C. E. Frangakis and D. B. Rubin. Principal stratiﬁcation in causal inference.

Bio-metrics , 58:21–29, 2002.[73] A. G¨uttner, J. K¨ubler, and I. Pigeot. Multivariate time-to-event analysis of multipleadverse events of drugs in integrated analyses.

Statistics in Medicine , 26:1518–1531,2007.[74] S. M. Berry and D. A. Berry. Accounting for multiplicities in assessing drug safety:a three-level hierarchical mixture model.

Biometrics

Dossier evaluation Intervention Control Ratio of follow-up timesOncology Median follow-up + safety follow-upA14-48 prostate 16.6 months + 28 days 4.6 months + 28 days 31%A15-17 lung 336 + 28 days 105 + 28 days 37%A15-33 melanoma 168 + 90 days 63 + 90 days 59%A16-04 mantle cell lymphoma 14.4 months + 30 days 3.0 months + 30 days 26%Hepatitis C Planned follow-up + safety follow-upA14-44 8 - 12 weeks + 30 days 24, 28 or 48 weeks + 30 days 23% - 57%A16-48 12 weeks + 30 days 24 weeks + 30 days 57% V01) Fixed lengthno survival FU

WithdrawalDeath Alive, end of FU

V1 V2 V3 EoT Saf-FU2) Fixed lengthincl. survival FU

On-treatment periodSafety-FU periodFU after TEAE period TEAEsNo TEAE

V03) Treatment tillprogression incl.survival FU V1 V2 Vn

Death DeathAlive, end of FU1st AE … Figure 1: Description of diﬀerent scenarios for typical AE follow-up (FU) in clinical trials(TEAEs: treatment emergent AEs (marked by bold symbols); EoT: end oftreatment; Saf-FU: safety follow-up; V0: visit at the beginning of the trial;V1,...,Vn: visits during treatment). First occurrences of AEs are marked bytriangles.0

DifferentSimilarNone / minor (Scenario 1) Medium tolarge (Scenario 2) Minor tomedium (Scenario 3) Large (Scenario 4)Planned follow-up times in the study populationObserved differences in follow-up times Observed differences in follow-up times

Increasing complexity in defining suitable estimands

Figure 2: Flow chart displaying four diﬀerent scenarios across indications for the consid-eration of safety estimands in an HTA system.1

Time (in days) AE P r obab ili t y . . . . . . . . Group 0Group 1

Figure 3: Cumulative AE probabilities for two groups and constant hazards. Althoughin group 1 the AE hazard is lower compared to group 0, the cumulative AEprobability in group 1 is eventually greater than in group 0.2 study

CANVASCANVAS−R

Fixed effect mKH Bayes HN(0.5) Bayes HN(1.0) Hazard ratio [1.181, 2.060][0.516, 1.120] [0.975, 1.532][0.011, 106.051][0.467, 2.587][0.273, 4.373]

Hazard ratio

Figure 4: Illustrating example for meta-analyses. Forest plot of hazard ratios for lowtrauma fractures as observed in CANVAS and CANVAS-R with 95% CIs andfour combined hazard ratios from a ﬁxed-eﬀect meta-analysis, modiﬁed Knapp-Hartung (mKH) meta-analysis and Bayesian random-eﬀects meta-analysis withtwo half-normal (HN) priors for the heterogeneity parameter ττ