[PDF] Defining and Estimating Subgroup Mediation Effects with Semi-Competing Risks Data

Abstract

In many medical studies, an ultimate failure event such as death is likely to be affected by the occurrence and timing of other intermediate clinical events. Both event times are subject to censoring by loss-to-follow-up but the nonterminal event may further be censored by the occurrence of the primary outcome, but not vice versa. To study the effect of an intervention on both events, the intermediate event may be viewed as a mediator, but conventional definition of direct and indirect effects is not applicable due to semi-competing risks data structure. We define three principal strata based on whether the potential intermediate event occurs before the potential failure event, which allow proper definition of direct and indirect effects in one stratum whereas total effects are defined for all strata. We discuss the identification conditions for stratum-specific effects, and proposed a semiparametric estimator based on a multivariate logistic stratum membership model and within-stratum proportional hazards models for the event times. By treating the unobserved stratum membership as a latent variable, we propose an EM algorithm for computation. We study the asymptotic properties of the estimators by the modern empirical process theory and examine the performance of the estimators in numerical studies.

Full PDF

DDeﬁning and Estimating Subgroup Mediation Eﬀects withSemi-Competing Risks Data

Fei GaoFred Hutchinson Cancer Research Center, Seattle, WA, USA.Fan Xia and Kwun Chuen Gary ChanDepartment of Biostatistics, University of Washington, Seattle, WA, USA.

Abstract

In many medical studies, an ultimate failure event such as death is likely to be af-fected by the occurrence and timing of other intermediate clinical events. Both eventtimes are subject to censoring by loss-to-follow-up but the nonterminal event mayfurther be censored by the occurrence of the primary outcome, but not vice versa.To study the eﬀect of an intervention on both events, the intermediate event maybe viewed as a mediator, but conventional deﬁnition of direct and indirect eﬀects isnot applicable due to semi-competing risks data structure. We deﬁne three principalstrata based on whether the potential intermediate event occurs before the potentialfailure event, which allow proper deﬁnition of direct and indirect eﬀects in one stratumwhereas total eﬀects are deﬁned for all strata. We discuss the identiﬁcation condi-tions for stratum-speciﬁc eﬀects, and proposed a semiparametric estimator based ona multivariate logistic stratum membership model and within-stratum proportionalhazards models for the event times. By treating the unobserved stratum membershipas a latent variable, we propose an EM algorithm for computation. We study theasymptotic properties of the estimators by the modern empirical process theory andexamine the performance of the estimators in numerical studies.

Keywords:

Illness-death model; Missing data; Principal stratiﬁcation; Proportional hazardsmodel; Survival analysis. 1 a r X i v : . [ s t a t . M E ] S e p . Introduction Evaluating the causal eﬀects of an intervention on a clinical outcome is a common theme inmany medical studies. After an overall relationship between the intervention and outcomeis established, it is often of further interest to understand the biological or mechanisticpathways that contribute to the causal treatment eﬀect. Causal mediation analysis is oftenutilized to disentangle the total treatment eﬀect by decomposing it into the indirect eﬀect,i.e., the eﬀect exerted by intermediate variables (mediators), and the direct eﬀect, i.e., theeﬀect involving pathways independent of the hypothesized mediators. A number of methodswere proposed for causal mediation analysis with survival outcomes, for a single mediatormeasured at study entry (Lange and Hansen 2011; VanderWeele 2011; Tchetgen Tchetgen2011; Lange et al. 2012) and for longitudinal mediators (Lin et al. 2017; Zheng and van derLaan 2017; Didelez 2019; Vansteelandt et al. 2019; Aalen et al. 2020).In many biomedical studies, intermediate, non-terminal landmark events are recordedin addition to the primary failure event because they are important to evaluate prognosis.Due to the ordering of the two events, the non-terminal event is subject to censoring by theoccurrence of the terminal event, but not vice versa, such that semi-competing risks data areobserved (Fine et al. 2001). In this paper, we consider a setting where a non-terminal eventmay serve as a mediator for individuals to whom the event would occur before the terminalevent. An example is a multi-center trial of allogeneic bone marrow transplants in patientswith acute leukemia (Copelan et al. 1991; Klein and Moeschberger 2006), where the primaryinterest is on the eﬀect of diﬀerent treatment regimen (methotrexate + cyclosporine vsmethylprednisolone + cyclosporine) on the survival time. The event time of an intermediateendpoint, chronic graft-versus-host disease (GVHD), is a major side eﬀect of the transplantthat can be lethal. However, some patients died without experiencing GVHD, such thatGVHD event time is subject to censoring by the death time.1ausal mediation analysis with semi-competing risks data is particularly challenging.First, the mediator is only well-deﬁned for those who would have the non-terminal event de-veloped before the occurrence of the primary event. Therefore, the conventional deﬁnitionof natural indirect and direct eﬀects based on replacing the counterfactual of mediator un-der one treatment by that under the other can hardly apply to the entire population. Thischallenge is similar to that of the ‘truncation-by-death’ problem (Zhang and Rubin 2003;Comment et al. 2019), where the primary outcome is only available if the terminal eventdoes not occur. However, there is a substantial diﬀerence in that the primary outcome ofinterest is the terminal event in our setting. Moreover, the semi-competing risks data struc-ture, that is, the primary event may censor the intermediate event but not vice versa, postsadditional challenges in the identiﬁability of parameters. Upon ﬁnishing this paper, webecame aware of the newly accepted paper by Huang (2020) which considered this problemby a counting process framework. The problem formulation, estimand and assumptionsare all very diﬀerent from our work. For instance, we do not make sequential ignorabilityassumptions on surviving subpopulations at arbitrary post-treatment time, because thoseevolving subpopulations are generally healthier than the baseline study population beforethe treatment is assigned.In this paper, we consider a novel principal stratiﬁcation approach to deﬁne causalmediation eﬀects in the subgroup where the intermediate event happens when given eithertreatment, i.e., those susceptible to the intermediate event under both treatments. Thenotations and settings are given in Sections 2.1 and 2.2. We discuss the identiﬁcationconditions needed for estimating the stratum-speciﬁc natural indirect and direct eﬀectsin Section 2.3, and proposed a semiparametric estimator based on a multivariate logisticstratum membership model and within-stratum proportional hazards models for the eventtimes in Section 2.4. By treating the unobserved stratum membership as a latent variable,we propose an EM algorithm for computation of the nonparametric maximum likelihood2stimator in Section 2.5. We also study the asymptotic properties of the estimators usingthe modern empirical process theory in Section 2.6 and examine the performance of theestimators in simulation studies Section 3. An analysis of data from a clinical trial isgiven in Section 4, and concluding remarks are included in Section 5. Proofs and detailedderivations are given in the Appendix.

2. Methods2.1. Notations for Observed Data

Let A be a binary treatment, T be the time to a primary event of interest and M bethe time to an intermediate, non-terminal event. The intermediate event time M may becensored by the occurrence of the primary event, but not vice versa, such that we observesemi-competing risks data. For example, A is a treatment for prolonging survival time, T is the time to death, and M is the time to cancer progression. The occurrence of deathmay censor the cancer progression onset, but not vice versa.Let X be a collection of baseline covariates that may be associated with either or bothevents. Let C denote a censoring time for the primary event, for example, end of follow-uptime. Then, we observe Y ≡ min( T, C ) and ∆ T = I ( T ≤ C ) for the primary event, and Z ≡ min( M, Y ) and ∆ M = I ( M ≤ Y ) for the intermediate event. The observations areversions of the counterfactual variables that we deﬁne as follows. To deﬁne causal mediation eﬀects of interest, we adopt the potential outcomes framework.In conventional causal mediation analysis based on counterfactuals, M ( a ) denotes the coun-terfactual nonterminal event time when the treatment is set to a and T ( a, m ) denotes thecounterfactual terminal event time when the treatment is set to a and the nonterminalevent time (mediator) is set to m . A comparison of T ( a, M ( a )) with T ( a, M ( a ∗ )) would3eﬁne a measure of the natural indirect eﬀect of changing the mediator from M ( a ) to M ( a ∗ ) and a comparison of T ( a, M ( a ∗ )) with T ( a ∗ , M ( a ∗ )) would deﬁne a measure of the naturaldirect eﬀect of changing the treatment from a to a ∗ . Both natural indirect and direct ef-fects involve the term T ( a, M ( a ∗ )) , i.e., the counterfactual outcome for the terminal eventtime when the treatment is set to a and the nonterminal event time is set to M ( a ∗ ) , thecounterfactual nonterminal event time when the treatment is set to a ∗ .However, these conventional deﬁnitions are inadequate for semi-competing risk settingsand needs to be modiﬁed for the following reasons. The potential non-terminal event mayor may not occur before the potential primary event time under diﬀerent treatment as-signments. When the potential primary event happens before the potential intermediateevent, the value of the mediator is not well-deﬁned (and is often set to ∞ by convention)and in such a case the potential primary event time shall not be dependent on an arbi-trary m greater than the potential primary event time. Due to these considerations, weexamine causal eﬀects based on our proposed principal stratiﬁcation approach, extendedfrom Frangakis and Rubin (2002). Intuitively, we stratify the study population into latentclasses identiﬁed by U with 3 categories based on whether they are susceptible to the non-terminal event under diﬀerent treatment assignments. For an individual with U = 1 , weassume the residual counterfactual event time T ( a, m ) − m is non-negative with probability1 for m ∈ M , where M is the support of M ( a ) . This does not only imply susceptibilityunder either treatment, but also implies T ( a, M ( a ∗ )) > M ( a ∗ ) for a (cid:54) = a ∗ . For U = 2 ,we assume M (1) = ∞ with probability 1, and T (1 , m ) is only deﬁned for m = ∞ , while ( T (0 , m ) , M (0)) is deﬁned for m ∈ M such that T (0 , m ) − m is non-negative with prob-ability 1. That is, the treatment “prevents” the subject with U = 2 from having theintermediate event. For individuals with U = 3 , we assume M (0) = M (1) = ∞ withprobability 1 and T ( a, m ) is deﬁned for a ∈ { , } with m = ∞ . That is, they are always“non-susceptible” to the non-terminal event. Here, we assume that the fourth stratum with4he nonterminal event only present in the treated does not exist. This restriction is alongthe same line as the “no deﬁer” assumption commonly adopted in the instrumental variablesmethods, suggesting that the treatment eﬀect is “monotone” and no reversed eﬀect for thesubjects (Angrist et al. 1996). Remark . The deﬁned strata (and associated stratum-speciﬁc eﬀects) are substantiallydiﬀerent from the survivors’ principal stratum (and the survivor average causal eﬀect(SACE)) that is commonly deﬁned in “truncation by death” literature (Zhang and Ru-bin 2003; Comment et al. 2019). In particular, the survivors’ principal stratum is deﬁnedby { T (0) ≥ t, T (1) ≥ t } for some ﬁxed time t in Comment et al. (2019), whereas ourdeﬁnition does not depend on any arbitrary post-treatment time t . Remark . Lin et al. (2017) explained the diﬃculties in deﬁning natural mediation eﬀectsin survival context with longitudinal mediators. They deﬁned interventional eﬀects ina discrete-time setting, where the mediators and past survival status are subject to ahypothetical intervention. They mentioned that principal stratiﬁcation as an alternativeframework to avoid such hypothetical intervention, but did not explore further. We considera diﬀerent setting that shares some of the diﬃculties, but also with unique data structureso that principal strata can be deﬁned.Using our modiﬁed deﬁnition, the convention T ( a ) = T ( a, M ( a )) still holds for allindividuals, while T ( a, M ( a ∗ )) , a (cid:54) = a ∗ , is only well-deﬁned for U = 1 . Therefore, thestratum-speciﬁc indirect and direct eﬀects can only be deﬁned for subjects with U = 1 ,while total eﬀects are still well-deﬁned for U = 2 and U = 3 . In light of these observations,for stratum with U = 1 , we deﬁne the stratum-speciﬁc natural indirect and direct eﬀectsas N IE ( t ; x ) = Pr { T (1 , M (1)) ≥ t | X = x , U = 1 } − Pr { T (1 , M (0)) ≥ t | X = x , U = 1 } (1)5nd N DE ( t ; x ) = Pr { T (1 , M (0)) ≥ t | X = x , U = 1 } − Pr { T (0 , M (0)) ≥ t | X = x , U = 1 } . (2)In stratum with U = 2 , T (1 , M (0)) is not well-deﬁned because T (1 , m ) is only deﬁned for m = ∞ and M (0) < ∞ with probability 1. However, T ( a ) = T ( a, M ( a )) is still well-deﬁnedand the stratum-speciﬁc total eﬀect is T E ( t ; x ) = Pr { T (1) ≥ t | X = x , U = 2 } − Pr { T (0) ≥ t | X = x , U = 2 } . In stratum with U = 3 , M (0) = M (1) = ∞ and T (1 , M (0)) = T (1 , M (1)) so there is noindirect eﬀect. The stratum-speciﬁc total eﬀect can still be deﬁned as T E ( t ; x ) = Pr { T (1) ≥ t | X = x , U = 3 } − Pr { T (0) ≥ t | X = x , U = 3 } . Remark . In principle, a mediator shall satisfy temporal precedence, that it shall occurbefore the primary event. Therefore, one can view that the mediator is technically absentin U = 3 , and an attempt to deﬁne mediation eﬀects would be futile. In U = 2 , thepresence of the mediator before the primary event only happens in one treatment levelwith certainty. As a result, one cannot ﬁx the mediator level at a diﬀerent treatment level,and mediation eﬀects cannot be deﬁned. Note that in U = 2 , T E can be interpreted asthe treatment eﬀect in survival among individuals whose mediating events are preventedby the treatment. To identify the stratum-speciﬁc natural indirect and direct eﬀects and stratum-speciﬁc totaleﬀects, we impose the following assumptions.

Assumption 1 (Consistency) . If A = a , then M = M ( a ) and T = T ( a ) with probability1; if A = a , and M = m , then T = T ( a, m ) with probability 1. ssumption 2 (Sequential Ignorability within Stratum) . For a = 0 , , a ∗ = 0 , , m ∈ (0 , τ ] , and u = 1 , , , { T ( a, m ) , M ( a ∗ ) } ⊥ A | X , U = u (3) and T ( a, m ) ⊥ M | A = a ∗ , X , U = u. (4)Assumptions 1 is a standard assumption for causal inference with no unmeasured con-founding. Assumption 2 is a standard assumption for mediation analysis in U = 1 . For U = 2 , , Assumption 2 reduces to the standard assumption of conditional exchangeabilitywithin stratum. Based on Assumptions 1 - 2, we are able to connect the stratum-speciﬁcnatural indirect and direct eﬀects and stratum-speciﬁc total eﬀects with the distributionof the observed data given stratum membership as follows. Theorem 1.

Under Assumptions 1 - 2, for stratum with U = 1 , the stratum-speciﬁc naturalindirect eﬀect N IE ( t ; x ) is equal to (cid:90) t { − Pr(

T < t | M = m, X = x , A = 1 , U = 1) }× (cid:8) dF M | X = x ,A =1 ,U =1 ( m ) − dF M | X = x ,A =0 ,U =1 ( m ) (cid:9) + Pr( M ≤ t | X = x , A = 0 , U = 1) − Pr( M ≤ t | X = x , A = 1 , U = 1) , and the stratum-speciﬁc natural direct eﬀect N DE ( t ; x ) is equal to (cid:90) t { Pr(

T < t | M = m, X = x , A = 0 , U = 1) − Pr(

T < t | M = m, X = x , A = 1 , U = 1) } dF M | X = x ,A =0 ,U =1 ( m ) . Under Assumptions 1 and 2, for stratum with U = 2 , the stratum-speciﬁc total eﬀect T E ( t ; x ) is equal to Pr( T ≥ t | A = 1 , X = x , U = 2) − Pr( T ≥ t | A = 0 , X = x , U = 2); nd for stratum with U = 3 , the stratum-speciﬁc total eﬀect T E ( t ; x ) is equal to Pr( T ≥ t | A = 1 , X = x , U = 3) − Pr( T ≥ t | A = 0 , X = x , U = 3) . The proof of Theorem 1 is given in Appendix A.1. Since U is unobserved, we cannotuse Theorem 1 directly to identify those stratum-speciﬁc eﬀects from observed data. To doso, one would further assume Assumption 3 (Stratum Membership Independent of Treatment) . With probability one, U is conditional independent of A given X . Assumption 4 (Restriction on Stratum-Speciﬁc Distributions) . With probability one,

Pr( M (0) = m | X = x , U = 2) = g { Pr( M (0) = m | X = x , U = 1); x } , Pr( T (0) ≥ t | M (0) = m, X = x , U = 2) = g { Pr( T (0) ≥ t | M (0) = m, X = x , U = 1); x } , Pr( T (1) ≥ t | X = x , U = 2) = g { Pr( T (1) ≥ t | X = x , U = 3); x } , for some known functions g k ( · ; x ) ( k = 1 , , . Assumption 5 (Non-Informative Censoring and Suﬃcient Follow-up) . ( M, T, U ) is con-ditionally independent of C given A and X , and the upper bound of the support of T is nolarger than that of C . Assumption 3 requires that the stratum membership is not aﬀected by the treatmentassignment A given covariates X . Assumption 4 requires some knowledge on the rela-tionships of stratum-speciﬁc event time distributions. The ﬁrst part of Assumption 5 is astandard assumption for non-informative censoring time. The second part of Assumption 5is an extension of the independent censoring and suﬃcient follow-up assumption in Mallerand Zhou (1992) for nonparametric estimation of cured proportion in censored data. Theassumption on the upper bounds of the supports ensures suﬃcient observation of the tail8ehaviour of the event times for identiﬁcation of stratum membership. By further assumingAssumptions 3 - 5, we obtain the identiﬁcation results in Theorem 2, whose proof is givenin Appendix A.2. Theorem 2.

Under Assumptions 1 - 5, the stratum-speciﬁc eﬀects can be identiﬁed, withidentiﬁcation formulas given in (A.2) - (A.5).

Theorem 2 gives the identiﬁcation result based on nonparametric models for U andfor ( M, T ) given U with minimum assumptions. In particular, Assumption 4 requiressome modeling assumptions to be made. In practice, we may consider additional modelassumptions for U and ( M, T ) to gain power in understanding the causal eﬀects. In thenext section, we extend the multistate modeling idea in the literature of semi-competingrisks data to form such a model. One way to model semi-competing risks data is to use a multistate framework (Xu et al.2010). In multistate analysis of semi-competing risks data, usually three states (states 1- 3) are involved, corresponding to healthy (state 1), illness (state 2), and death (state 3)in an illness-death model. All subjects starts at state 1. A subject enters state 2 if he/shedevelops the intermediate event, while he/she enters state 3 if he/she develops the primaryevent. In traditional illness-death model for semi-competing risks data, three processesmoving from one state to another are modeled: (1) healthy to illness (state 1 to 2), (2)illness to death (state 2 to 3), and (3) healthy to death (state 1 to 3).Here, we extend the idea and model the processes moving from one state to another indiﬀerent strata deﬁned in Section 2.2.. For subjects with U = 1 and subjects with U = 2 receiving A = 0 , the processes of healthy to illness and illness to death are involved andwe model the time to the nonterminal event M and the residual time R ≡ T − M . Weassume that M and R are conditionally independent given A, X , and U . This serves two9urposes: to obtain a tractable EM algorithm in Appendix B, and to avoid the problemof induced informative censoring caused by residual dependence between M and R (Wangand Wells 1998; Lin et al. 1999). For subjects with U = 2 receiving A = 1 and subjectswith U = 3 , the process of healthy to death is involved. This proposed model is related tobut diﬀerent from the illness-death model, in that the transition structure depends on theprincipal strata in our proposed model.Suppose that for a subject with U = 1 , the nonterminal event time follows a proportionalhazards model with hazard function given by λ (1) M ( t | A = a, X = x ) = λ ( t ) exp (cid:0) β M a + γ T M x (cid:1) , and the gap time between the occurrences of nonterminal and terminal events R follows aproportional hazards model with hazard function given by λ (1) R ( r | A = a, X = x ) = λ ( r ) exp (cid:0) β R a + γ T R x (cid:1) . Suppose that for subject with U = 2 and unexposed to treatment ( A = 0) , the nonterminalevent time follows a proportional hazards model with hazard function given by λ (2) M ( t | A = 0 , X = x ) = λ ( t ) exp (cid:0) β M + γ T M x (cid:1) , and the gap time between the occurrences of nonterminal and terminal events followsanother proportional hazards model with hazard function given by λ (2) R ( r | A = 0 , X = x ) = λ ( r ) exp (cid:0) β R + γ T R x (cid:1) . Here, subjects with U = 1 and subjects with U = 2 unexposed to treatment share the samebaseline hazard functions, although the hazard ratios for covariates may be diﬀerent. Theparameters β M and β R are the log hazard ratios of treatment on the nonterminal eventtime and gap time, respectively, for subjects with U = 1 ; the parameters β M and β R are10he log hazard ratios on the nonterminal event time and gap time, respectively, comparingsubjects with U = 1 and U = 2 who both unexposed to treatment with baseline covariatesvalue X = .For subject with U = 2 and exposed with treatment ( A = 1 ), we assume that theterminal event time follows a proportional hazards model with hazard function given by λ (2) T ( t | A = 1 , X = x ) = λ ( t ) exp (cid:0) β T + γ T T x (cid:1) . For subject with U = 3 , we suppose that the terminal event time follows a proportionalhazards model with hazard function given by λ (3) T ( t | A = a, X = x ) = λ ( t ) exp (cid:0) β T a + γ T T x (cid:1) . Note that the terminal event times for subject with U = 3 and subject with U = 2 exposedto treatment share the same baseline hazard function. The parameter β T is the log hazardratio of treatment on the terminal event time for subjects with U = 3 , while β T is the loghazard ratio of the terminal event time comparing subjects with U = 3 and A = 0 withsubjects with U = 2 and A = 1 , with the same covariates value X = .The natural indirect and direct eﬀects in stratum with U = 1 can be presented as N IE ( t | X = x ) = (cid:90) t exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111) λ ( m ) e γ T M x × (cid:104) e β M exp (cid:110) − Λ ( m ) e β M + γ T M x (cid:111) − exp (cid:110) − Λ ( m ) e γ T M x (cid:111)(cid:105) dm + exp (cid:110) − Λ ( t ) e β M + γ T M x (cid:111) − exp (cid:110) − Λ ( t ) e γ T M x (cid:111) and N DE ( t | X = x ) = (cid:90) t (cid:104) exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111) − exp (cid:110) − Λ ( t − m ) e γ T R x (cid:111)(cid:105) × λ ( m ) e γ T M x exp (cid:110) − Λ ( m ) e γ T M x (cid:111) dm, Λ ( t ) = (cid:82) t λ ( s ) ds and Λ ( t ) = (cid:82) t λ ( s ) ds . The total eﬀects in strata with U = 2 and U = 3 are given by T E ( t | X = x ) = exp (cid:110) − Λ ( t ) e β T + γ T T x (cid:111) − (cid:90) t λ ( m ) e β M + γ T M x × exp (cid:110) − Λ ( m ) e β M + γ T M x (cid:111) (cid:104) − exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111)(cid:105) dm and T E ( t | X = x ) = exp (cid:110) − Λ ( t ) e β T + γ T T x (cid:111) − exp (cid:110) − Λ ( t ) e γ T T x (cid:111) , where Λ ( t ) = (cid:82) t λ ( s ) ds .As in Yu et al. (2015), we consider a multinomial logistic regression model on thestratum membership. In particular, we assume w ( x ; α ) = Pr( U = 1 | X = x ) = exp (cid:0) α T1 (cid:101) x (cid:1) α T1 (cid:101) x ) + exp ( α T2 (cid:101) x ) ,w ( x ; α ) = Pr( U = 2 | X = x ) = exp (cid:0) α T2 (cid:101) x (cid:1) α T1 (cid:101) x ) + exp ( α T2 (cid:101) x ) , and w ( x ; α ) = Pr( U = 3 | X = x ) = { (cid:0) α T1 (cid:101) x (cid:1) + exp (cid:0) α T2 (cid:101) x (cid:1) } − , where α =( α T1 , α T2 ) T and (cid:101) x = (1 , x T ) T . Then, the marginalized stratum-speciﬁc natural indirect anddirect eﬀects are given by N IE ( t ) = P r { T (1 , M (1)) ≥ t | U = 1 } − P r { T (1 , M (0)) ≥ t | U = 1 } = (cid:82) N IE ( t | X = x ) w ( x ; α ) dF ( x ) (cid:82) w ( x ; α ) dF ( x ) and N DE ( t ) = P r { T (1 , M (0)) ≥ t | U = 1 } − P r { T (0 , M (0)) ≥ t | U = 1 } = (cid:82) N DE ( t | X = x ) w ( x ; α ) dF ( x ) (cid:82) w ( x ; α ) dF ( x ) , where F ( · ) is the cumulative distribution function of X .12 .5. Nonparametric Maximum Likelihood Estimation For a random sample of n subjects, the observed semi-competing risks data are given by O = {O i : i = 1 , . . . , n } , where O i = { ∆ Mi , Z i , ∆ Ti , Y i , A i , X i } . For i = 1 , . . . , n , if ∆ Mi = 1 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 1 | X i ) Pr (cid:0) Z i , Y i , ∆ Ti | U i = 1 , X i , A i (cid:1) + I ( A i = 0) Pr ( U i = 2 | X i ) Pr (cid:0) Z i , Y i , ∆ Ti | U i = 2 , X i , A i = 0 (cid:1) ; if ∆ Mi = 0 and ∆ Ti = 1 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 3 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 3 , X i (cid:1) + I ( A i = 1) Pr ( U i = 2 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 2 , X i , A i = 1 (cid:1) ; and if ∆ Mi = ∆ Ti = 0 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 1 | X i ) Pr (cid:0) Z i , ∆ Mi , Y i , ∆ Ti | U i = 1 , X i (cid:1) + Pr ( U i = 2 | X i ) (cid:8) I ( A i = 0) Pr (cid:0) Z i , ∆ Mi , Y i , ∆ Ti | U i = 2 , X i , A i = 0 (cid:1) + I ( A i = 1) Pr (cid:0) Y i , ∆ Ti | U i = 2 , X i , A i = 1 (cid:1)(cid:9) + Pr ( U i = 3 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 3 , X i (cid:1) . Therefore, the likelihood function for the observed data O is given by n (cid:89) i =1 (cid:101) L i ( O i ) ∆ Mi (cid:110)(cid:101) L i ( O i ) ∆ Ti (cid:101) L i ( O i ) − ∆ Ti (cid:111) − ∆ Mi . We consider the nonparametric maximum likelihood estimation such that the estimatorsfor Λ , Λ , and Λ are step functions. In particular, let < t < · · · < t m < ∞ be theordered sequence of event times Z i ’s with ∆ Mi = 1 ; let < t < · · · < t m < ∞ be the13rdered sequence of gap times V i ≡ Y i − Z i ’s with ∆ Mi = ∆ Ti = 1 ; and let < t < · · ·

Under regularity conditions, (cid:13)(cid:13)(cid:13)(cid:98) θ − θ (cid:13)(cid:13)(cid:13) + (cid:88) k =1 sup t ∈ [0 ,τ k ] (cid:12)(cid:12)(cid:12)(cid:98) Λ k ( t ) − Λ k ( t ) (cid:12)(cid:12)(cid:12) converges to zero almost surely. In addition, √ n { (cid:98) θ − θ , (cid:98) Λ ( · ) − Λ )( · ) , (cid:98) Λ ( · ) − Λ )( · ) , (cid:98) Λ ( · ) − Λ )( · ) } converges weakly to a zero-mean Gaussian process in the Banach space R m × l ∞ ( A ) × l ∞ ( A ) × l ∞ ( A ) , where m is the dimension of θ and A k is the unit ball in thespace of functions on [0 , τ k ] with bounded variation for k = 1 , , . Theorem 4.

Under regularity conditions, the estimators for stratum-speciﬁc eﬀects givenin (5)-(10) are consistent and asymptotically normal.

Since the form of the limiting variances of the stratum-speciﬁc eﬀects is complicated,we estimate the variance of the estimators by a nonparametric bootstrap procedure in allnumerical studies.

3. Simulation Studies

We conducted simulation studies to examine the performance of the proposed methods. Wegenerated two covariates X ∼ N (0 , and X ∼ U nif (0 , and generated the treatmentindicator A ∼ Bin (0 . to reﬂect 1:1 randomization. We set Λ ( t ) = t , Λ ( t ) = 0 . t ,and Λ ( t ) = log(1 + t ) , while the true values of the other parameters are shown in Table16, along with the simulation results. We generated a censoring time C ∼ U nif (0 , to obtain approximately 51% and 26% censoring rates for the nonterminal and terminalevents, respectively. The proportions of subjects with U = 1 , , are approximately 31%,41%, and 28%, respectively.We considered replicates with sample sizes n = 1000 and , where boot-strap samples were used for variance estimation. Table 1 shows the simulation results,where Bias, SE and SEE denote, respectively, the averaged bias, empirical standard errorand averaged standard error estimates, and CP stands for the empirical coverage prob-ability of the 95% conﬁdence intervals. All examined replications converge with a − convergence criterion. The parameter estimators are virtually unbiased. The bootstrapvariance estimator overestimates the true variability for some of the parameters, but it getsmore accurate when sample size increases.The estimators for the baseline cumulative hazard functions only take jump till the lastobservation times, such that the estimates after the last observation time is not meaningful.Therefore, to summarize the performance of the baseline hazard estimators, for every timepoint t , we only consider the replicates with last observation time greater than t . Figure 1shows the median of the estimated baseline hazard functions, among such replicates. Weplot till the time point at which at least 800 replicates have meaningful estimates. Thebias gets smaller as sample size increases.Table 2 shows the performance of the estimated stratum-speciﬁc indirect and directeﬀects in stratum with U = 1 and X = (0 . , . T , as well as the estimated total eﬀectsfor strata with U = 2 , and the same covariate values. Similarly, for any t the averagewas taken over all the replicates with estimators that have last jump time no less than t .The bias gets smaller as sample size increases. The variance estimator is accurate and thecoverage probability is close to the nominal level when sample size is large.17able 1. Simulation Results for Regression Parameters True n = 1000 n = 2000 Value Bias SE SEE CP Bias SE SEE CP β M . .

008 0 .

224 0 .

261 0 .

97 0 .

006 0 .

147 0 .

168 0 . γ M . .

010 0 .

091 0 .

095 0 .

96 0 .

005 0 .

063 0 .

063 0 . . .

014 0 .

269 0 .

296 0 .

96 0 .

010 0 .

192 0 .

195 0 . β R . − .

043 0 .

276 0 .

334 0 . − .

025 0 .

161 0 .

202 0 . γ R − . − .

011 0 .

095 0 .

102 0 . − .

004 0 .

065 0 .

067 0 . − . .

001 0 .

308 0 .

332 0 .

97 0 .

002 0 .

212 0 .

214 0 . β M − . .

005 0 .

503 0 .

599 0 . − .

002 0 .

327 0 .

380 0 . γ M . .

025 0 .

123 0 .

154 0 .

98 0 .

011 0 .

078 0 .

090 0 . . .

017 0 .

393 0 .

476 0 .

98 0 .

012 0 .

252 0 .

287 0 . β R . − .

068 0 .

586 0 .

699 0 . − .

028 0 .

358 0 .

435 0 . γ R . − .

013 0 .

187 0 .

230 0 . − .

002 0 .

115 0 .

138 0 . . − .

020 0 .

476 0 .

546 0 . − .

002 0 .

303 0 .

337 0 . β T . .

028 0 .

485 0 .

523 0 .

98 0 .

011 0 .

357 0 .

359 0 . γ T − . .

015 0 .

175 0 .

197 0 .

98 0 .

012 0 .

124 0 .

132 0 . − . − .

018 0 .

435 0 .

509 0 .

98 0 .

009 0 .

312 0 .

324 0 . β T . − .

044 0 .

410 0 .

448 0 . − .

034 0 .

334 0 .

332 0 . γ T − . − .

035 0 .

107 0 .

118 0 . − .

024 0 .

076 0 .

078 0 . . − .

017 0 .

371 0 .

402 0 . − .

029 0 .

265 0 .

261 0 . α . .

009 0 .

199 0 .

213 0 .

96 0 .

010 0 .

148 0 .

146 0 . . − .

005 0 .

113 0 .

113 0 . − .

003 0 .

078 0 .

078 0 . . − .

002 0 .

343 0 .

367 0 . − .

007 0 .

256 0 .

253 0 . α . .

025 0 .

296 0 .

321 0 .

96 0 .

010 0 .

201 0 .

211 0 . − . − .

019 0 .

155 0 .

175 0 . − .

019 0 .

108 0 .

114 0 . . − .

023 0 .

491 0 .

537 0 .

97 0 .

000 0 .

342 0 .

355 0 . NOTE: Bias, SE and SEE denote, respectively, the mean bias,empirical standard error and mean standard error estimator. CPstands for the empirical coverage probability of the 95% conﬁ-dence interval.

True n = 1000 n = 2000 t Value Bias SE SEE CP Bias SE SEE CP N DE − .

11 0 .

014 0 .

056 0 .

070 0 .

97 0 .

007 0 .

031 0 .

041 0 . − .

17 0 .

018 0 .

090 0 .

106 0 .

96 0 .

010 0 .

052 0 .

066 0 . − .

18 0 .

020 0 .

096 0 .

107 0 .

94 0 .

010 0 .

057 0 .

069 0 . N IE − . − .

001 0 .

022 0 .

025 0 .

97 0 .

000 0 .

015 0 .

017 0 . − . − .

001 0 .

016 0 .

018 0 . − .

001 0 .

011 0 .

012 0 . − . − .

001 0 .

010 0 .

011 0 .

97 0 .

000 0 .

007 0 .

007 0 . T E − . − .

036 0 .

158 0 .

175 0 . − .

022 0 .

115 0 .

126 0 . . − .

046 0 .

159 0 .

180 0 . − .

026 0 .

113 0 .

127 0 . . − .

047 0 .

136 0 .

156 0 . − .

027 0 .

097 0 .

109 0 . . − .

044 0 .

115 0 .

131 0 . − .

025 0 .

084 0 .

093 0 . T E − .

07 0 .

021 0 .

139 0 .

147 0 .

97 0 .

015 0 .

117 0 .

114 0 . − .

06 0 .

031 0 .

119 0 .

126 0 .

98 0 .

022 0 .

101 0 .

099 0 . − .

06 0 .

033 0 .

101 0 .

107 0 .

97 0 .

024 0 .

086 0 .

085 0 . − .

05 0 .

033 0 .

088 0 .

093 0 .

97 0 .

024 0 .

075 0 .

074 0 . NOTE: Bias, SE and SEE denote, respectively, the mean bias, empir-ical standard error and mean standard error estimator. CP stands forthe empirical coverage probability of the 95% conﬁdence interval. LambdaM t La m bda M ( t ) n=500n=1000n=2000true 0 2 4 6 8 10 12 . . . . . . . LambdaR t La m bda R ( t ) n=500n=1000n=2000true 0 2 4 6 8 10 12 . . . . . . . LambdaT t La m bda T ( t ) n=500n=1000n=2000true Figure 1.

Performance of the estimated baseline cumulative hazard functions.

4. Application

We consider application of the proposed methods to a prostate cancer clinical trial. NCICClinical Trials Group PR.3/Medical Research Council PR07/Intergroup T94-0110 is a ran-domized controlled trial of patients with locally advanced prostate cancer. The primary ob-jective is to determine whether the addition of radiotherapy (RT) to androgen-deprivationtherapy (ADT) prolonged overall survival, deﬁned as time from random assignment todeath from any cause. One thousand two hundred and ﬁve patients with locally advancedprostate cancer were recruited and randomly assigned between 1995 and 2005, 602 to ADTalone and 603 to ADT + RT. These patients were either with T3-4, N0/Nx, M0 prostatecancer or with T1-2 disease with either prostate-speciﬁc antigen (PSA) of more than 40 µ g/Lor PSA of 20 to 40 µ g/L plus Gleason score of 8 to 10. In the ﬁnal report of the study(Mason et al. 2015), at a median follow-up time of 8 years, 465 patients had died. Overallsurvival was signiﬁcantly improved in the patients allocated to ADT + RT (hazard ratio0.70 with 95% CI, 0.57 to 0.85; P < .001).In addition to the primary outcome of death, the study also collected data on time to20isease progression, which was deﬁned as the ﬁrst of any of the following events: biochemicalprogression, local progression, or development of metastatic disease. We analyzed the datato reveal the proportions of the treatment eﬀect on overall survival that are mediated bydisease progression. Particularly, we adjusted for initial PSA level ( <

20 vs. 20 to 50, vs. > U = 1 , ADT + RT is associated witha decreased risk of disease progression, while it is associated with an increased risk fromdisease progression to death. For stratum with U = 3 , ADT + RT is associated with adecreased risk of death. The eﬀects are not signiﬁcant at 0.05 level. For stratum with U = 1 , a subject with initial PSA level >

50 g/L is associated with signiﬁcantly increasedrisk of disease progression, compared to a similar subject with initial PSA level <

20 g/L;and a subject with Gleason score 8-10 is associated with signiﬁcantly decreased risk ofdisease progression, compared to a similar subject with Gleason score < U = 1 , , and 3 are40.1%, 25.7%, and 34.2%, respectively. To verify if the model is reasonable, we estimatedthe stratum-speciﬁc survival functions for every subject and summarize the subject-speciﬁcsurvival function by weighting them by his/her stratum membership probabilities. Weaverage the estimated survival functions for subjects assigned to ADT+RT versus ADT,and plot them against the survival function estimators from the Kaplan Meier methodsand the proportional hazards model. The results are shown in Figure 2. The estimatedpopulation-average survival functions for ADT+RT and ADT groups are similar to thosefrom the Kaplan Meier methods and the proportional hazards model, especially for time21able 3. Parameter Estimates for Regression Coeﬃcients for Event Time Processes Process U = 1 Health → Disease Disease → DeathEst SEE p -value Est SEE p -valueADT + RT − .

825 0 .

987 0 .

403 0 .

460 0 .

658 0 . Initial PSA Level (20 to 50 g/L) .

321 0 .

530 0 . − .

097 0 .

305 0 . Initial PSA Level ( >

50 g/L) .

607 0 .

566 0 .

005 0 .

065 0 .

342 0 . Gleason Score (8-10) − .

008 0 .

413 0 . − .

378 0 .

241 0 . Process U = 2 , ADTHealth → Disease Disease → DeathEst SEE p -value Est SEE p -valueIntercept − .

917 1 .

587 0 . − .

557 3 .

548 0 . Initial PSA Level (20 to 50 g/L) .

663 1 .

005 0 . − .

304 0 .

863 0 . Initial PSA Level ( >

50 g/L) .

619 0 .

904 0 . − .

104 0 .

940 0 . Gleason Score (8-10) .

674 0 .

952 0 . − .

106 3 .

381 0 . Process U = 2 , ADT + DT U = 3 Health → Death Health → DeathEst SEE p -value Est SEE p -valueIntercept − .

212 7 .

339 0 . − .

446 0 .

875 0 . Initial PSA Level (20 to 50 g/L) .

146 5 .

416 0 . − .

022 0 .

649 0 . Initial PSA Level ( >

50 g/L) .

601 5 .

545 0 . − .

883 0 .

641 0 . Gleason Score (8-10) .

624 4 .

630 0 . − .

514 0 .

486 0 . α α Est SEE p -value Est SEE p -valueIntercept .

205 0 .

441 0 .

643 0 .

291 0 .

743 0 . Initial PSA Level (20 to 50 g/L) .

090 0 .

583 0 .

877 0 .

625 1 .

054 0 . Initial PSA Level ( >

50 g/L) − .

684 0 .

517 0 .

186 0 .

004 0 .

929 0 . Gleason Score (8-10) .

134 0 .

415 0 . − .

495 0 .

732 0 . before 10 years when data are not sparse, indicating proper ﬁt of the proposed approach.Figure 3 shows the estimated marginalized stratum-speciﬁc indirect and direct eﬀects(with 95% conﬁdence intervals) for stratum with U = 1 . The estimated natural indirecteﬀect is positive and increasing over time, and the estimated natural direct eﬀect is slightlynegative over time. However, the 95% conﬁdence intervals are wide such that the stratum-speciﬁc natural indirect and direct eﬀects are not signiﬁcant diﬀerent from zero. The totaleﬀect in stratum with U = 1 is positive and increasing over time, corresponding to anincreased survival probability assigned to ADT+RT versus ADT in stratum with U = 1 .It is worth noting that the primary analysis for the data shows that overall survival wassigniﬁcantly improved in the patients allocated to ADT + RT compared to ADT. However,our analysis failed to obtain a signiﬁcant stratum-speciﬁc overall eﬀect of ADT + RT. Themain reason is that by identifying subjects to diﬀerent strata, the sample size to estimateparameters in each stratum is much smaller than that for the proportional hazards modelbased on all available subjects. In addition, the proposed model has much more parameters,such that the variability for parameter estimation signiﬁcantly increases.23 . . . . . . Time (Year) S u r v i v a l P r obab ili t y ADT + RTADTADT + RT, KMADT, KMADT + RT, PHADT, PH

Figure 2.

Estimated survival functions from the proposed, Kaplan-Meier, and proportionalhazards model approaches. − . − . . . . . Time (Year) C au s a l E ff e c t l l l ll Natural Indirect EffectNatural Direct EffectTotal Effect

Figure 3.

Estimated stratum-speciﬁc indirect and direct eﬀects in stratum with U = 1 . . Discussion Semi-competing risks data are frequently observed in medical studies, where the termi-nal event time may censor the intermediate event time but not vice versa. To deﬁneand estimate causal contrasts of the eﬀect of a treatment to the terminal and intermediateevents, we introduced a novel principal stratiﬁcation framework that distinguishes suscepti-ble and non-susceptible subjects given diﬀerent treatments, and deﬁned the natural indirectand direct eﬀects in the stratum where the times to intermediate and terminal events arewell-deﬁned given both treatments. We provided reasonable assumptions to identify thestratum-speciﬁc natural indirect and direct eﬀects, proposed a semiparametric model, andstudied an EM algorithm to obtain the nonparametric maximum likelihood estimators ofmodel parameters. We showed that the estimators are consistent and asymptotically eﬃ-cient estimated under mild regularity conditions, and their performance are satisfactory inﬁnite sample numerical studies.In identifying the stratum-speciﬁc natural indirect and direct eﬀects, we assumed thatthere are no subjects who are susceptible to the intermediate event under treatment ( A = 1 )and non-susceptible under control ( A = 0 ). This assumption may need careful examinationbased on scientiﬁc understanding of how treatment may aﬀect the intermediate event. Inour data application, we assessed this assumption by ﬁtting the proposed model withswitched treatment indicator labels of ADT+RT and ADT. The estimated probability ofbelonging to stratum with U = 2 (equivalent to the fourth stratum in the original labeling)is very low, suggesting that the assumption on non-existence of the fourth stratum maybe valid. In some applications, this fourth stratum may indeed exist. In the literature ofprincipal stratiﬁcation for uncensored data with four or more strata, the eﬀect of interestoften can only be interval identiﬁed. Interval identiﬁcation with a regression model oftenresults in a complicated solution manifold, with properties often not well understood. We26lan to explore this problem in a future study. Acknowledgement

This manuscript was prepared using data from Dataset NCT00002633-D1 from the NCTNData Archive of the National Cancer Institute (NCI) National Clinical Trials Network(NCTN). Data were originally collected from clinical trial NCT00002633 Phase III Ran-domized Trial Comparing Total Androgen Blockade Versus Total Androgen Blockade PlusPelvic Irradiation in Clinical Stage T3-4, N0, M0 Adenocarcinoma of the Prostate All anal-yses and conclusions in this manuscript are the sole responsibility of the authors and donot necessarily reﬂect the opinions or views of the clinical trial investigators, the NCTN, orthe NCI. The authors are partially funded by the U.S. National Institutes of Health grantsR01HL122212, U01AG016976 and U.S. National Science Foundation grant DMS 1711952.

APPENDIX A Identiﬁcation of Stratum-Speciﬁc EﬀectsA.1 Proof of Theorem 1

Note that (3) in Assumption 2 implies T ( a, m ) ⊥ A | M ( a ∗ ) = m ∗ , X , U = 1 . (A.1)In stratum U = 1 , for a given t ≤ τ and any a , a ∗ , we have Pr { T ( a, M ( a ∗ )) ≥ t | X = x , U = 1 } = (cid:90) Pr { T ( a, m ) ≥ t | M ( a ∗ ) = m, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | M ( a ∗ ) = m, A = a ∗ , X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | A = a ∗ , X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m ) (cid:90) Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )+ (cid:90) ∞ t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )+ Pr( M ( a ∗ ) ≥ t | X = x , U = 1) , where the second equality follows from (A.1), the third and ﬁfth equalities follow from(4) in Assumption 2, and the last equality follows from the fact that T ( a, m ) ≥ t withprobability one for any m ≥ t given U = 1 . Then, following (3) in Assumption 2 andAssumption 1, the proceeding expression is equal to (cid:90) t Pr { T ( a ) ≥ t, M ( a ) = m | A = a, X = x , U = 1 } Pr { M ( a ) = m | A = a, X = x , U = 1 } dF M ( a ∗ ) | A = a ∗ , X = x ,U =1 ( m )+ Pr( M ( a ∗ ) ≥ t | A = a ∗ , X = x , U = 1)= (cid:90) t Pr( T ≥ t | M = m, A = a, X = x , U = 1) dF M | A = a ∗ , X = x ,U =1 ( m )+ Pr( M > t | A = a ∗ , X = x , U = 1) . Then, the natural indirect and direct eﬀects, as deﬁned in equations (1) and (2), are equalto

N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x , U = 1) × (cid:8) dF M | A =1 , X = x ,U =1 ( m ) − dF M | A =0 , X = x ,U =1 ( m ) (cid:9) + Pr( M > t | A = 1 , X = x , U = 1) − Pr(

M > t | A = 0 , X = x , U = 1) and N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x , U = 1) Pr( T ≥ t | M = m, A = 0 , X = x , U = 1) } dF M | A =0 , X = x ,U =1 ( m ) . For stratum with U = 2 , we have T E ( t ; x ) = Pr( T (1) ≥ t | A = 1 , X = x , U = 2) − Pr( T (0) ≥ t | A = 0 , X = x , U = 2)= Pr( T ≥ t | A = 1 , X = x , U = 2) − Pr( T ≥ t | A = 0 , X = x , U = 2) , where the two equalities follow from Assumptions 2 and 1, respectively. Similarly, forstratum with U = 3 , we have T E ( t ; x ) = Pr( T (1) ≥ t | A = 1 , X = x , U = 3) − Pr( T (0) ≥ t | A = 0 , X = x , U = 3)= Pr( T ≥ t | A = 1 , X = x , U = 3) − Pr( T ≥ t | A = 0 , X = x , U = 3) , A.2 Proof of Theorem 2

We ﬁrst consider the identiﬁcation of stratum membership probabilities, as the ﬁrst stepto identify the stratum-speciﬁc eﬀects. By the deﬁnition of the strata, we have

Pr(

M < ∞| A = 1 , X = x ) = Pr( U = 1 | X = x )Pr( M = ∞| A = 0 , X = x ) = Pr( U = 3 | X = x )Pr( M = ∞| A = 1 , X = x ) = Pr( U = 2 | X = x ) + Pr( U = 3 | X = x ) . Pr( U = u | X ) can be identiﬁed by theobserved data if there is no censoring and we can observed if M = ∞ . The identiﬁcationof the quantities in the presence of censoring is discussed in the end of the section.We further consider the distributions of the (observed) intermediate and primary eventtimes to identify terms in the deﬁnition of stratum-speciﬁc eﬀects. First, consider the caseif we observe M = m < ∞ for a subject assigned to treatment A = 1 . Since it implies M (1) < ∞ , this subject should have U = 1 with probability one. That is, for any m < ∞ , Pr( M = m | A = 1 , X = x ) = Pr( M = m, M (1) ≤ T (1) | A = 1 , X = x )= Pr( M = m, M (0) ≤ T (0) , M (1) ≤ T (1) | A = 1 , X = x )= Pr( M = m, U = 1 | A = 1 , X = x )= Pr( M = m | U = 1 , A = 1 , X = x ) Pr( U = 1 | A = 1 , X = x )= Pr( M (1) = m | X = x , U = 1) Pr( U = 1 | X = x ) , where the ﬁrst equality follows from Assumption 1, the second and third equalities followfrom the deﬁnition of strata, and last equality follows from Assumption 3. Similarly, wehave for any m ≤ t < ∞ , Pr( T ≥ t | M = m, A = 1 , X = x ) = Pr( T < t, M = m | A = 1 , X = x )Pr( M = m | A = 1 , X = x )= Pr( T ≥ t, M = m, M (1) ≤ T (1) | A = 1 , X = x )Pr( M = m, M (1) ≤ T (1) | A = 1 , X = x )= Pr( T ≥ t, M = m, U = 1 | A = 1 , X = x )Pr( M = m, U = 1 | A = 1 , X = x )= Pr( T (1) ≥ t | M (1) = m, X = x , U = 1) . For the case that we observe M = m < ∞ for a subject assigned to treatment A = 0 ,the subject would possibly have U = 1 or U = 2 , since both strata has M (0) < ∞ . Then,we have Pr( M = m | A = 0 , X = x ) Pr( M = m, M (0) ≤ T (0) | A = 0 , X = x )= Pr( M = m, M (0) ≤ T (0) , M (1) ≤ T (1) | A = 0 , X = x )+ Pr( M = m, M (0) ≤ T (0) , M (1) = ∞| A = 0 , X = x )= Pr( M = m, U = 1 | A = 0 , X = x ) + Pr( M = m, U = 2 | A = 0 , X = x )= Pr( M = m | A = 0 , X = x , U = 1) Pr( U = 1 | X = x )+ Pr( M = m | A = 0 , X = x , U = 2) Pr( U = 2 | X = x )= Pr( M (0) = m | X = x , U = 1) Pr( U = 1 | X = x )+ g (Pr( M (0) = m | X = x , U = 1); x ) Pr( U = 2 | X = x ) , where the last equality follows from Assumption 4, and Pr( T ≥ t | M = m, A = 0 , X = x )= Pr( T ≥ t, M = m | A = 0 , X = x )Pr( M = m | A = 0 , X = x )= Pr( T ≥ t, M = m, M (0) ≤ T (0) | A = 0 , X = x )Pr( M = m | A = 0 , X = x )= (cid:80) u =1 , Pr( T (0) ≥ t | M (0) = m, X = x , U = u ) Pr( M (0) = m | X = x , U = u ) Pr( U = u | X = x ) (cid:80) u =1 , Pr( M = m (0) | X = x , U = u )) Pr( U = u | X = x ) . Then, the natural indirect eﬀect can be presented as

N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x ) × { Pr( M = m | M < ∞ , A = 1 , X = x ) − h ∗ ( m ; x ) } dm + Pr( M ≤ t | M < ∞ , A = 1 , X = x ) − (cid:90) t h ∗ ( m ; x ) dm. (A.2)where h ∗ ( m ; x ) is the solution to the equation Pr( M = m | A = 0 , X = x ) = h ( m ; x ) Pr( M < ∞| A = 1 , X = x )+ g { h ( m ; x ); x } { Pr(

M < ∞| A = 0 , X = x ) − Pr(

M < ∞| A = 1 , X = x ) } . N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x ) − h ∗ ( m, t ; x ) } h ∗ ( m ; x ) dm, (A.3)where h ∗ ( m, t ; x ) is the solution to Pr( T ≥ t, M = m | A = 0 , X = x ) = h ( m, t ; x ) h ∗ ( m ; x ) Pr( M < ∞| A = 1 , X = x )+ g { h ( m, t ; x ); x } g { h ∗ ( m ; x ); x }× { Pr(

M < ∞| A = 0 , X = x ) − Pr(

M < ∞| A = 1 , X = x ) } . By similar derivations, we have

Pr( T ≥ t, M = ∞| A = 1 , X = x )= Pr( T ≥ t | A = 1 , X = x , U = 2) Pr( U = 2 | X = x )+ Pr( T ≥ t | A = 1 , X = x , U = 3) Pr( U = 3 | X = x )= g { Pr( T (1) ≥ t | X = x , U = 2) } Pr( U = 2 | X = x )+ Pr( T (1) ≥ t | X = x , U = 3) Pr( U = 3 | X = x ) , and Pr( T ≥ t, M = ∞| A = 0 , X = x ) = Pr( T (0) ≥ t | X = x , U = 3) Pr( U = 3 | X = x ) . Then, the stratum-speciﬁc total eﬀects for strata with U = 2 and U = 3 are T E ( t ; x ) = (cid:90) t g { h ∗ ( m, t ; x ); x } g { h ∗ ( m ; x ); x } dm − g { h ∗ ( t ; x ); x } , (A.4)and T E ( t ; x ) = h ∗ ( t ; x ) − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) , (A.5)where h ∗ ( t ; x ) is the solution to the equation Pr( T = t, M < ∞| A = 1 , X = x ) = h ( t ; x ) Pr( M = ∞| A = 0 , X = x ) g { h ( t ; x ); x } { Pr( M = ∞| A = 1 , X = x ) − Pr( M = ∞| A = 0 , X = x ) } . In the special case that g k ( · ; x ) are identity functions, i.e., the stratum-speciﬁc jointdistributions of { M (0) , T } are the same for strata U = 1 and U = 2 given X = x , andthe stratum-speciﬁc distributions of T (0) are the same for strata U = 2 and U = 3 given X = x , the functions h ∗ k ’s have closed form h ∗ ( m ; x ) = Pr( M = m | M < ∞ , A = 0 , X = x ) ,h ∗ ( m, t ; x ) = Pr( T ≥ t | M = m, A = 0 , X = x ) ,h ∗ ( t ; x ) = Pr( T < t | M = ∞ , A = 1 , X = x ) . Then, the stratum-speciﬁc eﬀects can be identiﬁed by

N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x ) { Pr( M = m | M < ∞ , A = 1 , X = x ) − Pr( M = m | M < ∞ , A = 0 , X = x ) } dm + Pr( M ≤ t | M < ∞ , A = 1 , X = x ) − Pr( M ≤ t | M < ∞ , A = 0 , X = x ) ,N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x ) − Pr( T ≥ t | M = m, A = 0 , X = x ) }× Pr( M = m | M < ∞ , A = 0 , X = x ) dm,T E ( t ; x ) = (cid:90) t Pr( T ≥ t, M = m | M < ∞ , A = 1 , X = x ) dm − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) , and T E ( t ; x ) = Pr( T ≥ t | M = ∞ , A = 1 , X = x ) − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) . In the presence of censoring, we cannot observe if M = ∞ , such that previous formulacannot be directly applied. However, we are still able to identify the quantities if we assumenon-informative censoring and suﬃcient follow-up in strata (Assumption 5). Particularly,33e consider the marker process I ( M ≤ t ) along with the event time T . Then, based onan extension of results in Maller and Zhou (1992), the probability Pr( M = ∞| A, X ) canbe consistently estimated by the empirical value of the marker process at the last observedfailure time. By replacing terms related to Pr( M = ∞| A, X ) by their estimators, weidentify the stratum-speciﬁc mediation eﬀects and total eﬀects. APPENDIX B Details on EM Algorithm

Based on the likelihood function with known U i , we are then able to propose an EMalgorithm treating U i ( i = 1 , . . . , n ) as missing data. In particular, the complete-datalog-likelihood (with known U i for i = 1 , . . . , n ) is given by l n ( β , γ , α , Λ)= n (cid:88) i =1 (cid:34) I ( U i = 1) (cid:40) α T1 (cid:102) X i + ∆ Mi (cid:32) log Λ { Z i } + η T M W i − e η T R W i (cid:88) t l ≤ V i λ l (cid:33) +∆ Mi ∆ Ti (cid:0) log Λ { V i } + η T R W i (cid:1) − (cid:0) − ∆ Ti + ∆ Mi ∆ Ti (cid:1) e η T M W i (cid:88) t l ≤ Z i λ l (cid:41) + I ( A i = 0 , U i = 2) (cid:40) α T2 (cid:102) X i + ∆ Mi (cid:32) log Λ { Z i } + η T M (cid:102) X i − e η T R (cid:102) X i (cid:88) t l ≤ V i λ l (cid:33) +∆ Mi ∆ Ti (cid:16) log Λ { V i } + η T R (cid:102) X i (cid:17) − (cid:0) − ∆ Ti + ∆ Mi ∆ Ti (cid:1) e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l (cid:41) + I ( A i = 1 , U i = 2)(1 − ∆ Mi ) (cid:40) α T2 (cid:102) X i + ∆ Ti (cid:16) log Λ { Y i } + η T T (cid:102) X i (cid:17) − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:41) + I ( U i = 3)(1 − ∆ Mi ) (cid:40) ∆ Ti (cid:0) log Λ { Y i } + η T T W i (cid:1) − e η T T W i (cid:88) t l ≤ Y i λ l (cid:41) − log (cid:110) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17)(cid:111) (cid:35) .

34n the E-step of the EM algorithm, we evaluate the conditional expectation U i for subjects i = 1 , . . . , n . In particular, (cid:99) Pr( U i = 1) =∆ Mi (cid:26) I ( A i = 1) + I ( A i = 0) B i B i + B i (cid:27) + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i (cid:99) Pr( U i = 2) =∆ Mi I ( A i = 0) B i B i + B i + (1 − ∆ Mi )∆ Ti I ( A i = 1) C i C i + C i + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i (cid:99) Pr( U i = 3) =(1 − ∆ Mi )∆ Ti (cid:26) I ( A i = 0) + I ( A i = 1) C i C i + C i (cid:27) + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i where B i = exp (cid:40) α T1 (cid:102) X i + η T M W i + ∆ Ti (cid:0) η T R W i (cid:1) − e η T M W i (cid:88) t l ≤ Z i λ l − e η T R W i (cid:88) t l ≤ V i λ l (cid:41) ,B i = exp (cid:40) α T2 (cid:102) X i + η T M (cid:102) X i + ∆ Ti (cid:16) η T R (cid:102) X i (cid:17) − e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l − e η T R (cid:102) X i (cid:88) t l ≤ V i λ l (cid:41) ,C i = exp (cid:32) α T2 (cid:102) X i + η T T (cid:102) X i − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:33) ,C i = exp (cid:32) η T T W i − e η T T W i (cid:88) t l ≤ Y i λ l (cid:33) ,D i = exp (cid:32) α T1 (cid:102) X i − e η T M W i (cid:88) t l ≤ Z i λ l (cid:33) ,D i = exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:40) I ( A i = 0) exp (cid:32) − e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l (cid:33) + I ( A i = 1) exp (cid:32) − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:33)(cid:41) , and D i = exp (cid:32) − e η T T W i (cid:88) t l ≤ Y i λ l (cid:33) .

35n the M-step of the EM algorithm, we maximize the conditional expectation of thecomplete-data log-likelihood function. In particular, we update Λ , Λ , and Λ by λ l = (cid:80) ni =1 ∆ Mi I ( Z i = t l ) (cid:80) ni =1 (1 − ∆ Ti + ∆ Mi ∆ Ti ) I ( Z i ≥ t l ) S i ,λ l = (cid:80) ni =1 ∆ Mi ∆ Ti I ( V i = t l ) (cid:80) ni =1 ∆ Mi I ( V i ≥ t l ) S i ,λ l = (cid:80) ni =1 (1 − ∆ Mi )∆ Ti I ( Y i = t l ) (cid:80) ni =1 (1 − ∆ Mi ) I ( Y i ≥ t l ) S i , where S i = (cid:99) Pr( U i = 1) e η T M W i + (cid:99) Pr( U i = 2) I ( A i = 0) e η T M (cid:102) X i ,S i = (cid:99) Pr( U i = 1) e η T R W i + (cid:99) Pr( U i = 2) I ( A i = 0) e η T R (cid:102) X i ,S i = (cid:99) Pr( U i = 2) I ( A i = 1) e η T T (cid:102) X i + (cid:99) Pr( U i = 3) e η T T W i . We update η M by solving n (cid:88) i =1 ∆ Mi (cid:40)(cid:99) Pr( U i = 1) W i − (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) (cid:99) Pr( U j = 1) e η T M W j W j (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) S j (cid:41) = , and update η M by solving n (cid:88) i =1 ∆ Mi (cid:110)(cid:99) Pr( U i = 2) I ( A i = 0) (cid:102) X i − (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) (cid:99) Pr( U j = 2) I ( A j = 0) e β M + γ T M X j (cid:102) X j (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) S j (cid:41) = . We update η R by solving n (cid:88) i =1 ∆ Mi ∆ Ti (cid:40)(cid:99) Pr( U i = 1) W i − (cid:80) nj =1 I ( R j ≥ V i )∆ Mj (cid:99) Pr( U j = 1) e η T R W j W j (cid:80) nj =1 I ( R j ≥ V i )∆ Mj S j (cid:41) = , and update η R by solving n (cid:88) i =1 ∆ Mi ∆ Ti (cid:110)(cid:99) Pr( U i = 2) I ( A i = 0) (cid:102) X i (cid:80) nj =1 I ( R j ≥ V i )∆ Mj (cid:99) Pr( U j = 2) I ( A j = 0) e η T R (cid:102) X j (cid:102) X j (cid:80) nj =1 I ( R j ≥ V i )∆ Mj S j (cid:41) = . We update η T by solving n (cid:88) i =1 (1 − ∆ Mi )∆ Ti (cid:110)(cid:99) Pr( U i = 2) I ( A i = 1) (cid:102) X i − (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) (cid:99) Pr( U j = 2) I ( A j = 1) e η T T (cid:102) X j (cid:102) X j (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) S j (cid:41) = , and update η T by solving n (cid:88) i =1 (1 − ∆ Mi )∆ Ti (cid:110)(cid:99) Pr( U i = 3) W i − (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) (cid:99) Pr( U j = 1) e η T T W j W j (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) S j (cid:41) = . Finally, we update α by solving n (cid:88) i =1 (cid:99) Pr( U i = 1) − exp (cid:16) α T1 (cid:102) X i (cid:17) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17)  (cid:102) X i = , n (cid:88) i =1 (cid:99) Pr( U i = 2) − exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17)  (cid:102) X i = . Starting with θ = and λ kl = 1 /m k for k = 1 , , , we iterate between the E-step andthe M-step until convergence to obtain the nonparametric maximum likelihood estimators ( (cid:98) θ , (cid:98) A ) . REFERENCES

Aalen, O. O., Stensrud, M. J., Didelez, V., Daniel, R., Røysland, K., and Strohmaier, S.(2020), “Time-dependent mediators in survival analysis: Modeling direct and indirecteﬀects with the additive hazards model,”

Biometr. J. , 62, 532–549.37ngrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identiﬁcation of causal eﬀectsusing instrumental variables,”

J. Am. Statist. Ass. , 91, 444–455.Comment, L., Mealli, F., Haneuse, S., and Zigler, C. (2019), “Survivor average causaleﬀects for continuous time: a principal stratiﬁcation approach to causal inference withsemicompeting risks,” arXiv preprint arXiv:1902.09304 , .Copelan, E. A., Biggs, J. C., Thompson, J. M., Crilley, P., Szer, J., Klein, J. P., Kapoor, N.,Avalos, B. R., Cunningham, I., and Atkinson, K. (1991), “Treatment for Acute MyelocyticLeukemia with Allogeneic Bone Marrow Transplantation Following Preparation withBuCy2,”

Blood , 78, 838–843.Didelez, V. (2019), “Deﬁning causal mediation with a longitudinal mediator and a survivaloutcome,”

Lifetime Data Anal. , 25, 593–610.Fine, J. P., Jiang, H., and Chappell, R. (2001), “On Semi-competing Risks Data,”

Biometrika , 88, 907–919.Frangakis, C. E., and Rubin, D. B. (2002), “Principal stratiﬁcation in causal inference,”

Biometrics , 58, 21–29.Huang, Y.-T. (2020), “Causal mediation of semicompeting risks,”

Biometrics , .Klein, J. P., and Moeschberger, M. L. (2006),

Survival Analysis: Techniques for Censoredand Truncated Data , New York: Springer.Lange, T., and Hansen, J. V. (2011), “Direct and Indirect Eﬀects in a Survival Context,”

Epidemiol. , 22, 575–581.Lange, T., Vansteelandt, S., and Bekaert, M. (2012), “A Simple Uniﬁed Approach forEstimating Natural Direct and Indirect Eﬀects,”

Am. J. Epidemiol. , 176, 190–195.38in, D., Sun, W., and Ying, Z. (1999), “Nonparametric estimation of the gap time distri-bution for serial events with censored data,”

Biometrika , 86, 59–70.Lin, S.-H., Young, J. G., Logan, R., and VanderWeele, T. J. (2017), “Mediation analysis fora survival outcome with time-varying exposures, mediators, and confounders,”

Statisticsin medicine , 36, 4153–4166.Maller, R. A., and Zhou, S. (1992), “Estimating the proportion of immunes in a censoredsample,”

Biometrika , 79, 731–739.Mason, M. D., Parulekar, W. R., Sydes, M. R., Brundage, M., Kirkbride, P., Gospodarow-icz, M., Cowan, R., Kostashuk, E. C., Anderson, J., Swanson, G. et al. (2015), “Finalreport of the intergroup randomized study of combined androgen-deprivation therapyplus radiotherapy versus androgen-deprivation therapy alone in locally advanced prostatecancer,”

J. Clin. Oncol. , 33, 2143–2150.Tchetgen Tchetgen, E. J. (2011), “On Causal Mediation Analysis with a Survival Outcome,”

Int. J. Biostat. , 7, 1–38.VanderWeele, T. J. (2011), “Causal Mediation Analysis with Survival Data,”

Epidemiol. ,22, 582–585.Vansteelandt, S., Linder, M., Vandenberghe, S., Steen, J., and Madsen, J. (2019), “Medi-ation analysis of time-to-event endpoints accounting for repeatedly measured mediatorssubject to time-varying confounding,”

Statistics in medicine , 38, 4828–4840.Wang, W., and Wells, M. T. (1998), “Nonparametric estimation of successive durationtimes under dependent censoring,”

Biometrika , 85, 561–572.Xu, J., Kalbﬂeisch, J. D., and Tai, B. (2010), “Statistical Analysis of Illness–death Processesand Semicompeting Risks Data,”

Biometrics , 66, 716–725.39u, W., Chen, K., Sobel, M. E., and Ying, Z. (2015), “Semiparametric transformationmodels for causal inference in time-to-event studies with all-or-nothing compliance,”

J.R. Statist. Soc. B , 77, 397–415.Zhang, J. L., and Rubin, D. B. (2003), “Estimation of causal eﬀects via principal stratiﬁca-tion when some outcomes are truncated by “death”,”

J. Educ. Behav. Stat. , 28, 353–368.Zheng, W., and van der Laan, M. (2017), “Longitudinal Mediation Analysis with Time-varying Mediators and Exposures, with Application to Survival Outcomes,”