Defining and Estimating Subgroup Mediation Effects with Semi-Competing Risks Data
DDefining and Estimating Subgroup Mediation Effects withSemi-Competing Risks Data
Fei GaoFred Hutchinson Cancer Research Center, Seattle, WA, USA.Fan Xia and Kwun Chuen Gary ChanDepartment of Biostatistics, University of Washington, Seattle, WA, USA.
Abstract
In many medical studies, an ultimate failure event such as death is likely to be af-fected by the occurrence and timing of other intermediate clinical events. Both eventtimes are subject to censoring by loss-to-follow-up but the nonterminal event mayfurther be censored by the occurrence of the primary outcome, but not vice versa.To study the effect of an intervention on both events, the intermediate event maybe viewed as a mediator, but conventional definition of direct and indirect effects isnot applicable due to semi-competing risks data structure. We define three principalstrata based on whether the potential intermediate event occurs before the potentialfailure event, which allow proper definition of direct and indirect effects in one stratumwhereas total effects are defined for all strata. We discuss the identification condi-tions for stratum-specific effects, and proposed a semiparametric estimator based ona multivariate logistic stratum membership model and within-stratum proportionalhazards models for the event times. By treating the unobserved stratum membershipas a latent variable, we propose an EM algorithm for computation. We study theasymptotic properties of the estimators by the modern empirical process theory andexamine the performance of the estimators in numerical studies.
Keywords:
Illness-death model; Missing data; Principal stratification; Proportional hazardsmodel; Survival analysis. 1 a r X i v : . [ s t a t . M E ] S e p . Introduction Evaluating the causal effects of an intervention on a clinical outcome is a common theme inmany medical studies. After an overall relationship between the intervention and outcomeis established, it is often of further interest to understand the biological or mechanisticpathways that contribute to the causal treatment effect. Causal mediation analysis is oftenutilized to disentangle the total treatment effect by decomposing it into the indirect effect,i.e., the effect exerted by intermediate variables (mediators), and the direct effect, i.e., theeffect involving pathways independent of the hypothesized mediators. A number of methodswere proposed for causal mediation analysis with survival outcomes, for a single mediatormeasured at study entry (Lange and Hansen 2011; VanderWeele 2011; Tchetgen Tchetgen2011; Lange et al. 2012) and for longitudinal mediators (Lin et al. 2017; Zheng and van derLaan 2017; Didelez 2019; Vansteelandt et al. 2019; Aalen et al. 2020).In many biomedical studies, intermediate, non-terminal landmark events are recordedin addition to the primary failure event because they are important to evaluate prognosis.Due to the ordering of the two events, the non-terminal event is subject to censoring by theoccurrence of the terminal event, but not vice versa, such that semi-competing risks data areobserved (Fine et al. 2001). In this paper, we consider a setting where a non-terminal eventmay serve as a mediator for individuals to whom the event would occur before the terminalevent. An example is a multi-center trial of allogeneic bone marrow transplants in patientswith acute leukemia (Copelan et al. 1991; Klein and Moeschberger 2006), where the primaryinterest is on the effect of different treatment regimen (methotrexate + cyclosporine vsmethylprednisolone + cyclosporine) on the survival time. The event time of an intermediateendpoint, chronic graft-versus-host disease (GVHD), is a major side effect of the transplantthat can be lethal. However, some patients died without experiencing GVHD, such thatGVHD event time is subject to censoring by the death time.1ausal mediation analysis with semi-competing risks data is particularly challenging.First, the mediator is only well-defined for those who would have the non-terminal event de-veloped before the occurrence of the primary event. Therefore, the conventional definitionof natural indirect and direct effects based on replacing the counterfactual of mediator un-der one treatment by that under the other can hardly apply to the entire population. Thischallenge is similar to that of the ‘truncation-by-death’ problem (Zhang and Rubin 2003;Comment et al. 2019), where the primary outcome is only available if the terminal eventdoes not occur. However, there is a substantial difference in that the primary outcome ofinterest is the terminal event in our setting. Moreover, the semi-competing risks data struc-ture, that is, the primary event may censor the intermediate event but not vice versa, postsadditional challenges in the identifiability of parameters. Upon finishing this paper, webecame aware of the newly accepted paper by Huang (2020) which considered this problemby a counting process framework. The problem formulation, estimand and assumptionsare all very different from our work. For instance, we do not make sequential ignorabilityassumptions on surviving subpopulations at arbitrary post-treatment time, because thoseevolving subpopulations are generally healthier than the baseline study population beforethe treatment is assigned.In this paper, we consider a novel principal stratification approach to define causalmediation effects in the subgroup where the intermediate event happens when given eithertreatment, i.e., those susceptible to the intermediate event under both treatments. Thenotations and settings are given in Sections 2.1 and 2.2. We discuss the identificationconditions needed for estimating the stratum-specific natural indirect and direct effectsin Section 2.3, and proposed a semiparametric estimator based on a multivariate logisticstratum membership model and within-stratum proportional hazards models for the eventtimes in Section 2.4. By treating the unobserved stratum membership as a latent variable,we propose an EM algorithm for computation of the nonparametric maximum likelihood2stimator in Section 2.5. We also study the asymptotic properties of the estimators usingthe modern empirical process theory in Section 2.6 and examine the performance of theestimators in simulation studies Section 3. An analysis of data from a clinical trial isgiven in Section 4, and concluding remarks are included in Section 5. Proofs and detailedderivations are given in the Appendix.
2. Methods2.1. Notations for Observed Data
Let A be a binary treatment, T be the time to a primary event of interest and M bethe time to an intermediate, non-terminal event. The intermediate event time M may becensored by the occurrence of the primary event, but not vice versa, such that we observesemi-competing risks data. For example, A is a treatment for prolonging survival time, T is the time to death, and M is the time to cancer progression. The occurrence of deathmay censor the cancer progression onset, but not vice versa.Let X be a collection of baseline covariates that may be associated with either or bothevents. Let C denote a censoring time for the primary event, for example, end of follow-uptime. Then, we observe Y ≡ min( T, C ) and ∆ T = I ( T ≤ C ) for the primary event, and Z ≡ min( M, Y ) and ∆ M = I ( M ≤ Y ) for the intermediate event. The observations areversions of the counterfactual variables that we define as follows. To define causal mediation effects of interest, we adopt the potential outcomes framework.In conventional causal mediation analysis based on counterfactuals, M ( a ) denotes the coun-terfactual nonterminal event time when the treatment is set to a and T ( a, m ) denotes thecounterfactual terminal event time when the treatment is set to a and the nonterminalevent time (mediator) is set to m . A comparison of T ( a, M ( a )) with T ( a, M ( a ∗ )) would3efine a measure of the natural indirect effect of changing the mediator from M ( a ) to M ( a ∗ ) and a comparison of T ( a, M ( a ∗ )) with T ( a ∗ , M ( a ∗ )) would define a measure of the naturaldirect effect of changing the treatment from a to a ∗ . Both natural indirect and direct ef-fects involve the term T ( a, M ( a ∗ )) , i.e., the counterfactual outcome for the terminal eventtime when the treatment is set to a and the nonterminal event time is set to M ( a ∗ ) , thecounterfactual nonterminal event time when the treatment is set to a ∗ .However, these conventional definitions are inadequate for semi-competing risk settingsand needs to be modified for the following reasons. The potential non-terminal event mayor may not occur before the potential primary event time under different treatment as-signments. When the potential primary event happens before the potential intermediateevent, the value of the mediator is not well-defined (and is often set to ∞ by convention)and in such a case the potential primary event time shall not be dependent on an arbi-trary m greater than the potential primary event time. Due to these considerations, weexamine causal effects based on our proposed principal stratification approach, extendedfrom Frangakis and Rubin (2002). Intuitively, we stratify the study population into latentclasses identified by U with 3 categories based on whether they are susceptible to the non-terminal event under different treatment assignments. For an individual with U = 1 , weassume the residual counterfactual event time T ( a, m ) − m is non-negative with probability1 for m ∈ M , where M is the support of M ( a ) . This does not only imply susceptibilityunder either treatment, but also implies T ( a, M ( a ∗ )) > M ( a ∗ ) for a (cid:54) = a ∗ . For U = 2 ,we assume M (1) = ∞ with probability 1, and T (1 , m ) is only defined for m = ∞ , while ( T (0 , m ) , M (0)) is defined for m ∈ M such that T (0 , m ) − m is non-negative with prob-ability 1. That is, the treatment “prevents” the subject with U = 2 from having theintermediate event. For individuals with U = 3 , we assume M (0) = M (1) = ∞ withprobability 1 and T ( a, m ) is defined for a ∈ { , } with m = ∞ . That is, they are always“non-susceptible” to the non-terminal event. Here, we assume that the fourth stratum with4he nonterminal event only present in the treated does not exist. This restriction is alongthe same line as the “no defier” assumption commonly adopted in the instrumental variablesmethods, suggesting that the treatment effect is “monotone” and no reversed effect for thesubjects (Angrist et al. 1996). Remark . The defined strata (and associated stratum-specific effects) are substantiallydifferent from the survivors’ principal stratum (and the survivor average causal effect(SACE)) that is commonly defined in “truncation by death” literature (Zhang and Ru-bin 2003; Comment et al. 2019). In particular, the survivors’ principal stratum is definedby { T (0) ≥ t, T (1) ≥ t } for some fixed time t in Comment et al. (2019), whereas ourdefinition does not depend on any arbitrary post-treatment time t . Remark . Lin et al. (2017) explained the difficulties in defining natural mediation effectsin survival context with longitudinal mediators. They defined interventional effects ina discrete-time setting, where the mediators and past survival status are subject to ahypothetical intervention. They mentioned that principal stratification as an alternativeframework to avoid such hypothetical intervention, but did not explore further. We considera different setting that shares some of the difficulties, but also with unique data structureso that principal strata can be defined.Using our modified definition, the convention T ( a ) = T ( a, M ( a )) still holds for allindividuals, while T ( a, M ( a ∗ )) , a (cid:54) = a ∗ , is only well-defined for U = 1 . Therefore, thestratum-specific indirect and direct effects can only be defined for subjects with U = 1 ,while total effects are still well-defined for U = 2 and U = 3 . In light of these observations,for stratum with U = 1 , we define the stratum-specific natural indirect and direct effectsas N IE ( t ; x ) = Pr { T (1 , M (1)) ≥ t | X = x , U = 1 } − Pr { T (1 , M (0)) ≥ t | X = x , U = 1 } (1)5nd N DE ( t ; x ) = Pr { T (1 , M (0)) ≥ t | X = x , U = 1 } − Pr { T (0 , M (0)) ≥ t | X = x , U = 1 } . (2)In stratum with U = 2 , T (1 , M (0)) is not well-defined because T (1 , m ) is only defined for m = ∞ and M (0) < ∞ with probability 1. However, T ( a ) = T ( a, M ( a )) is still well-definedand the stratum-specific total effect is T E ( t ; x ) = Pr { T (1) ≥ t | X = x , U = 2 } − Pr { T (0) ≥ t | X = x , U = 2 } . In stratum with U = 3 , M (0) = M (1) = ∞ and T (1 , M (0)) = T (1 , M (1)) so there is noindirect effect. The stratum-specific total effect can still be defined as T E ( t ; x ) = Pr { T (1) ≥ t | X = x , U = 3 } − Pr { T (0) ≥ t | X = x , U = 3 } . Remark . In principle, a mediator shall satisfy temporal precedence, that it shall occurbefore the primary event. Therefore, one can view that the mediator is technically absentin U = 3 , and an attempt to define mediation effects would be futile. In U = 2 , thepresence of the mediator before the primary event only happens in one treatment levelwith certainty. As a result, one cannot fix the mediator level at a different treatment level,and mediation effects cannot be defined. Note that in U = 2 , T E can be interpreted asthe treatment effect in survival among individuals whose mediating events are preventedby the treatment. To identify the stratum-specific natural indirect and direct effects and stratum-specific totaleffects, we impose the following assumptions.
Assumption 1 (Consistency) . If A = a , then M = M ( a ) and T = T ( a ) with probability1; if A = a , and M = m , then T = T ( a, m ) with probability 1. ssumption 2 (Sequential Ignorability within Stratum) . For a = 0 , , a ∗ = 0 , , m ∈ (0 , τ ] , and u = 1 , , , { T ( a, m ) , M ( a ∗ ) } ⊥ A | X , U = u (3) and T ( a, m ) ⊥ M | A = a ∗ , X , U = u. (4)Assumptions 1 is a standard assumption for causal inference with no unmeasured con-founding. Assumption 2 is a standard assumption for mediation analysis in U = 1 . For U = 2 , , Assumption 2 reduces to the standard assumption of conditional exchangeabilitywithin stratum. Based on Assumptions 1 - 2, we are able to connect the stratum-specificnatural indirect and direct effects and stratum-specific total effects with the distributionof the observed data given stratum membership as follows. Theorem 1.
Under Assumptions 1 - 2, for stratum with U = 1 , the stratum-specific naturalindirect effect N IE ( t ; x ) is equal to (cid:90) t { − Pr(
T < t | M = m, X = x , A = 1 , U = 1) }× (cid:8) dF M | X = x ,A =1 ,U =1 ( m ) − dF M | X = x ,A =0 ,U =1 ( m ) (cid:9) + Pr( M ≤ t | X = x , A = 0 , U = 1) − Pr( M ≤ t | X = x , A = 1 , U = 1) , and the stratum-specific natural direct effect N DE ( t ; x ) is equal to (cid:90) t { Pr(
T < t | M = m, X = x , A = 0 , U = 1) − Pr(
T < t | M = m, X = x , A = 1 , U = 1) } dF M | X = x ,A =0 ,U =1 ( m ) . Under Assumptions 1 and 2, for stratum with U = 2 , the stratum-specific total effect T E ( t ; x ) is equal to Pr( T ≥ t | A = 1 , X = x , U = 2) − Pr( T ≥ t | A = 0 , X = x , U = 2); nd for stratum with U = 3 , the stratum-specific total effect T E ( t ; x ) is equal to Pr( T ≥ t | A = 1 , X = x , U = 3) − Pr( T ≥ t | A = 0 , X = x , U = 3) . The proof of Theorem 1 is given in Appendix A.1. Since U is unobserved, we cannotuse Theorem 1 directly to identify those stratum-specific effects from observed data. To doso, one would further assume Assumption 3 (Stratum Membership Independent of Treatment) . With probability one, U is conditional independent of A given X . Assumption 4 (Restriction on Stratum-Specific Distributions) . With probability one,
Pr( M (0) = m | X = x , U = 2) = g { Pr( M (0) = m | X = x , U = 1); x } , Pr( T (0) ≥ t | M (0) = m, X = x , U = 2) = g { Pr( T (0) ≥ t | M (0) = m, X = x , U = 1); x } , Pr( T (1) ≥ t | X = x , U = 2) = g { Pr( T (1) ≥ t | X = x , U = 3); x } , for some known functions g k ( · ; x ) ( k = 1 , , . Assumption 5 (Non-Informative Censoring and Sufficient Follow-up) . ( M, T, U ) is con-ditionally independent of C given A and X , and the upper bound of the support of T is nolarger than that of C . Assumption 3 requires that the stratum membership is not affected by the treatmentassignment A given covariates X . Assumption 4 requires some knowledge on the rela-tionships of stratum-specific event time distributions. The first part of Assumption 5 is astandard assumption for non-informative censoring time. The second part of Assumption 5is an extension of the independent censoring and sufficient follow-up assumption in Mallerand Zhou (1992) for nonparametric estimation of cured proportion in censored data. Theassumption on the upper bounds of the supports ensures sufficient observation of the tail8ehaviour of the event times for identification of stratum membership. By further assumingAssumptions 3 - 5, we obtain the identification results in Theorem 2, whose proof is givenin Appendix A.2. Theorem 2.
Under Assumptions 1 - 5, the stratum-specific effects can be identified, withidentification formulas given in (A.2) - (A.5).
Theorem 2 gives the identification result based on nonparametric models for U andfor ( M, T ) given U with minimum assumptions. In particular, Assumption 4 requiressome modeling assumptions to be made. In practice, we may consider additional modelassumptions for U and ( M, T ) to gain power in understanding the causal effects. In thenext section, we extend the multistate modeling idea in the literature of semi-competingrisks data to form such a model. One way to model semi-competing risks data is to use a multistate framework (Xu et al.2010). In multistate analysis of semi-competing risks data, usually three states (states 1- 3) are involved, corresponding to healthy (state 1), illness (state 2), and death (state 3)in an illness-death model. All subjects starts at state 1. A subject enters state 2 if he/shedevelops the intermediate event, while he/she enters state 3 if he/she develops the primaryevent. In traditional illness-death model for semi-competing risks data, three processesmoving from one state to another are modeled: (1) healthy to illness (state 1 to 2), (2)illness to death (state 2 to 3), and (3) healthy to death (state 1 to 3).Here, we extend the idea and model the processes moving from one state to another indifferent strata defined in Section 2.2.. For subjects with U = 1 and subjects with U = 2 receiving A = 0 , the processes of healthy to illness and illness to death are involved andwe model the time to the nonterminal event M and the residual time R ≡ T − M . Weassume that M and R are conditionally independent given A, X , and U . This serves two9urposes: to obtain a tractable EM algorithm in Appendix B, and to avoid the problemof induced informative censoring caused by residual dependence between M and R (Wangand Wells 1998; Lin et al. 1999). For subjects with U = 2 receiving A = 1 and subjectswith U = 3 , the process of healthy to death is involved. This proposed model is related tobut different from the illness-death model, in that the transition structure depends on theprincipal strata in our proposed model.Suppose that for a subject with U = 1 , the nonterminal event time follows a proportionalhazards model with hazard function given by λ (1) M ( t | A = a, X = x ) = λ ( t ) exp (cid:0) β M a + γ T M x (cid:1) , and the gap time between the occurrences of nonterminal and terminal events R follows aproportional hazards model with hazard function given by λ (1) R ( r | A = a, X = x ) = λ ( r ) exp (cid:0) β R a + γ T R x (cid:1) . Suppose that for subject with U = 2 and unexposed to treatment ( A = 0) , the nonterminalevent time follows a proportional hazards model with hazard function given by λ (2) M ( t | A = 0 , X = x ) = λ ( t ) exp (cid:0) β M + γ T M x (cid:1) , and the gap time between the occurrences of nonterminal and terminal events followsanother proportional hazards model with hazard function given by λ (2) R ( r | A = 0 , X = x ) = λ ( r ) exp (cid:0) β R + γ T R x (cid:1) . Here, subjects with U = 1 and subjects with U = 2 unexposed to treatment share the samebaseline hazard functions, although the hazard ratios for covariates may be different. Theparameters β M and β R are the log hazard ratios of treatment on the nonterminal eventtime and gap time, respectively, for subjects with U = 1 ; the parameters β M and β R are10he log hazard ratios on the nonterminal event time and gap time, respectively, comparingsubjects with U = 1 and U = 2 who both unexposed to treatment with baseline covariatesvalue X = .For subject with U = 2 and exposed with treatment ( A = 1 ), we assume that theterminal event time follows a proportional hazards model with hazard function given by λ (2) T ( t | A = 1 , X = x ) = λ ( t ) exp (cid:0) β T + γ T T x (cid:1) . For subject with U = 3 , we suppose that the terminal event time follows a proportionalhazards model with hazard function given by λ (3) T ( t | A = a, X = x ) = λ ( t ) exp (cid:0) β T a + γ T T x (cid:1) . Note that the terminal event times for subject with U = 3 and subject with U = 2 exposedto treatment share the same baseline hazard function. The parameter β T is the log hazardratio of treatment on the terminal event time for subjects with U = 3 , while β T is the loghazard ratio of the terminal event time comparing subjects with U = 3 and A = 0 withsubjects with U = 2 and A = 1 , with the same covariates value X = .The natural indirect and direct effects in stratum with U = 1 can be presented as N IE ( t | X = x ) = (cid:90) t exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111) λ ( m ) e γ T M x × (cid:104) e β M exp (cid:110) − Λ ( m ) e β M + γ T M x (cid:111) − exp (cid:110) − Λ ( m ) e γ T M x (cid:111)(cid:105) dm + exp (cid:110) − Λ ( t ) e β M + γ T M x (cid:111) − exp (cid:110) − Λ ( t ) e γ T M x (cid:111) and N DE ( t | X = x ) = (cid:90) t (cid:104) exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111) − exp (cid:110) − Λ ( t − m ) e γ T R x (cid:111)(cid:105) × λ ( m ) e γ T M x exp (cid:110) − Λ ( m ) e γ T M x (cid:111) dm, Λ ( t ) = (cid:82) t λ ( s ) ds and Λ ( t ) = (cid:82) t λ ( s ) ds . The total effects in strata with U = 2 and U = 3 are given by T E ( t | X = x ) = exp (cid:110) − Λ ( t ) e β T + γ T T x (cid:111) − (cid:90) t λ ( m ) e β M + γ T M x × exp (cid:110) − Λ ( m ) e β M + γ T M x (cid:111) (cid:104) − exp (cid:110) − Λ ( t − m ) e β R + γ T R x (cid:111)(cid:105) dm and T E ( t | X = x ) = exp (cid:110) − Λ ( t ) e β T + γ T T x (cid:111) − exp (cid:110) − Λ ( t ) e γ T T x (cid:111) , where Λ ( t ) = (cid:82) t λ ( s ) ds .As in Yu et al. (2015), we consider a multinomial logistic regression model on thestratum membership. In particular, we assume w ( x ; α ) = Pr( U = 1 | X = x ) = exp (cid:0) α T1 (cid:101) x (cid:1) α T1 (cid:101) x ) + exp ( α T2 (cid:101) x ) ,w ( x ; α ) = Pr( U = 2 | X = x ) = exp (cid:0) α T2 (cid:101) x (cid:1) α T1 (cid:101) x ) + exp ( α T2 (cid:101) x ) , and w ( x ; α ) = Pr( U = 3 | X = x ) = { (cid:0) α T1 (cid:101) x (cid:1) + exp (cid:0) α T2 (cid:101) x (cid:1) } − , where α =( α T1 , α T2 ) T and (cid:101) x = (1 , x T ) T . Then, the marginalized stratum-specific natural indirect anddirect effects are given by N IE ( t ) = P r { T (1 , M (1)) ≥ t | U = 1 } − P r { T (1 , M (0)) ≥ t | U = 1 } = (cid:82) N IE ( t | X = x ) w ( x ; α ) dF ( x ) (cid:82) w ( x ; α ) dF ( x ) and N DE ( t ) = P r { T (1 , M (0)) ≥ t | U = 1 } − P r { T (0 , M (0)) ≥ t | U = 1 } = (cid:82) N DE ( t | X = x ) w ( x ; α ) dF ( x ) (cid:82) w ( x ; α ) dF ( x ) , where F ( · ) is the cumulative distribution function of X .12 .5. Nonparametric Maximum Likelihood Estimation For a random sample of n subjects, the observed semi-competing risks data are given by O = {O i : i = 1 , . . . , n } , where O i = { ∆ Mi , Z i , ∆ Ti , Y i , A i , X i } . For i = 1 , . . . , n , if ∆ Mi = 1 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 1 | X i ) Pr (cid:0) Z i , Y i , ∆ Ti | U i = 1 , X i , A i (cid:1) + I ( A i = 0) Pr ( U i = 2 | X i ) Pr (cid:0) Z i , Y i , ∆ Ti | U i = 2 , X i , A i = 0 (cid:1) ; if ∆ Mi = 0 and ∆ Ti = 1 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 3 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 3 , X i (cid:1) + I ( A i = 1) Pr ( U i = 2 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 2 , X i , A i = 1 (cid:1) ; and if ∆ Mi = ∆ Ti = 0 , then the likelihood corresponding to subject i is given by (cid:101) L i ( O i ) = Pr ( U i = 1 | X i ) Pr (cid:0) Z i , ∆ Mi , Y i , ∆ Ti | U i = 1 , X i (cid:1) + Pr ( U i = 2 | X i ) (cid:8) I ( A i = 0) Pr (cid:0) Z i , ∆ Mi , Y i , ∆ Ti | U i = 2 , X i , A i = 0 (cid:1) + I ( A i = 1) Pr (cid:0) Y i , ∆ Ti | U i = 2 , X i , A i = 1 (cid:1)(cid:9) + Pr ( U i = 3 | X i ) Pr (cid:0) Y i , ∆ Ti | U i = 3 , X i (cid:1) . Therefore, the likelihood function for the observed data O is given by n (cid:89) i =1 (cid:101) L i ( O i ) ∆ Mi (cid:110)(cid:101) L i ( O i ) ∆ Ti (cid:101) L i ( O i ) − ∆ Ti (cid:111) − ∆ Mi . We consider the nonparametric maximum likelihood estimation such that the estimatorsfor Λ , Λ , and Λ are step functions. In particular, let < t < · · · < t m < ∞ be theordered sequence of event times Z i ’s with ∆ Mi = 1 ; let < t < · · · < t m < ∞ be the13rdered sequence of gap times V i ≡ Y i − Z i ’s with ∆ Mi = ∆ Ti = 1 ; and let < t < · · · Under regularity conditions, (cid:13)(cid:13)(cid:13)(cid:98) θ − θ (cid:13)(cid:13)(cid:13) + (cid:88) k =1 sup t ∈ [0 ,τ k ] (cid:12)(cid:12)(cid:12)(cid:98) Λ k ( t ) − Λ k ( t ) (cid:12)(cid:12)(cid:12) converges to zero almost surely. In addition, √ n { (cid:98) θ − θ , (cid:98) Λ ( · ) − Λ )( · ) , (cid:98) Λ ( · ) − Λ )( · ) , (cid:98) Λ ( · ) − Λ )( · ) } converges weakly to a zero-mean Gaussian process in the Banach space R m × l ∞ ( A ) × l ∞ ( A ) × l ∞ ( A ) , where m is the dimension of θ and A k is the unit ball in thespace of functions on [0 , τ k ] with bounded variation for k = 1 , , . Theorem 4. Under regularity conditions, the estimators for stratum-specific effects givenin (5)-(10) are consistent and asymptotically normal. Since the form of the limiting variances of the stratum-specific effects is complicated,we estimate the variance of the estimators by a nonparametric bootstrap procedure in allnumerical studies. 3. Simulation Studies We conducted simulation studies to examine the performance of the proposed methods. Wegenerated two covariates X ∼ N (0 , and X ∼ U nif (0 , and generated the treatmentindicator A ∼ Bin (0 . to reflect 1:1 randomization. We set Λ ( t ) = t , Λ ( t ) = 0 . t ,and Λ ( t ) = log(1 + t ) , while the true values of the other parameters are shown in Table16, along with the simulation results. We generated a censoring time C ∼ U nif (0 , to obtain approximately 51% and 26% censoring rates for the nonterminal and terminalevents, respectively. The proportions of subjects with U = 1 , , are approximately 31%,41%, and 28%, respectively.We considered replicates with sample sizes n = 1000 and , where boot-strap samples were used for variance estimation. Table 1 shows the simulation results,where Bias, SE and SEE denote, respectively, the averaged bias, empirical standard errorand averaged standard error estimates, and CP stands for the empirical coverage prob-ability of the 95% confidence intervals. All examined replications converge with a − convergence criterion. The parameter estimators are virtually unbiased. The bootstrapvariance estimator overestimates the true variability for some of the parameters, but it getsmore accurate when sample size increases.The estimators for the baseline cumulative hazard functions only take jump till the lastobservation times, such that the estimates after the last observation time is not meaningful.Therefore, to summarize the performance of the baseline hazard estimators, for every timepoint t , we only consider the replicates with last observation time greater than t . Figure 1shows the median of the estimated baseline hazard functions, among such replicates. Weplot till the time point at which at least 800 replicates have meaningful estimates. Thebias gets smaller as sample size increases.Table 2 shows the performance of the estimated stratum-specific indirect and directeffects in stratum with U = 1 and X = (0 . , . T , as well as the estimated total effectsfor strata with U = 2 , and the same covariate values. Similarly, for any t the averagewas taken over all the replicates with estimators that have last jump time no less than t .The bias gets smaller as sample size increases. The variance estimator is accurate and thecoverage probability is close to the nominal level when sample size is large.17able 1. Simulation Results for Regression Parameters True n = 1000 n = 2000 Value Bias SE SEE CP Bias SE SEE CP β M . . 008 0 . 224 0 . 261 0 . 97 0 . 006 0 . 147 0 . 168 0 . γ M . . 010 0 . 091 0 . 095 0 . 96 0 . 005 0 . 063 0 . 063 0 . . . 014 0 . 269 0 . 296 0 . 96 0 . 010 0 . 192 0 . 195 0 . β R . − . 043 0 . 276 0 . 334 0 . − . 025 0 . 161 0 . 202 0 . γ R − . − . 011 0 . 095 0 . 102 0 . − . 004 0 . 065 0 . 067 0 . − . . 001 0 . 308 0 . 332 0 . 97 0 . 002 0 . 212 0 . 214 0 . β M − . . 005 0 . 503 0 . 599 0 . − . 002 0 . 327 0 . 380 0 . γ M . . 025 0 . 123 0 . 154 0 . 98 0 . 011 0 . 078 0 . 090 0 . . . 017 0 . 393 0 . 476 0 . 98 0 . 012 0 . 252 0 . 287 0 . β R . − . 068 0 . 586 0 . 699 0 . − . 028 0 . 358 0 . 435 0 . γ R . − . 013 0 . 187 0 . 230 0 . − . 002 0 . 115 0 . 138 0 . . − . 020 0 . 476 0 . 546 0 . − . 002 0 . 303 0 . 337 0 . β T . . 028 0 . 485 0 . 523 0 . 98 0 . 011 0 . 357 0 . 359 0 . γ T − . . 015 0 . 175 0 . 197 0 . 98 0 . 012 0 . 124 0 . 132 0 . − . − . 018 0 . 435 0 . 509 0 . 98 0 . 009 0 . 312 0 . 324 0 . β T . − . 044 0 . 410 0 . 448 0 . − . 034 0 . 334 0 . 332 0 . γ T − . − . 035 0 . 107 0 . 118 0 . − . 024 0 . 076 0 . 078 0 . . − . 017 0 . 371 0 . 402 0 . − . 029 0 . 265 0 . 261 0 . α . . 009 0 . 199 0 . 213 0 . 96 0 . 010 0 . 148 0 . 146 0 . . − . 005 0 . 113 0 . 113 0 . − . 003 0 . 078 0 . 078 0 . . − . 002 0 . 343 0 . 367 0 . − . 007 0 . 256 0 . 253 0 . α . . 025 0 . 296 0 . 321 0 . 96 0 . 010 0 . 201 0 . 211 0 . − . − . 019 0 . 155 0 . 175 0 . − . 019 0 . 108 0 . 114 0 . . − . 023 0 . 491 0 . 537 0 . 97 0 . 000 0 . 342 0 . 355 0 . NOTE: Bias, SE and SEE denote, respectively, the mean bias,empirical standard error and mean standard error estimator. CPstands for the empirical coverage probability of the 95% confi-dence interval. True n = 1000 n = 2000 t Value Bias SE SEE CP Bias SE SEE CP N DE − . 11 0 . 014 0 . 056 0 . 070 0 . 97 0 . 007 0 . 031 0 . 041 0 . − . 17 0 . 018 0 . 090 0 . 106 0 . 96 0 . 010 0 . 052 0 . 066 0 . − . 18 0 . 020 0 . 096 0 . 107 0 . 94 0 . 010 0 . 057 0 . 069 0 . N IE − . − . 001 0 . 022 0 . 025 0 . 97 0 . 000 0 . 015 0 . 017 0 . − . − . 001 0 . 016 0 . 018 0 . − . 001 0 . 011 0 . 012 0 . − . − . 001 0 . 010 0 . 011 0 . 97 0 . 000 0 . 007 0 . 007 0 . T E − . − . 036 0 . 158 0 . 175 0 . − . 022 0 . 115 0 . 126 0 . . − . 046 0 . 159 0 . 180 0 . − . 026 0 . 113 0 . 127 0 . . − . 047 0 . 136 0 . 156 0 . − . 027 0 . 097 0 . 109 0 . . − . 044 0 . 115 0 . 131 0 . − . 025 0 . 084 0 . 093 0 . T E − . 07 0 . 021 0 . 139 0 . 147 0 . 97 0 . 015 0 . 117 0 . 114 0 . − . 06 0 . 031 0 . 119 0 . 126 0 . 98 0 . 022 0 . 101 0 . 099 0 . − . 06 0 . 033 0 . 101 0 . 107 0 . 97 0 . 024 0 . 086 0 . 085 0 . − . 05 0 . 033 0 . 088 0 . 093 0 . 97 0 . 024 0 . 075 0 . 074 0 . NOTE: Bias, SE and SEE denote, respectively, the mean bias, empir-ical standard error and mean standard error estimator. CP stands forthe empirical coverage probability of the 95% confidence interval. LambdaM t La m bda M ( t ) n=500n=1000n=2000true 0 2 4 6 8 10 12 . . . . . . . LambdaR t La m bda R ( t ) n=500n=1000n=2000true 0 2 4 6 8 10 12 . . . . . . . LambdaT t La m bda T ( t ) n=500n=1000n=2000true Figure 1. Performance of the estimated baseline cumulative hazard functions. 4. Application We consider application of the proposed methods to a prostate cancer clinical trial. NCICClinical Trials Group PR.3/Medical Research Council PR07/Intergroup T94-0110 is a ran-domized controlled trial of patients with locally advanced prostate cancer. The primary ob-jective is to determine whether the addition of radiotherapy (RT) to androgen-deprivationtherapy (ADT) prolonged overall survival, defined as time from random assignment todeath from any cause. One thousand two hundred and five patients with locally advancedprostate cancer were recruited and randomly assigned between 1995 and 2005, 602 to ADTalone and 603 to ADT + RT. These patients were either with T3-4, N0/Nx, M0 prostatecancer or with T1-2 disease with either prostate-specific antigen (PSA) of more than 40 µ g/Lor PSA of 20 to 40 µ g/L plus Gleason score of 8 to 10. In the final report of the study(Mason et al. 2015), at a median follow-up time of 8 years, 465 patients had died. Overallsurvival was significantly improved in the patients allocated to ADT + RT (hazard ratio0.70 with 95% CI, 0.57 to 0.85; P < .001).In addition to the primary outcome of death, the study also collected data on time to20isease progression, which was defined as the first of any of the following events: biochemicalprogression, local progression, or development of metastatic disease. We analyzed the datato reveal the proportions of the treatment effect on overall survival that are mediated bydisease progression. Particularly, we adjusted for initial PSA level ( < 20 vs. 20 to 50, vs. > U = 1 , ADT + RT is associated witha decreased risk of disease progression, while it is associated with an increased risk fromdisease progression to death. For stratum with U = 3 , ADT + RT is associated with adecreased risk of death. The effects are not significant at 0.05 level. For stratum with U = 1 , a subject with initial PSA level > 50 g/L is associated with significantly increasedrisk of disease progression, compared to a similar subject with initial PSA level < 20 g/L;and a subject with Gleason score 8-10 is associated with significantly decreased risk ofdisease progression, compared to a similar subject with Gleason score < U = 1 , , and 3 are40.1%, 25.7%, and 34.2%, respectively. To verify if the model is reasonable, we estimatedthe stratum-specific survival functions for every subject and summarize the subject-specificsurvival function by weighting them by his/her stratum membership probabilities. Weaverage the estimated survival functions for subjects assigned to ADT+RT versus ADT,and plot them against the survival function estimators from the Kaplan Meier methodsand the proportional hazards model. The results are shown in Figure 2. The estimatedpopulation-average survival functions for ADT+RT and ADT groups are similar to thosefrom the Kaplan Meier methods and the proportional hazards model, especially for time21able 3. Parameter Estimates for Regression Coefficients for Event Time Processes Process U = 1 Health → Disease Disease → DeathEst SEE p -value Est SEE p -valueADT + RT − . 825 0 . 987 0 . 403 0 . 460 0 . 658 0 . Initial PSA Level (20 to 50 g/L) . 321 0 . 530 0 . − . 097 0 . 305 0 . Initial PSA Level ( > 50 g/L) . 607 0 . 566 0 . 005 0 . 065 0 . 342 0 . Gleason Score (8-10) − . 008 0 . 413 0 . − . 378 0 . 241 0 . Process U = 2 , ADTHealth → Disease Disease → DeathEst SEE p -value Est SEE p -valueIntercept − . 917 1 . 587 0 . − . 557 3 . 548 0 . Initial PSA Level (20 to 50 g/L) . 663 1 . 005 0 . − . 304 0 . 863 0 . Initial PSA Level ( > 50 g/L) . 619 0 . 904 0 . − . 104 0 . 940 0 . Gleason Score (8-10) . 674 0 . 952 0 . − . 106 3 . 381 0 . Process U = 2 , ADT + DT U = 3 Health → Death Health → DeathEst SEE p -value Est SEE p -valueIntercept − . 212 7 . 339 0 . − . 446 0 . 875 0 . Initial PSA Level (20 to 50 g/L) . 146 5 . 416 0 . − . 022 0 . 649 0 . Initial PSA Level ( > 50 g/L) . 601 5 . 545 0 . − . 883 0 . 641 0 . Gleason Score (8-10) . 624 4 . 630 0 . − . 514 0 . 486 0 . α α Est SEE p -value Est SEE p -valueIntercept . 205 0 . 441 0 . 643 0 . 291 0 . 743 0 . Initial PSA Level (20 to 50 g/L) . 090 0 . 583 0 . 877 0 . 625 1 . 054 0 . Initial PSA Level ( > 50 g/L) − . 684 0 . 517 0 . 186 0 . 004 0 . 929 0 . Gleason Score (8-10) . 134 0 . 415 0 . − . 495 0 . 732 0 . before 10 years when data are not sparse, indicating proper fit of the proposed approach.Figure 3 shows the estimated marginalized stratum-specific indirect and direct effects(with 95% confidence intervals) for stratum with U = 1 . The estimated natural indirecteffect is positive and increasing over time, and the estimated natural direct effect is slightlynegative over time. However, the 95% confidence intervals are wide such that the stratum-specific natural indirect and direct effects are not significant different from zero. The totaleffect in stratum with U = 1 is positive and increasing over time, corresponding to anincreased survival probability assigned to ADT+RT versus ADT in stratum with U = 1 .It is worth noting that the primary analysis for the data shows that overall survival wassignificantly improved in the patients allocated to ADT + RT compared to ADT. However,our analysis failed to obtain a significant stratum-specific overall effect of ADT + RT. Themain reason is that by identifying subjects to different strata, the sample size to estimateparameters in each stratum is much smaller than that for the proportional hazards modelbased on all available subjects. In addition, the proposed model has much more parameters,such that the variability for parameter estimation significantly increases.23 . . . . . . Time (Year) S u r v i v a l P r obab ili t y ADT + RTADTADT + RT, KMADT, KMADT + RT, PHADT, PH Figure 2. Estimated survival functions from the proposed, Kaplan-Meier, and proportionalhazards model approaches. − . − . . . . . Time (Year) C au s a l E ff e c t l l l ll Natural Indirect EffectNatural Direct EffectTotal Effect Figure 3. Estimated stratum-specific indirect and direct effects in stratum with U = 1 . . Discussion Semi-competing risks data are frequently observed in medical studies, where the termi-nal event time may censor the intermediate event time but not vice versa. To defineand estimate causal contrasts of the effect of a treatment to the terminal and intermediateevents, we introduced a novel principal stratification framework that distinguishes suscepti-ble and non-susceptible subjects given different treatments, and defined the natural indirectand direct effects in the stratum where the times to intermediate and terminal events arewell-defined given both treatments. We provided reasonable assumptions to identify thestratum-specific natural indirect and direct effects, proposed a semiparametric model, andstudied an EM algorithm to obtain the nonparametric maximum likelihood estimators ofmodel parameters. We showed that the estimators are consistent and asymptotically effi-cient estimated under mild regularity conditions, and their performance are satisfactory infinite sample numerical studies.In identifying the stratum-specific natural indirect and direct effects, we assumed thatthere are no subjects who are susceptible to the intermediate event under treatment ( A = 1 )and non-susceptible under control ( A = 0 ). This assumption may need careful examinationbased on scientific understanding of how treatment may affect the intermediate event. Inour data application, we assessed this assumption by fitting the proposed model withswitched treatment indicator labels of ADT+RT and ADT. The estimated probability ofbelonging to stratum with U = 2 (equivalent to the fourth stratum in the original labeling)is very low, suggesting that the assumption on non-existence of the fourth stratum maybe valid. In some applications, this fourth stratum may indeed exist. In the literature ofprincipal stratification for uncensored data with four or more strata, the effect of interestoften can only be interval identified. Interval identification with a regression model oftenresults in a complicated solution manifold, with properties often not well understood. We26lan to explore this problem in a future study. Acknowledgement This manuscript was prepared using data from Dataset NCT00002633-D1 from the NCTNData Archive of the National Cancer Institute (NCI) National Clinical Trials Network(NCTN). Data were originally collected from clinical trial NCT00002633 Phase III Ran-domized Trial Comparing Total Androgen Blockade Versus Total Androgen Blockade PlusPelvic Irradiation in Clinical Stage T3-4, N0, M0 Adenocarcinoma of the Prostate All anal-yses and conclusions in this manuscript are the sole responsibility of the authors and donot necessarily reflect the opinions or views of the clinical trial investigators, the NCTN, orthe NCI. The authors are partially funded by the U.S. National Institutes of Health grantsR01HL122212, U01AG016976 and U.S. National Science Foundation grant DMS 1711952. APPENDIX A Identification of Stratum-Specific EffectsA.1 Proof of Theorem 1 Note that (3) in Assumption 2 implies T ( a, m ) ⊥ A | M ( a ∗ ) = m ∗ , X , U = 1 . (A.1)In stratum U = 1 , for a given t ≤ τ and any a , a ∗ , we have Pr { T ( a, M ( a ∗ )) ≥ t | X = x , U = 1 } = (cid:90) Pr { T ( a, m ) ≥ t | M ( a ∗ ) = m, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | M ( a ∗ ) = m, A = a ∗ , X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | A = a ∗ , X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) Pr { T ( a, m ) ≥ t | A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m ) (cid:90) Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )+ (cid:90) ∞ t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )= (cid:90) t Pr { T ( a, m ) ≥ t | M ( a ) = m, A = a, X = x , U = 1 } dF M ( a ∗ ) | X = x ,U =1 ( m )+ Pr( M ( a ∗ ) ≥ t | X = x , U = 1) , where the second equality follows from (A.1), the third and fifth equalities follow from(4) in Assumption 2, and the last equality follows from the fact that T ( a, m ) ≥ t withprobability one for any m ≥ t given U = 1 . Then, following (3) in Assumption 2 andAssumption 1, the proceeding expression is equal to (cid:90) t Pr { T ( a ) ≥ t, M ( a ) = m | A = a, X = x , U = 1 } Pr { M ( a ) = m | A = a, X = x , U = 1 } dF M ( a ∗ ) | A = a ∗ , X = x ,U =1 ( m )+ Pr( M ( a ∗ ) ≥ t | A = a ∗ , X = x , U = 1)= (cid:90) t Pr( T ≥ t | M = m, A = a, X = x , U = 1) dF M | A = a ∗ , X = x ,U =1 ( m )+ Pr( M > t | A = a ∗ , X = x , U = 1) . Then, the natural indirect and direct effects, as defined in equations (1) and (2), are equalto N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x , U = 1) × (cid:8) dF M | A =1 , X = x ,U =1 ( m ) − dF M | A =0 , X = x ,U =1 ( m ) (cid:9) + Pr( M > t | A = 1 , X = x , U = 1) − Pr( M > t | A = 0 , X = x , U = 1) and N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x , U = 1) Pr( T ≥ t | M = m, A = 0 , X = x , U = 1) } dF M | A =0 , X = x ,U =1 ( m ) . For stratum with U = 2 , we have T E ( t ; x ) = Pr( T (1) ≥ t | A = 1 , X = x , U = 2) − Pr( T (0) ≥ t | A = 0 , X = x , U = 2)= Pr( T ≥ t | A = 1 , X = x , U = 2) − Pr( T ≥ t | A = 0 , X = x , U = 2) , where the two equalities follow from Assumptions 2 and 1, respectively. Similarly, forstratum with U = 3 , we have T E ( t ; x ) = Pr( T (1) ≥ t | A = 1 , X = x , U = 3) − Pr( T (0) ≥ t | A = 0 , X = x , U = 3)= Pr( T ≥ t | A = 1 , X = x , U = 3) − Pr( T ≥ t | A = 0 , X = x , U = 3) , A.2 Proof of Theorem 2 We first consider the identification of stratum membership probabilities, as the first stepto identify the stratum-specific effects. By the definition of the strata, we have Pr( M < ∞| A = 0 , X = x ) = Pr( M (0) < ∞| A = 0 , X = x )= Pr( M (0) < ∞ , M (1) < ∞| A = 0 , X = x )+ Pr( M (0) < ∞ , M (1) = ∞| A = 0 , X = x )= Pr( U = 1 | A = 0 , X = x ) + Pr( U = 2 | A = 0 , X = x )= Pr( U = 1 | X = x ) + Pr( U = 2 | X = x ) , where the last equality follows from Assumption 3. Similarly, we have the followingequalites: Pr( M < ∞| A = 1 , X = x ) = Pr( U = 1 | X = x )Pr( M = ∞| A = 0 , X = x ) = Pr( U = 3 | X = x )Pr( M = ∞| A = 1 , X = x ) = Pr( U = 2 | X = x ) + Pr( U = 3 | X = x ) . Pr( U = u | X ) can be identified by theobserved data if there is no censoring and we can observed if M = ∞ . The identificationof the quantities in the presence of censoring is discussed in the end of the section.We further consider the distributions of the (observed) intermediate and primary eventtimes to identify terms in the definition of stratum-specific effects. First, consider the caseif we observe M = m < ∞ for a subject assigned to treatment A = 1 . Since it implies M (1) < ∞ , this subject should have U = 1 with probability one. That is, for any m < ∞ , Pr( M = m | A = 1 , X = x ) = Pr( M = m, M (1) ≤ T (1) | A = 1 , X = x )= Pr( M = m, M (0) ≤ T (0) , M (1) ≤ T (1) | A = 1 , X = x )= Pr( M = m, U = 1 | A = 1 , X = x )= Pr( M = m | U = 1 , A = 1 , X = x ) Pr( U = 1 | A = 1 , X = x )= Pr( M (1) = m | X = x , U = 1) Pr( U = 1 | X = x ) , where the first equality follows from Assumption 1, the second and third equalities followfrom the definition of strata, and last equality follows from Assumption 3. Similarly, wehave for any m ≤ t < ∞ , Pr( T ≥ t | M = m, A = 1 , X = x ) = Pr( T < t, M = m | A = 1 , X = x )Pr( M = m | A = 1 , X = x )= Pr( T ≥ t, M = m, M (1) ≤ T (1) | A = 1 , X = x )Pr( M = m, M (1) ≤ T (1) | A = 1 , X = x )= Pr( T ≥ t, M = m, U = 1 | A = 1 , X = x )Pr( M = m, U = 1 | A = 1 , X = x )= Pr( T (1) ≥ t | M (1) = m, X = x , U = 1) . For the case that we observe M = m < ∞ for a subject assigned to treatment A = 0 ,the subject would possibly have U = 1 or U = 2 , since both strata has M (0) < ∞ . Then,we have Pr( M = m | A = 0 , X = x ) Pr( M = m, M (0) ≤ T (0) | A = 0 , X = x )= Pr( M = m, M (0) ≤ T (0) , M (1) ≤ T (1) | A = 0 , X = x )+ Pr( M = m, M (0) ≤ T (0) , M (1) = ∞| A = 0 , X = x )= Pr( M = m, U = 1 | A = 0 , X = x ) + Pr( M = m, U = 2 | A = 0 , X = x )= Pr( M = m | A = 0 , X = x , U = 1) Pr( U = 1 | X = x )+ Pr( M = m | A = 0 , X = x , U = 2) Pr( U = 2 | X = x )= Pr( M (0) = m | X = x , U = 1) Pr( U = 1 | X = x )+ g (Pr( M (0) = m | X = x , U = 1); x ) Pr( U = 2 | X = x ) , where the last equality follows from Assumption 4, and Pr( T ≥ t | M = m, A = 0 , X = x )= Pr( T ≥ t, M = m | A = 0 , X = x )Pr( M = m | A = 0 , X = x )= Pr( T ≥ t, M = m, M (0) ≤ T (0) | A = 0 , X = x )Pr( M = m | A = 0 , X = x )= (cid:80) u =1 , Pr( T (0) ≥ t | M (0) = m, X = x , U = u ) Pr( M (0) = m | X = x , U = u ) Pr( U = u | X = x ) (cid:80) u =1 , Pr( M = m (0) | X = x , U = u )) Pr( U = u | X = x ) . Then, the natural indirect effect can be presented as N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x ) × { Pr( M = m | M < ∞ , A = 1 , X = x ) − h ∗ ( m ; x ) } dm + Pr( M ≤ t | M < ∞ , A = 1 , X = x ) − (cid:90) t h ∗ ( m ; x ) dm. (A.2)where h ∗ ( m ; x ) is the solution to the equation Pr( M = m | A = 0 , X = x ) = h ( m ; x ) Pr( M < ∞| A = 1 , X = x )+ g { h ( m ; x ); x } { Pr( M < ∞| A = 0 , X = x ) − Pr( M < ∞| A = 1 , X = x ) } . N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x ) − h ∗ ( m, t ; x ) } h ∗ ( m ; x ) dm, (A.3)where h ∗ ( m, t ; x ) is the solution to Pr( T ≥ t, M = m | A = 0 , X = x ) = h ( m, t ; x ) h ∗ ( m ; x ) Pr( M < ∞| A = 1 , X = x )+ g { h ( m, t ; x ); x } g { h ∗ ( m ; x ); x }× { Pr( M < ∞| A = 0 , X = x ) − Pr( M < ∞| A = 1 , X = x ) } . By similar derivations, we have Pr( T ≥ t, M = ∞| A = 1 , X = x )= Pr( T ≥ t | A = 1 , X = x , U = 2) Pr( U = 2 | X = x )+ Pr( T ≥ t | A = 1 , X = x , U = 3) Pr( U = 3 | X = x )= g { Pr( T (1) ≥ t | X = x , U = 2) } Pr( U = 2 | X = x )+ Pr( T (1) ≥ t | X = x , U = 3) Pr( U = 3 | X = x ) , and Pr( T ≥ t, M = ∞| A = 0 , X = x ) = Pr( T (0) ≥ t | X = x , U = 3) Pr( U = 3 | X = x ) . Then, the stratum-specific total effects for strata with U = 2 and U = 3 are T E ( t ; x ) = (cid:90) t g { h ∗ ( m, t ; x ); x } g { h ∗ ( m ; x ); x } dm − g { h ∗ ( t ; x ); x } , (A.4)and T E ( t ; x ) = h ∗ ( t ; x ) − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) , (A.5)where h ∗ ( t ; x ) is the solution to the equation Pr( T = t, M < ∞| A = 1 , X = x ) = h ( t ; x ) Pr( M = ∞| A = 0 , X = x ) g { h ( t ; x ); x } { Pr( M = ∞| A = 1 , X = x ) − Pr( M = ∞| A = 0 , X = x ) } . In the special case that g k ( · ; x ) are identity functions, i.e., the stratum-specific jointdistributions of { M (0) , T } are the same for strata U = 1 and U = 2 given X = x , andthe stratum-specific distributions of T (0) are the same for strata U = 2 and U = 3 given X = x , the functions h ∗ k ’s have closed form h ∗ ( m ; x ) = Pr( M = m | M < ∞ , A = 0 , X = x ) ,h ∗ ( m, t ; x ) = Pr( T ≥ t | M = m, A = 0 , X = x ) ,h ∗ ( t ; x ) = Pr( T < t | M = ∞ , A = 1 , X = x ) . Then, the stratum-specific effects can be identified by N IE ( t ; x ) = (cid:90) t Pr( T ≥ t | M = m, A = 1 , X = x ) { Pr( M = m | M < ∞ , A = 1 , X = x ) − Pr( M = m | M < ∞ , A = 0 , X = x ) } dm + Pr( M ≤ t | M < ∞ , A = 1 , X = x ) − Pr( M ≤ t | M < ∞ , A = 0 , X = x ) ,N DE ( t ; x ) = (cid:90) t { Pr( T ≥ t | M = m, A = 1 , X = x ) − Pr( T ≥ t | M = m, A = 0 , X = x ) }× Pr( M = m | M < ∞ , A = 0 , X = x ) dm,T E ( t ; x ) = (cid:90) t Pr( T ≥ t, M = m | M < ∞ , A = 1 , X = x ) dm − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) , and T E ( t ; x ) = Pr( T ≥ t | M = ∞ , A = 1 , X = x ) − Pr( T ≥ t | M = ∞ , A = 0 , X = x ) . In the presence of censoring, we cannot observe if M = ∞ , such that previous formulacannot be directly applied. However, we are still able to identify the quantities if we assumenon-informative censoring and sufficient follow-up in strata (Assumption 5). Particularly,33e consider the marker process I ( M ≤ t ) along with the event time T . Then, based onan extension of results in Maller and Zhou (1992), the probability Pr( M = ∞| A, X ) canbe consistently estimated by the empirical value of the marker process at the last observedfailure time. By replacing terms related to Pr( M = ∞| A, X ) by their estimators, weidentify the stratum-specific mediation effects and total effects. APPENDIX B Details on EM Algorithm Based on the likelihood function with known U i , we are then able to propose an EMalgorithm treating U i ( i = 1 , . . . , n ) as missing data. In particular, the complete-datalog-likelihood (with known U i for i = 1 , . . . , n ) is given by l n ( β , γ , α , Λ)= n (cid:88) i =1 (cid:34) I ( U i = 1) (cid:40) α T1 (cid:102) X i + ∆ Mi (cid:32) log Λ { Z i } + η T M W i − e η T R W i (cid:88) t l ≤ V i λ l (cid:33) +∆ Mi ∆ Ti (cid:0) log Λ { V i } + η T R W i (cid:1) − (cid:0) − ∆ Ti + ∆ Mi ∆ Ti (cid:1) e η T M W i (cid:88) t l ≤ Z i λ l (cid:41) + I ( A i = 0 , U i = 2) (cid:40) α T2 (cid:102) X i + ∆ Mi (cid:32) log Λ { Z i } + η T M (cid:102) X i − e η T R (cid:102) X i (cid:88) t l ≤ V i λ l (cid:33) +∆ Mi ∆ Ti (cid:16) log Λ { V i } + η T R (cid:102) X i (cid:17) − (cid:0) − ∆ Ti + ∆ Mi ∆ Ti (cid:1) e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l (cid:41) + I ( A i = 1 , U i = 2)(1 − ∆ Mi ) (cid:40) α T2 (cid:102) X i + ∆ Ti (cid:16) log Λ { Y i } + η T T (cid:102) X i (cid:17) − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:41) + I ( U i = 3)(1 − ∆ Mi ) (cid:40) ∆ Ti (cid:0) log Λ { Y i } + η T T W i (cid:1) − e η T T W i (cid:88) t l ≤ Y i λ l (cid:41) − log (cid:110) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17)(cid:111) (cid:35) . 34n the E-step of the EM algorithm, we evaluate the conditional expectation U i for subjects i = 1 , . . . , n . In particular, (cid:99) Pr( U i = 1) =∆ Mi (cid:26) I ( A i = 1) + I ( A i = 0) B i B i + B i (cid:27) + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i (cid:99) Pr( U i = 2) =∆ Mi I ( A i = 0) B i B i + B i + (1 − ∆ Mi )∆ Ti I ( A i = 1) C i C i + C i + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i (cid:99) Pr( U i = 3) =(1 − ∆ Mi )∆ Ti (cid:26) I ( A i = 0) + I ( A i = 1) C i C i + C i (cid:27) + (1 − ∆ Mi )(1 − ∆ Ti ) D i D i + D i + D i where B i = exp (cid:40) α T1 (cid:102) X i + η T M W i + ∆ Ti (cid:0) η T R W i (cid:1) − e η T M W i (cid:88) t l ≤ Z i λ l − e η T R W i (cid:88) t l ≤ V i λ l (cid:41) ,B i = exp (cid:40) α T2 (cid:102) X i + η T M (cid:102) X i + ∆ Ti (cid:16) η T R (cid:102) X i (cid:17) − e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l − e η T R (cid:102) X i (cid:88) t l ≤ V i λ l (cid:41) ,C i = exp (cid:32) α T2 (cid:102) X i + η T T (cid:102) X i − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:33) ,C i = exp (cid:32) η T T W i − e η T T W i (cid:88) t l ≤ Y i λ l (cid:33) ,D i = exp (cid:32) α T1 (cid:102) X i − e η T M W i (cid:88) t l ≤ Z i λ l (cid:33) ,D i = exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:40) I ( A i = 0) exp (cid:32) − e η T M (cid:102) X i (cid:88) t l ≤ Z i λ l (cid:33) + I ( A i = 1) exp (cid:32) − e η T T (cid:102) X i (cid:88) t l ≤ Y i λ l (cid:33)(cid:41) , and D i = exp (cid:32) − e η T T W i (cid:88) t l ≤ Y i λ l (cid:33) . 35n the M-step of the EM algorithm, we maximize the conditional expectation of thecomplete-data log-likelihood function. In particular, we update Λ , Λ , and Λ by λ l = (cid:80) ni =1 ∆ Mi I ( Z i = t l ) (cid:80) ni =1 (1 − ∆ Ti + ∆ Mi ∆ Ti ) I ( Z i ≥ t l ) S i ,λ l = (cid:80) ni =1 ∆ Mi ∆ Ti I ( V i = t l ) (cid:80) ni =1 ∆ Mi I ( V i ≥ t l ) S i ,λ l = (cid:80) ni =1 (1 − ∆ Mi )∆ Ti I ( Y i = t l ) (cid:80) ni =1 (1 − ∆ Mi ) I ( Y i ≥ t l ) S i , where S i = (cid:99) Pr( U i = 1) e η T M W i + (cid:99) Pr( U i = 2) I ( A i = 0) e η T M (cid:102) X i ,S i = (cid:99) Pr( U i = 1) e η T R W i + (cid:99) Pr( U i = 2) I ( A i = 0) e η T R (cid:102) X i ,S i = (cid:99) Pr( U i = 2) I ( A i = 1) e η T T (cid:102) X i + (cid:99) Pr( U i = 3) e η T T W i . We update η M by solving n (cid:88) i =1 ∆ Mi (cid:40)(cid:99) Pr( U i = 1) W i − (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) (cid:99) Pr( U j = 1) e η T M W j W j (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) S j (cid:41) = , and update η M by solving n (cid:88) i =1 ∆ Mi (cid:110)(cid:99) Pr( U i = 2) I ( A i = 0) (cid:102) X i − (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) (cid:99) Pr( U j = 2) I ( A j = 0) e β M + γ T M X j (cid:102) X j (cid:80) nj =1 I ( Z j ≥ Z i )(1 − ∆ Tj + ∆ Mj ∆ Tj ) S j (cid:41) = . We update η R by solving n (cid:88) i =1 ∆ Mi ∆ Ti (cid:40)(cid:99) Pr( U i = 1) W i − (cid:80) nj =1 I ( R j ≥ V i )∆ Mj (cid:99) Pr( U j = 1) e η T R W j W j (cid:80) nj =1 I ( R j ≥ V i )∆ Mj S j (cid:41) = , and update η R by solving n (cid:88) i =1 ∆ Mi ∆ Ti (cid:110)(cid:99) Pr( U i = 2) I ( A i = 0) (cid:102) X i (cid:80) nj =1 I ( R j ≥ V i )∆ Mj (cid:99) Pr( U j = 2) I ( A j = 0) e η T R (cid:102) X j (cid:102) X j (cid:80) nj =1 I ( R j ≥ V i )∆ Mj S j (cid:41) = . We update η T by solving n (cid:88) i =1 (1 − ∆ Mi )∆ Ti (cid:110)(cid:99) Pr( U i = 2) I ( A i = 1) (cid:102) X i − (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) (cid:99) Pr( U j = 2) I ( A j = 1) e η T T (cid:102) X j (cid:102) X j (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) S j (cid:41) = , and update η T by solving n (cid:88) i =1 (1 − ∆ Mi )∆ Ti (cid:110)(cid:99) Pr( U i = 3) W i − (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) (cid:99) Pr( U j = 1) e η T T W j W j (cid:80) nj =1 I ( Y j ≥ Y i )(1 − ∆ Mj ) S j (cid:41) = . Finally, we update α by solving n (cid:88) i =1 (cid:99) Pr( U i = 1) − exp (cid:16) α T1 (cid:102) X i (cid:17) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:102) X i = , n (cid:88) i =1 (cid:99) Pr( U i = 2) − exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:16) α T1 (cid:102) X i (cid:17) + exp (cid:16) α T2 (cid:102) X i (cid:17) (cid:102) X i = . Starting with θ = and λ kl = 1 /m k for k = 1 , , , we iterate between the E-step andthe M-step until convergence to obtain the nonparametric maximum likelihood estimators ( (cid:98) θ , (cid:98) A ) . REFERENCES Aalen, O. O., Stensrud, M. J., Didelez, V., Daniel, R., Røysland, K., and Strohmaier, S.(2020), “Time-dependent mediators in survival analysis: Modeling direct and indirecteffects with the additive hazards model,” Biometr. J. , 62, 532–549.37ngrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of causal effectsusing instrumental variables,” J. Am. Statist. Ass. , 91, 444–455.Comment, L., Mealli, F., Haneuse, S., and Zigler, C. (2019), “Survivor average causaleffects for continuous time: a principal stratification approach to causal inference withsemicompeting risks,” arXiv preprint arXiv:1902.09304 , .Copelan, E. A., Biggs, J. C., Thompson, J. M., Crilley, P., Szer, J., Klein, J. P., Kapoor, N.,Avalos, B. R., Cunningham, I., and Atkinson, K. (1991), “Treatment for Acute MyelocyticLeukemia with Allogeneic Bone Marrow Transplantation Following Preparation withBuCy2,” Blood , 78, 838–843.Didelez, V. (2019), “Defining causal mediation with a longitudinal mediator and a survivaloutcome,” Lifetime Data Anal. , 25, 593–610.Fine, J. P., Jiang, H., and Chappell, R. (2001), “On Semi-competing Risks Data,” Biometrika , 88, 907–919.Frangakis, C. E., and Rubin, D. B. (2002), “Principal stratification in causal inference,” Biometrics , 58, 21–29.Huang, Y.-T. (2020), “Causal mediation of semicompeting risks,” Biometrics , .Klein, J. P., and Moeschberger, M. L. (2006), Survival Analysis: Techniques for Censoredand Truncated Data , New York: Springer.Lange, T., and Hansen, J. V. (2011), “Direct and Indirect Effects in a Survival Context,” Epidemiol. , 22, 575–581.Lange, T., Vansteelandt, S., and Bekaert, M. (2012), “A Simple Unified Approach forEstimating Natural Direct and Indirect Effects,” Am. J. Epidemiol. , 176, 190–195.38in, D., Sun, W., and Ying, Z. (1999), “Nonparametric estimation of the gap time distri-bution for serial events with censored data,” Biometrika , 86, 59–70.Lin, S.-H., Young, J. G., Logan, R., and VanderWeele, T. J. (2017), “Mediation analysis fora survival outcome with time-varying exposures, mediators, and confounders,” Statisticsin medicine , 36, 4153–4166.Maller, R. A., and Zhou, S. (1992), “Estimating the proportion of immunes in a censoredsample,” Biometrika , 79, 731–739.Mason, M. D., Parulekar, W. R., Sydes, M. R., Brundage, M., Kirkbride, P., Gospodarow-icz, M., Cowan, R., Kostashuk, E. C., Anderson, J., Swanson, G. et al. (2015), “Finalreport of the intergroup randomized study of combined androgen-deprivation therapyplus radiotherapy versus androgen-deprivation therapy alone in locally advanced prostatecancer,” J. Clin. Oncol. , 33, 2143–2150.Tchetgen Tchetgen, E. J. (2011), “On Causal Mediation Analysis with a Survival Outcome,” Int. J. Biostat. , 7, 1–38.VanderWeele, T. J. (2011), “Causal Mediation Analysis with Survival Data,” Epidemiol. ,22, 582–585.Vansteelandt, S., Linder, M., Vandenberghe, S., Steen, J., and Madsen, J. (2019), “Medi-ation analysis of time-to-event endpoints accounting for repeatedly measured mediatorssubject to time-varying confounding,” Statistics in medicine , 38, 4828–4840.Wang, W., and Wells, M. T. (1998), “Nonparametric estimation of successive durationtimes under dependent censoring,” Biometrika , 85, 561–572.Xu, J., Kalbfleisch, J. D., and Tai, B. (2010), “Statistical Analysis of Illness–death Processesand Semicompeting Risks Data,” Biometrics , 66, 716–725.39u, W., Chen, K., Sobel, M. E., and Ying, Z. (2015), “Semiparametric transformationmodels for causal inference in time-to-event studies with all-or-nothing compliance,” J.R. Statist. Soc. B , 77, 397–415.Zhang, J. L., and Rubin, D. B. (2003), “Estimation of causal effects via principal stratifica-tion when some outcomes are truncated by “death”,” J. Educ. Behav. Stat. , 28, 353–368.Zheng, W., and van der Laan, M. (2017), “Longitudinal Mediation Analysis with Time-varying Mediators and Exposures, with Application to Survival Outcomes,”