[PDF] Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring

Abstract

We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the treatment decision times to be dependent on the failure time and conditionally independent of censoring, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time point. The estimator is constructed using generalized random survival forests, and its consistency is shown using empirical process theory. Simulations and leukemia data analysis results suggest that the new estimator brings higher expected outcomes than existing methods in various settings. An R package dtrSurv is available on CRAN.

Full PDF

aa r X i v : . [ s t a t . M E ] D ec Multi-stage optimal dynamic treatmentregimes for survival outcomes withdependent censoring

Hunyong Cho * , Shannon T. Holloway † , and Michael R. Kosorok ‡ December 8, 2020

Abstract

We propose a reinforcement learning method for estimating an optimaldynamic treatment regime for survival outcomes with dependent censoring.The estimator allows the treatment decision times to be dependent on thefailure time and conditionally independent of censoring, supports a ﬂexiblenumber of treatment arms and treatment stages, and can maximize eitherthe mean survival time or the survival probability at a certain time point.The estimator is constructed using generalized random survival forests, andits consistency is shown using empirical process theory. Simulations andleukemia data analysis results suggest that the new estimator brings higherexpected outcomes than existing methods in various settings. An R package dtrSurv is available on CRAN.

Keywords:

Precision medicine; Dynamic treatment regime; Survival analysis; Re-inforcement learning; Random forest; Conditionally independent censoring; Em-pirical process. * Department of Biostatistics, University of North Carolina at Chapel Hill, [email protected] † Department of Statistics, North Carolina State University ‡ Department of Biostatistics, Department of Statistics and Operations Research, University of North Carolina atChapel Hill Introduction

Multi-stage treatments are becoming more prevalent in medicine. Chronic condi-tions such as cancer, auto-immune disease, HIV, and heart disease, often requiremultiple treatments over a long period. For instance, cancer patients may receivemultiple rounds of chemo- and/or biological therapies (Habermann et al., 2006;Huang et al., 2020). Auto-immune disease patients might receive additional treat-ments in response to relapses (Edwards & Cambridge, 2006; Hogan & Radhakrishnan,2013). Traditionally such chronic diseases have been treated following a one-size-ﬁts-most approach by which the standard of care is selected as the treatment ortreatment plan that will likely beneﬁt most patients with a similar condition, i.e.,the treatment that is optimal only in the sense of the marginal effects.However in recent years, a more personalized approach to the treatment ofchronic diseases has become of interest. Under this approach patients beneﬁtfrom treatment rules that incorporate the patient heterogeneity of treatment ef-fects and thereby treatment decisions are based on the individual characteristicsof each patient. Dynamic treatment regimes formalize this data-driven approachto treatment by optimizing the overall outcome of interest and providing treat-ment suggestions that are based on the patient’s information available at eachpoint of the decision making process (Murphy, 2003; Kosorok & Moodie, 2015;Kosorok & Laber, 2019; Tsiatis et al., 2019). In part because they utilize all avail-able information, such optimized dynamic treatment regimes are often found tobe more beneﬁcial than a traditional set of ﬁxed treatment plans(Kidwell, 2015).Further, because optimal dynamic treatment regimes focus on the overall or long-term outcome, they are more advantageous than applying multiple single-stagetreatment rules that only optimize the stage-level outcomes.Finding an optimal dynamic treatment regime is a reinforcement learningproblem (Sutton & Barto, 2018), wherein an action by an agent, or a physician,changes both the immediate reward and the environment of the next stage and thelearner, or the statistician, searches for the rule of actions that brings the most over-all reward. Like many statistical learning methods, reinforcement learning prob-lems are often solved using the Q-learning algorithm (Watkins, 1989; Zhu et al.,2015; Zhao et al., 2011; Zhang et al., 2017; Qian & Murphy, 2011), which deﬁnesthe value as the expected cumulative sum of discounted rewards and optimizes theset of rules using backward recursion.In precision medicine applications, the rewards are often deﬁned in the contextof survival outcomes, such as mean survival time or survival probabilities, whichpose unique challenges. For survival outcomes, the failure time is often censored,2nd any approach should take into account the missing information. Several meth-ods have been developed for this purpose, for example, redistribution of the fail-ure probability as in the Kaplan-Meier Efron (1967) and Robertson & Uppuluri(1984) or inverse probability weighting (Robins et al., 1994; Robins & Rotnitzky,1992; Wahed & Tsiatis, 2006; Orellana et al., 2010). Further, the treatment timingmay not be ﬁxed and may even depend on a patient’s status or previous treatmentshaving association with the failure time. In such scenarios, the three events ateach stage are failure, proceeding to the next treatment, and censoring, which canbe restated in a competing risks framework, in which the three dependent eventscompete and only the earliest event is observed. However, the full joint distri-bution of the events is not identiﬁable without knowledge of their dependencestructure (Tsiatis, 1975). Thus, the dependency between the events needs to becarefully considered. For example, a cancer patient whose physical condition hasimproved has a longer expected survival time and at the same time may want totake another round of chemotherapy earlier than expected. If this is the case, thesurvival time and the treatment time have a negative association.There are two additional challenges that arise in the analysis of survival out-comes that must be addressed. First, the total number of treatments received mayvary among patients. For example, some patients may experience the failure or becensored before taking all planned treatments. And ﬁnally, there could be differ-ent optimization criteria for survival times. For some patients and clinicians, theaverage survival time is of primary interest, while the six-month survival proba-bility may be of greater concern for others. A method that can address differenttypes of endpoints would be useful in practice.Several methods regarding the estimation of dynamic treatment regimes forsurvival outcomes have been proposed in the literature. Goldberg & Kosorok(2012) ﬁrst developed a dynamic treatment regime estimator for survival out-comes using a Q-learning framework. To address the problems introduced by adifferent number of treatments as well as by censoring, the authors proposed mod-ifying the survival data so that each observation has the same number of treatmentstages without missing values and showed that the value of a policy in the originalproblem can be represented as an expectation of the modiﬁed data. After the datamodiﬁcation, the problem was put into a standard Q-learning framework usinginverse probability weighting to account for censoring. However, censoring wasassumed to be completely independent of all covariates and event times. The es-timated Q-function and the resulting decision rules would be thus subject to biasfor data with dependent censoring.Huang et al. (2014) addressed a similar problem using backward recursion.3heir motivating problem was a recurrent disease clinical trial where patients re-ceive an initial treatment followed by a salvage treatment if patients experience ei-ther treatment resistance or relapse. Simoneau et al. (2019) extended the dynamictreatment regime estimator for continuous outcomes via weighted least squares(Wallace & Moodie, 2015) into the censored time-to-event data setting. Both ofthese methods use the accelerated failure time model for the failure time distribu-tion, which carries a risk of model misspeciﬁcation. Thus, a more ﬂexible modelis needed to allow for more relaxed distributional assumptions. Wahed & Thall(2013) proposed a dynamic treatment regime estimator using full speciﬁcation ofthe likelihood and Xu et al. (2016) developed a Bayesian alternative, where thedisease progression dynamic was modeled using a dependent Dirichlet processprior with Gaussian process measure. While the latter approach provides a moreﬂexible modeling framework than the former, the class of regimes for both meth-ods is reﬁned to a ﬁxed number of pre-deﬁned treatment sequences.Each of the existing methods described above has its limitations. First, notall methods allow for a ﬂexible number of treatment stages and/or treatment lev-els. For example, the methods in Wahed & Thall (2013) and Huang et al. (2014)were designed for a two-stage decision problem, and the Simoneau et al. (2019)approach is restricted to binary treatments. Second, the outcomes of interest inmost methods are limited to the mean survival time. Jiang et al. (2017) proposedan optimal dynamic treatment regime estimator that maximizes the t -time survivalprobability. However, none of the methods permit choosing the criterion exceptWahed & Thall (2013). Third, parametric models or strong structural assumptionsare employed for all of these methods, with the exception of Goldberg & Kosorok(2012). Strong modeling assumptions, such as the accelerated failure time modelor the proportional hazards models used by the existing methods, are subject tomodel-misspeciﬁcation, may limit the policy class, and, consequently, may in-voke loss in the value of the optimal policy. Finally, the censoring assumptions ofGoldberg & Kosorok (2012) are restrictive. Censoring is associated with failuretime in many medical applications, but it is often reasonable to assume indepen-dence of censoring after conditioning on patient historical information.In this manuscript, we develop a general dynamic treatment regime estima-tor for censored time-to-failure outcomes that addresses all of the limitationsdiscussed for the existing estimators. Our estimator maximizes either the meansurvival time or the survival probability at a certain time using backward recur-sion. As the objective may not be expressed as the sum of intermediate rewards, astandard Q-learning algorithm is not applicable. Instead, the conditional survivalprobability information, rather than the summarized Q-values, is appended to the4able 1: Assumptions of the existing and the proposed methods method Q |A ( q ) | criterion T C { π } the new method ﬁnite ﬁnite E [ T ∧ τ ] , S ( t ) NP CI ﬂexibleGoldberg & Kosorok (2012) ﬁnite ﬁnite E [ T ∧ τ ] NP independent ﬂexibleHuang et al. (2014) 2+ 2+ E [ T ] AFT CI linearJiang et al. (2017) ﬁnite 2 S ( t ) PH CI linearSimoneau et al. (2019) ﬁnite 2 E [ T ] AFT CI linearWahed & Thall (2013) ﬁnite ﬁnite E [ T ] , S ( t ) AFT CI ﬁxedXu et al. (2016) ﬁnite ﬁnite E [ T ] BNP CI ﬁxed Q , the number of stages; |A ( q ) | , the number of treatment arms at stage q ; criterion, the targetvalue to be optimized; T , failure time model; C , censoring assumption; { π } , policy class; 2+,can be generalized to more than two; NP, non-parametric; AFT, accelerated failure time; PH,proportional hazards; BNP, Bayesian non-parametric; CI, conditional independence; ﬁxed, a ﬁxednumber of distinct treatment rules. previous stage information. A generalized random survival forest is developedfor this task, where survival curves for each individual, instead of the observedsurvival or censoring time, is fed into the random survival forest. A general im-plementation of our method is available on CRAN, R package dtrSurv .The key contributions of our work are the ﬂexibility of the proposed methodand the exposition of its theoretical properties. The proposed method affords sig-niﬁcantly more ﬂexibility than existing methods. The target value can be eitherthe mean survival or the survival probability at a certain time. The method allowsfor a ﬂexible number of treatment stages as well as a ﬂexible number of treatmentarms at each stage. Censoring can be conditionally independent. The conditionalsurvival probability is modeled nonparametrically using a random survival forest-based algorithm. And ﬁnally, the proposed method does not require estimatingthe joint distribution of failure time and treatment time nor does it require anyassumptions on their dependency, though these are allowed. We further show theconsistency of the random survival forests and the estimated regime values.Before we give an outline of the paper, we brieﬂy review random survivalforests, an important component of the proposed method. A random survivalforest is a nonparametric survival regression estimator that provides weightedKaplan-Meier curves for each point in the feature space with weights given byadaptive kernel estimates. In general, a random forest is an ensemble of treeseach of which are generated by recursively partitioning the data into two daughternodes. In the Breiman et al. (1984) Classiﬁcation and Regression Trees, or CART,approach, partitions are made by searching across all variables and data midpoints,and the best cut point that maximizes the heterogeneity between the resulting5aughter nodes based on the mean differences is taken. In Breiman (2001) randomforests, Breiman et al. (1984) trees are built and averaged based on multiple boot-strap samples considering only a random subset of the variables when splittingeach node. This randomization—bootstrapping and random partitioning—bringsdiversity among trees, and thus the variability of the estimators can be less thanthat obtained in the CART approach. Geurts et al. (2006) added an alternativerandomization mode by restricting the candidate threshold values to a randomlychosen cut point for each variable.Survival trees (random survival forests) generalize trees (random forests) byallowing censored outcomes as inputs and providing covariate-conditional sur-vival probability estimates as outputs. See Gordon & Olshen (1985) and Segal(1988) for survival trees and Hothorn et al. (2005), Ishwaran et al. (2008), andZhu & Kosorok (2012) for random survival forests. Due to censoring, hetero-geneity between two nodes cannot be ascertained by a simple mean; alternatetests such as the log-rank test (Peto & Peto, 1972; Gehan, 1965; Tarone & Ware,1977) must be used. At each terminal node of a survival tree, the survival prob-ability is estimated by the Kaplan-Meier procedure. A random survival forest isobtained by averaging the survival trees.The remainder of the manuscript is organized as follows. In Section 2, wegive notation and describe the dynamic treatment rule estimator. Consistency ofthe estimator is derived in Section 3, and its performance is illustrated throughsimulations in Section 4. Section 5 is devoted to the application of the proposedmethod to a two-stage leukemia clinical trial. We conclude the paper by discussingfuture research directions in Section 6. We assume that each of n patients can have a maximum of Q visits, or treatmentstages, with a maximum study length τ > . At the q th stage, q = 1 , , ..., Q ,the i th patient, among the n ( q ) ≤ n available patients, receives treatment A ( q ) i ∈A ( q ) , if he or she has survived and has not dropped out by the beginning of thestage, where A ( q ) is the ﬁnite treatment space for the q th stage, q = 1 , , ..., Q .Throughout this manuscript, we often drop the subject index i when it does notcause confusion. Our interest lies in estimating the survival distribution S π ofthe overall failure time T of patients if they followed a dynamic treatment regime6 = ( π (1) , π (2) , ...π ( Q ) ) ⊤ , and then ﬁnding the ‘best’ rule π ∗ , which is the rule thatoptimizes a certain criterion φ . For φ , we consider φ µ ( S ) = − R t> ( t ∧ τ ) dS ( t ) and φ σ,t ( S ) = S ( t ) for some < t ≤ τ . Without loss of generality, we assumethat a prolonged time to event is preferred; thus the objective is to maximize φ ( S ) over the regimes.We allow the time to next treatment U ( q ) i of the i th patient to depend on thehistorical information H ( q ) i ∈ H ( q ) available at the beginning of the q th stageand the treatment assignment A ( q ) i . At the q th stage, we observe either the i thpatient’s failure, advancement to next treatment, or censoring. The times fromthe beginning of the stage to each of these events are denoted as T ( q ) i , U ( q ) i , and C ( q ) i , respectively. The length of the stage X ( q ) i for the i th patient is deﬁned bythe minimum of these times, and the hypothetically uncensored stage length isdeﬁned as V ( q ) i = T ( q ) i ∧ U ( q ) i . And, if q = Q , V ( Q ) i = T ( Q ) i . We denote the q th stage censoring and treatment indicators as δ ( q ) i = 1( V ( q ) i ≤ C ( q ) i ) and γ ( q ) i =1( T ( q ) i ≤ U ( q ) i ) , respectively. Note that when C ( q ) i < V ( q ) i , γ ( q ) i is not observable.The baseline time at stage q is deﬁned as B ( q ) i = P q − k =1 X ( k ) i , q > , and B (1) i = 0 .Let T i , C i , X i = T i ∧ C i , and δ i = 1( T i ≤ C i ) denote the overall failure time,overall censoring time, overall observed time, and overall censoring indicator,respectively. We denote Z ( q ) i ∈ Z ( q ) as the covariate information of the i th patientthat is newly available at the beginning of the q th stage. Thus, H ( q ) i may includethe historical information such as A ( k ) i , B ( k ) i , Z ( k ) i , k = 1 , , .., q − and Z ( q ) i ;where d ( q ) denotes the dimension of H ( q ) i for all i = 1 , , ..., n ( q ) ; q = 1 , , .., Q .The number n ( q ) of patients eligible for treatment assignment at the beginning ofstage q is P ni =1 U ( q ) i ≤ T ( q ) i ∧ C ( q ) i ) . The notation is summarized in Table 3 inAppendix 1. As is the case in Q-learning, our estimator optimizes the intermediate stage de-cision assuming that all the later decisions are optimal, and thus backward recur-sion is used. Our method is based on the observation that the remaining life L ( q ) at stage q is deﬁned recursively as a convolution of L ( q +1) and the current stage7ength V ( q ) and that S ( q ) ( t | H ( q ) , A ( q ) , δ ( q ) = 1) = Z S ( q +1) ( t − V ( q ) | H ( q +1) , A ( q +1) , δ ( q ) = 1) dP ( V ( q ) , H ( q +1) , A ( q +1) | H ( q ) , A ( q ) , δ ( q ) = 1) , (1)where S ( q ) ( t | · ) = 1 for all t < and S ( q ) ( t | γ ( q − = 1) = 1( t < ,for q = Q − , Q − , ..., . In other words, assuming no censoring events, thesurvival probability S ( q ) of the remaining life L ( q ) at stage q can be obtained byaugmenting the survival probability of the remaining life L ( q +1) of the next stageby the current stage length V ( q ) and then marginalizing over all events that occurlater than the beginning of stage q . Under the covariate-independent censoring as-sumption, the contribution of the censored cases is incorporated into the equationby redistributing the probability to the right (Efron, 1967): S ( q ) ( t | H ( q ) , A ( q ) )= Z (cid:20) δ ( q ) S ( q +1) ( t − X ( q ) | H ( q +1) , A ( q +1) )+ (1 − δ ( q ) ) n X ( q ) > t ) + 1( X ( q ) ≤ t ) S ( q ) ( t | H ( q +1) , A ( q +1) ) S ( q ) ( X ( q ) | H ( q +1) , A ( q +1) ) o(cid:21) × dP ( X ( q ) , H ( q +1) , A ( q +1) , δ ( q ) | H ( q ) , A ( q ) ) . (2)Assuming the standard causal inference assumptions, which we will formally in-troduce in Subsection 3.2 (Assumption 8–10), we can translate S ( q ) ( t | H ( q ) , A ( q ) = a ) into the counterfactual quantity S ( q ) a ( t | H ( q ) ) . Thus, the regime is esti-mated using the following procedure: 1. estimating the conditional survival dis-tribution S ( q ) ( t | H ( q ) , A ( q ) ) of the remaining life L ( q ) for q = Q ; 2. opti-mizing the survival distribution over A ( q ) to have ˆ π ( q ) and ˆ S ( q ) ∗ ( t | H ( q ) ) for q = Q ; 3. augmenting the estimated optimal curves by the previous stage lengths,i.e., ˆ S ( q ) ∗ ( t − X ( q − | H ( q ) , A ( q ) ) for q = Q ; and, 4. repeating steps 1-3 for q = Q − , Q − , ..., . This procedure is summarized at a high level in Algo-8ithm 1. Result:

A dynamic treatment regime estimate ˆ π = (ˆ π (1) , ˆ π (2) , ..., ˆ π ( Q ) ) ⊤ for stage q = Q, Q − , ..., do Obtain ˆ S ( q ) ( · | H ( q ) , A ( q ) ) via generalized random survival forest(Algorithm 2);Obtain ˆ π ( q ) ( h ) = arg max a ∈A ( q ) φ ( q ) n ˆ S ( q ) ( · | H ( q ) = h , A ( q ) = a ) o ;Obtain ˆ S ( q ) ∗ ( · | H ( q ) ) = ˆ S ( q ) ( · | H ( q ) , A ( q ) = ˆ π ( q ) ( H ( q ) )) ; Algorithm 1: the proposed dynamic treatment regime estimatorAt the beginning of the last stage, there are n ( Q ) patients who have neitherexperienced failure nor been censored. The data for these patients are used to esti-mate the terminal stage covariate-conditional survival probability S ( Q ) ( · | H ( Q ) , A ( Q ) ) for each treatment arm using random survival forests described in detail in Sub-section 2.4. Then the optimal regime for the ﬁnal stage is estimated by ˆ π ( Q ) =arg max A φ ( Q ) n ˆ S ( Q ) ( · , A ) o , where ˆ S is the estimate of S and φ ( q ) , q = 2 , .., Q isdeﬁned adaptively for the intermediate and terminal survival probabilities. Specif-ically, φ ( q ) µ n S ( q ) ( · | H ( q ) ) o = B ( q ) − R t>X ( q ) ( t ∧ τ ) dS ( q ) ( t | H ( q ) ) and φ ( q ) σ,t n S ( q ) ( · | H ( q ) ) o = S ( q ) ( t − B ( q ) | H ( q ) ) .We denote the estimated optimized survival distribution of the q th stage as ˆ S ( q ) ∗ = ˆ S ( q ) {· , ˆ π ( q ) ( · ) } . The optimized survival information of the ﬁnal stage iscarried back to the previous stage. Speciﬁcally, the i th patient who is available atthe ( Q − st stage and has a stage length X ( Q − i is given a probabilistic augmen-tation so that the resulting survival probability of remaining life at the second tothe last stage is ˆ S ( Q − ∗ ( t − X ( Q − i | H ( Q ) i ) . For those who have already expe-rienced failure or censoring by the end of the stage, no augmentation is required,as either their remaining life has been determined already during the stage or thecensoring indicator and the stage length contain all the necessary information.Next, the conditional distribution of remaining life from the beginning of the ( Q − st stage is estimated by the random survival forests based on the n ( Q − patients. The distinction between the last stage and all the other stages is that givendata, the patients’ remaining life conditioning on the current historical informationis observed in a scalar form at the last stage, while the remaining life for theother stages is not observed but is given in the form of a stochastic process. So amodiﬁcation to the random survival forest estimator is required and is discussedin Subsection 2.4. Then estimation of S ( q ) , π ( q ) , and S ( q ) ∗ is recursively performeduntil we reach the ﬁrst stage and obtain ˆ π = (ˆ π (1) , ˆ π (2) , ..., ˆ π ( Q ) ) ⊤ .9 .3 A few aspects of the optimization: Backward recursion andcomposite values We discuss a few aspects of the optimization problem. First, we go over whybackward recursion can be used in optimizing the truncated survival mean val-ues ( φ µ ) or survival probabilities ( φ σ,t ). For some criteria, such as the median φ med ( S ) = S − (0 . , this backward recursion does not guarantee an overall op-timum. However, for the mean survival time φ µ and the survival probability φ σ,t at time t , backward recursion is a legitimate method according to Proposition 1,which is proved in Appendix 2. Proposition 1 (Legitimacy of backward recursion) . Pick either φ µ or φ σ,t for φ .Let { π } be an arbitrary class of decision rules and π ∗ = ( π (1) ∗ , π (2) ∗ , ..., π ( Q ) ∗ ) ⊤ satisfy φ n S ( · | H (1) , π ∗ ) o ≥ φ n S ( · | H (1) , π ) o for all π ∈ { π } . Then for any q = 1 , , ..., Q − ,π ( q ) ∗ ( H ( q ) ) = arg max a ( q ) ∈A ( q ) φ ( q ) n S ( q ) ( · | H ( q ) , π ( q +1) ∗ , π ( q +2) ∗ , ..., π ( Q ) ∗ ) o . Second, we study an alternative to the φ σ,t criterion. When φ σ,t ( S ) = S ( t ) ischosen as the optimization criterion, the treatment rule can not discriminate treat-ment arms at any point later than t . For example, when an investigator is interestedin the six month survival probability and a patient takes her third round of treat-ment seven months after her onset of the disease, the optimal dynamic treatmentrule could recommend any treatment option randomly, since it does not affect thepatient’s six month survival probability. Thus a composite criterion of ﬁrst maxi-mizing S ( t ) and second maximizing E [ T ∧ τ ] could be more beneﬁcial than thesimple criterion φ σ,t . The composite criterion values φ σ,µ,t ( S ) = { S ( t ) , E [ T ∧ τ ] } are ordered lexicographically; i.e., φ σ,µ,t ( S π ) > φ σ,µ,t ( S π ) if and only if eitherof the following is true: S π ( t ) > S π ( t ) , or S π ( t ) = S π ( t ) and E [ S π ∧ τ ] > E [ S π ∧ τ ] . We develop a generalized random survival forest that takes survival probabilitiesas the outcome variable. The high level algorithm is provided in Algorithm 2.Suppose we are interested in estimating the marginal survival probability of a10ode N at stage q , where each individual’s failure time information is given by thesurvival probability of the remaining life and the censoring status: { ( S ( q ) i , δ ( q ) i ) } n N i =1 where n N is the size of node N and S ( q ) i is shorthand for ˆ S ( q +1) ( t − X ( q ) i | H ( q +1) i ) .Then we modify the Kaplan-Meier estimator as ˆ S ( q ) ( t | N ) = t Y s ≥ ( P i ∈N δ ( q ) i dS ( q ) i ( s ) P i ∈N S ( q ) i ( s − ) ) , (3)where s − = s − ǫ with an arbitrarily small positive value ǫ . This modiﬁed Kaplan-Meier estimator will be used in both tree partitioning and terminal node survivalprobability estimation. Result: generalized random survival forest for the q th stage ˆ S ( q ) ( · | H ( q ) , A ( q ) ) Parameters: the number of trees n tree , the minimum terminal node size n min , the minimum number of events in terminal nodes n min.E ;Input data: { ( X ( q ) i , γ ( q ) i , δ ( q ) i , H ( q ) i , A ( q ) i ) } n ( q ) i =1 , ˆ S ( q +1) ∗ ( · | H ( q +1) ) ;Stochastic augmentation: S ( q ) i = ˆ S ( q +1) ∗ ( t − X ( q ) i | H ( q +1) = h ( q +1) i ) with S ( q ) i ( t ) = 1( t ≤ X ( q ) i ) if q = Q or if δ ( q ) i = 0 ; for treatment arm a ∈ A ( q ) dofor tree b = 1 , , ..., n tree do Recursively partition the feature space with the augmented data { ( S ( q ) i , δ ( q ) i , H ( q ) i ) } { i : a ( q ) i = a } via LR (4) or MD (5) until node size < n min or number of failures in each node < n min.E ;Obtain the terminal node survival probability ˆ S ( q ) b,a ( t | H ( q ) ) viamodiﬁed Kaplan-Meier (3);Obtain the random survival forest estimator ˆ S ( q ) ( · | H ( q ) , A ( q ) = a ) = n tree P n tree b =1 ˆ S ( q ) b,a ( t | H ( q ) ) ; Algorithm 2: the generalized random survival forest estimatorNow we develop the splitting rules. Many random survival forest methods usethe two-sample log-rank test statistic to partition the population into subgroupswith a homogeneous survival distribution (Ishwaran et al., 2008; Hothorn et al.,2005). When failure time is not observed as a scalar but is given as a probabilisticprocess, the log-rank test can be generalized to the generalized log-rank statistic11Cho et al., 2019); LR ( N , N ) = R τ Y ( t ) dN ( t )+ Y ( t ) dN ( t ) Y ( t ) qR τ Y ( t ) Y ( t ) dN ( t )( Y ( t ) − dN ( t )) Y ( t ) , (4)where N and N are two candidate daughter nodes, Y l ( t ) = |N l | P i ∈N l S ( q ) i ( t − ) , N l ( t ) =1 − |N l | P i ∈N l S i ( t ) , l = 1 , , Y ( t ) = λ Y ( t ) + λ Y ( t ) , N ( t ) = λ N ( t ) + λ N ( t ) , λ l = |N l ||N | + |N | , and |N | = P n ( q ) i =1 i ∈ N ) . The generalized log-ranksplitting rule chooses ( N , N ) = arg max N , N LR ( N , N ) as the best partition.When the optimization criterion is the truncated mean survival time φ µ , split-ting can also be done by ﬁnding the split point that maximizes the difference ofthe truncated mean survival times. Deﬁne M D ( N , N ) = (cid:12)(cid:12)(cid:12)(cid:12) Z t> ( t ∧ τ ) d ˆ S ( t | N ) − Z t> ( t ∧ τ ) d ˆ S ( t | N ) (cid:12)(cid:12)(cid:12)(cid:12) , (5)where ˆ S is the modiﬁed Kaplan-Meier estimator deﬁned in Equation (3). Themean difference splitting rule ﬁnds ( N , N ) = arg max N , N M D ( N , N ) asthe best split.The recursive partitioning continues until each terminal node size is less thana predeﬁned threshold k . At each terminal node N of the b th survival tree, thesurvival probability estimate ˆ S ( q ) b ( t | N ) is given by the modiﬁed Kaplan-Meierestimator in (3), b = 1 , , ..., n tree , where n tree is the number of trees in the forest.The random survival forest estimate is then given by ˆ S ( q ) ( t | h ( q ) ) = 1 n tree n tree X b =1 {N b,l } l X l =1 h ( q ) ∈ N b,l ) ˆ S ( q ) ( t | N b,l ) , where {N b,l } l is the size, or the number of terminal nodes, of the b th tree. We show consistency of the proposed dynamic treatment regime estimator ˆ π usingempirical process theory. We ﬁrst derive that the generalized random survival for-est is a uniformly—over the feature space—consistent estimator. The uniformityis required since estimating the optimal treatment policy of the q th stage relies on12ntegrating the estimated survival curve of ( q + 1 )st stage across the newly accruedinformation ( H ( q +1) , T ( q ) , U ( q ) ) | H ( q ) between the two stages. Then we showthat the value of the estimated regime is consistent for the value of the optimalregime, where the versions of the value are deﬁned in Subsection 3.2. We derive consistency of the generalized random survival forest outside of thecontext of dynamic treatment regimes. We have the following assumptions on datadistribution that are commonly used to achieve consistency of random forests orrandom survival forests (Meinshausen, 2006; Wager & Walther, 2015; Cui et al.,2017; Wager & Athey, 2018). These assumptions are tailored for use later in thedynamic treatment regime setting by adding stage indicators in Subsection 3.2.

Assumption 1 (Lipschitz continuous survival probability and censoring density) . There exist constants L S and L G such that | S ( t | h ) − S ( t | h ) | ≤ L S k h − h k and | g ( t | h ) − g ( t | h ) | ≤ L G k h − h k for all h , h ∈ H and t ∈ [0 , τ ] ,where G is the censoring survival distribution and g is its derivative with respectto time. Assumption 2 (Weakly dependent covariate values) . The covariate information( Z ) is given as a d Z -dimensional vector residing in a unit hypercube, i.e., Z ∈ [0 , d Z , with a density f Z such that ζ − ≤ f Z ( z ) ≤ ζ for all but countably manypoints z ∈ [0 , d Z and some constant ζ ≥ . Remark 1.

The distributional assumption was made for notational convenienceand could be relaxed. For example, the results in this paper still hold for covari-ates that are a multivariate normal random vector truncated by some constantssince random forests are invariant to scale transformation. Also the results arestill valid with categorical covariates with a ﬁnite number of levels after modiﬁ-cation to Assumption 1: in this setting, the norm is calculated by excluding thecategorical covariate coordinates. The case of categorical covariates will be dis-cussed in the proof of Theorem 1.

We also have assumptions regarding the tree-based estimator terminal nodesize and splitting rules. The ﬁrst assumption forces the terminal node size to beboth sufﬁciently small and sufﬁciently large.13 ssumption 3 (Terminal node size) . The maximum size k of the terminal nodesgrows at the following rate: k ≍ n β , < β < , where a ≍ b implies both a = O ( b ) and b = O ( a ) . Next, we give deﬁnitions of regular trees and random split trees that are oftenused for consistency proofs in the random forest literature (Meinshausen, 2006;Wager & Walther, 2015; Cui et al., 2017; Wager & Athey, 2018).

Deﬁnition 1 (random-split and α -regular trees and forests) . A tree is called arandom-split tree if each feature is given a minimum probability ( ϕ/d ) of beingthe split variable at each intermediate node, where < ϕ < . A tree is α -regular, if every daughter node has at least α fraction of the training sample in thecorresponding parent node. A random forest is called a random-split ( α -regular)forest, if each of its member trees is random-split ( α -regular). Assumption 4 ( α -regular and random split trees) . Trees are α -regular and random-split with a constant < ϕ < . Now we give our uniform consistency results of the generalized random sur-vival forests. This theorem is largely based on the Z-estimation lemma and Donsker-ness of the tree kernels. A detailed proof is relegated to Appendix 3.

Theorem 1 (Uniform consistency of a random survival forest) . Suppose Assump-tions 1-4 hold. Let ˜ S ′ be an estimator of S ′ that is uniformly consistent in both h ∈ H and t ∈ [0 , τ ] , where S ′ satisﬁes S ( t | h ) = R S ′ ( t − ( T ∧ U ) | H ′ ) dP ( T, U, H ′ | H = h ) . Then, the generalized random survival forest ˆ S ( t | h ) built based on ˜ S ′ is uniformly consistent. Speciﬁcally, sup t ∈ [0 ,τ ] , h ∈H | ˆ S ( t | h ) − S ( t | h ) | → , in probability as n → ∞ . Consistency of a random survival forest sequence in a dynamic treatmentregime setting requires adaptive versions of Assumptions 1-3 and an additionalcondition that guarantees an arbitrarily large sample size for the last stage withhigh probability. To achieve this condition, we assume that, at each stage exceptthe last one, each patient receives the next treatment before experiencing failureor censoring with at least a certain minimum probability.14 ssumption 5 (Lipschitz continuous survival probability and censoring densityat each stage) . There exist constants L S and L G such that | S ( q ) ( t | h ) − S ( q ) ( t | h ) | ≤ L S k h − h k and | g ( q ) ( t | h ) − g ( q ) ( t | h ) | ≤ L G k h − h k for all h , h ∈ H ( q ) , t ∈ [0 , τ − B ( q ) ] , and q = 1 , , ..., Q , where G ( q ) is the censoringsurvival distribution at the q th stage and g ( q ) is its derivative with respect to time. Assumption 6 (Weakly dependent historical information at each stage) . The pa-tient history information H ( q ) at stage q is given as a d ( q ) -dimensional vector lyingin a subset H ( q ) of [0 , d ( q ) , and its joint and marginal densities are bounded sothat ζ − ≤ f H ( q ) ( h ) ≤ ζ and ζ − ≤ f H ( q ) ,j ( h j ) ≤ ζ for all but a ﬁnite number of h ∈ H ( q ) and j = 1 , , ..., d ( q ) for some constant ζ ≥ . Assumption 6 modiﬁes Assumption 2 so that the support of the history spacecan be less than a hyper-cube, since, for example, the baseline survival times, ( B (1) ≤ B (2) ≤ B (3) ≤ ... ) , have a dependent support structure. Assumption 7 (Probability of proceeding to the next stage) . There exists a con-stant M ∈ (0 , such that for all h ( q ) ∈ H ( q ) and q = 1 , , ..., Q − , pr ( U ( q ) M. Corollary 1 (Uniform consistency of a random survival forest sequence) . Sup-pose Assumptions 3–7 hold. Let ˆ S = ( ˆ S (1) , ..., ˆ S ( q ) , ..., ˆ S ( Q ) ) be the sequence ofthe generalized random survival forest estimators of S = ( S (1) , ..., S ( q ) , ..., S ( Q ) ) such that the q th stage random survival forest is built based on ˆ S ( q +1) for q =1 , , ..., Q − . Then, sup t ∈ [0 ,τ ] , h ∈H ,q ∈{ , ,...,Q } | ˆ S ( q ) ( t | h ) − S ( q ) ( t | h ) | → , in probability as n → ∞ . The proof of Corollary 1 is given in Appendix 3.

We show consistency of the dynamic treatment regime estimator using the causalinference framework. With this framework, the theoretical properties are applica-ble to broader settings than sequentially randomized experiments. A counterfac-tual outcome is deﬁned as an outcome that would have been obtained if a personhad a treatment option contrary to his/her actual treatment. To denote the coun-terfactual outcomes pertaining to a counterfactual treatment or treatment policy,15e add a subscript to the corresponding random variables or survival functions.For example, S ( q ) π ( t | H ( q ) ) is a counterfactual survival probability of the remaininglife if treatment was given according to a treatment rule π to a person with history H ( q ) at stage q . We assume the following three standard causal inference assump-tions (Hern´an & Robins, 2020; Rubin, 2005; Cole & Frangakis, 2009) known asthe stable unit treatment value assumption, sequential ignorability, and positivity. Assumption 8 (Stable unit treatment value assumption, or SUTVA) . Each per-son’s counterfactual dynamics, such as failure time, time to next treatment, andcensoring, are not affected by the treatments or history of other patients. More-over, each treatment has only one version, or if there are multiple versions, theirdifferences are irrelevant.

Assumption 9 (Sequential ignorability) . Treatment assignment is given indepen-dently of the counterfactual outcomes, conditionally on the individual’s historicalinformation, where the outcomes include all future random quantities such as fail-ure time, time to next treatment, and censoring time.

Assumption 10 (Positivity) . For each stage q = 1 , , ..., Q , given historical in-formation, the probability of having each treatment a ∈ A ( q ) is greater than aconstant L > . Or, pr ( A ( q ) = a | H ( q ) = h ) > L for all a ∈ A ( q ) , h such that pr ( H ( q ) = h ) > . The following theorem states that the dynamic treatment regime estimator hasa value consistent for the value of the optimal regime. We deﬁne the value Φ of a treatment regime π as the criterion value of the survival probability if allindividuals in a population follow the treatment regime. We use Φ( π ) = φ ( S π ) to denote the value of π , and Φ inherits the subscript of φ . Theorem 2 (Consistency of the dynamic treatment regime estimator) . SupposeAssumptions 5–10 hold, and the optimal dynamic treatment regime estimator ( ˆ π )is built based on the generalized random forest grown under Assumptions 3 and4. Then, for each φ = φ µ , φ σ,t , | Φ( π ) − Φ( ˆ π ) | → , in probability as n → ∞ . The proof is provided in the Appendix 4.16

Simulations

We perform simulations to study the performance of the proposed method andcompare those results to that of existing methods. The overall scheme of the sim-ulations follows the pattern in Goldberg & Kosorok (2012), where they mimickeda cancer clinical trial using the tumor size and wellness dynamics. We adapt theirdata generating mechanism so that dependence among censoring, tumor size, andwellness is incorporated in the mechanism, and deterministic processes are re-laxed into more realistic ones.In the hypothetical trial, patients can go through up to Q = 3 treatment roundsduring the trial with length of the trial, τ = 10 . Each patient has tumor size φ ( t ) and wellness ω ( t ) at time t ∈ [0 , that affects the timing of failure or treatment.We use [ t ( q ) , t ( q +1) ] to denote the q th stage. For each stage, patients are giveneither a more aggressive ( A = 1 ) or a less aggressive ( A = 0 ) treatment.At the beginning of the trial, patients have uniformly distributed tumor sizesand wellness: ρ (0) ∼ U (0 . , , ω (0) ∼ U (0 . , . At each treatment stage, themore aggressive treatment immediately reduces the tumor size more than the lessaggressive one. However, the more aggressive treatment worsens patient wellnessmore than the less aggressive one: ρ ( t ( q ) +) = ρ ( t ( q ) ) ω ( t ( q ) ) { − A } ∨ ω ( t ( q ) +) = ω ( t ( q ) ) − A − . After each treatment ( t ( q ) < t ≤ t ( q +1) ), the tumor size and wellness followthe Ornstein-Uhlenbeck processes with negative dependence between the two: dχ ( t ) dt = − χ ( t ) + η χ ( t ) , where χ = ρ, ω and the second term obeys a drifted Wiener process such that ddt (cid:18) η ρ ( t ) η ω ( t ) (cid:19) ∼ N (cid:18) . . (cid:19) , (cid:18) − . − . (cid:19) ! . The positive means in the normal distribution imply that on average both tumorsize and wellness grow. When tumor size is greater than 1, the patient receives17able 2: The failure time and censoring hazard coefﬁcients and dimensions ofcovariates in Scenarios 1 to 4. scenario 1 scenario 2 scenario 3 scenario 4(base) (small p ) (large p ) (moderate censoring) p β F (1) (0 , − , , , , − , − ⊤ (0 , − , , ⊤ (0 , − , ⊤ , − ⊤ , − ⊤ ) (0 , − , , , , − , − ⊤ β F (0) (0 , − , , , , , ⊤ (0 , − , , ⊤ (0 , − , ⊤ , ⊤ , ⊤ ) (0 , − , , , , , ⊤ β C (1) ( − , . , . ⊤ , − . ⊤ ) ⊤ ( − , . , . , − . ⊤ ( − , . , . ⊤ , − . ⊤ , ⊤ ) ⊤ ( − , . , . ⊤ , − . ⊤ ) ⊤ β C (0) ( − , . , . ⊤ , . ⊤ ) ⊤ ( − , . , . , . ⊤ ( − , . , . ⊤ , . ⊤ , ⊤ ) ⊤ ( − , . , . ⊤ , . ⊤ ) ⊤ β F ( · ) and β C ( · ) , the log hazard ratios of the counterfactual failure and censoring times in (6)and (7), respectively. the next treatment. In other words, t ( q +1) = min t arg max t>t ( q ) ω ( t ) > , for q = 1 , , .., Q − . The patient failure time is governed by the hazard process: λ F ( c ) = 15 ( ρ ( u ) ω ( u ) e g ( H ( q ) ) ⊤ β F ( A ( q ) ) + 1 (cid:16) ω ( u ) < . (cid:17)) , (6)where g ( H ( q ) ) = { , log( B ( q ) + 1) , Z ( q ) ⊤ } ⊤ , β F ( a ) is the hazard ratio for thefailure time given treatment a that is deﬁned in Table 2, and the indicator accel-erates the failure time when the wellness drops below the critical level, 0.1. Thebaseline covariates at the q th stage follow a p − variate normal distribution con-ditioning on the previous stage value: Z ( q ) = Z ( q − + Z ( q )0 , where Z ( q )0 ∼ N p ( p , . I p + 0 . · pp ) , q = 1 , , ..., Q , pp is the p -dimensional square matrix ofones, and Z (0) = p .For each stage, the censoring occurs according to the hazard function, λ C ( u ) = e g ( H ( q ) ) ⊤ β C ( A ( q ) ) , (7)where the censoring hazard ratio, β C ( a ) , is also deﬁned in Table 2.The treatment is allocated according to a Bernoulli trial with propensity π ( H ( q ) ) = n − g ( H ( q ) ) ⊤ β π ) o − , (8)where the coefﬁcient is β π = ( − , , − ⊤ p ) ⊤ for the observational data settingand β π = p +2 for the randomized trial setting.Simulations are run with two sample sizes, n = 300 and . Thus, wehave in total distinct factorial settings (four hazard/censoring scenarios, twopropensity designs, and two sample sizes). For each setting, we have n rep = 200 simulation replicates, and we estimate the optimal treatment rule based on both18he φ µ and φ σ,t =5 ,µ criteria. We compare the proposed method with the methodsin Goldberg & Kosorok (2012) and Simoneau et al. (2019). We further comparethe results with the observed policy characterized by (8) and the estimated op-timal zero-order model, where all patients are given the same embedded treat-ment regime, which is the best on average over all embedded regimes. In par-ticular, in implementing the Goldberg & Kosorok (2012) method, the linear Q -function space and the random forest-based approximation space were separatelyused. The method of Huang et al. (2014) is not included in these simulations be-cause Simoneau et al. (2019) can be viewed as a doubly robust extension of theirmethod. After estimation, the values—either the expected truncated survival timeor the survival probability at t = 5 —are evaluated by generating an uncensoredsample of size n eval = 10000 according to the estimated policies. The simulationswere complete using R package dtrSurv . For reproducible research, the code isavailable at https://github.com/hunyong/survQlearn. In Figure 1 LEFT, the simulation results show how the truncated mean survivaltime ( φ µ ) of each policy behaves under different settings. For most of the settings,on average, all estimated regimes have greater values than the zero-order model,and the zero-order model has a higher value than the observed policy. This impliesthat the standard of care improves the overall survival time of patients over theobserved policies and that individualized treatment rules can further enhance theoutcomes.Figure 1 LEFT also shows that among the individualized treatment rules, theproposed method outperforms the other methods in many settings. Even whenthe method exhibits lower or about the same performance as the other methodsfor the small sample size, it often has higher performance than the others for thelarger sample size. See, for example, the high censoring cases (the last row) ofeither the observational or the RCT setting. In the settings where the dimension ofthe covariate is high (the third row), the method of Simoneau et al. (2019) oftenhas high values, although sometimes estimation is not possible. This might bebecause of the doubly robustness property of the method and the fact that thetrue optimal decision rule might be close to a linear function, as suggested bythe linear interaction of the stage-wide hazard functions in (6). The method ofSimoneau et al. (2019), however, often does not provide estimates when eitherthe sample size of the terminal stage is small due to censoring or failure or thedimension of covariates is large. 19 .28 5.84 5.91 5.73 5.76 4.526.45 5.96 6 5.99 5.97 5.35.69 5.25 5.81 5.79 5.15 3.136.25 5.87 5.92 5.83 5.74 4.52 6.44 5.94 5.93 6.2 5.76 4.526.51 5.95 6.01 6.2 5.97 5.36.02 5.38 5.98 6.33 5.2 3.136.42 5.98 5.91 6.03 5.76 4.52 6.33 5.87 5.94 6.17 5.4 4.546.43 5.96 6.04 6.16 5.51 4.986.2 5.31 5.79 6.26 4.78 3.926.27 5.9 5.93 5.93 5.3 4.54 6.5 5.96 5.96 6.4 5.66 4.546.5 5.96 6.05 6.38 5.53 4.986.45 5.43 5.87 6.64 4.82 3.926.47 6 5.96 6.33 5.63 4.54 observationaln=300 observationaln = 1000 RCTn=300 RCTn = 1000 m ode r a t e c en s o r i ng r a t e d = m ode r a t e c en s o r i ng r a t e d = m ode r a t e c en s o r i ng r a t e d = i gh c en s o r i ng r a t e d = A B C D E F A B C D E F A B C D E F A B C D E F3456345634563456 method T r un c a t ed m ean s u r v i v a l t i m e observationaln=300 observationaln = 1000 RCTn=300 RCTn = 1000 m ode r a t e c en s o r i ng r a t e d = m ode r a t e c en s o r i ng r a t e d = m ode r a t e c en s o r i ng r a t e d = i gh c en s o r i ng r a t e d = A B C D E F A B C D E F A B C D E F A B C D E F0.20.30.40.50.60.20.30.40.50.60.20.30.40.50.60.20.30.40.50.6 method S u r v i v a l p r obab ili t y a t t = method A: the proposed methodB: Goldberg & Kosorok (2012), RF C: Goldberg & Kosorok (2012), linearD: Simoneau et al. (2019) E: zero−order modelF: observed policy

Figure 1: The estimated mean truncated survival time (LEFT) and the estimatedsurvival probability at time t = 5 (RIGHT) under each estimated policy. Eachdot represents one simulation replicate. The box plots are the quartile summary ofthe points and the black dots are the average values of each method. Each row ofboxes represents a data generation scenarios, and each column of boxes representsa combination of a study design and a sample size.Figure 1 RIGHT shows the simulation results in terms of the survival proba-bility at t = 5 , or φ σ, . Note that the estimated regimes for the proposed methodare distinct between LEFT and RIGHT but are the same for the other methods.The proposed estimator shows better performance in terms of the policy valuesunder most settings, giving us similar interpretations as the previous results. We analyze acute myeloid leukemia patient survival time using the proposed andexisting methods. We use a data set collected from a clinical trial in which 210acute myeloid leukemia patients were randomized to a frontline treatment fol-lowed by a salvage treatment that was adaptively chosen based on the patientstatus (Wahed & Thall, 2013; Xu et al., 2016). One of four initial treatment op-tions was given to each patient with an equal probability: combination of ﬂudara-bine, cytosine arabinoside, and idarubicin (FAI), combination of FAI and all-trans20etinoic acid (ATRA) (FAIA), combination of FAI + granulocyte colony stimulat-ing factor (G-CSF) (FAIG), and combination of FAI + ATRA + G-CSF (FAIAG).Response to the ﬁrst stage treatment took one of the following forms: completeremission followed by disease progression, failure (either partial remission or re-mission with subsequent relapse), or resistance to treatment. Surviving patientsthat did not attain complete remission were given a salvage treatment based onphysician judgement. The salvage treatments were labeled with one of two cat-egories depending on whether or not they contained high dose ara-C (HDAC) ornot. We assume that the salvage treatment effects are homogeneous across differ-ent versions and that the current data set contains most of the necessary informa-tion to achieve Assumption 9.The data set is composed of age at entry, type of leukemia characterized bycytogentic abnormality, treatment assignment, response to the initial treatment(failure, complete remission, or resistance), duration of each treatment stage, andsurvival and censoring status at each treatment stage. Because both age and typeof leukemia are known to have high power in predicting overall survival time, theyare the main factors taken into account by clinicians when prescribing treatmentin practice thereby mitigating the potential violation of Assumption 9. Patient ageat entry ranges from 21 to 87 years and is 61.1 years on average. The patients’cytogenetics abnormalities were grouped into three categories according to theirprognosis (Wahed & Thall, 2013): 65 poor, 79 intermediate, and 66 good. Out of210 patients, 198 experienced death during the study, and 12 were censored. Theoverall observed survival time ( T ∧ C ) was on average 414 days and had quartilesof 45, 184, and 443 days. We set τ = 450 days. Figure 2 overviews the dynamicsof the study.We derive the optimal treatment rules using the proposed method, the Goldberg & Kosorok(2012) method, the Simoneau et al. (2019) method, and the zero-order model. Thevalue of the observed policy—the combination of randomization and physician’sjudgement—is estimated and is compared with the other rules. As in Section 4,the Goldberg & Kosorok (2012) method was implemented using two differentQ-function approximations: a linear model and a random forest. Because theSimoneau et al. (2019) method was implemented using R package DTRreg thecurrent version (1.7) of which allows binary treatment arms only, the method wasapplied after dichotomizing the initial treatment according to inclusion of all-transretinoic acid.We compare the performance of the ﬁve models based on cross validation,where K = 100 training data sets of size 168 (80%) are sampled to estimate thetreatment rules without replacement and stratiﬁed according to the initial treat-21igure 2: Treatment stages and patient status with the number of instances in theparentheses.ment, and the remaining test sets are held out for evaluation. The value of eachtreatment rule is estimated by the inverse probability (censoring) weighting ap-proach as follows.Deﬁne W i ( π ) = Q q =1 π ( q ) ( H ( q ) i ) = A ( q ) i ) δ i b pr ( A (1) i ) b pr ( A (2) i | H (2) i ) Q q =1 b pr ( T ( q ) i > C ( q ) i | age i ) , (9)where the propensity of the salvage treatment is modeled as pr ( A (2) = a ) =(1 + exp( − H (2) β )) − , H (2) is a matrix with columns of one, B (2) , A (1) , age,cytogenetic leukemia type, and response to initial treatment. Because there wasan insufﬁcient number of censored patients in the study, the censoring propensityscore is modeled using only the patient age at entry. Then the value of each policy π and each criterion φ = φ µ , φ σ,t is estimated by ˆ φ µ ( S π ) = P i : test set ( T i ∧ τ ) W i ( π ) P i : test set W i ( π ) and ˆ φ σ,t ( S π ) = P i : test set T i > t ) W i ( π ) P i : test set W i ( π ) , where for the second criterion, we use t = 180 to evaluate the six month survivalprobabilities. We estimate the value of the observed policy, i.e., a hybrid of ran-domized rule for the initial treatment and the physician judgement for the salvagetreatment, under the same evaluation procedure.22 T r un c a t ed m ean s u r v i v a l t i m e S i x m on t h s u r v i v a l p r obab ili t y Figure 3: Estimated truncated mean survival time (LEFT) and estimated six-month survival probability (RIGHT) under the dynamic treatment regimes. Eachdot represents one cross-validation replicate. The box plots are the quartile sum-mary of the points and the black dots are the average values of each method.Figure 3 summarizes the value estimates of the estimated polices. Each boxand scatter plot corresponds to a policy, and within each plot, each dot repre-sents a cross-validation replicate of the estimated value (truncated mean survivaltime (LEFT) or six-month survival probability (RIGHT)) of the correspondingpolicy. The results show that the proposed method achieves higher values thanthe other estimated policies on average in terms of both the truncated mean sur-vival time and the six month survival probability. This is true when comparedwith the zero-order model implying the advantage of the proposed individualizedtreatment rules over the standard of care. However, the estimated methods do nothave higher values than the observed policy value on average. Recalling that thesecond treatment assignments were done according to physician judgement, thisimplies that their judgement was fairly successful and that a larger sample sizeand more patient information—available only to physicians but not included inthis data set—could enhance the performance of the methods. Despite the gap,however, the proposed method yields the closest values to the observed policyvalues among the competing methods. 23

Discussion

In this paper, we introduced a new dynamic treatment regime estimator for sur-vival outcomes. The estimator is versatile in the sense that it allows for a ﬂexiblenumber of treatment stages and arms, it gives a non-linear decision rule that max-imizes either the mean survival time or the survival probability, and it permitsdependent censoring. In this section, we discuss some considerations when usingour method and how the method can be further extended.The proposed estimator assumes that the distribution is well characterized bythe treatment stages rather than the natural passage of time. That is, the cohort ateach treatment stage might include patients that received the stage treatment onday 10 as well as patients that received the stage treatment on day 1000. Thus, thismethod is effective when the disease dynamic is considered stationary betweenstage transitions or when the treatment effect contains relatively stronger signalthan that of the baseline disease dynamics.Another consideration is the choice of the study length, τ , which is chosenso that the survival probabilities are reliably estimated up to τ . For the stabilityof estimation, the chosen τ should not be too large; however, reducing τ resultsin information loss. If all patients are censored at a certain stage, the survivalprobability is not estimable. In this case, a smaller τ should be chosen so thatevery stage has a sufﬁcient number of observed failure times.One interesting extension of the proposed method is optimizing a speciﬁcquantile of the survival distribution such as the median survival time. How-ever, this task introduces a unique challenge with respect to the backward re-cursion. Speciﬁcally, the optimal decisions made for later stages may not beoptimal once earlier decisions moderate the distribution. Thus, an exhaustivesearch may be needed to ﬁnd an optimal policy. An extension of the quantile-optimal dynamic treatment regime estimators developed by Wang et al. (2018)and Linn et al. (2017) to right-censored data would be interesting future work.The proposed method assumes an unrestricted policy class. In practice, how-ever, clinicians and patients may prefer understandable and simple rules. A linearrule, for example, is often of interest. By posing a Cox-type proportional hazardsassumption on top of our method, the resulting policy class becomes a set of lin-ear functions. The task then reduces to replacing the generalized random survivalforest with the generalized Cox model, where the estimated survival probabilities,instead of the observed time with censoring indicators, are entered as the outcomevariable.In establishing the uniform consistency results, we use the fact that under the24iven assumptions, the maximum size of the terminal node becomes arbitrarilysmall in probability uniformly, as argued in Wager & Walther (2015). However,the uniform rate of convergence can be very low. As a random-split tree growsdeeper and has an increasingly large number of terminal nodes, there exists, witha not-very-small probability, at least one terminal node that is created withoutenough splits on a certain covariate. To enhance the rate, splitting can potentiallybe done in a stratiﬁed fashion, by assigning a certain minimum number of splitsto each variable. The effect of the tree building rules on the rate of convergencewould be an interesting future study especially in the presence of noise variablesin high-dimensional data settings. Acknowledgement

The authors thank Peter F. Thall, Abdus S. Wahed, and Yanxun Xu for provid-ing the leukemia data, and Donglin Zeng for bringing the composite criterionoptimization into the discussion. This research was funded in part by grant P01CA142538 from the National Cancer Institute .

Supplementary material

Supplementary Material available at

Biometrika online includes the extended proofsof Theorems 1 and 2.

Appendix 1

Summary of notation

See Table 3 for the summary of notation.

Appendix 2

Proof of Theorem 1

The following Z-estimator lemma (Kosorok, 2007) plays a key role in this con-sistency proof. Let

Ψ : Θ L and Ψ n : Θ L be maps between two normed25able 3: Summary of notation (Basic notation) q Treatment stage. , , ..., Qτ Study length n ( q ) Number of training examples available at stage q (Predictors) A ( q ) Action at stage q . Z ( q ) Covariate information newly available at the beginning of stage q . H ( q ) History available at the beginning of stage q including Z ( q ) . d ( q ) Dimension of history H ( q ) at stage q . π ( q ) ( H ( q ) ) Decision rule at stage q given H ( q ) .(Stage-wise outcomes) T ( q ) Failure time from the beginning of stage q . U ( q ) Time to next treatment from the beginning of stage q . C ( q ) Censoring time from the beginning of stage q . V ( q ) Uncensored length of stage q . V ( q ) = T ( q ) ∧ U ( q ) . X ( q ) Observed length of stage q . X ( q ) = V ( q ) ∧ C ( q ) . δ ( q ) Censoring indicator of stage q . δ ( q ) = 1( V ( q ) ≤ C ( q ) ) .γ ( q ) Treatment indicator of stage q . γ ( q ) = 1( T ( q ) ≤ U ( q ) ) .B ( q ) Baseline time at stage q . B ( q ) = P q − k =1 X ( k ) , q > , B (1) = 0 . L ( q ) Remaining life at stage q . L ( q ) = T − B ( q ) .(Overall outcomes) T Overall failure time. T = P Qq =1 V ( q ) C Overall censoring time δ Overall censoring indicator. δ = 1( T ≤ C ) (Theoretical settings) L S , L G Lipschitz continuity constants of Assumption 1. k Maximum terminal node size. β Terminal node size rate in Assumption 3. α Regular split constant in Assumption 1(Survival functions and values) S Overall or generic failure survival function. S ( q ) Survival function of remaining life at the beginning of stage q . G Overall or generic censoring survival function. φ µ ( S ) Mean truncated survival time. φ µ ( S ) = E S [ T ∧ τ ] .φ σ,t ( S ) Survival probability at time t . φ σ,t ( S ) = S ( t ) .φ ( q ) µ ( S ( q ) ) Mean truncated survival time, given B ( q ) . φ ( q ) µ ( S ( q ) ) = B ( q ) + E S ( q ) [ L ( q ) ∧ ( τ − B ( q ) )] .φ ( q ) σ,t ( S ( q ) ) Survival probability at time t , given B ( q ) . φ ( q ) σ,t ( S ( q ) ) = S ( q ) ( t − B ( q ) ) . Θ is a parameter space, L is some normed space, k · k L denotes theuniform norm, Ψ is a ﬁxed map, and Ψ n is a data-dependent map. Lemma 1 (Consistency of Z-estimators) . Let Ψ( S ) = 0 for some S ∈ Θ , andassume k Ψ( S n ) k L → implies k S n − S k L → for any sequence { S n } ∈ Θ .Then, if k Ψ n ( ˆ S n ) k L → in probability for some sequence of estimators ˆ S n ∈ Θ and sup S ∈ Θ k Ψ n ( S ) − Ψ( S ) k L → in probability, k ˆ S n − S k L → in probability,as n → ∞ . For our speciﬁc problem, let Θ be the space of all covariate-conditional sur-vival functions. Deﬁne a normed space L = D [ − , { [0 , τ ] × R d } , where D A B isthe space of all right-continuous left-limits functions with range A and support B .Deﬁne ψ S,t = δS ′ ( t − X | H ′ ) + (1 − δ ) { S ( t | H ) S ( X | H ) ∧ } − S ( t | H ) , (10) ˜ ψ S,t, ˜ S ′ = δ ˜ S ′ ( t − X | H ′ ) + (1 − δ ) { S ( t | H ) S ( X | H ) ∧ } − S ( t | H ) (11) Ψ( S ) = P ψ

S,t δ h P δ h ≡ P ·| h ψ S,t and (12) Ψ n ( S ) = P n ˜ ψ S,t k h P n k h ≡ P n, ·| k h ˜ ψ S,t, ˜ S ′ , (13)where P is the population average of function values, i.e., P f = R f ( h ) dP ( h ) , P n is the sample average of function values, i.e., P n f = R f ( h ) d P n ( h ) = n P ni =1 f ( H i ) , S ′ is a ﬁxed survival probability that is distinct from S , ˜ S ′ is a data-dependentversion of S ′ such that lim n →∞ ˜ S ′ → S ′ in probability. In the context of dynamictreatment regimes, S ′ is the survival probability of the remaining life of the nextstage and H ′ is the historical information available at the beginning of the nextstage. Also, δ h = I ( · = h ) is the unnormalized Dirac measure, and k h the unnor-malized forest kernel. To be more speciﬁc, k h = n tree P n tree b =1 h ∈ L b ( h )) , where n tree is the number of trees in the forest, L b ( h ) is the terminal node of the b th treeof the forest that contains the point h . We used the term ‘unnormalized’ to meanthat they are not multiplied by the sample (or the population) size.Then by Lemma 1 and the following four propositions (Propositions 2 to 5), k ˆ S − S k L → in probability as n → ∞ . Proposition 2. Ψ( S ) = 0 for S ∈ Θ , roposition 3. Assume Assumptions 1 to 4. As n → ∞ , k Ψ( ˆ S ) k L → implies k ˆ S − S k L → , Proposition 4.

Assume Assumptions 1 to 4. Let ˆ S be the kernel-conditionalmodiﬁed Kaplan-Meier estimator, which is the modiﬁed Kaplan-Meier in (3) ap-plied to the data with weights indicated by the forest kernel, k h . As n → ∞ , k Ψ n ( ˆ S ) k L → in probability. Proposition 5.

Assume Assumptions 1 to 4. As n → ∞ , sup S ∈ Θ k Ψ n ( S ) − Ψ( S ) k L → in probability. The proofs of Propositions 2–4 are provided in Supplementary Material. Theproof of Proposition 5 relies on the entropy calculation and the Donsker theorem.The strategy is to show that the tree kernels and consequently the forest kernelsare Donsker, and to show that the ψ and ˜ ψ functions are also Donsker, whichwe will show in Proposition 6 below. Then by empirical process theory and withAssumptions 1, 2, 3, and 4, the uniform consistency is derived. Hence, we providethe related lemma (6) regrading the entropy of the tree kernels before we move onto the proof of Proposition 5. The proof of Propositions 6 and the rest of the proofof Proposition 5 are given in Supplementary Material. Proposition 6 (Bounded entropy integral of the tree and forest kernels) . The col-lections of the unnormalized tree and forest kernel functions are Donsker, wherethe tree kernels k tree ( · ) are axis-aligned random hyper-rectangles, and the forestkernels k forest ( · ) are the mean of arbitrarily many ( n tree ) tree kernels.of Corollary 1. Corollary 1 follows from recursion once we show that the Q thstage random survival forest is consistent and n ( Q ) > cn for some constant c al-most surely. For each h ∈ H ( Q ) and treatment a ∈ A ( Q ) , Assumption 7 guaranteesthat the probability of a patient making all planned visits is at least M Q − . That is, n ( Q ) ≥ M ( Q ) n almost surely. Uniform consistency of the stage-wise random sur-vival forest again relies on arbitrarily small node size, sup h ∈H sup h ′ ∈ k h k h − h ′ k .Although the support in Assumption 6 is different from that in Assumption 2, thebounded marginal density in Assumption 2 is sufﬁcient to follow the same lines ofproof given in Meinshausen (2006) and Wager & Walther (2015). Now, by having S ′ ( t | h ) = ˜ S ′ ( t | h ) = 1( t ≤ for all h , Theorem 1 yields uniform consistencyof ˆ S ( Q ) ( · | · ) . Consistency of the rest of the sequence then follows as n ( q ) ≥ n ( Q ) for q = Q − , Q − , ..., . 28 ppendix 3 Proof of Theorem 2 of Theorem 2.

The proof is largely based on the lines of proof used by Murphy(2005) and Goldberg & Kosorok (2012). We tailor their proofs based on the Q-learning framework into the context of this paper. We introduce notation for someintermediate stage quantities that are the analogs to the intermediate stage Q-and V- functions. Let Ξ ( q ) S, π ( H ( q ) , a ( q ) ) = φ ( q ) { S ( q ) π ( · | H ( q ) , A ( q ) = a ( q ) ) } and Φ ( q ) S, π ( H ( q ) , a ( q ) ) = φ ( q ) { S ( q ) π ( · | H ( q ) , A ( q ) = π ( q ) ( H ( q ) ) } . The dependency of Ξ and Φ on the choice of φ is marked with the corresponding subscripts, e.g., Φ ( q ) · ,µ ( · ) = φ ( q ) µ ( · ) , or is suppressed, if it is clear by context.We modify Lemmas 1 and 2 of Murphy (2005) to obtain Lemmas 2 and 3below. Then we conclude the proof of Theorem 2 by bounding E "(cid:26) Ξ ( q ) S ∗ , π ∗ ( H ( q ) , A ( q − , A ( q ) ) − ˆΞ ( q ) ( H ( q ) , A ( q − , A ( q ) ) (cid:27) of Lemma 3 using a quantity Err Ξ ( q +1) (Ξ ( q ) ) = E (cid:26) max a ( q +1) Ξ ( q +1) ( H ( q +1) , A ( q +1) ) − Ξ ( q ) ( H ( q ) , A ( q ) ) (cid:27) , where a similar strategy was used in Murphy (2005) and Goldberg & Kosorok(2012) and we drop the arguments of Ξ ( q ) and Ξ ( q +1) as long as no confusionarises. 29 rr ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ )= E [ˆΞ ( q ) ] − E [Ξ ( q ) ∗ ] + 2 E (cid:20) (Ξ ( q ) ∗ − ˆΞ ( q ) ) max a ( q +1) Ξ ( q +1) ∗ (cid:21) − E (cid:20) (max a ( q +1) Ξ ( q +1) ∗ − max a ( q +1) ˆΞ ( q +1) )(Ξ ( q ) ∗ − ˆΞ ( q ) ) (cid:21) = E [ˆΞ ( q ) ] − E [Ξ ( q ) ∗ ] + 2 E (cid:20) (Ξ ( q ) ∗ − ˆΞ ( q ) )Ξ ( q ) ∗ (cid:21) − E (cid:20) (max a ( q +1) Ξ ( q +1) ∗ − max a ( q +1) ˆΞ ( q +1) )(Ξ ( q ) ∗ − ˆΞ ( q ) ) (cid:21) = E [ˆΞ ( q ) − Ξ ( q ) ∗ ] − E (cid:20) (max a ( q +1) Ξ ( q +1) ∗ − max a ( q +1) ˆΞ ( q +1) )(Ξ ( q ) ∗ − ˆΞ ( q ) ) (cid:21) ≥ E [ˆΞ ( q ) − Ξ ( q ) ∗ ] − E / (cid:20) max a ( q +1) Ξ ( q +1) ∗ − max a ( q +1) ˆΞ ( q +1) (cid:21) E / (cid:20) Ξ ( q ) ∗ − ˆΞ ( q ) (cid:21) ≥ E [ˆΞ ( q ) − Ξ ( q ) ∗ ] − s LE (cid:20) Ξ ( q +1) ∗ − ˆΞ ( q +1) (cid:21) E / (cid:20) Ξ ( q ) ∗ − ˆΞ ( q ) (cid:21) ≥ E [ˆΞ ( q ) − Ξ ( q ) ∗ ] − { LE (cid:20) Ξ ( q +1) ∗ − ˆΞ ( q +1) (cid:21) + E (cid:20) Ξ ( q ) ∗ − ˆΞ ( q ) (cid:21) } = E [ˆΞ ( q ) − Ξ ( q ) ∗ ] − LE [ˆΞ ( q +1) − Ξ ( q +1) ∗ ] , where in the second equality, E (cid:20) max a ( q +1) Ξ ( q +1) ∗ = φ ( q +1) { S ( q +1) ∗ ( · | H ( q +1) , A ( q +1) ) } | H ( q ) , A ( q ) (cid:21) = Ξ ( q ) ∗ ( H ( q ) , A ( q ) ) was used; in the ﬁrst and the last inequalities, the Cauchy-Schwarz inequality andthat fact that ab ≤ ( a + b ) / were used; and the second inequality holds true,since { max a ( q +1) Ξ ( q +1) ∗ ( a ( q +1) ) − max a ( q +1) ˆΞ ( q +1) ( a ( q +1) ) } ≤ max a ( q +1) { Ξ ( q +1) ∗ ( a ( q +1) ) − ˆΞ ( q +1) ( a ( q +1) ) } ≤ LE { Ξ ( q +1) ∗ ( a ( q +1) ) − ˆΞ ( q +1) ( a ( q +1) ) } . E [ˆΞ ( q ) − Ξ ( q ) ∗ ] ≤ Q X q ′ = q (4 L ) q ′ − q (cid:8) Err ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ ) (cid:9) ,q = 1 , , ..., Q. Thus, by Lemma 3, the error bound of the value function isbounded by Φ( π ∗ ) − Φ( ˆ π ) ≤ Q X q =1 L q/ vuut Q X q ′ = q (4 L ) q ′ − q (cid:8) Err ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ ) (cid:9) ≤ Q X q =1 vuut − q Q X q ′ = q L q ′ (cid:8) Err ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ ) (cid:9) (14)Now we obtain the bound of difference in Err values:

Err ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ )= E (cid:20)(cid:8) (max a ( q +1) ˆΞ ( q +1) − ˆΞ ( q ) ) + (max a ( q +1) ˆΞ ( q +1) − Ξ ( q ) ∗ ) (cid:9)(cid:8) Ξ ( q ) ∗ − ˆΞ ( q ) (cid:9)(cid:21) ≤ c ( φ ) E (cid:2) Ξ ( q ) ∗ − ˆΞ ( q ) (cid:3) ≤ c ( φ ) sup h ( q ) ,a ( q ) (cid:8) Ξ ( q ) ∗ ( h ( q ) , a ( q ) ) − ˆΞ ( q ) ( h ( q ) , a ( q ) ) (cid:9) ≤ c ( φ ) sup h ( q ) ,a ( q ) " φ ( q ) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) S ( q ) ∗ ( · | h ( q ) , a ( q ) ) − ˆ S ( q ) ( · | h ( q ) , a ( q ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ c ( φ ) sup h ( q ) ,a ( q ) " φ ( q ) (cid:26) sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12) S ( q ) ∗ ( · | h ( q ) , a ( q ) ) − ˆ S ( q ) ( · | h ( q ) , a ( q ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ c ( φ ) φ ( q ) (cid:26) sup h ( q ) ,a ( q ) ,t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12) S ( q ) ∗ ( · | h ( q ) , a ( q ) ) − ˆ S ( q ) ( · | h ( q ) , a ( q ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) , where c ( φ µ ) = τ and c ( φ σ,t ) = 1 , and the ﬁrst inequality comes from the triangu-lar inequality and the fact that any Ξ function is bounded by [0 , τ ] for φ = φ µ and [0 , for φ = φ σ,t , and, in the fourth inequality, Fatou’s lemma was used.31iven that all assumptions of Corollary 1 hold, and that the space of history H incorporates the historical actions, the theorem gives us sup t ∈ [0 ,τ ] ,h ( q ) ∈H ( q ) ,a ( q ) ∈A ( q ) ,q ∈{ , ,...,Q } | ˆ S ( q ) ( t | h ( q ) , a ( q ) ) − S ( q ) ∗ ( t | h ( q ) , a ( q ) ) | → , in probability as n → ∞ .Since φ is either integration over the interval [0 , τ ] for φ = φ µ or evaluationat t for φ = φ σ,t , the error is bounded by c ( φ ) times the supremum error of thesurvival curves. Thus, Err ˆΞ ( q +1) (ˆΞ ( q ) ) − Err ˆΞ ( q +1) (Ξ ( q ) ∗ ) → in probability as n → ∞ and since in (14) Φ( π ∗ ) − Φ( ˆ π ) is bounded by ﬁnitesummations and bounded transformations, the desired result holds. Lemma 2.

For treatment regimes ˜ π and π , Φ( ˜ π ) − Φ( π ) = − E π (cid:26) Q X q =1 ∆ ( q ) S ∗ , ˜ π ( H ( q ) , A ( q ) ) (cid:27) , where ∆ ( q ) S, π ( h ( q ) , a ( q ) ) = Ξ S, π ( h ( q ) , a ( q ) ) − Φ S, π ( h ( q ) ) is the advantage of treat-ment a ( q ) at stage q , and S ∗ is the true survival probability. This statement is truefor both criteria φ = φ µ , φ σ,t . Lemma 3.

For all functions, ˆΞ ( q ) satisfying ˆ π ( q ) ( h ( q ) ) ∈ arg max a ( q ) ˆΞ ( q ) ( h ( q ) , a ( q ) ) , q =1 , , ..., Q , we have, Φ( π ∗ ) − Φ( ˆ π ) ≤ Q X q =1 L q/ vuut E "(cid:26) Ξ ( q ) S ∗ , π ∗ ( H ( q ) , A ( q − , A ( q ) ) − ˆΞ ( q ) ( H ( q ) , A ( q − , A ( q ) ) (cid:27) , where π ∗ = ( π (1) , ..., π ( Q ) ) is the optimal decision rule for the survival probability S ∗ . This statement is true for both criteria φ = φ µ , φ σ,t . References B REIMAN , L. (2001). Random forests.

Machine learning , 5–32.32 REIMAN , L., F

RIEDMAN , J., O

LSHEN , R. & S

TONE , C. (1984). Classiﬁcationand regression trees. wadsworth int.

Group , 237–251.C HO , H., J EWELL , N. P. & K

OSOROK , M. R. (2019). Interval censored recursiveforests. arXiv preprint arXiv:1912.09983 .C OLE , S. R. & F

RANGAKIS , C. E. (2009). The consistency statement in causalinference: a deﬁnition or an assumption?

Epidemiology , 3–5.C UI , Y., Z HU , R., Z HOU , M. & K

OSOROK , M. (2017). Consistency of sur-vival tree and forest models: splitting bias and correction. arXiv preprintarXiv:1707.09631 .E DWARDS , J. C. & C

AMBRIDGE , G. (2006). B-cell targeting in rheumatoidarthritis and other autoimmune diseases.

Nature Reviews Immunology , 394–403.E FRON , B. (1967). The two sample problem with censored data. In

Proceed-ings of the ﬁfth Berkeley symposium on mathematical statistics and probability ,vol. 4.G

EHAN , E. A. (1965). A generalized wilcoxon test for comparing arbitrarilysingly-censored samples.

Biometrika , 203–224.G EURTS , P., E

RNST , D. & W

EHENKEL , L. (2006). Extremely randomized trees.

Machine learning , 3–42.G OLDBERG , Y. & K

OSOROK , M. R. (2012). Q-learning with censored data.

Annals of statistics , 529.G ORDON , L. & O

LSHEN , R. A. (1985). Tree-structured survival analysis.

Can-cer treatment reports , 1065–1069.H ABERMANN , T. M., W

ELLER , E. A., M

ORRISON , V. A., G

ASCOYNE , R. D.,C

ASSILETH , P. A., C

OHN , J. B., D

AKHIL , S. R., W

ODA , B., F

ISHER , R. I.,P

ETERSON , B. A. et al. (2006). Rituximab-chop versus chop alone or withmaintenance rituximab in older patients with diffuse large b-cell lymphoma.

Journal of clinical oncology , 3121–3127.H ERN ´ AN , M. A. & R OBINS , J. M. (2020). Causal inference: What if.33

OGAN , J. & R

ADHAKRISHNAN , J. (2013). The treatment of minimal changedisease in adults.

Journal of the American Society of Nephrology , 702–711.H OTHORN , T., B ¨

UHLMANN , P., D

UDOIT , S., M

OLINARO , A. & V AN D ER L AAN , M. J. (2005). Survival ensembles.

Biostatistics , 355–373.H UANG , H.-J., Y

ANG , L.-Y., T

UNG , H.-J., K U , F.-C., W U , R.-C., T ANG , Y.-H., C

HANG , W.-Y., J

UNG , S.-M., W

ANG , C.-C., L IN , C.-T. et al. (2020).Management and clinical outcomes of patients with recurrent/progressive ovar-ian clear cell carcinoma. Journal of the Formosan Medical Association ,793–804.H

UANG , X., N

ING , J. & W

AHED , A. S. (2014). Optimization of individualizeddynamic treatment regimes for recurrent diseases.

Statistics in medicine ,2363–2378.I SHWARAN , H., K

OGALUR , U. B., B

LACKSTONE , E. H., L

AUER , M. S. et al.(2008). Random survival forests.

The annals of applied statistics , 841–860.J IANG , R., L U , W., S ONG , R. & D

AVIDIAN , M. (2017). On estimation of opti-mal treatment regimes for maximizing t-year survival probability.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 1165–1185.K IDWELL , K. M. (2015). Chapter 2: Dtrs and smarts: Deﬁnitions, designs, andapplications. In

Adaptive Treatment Strategies in Practice: Planning Trials andAnalyzing Data for Personalized Medicine . SIAM, pp. 7–23.K

OSOROK , M. R. (2007).

Introduction to empirical processes and semiparamet-ric inference . Springer Science & Business Media.K

OSOROK , M. R. & L

ABER , E. B. (2019). Precision medicine.

Annual reviewof statistics and its application , 263–286.K OSOROK , M. R. & M

OODIE , E. E. (2015).

Adaptive treatment strategies inpractice: planning trials and analyzing data for personalized medicine . SIAM.L

INN , K. A., L

ABER , E. B. & S

TEFANSKI , L. A. (2017). Interactive q-learningfor quantiles.

Journal of the American Statistical Association , 638–649.M

EINSHAUSEN , N. (2006). Quantile regression forests.

Journal of MachineLearning Research , 983–999. 34 URPHY , S. A. (2003). Optimal dynamic treatment regimes.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 331–355.M URPHY , S. A. (2005). A generalization error for q-learning.

Journal of MachineLearning Research , 1073–1097.O RELLANA , L., R

OTNITZKY , A. & R

OBINS , J. M. (2010). Dynamic regimemarginal structural mean models for estimation of optimal dynamic treatmentregimes, part i: main content.

The international journal of biostatistics .P ETO , R. & P

ETO , J. (1972). Asymptotically efﬁcient rank invariant test proce-dures.

Journal of the Royal Statistical Society: Series A (General) , 185–198.Q

IAN , M. & M

URPHY , S. A. (2011). Performance guarantees for individualizedtreatment rules.

Annals of statistics , 1180.R OBERTSON , J. B. & U

PPULURI , V. (1984). A generalized kaplan-meier esti-mator.

The Annals of Statistics , 366–371.R

OBINS , J. M. & R

OTNITZKY , A. (1992). Recovery of information and adjust-ment for dependent censoring using surrogate markers. In

AIDS epidemiology .Springer, pp. 297–331.R

OBINS , J. M., R

OTNITZKY , A. & Z

HAO , L. P. (1994). Estimation of regressioncoefﬁcients when some regressors are not always observed.

Journal of theAmerican statistical Association , 846–866.R UBIN , D. B. (2005). Bayesian inference for causal effects.

Handbook of statis-tics , 1–16.S EGAL , M. R. (1988). Regression trees for censored data.

Biometrics , 35–47.S

IMONEAU , G., M

OODIE , E. E., N

IJJAR , J. S., P

LATT , R. W. & I

NVESTIGA - TORS , S. E. R. A. I. C. (2019). Estimating optimal dynamic treatment regimeswith survival outcomes.

Journal of the American Statistical Association , 1–24.S

UTTON , R. S. & B

ARTO , A. G. (2018).

Reinforcement learning: An introduc-tion . The MIT Press.T

ARONE , R. E. & W

ARE , J. (1977). On distribution-free tests for equality ofsurvival distributions.

Biometrika , 156–160.35 SIATIS , A. (1975). A nonidentiﬁability aspect of the problem of competingrisks.

Proceedings of the National Academy of Sciences , 20–22.T SIATIS , A. A., D

AVIDIAN , M., T, H. S. & L

ABER , E. B. (2019).

DynamicTreatment Regimes: Statistical Methods for Precision Medicine . CRC press.W

AGER , S. & A

THEY , S. (2018). Estimation and inference of heterogeneoustreatment effects using random forests.

Journal of the American StatisticalAssociation , 1228–1242.W

AGER , S. & W

ALTHER , G. (2015). Adaptive concentration of regression trees,with application to random forests. arXiv preprint arXiv:1503.06388 .W AHED , A. S. & T

HALL , P. F. (2013). Evaluating joint effects of induction–salvage treatment regimes on overall survival in acute leukaemia.

Journal ofthe Royal Statistical Society: Series C (Applied Statistics) , 67–83.W AHED , A. S. & T

SIATIS , A. A. (2006). Semiparametric efﬁcient estimation ofsurvival distributions in two-stage randomisation designs in clinical trials withcensored data.

Biometrika , 163–177.W ALLACE , M. P. & M

OODIE , E. E. (2015). Doubly-robust dynamic treatmentregimen estimation via weighted least squares.

Biometrics , 636–644.W ANG , L., Z

HOU , Y., S

ONG , R. & S

HERWOOD , B. (2018). Quantile-optimaltreatment regimes.

Journal of the American Statistical Association , 1243–1254.W

ATKINS , C. J. C. H. (1989).

Learning from delayed rewards . Ph.D. thesis.X U , Y., M ¨ ULLER , P., W

AHED , A. S. & T

HALL , P. F. (2016). Bayesian non-parametric estimation for dynamic treatment regimes with sequential transitiontimes.

Journal of the American Statistical Association , 921–950.Z

HANG , Y., L

ABER , E. B., D

AVIDIAN , M. & T

SIATIS , A. A. (2017). Estimationof optimal treatment regimes using lists.

Journal of the American StatisticalAssociation,(just-accepted) .Z HAO , Y., Z

ENG , D., S

OCINSKI , M. A. & K

OSOROK , M. R. (2011). Rein-forcement learning strategies for clinical trials in nonsmall cell lung cancer.

Biometrics , 1422–1433. 36 HU , R. & K OSOROK , M. R. (2012). Recursively imputed survival trees.

Journalof the American Statistical Association , 331–340.Z HU , R., Z ENG , D. & K

OSOROK , M. R. (2015). Reinforcement learning trees.

Journal of the American Statistical Association110