[PDF] An evaluation of machine learning techniques to predict the outcome of children treated for Hodgkin-Lymphoma on the AHOD0031 trial: A report from the Children's Oncology Group

Abstract

In this manuscript we analyze a data set containing information on children with Hodgkin Lymphoma (HL) enrolled on a clinical trial. Treatments received and survival status were collected together with other covariates such as demographics and clinical measurements. Our main task is to explore the potential of machine learning (ML) algorithms in a survival analysis context in order to improve over the Cox Proportional Hazard (CoxPH) model. We discuss the weaknesses of the CoxPH model we would like to improve upon and then we introduce multiple algorithms, from well-established ones to state-of-the-art models, that solve these issues. We then compare every model according to the concordance index and the brier score. Finally, we produce a series of recommendations, based on our experience, for practitioners that would like to benefit from the recent advances in artificial intelligence.

Full PDF

AAn evaluation of machine learning techniques to predictthe outcome of children treated for Hodgkin-Lymphomaon the AHOD0031 trial: A report from the Children’sOncology Group

Cédric Beaulac a Jeffrey S. Rosenthal b Qinglin Pei c Debra Friedman d Suzanne Wolden e David Hodgson f January 17, 2020 a Department of Statistical Sciences, University of Toronto, Toronto, Canada; b Department of Sta-tistical Sciences, University of Toronto, Toronto, Canada; c Department of Biostatistics, Universityof Florida, Gainesville, USA; d Department of Pediatrics, Vanderbilt University, Nashville, USA; e Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, USA; f Department of Radiation Oncology, University of Toronto, Toronto, Canada. a r X i v : . [ q - b i o . Q M ] J a n bstract In this manuscript we analyze a data set containing information on children with HodgkinLymphoma (HL) enrolled on a clinical trial. Treatments received and survival status werecollected together with other covariates such as demographics and clinical measurements. Ourmain task is to explore the potential of machine learning (ML) algorithms in a survival analysiscontext in order to improve over the Cox Proportional Hazard (CoxPH) model. We discussthe weaknesses of the CoxPH model we would like to improve upon and then we introducemultiple algorithms, from well-established ones to state-of-the-art models, that solve theseissues. We then compare every model according to the concordance index and the brier score.Finally, we produce a series of recommendations, based on our experience, for practitionersthat would like to beneﬁt from the recent advances in artiﬁcial intelligence.

Keywords : machine learning, case study, survival analysis, Cox proportional hazard,survival trees, neural networks, variational auto-encoders

There is increasing effort in medical research to applying ML algorithms to improve treatmentdecisions and predict patient outcomes. In this article, we want to explore the potential of MLalgorithms to predict the outcome of children treated for Hodgkin Lymphoma. As we want tominimize the side effects of intensive chemotherapy or radiation therapy, a major clinical concernis how, for a given patient, we can select a treatment that eradicates the disease while keeping theintensity of the treatment, and the associated side effects, to a minimum.In this article we will introduce multiple ML algorithms adapted to our needs and compare them2ith the Cox proportional hazard model. As it is the case with many data set within this ﬁeld, theresponse variable, time until death or relapse, was right-censored for patients without events andthe data set is of relatively small size (n=1712). From a ML perspective, this can be challenging.The response variable is right-censored for multiple observations but many ML techniques are notdesigned to deal with censored observations and thus it restricts the techniques we can includein our case study. Another challenged previously mentioned is that medical data sets are usuallysmaller than those used in ML applications and thus we will have to carefully select algorithmsthat could perform well in this context.We will introduce the data set in section 2. In section 3 we will introduce the algorithms tested.Then, in section 4 we will present our experimental set up and our results. Finally, in section5, we will discuss thoroughly the results, recommend further improvements and introduce openquestions.

We have a data set of 1,712 patients, treated on the Children’s Oncology Group trial AHOD0031,the largest randomized trial of pediatric HL ever conducted. Each observation represents a patientsuffering from Hodgkin Lymphoma. For every patient, characteristics and symptoms have beencollected as well as the treatment, for a total of 21 predictors. A table containing information onthe predictors is in the appendix. The response is a time-to-event variable registered in number ofdays. We consider events to be either death or relapse. For patients without events, the responsevariable was right-censored at time of last seen, which is a well-known data structure in survivalanalysis. This data set and the data collecting technique are presented in detail by Friedman & al.32014) who previously analyzed the same data set for other purposes.

The Cox Proportional Hazard (CoxPH) model (Cox, 1972) serves as our benchmark model. It iswidely used in medical sciences since it is robust, easy to use and produce highly interpretable re-sults. It is a semi-parametric model that ﬁts the hazard function, which represents the instantaneousrate of occurrence for the event of interest, using a partial likelihood function (Cox, 1975).The CoxPH model ﬁts the hazard function which contains two parts, a baseline hazard functionof the time and a feature component which is a linear function of the predictors. The proportionalhazard assumption assumes the time component and the feature component of the hazard functionare proportional. In other words, the effect of the features is ﬁxed through time. In the CoxPHmodel, the baseline hazard, which contains the time component, is usually unspeciﬁed so we can-not use the model directly to compute the hazard or to predict the survival function for a given setof covariates.The main goal of this analysis is to test whether or not new ML models can outperform theCoxPH model. As ML models have shown great potential in many data analysis applications, itis important to test their potential to improve outcome prediction for cancer patients. We wouldlike our selected models to improve upon at least one of the three following problems that areintrinsic to the CoxPH model. Problem (1): the proportional hazard assumption; we would likemodels that allow for feature effects to vary through time. Problem (2): the unspeciﬁed baseline4azard function; we would like models able to predict the survival function itself. Problem (3): thelinear combination of features; we would like to use models that are able to grasp high order ofinteraction between the variable or non-linear combinations of the features.

The ﬁrst model to be tested is a member of the CoxPH family. One way to capture interactions be-tween predictors in linear models, and thus improve towards problem (3), is to include interactionterms. Since typical medical data sets contains few observations and many predictors, includingall interactions usually leads to model saturation.To deal with this issue we will use a variable selection model. Cox-Net (Simon, Friedman,Hastie, & Tibshirani, 2011) is an extension of the now well-know lasso regression (Hastie, Tibshi-rani, & Friedman, 2009) implemented in the glmnet package (J. Friedman, Hastie, & Tibshirani,2010) and is the ﬁrst model we will experiment with. The Cox-Net is a lasso regression-stylemodel that shrinks some model coefﬁcients to zero and thus insures the model is not saturated.The resulting model is as interpretable as the benchmark CoxPH model, but Cox-Net allows us toinclude all interactions in the base model without losing too many degrees of freedom.Another approach based on regression models is the Multi-Task Logistic Regression (MTLR).Yu et al. (Yu, Greiner, Lin, & Baracos, 2011) proposed the MTLR model which quickly becamea benchmark in the ML community for survival analysis and was cited by many authors (Luck,Sylvain, Cardinal, Lodi, & Bengio, 2017; Fotso, 2018; Zhao & Feng, 2019; Jinga et al., 2019).The proposed technique directly models the survival function by combining multiple local logistic5egression models and considers the dependency of these models. By modelling the survival distri-bution with a sequence of dependent logistic regression, this model captures time-varying effectsof features and thus the proportional hazard assumption is not needed. The model also grants theability to predict survival time for individual patients. This model solves both problem (1) and (2).For our case study, we used the MTLR R-package (Haider, 2019) recently implemented by Haider.

Decision trees (Breiman, Friedman, Olshen, & Stone, 1984) and random forests (Breiman, 1996,2001) are known for their ability to detect and naturally incorporate high degrees of interac-tions among the predictors which is helpful towards problem (3). This family of models is well-established and make very few assumptions about the data set, making it a natural choice for ourcase study.Multiple adaptations of decision trees were suggested for survival analysis and are commonlyreferred as survival trees. The idea suggested by many authors is to modify the splitting criteria ofdecision trees to accommodate for right-censored data. Based on previously published reviews ofsurvival trees (LeBlanc & Crowley, 1995; Bou-Hamad, Larocque, & Ben-Ameur, 2011), we haveselected four techniques for the case study.One of the oldest survival tree models that was implemented in R (R Core Team, 2013) isthe Relative Risk Survival Tree (Leblanc & Crowley, 1992). This survival tree algorithm usesmost of the architecture established by CART (Breiman et al., 1984) but also borrows ideas fromthe CoxPH model. The model suggested by LeBlanc et al. assumes proportional hazards andpartitions the data to maximize the difference in relative risk between regions. This technique was6mplemented in the rpart R-package (Therneau, Atkinson, & Ripley, 2017).We also selected a few ensemble methods. To begin, Hothorn et al. (2004) proposed a newtechnique to aggregate survival decision trees that can produce conditional survival function, whichsolves problem (2). To predict the survival probabilities of a new observation, they use an ensembleof survival trees (Leblanc & Crowley, 1992) to determine a set of observations similar to the one inneed of a prediction. They then use this set of observations to generate the Kaplan-Meier estimatesfor the new one. Their proposed technique is available in the ipred R-package (Peters & Hothorn,2019). A year later, Hothorn et al. (2005; 2007) proposed a new ensemble technique able toproduce log-survival time estimates instead. We will test this technique that is implemented inparty R-package (Hothorn, Hornik, & Zeileis, 2006; Hothorn, Hornik, Strobl, & Zeileis, 2019).Finally, the latest development in random forests for survival analysis is Random SurvivalForests (Ishwaran, Kogalur, Blackstone, & Lauer, 2008). This implementation of a random sur-vival forest was shown to be consistent (Ishwaran & Kogalur, 2010) and it comes with high-dimensional variable selection tools (Ishwaran, Kogalur, Gorodeski, Minn, & Lauer, 2010). Thismodel was implemented in the randomForestSRC R-package (Ishwaran & Kogalur, 2019).

The ﬁrst state-of-the-art model we will experiment with is built upon the most popular architectureof models in recent years: deep neural networks. Yu et al. (2011) MTLR model inspired manymodiﬁcations (Luck et al., 2017; Fotso, 2018; Zhao & Feng, 2019; Jinga et al., 2019) in order toinclude a deep-learning component to the model. The main purpose is to allow for interactions7nd non-linear effect of the predictors. For example, Fotso (2018; 2019) suggested an extensionof the MTLR where a deep neural networks parameterization replaces the linear parameterizationand Luck et al. (2017) proposed a neural network model that produces two outputs: one is the riskand one is the probability of observing an event in a given time bin. Unfortunately, the authors formost of these techniques (Luck et al., 2017; Zhao & Feng, 2019; Jinga et al., 2019) did not provideeither their code or a package which causes great reproducibility problems and leads to a seriousaccessibility issue for practitioners. The DeepSurv architecture (Katzman et al., 2018) proposed byKatzman et. al is a direct extension to the CoxPH model where the linear function of the covarianceis replaced by a deep neural network. This allows the model to grasp high-order of interactionsbetween predictors therefore solving problem (3). By allowing for interaction between covariatesand the treatment the proposed model provides a treatment recommendation procedure. Finally,the authors provided a Python library available on the ﬁrst author’s GitHub (Katzman, 2017).

The ﬁnal model is a latent-variable model based on the Variational Auto-Encoder (VAE) (Kingma& Welling, 2013; Kingma, 2017) architecture. Louizos et al. (2017) recently suggested a latentvariable model for causal inference. The latent variables allow for a more ﬂexible observed vari-able distribution and intuitively model the hidden patient status. Inspired by this model and bythe recommendation of Nazbal et al. (2018) we implemented a latent variable model (Beaulac,Rosenthal, & Hodgson, 2018) that adapts the VAE architecture for the purposed of survival anal-ysis. This Survival Analysis Variational Auto-Encoder (SAVAE) uses the latent space to representthe patient true sickness status and can produce individual patient survival function based on theirrespective covariates which should solve problem (1), (2) and (3).8

Data analysis

We will use two different metrics to evaluate the various algorithms, both are well establishedand they evaluate different properties of the models. First, the concordance index (Harrell, Lee,& Mark, 1996) is a metric of accuracy for the ordering of the predicted survival time or hazard.Second, the brier score (Graf, Schmoor, Sauerbrei, & Schumacher, 1999) is a metric similar to themean squared error but adapted for right-censored observations.

The concordance index ( c -index) was proposed by Harrell et al. (1996). It is one of the most pop-ular performance measures for survival problems (Steck, Krishnapuram, Dehing-oberije, Lambin,& Raykar, 2008; Chen, Kodell, Cheng, & Chen, 2012; Katzman, 2017) as it elegantly accounts forthe censored data. It is deﬁned as the proportion of all usable patient pairs in which the predictionsand outcomes are concordant. Pairs are said to be concordant if the predicted event times have aconcordant ordering with the observed event times.Recently Steck et al. used the c -index directly as part of the optimization procedure (Steck etal., 2008), their paper also elegantly presents the c -index itself using graphical models as illustratedin ﬁgure 1. In their article it is deﬁned as the fraction of all pairs of subjects whose predictedsurvival times are correctly ordered among all subjects that can actually be ordered. We expect arandom classiﬁcation algorithm to achieves a c -index of 0.5. The further from 0.5 the c -index isthe more concordant pairs of predictions the model has produced. A c -index of 1 indicates perfect9redicted order.Figure 1: Steck et al.(2008) graphical representation of the c-index computation. Filled circlerepresents observed points and empty circle represents censored points. This ﬁgure illustrates thepairs of points for which an order of events can be established.Figure 1 illustrates when we can compute the concordance for a pair of data points; this isrepresented by an arrow. We can evaluate the order of events if both events are observed. If oneof the data points is censored, then concordance can be evaluated if the censoring for the censoredpoint happens after the event for the observed point. If the reverse happens, if both points arecensored or if both events happen exactly at the same time then we cannot evaluate the concordancefor that pair. The Brier score established by Graf et al. (1999) is a performance metric inspired by the meansquared errors (MSE). For a survival model it is reasonable to try to predict P ( T > t | X = x ) = S ( t | X = x ) the survival probabilities a time t for a patient with predictors x . In Graf’s notation, ˆ π ( t | x ) is the predicted probability of survival at time t for a patient with characteristics x . These10robabilities are used as predictions of the observed event y = ( T > t ) . If the data contains nocensoring, the simplest deﬁnition of the Brier Score would be :BS ( t ) = 1 n n (cid:88) i =1 ( ( T i > t ) − ˆ π ( t | x i )) (1)Assuming we have a censoring survival distribution G ( t ) = P ( C > t ) and an associatedKaplan-Meier estimated ˆ G ( t ) . For a given ﬁxed time t we are facing three different scenarios :Case 1: T i > t and δ i = 1 or δ i = 0 Case 2: T i < t and δ i = 1 Case 3: T i < t and δ i = 0 ,where δ = 1 if the event is observed and if it is censored. For case 1, the event status is 1 since thepatient is known to be alive at time t ; the resulting contribution to the Brier score is (1 − ˆ π ( t | x i )) .For case 2, the event occurred before t and the event status is equal to ( T i > t ) = 0 and thusthe contribution is (0 − ˆ π ( t | x i )) . Finally, for case 3 the censoring occurred before t and thus thecontribution to the Brier score cannot be calculated. To compensate for the loss of informationdue to censoring, the individual contributions have to be reweighed in a similar way as in thecalculation of the Kaplan-Meier estimator leading to the following Brier Score :BS c ( t ) = 1 n n (cid:88) i =1 (cid:16) (0 − ˆ π ( t | x i )) ( T i < t, δ i = 1)(1 / ˆ G ( T i )) + (1 − ˆ π ( t | x i )) ( T i > t )(1 / ˆ G ( t )) (cid:17) (2)11 .2 Comparative results The data set introduced in section 2 was imported in both R (R Core Team, 2013) and Python(Van Rossum & Drake Jr, 1995). To evaluate the algorithms we randomly divided the data set into1500 training observations and 212 testing observations. The models were ﬁt using the trainingobservations and the evaluation metrics were computed on the testing observations.As mentioned in the previous sections, the CoxPH benchmark and the conventional statisticallearning models were all tested in the R language (R Core Team, 2013). They were relatively easyto use with very little adjustment needed and clear and concise documentation. The computationalspeed of these algorithms was fast enough on a single CPU so that we could perform 50 trials.The state-of-the-art techniques needed a deeper understanding of the model as they contain manyhyper-parameters that require calibration. They were also slower to run on a single CPU.12 .50.60.7 Cox CoxNet STree BTree CForest RSF MTLR SAVAE DeepSurv

Techniques C on c o r dan c e I nde x Figure 2: Boxplots and Sinaplots of the c -index (higher the better).Figure 2 illustrates Sinaplots (Sidiropoulos, Sohi, Rapin, & Bagger, 2017) with associatedBoxplots of the c -index for the CoxPH model and the 8 competitors. We used standard boxplotson the background since they are common and easy to understand. The sinaplots superposed onthem represent the actual observed metric values and convey information about the distributionof the metrics for a given technique. As mentioned earlier c-index ranges from 0.5 to 1 where a c -index of 1 indicates perfect predicted order. According to ﬁgure 2, it seems no model clearlyoutperforms another. It seems like Random Survival Forests is the best-performing model withrelatively small variance and high performance but the difference is not statistically signiﬁcant.13 .100.150.20 Cox CoxNet STree BTree CForest RSF MTLR SAVAE DeepSurv Techniques B r i e r S c o r e Figure 3: Boxplots and Sinaplots of Brier Scores evaluated at 3 years (lower the better)Since the Brier score is a metric inspired by the mean squared error, it ranges from 0 to 1 andthe lower the Brier score is the better the technique. In ﬁgure 3 we once again observe that none ofthe new techniques signiﬁcantly outperforms any CoxPH. SAVAE has the lowest Brier score butthe difference is not signiﬁcant when compared to other techniques.

The previous section demonstrates that the new ML methods offers very little improvement com-pared to the benchmark CoxPH model according to our two designated performance metrics whenpatient clinical characteristics that are typically collected in clinical trials are used as predictor14ariables. This is an important result as we need to evaluate the abilities of ML techniques to solvereal-life data problems, and to illuminate the changes in clinical data collection that will have tooccur for ML methods to be used to greatest effect in assisting outcome prediction and treatment.Similar results on real-life data sets are observed in article presenting methodologies (Fotso,2018; Luck et al., 2017; Jinga et al., 2019) where the proposed techniques provide non-signiﬁcantimprovements over simple models such as CoxPH. Christodoulou et al. (2019) recently performedan exhaustive review of 927 articles that discuss the development of diagnostic or prognostic clin-ical prediction models for binary outcomes based on clinical data. The authors of the review notedthe overall poor comparison methodologies and the lack of signiﬁcant difference between a simplelogistic regression and state-of-the-art ML techniques in most of recent years publications. Theseresults are supported by Hand (2006) who discussed in detail the potential strength of the simplemodels compared to state-of-the-art ML models. This raises an important question our case studyhighlights: is it worth using more complex models for a slight improvement?The alternative we proposed in section 3 are all more complicated than CoxPH in various ways.Most of the new techniques require deeper knowledge of the algorithm behaviors to correctly ﬁxthe many hyper-parameters. They can produce less interpretable results due to model complexity,and often require more computing power. Indeed, if the CoxPH model can be ﬁt in seconds, mostof the conventional statistical learning models take minutes to ﬁt and the state-of-the-art modelstake hours. Finally, many of the new techniques are not widely accessible or standardized. As anopen language, Python offers very little support to users and the libraries are not maintained, notstandardized and come with dependency issues.Hand (2006) demonstrates the high relative performances of extremely simple methods com-15ared to complex ones and mathematically justiﬁes his argument. He also discusses how theseslight improvements over simple models might be undesirable as they might be attributed to over-ﬁtting which would cause reproducibility issues on new data sets. These slight improvementsmight also be artiﬁcial as they were achieved only because the inventors of these techniques wereable to obtain through much effort the best performance from their own techniques and not themethods described by others. Overall if the improvements over simple techniques are small, per-haps they are simply not an improvement and this argument seems to be supported by both ourcase study and the recent review of Christodoulou et al. (2019). We recommend that practitionerskeep their expectations low when it comes to some of these new models.In contrast, signiﬁcant improvements for diagnostic tasks have been accomplished using A.I.in recent years (Liu et al., 2017; Rodriguez-Ruiz et al., 2019; RodrÃ guez-Ruiz et al., 2019) andthus we ask ourselves what caused this difference ? There is a major difference in the style ofdata sets that were available. In the cited articles, images (mammographic, gigapixel pathologyimage, MRI scans) are analyzed using deep convolutional neural networks (CNN) (Goodfellow,Bengio, & Courville, 2016). Models such as CNN were developed because a special type of datawas available and none of the current tools were equipped to analyze it. Conventional techniquessuch as logistic regression or CoxPH are not able to grasp the signal in images, which contains alarge number of highly correlated predictors that individually contain close to no information butanalyzed together contain a lot. As a matter of fact, the greatest strength of these models is thatthey are able to extract a lot of information from a rich, but complicated, data set.In our case study, the stratum predictor was a binary predictor indicating if the patient hada rapid early response to the ﬁrst rounds of chemotherapy. Computed-tomography (CT) scans16f the affected regions were analyzed before and after the ﬁrst round of treatments and this richinformation was transformed into a simple binary variable. This practice is common: even inongoing trials, patients’ characteristics continue to be collected manually (often on paper forms),which dramatically limits the capacity to capture the full range of potentially useful data availablefor analysis. As new tools are established to extract information from ever growing, both in sizeand complexity, data sets, clinical trialists have to rethink how they gather data and transform it tomake sure that no information is lost in order to utilize these new tools. It seems like extractingand keeping as much information as possible and having a data-centric approach where the modelis designed to analyze a speciﬁc style of data were some of the factors in the success of CNNs.

In this article, we have identiﬁed a series of statistical and ML techniques that should alleviatesome of the ﬂaws of the well-known CoxPH model. These models were tested against a real-lifedata set and provided little to no improvement according the c -index and the Brier score. Althoughone might anticipate that these techniques would have increased our prediction abilities, insteadthe CoxPH performed comparably to modern models. These results are supported by other articleswith similar ﬁndings.It would be advantageous to try to theoretically understand when the new techniques shouldwork and when they should not. As it currently stands, authors are not incentivized to discuss theweakness of their techniques and it actually slows scientiﬁc progress. It is imperative that we try tounderstand when some of the newest technique perform poorly and shed the light on why it is thecase. It is also important to understand what made some of these new techniques successful. For17xample, it seems that CNNs were successful since the model was speciﬁcally built for images, aspecial type of data that was previously hard to handle but contained a large amount of information.18 unding details Research reported in this work was supported by the Children’s Oncology Group; by the NationalCancer Institute of the National Institute of Health under the National Clinical Trials Network(NCTN) Operations Center Grant U10CA180886, the NCTN Statistics and Data Center GrantU10CA180899; by St. Baldrick’s Foundation; the Natural Sciences and Engineering ResearchCouncil of Canada; and by the Ontario Graduate Scholarships.The content is solely the responsibility of the authors and does not necessarily represent theofﬁcial views of the Children’s Oncology Group, the National Institute of Health, or St. Baldrick’sFoundation.

Disclosure statement

No potential conﬂict of interest was reported by the authors.19 ppendix

Variable Type Descriptionagedxyrs Continuous Age of the patient at the start of the treatmentgender Binary Biological genderstage Categorical Cancer stage ranging from 1 to 4b_symptoms Binary Presence of B symptomsbulk_disease Binary Presence of Bulk diseaseextralymphatic_disease Binary Presence of Extralymphatic diseasefever Binary Presence of recurrent fevernight_sweats Binary Presence of night sweatsweight_loss Binary Presence of signiﬁcant weight loss (> 10%)nodal_aggregate Binary Presence of a nodal aggregatemediastinal_mass Binary Presence of a mediastinal massesron Continuous Erthroctye sedimentation rate (mm/hr)istnon Continuous Number of involved nodal siteshistology Categorical Histology (LP,LD,NS,MC, unknown)albon Continuous Albumin (g/dL)hgbon Continuous Hemoglobin(g/dL)amend Binarystratum Binary Rapid early response to ﬁrst treatmentmorpho_icdo Categorical ICD-O Morphology codesRT Binary Treatment variable: RadiotherapyDECA Binary Treatment variable: Intensive ChemotherapyTable 1: Predictor variables and description20 eferences

Beaulac, C., Rosenthal, J. S., & Hodgson, D. (2018). A deep latent-variable model applicationto select treatment intensity in survival analysis.

Proceedings of the Machine Learning forHealth (ML4H) Workshop at NeurIPS 2018 , .Bou-Hamad, I., Larocque, D., & Ben-Ameur, H. (2011, 01). A review of survival trees. StatisticsSurveys , . doi: 10.1214/09-SS047Breiman, L. (1996). Bagging predictors. Machine Learning , (2), 123–140. Retrieved from http://dx.doi.org/10.1007/BF00058655 doi: 10.1007/BF00058655Breiman, L. (2001). Random forests. Machine Learning , (1), 5–32. Retrieved from http://dx.doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classiﬁcation and Regression Trees .Monterey, CA: Wadsworth and Brooks.Chen, H.-C., Kodell, R. L., Cheng, K. F., & Chen, J. J. (2012, Jul 23). Assessment of performanceof survival prediction models for cancer prognosis.

BMC Medical Research Methodol-ogy , (1), 102. Retrieved from https://doi.org/10.1186/1471-2288-12-102 doi: 10.1186/1471-2288-12-102Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Calster, B. V.(2019). A systematic review shows no performance beneﬁt of machine learning over lo-gistic regression for clinical prediction models. Journal of Clinical Epidemiology , ,12 - 22. Retrieved from doi: https://doi.org/10.1016/j.jclinepi.2019.02.004Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society. eries B (Methodological) , (2), 187–220. Retrieved from Cox, D. R. (1975, 08). Partial likelihood.

Biometrika , (2), 269-276. Retrieved from https://doi.org/10.1093/biomet/62.2.269 doi: 10.1093/biomet/62.2.269Fotso, S. (2018, Jan). Deep Neural Networks for Survival Analysis Based on a Multi-Task Frame-work. arXiv e-prints , arXiv:1801.05512.Fotso, S., et al. (2019). PySurvival: Open source package for survival analysis modeling.

Retrievedfrom

Friedman, D. L., Chen, L., Wolden, S., Buxton, A., McCarten, K., FitzGerald, T. J., . . . Schwartz,C. L. (2014). Dose-intensive response-based chemotherapy and radiation therapy for chil-dren and adolescents with newly diagnosed intermediate-risk hodgkin lymphoma: A reportfrom the children’s oncology group study ahod0031.

Journal of Clinical Oncology , (32),3651-3658. Retrieved from https://doi.org/10.1200/JCO.2013.52.5410 (PMID: 25311218) doi: 10.1200/JCO.2013.52.5410Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software , (1), 1–22. Retrieved from Goodfellow, I., Bengio, Y., & Courville, A. (2016).

Deep learning

Statistics in medicine , , 2529-45.doi: 10.1002/(SICI)1097-0258(19990915/30)18:17/183.0.CO;2-5Haider, H. (2019). Mtlr: Survival prediction with multi-task logistic regression [Computer soft-22are manual]. Retrieved from https://CRAN.R-project.org/package=MTLR (R package version 0.2.1)Hand, D. J. (2006, 02). Classiﬁer technology and the illusion of progress. Statist. Sci. , (1),1–14. Retrieved from https://doi.org/10.1214/088342306000000060 doi:10.1214/088342306000000060Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariable prognostic models: Issues indeveloping models, evaluating assumptions and adequacy, and measuring and reducing er-rors. Statistics in Medicine , (4), 361-387. doi: 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.).Springer.Hothorn, T., BÃijhlmann, P., Dudoit, S., Molinaro, A., & Van Der Laan, M. J. (2005, 12). Survivalensembles.

Biostatistics , (3), 355-373. Retrieved from https://doi.org/10.1093/biostatistics/kxj011 doi: 10.1093/biostatistics/kxj011Hothorn, T., Hornik, K., Strobl, C., & Zeileis, A. (2019). party: A laboratory for recursivepartytioning [Computer software manual]. Retrieved from https://cran.r-project.org/web/packages/party/index.html (R package version 1.3-3)Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditionalinference framework. Journal of Computational and Graphical Statistics , (3), 651-674. Retrieved from http://dx.doi.org/10.1198/106186006X133933 doi:10.1198/106186006X133933Hothorn, T., Lausen, B., Benner, A., & Radespiel-TrÃ ˝uger, M. (2004). Bagging survivaltrees. Statistics in Medicine , (1), 77-91. Retrieved from https://onlinelibrary wiley.com/doi/abs/10.1002/sim.1593 doi: 10.1002/sim.1593Ishwaran, H., & Kogalur, U. (2019). Fast uniﬁed random forests for survival, regression, andclassiﬁcation (rf-src) [Computer software manual]. manual. Retrieved from https://cran.r-project.org/package=randomForestSRC (R package version 2.9.1)Ishwaran, H., & Kogalur, U. B. (2010). Consistency of random survival forests. Statistics & Prob-ability Letters , (13), 1056 - 1064. Retrieved from doi: https://doi.org/10.1016/j.spl.2010.02.020Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008, 09). Random sur-vival forests. Ann. Appl. Stat. , (3), 841–860. Retrieved from http://dx.doi.org/10.1214/08-AOAS169 doi: 10.1214/08-AOAS169Ishwaran, H., Kogalur, U. B., Gorodeski, E. Z., Minn, A. J., & Lauer, M. S. (2010). High-dimensional variable selection for survival data. Journal of the American Statistical Associ-ation , (489), 205-217. Retrieved from https://doi.org/10.1198/jasa.2009.tm08622 doi: 10.1198/jasa.2009.tm08622Jinga, B., Zhangh, T., Wanga, Z., Jina, Y., Liua, K., Qiua, W., . . . Lia, C. (2019). A deepsurvival analysis method based on ranking. Artiﬁcial Intelligence in Medicine , , 1 -9. Retrieved from doi: https://doi.org/10.1016/j.artmed.2019.06.001Katzman, J. (2017). Deepsurv: Personalized treatment recommender system using a coxproportional hazards deep neural network.

Retrieved from https://github.com/jaredleekatzman/DeepSurv

Katzman, J., Shaham, U., Cloninger, A., Bates, J., Jiang, T., & Kluger, Y. (2018, 12). Deepsurv:24ersonalized treatment recommender system using a cox proportional hazards deep neuralnetwork.

BMC Medical Research Methodology , . doi: 10.1186/s12874-018-0482-1Kingma, D. P. (2017). Variational inference & deep learning : A new synthesis (Unpublisheddoctoral dissertation). Universiteit van Armsterdam.Kingma, D. P., & Welling, M. (2013, December). Auto-Encoding Variational Bayes.

ArXive-prints .LeBlanc, M., & Crowley, J. (1995). A review of tree-based prognostic models.

Recent Advancesin Clinical Trial Design and Analysis , , 113-124.Leblanc, M. E., & Crowley, J. P. (1992). Relative risk trees for censored survival data. Biometrics ,

48 2 , 411-25.Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A., . . . Stumpe, M. C.(2017, Mar). Detecting Cancer Metastases on Gigapixel Pathology Images. arXiv e-prints ,arXiv:1703.02442.Louizos, C., Shalit, U., Mooij, J., Sontag, D., Zemel, R., & Welling, M. (2017, May). CausalEffect Inference with Deep Latent-Variable Models.

ArXiv e-prints .Luck, M., Sylvain, T., Cardinal, H., Lodi, A., & Bengio, Y. (2017). Deep learning for patient-speciﬁc kidney graft survival analysis.

CoRR , abs/1705.10245 . Retrieved from http://arxiv.org/abs/1705.10245 Nazábal, A., Olmos, P. M., Ghahramani, Z., & Valera, I. (2018). Handling incomplete heteroge-neous data using vaes.

ArXiv , abs/1807.03653 .Peters, A., & Hothorn, T. (2019). ipred: Improved predictors [Computer software manual]. Re-trieved from https://CRAN.R-project.org/package=ipred (R package ver-sion 0.9-9) 25 Core Team. (2013). R: A language and environment for statistical computing [Computer soft-ware manual]. Vienna, Austria. Retrieved from RodrÃ guez-Ruiz, A., Krupinski, E., Mordang, J.-J., Schilling, K., Heywang-KÃ ˝ubrunner, S. H.,Sechopoulos, I., & Mann, R. M. (2019). Detection of breast cancer with mammography:Effect of an artiﬁcial intelligence support system.

Radiology , (2), 305-314. Retrievedfrom https://doi.org/10.1148/radiol.2018181371 (PMID: 30457482) doi:10.1148/radiol.2018181371Rodriguez-Ruiz, A., LÃˇeng, K., Gubern-Merida, A., Broeders, M., Gennaro, G., Clauser, P., . . .Sechopoulos, I. (2019, 03). Stand-Alone Artiﬁcial Intelligence for Breast Cancer Detectionin Mammography: Comparison With 101 Radiologists. JNCI: Journal of the National Can-cer Institute , (9), 916-922. Retrieved from https://doi.org/10.1093/jnci/djy222 doi: 10.1093/jnci/djy222Sidiropoulos, N., Sohi, S. H., Rapin, N., & Bagger, F. O. (2017). sinaplot: an en-hanced chart for simple and truthful representation of single observations over multi-ple classes. Retrieved from https://cran.r-project.org/web/packages/sinaplot/vignettes/SinaPlot.html

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’sproportional hazards model via coordinate descent.

Journal of Statistical Software, Articles , (5), 1–13. Retrieved from doi: 10.18637/jss.v039.i05Steck, H., Krishnapuram, B., Dehing-oberije, C., Lambin, P., & Raykar, V. C.(2008). On ranking in survival analysis: Bounds on the concordance in-dex. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances n neural information processing systems 20 (pp. 1209–1216). Curran Associates,Inc. Retrieved from http://papers.nips.cc/paper/3375-on-ranking-in-survival-analysis-bounds-on-the-concordance-index.pdf Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variableimportance measures: Illustrations, sources and a solution.

BMC Bioinformatics , (1), 25.Retrieved from http://dx.doi.org/10.1186/1471-2105-8-25 doi: 10.1186/1471-2105-8-25Therneau, T., Atkinson, B., & Ripley, B. (2017). rpart: Recursive partitioning and regressiontrees [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=rpart (R package version 4.1-11)Van Rossum, G., & Drake Jr, F. L. (1995). Python tutorial . Centrum voor Wiskunde en InformaticaAmsterdam, The Netherlands.Yu, C.-N., Greiner, R., Lin, H.-C., & Baracos, V. (2011). Learning patient-speciﬁccancer survival distributions as a sequence of dependent regressors. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.),

Ad-vances in neural information processing systems 24 (pp. 1845–1853). Cur-ran Associates, Inc. Retrieved from http://papers.nips.cc/paper/4210-learning-patient-specific-cancer-survival-distributions-as-a-sequence-of-dependent-regressors.pdf

Zhao, L., & Feng, D. (2019, Aug). DNNSurv: Deep Neural Networks for Survival Analysis UsingPseudo Values. arXiv e-printsarXiv e-prints