[PDF] A Simple and Interpretable Predictive Model for Healthcare

Abstract

Deep Learning based models are currently dominating most state-of-the-art solutions for disease prediction. Existing works employ RNNs along with multiple levels of attention mechanisms to provide interpretability. These deep learning models, with trainable parameters running into millions, require huge amounts of compute and data to train and deploy. These requirements are sometimes so huge that they render usage of such models as unfeasible. We address these challenges by developing a simpler yet interpretable non-deep learning based model for application to EHR data. We model and showcase our work's results on the task of predicting first occurrence of a diagnosis, often overlooked in existing works. We push the capabilities of a tree based model and come up with a strong baseline for more sophisticated models. Its performance shows an improvement over deep learning based solutions (both, with and without the first-occurrence constraint) all the while maintaining interpretability.

Full PDF

AA Simple and Interpretable Predictive Model for Healthcare

Subhadip Maji [email protected] Global SolutionsBangalore, India

Raghav Bali [email protected] Global SolutionsBangalore, India

Sree Harsha Ankem [email protected] Global SolutionsHyderabad, India

Kishore V Ayyadevara [email protected] Global SolutionsHyderabad, India

ABSTRACT

Deep Learning based models are currently dominating most state-of-the-art solutions for disease prediction. Existing works employRNNs along with multiple levels of attention mechanisms to provideinterpretability. These deep learning models, with trainable param-eters running into millions, require huge amounts of compute anddata to train and deploy. These requirements are sometimes so hugethat they render usage of such models as unfeasible. We addressthese challenges by developing a simpler yet interpretable non-deeplearning based model for application to EHR data. We model andshowcase our work’s results on the task of predicting first occur-rence of a diagnosis, often overlooked in existing works. We pushthe capabilities of a tree based model and come up with a strongbaseline for more sophisticated models. Its performance shows animprovement over deep learning based solutions (both, with andwithout the first-occurrence constraint) all the while maintaininginterpretability.

KEYWORDS healthcare, disease prediction, boosted trees, deep learning, inter-pretable, her

Deep Learning has taken the world by a storm and has becomethe goto choice for developing solutions in areas such as imageprocessing[12, 23], text processing [21], and even healthcare[8, 20].Usage of RNNs (particularly LSTMs [13] and its variants) to modelsequential EHR data for disease prediction has been seen in manyrecent works[14, 18] . Recent advancements in deep learning spacethrough the use of attention[2] has helped in adding interpretabilityas well to these otherwise sophisticated black-box models. RNNswith Attention in healthcare have also been successfully appliedin several works[6, 7, 11, 27]. Choi et al. [8] in their paper titled"RETAIN: An Interpretable Predictive Model for Healthcare usingReverse Time Attention Mechanism" used reversed time attentionmechanism to achieve good performance while being clinicallyinterpretable for application to the Electronic Health Records data.Ma et al. [17] rectified some drawbacks of RETAIN, by introducingbidirectional RNNs over normal RNN based approach to captureboth the past and future medical experiences of patients. The re-search and development into this space is possible due to adoptionand availability of EHR data. Several works have highlighted the positive impact of such predictive works towards improving qualityof healthcare [4, 15, 26].Deep Learning models showcase exemplary performance, attimes even outpacing human counter parts. This boost in perfor-mance comes at the cost of compute requirements, training time andvolume of training data. In many cases, these costs are prohibitoryboth financially and otherwise.We address these limitations/challenges by proposing a simpleryet interpretable tree based predictive model for healthcare. Thefollowing were the major motivations behind this work. First andforemost was to test the performance of non-deep learning ap-proaches. We wanted to develop a competitive baseline which canact as a benchmark for highly parameterised current deep learningapproaches. Second was to provide a novel way of preparing se-quential EHR data for non-deep learning approaches. Our approachhad significant impact on overall model performance. Third, modelssuch as RETAIN[8] provide an intuitive way to interpreting instancelevel results. We wanted to apply model agnostic approaches to non-deep learning models and provide similar levels of interpretability.The final motivation was to provide an efficient and easy to de-ploy alternative to Deep Learning models while providing on-parperformance and interpretability.Our model was tested on multiple EHR datasets, each having atleast a 24 month historical timeline. We experimented with differentdatasets and target disease to ensure generalisable performancemetrics. We also cater to first-occurrence prediction, i.e. predict-ing the first ever incidence of a diagnosis in consideration. Thisconstraint adds additional complexity to the prediction task. Ourexperiments showcase that our simpler approach improves overdeep learning based solutions (both, with and without the first-occurrence constraint) all the while maintaining interpretability.The rest of the paper is organised as follows: section 2 detailsthe Data Preparation step. It had a significant impact on the over-all model performance. Section 3 describes the overall approach,model choice along with different experiments and their results.We also provide details on the choice of evaluation metric used.We present interesting comparison with different deep learningbased approaches. We compare performance with models suchas RETAIN[8], Dipole[17], etc. In section 4 we discuss the needfor model interpretability. We also showcase instance level inter-pretability results using a model agnostic approach called SHAP[16]. a r X i v : . [ s t a t . M L ] J u l aji and Bali, et al. Figure 1: Flow Diagram of the overall approach for rare dis-ease prediction on EHR data using XGBoost and any ModelInterpreter

We also showcase global interpretability results of our model. Sec-tion 5 presents commentary on the effectiveness of our work inthis domain and section 6 concludes the paper.

Data Preparation is an important aspect of this work and a majormotivation. As mentioned earlier, the aim was to prepare longi-tudinal EHR data for first occurrence prediction task. To captureenough historical traits and variability, similar to the works of Choiet al., we also pick up a 24 month historical period for training.For the predictions to be useful and actionable, we use a delta of3 months between the training period and first occurrence dateof the diagnosis. Table 1 showcases a quick summary of our EHRdataset.Let us denote a patient as p having a certain history of diag-nosis, denoted as H = { t , t , .... t N } where t i is one timestep ora visit in his/her history. Each timestep consists of various diag-nosis codes (represented as ICD codes), procedures codes (rep-resented as CPT codes), prescription (represented as RX codes)and demographic details of the patient and can be represented as t i = { ICD ( .. N ) , CPT ( .. N ) , demoдraphics } . Assume that the taskis to predict the first occurrence of a disease D . Given that thispatient was diagnosed with d at time steps t n and t ( n + p ) (wherep > 0) time step. The first occurrence of d for p would time step t n which would be our target instance. The response variable is _ d (for instance is _ diabetic ) would be set to 1 for such patients and0 otherwise (i.e. patients who have never been diagnosed withdiabetes).Using the above mentioned procedure, we prepare the responsevariables for our population. We prepare a different dataset for eachdiagnosis. The feature space consists of ICDcodes along with demo-graphic attributes like aдe and дender . To prepare an aggregatedfeature vector for each patient p , we dissolve the time steps, i.e. thepatient vector is represented as: p = { ICD .. ICD N , CPT , .. CPT N , RX .. RX N , aдe , дender } (1), where the value for each ICD i depends upon the experimentbeing considered. We experimented with two different versions.First experiment involved setting the value of each ICD i , CPT i and RX i to the number of such diagnosis, procedures or medication. So,for a given patient p i the feature vector would be referred as: p i = { X i , y i } : {( n , n , ....., a , д , ... ) , y i } (2), where n ij is the ICD count value for patient i and ICD j (simi-larly for CPT and RX codes), a i , д i and y i are the age, gender andresponse value respectively for patient I ; with a i ∈ ( , ∞) д i ∈ {Male, Female} and y i ∈ { , } . In the second experiment we treated ICD i as a binary categorical feature. We share details on the modelperformance on these two different experiments in the followingsection. Deep learning models are very effective in a majority of tasks.Before the widespread usage of such models, tree based models [3, 5,19] were the go-to choice. The major reasons behind the popularityof tree based models were their low bias, robustness against outliers,ease of interpretability and speed of training and inference. Thesewere our motivations as well to decide upon XGBoost[5] as ourmodel of choice. Being battle-tested in different scenarios suchas production use-cases, academics and ML competitions furtherreinforced our decision. Global interpretability is another factorwhich contributes in understanding model behaviour. We utiliseddifferent model-agnostic instance level feature interpreters likeLIME[24] and SHAP[16] as well. These instance level approacheswere used to identify contributing features at each patient level.Figure 1 shows our overall approach in a flow diagram.

Disease incidence is usually very low in observed patient samples.This low incidence leads to issues associated with class imbalancefor classification models. Accuracy as a measure is not helpfulin such scenarios and leads to models biased towards majorityclass. We use Receiver Operating Characteristic-Area Under Curveor ROC-AUC[10] as our metric of choice. ROC is a probabilitycurve to understand performance measurement of the model atdifferent thresholds while the area under the curve denotes thedegree of separability between classes. ROC-AUC is robust measurefor imbalanced datasets. We also measure

Recall@K as an additionalmetric and define it as: given the predicted probability scores acrossall the observations binned into deciles,

Recall@30 is defined as thepercentage of true cases for which the predicted probability falls inthe top 3 deciles.

We performed various experiments to understand model perfor-mance using ROC-AUC as our metric of choice. The aim is toprepare a classifier to identify first occurrence of a disease in thedataset. Considering our focus is towards a classification task (firstoccurrence/diagnosis for diabetes, heart failure and kidney failurerespectively), we split the dataset into three parts: train , validation and test . Our datasets are highly imbalanced. The class imbalancestands at ( class 1:class 0 = :

88 ) for diabetes, ( class1:class 0 = :

85 ) for heart failure and ( class 1:class 0 = :

86 ) for kidney failure . Stratified sampling was performed while

Simple and Interpretable Predictive Model for Healthcare

Diabetes Heart Failure Kidney Failure

Table 1: Summary of EHR Datasets utilised for experiments splitting the dataset into train , validation and test to maintain classdistribution.As a first step we fit a logistic regression model on our datasets.This was done to have a baseline in place and understand the relativestrength of each of the models. Since this was a binary classificationtask, we could directly fit a logistic regression model. The modelsachieved an ROC-AUC of 0 . .

754 and 0 .

731 for diabetes, heartfailure and kidney failure respectively. This is quite a decent per-formance given the simplicity of the model. The models where weutilised count of features rather than binary categorisation achievedbetter performance throughout. For the rest of this section we willrefer to count based feature set as our primary dataset unless statedotherwise.Moving ahead with this baseline, the next experiment involvedfitting an XGBoost model with default parameters. XGBoost is a treebased boosting algorithm with numerous hyper-parameters suchas learning rate, number of estimators, regularisation parametersand so on. We denote this XGBoost model with default parametersas xдb def henceforth. The XGBoost models with default settingfor diabetes, heart failure and kidney failure resulted in an ROC-AUC value of 0 .

78, 0 .

837 and 0 .

823 respectively, which is goodimprovement over the logistic regression baseline.

As mentioned earlier, XGBoost has a host of hyper-parametersavailable for fine-tuning. Since our aim was to try and push theboundaries of non-deep learning models, it was logical next stepto fine tune the xдb def model. One of the ways of identifying theright values for each of the hyper-parameters is to perform a greedysearch. We did not proceed with the usual grid-search due to theshear size of the hyper-parameter search space. A grid search wouldhave been too time and effort consuming.The greedy search paradigm works as follows: • Use xдb def model as the base for this greedy search • Let the rank ordered list of hyper-parameters be denoted as H = { learninд _ rate , n _ estimators , ...., reд _ lambda } . • Rank order hyper-parameters based on their importance. • For each h i ∈ H : – Fit XGBoost on validation dataset to identify optimalvalue of h i keeping all other hyper-parameters in H asconstant. – Update optimal value of h i in H The above process helped us in achieve optimal values for each ofour hyper-parameters in consideration. The optimal parameter val-ues are mentioned in table 2 for reference . The fine-tuned XGBoost(denoted as xдb diabetesopt ) shows an improvement of approximately6 .

5% over xдb def with an ROC-AUC of 0 . xдb heartopt and xдb kidneyopt stand at 0 .

853 and 0 .

849 respectively.

Works by Choi et al.[8] and Ma et al.[17] utilise complex atten-tion mechanisms to showcase improvements in their results. Theseworks compare their results against weak baselines only. Theseworks also overlook experiments concerning first occurrence pre-diction. We believe this additional constraint is an important onefor the models to be useful in real life use cases. We also observedthat the performance (across models) tends to improve drastically ifthis constraint is removed. This makes intuitive sense as for manydiagnosis, a repeat occurrence is quite obvious. From a business andhealthcare stand-point, it makes sense to predict first occurrenceto take any preventive/corrective action in time.To provide a common framework, competitive baseline and use-ful constraints, we trained RETAIN[8] and Dipole[17] on ourdatasets, preparing data in the formats expected and performedhyper-parameter tuning to report the best results on our test dataset.We observed significant improvements in ROC-AUC values forboth RETAIN and Dipole as compared to xдb def and logistic re-gression baselines. This was expected as both these models arehighly parameterised and complex implementations. Also, boththese works present improvements against logistic regression intheir respective works as well. The surprising aspect was the com-parison with our fine-tuned XGBoost model, i.e. xдb opt . Our pro-posed model shows considerable improvements as compared toRETAIN and Dipole for all three target diseases respectively. Theresults are showcased in table 3 for reference. The experiments and results outlined in previous section utilisedICD codes truncated till 3 characters (or ICD3 for short). For in-stance, diagnosis code 250 .

31 refers to

Diabetes with other coma,type I [juvenile type], not stated as uncontrolled . We truncate the only those hyper-parameters are listed in the table which changed their val-ues from default settings. The nomenclatures of the hyper-parameters are from xgboost.XGBClassifier[1] We used the code from https://github.com/mp2893/retain Among the three attention layers described, General Attention layer worked best aji and Bali, et al.

Optimal ValuesImportant Parameters Default Values Diabetes Heart Failure Kidney Failure Heart Failure-ICD-Full learning_rate 0.1 0.01 0.01 0.01 0.01n_estimators 100 6674 7059 4795 5733max_depth 3 4 4 5 4min_child_weight 1 6 1 4 1gamma 0 0.4 0 0 0.1reg_alpha 0 1e-5 6 1 1reg_lambda 1 100 1 10 1subsample 1 0.9 0.95 0.8 0.7colsample_bytree 1 0.9 0.45 0.6 0.6

Table 2: Comparison between default and optimal hyper-parameter values of XGboost for each target diseaseFigure 2: ROC-AUC plot for fine tuned XGBoost Model ( xдb opt ) on EHR DataDiabetes Heart Failure Kidney FailureModel ROC-AUC Recall@30 ROC-AUC Recall@30 ROC-AUC Recall@30

RETAIN 0 . ± . . ± .

002 0.4047 0 . ± .

001 0.3312Dipole 0 . ± . . ± .

009 0.3998 0 . ± .

008 0.3357Logistic Regression 0 . ± . . ± . . ± . xдb base . ± .

000 0.3047 0 . ± . . ± . xдb oneHot . ± . . ± . . ± . xдb opt . ± . . ± . . ± . Table 3: Comparison of results of our proposed method with recent papers same to only 250 which refers to the class of diabetic diagnosis.By doing so, we reduce the overall dimensionality of our alreadysparse feature set.Even though such a grouping is helpful in reducing the impact ofa sparse feature set, it leads to loss of understanding/interpretability.To enable better and granular interpretation, we experimented withcomplete ICD codes or ICD-Full . The dataset was prepared as men-tioned in the Data Preparation section with only difference beingthe feature set consists of ICD-Full while targets are still ICD3. Thiswas done to ensure we have enough training samples for each class.This new dataset was used to train and tune RETAIN, Dipole andXgboost for comparison. Similar to previous experiments, in thiscase also XGBoost outperformed its more sophisticated competitorson ROC-AUC metric. Results are shared in table 4 for reference.We attribute the improvement in performance across models to For our experiments, ICD-Full refers to complete codes and ICD3 refers to only 3digit codes. Do not confuse this with version number of ICD codes. the added granularity in the feature set all the while maintainingsimilar class distribution. The results were cross validated to ensuremodel stability.

Interpretability is an important factor when it comes to use casessuch as disease prediction. Typically there is a trade-off betweenmodel performance and its interpretability. Most Deep Learningmodels are highly complex and are often treated as black boxes.To overcome these limitations, works by Choi et al.[8] and Maet al.[17] utilise attention mechanisms.Since our fine-tuned XGBoost model, xдb opt is not a deep learn-ing model, the interpretability at instance level had to be solved in adifferent way. For global (or dataset) level feature importance, treebased algorithms are a go-to choice. The

XGBClassifier [1] alsoprovides similar functionality out of the box. Important featuresfor xдb opt for first occurrence prediction of diabetes are reported

Simple and Interpretable Predictive Model for Healthcare

Models ROC-AUC

RETAIN 0 . ± . . ± . xдb opt . ± . Table 4: Heart Failure Prediction using ICD5 in the feature set. as ICD_I10(Hypertension), ICD_R73(Elevated blood glucose lev-els), etc. which are inline with factors leading to diabetes. HeartFailure task using ICD-Full as feature set resulted in top 5 featuresas ICD5939 (Unspecified disorder of kidney and ureter), ICD7931(Nonspecific (abnormal) findings on radiological and other exami-nation of lung field).Figure 3 presents the top 5 features in detailfor each of the target diseases. xдb opt outperforms its deep learning counterparts for the task offirst occurrence prediction while remaining globally interpretable.One downside of XGBoost is its inability to provide instance or inthis case, patient level interpretability. To handle this scenario, weleverage a model agnostic approach by Lundberg and Lee[16].This approach mimics the behaviour outlined in the works ofChoi et al.. Their work explains theoretical motivations and work-ing in detail. To better understand the impact on our work, let uswork through an instance of a patient from our test dataset itself.Let us consider a randomly sampled patient from our test datasetfor diabetes. We use xдb opt to predict the first occurrence proba-bility of this patient being diabetic. This particular patient turnsout to be diabetic with a probability score of 0.979 (ground truthfor this patient was observed to be 1). XGBoost is supported by theSHAP framework out of the box. Upon analysing this particularinstance using SHAP, we observe the following for this particularpatient. Features such as age, LAB_4548-4_H (a diagnostic test forHaemoglobin A1c), RX_841(diabetes testing supplies) and so onhave positive SHAP values. Positive SHAP values move the logitvalue (classification decision) of the classifier from approximately − . .

85. These are the past diagnosis, events, lab test or pre-scriptions which the model uses to have a high probability (0 . One known limitation of XGBoost models as compared to deeplearning counterparts[8][17] is the visit-level importance. In thedata preparation step, we outlined the fact that we the aggregatedfeature vector does not include time aspect of a patient’s history.We dissolve the time steps while preparing the patient vector p .While sequence to sequence based deep learning models canprovide time-step level interpretability, our model does not have such capability out of the box. To handle this scenario, we presenta simple workaround. We firstly narrow down the top most im-portant features at the patient level. The next step is to identifyvisits which had these features present. We can then mark suchvisits as important in identifying the final diagnosis and also pro-vide physicians with supplementary information regarding such adecision. Our experiments outline the effective predictive performance ofXGBoost based models as compared to its sophisticated deep learn-ing counterparts. Despite having far less parameters, our optimisedversions were able to outperform attention based architectures suchas RETAIN[8] and Dipole[17].Such a strong baseline can be attributed to two main aspects ofour experimental setup. The first and the foremost is the domain andits data. We shared our results and corresponding interpretationswith medical professionals. The experts were able to verify the re-sults and the interpretations from a random sample of our test sets.They also highlighted the importance of sequential/longitudinalnature of electronic health records. Though an important factorin a number of diagnosis (such as Alzheimer’s ), not all diagnosisare time dependent, especially for the target diseases in our experi-ments. They highlighted the fact that even though past diagnosisimpact current and future health states, the time gap is not alwaysan important factor. This goes hand in hand with our results and thefact that a simpler model out performs complex ones. This also high-lights a gap in the data recording process. Each diagnosis in EHRdatasets is associated with a visit to a doctor/medical professionaland is not the actual date of incidence. Thus, the time informationfrom EHR dataset is dependent upon when a particular personvisits a medical facility for diagnosis. This might include delays dueto personal preferences such as seriousness of symptoms, accessto healthcare, pain tolerance and so on. Such variability betweenincidence and reporting requires more study and experiments.The second aspect is from the algorithmic standpoint. Despitesuccessful application across various domains and data types (mostlyunstructured), deep learning is yet to make a mark when it comesto tabular or structured datasets. Tree based ensembles, especiallyXGBoost and variants dominate this space[22]. Real world datasetsare typically high dimensional yet sparse in nature. In other words,they can be represented in a lower dimensional space easily (saya hyperplane). This process is termed as unfolding or manifoldlearning. Tree based boosting algorithms are highly efficient formanifold learning with hyperplane boundaries (a characteristic oftabular datasets) [9]. Another reason behind better performanceof tree based ensembles over deep learning counterparts is their aji and Bali, et al.

Figure 3: Interpreting Patient Level Predictions. We used SHAP to understand features impacting model prediction probabili-ties. Here we have 3 randomly chosen patients for diabetes, heart failure and kidney failure prediction tasks.Figure 4: Interpreting Patient Level Predictions. We used SHAP to understand features impacting model prediction probabili-ties. Here we have 3 randomly chosen patients for diabetes, heart failure and kidney failure prediction tasks. ease and speed of training. Deep Learning models are over param-eterised and even though they are termed as universal functionapproximators, finding the optimal set of parameters is not a trivialtask. These require far more training samples and time as comparedto traditional methods[25].

We presented a simple and interpretable predictive model for dis-ease prediction. Sophisticated and complex deep learning modelsare the focus of research work in disease prediction domain.Choiet al.[8] present attention based approach to prepare interpretabledisease prediction model. Their work and the likes present com-parison with weak baselines, mostly using logistic regression. The

Simple and Interpretable Predictive Model for Healthcare focus of this work was to push the capabilities of a tree basednon-deep learning model and come up with a strong baseline formore sophisticated models. We present a novel data preparationpipeline which is observed to have a positive impact on the overallmodel performance. We used ROC-AUC[10] as our evaluation met-ric, given the fact that dataset in consideration is highly skewed.Our work outlined different experiments and a simple algorithm tofine-tune the XGBoost model for performance. We compared theperformance of our work with that of RETAIN[8] and Dipole[17]. Itwas surprising to observe that our fine-tuned model outperformedthese deep learning solutions by a good margin. This was despitethe fact that both deep learning implementations were fine-tunedwith respect to the dataset in consideration. We also presentedstrategies to interpret our model at both global and instance lev-els. The instance level interpretation utilised SHAP framework byLundberg and Lee. SHAP values help us understand patient levelfeature importance. We also discussed about the limitation of ourmodel while identifying visit level importance. We closed by provid-ing a simple workaround for this known limitation. We leveragedXGBoost implementation by Chen and Guestrin[5] to prepare ourmodels.

ACKNOWLEDGEMENTS

We would like to thank Vineet Shukla and Saikumar Chintareddyfor helpful discussions and inputs to improve the solution, and thewhole diagnosis prediction team for their contributions.

REFERENCES [1] [n.d.].

XGBoost Python Package . https://xgboost.readthedocs.io/en/latest/python/python_api.html[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural MachineTranslation by Jointly Learning to Align and Translate. arXiv e-prints , ArticlearXiv:1409.0473 (Sept. 2014), arXiv:1409.0473 pages. arXiv:1409.0473 [cs.CL][3] Leo Breiman. 2001. Random Forests.

Machine Learning

45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324[4] Basit Chaudhry, Jerome Wang, Shinyi Wu, Margaret Maglione, Walter Mojica,Elizabeth Roth, Sally C Morton, and Paul G Shekelle. 2006. Systematic review:impact of health information technology on quality, efficiency, and costs ofmedical care.

Annals of internal medicine arXiv e-prints , Article arXiv:1603.02754 (Mar 2016), arXiv:1603.02754 pages.arXiv:1603.02754 [cs.LG][6] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, andJimeng Sun. 2015. Doctor AI: Predicting Clinical Events via Recurrent NeuralNetworks.

JMLR workshop and conference proceedings

56 (2015), 301–318.[7] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine JudithCoffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun.2016. Multi-layer Representation Learning for Medical Concepts. In

KDD ’16 .[8] Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Wal-ter F. Stewart, and Jimeng Sun. 2016. RETAIN: An Interpretable Predictive Modelfor Healthcare using Reverse Time Attention Mechanism. arXiv e-prints , ArticlearXiv:1608.05745 (Aug 2016), arXiv:1608.05745 pages. arXiv:1608.05745 [cs.LG][9] Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. 2012.

Decision Forests:A Unified Framework for Classification, Regression, Density Estimation, ManifoldLearning and Semi-Supervised Learning

Proceedings of the 23rd International Conference onMachine Learning (Pittsburgh, Pennsylvania, USA) (ICML âĂŹ06) . Associationfor Computing Machinery, New York, NY, USA, 233âĂŞ240. https://doi.org/10.1145/1143844.1143874[11] Wei Guo, Wei Ge, Li zhen Cui, Hui Li, and Lanju Kong. 2019. An InterpretableDisease Onset Predictive Model Using Crossover Attention Mechanism FromElectronic Health Records.

IEEE Access

CoRR abs/1512.03385 (2015).[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.

Neural Comput.

9, 8 (Nov. 1997), 1735âĂŞ1780. https://doi.org/10.1162/neco.1997.9.8.1735[14] Abhyuday N. Jagannatha and Hong Yu. 2016. Bidirectional RNN for MedicalEvent Detection in Electronic Health Records.

Proceedings of the conference.Association for Computational Linguistics. North American Chapter. Meeting

NewEngland Journal of Medicine arXiv e-prints , Article arXiv:1705.07874 (May 2017),arXiv:1705.07874 pages. arXiv:1705.07874 [cs.AI][17] Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao.2017. Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirec-tional Recurrent Neural Networks. arXiv e-prints , Article arXiv:1706.05764 (Jun2017), arXiv:1706.05764 pages. arXiv:1706.05764 [cs.LG][18] G. Maragatham and Shobana Devi. 2019. LSTM Model for Prediction of HeartFailure in Big Data.

Journal of Medical Systems

43 (05 2019). https://doi.org/10.1007/s10916-019-1243-3[19] Dan H. Moore II. 1987. Classification and regression trees, by LeoBreiman, Jerome H. Friedman, Richard A. Olshen, and Charles J.Stone. Brooks/Cole Publishing, Monterey, 1984,358 pages, $27.95.

Cy-tometry

8, 5 (1987), 534–535. https://doi.org/10.1002/cyto.990080516arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cyto.990080516[20] Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh.2017.

Deepr : A Convolutional Net for Medical Records.

IEEE Journal of Biomedicaland Health Informatics

21, 1 (Jan 2017), 22âĂŞ30. https://doi.org/10.1109/jbhi.2016.2633963[21] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring theLimits of Transfer Learning with a Unified Text-to-Text Transformer. arXive-prints (2019). arXiv:1910.10683[22] shivamb. 2018.

Data Science Trends on Kaggle !!

Advances in Neural Information ProcessingSystems (NeurIPS) .[24] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should ITrust You?”: Explaining the Predictions of Any Classifier. arXiv e-prints , ArticlearXiv:1602.04938 (Feb 2016), arXiv:1602.04938 pages. arXiv:1602.04938 [cs.LG][25] David H Wolpert and William G Macready. 1997. No free lunch theorems foroptimization.

IEEE transactions on evolutionary computation

1, 1 (1997), 67–82.[26] Cao Xiao, Edward Choi, and Jimeng Sun. 2018. Opportunities and challenges indeveloping deep learning models using electronic health records data: a system-atic review.

Journal of the American Medical Informatics Association

25, 10 (2018),1419–1428.[27] Yuan Zhang, Xi Yang, Julie S. Ivy, and Min Chi. 2019. ATTAIN: Attention-basedTime-Aware LSTM Networks for Disease Progression Modeling. In