[PDF] An Interpretable Intensive Care Unit Mortality Risk Calculator

Abstract

Mortality risk is a major concern to patients have just been discharged from the intensive care unit (ICU). Many studies have been directed to construct machine learning models to predict such risk. Although these models are highly accurate, they are less amenable to interpretation and clinicians are typically unable to gain further insights into the patients' health conditions and the underlying factors that influence their mortality risk. In this paper, we use patients' profiles extracted from the MIMIC-III clinical database to construct risk calculators based on different machine learning techniques such as logistic regression, decision trees, random forests and multilayer perceptrons. We perform an extensive benchmarking study that compares the most salient features as predicted by various methods. We observe a high degree of agreement across the considered machine learning methods; in particular, the cardiac surgery recovery unit, age, and blood urea nitrogen levels are commonly predicted to be the most salient features for determining patients' mortality risks. Our work has the potential for clinicians to interpret risk predictions.

Full PDF

AAn Interpretable Intensive Care Unit Mortality Risk Calculator

Eugene T. Y. Ang , Milashini Nambiar , Yong Sheng Soh and Vincent Y. F. Tan Abstract — Mortality risk is a major concern to patients havejust been discharged from the intensive care unit (ICU). Manystudies have been directed to construct machine learning modelsto predict such risk. Although these models are highly accurate,they are less amenable to interpretation and clinicians aretypically unable to gain further insights into the patients’health conditions and the underlying factors that inﬂuencetheir mortality risk. In this paper, we use patients’ proﬁlesextracted from the MIMIC-III clinical database to constructrisk calculators based on different machine learning techniquessuch as logistic regression, decision trees, random forests andmultilayer perceptrons. We perform an extensive benchmarkingstudy that compares the most salient features as predicted byvarious methods. We observe a high degree of agreement acrossthe considered machine learning methods; in particular, thecardiac surgery recovery unit, age, and blood urea nitrogenlevels are commonly predicted to be the most salient featuresfor determining patients’ mortality risks. Our work has thepotential for clinicians to interpret risk predictions.

I. I

NTRODUCTION

Risk calculators are tools used by healthcare providers toestimate the probabilities of chronically ill or Intensive CareUnit (ICU) patients experiencing adverse clinical outcomessuch as health complications, readmission to hospital, ormortality, and to then identify and screen high-risk patients.In the context of patient mortality, severity scores suchas SAPS-II [1] and sequential organ failure assessment(SOFA) [2] have traditionally been used to predict hospitalmortality. These scoring systems predict mortality by feedinga ﬁxed set of predictor variables, including various labora-tory measurements and vitals, into simple models such aslogistic regression. The simplicity of these models limitstheir accuracy, and benchmarking studies [3] have found thatautomated risk calculators based on machine learning, thattrain supervised learning models on Electronic Health Record(EHR) data, tend to be far more accurate.However, a drawback of using nonparametric machinelearning models is that these models tend to be less inter-pretable than the simple models used in traditional scoringsystems, i.e. the reasoning behind their predictions is notalways transparent. This paper investigates the problem ofbuilding a risk calculator that is not only accurate butinterpretable. We focus on the task of predicting the 30-day mortality of ICU patients at discharge, but note that theobservations and principles applied to this work are broadly Department of Mathematics, National University of Singapore [email protected], [email protected],[email protected] Institute for Infocomm Research, Agency for Science Technology andResearch

Milashini [email protected] applicable to the tasks of predicting other adverse clinicaloutcomes using EHR data.There is a considerable body of literature on the problemof ICU mortality prediction. Typically, this literature usesdata from the ﬁrst 24 to 48 hours of a patient’s stay to predictin-hospital mortality. The 2012 PhysioNet Computing inCardiology Challenge called for machine learning solutionsto address the task of predicting in-hospital mortality forpatients who stayed in the ICU for at least 48 hours [4].The winning team developed a novel Bayesian ensemblelearning solution and achieved an AUC of 0.86 [5]. In thepresent paper, however, we study the task of 30-day mortalityprediction at discharge—thus, the accuracies obtained arenot directly comparable to the results in these other papers.Another work that investigates risk prediction at dischargeis [6], in which the authors use LSTM models to predictthe readmission of ICU patients within 30 days of theirdischarge, based on the last 48 hours of data from their stays.A different stream of literature has looked at designingrisk calculators that are not only accurate but interpretable.A number of papers [7], [8] have proposed attention neuralnetworks, where attention mechanisms are used to identifyimportant features while retaining the accuracy of deep neu-ral networks. Some other works [9]–[11] have also employedmethods such as Shapley values, which are also used inthis paper, and local interpretable model-agnostic explanation(LIME) to enhance interpretability. More speciﬁcally, [9]combines Shapley values with XGBoost to predict the mor-tality of elderly ICU patients, while [10] combines Shapleyvalues with Convolutional Neural Networks [12] to predictICU mortality.A key difference between our paper and the literatureon interpretable risk calculators is that rather than focusingon enhancing the interpretability of any particular machinelearning method, we perform a benchmarking study thatcompares the reasoning of various methods. We develop aseries of risk calculators using Logistic Regression (LR),Decision Tree (DT), Random Forest (RF), k -Nearest Neigh-bors (k-NN), and Multilayer Perceptron (MLP); see Bishop’sbook [13] for detailed descriptions of these methods. InSection IV, we extract the most inﬂuential set of variablesthat each calculator relies on to make predictions. We observethat there is a high degree of commonality among thesesets of variables, which suggests that all of these calculatorsrely on a similar set of underlying reasonings to arriveat a prediction. From a clinical perspective, our resultsare reassuring because they suggest that these data-drivenmethods for performing prediction in clinical settings makesimilar conclusions, even if the precise manner in which each a r X i v : . [ s t a t . A P ] J a n ethod arrives at a decision can be quite different.In Section V, we show how the set of inﬂuential factorsderived in Section IV may be used to guide a clinician asto how a prediction was made. However, it is importantfor clinicians to understand the limitations of each modelwhen using the risk calculator, so to gain better insightsto the patients’ health conditions. As LR constructs lineardecision boundaries to distinguish patients’ class, this modelwould not be that effective if the training data is severelynot linearly separable. Even though the DT model is highlyinterpretable, slight changes in the training data may resultin a completely different tree. Unlike DT, RF is morestable, however, it requires more computational resources toconstruct and aggregate various trees. Despite its simplicity, k -NN also requires a signiﬁcant amount of memory to storeall the training data and to compute the prediction givena large dataset of patients’ proﬁles. While the MLP candeal with training data that is not linearly separable, it iscomputationally intensive to train. It is thus suggested forclinicians to use different models according to their needs.II. M ETHODOLOGY

A. Cohort Selection

The data used in this study are extracted from the MIMIC-III Critical Care Database [14], which contains the healthrecords of patients in Beth Israel Deaconess Medical CenterICU from 2001-2012. For patients with multiple admissions,we considered every admission independently. There were atotal of 61532 ICU records.

B. Feature Extraction

Our choice of features followed from studies done inthe MIMIC-III research community and in MIT CriticalData [15]. In this study, our target variable was a binaryﬂag, which indicated the patient’s mortality within 28 daysof their discharge from the ICU. For our input features, weselected 4 different data categories, namely, demographic,laboratory measurements, severity scores and vitals.For demographic features, we extracted the patients’height, weight, age, ethnicity, length of ICU stay, gender andthe service unit that the patients were admitted to, namelythe Coronary Care Unit (CCU), the Cardiac Surgery CareUnit (CSRU), the Medical Intensive Care Unit (MICU), theSurgical Intensive Care Unit (SICU) and the Trauma/SurgicalIntensive Care Unit (TSICU). We calculated the patients’BMIs by querying the patients’ height and weight mea-surements that were last taken before the patients weredischarged from the ICU.For laboratory measurement features, we extracted bloodurea nitrogen (BUN), chloride, creatinine, hemoglobin,platelet, potassium, sodium, total carbon dioxide (TotalCO2)level and white blood cells (WBC) count of the patients. Asthese measurements might change due to the treatment thepatients received during their ICU stay, we chose to querythe patients’ measurements that were last taken before thepatients were discharged from the ICU. We also calculated severity scores such as the patients’SOFA score [2] and the estimated Glomerular Filtration Rate(eGFR). The latter was calculated based on the CKD-EPICreatinine Equation [16].For vitals features, we extracted the temperature, heartrate, blood oxygen level (SpO2), systolic blood pressure(SysBP), diastolic blood pressure (DiasBP) and mean arterialpressure (MAP). Due to the non-stationary nature of the timeseries of vital signs, we chose to take the median value ofthe time series.

C. Data Processing

Out of the 61532 records, the majority of the entries hadmissing height and weight features while a handful had miss-ing features such as MAP, SpO2 and SysBP. Furthermore,there were anomalies in the certain features such as age andweight. We deﬁne an outlier as a value that is more than 3standard deviations from the mean. We dropped records thatcontained these anomalies and those that had missing featurevalues, except for height, leaving 36330 data entries.We then treated missing height entries as follows. As iswell known, the BMI can be expressed as the followingfunction of the height and weightBMI = weight / height . (1)We used ordinary least squares linear regression to regressBMI against weight and impute the missing height entriesusing the formula (cid:103) BMI = 5 . . × weight , (2)and the height features through Eqn. (1). We found a strongpositive correlation between BMI and weight, with an R value of . , justifying that our imputation procedure isfairly accurate.An issue that we faced with the dataset was that thenumber of positively labelled records (patients who diedwithin 28 days of discharge) represented 7.6% of the totaldata sample, contributing to a severe class imbalance. Oneapproach to tackle such class imbalance is Synthetic Mi-nority Oversampling Technique (SMOTE) [17], which over-samples the examples in the minority class, by creating newexamples from the minority class in the training data set.This technique chooses a random minority class instance andﬁnds its k nearest minority class neighbours, based on theEuclidean distance. In our data processing step, we used thedefault setting for k , where k = 5 , in the imblearn library.As the generated synthetic instance is a convex combinationof that instance and one of its randomly chosen neighbours,it would likely contain some characteristics of the minorityclass due to its proximity (according to Euclidean distance).As some of our features were categorical, we usedSMOTE-NC (Synthetic Minority Over-sampling Techniquefor Nominal and Continuous) technique [17] to deal withthe categorical features in the dataset. SMOTE-NC sets thecategorical values of the generated data by choosing themost frequent category of the nearest neighbours during thegeneration process. . Model Selection and Prediction After splitting our data so that 75% of the original datasetis treated as the training data and the remaining 25% as testdata, we applied SMOTE-NC on the training set and trainedthe models times.The features were used to train LR, DT, RF, k-NN andMLP in order to determine whether a patient would survivewithin 28 days of discharge from the ICU. The parametersof the model were estimated using 5-fold cross-validation(CV). The hyperparameters of each model were optimisedthrough random and grid search. Also, in this optimisation,a different -fold CV on the training set was performed. Thetuned models were tested on the test set and performanceswere evaluated using the area under the receiving operatingcharacteristic curve (AUC), test accuracy (ACC) and recall(REC). We then calculated the mean of the performancemetrics across the 30 trials and used the standard deviationas the uncertainty bound. All modelling and analyses wereperformed with Python, and in particular the sklearn library.III. M ODEL P ERFORMANCES

In Table III, we show the performance of our trainedclassiﬁers at predicting mortality of an ICU patient in thetest data set. With the exception of the Decision Tree model,all our trained classiﬁers attained a AUC ≥ . or greater,and an accuracy (denoted by ACC) of . or greater. LRachieved the best performance in terms of AUC and REC,with . and . respectively, while the MLP predictionmodel had the best performance in terms of ACC at . . Ourresults suggest that all trained classiﬁers provide reasonableperformance for predicting mortality. LR DT RF kNN MLPAUC . ± . ± ± ± ± ± ± ± ± . ± . REC . ± . ± ± ± ± ERFORMANCE ON PREDICTING

ICU

MORTALITY ON TEST DATA SETACROSS DIFFERENT CLASSIFIERS . IV. E

XTRACTING I NFLUENTIAL F EATURES

Next, we extract the most inﬂuential features from eachtrained classiﬁer. We exhibit the high degree of overlapamong the inﬂuential features across different classiﬁers.

A. Logistic Regression

In LR, the trained classiﬁer predicts mortality accordingto the following relationship P ( Y = 1 | x ) = 1 / (1 + exp (cid:0) − ( θ T x + θ ) (cid:1) ) . (3)Here, θ represents the weights of the input features, x represents the patient’s normalized data, and θ represents theoffset of the decision boundary. In particular, a larger value of θ T x + θ corresponds to a higher probability of mortality.As such, we can infer the most inﬂuential features as thosecorresponding to the largest coefﬁcients θ i in magnitude.In Table II, we show the features corresponding to the ﬁvelargest coefﬁcients and the ﬁve smallest coefﬁcients. ResultsPositive Inﬂuence Negative Inﬂuence1 Age 0.586 CSRU -1.1762 BUN 0.336 TSICU -0.4443 Male 0.325 SICU -0.2964 MICU 0.285 BMI -0.2885 Heart Rate 0.273 Hemoglobin -0.198TABLE IIT OP MOST INFLUENTIAL FEATURES BY LOGISTIC REGRESSION

We observed from Table II that all service units exceptMICU had a negative inﬂuence on the mortality risk. Wealso observed that older patients, male patients, patients withhigher blood urea nitrogen (BUN) or heart rate would tendto have a greater mortality risk within 28 days after theirdischarge. These features are possible indicators for poorrenal or cardiovascular functions. Similarly, patients that hadhigher BMI or hemoglobin level would tend to have lowermortality risk.To increase the interpretability of the model, we penalizethe logistic regression problem with an (cid:96) -norm regularizer,i.e., we solve min θ ,θ n (cid:88) t =1 log (cid:16) (cid:0) − y t ( θ T x t + θ ) (cid:1)(cid:17) + λ (cid:107) θ (cid:107) . (4)Here, λ > is the regularization parameter, which we tunevia CV.After tuning the hyperparameters with 5-fold CV andrunning the model times, the best estimator achieved atest score of . ± . , recall of . ± . and AUCof . ± . . We annihilated those features which hada coefﬁcient of zero more than half of the time [18]. TABLE IIIFEATURES WITH NON-ZERO COEFFICIENTS INL1-PENALISED LRResultsFeatures Coefﬁcients1 Age 0.494 ± ± ± ± ± ± ± ± ± ± ± ± ± ± e observed that features were annihilated. Thus,clinicians could determine to mortality risk by the sparseselection of inﬂuential features as shown in Table III. B. Decision Tree

To obtain an interpretable model, we restricted the learnedDT model to a maximum depth of 5 tiers. One approach toﬁnd out which features are more inﬂuential in a DT modelis to compute the Gini Importance of each feature. This isgiven as the decrease in Gini impurity weighted by the nodeprobability where the

Gini impurity of a certain node is G ( p , p ) = 1 − p − p , (5)where p i is the probability of picking a data instance fromclass i ∈ { , } in that node. Hence, the larger the feature’sGini importance, the more important the feature is [19]. Thetop 5 features with the highest Gini importances are shownin Table IV. TABLE IVTOP 5 FEATURES OF THE HIGHEST GINI IMPORTANCEResultsFeatures Gini Importance1 Length of stay 0.2247732 CSRU 0.2140033 BUN 0.1746144 Age 0.0524575 eGFR 0.047148

We observed from Table IV that length of stay, service unitCSRU, blood urea nitrogen level, age and estimated GFRhad the highest Gini Importance. This means that most ofthe decisions that were made on how to split the data weremostly based on these features, and that the samples in thefollowing nodes were relatively pure. Hence, clinicians couldobtain a reasonable prediction of the risk of mortality bylooking just at this smaller subset of features.

C. Random Forest

A random forest is an ensemble of decision tree models,and is hence less amenable to interpretation. As such, wecompute the Shapley value of each feature to infer its inﬂu-ence. The concept of Shapley values were ﬁrst introducedin the ﬁeld of cooperative game theory, and it measures themarginal contribution that each feature (player, in the contextof game theory) to a subset of features (coalition), averagedacross all possible subsets of features [20]. The Shapley valueof i -th feature is given as φ i ( f ) = (cid:88) S ⊆{ x j } pj =1 \{ x i } | S | !( p −| S |− p ! ( f ( S ∪ { x i } ) − f ( S )) . (6)Thus φ i ( f ) is the contribution of the i -th feature based on f ,which calculated the economic output of all feature values in S , a subset of the features used in the model, p is the numberof features and x i represents the corresponding feature valuesof a data point to be explained. Intuitively, a feature has ahigher Shapley value if the inclusion of this feature in the prediction model results in a greater change in predictions,as compared to the inclusion of other features.For every data instance, the average marginal contributionof each feature is given by its Shapley value. In order to ﬁndout the inﬂuence of the feature, we can determine the SHAPfeature importance by averaging the Shapley value for everyfeatures across all data instances [21].Note that the Shapley value requires us to sum over allpossible subsets of features. This becomes intractable if thenumber of features is moderately large. In our numericalimplementations, we approximately compute the Shapleyvalue by averaging over a small number of randomly sampledsubsets of features. For the RF model, we implementeda TreeExplainer from the shap library, which the classdisregarded decision paths that involved missing featuresinstead of computing the output through a selection ofthe features [22]. In Fig. 1 we show the SHAP featureimportance values across the top 20 features. Fig. 1. SHAP feature importance for Random Forest

We observed from Fig. 1 that length of stay had thelargest SHAP feature importance and was responsible forchanging the predicted absolute ICU mortality risk predictionon average by around 8% (0.08 on x-axis) regardless ofclasses. As the larger the SHAP feature importance, thebigger the average marginal contribution to risk predictionthe feature had. We could also observe that length of stay,service unit CSRU, blood urea nitrogen level, age and serviceunit MICU were the 5 most inﬂuential features. In fact, formany instances, clinicians could accurately deduce the ﬁnalpredicted risk mortality with our trained RF model with these5 features while choosing the remaining features randomly.

D. k-Nearest Neighbor

As our implementation of k -NN uses weights trained usingthe RF model, the most inﬂuential variables using k -NNs areprecisely those found using these other methods. As such, weomit discussing k -NNs. . Multilayer Perceptron The MLP is a black-box model, and hence is less amenableto interpretation. Hence, we computed Shapley value usingthe KernelExplainer from the shap library to interpret themodel [23]. In order to compute the SHAP feature impor-tance, we had to take the average of the Shapley values ofevery features across all data instances, so to reduce thecomputational complexity, we estimated the SHAP featureimportance from 500 training data instances randomly andrepeated the process 30 times. We plot the SHAP featureimportance values for MLP in Fig. 2.

Fig. 2. SHAP feature importance for MLP

We observed from Fig. 2 that service unit, CSRU had thelargest SHAP feature importance and was responsible forchanging the predicted absolute ICU mortality risk predictionon average by around 8% regardless of classes. Similar tothe analysis of the RF model, clinicians could accuratelydeduce the ﬁnal predicted mortality risk by looking at the 5most inﬂuential features, namely service unit CSRU, bloodurea nitrogen level, age, service unit SICU and length of staywhile choosing the remaining features randomly.

F. Comparison of Inﬂuential Features

We compare the top few most inﬂuential features extractedfrom the various models, and we tabulate these in Table V.We observed a high degree of agreement in the most im-portant features across all models; in particular, we observedthat service unit CSRU, age, and blood urea nitrogen levelsranked among the top 5 most inﬂuential features across allmodels. This suggested that these models made predictionswith largely similar reasonings.V. I

NTERPRETING R ISK P REDICTIONS

Next, we apply the inﬂuential factors obtained in SectionIV to interpret the predictions derived from our risk calcu-lators. We focus on 4 speciﬁc test cases, Patients

LR DT RF MLP CSRU Length of stay Length of stay CSRU Age CSRU CSRU BUN TSICU BUN BUN Age BUN Age Age SICU Gender (M) eGFR Creatinine Length of stayTABLE VT OP RANKED INFLUENTIAL FEATURES ACROSS DIFFERENT MODELS

Logistic Regression.

We recall that LR predicts the mor-tality risk of a patient based on the linear function θ T x + θ .Hence the ﬁnal prediction is primarily inﬂuenced by how farthese feature values deviate from the population sample, withits importance weighted by the coefﬁcients in ( θ , θ ) . Assuch, a clinician who wishes to understand whether the LR’sprediction is sound can examine how far the patient’s featurevalues deviate from the population, with special attention onfeatures whose coefﬁcients have larger maginitude.As an illustration, we show the numerical values of eachpatient corresponding to the inﬂuential features, as listed inTable III in our ﬁtted LR model in Table V. We observedthat Patient TABLE VIRAW FEATURES VALUES OF THE 4 EXAMPLES

Features Range Test Examples

Service Unit CCU SICU MICU CSRUAge (years) . ± . . . .

53 57 . BUN (mg/dL) . ± . Male

Heart Rate (bpm) . ± .

82 82 81 . SOFA . ± WBC (K/ µ L) . ± . . . . . Length of stay (days) . ± . MAP (mmHg) . ± . . . SysBP (mmHg) . ± . . Temperature (Celsius) . ± . . . . . Hemoglobin (g/dL) . ± . .

12 10 9 . BMI (kg m − ) . ± . . . . . Decision Tree.

A DT model makes a prediction by an-swering a sequence of queries, which checks if a particularfeature value is above or below a learned threshold. Hence, aclinician who wishes to understand if the learned DT modelmakes decisions in a fashion that is clinically sound cantudy these queries, and in particular, check if these queriescan be backed by clinical expertise. The respective branchesof all examples are shown in Table VII.We observed from Table VII that Patients

Nodes

BUN > BUN ≤ BUN > BUN ≤ TSICU ≤ . Length of stay ≤ TSICU ≤ . Length of stay ≤ CSRU ≤ . CSRU ≤ . CSRU ≤ . CSRU > . SpO2 ≤ . Age ≤ . SpO2 ≤ . - Hemoglobin ≤ . Age > . Hemoglobin ≤ . - TABLE VIIDT

BRANCHES OF P ATIENTS

AND

Random Forest.

As we noted in Section IV-C, RF is anensemble method, and hence its results are less amenable todirection interpretation compared to LR and DT. As such, werely on the SHAP explanation forces to interpret the predic-tions provided by the RF model. Broadly speaking, Shapleyvalues can be viewed as “forces”, which either increase ordecrease the mortality risk. Each Shapley value is representedas an arrow that pushes to increase or decrease the predictionprobability and the magnitude of the contribution is showedby the size of the arrow. These forces would balance eachother out at the actual prediction of the data instance. TheSHAP explanation force plots for the examples

Fig. 3. SHAP explanation force plot for Patient

We observed from Figs. 3 and 4 that the baseline probabil-ity, which was the average predicted probability, was . . Weobserve from Fig. 3 that patient . . Also, the marginal risk-decreasing contributionof being in the service unit CCU is the largest, accordingto the size of the arrow. Although the 3 most inﬂuentialrisk-increasing factors such as not being in service unit,CSRU, 3 days in the ICU and an estimated GFR of 19.28mL/min/1.73m has less absolute marginal contribution thanthe service unit CSRU, the combined effects of these factorsoutweighed the risk-decreasing factor and thus, leads to anoverall higher risk prediction. Fig. 4. SHAP explanation force plot for Patient

We observe from Fig. 4 that patient . . Also, the marginal risk-decreasing contributionof being in the service unit CSRU is the largest, accordingto the size of the arrow. The risk-increasing factors hadnegligible marginal contribution to the overall risk prediction,whereas risk-decreasing factors such as being in the serviceunit CSRU, 1 day in the ICU, blood urea nitrogen level of 12mg/dL have large absolute marginal contribution, as observedfrom the size of the arrows. This leads to an exceptionallylow overall mortality risk. k -Nearest Neighbors. The k -Nearest Neighbors modelmakes prediction based on the outcomes of the nearestneighbors. As the best estimator used 16 neighbors to makea prediction, we showed the feature values of the patient neighbors in Table VIII.We observed from Table VIII the majority of the neighborshad a positive label, resulting in a positive prediction fortest example Features

Range of 16 neighbors’ feature valuesAge 73.7 . ± . Length of stay 3 . ± . SOFA 7 . ± . DiasBP 51 . ± . HeartRate 108 . ± . MAP 62 . ± . SpO2 96 . ± . SysBP 100 . ± . Temperature 36.4 . ± . BUN 28 . ± . Chloride 92 . ± . Creatinine 2.4 . ± . Hemoglobin 8.3 . ± . Platelet 266 ± Potassium 4 . ± . Sodium 133 ± TotalCO2 31 . ± . WBC 8.3 . ± . BMI 32.7 . ± . eGFR 19.3 . ± . Gender F 14 male, 2 femaleService Unit CCU 14 CCU, 2 MICULabel Positive 15 positive, 1 negative

TABLE VIIIF

EATURE VALUES OF EXAMPLE

AND ITS NEIGHBORS

Multilayer Perceptron.

Our analysis for MLP is based onxamining SHAP explanation force plots, and is similar toour discussion regarding the RF model. The SHAP explana-tion force plots for the example

Fig. 5. SHAP explanation force plot for Patient

We observed from Fig. 5 that the base value, whichwas the average predicted probability, was . . We observefrom Fig. 5 that patient . .Also, the marginal risk-decreasing contribution of age, witha raw value of . is the largest, according to the sizeof the arrow. Although risk-decreasing factors such as ageand being in the service unit SICU have large marginalcontribution, these contributions are counteracted by severalless inﬂuential risk-increasing factors such as the systolicblood pressure and total carbon dioxide level. Hence, theMLP model may not be able to provide a deﬁnitive predictionon this patient’s mortality risk and clinicians could considerusing other models such as DTs or k-NNs to estimate themortality risk. VI. C ONCLUSION

In this study, based on patients’ proﬁles extracted fromthe MIMIC-III clinical database, we developed various riskcalculators to predict 30-days mortality risk of ICU patientsat discharge. Not only have we have not only determinedthe common inﬂuential features that all calculators reliedon to make predictions, but we have also developed variousmethods to interpret the risk predictions. On top of achievinghigh performance on the test data, these calculators shareda high degree of commonality among the set of inﬂuentialfeatures. This showed the effectiveness of these risk calcu-lators in clinical settings. As this study can be helpful inproviding interpretable yet accurate predictions for clinicians,future research can be done on exploring causal models.This has the potential to provide clinicians with insightsregarding the causal relationships between these features andthus improving the patients’ health conditions by targetingthe independent variables.Rmodels.This has the potential to provide clinicians with insightsregarding the causal relationships between these features andthus improving the patients’ health conditions by targetingthe independent variables.R