MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
MMGP-A TT TCN: A N I NTERPRETABLE M ACHINE L EARNING M ODEL FOR THE P REDICTION OF S EPSIS
Margherita Rosnati ∗ Department of Computer ScienceETH ZürichSwitzerland
Vincent Fortuin
Department of Computer ScienceETH ZürichSwitzerland
Abstract
With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollarsin the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageingwestern world. Recently, medical and technological advances have helped re-define the illness criteria of this disease,which is otherwise poorly understood by the medical society. Together with the rise of widely accessible ElectronicHealth Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the earlydetection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with anopen-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitaskGaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretablemanner. We show that our model outperforms the current state-of-the-art and present evidence that different labellingheuristics lead to discrepancies in task difficulty.
Every year, it is estimated that 31.5 million people worldwide contract sepsis. With a mortality rate of 17% in its benignstate and 26% for its severe state [11], sepsis is one of the leading causes of hospital mortality [40], costing the healthcaresystem more than 16 billion dollars in the USA alone [1]. Studies demonstrated that early treatment has a significant positiveeffect on the survival rate [24, 32]. In particular, Castellanos-Ortega et al. [7] demonstrated that each hour delay in treating apatient results in a 7.6% increase in mortality.Current methods of screening, such as Modified Early Warning System (MEWS) and Systemic Inflammatory ResponseSyndrome (SIRS) have been criticised for their lack of specificity, leading to low accuracies and high false alarm rates.In 2015, the Third International Consensus Definitions for Sepsis [39, 36, 37] committee worked towards incorporatingmedical and technological advances into an up-to-date definition of sepsis, providing scientists with widely acknowledgedillness criteria. Together with the rise of Electronic Health Records (EHR), the scientific community is now armed with boththe data and labelling techniques to experiment with novel prediction methods [20, 19, 18, 6, 10], which are already provingeffective in increasing survival rate [38] and promising in decreasing costs.New models developed so far either relied on some interpretable yet simple prediction methods, such as logistic regression[6] and decision tree based classifiers [29, 9], or on effective yet black-box methods such as Recurrent Neural Networks[16]. Moreover, the results achieved by different authors are rarely comparable: although most use the MIMIC-III data set,the disparities in labelling rules result in highly variable data sets (eg. [34] have 17,898 septic patients vs. 2,577 for [10]).This work presents an attempt at reconciling interpretability and predictive performance on the sepsis prediction task andmakes the following contributions: • Gold standard for labelling. We provide a gold standard for Sepsis-3 labelling implemented on the MIMIC-III dataset. ∗ [email protected] a r X i v : . [ c s . L G ] S e p GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
MGP TCN α TCN β temporal datastatic data label predictionαβ y MGP
Figure 1: Sketch of our proposed model architecture. • Novel interpretable model. We present an explainable and end-to-end trainable model based on Multitask GaussianProcesses and Attentive Neural Networks for the early prediction of sepsis. • Empirical evaluation. We assess our model on real-world medical data and report superior predictive performanceand interpretability compared to previous methods.An overview of our proposed method is shown in Figure 1.
Medical time series diagnosis
Multiple researchers have tackled the task of predicting sepsis and septic shock. Works onseptic shock include exploration of survival models [19] and Hidden Markov Models [18]. However, these models rely onthe assumption that a patient has already developed sepsis and focus on patterns of patients’ further deterioration. Otherauthors [6, 10, 29, 9] use linear models and decision trees on engineered features to predict sepsis onset, thus failing tocapture temporal patterns. More recently, Kam and Kim [23] and Raghu et al. [34] employed recurrent neural networks tobetter capture time dependencies. Crucially, all these models rely on either averaging or forward imputation of data points tocreate equidistant inputs. This transformation creates data artefacts and discards relevant uncertainty: in the medical field,the absence of data is a conscious decision made by professionals implying an underlying belief of the patient state. Futomaet al. [16] and Moor et al. [31] tackled this issue with Multitask Gaussian Processes (MGPs), however their models lack theinterpretability necessary in the medical field.
Irregularly sampled time series
The most common solution to missing values is forward imputation [6]. [28] utiliseforward imputation coupled with a missingness indicator fed into a black-box model. Although this method retainsinformation about data presence, it is not clear how the information is later interpreted by the model and hence does notmeet our transparency criteria. [17] use MGPs to fit sparse medical data, however they optimise their model for the data fitand use the parametrisation as input for a classifier rather than optimising the model for a classification task. Both [15] and[31] use MGPs with end-to-end training, although their temporal covariance function is shared across all variables. Finally,[16] uses MGPs with multiple time kernels in a similar fashion to our model, although they infer the number of kernels fromhyperparameter tuning rather than the data itself.
Attention based neural networks
Attention was first introduced on the topic of machine translation [2]. Since then, theconcept has been used in natural language processing [41, 42] and image analysis [30, 35]. In the same spirit, [33] usedattention mechanisms to improve the performance of a time series prediction model. Although their model easily explainsthe variable importance, its attention mechanism is based on Long Short Term Memory encodings of the time series. Atany given time, such an encoding contains both the information of the current time point and all previous time points seenby the recurrent model. As such, the time domain attention does not allow for easy interpretation. More similar to ourimplementation is the RETAIN model [8], which generates its attention weights through reversed recurrent networks andapplies them to a simple embedding of the time series. The model employs recurrent neural networks which are slower totrain and suffer from the vanishing gradient problem. Furthermore, the initial and final embeddings decrease the model’sinterpretablity. Other authors using TCN-based attention include [27], who only attend to time.2GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Let us first define some notation for the problem at hand. For each patient encounter p , several features y p,t i ,k are recordedat times t p,i,k from admission, where k ∈ { , . . . , M } is the feature identifier. These features are often vital signs andlaboratory results. As such, they are rarely observed at the same times. Hence, we have a sparse matrix representation ofobservations y p, ,t . . . y p, ,t Np ... . . . y p,M,t . . . y p,M,t Np (1)where N p is the patient’s observation period length. We also define static features s p = { s p,M +1 , ..s p,M + Q } with featuresidentifiers k ∈ { M + 1 , . . . , M + Q } , corresponding to time-independent quantities, such as age, gender and first admissionunit. Finally, we define sepsis labels l p ∈ { , } . Given the sparsity of the data, we can define the compact representation ofall observed values: { t p , y p , s p , l p } = (cid:8) { t p,i,k , y p,i,k } i ∈{ ,...,N p } ,k ∈{ ,...,M } , { s p,M +1 , ..s p,M + Q } , l p (cid:9) (2)The goal of the model is, for a given set { t p , y p , s p } to predict the label l p . In order to remove clutter, we will from now ondrop the patient-specific subscript p from all notation, and the feature subscript k from time notation, simplifying t p,i,k to t i . Gaussian processes are commonly known for their ability to generate coherent function fits to a set of irregular samples, bymodelling the data covariance. As they easily account for uncertainty and do not require homogeneously sampled data,Gaussian processes are the perfect candidate model to deal with relatively small amounts of medical data.Following [4], we use a Multitask Gaussian Process (MGP) to capture feature correlation and Li and Marlin [26]’s end-to-endtraining framework, in a similar manner to [15]. Given an hourly spaced time series { t (cid:48) i } i = − N p (where 0 is the time ofprediction), the MGP layer produces a set of posterior predictions for each feature, which will then be fed into a classificationmodel.We define a patient-independent prior over the true values of { y i,k } by { f k ( t i ) } such that { y i,k } ∼ N ( f k ( t i ) , σ k ) (3) (cid:10) f k ( t i ) , f k (cid:48) ( t j ) (cid:11) = (cid:88) l ∈ L K k l ( k, k (cid:48) ) K tt l ( t i , t j ) (4)where { K tt l ( t i , t j ) } l ∈ L are time point covariances varying in smoothness, { K k l ( k, k (cid:48) ) } l ∈ L are feature covariances at a givensmoothness level, independent of time, and L are smoothness clusters. Over all variables and time points, the multivariatemodel has covariance (cid:88) l ∈ L K k l ⊗ K tt l + D ⊗ I (5)where D = diag ( σ k ) are the noise terms associated to each feature and ⊗ is the Kronecker product. The posterior over t (cid:48) = { t (cid:48) i } i = − N p is a multivariate Gaussian with mean and covariance: µ = (cid:0) (cid:88) l ∈ L K k l ⊗ K tt (cid:48) l (cid:1)(cid:0) (cid:88) l ∈ L K k l ⊗ K tt l + D ⊗ I (cid:1) − y Σ = (cid:88) l ∈ L K k l ⊗ K t (cid:48) t (cid:48) l − (cid:0) (cid:88) l ∈ L K k l ⊗ K tt (cid:48) l (cid:1)(cid:0) (cid:88) l ∈ L K k l ⊗ K tt l + D ⊗ I (cid:1) − (cid:0) (cid:88) l ∈ L K k l ⊗ K t (cid:48) t l (cid:1) (6)In order to approximate the posterior distribution, we then take Monte Carlo samples y MC from Y MGP ∼ N ( µ , Σ) .Note that there are two main feature clusters: vital signs (vitals) and laboratory results (labs). Vitals are noisier and sampledmore often, whereas labs are more monotone and rarely sampled. As opposed to [16], we do not treat the number of clusters L as hyperparameters but set L = 2 and define K tl ( t i , t j ) = exp (cid:16) −| t i − t j | λ l (cid:17) (7)3GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsisas Ornstein-Uhlenbeck (OU) kernels with lengths λ and λ , each representing a cluster smoothness. OU kernels are wellsuited to capture local variations and do not assume infinite differentiability as Squared Exponential kernels do. In our case,differentiablity implies a level of smoothness which does not apply to medical records and only introduces unnecessary bias.In addition, given the scarce availability of labs, all short lengthscales would be an ill fit to the data. We hence discardedkernels varying over lengthscales such as the Cauchy and the Rational Quadratic kernels. K kl ( k, k (cid:48) ) are free-form covariancematrice that are learned by gradient descent.To feed the MGP samples into the classifier, we fix the model time window to N = 25 by either zero padding or truncatingthe beginning of the time series. We choose to do so at the beginning of the time series in order to align prediction times asthe last step of the temporal classification model. Here, we also integrate the static variables by broadcasting them over eachtime point . The concept of attention was born in machine translation [2]: given an input sentence embedding S = { h , . . . h | S | } , theattention mechanism produces weights { α i , . . . α i | S | } such that α ij ∈ [0 , , (cid:80) j α ij = 1 , and a context vector c i = (cid:80) j α ij h j used to predict target word i . The weights α ij can therefore be interpreted as the importance of the input sentence’s j th wordto produce the i th word of the translation.More recently, Choi et al. [8] have applied attention to clinical time series. Given a time series { x , . . . x T } ⊂ R r , theauthors first create a time-independent embedding of the data { v , . . . v T } ⊂ R m . They then use inversed recurrent neuralnetworks (RNN) to create weights α ∈ R T and β ∈ R T × m , where α j ∈ [0 , and β ij ∈ [ − , , with softmax andtanh activations respectively. The context vectors then take the form c i = (cid:80) j ≤ i α j β j (cid:12) v j and are fed into a multilayerperceptron with softmax activation to yield a prediction.The attention model we devised borrows some ideas from [8]. The interpolated data y MC ∈ R N × ( M + Q ) is directly fedinto two temporal convolutional networks (TCNs) ([25]) and generates embeddings z = [ z , . . . , z N ] ∈ R N × M and z (cid:48) = [ z (cid:48) , . . . , z (cid:48) N ] ∈ R N × M .TCNs are a class of neural networks composed of causal convolutions stacked into Residual Blocks. A causal convolution isa 1D convolutional layer which only takes inputs from the past to generate its output, avoiding any information leakage fromthe future. Residual Blocks are made of two causal convolutional layers together with ReLU activation functions, dropoutand L2 regularisations. The Residual Blocks also include an identity map from the input of the block added to the output.As we only use up to 12 layers, this last step is omitted in our architecture. TCNs have shown to outperform RNNs ([3]), arefaster at training and do not suffer from vanishing gradients. Given the latter, inverting the time series similarly to [25] alsobecomes an unnecessary step which we omit.We generate the attention weights α and β as α j, = softmax ( z j × W α, + b α, ) α j, = softmax ( z j × W α, + b α, ) (8) β j, = sigmoid ( z (cid:48) j × W β, + b β, ) β j, = sigmoid ( z (cid:48) j × W β, + b β, ) (9) W α, , W α, ∈ R M + Q b α, , b α, ∈ R (10) W β, , W β, ∈ R ( M + Q ) × ( M + Q ) b β, , b β, ∈ R M + Q (11)such that α = [ α , α ] ∈ R N × and β = [ β , β ] ∈ R N × ( M + Q ) × .We then create two context vectors, one for each of negative and positive label predictions c i = (cid:88) j ≤ i α j,δ β j,δ (cid:12) y MC ,j ∈ R N × ( M + Q ) × , δ ∈ { , } (12)where y MC ,j is broadcast to meet the dimensionality of β j,δ . We then predict the labels as ˆ l i = softmax (cid:16) N (cid:88) n M + Q (cid:88) m c i,nm (cid:17) ∈ [0 , (13) see Appendix C.2 for more information on this design choice ˆl last to train themodel. This of course can be easily modified to suit any specific use-case.Since the MGP output is directly multiplied by weights c i , the classification model can be interpreted as a scoring mechanismwhere each past point y MC ,ij contributes α i, β ij, to the time series being classified as positive, and α i, β ij, to the timeseries being classified as negative. The positive and negative scores are then normalised to represent probabilities of thepositive or negative labelling. As we design both α and β to be non-negative, we can hence directly look at the average α and β over Monte Carlo samples to see which time points and features contribute most strongly to the network’s positive ornegative decision. Sepsis is defined as a life-threatening organ dysfunction caused by a dysregulated host response to infection [39]. The latteris usually interpreted as the administration of antibiotics coupled with the culture of blood samples, generating a suspicionof infection window, whereas the former is interpreted as a two point increase in Sequential Organ Failure Assessment(SOFA) within such a suspected infection window. We make use of the MIMIC-III data set [22] and encode the Sepsis-3criteria following Johnson and Pollard [21]’s code available on GitHub, with the help of Moor et al. [31]’s code that theauthors have generously provided.One key difference between our assumptions and Moor et al. [31]’s is the handling of missing SOFA contributor values: ifone or more SOFA contributors are missing, Moor et al. do not calculate the total score. On the other hand, we assume sucha contributor to be within a healthy norm, implying a zero contribution. In order to validate our results, we carry out allexperiments using both labelling techniques.We proceed to extract times series of case and control patients for a set of commonly recorded vitals, labs and static variablesand normalise their values. Following Moor et al. [31], in order to keep the data set length balanced, we match the timeseries lengths of control patients to those of case patients using the class balance ratio. In addition, we create up to sevencopies of each time series and truncate the last zero to six hours of data, effectively creating early prediction patients andaugmenting our data set. We remove excessively noisy or computationally intensive data and train the model over differenthyperparameters, randomly resampling an equal number of case and control patients to counteract the data set imbalance.
We compare our model’s performance to the performance of the InSight algorithm [6] and to the state-of-the-art MGP-TCNalgorithm [31]. Figure 3 shows the predictive performance of the models for different time horizons. y MGP αβ time to onsetMGP outputstatic data(broadcast) time step contribution sums to 12 channels:positive & negative labelsproduct: feature contributionscore per label (0, 1) Figure 2: Interpretation of the different attention weights in our model.5GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Using Moor et. al. labels
InSightMGP-TCNMGP-AttTCN 6 5 4 3 2 1 0Time to onset
Using our labels
InSightMGP-TCNMGP-AttTCN
Area under the ROC over time
Figure 3: Performance of different models. It can be seen that our proposed labels are harder to fit than the ones by Mooret al. [31]. Moreover, our proposed model outperforms the baselines on both label sets, especially for earlier predictionhorizons.
The first result is the difference in performance of models applied to the different labelling methods. The SOFA contributorassumption Moor et al. make has two main implications. Firstly, it considerably restricts the number of patients. Assumingthat sicker patients receive more medical attention, the patients included are likely to be in worse conditions than the septicpatients excluded and hence easier to classify. Secondly, it delays sepsis onset. For example, a patient having a severe liverfailure with few other recorded vitals, followed by an overall collapse further in time will have septic onset at the time ofits liver failure in our records, whereas it will only be considered septic at the time of the overall collapse in Moor et al.’slabels. On the other hand, the labels we produce reflect the incomplete nature of medical data: even if only a part of all thepotentially relevant tests are carried out at any given time, a doctor must be able to assess a patient’s well-being and foreseepotential complications. The difference in labels implies a discrepancy in task difficulty: Moor et al.’s labels present aneasier learning problem, however they define a more narrow use-case in real-world scenarios.Indeed, when assessing the performance of the different models on the two different data labellings, it becomes evidentthat our proposed labels are harder to fit. This means that predicting sepsis in a realistic setting on the intensive care unit isprobably much harder than previous work would suggest.
We find that our MGP-AttTCN model has a better performance when presented with patients further in time from sepsisonset. In the case of Moor et al.’s labels the difference is clearly noticeable, whereas with our labels it is of lower statisticalsignificance. With our labels, both MGP-TCN and MGP-AttTCN have a stronger performance than InSight. The intuitionbehind this result is the robustness of the models to missing data: both MGP-TCN and MGP-AttTCN account for the datauncertainty and hence have a better performance on lower resolution and more irregular data.On Moor et al.’s labels, the MGP-TCN model does not seem to significantly outperform the InSight model, suggestingthat those labels might be easy enough to not require a particularly pronounced robustness to missing data. However, theadditional attention of the proposed MGP-AttTCN model does seem to gain a clearer advantage here than on our labels,presumably due to a more complete set of features that can be attended to.6GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis sysbpdiabpmeanbpresprateheartratespo2_pulsoxytempcbicarbonatecreatininechlorideglucosehematocrithemoglobinlactateplateletpotassiumpttinrptsodiumbunwbcmagnesiumph_bloodgas for time covariance lenth 1.9 for time covariance lenth 64.4
Figure 4: Heatmaps of the learned MGP covariance matrices between the data features for the two different smoothnessclusters.
Inspecting the learned covariances (Figure 4), we notice that the two OU lengthscales converged to represent two clusterswithin the selected variables: a shorter lengthscale (around two hours) represents noisy data, whereas a larger lengthscale(around 64 hours) represents smoother observations. In addition, the feature covariance matrix for the short lengthscale putsmore emphasis on vitals, while the one for the long lengthscale puts more emphasis on labs, fitting our initial intuition thatvitals vary more rapidly. Graphically, once can observe this by inspecting the diagonals on the covariance heatmaps.On a more granular level, the two covariance matrices also provide insights about the underlying variables. One can forinstance observe that the body temperature ( tempc ) has a larger variance than the systolic and diastolic blood pressure ( sysbp , diabp ), following the general clinical intuition. Moreover, we can observe correlations between different features, such as anegative correlation between temperature and heart rate, which also seems to coincide with the general medical expectation.These covariances can then for instance be used by the model to extrapolate a full function from a single INR observationwith an inverse correlation to the pulse oximetry observations (Fig. 5). One important benefit of our model compared to current approaches is its interpretability due to the attention mechanism.Once the samples have been drawn, the weights α and β provide us with more information about the importance of differenttime points and features for the model’s behaviour. The attention weights for an exemplary patient trajectory are depicted inFigure 5.Overall, the absolute values of α are small for points further from the prediction time and increasingly larger closer to it. Agood example of this behaviour is the fourth row in Figure 5, where feature importance increases in time. We can also seethere, that different features can have opposing effects on the prediction. While the elevated heart rate close to the predictiontime increases the likelihood of a sepsis prediction (first column, yellow weights), the lowered prothrombin values reducethis likelihood (third column, blue weights). Interestingly, the low prothrombin values are not actually measured in thisexample, but predicted by the MGP purely based on the other measured features and the learned covariances.Finally, α × β × y MC gives the individual score contribution of each feature at each time point. These weights are shownin the last row of the figure. It can again be seen that the attention weights are generally larger in magnitude closer to theprediction time. Moreover, about half of the features have significant non-zero attention weights, while the others seem tonot be important for the prediction in this example. 7GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
20 15 10 5 0202
Heart Rate - var. 4
20 15 10 5 050
Pulse oximetry (SpO2) - var. 5
20 15 10 5 02.50.02.5
Prothrombin time (INR) - var. 17
20 15 10 5 0202 20 15 10 5 050 20 15 10 5 02.50.02.520 15 10 5 0202 20 15 10 5 050 20 15 10 5 02.50.02.520 15 10 5 0202 20 15 10 5 050 20 15 10 5 02.50.02.5var. 0 var. 5 var. 10 var. 15 var. 20 var. 25 var. 302010 t i m e t o p r e d . Figure 5: Visualization of the journey of an exemplary patient trajectory through our proposed model architecture. The rawfeatures (row 1), measured at irregular time points, are interpolated by an MGP (row 2). Samples from the MGP posteriorcan then be aggregated into means and variances for each feature on a fixed, regularly-spaced time grid (row 3). Thesevalues are then attended to by the TCN (row 4), where positive attention weights are yellow and negative ones blue. Row 5shows the attention weights separated by features (x-axis) and time point (y-axis).These visualizations could be used by doctors to make an informed decision about whether or not to trust the prediction of themodel for each given patient, thus facilitating the interpretability and accountability that is crucial in medical applications.
We have shown that current data sets for the early prediction of sepsis underestimate the true difficulty of the problemand proposed a new labelling for the MIMIC-III data set that corresponds more closely to a realistic intensive care setting.Moreover, we have proposed a new machine learning model, the MGP-AttTCN, which outperforms the state-of-the-artapproaches on the easier labels from the literature as well as on our proposed harder labels. Additionally, our modelprovides an interpretable attention mechanism that would allow clinicians to make more informed decisions about trustingits predictions on a case-by-case basis.Potential avenues for future work include a more thorough discussion with clinicians to potentially make our proposed labelseven more representative of the real-world task, and architectural improvements, for instance by meta-learning the MGPprior [12], amortizing the latent MGP inference for performance gains [14], or discretizing the latent space for improvedinterpretability [13]. 8GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
References [1] Derek C Angus, Walter T Linde-Zwirble, Jeffrey Lidicker, Gilles Clermont, Joseph Carcillo, and Michael R Pinsky.Epidemiology of severe sepsis in the united states: analysis of incidence, outcome, and associated costs of care.
Critical care medicine , 29(7):1303–1310, 2001.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473 , 2014.[3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling. arXiv preprint arXiv:1803.01271 , 2018.[4] Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process prediction. In
Advances inneural information processing systems , pages 153–160, 2008.[5] Jacob Calvert, Jana Hoffman, Christopher Barton, David Shimabukuro, Michael Ries, Uli Chettipally, Yaniv Kerem,Melissa Jay, Samson Mataraso, and Ritankar Das. Cost and mortality impact of an algorithm-driven sepsis predictionsystem.
Journal of medical economics , 20(6):646–651, 2017.[6] Jacob S Calvert, Daniel A Price, Uli K Chettipally, Christopher W Barton, Mitchell D Feldman, Jana L Hoffman,Melissa Jay, and Ritankar Das. A computational approach to early sepsis detection.
Computers in biology and medicine ,74:69–73, 2016.[7] Álvaro Castellanos-Ortega, Borja Suberviola, Luis A García-Astudillo, María S Holanda, Fernando Ortiz, JavierLlorca, and Miguel Delgado-Rodríguez. Impact of the surviving sepsis campaign protocols on hospital length ofstay and mortality in septic shock patients: results of a three-year follow-up quasi-experimental study.
Critical caremedicine , 38(4):1036–1043, 2010.[8] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain:An interpretable predictive model for healthcare using reverse time attention mechanism. In
Advances in NeuralInformation Processing Systems , pages 3504–3512, 2016.[9] Ryan J Delahanty, JoAnn Alvarez, Lisa M Flynn, Robert L Sherwin, and Spencer S Jones. Development and evaluationof a machine learning model for the early identification of patients at risk for sepsis.
Annals of emergency medicine , 73(4):334–344, 2019.[10] Thomas Desautels, Jacob Calvert, Jana Hoffman, Melissa Jay, Yaniv Kerem, Lisa Shieh, David Shimabukuro, UliChettipally, Mitchell D Feldman, Chris Barton, et al. Prediction of sepsis in the intensive care unit with minimalelectronic health record data: a machine learning approach.
JMIR medical informatics , 4(3), 2016.[11] Carolin Fleischmann, André Scherag, Neill KJ Adhikari, Christiane S Hartog, Thomas Tsaganos, Peter Schlattmann,Derek C Angus, and Konrad Reinhart. Assessment of global incidence and mortality of hospital-treated sepsis. currentestimates and limitations.
American journal of respiratory and critical care medicine , 193(3):259–272, 2016.[12] Vincent Fortuin and Gunnar Rätsch. Deep mean functions for meta-learning in gaussian processes. arXiv preprintarXiv:1901.08098 , 2019.[13] Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. Som-vae: Interpretablediscrete representation learning on time series. arXiv preprint arXiv:1806.02199 , 2018.[14] Vincent Fortuin, Gunnar Rätsch, and Stephan Mandt. Multivariate time series imputation with variational autoencoders. arXiv preprint arXiv:1907.04155 , 2019.[15] Joseph Futoma, Sanjay Hariharan, and Katherine Heller. Learning to detect sepsis with a multitask gaussian processrnn classifier. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 1174–1182.JMLR. org, 2017.[16] Joseph Futoma, Sanjay Hariharan, Mark Sendak, Nathan Brajer, Meredith Clement, Armando Bedoya, Cara O’Brien,and Katherine Heller. An improved multi-output gaussian process rnn with real-time validation for early sepsisdetection. arXiv preprint arXiv:1708.05894 , 2017.[17] Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, andMengling Feng. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icuwith sparse, heterogeneous clinical data. In
Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015.[18] Shameek Ghosh, Jinyan Li, Longbing Cao, and Kotagiri Ramamohanarao. Septic shock prediction for icu patients viacoupled hmm walking on sequential contrast patterns.
Journal of biomedical informatics , 66:19–31, 2017.9GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis[19] Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. A targeted real-time early warning score(trewscore) for septic shock.
Science translational medicine , 7(299):299ra122–299ra122, 2015.[20] Md Mohaimenul Islam, Tahmina Nasrin, Bruno Andreas Walther, Chieh-Chen Wu, Hsuan-Chia Yang, and Yu-ChuanLi. Prediction of sepsis patients using machine learning approach: a meta-analysis.
Computer methods and programsin biomedicine , 170:1–9, 2019.[21] Alistair Johnson and Tom Pollard. sepsis3-mimic, May 2018. URL https://doi.org/10.5281/zenodo.1256723 .[22] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, BenjaminMoody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.
Scientific data , 3:160035, 2016.[23] Hye Jin Kam and Ha Young Kim. Learning representations for the early detection of sepsis with deep neural networks.
Computers in biology and medicine , 89:248–255, 2017.[24] Anand Kumar, Daniel Roberts, Kenneth E Wood, Bruce Light, Joseph E Parrillo, Satendra Sharma, Robert Suppes,Daniel Feinstein, Sergio Zanotti, Leo Taiberg, et al. Duration of hypotension before initiation of effective antimicrobialtherapy is the critical determinant of survival in human septic shock.
Critical care medicine , 34(6):1589–1596, 2006.[25] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networksfor action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 156–165, 2017.[26] Steven Cheng-Xian Li and Benjamin M Marlin. A scalable end-to-end gaussian process adapter for irregularly sampledtime series classification. In
Advances in neural information processing systems , pages 1804–1812, 2016.[27] Lei Lin, Beilei Xu, Wencheng Wu, Trevor Richardson, and Edgar A Bernal. Medical time series classification withhierarchical attention-based temporal convolutional networks: A case study of myotonic dystrophy diagnosis. arXivpreprint arXiv:1903.11748 , 1, 2019.[28] Zachary C Lipton, David C Kale, and Randall Wetzel. Modeling missing data in clinical time series with rnns. arXivpreprint arXiv:1606.04130 , 2016.[29] Qingqing Mao, Melissa Jay, Jana L Hoffman, Jacob Calvert, Christopher Barton, David Shimabukuro, Lisa Shieh, UliChettipally, Grant Fletcher, Yaniv Kerem, et al. Multicentre validation of a sepsis prediction algorithm using only vitalsign data in the emergency department, general ward and icu.
BMJ open , 8(1):e017833, 2018.[30] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In
Advances in neuralinformation processing systems , pages 2204–2212, 2014.[31] Michael Moor, Max Horn, Bastian Rieck, Damian Roqueiro, and Karsten Borgwardt. Early recognition of sepsis withgaussian process temporal convolutional networks and dynamic time warping. arXiv preprint arXiv:1902.01659 , 2019.[32] H Bryant Nguyen, Stephen W Corbett, Robert Steele, Jim Banta, Robin T Clark, Sean R Hayes, Jeremy Edwards,Thomas W Cho, and William A Wittlake. Implementation of a bundle of quality indicators for the early managementof severe sepsis and septic shock is associated with decreased mortality.
Critical care medicine , 35(4):1105–1112,2007.[33] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison Cottrell. A dual-stage attention-basedrecurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 , 2017.[34] Aniruddh Raghu, Matthieu Komorowski, and Sumeetpal Singh. Model-based reinforcement learning for sepsistreatment. arXiv preprint arXiv:1811.09602 , 2018.[35] Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Heinrich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert.Attention gated networks: Learning to leverage salient regions in medical images.
Medical image analysis , 53:197–207,2019.[36] Christopher W Seymour, Vincent X Liu, Theodore J Iwashyna, Frank M Brunkhorst, Thomas D Rea, André Scherag,Gordon Rubenfeld, Jeremy M Kahn, Manu Shankar-Hari, Mervyn Singer, et al. Assessment of clinical criteria forsepsis: for the third international consensus definitions for sepsis and septic shock (sepsis-3).
Jama , 315(8):762–774,2016.[37] Manu Shankar-Hari, Gary S Phillips, Mitchell L Levy, Christopher W Seymour, Vincent X Liu, Clifford S Deutschman,Derek C Angus, Gordon D Rubenfeld, and Mervyn Singer. Developing a new definition and assessing new clinicalcriteria for septic shock: for the third international consensus definitions for sepsis and septic shock (sepsis-3).
Jama ,315(8):775–787, 2016. 10GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis[38] David W Shimabukuro, Christopher W Barton, Mitchell D Feldman, Samson J Mataraso, and Ritankar Das. Effect of amachine learning-based severe sepsis prediction algorithm on patient survival and hospital length of stay: a randomisedclinical trial.
BMJ open respiratory research , 4(1):e000234, 2017.[39] Mervyn Singer, Clifford S Deutschman, Christopher Warren Seymour, Manu Shankar-Hari, Djillali Annane, MichaelBauer, Rinaldo Bellomo, Gordon R Bernard, Jean-Daniel Chiche, Craig M Coopersmith, et al. The third internationalconsensus definitions for sepsis and septic shock (sepsis-3).
Jama , 315(8):801–810, 2016.[40] Jean-Louis Vincent, John C Marshall, Silvio A Ñamendys-Silva, Bruno François, Ignacio Martin-Loeches, JeffreyLipman, Konrad Reinhart, Massimo Antonelli, Peter Pickkers, Hassane Njimi, et al. Assessment of the worldwideburden of critical illness: the intensive care over nations (icon) audit.
The lancet Respiratory medicine , 2(5):380–386,2014.[41] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networksfor document classification. In
Proceedings of the 2016 conference of the North American chapter of the associationfor computational linguistics: human language technologies , pages 1480–1489, 2016.[42] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modularattention network for referring expression comprehension. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1307–1315, 2018. 11GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Appendix A Data processing
A.1 Data labelling
Please refer to https://github.com/mmr12/MIMIC-III-sepsis-3-labels for more details how we derived theMIMIC-III sepsis labels.
A.2 Data extractionPatient Inclusion
We filter for patients admitted to Intensive Care Units (ICU) who are more than 14 years old andwith valid records. Case patients are patients having sepsis onset within their ICU stay, whereas control patients have notdeveloped sepsis nor have an ICD discharge code referring to sepsis. Starting with 58’976 patients, we find 14’071 controlpatients and 7’936 case patients using our labels, versus 1’797 cases using Moor et al. labels.
Feature extraction
Reviewing sepsis related literature and commonly extracted laboratory and vital recordings, weextracted all features which were reported at least once for more than 75% of the included population. The final 24 dynamicfeatures are reported in Table 3. We also extracted static features - age, gender and first ICU admission department.
Case-control matching
As the goal is to predict sepsis prior to onset, the cases data was extracted between ICU admissionand sepsis onset. Note that sepsis onset happens early within ICU admission, with the median patient getting sick at 3.4hours of admission. On the other hand, patients not developing sepsis are more likely to recover completely, and do so in alengthier time frame. In addition, once they are close to discharge, their vitals and labs are within the norms. Hence, boththe length and the values of the time series are strong discriminatory factors which ease the classification. We hence carryout a matching strategy similar to Moor et al. [31]: following the class imbalance ratio, we associate each control time seriesto a case time series and truncate the control to have the same length as the case from ICU admission. We then discardpatients with less than 40 data points within the selected window, and - for computational tractability - truncate the first N p − initial values of patients’ time series in order to keep a maximum of 250 data points per patient. Horizon augmentation
As our goal is to predict sepsis early, we augment the data by creating new shorter time series.For each time series, we create six copies, where each copy represents a different horizon to onset. We then proceed totruncate the last one to six hours prior to onset from the time series copies. In order to keep data consistency, we once againdiscard time series with less than 40 observations. In Tables 1 and 2 we illustrate the data distribution per horizon.Table 1: Augmented Dataset Description with Moor et al. labels
Horizon to onset N. of patients N. of obs. per patient . ± . . ± . . ± . . ± . . ± . . ± . . ± . Data split
Finally we split the data into training, validation and testing sets, respectively capturing 80%, 10% and 10% ofthe data. We then normalise the data by subtracting the training set mean and dividing by the training set standard deviationof each feature. 12GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of SepsisTable 2: Augmented Dataset Description with authors labels
Horizon to onset N. of patients N. of obs. per patient ± ± ± ± ± ± ± Vitals Labs
Sys. blood pressure Bicarbonate PTTDia. blood pressure Creatinine INRMean blood pressure Chloride PTResp. rate Glucose SodiumHeart rate Hematocrit BUNSpO2 pulse ox. Hemoglobin WBCTemperature (C) Lactate MagnesiumPlatelet pH blood gasPotassium
Appendix B Baselines
B.1 Data preparation
In order to benchmark our MGP model, we build some baselines homogenising the data sampling. For each hour andvariable, we take the average of the available observations. If a given hour has no observations, we carry forward theaverage of the previous hour. In this manner, we generate an hourly sampled time series for each patient. We then proceedto normalise the size of each patient matrix by setting a time window of observation N . For patients having more than N observations N p , we discard the first N − N p observation; whereas for patients having less than N observations we pad thebeginning of the matrix with zeros. y p, ,t . . . y p, ,t Np ... . . . y p,M,t . . . y p,M,t Np carry forward −−−−−−−→ y p, , . . . y p, ,N p ... . . . y p,M, . . . y p,M,N p (14) normalise −−−−−→ y p, ,N − N p . . . y p, ,N ... . . . y p,M,N − N p . . . y p,M,N if N p ≥ N . . . y p, ,N − N p . . . y p, ,N ... ... ... . . . . . . y p,M,N − N p . . . y p,M,N oth. (15)We choose to align the end of the time series as opposed to the beginning as the relative importance of time points is towhen a patient becomes sick rather to when he is admitted to the ICU.As a next step, we augment the data to focus on different time series in a similar manner than for irregularly sampled data.We create seven copies of each time series, for each copy we discard the last zero to six hours, then normalise the matrix asabove. We hence generate a dataset Y BL = { Y } q = {{ y q,ij } N,Mi,j =1 } q where q represent all augmented the time series.13GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis B.2 InSight
The InSight scoring model is one of the few machine learning algorithms to surpass the proof-of-concept stage with multipleresearch, economic and clinical trials [6, 10, 5, 29]. We therefore include it as a baseline to our model.The key concept of the model is to use few largely available vitals, build some handcrafted features and train a simpleclassification model.Here is an account of our interpretation of the author’s method. The features extracted are based on a six consecutive hourwindow. For each six hour window, we extract each variable’s mean M i and difference D i (last observation minus firstobservation) over the window. We also extract variables pairs correlation D ij and triplet correlation D ijk ; where i, j, k areobserved variables. We interpret the latter as a relaxation of the Pearson correlation: if the correlation between two variablesis ρ XY = E [( X − µ X )( Y − µ Y )] σ X σ Y (16)then we define the triplet correlation as ρ XY Z = E [( X − µ X )( Y − µ Y )( Z − µ Z )] σ X σ Y σ Z (17)We then classify the difference and correlations as either positive, negligible or negative using their distribution quantilesover every patient and six hour window observed. Note that given the high level of data missingness, many variablesare calculated by forward imputation and hence have no variance over six hours. To adjust for the high number of zerocorrelations, we calcualte the quantiles of non-zero correlations and define: ˆ D i = if D i > q ∗ (2 / − if D i < q ∗ (1 / otherwise (18)where q ∗ is the adjusted quantile function. We proceed in a similar manner for the correlations and triplet correlations.In order to keep the results comparable to the AttTCN fixed window N , we extract N − (6 − six consecutive hour windowand vectorise the resulting features, generating in total n features = (cid:16) N − (cid:17) × (cid:16) × M + (cid:18) M (cid:19) + (cid:18) M (cid:19)(cid:17) (19)features per patient.Although the original paper does not specify which classification method the authors employ, we derive by their descriptionof a dimensionless score that the method is a logistic regression.14GP-AttTCN: An Interpretable Machine Learning Model for the Prediction of SepsisTable 4: Hyperparameter search Hyperparameter Random Searchmin max
MGP Monte Carlo samples 4 20TCN kernel size 2 6TCN number of Residual Blocks 2 12TCN number of hidden layers 10 55TCN dropout rate 0 0.99TCN L2 regularisation 0 250
Appendix C Model details
C.1 TCN properties
TCNs are a class of neural networks composed of causal convolutions stacked into Residual Blocks. A causal convolution isa 1D convolutional layer which only takes inputs from the past to generate its output, avoiding any data leakage. ResidualBlocks are made of two causal convolutional layers together with ReLU activation functions, dropout and L2 regularisations.
C.2 Static variables