Predicting Surgery Duration with Neural Heteroscedastic Regression
Nathan Ng, Rodney A Gabriel, Julian McAuley, Charles Elkan, Zachary C Lipton
PProceedings of Machine Learning for Healthcare 2017 JMLR W&C Track Volume 68
Predicting Surgery Duration withNeural Heteroscedastic Regression
Nathan H Ng , Rodney A Gabriel , , Julian McAuley , Charles Elkan , Zachary C Lipton , ∗ Department of Computer Science Division of Biomedical Informatics Department of Anesthesiology University of California, San Diego9500 Gilman Drive La Jolla, CA 92093, USA {nhng, ragabriel, jmcauley, elkan, zlipton}@ucsd.edu
Abstract
Scheduling surgeries is a challenging task due to the fundamental uncertainty of the clinical environment, aswell as the risks and costs associated with under- and over-booking. We investigate neural regression algo-rithms to estimate the parameters of surgery case durations, focusing on the issue of heteroscedasticity . Weseek to simultaneously estimate the duration of each surgery, as well as a surgery-specific notion of our un-certainty about its duration. Estimating this uncertainty can lead to more nuanced and effective schedulingstrategies, as we are able to schedule surgeries more efficiently while allowing an informed and case-specificmargin of error. Using surgery records from a large United States health system we demonstrate potentialimprovements on the order of 20% (in terms of minutes overbooked) compared to current scheduling tech-niques. Moreover, we demonstrate that surgery durations are indeed heteroscedastic. We show that modelsthat estimate case-specific uncertainty better fit the data (log likelihood). Additionally, we show that the het-eroscedastic predictions can more optimally trade off between over and under-booking minutes, especiallywhen idle minutes and scheduling collisions confer disparate costs.
1. Introduction
In the United States, healthcare is expensive and hospital resources are scarce. Healthcare expenditure nowexceeds of US GDP (World Bank, 2014), even as surgery wait times appear to have increased over thelast decade (Bilimoria et al., 2011). One source of inefficiency (among many) is the inability to fully utilizehospital resources. Because doctors cannot accurately predict the duration of surgeries, operating rooms canbecome congested (when surgeries run long) or lie vacant (when they run short). Over-booking can lead tolong wait times and higher costs of labor (due to over-time pay), while under-booking decreases throughput,increasing the marginal cost per surgery.At present, doctors book rooms according to a simple formula. The default time reserved is simply themean duration of that specific procedure. The procedure code does in fact explain a significant amount of thevariance in surgery durations. But by ignoring other signals, we hypothesize that the medical system leavesimportant signals untapped.We address this issue by developing better and more nuanced strategies for surgery case prediction. Ourwork focuses on a collection of surgery logs recorded in Electronic Health Records (EHRs) at a large UnitedStates health system. For each patient, we consider a collection of pre-operative features, including patientattributes (age, weight, height, sex, co-morbidities, etc.), as well as attributes of the clinical environment,such as the surgeon, surgery location, and time. For each procedure, we also know how much time wasoriginally scheduled, in addition to the actual ‘ground-truth’ surgery duration, recorded after each procedureis performed. ∗ . Corresponding author, website: http://zacklipton.com c (cid:13) a r X i v : . [ s t a t . M L ] J u l e are particularly interested in developing methods that allow us to better estimate the uncertainty associated with the duration of each surgery. Typically, neural network regression objectives assume ho-moscedasticity , i.e., constant levels of target variability for all instances. While mathematically convenient,this assumption is clearly violated in data such as ours: as one might surmise intuitively that operations thattypically take a long time tend to exhibit greater variance than shorter ones. For example, among the mostcommon procedures, epidural injections are both the shortest procedures and the ones with the least variance(Figure 1). Among the same procedures, exploratory laparotomy and major burn surgery exihibit thegreatest variance. All procedures exhibit long (and one-sided) tails.To model this data, we revisit the idea of heteroscedastic neural regression, combining it with expressive,dropout-regularized neural networks. In our approach, we jointly learn all parameters of a predictive distri-bution. In particular, we consider Gaussian and Laplace distributions, each of which is parameterized by amean and standard deviation. We also consider Gamma distributions, which are especially suited to survivalanalysis. Unlike the Gaussian and Laplace which are long tailed on both ends, the gamma has a long righttail and has only positive support (i.e., it assigns zero probability density to any value less than zero). Therestriction to positive values suits the modeling of durations or other survival-type data. While the gammadistribution (and the related Weibull distribution) has been applied to medical data with classical approachesBennett (1983); Sahu et al. (1997), this is, to our knowledge, the first to approximate the a parameters of agamma distribution using modern neural network approaches.Our heteroscedastic models better fit the data (as determined by log likelihood) compared to both currentpractice and neural network baselines that fail to account for heteroscedasticity. Furthermore, our mod-els produce reliable estimates of the variance, which can be used to schedule intelligently, especially whenover-booking and under-booking confer disparate costs. These uncertainty estimates come at no cost in per-formance by traditional measures. The best-performing Gamma MLP model achieves a lower mean squarederror than a vanilla least squares (Gaussian) MLP, despite optimizing a different objective.
2. Dataset
Our dataset consists of patient records extracted from the EHR system at a large United States hospital.Specifically, we selected 107,755 records corresponding to surgeries that took place between 2014 and 2016.These surgeries span 995 distinct procedures, and were performed by 368 distinct surgeons. Histograms ofboth are long-tailed, with over procedures performed fewer than times and doctors performingfewer than surgeries each. Moreover the data contains several clerical mistakes in logging the durations.For example, a number of surgeries in the record were reported as running less than minutes. Discussionswith the hospital experts suggest that this may indicate either clerical errors or an inconsistently appliedconvention for logging canceled surgeries. Additionally, several surgeries were reported to run over hours,suggesting (rare) clerical errors in logging the end times of procedures. We remove all surgeries reported totake less than minutes or more than hours from the dataset. This preprocessing left us with roughly 80%of our original data (86,796 examples). For our experiments, we split this remaining data 80%/8%/12% fortraining/validation/testing. For each surgery, we extracted a number of pre-operative features from the corresponding EHRs. We restrictattention to features that are available for a majority of patients and (to avoid target leaks) exclude anyinformation that is charted during or following the procedure. Our features fall into several categories: patient,doctor, procedure, and context.
Patient features:
For each of our patients, we include the following features: • Size:
Patient height and weight are real-valued features. We normalize each to mean , variance . • Age:
A categorical variable, binned according to ten-year wide intervals that are open on the left side (0 − , (10 − , . . . None of the patients in our cohort are zero years old.2igure 1: Distributions of durations for the most common procedures. • ASA score: an ordinal score that represents the severity of a patient’s illness. For example, ASAI denotes a healthy patient, ASA III denotes severe systemic disease, and ASA V denotes that thepatient is moribund without surgery. ASA VI refers to a brain-dead patient in preparation for organtransplantation. • Anesthesia Type:
This categorical feature represents the type of anesthesia applied to sedate thepatient. The values assigned to this variable include
General , Monitored anesthesia care (MAC) —inwhich a patient undergoes local anesthesia together with sedation,
Neuraxial , No Anesthesiologist , and other/unknown . • Patient Class:
This categorical feature indicates the patient’s current status. The values assignedto this variable include
Emergency Department Encounter , Hospital Outpatient Procedure , HospitalOutpatient Surgery , Hospital Inpatient Surgery , Trauma Inpatient Admission , Inpatient Admission , Trauma Outpatient . • Comorbidities:
We model the following co-morbidities as binary variables: smoker status, atrial fib-rillation, chronic kidney disease, chronic obstructive pulmonary disease, congestive heart failure, coro-nary artery disease, diabetes, hypertension, cirrhosis, obstructive sleep apnea, cardiac device, dialysis,asthma, and dementia.
Doctor:
We represent the doctor performing the procedure (categorical) using a one-hot vector. The doctor feature exhibits considerable class imbalance, with the most prolific doctor performing surgeries andthe least prolific doctor (in the pruned dataset) performing .3 eature Abbreviation Type age Categorical - Sex sex Binary Female
Weight weight Numerical - -
Height height Numerical - -
Time of Day hour Categorical Day of Week day Categorical Friday
Month month Categorical March
Location location Categorical - Patient Class class Categorical Hospital Outpatient
ASA Rating asa Categorical None
Anesthesia Type anesthesia Categorical General
Surgeon surgeon Categorical 155 -
Procedure procedure Categorical 199 Colonoscopy
Smoker smoker Binary No Heart Arrhytmia afib Binary No Chronic Kidney Disease ckd Binary No Congestive Heart Failure chf Binary No Coronoary Artery Disease cad Binary No Type II Diabetes diabetes Binary No Hypertension htn Binary No Liver Cirrhosis cirrhosis Binary No Sleep Apena osa Binary No Cardiac Device cardiac_device Binary No Dialysis dialysis Binary No Asthma asthma Binary No Dementia dementia Binary No Cognitive Impairment cognitive Binary NoTable 1: Summary of features.
Procedure:
We represent the procedure performed as a one-hot vector. The most common operations tendto be minor GI procedures: the four most frequent procedures are colonoscopy , upper GI endoscopy , cataractremoval , and abdominal paracentesis . This distribution is also long-tailed with 11,173 colonoscopies. Context:
We represent the context of the procedure with several categorical variables. First, we representthe hour of the day as a categorical variable with values binned into non-overlapping -hour width buckets.Second, we represent the day of the week and month of the year each as one-hot vectors. Finally, we similarlyrepresent the location of the operations as a one-hot vector.We summarize the number and kind of features in our dataset in Table 1. We handle variables withmissing values, including height, weight, and hour of the day, by incorporating missing value indicators,following previous work on clinical datasets (Lipton et al., 2016).
3. Methods
This paper addresses the familiar task of regression. We start off by refreshing some basic preliminaries.Given a set of examples { x i } , and corresponding labels { y i } , we desire a model f that outputs a prediction ˆ y = f ( x ) . The task of the machine learning algorithm is to produce the function f given a dataset D consisting of examples X and labels y . Generally, we seek predictions that are somehow close to y , as4etermined by some computable loss function L . Most often we minimize the squared loss L = (cid:80) i ( y i − ˆ y i ) for all instances ( x i , y i ).One popular method for producing such a function is to choose a class of functions f parameterized bysome values θ . Linear models are the simplest examples of this approach. To train a linear regression model,we define f ( x ) = θ T x . Then we solve the following optimization problem: θ ∗ = argmin θ L ( y , ˆ y ) over some training data and evaluate the model by its performance on previously unseen data. For linearmodels, the error-minimizing parameters (on the training data) can be calculated analytically. For all moderndeep learning models, no analytic solution exists, so optimization typically proceeds by stochastic gradientdescent.For neural network models, we change only the function f . In multilayer perceptrons (MLP) for example,we transform our input through a series of matrix multiplications, each followed by a nonlinear activationfunction. Formally, an L -layer MLP for regression has the simple form ˆ y = W L · φ ( W L − · . . . · φ ( W · x + b ) + . . . + b L − ) + b L , where φ is an activation function such as sigmoid, tanh, or rectified linear unit (ReLU) and θ consists of thefull set of parameters W l and b l .We might view the loss function (squared loss) as simply an intuitive measure of distance. Alternatively,it’s possible to derive the choice of squared loss by viewing regression from a probabilistic perspective. Inthe probabilistic view, a parametric model outputs a distribution P ( y | x ) .In the simplest case, we can assume that the prediction ˆ y is the mean of a Gaussian predictive distributionwith some variance σ . In this view, we can calculate the probability density of any y given x , and thus canchoose our parameters according to the maximum likelihood principle: θ MLE = max θ n (cid:89) i =1 √ π ˆ σ exp (cid:18) − ( y i − ˆ y i ) σ (cid:19) = min θ n (cid:88) i =1 (cid:18) log(ˆ σ i ) + ( y i − ˆ y i ) σ (cid:19) . (1)Assuming constant ˆ σ , this yields a familiar least-squares objective.In this work, we relax the assumption of constant variance (homoscedasticity), predicting both ˆ y ( θ, x ) and ˆ σ ( θ, x ) simultaneously. While we apply the idea to MLPs, it is easily applied to networks of arbitraryarchitecture. To predict the standard deviation ˆ σ of the predictive distribution, we modify our MLP to havetwo outputs: The first output has linear activation and we interpret its output as the conditional mean ˆ y . Thesecond output models the conditional variance ˆ σ . To enforce positivity of ˆ σ , we run this output through thesoftplus activation function softplus ( z ) = log(1 + exp( z )) .We extend the same idea to Laplace distributions, which turn out to better describe the target variabilityfor surgery duration, and are also maximum likelihood estimators when optimizing the Mean Absolute Error(MAE). Mean Absolute Error corresponds to the average number of minutes over or underbooked, and istypically the quantity of interest for this type of scheduling task. The Laplace distribution is parameterizedby b = √ σ : θ MLE = max n (cid:89) i =1 b exp (cid:18) −| y i − ˆ y i | b (cid:19) = min θ n (cid:88) i =1 (cid:18) log b + | y i − ˆ y i | b (cid:19) . (2)Finally, we apply the same technique to perform neural regression with gamma predictive distributions.The gamma distribution has strictly positive support and is long-tailed on the right. Since surgeries and othersurvival-type data have nonnegative lengths, probability distributions with similarly nonnegative support suchas the gamma distribution (compared to the real-valued support of the Gaussian and Laplace distributions),might better describe surgery duration. Formally, the expected time between surgeries (or their associateddurations) follows a gamma distribution when surgery start times are modeled as a Poisson process.5he gamma distribution is parametrized by a shape parameter k and a scale parameter Φ : θ MLE = max n (cid:89) i =1 k )Φ k y k − i exp (cid:18) − y i Φ (cid:19) = min θ n (cid:88) i =1 (cid:16) log(Γ( k )) + k log Φ − ( k −
1) log y i + y i Φ (cid:17) . (3)In this case, the model now needs to predict two values: ˆ k ( θ, x ) and ˆΦ( θ, x ) . As before, our MLP has twooutputs, with both passed through a softplus activation to enforce positivity.
4. Experiments
We now present the basic experimental setup. For all experiments we use the same / / train-ing/validation/test set split. Model weights are updated on the training set and we choose all non-differentiablehyper-parameters and architecture details based on validation set performance. In the final tally, we have features, the majority of which are sparse and accounted for by the one-hot representations over proceduresand doctors. We express our labels (the surgery durations) as the number of hours that each procedure takes. Baselines
We consider three sensible baselines for comparison. The first is to follow the current heuristicof predicting the average time per procedure. Note that this is equivalent to training an unregularized linearregression model with a single feature per procedure and no others. Although the main technical contributionof this paper is concerned with modeling heteroscedasticity, we are also generally interested to know howmuch performance the current approach leaves untapped. This baseline helps us to address this question. Wealso compare against linear regression. While we tried applying (cid:96) regularization, choosing the strength ofregularization λ on holdout data, this did not lead to improved performance. Finally, we compare againsttraditional multilayer perceptrons. To calculate NLL for models that assume homoscedasticity, we choosethe constant variance that minimizes NLL on the validation set. Training Details
For all neural network experiments, we use MLPs with ReLU activations. We optimizeeach network’s parameters by stochastic gradient descent, halving the learning rate every epochs. Foreach experiment, we used an initial learning rate of . . To determine the architecture, we performed a gridsearch over the number of hidden layers (in the range 1-3) and over the number of hidden nodes, choosingbetween 128, 256, 384, 512. As determined by our hyper-parameter optimization, for homoscedastic models,all MLPs use hidden layer with nodes. All heteroscedastic models use hidden layer with nodes.All models use dropout regularization.For our basic quantitative evaluation, we report the root mean squared error (RMSE), mean absolute error(MAE), and negative log-likelihood (NLL). For heteroscedastic models, we evaluate NLL using the predictedparameters of the distribution. For the Gamma distribution, we calculate its mean as k · Φ . We use its meanas the prediction ˆ y for calculating RMSE. To calculate MAE, we use the median of the Gamma distributionas ˆ y . Although the median of a Gamma distribution has no closed form, it can be efficiently approximated.We summarize the results in Table 2. The heteroscedastic Gamma MLP performs best as measured byboth RMSE and NLL, while the Laplace MLP performs best as measured by MAE. All heteroscedastic mod-els outperform all homoscedastic models (as determined by NLL) with the heteroscedastic Gamma MLPachieving an NLL of . as compared to . by the best performing homoscedastic model (LaplaceMLP). The significant quantity when evaluating log likelihood is the difference between NLL values, corre-sponding to the (log of the) likelihood ratio between two models.Plots in Figure 2 demonstrate that the predicted deviation reliably estimates the observed error, and QQplots (Figure 3) demonstrate that the Laplace distribution appears to fit our targets better than a Gaussianpredictive distribution. This gives some (limited) insight into why the Laplace predictive distribution mightbetter fit our data than the Gaussian. 6 odels RMSE MAE NLL Change in NLLvs. Current MethodCurrent Method .
80 28 .
87 1 . . Procedure Means .
06 27 .
70 1 . . Linear Regression .
23 25 .
07 1 . . MLP Gaussian .
51 23 .
90 1 . . MLP Gaussian HS .
03 24 .
23 0 . . MLP Laplace . . . . MLP Laplace HS .
07 23 .
41 0 . . MLP Gamma HS . . . . Table 2: Performance on test-set data (lower is better). MLP models outperform alternatives at the 1%significance level or better. (a) Gaussian (b) Laplacian (c) Gamma(d) Gaussian (e) Laplacian (f) Gamma
Figure 2: Plots of predicted ˆ σ against absolute error with heteroscedastic Gaussian (a), Laplacian (b), andGamma (c) models. Averaging over bins of width . (d) (e) (f), shows that ˆ σ i is a reliable estimator of theobserved error.
5. Related Work
Previous work in in the medical literature addresses the prediction of surgery duration (Eijkemans et al.,2010; Kayı¸s et al., 2015; Devi et al., 2012), accounting for both patient and surgical team characteristics.To our knowledge ours is the first paper to address the problem with modern deep learning techniques andthe first to model its heteroscedasticity. The idea of neural heteroscedastic regression was first proposed byNix and Weigend (1994), though they do not share hidden layers between the two outputs, and are onlyconcerned with Gaussian predictive distributions. Williams (1996) use a shared hidden layer and consider thecase of multivariate Gaussian distributions, for which they predict the full covariance matrix via its Choleskyfactorization. Heteroscedastic regression has a long history outside of neural networks. Le et al. (2005)7 a) Gaussian QQ Plot (b) Laplace QQ Plot
Figure 3: QQ plots of observed error for Gaussian and Laplace noise models. The Laplace distribution betterdescribes observed error, with shorter tails at both ends.address a formulation for Gaussian processes. Most related is Lakshminarayanan et al. (2016) which alsorevisits heteroscedastic neural regression, also using a softplus activation to enforce non-negativity. We showsome successful modifications to the above work, such as the use of the Laplace distribution, but our moresignificant contribution is the application of the idea to clinical medical data.
6. Discussion
Our results demonstrate both the efficacy of machine learning (over current approaches) and the heteroscedas-ticity of surgery duration data. In this section, we explore both results in greater detail. Specifically, weanalyze the models to see which features are most predictive and examine the uncertainty estimates to see how they might be used in decision theory to lower costs.
First, we consider the importance of the various features. Perhaps the most common way to do this is tosee which features corresponded to the largest weights in our linear model. These results are summarizedin Figure 4. Not surprisingly, the top features are dominated by procedures. In particular pulmonary throm-boendarterectomy receives the highest positive weight. This procedure involves a high risk of mortality, a fullcardiopulmonary bypass, hypothermia and full cardiac arrest. Interestingly, even after accounting for proce-dures, two doctors receive high weight. One (Doctor 266) receives significant negative weight, indicatingunusual efficiency and another (Doctor 296) appears to be unusually slow. For ethical reasons, we maintainthe anonymity of both the doctors and their specialties.For neural network models, we evaluate the importance of each feature group by performing an ablationanalysis (Figure 5). As a group, procedure codes are again the most important features. However, location,patient class, surgeon, anesthesia, and patient sex all contribute significantly. The hour of day appears toinfluence the performance of our models but the day of the week does not and the month appears to merelyintroduce noise, leading to a reduction in test set performance. Interestingly, comorbidities also made littledifference in performance. However, it is possible that these features only apply to a small subset of patientsbut are highly predictive for that subset.
Our aim in predicting the variance of the error is to provide uncertainty information that could be used tomake better scheduling decisions. To compare the various approaches from an economic/decision-theoretic8igure 4: Top 30 linear regression features sorted by coefficient magnitude.perspective, we might consider the plausible case where the cost to over-reserve the room by one minute (pro-cedure finishes early) differs from the cost to under-reserve the room (procedure runs over). We demonstratehow the two quantities can be traded off in Figure 6.For models that don’t output variance, we consider scheduled durations of the form ˆ y + k and ˆ y · k where k is a data-independent constant. In either case, by modulating k , one books more or less aggressively. Themultiplicative approach performed better, likely because long procedures have higher variance than shortones. This approach is equivalent to selecting a certain percentile of each predicted distribution given aconstant sigma.For heteroscedastic models we make the trade-off by selecting a constant percentile of each predicteddistribution. When the cost of over-reserving by one minute and under-reserving by one minute are equal, theproblem reduces to minimizing absolute error. Around this point on the curve the homoscedastic Laplacianoutperforms all other models. However, given cost sensitivity, the heteroscedastic Gamma strictly outper-forms all other models. We are encouraged by the efficacy of simple machine learning methods both to predict the durations ofsurgeries and to estimate our uncertainty. We see several promising avenues for future work. Most concretely,we are currently engaged in discussions with the medical institution whose data we used about introducing atrial in which surgeries would be scheduled according to decision theory based on our estimates. Regardingmethodology, we look forward to expanding this research in several directions. First, we might extend theapproach to modeling covariances and more complex interactions among multiple real-valued predictions.We might also consider problems like bounding box detection, requiring more complex neural architectures.
References
Steve Bennett. Log-logistic regression models for survival data.
Applied Statistics , pages 165–171, 1983.Karl Y Bilimoria, Clifford Y Ko, James S Tomlinson, Andrew K Stewart, Mark S Talamonti, Denise L Hynes,David P Winchester, and David J Bentrem. Wait times for cancer surgery in the united states: trends andpredictors of delays.
Annals of surgery , 253(4):779–785, 2011.9igure 5: Ablation analysis of feature importance for neural models.S Prasanna Devi, K Suryaprakasa Rao, and S Sai Sangeetha. Prediction of surgery times and scheduling ofoperation theaters in optholmology department.
Journal of medical systems , 2012.Marinus JC Eijkemans, Mark van Houdenhoven, Tien Nguyen, Eric Boersma, Ewout W Steyerberg, andGeert Kazemier. Predicting the unpredictable: A new prediction model for operating room times usingindividual characteristics and the surgeon’s estimate.
The Journal of the American Society of Anesthesiol-ogists , 2010.Enis Kayı¸s, Taghi T Khaniyev, Jaap Suermondt, and Karl Sylvester. A robust estimation model for surgerydurations with temporal, operational, and surgery team effects.
Health care management science , 2015.Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncer-tainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474 , 2016.Quoc V Le, Alex J Smola, and Stéphane Canu. Heteroscedastic gaussian process regression. In
ICML , 2005.Zachary C Lipton, David C Kale, and Randall Wetzel. Modeling missing data in clinical time series withrnns. In
Machine Learning for Healthcare , 2016.David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution.In
International Conference on Neural Networks . IEEE, 1994.10igure 6: Total over-booking and under-booking errors for different models as we adjust predicted values.Predictions were selected by considering different percentiles of the distribution predicted for each case. Forhomoscedastic models, this was equivalent to multiplying each prediction by a constant value.Sujit K Sahu, Dipak K Dey, Helen Aslanidou, and Debajyoti Sinha. A weibull regression model with gammafrailties for multivariate survival data.
Lifetime data analysis , 3(2):123–137, 1997.Peter M Williams. Using neural networks to model conditional multivariate densities.
Neural Computation ,1996.World Bank. Health expenditure, total (% of gdp), 2014. URL http://data.worldbank.org/indicator/SH.XPD.TOTL.ZShttp://data.worldbank.org/indicator/SH.XPD.TOTL.ZS