Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration
Betty van Aken, Jens-Michalis Papaioannou, Manuel Mayrdorfer, Klemens Budde, Felix A. Gers, Alexander Löser
CClinical Outcome Prediction from Admission Notesusing Self-Supervised Knowledge Integration
Betty van Aken , Jens-Michalis Papaioannou , Manuel Mayrdorfer ,Klemens Budde , Felix A. Gers and Alexander L¨oser Beuth University of Applied Sciences Berlin Charit´e Berlin { bvanaken,michalis.papaioannou,gers,aloeser } @beuth-hochschule.de { manuel.mayrdorfer,klemens.budde } @charite.de Abstract
Outcome prediction from clinical text canprevent doctors from overlooking possiblerisks and help hospitals to plan capacities. Wesimulate patients at admission time, when de-cision support can be especially valuable, andcontribute a novel admission to discharge taskwith four common outcome prediction targets:Diagnoses at discharge, procedures performed,in-hospital mortality and length-of-stay predic-tion. The ideal system should infer outcomesbased on symptoms, pre-conditions and riskfactors of a patient. We evaluate the effec-tiveness of language models to handle thisscenario and propose clinical outcome pre-training to integrate knowledge about patientoutcomes from multiple public sources. Wefurther present a simple method to incorpo-rate ICD code hierarchy into the models. Weshow that our approach improves performanceon the outcome tasks against several baselines.A detailed analysis reveals further strengths ofthe model, including transferability, but alsoweaknesses such as handling of vital valuesand inconsistencies in the underlying data.
Clinical professionals make decisions about pa-tients under strong time constraints. The patientinformation at hand is often unstructured, e.g. inthe form of clinical notes written by other medicalpersonnel in limited time. Clinical decision support(CDS) systems can help in these scenarios by point-ing towards related cases or certain risks. Clinicaloutcome prediction is a fundamental task of CDSsystems, in which the patient’s development is pre-dicted based on data from their Electronic HealthRecord (EHR). In this work we focus on textualEHR data available at admission time. Figure 1 shows a sample admission note with high-lighted parts that – according to medical doctors –must be considered when evaluating a patient.
Encoding clinical notes with pre-trainedlanguage models.
Neural models need to extractrelevant facts from such notes and learn complexrelations between them in order to associate certainclinical outcomes. Pre-trained language modelssuch as BERT (Devlin et al., 2019) have shownto be able to both extract information from noisytext and to capture task-specific relations in anend-to-end fashion (Tenney et al., 2019; van Akenet al., 2019). We thus base our work on thesemodels and pose the following questions:• Can pre-trained language models learn to pre-dict patient outcomes from their admissioninformation only?• How can we integrate knowledge about out-comes that doctors gain from medical litera-ture and previous patients?• How well would these models work in clinicalpractice? Are they able to interpret commonrisk factors? Where are they failing?
Simulating patients at admission time.
Exist-ing work on text-based outcome prediction focuseson progress notes after a certain time of a patient’shospitalisation (Huang et al., 2019). This is mostlydue to a lack of publicly available admission notesand poses some problems: 1) Doctors might missspecific outcome risks early in admission and 2)progress notes already contain information aboutclinical decisions made on admission time (Boaget al., 2018). We propose to simulate newly ar-rived patients by extracting admission notes fromMIMIC III discharge summaries. We are thus able a r X i v : . [ c s . C L ] F e b RESENT ILLNESS : 58yo man w/ hx of hypertension, AFib on coumadin and NIDDM presented to ED with theworst headache of his life. He had a syncopal episodeand was intubated by EMS. Medication on admission: 1mg IV ativan x 1.
PHYSICAL EXAM : Vitals: P: 92 R: 13 BP: 151/72SaO2: 99% intubated. GCS E: 3 V:2 M:5HEENT:atraumatic, normocephalic Pupils: 4-3mm [...]
FAMILY HISTORY : Mother had stroke at age 82.Father unknown.
SOCIAL HISTORY : Lives with wife. 25py. No EtOH DIAGNOSES:430
Subarachnoid Hemorrhage
Essential Hypertension
Diabetes Mellitus [...]
PROCEDURES:397
Endovascular Repair of Vessel
Continous Invasive Mechanical Ventilation [...]
IN-HOSPITAL MORTALITY:
Not deceased
LENGTH OF STAY: > 14 days
Symptoms & Vitals General Risk Factors Medications Pre-Conditions
ADMISSION DISCHARGE
Figure 1:
Admission to discharge sample that demonstrates the outcome prediction task. The model has to extractpatient variables and learn complex relations between them in order to predict the clinical outcome. to give doctors hints towards possible outcomesfrom the very beginning of an admission and canpotentially prevent early mistakes. We can alsohelp hospitals in planning resources by indicatinghow long a patient might stay hospitalised.
Integrating knowledge with specialisedoutcome pre-training.
Gururangan et al. (2020)recently emphasized the importance of domain-and task-specific pre-training for deep neuralmodels. Consequently we propose to enhancelanguage models pre-trained on the medicaldomain with a task-specific clinical outcomepre-training . Besides processing clinical languagewith idiosyncratic and specialized terms, ourmodels are thus able to learn about patienttrajectories and symptom-disease associations in aself-supervised manner. We derive this knowledgefrom two main sources: 1) Previously admittedpatients and their outcomes. This knowledge isusually stored by hospitals in unlabelled clinicalnotes and 2) Scientific case reports and knowledgebases that describe diseases, their presentations inpatients and prognoses. We introduce a method forincorporating these sources by creating a suitablepre-training objective from publicly available data.
Contributions.
We summarize the major contri-butions of this work as follows:1) A novel task setup for clinical outcome predic-tion that simulates the patient’s admission state andpredicts the outcome of the current admission.2) We introduce self-supervised clinical outcomepre-training , which integrates knowledge about pa-tient outcomes into existing language models.3) We further propose a simple method that injectshierarchical signals into ICD code prediction.4) We compare our approaches against multiplebaselines and show that they improve performance on four relevant outcome prediction tasks with upto 1,266 classes. We show that the models are trans-ferable by applying them to a second public datasetwithout additional fine-tuning.5) We present a detailed analysis of our model thatincludes a manual evaluation of samples conductedby medical professionals.
Using clinical notes for outcome prediction.
Boag et al. (2018) studied the predictive value ofclinical notes with simple approaches such as bag-of-words. Recent work increasingly applies neuralmodels to compensate for the noisy nature of thedata and the complexity of patterns. Hashir andSawhney (2020) used both convolutional and recur-rent layers for outcome prediction, while Jain et al.(2019) and Qiao et al. (2019) proposed attention-based approaches. Dligach et al. (2019) exploredpre-training as a strategy to mitigate data sparsity inclinical setups. Si and Roberts (2019) and Sureshet al. (2018) further showed that outcome predic-tion benefits from a multitask setup. In contrast toearlier work we apply neural models to admissionnotes in an admission to discharge setup.
Pre-trained language models for the clinicaldomain.
While pre-trained language models aresuccessful in many areas of NLP, there has beenlittle work on applying them to the clinical do-main (Qiu et al., 2020). Alsentzer et al. (2019) andHuang et al. (2019) both pre-trained BERT-basedmodels on clinical data. They evaluated their workon readmission prediction and other NLP tasks. Weare the first to evaluate pre-trained language mod-els on multiple clinical outcome tasks with largelabel sets. We further propose a novel pre-trainingobjective specifically for the clinical domain. rediction of diagnoses and procedures.
Themajority of work on diagnosis and procedure pre-diction covers either single diagnoses (Liu et al.,2018; Choi et al., 2018) or coarse-grained groups(Peng et al., 2020; Sushil et al., 2018). We arguethat models should predict diseases and proceduresin a fine-grained manner to be beneficial for doc-tors. Thus we use all diagnosis and procedure codesfrom the data for our outcome prediction tasks.
ICD coding vs. outcome prediction.
There is avariety of work in the related field of automatedICD coding (Xie et al., 2018; Falis et al., 2019).Zhang et al. (2020) recently presented a model ableto identify up to 2,292 ICD codes from text. How-ever, ICD coding differs from outcome predictionin the way that diseases are directly extracted fromtext rather than inferred from symptom descrip-tions and patient history. We further discuss thisdistinction in Section 6.
Admission to Discharge
Task
Clinical outcome prediction can be defined in dif-ferent ways. We approach the task from a doctor’sperspective and predict the outcome of a currentadmission from the time of the patient’s arrival tothe hospital unit. We describe our setup as follows.
As our primary data source, we use the freely-available MIMIC III v1.4 database (Johnson et al.,2016). It contains de-identified EHR data includingclinical notes in English from the Intensive CareUnit (ICU) of Beth Israel Deaconess Medical Cen-ter in Massachusetts between 2001 and 2012. Wefocus our work on discharge summaries in partic-ular and the outcome information associated withan admission. Similar to previous work, we filterout notes about newborns and remove duplicates.
The state of a patient is commonly summarized inan ongoing document, which finally concludes in
Admission Notes Statistics avg std avg std(words / doc) (words / doc) (sent / doc) (sent / doc)396.3 233.3 32.5 23.1
Table 1: Numbers of words / sentences in MIMIC IIIadmission notes. We see a high variation in length.
Multi-label tasks:
ICD-9 codes per dataset splitDiagnoses Procedures
Total
Train Val Test
Total
Train Val Test
672 476 563
Table 2: Distribution of ICD-9 codes per dataset split(patient-wise). Note that very rare codes do not appearin each split of the dataset.
Single-label tasks:
Samples per classMortality Length of Stay (in days)0 1 ≤ > ≤ > ≤ > Table 3: Distribution of labels for
Mortality Prediction and
Length of Stay task. Both tasks have unbalancedclass distributions. a discharge summary. Since we want to supportclinical decisions from the beginning of a patient’sstay, we simulate the state of the patient’s docu-ment at admission time. We thus filter the docu-ment by sections that are known at admission suchas:
Chief complaint, (History of) Present illness,Medical history, Admission Medications, Allergies,Physical exam, Family history and
Social history .We further describe the filtering in Appendix B.1.Our approach results in 48,745 admission notes.As shown in Table 1 the notes contain about 400words on average. The selection of admission sec-tions as well as the resulting structure of the noteswere verified by medical doctors.This newly created admission dataset enables usto make predictions on the outcome of a currentadmission. At inference time, doctors can then usethe model’s predictions on textual data from newlyarrived patients.
We select four relevant tasks for outcome predictionin consultation with medical professionals. Alltasks take admission notes as input.
Diagnosis prediction.
A main goal of clinicaloutcome prediction is to support medical profes-sionals in the process of differential diagnosis. Wethus take all diagnoses associated with an admis-sion into account and frame the task as an extrememulti-label classification. Diagnoses are encodedas ICD-9 codes in the MIMIC III database. Follow-ing Choi et al. (2017), we group ICD-9 diagnosiscodes from the database from 4- into 3-digit codesto reduce complexity while still obtaining granularsuggestions. This results in a total of 1,266 diag-
ERTBioBERTClinical OutcomeRepresentations(
CORe ) Diagnoses ProceduresMortality
Length of Stay SE L F - S U PE R V I SE D S U PE R V I SE D i2b2 MTSamplesMIMIC III
PubMed
Wikipedia MedQuad
DISCHARGE
ADMISSIONTEXT DISCHARGETEXT [CLS]
Former for 20-30 years. [SEP]
The aorta is ectatic with eccentric ...
Does this outcome match the patient?
Label:
True
Clinical Outcome Pre-Training
DISCHARGE
SYMPTOMS /RISK FACTORS TREATMENTS /PROGNOSES / ... [CLS] ... skin lesions with sometimes itching. [SEP]
Delivery of whole brain radiotherapy ...Does this treatment match the symptoms?Label:
False
PATIENTS ARTICLES
Figure 2: Schematic demonstration of clinical outcome pre-training . Sources of clinical knowledge are completepatient notes and medical articles. Based on that we create a self-supervised learning objective that teaches relationsbetween symptoms, risk factors and outcomes. nosis codes, which are distributed over our datasetsplits as shown in Table 2. The labels are power-law distributed with a long tail of very rare codes.
Procedure prediction.
Procedures are either di-agnostics or treatments applied to a patient duringa stay. Similarly to diagnosis prediction, this isan extreme multi-label task. We again group theICD-9 codes from the MIMIC III database into 3-digit codes. In total there are 711 procedure codeslabelled in the database in a power law distributionsimilar to the diagnosis codes.
In-hospital mortality prediction.
Predicting apatient’s mortality risk is a fundamental part of thetriage process. In-hospital mortality in particulardescribes whether a patient died during the cur-rent admission and is a binary classification task.The percentage of deceased patients in the data isaround 10% (see Table 3). As some notes containdirect indications of mortality such as patient de-ceased within the admission sections, we apply anadditional filter for those terms.
Length-of-stay prediction.
The duration of anICU stay is an important information for hospitalsin order to plan allocations of resources. We grouppatients into four major categories regarding theirlength of stay:
Under 3 days, 3 to 7 days, 1 week to2 weeks, more than 2 weeks.
These categories wererecommended by medical doctors in order to makethe results as useful as possible in clinical practice.Table 3 shows the samples per class.
We propose clinical outcome pre-training , a way tointegrate knowledge about clinical outcomes intopre-trained language models. We further introducean additional step to incorporate ICD code hierar-chy into our multi-label classification tasks. Language model pre-training hasshown to be of use in specialised domains like theclinical (Alsentzer et al., 2019; Huang et al., 2019).However, these models lack knowledge about pa-tient trajectories and symptom-diagnosis relations,because their training is focused on learning lan-guage characteristics.We develop an additional pre-training step that pro-duces
Clinical Outcome Representations (CORe) in order to teach the model relations between symp-toms, risk factors and clinical outcomes. Much ofthis knowledge is present and publicly available,e.g. in knowledge bases like Wikipedia or publi-cation archives like PubMed. Another source isavailable to hospitals in the form of unlabelled clin-ical notes from previous patients. The suggestedoutcome pre-training is a way to use this knowledgeto improve the model’s capabilities in predictingclinical outcomes as described in 3.3.Corresponding to the way doctors gain their knowl-edge from both experience and medical literature, The code to recreate the experiments and datasets de-scribed in this paper is accessible at: https://github.com/bvanaken/clinical-outcome-prediction e incorporate knowledge from complete patientnotes (including discharge information) and medi-cal articles.
Training objective.
Our proposed training objec-tive (Figure 2) is strongly related to the Next Sen-tence Prediction (NSP) task introduced by Devlinet al. (2019). In NSP the model gets two sentencesas an input and predicts whether the second followsthe first sentence. This way models such as BERTlearn relations between sentences. We convert thissetting so that the model instead learns relationsbetween admissions and outcomes.From common sections in patient notes, we cre-ate two categories: Sections that are created atadmission A and sections that are created after ad-mission, e.g. at discharge time D . Given a patientnote N , we split it into sections A N ∈ A and D N ∈ D . We remove all other sections. We thensample token sequences from these sections to get t N, ...k ∈ A N and t (cid:48) N, ...k ∈ D N , where k is ran-domly set between 30 and 50 tokens. We then trainthe model to maximize P ( Same P atient | X N N ) and P ( Other P atient | X N M ) with X N N = Enc ( t N, ...k , t (cid:48) N , ...k ) X N M = Enc ( t N, ...k , t (cid:48) M , ...k ) (1)with M being a randomly sampled document fromthe same batch and Enc referring to the BioBERTencoding. As in the original NSP setting, we applynegative sampling ( X N M ) for 50% of examples.We apply the same strategy on medical articles andcase reports, so that A represents sections describ-ing symptoms and risk factors, and D representssections that describe outcomes of a disease or case. Data sources.
We create the pre-training datasetfrom multiple public sources. To integrate knowl-edge that doctors gain from previous patients andmedical literature, we create two groups of sources:1)
Patients , which includes 32,721 discharge sum-maries from the MIMIC III training set, 5,000 pub-licly available medical transcriptions from the MT-Samples website and 4,777 clinical notes from thei2b2 challenges 2006-2012 (Uzuner et al., 2007,2008, 2010a,b, 2011, 2012; Sun et al., 2013b,a).2) Articles , composed of 9,335 case reportsfrom PubMed Central (PMC), 2,632 articles from https://mtsamples.com We exclude notes from the 2014 De-identification andHeart Disease Risk Factors Challenge in order to use this setfor evaluation as described in Section 5.4.
Wikipedia describing diseases and 1,467 articlesections from the MedQuAd dataset (Abacha andDemner-Fushman, 2019) extracted from NIH web-sites such as cancer.gov.While
Patients samples contain unaudited practicalknowledge,
Articles samples are built from verifiedgeneral medical knowledge such as peer-reviewedstudies. The sources are therefore substantially dif-ferent and we evaluate their individual effect onperformance in Section 5.3.
Data preparation.
We create admission ( A N )and discharge parts ( D N ) of the documents basedon section headings. We define common sectionsbelonging to the admission part and those belong-ing to the discharge part similar to the method de-scribed in Section 3.2. We ignore sections thatcannot be categorized. For section heading ex-traction from MIMIC III discharge summaries andMTSamples transcriptions, we apply simple rule-based approaches, which is feasible because thenotes are well-structured. For Wikipedia we useheadings from the WikiSection dataset (Arnoldet al., 2019) filtered for disease articles only. ForPubMed Central we similarly use the PubMedSec-tion dataset (Schneider et al., 2020) and filter forsection headings that indicate case reports. Asi2b2 notes are less well-structured in comparison toMIMIC III discharge summaries, we use a classifieras proposed by Rosenthal et al. (2019) to determinewhich section a sentence belongs to. The classifieris trained on an annotated set of i2b2 notes andthen applied to all other notes. Diagnosisand procedure prediction requires the model to pre-dict ICD-9 codes in a multi-label manner. ICD-9codes are hierarchically ordered into associatedgroups. Figure 3 shows the code hierarchy for
Ma-lignant hypertensive renal disease with the ICD-9code . The diagnosis has two parent groupsnamely
Hypertension renal disease and
Diseasesof the circulatory system . Diagnoses or proceduresin the same group often share similar medical char-acteristics, therefore hierarchical relations of a la-belled code can be valuable information. This med-ical information is currently not integrated into themodel. The same holds for words describing theICD-9 codes, that often represent further importantsignals, such as the words renal or malignant .
90 – 459 Diseases of the circulatory system - 401 Essential Hypertension - 403 Hypertension renal disease - 403.0 Malignant hypertensive renal disease - 403.1 Benign hypertensive renal diseaseAssigned Label:
Assigned Labels with ICD+:
Figure 3: Example of
ICD+ labelling.
Malignanthypertensive renal disease is assigned to nine codes(bottom row) that inform about the type and group ofthe disease.
Enhancing training with useful additionalsignals.
We propose a simple method,
ICD+ , toincorporate both associated groups and words intothe model weights: Instead of only classifying 3-digit codes (as mentioned in 3.3), we let the modeladditionally predict the 4-digit codes and the bag ofassociated words with a code and its parent groups.In order to create the bag of words per code, weuse the descriptions of ICD-9 codes from MIMICIII and remove all stop words. As shown in Figure3, the
ICD+ method assigns eight additional labelsto the example diagnosis and therefore supplies themodel with further information about the diagnosisduring training.By increasing the amount of labels per sample, weintegrate relevant medical knowledge and enablethe model to learn implicit relations between codesand code groups that share certain words. We eval-uate the effectiveness of
ICD+ in Section 5.
We pre-train the
CORe model on top of BioBERTweights . We then fine-tune the model separatelyon the four outcome tasks. We use the same train-ing regimen for both pre-training and fine-tuning:We tokenize the texts with WordPiece tokenizationand truncate them to 512 tokens, due to the lim-ited context length of the pre-trained models. Weuse early stopping and tune hyperparameters asdescribed in Appendix C. We choose BioBERT as the base for our model becauseit outperforms BERT on medical tasks and has not seen datafrom our test set during pre-training unlike DischargeBERT.
In the following, we introduce the baseline modelsthat we evaluate on the novel outcome predictiontasks. In order to understand the abilities of pre-trained language models we compare their perfor-mance against more traditional approaches. Thefirst three models (
BOW, word embeddings, CNN )are trained using the hyperparameters proposed bythe authors for outcome prediction tasks. The lan-guage models are fine-tuned the same way as the
CORe model.
Bag-of-Words.
Boag et al. (2018) shows that asimple bag-of-words (BOW) approach can outper-form more complex models on tasks like mortalityprediction. We thus include their approach in ourevaluation. We adopt their training setting exceptthat we consider 200 instead of 20 top tf-idf wordsin order to make the model converge.
Pre-trained word embeddings.
Boag et al.(2018) further propose the use of pre-computedword embeddings that were trained on MIMIC IIIdata. We use the same setting as for the BOW ap-proach and fit a support vector machine classifieron the clinical outcome tasks.
Convolutional Neural Network (CNN).
Si andRoberts (2019) built a neural network for mortal-ity prediction with two hierarchical convolutionallayers at the word and sentence levels and then ag-gregated it to a patient level representation. Wefollow their approach to evaluate the model on ourfour admission to discharge tasks.
BioBERT.
Following the success of BERT, Leeet al. (2020) further pre-trained the model onbiomedical research articles from PubMed usingabstracts and full-text articles. They reported im-proved performance on a range of biomedical textmining tasks.
ClinicalBERT and DischargeBERT.
We fur-ther evaluate two public language models pre-trained on the clinical domain, with MIMIC IIIdata in particular. Huang et al. (2019) pre-traineda BERT Base model on 100,000 random clinicalnotes (ClinicalBERT) while Alsentzer et al. (2019)further pre-trained BioBERT on all discharge sum-maries from MIMIC III (we refer to the model asDischargeBERT for simplicity). iagnoses Procedures In-Hospital Mortality Length-of-Stay(1266 classes) (711 classes) (2 classes) (4 classes)BOW (Boag et al., 2018) 75.87 77.47 79.15 65.83Embeddings (Boag et al., 2018) 75.16 76.72 79.94 66.78CNN (Si and Roberts, 2019) 61.18 73.13 75.50 64.49BERT Base (Devlin et al., 2019) 82.08 85.84 81.13 70.40ClinicalBERT (Huang et al., 2019) 81.99 86.15 82.20 71.14
DischargeBERT (Alsentzer et al., 2019)
BioBERT Base (Lee et al., 2020) 82.81 86.36 82.55 71.59BioBERT ICD+ 83.17 87.45 - -CORe Articles (w/o ICD+) (82.89) (86.75) (w/o ICD+) (83.40) (86.60) (w/o ICD+) (83.39) (87.15)
Table 4: Results on outcome prediction tasks in macro-averaged % AUROC. The
CORe models outperform thebaselines,
ICD+ adds further improvement (values in parentheses are ablation results without
ICD+ ). Discharge-BERT results are printed in italic because the model has seen all test data during pre-training and is thereforeslightly advantaged.
Table 4 shows performances in (macro-averaged)area under the receiver operating characteristiccurve (AUROC). We report scores of the
CORe model trained only on
Articles , Patients and in acombined training setting
CORe All . We evalu-ate diagnosis and procedure prediction both withand without the
ICD+ method on BioBERT andthe
CORe models. In both scenarios we evaluateon 3-digit ICD codes only, in order to maintaincomparability between the methods.
Pre-trained models outperform baselines.
Wesee that the evaluated pre-trained language mod-els clearly outperform the
BOW , word embeddings and CNN approaches. We further observe thatthe
CORe models improve scores on all tasks incomparison to the baseline models, except for Dis-chargeBERT that reaches a higher score in mor-tality prediction – probably affected by its expo-sure to the test data. This shows that even thoughthe language models are trained on similar data(e.g. PubMed and/or clinical notes), the specific outcome pre-training improves the model’s abilityto predict clinical outcome targets. Pre-training on
Patients and
Articles achieve similar improvementsover the baselines, while the combined training isthe most effective. An exception is the procedureprediction, where pre-training on
Patients achievesthe highest score. A probable reason is that pro-cedures are documented in more detail in clinicalnotes, especially since our selection of medicalarticles focuses on diseases rather than procedures.
Predicting mortality risk is easier than lengthof stay.
We see that the models reach higherscores in the binary mortality task than in lengthof stay prediction. Even a simple
BOW approachcan reach a relatively high score, which indicatesthat most of the notes contain clear hints towardsan increased mortality risk. On the other hand,the length of stay task is difficult due to the manyfactors that can contribute to the length of a pa-tient’s stay after the admission, including nonclini-cal factors such as the patient’s insurance situation(Khosravizadeh et al., 2016).
ICD hierarchy improves diagnosis and proce-dure predictions.
Table 4 shows an ablation testwithout the
ICD+ method (in parentheses). Wesee that both the BioBERT model and the
CORe models improve when incorporating code hierarchyand relations through
ICD+ into the training pro-cess. This is especially visible for ICD procedures,where the hierarchical and textual information, e.g.that a
Nephropexy is an operation on the kidney canadd important signals during training.i2b2 DiagnosesBioBERT ICD+ 80.43CORe Articles 81.46CORe Patients
CORe All 81.15
Table 5: Results on i2b2 diagnosis prediction task (5classes) in % AUROC. The models reach similar resultsas on the MIMIC III data, indicating their transferabil-ity to other data sources without additional fine-tuning. o r t a li t y P r ed i c t i on A v e r aged M o r t a li t y i n M I M I C III ( do tt ed li ne )
18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 [ ** A ge o v e r ** ] Figure 4: Impact of age on mortality prediction on 20random samples. Mortality risk and age mostly in-crease proportionally as intended, with certain peaksthat might indicate unintended biases in the data.
In order to verify that the fine-tuned models aretransferable to ICU data from other sources, weapply it to data from the i2b2 De-identification andHeart Disease Risk Factors Challenge (Stubbs et al.,2015). We convert the clinical notes to admissionnotes as further described in Appendix B.2, whichresults in 1,118 samples labelled with up to fiveICD-9 codes.
Models generalize to i2b2 data.
We apply ourMIMIC III-based models to predict diagnosis codesfor the i2b2 notes without further fine-tuning. Wethen evaluate based on whether the predictions con-tain the five mentioned ICD-9 codes. The resultsin macro-averaged % AUROC are shown in Table5. Even though the clinical notes differ from theMIMIC III notes in structure and writing style, thetested models are mostly able to identify the con-ditions. The scores are comparable to the MIMICIII results, which shows that the models are able togeneralise on data from different sources such asother hospitals.
Clinical outcome prediction is a sensitive task.We therefore conduct an extensive analysis on the
CORe All model including a manual error analysisby medical doctors on 20 randomly chosen sam-ples to understand how the model would performin clinical practice. Our demo application used for this analysis isavailable at: https://outcome-prediction.demo.datexis.com % AUROCAll Diagnoses 83.54Diagnoses
Mentioned in Text
Diagnoses
Not Mentioned in Text 82.35
Table 6: Analysis of the impact of directly mentioneddiagnoses on the diagnosis prediction task. Mentioneddiagnoses are detected more reliably. Though on un-mentioned diagnoses, scores only see a small decreasecompared to the overall score.
We observe that a majority of codeddiseases are already mentioned in the admissiontext. This is mainly due to chronic diseases (e.g. di-abetes mellitus ) or to conditions that were identifiedprior to the ICU admission (e.g. in the emergencyward). We want to know if our model is also ableto predict diagnoses that are not mentioned in thetext. We annotate the admission texts with ICD-9diagnosis codes with the methodology describedby Searle et al. (2020). We then evaluate on codesthat were explicitly mentioned in the text and thosethat were not. Table 6 shows that the model indeedextracts many diagnoses directly from the text andthus reaches a higher score on mentioned diagnoses.On the other hand, we see that the performance onnon-mentioned diagnoses does drop only slightly,indicating that the model has also learned to predictnon-mentioned diagnoses.
How does age and gender impact predictions?
Age and gender are common risk factors with sig-nificant impact on the potential clinical outcome ofa patient. We want our models to learn that impactwithout overestimating it. We test the model’s be-haviour by switching age and gender throughout20 random samples and analyse how the mortalityprediction changes. For each sample we manuallyswitch the age mention and iterate over it from 18until [**Age over 90**] . Figure 4 shows that theanalysed samples show a high variation in mortal-ity risk and that age only impacts the predictionpartially. In all cases the prediction increases withage – as expected from a medical perspective. Wealso observe some peaks without a medical rea-son that are caused by the mortality of certain agegroups in the original data (black dotted line). Thisdemonstrates how the model does not follow medi-cal reasoning but merely statistic observations. We De-identified age information for patients older than 89. imilarly switch the gender mention and all pro-nouns in the texts and observe that mortality pre-diction for male patients is increased by 5% onaverage, consistent with medical rationale.
Where is the model failing? Negation : While our error analysis depictsthat negation does not generally falsify themodel’s predictions, we find single samples inwhich especially medical-specific negations,such as abstinent from alcohol , are misin-terpreted by the model, e.g. into alcoholdependence syndrome .2.
Numerical data : Wallace et al. (2019) showBERT’s inabilities to interpret numbers. Weobserve this in the case that the model does notinterpret life-threatening vital values (such astemperature over 105 ◦ F) as an increased mor-tality risk. Clinical notes contain many suchrelevant values, thus improving the encodingof such data is an important goal for futurework.
Our erroranalysis reveals that 60% of the analysed samplesare partially under-coded. They contain indicatorsfor a diagnosis or procedure but miss the corre-sponding ICD-9 code. This is consistent with re-sults from Searle et al. (2020) showing that MIMICIII is up to 35% under-coded. Additionally we findthat procedures that are almost always performed inthe ICU such as
Puncture of vessel are often codedinconsistently. While a doctor can infer these labelswith medical common sense, they pose a challengeto our models. We therefore suggest a critical viewtowards the data and welcome additional clinicaldatasets to compensate for noisy labels.
Multiple possible outcomes.
85% of analysedsamples contain false positive predictions that thedoctors still consider medically reasonable. Thisdemonstrates that there are many possible clinicalpathways and that some might not be foreseeable atadmission time. We also see many cases in whichthe information in the clinical note is not sufficientand therefore allows multiple interpretations. Forfuture work, we propose including further EHRdata as suggested by Khadanga et al. (2019) to ex-tend the patient representation in these scenarios.
We reframe the task of clinical outcome predic-tion to consider the admission state of a patientand support doctors in their initial decision pro-cess. We show that current state-of-the-art lan-guage models outperform selected baselines on thistask and present methods for further improvement:
Outcome pre-training enables our models to learnfrom unlabelled sources and
ICD+ incorporates hi-erarchical and textual ICD representations into ourmodels. For future work, we suggest consideringpre-trained language models with larger contextsizes (Beltagy et al., 2020; Zaheer et al., 2020) andlanguages other than English (Reys et al., 2020).We further encourage work on semantic encodingof negated terms and numerical data from clinicaltext.
Acknowledgments
We would like to thank Anjali Grover and SebastianHerrmann for their support throughout the project.Our work is funded by the German Federal Min-istry for Economic Affairs and Energy (BMWi) un-der grant agreement 01MD19003B (PLASS) and01MK2008MD (Servicemeister).
References
Asma Ben Abacha and Dina Demner-Fushman. 2019.A question-entailment approach to question answer-ing.
BMC Bioinformatics , 20(1):511:1–511:23.Betty van Aken, Benjamin Winter, Alexander L¨oser,and Felix A. Gers. 2019. How Does BERT AnswerQuestions?: A Layer-Wise Analysis of TransformerRepresentations. In
Proceedings of the 28th ACM In-ternational Conference on Information and Knowl-edge Management, CIKM 2019 , pages 1823–1832,Beijing, China. ACM.Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly AvailableClinical BERT Embeddings. In
Proceedings of the2nd Clinical Natural Language Processing Work-shop , pages 72–78, Minneapolis, Minnesota, USA.ACL.Sebastian Arnold, Rudolf Schneider, Philippe Cudr´e-Mauroux, Felix A. Gers, and Alexander L¨oser. 2019.SECTOR: A Neural Model for Coherent Topic Seg-mentation and Classification.
Transactions of theAssociation for Computational Linguistics, TACL ,7:169–184.Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.Longformer: The Long-Document Transformer.
Computing Research Repository , arXiv/2004.05150.illie Boag, Dustin Doss, Tristan Naumann, and Pe-ter Szolovits. 2018. What’s in a Note? UnpackingPredictive Value in Clinical Note Representations.
AMIA Summits on Translational Science Proceed-ings , 2018:26 – 34.Edward Choi, Siddharth Biswal, Bradley Malin, JonDuke, Walter F. Stewart, and Jimeng Sun. 2017.Generating Multi-label Discrete Patient Records us-ing Generative Adversarial Networks. In
Proceed-ings of the Machine Learning for Healthcare Con-ference, MLHC , volume 68 of
Proceedings of Ma-chine Learning Research , pages 286–305, Boston,Massachusetts. PMLR.Edward Choi, Cao Xiao, Walter F. Stewart, and JimengSun. 2018. MiME: Multilevel Medical Embeddingof Electronic Health Records for Predictive Health-care. In
Advances in Neural Information ProcessingSystems 31: Annual Conference on Neural Informa-tion Processing Systems 2018, NeurIPS 2018 , pages4552–4562, Montr´eal, Canada.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Volume 1 (Longand Short Papers) , pages 4171–4186, Minneapolis,Minnesota, USA. ACL.Dmitriy Dligach, Majid Afshar, and Timothy A. Miller.2019. Toward a clinical text encoder: pretrainingfor clinical natural language processing with appli-cations to substance misuse.
J. Am. Medical Infor-matics Assoc. , 26(11):1272–1278.Mat´us Falis, Maciej Pajak, Aneta Lisowska, PatrickSchrempf, Lucas Deckers, Shadia Mikhael,Sotirios A. Tsaftaris, and Alison O’Neil. 2019. On-tological attention ensembles for capturing semanticconcepts in ICD code prediction from clinical text.In
Proceedings of the Tenth International Workshopon Health Text Mining and Information Analysis,LOUHI@EMNLP 2019 , pages 168–177, HongKong, China. ACL.Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t Stop Pretraining:Adapt Language Models to Domains and Tasks.In
Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics, ACL2020 , pages 8342–8360, Online. ACL.Mohammad Hashir and Rapinder Sawhney. 2020. To-wards unstructured mortality prediction with free-text clinical notes.
Journal of Biomedical Informat-ics , 108:103489.Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.2019. ClinicalBERT: Modeling Clinical Notes andPredicting Hospital Readmission. In
Proceedings of ACM Conference on Health, Inference, and Learn-ing, CHIL 2020 , Online. ACM.Sarthak Jain, Ramin Mohammadi, and Byron C. Wal-lace. 2019. An Analysis of Attention over Clini-cal Notes for Predictive Tasks. In
Proceedings ofthe 2nd Clinical Natural Language Processing Work-shop , pages 15–21, Minneapolis, Minnesota, USA.ACL.Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.MIMIC-III, a freely accessible critical care database.
Scientific data , 3:160035.Swaraj Khadanga, Karan Aggarwal, Shafiq R. Joty, andJaideep Srivastava. 2019. Using Clinical Notes withTime Series Data for ICU Management. In
Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing, EMNLP-IJCNLP 2019 , pages 6431–6436,Hong Kong, China. ACL.Omid Khosravizadeh, Soudabeh Vatankhah, PeivandBastani, Rohollah Kalhor, Samira Alirezaei, andFarzane Doosty. 2016. Factors affecting length ofstay in teaching hospitals of a middle-income coun-try.
Electronic physician , 8(10):3042–3047.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation modelfor biomedical text mining.
Bioinformatics ,36(4):1234–1240.Jingshu Liu, Zachariah Zhang, and Narges Razavian.2018. Deep EHR: Chronic Disease Prediction Us-ing Medical Notes. In
Proceedings of the MachineLearning for Healthcare Conference, MLHC 2018 ,volume 85 of
Proceedings of Machine Learning Re-search , pages 440–464, Palo Alto, California, USA.PMLR.Xueping Peng, Guodong Long, Tao Shen, Sen Wang,and Jing Jiang. 2020. Self-Attention EnhancedPatient Journey Understanding in Healthcare Sys-tem. In
Proceedings of The European Conferenceon Machine Learning and Principles and Practice ofKnowledge Discovery in Databases, ECML PKDD2020 , Online.Zhi Qiao, Xian Wu, Shen Ge, and Wei Fan. 2019.MNN: Multimodal Attentional Neural Networks forDiagnosis Prediction. In
Proceedings of the Twenty-Eighth International Joint Conference on ArtificialIntelligence, IJCAI 2019 , pages 5937–5943, Macao,China.Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,Ning Dai, and Xuanjing Huang. 2020. Pre-trainedodels for Natural Language Processing: A Sur-vey.
Science China Technological Sciences , 63:1872– 1897.Arthur D. Reys, Danilo Silva, Daniel Severo, Saulo Pe-dro, Marcia M. de Souza e S´a, and Guilherme A. C.Salgado. 2020. Predicting Multiple ICD-10 Codesfrom Brazilian-Portuguese Clinical Notes. In
Pro-ceedings of the 9th Brazilian Conference on Intelli-gent Systems (BRACIS) , Rio Grande, Brazil.Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019.Leveraging Medical Literature for Section Predic-tion in Electronic Health Records. In
Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 4864–4873, Hong Kong,China. ACL.Rudolf Schneider, Tom Oberhauser, Paul Grundmann,Felix Alexander Gers, Alexander Loeser, and Stef-fen Staab. 2020. Is Language Modeling Enough?Evaluating Effective Embedding Combinations. In
Proceedings of The 12th Language Resources andEvaluation Conference , pages 4739–4748, Mar-seille, France. ELRA.Thomas Searle, Zina M. Ibrahim, and Richard J. B.Dobson. 2020. Experimental Evaluation and De-velopment of a Silver-Standard for the MIMIC-IIIClinical Coding Dataset. In
Proceedings of the 19thSIGBioMed Workshop on Biomedical Language Pro-cessing, BioNLP 2020, Online, July 9, 2020 , pages76–85. ACL.Yuqi Si and Kirk Roberts. 2019. Deep Patient Repre-sentation of Clinical Notes via Multi-Task Learningfor Mortality Prediction.
AMIA Summits on Transla-tional Science Proceedings , 2019:779–788.Amber Stubbs, Christopher Kotfila, and ¨Ozlem Uzuner.2015. Automated systems for the de-identificationof longitudinal clinical narratives: Overview of 2014i2b2/UTHealth shared task Track 1.
Journal ofBiomedical Informatics , 58:S11–S19.Amber Stubbs and ¨Ozlem Uzuner. 2015. Annotatinglongitudinal clinical narratives for de-identification:The 2014 i2b2/UTHealth corpus.
Journal ofBiomedical Informatics , 58:S20–S29.Weiyi Sun, Anna Rumshisky, and ¨Ozlem Uzuner.2013a. Annotating temporal information in clini-cal narratives.
Journal of Biomedical Informatics ,46(6):S5–S12.Weiyi Sun, Anna Rumshisky, and ¨Ozlem Uzuner.2013b. Evaluating temporal relations in clinical text:2012 i2b2 Challenge.
Journal of the American Med-ical Informatics Association , 20(5):806–813.Harini Suresh, Jen J. Gong, and John V. Guttag. 2018.Learning Tasks for Multitask Learning: Heteroge-nous Patient Populations in the ICU. In
Proceed-ings of the 24th ACM SIGKDD International Confer- ence on Knowledge Discovery & Data Mining, KDD2018 , pages 802–810, London, UK. ACM.Madhumita Sushil, Simon ˇSuster, Kim Luyckx, andWalter Daelemans. 2018. Patient representationlearning and interpretable evaluation using clinicalnotes.
Journal of Biomedical Informatics , 84:103 –113.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do youlearn from context? Probing for sentence structurein contextualized word representations. In , New Orleans, LA, USA.¨Ozlem Uzuner, Andreea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R. South. 2012.Evaluating the state of the art in coreference res-olution for electronic medical records.
Journalof the American Medical Informatics Association ,19(5):786–791.¨Ozlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac S.Kohane. 2008. Viewpoint Paper: Identifying PatientSmoking Status from Medical Discharge Records.
Journal of the American Medical Informatics Asso-ciation , 15(1):14–24.¨Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007.Viewpoint Paper: Evaluating the State-of-the-Art inAutomatic De-identification.
Journal of the Amer-ican Medical Informatics Association , 14(5):550–563.¨Ozlem Uzuner, Imre Solti, and Eithon Cadag. 2010a.Extracting medication information from clinical text.
Journal of the American Medical Informatics Asso-ciation , 17(5):514–518.¨Ozlem Uzuner, Imre Solti, Fei Xia, and Eithon Cadag.2010b. Community annotation experiment forground truth generation for the i2b2 medication chal-lenge.
Journal of the American Medical InformaticsAssociation , 17(5):519–523.¨Ozlem Uzuner, Brett R. South, Shuying Shen, andScott L. DuVall. 2011. 2010 i2b2/VA challenge onconcepts, assertions, and relations in clinical text.
Journal of the American Medical Informatics Asso-ciation , 18(5):552–556.Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,and Matt Gardner. 2019. Do NLP Models KnowNumbers? Probing Numeracy in Embeddings. In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing, EMNLP-IJCNLP 2019 , pages5306–5314, Hong Kong, China. ACL.Pengtao Xie, Haoran Shi, Ming Zhang, and Eric P.Xing. 2018. A Neural Architecture for AutomatedCD Coding. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2018, Volume 1: Long Papers , pages1066–1076, Melbourne, Australia. ACL.Manzil Zaheer, Guru Guruganesh, Kumar AvinavaDubey, Joshua Ainslie, Chris Alberti, SantiagoOnta˜n´on, Philip Pham, Anirudh Ravula, QifanWang, Li Yang, and Amr Ahmed. 2020. Big bird:Transformers for longer sequences. In
Advances inNeural Information Processing Systems 33: AnnualConference on Neural Information Processing Sys-tems 2020, NeurIPS 2020, December 6-12, 2020,virtual .Zachariah Zhang, Jingshu Liu, and Narges Razavian.2020. BERT-XML: large scale automated ICD cod-ing using BERT pretraining. In
Proceedings ofthe 3rd Clinical Natural Language Processing Work-shop, ClinicalNLP@EMNLP 2020, Online, Novem-ber 19, 2020 , pages 24–34. Association for Compu-tational Linguistics.
A Distribution of Diagnosis andProcedure Labels
Figure 5 and Figure 6 show the distributions oflabels in the diagnosis and procedure predictiontraining sets. Both distributions follow the powerlaw with a long tail of rare codes.
B Pre-Processing Clinical Notes
B.1 Admission Notes From DischargeSummaries
We use MIMIC III discharge summaries that con-tain aggregated information about a patient such asdoctor’s assessments, relevant lab values, medica-tions, and the patient’s history. In order to filter thedocuments by admission sections, we first split alldischarge summaries into sections with simple pat-tern matching. Together with clinical professionals,we then evaluated discharge summaries and identi-fied sections that are known at admission time. Weremove all other sections and thus hide informationabout the further hospital course and discharge ofa patient. We exclude notes that do not containany of the admission sections. We further apply apatient-wise split into train, validation and test setwith a 70/10/20 ratio.
B.2 Converting i2b2 Data into AdmissionDischarge Task
The i2b2 De-identification and Heart Disease RiskFactors Challenge (Stubbs et al., 2015; Stubbs andUzuner, 2015) introduced a dataset that containsclinical notes and discharge summaries annotated
Figure 5: Distribution of ICD-9 diagnosis codes inMIMIC III training set.Figure 6: Distribution of ICD-9 procedure codes inMIMIC III training set. based on risk factors and disease indicators. Weconvert the data into an admission to discharge taskby selecting five of the annotated conditions whichcorrespond to ICD-9 codes as our labels, namelyHypertension (401), Hyperlipidemia (272), Coro-nary artery disease (414), Diabetes mellitus (250)and Obesity (278). Just like the MIMIC III diag-nosis task, samples are annotated in a multi-labelfashion. In order to convert the clinical notes toadmission notes, we use the dataset from Rosen-thal et al. (2019) that contain section labels persentence. We then exclude sections that are notknown at admission time concurrent to Section 3.2.
C Hyperparameter Setting
We use the following setting for pre-training andfine-tuning of the introduced Transformer-basedmodels: We use early stopping and apply a randomsearch for tuning the following hyperparameterson the validation set: learning rate [1e-4 − − − − igure 7: Top 10 diagnoses by frequency with thescores reached by the CORe All model.Figure 8: Top 10 procedures by frequency with thescores reached by the
CORe All model.
D Results on Top 10 Diagnoses andProcedures
Figures 7 and 8 show the % AUROC scores ofour
CORe All model on the most frequent labelswithin the diagnosis and procedure prediction tasks.Figure 7 show that many chronic diseases such as
Essential Hypertension or Chronic ischemic heartdisease are among the most common within theMIMIC III dataset and present with relatively highAUROC values. We also observe that very spe-cific codes such as
Diabetes mellitus and
BypassAnastomosis are predicted more easily compared tomore general codes such as
Other and unspecifiedanemias .Figure 8 further shows the negative influence ofinconsistent labeling on standard procedures suchas