[PDF] Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

Abstract

Outcome prediction from clinical text can prevent doctors from overlooking possible risks and help hospitals to plan capacities. We simulate patients at admission time, when decision support can be especially valuable, and contribute a novel admission to discharge task with four common outcome prediction targets: Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction. The ideal system should infer outcomes based on symptoms, pre-conditions and risk factors of a patient. We evaluate the effectiveness of language models to handle this scenario and propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources. We further present a simple method to incorporate ICD code hierarchy into the models. We show that our approach improves performance on the outcome tasks against several baselines. A detailed analysis reveals further strengths of the model, including transferability, but also weaknesses such as handling of vital values and inconsistencies in the underlying data.

Full PDF

CClinical Outcome Prediction from Admission Notesusing Self-Supervised Knowledge Integration

Betty van Aken , Jens-Michalis Papaioannou , Manuel Mayrdorfer ,Klemens Budde , Felix A. Gers and Alexander L¨oser Beuth University of Applied Sciences Berlin Charit´e Berlin { bvanaken,michalis.papaioannou,gers,aloeser } @beuth-hochschule.de { manuel.mayrdorfer,klemens.budde } @charite.de Abstract

Outcome prediction from clinical text canprevent doctors from overlooking possiblerisks and help hospitals to plan capacities. Wesimulate patients at admission time, when de-cision support can be especially valuable, andcontribute a novel admission to discharge taskwith four common outcome prediction targets:Diagnoses at discharge, procedures performed,in-hospital mortality and length-of-stay predic-tion. The ideal system should infer outcomesbased on symptoms, pre-conditions and riskfactors of a patient. We evaluate the effec-tiveness of language models to handle thisscenario and propose clinical outcome pre-training to integrate knowledge about patientoutcomes from multiple public sources. Wefurther present a simple method to incorpo-rate ICD code hierarchy into the models. Weshow that our approach improves performanceon the outcome tasks against several baselines.A detailed analysis reveals further strengths ofthe model, including transferability, but alsoweaknesses such as handling of vital valuesand inconsistencies in the underlying data.

Clinical professionals make decisions about pa-tients under strong time constraints. The patientinformation at hand is often unstructured, e.g. inthe form of clinical notes written by other medicalpersonnel in limited time. Clinical decision support(CDS) systems can help in these scenarios by point-ing towards related cases or certain risks. Clinicaloutcome prediction is a fundamental task of CDSsystems, in which the patient’s development is pre-dicted based on data from their Electronic HealthRecord (EHR). In this work we focus on textualEHR data available at admission time. Figure 1 shows a sample admission note with high-lighted parts that – according to medical doctors –must be considered when evaluating a patient.

Encoding clinical notes with pre-trainedlanguage models.

Neural models need to extractrelevant facts from such notes and learn complexrelations between them in order to associate certainclinical outcomes. Pre-trained language modelssuch as BERT (Devlin et al., 2019) have shownto be able to both extract information from noisytext and to capture task-speciﬁc relations in anend-to-end fashion (Tenney et al., 2019; van Akenet al., 2019). We thus base our work on thesemodels and pose the following questions:• Can pre-trained language models learn to pre-dict patient outcomes from their admissioninformation only?• How can we integrate knowledge about out-comes that doctors gain from medical litera-ture and previous patients?• How well would these models work in clinicalpractice? Are they able to interpret commonrisk factors? Where are they failing?

Simulating patients at admission time.

Exist-ing work on text-based outcome prediction focuseson progress notes after a certain time of a patient’shospitalisation (Huang et al., 2019). This is mostlydue to a lack of publicly available admission notesand poses some problems: 1) Doctors might missspeciﬁc outcome risks early in admission and 2)progress notes already contain information aboutclinical decisions made on admission time (Boaget al., 2018). We propose to simulate newly ar-rived patients by extracting admission notes fromMIMIC III discharge summaries. We are thus able a r X i v : . [ c s . C L ] F e b RESENT ILLNESS : 58yo man w/ hx of hypertension, AFib on coumadin and NIDDM presented to ED with theworst headache of his life. He had a syncopal episodeand was intubated by EMS. Medication on admission: 1mg IV ativan x 1.

PHYSICAL EXAM : Vitals: P: 92 R: 13 BP: 151/72SaO2: 99% intubated. GCS E: 3 V:2 M:5HEENT:atraumatic, normocephalic Pupils: 4-3mm [...]

FAMILY HISTORY : Mother had stroke at age 82.Father unknown.

SOCIAL HISTORY : Lives with wife. 25py. No EtOH DIAGNOSES:430

Subarachnoid Hemorrhage

Essential Hypertension

Diabetes Mellitus [...]

PROCEDURES:397

Endovascular Repair of Vessel

Continous Invasive Mechanical Ventilation [...]

IN-HOSPITAL MORTALITY:

Not deceased

LENGTH OF STAY: > 14 days

Symptoms & Vitals General Risk Factors Medications Pre-Conditions

ADMISSION DISCHARGE

Figure 1:

Admission to discharge sample that demonstrates the outcome prediction task. The model has to extractpatient variables and learn complex relations between them in order to predict the clinical outcome. to give doctors hints towards possible outcomesfrom the very beginning of an admission and canpotentially prevent early mistakes. We can alsohelp hospitals in planning resources by indicatinghow long a patient might stay hospitalised.

Integrating knowledge with specialisedoutcome pre-training.

Gururangan et al. (2020)recently emphasized the importance of domain-and task-speciﬁc pre-training for deep neuralmodels. Consequently we propose to enhancelanguage models pre-trained on the medicaldomain with a task-speciﬁc clinical outcomepre-training . Besides processing clinical languagewith idiosyncratic and specialized terms, ourmodels are thus able to learn about patienttrajectories and symptom-disease associations in aself-supervised manner. We derive this knowledgefrom two main sources: 1) Previously admittedpatients and their outcomes. This knowledge isusually stored by hospitals in unlabelled clinicalnotes and 2) Scientiﬁc case reports and knowledgebases that describe diseases, their presentations inpatients and prognoses. We introduce a method forincorporating these sources by creating a suitablepre-training objective from publicly available data.

Contributions.

We summarize the major contri-butions of this work as follows:1) A novel task setup for clinical outcome predic-tion that simulates the patient’s admission state andpredicts the outcome of the current admission.2) We introduce self-supervised clinical outcomepre-training , which integrates knowledge about pa-tient outcomes into existing language models.3) We further propose a simple method that injectshierarchical signals into ICD code prediction.4) We compare our approaches against multiplebaselines and show that they improve performance on four relevant outcome prediction tasks with upto 1,266 classes. We show that the models are trans-ferable by applying them to a second public datasetwithout additional ﬁne-tuning.5) We present a detailed analysis of our model thatincludes a manual evaluation of samples conductedby medical professionals.

Using clinical notes for outcome prediction.

Boag et al. (2018) studied the predictive value ofclinical notes with simple approaches such as bag-of-words. Recent work increasingly applies neuralmodels to compensate for the noisy nature of thedata and the complexity of patterns. Hashir andSawhney (2020) used both convolutional and recur-rent layers for outcome prediction, while Jain et al.(2019) and Qiao et al. (2019) proposed attention-based approaches. Dligach et al. (2019) exploredpre-training as a strategy to mitigate data sparsity inclinical setups. Si and Roberts (2019) and Sureshet al. (2018) further showed that outcome predic-tion beneﬁts from a multitask setup. In contrast toearlier work we apply neural models to admissionnotes in an admission to discharge setup.

Pre-trained language models for the clinicaldomain.

While pre-trained language models aresuccessful in many areas of NLP, there has beenlittle work on applying them to the clinical do-main (Qiu et al., 2020). Alsentzer et al. (2019) andHuang et al. (2019) both pre-trained BERT-basedmodels on clinical data. They evaluated their workon readmission prediction and other NLP tasks. Weare the ﬁrst to evaluate pre-trained language mod-els on multiple clinical outcome tasks with largelabel sets. We further propose a novel pre-trainingobjective speciﬁcally for the clinical domain. rediction of diagnoses and procedures.

Themajority of work on diagnosis and procedure pre-diction covers either single diagnoses (Liu et al.,2018; Choi et al., 2018) or coarse-grained groups(Peng et al., 2020; Sushil et al., 2018). We arguethat models should predict diseases and proceduresin a ﬁne-grained manner to be beneﬁcial for doc-tors. Thus we use all diagnosis and procedure codesfrom the data for our outcome prediction tasks.

ICD coding vs. outcome prediction.

There is avariety of work in the related ﬁeld of automatedICD coding (Xie et al., 2018; Falis et al., 2019).Zhang et al. (2020) recently presented a model ableto identify up to 2,292 ICD codes from text. How-ever, ICD coding differs from outcome predictionin the way that diseases are directly extracted fromtext rather than inferred from symptom descrip-tions and patient history. We further discuss thisdistinction in Section 6.

Admission to Discharge

Task

Clinical outcome prediction can be deﬁned in dif-ferent ways. We approach the task from a doctor’sperspective and predict the outcome of a currentadmission from the time of the patient’s arrival tothe hospital unit. We describe our setup as follows.

As our primary data source, we use the freely-available MIMIC III v1.4 database (Johnson et al.,2016). It contains de-identiﬁed EHR data includingclinical notes in English from the Intensive CareUnit (ICU) of Beth Israel Deaconess Medical Cen-ter in Massachusetts between 2001 and 2012. Wefocus our work on discharge summaries in partic-ular and the outcome information associated withan admission. Similar to previous work, we ﬁlterout notes about newborns and remove duplicates.

The state of a patient is commonly summarized inan ongoing document, which ﬁnally concludes in

Admission Notes Statistics avg std avg std(words / doc) (words / doc) (sent / doc) (sent / doc)396.3 233.3 32.5 23.1

Table 1: Numbers of words / sentences in MIMIC IIIadmission notes. We see a high variation in length.

Multi-label tasks:

ICD-9 codes per dataset splitDiagnoses Procedures

Total

Train Val Test

Total

Train Val Test

672 476 563

Table 2: Distribution of ICD-9 codes per dataset split(patient-wise). Note that very rare codes do not appearin each split of the dataset.

Single-label tasks:

Samples per classMortality Length of Stay (in days)0 1 ≤ > ≤ > ≤ > Table 3: Distribution of labels for

Mortality Prediction and

Length of Stay task. Both tasks have unbalancedclass distributions. a discharge summary. Since we want to supportclinical decisions from the beginning of a patient’sstay, we simulate the state of the patient’s docu-ment at admission time. We thus ﬁlter the docu-ment by sections that are known at admission suchas:

Chief complaint, (History of) Present illness,Medical history, Admission Medications, Allergies,Physical exam, Family history and

Social history .We further describe the ﬁltering in Appendix B.1.Our approach results in 48,745 admission notes.As shown in Table 1 the notes contain about 400words on average. The selection of admission sec-tions as well as the resulting structure of the noteswere veriﬁed by medical doctors.This newly created admission dataset enables usto make predictions on the outcome of a currentadmission. At inference time, doctors can then usethe model’s predictions on textual data from newlyarrived patients.

We select four relevant tasks for outcome predictionin consultation with medical professionals. Alltasks take admission notes as input.

Diagnosis prediction.

A main goal of clinicaloutcome prediction is to support medical profes-sionals in the process of differential diagnosis. Wethus take all diagnoses associated with an admis-sion into account and frame the task as an extrememulti-label classiﬁcation. Diagnoses are encodedas ICD-9 codes in the MIMIC III database. Follow-ing Choi et al. (2017), we group ICD-9 diagnosiscodes from the database from 4- into 3-digit codesto reduce complexity while still obtaining granularsuggestions. This results in a total of 1,266 diag-

ERTBioBERTClinical OutcomeRepresentations(

CORe ) Diagnoses ProceduresMortality

Length of Stay SE L F - S U PE R V I SE D S U PE R V I SE D i2b2 MTSamplesMIMIC III

PubMed

Wikipedia MedQuad

DISCHARGE

ADMISSIONTEXT DISCHARGETEXT [CLS]

Former for 20-30 years. [SEP]

The aorta is ectatic with eccentric ...

Does this outcome match the patient?

Label:

True

Clinical Outcome Pre-Training

DISCHARGE

SYMPTOMS /RISK FACTORS TREATMENTS /PROGNOSES / ... [CLS] ... skin lesions with sometimes itching. [SEP]

Delivery of whole brain radiotherapy ...Does this treatment match the symptoms?Label:

False

PATIENTS ARTICLES

Figure 2: Schematic demonstration of clinical outcome pre-training . Sources of clinical knowledge are completepatient notes and medical articles. Based on that we create a self-supervised learning objective that teaches relationsbetween symptoms, risk factors and outcomes. nosis codes, which are distributed over our datasetsplits as shown in Table 2. The labels are power-law distributed with a long tail of very rare codes.

Procedure prediction.

Procedures are either di-agnostics or treatments applied to a patient duringa stay. Similarly to diagnosis prediction, this isan extreme multi-label task. We again group theICD-9 codes from the MIMIC III database into 3-digit codes. In total there are 711 procedure codeslabelled in the database in a power law distributionsimilar to the diagnosis codes.

In-hospital mortality prediction.

Predicting apatient’s mortality risk is a fundamental part of thetriage process. In-hospital mortality in particulardescribes whether a patient died during the cur-rent admission and is a binary classiﬁcation task.The percentage of deceased patients in the data isaround 10% (see Table 3). As some notes containdirect indications of mortality such as patient de-ceased within the admission sections, we apply anadditional ﬁlter for those terms.

Length-of-stay prediction.

The duration of anICU stay is an important information for hospitalsin order to plan allocations of resources. We grouppatients into four major categories regarding theirlength of stay:

Under 3 days, 3 to 7 days, 1 week to2 weeks, more than 2 weeks.

These categories wererecommended by medical doctors in order to makethe results as useful as possible in clinical practice.Table 3 shows the samples per class.

We propose clinical outcome pre-training , a way tointegrate knowledge about clinical outcomes intopre-trained language models. We further introducean additional step to incorporate ICD code hierar-chy into our multi-label classiﬁcation tasks. Language model pre-training hasshown to be of use in specialised domains like theclinical (Alsentzer et al., 2019; Huang et al., 2019).However, these models lack knowledge about pa-tient trajectories and symptom-diagnosis relations,because their training is focused on learning lan-guage characteristics.We develop an additional pre-training step that pro-duces

Clinical Outcome Representations (CORe) in order to teach the model relations between symp-toms, risk factors and clinical outcomes. Much ofthis knowledge is present and publicly available,e.g. in knowledge bases like Wikipedia or publi-cation archives like PubMed. Another source isavailable to hospitals in the form of unlabelled clin-ical notes from previous patients. The suggestedoutcome pre-training is a way to use this knowledgeto improve the model’s capabilities in predictingclinical outcomes as described in 3.3.Corresponding to the way doctors gain their knowl-edge from both experience and medical literature, The code to recreate the experiments and datasets de-scribed in this paper is accessible at: https://github.com/bvanaken/clinical-outcome-prediction e incorporate knowledge from complete patientnotes (including discharge information) and medi-cal articles.

Training objective.

Our proposed training objec-tive (Figure 2) is strongly related to the Next Sen-tence Prediction (NSP) task introduced by Devlinet al. (2019). In NSP the model gets two sentencesas an input and predicts whether the second followsthe ﬁrst sentence. This way models such as BERTlearn relations between sentences. We convert thissetting so that the model instead learns relationsbetween admissions and outcomes.From common sections in patient notes, we cre-ate two categories: Sections that are created atadmission A and sections that are created after ad-mission, e.g. at discharge time D . Given a patientnote N , we split it into sections A N ∈ A and D N ∈ D . We remove all other sections. We thensample token sequences from these sections to get t N, ...k ∈ A N and t (cid:48) N, ...k ∈ D N , where k is ran-domly set between 30 and 50 tokens. We then trainthe model to maximize P ( Same P atient | X N N ) and P ( Other P atient | X N M ) with X N N = Enc ( t N, ...k , t (cid:48) N , ...k ) X N M = Enc ( t N, ...k , t (cid:48) M , ...k ) (1)with M being a randomly sampled document fromthe same batch and Enc referring to the BioBERTencoding. As in the original NSP setting, we applynegative sampling ( X N M ) for 50% of examples.We apply the same strategy on medical articles andcase reports, so that A represents sections describ-ing symptoms and risk factors, and D representssections that describe outcomes of a disease or case. Data sources.

We create the pre-training datasetfrom multiple public sources. To integrate knowl-edge that doctors gain from previous patients andmedical literature, we create two groups of sources:1)

Patients , which includes 32,721 discharge sum-maries from the MIMIC III training set, 5,000 pub-licly available medical transcriptions from the MT-Samples website and 4,777 clinical notes from thei2b2 challenges 2006-2012 (Uzuner et al., 2007,2008, 2010a,b, 2011, 2012; Sun et al., 2013b,a).2) Articles , composed of 9,335 case reportsfrom PubMed Central (PMC), 2,632 articles from https://mtsamples.com We exclude notes from the 2014 De-identiﬁcation andHeart Disease Risk Factors Challenge in order to use this setfor evaluation as described in Section 5.4.

Wikipedia describing diseases and 1,467 articlesections from the MedQuAd dataset (Abacha andDemner-Fushman, 2019) extracted from NIH web-sites such as cancer.gov.While

Patients samples contain unaudited practicalknowledge,

Articles samples are built from veriﬁedgeneral medical knowledge such as peer-reviewedstudies. The sources are therefore substantially dif-ferent and we evaluate their individual effect onperformance in Section 5.3.

Data preparation.

We create admission ( A N )and discharge parts ( D N ) of the documents basedon section headings. We deﬁne common sectionsbelonging to the admission part and those belong-ing to the discharge part similar to the method de-scribed in Section 3.2. We ignore sections thatcannot be categorized. For section heading ex-traction from MIMIC III discharge summaries andMTSamples transcriptions, we apply simple rule-based approaches, which is feasible because thenotes are well-structured. For Wikipedia we useheadings from the WikiSection dataset (Arnoldet al., 2019) ﬁltered for disease articles only. ForPubMed Central we similarly use the PubMedSec-tion dataset (Schneider et al., 2020) and ﬁlter forsection headings that indicate case reports. Asi2b2 notes are less well-structured in comparison toMIMIC III discharge summaries, we use a classiﬁeras proposed by Rosenthal et al. (2019) to determinewhich section a sentence belongs to. The classiﬁeris trained on an annotated set of i2b2 notes andthen applied to all other notes. Diagnosisand procedure prediction requires the model to pre-dict ICD-9 codes in a multi-label manner. ICD-9codes are hierarchically ordered into associatedgroups. Figure 3 shows the code hierarchy for

Ma-lignant hypertensive renal disease with the ICD-9code . The diagnosis has two parent groupsnamely

Hypertension renal disease and

Diseasesof the circulatory system . Diagnoses or proceduresin the same group often share similar medical char-acteristics, therefore hierarchical relations of a la-belled code can be valuable information. This med-ical information is currently not integrated into themodel. The same holds for words describing theICD-9 codes, that often represent further importantsignals, such as the words renal or malignant .

90 – 459 Diseases of the circulatory system - 401 Essential Hypertension - 403 Hypertension renal disease - 403.0 Malignant hypertensive renal disease - 403.1 Benign hypertensive renal diseaseAssigned Label:

Assigned Labels with ICD+:

Figure 3: Example of

ICD+ labelling.

Malignanthypertensive renal disease is assigned to nine codes(bottom row) that inform about the type and group ofthe disease.

Enhancing training with useful additionalsignals.

We propose a simple method,

ICD+ , toincorporate both associated groups and words intothe model weights: Instead of only classifying 3-digit codes (as mentioned in 3.3), we let the modeladditionally predict the 4-digit codes and the bag ofassociated words with a code and its parent groups.In order to create the bag of words per code, weuse the descriptions of ICD-9 codes from MIMICIII and remove all stop words. As shown in Figure3, the

ICD+ method assigns eight additional labelsto the example diagnosis and therefore supplies themodel with further information about the diagnosisduring training.By increasing the amount of labels per sample, weintegrate relevant medical knowledge and enablethe model to learn implicit relations between codesand code groups that share certain words. We eval-uate the effectiveness of

ICD+ in Section 5.

We pre-train the

CORe model on top of BioBERTweights . We then ﬁne-tune the model separatelyon the four outcome tasks. We use the same train-ing regimen for both pre-training and ﬁne-tuning:We tokenize the texts with WordPiece tokenizationand truncate them to 512 tokens, due to the lim-ited context length of the pre-trained models. Weuse early stopping and tune hyperparameters asdescribed in Appendix C. We choose BioBERT as the base for our model becauseit outperforms BERT on medical tasks and has not seen datafrom our test set during pre-training unlike DischargeBERT.

In the following, we introduce the baseline modelsthat we evaluate on the novel outcome predictiontasks. In order to understand the abilities of pre-trained language models we compare their perfor-mance against more traditional approaches. Theﬁrst three models (

BOW, word embeddings, CNN )are trained using the hyperparameters proposed bythe authors for outcome prediction tasks. The lan-guage models are ﬁne-tuned the same way as the

CORe model.

Bag-of-Words.

Boag et al. (2018) shows that asimple bag-of-words (BOW) approach can outper-form more complex models on tasks like mortalityprediction. We thus include their approach in ourevaluation. We adopt their training setting exceptthat we consider 200 instead of 20 top tf-idf wordsin order to make the model converge.

Pre-trained word embeddings.

Boag et al.(2018) further propose the use of pre-computedword embeddings that were trained on MIMIC IIIdata. We use the same setting as for the BOW ap-proach and ﬁt a support vector machine classiﬁeron the clinical outcome tasks.

Convolutional Neural Network (CNN).

Si andRoberts (2019) built a neural network for mortal-ity prediction with two hierarchical convolutionallayers at the word and sentence levels and then ag-gregated it to a patient level representation. Wefollow their approach to evaluate the model on ourfour admission to discharge tasks.

BioBERT.

Following the success of BERT, Leeet al. (2020) further pre-trained the model onbiomedical research articles from PubMed usingabstracts and full-text articles. They reported im-proved performance on a range of biomedical textmining tasks.

ClinicalBERT and DischargeBERT.

We fur-ther evaluate two public language models pre-trained on the clinical domain, with MIMIC IIIdata in particular. Huang et al. (2019) pre-traineda BERT Base model on 100,000 random clinicalnotes (ClinicalBERT) while Alsentzer et al. (2019)further pre-trained BioBERT on all discharge sum-maries from MIMIC III (we refer to the model asDischargeBERT for simplicity). iagnoses Procedures In-Hospital Mortality Length-of-Stay(1266 classes) (711 classes) (2 classes) (4 classes)BOW (Boag et al., 2018) 75.87 77.47 79.15 65.83Embeddings (Boag et al., 2018) 75.16 76.72 79.94 66.78CNN (Si and Roberts, 2019) 61.18 73.13 75.50 64.49BERT Base (Devlin et al., 2019) 82.08 85.84 81.13 70.40ClinicalBERT (Huang et al., 2019) 81.99 86.15 82.20 71.14

DischargeBERT (Alsentzer et al., 2019)

BioBERT Base (Lee et al., 2020) 82.81 86.36 82.55 71.59BioBERT ICD+ 83.17 87.45 - -CORe Articles (w/o ICD+) (82.89) (86.75) (w/o ICD+) (83.40) (86.60) (w/o ICD+) (83.39) (87.15)

Table 4: Results on outcome prediction tasks in macro-averaged % AUROC. The

CORe models outperform thebaselines,

ICD+ adds further improvement (values in parentheses are ablation results without

ICD+ ). Discharge-BERT results are printed in italic because the model has seen all test data during pre-training and is thereforeslightly advantaged.

Table 4 shows performances in (macro-averaged)area under the receiver operating characteristiccurve (AUROC). We report scores of the

CORe model trained only on

Articles , Patients and in acombined training setting

CORe All . We evalu-ate diagnosis and procedure prediction both withand without the

ICD+ method on BioBERT andthe

CORe models. In both scenarios we evaluateon 3-digit ICD codes only, in order to maintaincomparability between the methods.

Pre-trained models outperform baselines.

Wesee that the evaluated pre-trained language mod-els clearly outperform the

BOW , word embeddings and CNN approaches. We further observe thatthe

CORe models improve scores on all tasks incomparison to the baseline models, except for Dis-chargeBERT that reaches a higher score in mor-tality prediction – probably affected by its expo-sure to the test data. This shows that even thoughthe language models are trained on similar data(e.g. PubMed and/or clinical notes), the speciﬁc outcome pre-training improves the model’s abilityto predict clinical outcome targets. Pre-training on

Patients and

Articles achieve similar improvementsover the baselines, while the combined training isthe most effective. An exception is the procedureprediction, where pre-training on

Patients achievesthe highest score. A probable reason is that pro-cedures are documented in more detail in clinicalnotes, especially since our selection of medicalarticles focuses on diseases rather than procedures.

Predicting mortality risk is easier than lengthof stay.

We see that the models reach higherscores in the binary mortality task than in lengthof stay prediction. Even a simple

BOW approachcan reach a relatively high score, which indicatesthat most of the notes contain clear hints towardsan increased mortality risk. On the other hand,the length of stay task is difﬁcult due to the manyfactors that can contribute to the length of a pa-tient’s stay after the admission, including nonclini-cal factors such as the patient’s insurance situation(Khosravizadeh et al., 2016).

ICD hierarchy improves diagnosis and proce-dure predictions.

Table 4 shows an ablation testwithout the

ICD+ method (in parentheses). Wesee that both the BioBERT model and the

CORe models improve when incorporating code hierarchyand relations through

ICD+ into the training pro-cess. This is especially visible for ICD procedures,where the hierarchical and textual information, e.g.that a

Nephropexy is an operation on the kidney canadd important signals during training.i2b2 DiagnosesBioBERT ICD+ 80.43CORe Articles 81.46CORe Patients

CORe All 81.15

Table 5: Results on i2b2 diagnosis prediction task (5classes) in % AUROC. The models reach similar resultsas on the MIMIC III data, indicating their transferabil-ity to other data sources without additional ﬁne-tuning. o r t a li t y P r ed i c t i on A v e r aged M o r t a li t y i n M I M I C III ( do tt ed li ne )

18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 [ ** A ge o v e r ** ] Figure 4: Impact of age on mortality prediction on 20random samples. Mortality risk and age mostly in-crease proportionally as intended, with certain peaksthat might indicate unintended biases in the data.

In order to verify that the ﬁne-tuned models aretransferable to ICU data from other sources, weapply it to data from the i2b2 De-identiﬁcation andHeart Disease Risk Factors Challenge (Stubbs et al.,2015). We convert the clinical notes to admissionnotes as further described in Appendix B.2, whichresults in 1,118 samples labelled with up to ﬁveICD-9 codes.

Models generalize to i2b2 data.

We apply ourMIMIC III-based models to predict diagnosis codesfor the i2b2 notes without further ﬁne-tuning. Wethen evaluate based on whether the predictions con-tain the ﬁve mentioned ICD-9 codes. The resultsin macro-averaged % AUROC are shown in Table5. Even though the clinical notes differ from theMIMIC III notes in structure and writing style, thetested models are mostly able to identify the con-ditions. The scores are comparable to the MIMICIII results, which shows that the models are able togeneralise on data from different sources such asother hospitals.

Clinical outcome prediction is a sensitive task.We therefore conduct an extensive analysis on the

CORe All model including a manual error analysisby medical doctors on 20 randomly chosen sam-ples to understand how the model would performin clinical practice. Our demo application used for this analysis isavailable at: https://outcome-prediction.demo.datexis.com % AUROCAll Diagnoses 83.54Diagnoses

Mentioned in Text

Diagnoses

Not Mentioned in Text 82.35

Table 6: Analysis of the impact of directly mentioneddiagnoses on the diagnosis prediction task. Mentioneddiagnoses are detected more reliably. Though on un-mentioned diagnoses, scores only see a small decreasecompared to the overall score.

We observe that a majority of codeddiseases are already mentioned in the admissiontext. This is mainly due to chronic diseases (e.g. di-abetes mellitus ) or to conditions that were identiﬁedprior to the ICU admission (e.g. in the emergencyward). We want to know if our model is also ableto predict diagnoses that are not mentioned in thetext. We annotate the admission texts with ICD-9diagnosis codes with the methodology describedby Searle et al. (2020). We then evaluate on codesthat were explicitly mentioned in the text and thosethat were not. Table 6 shows that the model indeedextracts many diagnoses directly from the text andthus reaches a higher score on mentioned diagnoses.On the other hand, we see that the performance onnon-mentioned diagnoses does drop only slightly,indicating that the model has also learned to predictnon-mentioned diagnoses.

How does age and gender impact predictions?

Age and gender are common risk factors with sig-niﬁcant impact on the potential clinical outcome ofa patient. We want our models to learn that impactwithout overestimating it. We test the model’s be-haviour by switching age and gender throughout20 random samples and analyse how the mortalityprediction changes. For each sample we manuallyswitch the age mention and iterate over it from 18until [**Age over 90**] . Figure 4 shows that theanalysed samples show a high variation in mortal-ity risk and that age only impacts the predictionpartially. In all cases the prediction increases withage – as expected from a medical perspective. Wealso observe some peaks without a medical rea-son that are caused by the mortality of certain agegroups in the original data (black dotted line). Thisdemonstrates how the model does not follow medi-cal reasoning but merely statistic observations. We De-identiﬁed age information for patients older than 89. imilarly switch the gender mention and all pro-nouns in the texts and observe that mortality pre-diction for male patients is increased by 5% onaverage, consistent with medical rationale.

Where is the model failing? Negation : While our error analysis depictsthat negation does not generally falsify themodel’s predictions, we ﬁnd single samples inwhich especially medical-speciﬁc negations,such as abstinent from alcohol , are misin-terpreted by the model, e.g. into alcoholdependence syndrome .2.

Numerical data : Wallace et al. (2019) showBERT’s inabilities to interpret numbers. Weobserve this in the case that the model does notinterpret life-threatening vital values (such astemperature over 105 ◦ F) as an increased mor-tality risk. Clinical notes contain many suchrelevant values, thus improving the encodingof such data is an important goal for futurework.

Our erroranalysis reveals that 60% of the analysed samplesare partially under-coded. They contain indicatorsfor a diagnosis or procedure but miss the corre-sponding ICD-9 code. This is consistent with re-sults from Searle et al. (2020) showing that MIMICIII is up to 35% under-coded. Additionally we ﬁndthat procedures that are almost always performed inthe ICU such as

Puncture of vessel are often codedinconsistently. While a doctor can infer these labelswith medical common sense, they pose a challengeto our models. We therefore suggest a critical viewtowards the data and welcome additional clinicaldatasets to compensate for noisy labels.

Multiple possible outcomes.

85% of analysedsamples contain false positive predictions that thedoctors still consider medically reasonable. Thisdemonstrates that there are many possible clinicalpathways and that some might not be foreseeable atadmission time. We also see many cases in whichthe information in the clinical note is not sufﬁcientand therefore allows multiple interpretations. Forfuture work, we propose including further EHRdata as suggested by Khadanga et al. (2019) to ex-tend the patient representation in these scenarios.

We reframe the task of clinical outcome predic-tion to consider the admission state of a patientand support doctors in their initial decision pro-cess. We show that current state-of-the-art lan-guage models outperform selected baselines on thistask and present methods for further improvement:

Outcome pre-training enables our models to learnfrom unlabelled sources and

ICD+ incorporates hi-erarchical and textual ICD representations into ourmodels. For future work, we suggest consideringpre-trained language models with larger contextsizes (Beltagy et al., 2020; Zaheer et al., 2020) andlanguages other than English (Reys et al., 2020).We further encourage work on semantic encodingof negated terms and numerical data from clinicaltext.

Acknowledgments

We would like to thank Anjali Grover and SebastianHerrmann for their support throughout the project.Our work is funded by the German Federal Min-istry for Economic Affairs and Energy (BMWi) un-der grant agreement 01MD19003B (PLASS) and01MK2008MD (Servicemeister).

References

Asma Ben Abacha and Dina Demner-Fushman. 2019.A question-entailment approach to question answer-ing.

BMC Bioinformatics , 20(1):511:1–511:23.Betty van Aken, Benjamin Winter, Alexander L¨oser,and Felix A. Gers. 2019. How Does BERT AnswerQuestions?: A Layer-Wise Analysis of TransformerRepresentations. In

Proceedings of the 28th ACM In-ternational Conference on Information and Knowl-edge Management, CIKM 2019 , pages 1823–1832,Beijing, China. ACM.Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly AvailableClinical BERT Embeddings. In

Proceedings of the2nd Clinical Natural Language Processing Work-shop , pages 72–78, Minneapolis, Minnesota, USA.ACL.Sebastian Arnold, Rudolf Schneider, Philippe Cudr´e-Mauroux, Felix A. Gers, and Alexander L¨oser. 2019.SECTOR: A Neural Model for Coherent Topic Seg-mentation and Classiﬁcation.

Transactions of theAssociation for Computational Linguistics, TACL ,7:169–184.Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.Longformer: The Long-Document Transformer.

Computing Research Repository , arXiv/2004.05150.illie Boag, Dustin Doss, Tristan Naumann, and Pe-ter Szolovits. 2018. What’s in a Note? UnpackingPredictive Value in Clinical Note Representations.

AMIA Summits on Translational Science Proceed-ings , 2018:26 – 34.Edward Choi, Siddharth Biswal, Bradley Malin, JonDuke, Walter F. Stewart, and Jimeng Sun. 2017.Generating Multi-label Discrete Patient Records us-ing Generative Adversarial Networks. In

Proceed-ings of the Machine Learning for Healthcare Con-ference, MLHC , volume 68 of

Proceedings of Ma-chine Learning Research , pages 286–305, Boston,Massachusetts. PMLR.Edward Choi, Cao Xiao, Walter F. Stewart, and JimengSun. 2018. MiME: Multilevel Medical Embeddingof Electronic Health Records for Predictive Health-care. In

Advances in Neural Information ProcessingSystems 31: Annual Conference on Neural Informa-tion Processing Systems 2018, NeurIPS 2018 , pages4552–4562, Montr´eal, Canada.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Volume 1 (Longand Short Papers) , pages 4171–4186, Minneapolis,Minnesota, USA. ACL.Dmitriy Dligach, Majid Afshar, and Timothy A. Miller.2019. Toward a clinical text encoder: pretrainingfor clinical natural language processing with appli-cations to substance misuse.

J. Am. Medical Infor-matics Assoc. , 26(11):1272–1278.Mat´us Falis, Maciej Pajak, Aneta Lisowska, PatrickSchrempf, Lucas Deckers, Shadia Mikhael,Sotirios A. Tsaftaris, and Alison O’Neil. 2019. On-tological attention ensembles for capturing semanticconcepts in ICD code prediction from clinical text.In

Proceedings of the Tenth International Workshopon Health Text Mining and Information Analysis,LOUHI@EMNLP 2019 , pages 168–177, HongKong, China. ACL.Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t Stop Pretraining:Adapt Language Models to Domains and Tasks.In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics, ACL2020 , pages 8342–8360, Online. ACL.Mohammad Hashir and Rapinder Sawhney. 2020. To-wards unstructured mortality prediction with free-text clinical notes.

Journal of Biomedical Informat-ics , 108:103489.Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.2019. ClinicalBERT: Modeling Clinical Notes andPredicting Hospital Readmission. In

Proceedings of ACM Conference on Health, Inference, and Learn-ing, CHIL 2020 , Online. ACM.Sarthak Jain, Ramin Mohammadi, and Byron C. Wal-lace. 2019. An Analysis of Attention over Clini-cal Notes for Predictive Tasks. In

Proceedings ofthe 2nd Clinical Natural Language Processing Work-shop , pages 15–21, Minneapolis, Minnesota, USA.ACL.Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.MIMIC-III, a freely accessible critical care database.

Scientiﬁc data , 3:160035.Swaraj Khadanga, Karan Aggarwal, Shaﬁq R. Joty, andJaideep Srivastava. 2019. Using Clinical Notes withTime Series Data for ICU Management. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing, EMNLP-IJCNLP 2019 , pages 6431–6436,Hong Kong, China. ACL.Omid Khosravizadeh, Soudabeh Vatankhah, PeivandBastani, Rohollah Kalhor, Samira Alirezaei, andFarzane Doosty. 2016. Factors affecting length ofstay in teaching hospitals of a middle-income coun-try.

Electronic physician , 8(10):3042–3047.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation modelfor biomedical text mining.

Bioinformatics ,36(4):1234–1240.Jingshu Liu, Zachariah Zhang, and Narges Razavian.2018. Deep EHR: Chronic Disease Prediction Us-ing Medical Notes. In

Proceedings of the MachineLearning for Healthcare Conference, MLHC 2018 ,volume 85 of

Proceedings of Machine Learning Re-search , pages 440–464, Palo Alto, California, USA.PMLR.Xueping Peng, Guodong Long, Tao Shen, Sen Wang,and Jing Jiang. 2020. Self-Attention EnhancedPatient Journey Understanding in Healthcare Sys-tem. In

Proceedings of The European Conferenceon Machine Learning and Principles and Practice ofKnowledge Discovery in Databases, ECML PKDD2020 , Online.Zhi Qiao, Xian Wu, Shen Ge, and Wei Fan. 2019.MNN: Multimodal Attentional Neural Networks forDiagnosis Prediction. In

Proceedings of the Twenty-Eighth International Joint Conference on ArtiﬁcialIntelligence, IJCAI 2019 , pages 5937–5943, Macao,China.Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,Ning Dai, and Xuanjing Huang. 2020. Pre-trainedodels for Natural Language Processing: A Sur-vey.

Science China Technological Sciences , 63:1872– 1897.Arthur D. Reys, Danilo Silva, Daniel Severo, Saulo Pe-dro, Marcia M. de Souza e S´a, and Guilherme A. C.Salgado. 2020. Predicting Multiple ICD-10 Codesfrom Brazilian-Portuguese Clinical Notes. In

Pro-ceedings of the 9th Brazilian Conference on Intelli-gent Systems (BRACIS) , Rio Grande, Brazil.Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019.Leveraging Medical Literature for Section Predic-tion in Electronic Health Records. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 4864–4873, Hong Kong,China. ACL.Rudolf Schneider, Tom Oberhauser, Paul Grundmann,Felix Alexander Gers, Alexander Loeser, and Stef-fen Staab. 2020. Is Language Modeling Enough?Evaluating Effective Embedding Combinations. In

Proceedings of The 12th Language Resources andEvaluation Conference , pages 4739–4748, Mar-seille, France. ELRA.Thomas Searle, Zina M. Ibrahim, and Richard J. B.Dobson. 2020. Experimental Evaluation and De-velopment of a Silver-Standard for the MIMIC-IIIClinical Coding Dataset. In

Proceedings of the 19thSIGBioMed Workshop on Biomedical Language Pro-cessing, BioNLP 2020, Online, July 9, 2020 , pages76–85. ACL.Yuqi Si and Kirk Roberts. 2019. Deep Patient Repre-sentation of Clinical Notes via Multi-Task Learningfor Mortality Prediction.

AMIA Summits on Transla-tional Science Proceedings , 2019:779–788.Amber Stubbs, Christopher Kotﬁla, and ¨Ozlem Uzuner.2015. Automated systems for the de-identiﬁcationof longitudinal clinical narratives: Overview of 2014i2b2/UTHealth shared task Track 1.

Journal ofBiomedical Informatics , 58:S11–S19.Amber Stubbs and ¨Ozlem Uzuner. 2015. Annotatinglongitudinal clinical narratives for de-identiﬁcation:The 2014 i2b2/UTHealth corpus.

Journal ofBiomedical Informatics , 58:S20–S29.Weiyi Sun, Anna Rumshisky, and ¨Ozlem Uzuner.2013a. Annotating temporal information in clini-cal narratives.

Journal of Biomedical Informatics ,46(6):S5–S12.Weiyi Sun, Anna Rumshisky, and ¨Ozlem Uzuner.2013b. Evaluating temporal relations in clinical text:2012 i2b2 Challenge.

Journal of the American Med-ical Informatics Association , 20(5):806–813.Harini Suresh, Jen J. Gong, and John V. Guttag. 2018.Learning Tasks for Multitask Learning: Heteroge-nous Patient Populations in the ICU. In

Proceed-ings of the 24th ACM SIGKDD International Confer- ence on Knowledge Discovery & Data Mining, KDD2018 , pages 802–810, London, UK. ACM.Madhumita Sushil, Simon ˇSuster, Kim Luyckx, andWalter Daelemans. 2018. Patient representationlearning and interpretable evaluation using clinicalnotes.

Journal of Biomedical Informatics , 84:103 –113.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do youlearn from context? Probing for sentence structurein contextualized word representations. In , New Orleans, LA, USA.¨Ozlem Uzuner, Andreea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R. South. 2012.Evaluating the state of the art in coreference res-olution for electronic medical records.

Journalof the American Medical Informatics Association ,19(5):786–791.¨Ozlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac S.Kohane. 2008. Viewpoint Paper: Identifying PatientSmoking Status from Medical Discharge Records.

Journal of the American Medical Informatics Asso-ciation , 15(1):14–24.¨Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007.Viewpoint Paper: Evaluating the State-of-the-Art inAutomatic De-identiﬁcation.

Journal of the Amer-ican Medical Informatics Association , 14(5):550–563.¨Ozlem Uzuner, Imre Solti, and Eithon Cadag. 2010a.Extracting medication information from clinical text.

Journal of the American Medical Informatics Asso-ciation , 17(5):514–518.¨Ozlem Uzuner, Imre Solti, Fei Xia, and Eithon Cadag.2010b. Community annotation experiment forground truth generation for the i2b2 medication chal-lenge.

Journal of the American Medical InformaticsAssociation , 17(5):519–523.¨Ozlem Uzuner, Brett R. South, Shuying Shen, andScott L. DuVall. 2011. 2010 i2b2/VA challenge onconcepts, assertions, and relations in clinical text.

Journal of the American Medical Informatics Asso-ciation , 18(5):552–556.Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,and Matt Gardner. 2019. Do NLP Models KnowNumbers? Probing Numeracy in Embeddings. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing, EMNLP-IJCNLP 2019 , pages5306–5314, Hong Kong, China. ACL.Pengtao Xie, Haoran Shi, Ming Zhang, and Eric P.Xing. 2018. A Neural Architecture for AutomatedCD Coding. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2018, Volume 1: Long Papers , pages1066–1076, Melbourne, Australia. ACL.Manzil Zaheer, Guru Guruganesh, Kumar AvinavaDubey, Joshua Ainslie, Chris Alberti, SantiagoOnta˜n´on, Philip Pham, Anirudh Ravula, QifanWang, Li Yang, and Amr Ahmed. 2020. Big bird:Transformers for longer sequences. In

Advances inNeural Information Processing Systems 33: AnnualConference on Neural Information Processing Sys-tems 2020, NeurIPS 2020, December 6-12, 2020,virtual .Zachariah Zhang, Jingshu Liu, and Narges Razavian.2020. BERT-XML: large scale automated ICD cod-ing using BERT pretraining. In

Proceedings ofthe 3rd Clinical Natural Language Processing Work-shop, ClinicalNLP@EMNLP 2020, Online, Novem-ber 19, 2020 , pages 24–34. Association for Compu-tational Linguistics.

A Distribution of Diagnosis andProcedure Labels

Figure 5 and Figure 6 show the distributions oflabels in the diagnosis and procedure predictiontraining sets. Both distributions follow the powerlaw with a long tail of rare codes.

B Pre-Processing Clinical Notes

B.1 Admission Notes From DischargeSummaries

We use MIMIC III discharge summaries that con-tain aggregated information about a patient such asdoctor’s assessments, relevant lab values, medica-tions, and the patient’s history. In order to ﬁlter thedocuments by admission sections, we ﬁrst split alldischarge summaries into sections with simple pat-tern matching. Together with clinical professionals,we then evaluated discharge summaries and identi-ﬁed sections that are known at admission time. Weremove all other sections and thus hide informationabout the further hospital course and discharge ofa patient. We exclude notes that do not containany of the admission sections. We further apply apatient-wise split into train, validation and test setwith a 70/10/20 ratio.

B.2 Converting i2b2 Data into AdmissionDischarge Task

The i2b2 De-identiﬁcation and Heart Disease RiskFactors Challenge (Stubbs et al., 2015; Stubbs andUzuner, 2015) introduced a dataset that containsclinical notes and discharge summaries annotated

Figure 5: Distribution of ICD-9 diagnosis codes inMIMIC III training set.Figure 6: Distribution of ICD-9 procedure codes inMIMIC III training set. based on risk factors and disease indicators. Weconvert the data into an admission to discharge taskby selecting ﬁve of the annotated conditions whichcorrespond to ICD-9 codes as our labels, namelyHypertension (401), Hyperlipidemia (272), Coro-nary artery disease (414), Diabetes mellitus (250)and Obesity (278). Just like the MIMIC III diag-nosis task, samples are annotated in a multi-labelfashion. In order to convert the clinical notes toadmission notes, we use the dataset from Rosen-thal et al. (2019) that contain section labels persentence. We then exclude sections that are notknown at admission time concurrent to Section 3.2.

C Hyperparameter Setting

We use the following setting for pre-training andﬁne-tuning of the introduced Transformer-basedmodels: We use early stopping and apply a randomsearch for tuning the following hyperparameterson the validation set: learning rate [1e-4 − − − − igure 7: Top 10 diagnoses by frequency with thescores reached by the CORe All model.Figure 8: Top 10 procedures by frequency with thescores reached by the

CORe All model.

D Results on Top 10 Diagnoses andProcedures

Figures 7 and 8 show the % AUROC scores ofour

CORe All model on the most frequent labelswithin the diagnosis and procedure prediction tasks.Figure 7 show that many chronic diseases such as

Essential Hypertension or Chronic ischemic heartdisease are among the most common within theMIMIC III dataset and present with relatively highAUROC values. We also observe that very spe-ciﬁc codes such as

Diabetes mellitus and

BypassAnastomosis are predicted more easily compared tomore general codes such as

Other and unspeciﬁedanemias .Figure 8 further shows the negative inﬂuence ofinconsistent labeling on standard procedures suchas