[PDF] Improving Clinical Outcome Predictions Using Convolution over Medical Entities with Multimodal Learning

Abstract

Early prediction of mortality and length of stay(LOS) of a patient is vital for saving a patient's life and management of hospital resources. Availability of electronic health records(EHR) makes a huge impact on the healthcare domain and there has seen several works on predicting clinical problems. However, many studies did not benefit from the clinical notes because of the sparse, and high dimensional nature. In this work, we extract medical entities from clinical notes and use them as additional features besides time-series features to improve our predictions. We propose a convolution based multimodal architecture, which not only learns effectively combining medical entities and time-series ICU signals of patients, but also allows us to compare the effect of different embedding techniques such as Word2vec, FastText on medical entities. In the experiments, our proposed method robustly outperforms all other baseline models including different multimodal architectures for all clinical tasks. The code for the proposed method is available at this https URL.

Full PDF

IImproving Clinical Outcome Predictions Using Convolution overMedical Entities with Multimodal Learning

Batuhan Bardak, Mehmet Tan*Department of Computer EngineeringTOBB University of Economics and TechnologyAnkara, Turkey

Abstract

Early prediction of mortality and length of stay(LOS) of a patient is vital for saving a pa-tient’s life and management of hospital resources. Availability of electronic health records(EHR)makes a huge impact on the healthcare domain and there has seen several works on predict-ing clinical problems. However, many studies did not beneﬁt from the clinical notes becauseof the sparse, and high dimensional nature. In this work, we extract medical entities fromclinical notes and use them as additional features besides time-series features to improve ourpredictions. We propose a convolution based multimodal architecture, which not only learnseﬀectively combining medical entities and time-series ICU signals of patients, but also al-lows us to compare the eﬀect of diﬀerent embedding techniques such as Word2vec, FastTexton medical entities. In the experiments, our proposed method robustly outperforms all otherbaseline models including diﬀerent multimodal architectures for all clinical tasks. The code forthe proposed method is available at https://github.com/tanlab/ConvolutionMedicalNer . Keywords: deep learning; healthcare; ehr; ner; multimodal

Electronic Health Record (EHR) data collected from patients who have been admitted intohospitals or intensive care units (ICU) oﬀer a detailed overview of patients consisting of butnot limited to demographics, insurance, laboratory test results and medical notes. With theEHR data becoming available for researchers, there has been increasing interest in using it withdeep learning algorithms. Besides rapid progress in deep learning area, after Medical InformationMart for Intensive Care(MIMIC-III) [1], today’s most popular public EHR database, was released,1 a r X i v : . [ c s . L G ] N ov umerous studies have achieved successful results using this data set and deep learning modelsto predict diﬀerent clinical outcomes [2, 3, 4].Understanding the health condition of the patient by observing the clinical measure-ments, laboratory test results and predicting the condition of patients during their ICU stay isa vital problem. In this paper, we focus on two diﬀerent common risk prediction tasks, mor-tality (in-hospital & in-ICU) and length of ICU stay (LOS). Both are very important clinicaloutcomes for determining treatment methods, planning hospital resources and ultimately savinglives. Previous studies primarily focused on predicting clinical events using only the structureddata of patient such as historical patient diagnosis (ICD codes) [5, 6], lab results and patient ICUmeasurements [7, 8, 9] and did not beneﬁt from the unstructured data in EHR. The EHR datawhich consists of clinical notes written by doctors, nurses, or radiology, discharge notes and manyother sources, contains quite detailed information about patients, projecting the knowledge andinference of doctors and even critical details about patient health status for many cases. As perthe importance of the clinical notes, researchers want to take advantage of the rich content inclinical notes. Moreover, the recent developments in Natural Language Processing (NLP), therehas been increasing interest in using clinical notes to make clinical model predictions [10, 11].Although it may be possible to leverage clinical notes to make more accurate predictions, thesenotes may consist of long written free-text with an unusual grammatical structure and may con-tain redundant information. As it may be hard to process raw clinical notes, because of theirhigh-dimensional and sparse nature, extracting medical entities is required to unlock the medicalinformation trapped in the clinical notes and to feed them into prediction models.Named Entity Recognition (NER) is a fundamental task in NLP that focuses on informa-tion extraction aiming to extract entities in a text and classify them into predeﬁned classes. Theseclasses can be locations, people, or organizations in general NER algorithms [12, 13]. There canbe various NER models for diﬀerent domains like cybersecurity [14] or medicine [15]. Recently,several deep learning algorithms were applied to clinical texts to train clinical named entity recog-nition models. These clinical NER models generally try to extract medical information such asdisease, drugs, dosage, frequency.In this paper, we argue that the integration of structured data in EHR and medicalentities positively aﬀects the prediction of mortality and LOS. We also investigate the eﬀectof diﬀerent word representations such as Word2Vec[16], FastText[17], and concatenation of bothrepresentations on medical entities. To evaluate the success of our proposed multimodal architec-ture, we ﬁrst train models separately with structured and medical entity features. Then we apply2 of Patient >

15 years old) 38,597 49,785 53,423MIMIC-Extract 34,472 34,472 34,472MIMIC-Extract (at least 24+6 (gap) hours patient) 23,937 23,937 23,937

Final Cohort (After clinical note elimination) 21,080 21,080 21,080

Table 1: Summary statistics of the original MIMIC-III dataset, and the ﬁnal cohort that is usedin this study.multimodal approach and use these features together in several ways to show the eﬀectiveness ofthe proposed network. The results indicate a promising increase in performance on mortality andLOS tasks when the medical entities are used with structured data in a multimodal approach.In the next section, we summarize the similar studies that work on clinical domainespecially predict mortality and length of stay at ICU. Following that, we discuss our data set,problem deﬁnitions, and deep learning models used in this study. Finally, we report experimentalresults and conclude the paper by our ﬁndings and conclusion.

With the rapid development of deep learning algorithms in the last decade, the number of deeplearning models increased substantially for various clinical predictions. Several studies haveexplored EHRs to solve clinical problems, e.g., [18] used 13 diﬀerent vital measurements to classify128 diagnoses using Long Short Term Memory (LSTM) and DoctorAI [5] used Gated RecurrentUnit (GRU) to predict multi-label diagnosis for the next visit. [19] proposed early heart failuredetection using Recurrent Neural Networks (RNNs). Forecasting the LOS and mortality havebeen a popular clinical problem for healthcare researchers in recent years. In earlier studies [20, 21,22] on mortality prediction, hand crafted features are selected and used simple machine learningmodels like logistic regression with diﬀerent severity scores such as APACHE [23], SAPS-II [24],and SOFA [25]. Nowadays with the progress on deep learning, diﬀerent architectures have beenapplied on EHR data to predict this kind of problems. [26] used ensemble learning to make an earlymortality prediction and [27] proposed a method to predict mortality using 12 features extractedfrom the vital signals in the ﬁrst hour of ICU admission. Darabi et al. [28] used Convolutionalneural network to predict long-term mortality risk on the MIMIC-III dataset. More recent work[8]includes attention to their deep learning model to improve models’ success. Another work [29]try to predict LOS for acute coronary syndrome patients. There is a comprehensive survey on3ortality prediction and LOS [30]. Despite these studies and developments, one of the majorproblems that the healthcare researchers experienced, the researches on the literature are shortof standardized preprocessing steps such as unit conversion, handling outlier and missing values,and transforming raw structured data into usable hourly time series data. In order to solvethis problem, [31, 32, 33] carried out a comprehensive benchmark on MIMIC-III for varioustasks such as mortality, LOS, readmission, phenotyping and make their code publicly available.Purushotham et. al. [33] extracts 17 features from the MIMIC-III and works on hospitalmortality, LOS and ICD-9 code group predictions. They compared their proposed super learnermethod with feedforward and recurrent neural network. [31] is another research that benchmarkedtheir results on the MIMIC-III. They used multi-task learning approaches to predict four clinicalprediction tasks such as risk of mortality, LOS, detecting physiologic decline, and phenotypeclassiﬁcation. MIMIC-Extract [32] is the most recent work which is an open source pipeline fortransforming MIMIC-III data into directly usable features. Their pipeline ﬁrst transforms theraw vital sign and laboratory data into hourly time series and then apply some preprocessingsteps such as unit conversion, outlier handling, imputing missing data. In this study, to increasereproducibility, we used MIMIC-Extract pipeline to featurize MIMIC-III data.We also use medical entities which are extracted from clinical notes to improve ourmodel predictions. Clinical natural language processing and information extraction has beenwidely studied in recent years on clinical notes. [34, 35] proposed a deep learning based multi-task learning to make clinical predictions from clinical notes. [11] compared diﬀerent embeddingapproaches such as Bag of Words (BoW), Word2Vec and LSTM on clinical note representationby evaluating the prediction performance on diagnosis prediction and mortality risk estimation.More recently, transformer-based architectures such as BERT [36], XLNET [37] gave state-of-the-art performance on diﬀerent NLP tasks. These models are pre-trained on medical data, whichis then ﬁne-tuned on clinical text [38, 39]. However, clinicians generally use medical jargon andshorthands when they take these clinical notes which makes hard to process directly. There area number of studies in the ﬁeld of clinical NLP which try to extract medical entities in clinicalnotes [40, 41, 42]. In this work, we use med7 [15] which is developed for free-text electronic healthrecord. Then, we combine these medical entities with structured data to beneﬁt from multimodalapproach. For a detailed overview on deep learning for natural language processing in the clinicaldomain, readers can refer to [43].Multimodal learning is a key research area that uses multiple sources to predict uniquetasks [44]. This approach has shown success in image captioning tasks [45], visual questionanswering [46] and speech recognition [47]. In the healthcare research domain, [48] combines4nstructured clinical notes and structural time-series data for predicting in-hospital mortality,decompensation, and LOS. Similarly, [49] made uniﬁed mortality prediction and try to explorehow physiological time series data and clinical notes can be integrated. The study by Jin. et al[50]is the closest to our work in terms of motivation. They made hospital mortality prediction bycombining clinical notes and time series data. Clinical notes are represented with Doc2VecC [51]algorithm in two diﬀerent ways. First, they directly combine clinical notes with time series data,second, they use neural network based clinical NER service to extract ﬁve types of medical entitiesand identify negated entities from clinical notes. After this pre-processing, they use the samerepresentation with the ﬁrst model and reported a 2% increase in the Area Under ther Curve(AUC). The diﬀerence of our paper from [50] and the main contributions of this work can besummarized as follows. • We work with four diﬀerent clinical outcome such as in-hospital mortality, in-ICU mortality,LOS > > • We compare diﬀerent types of word embedding methods (Word2Vec, FastText, Concatena-tion), and discuss the eﬀect these methods on medical entities. • We propose a convolutional based deep learning model for combining clinical NER featureswith time series ICU features. We compare our proposed model with several benchmarks.

In this section, we begin by describing our dataset. The details of baselines and clinical NERmodel are explained next and ﬁnally we propose our multimodal deep learning models.

We use the publicly-available MIMIC-III dataset which contains de-identiﬁed EHR data of 58,976unique hospital admissions, 61,532 ICU admissions from 46,520 patients in the ICU of the BethIsreal Deaconess Medical Center between 2001 and 2012. We use MIMIC-Extract [32], an opensource data extraction pipeline, to extract structured time series features in MIMIC-III. MIMIC-Extract mainly focuses on the patient’s ﬁrst ICU visit with some patient inclusion criteria. Theyeliminate data from patients younger than 15 years old and where the LOS are not between12 hours and 10 days. This pipeline produces a cohort of 34,472 patients and 104 clinically5ggregated time-series variables. In all of our experiments, we use the ﬁrst 24 hours of patient’sdata after ICU admission and only consider the patients with at least 30 hours of present datalike MIMIC-Extract. In our multimodal approach we combine medical entities with time-seriesvariables. Before applying the clinical NER model on notes, we drop discharge summaries toavoid any information leak. Furthermore, we drop all clinical notes the chart time of which donot exist. After these steps, we drop all patients who do not have any clinical notes in 24 hours.The preprocessing on clinical notes are made similar to [48]. In the train-test split, for all clinicaltasks, we split the data based on class distribution with 70%/10%/20% ratio. Statistics of theﬁnal cohort and the others are summarized in Table 1.

Problem Deﬁnition.

We mainly focus on two vital clinical prediction tasks, mortality(in-hospital & in-ICU) and LOS( > >

7) at ICU. We use the same deﬁnitions of the benchmarktasks deﬁned by MIMIC-Extract as the following four binary classiﬁcation tasks. The explanationof these tasks and the class distributions are as follows:1.

In-hospital mortality : Patient who dies during hospital stay after ICU admission (Sig-niﬁcantly imbalanced, %10.5).2.

In-ICU mortality : Patient who dies during ICU stay after ICU admission (Signiﬁcantlyimbalanced, %7).3.

Length-of-stay > : Patient who stays in the ICU longer than 3 days (Slight imbalanced,%43.2).4. Length-of-stay > : Patient who stays in the ICU longer than 7 days (Signiﬁcantlyimbalanced, %7.9). In this subsection, we discuss our time-series baseline modal that we evaluate on each of our fourbenchmark tasks. Further, we explain clinical NER model, embedding approaches to representmedical entities and the multimodal baselines used in this study .

We employ both Long Short Term Memory (LSTM) [52] and Gated Recurrent Units (GRU) [53]networks to capture the temporal information between the patient features. As a result oftime-series baseline experiments, GRU has shown a better AUC and AUPRC performance than6 edical Entity Total Count Unique Count Example

Drug 744778 18268 MagnesiumStrength 156486 10749 400mg/5mlForm 40885 597 suspensionRoute 207876 1193 PODosage 126756 7239 30mlFrequency 71285 3344 bidDuration 5939 1185 next 5 daysTable 2: The ﬁrst column shows the type of medical entity, the second columns shows the totalnumber of related entity found in clinical notes, and the third column shows the number ofunique entity number. The last column shows the output of med7 for example sentence givenfrom clinical notes.LSTM up to %0 . r and an update gate z . With these gates, GRU can handle the vanishing gradient problem.We can iterate the mathematical formulation of GRU modal as follows: z t = σ ( W z x t + U z h t − + b z ) r t = σ ( W r x t + U r h t − + b r )ˆ h t = tanh( w h x t + r t ◦ U h h i − t + b h ) h t = z t ◦ h t − + (1 − z t ) ◦ ˆ h t ˆ prediction = sigmoid( W h h t + b h )7 ClinicalNotes for eachpatient in 24hours once a day5 mginfusionsneosynephrinebisphosphonate

NER Entitites Methodsmed7MIMIC IIIClinical NotesMIMIC - III

Doc2Vec

EmbeddingRepresentationFor Each Word(D Dimension) LowDimensionalRepresentation

AveragingDWWparagraph idneosynephrineinfusions Figure 1: Methodology for learning medical entity vectors. (1) The medical entities that areextracted from clinical notes are embedded into continuous word vectors. Then, we take themean of these learned entity representations. (2) The words are removed from clinical notes ifthey are not belong to any medical entity category. Then, we train Doc2Vec on the preprocessedclinical notes to learn low dimensional representation of medical entities.where z t and r t respectively represent the update gate and the reset gate, ˆ h t the candidateactivation unit, h t the current activation, and ◦ represents element-wise multiplication. Forpredicting the mortality and LOS, a sigmoid classiﬁer is stacked on top of the one layer GRUwith 256 hidden units. In this work, besides time series features, we also use information from clinical notes to improveclinical task prediction performance. Instead of working directly with clinical notes, we ﬁrstaim to extract medical related keywords. Recently, there are some notable works in the clinicaldomain that made their pre-trained clinical NER models publicly available [54, 55, 15]. We usea pre-trained clinical NER model, med7 [15], which uses the same dataset that we use in ourexperiments, MIMIC-III. This clinical NER model extracts seven diﬀerent named entities such as’Drug’, ’Strength’, ’Duration’, ’Route’, ’Form’, ’Dosage’, ’Frequency’. To represent the patient’smedical entities we try two diﬀerent embedding methods, word embedding and document em-bedding. First, we use three diﬀerent word embedding algorithms to represent the each clinicalNER model outputs and compare their performance. Second, we use Doc2Vec [56] algorithm torepresent the whole documents consisting of medical entities. The detailed schema of these two8 edicalEntitiesMIMIC IIIClinical NotesMIMIC - III GRU

Dense Layer Binary Classifier YesNo

MIMIC-EXTRACTPreprocessing med7 + Word Embedding 104 features, 24 timestamp

Figure 2: Overview of Proposed multimodel architecture for predicting the In-Hospital Mortality,In-ICU Mortality, LOS >

3, and LOS >

7. To extract timeseries features, we use MIMIC-EXTRACT pipeline and fed these features through GRU. We also preprocess the clinical notesand use med7 to extract medical entities. 1D CNN is applied to extract features from medicalentity representations. In the ﬁnal layer, we concatenate features that extracted from timeseriesand medical entities and fed through fully connected layer to predict 4 diﬀerent binary clinicaltasks.approaches are shown in Figure 1 and the statistics of the extracted medical entities by med7 inMIMIC-III dataset for selected patients are shown in Table 2.

Word Embeddings.

Diﬀerent word embedding methods might capture various semantic fea-tures on the same word. In our experiments, to understand this variety, we compare the per-formance of Word2Vec, FastText and the concatenation of Word2Vec & FastText embeddings.Word2Vec [16] is a two-layer neural network that learns the representations of words in the giventext with two ways: as a continuous bag-of-words (CBOW) and as a skip-gram. FastText [17] isan extension of the skip-gram model implemented by Facebook’s AI Research (FAIR) lab whichcan handle out-of-vocabulary (OOV) words, and can learn better representations for rare wordsusing several n-grams for words. We use pre-trained word2vec ( w i ∈ R ) and fastText embed-dings ( f i ∈ R ) which was trained on 2.8 billion words from MIMIC-III clinical notes as shown in[38]. Lastly, we design an experimental embedding approach which concatanates the Word2Vecand FastText representations horizontally ( c i ∈ R ). When the Word2Vec embedding does notexist for a given word, we make zero padding in this setting. Document Embeddings.

Doc2Vec is an extention of Word2Vec model to learn document-level9mbeddings instead of word level. Before learning document level representations, we combinethe ﬁrst 24 hours of patient’s clinical notes and apply clinical NER algorithm to keep only medicalrelated keywords in the clinical notes. When training Doc2Vec, we use context window size of 5words. This algorithm produces the ﬁxed-length feature vector ( d i ∈ R ) for each patient.We present two diﬀerent baseline multimodal approaches with word and document embeddingsthat combine time-series data and medical entities. Multimodal with Average Representation.

This modal takes the average of all medicalentities associated with a patient. For each patient, there are N clinical notes and we extract K medical entities from these N clinical notes. Each medical entity is represented by wordembeddings which is explained in Word Embeddings section. We sum n -dimensional K clinicalentities representation component wise and then divide this by K . We use two diﬀerent inputtypes to train our model. Time series data is processed through one layer GRU layer with256 hidden units as explained in Section 3.2.1. Averaged representations of medical entitiesare combined with time-series feature maps that are learned via GRU. In the end, these mergedfeature representations are fed into fully connected layer with 256 neurons, and a sigmoid classiﬁeris added to the model. Multimodal with Doc2Vec Representation.

In this multimodal approach, instead of av-eraging medical entities, we apply Doc2Vec algorithm to obtain the ﬁxed-length feature vector.First, we concatenate N clinical notes for each patient and discard keywords from these notes ifthe keyword is not a medical entity. Then we apply the Doc2Vec algorithm to learn a low levelrepresentation from notes for each patient. After the learning ﬁxed-length feature vector, we usethe same architecture as average embedding approach. Figure 2 describes the proposed multimodal approach which takes the advantage of 1D convo-lutional layers as a feature extractor on medical entities. Applying 1D Convolutional NeuralNetworks(CNN) on text learns the combination of adjacent words and shows successful resultsfor various NLP problems [57]. In our model, K medical entities were extracted from N clinicalnotes from each patient. These K medical entities are ﬁrst represented as a sequence of wordembeddings with diﬀerent word representation techniques such as Word2vec, FastText, and acombination of them. These entities e i ∈ R d are combined vertically and each patient is repre-sented by a matrix M ∈ R k ∗ d where rows are ﬁlled with medical entity representations. Thispatient clinical NER entity matrix (padded where necessary) is represented as:10 k = e ⊗ e ⊗ . . . ⊗ e k (1)where ⊗ is the concatenation operator and e refers to the representation of the medicalentity and k is the number of entity. We use a 1D-CNN model similar [58] to extract features frommedical entities. We stack three consecutive 1D convolutional layers with ﬁlter size 32, 64, and 96.The kernel size is same for three convolutional layer. The output of the last convolutional layer isfollowed by the max-pooling layer. The ﬁnal features of the max-pooling layers are concatenatedwith the features from one layer GRU with 256 hidden units and fed through one fully-connectedlayer with 512 hidden units. In this section, we report the results of our baseline and multimodel experiments, the metrics weused for the evaluation and details about our development platform.

Training.

For all tasks, we use the patient’s ﬁrst 24 hours ICU measurements. For multimodalarchitectures, we use 0.2 dropout rate at the end of the fully connected layer. A ReLU activationfunction is used for nonlinearity and L norm for sparsity regularization is selected with the 0.01scale factor. For the optimization, we use ADAM [59] algorithm with a learning rate of 0.001.All models are trained to minimize the binary crossentropy loss and we independently tune thehyperparameters - number of hidden layers, hidden units, convolutional ﬁlters, ﬁlter-size, learningrate, dropout rates and regularization parameters on the validation set. Each model is trainedfor 50 epochs and early stopping is used on the validation loss. We train each model 10 timeswith diﬀerent initialization seed and report the average performance. Evaluation metrics.

The clinical problems that we work on suﬀer from class imbalance problem.We use three diﬀerent metrics which are Area Under the Receiver Operating Characteristics(AUROC), Area Under Precision-Recall (AUPRC) and F1. AUROC is a popular robust metricfor imbalanced datasets [60]. The second metric AUPRC does not include the true negatives incalculation and this approach makes it useful for data with many true negatives as our dataset.F1 is the ﬁnal metric which calculates the harmonic mean of precision and recall.

Implementation Details.

The aforementioned deep learning algorithms are implemented usingKeras [61], which runs Tensorﬂow [62] on its backend. med7 is used for extracting clinical related11 ask Baseline Modal Embedding AUROC AUPRC F1

In-Hospital Mortality

GRU - 85.04 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± In-ICU Mortality

GRU - 86.32 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± LOS > GRU - 67.40 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± LOS > GRU - 70.54 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Performance comparison of baseline methods. For all four clinical tasks, we report bothAUC, AUPRC and F1 scores and the standard deviations.entities from clinical notes. All experiments experiments were performed on a computer withNVIDIA Tesla K80 GPU with 24GB of VRAM, 378 GB of ram and Intel Xeon E5 2683 processor.The full code of this work is available at https://github.com/tanlab/ConvolutionMedicalNer .12 ask Modal Embedding AUROC AUPRC F1

In-Hospital Mortality

Best Baseline - 86.42 ± ± ± ± ± ± ± ± ± ± ± ± In-ICU Mortality

Best Baseline - 87.17 ± ± ± ± ± ± ± ± ± ± ± ± LOS > Best Baseline - 68.90 ± ± ± ± ± ± ± ± ± ± ± ± LOS > Best Baseline - 71.63 ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Proposed model performance comparison with best baseline model. We select thehighest score for each metric and each clinical task from baseline methods.

We predict four diﬀerent clinical tasks with the patient’s ﬁrst 24 hours ICU measurements andmedical entities. Table 3 summarizes the overall performance of baseline methods. As seenfrom results, instead of strong results of time-series GRU model, multimodal approaches improvethe performance, as expected. For in-hospital mortality prediction, we see an improvement of%1.5 AUROC, %2.5 AUPRC and %4 F1 score compare to the time-series GRU modal. Forother mortality prediction task, in-icu mortality, multimodal approach improve the performancearound %2 for AUROC and AUPRC and %7 for F1 score. Multimodal approach also improves13he performance of predictions tasks in LOS problem. Both in LOS > >

7, all metricsare improved around %1.5. For all experiments, time-series GRU modal only get better F1 scorefor LOS > In this section, we compare the result of our proposed model against the best scores taken frombaseline models. All results for the proposed model against best baseline scores are provided inTable 4. As shown in Table 3, multimodal approach improves the performance of predictionstasks over the time-series, however we try to use medical entities more eﬃciently to improve theprediction of our models. Except the F1 score of LOS > Table 3 shows that the use of medical entity features improve the prediction performance on allclinical tasks. As shown in Table 3, multimodal baseline modals increase all metrics performancewhich indicates the beneﬁt of using medical entities for predicting mortality and LOS. Theseexperiments also provide an opportunity to compare the medical entity representation methods.Although there is no certain winner for all tasks, in the baseline models, the results show us formortality prediction tasks, representing the medical entities with averaging method gives betterresults. For LOS prediction tasks, representing all medical entities together with Doc2Vec is alsosuccessful as averaging method. Furthermore, both scores on Table 3 and Table 4 gives us achance to compare the word embedding approaches. We do not observe a signiﬁcant change inperformance between word embedding techniques, however pretrained Word2Vec model gener-ally achieves slightly higher scores (around %0.5) than FastText and experimental concatenatedembeddings. Apart from these experiments and comparisons, our main motivation is ﬁnding aneﬃcient way to combine time-series features with medical entities. Even though both baselinemultimodals improve the prediction results compared to timeseries baseline, to make better fea-ture extraction on medical entities, we want to take the advantage of 1D CNN. In the literature,there have been several studies that use 1D CNN in NLP. We stack three 1D convolution oper-ation to extract the features, and then apply 1D max pooling operation over the time-step toobtain a ﬁxed-length vector. By analyzing the results between the proposed and baseline multi-modals, we see that 1D CNN based multimodal approach give better results than the averagingand document based embedding methods. Addition to these trials, we also make experiments by14sing only medical entity features as another baseline. However, only medical entity baseline givepoor results (around less than %10 for all tasks) compared to the timeseries and multimodal, sowe do not report these results.

Over the past decade, there has been increased attention to improve mortality and LOS predic-tion performance. Predicting any complications and saving patient’s life is an important task forhealthcare system which motivates us to work on mortality prediction. LOS is another importantclinical problem to improve hospital performance and better healthcare resource utilisation. Inthis work, we present 1D-CNN based multimodal deep learning architecture that use time-seriesfeatures and medical entities together and this model outperforms several baselines. Our pro-posed model performance gain over multimodal baselines is around %1 - %1.5 AUPRC, and theimprovement over time-series baseline is around %2.5 - %3 AUPRC. We also make experimentsto investigate the eﬀect of diﬀerent word embedding algorithms to solve our clinical problems andreport the results. This work can be extended in multiple directions. First, we can involve morefeatures associated with patient such as prescription data and diagnosis codes to improve theprediction performance. Second, using diﬀerent word embedding especially transformer basedtechniques can be used for learning the entity representations. Another thing we may considerin the future is to use more advanced deep learning architectures with attention based will beuseful for clinical tasks.

References [1] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark.Mimic-iii, a freely accessible critical care database.

Scientiﬁc data , 3:160035, 2016.[2] Marzyeh Ghassemi, Mike Wu, Michael C Hughes, Peter Szolovits, and Finale Doshi-Velez.Predicting intervention onset in the icu with switching state space models.

AMIA Summitson Translational Science Proceedings , 2017:82, 2017.[3] Matthew BA McDermott, Tom Yan, Tristan Naumann, Nathan Hunt, Harini Suresh, Pe-ter Szolovits, and Marzyeh Ghassemi. Semi-supervised biomedical translation with cycle15asserstein regression gans. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence ,2018.[4] Christopher Barton, Uli Chettipally, Yifan Zhou, Zirui Jiang, Anna Lynn-Palevsky, SidneyLe, Jacob Calvert, and Ritankar Das. Evaluation of a machine learning algorithm for up to48-hour advance prediction of sepsis using six vital signs.

Computers in biology and medicine ,109:79–84, 2019.[5] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and JimengSun. Doctor ai: Predicting clinical events via recurrent neural networks. In

Machine Learningfor Healthcare Conference , pages 301–318, 2016.[6] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, andWalter Stewart. Retain: An interpretable predictive model for healthcare using reversetime attention mechanism. In

Advances in Neural Information Processing Systems , pages3504–3512, 2016.[7] Karla L Caballero Barajas and Ram Akella. Dynamically modeling patient’s health statefrom electronic medical records: A time series approach. In

Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 69–78,2015.[8] Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias. Attend anddiagnose: Clinical time series analysis using attention models. In

Thirty-second AAAI con-ference on artiﬁcial intelligence , 2018.[9] Harini Suresh, Jen J Gong, and John V Guttag. Learning tasks for multitask learning:Heterogenous patient populations in the icu. In

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining , pages 802–810, 2018.[10] James Mullenbach, Sarah Wiegreﬀe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Ex-plainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695 ,2018.[11] Willie Boag, Dustin Doss, Tristan Naumann, and Peter Szolovits. What’s in a note? un-packing predictive value in clinical note representations.

AMIA Summits on TranslationalScience Proceedings , 2018:26, 2018.[12] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard,and David McClosky. The stanford corenlp natural language processing toolkit. In

Pro- eedings of 52nd annual meeting of the association for computational linguistics: systemdemonstrations , pages 55–60, 2014.[13] Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system fordependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing , pages 1373–1378, Lisbon, Portugal, September 2015. Association forComputational Linguistics.[14] Houssem Gasmi, Abdelaziz Bouras, and Jannik Laval. Lstm recurrent neural networks forcybersecurity named entity recognition.

ICSEA , 11:2018, 2018.[15] Andrey Kormilitzin, Nemanja Vaci, Qiang Liu, and Alejo Nevado-Holgado. Med7: a transfer-able clinical natural language processing model for electronic health records. arXiv preprintarXiv:2003.01271 , 2020.[16] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. Eﬃcient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 , 2013.[17] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks foreﬃcient text classiﬁcation. arXiv preprint arXiv:1607.01759 , 2016.[18] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnosewith lstm recurrent neural networks. arXiv preprint arXiv:1511.03677 , 2015.[19] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Using recurrent neuralnetwork models for early detection of heart failure onset.

Journal of the American MedicalInformatics Association , 24(2):361–370, 2017.[20] Sujin Kim, Woojae Kim, and Rae Woong Park. A comparison of intensive care unit mor-tality prediction models through the use of data mining techniques.

Healthcare informaticsresearch , 17(4):232–243, 2011.[21] Richard Dybowski, Vanya Gant, P Weller, and R Chang. Prediction of outcome in criticallyill patients using artiﬁcial neural network synthesised by genetic algorithm.

The Lancet ,347(9009):1146–1150, 1996.[22] Leo Anthony Celi, Sean Galvin, Guido Davidzon, Joon Lee, Daniel Scott, and Roger Mark.A database-driven decision support system: customized mortality prediction.

Journal ofpersonalized medicine , 2(4):138–148, 2012.1723] William A Knaus, Jack E Zimmerman, Douglas P Wagner, Elizabeth A Draper, and Diane ELawrence. Apache-acute physiology and chronic health evaluation: a physiologically basedclassiﬁcation system.

Critical care medicine , 9(8):591–597, 1981.[24] Jean-Roger Le Gall, Stanley Lemeshow, and Fabienne Saulnier. A new simpliﬁed acutephysiology score (saps ii) based on a european/north american multicenter study.

Jama ,270(24):2957–2963, 1993.[25] J-L Vincent, Rui Moreno, Jukka Takala, Sheila Willatts, Arnaldo De Mendon¸ca, Hajo Bru-ining, CK Reinhart, PeterM Suter, and Lambertius G Thijs. The sofa (sepsis-related organfailure assessment) score to describe organ dysfunction/failure, 1996.[26] Aya Awad, Mohamed Bader-El-Den, James McNicholas, and Jim Briggs. Early hospitalmortality prediction of intensive care unit patients using an ensemble learning approach.

International journal of medical informatics , 108:185–195, 2017.[27] Reza Sadeghi, Tanvi Banerjee, and William Romine. Early hospital mortality predictionusing vital signals.

Smart Health , 9:265–274, 2018.[28] Hamid R Darabi, Daniel Tsinis, Kevin Zecchini, Winthrop F Whitcomb, and AlexanderLiss. Forecasting mortality risk for patients admitted to intensive care units using machinelearning.

Procedia Computer Science , 140:306–313, 2018.[29] Alexey Yakovlev, Oleg Metsker, Sergey Kovalchuk, and Ekaterina Bologova. Prediction ofin-hospital mortality and length of stay in acute coronary syndrome patients using machine-learning methods.

Journal of the American College of Cardiology , 71(11 Supplement):A242.[30] Aya Awad, Mohamed Bader-El-Den, and James McNicholas. Patient length of stay andmortality prediction: a survey.

Health services management research , 30(2):105–120, 2017.[31] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan.Multitask learning and benchmarking with clinical time series data.

Scientiﬁc data , 6(1):1–18, 2019.[32] Shirly Wang, Matthew BA McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael CHughes, and Tristan Naumann. Mimic-extract: A data extraction, preprocessing, and repre-sentation pipeline for mimic-iii. In

Proceedings of the ACM Conference on Health, Inference,and Learning , pages 222–235, 2020. 1833] Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. Benchmarking deeplearning models on large healthcare datasets.

Journal of biomedical informatics , 83:112–134,2018.[34] Yuqi Si and Kirk Roberts. Deep patient representation of clinical notes via multi-tasklearning for mortality prediction.

AMIA Summits on Translational Science Proceedings ,2019:779, 2019.[35] Jingshu Liu, Zachariah Zhang, and Narges Razavian. Deep ehr: Chronic disease predictionusing medical notes. arXiv preprint arXiv:1808.04928 , 2018.[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805 , 2018.[37] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc VLe. Xlnet: Generalized autoregressive pretraining for language understanding. In

Advancesin neural information processing systems , pages 5754–5764, 2019.[38] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notesand predicting hospital readmission. arXiv preprint arXiv:1904.05342 , 2019.[39] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann,and Matthew McDermott. Publicly available clinical bert embeddings. arXiv preprintarXiv:1904.03323 , 2019.[40] Henghui Zhu, Ioannis Ch Paschalidis, and Amir Tahmasebi. Clinical concept extraction withcontextual word embedding. arXiv preprint arXiv:1810.10566 , 2018.[41] Parminder Bhatia, Busra Celikkaya, Mohammed Khalilia, and Selvan Senthivel. Compre-hend medical: a named entity recognition and relationship extraction web service. arXivpreprint arXiv:1910.07419 , 2019.[42] Kathleen C Fraser, Isar Nejadgholi, Berry De Bruijn, Muqun Li, Astha LaPlante, and Khal-doun Zine El Abidine. Extracting umls concepts from medical text using general and domain-speciﬁc deep learning models. arXiv preprint arXiv:1910.01274 , 2019.[43] Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng Du, Zongcheng Ji, Yuqi Si, SarveshSoni, Qiong Wang, Qiang Wei, Yang Xiang, et al. Deep learning in clinical natural languageprocessing: a methodical review.

Journal of the American Medical Informatics Association ,27(3):457–470, 2020. 1944] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.Multimodal deep learning. 2011.[45] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidi-rectional image sentence mapping. In

Advances in neural information processing systems ,pages 1889–1897, 2014.[46] Ilija Ilievski and Jiashi Feng. Multimodal learning and reasoning for visual question answer-ing. In

Advances in Neural Information Processing Systems , pages 551–562, 2017.[47] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning foraudio-visual speech recognition. In , pages 2130–2134. IEEE, 2015.[48] Swaraj Khadanga, Karan Aggarwal, Shaﬁq Joty, and Jaideep Srivastava. Using clinical noteswith time series data for icu management. arXiv preprint arXiv:1909.09702 , 2019.[49] Satya Narayan Shukla and Benjamin M Marlin. Integrating physiological time series andclinical notes with deep learning for improved icu mortality prediction. arXiv preprintarXiv:2003.11059 , 2020.[50] Mengqi Jin, Mohammad Taha Bahadori, Aaron Colak, Parminder Bhatia, Busra Celikkaya,Ram Bhakta, Selvan Senthivel, Mohammed Khalilia, Daniel Navarro, Borui Zhang, et al. Im-proving hospital mortality prediction with medical named entities and multimodal learning. arXiv preprint arXiv:1811.12276 , 2018.[51] Minmin Chen. Eﬃcient vector representation for documents through corruption. arXivpreprint arXiv:1707.02377 , 2017.[52] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.[53] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empiricalevaluation of gated recurrent neural networks on sequence modeling. arXiv preprintarXiv:1412.3555 , 2014.[54] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: Fast and robustmodels for biomedical natural language processing. arXiv preprint arXiv:1902.07669 , 2019.2055] Andriy Mulyar, Darshini Mahendran, Luke Maﬀey, Amy Olex, Grant Matteo, Neha Dill,Nastassja Lewinski, and Bridget McInnes. Tac srie 2018: Extracting systematic reviewinformation with medacy.

Strain , 372:338.[56] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In

International conference on machine learning , pages 1188–1196, 2014.[57] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. arXiv preprintarXiv:1408.5882 , 2014.[58] Hakime ¨Ozt¨urk, Arzucan ¨Ozg¨ur, and Elif Ozkirimli. Deepdta: deep drug–target bindingaﬃnity prediction.

Bioinformatics , 34(17):i821–i829, 2018.[59] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[60] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In

Proceedings of the 23rd international conference on Machine learning , pages 233–240, 2006.[61] Fran¸cois Chollet. keras. https://github.com/fchollet/kerashttps://github.com/fchollet/keras