[PDF] A Meta-embedding-based Ensemble Approach for ICD Coding Prediction

Abstract

International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. The problem of automatically assigning ICD codes has been approached in literature as a multilabel classification, using neural models on unstructured data. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles. Furthermore, we exploit the geometric properties of the two sets of word vectors and combine them into a common dimensional space, using meta-embedding techniques. We demonstrate the efficacy of this approach for a multimodal setting, using unstructured and structured information. We empirically show that our approach improves the current state-of-the-art deep learning architectures and benefits ensemble models.

Full PDF

aa r X i v : . [ c s . C L ] F e b A Meta-embedding-based Ensemble Approach for ICD CodingPrediction

Pavithra Rajendran , Alexandros Zenonos , Josh Spear and Rebecca Pope Data Science and Engineering, KPMG [email protected], [email protected],[email protected], [email protected]

Abstract

International Classiﬁcation of Diseases(ICD) are the de facto codes used glob-ally for clinical coding. These codes en-able healthcare providers to claim reim-bursement and facilitate efﬁcient storageand retrieval of diagnostic information.The problem of automatically assigningICD codes has been approached in lit-erature as a multilabel classiﬁcation, us-ing neural models on unstructured data.Our proposed approach enhances the per-formance of neural models by effectivelytraining word vectors using routine med-ical data as well as external knowledgefrom scientiﬁc articles. Furthermore, weexploit the geometric properties of thetwo sets of word vectors and combinethem into a common dimensional space,using meta-embedding techniques. Wedemonstrate the efﬁcacy of this approachfor a multimodal setting, using unstruc-tured and structured information. We em-pirically show that our approach improvesthe current state-of-the-art deep learningarchitectures and beneﬁts ensemble mod-els.

The International Classiﬁcation of Diseases (ICD)was created in 1893, when a French doctor namedJacques Bertillon named 179 categories of causesof death. It has been revised every ten years sincethen and has become an important standard for in- formation exchange in the health care sector. Im-portantly, it is endorsed by the World Health Organ-isation (WHO) and it has been adopted widely byphysicians and other health care providers for reim-bursement, storage and retrieval of diagnostic infor-mation.ICD coding is a complex process where clin-ical coders have to consult multiple sources andnavigate through these sources to identify the rightcodes to use. The label space is very large, with over15,000 codes in the ICD-9 taxonomy. The more re-cent ICD-10-CM/PCS has over 140,000 codes com-bined. (World Health Organisation, 2016).The diagnosis descriptions written by cliniciansand the textual descriptions of ICD codes are writ-ten in quite different styles even if they refer to thesame disease. Dealing with the clinical text is chal-lenging as it includes irrelevant information, has aninformal tone, does not necessarily follow correctgrammatical conventions and contains a large med-ical vocabulary. On the other hand, the deﬁnitionsof ICD codes are formally and precisely worded.While ICD codes are important for making clin-ical and ﬁnancial decisions, clinical coding is time-consuming, error-prone and expensive due to thereasons mentioned above.ICD codes are organised in a hierarchical struc-ture where the top-level codes represent generic dis-ease categories and the bottom-level codes repre-sent more speciﬁc diseases. A mistake may happenwhen the coder matches the diagnosis descriptionto an overly generic code instead of a more speciﬁccode or when a code is omitted altogether. Also,it is important to note that all other illnesses or pre-existing conditions (called comorbidities) that affect patient’s care, should be captured with a code.In this work, we utilise machine learning and nat-ural language processing techniques to help clini-cians by automating the ICD coding process usinginformation present in unstructured texts as well asstructured data. In particular, we make use of theMIMIC-III dataset [Johnson et al. , 2016] to empiri-cally evaluate our approach.Prior work on MIMIC-III dataset has made useof deep learning methods. Speciﬁcally, classifyingdischarge summaries with their corresponding ICDcodes [Mullenbach et al. , 2018; Li and Yu, 2020;Xu et al. , 2019] and, descriptions about the ICDcodes have been used as label based information toimprove the performance.In our work, we propose a novel multimodalapproach to enhance existing work without relyingupon external knowledge such as the descriptionsof the ICD coding as label-based informationor complex architectures. Instead, we combineexisting information present within various systemscontaining structured information along withdischarge summaries. Furthermore, we exploitthe geometric properties of the pretrained wordembeddings as well as, combining embeddingstrained on external knowledge into a common di-mensional space using using meta-embeddingtechniques [Coates and Bollegala, 2018;Kiela et al. , 2018; Bollegala et al. , 2018]. Pre-trained word embeddings have been successfullyused in downstream NLP tasks [Kim, 2014]since they capture semantic relationships andprovide better text representations. To study thepotential advantage of our proposed approach,we consider prior work [Mullenbach et al. , 2018;Li and Yu, 2020] utilising pretrained word embed-dings speciﬁcally trained on the MIMIC dataset intheir neural model.Furthermore, models trained on structured andunstructured information on different modalities arefurther used in an ensemble and we build a meta-classiﬁer for training on the different predictionsfrom the different models for making the ﬁnal pre-diction. We empirically show that the ﬁnal ensem-ble model outperforms the baseline methods in thismulti-label classiﬁcation task.In particular, the contributions of this paper are asfollows: We propose a simple and effective approach,which we term as pre-training initialisation steps on pre-trained word vectors that are fedinto a CNN-based architecture for multi-labelclassiﬁcation. We evaluate our approach on top32 and 50 ICD codes similarly to related workand we show an improvement by utilising dif-ferent initialisation step. An ensemble approach that utilises variousmodalities, combining both structured and un-structured information. Our ensemble modelempirically outperforms the baseline methods.

There has been a growing interest within the ﬁeldof natural language processing for learning rep-resentations of words as vectors, also known asword embeddings. Pretrained word embeddingsare obtained by training on a large corpus and twomain approaches are followed for learning thesevectors namely, count-based and prediction-based.The Glove algorithm [Pennington et al. , 2014] is apopular count-based approach that makes use ofthe co-occurrences probabilities of words. TheWord2Vec [Mikolov et al. , 2013] approach is a pop-ular prediction-based approach, in which modelsare trained based on CBOW (continuous bag ofwords) or the Skip-Gram approach. A growing in-terest in topics related to word embeddings is re-ﬂected by the numerous papers published in variousNLP conferences.A new area of interest within word embed-dings is to learn meta-embeddings from differ-ent pre-trained word vectors without having theactual text sources on which the word vectorshave been trained. One of the simplest ap-proaches proposed is to use concatenation fol-lowed by averaging [Coates and Bollegala, 2018].Yin et al. [2016] is one of the earliest works inmeta-embedding and use a projection layer knownas 1TON for computing the meta-embeddings ina linear transformation. However, few workhave studied the use of meta-embeddings forthe healthcare domain [El Boukkouri et al. , 2019;Chowdhury et al. , 2020]. To the best of our knowl-edge, the application of meta-embedding techniquesfor multilabel classiﬁcation of ICD coding has notyet been investigated. In this work we investigateseveral different approaches for doing so, particu-larly focusing on combining pretrained in-domainword embeddings with pretrained word embeddingserived from external sources such as scientiﬁc ar-ticles on the Web.Automatic ICD coding using unstructured textdata has been explored by researchers for sev-eral years whereby the full breadth of learningapproaches has been considered. Koopman etal. [2015] utilised a multi-label classiﬁcation ap-proach and combined SVM classiﬁers via a hierar-chical model to assign ICD codes to patient deathcertiﬁcates, ﬁrst identifying whether the cause ofdeath was due to cancer, then identifying the typeof cancer and associated ICD code. Whilst the ﬁnalsolution performed well, there were two key limi-tations to the approach: the coverage of cancers in-cluded in the dataset was very imbalanced resultingin cancer types associated with rarer diseases be-ing harder to predict and the cancer identiﬁcationmodel was susceptible to false-positives when a pa-tient was cited as having cancer but it was not theprimary cause of death.More recently, the scope of the problem hasbeen extended to include multiple ICD codes andit has been addressed via multi-label methodswith researchers more often utilising deep learn-ing approaches, generally centred around CNNand LSTM based architectures. Mullenbach etal. [2018] adopted a CNN architecture with a sin-gle ﬁlter, deﬁning a per-label attention mechanismto identify the relevant parts of the latent space. TheCNN architecture [Kim, 2014] has been proven tobe useful for sentence classiﬁcation tasks and inthis work, they empirically show that CNN is bet-ter for this task. The per-label attention provideda means of scanning through the entire documentwithout limiting it to a particular segment of thedata. This approach achieved state-of-the-resultsacross several MIMIC datasets. In our approach, weuse a similar architecture but instead focus on using pre-training initialization steps and also combiningstructured features, along with the unstructured in-formation. Both approaches are explained in the pa-per along with empirical evidence of the improvedperformance that our proposed approach gives.Vu et al. [2020] proposed a BiLSTM encoderalong with a per-label attention mechanism, in-spired by the work of Lin et al. [2017] which provedit to perform well at generating general sentenceembeddings. Here, Vu et al. [Vu et al. , 2020] ex-tended the attention mechanism by generalising itfor multilabel classiﬁcation by performing an atten- tion hop per label.Xie et al. [2019] proposed a text-CNN for mod-elling unstructured clinician notes however, ratherthan implement an attention mechanism, the authorsextracted features via TF-IDF from the unstructuredguidelines provided to professionals when deﬁningthe ICD classiﬁcations. By including these featuresalong with the convolved CNN layer, the authorsmimicked the input that the professionals would getfrom the guidelines. To enrich the predictions ofthe unstructured data model the authors used anensemble-based method of three models. Semi-structured data was utilised by embedding the ICDcode descriptions in the same latent vector space asthe diagnosis descriptions and structured data wasutilised through a decision tree model. The im-balance issues in the data was addressed via LabelSmoothing Regularization and the resulting modelachieved state-of-the-art accuracy, for the time, aswell as improving model interpretability. However,they do not disclose what structured data they usedspeciﬁcally and what features were used.Shi et al. [2017] propose an hierarchical deeplearning model with attention mechanism which canautomatically assign ICD diagnostic codes givenwritten diagnosis. They also propose an attentionmechanism that is designed to address the mismatchbetween diagnosis description number and assignedcode number. The results show that the soft atten-tion mechanism improves performance. However,they only focus on top 50 ICD-9 codes.

In this section, we explain our proposed multi-labelclassiﬁcation approach on the discharge summaries(unstructured data) as well the multimodal approach(structured and unstructured data) for automaticallypredicting the ICD coding.

In this subsection, we explain our ap-proaches for classifying unstructured datai.e., discharge summaries. Based on priorwork [Mullenbach et al. , 2018], we follow a similardeep learning approach using a convolutionalneural network with a per-label attention mecha-nism that can span across the entire document toidentify the important portions of the documentthat correspond to the different labels. Differingfrom [Mullenbach et al. , 2018], we propose thentroduction of an additional step, which we termas ” pre-training initialization steps ” for obtainingefﬁcient and effective pre-trained word vectors thatcan provide a better representation of the text thatare inputted into the CNN architecture.Firstly, let us assume that we trained wordembeddings on MIMIC discharge summaries us-ing Word2Vec [Mikolov et al. , 2013] algorithmas (1) our purpose here is to focus on ex-ploiting as much of semantic information cap-tured for better utilisation of embeddings and (2)it serves as a baseline comparison with priorwork [Mullenbach et al. , 2018]. In the ﬁrst pre-training initialisation step , we introduce post-processing steps on the pretrained Word2Vec wordvectors. The details of these post-processing stepsare described below.

Post-processing word vectors

Let us consider a set of words present in a corpus,represented as w ∈ V, such that each word w isrepresented by a pretrained word embedding w i ∈ R k in some k dimensional vector space.We term the ﬁrst step in the post-processing [Mu and Viswanath, 2018] as MeanDiff which computes the mean embedding vector, ˆ w ,of all embeddings for the words in V and the wordembedding of each word is further computed bysubtracting the mean from the corresponding wordvector as shown below: ˆ w = 1 | V | X w ∈ V w (1) ∀ w ∈ V ˜ w = w − ˆ w (2)Mu and Viswanath [2018] observed that the nor-malised variance ratio decays until some top l ≤ d components, and remains constant after that, andproposed removing the top l principle componentsfrom the embeddings obtained after subtracting themean vector (Eq. 2).Based on the above observation, the second stepin the post-processing is to apply principal compo-nent analysis on the word vectors obtained usingEq. 2 and, removing the ﬁrst d principal compo-nents from each individual word vector. We termthis step as PCADiff for our usage. This is carriedout by arranging the word vectors as columns withina matrix A ∈ R k ×| V | . Here, we denote the principal components as u , . . . , u d and remove the ﬁrst Dprincipal components as follows: w ′ = ˜ w − l X i =1 (cid:0) u ⊤ i w (cid:1) u i (3)In the above approach, we focus on the in-domain trained word embeddings i.e the word vec-tors trained on the MIMIC discharge summaries.The next approach that we propose is touse external knowledge available in the Web i.eusing PubMed scientiﬁc articles. We trainedword vectors on the PubMed scientiﬁc arti-cles using the Word2Vec algorithm. Recentwork [Yin and Sch¨utze, 2016] shows that pretrainedword vectors trained on same source of informationbut different algorithms contain varied quality in thesemantics captured. We propose the same intuitioncan be applied to using the same algorithm but ondifferent datasets. In order to capture the seman-tics from both the word embeddings i.e MIMIC dis-charge summaries and the PubMed articles respec-tively, we combined the information in the form of” meta-embeddings ” i.e capturing the different em-beddings into a common meta-embedding space.While there are several approaches to achieve meta-embeddings, in this paper we focus on two method-ologies which are described below. Meta-embeddings

Let us assume, for simplicity, we have two dif-ferent sources of information i.e one based on thedischarge summaries and the other based on thePubMed articles. Let us represent the vocabularyof words present in the two sources of informationas S and S respectively.The two methodologies for getting meta-embeddings are explained below. Averaging meta-embedding

Here, we as-sume that the words present in S and S were trained on the same algorithm i.eWord2Vec to produce word vectors andthat the vectors belonging to both sourceshave a common dimensionality d. Priorwork [Coates and Bollegala, 2018] showsaveraging the embedding vectors is a usefulmeta-embedding technique that achievescomparable result as that of concatenating thevectors.o understand this, let us consider the meta-embedding of each word represented by aver-aging the word vectors from the two sources asfollows: ˆ w = w S + w S (4)In this case, the Euclidean distance betweentwo words ˆ w and ˆ w based on Eq. 4 is givenas follows: E AVG = k ˆ w − ˆ w k (5)E AVG ≈ q ( E S ) + ( E S ) − (2 E S E S cos θ ) (6)E AVG = q ( E S ) + ( E S ) (7)Coates and Bolle-gala [Coates and Bollegala, 2018] showsthat the source embeddings are approximatelyorthogonal (Eq. 6) and hence, averagingcaptures approximately similar informationas concatenation, without increasing thedimensionality. Hence, we use this averagingmethod for combining the word embeddingsas meta-embeddings. Locally linear meta-embedding

In the abovemethod, we ﬁnd that averaging-basedmeta-embeddings performs comparable toconcatenation [Coates and Bollegala, 2018].However, one of it’s limitation is that it doesnot capture the variations present within thelocal neighbourhood of the word vectors in dif-ferent sources. To address this issue, Bollegalaet al. [2018] constructed meta-embeddings inan unsupervised manner where they considerthe mapping of embeddings from differentsources to a common meta-embedding spacebased on the local neighbourhood of a wordin each of the sources. Two steps wereperformed i.e the embeddings in each sourcefor a given word was reconstructed based onits local neighbourhood and this was used forprojecting the embeddings into a commonmeta-embedding space such that the nearestneighbouring words are embedded close toeach other. Let us assume we have a set of words present inthe two sources i.e. S and S and, let us rep-resent their vocabularies as V and V respec-tively such that, V represents the common setof vocabulary. Further, for each word v ∈ V ,we represent the word vectors as v V ∈ R d .Similarly, for each word v ∈ V , we representthe word vectors as v V ∈ R d . Here, d andd represent the dimensions of the vectors re-spectively.In the reconstruction step, for each word v ∈V i.e v ∈ V ∩ V , we obtain the k near-est neighbours present within the two sources S and S respectively. This is carried outusing the BallTree algorithm based on priorwork [Bollegala et al. , 2018] since this ap-proximate methodology reduces the time com-plexity in identifying the approximate k neigh-bours. Let us denote the neighbours as N V and N V respectively. Ψ( W ) = X i =1 X v ∈V (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v V i − X u ∈N V i ( v ) w uv u V i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (8)where w uv = 0; if the words are not k neigh-bours in either of the sources. ∂ (Ψ( W )) ∂ w uv = − X i =1  v V i − X x ∈N V i ( x ) w vx x V i  ⊤ u V i I [ u ∈ N V i ( v )] (9)Given a word v ∈ V , for each neighbour-ing word u in N V and N V , the recontructionweights are learned such that the reconstruc-tion error given in Eq. 8 is minimized i.e mini-mizing the sum of the local distortions presentin the two sources and to do this, the error gra-dient is computed using Eq. 9.Based on prior work [Bollegala et al. , 2018],the weights are uniformly randomly initial-ized for each k neighbour and optimal weightsare obtained using stochastic gradient descent(SGD) with the initial learning rate set as 0.01nd maximum number of iterations is set to100.The weights are normalized and used in theprojection step. The projection step makes useof the normalized reconstructed weights andlearns the the meta-embeddings of the wordsu , v ∈ V in a common dimensional space P i.e u P , v P ∈ R d P such that the local neighbour-hood from both the sources is preserved. To dothis, projection cost given below is minimized: Ψ( P ) = X i =1 X v ∈V (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v P − X u ∈N V i ( v ) w ′ uv u P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (10)where w ′ uv = w uv X i =1 I [ u ∈ N V i ( v )] (11)such that if x = 1; I ( ( x )) = True and

False oth-erwise.The meta-embeddings are obtained by com-puting the smallest ( d P + 1) eigenvectors ofthe matrix given below. M = ( I − W ′ ) ⊤ ( I − W ′ ) (12)such that matrix W ′ contains the values com-puted using Eq. 10.Based on the described pre-training initializationsteps , we can apply the different post-processingtechniques on pretrained word embeddings to ob-tain ﬁnal embeddings. We can also combine em-bedding vectors (post-processed or not) from differ-ent sources into meta-embeddings. This providesus with different variants of pretrained word em-beddings capturing different semantic information.These different embedding vectors can be fed intoa neural model for multilabel classiﬁcation of theICD coding.For the purpose of comparison, we usethe CNN-based architecture following priorwork [Mullenbach et al. , 2018] which is describedbelow. CNN Encoder

A document is represented as X = { x , . . . , x N } such that each word is repre-sented using a pre-trained word vector. A con- volutional neural architecture is used to encodethe document at each n step as: h i = tanh( W c ∗ x i : i + k − + b c ) (13)where W c ∈ R k × d e × d c represents the convo-lutional ﬁlter such that k , d e , d c denote ﬁlterwidth, input embedding dimension and ﬁlteroutput size respectively. Per-Label Attention

Following the base represen-tation of the document as H = { h , . . . , h N } ,for each label l , a per-label attention vector iscomputed as follows: a l = Sof tmax ( H T u l ) (14)This attention vector is then used for represent-ing the label-based document representationsas follows: v l = N X n =1 a l , n h n (15) Classiﬁcation

The probability per label is com-puted as follows: ˆ y l = σ ( α ⊤ l v l + b l ) (16)where α ∈ R d c represents the vector contain-ing the prediction weights.The training objective aims to minimize thebinary-cross entropy loss as given below.Loss ( X , y ) = − L X l =1 y l log(ˆ y l )+(1 − y l ) log(1 − ˆ y l ) (17)The probability scores per label are further usedin combination with structured data to improve theperformance of the multilabel classiﬁcation, whichis explained in detail in the following subsection. We utilise data from the MIMIC-III dataset specif-ically, admission information, lab reports, prescrip-tions, vital signs (chart events table) and microbiol-ogy test results. All of the aforementioned data aretabular formats that contain information about a pa-tient’s care during their hospital admission as wellas some textual information like medication and labresults. tructured Data

We aggregate data from these modalities per admis-sion, and we extract statistical properties, i.e., mean,standard deviation, min and max for each numericvalue as well as the number of measurements takenfor each patient in each admission. as we require atleast some monitoring of the patients health duringtheir stay. In particular, we look into the 100 mostcommon items that measurements were taken foreach patient. These include, heart rate , hemoglobin , respiratory rate , creatine , bun , wbc , magnesium etc. Categorical Data

Textual information contained in tables but withno particular meaning such as medication (drugs)are represented using the Term Frequency -Inverse Document Frequency (TF-IDF) based fea-tures. We use Tree-based approaches and speciﬁ-cally eXtreme Gradient Boosting (XGBoost) mod-els [Chen and Guestrin, 2016] for all the featureswe identiﬁed but we keep the TF-IDF in separatemodels.These models are trained separately on the Trainset and make predictions on a second unseen testset, namely Test (as shown on Table 1). The test setis consequently used in an ensemble where we use a5 fold cross-validation method to train and test ourmeta-model.Our meta-classiﬁer is logistic regression on thisoccasion which utilises the logistic function (sig-moid function) that ensures our output is between0 and 1: g ( z ) = 11 + exp ( − z ) (18)The logistic regression hypothesis is then deﬁnedas: h β ( x ) = g ( β T x ) (19)In order to solve the equation we use MaximumLikelihood J ( β ) = 1 m m X i =1 [ − y i log ( h β ( x i ) − (1 − y i ) log (1 − h β ( x i ))] (20)which implies we need to solve: dJ ( β ) dβ = 1 m N X i =1 x i ( h β ( x i ) − y i ) = 0 (21) Table 1: Total number of samples in train, test and dev setbased on 32 ICD-10 codes and 50 ICD-9 codes respec-tively.

Data Train Dev Test32 ICD-10 28201 3134 1243050 ICD-9 8044 804 1725

This section provides a brief overview on theMIMIC-III dataset that we use in our experiments,the different experimental settings conducted on theunstructured data i.e discharge summaries and theexperiments using multimodal data (structured andunstructured information).

In this paper we use the well-known MIMIC-IIIdatabase for empirically evaluating our approach.The dataset has electronic health records of 58,976patients who stayed in the Intensive Care Unit (ICU)of Beth Israel Deaconess Medical Centre from 2001to 2012 [Johnson et al. , 2016]. This includes infor-mation such as demographics, vital sign measure-ments, laboratory test results, procedures, medica-tions, caregiver notes and imaging reports. For ourpurposes, we remove admissions that are not as-sociated with a discharge summary. Also, we re-move admissions that have no information on ad-mission, laboratory result data, prescriptions, andas we require at least some monitoring of the pa-tients health during their stay. The total number ofunique admissions after ﬁltering is , . A dis-charge summary is considered very useful to under-stand what happened during an admission as it in-cludes but is not limited to information about thehistory of illness, past medical history, medication,allergies, family and social history, physical exam atthe point of admission, lab result summary, proce-dures, discharge condition and status as well as dis-charge medication, follow-up plans, ﬁnal diagnosisand other discharge instructions. In terms of prepro-cessing the summaries, we removed symbols andnumbers not associated to a text. In this work, ouraim is to predict the ICD code classiﬁcation for eachof the admissions we are considering. Even thoughthe MIMIC-III dataset includes ICD-9 mappings,we manually map those to the new ICD-10 codesand get the top 32, similarly to [Xu et al. , 2019].lso, for completeness we experiment with the top50 ICD-9 codes as in previous studies.Table. 1 contains the number of samples presentwithin the training, development and test set for ex-perimentation purposes. Experiments based on both structured and unstruc-tured data are carried out with the top 32 ICD-10codes and top 50 ICD-codes in order to align withbenchmarks in the literature.

Unstructured Data

Experiments based on unstructured informationwere carried out by extracting the discharge sum-maries based on top 32-ICD codes as well as top 50ICD-9 codes separately.The CNN-based architecture (also known asCAML) [Mullenbach et al. , 2018] is used for train-ing and comparison purpose. We use this as abaseline approach by using embeddings trained onthe MIMIC train dataset for initialising the neuralmodel. The hyperparameters set are: embeddingdimension as , dropout as , ﬁlter size as ,learning rate as , batch size as , ﬁlter mapsas and patience as . We train the model for epochs with an early stopping criteria based on the micro-F1 score such that the training is stopped ifthe micro-F1 score does not improve after epochs.We also experimented with the MultiResCNNarchitecture using the default hyperparame-ters [Li and Yu, 2020] aside from embedding size,for which we used 200. Due to the complexityof the model, we report the scores based on thesame epoch as the best epoch achieved using thebaseline CNN-architecture and found the baselinemodel to perform better. Henceforth, we use thebaseline architecture for our further experiments.In order to understand the adaptability of ourproposed methodology, we report the scores usingthe best meta-embedding technique using theMultiResCNN architecture.In addition, we extract full scientiﬁc arti-cles from PubMed, which consists of 672,589articles in total for training the word embed-dings [Moen and Ananiadou, 2013].The baseline experiment reproduces the resultsbased on prior work i.e the per-label attention basedCNN architecture [Mullenbach et al. , 2018] and todo this, we use Word2Vec to generate word embed-dings from the MIMIC training dataset. These em- beddings are then used as an input to train the CNNneural network.We investigated different ” pre-initializationsteps ” explained in Section. 3.1 and ran experi-ments by obtaining different input embeddingsto train the neural model. Word embeddingstrained on MIMIC discharge summaries and thosetrained on PubMed scientiﬁc articles are termedas Word2Vec-MIMIC and

Word2Vec-PubMed respectively for our usage.1.

MeanDiff-Word2Vec-MIMIC

Mean vectorof the vectors in

Word2Vec-MIMIC removedfrom individual word vectors (Section. 3.1).2.

MeanDiff-PCADiff-Word2Vec-MIMIC

Above steps followed and extended bycomputing the principal components andremoving the ﬁrst D principal componentsfrom the mean removed word vectors. In ourexperiments, we consider D = 2 based on theembedding dimension i.e 200 which is chosenbased on best performance.3. Averaging

Meta-embeddings obtainedby combining (

Word2Vec-MIMIC ) and(

Word2Vec-PubMed ) based on techniqueexplained in Section. Both the vectors havethe dimensions as 200.4.

Locally Linear

Meta-embeddings are ob-tained by combining (

Word2Vec-MIMIC )and (

Word2Vec-PubMed ) based on the tech-nique explained in Section. 3.1. The near-est neighbours for each source is set to 1200since this gives the best result based on priorwork [Bollegala et al. , 2018]. Here, the di-mensions are set to 200.5.

MeanDiff-Averaging

First, steps in (1)are carried out on

Word2Vec-MIMIC and

Word2Vec-PubMed separately and then,combined using the technique explained in(3).6.

MeanDiff-PCADiff-Averaging

First, steps in(2) are carried out on

Word2Vec-MIMIC and

Word2Vec-PubMed separately and then,combined using the technique (3).7.

MeanDiff-Locally Linear

First, steps in(1) are carried out on

Word2Vec-MIMIC and

Word2Vec-PubMed separately and then,combined using the technique (4).8.

MeanDiff-PCADiff-Locally Linear

First, steps in (2) are carried out on Prior work theoretically show that choosing D de-pends on the length of the embeddings ord2Vec-MIMIC and

Word2Vec-PubMed separately and then, combined using thetechnique (4).Experiments were conducted using the same hy-perparameters to train the baseline approach tobenchmark our results against prior work.

Multimodal data

For the experiments we carried out for the struc-tured data we trained 3 separate models. All threeare tree-based models (Extreme Gradient Boosting).For each set of experiments, i.e., top 32 ICD-10codes and top 50 ICD-9 codes we run a randomisedsearch for the best hyperparameters in a crossvalida-tion fashion. Their corresponding values are shownin Table 2.1.

XGBoost Structured Data

This approachutilised only information available in struc-tured format in the MIMIC-III dataset. In par-ticular, as mentioned above, we aggregate in-formation at admission level.2.

XGBoost Prescription Data

This approachutilised only information from the prescriptiontable in the MIMIC-III dataset. In particular,we aggregate drugs of a patient in an admissionto a single row and apply TF-IDF technique toextract features. Our decision to exclude thisfrom the unstructured data is because the re-sult is not a comprehensible text but rather abag of words.3.

XGBoost Lab Exam Data

Similar to themedications, lab exams include labels, ﬂuidsand categories, describing the chemistry examsmade, the type of ﬂuid measurements that weretaken and the overall category of that exam.The aim is to associate speciﬁc examinationswith ICD codes. Again, due to the missingcontext, we treat this as a bag of words andapply TF-IDF to generate features.All three models are trained separately but ondata concerning the same patients. All models makepredictions for the same Test set as shown on Ta-ble 1.

Meta data

It is critical to ensure predictions from the differ-ent models are made on the same test set in orderto train a meta-classiﬁer namely logistic regression.This classiﬁer takes inputs from the three XGBoostmodels above (Section 4.2) as well as the CNNmentioned earlier (Section 4.2). We use a 5-Fold

Table 2: Hyperparameters used for the XGBoost Algo-rithm for the top 32 ICD-10 and top 50 ICD-9 experi-ments

Hyperparameter 32 ICD-10 50 ICD-9)Colsample by tree 0.85 0.98Gamma 0.86 0.78Subsample 0.66 0.67Number of estimators 2000 2000Max Depth 5 7Min Child Weight 5 4Learning rate 0.15 0.19crossvalidation approach on the test set to measurethe performance of our ensemble model.

Table. 3 contains the results based on the dif-ferent experiments conducted using the proposedapproaches on unstructured data (Section. 3.1)for the multilabel classiﬁcation of 32 ICD-10codes. In particular, the baseline performance ofCAML [Mullenbach et al. , 2018] was better thanthe more complex MultiResCNN [Li and Yu, 2020]for less epochs and hence, remaining results arebased on the CAML architecture.The best performance is achieved using the post-processing technique

MeanDiff on the differentembeddings and then performing the locally linearmeta-embedding technique [Bollegala et al. , 2018]with the result indicating that post-processing waseffective only on PubMed trained embeddings. Alikely reason may be the inaccurate grammaticaland semantic structure of discharge summaries inMIMIC dataset. Furthermore, the results indicatethat the information captured by the local neigh-bourhood in each of the sources used for trainingmeta-embeddings is important for boosting the per-formance. We ﬁnd that post-processing word em-beddings by removing principal components did notprovide any improvement and investigating the rea-sons for this is out of scope for this paper.We ﬁnd that the meta-embeddings were ableto capture the different semantic information fromthe different sources i.e. the discharge summariesand the external knowledge from PubMed articles.Meta-embedding techniques do not require the rawinformation used to originally train the embeddingsbut rather, the embeddings trained on this informa- able 3: Micro and macro results are presented for multilabel classiﬁcation of 32 ICD-10 CODES on the test set.

InputEmbedding refers to the different input embedding vectors that are fed into the CNN-based architecture.

Input Embedding Dim

Macro Micro

F1 AUC F1 AUC P@8BASELINE

Word2Vec-MIMIC (CAML [Mullenbach et al. , 2018]) 200 0.5554 0.8767 0.6749 0.9218 0.4034Word2Vec-MIMIC-MultiResCNN [Li and Yu, 2020] 200 0.2519 0.7195 0.4225 0.8126 0.3151

POST-PROCESSED EMBEDDINGS

MeanDiff-Word2Vec-MIMIC 200 0.5525 0.8749 0.6714 0.9205 0.4011MeanDiff-PCADiff-Word2Vec-MIMIC 200 0.5325 0.8726 0.6749 0.9214 0.4021

META-EMBEDDING: Word2Vec-MIMIC, Word2Vec-PubMed

Averaging 200 0.5750 0.8852 0.6819 0.9248 0.4074Locally-linear 200

Locally-linear (MultiResCNN) 200 0.4061 0.8139 0.5678 0.8813 0.3695

META-EMBEDDINGS on POST-PROCESSED EMBEDDINGS

DiffMean + Averaging 200 0.6144 0.9043 0.6923 0.9336 0.4135DiffMean-DiffPCA + Averaging 200 0.6059 0.9004 0.6988 0.9341 0.4132DiffMean + Locally Linear 200

DiffMean-DiffPCA + Locally Linear 200 0.6205 0.9080 0.7014 0.9372 0.4160Table 4: Micro and Macro results are presented for the multimodal multilabel classiﬁcation of 32 ICD-10 codes.

Unstructured data features

Macro Micro

F1 AUC F1 AUCBASELINE - 0.33211 0.6078 0.4521 0.6608Word2Vec-MIMIC 0.5557 0.7294 0.6771 0.7921

META-EMBEDDING:Word2Vec-MIMIC, Word2Vec-PubMed

Locally linear meta-embedding 0.5826 ± ± ± ± ± ± ± ± ± ± ± ± InputEmbedding refers to the different input embedding vectors that are fed into the CNN-based architecture.

Input Embedding Dim

Macro Micro

F1 AUC F1 AUC P@5BASELINE

Word2Vec-MIMIC (CAML [Mullenbach et al. , 2018]) 200 0.5571 0.8693 0.6084 0.8910 0.5829Word2Vec-MIMIC-MultiResCNN ([Li and Yu, 2020]) 200 0.3616 0.7643 0.5425 0.8492 0.3415

POST-PROCESSED EMBEDDINGS

MeanDiff-Word2Vec-MIMIC 200 0.5597 0.8620 0.5974 0.8848 0.5739MeanDiff-PCADiff-Word2Vec-MIMIC 200 0.5544 0.8577 0.6022 0.8799 0.5730

META-EMBEDDING: Word2Vec-MIMIC, Word2Vec-PubMed

Averaging 200 0.5663 0.8642 0.6132 0.8894 0.5882Locally-linear 200

Locally-linear (MultiResCNN) 200 0.4475 0.8159 0.5761 0.8849 0.3716

META-EMBEDDINGS on POST-PROCESSED EMBEDDINGS

DiffMean + Averaging 200 0.5665 0.8719 0.6079 0.8937 0.5863DiffMean-DiffPCA + Averaging 200 0.5489 0.8685 0.6049 0.8948 0.5631DiffMean + Locally Linear 200 0.5730 0.8758

DiffMean-DiffPCA + Locally Linear 200

Unstructured data features

Macro Micro

F1 AUC F1 AUCBASELINE - 0.3940 0.6679 0.4662 0.6417Word2Vec-MIMIC 0.5457 0.7179 0.6078 0.7499

META-EMBEDDING:Word2Vec-MIMIC, Word2Vec-PubMed

Locally linear meta-embedding 0.5416 ± ± ± ± ± ± ± ± ± ± ± ± tion is sufﬁcient for training. This provides us anefﬁcient way of improving the performance withoutadding to the complexity of the model.Similar to Table. 3, Table. 5 contains the resultsbased on the different experiments conducted usingthe proposed approaches on unstructured data (Sec-tion. 3.1) for the multilabel classiﬁcation of top 50ICD-9 codes. We can ﬁnd similar trend in the per-formance based on the post-processing techniquesfollowed by using the locally linear meta embed-ding technique. Overall, our proposed approachclearly outperforms the baseline performance.Table. 4 presents the results based on our experi-ments using the multimodal approach for predicting32 ICD-10 codes and the methodology is explainedin Section. 3.2. From the results we can infer that,structured information on it’s own does not providea good performance in comparison with the base-line approach that makes use of the features fromthe baseline CNN model. Similarly, Table. 6 showsthat structured data alone has poor performance thatis improved when ensembling with predictions fromother models. Crucially, we show that the Locallylinear meta-embeddings approach has the highestscores when ensembled with structured data.Our aim is to understand whether we can enhancethe prediction of structured information by captur-ing better probability scores using our proposed ap-proach for unstructured information. Hence, fromthe results we can infer that the best performance onstructured data is achieved by combining the proba-bility scores of each label obtained using the bestperformance approach on the unstructured data.Overall, the results based on the unstructured datashows that our proposed approach is effective andoutperforms the baseline approach. In addition, the results from our proposed approach enhances theperformance of the multimodal multi-label classi-ﬁcation. In this work, we did not focus on the interpretabilityof the results but rather we refer the reader to relatedwork. Speciﬁcally, we would like to refer to previ-ously tackled methods in [Mullenbach et al. , 2018]for predictions based on unstructured texts and[Xu et al. , 2019] for multimodal data. Since, theseinterpretability methods are shown to be useful, wemake an assumption that their validation holds forour experiments as well. Also, in this paper wedo not attempt to improve on the current architec-ture [Mullenbach et al. , 2018] but rather show thebeneﬁt of utilising better features as well as struc-tured data.

In this paper, we present a novel multimodal ap-proach for predicting ICD-coding using unstruc-tured information along with structured informa-tion. In particular, our proposed approach, whichwe term as pre-training initialisation steps en-hances the performance of the current state-of-the-art model using unstructured information by effec-tively exploiting the geometric properties of pre-trained word embeddings as well as, combiningexternal knowledge using meta-embedding tech-niques. We empirically show that our proposedapproach can enhance the performance of currentstate-of-the-art approaches used for multilabel clas-siﬁcation of discharge summaries in MIMIC-IIIdataset without relying on more complex archi-tectures or using additional knowledge from theescriptions of ICD coding. In particular, post-processing word vectors and then, combining dif-ferent pre-trained word embeddings using locally-linear meta-embeddings provides the best perfor-mance. In addition, we empirically show that un-structured information enhances the performance ofthe multimodal multi-label classiﬁcation approach.In future, we would like to investigate softer-measures of performing post-processing of pre-trained word embeddings such as using conceptnegators [Liu et al. , 2019]. We would also liketo investigate the potential to exploit the hi-erarchical structure of the text present in dis-charge summaries using hyperbolic-based embed-ding vectors [Liu et al. , 2019] and also, combinedifferent hyperbolic-based embeddings using meta-embedding techniques [Jawanpuria et al. , 2020].Finally, we will experiment with more statisticalfeatures and other machine learning algorithms toassess whether we can further improve performancefrom the structured data.

References [Bollegala et al. , 2018] Danushka Bollegala, KoheiHayashi, and Ken-ichi Kawarabayashi. Thinkglobally, embed locally - locally linear meta-embedding of words. In

Proceedings of IJCAI ,pages 3970–3976, 2018.[Chen and Guestrin, 2016] Tianqi Chen and CarlosGuestrin. XGBoost: A scalable tree boostingsystem. In

Proceedings of KDD , pages 785–794,2016.[Chowdhury et al. , 2020] Shaika Chowdhury,Chenwei Zhang, Philip S. Yu, and Yuan Luo.Med2meta: Learning representations of medicalconcepts with meta-embeddings. In

Proceedingsof HEALTHINF , pages 369–376, 2020.[Coates and Bollegala, 2018] Joshua Coates andDanushka Bollegala. Frustratingly easy meta-embedding - computing meta-embeddings by av-eraging source word embeddings. In

Proceed-ings of NAACL-HLT , pages 194–198, 2018.[El Boukkouri et al. , 2019] Hicham El Boukkouri,Olivier Ferret, Thomas Lavergne, and PierreZweigenbaum. Embedding strategies for spe-cialized domains: Application to clinical entityrecognition. In

Proceedings of ACL , pages 295–301, 2019. [Jawanpuria et al. , 2020] Pratik Jawanpuria,N. T. V. Satya Dev, Anoop Kunchukuttan,and Bamdev Mishra. Learning geometricword meta-embeddings. In

Proceedings ofRepL4NLP@ACL , pages 39–44, 2020.[Johnson et al. , 2016] Alistair EW Johnson, Tom JPollard, Lu Shen, H Lehman Li-wei, MenglingFeng, Mohammad Ghassemi, Benjamin Moody,Peter Szolovits, Leo Anthony Celi, and Roger GMark. Mimic-iii, a freely accessible critical caredatabase.

Scientiﬁc data , 3:160035, 2016.[Kiela et al. , 2018] Douwe Kiela, ChanghanWang, and Kyunghyun Cho. Dynamic meta-embeddings for improved sentence represen-tations. In

Proceedings of EMNLP , pages1466–1477, 2018.[Kim, 2014] Yoon Kim. Convolutional neural net-works for sentence classiﬁcation. In

Proceedingsof EMNLP , pages 1746–1751, 2014.[Koopman et al. , 2015] Bevan Koopman, GuidoZuccon, Anthony Nguyen, Anton Bergheim, andNarelle Grayson. Automatic icd-10 classiﬁ-cation of cancers from free-text death certiﬁ-cates.

International journal of medical informat-ics , 84(11):956–965, 2015.[Li and Yu, 2020] Fei Li and Hongmmy Yu. Icdcoding from clinical text using multi-ﬁlter resid-ual convolutional neural network. In

Proceed-ings of AAAI , pages 8180–8187, 2020.[Lin et al. , 2017] Zhouhan Lin, Minwei Feng,C´ıcero Nogueira dos Santos, Mo Yu, Bing Xi-ang, Bowen Zhou, and Yoshua Bengio. A struc-tured self-attentive sentence embedding. In

Pro-ceedings of ICLR , 2017.[Liu et al. , 2019] Tianlin Liu, Lyle Ungar, and Jo˜aoSedoc. Unsupervised post-processing of wordvectors via conceptor negation. In

Proceedingsof AAAI , pages 6778–6785, 2019.[Mikolov et al. , 2013] Tomas Mikolov, IlyaSutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of wordsand phrases and their compositionality. In

Proceesings of NIPS , pages 3111–3119, 2013.[Moen and Ananiadou, 2013] SPFGH Moen andTapio Salakoski2 Sophia Ananiadou. Distribu-tional semantics resources for biomedical textprocessing.

Proceedings of LBM , pages 39–44,2013.Mu and Viswanath, 2018] Jiaqi Mu and PramodViswanath. All-but-the-top: Simple and effec-tive postprocessing for word representations. In

Proceedings of ICLR , 2018.[Mullenbach et al. , 2018] James Mullenbach,Sarah Wiegreffe, Jon Duke, Jimeng Sun, andJacob Eisenstein. Explainable prediction ofmedical codes from clinical text. In

Proceedingsof NAACL-HLT , pages 1101–1111, 2018.[Pennington et al. , 2014] Jeffrey Pennington,Richard Socher, and Christopher D Manning.Glove: Global vectors for word representation.In

Proceedings of EMNLP , pages 1532–1543,2014.[Shi et al. , 2017] Haoran Shi, Pengtao Xie, Zhit-ing Hu, M. Zhang, and E. Xing. Towards au-tomated icd coding using deep learning.

ArXiv ,abs/1711.04075, 2017.[Vu et al. , 2020] Thanh Vu, Dat Quoc Nguyen, andAnthony Nguyen. A label attention model foricd coding from clinical text. In

Proceedings ofIJCAI , pages 3335–3341, 2020.[Xu et al. , 2019] Keyang Xu, Mike Lam, JingzhiPang, Xin Gao, Charlotte Band, Piyush Mathur,Frank Papay, Ashish K Khanna, Jacek B Cywin-ski, Kamal Maheshwari, et al. Multimodal ma-chine learning for automated icd coding. In

Pro-ceedings of Machine Learning for HealthcareConference , pages 197–215, 2019.[Yin and Sch¨utze, 2016] Wenpeng Yin and HinrichSch¨utze. Learning word meta-embeddings. In

Proceedings of ACL , pages 1351–1360, 2016.

A Appendix

Table 7: Mapping of ICD-9 to 32 ICD-10 codes