[PDF] Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis

Abstract

Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making. Considering that manual diagnosis could be error-prone and time-consuming, many intelligent approaches based on clinical text mining have been proposed to perform automatic diagnosis. However, these methods may not achieve satisfactory results due to the following challenges. First, most of the diagnosis codes are rare, and the distribution is extremely unbalanced. Second, existing methods are challenging to capture the correlation between diagnosis codes. Third, the lengthy clinical note leads to the excessive dispersion of key information related to codes. To tackle these challenges, we propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis. Specifically, we propose a hierarchical joint prediction strategy to address the challenge of unbalanced codes distribution. Then, we utilize graph convolutional neural networks to obtain the correlation and semantic representations of medical ontology. Furthermore, we introduce multi attention mechanisms to extract crucial information. Finally, extensive experiments on MIMIC-III dataset clearly validate the effectiveness of our method.

Full PDF

IInheritance-guided Hierarchical Assignment forClinical Automatic Diagnosis

Yichao Du , Pengfei Luo , Xudong Hong , Tong Xu (cid:12) ) , Zhe Zhang , ChaoRen , Yi Zheng , and Enhong Chen School of Computer Science and Technology, University of Science and Technologyof China, Hefei, China { duyichao,pfluo,renchao } @mail.ustc.edu.cn, [email protected], { tongxu,cheneh } @ustc.edu.cn Institution of Smart City Research (WuHu), University of Science and Technologyof China, Wuhu, China [email protected] HUAWEI Technologies [email protected]

Abstract.

Clinical diagnosis, which aims to assign diagnosis codes fora patient based on the clinical note, plays an essential role in clinicaldecision-making. Considering that manual diagnosis could be error-proneand time-consuming, many intelligent approaches based on clinical textmining have been proposed to perform automatic diagnosis. However,these methods may not achieve satisfactory results due to the followingchallenges. First, most of the diagnosis codes are rare, and the distribu-tion is extremely unbalanced. Second, existing methods are challengingto capture the correlation between diagnosis codes. Third, the lengthyclinical note leads to the excessive dispersion of key information related tocodes. To tackle these challenges, we propose a novel framework to com-bine the inheritance-guided hierarchical assignment and co-occurrencegraph propagation for clinical automatic diagnosis. Speciﬁcally, we pro-pose a hierarchical joint prediction strategy to address the challengeof unbalanced codes distribution. Then, we utilize graph convolutionalneural networks to obtain the correlation and semantic representationsof medical ontology. Furthermore, we introduce multi attention mecha-nisms to extract crucial information. Finally, extensive experiments onMIMIC-III dataset clearly validate the eﬀectiveness of our method.

Keywords: clinical automatic diagnosis · hierarchical assignment · co-occurrence graph · graph convolutional network. The clinical note is an essential part of Electronic Health Record (EHR), whichcontains lengthy and terminological text records about medical history, chiefcomplaint, current symptoms, and laboratory test results. To avoid the redun-dancy and ambiguity caused by the text, the World Health Organization recom-mends using the diagnosis codes in the International Classiﬁcation of Diseases a r X i v : . [ c s . C L ] J a n Y. Du et al. (ICD) for each disease, symptom, and sign to represent the patient’s condition.The goal of clinical diagnosis is to assign the most likely diagnosis codes for thepatient based on the clinical note. Traditionally, clinical diagnosis is completedby well-trained clinical coders, which is labor-intensive and error-prone becausethe diagnosis codes system is vast and growing. For example, in the UnitedStates, about 20% of patients are misdiagnosed at the primary care level, andone-third of the misdiagnosis will cause later severe injury to the patients [22].

History of Present Illness:

Physical Exam: ...,Response: decreased air movement through-out. Diffuse end-expiratory.....

Major Surgical or Invasive Procedure: cardiac catheterization; intubation.....

Discharge Diagnosis:

Hospital acquired Pneumonia; Pulmonaryedema; Elevated cardiac enzymes; Hypertension;Arrythmia........

DiagnosisCode Diagnosis Description

486 Pneumonia Organism Unspecified 518.81 Acute Respiratory Failure410.81 Unspecified Essential Hypertension491.21 Obstructive Chronic Bronchitiswith Acute Exacerbation427.89 Other Specified Cardiac... ...

ClinicalDiagnosisModel

Input: EHR Output: Diagnosis Results

Fig. 1.

Illustration of clinical automatic diagnosis task. The input and output of themodel are EHR and diagnosis codes, respectively. The text related to the diagnosiscode in the EHR is marked in colored font.

Consequently, the automatic clinical diagnosis based on EHR has arousedwidespread attention in the industrial and academic circles [4]. Among the pro-posed methods, supervised machine learning methods were trained to learn shal-low feature combinations for clinical note [19,7]. Recently, most deep learningmodels treated this task as a sequence learning problem, including used Convo-lutional Neural Networks [16,9] and Recurrent Neural Networks [21,3] to capturecomplex semantic information. On this basis, medical ontology was further intro-duced as auxiliary knowledge. Speciﬁcally, Bai et al. [1] incorporated the diseaseencyclopedia of Wikipedia into the model to enhance its predictive ability. Be-sides, the patient’s history and demographic information could also be leveragedto enhance the prediction of future admissions [20,1,14]. Although these methodshave made signiﬁcant progress in automatic diagnosis, they may also fail due tothe following challenges: – C1: The number of diagnosis codes is enormous, and the distribu-tion is extremely unbalanced.

For example, the MIMIC-III [6] dataset,which is widely used for automatic diagnosis, contains 8,925 codes, but 4,344appear less than ﬁve times in all data. The severe long-tail distribution makes nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 3 it diﬃcult to assign proper codes to rare diseases, which may cause irrepara-ble damage to the patients. – C2: The correlations between diagnosis codes are greatly over-looked.

However, the medical relationship between diseases can help usidentify diseases that are not clearly reﬂected by the clinical note. As shownin Fig. 1, we can extract clues (colored fonts) from the text to assign diagnosiscodes to the patient. For example, from the text “Hospital Acquired Pneumo-nia”, we can easily infer the code “486 (Pneumonia Organism Unspeciﬁed)”.Nevertheless, it is diﬃcult to infer the code “410.81 (Acute Respiratory Fail-ure)” only from the text. Fortunately, we can infer the code “410.81” fromthe relationship between it and the code “486”, that is, “Pneumonia Organ-ism Unspeciﬁed” will in all probability cause patients to have the symptomof “Acute Respiratory Failure”. – C3: In clinical note, only a few key fragments can provide valuableinformation for automatic diagnosis.

For example, in the MIMIC-IIIdataset, clinical notes usually contain more than 1,500 tokens, but only afew tokens are related to speciﬁc diagnosis codes. Extracting crucial tokensfor speciﬁc diagnosis codes is as tricky as ﬁnding a needle in a haystack.To this end, we propose a model named I nheritance-guided H ierarchicalAssignment with C o-occurrence-based E nhancement (IHCE) to address thesechallenges. First, for C1, we design a hierarchical assignment method based onthe hierarchical inheritance structure of diagnosis codes deﬁned by ICD, whichmakes assignment level by level. As shown in Fig. 2, “405.0 (Malignant renovas-cular hypertension)” and “405.1 (Benign secondary hypertension)” are mutuallyexclusive. Moreover, “405.01 (Malignant renovascular hypertension)” inheritsthe information of “405.0”. Consequently, if we assign “405.0” at the high level,we will tend to further assign “405.01” instead of the children of “405.1”. Withthe inheritance-guided hierarchical assignment, we can use the diagnostic resultsof a high level to guide the low level, which addresses the challenge of unbal-anced distribution. Second, for C2, we construct a co-occurrence graph based onEHR data and use GCN to obtain the diagnosis codes’ semantic representations.In this way, the representations of the diagnosis codes contain the correlationbetween diseases, which help us to assign codes to diseases for where it is chal-lenging to ﬁnd textual clues from the clinical note. Third, for C3, we enhance theability to extract the tokens related to the diagnosis codes based on the attentionmechanism which models the interaction between diagnosis codes’ ontology rep-resentations and the clinical note. Finally, experiments on a real medical datasetshow that IHCE is superior to the SOTA methods on all evaluation metrics. Clinical automatic diagnosis has become a research hot spot in medicine, aimingto solve manual diagnosis limitations. In recent years, deep learning technolo-

Y. Du et al.

Hypertensive Disease Hypertensiveheart disease

Hierarchical Structure

Level 1Level 2Level 3Level 4

Secondary hypertensionMalignant secondary hypertension Benign secondary hypertensionMalignant renovascular hypertension

Fig. 2.

An example of diagnosis codes’ descriptors and their hierarchical inheritancestructure based on ICD. gies [21,16,9] have shown substantial advantages over traditional machine learn-ing methods [19,7] and have been widely used for this task. Most researchersmodeled this task as a multi-label text classiﬁcation task based on the free textin EHR. Among them, Shi et al. [21] proposed a character-perceived LSTMnetwork that generated written diagnosis descriptions and representations ofdiagnosis codes. Baumel et al. [3] proposed a hierarchical-GRU with a label-dependent attention layer to alleviate excessive text problem. Wang et al. [23]proposed a label-word joint embedding model and applied the cosine similarityto assign the codes. Moreover, some researchers incorporated external knowledgeinto the model [20,1,14]. For example, Knowledge Source Integration (KSI) [1]calculated the matching score between the clinical note and each knowledge doc-ument based on the intersection of clinical notes and external knowledge for thistask. Our method is diﬀerent from these methods, considering the hierarchy andco-occurrence relationship to achieve better performance in automatic diagnosis.

In the past few years, Graph Convolutional Network (GCN) [8] has been widelyused in various tasks to encode advanced graph structures, such as health-care [25,11], recommender systems [12], business analysis [10], machine trans-lation [2], text classiﬁcation [24,18]. Speciﬁcally, in order to promote the sharingof disease among patients, Liu et al. [11] applied GCN on text corpus to collecthigh-order neighbor information, and predicted for patients based on projection.Yao et al. [24] proposed Text-GCN, which was utilized to learn the represen-tations of words and documents to improve text classiﬁcation. Peng et al. [18]proposed a recursive regularized GCN to perform large-scale text classiﬁcationon word co-occurrence graphs. Inspired by this, we apply GCN to obtain a goodcorrelation between diagnosis codes and represent the medical ontology. Fur-thermore, we utilize the ontology representations as interactive information toimprove the performance of automatic diagnosis. nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 5

For a patient, the word sequence S = { w , w , ..., w n } of the patient’s clinicalnote is included, where n is the length of S . Furthermore, a set of diagnosiscodes L = (cid:8) l , l , ..., l | L | (cid:9) ∈ { , } | L | are also contained to denote the diseasesof the patient, where | L | is the number of diagnosis codes. In addition, we alsointroduce hierarchical inheritance structure L = (cid:8) L , L , ..., L T (cid:9) to expand L based on external knowledge (i.e., the hierarchical inheritance structure basedon ICD in Fig. 2), where L t = (cid:110) l t , l t , ..., l t | L t | (cid:111) means all diagnosis codes of thelevel- t , and T is the total number of hierarchical levels. Note that, L T = L , whichmeans that the last hierarchical level is the same as the patient’s diagnosis codes.With above description, we can deﬁne the clinical automatic diagnosis task withinheritance guidance as follows: Deﬁnition 1.

Given the patient’s clinical note sequence S and the diagnosiscodes hierarchical inheritance structure L , our goal is to predict the patient’sdiagnosis codes set ˆ L t = (cid:110) ˆ l t , ˆ l t , ... (cid:111) ∈ { , } | ˆ L t | level by level, and ﬁnally use thelast level ˆ L T as the prediction of the patient’s diagnosis. As shown in Fig. 3, IHCE mainly contains three components: (1) DocumentEncoding Layer (DEL), (2) Ontology Representation Layer (ORL), and (3) Hi-erarchical Prediction Layer (HPL). Speciﬁcally, we ﬁrst utilize the DEL to obtainrepresentations of the clinical note and diagnosis codes. Secondly, we apply theORL to obtain the correlation and semantic representations of medical ontol-ogy. Finally, we design HPL to predict the patient’s diagnosis codes based onhierarchical dependence and attention mechanism.

The goal of DEL is to generate uniﬁed representations for the clinical note anddiagnosis codes. We ﬁrst utilize the Embedding Module to encode the patient’sclinical note and diagnosis codes. Then, we apply the Feature Extraction Moduleto enhance the semantic representation of the clinical note.

Embedding Module.

First, given the word sequence S = { w , w , ..., w n } , weuse the word vector matrix E = (cid:2) e , e , . . . , e | E | (cid:3) ∈ R | E |× d e to obtain the wordembedding sequence X = [ x , x , . . . , x n ] ∈ R n × d e , where | E | is the size of thevocabulary, and d e is the dimension of the word vector. Similarly, we generatethe diagnosis code ontology embedding for each code l ti ∈ L t via averaging theword embedding of its descriptor sequence: v ti = | N ti | (cid:80) j ∈ N ti e j , i = 1 , . . . , | L t | V t = (cid:104) v t , v t , . . . , v t | L t | (cid:105) ∈ R | L t |× d e , (1) Y. Du et al. … Hierarchical Prediction Module Hierarchical Prediction Module

518 491 272 ...

GCNHierarchical Prediction Module

Level 1

GCN

Level 2

Ontology Represention LayerHierarchical

Prediction

Layer … … Level

GCN ! !" ! !" ! !" ! ! ! ! ! ! ! ! ! ! Document Encoding Layer

Embedding ct…leftlungapex … … X X m X … res X resm X F ! F ! F ! ! !" k X k X ! !" F !" F !" F !" Co-graph Co-graph Co-graph ! !" Feature Extraction !" ! !" ! !" ! !" ! ! ! ! ! !" res X Co-appear Co-appear Co-appear ˜ Y T ! ! !" Fig. 3.

The architecture of IHCE. where N ti is the text descriptor index set of l ti , and v ti denotes the word embeddingof the l ti , and V t indicates the representations of all codes of the level- t . Feature Extraction Module.

As shown in the lower part of the Fig. 3, weapply the multi-ﬁlter residual convolutional neural network [9] architecture fordeep feature extraction on clinical note’s embedding matrix X .First, we utilize convolutional neural networks containing m ﬁlters to capturediﬀerent length patterns of word sequence: X = F ( X, W ) = tanh (cid:2) . . . , W T X j : j + s − , . . . (cid:3) . . .X m = F m ( X, W m ) = tanh (cid:2) . . . , W Tm X j : j + s m − , . . . (cid:3) , where j = 1 , , ..., n, (2)Let us take the k -th operation as an example. F k ( X, W k ) denotes the con-volution operation on the matrix X , where W k ∈ R ( s k × d e ) × d c is the parametermatrix, and d c indicates each convolutional layer’s feature mapping dimension. s , s , ..., s m denote diﬀerent convolution kernel sizes, and X j : j + s k − ∈ R s k × d e is the input matrix of the j -th to the ( j + s k − X . Note that,we set padding and stride as f loor ( s k /

2) and 1. Finally, the feature matrices X k ∈ R n × d c , k = 1 , , ..., m can be obtained. In order to express conciseness, thebias is ignored in all the calculation formulas in this paper.Next, we connect m parallel residual blocks after the multi-ﬁlter convolutionallayer, capturing longer text features by expanding the receptive ﬁeld. Taking the nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 7 k -th unit as an example, the residual block is formally deﬁned as: X k = F k ( X k , W k ) = tanh (cid:104) . . . , W Tk X j : j + s k − k , . . . (cid:105) ,X k = F k ( X k , W k ) = (cid:104) . . . , W Tk X j : j + s k − k , . . . , (cid:105) ,X k = F k ( X k , W k ) = (cid:104) . . . , W Tk X j : jk , . . . (cid:105) ,X resk = tanh ( X k + X k ) , (3)where j = 1 , , ..., n , and W k i is the weight matrix of the k i -th convolution layerin the residual block, speciﬁcally W k ∈ R ( s k × d c ) × d r , W k ∈ R ( s k × d r ) × d r , W k ∈ R (1 × d c ) × d r . The output of each residual block is X resk , k = 1 , , ..., m , where d r indicates the feature mapping dimension. Finally, we concatenate them togetherby rows to obtain an enhanced clinical note’s representation: X res = concat ( X res , . . . , X resm ) ∈ R n × d res , where d res = ( m × d r ) . (4) Comorbidities and complications manifest the correlation between the diagnosiscodes ontology and play an auxiliary role for codes that are diﬃcult to predictbased on the clinical note alone. To this end, we ﬁrst use co-occurrence features ateach hierarchical level to construct a co-occurrence graph (co-graph) of diagnosiscodes ontology. Then, we use GCN to capture the ontology’s representations,which contain the correlation between the ontology. Here we take the level- t asan example to introduce the process. Co-graph Construction.

The co-graph is represented by G t = ( L t , E t ), where L t and E t indicate the diagnosis codes set and edge set of the level- t , respectively.For any diagnosis code l ti , if there is another code l tj in the EHR data that co-appears, there is an edge e ( l ti , l tj ) between them. And the corresponding weightis calculated as follows: e ( l ti , l tj ) = count( l ti , l tj ) (cid:80) l tk ∈ L t count( l ti , l tk ) , (5)where count( · , · ) indicates the number of times the two codes co-appear in thewhole EHR dataset, which can represent prior knowledge. After that, the edgeset E t can be described as follows: E t = { e ( l ti , l tj ) | l ti , l tj ∈ L t } . (6) Co-graph Propagation via GCN.

Now we turn to represent the diagnosiscodes. First, we can obtain the feature matrix H t, (0) = V t ∈ R | L t |× d e of thediagnosis codes ontology by Equation (1). For the sake of simplicity, we omit thesuperscript t in the rest of this subsection. Then, we apply the GCN to propagate Y. Du et al. the representations of the diagnosis codes on the co-graph G , which takes thefeature matrix H ( l ) and the matrix ˜ A as input, and update the embedding ofthe codes by utilizing the information of adjacent codes: H ( l +1) = σ (cid:16) ˜ D − ˜ A ˜ D − H ( l ) W ( l ) (cid:17) , (7)where ˜ A = A + I , A is the adjacency matrix of G , I is the identity matrix,˜ D ii = (cid:80) i ˜ A ij , and W ( l ) is a layer-speciﬁc trainable weight matrix. σ ( · ) denotesan activation function, such as the ReLU( · ) = max(0 , · ). H ( l ) ∈ R L × d g is thematrix of activations in the l -th layer, where d g indicates the hidden layer sizeof GCN. Then the last hidden layer is used to represent the diagnosis codesontology, i.e., H t = H t, ( l +1) ∈ R | L t |× d g . To simulate human diagnosis’s gradual progress from shallow to deep, we pro-pose an inheritance-guided hierarchical joint learning mechanism. To be speciﬁc,according to the hierarchical structure of the codes, the patient is diagnosed pro-gressively from coarse-grained to ﬁne-grained.

MAU

CPU DPU ! ! ! !" ! ! ! ! ! !" ˜ Y t Fig. 4.

Hierarchical Prediction Module.

Fig. 4 shows the core module Hierarchical Prediction Module(HPM) of HPL.Speciﬁcally, HPM is mainly composed of three parts, namely Multi AttentionUnit (MAU), Code Predicting Unit (CPU) and Dependency Passing Unit (DPU)respectively. For the level- t , the input of HPM includes three parts, i.e., theclinical note’s representation X res , the medical ontology representations H t ,and the dependency information c t − of the previous level: R t = MAU ( X res , H t ) ,Y t = CPU (cid:0) c t − , R t (cid:1) ,c t = DPU (cid:16) c t − , ˜ Y t (cid:17) . (8)We ﬁrst utilize the MAU part to obtain the correlation representation R t betweenthe clinical note and medical ontology. Next, the CPU part assigns the diagnosis nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 9 codes ˜ Y t to the patient based on the R t and c t − . Finally, the DPU part generatesthe level dependency information c t for the next level based on the previouslevel’s memory and the current level’s assignment results. Note that we set c to0 since the current level is 0 and does not contain the previous level’s information.Next, we introduce each unit of the HPM at level- t . Multi Attention Unit.

By the operations above, we can obtain the clinicalnote representation X res and medical ontology representations H t . Intuitively,the patient’s clinical note is composed of a large number of lengthy text de-scriptions and diﬀerent codes may focus on diﬀerent aspects of the document.Therefore, for level- t , we need | L t | aspects to focus on diﬀerent codes to repre-sent the overall semantic of the whole clinical note. Next, we introduce the twoattention mechanisms we use. Ontology Guided Attention.

For some diagnosis codes that are diﬃcult to predictusing only clinical text, we can improve it by interacting between the clinicalnote and medical ontology. First, we pass the document feature matrix X res through a simple feed-forward neural network: O (cid:48) t = tanh( W (cid:48) t · ( X res ) T ) , (9)where W (cid:48) t ∈ R d g × d res is the transform matrix, d g is consistent with the dimensionof the columns of H t , and O (cid:48) t ∈ R d g × n is the intermediate result. Then, for eachcode l t ∈ L t , we can generate the attention vector guided by the ontology: α l t = softmax( h l t · O (cid:48) t ) , (10)where h l t ∈ H t is the feature vector of label l t , and softmax( · ) is the normalizedexponential function for row operations. The attention α l t ∈ R × n is then usedto compute vector representation for each label: x atti (cid:48) = α l t · X res , (11)Finally, we concatenate the x atti (cid:48) ( i = 1 , .., | L t | ) to obtain the ontology guideddocument representation, denoted as X attt (cid:48) = [ x att (cid:48) , x att (cid:48) , ..., x att | L t |(cid:48) ] ∈ R | L t |× d res . Code Speciﬁc Attention.

Similar to ontology guided attention, the code speciﬁcattention is formalized as: O (cid:48)(cid:48) t = tanh( W (cid:48)(cid:48) t · ( X res ) T ) ,A (cid:48)(cid:48) t = softmax( U (cid:48)(cid:48) t · O (cid:48)(cid:48) t ) ,X attt (cid:48)(cid:48) = A t (cid:48)(cid:48) · X res , (12)where W (cid:48)(cid:48) t ∈ R d a × d res is the intermediate parameter matrix. d a is a hyperpa-rameter, O (cid:48)(cid:48) t ∈ R d a × n is the intermediate result matrix and U (cid:48)(cid:48) t ∈ R | L t |× d a is thecode-speciﬁc attention parameter matrix. Finally, X attt (cid:48)(cid:48) ∈ R | L t |× d res denotescode-speciﬁc document representation.With the above description, we apply R t = concat( X attt (cid:48) , X attt (cid:48)(cid:48) ) ∈ R | L t |× d res as the output of the MAU. Code Predicting Unit.

For the level- t , we combine the result R t of MAUwith the inherited information c t − of the previous level to assign diagnosiscodes to the patient. Speciﬁcally, the CPU uses a linear layer following a sigmoidtransformation for each code: X clst = concat(broadcast( c t − ) , R t ) , ˜ Y t = σ (cid:0) X clst · W ty (cid:1) , (13)where broadcast( · ) is the process of making matrixes with diﬀerent shapes havecompatible shapes for arithmetic operations, σ ( · ) denotes an activation function,such as the sigmoid( x ) = e − x , W ty ∈ R (2 d res + d t − c ) × is the parameter of theCPU, and ˜ Y t ∈ R | L t |× is the prediction results of the level- t . Dependency Passing Unit.

We aim to preserve important information whilereducing the harm caused by the previous level’s error transmission. Therefore,we employ the combination of a linear layer and sigmoid function to imitate thegating mechanism to ﬁlter and integrate information as follows: Z = concat(( ˜ Y t ) T , c t − ) ,c t = σ ( Z · W tdpu ) , (14)where Z ∈ R × ( | L t | + d t − c ) and W tdpu ∈ R ( | L t | + d t − c ) × d tc is the parameter matrix.Then, we can get the inter-level dependence c t ∈ R × d tc based on the previouslevel’s memory information and the prediction results of the current level. For training, we combine all levels of multi-label binary cross-entropy as the loss: loss = T (cid:88) t loss t = T (cid:88) t L t (cid:88) i =1 [ − y i log (˜ y i ) − (1 − y i ) log (1 − ˜ y i )] , where ˜ y i ∈ ˜ Y t , (15)where loss t indicates the loss function of level- t . In this paper, we conduct experiments on a real-world dataset: the MIMIC-IIIdataset, which is widely used in clinical automatic diagnosis. Following previousstudies [16,9], we use the discharge summaries as the model’s input and usethe full codes and the top 50 most common codes for experiments. Speciﬁcally,for the MIMIC-III full setting, it includes the 8,925 codes, 47,719, 1,631, and3,372 discharge summaries used for training, validation, and testing, respectively.For the MIMIC-III top-50 setting, it includes 8,067, 1,574, and 1,730 discharge nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 11 summaries used for training, validation, and testing, respectively. In addition,we expand the codes from ﬁne to coarse according to the hierarchical inheritancestructure of ICD because EHR data only have the ﬁnest-grained codes (i.e.thelevel-4 in Table 1). The speciﬁc statistical results are shown in Table 1.The evaluation metrics used in the experiments are Precision@K(K=5, 8,and 15), Macro-F1, Micro-F1, Macro-AUC and Micro-AUC.

Table 1.

The statistics of hierarchical levels.

Statistics full top-50

We utilize PyTorch [17] to implement IHCE model and train it on a server with4 × V100 GPU. For the training setting, we use AdamW [13] for learning and setthe learning rate and weight decay to 0.0001 and 0.00005, respectively. We setthe dropout probability 0.4 and set the batch size to 16. We also apply an earlystop mechanism, in which the training will stop if the Micro-F1 score on thevalidation set does not improve in 10 continuous epochs. Since our model has anumber of hyperparameters, it is infeasible to search optimal values for all hy-perparameters. We keep the hyperparameters of the Feature Extraction Moduleconsistent with Li[9]. Speciﬁcally, the word embedding dimension d e =100, thenumber of convolution kernels m in feature extraction is 6, and the size of theconvolution kernels s , s , ...s m are set to “3,5,9,15,19,25”, d c = d e and d r =50.Besides, we pre-train word embeddings on all the text in the training set usingthe word2vec [15] implemented by gensim . The maximum length of a tokensequence is 2,500, and the one that exceeds this length will be truncated. Forthe remaining parameters, we use the grid to search for the optimal hyperpa-rameters. Speciﬁcally, we set the number of hidden layers to 1, and the hiddenlayer size d g = 300 for GCN. In addition, we set d a =300 for ORL’s attentiondimension, and d tc = 500 ( t = 1 , , ..., T −

1) for all DPUs’ parameters dimension.

We compared IHCE with the following baselines, including machine learning anddeep learning models: https://radimrehurek.com/gensim/ – LR: which is a bag-of-words logistic regression model. – H-SVM [19]: which designs a hierarchical SVM algorithm from root to leafnode by utilizing the hierarchical structure of diagnosis codes. – Bi-GRU [16]: which employs bidirectional gated recurrent units to learnclinical note’s representation for automatic diagnosis task. – C-MemNN [20]: which combines the memory network with iterative com-pression memory representation to improve diagnosis accuracy. – C-LSTM-Att [21]: which uses an LSTM-based language model to gener-ate clinical note and diagnosis code representations as well as an attentionmechanism to resolve the mismatch between notes and codes. – LEAM [23]: which is proposed for text classiﬁcation task by projectinglabels and words in the same embedding space and using the cosine similarityto predict the label of text. – HARNNN [5] which is initially used for multi-label text classiﬁcation andconsiders the hierarchy of categories. We apply it to the automatic diagnosis. – CNN [16]: which uses a single layer convolutional neural network and amax-pooling layer for automatic diagnosis task. – CAML and DR-CAML [16]: which assign diagnosis codes based on clini-cal text by using CNN to aggregate information among the clinical note andattention mechanism to select the most relevant segment for each possiblecode. DR-CAML further uses text description as a regularization. – MultiResCNN [9]: which utilizes multi-ﬂiter convolutional neural net-works and residual networks for automatic diagnosis and becomes the SOTAmodel on MIMIC-III. In this section, we compare the IHCE with existing works for clinical automaticdiagnosis. Table 2 shows our overall performance on MIMIC-III full setting andMIMIC-III 50 setting. T = 3 means that our experiment is based on the lastthree levels (i.e., level-2 to level-4 in Table 1) in the hierarchy. Our model IHCEsurpasses all baselines on both settings. The results indicate that IHCE is able toeﬀectively perform clinical automatic diagnosis by exploiting the hierarchy andco-occurrence structure of the medical ontology and the attention mechanism.The speciﬁc analysis is as follows:(1) In the MIMIC-III full setting, compared with the SOTA method Mul-tiResCNN, the IHCE improves Macro-AUC, Macro-F1 and Micro-F1 by 2.1%,22.4% and 3.8%, respectively. It is worth noting that all models have low Macro-F1 scores on MIMIC-III full setting because the diagnosis codes space is too large,and the distribution is extremely unbalanced. Nevertheless, what is exciting isthat our model has and improvements in this metric comparedto CAML and MultiReCNN, respectively. The reason is the IHCE considers hi-erarchical inheritance structure and dependencies. So the IHCE can assists theprocessing of low-frequency codes based on high-level prediction results. Simi-larly, we can observe that H-SVM with a hierarchical structure is better thanBiGRU without a hierarchical structure in Micro-F1. However, the performance nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 13 Table 2.

Overall performance on MIMIC-III, where “-” means that the baseline didnot report the result of the corresponding metric.

Models MIMIC-III full MIMIC-III top-50AUC F1-score P@K AUC F1-score P@KMacro Micro Macro Micro 8 15 Macro Micro Macro Micro 5LR 56.1 93.7 1.1 27.2 54.2 41.1 82.9 86.4 47.7 53.3 54.6H-SVM - - - 44.1 - - - - - - -C-MemNN - - - - - - 83.3 - - - 42.0C-LSTM-Att - - - - - - - 90.0 - 53.2 -HARNN - - - 40.5 - - - - - - -BiGRU 82.2 97.1 3.8 41.7 58.5 44.5 82.8 86.8 48.4 54.9 59.1LEAM - - - - - - 88.1 91.2 54.0 61.9 61.2CNN 80.6 96.9 4.2 41.9 58.1 44.3 87.6 90.7 57.6 62.5 62.0CAML 89.5 98.6 8.8 53.9 70.9 56.1 87.5 90.9 53.2 61.4 60.9DR-CAML 89.7 98.5 8.6 52.9 69.0 54.8 88.4 91.6 57.6 63.3 61.8MultiResCNN 91.0 98.6 8.5 55.2 73.4 58.4 89.9 92.8 60.6 67.0 64.1

IHCE( T = 3) of H-SVM is lower than that of CAML and MultiReCNN because CAML andMultiReCNN utilize a primary attention mechanism to improve the ability to re-trieve critical information. Furthermore, compared to CAML and MultiResCNN,our model has multiple attention mechanisms, so our model has more robust keyinformation retrieval capabilities and surpasses them in all metrics.(2) In the MIMIC-III top-50 setting, compared with the SOTA method Mul-tiResCNN, the IHCE improves Macro-F1 and Micro-F1 by 6.8% and 3.9%, re-spectively. Although there are only 50 diagnosis codes in MIMIC-III top-50 set-ting, it still shows a slight long-tail eﬀect. The IHCE has a signiﬁcant improve-ment on the Macro-f1, indicating that our model can employ the hierarchicalstructure to alleviate this problem. It is worth noting that even though DR-CAML utilize codes description as regularization to assist in the allocation ofdiagnosis codes that are diﬃcult to predict, the eﬀect is still limited comparedto CNN. However, the IHCE utilizes the co-occurrence structure between codesto solve this problem better. In this section, to verify each component’s eﬀectiveness in the IHCE, we per-form ablation studies. The speciﬁc results are shown in Table 3. It is observedthat removing each component will cause F1 to decrease, which illustrates theeﬀectiveness of each component of our model. (1)

HPL’s eﬀectiveness:

Afterremoving the HPL module, the macro-average metrics drop signiﬁcantly, indicat-ing that the inheritance-guided hierarchical assignment mechanism introducedby our IHCE has a signiﬁcant eﬀect on solving the long-tail eﬀect. (2)

ORL’s eﬀectiveness:

After ORL is removed, the overall performance of IHCE declinesbecause the method cannot model disease co-occurrence relationships. However,this ability is beneﬁcial for assigning diseases for which it is not easy to ﬁndtextual clues in the clinical note. (3)

Attention mechanism’s eﬀectiveness:

We only retain the

Code Speciﬁc Attention module, which expands the atten-tion mechanism in MultiResCNN and improves almost all metrics. It shows thatour attention mechanism can better extract essential information to prevent thesituation of ﬁnding a needle in a haystack.

Table 3.

Ablation study results, where “w/o” indicates without.

Models MIMIC-III full MIMIC-III top-50Macro-AUC Macro-F1 Micro-F1 Macro-AUC Macro-F1 Micro-F1MultiResCNN(SOTA) 91.0 8.5 55.2 89.9 60.6 67.0 w/o ORL&HPL w/o HPL w/o ORL 93.1

IHCE( T = 3 ) In the clinical automatic diagnosis task, it is important to assign the diagnosiscodes of the last level to the patient. It is also essential to evaluate the per-formance at diﬀerent levels because, in some cases, a diﬀerent granularity ofcodes may be required. Therefore, we compared the performance of IHCE and (a) MIMIC-III top-50(b) MIMIC-III full

IHCE IHCE-DPU

Fig. 5.

Performance at diﬀerent levels in hierarchy.nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 15

IHCE-DPU at each hierarchical level. Note that this comparison is based on T = 3. The IHCE-DPU ignores the dependency between the levels by removingthe DPU in the HPM. In Fig. 5, we can see that the performance of IHCE atalmost all levels is better than IHCE-DPU. Moreover, we can also notice thatthe performance on all metrics tend to decrease when the hierarchy deepens,and the trend on Macro-F1 in MIMIC-III full setting is the most obvious. Thereason is that as the level deepens, the number of codes of this level will increaserapidly (e.g., the MIMIC-III full setting has 5,125, 8,925 unique codes in level-3 and level-4 respectively, as shown in Table 1). Moreover, we can notice thatIHCE reduces this negative factor compared with IHCE-DPU by modeling thedependency among diﬀerent hierarchical levels. In this section, we turn to ﬁgure out the eﬀect of the number of hierarchicallevels, i.e., T . To that end, a series of experiments are conducted to evaluate theeﬀectiveness under diﬀerent settings. Speciﬁcally, T = n means choosing the last n levels in Table 1. For example, T = 2 means that we choose level-3 and level-4.From Fig. 6, we can conclude that the models that consider hierarchical structure Hierarchical levels F - S c o re MIMIC-III top-50 macromicro 1 2 3 4

Hierarchical levels F - S c o re MIMIC-III full macromicro

Fig. 6.

Performance by varying the number of hierarchical levels. preform much better than models that do not. The performance rises when thenumber T of levels increases because high-level information has a guiding eﬀecton the low level. However, the performance decreases when the T continuouslyincreases. The reason is that when the number of codes between diﬀerent levels isnot an order of magnitude, errors caused by high-level results will still seriouslyaﬀect low-level levels, although DPU has a mitigating eﬀect. Speciﬁcally, for theMIMIC-III full setting, when T =4, the model will extend level-1 with only 199diagnosis codes, which is not in the same order of magnitude as other levels. Forthe MIMIC-III top-50 setting, each level’s magnitude is not much diﬀerent, andthe impact of this error will also be reduced. In this paper, we proposed a novel Inheritance-guided Hierarchical Assignmentwith Co-occurrence-based Enhancement (IHCE) framework for clinical auto-matic diagnosis, which could jointly exploit code hierarchy and code co-occurrence.We utilized GCN to obtain the correlation between medical ontology. Moreover,we proposed a hierarchical joint prediction strategy based on the attention mech-anism. Experimental results on real medical datasets show that our model hasobtained state-of-the-art performance with substantial improvements in diﬀerentevaluation metrics. We believe that our method can also be used for other tasksthat require the application of hierarchical structure and label co-occurrence,such as hierarchical multi-label classiﬁcation.

Acknowledgements

This research was partially supported by grants from theNational Key Research and Development Program of China (Grant No.2018YFB1402600),the National Natural Science Foundation of China (Grant No.62072423), and theKey Research and Development Program of Anhui Province (No.1804b06020377).

References

1. Bai, T., Vucetic, S.: Improving medical code prediction from clinical text via in-corporating online knowledge sources. In: The World Wide Web Conference. pp.72–82 (2019)2. Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K.: Graph convo-lutional encoders for syntax-aware neural machine translation. arXiv preprintarXiv:1704.04675 (2017)3. Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-labelclassiﬁcation of patient notes a case study on icd code assignment. arXiv preprintarXiv:1709.09587 (2017)4. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K.,Cui, C., Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare.Nature medicine (1), 24–29 (2019)5. Huang, W., Chen, E., Liu, Q., Chen, Y., Huang, Z., Liu, Y., Zhao, Z., Zhang, D.,Wang, S.: Hierarchical multi-label text classiﬁcation: An attention-based recurrentnetwork approach. In: Proceedings of the 28th ACM International Conference onInformation and Knowledge Management. pp. 1051–1060 (2019)6. Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M.,Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessiblecritical care database. Scientiﬁc data (1), 1–9 (2016)7. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learningapproaches in assigning diagnosis codes to electronic medical records. Artiﬁcialintelligence in medicine (2), 155–166 (2015)8. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)9. Li, F., Yu, H.: Icd coding from clinical text using multi-ﬁlter residual convolutionalneural network. In: AAAI. pp. 8180–8187 (2020)nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 1710. Li, S., Zhou, J., Xu, T., Liu, H., Lu, X., Xiong, H.: Competitive analysis for pointsof interest. In: Proceedings of the 26th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. pp. 1265–1274 (2020)11. Liu, N., Zhang, W., Li, X., Yuan, H., Wang, J.: Coupled graph convolutional neuralnetworks for text-oriented clinical diagnosis inference. In: International Conferenceon Database Systems for Advanced Applications. pp. 369–385. Springer (2020)12. Liu, Y., Li, Z., Huang, W., Xu, T., Chen, E.H.: Exploiting structural and temporalinﬂuence for dynamic social-aware recommendation. Journal of Computer Scienceand Technology , 281–294 (2020)13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 (2017)14. Ma, F., You, Q., Xiao, H., Chitta, R., Zhou, J., Gao, J.: Kame: Knowledge-basedattention model for diagnosis prediction in healthcare. In: Proceedings of the 27thACM International Conference on Information and Knowledge Management. pp.743–752 (2018)15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word repre-sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)16. Mullenbach, J., Wiegreﬀe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable pre-diction of medical codes from clinical text. In: Proceedings of the 2018 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). pp. 1101–1111 (2018)17. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processingsystems. pp. 8026–8037 (2019)18. Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., Yang, Q.: Large-scale hierarchical text classiﬁcation with recursively regularized deep graph-cnn.In: Proceedings of the 2018 World Wide Web Conference. pp. 1063–1072 (2018)19. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.:Diagnosis code assignment: models and evaluation metrics. Journal of the Ameri-can Medical Informatics Association (2), 231–237 (2014)20. Prakash, A., Zhao, S., Hasan, S.A., Datla, V., Lee, K., Qadir, A., Liu, J., Farri,O.: Condensed memory networks for clinical diagnostic inferencing. arXiv preprintarXiv:1612.01848 (2016)21. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd codingusing deep learning. arXiv preprint arXiv:1711.04075 (2017)22. Singh, H., Schiﬀ, G.D., Graber, M.L., Onakpoya, I., Thompson, M.J.: The globalburden of diagnostic errors in primary care. BMJ quality & safety (6), 484–494(2017)23. Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., Henao, R., Carin, L.:Joint embedding of words and labels for text classiﬁcation. In: Proceedings of the56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). pp. 2321–2331 (2018)24. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classiﬁcation.In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence. vol. 33, pp.7370–7377 (2019)25. Yichao, D., Tong, X., Jianhui, M., Enhong, C., Yi ZHENG, T.L., Guixian, T.: Anautomatic icd coding method for clinical records based on deep neural network.Big Data Research6