Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis
Yichao Du, Pengfei Luo, Xudong Hong, Tong Xu, Zhe Zhang, Chao Ren, Yi Zheng, Enhong Chen
IInheritance-guided Hierarchical Assignment forClinical Automatic Diagnosis
Yichao Du , Pengfei Luo , Xudong Hong , Tong Xu (cid:12) ) , Zhe Zhang , ChaoRen , Yi Zheng , and Enhong Chen School of Computer Science and Technology, University of Science and Technologyof China, Hefei, China { duyichao,pfluo,renchao } @mail.ustc.edu.cn, [email protected], { tongxu,cheneh } @ustc.edu.cn Institution of Smart City Research (WuHu), University of Science and Technologyof China, Wuhu, China [email protected] HUAWEI Technologies [email protected]
Abstract.
Clinical diagnosis, which aims to assign diagnosis codes fora patient based on the clinical note, plays an essential role in clinicaldecision-making. Considering that manual diagnosis could be error-proneand time-consuming, many intelligent approaches based on clinical textmining have been proposed to perform automatic diagnosis. However,these methods may not achieve satisfactory results due to the followingchallenges. First, most of the diagnosis codes are rare, and the distribu-tion is extremely unbalanced. Second, existing methods are challengingto capture the correlation between diagnosis codes. Third, the lengthyclinical note leads to the excessive dispersion of key information related tocodes. To tackle these challenges, we propose a novel framework to com-bine the inheritance-guided hierarchical assignment and co-occurrencegraph propagation for clinical automatic diagnosis. Specifically, we pro-pose a hierarchical joint prediction strategy to address the challengeof unbalanced codes distribution. Then, we utilize graph convolutionalneural networks to obtain the correlation and semantic representationsof medical ontology. Furthermore, we introduce multi attention mecha-nisms to extract crucial information. Finally, extensive experiments onMIMIC-III dataset clearly validate the effectiveness of our method.
Keywords: clinical automatic diagnosis · hierarchical assignment · co-occurrence graph · graph convolutional network. The clinical note is an essential part of Electronic Health Record (EHR), whichcontains lengthy and terminological text records about medical history, chiefcomplaint, current symptoms, and laboratory test results. To avoid the redun-dancy and ambiguity caused by the text, the World Health Organization recom-mends using the diagnosis codes in the International Classification of Diseases a r X i v : . [ c s . C L ] J a n Y. Du et al. (ICD) for each disease, symptom, and sign to represent the patient’s condition.The goal of clinical diagnosis is to assign the most likely diagnosis codes for thepatient based on the clinical note. Traditionally, clinical diagnosis is completedby well-trained clinical coders, which is labor-intensive and error-prone becausethe diagnosis codes system is vast and growing. For example, in the UnitedStates, about 20% of patients are misdiagnosed at the primary care level, andone-third of the misdiagnosis will cause later severe injury to the patients [22].
History of Present Illness:
Physical Exam: ...,Response: decreased air movement through-out. Diffuse end-expiratory.....
Major Surgical or Invasive Procedure: cardiac catheterization; intubation.....
Discharge Diagnosis:
Hospital acquired Pneumonia; Pulmonaryedema; Elevated cardiac enzymes; Hypertension;Arrythmia........
DiagnosisCode Diagnosis Description
486 Pneumonia Organism Unspecified 518.81 Acute Respiratory Failure410.81 Unspecified Essential Hypertension491.21 Obstructive Chronic Bronchitiswith Acute Exacerbation427.89 Other Specified Cardiac... ...
ClinicalDiagnosisModel
Input: EHR Output: Diagnosis Results
Fig. 1.
Illustration of clinical automatic diagnosis task. The input and output of themodel are EHR and diagnosis codes, respectively. The text related to the diagnosiscode in the EHR is marked in colored font.
Consequently, the automatic clinical diagnosis based on EHR has arousedwidespread attention in the industrial and academic circles [4]. Among the pro-posed methods, supervised machine learning methods were trained to learn shal-low feature combinations for clinical note [19,7]. Recently, most deep learningmodels treated this task as a sequence learning problem, including used Convo-lutional Neural Networks [16,9] and Recurrent Neural Networks [21,3] to capturecomplex semantic information. On this basis, medical ontology was further intro-duced as auxiliary knowledge. Specifically, Bai et al. [1] incorporated the diseaseencyclopedia of Wikipedia into the model to enhance its predictive ability. Be-sides, the patient’s history and demographic information could also be leveragedto enhance the prediction of future admissions [20,1,14]. Although these methodshave made significant progress in automatic diagnosis, they may also fail due tothe following challenges: – C1: The number of diagnosis codes is enormous, and the distribu-tion is extremely unbalanced.
For example, the MIMIC-III [6] dataset,which is widely used for automatic diagnosis, contains 8,925 codes, but 4,344appear less than five times in all data. The severe long-tail distribution makes nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 3 it difficult to assign proper codes to rare diseases, which may cause irrepara-ble damage to the patients. – C2: The correlations between diagnosis codes are greatly over-looked.
However, the medical relationship between diseases can help usidentify diseases that are not clearly reflected by the clinical note. As shownin Fig. 1, we can extract clues (colored fonts) from the text to assign diagnosiscodes to the patient. For example, from the text “Hospital Acquired Pneumo-nia”, we can easily infer the code “486 (Pneumonia Organism Unspecified)”.Nevertheless, it is difficult to infer the code “410.81 (Acute Respiratory Fail-ure)” only from the text. Fortunately, we can infer the code “410.81” fromthe relationship between it and the code “486”, that is, “Pneumonia Organ-ism Unspecified” will in all probability cause patients to have the symptomof “Acute Respiratory Failure”. – C3: In clinical note, only a few key fragments can provide valuableinformation for automatic diagnosis.
For example, in the MIMIC-IIIdataset, clinical notes usually contain more than 1,500 tokens, but only afew tokens are related to specific diagnosis codes. Extracting crucial tokensfor specific diagnosis codes is as tricky as finding a needle in a haystack.To this end, we propose a model named I nheritance-guided H ierarchicalAssignment with C o-occurrence-based E nhancement (IHCE) to address thesechallenges. First, for C1, we design a hierarchical assignment method based onthe hierarchical inheritance structure of diagnosis codes defined by ICD, whichmakes assignment level by level. As shown in Fig. 2, “405.0 (Malignant renovas-cular hypertension)” and “405.1 (Benign secondary hypertension)” are mutuallyexclusive. Moreover, “405.01 (Malignant renovascular hypertension)” inheritsthe information of “405.0”. Consequently, if we assign “405.0” at the high level,we will tend to further assign “405.01” instead of the children of “405.1”. Withthe inheritance-guided hierarchical assignment, we can use the diagnostic resultsof a high level to guide the low level, which addresses the challenge of unbal-anced distribution. Second, for C2, we construct a co-occurrence graph based onEHR data and use GCN to obtain the diagnosis codes’ semantic representations.In this way, the representations of the diagnosis codes contain the correlationbetween diseases, which help us to assign codes to diseases for where it is chal-lenging to find textual clues from the clinical note. Third, for C3, we enhance theability to extract the tokens related to the diagnosis codes based on the attentionmechanism which models the interaction between diagnosis codes’ ontology rep-resentations and the clinical note. Finally, experiments on a real medical datasetshow that IHCE is superior to the SOTA methods on all evaluation metrics. Clinical automatic diagnosis has become a research hot spot in medicine, aimingto solve manual diagnosis limitations. In recent years, deep learning technolo-
Y. Du et al.
Hypertensive Disease Hypertensiveheart disease
Hierarchical Structure
Level 1Level 2Level 3Level 4
Secondary hypertensionMalignant secondary hypertension Benign secondary hypertensionMalignant renovascular hypertension
Fig. 2.
An example of diagnosis codes’ descriptors and their hierarchical inheritancestructure based on ICD. gies [21,16,9] have shown substantial advantages over traditional machine learn-ing methods [19,7] and have been widely used for this task. Most researchersmodeled this task as a multi-label text classification task based on the free textin EHR. Among them, Shi et al. [21] proposed a character-perceived LSTMnetwork that generated written diagnosis descriptions and representations ofdiagnosis codes. Baumel et al. [3] proposed a hierarchical-GRU with a label-dependent attention layer to alleviate excessive text problem. Wang et al. [23]proposed a label-word joint embedding model and applied the cosine similarityto assign the codes. Moreover, some researchers incorporated external knowledgeinto the model [20,1,14]. For example, Knowledge Source Integration (KSI) [1]calculated the matching score between the clinical note and each knowledge doc-ument based on the intersection of clinical notes and external knowledge for thistask. Our method is different from these methods, considering the hierarchy andco-occurrence relationship to achieve better performance in automatic diagnosis.
In the past few years, Graph Convolutional Network (GCN) [8] has been widelyused in various tasks to encode advanced graph structures, such as health-care [25,11], recommender systems [12], business analysis [10], machine trans-lation [2], text classification [24,18]. Specifically, in order to promote the sharingof disease among patients, Liu et al. [11] applied GCN on text corpus to collecthigh-order neighbor information, and predicted for patients based on projection.Yao et al. [24] proposed Text-GCN, which was utilized to learn the represen-tations of words and documents to improve text classification. Peng et al. [18]proposed a recursive regularized GCN to perform large-scale text classificationon word co-occurrence graphs. Inspired by this, we apply GCN to obtain a goodcorrelation between diagnosis codes and represent the medical ontology. Fur-thermore, we utilize the ontology representations as interactive information toimprove the performance of automatic diagnosis. nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 5
For a patient, the word sequence S = { w , w , ..., w n } of the patient’s clinicalnote is included, where n is the length of S . Furthermore, a set of diagnosiscodes L = (cid:8) l , l , ..., l | L | (cid:9) ∈ { , } | L | are also contained to denote the diseasesof the patient, where | L | is the number of diagnosis codes. In addition, we alsointroduce hierarchical inheritance structure L = (cid:8) L , L , ..., L T (cid:9) to expand L based on external knowledge (i.e., the hierarchical inheritance structure basedon ICD in Fig. 2), where L t = (cid:110) l t , l t , ..., l t | L t | (cid:111) means all diagnosis codes of thelevel- t , and T is the total number of hierarchical levels. Note that, L T = L , whichmeans that the last hierarchical level is the same as the patient’s diagnosis codes.With above description, we can define the clinical automatic diagnosis task withinheritance guidance as follows: Definition 1.
Given the patient’s clinical note sequence S and the diagnosiscodes hierarchical inheritance structure L , our goal is to predict the patient’sdiagnosis codes set ˆ L t = (cid:110) ˆ l t , ˆ l t , ... (cid:111) ∈ { , } | ˆ L t | level by level, and finally use thelast level ˆ L T as the prediction of the patient’s diagnosis. As shown in Fig. 3, IHCE mainly contains three components: (1) DocumentEncoding Layer (DEL), (2) Ontology Representation Layer (ORL), and (3) Hi-erarchical Prediction Layer (HPL). Specifically, we first utilize the DEL to obtainrepresentations of the clinical note and diagnosis codes. Secondly, we apply theORL to obtain the correlation and semantic representations of medical ontol-ogy. Finally, we design HPL to predict the patient’s diagnosis codes based onhierarchical dependence and attention mechanism.
The goal of DEL is to generate unified representations for the clinical note anddiagnosis codes. We first utilize the Embedding Module to encode the patient’sclinical note and diagnosis codes. Then, we apply the Feature Extraction Moduleto enhance the semantic representation of the clinical note.
Embedding Module.
First, given the word sequence S = { w , w , ..., w n } , weuse the word vector matrix E = (cid:2) e , e , . . . , e | E | (cid:3) ∈ R | E |× d e to obtain the wordembedding sequence X = [ x , x , . . . , x n ] ∈ R n × d e , where | E | is the size of thevocabulary, and d e is the dimension of the word vector. Similarly, we generatethe diagnosis code ontology embedding for each code l ti ∈ L t via averaging theword embedding of its descriptor sequence: v ti = | N ti | (cid:80) j ∈ N ti e j , i = 1 , . . . , | L t | V t = (cid:104) v t , v t , . . . , v t | L t | (cid:105) ∈ R | L t |× d e , (1) Y. Du et al. … Hierarchical Prediction Module Hierarchical Prediction Module
518 491 272 ...
GCNHierarchical Prediction Module
Level 1
GCN
Level 2
Ontology Represention LayerHierarchical
Prediction
Layer … … Level
GCN ! !" ! !" ! !" ! ! ! ! ! ! ! ! ! ! Document Encoding Layer
Embedding ct…leftlungapex … … X X m X … res X resm X F ! F ! F ! ! !" k X k X ! !" F !" F !" F !" Co-graph Co-graph Co-graph ! !" Feature Extraction !" ! !" ! !" ! !" ! ! ! ! ! !" res X Co-appear Co-appear Co-appear ˜ Y T ! ! !" Fig. 3.
The architecture of IHCE. where N ti is the text descriptor index set of l ti , and v ti denotes the word embeddingof the l ti , and V t indicates the representations of all codes of the level- t . Feature Extraction Module.
As shown in the lower part of the Fig. 3, weapply the multi-filter residual convolutional neural network [9] architecture fordeep feature extraction on clinical note’s embedding matrix X .First, we utilize convolutional neural networks containing m filters to capturedifferent length patterns of word sequence: X = F ( X, W ) = tanh (cid:2) . . . , W T X j : j + s − , . . . (cid:3) . . .X m = F m ( X, W m ) = tanh (cid:2) . . . , W Tm X j : j + s m − , . . . (cid:3) , where j = 1 , , ..., n, (2)Let us take the k -th operation as an example. F k ( X, W k ) denotes the con-volution operation on the matrix X , where W k ∈ R ( s k × d e ) × d c is the parametermatrix, and d c indicates each convolutional layer’s feature mapping dimension. s , s , ..., s m denote different convolution kernel sizes, and X j : j + s k − ∈ R s k × d e is the input matrix of the j -th to the ( j + s k − X . Note that,we set padding and stride as f loor ( s k /
2) and 1. Finally, the feature matrices X k ∈ R n × d c , k = 1 , , ..., m can be obtained. In order to express conciseness, thebias is ignored in all the calculation formulas in this paper.Next, we connect m parallel residual blocks after the multi-filter convolutionallayer, capturing longer text features by expanding the receptive field. Taking the nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 7 k -th unit as an example, the residual block is formally defined as: X k = F k ( X k , W k ) = tanh (cid:104) . . . , W Tk X j : j + s k − k , . . . (cid:105) ,X k = F k ( X k , W k ) = (cid:104) . . . , W Tk X j : j + s k − k , . . . , (cid:105) ,X k = F k ( X k , W k ) = (cid:104) . . . , W Tk X j : jk , . . . (cid:105) ,X resk = tanh ( X k + X k ) , (3)where j = 1 , , ..., n , and W k i is the weight matrix of the k i -th convolution layerin the residual block, specifically W k ∈ R ( s k × d c ) × d r , W k ∈ R ( s k × d r ) × d r , W k ∈ R (1 × d c ) × d r . The output of each residual block is X resk , k = 1 , , ..., m , where d r indicates the feature mapping dimension. Finally, we concatenate them togetherby rows to obtain an enhanced clinical note’s representation: X res = concat ( X res , . . . , X resm ) ∈ R n × d res , where d res = ( m × d r ) . (4) Comorbidities and complications manifest the correlation between the diagnosiscodes ontology and play an auxiliary role for codes that are difficult to predictbased on the clinical note alone. To this end, we first use co-occurrence features ateach hierarchical level to construct a co-occurrence graph (co-graph) of diagnosiscodes ontology. Then, we use GCN to capture the ontology’s representations,which contain the correlation between the ontology. Here we take the level- t asan example to introduce the process. Co-graph Construction.
The co-graph is represented by G t = ( L t , E t ), where L t and E t indicate the diagnosis codes set and edge set of the level- t , respectively.For any diagnosis code l ti , if there is another code l tj in the EHR data that co-appears, there is an edge e ( l ti , l tj ) between them. And the corresponding weightis calculated as follows: e ( l ti , l tj ) = count( l ti , l tj ) (cid:80) l tk ∈ L t count( l ti , l tk ) , (5)where count( · , · ) indicates the number of times the two codes co-appear in thewhole EHR dataset, which can represent prior knowledge. After that, the edgeset E t can be described as follows: E t = { e ( l ti , l tj ) | l ti , l tj ∈ L t } . (6) Co-graph Propagation via GCN.
Now we turn to represent the diagnosiscodes. First, we can obtain the feature matrix H t, (0) = V t ∈ R | L t |× d e of thediagnosis codes ontology by Equation (1). For the sake of simplicity, we omit thesuperscript t in the rest of this subsection. Then, we apply the GCN to propagate Y. Du et al. the representations of the diagnosis codes on the co-graph G , which takes thefeature matrix H ( l ) and the matrix ˜ A as input, and update the embedding ofthe codes by utilizing the information of adjacent codes: H ( l +1) = σ (cid:16) ˜ D − ˜ A ˜ D − H ( l ) W ( l ) (cid:17) , (7)where ˜ A = A + I , A is the adjacency matrix of G , I is the identity matrix,˜ D ii = (cid:80) i ˜ A ij , and W ( l ) is a layer-specific trainable weight matrix. σ ( · ) denotesan activation function, such as the ReLU( · ) = max(0 , · ). H ( l ) ∈ R L × d g is thematrix of activations in the l -th layer, where d g indicates the hidden layer sizeof GCN. Then the last hidden layer is used to represent the diagnosis codesontology, i.e., H t = H t, ( l +1) ∈ R | L t |× d g . To simulate human diagnosis’s gradual progress from shallow to deep, we pro-pose an inheritance-guided hierarchical joint learning mechanism. To be specific,according to the hierarchical structure of the codes, the patient is diagnosed pro-gressively from coarse-grained to fine-grained.
MAU
CPU DPU ! ! ! !" ! ! ! ! ! !" ˜ Y t Fig. 4.
Hierarchical Prediction Module.
Fig. 4 shows the core module Hierarchical Prediction Module(HPM) of HPL.Specifically, HPM is mainly composed of three parts, namely Multi AttentionUnit (MAU), Code Predicting Unit (CPU) and Dependency Passing Unit (DPU)respectively. For the level- t , the input of HPM includes three parts, i.e., theclinical note’s representation X res , the medical ontology representations H t ,and the dependency information c t − of the previous level: R t = MAU ( X res , H t ) ,Y t = CPU (cid:0) c t − , R t (cid:1) ,c t = DPU (cid:16) c t − , ˜ Y t (cid:17) . (8)We first utilize the MAU part to obtain the correlation representation R t betweenthe clinical note and medical ontology. Next, the CPU part assigns the diagnosis nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 9 codes ˜ Y t to the patient based on the R t and c t − . Finally, the DPU part generatesthe level dependency information c t for the next level based on the previouslevel’s memory and the current level’s assignment results. Note that we set c to0 since the current level is 0 and does not contain the previous level’s information.Next, we introduce each unit of the HPM at level- t . Multi Attention Unit.
By the operations above, we can obtain the clinicalnote representation X res and medical ontology representations H t . Intuitively,the patient’s clinical note is composed of a large number of lengthy text de-scriptions and different codes may focus on different aspects of the document.Therefore, for level- t , we need | L t | aspects to focus on different codes to repre-sent the overall semantic of the whole clinical note. Next, we introduce the twoattention mechanisms we use. Ontology Guided Attention.
For some diagnosis codes that are difficult to predictusing only clinical text, we can improve it by interacting between the clinicalnote and medical ontology. First, we pass the document feature matrix X res through a simple feed-forward neural network: O (cid:48) t = tanh( W (cid:48) t · ( X res ) T ) , (9)where W (cid:48) t ∈ R d g × d res is the transform matrix, d g is consistent with the dimensionof the columns of H t , and O (cid:48) t ∈ R d g × n is the intermediate result. Then, for eachcode l t ∈ L t , we can generate the attention vector guided by the ontology: α l t = softmax( h l t · O (cid:48) t ) , (10)where h l t ∈ H t is the feature vector of label l t , and softmax( · ) is the normalizedexponential function for row operations. The attention α l t ∈ R × n is then usedto compute vector representation for each label: x atti (cid:48) = α l t · X res , (11)Finally, we concatenate the x atti (cid:48) ( i = 1 , .., | L t | ) to obtain the ontology guideddocument representation, denoted as X attt (cid:48) = [ x att (cid:48) , x att (cid:48) , ..., x att | L t |(cid:48) ] ∈ R | L t |× d res . Code Specific Attention.
Similar to ontology guided attention, the code specificattention is formalized as: O (cid:48)(cid:48) t = tanh( W (cid:48)(cid:48) t · ( X res ) T ) ,A (cid:48)(cid:48) t = softmax( U (cid:48)(cid:48) t · O (cid:48)(cid:48) t ) ,X attt (cid:48)(cid:48) = A t (cid:48)(cid:48) · X res , (12)where W (cid:48)(cid:48) t ∈ R d a × d res is the intermediate parameter matrix. d a is a hyperpa-rameter, O (cid:48)(cid:48) t ∈ R d a × n is the intermediate result matrix and U (cid:48)(cid:48) t ∈ R | L t |× d a is thecode-specific attention parameter matrix. Finally, X attt (cid:48)(cid:48) ∈ R | L t |× d res denotescode-specific document representation.With the above description, we apply R t = concat( X attt (cid:48) , X attt (cid:48)(cid:48) ) ∈ R | L t |× d res as the output of the MAU. Code Predicting Unit.
For the level- t , we combine the result R t of MAUwith the inherited information c t − of the previous level to assign diagnosiscodes to the patient. Specifically, the CPU uses a linear layer following a sigmoidtransformation for each code: X clst = concat(broadcast( c t − ) , R t ) , ˜ Y t = σ (cid:0) X clst · W ty (cid:1) , (13)where broadcast( · ) is the process of making matrixes with different shapes havecompatible shapes for arithmetic operations, σ ( · ) denotes an activation function,such as the sigmoid( x ) = e − x , W ty ∈ R (2 d res + d t − c ) × is the parameter of theCPU, and ˜ Y t ∈ R | L t |× is the prediction results of the level- t . Dependency Passing Unit.
We aim to preserve important information whilereducing the harm caused by the previous level’s error transmission. Therefore,we employ the combination of a linear layer and sigmoid function to imitate thegating mechanism to filter and integrate information as follows: Z = concat(( ˜ Y t ) T , c t − ) ,c t = σ ( Z · W tdpu ) , (14)where Z ∈ R × ( | L t | + d t − c ) and W tdpu ∈ R ( | L t | + d t − c ) × d tc is the parameter matrix.Then, we can get the inter-level dependence c t ∈ R × d tc based on the previouslevel’s memory information and the prediction results of the current level. For training, we combine all levels of multi-label binary cross-entropy as the loss: loss = T (cid:88) t loss t = T (cid:88) t L t (cid:88) i =1 [ − y i log (˜ y i ) − (1 − y i ) log (1 − ˜ y i )] , where ˜ y i ∈ ˜ Y t , (15)where loss t indicates the loss function of level- t . In this paper, we conduct experiments on a real-world dataset: the MIMIC-IIIdataset, which is widely used in clinical automatic diagnosis. Following previousstudies [16,9], we use the discharge summaries as the model’s input and usethe full codes and the top 50 most common codes for experiments. Specifically,for the MIMIC-III full setting, it includes the 8,925 codes, 47,719, 1,631, and3,372 discharge summaries used for training, validation, and testing, respectively.For the MIMIC-III top-50 setting, it includes 8,067, 1,574, and 1,730 discharge nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 11 summaries used for training, validation, and testing, respectively. In addition,we expand the codes from fine to coarse according to the hierarchical inheritancestructure of ICD because EHR data only have the finest-grained codes (i.e.thelevel-4 in Table 1). The specific statistical results are shown in Table 1.The evaluation metrics used in the experiments are Precision@K(K=5, 8,and 15), Macro-F1, Micro-F1, Macro-AUC and Micro-AUC.
Table 1.
The statistics of hierarchical levels.
Statistics full top-50
We utilize PyTorch [17] to implement IHCE model and train it on a server with4 × V100 GPU. For the training setting, we use AdamW [13] for learning and setthe learning rate and weight decay to 0.0001 and 0.00005, respectively. We setthe dropout probability 0.4 and set the batch size to 16. We also apply an earlystop mechanism, in which the training will stop if the Micro-F1 score on thevalidation set does not improve in 10 continuous epochs. Since our model has anumber of hyperparameters, it is infeasible to search optimal values for all hy-perparameters. We keep the hyperparameters of the Feature Extraction Moduleconsistent with Li[9]. Specifically, the word embedding dimension d e =100, thenumber of convolution kernels m in feature extraction is 6, and the size of theconvolution kernels s , s , ...s m are set to “3,5,9,15,19,25”, d c = d e and d r =50.Besides, we pre-train word embeddings on all the text in the training set usingthe word2vec [15] implemented by gensim . The maximum length of a tokensequence is 2,500, and the one that exceeds this length will be truncated. Forthe remaining parameters, we use the grid to search for the optimal hyperpa-rameters. Specifically, we set the number of hidden layers to 1, and the hiddenlayer size d g = 300 for GCN. In addition, we set d a =300 for ORL’s attentiondimension, and d tc = 500 ( t = 1 , , ..., T −
1) for all DPUs’ parameters dimension.
We compared IHCE with the following baselines, including machine learning anddeep learning models: https://radimrehurek.com/gensim/ – LR: which is a bag-of-words logistic regression model. – H-SVM [19]: which designs a hierarchical SVM algorithm from root to leafnode by utilizing the hierarchical structure of diagnosis codes. – Bi-GRU [16]: which employs bidirectional gated recurrent units to learnclinical note’s representation for automatic diagnosis task. – C-MemNN [20]: which combines the memory network with iterative com-pression memory representation to improve diagnosis accuracy. – C-LSTM-Att [21]: which uses an LSTM-based language model to gener-ate clinical note and diagnosis code representations as well as an attentionmechanism to resolve the mismatch between notes and codes. – LEAM [23]: which is proposed for text classification task by projectinglabels and words in the same embedding space and using the cosine similarityto predict the label of text. – HARNNN [5] which is initially used for multi-label text classification andconsiders the hierarchy of categories. We apply it to the automatic diagnosis. – CNN [16]: which uses a single layer convolutional neural network and amax-pooling layer for automatic diagnosis task. – CAML and DR-CAML [16]: which assign diagnosis codes based on clini-cal text by using CNN to aggregate information among the clinical note andattention mechanism to select the most relevant segment for each possiblecode. DR-CAML further uses text description as a regularization. – MultiResCNN [9]: which utilizes multi-fliter convolutional neural net-works and residual networks for automatic diagnosis and becomes the SOTAmodel on MIMIC-III. In this section, we compare the IHCE with existing works for clinical automaticdiagnosis. Table 2 shows our overall performance on MIMIC-III full setting andMIMIC-III 50 setting. T = 3 means that our experiment is based on the lastthree levels (i.e., level-2 to level-4 in Table 1) in the hierarchy. Our model IHCEsurpasses all baselines on both settings. The results indicate that IHCE is able toeffectively perform clinical automatic diagnosis by exploiting the hierarchy andco-occurrence structure of the medical ontology and the attention mechanism.The specific analysis is as follows:(1) In the MIMIC-III full setting, compared with the SOTA method Mul-tiResCNN, the IHCE improves Macro-AUC, Macro-F1 and Micro-F1 by 2.1%,22.4% and 3.8%, respectively. It is worth noting that all models have low Macro-F1 scores on MIMIC-III full setting because the diagnosis codes space is too large,and the distribution is extremely unbalanced. Nevertheless, what is exciting isthat our model has and improvements in this metric comparedto CAML and MultiReCNN, respectively. The reason is the IHCE considers hi-erarchical inheritance structure and dependencies. So the IHCE can assists theprocessing of low-frequency codes based on high-level prediction results. Simi-larly, we can observe that H-SVM with a hierarchical structure is better thanBiGRU without a hierarchical structure in Micro-F1. However, the performance nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 13 Table 2.
Overall performance on MIMIC-III, where “-” means that the baseline didnot report the result of the corresponding metric.
Models MIMIC-III full MIMIC-III top-50AUC F1-score P@K AUC F1-score P@KMacro Micro Macro Micro 8 15 Macro Micro Macro Micro 5LR 56.1 93.7 1.1 27.2 54.2 41.1 82.9 86.4 47.7 53.3 54.6H-SVM - - - 44.1 - - - - - - -C-MemNN - - - - - - 83.3 - - - 42.0C-LSTM-Att - - - - - - - 90.0 - 53.2 -HARNN - - - 40.5 - - - - - - -BiGRU 82.2 97.1 3.8 41.7 58.5 44.5 82.8 86.8 48.4 54.9 59.1LEAM - - - - - - 88.1 91.2 54.0 61.9 61.2CNN 80.6 96.9 4.2 41.9 58.1 44.3 87.6 90.7 57.6 62.5 62.0CAML 89.5 98.6 8.8 53.9 70.9 56.1 87.5 90.9 53.2 61.4 60.9DR-CAML 89.7 98.5 8.6 52.9 69.0 54.8 88.4 91.6 57.6 63.3 61.8MultiResCNN 91.0 98.6 8.5 55.2 73.4 58.4 89.9 92.8 60.6 67.0 64.1
IHCE( T = 3) of H-SVM is lower than that of CAML and MultiReCNN because CAML andMultiReCNN utilize a primary attention mechanism to improve the ability to re-trieve critical information. Furthermore, compared to CAML and MultiResCNN,our model has multiple attention mechanisms, so our model has more robust keyinformation retrieval capabilities and surpasses them in all metrics.(2) In the MIMIC-III top-50 setting, compared with the SOTA method Mul-tiResCNN, the IHCE improves Macro-F1 and Micro-F1 by 6.8% and 3.9%, re-spectively. Although there are only 50 diagnosis codes in MIMIC-III top-50 set-ting, it still shows a slight long-tail effect. The IHCE has a significant improve-ment on the Macro-f1, indicating that our model can employ the hierarchicalstructure to alleviate this problem. It is worth noting that even though DR-CAML utilize codes description as regularization to assist in the allocation ofdiagnosis codes that are difficult to predict, the effect is still limited comparedto CNN. However, the IHCE utilizes the co-occurrence structure between codesto solve this problem better. In this section, to verify each component’s effectiveness in the IHCE, we per-form ablation studies. The specific results are shown in Table 3. It is observedthat removing each component will cause F1 to decrease, which illustrates theeffectiveness of each component of our model. (1)
HPL’s effectiveness:
Afterremoving the HPL module, the macro-average metrics drop significantly, indicat-ing that the inheritance-guided hierarchical assignment mechanism introducedby our IHCE has a significant effect on solving the long-tail effect. (2)
ORL’s effectiveness:
After ORL is removed, the overall performance of IHCE declinesbecause the method cannot model disease co-occurrence relationships. However,this ability is beneficial for assigning diseases for which it is not easy to findtextual clues in the clinical note. (3)
Attention mechanism’s effectiveness:
We only retain the
Code Specific Attention module, which expands the atten-tion mechanism in MultiResCNN and improves almost all metrics. It shows thatour attention mechanism can better extract essential information to prevent thesituation of finding a needle in a haystack.
Table 3.
Ablation study results, where “w/o” indicates without.
Models MIMIC-III full MIMIC-III top-50Macro-AUC Macro-F1 Micro-F1 Macro-AUC Macro-F1 Micro-F1MultiResCNN(SOTA) 91.0 8.5 55.2 89.9 60.6 67.0 w/o ORL&HPL w/o HPL w/o ORL 93.1
IHCE( T = 3 ) In the clinical automatic diagnosis task, it is important to assign the diagnosiscodes of the last level to the patient. It is also essential to evaluate the per-formance at different levels because, in some cases, a different granularity ofcodes may be required. Therefore, we compared the performance of IHCE and (a) MIMIC-III top-50(b) MIMIC-III full
IHCE IHCE-DPU
Fig. 5.
Performance at different levels in hierarchy.nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 15
IHCE-DPU at each hierarchical level. Note that this comparison is based on T = 3. The IHCE-DPU ignores the dependency between the levels by removingthe DPU in the HPM. In Fig. 5, we can see that the performance of IHCE atalmost all levels is better than IHCE-DPU. Moreover, we can also notice thatthe performance on all metrics tend to decrease when the hierarchy deepens,and the trend on Macro-F1 in MIMIC-III full setting is the most obvious. Thereason is that as the level deepens, the number of codes of this level will increaserapidly (e.g., the MIMIC-III full setting has 5,125, 8,925 unique codes in level-3 and level-4 respectively, as shown in Table 1). Moreover, we can notice thatIHCE reduces this negative factor compared with IHCE-DPU by modeling thedependency among different hierarchical levels. In this section, we turn to figure out the effect of the number of hierarchicallevels, i.e., T . To that end, a series of experiments are conducted to evaluate theeffectiveness under different settings. Specifically, T = n means choosing the last n levels in Table 1. For example, T = 2 means that we choose level-3 and level-4.From Fig. 6, we can conclude that the models that consider hierarchical structure Hierarchical levels F - S c o re MIMIC-III top-50 macromicro 1 2 3 4
Hierarchical levels F - S c o re MIMIC-III full macromicro
Fig. 6.
Performance by varying the number of hierarchical levels. preform much better than models that do not. The performance rises when thenumber T of levels increases because high-level information has a guiding effecton the low level. However, the performance decreases when the T continuouslyincreases. The reason is that when the number of codes between different levels isnot an order of magnitude, errors caused by high-level results will still seriouslyaffect low-level levels, although DPU has a mitigating effect. Specifically, for theMIMIC-III full setting, when T =4, the model will extend level-1 with only 199diagnosis codes, which is not in the same order of magnitude as other levels. Forthe MIMIC-III top-50 setting, each level’s magnitude is not much different, andthe impact of this error will also be reduced. In this paper, we proposed a novel Inheritance-guided Hierarchical Assignmentwith Co-occurrence-based Enhancement (IHCE) framework for clinical auto-matic diagnosis, which could jointly exploit code hierarchy and code co-occurrence.We utilized GCN to obtain the correlation between medical ontology. Moreover,we proposed a hierarchical joint prediction strategy based on the attention mech-anism. Experimental results on real medical datasets show that our model hasobtained state-of-the-art performance with substantial improvements in differentevaluation metrics. We believe that our method can also be used for other tasksthat require the application of hierarchical structure and label co-occurrence,such as hierarchical multi-label classification.
Acknowledgements
This research was partially supported by grants from theNational Key Research and Development Program of China (Grant No.2018YFB1402600),the National Natural Science Foundation of China (Grant No.62072423), and theKey Research and Development Program of Anhui Province (No.1804b06020377).
References
1. Bai, T., Vucetic, S.: Improving medical code prediction from clinical text via in-corporating online knowledge sources. In: The World Wide Web Conference. pp.72–82 (2019)2. Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K.: Graph convo-lutional encoders for syntax-aware neural machine translation. arXiv preprintarXiv:1704.04675 (2017)3. Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-labelclassification of patient notes a case study on icd code assignment. arXiv preprintarXiv:1709.09587 (2017)4. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K.,Cui, C., Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare.Nature medicine (1), 24–29 (2019)5. Huang, W., Chen, E., Liu, Q., Chen, Y., Huang, Z., Liu, Y., Zhao, Z., Zhang, D.,Wang, S.: Hierarchical multi-label text classification: An attention-based recurrentnetwork approach. In: Proceedings of the 28th ACM International Conference onInformation and Knowledge Management. pp. 1051–1060 (2019)6. Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M.,Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessiblecritical care database. Scientific data (1), 1–9 (2016)7. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learningapproaches in assigning diagnosis codes to electronic medical records. Artificialintelligence in medicine (2), 155–166 (2015)8. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)9. Li, F., Yu, H.: Icd coding from clinical text using multi-filter residual convolutionalneural network. In: AAAI. pp. 8180–8187 (2020)nheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis 1710. Li, S., Zhou, J., Xu, T., Liu, H., Lu, X., Xiong, H.: Competitive analysis for pointsof interest. In: Proceedings of the 26th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. pp. 1265–1274 (2020)11. Liu, N., Zhang, W., Li, X., Yuan, H., Wang, J.: Coupled graph convolutional neuralnetworks for text-oriented clinical diagnosis inference. In: International Conferenceon Database Systems for Advanced Applications. pp. 369–385. Springer (2020)12. Liu, Y., Li, Z., Huang, W., Xu, T., Chen, E.H.: Exploiting structural and temporalinfluence for dynamic social-aware recommendation. Journal of Computer Scienceand Technology , 281–294 (2020)13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 (2017)14. Ma, F., You, Q., Xiao, H., Chitta, R., Zhou, J., Gao, J.: Kame: Knowledge-basedattention model for diagnosis prediction in healthcare. In: Proceedings of the 27thACM International Conference on Information and Knowledge Management. pp.743–752 (2018)15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)16. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable pre-diction of medical codes from clinical text. In: Proceedings of the 2018 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). pp. 1101–1111 (2018)17. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processingsystems. pp. 8026–8037 (2019)18. Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., Yang, Q.: Large-scale hierarchical text classification with recursively regularized deep graph-cnn.In: Proceedings of the 2018 World Wide Web Conference. pp. 1063–1072 (2018)19. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.:Diagnosis code assignment: models and evaluation metrics. Journal of the Ameri-can Medical Informatics Association (2), 231–237 (2014)20. Prakash, A., Zhao, S., Hasan, S.A., Datla, V., Lee, K., Qadir, A., Liu, J., Farri,O.: Condensed memory networks for clinical diagnostic inferencing. arXiv preprintarXiv:1612.01848 (2016)21. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd codingusing deep learning. arXiv preprint arXiv:1711.04075 (2017)22. Singh, H., Schiff, G.D., Graber, M.L., Onakpoya, I., Thompson, M.J.: The globalburden of diagnostic errors in primary care. BMJ quality & safety (6), 484–494(2017)23. Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., Henao, R., Carin, L.:Joint embedding of words and labels for text classification. In: Proceedings of the56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). pp. 2321–2331 (2018)24. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification.In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp.7370–7377 (2019)25. Yichao, D., Tong, X., Jianhui, M., Enhong, C., Yi ZHENG, T.L., Guixian, T.: Anautomatic icd coding method for clinical records based on deep neural network.Big Data Research6