Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels
Hieu H. Pham, Tung T. Le, Dat Q. Tran, Dat T. Ngo, Ha Q. Nguyen
IInterpreting chest X-rays via CNNs that exploithierarchical disease dependencies and uncertainty labels
Hieu H. Pham ∗ , Tung T. Le, Dat Q. Tran, Dat T. Ngo, Ha Q. Nguyen Medical Imaging Group, Vingroup Big Data Institute (VinBDI)458 Minh Khai street, Hai Ba Trung, Hanoi, Vietnam
Abstract
Chest radiography is one of the most common types of diagnostic radiologyexams, which is critical for screening and diagnosis of many different thoracicdiseases. Specialized algorithms have been developed to detect several specificpathologies such as lung nodules or lung cancer. However, accurately detectingthe presence of multiple diseases from chest X-rays (CXRs) is still a challengingtask. This paper presents a supervised multi-label classification frameworkbased on deep convolutional neural networks (CNNs) for predicting the presenceof 14 common thoracic diseases and observations. We tackle this problem bytraining state-of-the-art CNNs that exploit hierarchical dependencies amongabnormality labels. We also propose to use the label smoothing technique fora better handling of uncertain samples, which occupy a significant portion ofmany CXR datasets. Our model is trained on over 200,000 CXRs of the recentlyreleased CheXpert dataset and achieves a mean area under the curve (AUC) of0.940 in predicting 5 selected pathologies from the validation set. This is thehighest AUC score yet reported to date. The proposed method is also evaluatedon the independent test set of the CheXpert competition, which is composedof 500 CXR studies annotated by a panel of 5 experienced radiologists. Theperformance is on average better than 2.6 out of 3 other individual radiologistswith a mean AUC of 0.930, which ranks first on the CheXpert leaderboard atthe time of writing this paper.
Keywords:
Chest X-ray, CheXpert, Multi-label classification, Uncertainty label, ∗ Corresponding author: [email protected] (Hieu H. Pham)
Preprint submitted to Neurocomputing June 15, 2020 a r X i v : . [ ee ss . I V ] J un abel smoothing, Label dependency, Hierarchical learning
1. Introduction
Chest X-ray (CXR) is one of the most common radiological exams in di-agnosing many different diseases related to lung and heart, with millions ofscans performed globally every year [1, 2]. Many diseases among them, like
Pneumothorax [3], can be deadly if not diagnosed quickly and accurately enough.A computer-aided diagnosis (CAD) system that is able to correctly diagnose themost common observations from CXRs will significantly benefit many clinicalpractices. In this work, we investigate the problem of multi-label classificationfor CXRs using deep convolutional neural networks (CNNs).There has been a recent effort to harness advances in machine learning,especially deep learning, to build a new generation of CAD systems for clas-sification and localization of common thoracic diseases from CXR images [4].Several motivations are behind this transformation: First, interpreting CXRsto accurately diagnose pathologies is difficult. Even well-trained radiologistscan easily make mistake due to challenges in distinguishing different kinds ofpathologies, many of which often have similar visual features [5]. Therefore, ahigh-precision method for common thorax diseases classification and localiza-tion can be used as a second reader to support the decision making processof radiologists and to help reduce the diagnostic error. It also addresses thelack of diagnostic expertise in areas where the radiologists are limited or notavailable [6, 7]. Second, such a system can be used as a screening tool that helpsreduce waiting time of patients in hospitals and allows care providers to respondto emergency situations sooner or to speed up a diagnostic imaging workflow [8].Third, deep neural networks, in particular deep CNNs, have shown remarkableperformance for various applications in medical imaging analysis [9], includingthe CXR interpretation task [10, 11, 12, 13].Several deep learning-based approaches have been proposed for classifyinglung diseases and proven that they could achieve human-level performance [10, 14].2lmost all of these approaches, however, aim to detect some specific diseasessuch as pneumonia [15], tuberculosis [16, 17], or lung cancer [18]. Meanwhile,building a unified deep learning framework for accurately detecting the presenceof multiple common thoracic diseases from CXRs remains a difficult task thatrequires much research effort. In particular, we recognize that standard multi-label classifiers often ignore domain knowledge. For example, in the case of CXRdata, how to leverage clinical taxonomies of disease patterns and how to handleuncertainty labels are still open questions, which have not received much researchattention. This observation motivates us to build and optimize a predictivemodel based on deep CNNs for the CXR interpretation in which dependenciesamong labels and uncertainty information are taken into account during boththe training and inference stages. Specifically, we develop a deep learning-basedapproach that puts together the ideas of conditional training [19] and labelsmoothing [20] into a novel training procedure for classifying 14 common lungdiseases and observations. We trained our system on more than 200,000 CXRs ofthe CheXpert dataset [21]—one of the largest CXR dataset currently available,and evaluated it on the validation set of CheXpert containing 200 studies, whichwere manually annotated by 3 board-certified radiologists. The proposed methodis also tested against the majority vote of 5 radiologists on the hidden test set ofthe CheXpert competition that contains 500 studies.This study makes several contributions. First, we propose a novel trainingstrategy for multi-label CXR classification that incorporates (1) a conditionaltraining process based on a predefined disease hierarchy and (2) a smoothingregularization technique for uncertainty labels. The benefits of these two keyfactors are empirically demonstrated through our ablation studies. Second, wetrain a series of state-of-the-art CNNs on frontal-view CXRs of the CheXpertdataset for classifying 14 common thoracic diseases. Our best model, which isan ensemble of various CNN architectures, achieves the highest area under ROCcurve (AUC) score on both the validation set and test set of CheXpert at thetime being. Specifically, on the validation set, it yields an averaged AUC of 0.940in predicting 5 selected lung diseases:
Atelectasis (0.909),
Cardiomegaly (0.910),3 dema (0.958),
Consolidation (0.957) and
Pleural Effusion (0.964). This modelimproves the baseline method reported in [21] by a large margin of 5%. Onthe independent test set, we obtain a mean AUC of 0.930. More importantly,the proposed deep learning model is on average more accurate than 2.6 outof 3 individual radiologists in predicting the 5 selected thoracic diseases whenpresented with the same data .The rest of the paper is organized as follows. Related works on CNNs inmedical imaging and the problem of multi-label classification in CXR images arereviewed in Section 2. In Section 3, we present the details of the proposed methodwith a focus on how to deal with dependencies among diseases and uncertaintylabels. Section 4 provides comprehensive experiments on the CheXpert dataset.Section 5 discusses the experimental results, some key findings and limitationsof this research. Finally, Section 6 concludes the paper.
2. Related works
Thanks to the increased availability of large scale, high-quality labeleddatasets [22, 21, 23] and high-performing deep network architectures [24, 25, 26,27], deep learning-based approaches have been able to reach, even outperformexpert-level performance for many medical image interpretation tasks [10, 12,11, 28, 29, 16]. Most successful applications of deep neural networks in medicalimaging rely on CNNs [30, 31], which utilize convolutions to extract local featuresof the medical images.For CXR interpretation, the multi-label classification is a common settingin which each training example is associated with possibly more than onelabel [32, 33]. Due to its important role in medical imaging, a variety ofapproaches have been proposed in the literature. For instance, Rajpurkar et al . [10] introduced CheXNet—a DenseNet-121 model that was trained on the Our model (Hierarchical-Learning-V1) currently takes the first place in the CheX-pert competition. More information can be found at https://stanfordmlgroup.github.io/competitions/chexpert/ . Updated on June 15, 2020 al . [12] subsequently developed CheXNeXt, an improvedversion of the CheXNet, whose performance is on par with radiologists on atotal of 10 pathologies of ChestX-ray14. Yao et al . proposed [34] to combinea CNN encoder with a Recurrent Neural Network (RNN) decoder to learn notonly the visual features of the CXRs in ChestX-ray14 but also the dependenciesbetween their labels. Another notable work based on ChestX-ray14 was byKumar et al . [13] who presented a cascaded deep neural network to improve theperformance of the multi-label classification task. Closely related to our paperis the work of Chen et al . [19], in which they proposed to use the conditionaltraining strategy to exploit the hierarchy of lung abnormalities in the PLCOdataset [35]. In this method, a DenseNet-121 was first trained on a restrictedsubset of the data such that all parent nodes in the label hierarchy are positiveand then finetuned on the whole data.Recently, the availability of very large-scale CXR datasets such as CheXpert[21] and MIMIC-CXR [23] provides researchers with an ideal volume of data(224,316 scans of CheXpert and more than 350,000 of MIMIC-CXR) for devel-oping better and more robust supervised learning algorithms. Both of thesedatasets were automatically labeled by the same report-mining tool with 14common findings. Irvin et al . [21] proposed to train a 121-layer DenseNet onCheXpert with various approaches for handling the uncertainty labels. In partic-ular, uncertainty labels were either ignored ( U-Ignore approach) or mapped to positive ( U-Ones approach) or negative ( U-Zeros approach). On average, thisbaseline model outperformed 1.8 out of 3 individual radiologists with an AUCof 0.907 when predicting 5 selected pathologies on a test set of 500 studies. Inanother work, Rubin et al . [36] introduced DualNet—a novel dual convolutionalnetworks that were jointly trained on both the frontal and lateral CXRs ofMIMIC-CXR. Experiments showed that the DualNet provides an improvedperformance in classifying findings in CXR images when compared to separatebaseline ( i.e. frontal and lateral) classifiers.5n this paper, we adapt the conditional training approach of [19] to extensivelytrain a series of CNN architectures for the hierarchy of the 14 CheXpert patholo-gies, which is totally different from that of PLCO. Our approach is significantlydifferent from [34] as we directly exploit a predefined hierarchy of labels insteadof learning it from data. Furthermore, unlike previous studies [19, 34], we alsopropose the use of the label smoothing regularization (LSR) [20] to leverageuncertainty labels, which, as experiments will later show, significantly improvesthe uncertainty approaches originally proposed in [21].
3. Proposed Method
In this section, we present details of the proposed method. We first give aformulation of the multi-label classification for CXRs and the evaluation protocolused in this study (Section 3.1). We then describe a new training procedurethat exploits the relationship among diseases for improving model performance(Section 3.2). This section also introduces the way we use LSR to deal withuncertain samples in the training data (Section 3.3).
Our focus in this paper is to develop and evaluate a deep learning-basedapproach that could learn from hundreds of thousands of CXR images and makeaccurate diagnoses of 14 common thoracic diseases and observations from unseensamples. These categories include
Enlarged Cardiomediastinum , Cardiomegaly , Lung Opacity , Lung Lesion , Edema , Consolidation , Pneumonia , Atelectasis , Pneumothorax , Pleural Effusion , Pleural Other , Fracture , Support Devices , and
No Finding . In this multi-label learning scenario, we are given a training set D = (cid:8)(cid:0) x ( i ) , y ( i ) (cid:1) ; i = 1 , . . . , N (cid:9) that contains N CXRs. A significant portion ofthe dataset goes with uncertainty labels. It means that each input image x ( i ) is associated with label y ( i ) ∈ { , , − } , where 0, 1, and − correspond to negative , positive , and uncertain , respectively. Note that during the trainingstage, we apply various approaches to replace all uncertainty labels with positive ,6 egative , or a smoothed version of one of these two classes. Meanwhile, theoutput of the model is a vector of 14 entries, each of which reflects the probabilityof a specific category being positive. Specifically, we train a CNN, parameterizedby weights θ , that maps x ( i ) to a prediction ˆ y ( i ) ∈ [0 , such that the cross-entropy loss function is minimized over the training set D . Note that, insteadof the softmax function, in the multi-label classification, the sigmoid activationfunction ˆ y k = 11 + exp( − z k ) , k = 1 , . . . , , (1)is applied to the logits z k at the last layer of the CNN in order to output eachof the 14 probabilities. The loss function is then given by (cid:96) ( θ ) = N (cid:88) i =1 14 (cid:88) k =1 y ( i ) k log ˆ y ( i ) k + (cid:16) − y ( i ) k (cid:17) log (cid:16) − ˆ y ( i ) k (cid:17) . (2)A validation set V = (cid:8)(cid:0) x ( j ) , y ( j ) (cid:1) ; j = 1 , . . . , M (cid:9) contains M CXRs, anno-tated by a panel of 5 radiologists, is used to evaluate the effectiveness of theproposed method. More specifically, model performance is measured by the AUCscores over 5 observations:
Atelectasis , Cardiomegaly , Consolidation , Edema ,and
Pleural Effusion from the validation set of the CheXpert dataset [21], whichwere selected based on clinical importance and prevalence. Figure 1 shows anillustration of the task we investigate in this paper.
In medical imaging, labels are often organized into hierarchies in form ofa tree or a directed acyclic graph (DAG). These hierarchies are constructedby domain experts, e.g. radiologists in the case of CXR data. Diagnoses orobservations in CXRs are often conditioned upon their parent labels [37]. Thisimportant fact should be leveraged during the model training and prediction.Most existing CXR classification approaches, however, treat each label in anindependent manner and do not take the label structure into account. This groupof algorithms is known as flat classification methods [38]. A flat learning modelreveals some limitations when applied to hierarchical data as it fails to model7 igure 1: Illustration of our classification task, which aims to build a deep learning system forpredicting probability of presence of 14 different pathologies or observations from the CXRs.The relationships among labels were proposed by Irvin et al . [21]. the dependency between diseases. For example, from Figure 1, the presence of Cardiomegaly implies the presence of
Enlarged Cardiomediastinum . Additionally,some labels that are at the lower levels in the hierarchy, in particular at leafnodes, have very few positive samples, which makes the flat learning model easilybiased toward the negative class.Another group of algorithms called hierarchical multi-label classificationmethods has been proposed for leveraging the hierarchical relationships amonglabels in making predictions, which has been successfully exploited for textprocessing [39], visual recognition [40, 41] and genomic analysis [42]. Thehierarchies are constructed in a way that the root nodes correspond to themost general classes (like
Opacity ) and the leaf nodes correspond to the mostspecific ones (like
Pneumonia ). One common approach [19] to exploit such ahierarchy is to (1) train a classifier on conditional data, ignoring all samples withnegative parent-level labels, and then (2) add these samples back to finetune thenetwork on the whole dataset. Importantly, this strategy is not applied to thevalidation set since the classifier has been trained on the full dataset during the8econd phase. Instead, unconditional probabilities should be computed duringthe inference stage.We adapt the idea of Chen et al. [19] to the lung disease hierarchy inFigure 1, which was initially introduced in [21]. Presuming the medical validityof the hierarchy, we break the training procedure into two steps. The first step,called conditional training , aims to learn the dependent relationships betweenparent and child labels and to concentrate on distinguishing lower-level labels, inparticular the leaf labels. In this step, a CNN is pretrained on a partial trainingset containing all positive parent labels to classify the child labels; this procedureis illustrated in Figure 2. In the second step, transfer learning will be exploited.
Figure 2: Illustration of the key idea behind the conditional training ( left ). In this stage, aCNN is trained on a training set where all parent labels (red nodes) are positive, to classifyleaf labels (blue nodes), which could be either positive or negative. For example, we train aCNN to classify
Edema , Atelectasis , and
Pneumonia on training examples where both
LungOpacity and
Consolidation are positive ( right ). Specifically, we freeze all the layers of the pretrained network except the lastfully connected layer and then retrain it on the full dataset. This training stageaims at improving the capacity of the network in predicting parent-level labels,which could also be either positive or negative.According to the above training strategy, the output of the network for eachlabel can be viewed as the conditional probability that this label is positivegiven its parent being positive. During the inference phase, however, all thelabels should be unconditionally predicted. Thus, as a simple application of theBayes rule, the unconditional probability of each label being positive shouldbe computed by multiplying all conditional probabilities produced by the CNN9 igure 3: An example of a tree of 4 diseases: A , B , C , and D . along the path from the root node to the current label. For illustration, assumea tree of 4 diseases A, B, C , and D as shown in Figure 3 and let A , B , C , and D be the corresponding events that these labels are positive. Suppose the tuple ofconditional predictions ( p ( A ) , p ( B|A ) , p ( C|B , A ) , p ( D|B , A )) are already providedby the network. Note that a child label being positive implies that all of its parentlabels being positive too. Thus, the unconditional probability of a leaf-node labelbeing positive is identical to the probability that all labels along the path fromthe leaf node tracing back to the root node are jointly positive. In particular,the unconditional predictions for the presence of C can be computed by p ( C ) = p ( C , B , A ) (3) = p ( A ) p ( B|A ) p ( C|B , A ) . (4)Similarly, p ( D ) = p ( D , B , A ) (5) = p ( A ) p ( B|A ) p ( D|B , A ) . (6)It is important to note that the unconditional inference mentioned abovehelps ensure that the probability of presence of a child disease is always smallerthan the probability of its parent, which is consistent with clinical taxonomiesin practice. Another challenging issue in the multi-label classification of CXRs is thatwe may not have full access to the true labels for all input images providedby the training dataset. A considerable effort has been devoted to creating10arge-scale CXR datasets with more reliable ground truth, such as CheXpert [21]and MIMIC-CXR [23]. The labeling of these datasets, however, heavily dependson expert systems ( i.e. using keyword matching with hard-coded rules), whichleft many CXR images with uncertainty labels. This is mainly due to theunavoidable ambiguities in medical reports. Several approaches have beenproposed in [21] to deal with these uncertain samples. For example, they canbe all ignored ( U-Ignore ), all mapped to positive ( U-Ones ), or all mapped to negative ( U-Zeros ). While
U-Ignore could not make use of the full list oflabels on the whole dataset, both the
U-Ones and
U-Zeros yielded a minimalimprovement on CheXpert, as reported in [21]. This may be because setting alluncertainty labels to either 1 or 0 will certainly produce a lot of wrong labels,which misguide the model training.In this paper, we propose to apply a new advance in machine learning called label smoothing regularization (LSR) [43, 44] for a better handling of uncertaintysamples. The method has been effectively used [20] to boost the performancesof multi-class classification models via smoothing out the label vector of eachsample. We adapt this idea of LSR to the binary classification of a CXR intopositive/negative for each of the 14 categories. Our main goal is to exploitthe large amount of uncertain CXRs and, at the same time, to prevent themodel from overconfident prediction of the training examples that might containmislabeled data. Specifically, the
U-ones approach is softened by mapping eachuncertainty label ( − ) to a random number close to 1. The proposed U-ones+LSR approach now maps the original label y ( i ) k to ¯ y ( i ) k = u, if y ( i ) k = − y ( i ) k , otherwise , (7)where u ∼ U ( a , b ) is a uniformly distributed random variable between a and b —the hyper-parameters of this approach. Similarly, we propose the U-zeros+LSR approach that softens the
U-zeros by setting each uncertaintylabel to a random number u ∼ U ( a , b ) that is closed to 0.11 . Experiments CheXpert dataset [21] was used to develop and evaluate the proposed method.This is one of the largest public CXR dataset currently available, which contains224,316 X-ray scans of 65,240 patients. The dataset was labeled for the presenceof 14 observations, including 12 common thoracic pathologies. Each observationcan be assigned to either positive (1), negative (0), or uncertain (-1). The maintask on CheXpert is to predict the probability of multiple observations froman input CXR. The predictive models take as input a single view CXR andoutput the probability of each of the 14 observations as shown in Figure 1.The whole dataset is divided into a training set of 223,414 studies, a validationset of 200 studies, and a test set of 500 studies. For the validation set, theground-truth label of each study is obtained by taking the majority vote amongstthe annotations of 3 board-certified radiologists. Meanwhile, each study in thetest set is labeled by the consensus of 5 board-certified radiologists. The authorsof CheXpert proposed an evaluation protocol over 5 observations:
Atelectasis , Cardiomegaly , Consolidation , Edema , and
Pleural Effusion , which were selectedbased on the clinical importance and prevalence from the validation set. Theeffectiveness of predictive models is measured by the AUC metric.
The learning performance of deep neural networks on raw CXRs may beaffected by the irrelevant noisy areas such as texts or the existence of irregularborders. Moreover, we observe a high ratio of CXRs that have poor alignment.We therefore propose a series of preprocessing steps to reduce the effect ofirrelevant factors and focus on the lung area. Specifically, all CXRs were firstrescaled to × pixels. A template matching algorithm [45] was then usedto search and find the location of a template chest image ( × pixels) inthe original images. Finally, they were normalized using mean and standarddeviation of images from the ImageNet training set [31] in order to reducesource-dependent variation. 12 .3. Network architecture and training methodology The conditional training was performed after applying different approaches foruncertainty labels ( i.e.
U-Ignore , U-Ones , U-Zeros , U-Zeros+LSR , U-Ones+LSR ).We used DenseNet-121 [25] as a baseline network architecture for verifying ourhypotheses on the conditional training procedure (Section 3.2) and the effect ofLSR (Section 3.3). In the training stage, all images were fed into the networkwith a standard size of × pixels. The final fully-connected layer is a14-dimensional dense layer, followed by sigmoid activations that were applied toeach of the outputs to obtain the predicted probabilities of the presence of the 14pathology classes. We used Adam optimizer [46] with default parameters β =0.9, β = 0.999 and a batch size of 32 to find the optimal weights. The learningrate was initially set to e − and then reduced by a factor of 10 after eachepoch during the training phase. The network was initialized with a pretrainedmodel on ImageNet [47] and then trained for 5 epochs on the conditional datathat excludes all examples containing negative parent labels. Next, we addedback these samples to the training set and trained the network for 5 moreepochs on the full data. During training, our goal is to minimize the binarycross-entropy loss function between the ground-truth labels and the predictedlabels output by the network over the training samples. The proposed deepnetwork was implemented in Python using Keras with TensorFlow as backend.All experiments were conducted on a Windows 10 machine with a single NVIDIAGeforce RTX 2080 Ti with 11GB memory.We conducted extensive ablation studies to verify the impact of the proposedconditional training procedure and LSR. Specifically, we first independentlytrained the baseline network with 3 label approaches: U-Ignore , U-Ones , and
U-Zeros . We then fixed the hyper-parameter settings of these runs and per-formed the conditional training procedure on top of them, resulting in 3 othernetworks:
U-Ignore+CT , U-Ones+CT , and
U-Zeros+CT , respectively. Next, theLSR technique was applied to the two label approaches
U-Ones and
U-Zeros . For
U-Ones , all uncertainty labels were mapped to random numbers uniformly dis-tributed in the interval [0 . , . . For U-Zeros , we labeled uncertain samples13ith random numbers in [0 , . . Both of these intervals were emperically chosen.Finally, both CT and LSR were combined with U-Ones and
U-Zeros using thesame set of hyperparameters, resulting in
U-Ones+CT+LSR and
U-Zeros+CT+LSR ,respectively. Note that all of the above experiments were performed with atemplate matching (TM) algorithm as a preprocessing step. To isolate the effectof TM, we ran an additional experiment for the baseline
U-Ignore with TMbeing removed.
In a multi-label classification setting, it is hard for a single CNN model toobtain high and consistent AUC scores for all disease labels. In fact, the AUCscore for each label often varies with the choice of network architecture. In orderto achieve a highly accurate classifier, an ensemble technique should be explored.The key idea of the ensembling is to rely on the diversity of a set of possiblyweak classifiers that can be combined into a stronger classifier. To that end, wetrained and evaluated a strong set of different state-of-the-art CNN models onthe CheXpert. The following six architectures were investigated: DenseNet-121,DenseNet-169, DenseNet-201 [25], Inception-ResNet-v2 [26], Xception [48], andNASNetLarge [27]. The ensemble model was simply obtained by averaging theoutputs of all trained networks. In the inference stage, the test-time augmentation(TTA) [49] was also applied. Specifically, for each test CXR, we applied a randomtransformation (amongst horizontal flipping, rotating ± ± % ,and shearing ± Table 1 provides the AUC scores for all experimental settings we haveconducted on the CheXpert validation set. We found that the best performingDenseNet-121 model was trained with the
U-Ones+CT+LSR approach, whichobtained an AUC of 0.894 on the validation set. This is a 4% improvement14 able 1: Experimental results on the CheXpert dataset measured by AUC metric over 200chest radiographic studies of the validation set. CT and LSR stand for conditional training and label smoothing regularization , respectively. For each label approach, the highest AUC scoresare boldfaced.
Method Atelectasis Cardiomegaly Consolidation Edema P. Effusion Mean
U-Ignore
U-Ignore+CT
U-Zeros
U-Zeros+CT
U-Zeros+LSR
U-Zeros+CT+LSR
U-Ones
U-Ones+CT
U-Ones+LSR
U-Ones+CT+LSR compared to the baseline trained with the
U-Ones approach (mean AUC = 0.860).Additionally, experimental results show that both the proposed conditionaltraining and LSR help boost the model performance. Our final model, which isan ensemble of six single models, achieved an average AUC of 0.940. As shownin Table 2, this score outperforms all previous state-of-the-art results. Figure 4plots the ROC curves of the ensemble model for 5 pathologies on the validationset. Figure 5 illustrates some example predictions by the model during theinference stage.The effects of using TM and TTA can be seen in Tables 3 and 4. While TMimproves the mean AUC of DenseNet-121 (with
U-Ignore approach) by 0.006,removing TTA from the model ensembling only decreases the mean AUC by0.003. These gaps are marginal and, so, empirically confirm that the main sourceof improvement over previous methods indeed comes from our use of conditionaltraining and LSR. 15 able 2: Performance comparison using AUC metric between our ensemble of 6 models andprevious works on the CheXpert validation set. The highest AUC scores are boldfaced.
Method Atelectasis Cardiomegaly Consolidation Edema P. Effusion Mean
U-Ignore+LP [50] 0.720 0.870 0.770 0.870 0.900 0.826
U-Ignore+BR [50] 0.720 0.880 0.770 0.870 0.900 0.828
U-Ignore+CC [50] 0.700 0.870 0.740 0.860 0.900 0.814
U-Ignore [21] 0.818 0.828 0.938 0.934 0.928 0.889
U-Zeros [21] 0.811 0.840 0.932 0.929 0.931 0.888
U-Ones [21] 0.858 0.832 0.899 0.941 0.934 0.893
U-MultiClass [21] 0.821 0.854 0.937 0.928 0.936 0.895
U-SelfTrained [21] 0.833 0.831 0.939 0.935 0.932 0.894Ours
Table 3: AUC improvement on the CheXpert validation set by using TM for the
U-Ignore training procedure.
Method Atelectasis Cardiomegaly Consolidation Edema P. Effusion Mean
U-Ignore
U-Ignore+TM
A crucial evaluation of any machine learning-based medical diagnosis system(ML-MDS) is to evaluate how well the system performs on an independent test setin comparison to human expert-level performance. To this end, we evaluated theproposed method on the hidden test set of CheXpert, which contains 500 CXRslabeled by 8 board-certified radiologists. The annotations of 3 of them were usedfor benchmarking radiologist performance and the majority vote of the other 5served as ground truth. For each of the 3 individual radiologists, the AUC scoresfor the 5 selected diseases (
Atelectasis , Cardiomegaly , Consolidation , Edema ,and
Pleural Effusion ) were computed against the ground truth to evaluateradiologists’ performance. We then evaluated our ensemble model on the test setand performed ROC analysis to compare the model performance to radiologists.16 able 4: AUC improvement on the CheXpert validation set by using TTA on top of modelensembling.
Method Atelectasis Cardiomegaly Consolidation Edema P. Effusion Mean
Ensemble with TTA 0.909 0.910 0.957 0.958 0.964 0.940Ensemble without TTA 0.908 0.906 0.955 0.951 0.958 0.937
Figure 4: ROC curves of our ensemble model for the 5 pathologies on CheXpert validation set.
For more details, the ROCs produced by the prediction model and the threeradiologists’ operating points were both plotted. For each disease, whether themodel is superior to radiologists’ performances was determined by countingthe number of radiologists’ operating points lying below the ROC . The resultshows that our deep learning model, when being averaged over the 5 diseases,outperforms 2.6 out of 3 radiologists with an AUC of 0.930. This is the bestperformance on the CheXpert leaderboard to date. The attained AUC score This test was conducted independently with the support of the Stanford Machine LearningGroup as the test set is not released to the public. igure 5: Visualization of findings by the proposed network during the inference stage. validates the generalization capability of the trained deep learning model onan unseen dataset. Meanwhile, the total number of radiologists under ROCcurves indicates that the proposed method is able to reach human expert-levelperformance—an important step towards the application of an ML-MDS inreal-world scenarios.
5. Discussions
By training a set of strong CNNs on a large scale dataset, we built a deeplearning model that can accurately predict multiple thoracic diseases from CXRs.In particular, we empirically showed a major improvement, in terms of AUCscore, by exploiting the dependencies among diseases and by applying the labelsmoothing technique to uncertain samples. We found that it is especially difficultto obtain a good AUC score for all diseases with a single CNN. It is also observedthat the classification performance varies with network architectures, the rateof positive/negative samples, as well as the visual features of the lung diseasebeing detected. In this case, an ensemble of multiple deep learning models playsa key in boosting the generalization of the final model and its performance.Our findings, along with recent publications [10, 11, 12, 13], continue to assertthat deep learning algorithms can accurately identify the risk of many thoracicdiseases and is able to assist patient screening, diagnosing, and physician training.18 .2. Limitations
Although a highly accurate performance has been achieved, we acknowledgethat the proposed method reveals some limitations. The conditional trainingstrategy requires a predefined hierarchy of diseases, which is not easy to constructand usually imperfect. Furthermore, it seems difficult to extend this ideato deeper hierarchies of diseases, for which too many examples have to beexcluded from the training set in the first phase. The use of LSR in this paperwith heuristically chosen hyper-parameters, while significantly improving theclassification performance, is not clearly justified. In addition, the use of TTAhas also a limitation due to it decreases inference time.Other challenges are related to the training data. For instance, the deeplearning algorithm was trained and evaluated on a CXR data source collectedfrom a single tertiary care academic institution. Therefore, it may not yieldthe same level of performance when applied to data from other sources suchas from other institutions with different scanners. This phenomenon is called geographic variation . To overcome this, the learning algorithm should be trainedon data that are diverse in terms of regions, races, imaging protocols, etc.Next, to make a diagnosis from a CXR, doctors often rely on a broad range ofadditional data such as patient age, gender, medical history, clinical symptoms,and possibly CXRs from different views. This additional information should alsobe incorporated into the model training. Finally, CXR image quality is anotherproblem. When taking a deeper look at the CheXpert, we found a considerablerate of samples in low quality ( e.g. rotated image, low-resolution, samples withtexts, noise, etc.) that definitely hurts the model performance. In this case, atemplate matching-based method as proposed in this work may be insufficientto effectively remove all the undesired samples. A more robust preprocessingtechnique, such as that proposed in [51], should be applied to reject almost all out-of-distribution samples. 19 . Conclusion
We presented in this paper a comprehensive approach for building a high-precision computer-aided diagnosis system for common thoracic diseases classifi-cation from CXRs. We investigated almost every aspect of the task includingdata cleaning, network design, training, and ensembling. In particular, we in-troduced a new training procedure in which dependencies among diseases anduncertainty labels are effectively exploited and integrated in training advancedCNNs. Extensive experiments demonstrated that the proposed method outper-forms the previous state-of-the-art by a large margin on the CheXpert dataset.More importantly, our deep learning algorithm exhibited a performance on parwith specialists in an independent test. There are several possible mechanisms toimprove the current method. The most promising direction is to increase the sizeand quality of the dataset. A larger and high-quality labeled dataset can helpdeep neural networks generalize better and reduce the need for transfer learn-ing from ImageNet. For instance, extra training data from MIMIC-CXR [23],which uses the same labeling tool as CheXpert, should be considered. We arecurrently expanding this research by collecting a new large-scale CXR datasetwith radiologist-labeled reference from several hospitals and medical centers inVietnam. The new dataset is needed to validate the proposed method and toconfirm its usefulness in different clinical settings. We believe the cooperationbetween a machine learning-based medical diagnosis system and radiologistswill improve the outcomes of thoracic disease diagnosis and bring benefits toclinicians and their patients.
7. Acknowledgements
This research was supported by the Vingroup Big Data Institute (VinBDI).The authors gratefully acknowledge Jeremy Irvin from the Machine LearningGroup, Stanford University for helping us evaluate the proposed method on thehidden test set of CheXpert. 20 eferences [1] N. England, Diagnostic imaging dataset statistical release. February 2019, , (accessed 30 July 2019).[2] L. Anderson, A. Dean, D. Falzon, K. Floyd, I. Baena, C. Gilpin, P. Glaziou,Y. Hamada, T. Hiatt, A. Char, et al., Global tuberculosis report 2015,World Health Organization.[3] N. Bellaviti, F. Bini, L. Pennacchi, G. Pepe, B. Bodini, R. Ceriani,C. D’Urbano, A. Vaghi, Increased incidence of spontaneous pneumoth-orax in very young people: Observations and treatment, Chest 150 (4)(2016) 560A.[4] C. Qin, D. Yao, Y. Shi, Z. Song, Computer-aided detection in chest radio-graphy based on artificial intelligence: A survey, Biomedical EngineeringOnline 17 (1) (2018) 113. doi:doi:10.1186/s12938-018-0544-y .[5] L. Delrue, R. Gosselin, B. Ilsen, A. Van Landeghem, J. de Mey, P. Duyck,Difficulties in the interpretation of chest radiography, in: ComparativeInterpretation of CT and Standard Radiography of the Chest, Springer,2011, pp. 27–49. doi:https://doi.org/10.1007/978-3-540-79942-9_2 .[6] N. Crisp, L. Chen, Global supply of health professionals, New EnglandJournal of Medicine 370 (10) (2014) 950–957. doi:https://doi.org/10.1056/NEJMra1111610 .[7] T. Atlantic, Most of the world doesn’t have access to X-rays, ,(accessed 30 July 2019).[8] M. Annarumma, S. J. Withey, R. J. Bakewell, E. Pesce, V. Goh, G. Montana,Automated triaging of adult chest radiographs with deep artificial neuralnetworks, Radiology 291 (1) (2019) 196–202. doi:https://doi.org/10.1148/radiol.2018180921 . 219] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian,J. A. W. M. van der Laak, B. van Ginneken, C. I. Sánchez, A survey ondeep learning in medical image analysis, Medical Image Analysis 42 (2017)60–88. doi:https://doi.org/10.1016/j.media.2017.07.005 .[10] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding,A. Bagul, C. Langlotz, K. Shpanskaya, et al., ChexNet: Radiologist-levelpneumonia detection on chest X-rays with deep learning, arXiv preprintarXiv:1711.05225.[11] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, Y. Yang, Diagnose likea radiologist: Attention guided convolutional neural network for thoraxdisease classification, arXiv preprint arXiv:1801.09927.[12] P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, T. Duan,D. Ding, A. Bagul, C. P. Langlotz, et al., Deep learning for chest radiographdiagnosis: A retrospective comparison of the CheXNeXt algorithm topracticing radiologists, PLoS Medicine 15 (11) (2018) e1002686. doi:https://doi.org/10.1371/journal.pmed.1002686 .[13] P. Kumar, M. Grewal, M. M. Srivastava, Boosted cascaded convnets formultilabel classification of thoracic diseases in chest radiographs, in: ICIAR,2018, pp. 546–552. doi:https://doi.org/10.1007/978-3-319-93000-8_62 .[14] E. J. Hwang, S. Park, K.-N. Jin, J. Im Kim, S. Y. Choi, J. H. Lee, J. M. Goo,J. Aum, J.-J. Yim, J. G. Cohen, et al., Development and validation of a deeplearning–based automated detection algorithm for major thoracic diseaseson chest radiographs, JAMA network open 2 (3) (2019) e191095–e191095.[15] A. K. Jaiswal, P. Tiwari, S. Kumar, D. Gupta, A. Khanna, J. J. Ro-drigues, Identifying pneumonia in chest X-rays: A deep learning approach,Measurement 145 (2019) 511–518. doi:https://doi.org/10.1016/j.measurement.2019.05.076 . 2216] P. Lakhani, B. Sundaram, Deep learning at chest radiography: Automatedclassification of pulmonary tuberculosis by using convolutional neural net-works, Radiology 284 (2) (2017) 574–582. doi:https://doi.org/10.1148/radiol.2017162326 .[17] F. Pasa, V. Golkov, F. Pfeiffer, D. Cremers, D. Pfeiffer, Efficient deepnetwork architectures for fast chest X-ray tuberculosis screening and visu-alization, Scientific reports 9 (1) (2019) 6268. doi:https://doi.org/10.1038/s41598-019-42557-4 .[18] W. Ausawalaithong, A. Thirach, S. Marukatat, T. Wilaiprasitporn, Au-tomatic lung cancer prediction from chest X-ray images using the deeplearning approach, in: BMEiCON, IEEE, 2018, pp. 1–5. doi:https://doi.org/10.1109/bmeicon.2018.8609997 .[19] H. Chen, S. Miao, D. Xu, G. D. Hager, A. P. Harrison, Deep hierarchicalmulti-label classification of chest X-ray images, in: MIDL, 2019, pp. 109–120.[20] R. Müller, S. Kornblith, G. E. Hinton, When does label smoothing help?,in: NIPS, 2019, pp. 4696–4705.[21] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund,B. Haghgoo, R. L. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S.Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N.Patel, M. P. Lungren, A. Y. Ng, CheXpert: A large chest radiographdataset with uncertainty labels and expert comparison, in: AAAI, 2019. doi:https://doi.org/10.1609/aaai.v33i01.3301590 .[22] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, Chestx-ray8:Hospital-scale chest X-ray database and benchmarks on weakly-supervisedclassification and localization of common thorax diseases, in: IEEE CVPR,2017, pp. 2097–2106. doi:https://doi.org/10.1109/CVPR.2017.369 .[23] A. E. Johnson, T. J. Pollard, S. Berkowitz, N. R. Greenbaum, M. P. Lungren,23.-y. Deng, R. G. Mark, S. Horng, MIMIC-CXR: A large publicly availabledatabase of labeled chest radiographs, arXiv preprint arXiv:1901.07042.[24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,in: IEEE CVPR, 2016, pp. 770–778. doi:https://doi.org/10.1109/CVPR.2016.90 .[25] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connectedconvolutional networks, in: IEEE CVPR, 2017, pp. 4700–4708. doi:https://doi.org/10.1109/CVPR.2017.243 .[26] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in: AAAI, 2017.URL http://dl.acm.org/citation.cfm?id=3298023.3298188 [27] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning transferable architec-tures for scalable image recognition, in: IEEE CVPR, 2018, pp. 8697–8710. doi:https://doi.org/10.1109/CVPR.2018.00907 .[28] L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride,W. Sieh, Deep learning to improve breast cancer detection on screen-ing mammography, Scientific Reports 9. doi:https://doi.org/10.1038/s41598-019-48995-4 .[29] P. Huang, S. Park, R. Yan, J. Lee, L. C. Chu, C. T. Lin, A. Hussien,J. Rathmell, B. Thomas, C. Chen, et al., Added value of computer-aidedCT image features for early lung cancer diagnosis with small pulmonarynodules: A matched case-control study, Radiology 286 (1) (2017) 286–295. doi:https://doi.org/10.1148/radiol.2017162725 .[30] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learningapplied to document recognition, Proceedings of the IEEE 86 (11) (1998)2278–2324. doi:https://doi.org/10.1109/5.726791 .[31] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deepconvolutional neural networks, in: NIPS, 2012, pp. 1097–1105.2432] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEETransactions on Knowledge and Data Engineering 26 (8) (2013) 1819–1837. doi:https://doi.org/10.1109/TKDE.2013.39 .[33] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, In-ternational Journal of Data Warehousing and Mining 3 (3) (2007) 1–13. doi:https://doi.org/10.4018/jdwm.2007070101 .[34] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, K. Lyman,Learning to diagnose from scratch by exploiting dependencies among labels,arXiv preprint arXiv:1710.10501.[35] J. K. Gohagan, P. C. Prorok, R. B. Hayes, B.-S. Kramer, The prostate,lung, colorectal and ovarian (PLCO) cancer screening trial of the nationalcancer institute: History, organization, and status, Controlled Clinical Trials21 (6, Supplement 1) (2000) 251S – 272S. doi:https://doi.org/10.1016/S0197-2456(00)00097-0 .[36] J. Rubin, D. Sanghavi, C. Zhao, K. Lee, A. Qadir, M. Xu-Wilson, Largescale automated reading of frontal and lateral chest X-rays using dualconvolutional neural networks, arXiv preprint arXiv:1804.07839.[37] S. Van Eeden, J. Leipsic, S. Paul Man, D. D. Sin, The relationshipbetween lung inflammation and cardiovascular disease, American Jour-nal of Respiratory and Critical Care Medicine 186 (1) (2012) 11–16. doi:https://doi.org/10.1164/rccm.201203-0455PP .[38] N. Alaydie, C. K. Reddy, F. Fotouhi, Exploiting label dependency forhierarchical multi-label classification, in: PAKDD, Springer, 2012, pp. 294–305. doi:https://doi.org/10.1007/978-3-642-30217-6_25 .[39] R. Aly, S. Remus, C. Biemann, Hierarchical multi-label classification of textwith capsule networks, in: Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics: Student Research Workshop,25ssociation for Computational Linguistics, 2019, pp. 323–330. doi:http://dx.doi.org/10.18653/v1/P19-2045 .[40] W. Bi, J. T. Kwok, Mandatory leaf node prediction in hierarchical multilabelclassification, in: NIPS, 2012, pp. 153–161. doi:https://doi.org/10.1109/tnnls.2014.2309437 .[41] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, Y. Yu,HD-CNN: Hierarchical deep convolutional neural networks for large scalevisual recognition, in: IEEE ICCV, 2015, pp. 2740–2748. doi:https://doi.org/10.1109/ICCV.2015.314 .[42] W. Bi, J. T. Kwok, Bayes-optimal hierarchical multilabel classification,IEEE Transactions on Knowledge and Data Engineering 27 (11) (2015)2907–2918. doi:https://doi.org/10.1109/TKDE.2015.2441707 .[43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking theInception architecture for computer vision, in: IEEE CVPR, 2016, pp.2818–2826. doi:https://doi.org/10.1109/CVPR.2016.308 .[44] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, G. Hinton, Regularizingneural networks by penalizing confident output distributions, arXiv preprintarXiv:1701.06548.[45] R. Brunelli, Template matching techniques in computer vision: Theory andpractice, Wiley Publishing, ISBN: 978-0-470-51706-2, 2009.[46] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXivpreprint arXiv:1412.6980.[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: Alarge-scale hierarchical image database, in: IEEE CVPR, 2009, pp. 248–255.[48] F. Chollet, Xception: Deep learning with depthwise separable convolutions,in: IEEE CVPR, 2017, pp. 1251–1258. doi:https://doi.org/10.1109/CVPR.2017.195 . 2649] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scaleimage recognition, arXiv preprint arXiv:1409.1556.[50] I. Allaouzi, M. B. Ahmed, A novel approach for multi-label chest X-rayclassification of common thorax diseases, IEEE Access 7 (2019) 64279–64288. doi:https://doi.org/10.1109/ACCESS.2019.2916849doi:https://doi.org/10.1109/ACCESS.2019.2916849