A two-step explainable approach for COVID-19 computer-aided diagnosis from chest x-ray images
Carlo Alberto Barbano, Enzo Tartaglione, Claudio Berzovini, Marco Calandri, Marco Grangetto
©© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in anycurrent or future media, including reprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in otherworks. a r X i v : . [ ee ss . I V ] J a n TWO-STEP EXPLAINABLE APPROACH FOR COVID-19 COMPUTER-AIDED DIAGNOSISFROM CHEST X-RAY IMAGES
Carlo Alberto Barbano (cid:63)
Enzo Tartaglione (cid:63)
Claudio Berzovini † Marco Calandri ‡ Marco Grangetto (cid:63) ∗ (cid:63) Computer Science Department, University of Turin, Italy † Azienda Ospedaliera Citt`a della Salute e della Scienza Presidio Molinette, Turin, Italy ‡ Oncology Department, University of Turin, AOU San Luigi Gonzaga, Orbassano, Italy
ABSTRACT
Early screening of patients is a critical issue in order to assessimmediate and fast responses against the spread of COVID-19. The use of nasopharyngeal swabs has been considered themost viable approach; however, the result is not immediateor, in the case of fast exams, sufficiently accurate. UsingChest X-Ray (CXR) imaging for early screening potentiallyprovides faster and more accurate response; however, di-agnosing COVID from CXRs is hard and we should rely ondeep learning support, whose decision process is, on the otherhand, “black-boxed” and, for such reason, untrustworthy.We propose an explainable two-step diagnostic approach,where we first detect known pathologies (anomalies) in thelungs, on top of which we diagnose the illness. Our ap-proach achieves promising performance in COVID detection,compatible with expert human radiologists. All of our exper-iments have been carried out bearing in mind that, especiallyfor clinical applications, explainability plays a major role forbuilding trust in machine learning algorithms.
Index Terms — Explainable AI, Chest X-ray, DeepLearning, Classification, COVID-19
1. INTRODUCTION
Early COVID diagnosis is a key element for proper treatmentof the patients and prevention of the spread of the disease.Given the high tropism of COVID-19 for respiratory airwaysand lung epythelium, identification of lung involvement ininfected patients can be relevant for treatment and monitor-ing of the disease. Virus testing is currently considered theonly specific method of diagnosis. Nasopharingeal swabsare easily executable and affordable and current standard indiagnostic setting; their accuracy in literature is influencedby the severity of the disease and the time from symptomsonset and is reported up to 73.3% [1]. Current position pa-pers from radiological societies (Fleischner Society, SIRM, ∗ This project has received funding from the European Union’s Horizon2020 research and innovation programme under grant agreement No 825111,DeepHealth Project.
COVIDclassifier Baseline approachCOVIDdiagnosisTwo-steps approach
Radiologicalreport
Fig. 1 : Comparison between standard approaches to COVIDdiagnosis and our two-step approach.RSNA) [2, 3, 4] do not recommend routine use of imaging forCOVID-19 diagnosis; however, it has been widely demon-strated that, even at early stages of the disease, chest x-rays(CXR) can show pathological findings.In the last year, many works attempted to tackle this prob-lem, proposing deep learning-based strategies [5, 6, 7, 8, 9].All of the proposed approaches include some elements incommon: i) the images collected during the pandemic needto be augmented with non-COVID cases from publicly avail-able datasets; ii) some standard pre-processing is appliedto the images, like lung segmentation using U-Net [10] orsimilar models [5] or converting the pixels of the CXR scanin Hounsfield units; iii) the deep learning model is trainedto the final diagnosis using state-of-the-art approaches fordeep neural networks. Despite some very optimistic results,the proposed approaches exhibit significant limitations thatdeserve further analysis. For example, augmenting COVIDdatasets with negative cases from publicly-available datasetscan inject a dangerous bias, where the trained model learn todiscriminate different data sources rather than actual radio-logical features related to the disease [5]. These unwantedeffects are difficult to spot when using a “black box” model ig. 2 : CheXpert ’s radiological findings.like deep learning ones, without having control on the deci-sion process.In this work we propose an explainable approach, mimick-ing the radiologists’ decision process. Towards this end, webreak the COVID diagnosis problem into two sub-problems.First, we train a model to detect anomalies in the lungs. Theseanomalies are widely known and, following [11], comprise14 objective radiological observations which can be found inlungs. Then, on top of these, we train a decision tree model,where the COVID diagnosis is explicit (Fig. 1). Mimickingthe radiologist’s decision is more robust to biases and aims atbuilding trust for the physicians and patients towards the AItool, which can be useful for fast COVID diagnosis. Thanksto the collaboration with the radiology units of Citt`a dellaSalute e della Scienza di Torino (CDSS) and San Luigi Hos-pital (SLG) in Turin, we collected the COvid Radiographicimages DAta-set for AI (CORDA), comprising both positiveand negative COVID cases as well as a ground truth on the hu-man radiological reporting, and it currently comprises almost1000 CXRs.
2. DATASETS
In this section we introduce the datasets that will be used forour proposed approach. For our purposes we first need todetect some objective radiological findings (we train a modelon the
CheXpert dataset) and then, on top of those, we train amodel to elaborate the COVID diagnosis (using the
CORDA dataset).
CheXpert : this is a large dataset comprising about 224k CXRs. This dataset consists of 14 different observations onthe radiographic image: differently from many other datasetswhich are focused on disease classification based on clinicaldiagnosis, the main focus here is “chest radiograph interpre-tation”, where anomalies are detected [12]. The learnableradiological findings are summarized in Fig. 2.
CORDA : this dataset was created for this study by retro-spectively selecting chest x-rays performed at a dedicatedRadiology Unit in CDSS and at SLG in all patients withfever or respiratory symptoms (cough, shortness of breath,dyspnea) that underwent nasopharingeal swab to rule outCOVID-19 infection. Patients’ average age is 61 years (range17-97 years old). It contains a total of 898 CXRs and canbe split by different collecting institution into two similarlysized subgroups: CORDA-CDSS [5], which contains a totalof 447 CXRs from 386 patients, with 150 images comingfrom COVID-negative patients and 297 from positive ones,and CORDA-SLG, which contains the remaining 451 CXRs,with 129 COVID-positive and 322 COVID-negative images.Including data from different hospitals at test time is crucialto doublecheck the generalization capability of our model.The data collection is still in progress, with other 5 hospitalsin Italy willing to contribute at time of writing. We plan tomake CORDA available for research purposes according toEU regulations as soon as possible.
3. RADIOLOGICAL REPORT
In this section we are going to describe our proposed methodto extract radiological findings from CXRs. For this task, weleverage the large scale dataset
CheXpert , which contains an-notation for different kinds of common radiological findingsthat can be observed in CXR images (like opacity, pleural ef-fusion, cardiomegaly, etc.). Given the high heterogeneity andthe high cardinality of
CheXpert , its use is perfect for our pur-poses: in fact, once the model is trained on this dataset, thereis no need to fine-tune it for the COVID diagnosis, since itwill already extract objective radiological findings.CheXpert provides 14 different types of observations for eachimage in the dataset. For each class, the labels have beengenerated from radiology reports associated with the studieswith NLP techniques, conforming to the Fleischner Society’srecommended glossary [11], and marked as: negative (N),positive (P), uncertain (U) or blank (N/A). Following the re-lationship among labels illustrated in Fig. 2, as proposed by[12], we can identify 8 top-level pathologies and 6 child ones.
In order to extract the radiological findings from CXRs, adeep learning model is trained on the 14 observations. To-wards this end, given the possibility of having multiple find-ings in the same CXR, the weighted binary cross entropy loss able 1 : Performance (AUC) for DenseNet-121 trained on CheXpert.
Method Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion
Baseline [12] 0.79 is used to train the model. Typically, weights are used to com-pensate class unbalancing, giving higher importance to less-represented classes. Within
CheXpert , however, we also needto tackle another issue: how to treat the samples with the Ulabel. Towards this issue, multiple approaches have been sug-gested by [12]. The most popular is to ignore all the uncertainsamples, excluding them from the training process and con-sidering them as N/A.We propose to include the U samples in the learning process,mapping them to maximum uncertainty (probability . to beP or N). Then, we balance P and N outcomes for every radio-logical finding. Table 1 shows a performance comparison be-tween the standard approach as proposed by [12] and our pro-posal (U-label use), for 5 salient radiological findings, usingthe same setting as in [12]. We observe an overall improve-ment in the performance, which is expected by the inclusionof the U-labeled examples. For all our experiments, we willuse models trained using the U labeled samples.
4. COVID DIAGNOSIS
The second step of the proposed approach is building themodel which can actually provide a clinical diagnosis forCOVID. We freeze the model obtained from Sec. 3 and useits output as image features to train a new binary classifier onthe CORDA dataset. We test two different types of classifiers:a decision tree (Tree) and a neural network-based classifier(FC).The decision tree is trained on the probabilities output ofthe radiological reports, using the state-of-the-art CARTAlgorithm implementation provided by the Python scikit-learn [13] package. Besides the fully explainable decisiontree-based result, we also train a neural network classifier,comprising one hidden layer of size 512 and the output layer.Despite working with the same features as the decision tree,such an approach loses in explainability, but potentially en-hances the performance in terms of COVID diagnosis, as wewill see in Sec. 5.
5. RESULTS
In this section we compare the COVID diagnosis generaliza-tion capability through a direct deep learning-based approach(baseline) and our proposed two-step diagnosis, where firstwe detect the radiological findings, and then we discriminatepatients affected by COVID using a decision tree-based diag-nosis (Tree) or a deep learning-based classifier from the ra- diological findings (FC). The performance is tested on a sub-set of patients not included in the training / validation set.The assessed metrics are: balanced accuracy (BA), sensitiv-ity, specificity and area under the ROC curve (AUC). For allof the methods we adopt a 70%-30% train-test split. For thedeep learning-based strategy, SGD is used with a learning rate . and a weight decay of − .All of the experiments were run on NVIDIA Tesla T4GPUs using PyTorch 1.4. Table 2 compares the standarddeep learning-based approach [5] to our two-steps diagnosis.Baseline results are obtained pre-training the model on someof the most used publicly-available datasets. We observe thatthe best achievable performance is very low, consisting in aBA of 0.67. A key takeaway is that trying to directly diagnosediseases such as COVID-19 from CXRs might be currentlyinfeasible, probably given the small dataset sizes and strongselective bias in the datasets.We can clearly see how the two-step method outperformsthe direct diagnosis: using the same network architecture(ResNet-18 as backbone and a fully-connected classifier ontop of it), we obtain a significant increase in all of the as-sessed metrics. Even better results are achieved by using aDenseNet-121 as backbone and the fully-connected classifier.Fig. 3 graphically shows the learned decision tree (whoseperformance is shown in Table 2): this provides a very clearinterpretation for the decision process. From the clinicaland radiological perspective, these data are consistent withthe COVID-19 CXR semiotics that radiologists are usedto deal with. The edema feature, although unspecific, isstrictly related to the interstitial involvement that is typicalof COVID-19 infections and it has been largely reported inthe recent literature [14]. Indeed, in recent COVID-19 radi-ological papers, interstitial involvement has been reported asground glass opacity appearance [15]. However this defini-tion is more pertinent to the CT imaging setting rather thanCXR; the “edema” feature can be compatible, from the radi-ological perspective, to the interstitial opacity of COVID-19patients. Furthermore, the not irrelevant role of cardiomegaly(or more in general enlarged cardiomediastinum) in the deci-sion tree can be interesting from the clinical perspective. Infact, this can be read as an additional proof that establishedcardiovascular disease can be a relevant risk factor to de-velop COVID-19 [16]. Moreover, it may be consistent withthe hypotheses of a larger role of the primary cardiovascu-lar damage observed on on preliminary data of autopsies ofCOVID-19 patients [17].Focusing on the deep learning-based approach (FC) we ob- able 2 : Results for COVID diagnosis. Method Backbone Classifier Pretrain dataset Dataset Sensitivity Specificity BA AUC
Baseline [5] ResNet-18 FC none CORDA-CDSS
ResNet-18 FC ChestXRay CORDA-CDSS 0.54 0.58 0.56 0.67Two-step ResNet-18 FC CheXpert CORDA-CDSS 0.69 0.73 0.71 0.76DenseNet-121 FC CheXpert CORDA-CDSS 0.72
DenseNet-121 Tree CheXpert CORDA-CDSS
Pneumonia <= 0.081gini = 0.447samples = 312value = [105, 207]Pleural Other <= 0.063gini = 0.5samples = 170value = [86, 84]True Pneumothorax <= 0.008gini = 0.232samples = 142value = [19, 123]FalseEdema <= 0.041gini = 0.483samples = 108value = [64, 44] Lung Lesion <= 0.101gini = 0.458samples = 62value = [22, 40]Support Devices <= 0.052gini = 0.405samples = 71value = [51, 20] Cardiomegaly <= 0.108gini = 0.456samples = 37value = [13, 24]gini = 0.172samples = 21value = [19, 2] gini = 0.461samples = 50value = [32, 18] gini = 0.0samples = 11value = [0, 11] gini = 0.5samples = 26value = [13, 13] Pleural Other <= 0.073gini = 0.5samples = 32value = [16, 16] Enlarged Cardiomediastinum <= 0.202gini = 0.32samples = 30value = [6, 24]gini = 0.219samples = 8value = [1, 7] gini = 0.469samples = 24value = [15, 9] gini = 0.245samples = 28value = [4, 24] gini = 0.0samples = 2value = [2, 0]gini = 0.0samples = 2value = [2, 0] Atelectasis <= 0.379gini = 0.213samples = 140value = [17, 123]Enlarged Cardiomediastinum <= 0.338gini = 0.194samples = 138value = [15, 123] gini = 0.0samples = 2value = [2, 0]gini = 0.183samples = 137value = [14, 123] gini = 0.0samples = 1value = [1, 0]
Fig. 3 : Decision Tree obtained for COVID-19 classification based on the probabilities for the 14 classes of findings.
Fig. 4 : Grad-CAM on COVID-positive samples. serve a boost in the performance, achieving a BA of 0.75.However, this is the result of a trade-off between interpretabil-ity and discriminative power. Using Grad-CAM [18] we havehints on the area the model focused on to take the final di-agnostic decision. From Fig. 4 we observe that on COVID-positive images, the model seems to mostly focus on the ex-pected lung areas.Finally, to further test the reliability of our approach, we usedour strategy also on CORDA-SLG (which are data comingfrom a different hospital structure), reaching comparable andencouraging results.
6. CONCLUSIONS
One of the latest challenges for both the clinical and the AIcommunity has been applying deep learning in diagnosingCOVID from CXRs. Recent works suggested the possibil-ity of successfully tackling this problem, despite the currentlysmall quantity of publicly available data. In this work wepropose a multi-step approach, close to the physicians’ di-agnostic process, in which the final diagnosis is based upondetected lung pathologies. We performed our experimentson CORDA, a COVID-19 CXR dataset comprising approxi-mately 1000 images. All of our experiments have been carriedout bearing in mind that, especially for clinical applications,xplainability plays a major role for building trust in machinelearning algorithms, although better interpretability can comeat the cost of a lower prediction accuracy.
7. REFERENCES [1] Yang Yang, Minghui Yang, Chenguang Shen, FuxiangWang, Jing Yuan, Jinxiu Li, Mingxia Zhang, ZhaoqinWang, Li Xing, Jinli Wei, et al., “Laboratory diagnosisand monitoring the viral shedding of 2019-ncov infec-tions,” medRxiv
RSNA Radiology , 2020.[5] Enzo Tartaglione, Carlo Alberto Barbano, ClaudioBerzovini, Marco Calandri, and Marco Grangetto, “Un-veiling covid-19 from chest x-ray with deep learning: Ahurdles race with small data,”
International Journal ofEnvironmental Research and Public Health , vol. 17, no.18, pp. 6933, Sep 2020.[6] Prabira Kumar Sethy and Santi Kumari Behera, “De-tection of coronavirus disease (covid-19) based on deepfeatures,” 2020.[7] Ioannis D Apostolopoulos and Tzani Bessiana, “Covid-19: Automatic detection from x-ray images utilizingtransfer learning with convolutional neural networks,” arXiv preprint arXiv:2003.11617 , 2020.[8] Ali Narin, Ceren Kaya, and Ziynet Pamuk, “Automaticdetection of coronavirus disease (covid-19) using x-rayimages and deep convolutional neural networks,” arXivpreprint arXiv:2003.10849 , 2020.[9] Linda Wang, Zhong Qiu Lin, and Alexander Wong,“Covid-net: A tailored deep convolutional neural net-work design for detection of covid-19 cases from chestx-ray images,”
Scientific Reports , vol. 10, no. 1, pp. 1–12, 2020.[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: Convolutional networks for biomedical imagesegmentation,” in
International Conference on Med-ical image computing and computer-assisted interven-tion . Springer, 2015, pp. 234–241. [11] David M Hansell, Alexander A Bankier, Heber MacMa-hon, Theresa C McLoud, Nestor L Muller, and JacquesRemy, “Fleischner society: glossary of terms for tho-racic imaging,”
Radiology , vol. 246, no. 3, pp. 697–722,2008.[12] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu,Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund,Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al.,“Chexpert: A large chest radiograph dataset with uncer-tainty labels and expert comparison,” in
Proceedingsof the AAAI Conference on Artificial Intelligence , 2019,vol. 33, pp. 590–597.[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay, “Scikit-learn: Machine learning in Python,”
Jour-nal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.[14] Wei-jie Guan, Zheng-yi Ni, Yu Hu, Wen-hua Liang,Chun-quan Ou, Jian-xing He, Lei Liu, Hong Shan,Chun-liang Lei, David SC Hui, et al., “Clinical char-acteristics of coronavirus disease 2019 in china,”
NewEngland journal of medicine , vol. 382, no. 18, pp. 1708–1720, 2020.[15] Ho Yuen Frank Wong, Hiu Yin Sonia Lam, AmbroseHo-Tung Fong, Siu Ting Leung, Thomas Wing-YanChin, Christine Shing Yen Lo, Macy Mei-Sze Lui, Jo-nan Chun Yin Lee, Keith Wan-Hang Chiu, Tom Chung,et al., “Frequency and distribution of chest radiographicfindings in covid-19 positive patients,”
Radiology , p.201160, 2020.[16]
ESC Guidance for the Diagnosis and Management ofCV Disease during the COVID-19 Pandemic. , 2020.[17] Dominic Wichmann, Jan-Peter Sperhake, MarcL¨utgehetmann, Stefan Steurer, Carolin Edler, AxelHeinemann, Fabian Heinrich, Herbert Mushumba,Inga Kniep, Ann Sophie Schr¨oder, et al., “Autopsyfindings and venous thromboembolism in patientswith covid-19: a prospective cohort study,”
Annals ofInternal Medicine , 2020.[18] Ramprasaath R Selvaraju, Michael Cogswell, AbhishekDas, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra, “Grad-cam: Visual explanations from deep net-works via gradient-based localization,” in