[PDF] A two-step explainable approach for COVID-19 computer-aided diagnosis from chest x-ray images

Abstract

Early screening of patients is a critical issue in order to assess immediate and fast responses against the spread of COVID-19. The use of nasopharyngeal swabs has been considered the most viable approach; however, the result is not immediate or, in the case of fast exams, sufficiently accurate. Using Chest X-Ray (CXR) imaging for early screening potentially provides faster and more accurate response; however, diagnosing COVID from CXRs is hard and we should rely on deep learning support, whose decision process is, on the other hand, "black-boxed" and, for such reason, untrustworthy. We propose an explainable two-step diagnostic approach, where we first detect known pathologies (anomalies) in the lungs, on top of which we diagnose the illness. Our approach achieves promising performance in COVID detection, compatible with expert human radiologists. All of our experiments have been carried out bearing in mind that, especially for clinical applications, explainability plays a major role for building trust in machine learning algorithms.

Full PDF

©© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in anycurrent or future media, including reprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in otherworks. a r X i v : . [ ee ss . I V ] J a n TWO-STEP EXPLAINABLE APPROACH FOR COVID-19 COMPUTER-AIDED DIAGNOSISFROM CHEST X-RAY IMAGES

Carlo Alberto Barbano (cid:63)

Enzo Tartaglione (cid:63)

Claudio Berzovini † Marco Calandri ‡ Marco Grangetto (cid:63) ∗ (cid:63) Computer Science Department, University of Turin, Italy † Azienda Ospedaliera Citt`a della Salute e della Scienza Presidio Molinette, Turin, Italy ‡ Oncology Department, University of Turin, AOU San Luigi Gonzaga, Orbassano, Italy

ABSTRACT

Early screening of patients is a critical issue in order to assessimmediate and fast responses against the spread of COVID-19. The use of nasopharyngeal swabs has been considered themost viable approach; however, the result is not immediateor, in the case of fast exams, sufﬁciently accurate. UsingChest X-Ray (CXR) imaging for early screening potentiallyprovides faster and more accurate response; however, di-agnosing COVID from CXRs is hard and we should rely ondeep learning support, whose decision process is, on the otherhand, “black-boxed” and, for such reason, untrustworthy.We propose an explainable two-step diagnostic approach,where we ﬁrst detect known pathologies (anomalies) in thelungs, on top of which we diagnose the illness. Our ap-proach achieves promising performance in COVID detection,compatible with expert human radiologists. All of our exper-iments have been carried out bearing in mind that, especiallyfor clinical applications, explainability plays a major role forbuilding trust in machine learning algorithms.

Index Terms — Explainable AI, Chest X-ray, DeepLearning, Classiﬁcation, COVID-19

1. INTRODUCTION

Early COVID diagnosis is a key element for proper treatmentof the patients and prevention of the spread of the disease.Given the high tropism of COVID-19 for respiratory airwaysand lung epythelium, identiﬁcation of lung involvement ininfected patients can be relevant for treatment and monitor-ing of the disease. Virus testing is currently considered theonly speciﬁc method of diagnosis. Nasopharingeal swabsare easily executable and affordable and current standard indiagnostic setting; their accuracy in literature is inﬂuencedby the severity of the disease and the time from symptomsonset and is reported up to 73.3% [1]. Current position pa-pers from radiological societies (Fleischner Society, SIRM, ∗ This project has received funding from the European Union’s Horizon2020 research and innovation programme under grant agreement No 825111,DeepHealth Project.

COVIDclassifier Baseline approachCOVIDdiagnosisTwo-steps approach

Radiologicalreport

Fig. 1 : Comparison between standard approaches to COVIDdiagnosis and our two-step approach.RSNA) [2, 3, 4] do not recommend routine use of imaging forCOVID-19 diagnosis; however, it has been widely demon-strated that, even at early stages of the disease, chest x-rays(CXR) can show pathological ﬁndings.In the last year, many works attempted to tackle this prob-lem, proposing deep learning-based strategies [5, 6, 7, 8, 9].All of the proposed approaches include some elements incommon: i) the images collected during the pandemic needto be augmented with non-COVID cases from publicly avail-able datasets; ii) some standard pre-processing is appliedto the images, like lung segmentation using U-Net [10] orsimilar models [5] or converting the pixels of the CXR scanin Hounsﬁeld units; iii) the deep learning model is trainedto the ﬁnal diagnosis using state-of-the-art approaches fordeep neural networks. Despite some very optimistic results,the proposed approaches exhibit signiﬁcant limitations thatdeserve further analysis. For example, augmenting COVIDdatasets with negative cases from publicly-available datasetscan inject a dangerous bias, where the trained model learn todiscriminate different data sources rather than actual radio-logical features related to the disease [5]. These unwantedeffects are difﬁcult to spot when using a “black box” model ig. 2 : CheXpert ’s radiological ﬁndings.like deep learning ones, without having control on the deci-sion process.In this work we propose an explainable approach, mimick-ing the radiologists’ decision process. Towards this end, webreak the COVID diagnosis problem into two sub-problems.First, we train a model to detect anomalies in the lungs. Theseanomalies are widely known and, following [11], comprise14 objective radiological observations which can be found inlungs. Then, on top of these, we train a decision tree model,where the COVID diagnosis is explicit (Fig. 1). Mimickingthe radiologist’s decision is more robust to biases and aims atbuilding trust for the physicians and patients towards the AItool, which can be useful for fast COVID diagnosis. Thanksto the collaboration with the radiology units of Citt`a dellaSalute e della Scienza di Torino (CDSS) and San Luigi Hos-pital (SLG) in Turin, we collected the COvid Radiographicimages DAta-set for AI (CORDA), comprising both positiveand negative COVID cases as well as a ground truth on the hu-man radiological reporting, and it currently comprises almost1000 CXRs.

2. DATASETS

In this section we introduce the datasets that will be used forour proposed approach. For our purposes we ﬁrst need todetect some objective radiological ﬁndings (we train a modelon the

CheXpert dataset) and then, on top of those, we train amodel to elaborate the COVID diagnosis (using the

CORDA dataset).

CheXpert : this is a large dataset comprising about 224k CXRs. This dataset consists of 14 different observations onthe radiographic image: differently from many other datasetswhich are focused on disease classiﬁcation based on clinicaldiagnosis, the main focus here is “chest radiograph interpre-tation”, where anomalies are detected [12]. The learnableradiological ﬁndings are summarized in Fig. 2.

CORDA : this dataset was created for this study by retro-spectively selecting chest x-rays performed at a dedicatedRadiology Unit in CDSS and at SLG in all patients withfever or respiratory symptoms (cough, shortness of breath,dyspnea) that underwent nasopharingeal swab to rule outCOVID-19 infection. Patients’ average age is 61 years (range17-97 years old). It contains a total of 898 CXRs and canbe split by different collecting institution into two similarlysized subgroups: CORDA-CDSS [5], which contains a totalof 447 CXRs from 386 patients, with 150 images comingfrom COVID-negative patients and 297 from positive ones,and CORDA-SLG, which contains the remaining 451 CXRs,with 129 COVID-positive and 322 COVID-negative images.Including data from different hospitals at test time is crucialto doublecheck the generalization capability of our model.The data collection is still in progress, with other 5 hospitalsin Italy willing to contribute at time of writing. We plan tomake CORDA available for research purposes according toEU regulations as soon as possible.

3. RADIOLOGICAL REPORT

In this section we are going to describe our proposed methodto extract radiological ﬁndings from CXRs. For this task, weleverage the large scale dataset

CheXpert , which contains an-notation for different kinds of common radiological ﬁndingsthat can be observed in CXR images (like opacity, pleural ef-fusion, cardiomegaly, etc.). Given the high heterogeneity andthe high cardinality of

CheXpert , its use is perfect for our pur-poses: in fact, once the model is trained on this dataset, thereis no need to ﬁne-tune it for the COVID diagnosis, since itwill already extract objective radiological ﬁndings.CheXpert provides 14 different types of observations for eachimage in the dataset. For each class, the labels have beengenerated from radiology reports associated with the studieswith NLP techniques, conforming to the Fleischner Society’srecommended glossary [11], and marked as: negative (N),positive (P), uncertain (U) or blank (N/A). Following the re-lationship among labels illustrated in Fig. 2, as proposed by[12], we can identify 8 top-level pathologies and 6 child ones.

In order to extract the radiological ﬁndings from CXRs, adeep learning model is trained on the 14 observations. To-wards this end, given the possibility of having multiple ﬁnd-ings in the same CXR, the weighted binary cross entropy loss able 1 : Performance (AUC) for DenseNet-121 trained on CheXpert.

Method Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion

Baseline [12] 0.79 is used to train the model. Typically, weights are used to com-pensate class unbalancing, giving higher importance to less-represented classes. Within

CheXpert , however, we also needto tackle another issue: how to treat the samples with the Ulabel. Towards this issue, multiple approaches have been sug-gested by [12]. The most popular is to ignore all the uncertainsamples, excluding them from the training process and con-sidering them as N/A.We propose to include the U samples in the learning process,mapping them to maximum uncertainty (probability . to beP or N). Then, we balance P and N outcomes for every radio-logical ﬁnding. Table 1 shows a performance comparison be-tween the standard approach as proposed by [12] and our pro-posal (U-label use), for 5 salient radiological ﬁndings, usingthe same setting as in [12]. We observe an overall improve-ment in the performance, which is expected by the inclusionof the U-labeled examples. For all our experiments, we willuse models trained using the U labeled samples.

4. COVID DIAGNOSIS

The second step of the proposed approach is building themodel which can actually provide a clinical diagnosis forCOVID. We freeze the model obtained from Sec. 3 and useits output as image features to train a new binary classiﬁer onthe CORDA dataset. We test two different types of classiﬁers:a decision tree (Tree) and a neural network-based classiﬁer(FC).The decision tree is trained on the probabilities output ofthe radiological reports, using the state-of-the-art CARTAlgorithm implementation provided by the Python scikit-learn [13] package. Besides the fully explainable decisiontree-based result, we also train a neural network classiﬁer,comprising one hidden layer of size 512 and the output layer.Despite working with the same features as the decision tree,such an approach loses in explainability, but potentially en-hances the performance in terms of COVID diagnosis, as wewill see in Sec. 5.

5. RESULTS

In this section we compare the COVID diagnosis generaliza-tion capability through a direct deep learning-based approach(baseline) and our proposed two-step diagnosis, where ﬁrstwe detect the radiological ﬁndings, and then we discriminatepatients affected by COVID using a decision tree-based diag-nosis (Tree) or a deep learning-based classiﬁer from the ra- diological ﬁndings (FC). The performance is tested on a sub-set of patients not included in the training / validation set.The assessed metrics are: balanced accuracy (BA), sensitiv-ity, speciﬁcity and area under the ROC curve (AUC). For allof the methods we adopt a 70%-30% train-test split. For thedeep learning-based strategy, SGD is used with a learning rate . and a weight decay of − .All of the experiments were run on NVIDIA Tesla T4GPUs using PyTorch 1.4. Table 2 compares the standarddeep learning-based approach [5] to our two-steps diagnosis.Baseline results are obtained pre-training the model on someof the most used publicly-available datasets. We observe thatthe best achievable performance is very low, consisting in aBA of 0.67. A key takeaway is that trying to directly diagnosediseases such as COVID-19 from CXRs might be currentlyinfeasible, probably given the small dataset sizes and strongselective bias in the datasets.We can clearly see how the two-step method outperformsthe direct diagnosis: using the same network architecture(ResNet-18 as backbone and a fully-connected classiﬁer ontop of it), we obtain a signiﬁcant increase in all of the as-sessed metrics. Even better results are achieved by using aDenseNet-121 as backbone and the fully-connected classiﬁer.Fig. 3 graphically shows the learned decision tree (whoseperformance is shown in Table 2): this provides a very clearinterpretation for the decision process. From the clinicaland radiological perspective, these data are consistent withthe COVID-19 CXR semiotics that radiologists are usedto deal with. The edema feature, although unspeciﬁc, isstrictly related to the interstitial involvement that is typicalof COVID-19 infections and it has been largely reported inthe recent literature [14]. Indeed, in recent COVID-19 radi-ological papers, interstitial involvement has been reported asground glass opacity appearance [15]. However this deﬁni-tion is more pertinent to the CT imaging setting rather thanCXR; the “edema” feature can be compatible, from the radi-ological perspective, to the interstitial opacity of COVID-19patients. Furthermore, the not irrelevant role of cardiomegaly(or more in general enlarged cardiomediastinum) in the deci-sion tree can be interesting from the clinical perspective. Infact, this can be read as an additional proof that establishedcardiovascular disease can be a relevant risk factor to de-velop COVID-19 [16]. Moreover, it may be consistent withthe hypotheses of a larger role of the primary cardiovascu-lar damage observed on on preliminary data of autopsies ofCOVID-19 patients [17].Focusing on the deep learning-based approach (FC) we ob- able 2 : Results for COVID diagnosis. Method Backbone Classiﬁer Pretrain dataset Dataset Sensitivity Speciﬁcity BA AUC

Baseline [5] ResNet-18 FC none CORDA-CDSS

ResNet-18 FC ChestXRay CORDA-CDSS 0.54 0.58 0.56 0.67Two-step ResNet-18 FC CheXpert CORDA-CDSS 0.69 0.73 0.71 0.76DenseNet-121 FC CheXpert CORDA-CDSS 0.72

DenseNet-121 Tree CheXpert CORDA-CDSS

Pneumonia <= 0.081gini = 0.447samples = 312value = [105, 207]Pleural Other <= 0.063gini = 0.5samples = 170value = [86, 84]True Pneumothorax <= 0.008gini = 0.232samples = 142value = [19, 123]FalseEdema <= 0.041gini = 0.483samples = 108value = [64, 44] Lung Lesion <= 0.101gini = 0.458samples = 62value = [22, 40]Support Devices <= 0.052gini = 0.405samples = 71value = [51, 20] Cardiomegaly <= 0.108gini = 0.456samples = 37value = [13, 24]gini = 0.172samples = 21value = [19, 2] gini = 0.461samples = 50value = [32, 18] gini = 0.0samples = 11value = [0, 11] gini = 0.5samples = 26value = [13, 13] Pleural Other <= 0.073gini = 0.5samples = 32value = [16, 16] Enlarged Cardiomediastinum <= 0.202gini = 0.32samples = 30value = [6, 24]gini = 0.219samples = 8value = [1, 7] gini = 0.469samples = 24value = [15, 9] gini = 0.245samples = 28value = [4, 24] gini = 0.0samples = 2value = [2, 0]gini = 0.0samples = 2value = [2, 0] Atelectasis <= 0.379gini = 0.213samples = 140value = [17, 123]Enlarged Cardiomediastinum <= 0.338gini = 0.194samples = 138value = [15, 123] gini = 0.0samples = 2value = [2, 0]gini = 0.183samples = 137value = [14, 123] gini = 0.0samples = 1value = [1, 0]

Fig. 3 : Decision Tree obtained for COVID-19 classiﬁcation based on the probabilities for the 14 classes of ﬁndings.

Fig. 4 : Grad-CAM on COVID-positive samples. serve a boost in the performance, achieving a BA of 0.75.However, this is the result of a trade-off between interpretabil-ity and discriminative power. Using Grad-CAM [18] we havehints on the area the model focused on to take the ﬁnal di-agnostic decision. From Fig. 4 we observe that on COVID-positive images, the model seems to mostly focus on the ex-pected lung areas.Finally, to further test the reliability of our approach, we usedour strategy also on CORDA-SLG (which are data comingfrom a different hospital structure), reaching comparable andencouraging results.

6. CONCLUSIONS

One of the latest challenges for both the clinical and the AIcommunity has been applying deep learning in diagnosingCOVID from CXRs. Recent works suggested the possibil-ity of successfully tackling this problem, despite the currentlysmall quantity of publicly available data. In this work wepropose a multi-step approach, close to the physicians’ di-agnostic process, in which the ﬁnal diagnosis is based upondetected lung pathologies. We performed our experimentson CORDA, a COVID-19 CXR dataset comprising approxi-mately 1000 images. All of our experiments have been carriedout bearing in mind that, especially for clinical applications,xplainability plays a major role for building trust in machinelearning algorithms, although better interpretability can comeat the cost of a lower prediction accuracy.

7. REFERENCES [1] Yang Yang, Minghui Yang, Chenguang Shen, FuxiangWang, Jing Yuan, Jinxiu Li, Mingxia Zhang, ZhaoqinWang, Li Xing, Jinli Wei, et al., “Laboratory diagnosisand monitoring the viral shedding of 2019-ncov infec-tions,” medRxiv

RSNA Radiology , 2020.[5] Enzo Tartaglione, Carlo Alberto Barbano, ClaudioBerzovini, Marco Calandri, and Marco Grangetto, “Un-veiling covid-19 from chest x-ray with deep learning: Ahurdles race with small data,”

International Journal ofEnvironmental Research and Public Health , vol. 17, no.18, pp. 6933, Sep 2020.[6] Prabira Kumar Sethy and Santi Kumari Behera, “De-tection of coronavirus disease (covid-19) based on deepfeatures,” 2020.[7] Ioannis D Apostolopoulos and Tzani Bessiana, “Covid-19: Automatic detection from x-ray images utilizingtransfer learning with convolutional neural networks,” arXiv preprint arXiv:2003.11617 , 2020.[8] Ali Narin, Ceren Kaya, and Ziynet Pamuk, “Automaticdetection of coronavirus disease (covid-19) using x-rayimages and deep convolutional neural networks,” arXivpreprint arXiv:2003.10849 , 2020.[9] Linda Wang, Zhong Qiu Lin, and Alexander Wong,“Covid-net: A tailored deep convolutional neural net-work design for detection of covid-19 cases from chestx-ray images,”

Scientiﬁc Reports , vol. 10, no. 1, pp. 1–12, 2020.[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: Convolutional networks for biomedical imagesegmentation,” in

International Conference on Med-ical image computing and computer-assisted interven-tion . Springer, 2015, pp. 234–241. [11] David M Hansell, Alexander A Bankier, Heber MacMa-hon, Theresa C McLoud, Nestor L Muller, and JacquesRemy, “Fleischner society: glossary of terms for tho-racic imaging,”

Radiology , vol. 246, no. 3, pp. 697–722,2008.[12] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu,Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund,Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al.,“Chexpert: A large chest radiograph dataset with uncer-tainty labels and expert comparison,” in

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , 2019,vol. 33, pp. 590–597.[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay, “Scikit-learn: Machine learning in Python,”

Jour-nal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.[14] Wei-jie Guan, Zheng-yi Ni, Yu Hu, Wen-hua Liang,Chun-quan Ou, Jian-xing He, Lei Liu, Hong Shan,Chun-liang Lei, David SC Hui, et al., “Clinical char-acteristics of coronavirus disease 2019 in china,”

NewEngland journal of medicine , vol. 382, no. 18, pp. 1708–1720, 2020.[15] Ho Yuen Frank Wong, Hiu Yin Sonia Lam, AmbroseHo-Tung Fong, Siu Ting Leung, Thomas Wing-YanChin, Christine Shing Yen Lo, Macy Mei-Sze Lui, Jo-nan Chun Yin Lee, Keith Wan-Hang Chiu, Tom Chung,et al., “Frequency and distribution of chest radiographicﬁndings in covid-19 positive patients,”

Radiology , p.201160, 2020.[16]

ESC Guidance for the Diagnosis and Management ofCV Disease during the COVID-19 Pandemic. , 2020.[17] Dominic Wichmann, Jan-Peter Sperhake, MarcL¨utgehetmann, Stefan Steurer, Carolin Edler, AxelHeinemann, Fabian Heinrich, Herbert Mushumba,Inga Kniep, Ann Sophie Schr¨oder, et al., “Autopsyﬁndings and venous thromboembolism in patientswith covid-19: a prospective cohort study,”

Annals ofInternal Medicine , 2020.[18] Ramprasaath R Selvaraju, Michael Cogswell, AbhishekDas, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra, “Grad-cam: Visual explanations from deep net-works via gradient-based localization,” in