Iterative augmentation of visual evidence for weakly-supervised lesion localization in deep interpretability frameworks
Cristina González-Gonzalo, Bart Liefers, Bram van Ginneken, Clara I. Sánchez
11 Iterative augmentation of visual evidence forweakly-supervised lesion localization in deepinterpretability frameworks
Cristina Gonz´alez-Gonzalo, MSc, Bart Liefers, MSc, Bram van Ginneken, PhD, Clara I. S´anchez, PhD
Abstract —Interpretability of deep learning (DL) systems isgaining attention in medical imaging to increase experts’ trustin the obtained predictions and facilitate their integration inclinical settings. We propose a deep visualization method togenerate interpretability of DL classification tasks in medicalimaging by means of visual evidence augmentation. The proposedmethod iteratively unveils abnormalities based on the predictionof a classifier trained only with image-level labels. For eachimage, initial visual evidence of the prediction is extracted witha given visual attribution technique. This provides localization ofabnormalities that are then removed through selective inpainting.We iteratively apply this procedure until the system considers theimage as normal. This yields augmented visual evidence, includ-ing less discriminative lesions which were not detected at first butshould be considered for final diagnosis. We apply the method tograding of two retinal diseases in color fundus images: diabeticretinopathy (DR) and age-related macular degeneration (AMD).We evaluate the generated visual evidence and the performance ofweakly-supervised localization of different types of DR and AMDabnormalities, both qualitatively and quantitatively. We show thatthe augmented visual evidence of the predictions highlights thebiomarkers considered by the experts for diagnosis and improvesthe final localization performance. It results in a relative increaseof 11.2 ± Index Terms —interpretability, deep learning, visualization,weakly-supervised detection, lesion localization, color fundusimaging.
I. I
NTRODUCTION D EEP learning (DL) systems in medical imaging haveshown to provide high-performing approaches for diverseclassification tasks in healthcare, such as screening of eyediseases [1], [2], scoring of prostate cancer [3], or detection of
This work was supported by the Dutch Technology Foundation STW, whichis part of the Netherlands Organisation for Scientific Research (NWO) andpartly funded by the Ministry of Economic Affairs (Perspectief programmeP15-26 ‘DLMedIA: Deep Learning for Medical Image Analysis’).C. Gonz´alez-Gonzalo, B. Liefers and C.I. S´anchez are with the A-EyeResearch Group, Diagnostic Image Analysis Group, Department of Radi-ology and Nuclear Medicine, Radboudumc, Geert Grooteplein 10, 6525GA Nijmegen, The Netherlands, and with the Donders Institute for Brain,Cognition and Behaviour, Radboudumc, Montessorilaan 3, 6525 HR, Ni-jmegen, The Netherlands (e-mail: [email protected],[email protected], [email protected]).B. van Ginneken is with the Diagnostic Image Analysis Group, De-partment of Radiology and Nuclear Medicine, Radboudumc. (e-mail:[email protected]).C.I. S´anchez is with the Department of Ophthalmology. Radboudumc, GeertGrooteplein 10, 6525 GA Nijmegen, The Netherlands. skin cancer [4]. Nevertheless, DL systems are often referredto as “black boxes” due to the lack of interpretability of theirpredictions. This is problematic in healthcare applications [5],[6], and hinders experts’ trust and the integration of thesesystems in clinical settings as support for grading, diagnosisand treatment decisions. There is thus an increasing demandfor interpretable systems in medical imaging that could furtherexplain models’ decisions.Defining an interpretability framework as the combinationof a DL system to perform a classification task and a procedurefor generating explainable predictions, several such frame-works have been proposed in different medical applicationsand imaging modalities [4], [7]–[17]. Among the integratedprocedures, those based on visual attribution have become verypopular. These attribution methods provide an interpretation ofthe network’s decision by assigning an attribution value, some-times also called ”relevance” or ”contribution”, to each inputfeature of the network depending on its estimated contributionto the network output. [18]. This allows to highlight featuresin the input image that contribute to the output prediction; and,specifically in medical imaging, it allows for the identificationof regions discriminant for the final decision and, conse-quently, the weakly-supervised localization of abnormalities.The localized anomalies can provide a clinical explanation ofthe classification output without the need for costly lesion-levelannotations.Classification of disease severity in color fundus (CF)images, the focus of this paper, is one medical applicationwhere attribution methods have been applied to generateexplainable DL predictions and weakly-supervised detectionof retinal lesions. In [11] and [12], saliency maps [19] wereapplied to justify decisions on diabetic retinopathy (DR) andage-related macular degeneration (AMD) classification tasks,respectively. In [13], integrated gradients [20] was used togenerate heatmaps for the explanation of predicted DR severitylevels. Class activation maps (CAM) [21] were extracted in[14] and [15] also for interpretability of DR diagnosis.Although these interpretability frameworks have succeededat localizing abnormal areas related to the predicted diagnosis,visual attribution based directly on neural network classifiershas been shown to localize only the most significant regions,ignoring lesions that have less influence on the classificationresult but could be still important for disease understandingand grading [12], [22]. For some medical imaging modalitiesand applications, interpretability of abnormal predictions re-quires the localization of different types of lesions of varying a r X i v : . [ c s . C V ] O c t appearance and histologic composition that can be simultane-ously present and be responsible for the predicted diagnosis.To overcome this, in [11] and [12] different classifiers areused in parallel, which yields localization of different types ofabnormalities in separate maps. This allows for differentiationof abnormalities, but each input image must be processed sev-eral times and the interpretability of the actual disease gradingremains unclear. Alternatively, to improve lesion localization,some frameworks add customized postprocessing steps [11]or fine-tuning [15] to the attribution methods; or proposetailored architectures with additional interpretation modules[16], [17]. Nevertheless, this conflicts with directly obtaininginterpretability of the DL system and hinders the adaptabilityand generalization among DL classifiers and medical applica-tions.In this paper, we propose a novel deep visualization method,as an extension to [23], that iteratively unveils abnormalitiesresponsible for anomalous predictions in order to generatea map of augmented visual evidence. At each iteration, themethod guides the attention to less discriminative areas thatmight also be relevant for the final diagnosis, locating ab-normalities of different types, shapes and sizes. Defined as ageneral approach, it is meant to be seamlessly integrated indiverse interpretability frameworks with different DL classi-fiers and visual attribution techniques, and without the needof additional customized steps.We apply the proposed method for the interpretation ofautomated grading in CF images of two retinal diseases: DRand AMD [24], [25]. For each diagnosis task, we classifyimages by disease severity and analyze the interpretability per-formance when the proposed iterative augmentation is applied.We validate the initial and augmented visual evidence mapsqualitatively and, in contrast to most previous approaches, weevaluate the performance for weakly-supervised localizationof DR and AMD abnormalities quantitatively. We show thatthe method can be integrated with different visual attributiontechniques and different DL classifiers.II. M ETHODS
The first part of this section describes the proposed iterativevisual evidence augmentation, depicted in Fig. 1. The proposedmethod iteratively unveils areas relevant for a final diagnosis,so as to generate exhaustive visual evidence of classifica-tion predictions and, consequently, weakly-supervised lesion-level localization. The second part of the section describesthe image-level classification used to provide the DL-baseddecisions to be interpreted.
A. Iterative visual evidence augmentation
Let I ∈ R m × n × be an image with size m × n pix-els (and 3 color channels) and a corresponding label y , F cnn : I −→ ˆ y ∈ R a convolutional neural network (CNN)optimized for a classification task using a development set I = { ( I , y ) , ..., ( I s , y s ) } , and A : ( R m × n × , F cnn ) −→ M ∈ R m × n an attribution method, such as the ones definedin Table I. For a given I , a prediction ˆ y is obtained with F cnn . If the image is considered abnormal (or referable in the case of retinal images), an explanation map M is generatedby applying A , highlighting areas of I that are discriminantfor ˆ y . The explanation map M is binarized to identify theareas where selective inpainting is then applied, in order toremove abnormalities that have been already localized. Thisprocedure is applied iteratively to increase attention to lessdiscriminative areas and generate an augmented explanationmap M , by increasing the normality of the input image in eachiteration. Algorithm 1 includes the pseudocode to calculate theaugmented visual evidence, and Fig. 1 shows an overview ofthe proposed method.In this work, normality is defined based on the predictedvalue ˆ y = F cnn ( I ) , such that an image is considered normal (or non-referable in the case of retinal images) if ˆ y < th pred .The prediction threshold th pred is defined in a validationsubset of I by means of Receiver Operating Characteristic(ROC) analysis. The maximum number of iterations T wasset to 20. Regarding binarization of the explanation maps,we use the Otsu method [26] to compute th bin and yieldan adaptative thresholding. For selective inpainting, we usethe Navier-Stokes method [27] with a radius r inp of size 3,based on fluid dynamics to match gradient vectors around theboundaries of the region to be inpainted. The final augmentedexplanation map M is obtained by an exponentially decayingweighted sum of the iteratively generated maps M , with α = 0 . . Algorithm 1
Iterative visual evidence augmentation Input:
Input image I Trained CNN classifier F cnn Prediction threshold th pred Maximum number of iterations T Selective inpainting radius r inp Output:
Augmented explanation map M Initialize t = 1 and I t = I Calculate initial prediction ˆ y = F cnn ( I t ) while (ˆ y ≥ th pred ) & ( t < T ) do M t = A ( I t , F cnn ) Binarize M t by thresholding: B t ( x, y ) = (cid:40) , if M t ( x, y ) ≥ th bin , , otherwise Inpaint I t given mask B t : I t = Selective inpainting ( I t , B t , r inp ) Calculate new prediction ˆ y = F cnn ( I t ) t = t + 1 end while Compute augmented explanation map M : M = (cid:80) t e − αt M t B. Image-level classification
The proposed iterative visual evidence augmentation mustbe built upon a DL classifier that reaches acceptable perfor-mance, so as to achieve reliable interpretability. F cnn was Fig. 1. Overview of the proposed method applied to automated grading of diabetic retinopathy in color fundus images. The workflow to generate the originalprediction is depicted in green; the workflow for the proposed iterative visual evidence augmentation is depicted in blue.TABLE II
MPLEMENTED VISUAL ATTRIBUTION METHODS
Name Definition DescriptionSaliency [19] M SAL = ∂ F cnn ( I ) ∂I It indicates which local morphology changes in the image would lead to modifica-tions in the network’s prediction.Guided backpropagation [28] M GBP = ∂ F cnn ( I ) ∂I s.t. R li = 1 R l +1 i > f li > R l +1 i ,where R l +1 i = ∂ F cnn ( I ) ∂f l +1 i , f l +1 i = ReLU ( f li ) and f li isthe i-th feature map at convolutional layer l It provides additional guidance to the signal backpropagated through ReLU activa-tions from the higher layers, preventing backward stream of gradients associated toneurons that decrease the activation of the output node.Integrated gradients [20] M IG = ( I − ¯ I ) (cid:82) ∂ F cnn (¯ I + α ( I − ¯ I )) ∂I dα The generated maps measure the contribution of each pixel in the input image to theprediction. Instead of computing only the gradient with respect to the current inputvalue, this method computes the average gradient while the input varies linearly inseveral steps from a baseline image (commonly, all zeros) to their current value.Grad-CAM [29] M G − CAM = ReLU ( (cid:80) i α li f li ) ,where α li = GAP ( ∂ F cnn ( I ) ∂f li ) , f li is the i-th feature mapat convolutional layer l and GAP is the global averagepooling operation over the two spatial dimensions The gradients backpropagated from the output to a selected convolutional layer areused for computing a linear combination of the forward activation maps of thatlayer. Only the pixels with positive influence on the output are maintained, and thenrescaled to the input size.Guided Grad-CAM [29] M GG − CAM = M G − CAM M GBP
It combines guided backpropagation and Grad-CAM, in order to improve thelocalization ability of the latter method. therefore optimized for each classification task: classificationof CF images for detection of DR ( F DRcnn ) and AMD ( F AMDcnn ).Prior to classification, every CF image goes through apreprocessing stage, where the bounding box of the field ofview is extracted, then rescaled to × pixels, andlastly, contrast-enhancement based on [30] is applied to reducelocal differences in lighting and among images. The contrast-enhanced image is used as input for the classifier.The CNNs were based on the VGG-16 architecture [31],pre-trained on ImageNet. They were adapted to input imagesof size × by applying a stride of 2 in the first layerof the first convolutional block, and using a valid instead ofpadded convolution for the first layer of the last convolutionalblock. Dropout layers (p=0.5) were added in between the fully-connected layers. We followed a regression approach in whichthe output of a network consists of a single node, representinga continuous value which is monotonically related to predicteddisease severity. The loss was defined as the mean squarederror between the prediction and the reference-standard label.For each classification task, the optimal classifier F cnn was se-lected regarding the performance on a validation set by meansof receiver operating characteristic (ROC) analysis, computingthe area under the ROC curve (AUC), in order to assure gooddiscrimination between referable and non-referable cases. Ad- ditionally, the ability to discriminate between disease stageswas measured by means of the quadratic Cohen’s weightedkappa coefficient ( κ ) [32]. Sensitivity (SE) and specificity (SP)were computed at the optimal operating point of the system,which was considered to be the best tradeoff between the twovalues, i.e., the point closest to the upper left corner of thegraph. This allowed for extraction of the optimal threshold th pred for referability in the corresponding validation set.III. D ATA
A. Image-level classification
The Kaggle DR dataset [33] was used for training, valida-tion and testing of F DRcnn . Images were acquired by differentCF digital cameras with varying resolution. Each image wasgraded by DR severity by a human reader, regarding the Inter-national Clinical Diabetic Retinopathy (ICDR) severity scale[34], with stages 0 (no DR), 1 (mild non-proliferative DR),2 (moderate non-proliferative DR), 3 (severe non-proliferativeDR), and 4 (proliferative DR). Categories 0 and 1 are con-sidered non-referable DR and categories 2 to 4 referable DR.This database is divided in two sets: the Kaggle training set(35,126 images from 17,563 patients; one photograph per eye)and the Kaggle test set (53,576 images from 26,788 patients;one photograph per eye).
The classifier for AMD, F AMDcnn , was trained, validated andtested on the Age-Related Eye Disease Study (AREDS) dataset[35]. AREDS was designed as a long-term prospective studyof AMD development and cataract in which patients wereexamined on a regular basis and followed up to 12 years.The AREDS dbGaP set includes digitalized CF images. In2014, over 134,000 macula-centered CF images from 4,613participants were added to the set (for each patient-visitavailable, one photograph per eye with their correspondingstereo pairs). We excluded images regarding the criteria inthe AREDS dbGaP guidelines [35], and 133,820 images wereused in this study. We adapted the grading in AREDS dbGaP,which is based on the AREDS severity scale for AMD [36],for reference grading: stage 0 (no AMD), 1 (early AMD), 2(intermediate AMD), and 3 (advanced AMD, with presence offoveal geographic atrophy (GA) or choroidal neovasculariza-tion (CNV)). Categories 0 and 1 are considered non-referableAMD; categories 2 and 3, referable AMD.
B. Interpretability and weakly-supervised lesion-level detec-tion with iterative visual evidence augmentation
DiaretDB1 [37] was used for the assessment of the inter-pretability and weakly-supervised detection of DR abnormal-ities. This dataset consists of 89 CF images with manually-delineated areas performed by four medical experts. Fourdifferent types of DR lesions were annotated: hemorrhages,microaneurysms, hard exudates and soft exudates. As proposedin [37], we defined the reference standard as binary maskscontaining areas labelled with an average confidence level of75% between experts.For the assessment of the localization of AMD lesions,we used CF images from the European Genetic Database(EUGENDA), a large multi-center database for clinical andmolecular analysis of AMD [38]. AMD severity is defined foreach image according to the Cologne Image Reading Centerand Laboratory (CIRCL) protocol [38]. We generated a datasetdivided in two groups. The first group consists of 52 imageswith non-advanced AMD stages [39]. Two trained gradersmanually outlined all visible drusen (without sub-dividingtypes) in each image, and the binary masks generated duringconsensus were used as reference standard. In order to assesslesion detection in advanced AMD cases, we created a secondgroup with 12 images with advanced AMD (6 images withadvanced dry AMD and 6 images with advanced wet AMD).One professional grader manually delineated in each image allvisible AMD-related lesions. To define the reference standard,we generated two binary masks for each image in this group:drusen (including hard, soft distinct, soft indistinct and opticdisk drusen) and advanced-AMD lesions (including CNV,GA and subretinal hemorrhages). In total, 64 images withmanually-annotated abnormalities constituted our EUGENDAdataset. IV. E
XPERIMENTAL SETUP
A. Image-level classification
The DR classifier F DRcnn was trained on the 80% of theKaggle training set (28,098 images) and validated on the remaining 20% (7,028 images) for 400 epochs. Regardingtraining configuration, we used the Adam optimizer [40] with alearning rate of 0.0001; data augmentation and class balancingwere applied during the training phase to reduce overfitting.In order to assess the integration of the proposed iterativevisual evidence augmentation with different classification net-work architectures, we performed an additional validation withthe Inception-v3 architecture [41] for the classification task ofDR grading. As for this alternative DR classifier, F DRcnn,iv ,a dropout layer (p=0.5) was placed between the final globalaverage pooling layer and the regression node, and it wastrained for 100 epochs with the training configuration usedpreviously.For AMD classification, we applied five-fold cross-validation: the 4,613 patients in the AREDS dataset wererandomly divided in five groups, and all the images of eachpatient were included in the corresponding group. Each foldhad an average number of 26,764 images. Three folds wereused for training, one for validation and one for testing, withrotation of the folds. In total, five different classifiers weretrained for 80 epochs each, using the previously mentionedtraining configuration. We selected as F AMDcnn the model whichyielded best performance on its corresponding test fold.
B. Interpretability and weakly-supervised lesion-level detec-tion with iterative visual evidence augmentation
The images in the DiaretDB1 dataset and in the EUGENDAdataset were classified for DR and AMD severity, respectively,with the corresponding image-level classifier. Images whosedisease severity prediction was over th pred were consideredas referable cases and consequently eligible for interpretabilityand evaluation of weakly-supervised lesion detection. Sim-ilarly to [13], visual evidence of non-referable predictionsdoes not provide meaningful information, since the proposedaugmentation aims to unveil iteratively abnormalities while theprediction decreases until non-referability is reached.The binary masks with annotated lesions were used toassess if the obtained visual evidence highlighted actual ab-normalities, and to compare between initial and augmentedvisual evidence. Free-response ROC (FROC) curves were usedas evaluation metric of weakly-supervised lesion localizationin each dataset and obtained as follows: the points in theinterpretability maps with highest confidence values wereiteratively located and a circular area of detection with radius r was defined around. If this area overlapped with any annotatedlesion in the reference standard, that lesion was considered atrue positive detection; otherwise, a false positive detection.The values of the map within the detection area were thenmasked out, and each lesion in the reference standard detectedas true positive was considered only once. For the localizationof DR lesions, we defined r = 7 px (1.4% image dimensions);for AMD, r = 10 px (1.9% image dimensions). From thecurves, we extracted values of average sensitivity per averageof 10 false positives per image (SE/10 FPs).In order to analyze the adaptability of the proposed iter-ative augmentation to different interpretability methods, weimplemented different visual attribution techniques, included in Table I: saliency [19], guided backpropagation [28], in-tegrated gradients [20], Grad-CAM [29], and Guided Grad-CAM [29]. Regarding Grad-CAM, due to the extremely coarsemaps generated by this method when the gradient informationfrom the last convolutional layer is used [29], we used theinformation from a shallower convolutional layer (when usingVGG-16: the output of the the third block’s last convolutionallayer ( Block 3 conv 3 ); when using Inception-v3: the outputof the second Inception reduction module (
Mixed 8 )).V. R
ESULTS
A. Image-level classification
The DR classifier F DRcnn obtained an AUC of 0.93, witha SE of 0.86 and SP of 0.88, on the Kaggle test set. Themodel achieved a κ of 0.77 for discrimination between DRstages. For the alternative classifier based on the Inception-v3architecture, F DRcnn,iv , AUC on the Kaggle test set was 0.93,SE and SP were 0.86 and 0.90, respectively, and κ was 0.80. Regarding AMD classification, the overall performance inthe AREDS dataset corresponded to an AUC of 0.97, with SEof 0.91 and SP of 0.92 at the optimal operating point; κ was0.87. The model with best performance on the correspondingtest fold and selected as F AMDcnn obtained an AUC of 0.97,with SE of 0.92 and SP of 0.93, and a κ of 0.88. B. Interpretability and weakly-supervised lesion-level detec-tion with iterative visual evidence augmentation F DRcnn considered 75 images of the DiaretDB1 to havereferable DR. Initial and augmented visual evidence wereextracted for these cases. Fig. 2 shows one example from theDiaretDB1 set with the initial and augmented maps for allthe implemented visual attribution methods. Table II includesthe quantitative assessment of weakly-supervised localizationof four types of DR lesions (hemorrhages, microaneurysms,hard and soft exudates) for the different methods. It containsthe SE/10 FPs values for each type of DR lesion, compar-ing between initial and augmented visual evidence. Fig. 3illustrates the FROC curves for the initial and augmentedvisual evidence per type of lesion generated with guidedbackpropagation, which is the method that reached the highestaverage performance, as observed in Table V.When F DRcnn,iv was used as DR classifier, 67 images inthe DiaretDB1 dataset were graded as referable DR. Thequantitative results of weakly-supervised detection per DRlesion for the different visual evidence methods can be foundin Table III, with and without iterative augmentation. F AMDcnn graded 40 images in the EUGENDA set as referableAMD. Visual interpretability was extracted for these cases.Fig. 4 includes one example for qualitative evaluation ofweakly-supervised AMD lesion localization in this set for The ROC analyses of the DR classifiers can be found in Fig. S1 (availablein the supplementary files/multimedia tab). The ROC analysis on the whole AREDS set, the ROC analysis of theoptimal model, and the performance for each individual model can be found inFig. S2, Fig. S3 and Table SI (available in the supplementary files/multimediatab). An additional example from DiaretDB1 for qualitative assessment can befound in Fig. S4 (available in the supplementary files/multimedia tab). all the implemented visual attribution methods, showing theinitial and final visual evidence after iterative augmentation.The quantitative assessment of localization of drusen andadvanced-AMD lesions can be found in Table IV. In orderto analyze the influence of the advanced AMD cases in lesionlocalization performance, separate quantitative evaluation wascarried out on the 52 images with non-advanced AMD stagesin the EUGENDA set and results were also included in TableIV. The global adaptability of the proposed method acrossclassification tasks, network architectures and visual attribu-tion methods can be observed in Table V. There is a globalrelative increase of 11.2 ± ISCUSSION
Qualitative assessment of the visual evidence generated bythe different implemented interpretability methods shows thateach DL classifier is able to learn visual features relevantto the classification task at hand during the training process.For those images classified as referable, most visual featurescorrespond to actual abnormalities. Augmented visual evi-dence maps show that the proposed iterative approach allows,on one hand, to emphasize and achieve better delineationsof detected abnormalities, and, on the other hand, to unveilabnormalities that were not highlighted at first but are stillrelated to referable stages and relevant for final diagnosis,independently of anomaly appearance. This can be especiallyobserved in severe cases, where the augmented maps differmore from the initial ones due to a larger number of iterationsneeded to reach non-referability.As observed in Table V, the method can be adapted todifferent classification tasks, network architectures and visualattribution methods. Nevertheless, it can be observed thatiterative augmentation works better when the visual attributionis not coarse, but well localized. Appropriate spatial resolutionin the initial visual evidence allows to unveil abnormalities ofdifferent types, shapes and sizes, such as the ones related toretinal diseases. This can be observed when guided backprop-agation is used for visual attribution. Iterative augmentationimproves localization performance for AMD lesions (TableIV), as well as for all DR lesions (Table II, Table III, Fig. 3),where it reaches the highest average performance (Table V).This corresponds with sharp and localized visual evidence,as observed in Fig. 2 and Fig. 4. Fig. 5 includes additionalexamples for qualitative assessment of weakly-supervised le-sion detection when this method is applied, highlighting theimportance of good spatial resolution for yielding detailedvisual evidence.On the other hand, as observed in Fig. 2 and Fig. 4, the mapsgenerated using Grad-CAM are hardly detailed, even whena shallower convolutional layer is used for implementation.This was also reported in [15], where CAM were applied withspecific fine tuning to improve DR lesions localization. Lowspatial resolution prevents these methods from being a suitable An additional example from the EUGENDA set for qualitative assessmentcan be found in Fig. S5 (available in the supplementary files/multimedia tab).
TABLE IIA
VERAGE
SE/10 FP
S VALUES FOR WEAKLY - SUPERVISED LESION LOCALIZATION OF DR LESIONS USING
VGG-16
ARCHITECTURE . Grad-CAM Guided Grad-CAMDR lesion Visual evidence Saliency Guidedbackpropagation Integratedgradients Block 3 conv 3 Block 3 conv 3Initial 0.41 0.65 0.41 0.43 0.66Hemorrhages Augmented 0.50
Initial 0.45 0.67 0.83 0.58 0.70Soft exudates Augmented 0.58 0.93
Fig. 2. Example of visual evidence generated with different methods for one image of DiaretDB1, predicted as DR stage 3 with the DR classifier based onVGG-16. For each method: initial visual evidence (left) and augmented visual evidence (right).TABLE IIIA
VERAGE
SE/10 FP
S VALUES FOR WEAKLY - SUPERVISED LESION LOCALIZATION OF DR LESIONS USING I NCEPTION - V ARCHITECTURE . Grad-CAM Guided Grad-CAMDR lesion Visual evidence Saliency Guidedbackpropagation Integratedgradients Mixed 8 Mixed 8Initial 0.24 0.67 0.40 0.07 0.60Hemorrhages Augmented 0.32
TABLE IVA
VERAGE
SE/10 FP
S VALUES FOR WEAKLY - SUPERVISED LESION LOCALIZATION OF
AMD
LESIONS USING
VGG-16
ARCHITECTURE . Grad-CAM Guided Grad-CAMDR lesion Visual evidence Saliency Guidedbackpropagation Integratedgradients Block 3 conv 3 Block 3 conv 3All EUGENDA (40/64 classified as referable AMD)Initial 0.22 0.38 0.37 0.03 0.25Drusen Augmented 0.26 0.44
TABLE VA
VERAGE LESION LOCALIZATION PERFORMANCE ( AVERAGE
SE/10 FP S ) FOR EACH VALIDATED CLASSIFICATION TASK AND NETWORK ARCHITECTURE . Classification taskand architecture Visual evidence Saliency Guidedbackpropagation Integratedgradients Grad-CAM Guided Grad-CAMInitial . ± .
05 0 . ± .
10 0 . ± .
21 0 . ± .
17 0 . ± . DR, VGG-16 Augmented . ± . . ± . . ± .
26 0 . ± .
18 0 . ± . Initial . ± .
04 0 . ± .
06 0 . ± .
18 0 . ± .
16 0 . ± . DR, Inception-v3 Augmented . ± . . ± . . ± .
23 0 . ± .
11 0 . ± . Initial . ± .
33 0 . ± .
22 0 . ± .
31 0 . ± .
12 0 . ± . AMD, VGG-16 Augmented . ± .
33 0 . ± . . ± . . ± .
18 0 . ± . Shade indicates higher performance after iterative augmentation; bold indicates highest performance per classification task and architecture.
Fig. 3. Lesion localization performance of initial and augmented visualevidence per type of lesion in referable DR predictions from the DiaretDB1dataset, when using the DR classifier based on VGG-16 and guided back-propagation for visual attribution. option for interpretability of classification tasks that requireprecise lesion localization and, in these cases, augmentationdoes not help, as shown also quantitatively in Tables II, III andIV. Guided Grad-CAM, due to the combination with guidedbackpropagation, provides more localized visual evidence andgood detection performance especially for most DR lesions,although not better than using guided backpropagation alone,as seen in Table V.As for saliency maps, which are more localized thanGrad-CAM, augmentation shows visually and quantitativelyimprovement for detection of most lesions, although finalsensitivity values are not high. These maps were used in [11],but adjustment of the training loss and customized, complexpostprocessing steps were required to reduce the inherentnoise. Integrated gradients yields better general performance thansaliency and Grad-CAM, but maps are more noisy than thoseobtained with guided backpropagation. Iterative augmentationenhances the localization of AMD lesions, reaching the highestaverage performance, as seen in Table V, and certain DRlesions. However, the coarseness and noise of the maps hindersthe augmentation’s performance for extremely small lesions,such as microaneurysms. Integrated gradients was used in[13], showing support for DR graders, improving confidenceand time on task, although no quantitative results of lesionlocalization were included.Regarding the adaptability of the proposed method to dif-ferent architectures, the results in Table III show that weakly-supervised localization of lesions can be generated with differ-ent and deeper networks, such as Inception-v3, and improvedby means of iterative augmentation.To the extent of our knowledge, we provide the first quan-titative evaluation of weakly-supervised localization of AMDlesions in CF images. As observed in Table IV, advanced-AMD lesions, which should never be missed in grading set-tings, are fast and intensely detected with most interpretabilitytechniques. Augmentation improves drusen detection, althoughgeneral performance is lower than for DR lesions. This mightbe related to different aspects. On one hand, AMD grading andannotation of related lesions pose several difficulties to humanexperts [42], which transfers to the training of DL systems.On the other hand, there is a wide variety of drusen types[43] that are grouped in the presented validation. Table IVillustrates improvement in drusen detection when advancedcases are excluded, i.e., drusen present in advanced AMDstages are harder to unveil, as well as harder for expertsto grade [42]. Interpretability of AMD detection will benefitfrom a validation with further differentiation of drusen types.This would help identify classification burdens and consequentaspects for training optimization.We used an unsupervised inpainting technique [27] which
Fig. 4. Example of visual evidence generated with different methods for one image of EUGENDA, predicted as AMD stage 2 (ground-truth label: AMDstage 2). For each method: initial visual evidence (left) and augmented visual evidence (right). yielded satisfactory visual results and fast processing timesduring iterative augmentation. Future work might include moreadvanced inpainting techniques, at pixel-level or patch-level,or also trainable with healthy images, such as generativemodels [44] or context encoders [45].There are other methods for visual evidence that we havenot implemented but that might be interesting to consider forfuture comparison and integration of iterative augmentation.For instance, layer-wise relevance propagation and its variants[46], [47]. They can be directly applied to a trained classifierto extract interpretability of the predictions and might benefitfrom iterative augmentation.Although the proposed method allows to generate an aug-mented map of visual evidence agnostic to anomaly type andappearance for each prediction, differentiation among detectedabnormalities can be useful for a complete explainable diagno-sis. In [12], saliency maps were extracted from three differentAMD-related classifiers (presence of late-AMD, drusen, andpigmentary abnormalities), yielding one interpretability mapper classification task. An ensemble of classifiers for DRgrading was used in [11], where one model provided thefinal DR grade and other models were optimized to providea map for a given DR lesion type. These solutions allow forseparate and optimized interpretability of predictions related todisease grading with respect to a certain lesion type. However,each input image must be processed several times and with multiple maps there is no global and direct interpretation ofthe actual disease classification. In the future, interpretabilityof a given classification task will benefit from using theknowledge contained in the corresponding trained networkalso for differentiation of the lesions included in the visualevidence maps.The integration of other techniques might improve theusability of the proposed method and help increase trust inthe output of the DL classifiers where applied. For example,quantifying and providing information about the uncertainty ofthe system’s decisions [48], or exploiting the features learnedby the system not only for visual evidence of decisions butalso for semantic interpretation [49]. This would allow forbetter understanding of the features learned by the classifier inthe training process and their impact on the final predictions,leading to identify different types of lesions and how theyrelate to disease severity, as well as new biomarkers significantfor disease diagnosis.VII. C
ONCLUSION
We proposed a deep visualization method for exhaustivevisual interpretability of DL classification tasks in medicalimaging. The method allows to iteratively increase attentionto less discriminative areas that should be considered forfinal diagnosis, while being adaptable to different classi-fication tasks, network architectures and visual attribution
Fig. 5. Examples of visual evidence generated with guided backpropagation for two images in the DiaretDB1 dataset, predicted as DR stage 2 (first row)and 3 (second row), and for two images in the EUGENDA dataset, predicted as AMD stage 2 (third row; ground-truth label: AMD stage 2) and predicted asAMD stage 3 (fourth row; ground-truth label: AMD stage 3). techniques. We showed that visual evidence of the predictionscan achieve weakly-supervised lesion-level detection and in-clude the biomarkers considered by the experts for diagno-sis. Augmented visual evidence improves the final detectionperformance, being agnostic to anomaly type and appearanceand performing better with sharp and localized initial visualattribution. This makes the proposed method a useful tool forsupporting the decisions of medical DL-based classificationsystems, in order to increase the experts’ trust and facilitatetheir final integration in clinical settings.R
EFERENCES[1] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu,A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros et al. , “Development and validation of a deep learning algorithm fordetection of diabetic retinopathy in retinal fundus photographs,”
JAMA ,vol. 316, no. 22, pp. 2402–2410, 2016.[2] P. M. Burlina, N. Joshi, M. Pekala, K. D. Pacheco, D. E. Freund, andN. M. Bressler, “Automated grading of age-related macular degenerationfrom color fundus images using deep convolutional neural networks,”
JAMA Ophthalmol , vol. 135, no. 11, pp. 1170–1176, 2017. [3] K. Nagpal, D. Foote, Y. Liu, P.-H. C. Chen, E. Wulczyn, F. Tan, N. Ol-son, J. L. Smith, A. Mohtashamian, J. H. Wren et al. , “Development andvalidation of a deep learning algorithm for improving Gleason scoringof prostate cancer,” npj Digital Medicine , vol. 2, no. 1, p. 48, 2019.[4] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,and S. Thrun, “Dermatologist-level classification of skin cancer withdeep neural networks,”
Nature , vol. 542, no. 7639, p. 115, 2017.[5] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. S´anchez,“A survey on deep learning in medical image analysis,”
Med Image Anal ,vol. 42, pp. 60–88, 2017.[6] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T.Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman et al. , “Opportunities and obstacles for deep learning in biology andmedicine,”
J R Soc Interface , vol. 15, no. 141, p. 20170387, 2018.[7] H. Lee, S. Tajmir, J. Lee, M. Zissen, B. A. Yeshiwas, T. K. Alkasab,G. Choy, and S. Do, “Fully automated deep learning system for boneage assessment,”
J Digit Imaging , vol. 30, no. 4, pp. 427–441, 2017.[8] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L.Baxter, A. McKeown, G. Yang, X. Wu, F. Yan et al. , “Identifyingmedical diagnoses and treatable diseases by image-based deep learning,”
Cell , vol. 172, no. 5, pp. 1122–1131, 2018.[9] S. T. Kim, H. Lee, H. G. Kim, and Y. M. Ro, “ICADx: interpretablecomputer aided diagnosis of breast masses,” in
Medical Imaging , ser.Proceedings of the SPIE, vol. 10575. International Society for Opticsand Photonics, 2018, p. 1057522. [10] W. Gale, L. Oakden-Rayner, G. Carneiro, A. P. Bradley, and L. J.Palmer, “Producing radiologist-quality reports for interpretable artificialintelligence,” arXiv preprint arXiv:1806.00340 , 2018.[11] G. Quellec, K. Charri`ere, Y. Boudi, B. Cochener, and M. Lamard, “Deepimage mining for diabetic retinopathy screening,” Med Image Anal ,vol. 39, pp. 178–193, 2017.[12] Y. Peng, S. Dharssi, Q. Chen, T. D. Keenan, E. Agr´on, W. T. Wong, E. Y.Chew, and Z. Lu, “DeepSeeNet: A deep learning model for automatedclassification of patient-based age-related macular degeneration severityfrom color fundus photographs,”
Ophthalmology , vol. 126, no. 4, pp.565–575, 2019.[13] R. Sayres, A. Taly, E. Rahimy, K. Blumer, D. Coz, N. Hammel,J. Krause, A. Narayanaswamy, Z. Rastegar, D. Wu et al. , “Using a deeplearning algorithm and integrated gradients explanation to assist gradingfor diabetic retinopathy,”
Ophthalmology , 2018.[14] R. Gargeya and T. Leng, “Automated identification of diabetic retinopa-thy using deep learning,”
Ophthalmology , vol. 124, no. 7, pp. 962–969,2017.[15] W. M. Gondal, J. M. K¨ohler, R. Grzeszick, G. A. Fink, and M. Hirsch,“Weakly-supervised localization of diabetic retinopathy lesions in retinalfundus images,” in
Int Conf Image Process . IEEE, 2017, pp. 2069–2073.[16] Z. Wang, Y. Yin, J. Shi, W. Fang, H. Li, and X. Wang, “Zoom-in-net:Deep mining lesions for diabetic retinopathy detection,” in
Med ImageComput Comput Assist Interv , ser. Lect Notes Comput Sci. Springer,2017, pp. 267–275.[17] S. Keel, J. Wu, P. Y. Lee, J. Scheetz, and M. He, “Visualizing deeplearning models for the detection of referable diabetic retinopathy andglaucoma,”
JAMA Ophthalmol , vol. 137, no. 3, pp. 288–292, 2019.[18] M. Ancona, E. Ceolini, C. ¨Oztireli, and M. Gross, “Towards betterunderstanding of gradient-based attribution methods for deep neuralnetworks,” arXiv preprint arXiv:1711.06104 , 2017.[19] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034 , 2013.[20] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deepnetworks,” in
Int Conf Mach Learn . JMLR. org, 2017, pp. 3319–3328.[21] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in
Comput Vis PatternRecognit , 2016, pp. 2921–2929.[22] C. F. Baumgartner, L. M. Koch, K. Can Tezcan, J. Xi Ang, andE. Konukoglu, “Visual feature attribution using Wasserstein GANs,” in
Comput Vis Pattern Recognit , 2018, pp. 8309–8319.[23] C. Gonz´alez-Gonzalo, B. Liefers, B. van Ginneken, and C. I. S´anchez,“Improving weakly-supervised lesion localization with iterative saliencymap refinement,” in
Medical Imaging with Deep Learning
Lancet Glob Health , vol. 2, no. 2, pp. e106–e116,2014.[26] N. Otsu, “A threshold selection method from gray-level histograms,”
IEEE Trans Syst Man Cybern , vol. 9, no. 1, pp. 62–66, 1979.[27] M. Bertalmio, A. L. Bertozzi, and G. Sapiro, “Navier-stokes, fluiddynamics, and image and video inpainting,” in
Comput Vis PatternRecognit , vol. 1. IEEE, 2001, pp. I–I.[28] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,“Striving for simplicity: The all convolutional net,” arXiv preprintarXiv:1412.6806 , 2014.[29] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-CAM: Visual explanations from deep networks viagradient-based localization,” in
Int Conf Comp Vis , 2017, pp. 618–626.[30] B. Graham, “Kaggle diabetic retinopathy detection competition report,”
University of Warwick , 2015.[31] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[32] G. Hripcsak and D. F. Heitjan, “Measuring agreement in medicalinformatics reliability studies,”
J Biomed Inform et al. , “Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales,”
Ophthalmology
Am J Ophthalmol , vol. 132, no. 5, pp.668–681, 2001.[37] T. Kauppi, V. Kalesnykiene, J.-K. Kamarainen, L. Lensu, I. Sorri,A. Raninen, R. Voutilainen, H. Uusitalo, H. K¨alvi¨ainen, and J. Pietil¨a,“The diaretdb1 diabetic retinopathy database and evaluation protocol.”in
British Machine Vision Conference , vol. 1, 2007, pp. 1–10.[38] S. Fauser, D. Smailhodzic, A. Caramoy, J. P. van de Ven, B. Kirchhof,C. B. Hoyng, B. J. Klevering, S. Liakopoulos, and A. I. den Hollander,“Evaluation of serum lipid concentrations and genetic variants at high-density lipoprotein metabolism loci and TIMP3 in age-related maculardegeneration,”
Invest Ophthalmol Vis Sci , vol. 52, no. 8, pp. 5525–5528,2011.[39] M. J. van Grinsven, Y. T. Lechanteur, J. P. van de Ven, B. van Ginneken,C. B. Hoyng, T. Theelen, and C. I. S´anchez, “Automatic drusenquantification and risk assessment of age-related macular degenerationon color fundus images,”
Invest Ophthalmol Vis Sci , vol. 54, no. 4, pp.3019–3027, 2013.[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Comput Vis PatternRecognit , 2016, pp. 2818–2826.[42] R. P. Danis, A. Domalpally, E. Y. Chew, T. E. Clemons, J. Armstrong,J. P. SanGiovanni, and F. L. Ferris, “Methods and reproducibility ofgrading optimized digital color fundus photographs in the Age-RelatedEye Disease Study 2 (AREDS2 report number 2),”
Invest OphthalmolVis Sci , vol. 54, no. 7, pp. 4548–4554, 2013.[43] A. Abdelsalam, L. Del Priore, and M. A. Zarbin, “Drusen in age-related macular degeneration: pathogenesis, natural course, and laserphotocoagulation-induced regression,”
Surv Ophthalmol , vol. 44, no. 1,pp. 1–29, 1999.[44] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generativeimage inpainting with contextual attention,” in
Comput Vis PatternRecognit , 2018, pp. 5505–5514.[45] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in
Comput VisPattern Recognit , 2016, pp. 2536–2544.[46] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, andW. Samek, “On pixel-wise explanations for non-linear classifier deci-sions by layer-wise relevance propagation,”
PLoS One , vol. 10, no. 7,p. e0130140, 2015.[47] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M¨uller,“Explaining nonlinear classification decisions with deep Taylor decom-position,”
Pattern Recognit , vol. 65, pp. 211–222, 2017.[48] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl, “Lever-aging uncertainty information from deep neural networks for diseasedetection,”
Sci Rep , vol. 7, no. 1, p. 17816, 2017.[49] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas,and R. Sayres, “Interpretability beyond feature attribution: Quantita-tive testing with concept activation vectors (TCAV),” arXiv preprintarXiv:1711.11279arXiv preprintarXiv:1711.11279