An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization
Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, Kangning Liu, Sudarshini Tyagi, Laura Heacock, S. Gene Kim, Linda Moy, Kyunghyun Cho, Krzysztof J. Geras
AAn interpretable classifier for high-resolution breast cancer screening images utilizingweakly supervised localization
Yiqiu Shen a , Nan Wu a , Jason Phang a , Jungkyu Park a , Kangning Liu a , Sudarshini Tyagi d , Laura Heacock b,e , S. Gene Kim b,c,e ,Linda Moy b,c,e , Kyunghyun Cho a,d,f , Krzysztof J. Geras b,c,a a Center for Data Science, New York University b Department of Radiology, NYU School of Medicine c Center for Advanced Imaging Innovation and Research, NYU Langone Health d Department of Computer Science, Courant Institute, New York University e Perlmutter Cancer Center, NYU Langone Health f CIFAR Associate Fellow
A R T I C L E I N F O
Keywords: deep learning, breast cancerscreening, weakly supervised localiza-tion, high-resolution image classification
A B S T R A C TMedical images di ff er from natural images in significantly higher resolutions andsmaller regions of interest. Because of these di ff erences, neural network architecturesthat work well for natural images might not be applicable to medical image analysis.In this work, we extend the globally-aware multiple instance classifier, a framework weproposed to address these unique properties of medical images. This model first usesa low-capacity, yet memory-e ffi cient, network on the whole image to identify the mostinformative regions. It then applies another higher-capacity network to collect detailsfrom chosen regions. Finally, it employs a fusion module that aggregates global and lo-cal information to make a final prediction. While existing methods often require lesionsegmentation during training, our model is trained with only image-level labels and cangenerate pixel-level saliency maps indicating possible malignant findings. We apply themodel to screening mammography interpretation: predicting the presence or absence ofbenign and malignant lesions. On the NYU Breast Cancer Screening Dataset, consist-ing of more than one million images, our model achieves an AUC of 0.93 in classifyingbreasts with malignant findings, outperforming ResNet-34 and Faster R-CNN. Com-pared to ResNet-34, our model is 4.1x faster for inference while using 78.4% less GPUmemory. Furthermore, we demonstrate, in a reader study, that our model surpassesradiologist-level AUC by a margin of 0.11. The proposed model is available online.
1. Introduction
Breast cancer is the second leading cause of cancer-relateddeath among women in the United States (DeSantis et al.,2017). It was estimated that 268,600 women would be diag-nosed with breast cancer and 41,760 would die in 2019 (Siegelet al., 2019). Screening mammography, a low-dose X-ray ex-amination, is a major tool for early detection of breast can-cer. A standard screening mammogram consists of two high-resolution X-rays of each breast, taken from the side (the “medi-olateral” or MLO view) and from above (the “craniocaudal”or CC view) for a total of four images. Radiologists, physi-cians specialized in the interpretation of medical images, ana-lyze screening mammograms for tissue abnormalities that mayindicate breast cancer. Any detected abnormality leads to ad-ditional diagnostic imaging and possible tissue biopsy. A ra-diologist assigns a standardized assessment to each screening
This paper is an extension of work originally presented at the 10th Interna-tional Workshop on Machine Learning in Medical Imaging (Shen et al., 2019). mammogram per the American College of Radiology BreastImaging Reporting and Data System (BI-RADS), with specificfollow-up recommendations for each category (Liberman andMenell, 2002).Screening mammography interpretation is a particularlychallenging task because mammograms are in very high res-olutions while most asymptomatic cancer lesions are small,sparsely distributed over the breast and may present as subtlechanges in the breast tissue pattern. While randomized clin-ical trials have shown that screening mammography has sig-nificantly reduced breast cancer mortality (Du ff y et al., 2002;Kopans, 2002), it is associated with limitations such as falsepositive recalls for additional imaging and subsequent false pos-itive biopsies which result in benign, non-cancerous findings.About 10% to 20% of women who have an abnormal screeningmammogram are recommended to undergo a biopsy. Only 20%to 40% of these biopsies yield a diagnosis of cancer (Kopans,2015).To tackle these limitations, convolutional neural networks(CNN) have been applied to assist radiologists in the analy- a r X i v : . [ c s . C V ] F e b ig. 1: Four examples of breasts that were biopsied along with the annotated findings. The breasts (from left to right) were diagnosed withbenign calcifications, a benign mass, malignant calcifications, and malignant architectural distortion. While microcalcifications are common inboth benign and malignant findings, their presence in a ductal distribution, such as in the third example, is a strong indicator of malignancy. sis of screening mammography (Zhu et al., 2017; Kim et al.,2018; Kyono et al., 2018; Ribli et al., 2018; Wu et al., 2019b;McKinney et al., 2020). An overwhelming majority of exist-ing studies on this task utilize models that were originally de-signed for natural images. For instance, VGGNet (Simonyanand Zisserman, 2014), designed for object classification on Im-ageNet (Deng et al., 2009), has been applied to breast densityclassification (Wu et al., 2018) and Faster R-CNN (Ren et al.,2015) has been adapted to localize suspicious findings in mam-mograms (Ribli et al., 2018; Févry et al., 2019).Screening mammography is inherently di ff erent from typicalnatural images from a few perspectives. First of all, as illus-trated in Figure 1, regions of interest (ROI) in mammographyimages, such as masses, asymmetries, and microcalcifications,are often smaller in comparison to the salient objects in natu-ral images. Moreover, as suggested in multiple clinical stud-ies (Van Gils et al., 1998; Pereira et al., 2009; Wei et al., 2011),both the local details, such as lesion shape, and global structure,such as overall breast fibroglandular tissue density and pattern,are essential for accurate diagnosis. For instance, while micro-calcifications are common in both benign and malignant find-ings, their presence in a ductal distribution, such as in the thirdexample of Figure 1, is a strong indicator of malignancy. Thisis in contrast to typical natural images where objects outside themost salient regions provide little information towards predict-ing the label of the image. In addition, mammography imagesare usually of much higher resolutions than typical natural im-ages. The most accurate deep CNN architectures for naturalimages are not applicable to mammography images due to thelimited size of GPU memory.To address the aforementioned issues, in this work, we ex-tend and comprehensively evaluate the globally-aware multipleinstance classifier (GMIC), whose preliminary version we pro-posed in Shen et al. (2019). GMIC first applies a low-capacity,yet memory-e ffi cient, global module on the whole image togenerate saliency maps that provide coarse localization of pos-sible benign / malignant findings. As a result, GMIC is able toprocess screening mammography images in their original reso-lutions while keeping GPU memory manageable. In order tocapture subtle patterns contained in small ROIs, GMIC thenidentifies the most informative regions in the image and utilizesa local module with high-capacity to extract fine-grained visualdetails from these regions. Finally, it employs a fusion module that aggregates information from both global context and localdetails to predict the presence or absence of benign and malig-nant lesions in a breast. The specific contributions of this workare the following: • We extend the original architecture (Shen et al., 2019) witha fusion module. The fusion module improves classifi-cation performance by e ff ectively combining informationfrom both global and local features. In Section 3.6, wedemonstrate that the fusion module renders more accuratepredictions than our original design. • We apply the improved model to the task of screeningmammography interpretation: predicting the presence orabsence of benign and malignant lesions in a breast. Wetrained and tested our model on the NYU Breast CancerScreening Dataset consisting of 229,426 high-resolutionscreening mammograms (Wu et al., 2019c). On a held-outtest set of 14,148 exams, GMIC achieves an AUC of 0.93in identifying breasts with malignant findings, outperform-ing baseline approaches including ResNet-34 (He et al.,2016a), Faster R-CNN (Févry et al., 2019) and DMV-CNN (Wu et al., 2019b). • We demonstrate the clinical potential of the GMIC bycomparing the improved model to human experts. Inthe reader study, we show that it surpasses a radiologist-level classification performance: the AUC for the proposedmodel was greater than the average AUC for radiologistsby a margin of 0.11, reducing the error approximately byhalf. In addition, we experiment with hybrid models thatcombine predictions from both GMIC and each of the ra-diologists separately. At radiologists’ sensitivity (62 . . . • The proposed model is able to localize breast lesions in aweakly supervised manner, unlike existing approaches thatrely on pixel-level lesion annotations (Ribli et al., 2018;Févry et al., 2019; Wu et al., 2019b). In Section 3.5, wedemonstrate that the regions highlighted by the saliencymaps indeed correlate with the objects of interest. • We demonstrate that the proposed model is computation-ally e ffi cient. GMIC requires significantly less memorynd is much faster to train than standard image classifi-cation models such as ResNet-34 (He et al., 2016a) andFaster R-CNN (Ren et al., 2015). Benchmarked on high-resolution screening mammography images, GMIC has . % fewer parameters, uses . % less GPU memory, is faster during inference and faster during training,as compared to ResNet-34, while being more accurate. • We conduct a comprehensive ablation study that evaluatesthe e ff ectiveness of each component of GMIC. Moreover,we empirically measure how much performance can beimproved by ensembling GMIC with Faster R-CNN andResNet-34. In addition, we also experiment with utiliz-ing segmentation labels to enhance GMIC. In both experi-ments, we find that the improvement is marginal, suggest-ing that, for a large training set, image-level labels aloneare su ffi cient for GMIC to reach favorable performance.
2. Methods
We frame the task of screening mammography interpretationas a multi-label classification problem: given a grayscale image x ∈ R H , W , we predict the image-level label y = (cid:34) y b y m (cid:35) , where y b , y m ∈ { , } indicate whether any benign / malignant lesion ispresent in x . As shown in Figure 2, we propose a classification frameworkthat resembles the diagnostic procedure of a radiologist. Wefirst use a global network f g to extract a feature map h g fromthe input image x , i.e. we compute h g = f g ( x ) , (1)which is analogous to a radiologist roughly scanning throughthe entire image to obtain a holistic view.We then apply a 1 × h g into two saliency maps A b , A m ∈ R h , w indicating approximate locations of benign and malignant le-sions. Each element A ci , j ∈ [0 ,
1] where c ∈ { b , m } , denotesthe contribution of spatial location ( i , j ) towards predicting thepresence of benign / malignant lesions. Let A denote the con-catenation of A b and A m . That is, we compute A as A = sigm(conv × ( h g )) . (2)Due to limited GPU memory, in prior work, input images x are usually down-sampled (Guan et al., 2018; Yao et al.,2018; Zhong et al., 2019). For mammography images, however,down-sampling distorts important visual details such as lesionmargins and blurs small microcalcifications. Instead of sacri-ficing the input resolution, we control memory consumption byreducing the complexity of the global network f g . Because of Depending on the implementation of f g , the resolutions of the saliencymaps ( h , w ) are usually smaller than the resolution of the input image ( H , W ).In this work, we set h = w = H = W = its constrained capacity, f g may not be able to capture all subtlepatterns contained in the images at all scales. To compensatefor this, we utilize a high-capacity local network f l to extractfine-grained details from a set of informative regions. In thesecond stage, we use A to retrieve K most informative patchesfrom x : { ˜ x k } = retrieve_roi ( A ) , (3)where retrieve_roi denotes a heuristic patch-selection pro-cedure described later. This procedure can be seen as an ana-logue to a radiologist concentrating on areas that might cor-respond to lesions. The fine-grained visual features { ˜ h k } con-tained in all chosen patches { ˜ x k } are then processed using f l andare aggregated into a vector z by an aggregator f a . That is,˜ h k = f l (˜ x k ) and z = f a ( { ˜ h k } ) . (4)Finally, a fusion network f fusion combines information fromboth global structure h g and local details z to produce a pre-diction ˆ y . This is analogous to modelling a radiologist compre-hensively considering the global and local information to rendera full diagnosis as ˆ y = f fusion ( h g , z ) . (5) To process high-resolution im-ages while keeping GPU memory consumption manageable, weparameterize f g as a ResNet-22 (Wu et al., 2019b) whose archi-tecture is shown in Figure 2. In comparison to canonical ResNetarchitectures (He et al., 2016a), ResNet-22 has one more resid-ual block and only a quarter of the filters in each convolutionlayer. As suggested by Tan and Le (2019), a deeper CNN haslarger receptive fields and can capture richer and more complexfeatures in high-resolution images. Narrowing network widthcan decrease the total number of hidden units which reducesGPU memory consumption.It is di ffi cult to define a loss function that directly comparessaliency maps A and the cancer label y , since y does not containlocalization information. In order to train f g , we use an aggre-gation function f agg ( A c ) : R h , w (cid:55)→ [0 ,
1] to transform a saliencymap into an image-level class prediction:ˆ y c global = f agg ( A c ) . (6)With f agg we can train f g by backpropagating the gradient of theclassification loss between y and ˆ y global . The design of f agg ( A c )has been extensively studied (Durand et al., 2017). Global av-erage pooling (GAP) would dilute the prediction as most of thespatial locations in A c correspond to background and providelittle training signal. On the other hand, with global max pool-ing (GMP), the gradient is backpropagated through a single spa-tial location, which makes the learning process slow and unsta-ble. In our work, we propose, top t % pooling , which is a softbalance between GAP and GMP. Namely, we define the aggre-gation function as f agg ( A c ) = | H + | (cid:88) ( i , j ) ∈ H + A ci , j , (7) low capacity"global network ̂ global ̂ local BCE crop input image saliency maps agg _ sigmoid "high capacity"local network patch map ̃ ROI patches ̃ feature vectors sigmoid ∈ [0, 1] attention fc attention-weightedrepresentationfcconcat losspooling ̂ fusion BCE loss
Global ModuleLocal Module Fusion Module
BCE loss x ResNet-22
Fig. 2:
Overall architecture of GMIC (left) and architecture of ResNet-22 (right). The patch map indicates positions of ROI patches (blue squares)on the input. In ResNet-22, we use c , s , and p to denote number of output channels, strides and size of padding. “ResBlock, c =
32, d =
2” denotesa vanilla ResBlock proposed in He et al. (2016b) with 32 output channels and a downsample skip connection that reduces the resolution with afactor of 2. In comparison to canonical ResNet architectures (He et al., 2016a), ResNet-22 has one more residual block and only a quarter of thefilters in each convolution layer. Narrowing network width decreases the total number of hidden units which reduces GPU memory consumption. where H + denotes the set containing locations of top t % valuesin A c , where t is a hyperparameter. In all experiments, we tune t using a procedure described in Section 3.3. In fact, GAP andGMP can be viewed as two extremes of top t % pooling . GMPis equivalent to setting t = h × w and GAP is equivalent to setting t = t and em-pirically demonstrate that our parameterization of f agg achievesperformance superior to GAP and GMP. Acquiring ROI Patches.
We designed a greedy algorithm (Al-gorithm 1) to retrieve K patches as proposals for ROIs, ˜ x k ∈ R h c , w c , from the input x , where w c = h c =
256 in all exper-iments. In each iteration, retrieve_roi greedily selects therectangular bounding box that maximizes the criterion definedin line 7. The algorithm then maps each selected bounding boxto its corresponding location on the input image. The reset rulein line 12 explicitly ensures that extracted ROI patches do notsignificantly overlap with each other. In Section 3.6, we showhow the classification performance is impacted by K . Utilizing Information from Patches.
With retrieve_roi , wecan focus learning on a selected set of small yet informativepatches { ˜ x k } . We can now apply a local network f l with highercapacity (wider or deeper) that is able to utilize fine-grainedvisual features, to extract a representation ˜ h k ∈ R L from everypatch ˜ x k . We experiment with several parameterizations of f l including ResNet-18, ResNet-34 and ResNet-50. Algorithm 1 retrieve_roi
Input: x ∈ R H , W , A ∈ R h , w , , K Output: O = { ˜ x k | ˜ x k ∈ R h c , w c } O = ∅ for each class c ∈ { benign , malignant } do ˜A c = min-max-normalization( A c ) end for A ∗ = (cid:80) c ˜ A c l denotes an arbitrary h c hH × w c wW rectangular patch on A ∗ criterion( l , A ∗ ) = (cid:80) ( i , j ) ∈ l A ∗ [ i , j ] for each 1 , , ..., K do l ∗ = argmax l criterion( l , A ∗ ) L = position of l ∗ in x O = O ∪ { L } ∀ ( i , j ) ∈ l ∗ , set A ∗ [ i , j ] = end for return O Since ROI patches are retrieved using coarse saliency maps,the information relevant for classification carried in each patchvaries significantly. To address this issue, we use the Gated At-tention Mechanism (GA) (Ilse et al., 2018), allowing the modelto selectively incorporate information from all patches. Com-pared to other common attention mechanisms (Bahdanau et al.,2014; Luong et al., 2015), GA uses the sigmoid function to pro-vide a learnable non-linearity which increases model flexibility.n attention score α k is computed on each patch: α k = exp { w (cid:124) (tanh( V ˜h (cid:124) k ) (cid:12) sigm( U ˜h (cid:124) k )) } (cid:80) Kj = exp { w (cid:124) (tanh( V ˜h (cid:124) j ) (cid:12) sigm( U ˜h (cid:124) j )) } , (8)where (cid:12) denotes an element-wise multiplication and w ∈ R L , V ∈ R L × M , U ∈ R L × M are learnable parameters. In all experi-ments, we set L =
512 and M = z = K (cid:88) k = α k ˜ h k , (9)where the attention score α k ∈ [0 ,
1] indicates the relevanceof each patch ˜ x k . The representation z is then passed to a fullyconnected layer with sigmoid activation to generate a predictionˆ y local = sigm( w local T z ) , (10)where w local ∈ R L × are learnable parameters. Information Fusion.
To combine information from bothsaliency maps and ROI patches, we apply a global max pool-ing on h g and concatenate it with z . The concatenated repre-sentation is then fed into a fully connected layer with sigmoidactivation to produce the final prediction:ˆ y fusion = sigm( w f [GMP( h g ) , z ] (cid:124) ) (11)where GMP denotes the global max pooling operator and w f are learnable parameters. In order to constrain the saliency maps to only highlight im-portant regions, we impose the L A c to makethe saliency maps sparser: L reg ( A c ) = (cid:88) ( i , j ) | A ci , j | . (12)Despite the relative complexity of our proposed framework, themodel can be trained end-to-end using stochastic gradient de-scent with following loss function, defined for a single trainingexample as: L ( y , ˆ y ) = (cid:88) c ∈{ b , m } BCE( y c , ˆ y c local ) + BCE( y c , ˆ y c global ) + BCE( y c , ˆ y c fusion ) + β L reg ( A c ) , (13)where BCE is the binary cross-entropy and β is a hyperparam-eter.
3. Experiments and Results
To demonstrate the e ff ectiveness of GMIC on high-resolutionimage classification, we evaluate it on the task of screeningmammography interpretation: predicting the presence or ab-sence of benign and malignant findings in a breast. We compare R-CC L-CCR-MLO L-MLO
Fig. 3:
Example screening mammography exam. Each exam is as-sociated with four images that correspond to the CC and MLO viewof both left and right breast. The left breast is diagnosed with benignfindings which are highlighted in green.
GMIC to a previous ResNet-like network dedicated to mam-mography (Wu et al., 2019b) as well as to the standard ResNet-34 (He et al., 2016a) and Faster-RCNN (Ren et al., 2015; Févryet al., 2019) in terms of classification accuracy, number of pa-rameters, computation time, and GPU memory consumption.In addition, we also evaluate the localization performance ofGMIC by qualitatively and quantitatively comparing the result-ing saliency maps with the ground truth segmentation providedby the radiologists.
The NYU Breast Cancer Screening Dataset (Wu et al.,2019c) includes 229,426 exams (1,001,093 images) from141,472 patients. Each exam contains at least four imageswhich correspond to the four standard views used in screening Our retrospective study was approved by our institutional review board andwas compliant with the Health Insurance Portability and Accountability Act.Informed consent was waived. ammography: R-CC (right craniocaudal), L-CC (left cranio-caudal), R-MLO (right mediolateral oblique) and L-MLO (leftmediolateral oblique). An example is shown in Figure 3.Across the entire dataset (458,852 breasts), malignant find-ings were present in 985 breasts (0 . . M b , M m ∈{ , } H × W where M b / mi , j = i , j belongs to the be-nign / malignant findings. An example of such a segmentation isshown in Figure 3. In all experiments (except for experimentsin Section 3.6 that assess the benefits of utilizing segmentationlabels), segmentation labels are only used for evaluation. Wefound that, according to the radiologists, approximately 32 . The dataset is divided into disjoint training (186,816), valida-tion (28,462) and test (14,148) sets. All images are cropped to2944 × { , , , } degrees with equal probabil-ity. No rotation to the patches is applied during validation andtest phase.As each breast is associated with two images (CC and MLOviews) and our model generates a prediction for each image,we define breast-level predictions as the average of the twoimage-level predictions. For classification performance, we re-port area under the ROC curve (AUC) on the breast-level. In thereader study, we also use area under the precision-recall curve(PRAUC) to compare radiologists and the proposed model. Wecomputed the radiologists’ sensitivity which served as predic-tion threshold to derive the specificity of GMIC. To assess sta-tistical significance, we performed Student’s t-test and used bi-nomial proportion confidence intervals for specificity. To quan-titatively evaluate our model’s localization ability, we calculate the Dice similarity coe ffi cient (DSC). The DSC values we re-port are computed as an average over images for which seg-mentation labels are available (i.e. images from breasts whichhave biopsied findings which were not mammographically oc-cult).In addition to accuracy, computation time and memory e ffi -ciency are also important for medical image analysis. To mea-sure memory e ffi ciency, we report the peak GPU memory usageduring training as in Canziani et al. (2016). Similar to Schlem-per et al. (2019), we also report the run-time performance byrecording the total number of floating-point operations (FLOPs)during inference and elapsed time for forward and backwardpropagation. Both memory and run-time statistics are measuredby benchmarking each model on a single exam (4 images), av-eraged across 100 exams. All experiments are conducted on anNVIDIA Tesla V100 GPU. In all experiments, we parameterize f g as a ResNet-22 whosearchitecture is shown in Figure 2. We pretrain f g on BI-RADSlabels as described in Geras et al. (2017) and Wu et al. (2019b).For f l , we experiment with three di ff erent architectures withvarying levels of complexity (ResNet-18, ResNet-34, ResNet-50). For each image, we extract K = // github.com / nyukat / GMIC.
The proposed model is compared against threebaselines. We first trained ResNet-34 (He et al., 2016a) to pre-dict the presence of malignant and benign findings in a breast.In fact, ResNet-34 is the highest capacity model among theResNet architectures that can process a mammography in itsoriginal resolution while fitting in the memory of an NVIDIATesla V100 GPU. We also experimented with a variant ofResNet-34 by replacing the fully connected classification layerwith a 1 × top t % pooling as the aggre-gation function. In addition, we compared our model with DeepMulti-view CNN (DMV-CNN) proposed by Wu et al. (2019b)which has two versions. In the vanilla version, DMV-CNNapplies a ResNet-based model on four standard views to gen-erate two breast-level predictions for each exam. DMV-CNNcan also be enhanced with pixel-level heatmaps generated by apatch-level classifier. However, training the patch-level classi-fier requires hand-annotated segmentation labels. Lastly, wealso compared GMIC with the work of Févry et al. (2019)which trains a Faster R-CNN (Ren et al., 2015) that utilizessegmentation labels to localize anchor boxes that correspond tomalignant or benign lesions. Unlike DMV-CNN and Faster R-CNN which rely on segmentation labels, GMIC can be trainedwith only image-level labels. yperparameter Tuning. To make a fair comparison betweenmodel architectures, we optimize the hyperparameters with ran-dom search (Bergstra and Bengio, 2012) for both ResNet-34baselines and GMIC. Specifically, for all models, we search forthe learning rate η ∈ [ − . , − on a logarithmic scale. Ad-ditionally, for GMIC and ResNet-34 with 1 × β ∈ [ − . , − . (on a logarithmic scale) and for the pool-ing threshold t ∈ { , , , , } . For all models, wetrain 30 separate models using hyperparameters randomly sam-pled from ranges described above. Each model is trained for 50epochs, and we report the test performance using the weightsfrom the training epoch that achieves highest validation perfor-mance. Performance.
For each network architecture, we selected thetop five models (referred to as top-5 ) from the hyperparame-ter tuning phase that achieved the highest validation AUC inidentifying breasts with malignant findings and evaluated theirperformance on the held-out test set. In Table 1, we report themean and the standard deviation of AUC for the top-5 modelsin each network architecture. In general, the GMIC model out-performed all baselines. In particular, GMIC achieved higherAUC than Faster R-CNN and DMV-CNN (with heatmaps), de-spite GMIC not learning with pixel-level labels. We hypoth-esize that GMIC’s superior performance is related to its abil-ity to e ffi ciently integrate both global features and local details.In Section 3.6, we empirically investigate this hypothesis withmultiple ablation studies. Separately, we also observe that in-creasing the complexity of f l brings a small improvement inAUC.To further improve our results, we employed the techniqueof model ensembling (Dietterich, 2000). Specifically, we av-eraged the predictions of the top-5 models for GMIC-ResNet-18, GMIC-ResNet-34, and GMIC-ResNet-50 to produce theoverall prediction of the ensemble. Our best ensemble modelachieved an AUC of 0.930 in identifying breasts with malig-nant findings.In addition, GMIC is e ffi cient in both run-time complexityand memory usage. Compared to ResNet-34, GMIC-ResNet-18 has 28 .
8% fewer parameters, uses 78 .
43% less GPU mem-ory, is 4.1x faster during inference and 5.6x faster during train-ing. GMIC achieved even more prominent superiority in bothrun-time and GPU memory usage compared to Faster R-CNN.This improvement is brought forth by its design that avoids ex-cessive computation on the whole image while selectively fo-cusing on informative regions.
To evaluate the potential clinical impact of ourmodel, we compare the performance of GMIC to the perfor-mance of radiologists using data from the reader study con-ducted by Wu et al. (2019b). This study includes 14 readers: 12attending radiologists at various level of experience (between2-30 years), a medical resident, and a medical student. Eachreader was asked to provide probability estimates as well as bi-nary predictions of malignancy for 720 screening exams (1440breasts). Among the 1,440 breasts, 62 breasts were associated t r u e p o s i t i v e r a t e avg reader 0.0 0.5 1.0recall0.00.20.40.60.81.0 p r e c i s i o n avg reader (a) (a*) t r u e p o s i t i v e r a t e avg hybrid 0.0 0.5 1.0recall0.00.20.40.60.81.0 p r e c i s i o n avg hybrid (b) (b*) t r u e p o s i t i v e r a t e GMICDMV-CNNavg readeravg hybrid 0.0 0.5 1.0recall0.00.20.40.60.81.0 p r e c i s i o n GMICDMV-CNNavg readeravg hybrid (c) (c*)
Fig. 4:
The ROC curves ((a), (b), (c)) and the precision-recall curves((a*), (b*), (c*)) computed on the reader study dataset. (a) & (a*):curves for all 14 readers. We derive the ROC / PRC for the averagereader by computing the average true positive rate and precision acrossall readers for every false positive rate and recall. (b) & (b*): curves forhybrid models with each single reader. The curve highlighted in blueindicates the average performance of all hybrids. (c) & (c*): compar-ison among the GMIC, DMV-CNN, the average reader, and averagehybrid. with malignant findings and 356 breasts were associated withbenign findings. Among the breasts in which there were malig-nant findings, there were 21 masses, 26 calcifications, 12 asym-metries and 4 architectural distortions. The radiologists wereonly shown images with no other data.
Comparison to Radiologists.
We calculate AUC and PRAUCon the reader study dataset to measure the performance of ra-diologists and GMIC. We obtain GMIC’s predictions by en-sembling the predictions of the top-5
GMIC-ResNet-18 mod-els. In Figure 4 ((a) and (a*)), we visualize the receiver op-erating characteristic curve (ROC) and precision-recall curve(PRC) for each individual reader using their probability esti-mates of malignancy. We also compared GMIC with DMV-CNN and the radiologists ((c) and (c*)). GMIC achieves anAUC of 0.891 and PRAUC of 0.39 outperforming DMV-CNN able 1:
Comparison of performance of GMIC and the baselines on screening mammogram interpretation. For both GMIC and ResNet-34, wereported test AUC (mean and standard deviation) of top-5 models that achieved highest validation AUC in identifying breasts with malignantfindings. We also measure the total number of learnable parameters in millions, peak GPU memory usage (Mem) for training a single exam (4images), time taken for forward (Fwd) and backward (Bwd) propagation in milliseconds, and number of floating-point operations (FLOPs) inbillions.
Model AUC(M) AUC(B) / Bwd (ms) FLOPsResNet 34 + fc 0 . ± .
026 0 . ± .
015 21.30M 13.95 189 /
459 1622BResNet 34 + × . ± .
015 0 . ± .
008 21.30M 12.58 201 /
450 1625BDMV-CNN (w / o heatmaps) 0 . ± .
008 0 . ± .
004 6.13M 2.4 38 /
86 65BDMV-CNN (w / heatmaps) 0 . ± .
003 0 . ± .
002 6.13M 2.4 38 /
86 65BFaster R-CNN 0 . ± .
014 0 . ± .
008 104.8M 25.75 920 / GMIC-ResNet-18 0 . ± .
007 0 . ± .
005 15.17M 3.01 46 /
82 122BGMIC-ResNet-34 0 . ± .
005 0 . ± .
006 25.29M 3.45 58 /
94 180BGMIC-ResNet-50 . ± . . ± .
003 27.95M 5.05 66 /
131 194BGMIC-ResNet-18-ensemble . . - - - -(AUC: 0.876, PRAUC: 0.318). The AUCs associated with eachindividual reader ranges from 0.705 to 0.860 (mean: 0.778, std:0.0435) and the PRAUCs for readers vary from 0.244 to 0.453(mean: 0.364, std: 0.0496). GMIC achieves a higher AUCand PRAUC than the average reader. We note that there is alimitation associated with AUC and PRAUC. While AUC andPRAUC are calculated on continuous predictions, radiologistsare trained to make diagnosis by choosing from a discrete setof BI-RADS scores (D’Orsi, 2013). Indeed, even though thereaders were given a possibility to predict any number between0% and 100%, they chose to stick to the probability thresholdcorresponding to BI-RADS scores.To compare GMIC to radiologists, we also use sensitivity andspecificity as additional evaluation metrics. We first computethe radiologists’ sensitivity and specificity using the data fromthe reader study. We then use the average specificity and sensi-tivity among readers as the proxy for radiologists’ performanceunder a single-reader setting and use the statistics of the con-sensus reading to approximate the performance under a multi-reader setting. The predictions for the consensus reading arederived using majority voting. The 14 radiologists achieved anaverage specificity of 85 .
2% (std:5 . .
1% (std:9%). The consensus reading yields a specificityof 94 .
6% and a sensitivity of 76 . .
9% and a specificity 88 . . < . . .
6% which is lower than con-sensus reading specificity (94 . A U C A U C (a) (a*) P R A U C P R A U C (b) (b*) Fig. 5:
AUC and PRAUC as a function of λ ∈ [0 ,
1) for hybrids be-tween each reader and GMIC (left) / DMV-CNN (right) ensemble. Eachhybrid achieves the highest AUC / PRAUC for a di ff erent λ (markedwith ♦ ). the potential value of GMIC as a second reader. Human-machine Hybrid.
To further demonstrate the clinicalpotential of GMIC, we create a hybrid model whose predic-tions are a linear combination of predictions from each readerand the model: ˆ y hybrid = λ ˆ y reader + (1 − λ )ˆ y GMIC . We compute theAUC and PRAUC of the hybrid models by setting λ = .
5. Wenote that λ = . The implementation of Faster RCNN by Févry et al. (2019) is not compat-ible with our framework of FLOPs calculation.
MV-CNN GMIC0.890.900.910.92 m a x A U C DMV-CNN GMIC0.400.450.50 m a x P R A U C (a) (a*) DMV-CNN GMIC0.20.40.60.8 b e s t A U C DMV-CNN GMIC0.20.40.6 b e s t P R A U C (b) (b*) Fig. 6: (a) and (a*): the distribution of maximum AUC / PRAUCachieved for hybrids between each reader and GMIC / DMV-CNNensemble. (b) and (b*): the distribution of the optimal λ ∗ thatachieves the maximum AUC / PRAUC for both GMIC / DMV-CNN hy-brids. GMIC hybrids achieve higher AUC and PRAUC than DMV-CNN hybrids. Moreover, GMIC plays a more important role thanDMV-CNN in the hybrid models as indicated by the distribution of λ ∗ . els. On the other hand, the performance obtained by retroac-tively fine-tuning λ on the reader study is not transferable torealistic clinical settings. Therefore, we chose λ = . λ = .
5) which on average achieve an AUC of 0.892 (std:0.009) and an PRAUC of 0.449 (std: 0.036), improving radiol-ogists’ mean AUC by 0.114 and mean PRAUC by 0.085. Foreach of the hybrid models, we also calculate its specificity at theaverage radiologists’ sensitivity (62 . .
5% (std:1 . < . ff er-ent aspects of the task compared to radiologists and can be usedas a tool to assist in interpreting breast cancer screening exams.In addition, in Figure 5, we visualize the AUC and PRAUCachieved by combining predictions from each of these 14 read-ers with GMIC ((a) and (b)) and DMV-CNN ((a*) and (b*))with varying λ . The diamond mark on each curve indicatesthe λ ∗ that achieves the highest AUC / PRAUC. As shown in theplot, the predictions from all radiologists could be improved( λ ∗ < .
0) by incorporating predictions from GMIC. Morespecifically, as shown in Figure 6 ((a) and (a*)), with the op-timal λ ∗ , GMIC hybrids achieves a mean AUC of 0 . ± . . ± .
03 both of which are higherthan the counterparts of DMV-CNN hybrids (AUC:0 . ± . . ± . λ ∗ for GMIC and DMV-CNN. The average value of λ ∗ associated with GMIC hybrid models to achieve maximumAUC / PRAUC is 0 . ± . / . ± .
11 which is lower thanDMV-CNN (0 . ± . / . ± . To evaluate the localization performance of GMIC, we se-lect the model with the highest DSC for malignancy localiza-tion using the validation set. During inference, we upsamplesaliency maps using nearest neighbour interpolation to matchthe resolution of the input image. Our best localization modelachieves a mean test DSC of 0.325 (std:0.231) for localizationof malignant lesions and 0.240 (std:0.175) for localization ofbenign lesions. The best localization model achieves an AUCof 0.886 / / benign lesions. We ob-serve that localization and classification performance are notperfectly correlated. The trade-o ff between classification andlocalization has been discussed in the weakly supervised objectdetection literature (Feng et al., 2017; Sedai et al., 2018; Yaoet al., 2018).In Figure 7, we visualize saliency maps for four samples se-lected from the test set. In the first two examples, the saliencymaps are highly activated on the annotated lesions, suggestingthat our model is able to detect suspicious lesions without pixel-level supervision. Moreover, the attention α k is highly concen-trated on ROI patches that overlap with the annotated lesions.In the third example, the saliency map for benign findings iden-tifies three abnormalities. Although only the top abnormalitywas escalated for biopsy and hence annotated by radiologists,the radiologist’s report confirms that the two non-biopsied find-ings have a high probability of benignity and a low probabil-ity of malignancy. In the fourth example, we illustrate a casewhen there is some level of disagreement between our modeland the annotation in the dataset. The malignancy saliency maponly highlights part of a large malignant lesion with segmentalcoarse heterogeneous calcifications. This behavior is related tothe design of f agg : a fixed pooling threshold t cannot be optimalfor all sizes of ROI. The impact of f agg is further studied in 3.6.This example also illustrates that while human experts are askedto annotate the entire lesion, CNNs tend to emphasize only themost informative regions. While no benign lesion is present,the benign saliency map still highlights regions similar to thatin the malignancy saliency map, but with a lower probabilitythan the malignancy saliency map. In fact, calcifications withthis morphology and distribution can also result from benignpathophysiology (Liberman and Menell, 2002).In addition, we observe that GMIC is able to provide mean-ingful localization when the lesions are hardly visible to radiol-ogists in the image. In Figure 8, we illustrate a mammographi-cally occult mammogram of a 59-year old patient with no fam-ily history of breast cancer and dense breasts. There is an asym-metry in the left lateral breast posterior depth which appearsstable compared to prior mammograms and was determined tobe benign by the reading radiologist. However, the saliencymap of malignant findings successfully identifies the malignantlesion on the screening mammogram. Same day screening ul-trasound (sagittal image) demonstrated a 1.2 cm irregular mass;ultrasound biopsy yielded moderate grade invasive ductal car-cinoma. nnotated input patch map saliency map (B) saliency map (M) 0.33 ROI patches0.26 0.110.08 0.13 0.090.28 0.42 0.060.06 0.12 0.060.14 0.19 0.140.29 0.12 0.120.15 0.17 0.210.18 0.17 0.12 Fig. 7:
Visualization of results for four examples. From left to right: input images annotated with segmentation labels (green = benign,red = malignant), locations of ROI patches (blue squares), saliency map for benign class, saliency map for malignant class, and ROI patcheswith their attention scores. The top example contains a circumscribed oval mass in the left upper breast middle depth which was diagnosed as abenign fibroadenoma by ultrasound biopsy. The second example contains an irregular mass in the right lateral breast posterior depth which wasdiagnosed as an invasive ductal carinoma by ultrasound biopsy. In the third example, the benign saliency map identifies (from up to bottom) (a)a circumscribed oval mass in the lateral breast middle depth, (b) a smaller circumscribed oval mass in the media breast, and (c) an asymmetry inthe left central breast middle depth. Ultrasound-guided biopsy of the finding shown in (a) yielded benign fibroadenoma. The medial breast mass(b) was recommended for short-term follow-up by the breast radiologist. The central breast asymmetry (c) was imaging-proven stable on multipleprior mammograms and benign. The bottom example contains segmental coarse heterogeneous calcifications in the right central breast middledepth. Stereotactic biopsy yielded high grade ductal carcinoma in situ. We provide additional visualizations of exams with benign and malignantfindings in the Appendix (Figure 14 and Figure 15). ig. 8: A mammographically occult example with a biopsy-proven malignant finding. From left to right: the original image, the saliency mapfor benign findings, the saliency map for malignant findings, and the sagittal ultrasound image of this patient. While the asymmetry in the leftlateral breast posterior depth was intepreted as benign by the radiologist, a subsequent screening ultrasound and ultrasound-guided biopsy yieldedmammographically-occult moderate grade invasive ductal carcinoma. On saliency maps, this area shows a weak probability of benignity and ahigh probability of malignancy.
Table 2:
Ablation study: e ff ectiveness of incorporating both globaland local features. We report the mean and standard deviation of thetest AUC for top-5 GMIC-ResNet-18. We experimented with 4 GMICvariants that use ˆ y global , ˆ y local , the average of ˆ y global and ˆ y local , and ˆ y fusion as predictions. The proposed design that uses ˆ y fusion as predictionsoutperforms all variants. Prediction AUC(M) AUC(B)ˆ y global . ± .
009 0 . ± . y local . ± .
004 0 . ± . (ˆ y local + ˆ y global ) 0 . ± .
006 0 . ± . y fusion . ± . . ± . We performed ablation studies to explore the e ff ectiveness ofglobal module, local module, fusion module, patch-level atten-tion, and the proposed top t% pooling . In addition, we alsoassess how much performance of GMIC could be improvedby utilizing the pixel-level labels and ensembling GMIC withDMV-CNN and Faster R-CNN. All ablation experiments arebased on the GMIC-ResNet-18 model. Synergy of Global and Local Information.
In the preliminaryversion of GMIC (Shen et al., 2019), the final prediction is de-fined as (ˆ y global + ˆ y local ). In this work, we enhance GMIC with afusion module that combines signals from both global featuresand local details. To empirically evaluate the e ff ectiveness ofthe fusion module, we compared the performance achieved us-ing only global features (ˆ y global ), only local patches (ˆ y local ), theaverage prediction of two modules ( (ˆ y global + ˆ y local )), and thefusion of the two (ˆ y fusion ). As shown in Table 2, ˆ y fusion achieveda higher AUC consistently for classifying both benign and ma-lignant lesions than either ˆ y global or ˆ y local . This result suggeststhat the fusion module helps GMIC to aggregate signals fromboth global and local module. Moreover, ˆ y fusion also outper-forms the ensemble prediction (ˆ y local + ˆ y global ), which furtherdemonstrates that the fusion module promotes an e ff ective syn-ergy beyond an ensembling e ff ect created from averaging pre- dictions over two sets of parameters. Table 3:
To evaluate the e ff ectiveness of the patch-wise attention, wecompare the proposed model with the variant (uniform) that alwaysassigns equal attention to all patches. To investigate the importanceof the localization information in the saliency maps, we trained an-other variant (random) that randomly selects patches from the inputimage. We use GMIC-ResNet-18 model with top 3% pooling as thebase model. The performance of the local module (ˆ y local ) is reported. Attention ROI patches AUC(M) AUC(B)uniform retrieve_roi . ± .
008 0 . ± . . ± .
042 0 . ± . retrieve_roi . ± . . ± . Table 4:
Ablation study: e ff ect of di ff erent choice of aggregation func-tion. We report the performance achieved by parameterizing f agg asglobal average pooling (GAP), global maximum pooling (GMP), and top t % pooling . For each setting, we trained five GMIC-ResNet-18models and report the mean and standard deviation of AUC and DSC. f agg AUC(M) AUC(B) DSC(M) DSC(B)GMP 0 . ± .
02 0 . ± .
012 0 . ± .
052 0 . ± . t =
1% 0 . ± .
01 0 . ± .
007 0 . ± .
030 0 . ± . t = ± .
009 0 . ± .
007 0 . ± .
013 0 . ± . t =
3% 0 . ± . ± . ± .
036 0 . ± . t =
5% 0 . ± .
009 0 . ± .
002 0 . ± . ± . t =
10% 0 . ± . ± .
008 0 . ± .
050 0 . ± . t =
20% 0 . ± .
017 0 . ± .
008 0 . ± .
048 0 . ± . . ± .
02 0 . ± .
012 0 . ± .
006 0 . ± . ROI Proposals and Patch-wise Attention.
GMIC applies twomechanisms to control the quality of patches provided to thelocal module. First, the retrieve_roi algorithm utilizes lo-calization information from the saliency maps and greedily se-lects informative patches of the input image. Those selectedpatches are then weighted using the Gated Attention network.To evaluate the e ff ectiveness of both mechanisms, we trainedtwo variants: one (uniform) that always assigns equal atten-tion score to each patch and another (random) that randomlyround Truth GMP top 3% top 10% top 20% GAP0.33 0.52 0.39 0.38 0.180.34 0.73 0.62 0.49 0.300.04 0.42 0.50 0.38 0.270.14 0.53 0.77 0.63 0.55 Fig. 9:
In this figure we illustrate the e ff ect of t in the pooling function on the saliency maps. From left to right: the mammogram with groundtruth segmentation and the saliency map generated using GMP, top 3% pooling, top 10% pooling, top 20% pooling, and GAP. The correspondingDSC is specified below each saliency map. A benign lesion is found in the top two examples. A malignant lesion is found in the bottom twoexamples. amples patches without using the saliency map. As shown inTable 3, if patch-wise attention is disabled, the AUC of clas-sifying malignant lesions decreases from 0.898 to 0.874. Ifthe retrieve_roi algorithm is replaced with random sam-pling, the local module su ff ers from a significant performancedecrease. These results suggest that both the patch-wise atten-tion and retrieve_roi procedure are essential for the localmodule to make accurate predictions. Aggregation Function.
In order to study the the impact of theaggregation function, we experimented with 8 parameteriza-tions of f agg including GAP, GMP, and top t % pooling with t ∈ { , , , , , } . For each parameterization, we fixedother hyperparameters and trained five GMIC-ResNet-18 mod-els with randomly initialized weights. In table 4, we report theAUC and DSC achieved by each value of t . GMIC-ResNet-18achieves the highest AUC on identifying malignant cases whenusing top t % pooling with t =
2. The performance of top t % pooling decreases as t moves away from 2 and converges to thatof GAP / GMP when t is large / small. This observation is consis-tent with the intuition that GAP and GMP are two extremes of top t % pooling . We observe a similar but less pronounced trendon the AUC of identifying benign cases.GMIC-ResNet-18 also obtains better localization perfor-mance with top t % pooling than with GAP or GMP. The high-est DSC for localizing malignant and benign lesions is achievedwhen t is set to 3% and 5% respectively. To further study thee ff ect of t , we visualize the saliency maps for four examplesselected from the test set. As illustrated in Figure 9, when t issmall, the saliency maps tend to highlight a small area. When t is large, the highlighted region grows. Ideally, the choice of t should reflect the true size of lesions contained in the imageand di ff erent images could use di ff erent t . In future research,we propose to learn t using information within the image. Number of ROI patches.
We experimented with GMIC vary-ing the number of patches K ∈ { , , , , , , } . For eachsetting, we trained five GMIC-ResNet-18 models with top t % pooling ( t = y fusion and ˆ y local on clas-sifying benign and malignant lesions. Increasing K improvesthe classification performance when K is small. The improve-ment is more evident on ˆ y local than ˆ y fusion , because ˆ y fusion alsoutilizes global features. However, for K >
3, the classificationperformance saturates. This observation demonstrates a trendof diminishing marginal return of incorporating additional ROIpatches.
Utilizing Segmentation Labels.
We also assessed how muchperformance of GMIC could be improved by utilizing pixel-level labels during training. Following Wu et al. (2019b), weused the pixel-level labels to train a patch-level model whichclassifies 256 × A U C ( m a li g n a n t ) y fusion y local A U C ( b e n i g n ) y fusion y local Fig. 10:
The classification performance of GMIC-ResNet-18 with avarying number of patches K ∈ { , , , , , , } . For each K , wetrained five models and reported the mean and the standard deviation oftest AUC on classifying malignant (top) and benign (bottom) lesions.We show the performance of both ˆ y fusion and ˆ y local . The performancesaturates for K > and the other containing an estimated probability of a benignfinding. In this comparison study, we concatenated the inputimages with these two heatmaps to train 30 GMIC-ResNet-18 models (referred as GMIC-ResNet-18-heatmap models) us-ing the hyperparameter optimization setting described in Sec-tion 3.3. We reported the test performance of the top-5 GMIC-heatmap models that achieved the highest validation AUC onidentifying breasts with malignant lesions. The top-5
GMIC-ResNet-18-heatmap models achieved a mean AUC of 0 . ± . / . ± .
008 in identifying breasts with malignant / benignlesions, outperforming the vanilla GMIC models (0 . ± . / . ± . top-5 GMIC-ResNet-18-heatmap models achieved an AUC of 0 . / .
80 in identi-fying breasts with malignant / benign lesions matching the per-formance of vanilla GMIC models (0 . / . ffi ciently large dataset, image-level labels alone are powerfulenough to capture most of the signal, and additional localiza-tion information from the pixel-level segmentation labels only The two heatmap channels are only used by the global network f g . Thelocal network f l does not use them. ig. 11: Example heatmaps generated by the patch-level model pro-posed by Wu et al. (2019b). The original image (left), the “benign”heatmap over the image (middle), and the “malignant” heatmap overthe image (right). slightly improves the performance of GMIC. In fact, some-times it might even be biasing the model towards ignoringmammographically-occult findings.
Ensembling GMIC with Other Models.
In order to estimatea lower bound of what level of performance is possible toachieve on this task, we build a large “super-ensemble” of mod-els by aggregating the predictions of: a) an ensemble of top-5
GMIC-ResNet-18, b) an ensemble of 5 DMV-CNN model (withheatmaps) (Wu et al., 2019b), and c) an ensemble of 3 FasterR-CNN models (Févry et al., 2019). Similar to the human-machine hybrid model, the predictions of the ensemble modelare defined as ˆ y ensemble = λ ˆ y GMIC + λ ˆ y Faster R-CNN + λ ˆ y DMV-CNN where λ + λ + λ =
1. On the test set, the ensemble modelwith equal weights associated with each of its components( λ = λ = λ = ) achieves an AUC of 0.936 in identifyingbreasts with malignant lesions. We note that the improvementagainst top-5 GMIC-ResNet-18-ensemble (0.930) is small. Wealso note that utilizing this ensemble might be impractical, dueto its complexity and computational cost.We also checked what would be the AUC of this ensembleif we could tune the weighting coe ffi cients of the ensemble onthe test set. In Figure 12, we visualize its classification per-formance on the reader study dataset and the full test set fordi ff erent combinations of λ , λ and λ . For the optimal combi-nations of λ , λ , and λ that achieve the highest AUC on bothdatasets, the weight associated with GMIC ( λ ) is the largest,however, the two other weights are also non-negligible, sug-gesting that the three types of models are complementary, eventhough the improvement in terms of AUC is small.
4. Related Work
The increased resolution level of medical images has posednew challenges for machine learning. Early works on applyingdeep neural networks to medical image classification typicallyutilize a CNN acting on the entire image to generate a predic-tion, resembling approaches developed for object classificationin natural images. For instance, Roth et al. (2015) adopted a5-layer CNN to perform anatomical classification of CT slices. A similar approach was adopted by Codella et al. (2015) to rec-ognize melanoma on dermoscopy images. More recently, Ra-jpurkar et al. (2017) fine-tuned a 121-layer DenseNet (Huanget al., 2016) to classify thorax disease on chest X-ray images.However, this line of work su ff ers from two drawbacks. Unlikemany natural images in which ROIs are su ffi ciently large, ROIsin medical images are typically small and sparsely distributedover the image. Applying a CNN indiscriminately over the en-tire image may include a considerable level of noise outside theROI. Moreover, input images are commonly downsampled to fitin GPU memory. Aggressively downsampling medical imagescould distort important details making the correct diagnosis dif-ficult (Geras et al., 2017).In another line of research, input images are uniformly di-vided into small patches. A classifier is trained and applied toeach patch, and patch-level predictions are aggregated to forman image-level prediction. This family of methods has beencommonly applied to the segmentation and classification ofpathology images (Campanella et al., 2019; Sun et al., 2019a,b).Coudray et al. (2018) used Inception V3 (Szegedy et al., 2016)on tiles of whole-slide histopathology images to detect adeno-carcinoma and squamous cell carcinoma. Sun et al. (2019a)proposed a multi-scale patch-level classifier using dilated con-volutions to localize gastric cancer regions. For breast cancerscreening, Wu et al. (2019b) utilized patch-level predictions asadditional input channels to classify screening mammograms.A major limitation of these methods is that many of them re-quire lesion locations to train the patch-level classifiers, whichmight be expensive to obtain. Moreover, global informationsuch as the image structure could be lost by dividing input im-ages into small patches.Instead of applying patch-level model on all tiles, severalmethods have been proposed to select patches that are related tothe classification task. Zhong et al. (2019) suggested selectingimportant patches based on a coarse attention map generated byapplying an UNet (Ronneberger et al., 2015) on downsampledinput images. Guo et al. (2019) adopted a similar strategy todetect strut points on intravascular optical coherence tomogra-phy images. Guan et al. (2018) further developed this idea andproposed the attention guided convolution neural network (AG-CNN) that explicitly merges information from both the globalimage and a refined local patch to detect thorax disease on chestX-ray images. Our work is perhaps most similar to Guan et al.(2018). While AG-CNN only selects one patch for each class,our method is able to selectively aggregate information from avariable number of patches, which enables the model to learnfrom broader source of signal. Early works on breast cancer screening exam classificationwere computer-aided detection (CAD) systems built with hand-crafted features (Li et al., 2001; Wu et al., 2007; Masotti et al.,2009; Oliver et al., 2010). Despite their popularity, clinicalstudy has suggested that CAD systems do not improve diag-nostic accuracy (Lehman et al., 2015). With the advances indeep learning in the last decade (LeCun et al., 2015), neural net-works have been extensively applied to assist radiologists in in-terpreting screening mammograms (Zhu et al., 2017; Wu et al.,
MIC DMV-CNNFasterR-CNN GMIC DMV-CNNFasterR-CNN
Fig. 12:
We visualize the AUC of identifying breasts with malignant findings achieved by the ensemble model with varying λ , λ , and λ on thereader study dataset (left) and the test set (right). The optimal combination of λ , λ , and λ that achieves highest AUC is highlighted in whitediamond. The weight associated with GMIC is the largest among the three models for both datasets. On the reader study dataset, the optimalcombination ( λ = . λ = . λ = .
24) achieves an AUC of 0.905. On the test set, the optimal combination ( λ = . λ = . λ = .
19) achieves an AUC of 0.939. ff erenceamong patches to detect breast masses. Another popular way ofutilizing segmentation labels is to train anchor-based object de-tection models. For instance, Ribli et al. (2018) and Févry et al.(2019) fine-tuned a Faster RCNN (Ren et al., 2015) to localizelesions on mammograms. Xiao et al. (2019) integrated objectdetector in a Siamese structure with explicit loss terms to dif-ferentiate anchor proposals containing lesion from those withonly normal tissues. We refer the readers to Hamidinekoo et al.(2018); Gao et al. (2019); Geras et al. (2019) for comprehensivereviews of prior works on machine learning for mammography. Recent progress demonstrates that CNN classifiers, trainedwith image-level labels, are able to perform semantic segmen-tation at the pixel level (Oquab et al., 2015; Pinheiro and Col-lobert, 2015; Bilen and Vedaldi, 2016; Zhou et al., 2016; Dibaet al., 2017; Zeng et al., 2019). This is commonly achievedin two steps. First, a backbone CNN converts the input imageto a saliency map which highlights the discriminative regions.A global pooling operator then collapses the saliency map intoscalar predictions, which makes the entire model trainable end-to-end. Durand et al. (2017) devised a new pooling operatorthat performs feature pooling on both spatial space and classspace. Wei et al. (2018) augmented the backbone network usingconvolution filters with varying dilation rates to address scalevariation among object classes. Zhu et al. (2019) refined seg-mentation masks using pseudo-supervision from noisy segmentproposals.Weakly supervised object detection (WSOD) has become in-creasingly popular in the field of medical image analysis as iteliminates the reliance of models on segmentation labels which are often expensive to obtain. WSOD has been broadly utilizedin medical applications including disease classification (Yaoet al., 2018; Liu et al., 2019), cell segmentation (Li et al., 2019;Yoo et al., 2019), and lesion detection (Xu et al., 2014; Luoet al., 2019; Wu et al., 2019a). Schlemper et al. (2019) designeda novel attention gate unit that can be integrated with standardCNN classifiers to localize objects of interest in ultrasound im-ages. Ouyang et al. (2019) proposed a spatial smoothing regu-larization to model the uncertainty associated with the segmen-tation mask. Kervadec et al. (2019) demonstrated that regular-ization terms stemming from inequality constraints can signifi-cantly improve the localization performance of a weakly super-vised model. While many works still rely on weak localizationlabels such as point annotations (Yoo et al., 2019) and scrib-bles (Ji et al., 2019) to produce saliency maps, our approachrequires only image-level labels that indicate the presence ofan object of a given class. In addition, to make an image-levelprediction, most existing models only utilize global informa-tion from the saliency maps which often neglect fine-graineddetails. In contrast, our model also leverages local informationfrom ROI patches using a dedicated network. In Section 3.6, weempirically demonstrate that the ability to focus on fine visualdetail is important for classification.
5. Discussion and Conclusion
Medical images di ff er from typical natural images in manyways such as much higher resolutions and smaller ROIs. More-over, both the global structure and local details play essentialroles in the classification of medical images. Because of thesedi ff erences, deep neural network architectures that work wellfor natural images might not be applicable to many medicalimage classification tasks. In this work, we present a novelframework, GMIC, to classify high-resolution screening mam-mograms. GMIC first applies a low-capacity, yet memory-e ffi cient, global module on the whole image to extract the globalcontext and generate saliency maps that provide coarse localiza-tion of possible benign / malignant findings. It then identifies themost informative regions in the image and utilizes a local mod-ule with higher capacity to extract fine-grained visual details
10 20 30 40 50epoch number0.600.650.700.750.800.850.90 A U C ( m a li g n a n t ) y fusion y global y local Fig. 13:
Learning curves for a GMIC-ResNet-18 model. The AUC formalignancy prediction on the validation set is shown for ˆ y fusion , ˆ y global ,and ˆ y local . from the chosen regions. Finally, it employs a fusion modulethat aggregates information from both global context and localdetails to produce the final prediction.Our approach is well-suited for the unique properties of med-ical images. GMIC is capable of processing input images ina memory-e ffi cient manner, thus being able to handle medi-cal images in their original resolutions while still using a high-capacity neural network to pick up on fine visual details. More-over, despite being trained with only image-level labels, GMICis able to generate pixel-level saliency maps that provide addi-tional interpretability.We applied GMIC to interpret screening mammograms: pre-dicting the presence or absence of malignant and benign lesionsin a breast. Evaluated on a large mammography dataset, theproposed model outperforms the ResNet-34 while being faster and using . % fewer memory of GPU. Moreover, wealso demonstrated that our model can generate predictions thatare as accurate as radiologists, given equivalent input informa-tion. Given its generic design, the proposed model could bewidely applicable to various high-resolution image classifica-tion tasks. In future research, we would like to extend thisframework to other imaging modalities such as ultrasound, to-mosynthesis, and MRI.In addition, we note that training GMIC is slightly more com-plex than training a standard ResNet model. As shown in Fig-ure 13, the learning speeds for the global and local moduleare di ff erent. As learning of the global module stabilizes, thesaliency maps tend to highlight a fixed set of regions in each ex-ample, which decreases the diversity of patches provided to thelocal module. This causes the local module to overfit, causingits validation AUC to decrease. We speculate that GMIC couldbenefit from a curriculum that optimally coordinates the learn-ing of both modules. A learnable strategy such as the one pro-posed in Katharopoulos and Fleuret (2019) could help to jointlytrain both global and local module. Acknowledgments
The authors would like to thank Joe Katsnelson, Mario Vi-dena and Abdul Khaja for supporting our computing environ-ment. We also gratefully acknowledge the support of NvidiaCorporation with the donation of some of the GPUs used inthis research. This work was supported in part by grantsfrom the National Institutes of Health (R21CA225175 andP41EB017183).
References
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointlylearning to align and translate. arXiv:1409.0473 .Bergstra, J., Bengio, Y., 2012. Random search for hyper-parameter optimiza-tion. Journal of Machine Learning Research 13.Bilen, H., Vedaldi, A., 2016. Weakly supervised deep detection networks,in: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 2846–2854.Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K.,Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J., 2019.Clinical-grade computational pathology using weakly supervised deeplearning on whole slide images. Nature Medicine 25, 1301–1309.Canziani, A., Paszke, A., Culurciello, E., 2016. An analysis of deep neuralnetwork models for practical applications. arXiv:1605.07678 .Codella, N., Cai, J., Abedini, M., Garnavi, R., Halpern, A., Smith, J.R., 2015.Deep learning, sparse coding, and svm for melanoma recognition in der-moscopy images, in: International Workshop on Machine Learning in Med-ical Imaging, Springer. pp. 118–126.Coudray, N., Ocampo, P.S., Sakellaropoulos, T., Narula, N., Snuderl, M.,Fenyö, D., Moreira, A.L., Razavian, N., Tsirigos, A., 2018. Classificationand mutation prediction from non–small cell lung cancer histopathology im-ages using deep learning. Nature Medicine 24, 1559.Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: Alarge-scale hierarchical image database, in: Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, Ieee. pp. 248–255.DeSantis, C.E., Ma, J., Goding Sauer, A., Newman, L.A., Jemal, A., 2017.Breast cancer statistics, 2017, racial disparity in mortality by state. CA: ACancer Journal for Clinicians 67, 439–448.Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L., 2017. Weaklysupervised cascaded convolutional networks, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 914–922.Dietterich, T.G., 2000. Ensemble methods in machine learning, in: Interna-tional Workshop on Multiple Classifier Systems, Springer. pp. 1–15.D’Orsi, C.J., 2013. ACR BI-RADS atlas: breast imaging reporting and datasystem. American College of Radiology.Du ff y, S.W., Tabár, L., Chen, H.H., Holmqvist, M., Yen, M.F., Abdsalah, S.,Epstein, B., Frodis, E., Ljungberg, E., Hedborg-Melander, C., et al., 2002.The impact of organized mammography service screening on breast carci-noma mortality in seven swedish counties: a collaborative evaluation. Can-cer: Interdisciplinary International Journal of the American Cancer Society95, 458–469.Durand, T., Mordan, T., Thome, N., Cord, M., 2017. Wildcat: Weakly su-pervised learning of deep convnets for image classification, pointwise lo-calization and segmentation, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 642–651.Feng, X., Yang, J., Laine, A.F., Angelini, E.D., 2017. Discriminative local-ization in cnns for weakly-supervised segmentation of pulmonary nodules,in: International Conference on Medical image Computing and Computer-Assisted Intervention, Springer. pp. 568–576.Févry, T., Phang, J., Wu, N., Kim, S., Moy, L., Cho, K., Geras, K.J., 2019.Improving localization-based approaches for breast cancer screening examclassification. arXiv:1908.00615 .Gao, Y., Geras, K.J., Lewin, A.A., Moy, L., 2019. New frontiers: An updateon computer-aided diagnosis for breast imaging in the age of artificial intel-ligence. American Journal of Roentgenology 212, 300–307.Geras, K.J., Mann, R.M., Moy, L., 2019. Artificial intelligence for mammogra-phy and digital breast tomosynthesis: Current concepts and future perspec-tives. Radiology 293, 246–259.eras, K.J., Shen, Y., Wolfson, S., Kim, S.G., Moy, L., Cho, K., 2017. High-resolution breast cancer screening with multi-view deep convolutional neu-ral networks. arXiv:1703.07047v2 .Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y., 2018. Diag-nose like a radiologist: Attention guided convolutional neural network forthorax disease classification. arXiv:1801.09927 .Guo, Y., Bi, L., Kumar, A., Gao, Y., Zhang, R., Feng, D., Wang, Q., Kim,J., 2019. Deep local-global refinement network for stent analysis in ivoctimages, in: International Conference on Medical image Computing andComputer-Assisted Intervention, Springer. pp. 539–546.Hagos, Y.B., Mérida, A.G., Teuwen, J., 2018. Improving breast cancer detec-tion using symmetry information with deep learning, in: Image Analysis forMoving Organ, Breast, and Thoracic Images. Springer, pp. 90–97.Hamidinekoo, A., Denton, E., Rampun, A., Honnor, K., Zwiggelaar, R., 2018.Deep learning in mammography and breast histology, an overview and fu-ture trends. Medical Image Analysis 47, 45–67.He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for imagerecognition, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 770–778.He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity mappings in deep residualnetworks, in: European Conference on Computer Vision, Springer. pp. 630–645.Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2016. Denselyconnected convolutional networks. arXiv:1608.06993 .Ilse, M., Tomczak, J.M., Welling, M., 2018. Attention-based deep multipleinstance learning. arXiv:1802.04712 .Ji, Z., Shen, Y., Ma, C., Gao, M., 2019. Scribble-based hierarchical weaklysupervised learning for brain tumor segmentation, in: International Con-ference on Medical image Computing and Computer-Assisted Intervention,Springer. pp. 175–183.Katharopoulos, A., Fleuret, F., 2019. Processing megapixel images with deepattention-sampling models. arXiv:1905.03711 .Kervadec, H., Dolz, J., Tang, M., Granger, E., Boykov, Y., Ayed, I.B., 2019.Constrained-cnn losses for weakly supervised segmentation. Medical ImageAnalysis 54, 88–99.Kim, E.K., Kim, H.E., Han, K., Kang, B.J., Sohn, Y.M., Woo, O.H., Lee, C.W.,2018. Applying data-driven imaging biomarker in mammography for breastcancer screening: preliminary study. Scientific Reports 8.Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization.arXiv:1412.6980 .Kooi, T., Karssemeijer, N., 2017. Classifying symmetrical di ff erences and tem-poral change for the detection of malignant masses in mammography usingdeep neural networks. Journal of Medical Imaging 4, 044501.Kopans, D.B., 2002. Beyond randomized controlled trials: organized mammo-graphic screening substantially reduces breast carcinoma mortality. Cancer94.Kopans, D.B., 2015. An open letter to panels that are deciding guidelines forbreast cancer screening. Breast Cancer Research and Treatment 151, 19–25.Kyono, T., Gilbert, F.J., van der Schaar, M., 2018. MAMMO: A deep learningsolution for facilitating radiologist-machine collaboration in breast cancerdiagnosis. arXiv:1811.02661 .LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.Lehman, C.D., Arao, R.F., Sprague, B.L., Lee, J.M., Buist, D.S., Kerlikowske,K., Henderson, L.M., Onega, T., Tosteson, A.N., Rauscher, G.H., et al.,2016. National performance benchmarks for modern screening digital mam-mography: update from the breast cancer surveillance consortium. Radiol-ogy 283, 49–58.Lehman, C.D., Wellman, R.D., Buist, D.S., Kerlikowske, K., Tosteson, A.N.,Miglioretti, D.L., 2015. Diagnostic accuracy of digital screening mammog-raphy with and without computer-aided detection. JAMA Internal Medicine175, 1828–1837.Li, C., Wang, X., Liu, W., Latecki, L.J., Wang, B., Huang, J., 2019. Weakly su-pervised mitosis detection in breast histopathology images using concentricloss. Medical Image Analysis 53, 165–178.Li, L., Zheng, Y., Zhang, L., Clark, R.A., 2001. False-positive reduction in cadmass detection using a competitive classification strategy. Medical Physics28, 250–258.Liberman, L., Menell, J.H., 2002. Breast imaging reporting and data system(bi-rads). Radiologic Clinics 40, 409–430.Liu, J., Zhao, G., Fei, Y., Zhang, M., Wang, Y., Yu, Y., 2019. Align, attend andlocate: Chest x-ray diagnosis via contrast induced attention network withlimited supervision, in: International Conference on Computer Vision, pp. 10632–10641.Lotter, W., Sorensen, G., Cox, D., 2017. A multi-scale cnn and curriculumlearning strategy for mammogram classification, in: Deep Learning in Med-ical Image Analysis and Multimodal Learning for Clinical Decision Support.Springer, pp. 169–177.Luo, L., Chen, H., Wang, X., Dou, Q., Lin, H., Zhou, J., Li, G., Heng, P.A.,2019. Deep angular embedding and feature correlation attention for breastmri cancer analysis. arXiv:1906.02999 .Luong, M.T., Pham, H., Manning, C.D., 2015. E ff ective approaches toattention-based neural machine translation. arXiv:1508.04025 .Masotti, M., Lanconelli, N., Campanini, R., 2009. Computer-aided mass de-tection in mammography: False positive reduction via gray-scale invariantranklet texture features. Medical Physics 36, 311–316.McKinney, S.M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N.,Ashrafian, H., Back, T., Chesus, M., Corrado, G.C., Darzi, A., et al., 2020.International evaluation of an ai system for breast cancer screening. Nature577, 89–94.Oliver, A., Freixenet, J., Marti, J., Perez, E., Pont, J., Denton, E.R., Zwigge-laar, R., 2010. A review of automatic mass detection and segmentation inmammographic images. Medical Image Analysis 14, 87–110.Oquab, M., Bottou, L., Laptev, I., Sivic, J., 2015. Is object localization for free?-weakly-supervised learning with convolutional neural networks, in: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion, pp. 685–694.Ouyang, X., Xue, Z., Zhan, Y., Zhou, X.S., Wang, Q., Zhou, Y., Wang, Q.,Cheng, J.Z., 2019. Weakly supervised segmentation framework with un-certainty: A study on pneumothorax segmentation in chest x-ray, in: Inter-national Conference on Medical image Computing and Computer-AssistedIntervention, Springer. pp. 613–621.Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic di ff erentiation inpytorch .Pereira, S.M.P., McCormack, V.A., Moss, S.M., dos Santos Silva, I., 2009. Thespatial distribution of radiodense breast tissue: a longitudinal study. BreastCancer Research 11, R33.Pinheiro, P.O., Collobert, R., 2015. From image-level to pixel-level labelingwith convolutional networks, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 1713–1721.Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul,A., Langlotz, C., Shpanskaya, K., et al., 2017. Chexnet: Radiologist-levelpneumonia detection on chest x-rays with deep learning. arXiv:1711.05225.Rampun, A., López-Linares, K., Morrow, P.J., Scotney, B.W., Wang, H.,Ocaña, I.G., Maclair, G., Zwiggelaar, R., Ballester, M.A.G., Macía, I.,2019. Breast pectoral muscle segmentation in mammograms using a mod-ified holistically-nested edge detection network. Medical Image Analysis.Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-timeobject detection with region proposal networks, in: Advances in Neural In-formation Processing Systems, pp. 91–99.Ribli, D., Horváth, A., Unger, Z., Pollner, P., Csabai, I., 2018. Detecting andclassifying lesions in mammograms with deep learning. Scientific Reports8.Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks forbiomedical image segmentation, in: International Conference on Medicalimage Computing and Computer-Assisted Intervention, Springer. pp. 234–241.Roth, H.R., Lee, C.T., Shin, H.C., Se ff , A., Kim, L., Yao, J., Lu, L., Summers,R.M., 2015. Anatomy-specific classification of medical images using deepconvolutional nets. arXiv:1504.04003 .Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B.,Rueckert, D., 2019. Attention gated networks: Learning to leverage salientregions in medical images. Medical Image Analysis 53, 197–207.Sedai, S., Mahapatra, D., Ge, Z., Chakravorty, R., Garnavi, R., 2018. Deepmultiscale convolutional feature learning for weakly supervised localizationof chest pathologies in x-ray images, in: International Workshop on MachineLearning in Medical Imaging, Springer. pp. 267–275.Shen, L., 2017. End-to-end training for whole image breast cancer diagnosisusing an all convolutional design. arXiv:1711.05775 .Shen, Y., Wu, N., Phang, J., Park, J., Kim, G., Moy, L., Cho, K., Geras, K.J.,2019. Globally-aware multiple instance classifier for breast cancer screen-ing, in: Machine Learning in Medical Imaging: 10th International Work-hop, MLMI 2019, Held in Conjunction with International Conference onMedical image Computing and Computer-Assisted Intervention 2019, Shen-zhen, China, October 13, 2019, Proceedings, Springer. p. 18.Siegel, R.L., Miller, K.D., Jemal, A., 2019. Cancer statistics, 2019. CA: ACancer Journal for Clinicians 69, 7–34.Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks forlarge-scale image recognition. arXiv:1409.1556 .Sun, M., Zhang, G., Dang, H., Qi, X., Zhou, X., Chang, Q., 2019a. Accurategastric cancer segmentation in digital pathology images using deformableconvolution and multi-scale embedding networks. IEEE Access .Sun, M., Zhou, W., Qi, X., Zhang, G., Girnita, L., Seregard, S., Grossniklaus,H.E., Yao, Z., Zhou, X., Stålhammar, G., 2019b. Prediction of bap1 ex-pression in uveal melanoma using densely-connected deep classification net-works. Cancers 11, 1579.Szegedy, C., Vanhoucke, V., Io ff e, S., Shlens, J., Wojna, Z., 2016. Rethinkingthe inception architecture for computer vision, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 2818–2826.Tan, M., Le, Q.V., 2019. E ffi cientnet: Rethinking model scaling for convolu-tional neural networks. arXiv:1905.11946 .Teare, P., Fishman, M., Benzaquen, O., Toledano, E., Elnekave, E., 2017. Ma-lignancy detection on mammography using dual deep convolutional neuralnetworks and genetically discovered false color input enhancement. Journalof Digital Imaging 30, 499–505.Van Gils, C.H., Otten, J.D., Verbeek, A.L., Hendriks, J.H., 1998. Mammo-graphic breast density and risk of breast cancer: masking bias or causality?European Journal of Epidemiology 14, 315–320.Wei, J., Chan, H.P., Wu, Y.T., Zhou, C., Helvie, M.A., Tsodikov, A., Hadji-iski, L.M., Sahiner, B., 2011. Association of computerized mammographicparenchymal pattern measure with breast cancer risk: a pilot case-controlstudy. Radiology 260, 42–49.Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S., 2018. Revisiting dilatedconvolution: A simple approach for weakly-and semi-supervised semanticsegmentation, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 7268–7277.Wu, K., Du, B., Luo, M., Wen, H., Shen, Y., Feng, J., 2019a. Weakly supervisedbrain lesion segmentation via attentional representation learning, in: Inter-national Conference on Medical image Computing and Computer-AssistedIntervention, Springer. pp. 211–219.Wu, N., Geras, K.J., Shen, Y., Su, J., Kim, S.G., Kim, E., Wolfson, S., Moy, L.,Cho, K., 2018. Breast density classification with deep convolutional neuralnetworks, in: 2018 IEEE International Conference on Acoustics, Speech andSignal Processing, IEEE. pp. 6682–6686.Wu, N., Phang, J., Park, J., Shen, Y., Huang, Z., Zorin, M., Jastrzeb-ski, S., Févry, T., Katsnelson, J., Kim, E., et al., 2019b. Deep neu-ral networks improve radiologists’ performance in breast cancer screening.arXiv:1903.08297 .Wu, N., Phang, J., Park, J., Shen, Y., Kim, S.G., Heacock, L., Moy, L., Cho,K., Geras, K.J., 2019c. The NYU Breast Cancer Screening Dataset v1.0.Technical Report. Available at https: // cs.nyu.edu / ~kgeras / reports / datav1.0.pdf.Wu, Y.T., Wei, J., Hadjiiski, L.M., Sahiner, B., Zhou, C., Ge, J., Shi, J., Zhang,Y., Chan, H.P., 2007. Bilateral analysis based false positive reduction forcomputer-aided mass detection. Medical Physics 34, 3334–3344.Xiao, L., Zhu, C., Liu, J., Luo, C., Liu, P., Zhao, Y., 2019. Learning fromsuspected target: Bootstrapping performance for breast cancer detection inmammography, in: International Conference on Medical Image Computingand Computer-Assisted Intervention, Springer. pp. 468–476.Xu, Y., Zhu, J.Y., Eric, I., Chang, C., Lai, M., Tu, Z., 2014. Weakly super-vised histopathology cancer image segmentation and classification. MedicalImage Analysis 18, 591–604.Yao, L., Prosky, J., Poblenz, E., Covington, B., Lyman, K., 2018. Weaklysupervised medical diagnosis and localization from multiple resolutions.arXiv:1803.07703 .Yoo, I., Yoo, D., Paeng, K., 2019. Pseudoedgenet: Nuclei segmentation onlywith point annotations. arXiv:1906.02924 .Zeng, Y., Zhuge, Y., Lu, H., Zhang, L., 2019. Joint learning of saliency detec-tion and weakly supervised semantic segmentation, in: International Con-ference on Computer Vision, pp. 7223–7233.Zhong, Z., Li, J., Zhang, Z., Jiao, Z., Gao, X., 2019. An attention-guided deep regression model for landmark detection in cephalograms.arXiv:1906.07549 .Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features for discriminative localization, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 2921–2929.Zhu, W., Lou, Q., Vang, Y.S., Xie, X., 2017. Deep multi-instance networkswith sparse label assignment for whole mammogram classification, in: Inter-national Conference on Medical image Computing and Computer-AssistedIntervention, Springer. pp. 603–611.Zhu, Y., Zhou, Y., Xu, H., Ye, Q., Doermann, D., Jiao, J., 2019. Learning in-stance activation maps for weakly supervised instance segmentation, in: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion, pp. 3116–3125. nnotated input patch map saliency map (B) saliency map (M) 0.29 ROI patches0.2 0.140.15 0.11 0.10.35 0.21 0.10.1 0.14 0.090.19 0.12 0.350.09 0.1 0.160.18 0.13 0.160.14 0.26 0.13 Fig. 14:
Additional visualizations of benign examples. We follow the same layout as described in Figure 7. Input images are annotated withsegmentation labels (green = benign, red = malignant). ROI patches are shown with their attention scores. nnotated input patch map saliency map (B) saliency map (M) 0.43 ROI patches0.21 0.240.06 0.02 0.040.15 0.52 0.120.05 0.04 0.120.49 0.05 0.110.08 0.09 0.180.28 0.42 0.060.06 0.12 0.16 Fig. 15:
Additional visualizations of malignant examples. We follow the same layout as described in Figure 7. Input images are annotated withsegmentation labels (green = benign, red ==