End-to-end Prostate Cancer Detection in bpMRI via 3D CNNs: Effects of Attention Mechanisms, Clinical Priori and Decoupled False Positive Reduction
EEnd-to-end Prostate Cancer Detection in bpMRIvia 3D CNNs: Effect of Attention Mechanisms, ClinicalPriori and Decoupled False Positive Reduction
Anindo Saha a, ∗ , Matin Hosseinzadeh a, ∗ , Henkjan Huisman a a Diagnostic Image Analysis Group, Radboud University Medical Center, The Netherlands
Abstract –
We present a novel multi-stage 3D computer-aided detection and diagnosis (CAD) model for automatedlocalization of clinically significant prostate cancer (csPCa) in bi-parametric MR imaging (bpMRI). Deep attention mech-anisms drive its detection network, targeting multi-resolution, salient structures and highly discriminative feature di-mensions, in order to accurately identify csPCa lesions from indolent cancer and the wide range of benign pathologythat can afflict the prostate gland. In parallel, a decoupled residual classifier is used to achieve consistent false posi-tive reduction, without sacrificing high sensitivity or computational efficiency. Furthermore, a probabilistic anatomicalprior, which captures the spatial prevalence of csPCa as well as its zonal distinction, is computed and encoded into theCNN architecture to guide model generalization with domain-specific clinical knowledge.For 486 institutional testing scans, the 3D CAD system achieves 83.69 ± ± kappa = kappa = Keywords – prostate cancer · magnetic resonance imaging · convolutional neural network · computer-aided detec-tion and diagnosis · anatomical prior · deep attention (cid:7)
1. Introduction
Prostate cancer (PCa) is one of the most prevalent can-cers in men worldwide. It is estimated that as of January,2019, over 45% of all men living with a history of can-cer in the United States had suffered from PCa (Milleret al., 2019). One of the main challenges surroundingthe accurate diagnosis of PCa is its broad spectrum ofclinical behavior. PCa lesions can range from low-grade,benign tumors that never progress into clinically signifi-cant disease to highly aggressive, invasive malignancies,i.e. clinically significant PCa (csPCa), that can rapidly ad-vance towards metastasis and death (Johnson et al., 2014).In clinical practice, prostate biopsies are used to histo-logically assign a Gleason Score (GS) to each lesion as ameasure of cancer aggressiveness (Epstein et al., 2016).Non-targeted transrectal ultrasound (TRUS) is generallyemployed to guide biopsy extractions, but it is severelyprone to an underdetection of csPCa and overdiagnosisof indolent PCa (Verma et al., 2017). Prostate MR imaging ∗ Authors with equal contribution to this research. e-mail: [email protected] ( Anindo Saha ) Algorithm and source code have been made publicly available at:https://grand-challenge.org/algorithms/ { to-be-announced } https://github.com/DIAGNijmegen/ { to-be-announced } can compensate for these limitations of TRUS (Johnsonet al., 2014; Isra¨el et al., 2020; Engels et al., 2020). Neg-ative MRI can rule out unnecessary biopsies by 23–45%(Kasivisvanathan et al., 2018; van der Leest et al., 2019;Elwenspoek et al., 2019; Rouvi`ere et al., 2019). ProstateImaging Reporting and Data System: Version 2 (PI-RADSv2) (Weinreb et al., 2016) is a guideline for reading andacquiring prostate MRI, following a qualitative and semi-quantitative assessment that mandates substantial exper-tise for proper usage. Meanwhile, csPCa can manifest asmultifocal lesions of different shapes and sizes, bearinga strong resemblance to numerous non-malignant con-ditions (as seen in Fig. 1). In the absence of experi-enced radiologists, these factors can lead to low inter-reader agreement ( < The advent of deep convolutional neural networks(CNN) has paved the way for powerful computer-aideddetection and diagnosis (CAD) systems that rival hu-man performance (Esteva et al., 2017; McKinney et al., a r X i v : . [ ee ss . I V ] J a n Probability Intensity Value
MalignantBenign
DWI
Probability Intensity Value
MalignantBenign
ADC
Probability Intensity Value
MalignantBenign
Fig. 1. The challenge of discriminating csPCa due to its morphological heterogeneity. (a-b)
T2-weighted imaging (T2W), (c-d) diffusion-weightedimaging (DWI) and (e-f) apparent diffusion coefficient (ADC) maps constituting the prostate bpMRI scans for two different patients are shownabove, where yellow contours indicate csPCa lesions. While one of the patients has large, severe csPCa developing from both ends (top row) ,the other is afflicted by a single, relatively focal csPCa lesion surrounded by perceptually similar nodules of benign prostatic hyperplasia (BPH) (bottom row) . Furthermore, normalized intensity histograms (right) compiled from all 2733 scans used in this study reveal a large overlapbetween the distributions of csPCa and non-malignant prostatic tissue for all three MRI channels.
FocalNet for joint csPCadetection and GS prediction. Over 5-fold cross-validationusing 417 patient scans,
FocalNet achieved 87.9% sensitiv-ity at 1.0 false positive per patient. Meanwhile, Yu et al.(2020a) proposed a dual-stage 2D U-Net for csPCa detec-tion, where the second-stage module is an integrated net-work for false positive reduction.Cancerous lesions stemming from the prostatic periph-eral zone (PZ) exhibit different morphology and pathol-ogy than those developing from the transitional zone (TZ)(Chen et al., 2000; Weinreb et al., 2016; Isra¨el et al., 2020).Hosseinzadeh et al. (2019) highlights the merits of uti-lizing this priori through an early fusion of probabilisticzonal segmentations inside a 2D CAD system. The studydemonstrated that the inclusion of PZ and TZ segmenta-tions can introduce an average increase of 5.3% detectionsensitivity, between 0.5–2.0 false positives per patient. Ina separate study, Cao et al. (2019b) constructed a proba-bilistic 2D prevalence map from 1055 MRI slices. Depict-ing the typical sizes, shapes and locations of malignancyacross the prostate anatomy, this map was used to weakly supervise a 2D U-Net for PCa detection. Both methodsunderline the value of clinical priori and anatomical fea-tures –factors known to play an equally important role inclassical machine learning-based solutions (Litjens et al.,2014; Lemaˆıtre et al., 2017).The vast majority of CAD systems for csPCa operatesolely on a 2D-basis, citing computational limitations andthe non-isotropic imaging protocol of prostate MRI astheir primary rationale. Yoo et al. (2019) tackled thischallenge by employing dedicated 2D ResNets for eachslice in a patient scan and aggregating all slice-level pre-dictions with a Random Forest classifier. Aldoj et al.(2020) proposed a patch-based approach, passing highly-localized regions of interest (ROI) through a standard 3DCNN. Alkadi et al. (2019) followed a 2.5D approach asa compromise solution, sacrificing the ability to harnessmultiple MRI channels for an additional pseudo-spatialdimension.
In this research, we harmonize several state-of-the-arttechniques from recent literature to present a novel end-to-end 3D CAD system that generates voxel-level detec-tions of csPCa in prostate MRI. Key contributions of ourstudy are, as follows:• We examine a detection network with dual-attentionmechanisms, which can adaptively target highly dis-criminative feature dimensions and spatially salientprostatic structures in bpMRI, across multiple reso-lutions, to reach peak detection sensitivity at lowerfalse positive rates.• We study the effect of employing a residual patch-wise 3D classifier for decoupled false positive re-duction and we investigate its utility in improving2aseline specificity, without sacrificing high detec-tion sensitivity.• We develop a probabilistic anatomical prior, cap-turing the spatial prevalence and zonal distinctionof csPCa from a large training dataset of 1584 MRIscans. We investigate the impact of encoding thecomputed prior into our CNN architecture and weevaluate its ability to guide model generalizationwith domain-specific clinical knowledge.• We evaluate model performance across large, multi-institutional testing datasets: 486 institutional and296 external patient scans annotated using PI-RADSv2 and GS grades, respectively. Our benchmark in-cludes a consensus score of expert radiologists to as-sess clinical viability.
2. Material and Methods
The primary dataset was a cohort of 2436 prostate MRIscans from Radboud University Medical Center (RUMC),acquired over the period January, 2016 – January, 2018.All cases were paired with radiologically-estimated an-notations of csPCa derived via PI-RADS v2. From here,1584 (65%), 366 (15%) and 486 (20%) patient scans weresplit into training, validation and testing (TS1) sets, re-spectively, via double-stratified sampling. Additionally,296 prostate bpMRI scans from Ziekenhuisgroep Twente(ZGT), acquired over the period March, 2015 – January,2017, were used to curate an external testing set (TS2). TS2annotations included biopsy-confirmed GS grades.
Patients were biopsy-naive men (RUMC: { median age:66 yrs, IQR: 61–70 } , ZGT: { median age: 65 yrs, IQR: 59–68 } ) with elevated levels of PSA (RUMC: { median level:8 ng/mL, IQR: 5–11 } , ZGT: { median level: 6.6 ng/mL,IQR: 5.1–8.7 } ). Imaging was performed on 3T MR scan-ners (RUMC: { } , ZGT: { } ; Siemens Healthineers,Erlangen). In both cases, acquisitions were obtained fol-lowing standard mpMRI protocols in compliance with PI-RADS v2 (Engels et al., 2020). Given the limited role ofdynamic contrast-enhanced (DCE) imaging in mpMRI, inrecent years, bpMRI has emerged as a practical alterna-tive –achieving similar performance, while saving timeand the use of contrast agents (Turkbey et al., 2019; Basset al., 2020). Similarly, in this study, we used bpMRI se-quences only, which included T2-weighted (T2W) anddiffusion-weighted imaging (DWI). Apparent diffusioncoefficient (ADC) maps and high b-value DWI (b > ) were computed from the raw DWI scans. Priorto usage, all scans were spatially resampled to a commonaxial in-plane resolution of 0.5 mm and slice thickness of 3.6 mm via B-spline interpolation. Due to the stan-dardized precautionary measures (e.g. minimal tempo-ral difference between acquisitions, administration of an-tispasmodic agents to reduce bowel motility, use of rectalcatheter to minimize distension, etc.) (Engels et al., 2020)taken in the imaging protocol, we observed negligible pa-tient motion across the different sequences. Thus, no ad-ditional registration techniques were applied, in agree-ment with clinical recommendations (Epstein et al., 2016)and recent studies (Cao et al., 2019a). All patient scans from RUMC and ZGT were reviewedby expert radiologists using PI-RADS v2. For this study,we flagged any detected lesions marked PI-RADS 4 or5 as csPCa ( PR ) . When independently assigned PI-RADSscores were discordant, a consensus was reached throughjoint assessment. All instances of csPCa ( PR ) were thencarefully delineated on a voxel-level basis by trained stu-dents under the supervision of expert radiologists. ForZGT dataset, all patients underwent TRUS-guided biop-sies performed by a urologist, blinded to the imaging re-sults. In the presence of any suspicious lesions (PI-RADS3-5), patients also underwent in-bore MRI-guided biop-sies as detailed in van der Leest et al. (2019). Tissuesamples were reviewed by experienced uropathologists,where cores containing cancer were assigned GS gradesin compliance with the 2014 International Society of Uro-logic Pathology (ISUP) guidelines (Epstein et al., 2016).Any lesion graded GS > > ( GS ) , and subsequently delineated bytrained students on a voxel-level basis.Upon complete annotation, the RUMC and ZGTdatasets contained 1527 and 210 benign cases, alongwith 909 and 86 malignant cases ( ≥ csPCa lesion),respectively. Moreover, on a lesion-level basis, theRUMC dataset contained 1095 csPCa ( PR ) lesions (meanfrequency: 1.21 lesions per malignant scan; median size:1.05 cm , range: 0.01–61.49 cm ), while the ZGT datasetcontained 90 csPCa ( GS ) lesions (mean frequency: 1.05 le-sions per malignant scan; median size: 1.69 cm , range:0.23–22.61 cm ). Multi-class segmentations of prostatic TZ and PZ weregenerated for each scan in the training dataset using amulti-planar, anisotropic 3D U-Net from a separate study(Riepe et al., 2020), where the network achieved an av-erage Dice Similarity Coefficient of 0.90 ± × The architecture of our proposed CAD solution com-prises of two parallel 3D CNNs ( M , M ) followed by adecision fusion node N DF , as shown in Fig. 2. Based on3 .11 0.98 Multi-Channel Patches ( x ) [8,64,64,8,3] * Multi-Channel Whole Volume ( x ) [1,144,144,18,4] * Prostate bpMRI (Validation/Test Case)
Prostate Zonal Segmentations and Tumor Annotations (All Training Cases)
Early Fusion of Probabilistic
Anatomical Prior ( P ) Focal Loss with
Adam Optimizer
Dual-Attention U-Net Detector ( M ) Balanced Cross-Entropy Loss with
AMSBound Optimizer
Residual Classifier ( M ) PreliminaryDetection ( y ) Soft Malignancy Score For Each Patch ( y ) Decision Fusion ( N DF ) Processed Detection with
Reduced False Positives ( y DF ) number of samples, width, height, depth, number of channels ] * [ Clinically SignificantCancer Detection
T2WDWIADC
Intensity NormalizationIntensity Normalization
Fig. 2. Proposed end-to-end framework for computing voxel-level detections of csPCa in validation/test samples of prostate bpMRI. The modelcenter-crops two ROIs from the multi-channel concatenation of the patient’s T2W, DWI and ADC scans for the input of its detection and classi-fication 3D CNN sub-models ( M , M ). M leverages an anatomical prior P in its input x to synthesize spatial priori and generate a preliminarydetection y . M infers on a set of overlapping patches x and maps them to a set of probabilistic malignancy scores y . Decision fusion node N DF aggregates y , y to produce the model output y DF in the form of a post-processed csPCa detection map with high sensitivity and reduced falsepositives. our observations in previous work (Hosseinzadeh et al.,2019; Riepe et al., 2020), we opted for anisotropically-strided 3D convolutions in both M and M to processthe bpMRI data, which resemble multi-channel stacks of2D images rather than full 3D volumes. T2W and DWIchannels were normalized to zero mean and unit stan-dard deviation, while ADC channels were linearly nor-malized from [0,3000] to [0,1] in order to retain their clin-ically relevant numerical significance (Isra¨el et al., 2020).Anatomical prior P , constructed using the prostate zonalsegmentations and csPCa ( PR ) annotations in the trainingdataset, is encoded in M to infuse spatial priori. At train-time, M and M are independently optimized using dif-ferent loss functions and target labels. At test-time, N DF is used to aggregate their predictions ( y , y ) into a singleoutput detection map y DF . The principal component of our proposed model is thedual-attention detection network or M , as shown in Fig.2, 3. It is used to generate the preliminary voxel-level de-tection of csPCa in prostate bpMRI scans with high sensi-tivity. Typically, a prostate gland occupies 45–50 cm , butit can be significantly enlarged in older males and patientsafflicted by BPH (Basillote et al., 2003). The input ROI of M , measuring 144 × ×
18 voxels per channel or nearly336 cm , includes and extends well beyond this windowto utilize surrounding peripheral and global anatomicalinformation. M trains on whole-image volumes equiv-alent to its total ROI, paired with fully delineated anno-tations of csPCa ( PR ) as target labels. Since the larger ROIand voxel-level labels contribute to a severe class imbal-ance (1:153) at train-time, we use a focal loss function to train M . Focal loss addresses extreme class imbalance inone-stage dense detectors by weighting the contributionof easy to hard examples, alongside conventional class-weighting (Lin et al., 2017). In a similar study for jointcsPCa detection in prostate MRI, the authors credited fo-cal loss as one of the pivotal enhancements that enabledtheir CNN solution, titled FocalNet (Cao et al., 2019a).For an input volume, x = ( x , x , ..., x n ) derived froma given scan, let us define its target label Y = ( Y , Y , ..., Y n ) ∈ { , } , where n represents the total number ofvoxels in x . We can formulate the focal loss function of M for a single voxel in each scan, as follows: FL ( x i , Y i ) = − α (1 − y i ) γ Y i log y i − (1 − α )( y i ) γ (1 − Y i ) log (1 − y i ) i ∈ [1 , n ] Here, y i = p ( O = | x i ) ∈ [0 , , represents the probabil-ity of x i being a malignant tissue voxel as predicted by M , while α and γ represent weighting hyperparametersof the focal loss. At test-time, y = ( y , y , ..., y n ) ∈ [0 , ,i.e. a voxel-level, probabilistic csPCa detection map for x , serves as the final output of M for each scan.We choose 3D U-Net (Ronneberger et al., 2015; C¸ ic¸eket al., 2016) as the base architecture of M , for its abil-ity to summarize multi-resolution, global anatomical fea-tures (Dalca et al., 2018; Isensee et al., 2020) and gener-ate an output detection map with voxel-level precision.Pre-activation residual blocks (He et al., 2016) are used ateach scale of M for deep feature extraction. Architectureof the decoder stage is adapted into that of a modifiedUNet++ (Zhou et al., 2020) for improved feature aggre-gation. UNet++ uses redesigned encoder-decoder skipconnections that implicitly enable a nested ensemble con-figuration. In our adaptation, its characteristic property4f feature fusion from multiple semantic scales is usedto achieve similar performance, while dense blocks anddeep supervision from the original design are forgone toremain computationally lightweight.Two types of differentiable, soft attention mechanismsare employed in M to highlight salient informationthroughout the training process, without any additionalsupervision. Channel-wise Squeeze-and-Excitation (SE) at-tention (Hu et al., 2019; Rundo et al., 2019) is used to am-plify the most discriminative feature dimensions at eachresolution. Grid-attention gates (Schlemper et al., 2019)are used to automatically learn spatially important pro-static structures of varying shapes and sizes. While theformer is integrated into every residual block to guidefeature extraction, the latter is placed at the start of skip-connections to filter the semantic features being passedonto the decoder. During backpropagation, both atten-tion mechanisms work collectively to suppress gradientsoriginating from background voxels and inessential fea-ture maps. Similar combinations of dual-attention mech-anisms have reached state-of-the-art performance in se-mantic segmentation challenges (Fu et al., 2019) and PCadiagnosis (Yu et al., 2020b), sharing an ability to integratelocal features with their global dependencies.
The goal of the classification network, M , is to improveoverall model specificity via independent, binary classifi-cation of each scan and its constituent segments. It is ef-fectuated by N DF , which factors in these predictions from M to locate and penalize potential false positives in theoutput of M . M has an input ROI of 112 × ×
12 voxelsper channel or nearly 136 cm , tightly centered aroundthe prostate. While training on the full ROI volume hasthe advantage of exploiting extensive spatial context, itresults in limited supervision by the usage of a singlecoarse, binary label per scan. Thus, we propose patch-wise training using multiple, localized labels, to enforcefully supervised learning. We define an effective patch extraction policy as one that samples regularly acrossthe ROI to densely cover all spatial positions. Sampledpatches must also be large enough to include a sufficientamount of context for subsequent feature extraction. Ran-dom sampling within a small window, using the afore-mentioned criteria, poses the risk of generating highlyoverlapping, redundant training samples. However, aminimum level of overlap can be crucial, benefiting re-gions that are harder to predict by correlating semanticfeatures from different surrounding context (Xiao et al.,2018). As such, we divide the ROI into a set of eight oc-tant training samples x , measuring × × voxels eachwith upto 7.5% overlap between neighboring patches.For input patches, x = ( x , x , ..., x ) derived froma given scan, let us define its set of target labels Y = ( Y , Y , ..., Y ) ∈ { , } . Using a pair of complementaryclass weights to adjust for the patch-level class imbalance(1:4), we formulate the balanced cross-entropy loss func-tion of M for a single patch in each scan, as follows: BCE ( x i , Y i ) = − β Y i log y i − (1 − β )(1 − Y i ) log (1 − y i ) i ∈ [1 , Here, y i = p ( O = | x i ) ∈ [0 , , represents the probabilityof x i being a malignant patch as predicted by M . At test-time, y = ( y , y , ..., y ) ∈ [0 , , i.e. a set of probabilisticmalignancy scores for x , serves as the final output of M for each scan.Transforming voxel-level annotations into patch-wiselabels can introduce additional noise in the target labelsused at train-time. For instance, a single octant patch con-tains × × or 32768 voxels per channel. In a naivepatch extraction system, if the fully delineated ground-truth for this sample includes even a single voxel of ma-lignant tissue, then the patch-wise label would be inac-curately assigned as malignant , despite a voxel-level im-balance of 1:32767 supporting the alternate class. Sucha training pair carries high label noise and proves detri-mental to the learning cycle, where the network associates F Number of Filters Residual AdditionTransposed Convolution + SE-Residual BlockGrid-Attention GateSE-Residual BlockTransposed ConvolutionConcatenationSoftmax Layer and
Focal Loss ( α=0.75, γ =2.00 ) Computation F =16 F =16 F =32 F =128 F =64 F =32 F =32 F =64 F =128 F =64 F =64 F =128 F =128 F =128 F =256 Spatial Dimensions [ width, height, depth ] [144, 144, 18][72, 72, 18][36, 36, 18][18, 18, 9][9, 9, 9]
Fig. 3. Architecture schematic for the Dual-Attention U-Net ( M ). M is a modified adaptation of the UNet++ architecture (Zhou et al., 2020),utilizing a pre-activation residual backbone (He et al., 2016) with Squeeze-and-Excitation (SE) channel-wise attention mechanism (Hu et al.,2019) and grid-attention gates (Schlemper et al., 2019). All convolutional layers in the encoder and decoder stages are activated by ReLU andLeakyReLU, respectively, and use kernels of size × × with L regularization ( β = . ). Both downsampling and upsampling operationsthroughout the network are performed via anisotropic strides. Dropout nodes ( rate = . ) are connected at each scale of the decoder to alleviatetrain-time overfitting. τ , representing the minimum percent-age of malignant tissue voxels required for a given patchto be considered malignant .For M , we consider CNN architectures based on resid-ual learning for feature extraction, due to their modular-ity and continued success in supporting state-of-the-artsegmentation and detection performance in the medicaldomain (Yoo et al., 2019; McKinney et al., 2020; Jiang et al.,2020), The goal of the decision fusion node N DF is to aggre-gate M and M predictions ( y , y ) into a single output y DF , which retains the same sensitivity as y , but improvesspecificity by reducing false positives. False positives in y are fundamentally clusters of positive values locatedin the benign regions of the scan. N DF employs y as ameans of identifying these regions. We set a threshold T P on (1 − y i ) to classify each patch x i , where i ∈ [1,8]. T P represents the minimum probability required to classify x i as a benign patch. A high value of T P adapts M as ahighly sensitive classifier that yields very few false nega-tives, if any at all. Once all benign regions have been iden-tified, any false positives within these patches are sup-pressed by multiplying their corresponding regions in y with a penalty factor λ . The resultant detection map y DF ,i.e. essentially a post-processed y , serves as the final out-put of our proposed CAD system. N DF is limited to a sim-ple framework of two hyperparameters only to alleviatethe risk of overfitting. An appropriate combination of T P and λ can either suppress clear false positives or facilitatean aggressive reduction scheme at the expense of fewertrue positives in y DF . In this research, we opted for theformer policy to retain maximum csPCa detection sensi-tivity. Optimal values of T P and λ were determined to be0.98 and 0.90, respectively, via a coarse-to-fine hyperpa-rameter grid search. Parallel to recent studies in medical image computing(Gibson et al., 2018; Dalca et al., 2018; Wachinger et al.,2018; Cao et al., 2019b) on infusing spatial priori intoCNN architectures, we hypothesize that M can benefitfrom an explicit anatomical prior for csPCa detection inbpMRI. To this end, we construct a probabilistic popula-tion prior P , as introduced in our previous work (Sahaet al., 2020). P captures the spatial prevalence and zonaldistinction of csPCa using 1584 radiologically-estimatedcsPCa ( PR ) annotations and CNN-generated prostate zonalsegmentations from the training dataset. We opt for anearly fusion technique to encode the clinical priori (Hos-seinzadeh et al., 2019), where P is concatenated as an ad-ditional channel to every input scan passed through M ,thereby guiding its learning cycle as a spatial weight mapembedded with domain-specific clinical knowledge (referto Fig. 2). Several experiments were conducted to statisticallyevaluate performance and analyze the design choicesthroughout the end-to-end model. We facilitated a faircomparison by maintaining an identical preprocessing,augmentation, tuning and train-validation pipeline foreach candidate system in a given experiment. Patient-based diagnosis performance was evaluated using theReceiver Operating Characteristic (ROC), where the areaunder ROC (AUROC) was estimated from the normal-ized Wilcoxon/Mann-Whitney U statistic (Hanley andMcNeil, 1982). Lesion-level performance was evaluatedusing the Free-Response Receiver Operating Character-istic (FROC) to address PCa multifocality, where detec-tions sharing a minimum Dice Similarity Coefficient of0.10 with the ground-truth annotation were consideredtrue positives. All metrics were computed in 3D. Confi-dence intervals were estimated as twice the standard de-viation from the mean of 5-fold cross-validation (applica-ble to validation sets) or 1000 replications of bootstrap-ping (applicable to testing sets). Statistically significantimprovements were verified with a p -value on the differ-ence in case-level AUROC and lesion-level sensitivity atclinically relevant false positive rates (0.5, 1.0) using 1000replications of bootstrapping (Chihara et al., 2014). Bon-ferroni correction was used to adjust the significance levelfor multiple comparisons.
3. Results and Analysis
To determine the effect of the classification architecturefor M , five different 3D CNNs (ResNet-v2, Inception-ResNet-v2, Residual Attention Network, SEResNet,SEResNeXt) were implemented and tuned across theirrespective hyperparameters to maximize patient-basedAUROC over 5-fold cross-validation. Furthermore, eachcandidate CNN was trained using whole-images andpatches, in separate turns, to draw out a comparativeanalysis surrounding the merits of spatial context versuslocalized labels. In the latter case, we studied the effect of τ on patch-wise label assignment (refer to Section 2.2.2).We investigated four different values of τ : 0.0%, 0.1%,0.5%, 1.0%; which correspond to minimum csPCa vol-umes of 9, 297, 594 and 1188 mm , respectively. Each clas-sifier was assessed qualitatively via 3D GradCAMs (Sel-varaju et al., 2017) to ensure adequate interpretability forclinical usage.From the results noted in Table 1, we observed thatthe SEResNet architecture consistently scored the highestAUROC across every training scheme. However, in eachcase, its performance remained statistically similar ( p ≥ × additional spatial6 .000.100.300.500.700.200.400.600.80 T2W
ADCDWI
Prostate bpMRI (with csPCa Annotation)
ResNet-v Inception-ResNet-v SEResNetRes. Attention Network SEResNeXt
Gradient-Weighted Class Activations Maps (GradCAM)
Fig. 4. Model interpretability of the candidate CNN architectures for classifier M at τ = ( PR ) located in the prostatic TZ ( center row ) or PZ ( top, bottom rows ), as indicated by the yellow contours. Whole-imageGradCAMs are generated by restitching and normalizing ( min-max ) the eight patch-level GradCAMs generated per case. Maximum voxel-levelactivation is observed in close proximity of csPCa ( PR ) , despite training each network using patch-level binary labels only.Table 1. Patient-based diagnosis performance of the candidate CNN architectures and training schemes (whole-image versus patch-wise trainingwith four different values of τ to regulate label noise) for classifier M . Performance scores indicate mean of 5-fold cross-validation, followed by95% confidence intervals estimated as twice the standard deviation. Model Params AUROC AUROC (Patches)(Whole-Image) τ = . % τ = . % τ = . % τ = . %ResNet-v2 (He et al., 2016) 0.089 M . ± . . ± . . ± . . ± . . ± . Inception-ResNet-v2 (Szegedy et al., 2017) 6.121 M . ± . . ± . . ± . . ± . . ± . Res. Attention Network (Wang et al., 2017) 1.233 M . ± . . ± . . ± . . ± . . ± . SEResNet (Hu et al., 2019) 0.095 M . ± . . ± . . ± . . ± . . ± . SEResNeXt (Hu et al., 2019) 0.128 M . ± . . ± . . ± . . ± . . ± . context provided per sample during whole-image train-ing. Increasing the value of τ consistently improved per-formance for all candidate classifiers (upto 10% in patch-level AUROC). While we attribute this improvement tolower label noise, it is important to note that the vast ma-jority of csPCa lesions are typically small (refer to Section2.1.2) and entire patient cases risk being discarded fromthe training cycle for higher values of τ . For instance,when τ = is labelled as benign –leading to 9 pa-tient cases with incorrect label assignment in the trainingdataset. For the 3D CAD system, we chose the SEResNetpatch-wise classifier trained at τ = M , becauseat τ = τ = { } %) and patch-level AUROCstill improved by nearly 2% relative to a naive patch ex-traction system ( τ = M accurately targets csPCa lesions (if any) on a voxel-levelbasis, despite being trained on patch-level binary labels(as highlighted in Fig. 4). Further details regarding thenetwork and training configurations of M are listed inAppendix A. We analyzed the effect of the M architecture, in com-parison to the four baseline 3D CNNs (U-SEResNet,UNet++, nnU-Net, Attention U-Net) that inspire its de-sign. We evaluated the end-to-end 3D CAD system, alongwith the individual contributions of its constituent com-ponents ( M , M , P ), to examine the effects of false pos-itive reduction and clinical priori. Additionally, we ap-plied the ensembling heuristic of the nnU-Net framework(Isensee et al., 2020) to create CAD ∗ , i.e. an ensemblemodel comprising of multiple CAD instances, and westudied its impact on overall performance. Each candi-date setup was tuned over 5-fold cross-validation andbenchmarked on the testing datasets (TS1, TS2). Lesion Localization : From the FROC analysis on the in-stitutional testing set TS1 (refer to Fig. 5), we observedthat M reached 88.15 ± p ≤ ± alse Positive Rate ( - Specificity)True Positive Rate (Sensitivity) Random Classifier (AUC = ±0.000 ) U-SEResNet (AUC = ±0.037 ) UNet ++ (AUC = ±0.036 ) nnU-Net (AUC = ±0.034 ) M (AUC = ±0.032 ) Attention U-Net (AUC = ±0.035 ) M ⊗ M (AUC = ±0.031 ) Proposed CAD (AUC = ±0.030 ) Proposed CAD (AUC = ±0.030 ) False Positives per Patient (a)
Sensitivity U-SEResNetUNet ++ M nnU-NetAttention U-Net M ⊗ M Proposed CADProposed CAD * *
False Positive Rate ( - Specificity)True Positive Rate (Sensitivity) Random Classifier (AUC = ±0.000 ) U-SEResNet (AUC = ±0.066 ) UNet ++ (AUC = ±0.058 ) M (AUC = ±0.054 ) Attention U-Net (AUC = ±0.056 ) M ⊗ M (AUC = ±0.054 ) Proposed CAD (AUC = ±0.043 ) Proposed CAD (AUC = ±0.044 ) ** Proposed CAD (Threshold = ) Radiologists (PI-RADS ≥ ) nnU-Net (AUC = ±0.054 ) False Positives per Patient (a)
Sensitivity U-SEResNetUNet ++ M Attention U-Net M ⊗ M Proposed CAD * Proposed CAD
Radiologists (PI-RADS ≥ ) nnU-Net Fig. 5. Lesion-level FROC ( left ) and patient-based ROC ( right ) analyses of csPCa ( PR ) ( top row ) / csPCa ( GS ) ( bottom row ) detection sensitivityagainst the number of false positives generated per patient scan using the baseline, ablated and proposed detection models on the institutionaltesting set TS1 ( top row ) and the external testing set TS2 ( bottom row ). Transparent areas indicate the 95% confidence intervals. Mean per-formance for the consensus of expert radiologists and their 95% confidence intervals are indicated by the centerpoint and length of the greenmarkers, respectively, where all observations marked PI-RADS 4 or 5 are considered positive detections (as detailed in Section 2.3). (83.81 ± ± ± M to M ( M ⊗ M ), upto 12.89% ( p ≤ M ⊗ M is illustrated in Fig. 6through a particularly challenging patient case, wherethe prostate gland is afflicted by multiple, simultaneousconditions. With the inclusion of anatomical prior P in M ⊗ M , our proposed CAD system benefited from a fur- ther 3.14% increase in partial area under FROC (pAUC)between 0.10–2.50 false positives per patient, reaching1.676 ± ± p ≤ p ≤ p ≤ ( PR ) lesions than its component systems M and M ⊗ M , respectively. It reached a maximum de-tection sensitivity of 93.19 ± enign ProstaticHyperplasia (BPH) Indolent ProstateCancer (GS ≤ 3+3 ) Clinically SignificantProstate Cancer (GS > 3+3 ) Fig. 6. ( a ) T2W, ( b ) DWI, ( c ) ADC scans for a patient case in the exter-nal testing set TS2, followed by its csPCa detection map as predictedby each candidate system: ( d ) U-SEResNet, ( e ) UNet++, ( f ) AttentionU-Net, ( g ) nnU-Net, ( h ) M , ( i ) M ⊗ M , ( j ) proposed CAD, ( k ) proposedCAD ∗ . Three stand-alone detection networks (UNet++, nnU-Net, M )successfully identify the csPCa lesion, albeit with additional falsepositive(s). In the case of the proposed CAD/CAD ∗ system, whilethe classifier in M ⊗ M is able to suppresses these false positive(s)from M , inclusion of prior P further strengthens the confidence andboundaries of the true positive. currences than all other candidate systems. Patient-Based Diagnosis : From ROC analysis on the in-stitutional testing set TS1 (refer to Fig. 5), we observedthat our proposed CAD system reached 0.882 ± p ≤ p ≤ p ≤ benign and malignant pa-tient cases was statistically similar ( p ≥ M and M ⊗ M . Both the FROC and ROC analyses on the external test-ing set TS2 (refer to Fig. 5) indicate similar patternsemerging as those observed in Section 3.2.1, but with an overall decrease in performance. Given the near-identicalMRI scanners and acquisition conditions employed be-tween both institutions (refer to Section 2.1.1), we pri-marily attribute this decline to the disparity between theimperfect radiologically-estimated training annotations(csPCa ( PR ) ) and the histologically-confirmed testing anno-tations (csPCa ( GS ) ) in TS2 (refer to Section 3.3 for radiolo-gists’ performance). By comparing the relative drop inperformance for each candidate model, we can effectivelyestimate their generalization and latent understanding ofcsPCa, beyond our provided training samples. Lesion Localization : At 1.0 false positive per patient,our proposed CAD system achieved 85.55 ± p ≤ ± ± ± ± p ≤ ( GS ) lesionsthan its ablated counterparts M and M ⊗ M , respectively.The 3D CAD system reached a maximum detection sen-sitivity of 90.03 ± M and M ⊗ M fell by nearly 10%.From the inclusion of P in M ⊗ M , this decline camedown to only 3% for the CAD system at the same falsepositive rate. Furthermore, an overall 11.54% increase inpAUC was observed between 0.10–2.50 false positives perpatient, relative to M ⊗ M . Patient-Based Diagnosis : Our proposed CAD systemreached 0.862 ± p ≤ p ≤ p > p ≤ M ⊗ M by 3.6% ( p ≤ Table 2. Computational requirements (in terms of the number of trainable parameters, VRAM usage and the average time taken per patient scanduring inference on a single NVIDIA RTX 2080 Ti) against the localization performance (in terms of the maximum csPCa detection sensitivityachieved and its corresponding false positive rate across both testing datasets) for each candidate detection system.
Model Params VRAM Inference Maximum Sensitivity { False Positive Rate } TS1 – csPCa ( PR ) TS2 – csPCa ( GS ) U-SEResNet (Hu et al., 2019) 1.615 M . GB . ± . s . ± . { . } . ± . { . } UNet++ (Zhou et al., 2020) 14.933 M . GB . ± . s . ± . { . } . ± . { . } nnU-Net (Isensee et al., 2020) 30.599 M . GB . ± . s . ± . { . } . ± . { . } Attention U-Net (Schlemper et al., 2019) 2.235 M . GB . ± . s . ± . { . } . ± . { . } Dual-Attention U-Net – M . GB . ± . s . ± . { . } . ± . { . } M with False Positive Reduction – M ⊗ M . GB . ± . s . ± . { . } . ± . { . } M ⊗ M with Prior – Proposed CAD 15.335 M . GB . ± . s . ± . { . } . ± . { . } Ensemble of CAD – Proposed CAD ∗ . GB . ± . s . ± . { . } . ± . { . } .2.3. Effect of Ensembling The ensembled prediction of CAD ∗ is the weighted-average output of three member models: 2D, 3D and two-stage cascaded 3D variants of the proposed CAD system(refer to Appendix A for detailed implementation). Incomparison to the standard CAD system, CAD ∗ carries2.6 × trainable parameters, occupies 2.5 × VRAM for hard-ware acceleration and requires 1.3 × inference time per pa-tient scan (as noted in Table 2). In terms of its perfor-mance, CAD ∗ demonstrated 0.3–0.4% improvement inpatient-based AUROC across both testing datasets andshared statistically similar lesion localization on TS1. Itboasted a considerably large improvement in lesion de-tection on TS2, amounting to 4.01% increase in pAUC be-tween 0.10–2.50 false positives per patient (refer to Fig5), as well as a higher maximum detection sensitivity(91.05 ± To evaluate the proposed CAD ∗ system in compari-son to the consensus of expert radiologists, we analyzedtheir relative performance on the external testing set TS2.Agreements in patient-based diagnosis were computedwith Cohen’s kappa .Radiologists achieved 90.72 ± ± ± ∗ system reached0.753 ± ± kappa = ± kappa = ± kappa = ±
4. Discussion and Conclusion
We conclude that a detection network ( M ), harmoniz-ing state-of-the-art attention mechanisms, can accuratelydiscriminate more malignancies at the same false posi-tive rate (refer to Section 3.2.1). Among four other recentadaptations of the 3D U-Net that are popularly used forbiomedical segmentation, M detected significantly morecsPCa lesions at 1.00 false positive per patient and con-sistently reached the highest detection sensitivity on thetesting datasets between 0.10–2.50 false positives per pa-tient (refer to Fig. 5). As soft attention mechanisms con-tinue to evolve, supporting ease of optimization, sharingequivariance over permutations (Goyal and Bengio, 2020)and suppressing gradient updates from inaccurate anno-tations (Wang et al., 2017; Min et al., 2019), deep attentivemodels, such as M , become increasingly more applicablefor csPCa detection in bpMRI (Duran et al., 2020; Yu et al.,2020b).We conclude that a residual patch-wise 3D classifier( M ) can significantly reduce false positives, without sac-rificing high sensitivity. In stark contrast to ensembling,which scaled up the number of trainable parametersnearly 3 × for limited improvements in performance (re-fer to Section 3.2.3), M produced flat increases in speci-ficity (upto 12.89% less false positives per patient) acrossboth testing datasets, while requiring less than 1% of thetotal parameters in our proposed CAD system (as notedin Table 2). Furthermore, as a decoupled classifier, M T2W
ADCDWI Proposed CAD * T2W ADCDWI Proposed CAD * Prostate bpMRI (with csPCa Annotation)Prostate bpMRI (with csPCa Annotation)
Fig. 7. Six patient cases from the external testing set TS2 and their corresponding csPCa detection maps, as predicted by the proposed CAD ∗ system. Yellow contours indicate csPCa ( GS ) lesions, if present. While CAD ∗ is able to successfully localize large, multifocal and apical instancesof csPCa ( GS ) (left) , in the presence of severe inflammation/fibrosis induced by other non-malignant conditions (eg. BPH, prostatitis), CAD ∗ canmisidentify smaller lesions, resulting in false positive/negative predictions (right) . M on theoverall CAD system could be controlled via the deci-sion fusion node N DF , such that the maximum detectionsensitivity of the system was completely retained (referto Table 2). Secondly, due to its independent trainingscheme, M remains highly modular, i.e. it can be eas-ily tuned, upgraded or swapped out entirely upon futureadvancements, without retraining or affecting the stand-alone performance of M .We conclude that encoding an anatomical prior ( P )into the CNN architecture can guide model generaliza-tion with domain-specific clinical knowledge. Results in-dicated that P played the most important role in the gen-eralization of the 3D CAD system (via M ) and in retain-ing its performance across the multi-institutional testingdatasets (refer to Section 3.2.2). Remarkably, its contri-bution was substantially more than any other architec-tural enhancement proposed in recent literature, while in-troducing negligible changes in the number of trainableparameters (refer to Table 2). However, it is worth not-ing that similar experiments with classifier M , yieldedno statistical improvements. Parallel to the methods pro-posed by Cheng et al. (2018) and Tang et al. (2019), M wasdesigned to learn a different set of feature representationsfor csPCa than M , using its smaller receptive field size,patch-wise approach and decoupled optimization strat-egy. Thus, while M was trained to learn translation co-variant features for localization, M was trained to learntranslation invariant features for classification, i.e. patch-wise prediction of the presence/absence of csPCa, irre-gardless of its spatial context in the prostate gland. Wepresume this key difference to be the primary reason why M was effective at independent false positive reduction,yet unable to leverage the spatial priori embedded in P .Nonetheless, our study confirmed that powerful anatom-ical priors, such as P , can substitute additional trainingdata for deep learning-based CAD systems and improvemodel generalization, by relaying the inductive biases ofcsPCa in bpMRI (Goyal and Bengio, 2020).We benchmarked our proposed architecture againsta consensus of radiologists, using an external test-ing set graded by independent pathologists. No-tably, we observed that the CAD ∗ system demon-strated higher agreement with pathologists (81.08%; kappa = ± kappa = ± ( GS ) and gener-alize beyond the radiologically-estimated training anno-tations. Although, deep learning-based systems remaininadequate as stand-alone solutions (refer to Fig. 5, 7),the moderate agreement of CAD ∗ with both clinical ex-perts, while inferring predictions relatively dissimilar toradiologists, highlights its potential to improve diagnos- tic certainty as a viable second reader, in a screening set-ting (Sanford et al., 2020; Schelb et al., 2020).The study is limited in a few aspects. All prostate scansused within the scope of this research, were acquired us-ing MRI scanners developed by the same vendor. Thus,generalizing our proposed solution to a vendor-neutralmodel requires special measures, such as domain adapta-tion (Chiou et al., 2020), to account for heterogeneous ac-quisition conditions. Radiologists utilize additional clini-cal variables (e.g. prior studies, DCE scans, PSA densitylevels, etc.) to inform their diagnosis for each patient case–limiting the equity of any direct comparisons against the3D CNNs developed in this research.In summary, an automated novel end-to-end 3D CADsystem, harmonizing several state-of-the-art methodsfrom recent literature, was developed to diagnose andlocalize csPCa in bpMRI. To the best of our knowledge,this was the first demonstration of a deep learning-based3D detection and diagnosis system for csPCa, trained us-ing radiologically-estimated annotations only and eval-uated on large, multi-institutional testing datasets. Thepromising results of this research motivate the ongoingdevelopment of new techniques, particularly those whichfactor in the breadth of clinical knowledge established inthe field beyond limited training datasets, to create com-prehensive CAD solutions for the clinical workflow ofprostate cancer management. Acknowledgements
The authors would like to acknowledge the contribu-tions of Maarten de Rooij and Ilse Slootweg from Rad-boud University Medical Center during the annotationof fully delineated masks of prostate cancer for everybpMRI scan used in this study. This research is supportedin parts by the European Union H2020: ProCAncer-Iproject (EU grant 952159) and Siemens Healthineers (CID:C00225450). Anindo Saha is supported by an EuropeanUnion EACEA: Erasmus+ grant in the Medical Imagingand Applications (MaIA) program.
References
Aldoj, N., Lukas, S., Dewey, M., Penzkofer, T., 2020. Semi-AutomaticClassification of Prostate Cancer on Multi-parametric MR Imagingusing a Multi-Channel 3D Convolutional Neural Network. EuropeanRadiology 30, 1243–1253. doi: .Alkadi, R., El-Baz, A., Taher, F., Werghi, N., 2019. A 2.5D Deep Learning-Based Approach for Prostate Cancer Detection on T2-Weighted Mag-netic Resonance Imaging, in: Computer Vision – ECCV 2018 Work-shops, Springer International Publishing. pp. 734–739.Basillote, J.B., Armenakas, N.A., Hochberg, D.A., Fracchia, J.A., 2003. In-fluence of Prostate Volume in the Detection of Prostate Cancer. Urol-ogy 61, 167–171. doi: .Bass, E., Pantovic, A., Connor, M., Gabe, R., , Ahmed, H., 2020. A Sys-tematic Review and Meta-Analysis of the Diagnostic Accuracy of Bi-parametric Prostate MRI for Prostate Cancer in Men at Risk. ProstateCancer and Prostatic Diseases , 1–16.Cao, R., Mohammadian Bajgiran, A., Afshari Mirak, S., Shakeri, S.,Zhong, X., Enzmann, D., Raman, S., Sung, K., 2019a. Joint ProstateCancer Detection and Gleason Score Prediction in mp-MRI via Focal-Net. IEEE Transactions on Medical Imaging 38, 2496–2506. ao, R., Zhong, X., Scalzo, F., Raman, S., Sung, K., 2019b. Prostate Can-cer Inference via Weakly-Supervised Learning using a Large Collec-tion of Negative MRI, in: 2019 IEEE/CVF International Conferenceon Computer Vision Workshop (ICCVW), pp. 434–439.Chen, M.E., Johnston, D.A., Tang, K., Babaian, R.J., Troncoso, P., 2000.Detailed mapping of prostate carcinoma foci: biopsy strategy impli-cations. Cancer 89, 1800–1809.Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T., 2018. Re-visiting RCNN: On Awakening the Classification Power of FasterRCNN, in: Proceedings of the European Conference on ComputerVision (ECCV).Chihara, L.M., Hesterberg, T.C., Dobrow, R.P., 2014. MathematicalStatistics with Resampling and R & Probability: With Applicati. JohnWiley & Sons. OCLC: 941516595.Chiou, E., Giganti, F., Punwani, S., Kokkinos, I., Joskowicz, L., 2020.Harnessing Uncertainty in Domain Adaptation for MRI Prostate Le-sion Segmentation, in: Medical Image Computing and Computer As-sisted Intervention – MICCAI 2020, Springer International Publish-ing. pp. 510–520.C¸ ic¸ek, ¨O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.,2016. 3D U-Net: Learning Dense Volumetric Segmentation fromSparse Annotation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Springer International Pub-lishing. pp. 424–432.Dalca, A.V., Guttag, J., Sabuncu, M.R., 2018. Anatomical Priors in Con-volutional Networks for Unsupervised Biomedical Segmentation, in:2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, pp. 9290–9299.Duran, A., Jodoin, P.M., Lartizien, C., 2020. Prostate Cancer SemanticSegmentation by Gleason Score Group in Bi-parametric MRI with SelfAttention Model on the Peripheral Zone, in: International Conferenceon Medical Imaging with Deep Learning (MIDL) – Full Paper Track,Montreal, QC, Canada. pp. 193–204.Elwenspoek, M.M.C., Sheppard, A.L., McInnes, M.D.F., Whiting, P.,2019. Comparison of Multiparametric Magnetic Resonance Imag-ing and Targeted Biopsy With Systematic Biopsy Alone for theDiagnosis of Prostate Cancer: A Systematic Review and Meta-analysis. JAMA Network Open 2, e198427–e198427. doi: .Engels, R.R., Isra¨el, B., Padhani, A.R., Barentsz, J.O., 2020. Multipara-metric Magnetic Resonance Imaging for the Detection of ClinicallySignificant Prostate Cancer: What Urologists Need to Know. Part1: Acquisition. European Urology 77, 457 – 468. doi: .Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., 2016. The 2014 Inter-national Society of Urological Pathology (ISUP) Consensus Confer-ence on Gleason Grading of Prostatic Carcinoma: Definition of Grad-ing Patterns and Proposal for a New Grading System. Am. J. Surg.Pathol. 40, 244–252.Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., 2017. Dermatologist-levelClassification of Skin Cancer with Deep Neural Networks. Nature542, 115–118. doi: .Fu, J., Liu, J., Tian, H., Lu, H., 2019. Dual Attention Network for SceneSegmentation, in: 2019 IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pp. 3141–3149.Garcia-Reyes, K., Passoni, N.M., Palmeri, M.L., Kauffman, C.R., 2015.Detection of Prostate Cancer with Multiparametric MRI (mpMRI): Ef-fect of Dedicated Reader Education on Accuracy and Confidence ofIndex and Anterior Cancer Diagnosis. Abdominal Imaging 40, 134–142. doi: .Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K.,Davidson, B., Pereira, S.P., Clarkson, M.J., Barratt, D.C., 2018. Au-tomatic Multi-Organ Segmentation on Abdominal CT With DenseV-Networks. IEEE Transactions on Medical Imaging 37, 1822–1834.doi: .Goyal, A., Bengio, Y., 2020. Inductive Biases for Deep Learning ofHigher-Level Cognition. arXiv:2011.15091 .Hanley, J.A., McNeil, B.J., 1982. The Meaning and Use of The Area Un-der A Receiver Operating Characteristic (ROC) Curve. Radiology143, 29–36. doi: . pMID:7063747.He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,in: Proceedings of the IEEE International Conference on ComputerVision (ICCV), pp. 1026–1034.He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity Mappings in DeepResidual Networks, in: Leibe, B., Matas, J., Sebe, N., Welling, M.(Eds.), Computer Vision – ECCV 2016, Springer International Pub-lishing. pp. 630–645.Hosseinzadeh, M., Brand, P., Huisman, H., 2019. Effect of Adding Prob-abilistic Zonal Prior in Deep Learning-based Prostate Cancer Detec-tion, in: International Conference on Medical Imaging with DeepLearning (MIDL) – Extended Abstract Track, London, United King-dom. pp. 1026–1034.Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., 2019. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 7132–7141.Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.,2020. nnU-Net: a self-configuring method for deep learning-basedbiomedical image segmentation. Nature Methods doi: .Isra¨el, B., van der Leest, M., Sedelaar, M., Padhani, A.R., Z´amecnik, P.,Barentsz, J.O., 2020. Multiparametric Magnetic Resonance Imagingfor the Detection of Clinically Significant Prostate Cancer: What Urol-ogists Need to Know. Part 2: Interpretation. European Urology 77,469 – 480. doi: .Jiang, Z., Ding, C., Liu, M., Tao, D., 2020. Two-Stage Cascaded U-Net:1st Place Solution to BraTS Challenge 2019 Segmentation Task, in:Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic BrainInjuries, Springer International Publishing. pp. 231–241.Johnson, L.M., Turkbey, B., Figg, W.D., Choyke, P.L., 2014. Multipara-metric MRI in Prostate Cancer Management. Nature Reviews ClinicalOncology 11, 346–353. doi: .Kasivisvanathan, V., Rannikko, A.S., Borghi, M., Panebianco, V., 2018.MRI-Targeted or Standard Biopsy for Prostate-Cancer Diagnosis.New England Journal of Medicine 378, 1767–1777. doi: .Kingma, D.P., Ba, J., 2015. Adam: A Method for Stochastic Optimization,in: International Conference on Learning Representations (ICLR),Ithaca, NY: arXiv.org. URL: http://arxiv.org/abs/1412.6980 .Lemaˆıtre, G., Mart´ı, R., Rastgoo, M., M´eriaudeau, F., 2017. Computer-Aided Detection for Prostate Cancer Detection based on Multi-parametric Magnetic Resonance Imaging, in: 2017 39th Annual In-ternational Conference of the IEEE Engineering in Medicine and Bi-ology Society (EMBC), pp. 3138–3141.Lin, T., Goyal, P., Girshick, R., He, K., Doll´ar, P., 2017. Focal Loss forDense Object Detection, in: 2017 IEEE International Conference onComputer Vision (ICCV), pp. 2999–3007.Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N., Huisman, H., 2014.Computer-aided detection of prostate cancer in mri. IEEE Transac-tions on Medical Imaging 33, 1083–1092.Luo, L., Xiong, Y., Liu, Y., 2019. Adaptive Gradient Methods withDynamic Bound of Learning Rate, in: International Conference onLearning Representations.McKinney, S.M., Sieniek, M., Godbole, V., Godwin, J., 2020. Interna-tional Evaluation of an AI System for Breast Cancer Screening. Na-ture 577, 89–94. doi: .Miller, K.D., Nogueira, L., Mariotto, A.B., Rowland, J.H., Yabroff, K.R.,Alfano, C.M., Jemal, A., Kramer, J.L., Siegel, R.L., 2019. Cancer Treat-ment and Survivorship Statistics, 2019. CA: A Cancer Journal forClinicians 69, 363–385. doi: .Min, S., Chen, X., Zha, Z.J., Wu, F., Zhang, Y., 2019. A Two-StreamMutual Attention Network for Semi-supervised Biomedical Segmen-tation with Noisy Labels, in: Proceedings of the AAAI Conference onArtificial Intelligence, pp. 4578–4585.Riepe, T., Hosseinzadeh, M., Brand, P., Huisman, H., 2020. AnisotropicDeep Learning Multi-planar Automatic Prostate Segmentation, in:Proceedings of the 28th International Society for Magnetic Reso-nance in Medicine Annual Meeting. URL: http://indexsmart.mirasmart.com/ISMRM2020/PDFfiles/3518.html .Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional Net-works for Biomedical Image Segmentation, in: Navab, N., Horneg-ger, J., Wells, W.M., Frangi, A.F. (Eds.), Medical Image Computingand Computer-Assisted Intervention – MICCAI 2015, Springer Inter- ational Publishing. pp. 234–241.Rosenkrantz, A.B., Ginocchio, L.A., Cornfeld, D., Froemming, A.T.,2016. Interobserver Reproducibility of the PI-RADS Version 2 Lex-icon: A Multicenter Study of Six Experienced Prostate Radiologists.Radiology 280, 793–804. doi: .Rouvi`ere, O., Puech, P., Renard-Penna, R., Claudon, M., 2019. Use ofProstate Systematic and Targeted Biopsy on the Basis of Multipara-metric MRI in Biopsy-Naive Patients (MRI-FIRST): A Prospective,Multicentre, Paired Diagnostic Study. The Lancet Oncology 20, 100 –109. doi: .Rundo, L., Han, C., Nagano, Y., Zhang, J., Hataya, R., Militello, C.,Tangherloni, A., Nobile, M., Ferretti, C., Besozzi, D., Gilardi, M.,Vitabile, S., Mauri, G., Nakayama, H., Cazzaniga, P., 2019. USE-Net:Incorporating Squeeze-and-Excitation Blocks into U-Net for ProstateZonal Segmentation of Multi-Institutional MRI Datasets. Neurocom-puting 365, 31 – 43.Saha, A., Hosseinzadeh, M., Huisman, H., 2020. Encoding Clinical Prioriin 3D Convolutional Neural Networks for Prostate Cancer Detectionin bpMRI, in: Medical Imaging Meets NeurIPS Workshop–34th Con-ference on Neural Information Processing Systems (NeurIPS 2020).URL: https://arxiv.org/abs/2011.00263 .Sanford, T., Harmon, S.A., Turkbey, E.B., Turkbey, B., 2020. Deep-Learning-Based Artificial Intelligence for PI-RADS Classification toAssist Multiparametric Prostate MRI Interpretation: A DevelopmentStudy. Journal of Magnetic Resonance Imaging n/a. doi: .Schelb, P., Kohl, S., Radtke, J.P., Bonekamp, D., 2019. Classifica-tion of Cancer at Prostate MRI: Deep Learning versus Clinical PI-RADS Assessment. Radiology 293, 607–617. doi: .Schelb, P., Wang, X., Radtke, J.P., Bonekamp, D., 2020. Simulated ClinicalDeployment of Fully Automatic Deep Learning for Clinical ProstateMRI Assessment. European Radiology .Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B.,Rueckert, D., 2019. Attention Gated Networks: Learning to LeverageSalient Regions in Medical Images. Medical Image Analysis 53, 197 –207. doi: .Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra,D., 2017. Grad-CAM: Visual Explanations from Deep Networks viaGradient-Based Localization, in: 2017 IEEE International Conferenceon Computer Vision (ICCV), pp. 618–626.Smith, C.P., Harmon, S.A., Barrett, T., Bittencourt, L.K., 2019. Intra-and Interreader Reproducibility of PI-RADSv2: A Multireader Study.Journal of Magnetic Resonance Imaging 49, 1694–1703. doi: .Smith, L.N., 2017. Cyclical Learning Rates for Training Neural Net-works, in: 2017 IEEE Winter Conference on Applications of Com-puter Vision (WACV), pp. 464–472.Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A., 2017. Inception-v4,Inception-ResNet and the Impact of Residual Connections on Learn-ing, in: Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence, AAAI Press. p. 4278–4284.Tang, H., Zhang, C., Xie, X., 2019. NoduleNet: Decoupled False Posi-tive Reduction for Pulmonary Nodule Detection and Segmentation,in: Medical Image Computing and Computer Assisted Intervention(MICCAI 2019), pp. 266–274.Turkbey, B., Rosenkrantz, A.B., Haider, M.A., Padhani, A.R., Margolis,D.J., 2019. Prostate Imaging Reporting and Data System version 2.1:2019 Update of Prostate Imaging Reporting and Data System version2. European Urology .van der Leest, M., Cornel, E., Isra¨el, B., Hendriks, R., 2019. Head-to-headComparison of Transrectal Ultrasound-guided Prostate Biopsy Ver-sus Multiparametric Prostate Resonance Imaging with SubsequentMagnetic Resonance-guided Biopsy in Biopsy-na¨ıve Men with El-evated Prostate-specific Antigen: A Large Prospective MulticenterClinical Study. European Urology 75, 570 – 578. doi: doi.org/10.1016/j.eururo.2018.11.023 .Verma, S., Choyke, P.L., Eberhardt, S.C., Oto, A., Tempany, C.M., Turk-bey, B., Rosenkrantz, A.B., 2017. The Current State of MR Imag-ing–targeted Biopsy Techniques for Detection of Prostate Cancer. Ra-diology 285, 343–356. doi: .Wachinger, C., Reuter, M., Klein, T., 2018. DeepNAT: Deep Convolu- tional Neural Network for Segmenting Neuroanatomy. NeuroImage170, 434 – 445. doi: . seg-menting the Brain.Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang,X., 2017. Residual Attention Network for Image Classification, in:2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6450–6458.Weinreb, J.C., Barentsz, J.O., Choyke, P.L., Cornud, F., 2016. PI-RADSProstate Imaging – Reporting and Data System: 2015, Version 2. Eu-ropean Urology 69, 16 – 40. doi: .Westphalen, A.C., McCulloch, C.E., Anaokar, J.M., Arora, S.,Rosenkrantz, A.B., 2020. Variability of the Positive Predictive Value ofPI-RADS for Prostate MRI across 26 Centers: Experience of the Soci-ety of Abdominal Radiology Prostate Cancer Disease-focused Panel.Radiology 296, 76–84. doi: . pMID:32315265.Xiao, C., Deng, R., Li, B., Yu, F., Liu, M., Song, D., 2018. Characteriz-ing Adversarial Examples Based on Spatial Consistency Informationfor Semantic Segmentation, in: Ferrari, V., Hebert, M., Sminchisescu,C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018, Springer Interna-tional Publishing. pp. 220–237.Yoo, S., Gujrathi, I., Haider, M.A., Khalvati, F., 2019. Prostate CancerDetection using Deep Convolutional Neural Networks. Scientific Re-ports 9, 19518. doi: .Yu, X., Lou, B., Shi, B., Szolar, D., 2020a. False Positive Reduction Us-ing Multiscale Contextual Features for Prostate Cancer Detection inMulti-Parametric MRI Scans, in: 2020 IEEE 17th International Sym-posium on Biomedical Imaging (ISBI), pp. 1355–1359.Yu, X., Lou, B., Zhang, D., Winkel, D., Joskowicz, L., 2020b. Deep Atten-tive Panoptic Model for Prostate Cancer Detection Using Biparamet-ric MRI Scans, in: Medical Image Computing and Computer AssistedIntervention (MICCAI 2020), Springer International Publishing. pp.594–604.Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2020. UNet++: Re-designing Skip Connections to Exploit Multiscale Features in ImageSegmentation. IEEE Transactions on Medical Imaging 39, 1856–1867. Appendix A. Network Configurations
Proposed CAD/CAD ∗ system, including its CNNcomponents ( M , M ), were implemented in TensorFlow(Estimator, Keras APIs). Special care was taken through-out the design stage (as detailed in Section 2.2) to en-sure computational efficiency, such that, the end-to-end3D system is fully trainable and deployable from a singleNVIDIA RTX 2080 Ti GPU (11 GB) in less than 6 hours forthe dataset used in this study.
3D Dual-Attention U-Net ( M ) (component of the CADsystem): Network architecture (as detailed in Section3.2.1) comprises of 75 convolutional layers. Layers alongthe encoder and decoder stages are activated by ReLUand Leaky ReLU ( α = . ), respectively, and the out-put layer is activated by the softmax function. Dimensionreduction ratio of 8 is applied to re-weight each channelinside every SE module (Hu et al., 2019). Sub-samplingkernels of size (1,1,1) are used inside every grid-basedattention gate (Schlemper et al., 2019). Dropout nodes( rate = . ) are connected at each scale of the decoderto alleviate overfitting. M is initialized using He uni-form variance scaling (He et al., 2015) and trained using × × × multi-channel whole-images over 40epochs. It is trained with a minibatch size of 2 and anexponentially decaying cyclic learning rate ( γ = . ,13tep size = epochs) (Smith, 2017) oscillating between − and . × − . Focal loss ( α = . , γ = . ) isused with Adam optimizer ( β = . , β = . , (cid:15) = − ) (Kingma and Ba, 2015) in backpropagation throughthe model. Train-time augmentations include horizon-tal flip, rotation ( − . ° to . °), translation ( - horizon-tal/vertical shifts) and scaling ( - ) centered along theaxial plane. Test-time augmentation includes horizontalflip along the axial plane. M predictions carry a weightof 0.60 in the ensembled output of CAD ∗ .
3D SEResNet ( M ) (component of the CAD system): Net-work follows a relatively shallow 3D adaptation of theSEResNet architecture proposed by Hu et al. (2019) –comprising of 2 residual blocks with 6 convolutional lay-ers each, followed by global average pooling and a singledensely-connected layer. All layers are activated by ReLUwith the exception of the output layer, which is activatedby the softmax function. Dimension reduction ratio of 8 isapplied to re-weight each channel inside every SE mod-ule. M is initialized using He uniform variance scaling(He et al., 2015) and trained using × × × multi-channel octant patches over 262 epochs. It trains with aminibatch size of 80 (equivalent to 10 full scans) and anexponentially decaying cyclic learning rate ( γ = . ,step size = epochs) (Smith, 2017) oscillating between − and . × − . Balanced cross-entropy loss ( β = . )is used with AMSBound optimizer ( γ = − , β = . , β = . ) (Luo et al., 2019) in backpropagation throughthe model. Train-time augmentations include horizon-tal flip, rotation ( − ° to °), translation ( - horizon-tal/vertical shifts) and scaling ( - ) centered along theaxial plane.
3D CAD (member model of the CAD ∗ ensemble): Stan-dard solution proposed in this research, comprising ofthe detection network M , decoupled classifier M andanatomical prior P (as detailed in Section 3.2). Model pre-dictions carry a weight of 0.60 in the ensembled output ofCAD ∗ .
2D CAD (member model of the CAD ∗ ensemble): Net-work architecture and training configuration are identicalto that of the 3D CAD system, with only one exception:all modules operate with isotropically-strided 2D convo-lutions. Model predictions carry a weight of 0.20 in theensembled output of CAD ∗ .
3D Two-Stage Cascaded CAD (member model of theCAD ∗ ensemble): Network architecture of each stage andthe training configuration of the overall model are identi-cal to that of the 3D CAD system, with three exceptions.First-stage uses only half as many convolutional filters asthe 3D CAD system at every resolution. Second-stage in-put includes the first-stage output, as an additional chan-nel. Total cost function is computed as the average lossbetween the intermediary first-stage and the final second-stage outputs against the same ground-truth –identicalto the course-to-fine approach proposed by Jiang et al. (2020). Model predictions carry a weight of 0.20 in theensembled output of CAD ∗∗