[PDF] End-to-end Prostate Cancer Detection in bpMRI via 3D CNNs: Effects of Attention Mechanisms, Clinical Priori and Decoupled False Positive Reduction

Abstract

We present a multi-stage 3D computer-aided detection and diagnosis (CAD) model for automated localization of clinically significant prostate cancer (csPCa) in bi-parametric MR imaging (bpMRI). Deep attention mechanisms drive its detection network, targeting salient structures and highly discriminative feature dimensions across multiple resolutions. Its goal is to accurately identify csPCa lesions from indolent cancer and the wide range of benign pathology that can afflict the prostate gland. Simultaneously, a decoupled residual classifier is used to achieve consistent false positive reduction, without sacrificing high sensitivity or computational efficiency. In order to guide model generalization with domain-specific clinical knowledge, a probabilistic anatomical prior is used to encode the spatial prevalence and zonal distinction of csPCa. Using a large dataset of 1950 prostate bpMRI paired with radiologically-estimated annotations, we hypothesize that such CNN-based models can be trained to detect biopsy-confirmed malignancies in an independent cohort. For 486 institutional testing scans, the 3D CAD system achieves 83.69\pm5.22% and 93.19\pm2.96% detection sensitivity at 0.50 and 1.46 false positive(s) per patient, respectively, with 0.882\pm0.030 AUROC in patient-based diagnosis -significantly outperforming four state-of-the-art baseline architectures (U-SEResNet, UNet++, nnU-Net, Attention U-Net) from recent literature. For 296 external biopsy-confirmed testing scans, the ensembled CAD system shares moderate agreement with a consensus of expert radiologists (76.69%; kappa = 0.51\pm0.04) and independent pathologists (81.08%; kappa = 0.56\pm0.06); demonstrating strong generalization to histologically-confirmed csPCa diagnosis.

Full PDF

EEnd-to-end Prostate Cancer Detection in bpMRIvia 3D CNNs: Effect of Attention Mechanisms, ClinicalPriori and Decoupled False Positive Reduction

Anindo Saha a, ∗ , Matin Hosseinzadeh a, ∗ , Henkjan Huisman a a Diagnostic Image Analysis Group, Radboud University Medical Center, The Netherlands

Abstract –

We present a novel multi-stage 3D computer-aided detection and diagnosis (CAD) model for automatedlocalization of clinically signiﬁcant prostate cancer (csPCa) in bi-parametric MR imaging (bpMRI). Deep attention mech-anisms drive its detection network, targeting multi-resolution, salient structures and highly discriminative feature di-mensions, in order to accurately identify csPCa lesions from indolent cancer and the wide range of benign pathologythat can afﬂict the prostate gland. In parallel, a decoupled residual classiﬁer is used to achieve consistent false posi-tive reduction, without sacriﬁcing high sensitivity or computational efﬁciency. Furthermore, a probabilistic anatomicalprior, which captures the spatial prevalence of csPCa as well as its zonal distinction, is computed and encoded into theCNN architecture to guide model generalization with domain-speciﬁc clinical knowledge.For 486 institutional testing scans, the 3D CAD system achieves 83.69 ± ± kappa = kappa = Keywords – prostate cancer · magnetic resonance imaging · convolutional neural network · computer-aided detec-tion and diagnosis · anatomical prior · deep attention (cid:7)

1. Introduction

Prostate cancer (PCa) is one of the most prevalent can-cers in men worldwide. It is estimated that as of January,2019, over 45% of all men living with a history of can-cer in the United States had suffered from PCa (Milleret al., 2019). One of the main challenges surroundingthe accurate diagnosis of PCa is its broad spectrum ofclinical behavior. PCa lesions can range from low-grade,benign tumors that never progress into clinically signiﬁ-cant disease to highly aggressive, invasive malignancies,i.e. clinically signiﬁcant PCa (csPCa), that can rapidly ad-vance towards metastasis and death (Johnson et al., 2014).In clinical practice, prostate biopsies are used to histo-logically assign a Gleason Score (GS) to each lesion as ameasure of cancer aggressiveness (Epstein et al., 2016).Non-targeted transrectal ultrasound (TRUS) is generallyemployed to guide biopsy extractions, but it is severelyprone to an underdetection of csPCa and overdiagnosisof indolent PCa (Verma et al., 2017). Prostate MR imaging ∗ Authors with equal contribution to this research. e-mail: [email protected] ( Anindo Saha ) Algorithm and source code have been made publicly available at:https://grand-challenge.org/algorithms/ { to-be-announced } https://github.com/DIAGNijmegen/ { to-be-announced } can compensate for these limitations of TRUS (Johnsonet al., 2014; Isra¨el et al., 2020; Engels et al., 2020). Neg-ative MRI can rule out unnecessary biopsies by 23–45%(Kasivisvanathan et al., 2018; van der Leest et al., 2019;Elwenspoek et al., 2019; Rouvi`ere et al., 2019). ProstateImaging Reporting and Data System: Version 2 (PI-RADSv2) (Weinreb et al., 2016) is a guideline for reading andacquiring prostate MRI, following a qualitative and semi-quantitative assessment that mandates substantial exper-tise for proper usage. Meanwhile, csPCa can manifest asmultifocal lesions of different shapes and sizes, bearinga strong resemblance to numerous non-malignant con-ditions (as seen in Fig. 1). In the absence of experi-enced radiologists, these factors can lead to low inter-reader agreement ( < The advent of deep convolutional neural networks(CNN) has paved the way for powerful computer-aideddetection and diagnosis (CAD) systems that rival hu-man performance (Esteva et al., 2017; McKinney et al., a r X i v : . [ ee ss . I V ] J a n Probability Intensity Value

MalignantBenign

DWI

Probability Intensity Value

MalignantBenign

ADC

Probability Intensity Value

MalignantBenign

Fig. 1. The challenge of discriminating csPCa due to its morphological heterogeneity. (a-b)

T2-weighted imaging (T2W), (c-d) diffusion-weightedimaging (DWI) and (e-f) apparent diffusion coefﬁcient (ADC) maps constituting the prostate bpMRI scans for two different patients are shownabove, where yellow contours indicate csPCa lesions. While one of the patients has large, severe csPCa developing from both ends (top row) ,the other is afﬂicted by a single, relatively focal csPCa lesion surrounded by perceptually similar nodules of benign prostatic hyperplasia (BPH) (bottom row) . Furthermore, normalized intensity histograms (right) compiled from all 2733 scans used in this study reveal a large overlapbetween the distributions of csPCa and non-malignant prostatic tissue for all three MRI channels.

FocalNet for joint csPCadetection and GS prediction. Over 5-fold cross-validationusing 417 patient scans,

FocalNet achieved 87.9% sensitiv-ity at 1.0 false positive per patient. Meanwhile, Yu et al.(2020a) proposed a dual-stage 2D U-Net for csPCa detec-tion, where the second-stage module is an integrated net-work for false positive reduction.Cancerous lesions stemming from the prostatic periph-eral zone (PZ) exhibit different morphology and pathol-ogy than those developing from the transitional zone (TZ)(Chen et al., 2000; Weinreb et al., 2016; Isra¨el et al., 2020).Hosseinzadeh et al. (2019) highlights the merits of uti-lizing this priori through an early fusion of probabilisticzonal segmentations inside a 2D CAD system. The studydemonstrated that the inclusion of PZ and TZ segmenta-tions can introduce an average increase of 5.3% detectionsensitivity, between 0.5–2.0 false positives per patient. Ina separate study, Cao et al. (2019b) constructed a proba-bilistic 2D prevalence map from 1055 MRI slices. Depict-ing the typical sizes, shapes and locations of malignancyacross the prostate anatomy, this map was used to weakly supervise a 2D U-Net for PCa detection. Both methodsunderline the value of clinical priori and anatomical fea-tures –factors known to play an equally important role inclassical machine learning-based solutions (Litjens et al.,2014; Lemaˆıtre et al., 2017).The vast majority of CAD systems for csPCa operatesolely on a 2D-basis, citing computational limitations andthe non-isotropic imaging protocol of prostate MRI astheir primary rationale. Yoo et al. (2019) tackled thischallenge by employing dedicated 2D ResNets for eachslice in a patient scan and aggregating all slice-level pre-dictions with a Random Forest classiﬁer. Aldoj et al.(2020) proposed a patch-based approach, passing highly-localized regions of interest (ROI) through a standard 3DCNN. Alkadi et al. (2019) followed a 2.5D approach asa compromise solution, sacriﬁcing the ability to harnessmultiple MRI channels for an additional pseudo-spatialdimension.

In this research, we harmonize several state-of-the-arttechniques from recent literature to present a novel end-to-end 3D CAD system that generates voxel-level detec-tions of csPCa in prostate MRI. Key contributions of ourstudy are, as follows:• We examine a detection network with dual-attentionmechanisms, which can adaptively target highly dis-criminative feature dimensions and spatially salientprostatic structures in bpMRI, across multiple reso-lutions, to reach peak detection sensitivity at lowerfalse positive rates.• We study the effect of employing a residual patch-wise 3D classiﬁer for decoupled false positive re-duction and we investigate its utility in improving2aseline speciﬁcity, without sacriﬁcing high detec-tion sensitivity.• We develop a probabilistic anatomical prior, cap-turing the spatial prevalence and zonal distinctionof csPCa from a large training dataset of 1584 MRIscans. We investigate the impact of encoding thecomputed prior into our CNN architecture and weevaluate its ability to guide model generalizationwith domain-speciﬁc clinical knowledge.• We evaluate model performance across large, multi-institutional testing datasets: 486 institutional and296 external patient scans annotated using PI-RADSv2 and GS grades, respectively. Our benchmark in-cludes a consensus score of expert radiologists to as-sess clinical viability.

2. Material and Methods

The primary dataset was a cohort of 2436 prostate MRIscans from Radboud University Medical Center (RUMC),acquired over the period January, 2016 – January, 2018.All cases were paired with radiologically-estimated an-notations of csPCa derived via PI-RADS v2. From here,1584 (65%), 366 (15%) and 486 (20%) patient scans weresplit into training, validation and testing (TS1) sets, re-spectively, via double-stratiﬁed sampling. Additionally,296 prostate bpMRI scans from Ziekenhuisgroep Twente(ZGT), acquired over the period March, 2015 – January,2017, were used to curate an external testing set (TS2). TS2annotations included biopsy-conﬁrmed GS grades.

Patients were biopsy-naive men (RUMC: { median age:66 yrs, IQR: 61–70 } , ZGT: { median age: 65 yrs, IQR: 59–68 } ) with elevated levels of PSA (RUMC: { median level:8 ng/mL, IQR: 5–11 } , ZGT: { median level: 6.6 ng/mL,IQR: 5.1–8.7 } ). Imaging was performed on 3T MR scan-ners (RUMC: { } , ZGT: { } ; Siemens Healthineers,Erlangen). In both cases, acquisitions were obtained fol-lowing standard mpMRI protocols in compliance with PI-RADS v2 (Engels et al., 2020). Given the limited role ofdynamic contrast-enhanced (DCE) imaging in mpMRI, inrecent years, bpMRI has emerged as a practical alterna-tive –achieving similar performance, while saving timeand the use of contrast agents (Turkbey et al., 2019; Basset al., 2020). Similarly, in this study, we used bpMRI se-quences only, which included T2-weighted (T2W) anddiffusion-weighted imaging (DWI). Apparent diffusioncoefﬁcient (ADC) maps and high b-value DWI (b > ) were computed from the raw DWI scans. Priorto usage, all scans were spatially resampled to a commonaxial in-plane resolution of 0.5 mm and slice thickness of 3.6 mm via B-spline interpolation. Due to the stan-dardized precautionary measures (e.g. minimal tempo-ral difference between acquisitions, administration of an-tispasmodic agents to reduce bowel motility, use of rectalcatheter to minimize distension, etc.) (Engels et al., 2020)taken in the imaging protocol, we observed negligible pa-tient motion across the different sequences. Thus, no ad-ditional registration techniques were applied, in agree-ment with clinical recommendations (Epstein et al., 2016)and recent studies (Cao et al., 2019a). All patient scans from RUMC and ZGT were reviewedby expert radiologists using PI-RADS v2. For this study,we ﬂagged any detected lesions marked PI-RADS 4 or5 as csPCa ( PR ) . When independently assigned PI-RADSscores were discordant, a consensus was reached throughjoint assessment. All instances of csPCa ( PR ) were thencarefully delineated on a voxel-level basis by trained stu-dents under the supervision of expert radiologists. ForZGT dataset, all patients underwent TRUS-guided biop-sies performed by a urologist, blinded to the imaging re-sults. In the presence of any suspicious lesions (PI-RADS3-5), patients also underwent in-bore MRI-guided biop-sies as detailed in van der Leest et al. (2019). Tissuesamples were reviewed by experienced uropathologists,where cores containing cancer were assigned GS gradesin compliance with the 2014 International Society of Uro-logic Pathology (ISUP) guidelines (Epstein et al., 2016).Any lesion graded GS > > ( GS ) , and subsequently delineated bytrained students on a voxel-level basis.Upon complete annotation, the RUMC and ZGTdatasets contained 1527 and 210 benign cases, alongwith 909 and 86 malignant cases ( ≥ csPCa lesion),respectively. Moreover, on a lesion-level basis, theRUMC dataset contained 1095 csPCa ( PR ) lesions (meanfrequency: 1.21 lesions per malignant scan; median size:1.05 cm , range: 0.01–61.49 cm ), while the ZGT datasetcontained 90 csPCa ( GS ) lesions (mean frequency: 1.05 le-sions per malignant scan; median size: 1.69 cm , range:0.23–22.61 cm ). Multi-class segmentations of prostatic TZ and PZ weregenerated for each scan in the training dataset using amulti-planar, anisotropic 3D U-Net from a separate study(Riepe et al., 2020), where the network achieved an av-erage Dice Similarity Coefﬁcient of 0.90 ± × The architecture of our proposed CAD solution com-prises of two parallel 3D CNNs ( M , M ) followed by adecision fusion node N DF , as shown in Fig. 2. Based on3 .11 0.98 Multi-Channel Patches ( x ) [8,64,64,8,3] * Multi-Channel Whole Volume ( x ) [1,144,144,18,4] * Prostate bpMRI (Validation/Test Case)

Prostate Zonal Segmentations and Tumor Annotations (All Training Cases)

Early Fusion of Probabilistic

Anatomical Prior ( P ) Focal Loss with

Adam Optimizer

Dual-Attention U-Net Detector ( M ) Balanced Cross-Entropy Loss with

AMSBound Optimizer

Residual Classiﬁer ( M ) PreliminaryDetection ( y ) Soft Malignancy Score For Each Patch ( y ) Decision Fusion ( N DF ) Processed Detection with

Reduced False Positives ( y DF ) number of samples, width, height, depth, number of channels ] * [ Clinically SigniﬁcantCancer Detection

T2WDWIADC

Intensity NormalizationIntensity Normalization

Fig. 2. Proposed end-to-end framework for computing voxel-level detections of csPCa in validation/test samples of prostate bpMRI. The modelcenter-crops two ROIs from the multi-channel concatenation of the patient’s T2W, DWI and ADC scans for the input of its detection and classi-ﬁcation 3D CNN sub-models ( M , M ). M leverages an anatomical prior P in its input x to synthesize spatial priori and generate a preliminarydetection y . M infers on a set of overlapping patches x and maps them to a set of probabilistic malignancy scores y . Decision fusion node N DF aggregates y , y to produce the model output y DF in the form of a post-processed csPCa detection map with high sensitivity and reduced falsepositives. our observations in previous work (Hosseinzadeh et al.,2019; Riepe et al., 2020), we opted for anisotropically-strided 3D convolutions in both M and M to processthe bpMRI data, which resemble multi-channel stacks of2D images rather than full 3D volumes. T2W and DWIchannels were normalized to zero mean and unit stan-dard deviation, while ADC channels were linearly nor-malized from [0,3000] to [0,1] in order to retain their clin-ically relevant numerical signiﬁcance (Isra¨el et al., 2020).Anatomical prior P , constructed using the prostate zonalsegmentations and csPCa ( PR ) annotations in the trainingdataset, is encoded in M to infuse spatial priori. At train-time, M and M are independently optimized using dif-ferent loss functions and target labels. At test-time, N DF is used to aggregate their predictions ( y , y ) into a singleoutput detection map y DF . The principal component of our proposed model is thedual-attention detection network or M , as shown in Fig.2, 3. It is used to generate the preliminary voxel-level de-tection of csPCa in prostate bpMRI scans with high sensi-tivity. Typically, a prostate gland occupies 45–50 cm , butit can be signiﬁcantly enlarged in older males and patientsafﬂicted by BPH (Basillote et al., 2003). The input ROI of M , measuring 144 × ×

18 voxels per channel or nearly336 cm , includes and extends well beyond this windowto utilize surrounding peripheral and global anatomicalinformation. M trains on whole-image volumes equiv-alent to its total ROI, paired with fully delineated anno-tations of csPCa ( PR ) as target labels. Since the larger ROIand voxel-level labels contribute to a severe class imbal-ance (1:153) at train-time, we use a focal loss function to train M . Focal loss addresses extreme class imbalance inone-stage dense detectors by weighting the contributionof easy to hard examples, alongside conventional class-weighting (Lin et al., 2017). In a similar study for jointcsPCa detection in prostate MRI, the authors credited fo-cal loss as one of the pivotal enhancements that enabledtheir CNN solution, titled FocalNet (Cao et al., 2019a).For an input volume, x = ( x , x , ..., x n ) derived froma given scan, let us deﬁne its target label Y = ( Y , Y , ..., Y n ) ∈ { , } , where n represents the total number ofvoxels in x . We can formulate the focal loss function of M for a single voxel in each scan, as follows: FL ( x i , Y i ) = − α (1 − y i ) γ Y i log y i − (1 − α )( y i ) γ (1 − Y i ) log (1 − y i ) i ∈ [1 , n ] Here, y i = p ( O = | x i ) ∈ [0 , , represents the probabil-ity of x i being a malignant tissue voxel as predicted by M , while α and γ represent weighting hyperparametersof the focal loss. At test-time, y = ( y , y , ..., y n ) ∈ [0 , ,i.e. a voxel-level, probabilistic csPCa detection map for x , serves as the ﬁnal output of M for each scan.We choose 3D U-Net (Ronneberger et al., 2015; C¸ ic¸eket al., 2016) as the base architecture of M , for its abil-ity to summarize multi-resolution, global anatomical fea-tures (Dalca et al., 2018; Isensee et al., 2020) and gener-ate an output detection map with voxel-level precision.Pre-activation residual blocks (He et al., 2016) are used ateach scale of M for deep feature extraction. Architectureof the decoder stage is adapted into that of a modiﬁedUNet++ (Zhou et al., 2020) for improved feature aggre-gation. UNet++ uses redesigned encoder-decoder skipconnections that implicitly enable a nested ensemble con-ﬁguration. In our adaptation, its characteristic property4f feature fusion from multiple semantic scales is usedto achieve similar performance, while dense blocks anddeep supervision from the original design are forgone toremain computationally lightweight.Two types of differentiable, soft attention mechanismsare employed in M to highlight salient informationthroughout the training process, without any additionalsupervision. Channel-wise Squeeze-and-Excitation (SE) at-tention (Hu et al., 2019; Rundo et al., 2019) is used to am-plify the most discriminative feature dimensions at eachresolution. Grid-attention gates (Schlemper et al., 2019)are used to automatically learn spatially important pro-static structures of varying shapes and sizes. While theformer is integrated into every residual block to guidefeature extraction, the latter is placed at the start of skip-connections to ﬁlter the semantic features being passedonto the decoder. During backpropagation, both atten-tion mechanisms work collectively to suppress gradientsoriginating from background voxels and inessential fea-ture maps. Similar combinations of dual-attention mech-anisms have reached state-of-the-art performance in se-mantic segmentation challenges (Fu et al., 2019) and PCadiagnosis (Yu et al., 2020b), sharing an ability to integratelocal features with their global dependencies.

The goal of the classiﬁcation network, M , is to improveoverall model speciﬁcity via independent, binary classiﬁ-cation of each scan and its constituent segments. It is ef-fectuated by N DF , which factors in these predictions from M to locate and penalize potential false positives in theoutput of M . M has an input ROI of 112 × ×

12 voxelsper channel or nearly 136 cm , tightly centered aroundthe prostate. While training on the full ROI volume hasthe advantage of exploiting extensive spatial context, itresults in limited supervision by the usage of a singlecoarse, binary label per scan. Thus, we propose patch-wise training using multiple, localized labels, to enforcefully supervised learning. We deﬁne an effective patch extraction policy as one that samples regularly acrossthe ROI to densely cover all spatial positions. Sampledpatches must also be large enough to include a sufﬁcientamount of context for subsequent feature extraction. Ran-dom sampling within a small window, using the afore-mentioned criteria, poses the risk of generating highlyoverlapping, redundant training samples. However, aminimum level of overlap can be crucial, beneﬁting re-gions that are harder to predict by correlating semanticfeatures from different surrounding context (Xiao et al.,2018). As such, we divide the ROI into a set of eight oc-tant training samples x , measuring × × voxels eachwith upto 7.5% overlap between neighboring patches.For input patches, x = ( x , x , ..., x ) derived froma given scan, let us deﬁne its set of target labels Y = ( Y , Y , ..., Y ) ∈ { , } . Using a pair of complementaryclass weights to adjust for the patch-level class imbalance(1:4), we formulate the balanced cross-entropy loss func-tion of M for a single patch in each scan, as follows: BCE ( x i , Y i ) = − β Y i log y i − (1 − β )(1 − Y i ) log (1 − y i ) i ∈ [1 , Here, y i = p ( O = | x i ) ∈ [0 , , represents the probabilityof x i being a malignant patch as predicted by M . At test-time, y = ( y , y , ..., y ) ∈ [0 , , i.e. a set of probabilisticmalignancy scores for x , serves as the ﬁnal output of M for each scan.Transforming voxel-level annotations into patch-wiselabels can introduce additional noise in the target labelsused at train-time. For instance, a single octant patch con-tains × × or 32768 voxels per channel. In a naivepatch extraction system, if the fully delineated ground-truth for this sample includes even a single voxel of ma-lignant tissue, then the patch-wise label would be inac-curately assigned as malignant , despite a voxel-level im-balance of 1:32767 supporting the alternate class. Sucha training pair carries high label noise and proves detri-mental to the learning cycle, where the network associates F Number of Filters Residual AdditionTransposed Convolution + SE-Residual BlockGrid-Attention GateSE-Residual BlockTransposed ConvolutionConcatenationSoftmax Layer and

Focal Loss ( α=0.75, γ =2.00 ) Computation F =16 F =16 F =32 F =128 F =64 F =32 F =32 F =64 F =128 F =64 F =64 F =128 F =128 F =128 F =256 Spatial Dimensions [ width, height, depth ] [144, 144, 18][72, 72, 18][36, 36, 18][18, 18, 9][9, 9, 9]

Fig. 3. Architecture schematic for the Dual-Attention U-Net ( M ). M is a modiﬁed adaptation of the UNet++ architecture (Zhou et al., 2020),utilizing a pre-activation residual backbone (He et al., 2016) with Squeeze-and-Excitation (SE) channel-wise attention mechanism (Hu et al.,2019) and grid-attention gates (Schlemper et al., 2019). All convolutional layers in the encoder and decoder stages are activated by ReLU andLeakyReLU, respectively, and use kernels of size × × with L regularization ( β = . ). Both downsampling and upsampling operationsthroughout the network are performed via anisotropic strides. Dropout nodes ( rate = . ) are connected at each scale of the decoder to alleviatetrain-time overﬁtting. τ , representing the minimum percent-age of malignant tissue voxels required for a given patchto be considered malignant .For M , we consider CNN architectures based on resid-ual learning for feature extraction, due to their modular-ity and continued success in supporting state-of-the-artsegmentation and detection performance in the medicaldomain (Yoo et al., 2019; McKinney et al., 2020; Jiang et al.,2020), The goal of the decision fusion node N DF is to aggre-gate M and M predictions ( y , y ) into a single output y DF , which retains the same sensitivity as y , but improvesspeciﬁcity by reducing false positives. False positives in y are fundamentally clusters of positive values locatedin the benign regions of the scan. N DF employs y as ameans of identifying these regions. We set a threshold T P on (1 − y i ) to classify each patch x i , where i ∈ [1,8]. T P represents the minimum probability required to classify x i as a benign patch. A high value of T P adapts M as ahighly sensitive classiﬁer that yields very few false nega-tives, if any at all. Once all benign regions have been iden-tiﬁed, any false positives within these patches are sup-pressed by multiplying their corresponding regions in y with a penalty factor λ . The resultant detection map y DF ,i.e. essentially a post-processed y , serves as the ﬁnal out-put of our proposed CAD system. N DF is limited to a sim-ple framework of two hyperparameters only to alleviatethe risk of overﬁtting. An appropriate combination of T P and λ can either suppress clear false positives or facilitatean aggressive reduction scheme at the expense of fewertrue positives in y DF . In this research, we opted for theformer policy to retain maximum csPCa detection sensi-tivity. Optimal values of T P and λ were determined to be0.98 and 0.90, respectively, via a coarse-to-ﬁne hyperpa-rameter grid search. Parallel to recent studies in medical image computing(Gibson et al., 2018; Dalca et al., 2018; Wachinger et al.,2018; Cao et al., 2019b) on infusing spatial priori intoCNN architectures, we hypothesize that M can beneﬁtfrom an explicit anatomical prior for csPCa detection inbpMRI. To this end, we construct a probabilistic popula-tion prior P , as introduced in our previous work (Sahaet al., 2020). P captures the spatial prevalence and zonaldistinction of csPCa using 1584 radiologically-estimatedcsPCa ( PR ) annotations and CNN-generated prostate zonalsegmentations from the training dataset. We opt for anearly fusion technique to encode the clinical priori (Hos-seinzadeh et al., 2019), where P is concatenated as an ad-ditional channel to every input scan passed through M ,thereby guiding its learning cycle as a spatial weight mapembedded with domain-speciﬁc clinical knowledge (referto Fig. 2). Several experiments were conducted to statisticallyevaluate performance and analyze the design choicesthroughout the end-to-end model. We facilitated a faircomparison by maintaining an identical preprocessing,augmentation, tuning and train-validation pipeline foreach candidate system in a given experiment. Patient-based diagnosis performance was evaluated using theReceiver Operating Characteristic (ROC), where the areaunder ROC (AUROC) was estimated from the normal-ized Wilcoxon/Mann-Whitney U statistic (Hanley andMcNeil, 1982). Lesion-level performance was evaluatedusing the Free-Response Receiver Operating Character-istic (FROC) to address PCa multifocality, where detec-tions sharing a minimum Dice Similarity Coefﬁcient of0.10 with the ground-truth annotation were consideredtrue positives. All metrics were computed in 3D. Conﬁ-dence intervals were estimated as twice the standard de-viation from the mean of 5-fold cross-validation (applica-ble to validation sets) or 1000 replications of bootstrap-ping (applicable to testing sets). Statistically signiﬁcantimprovements were veriﬁed with a p -value on the differ-ence in case-level AUROC and lesion-level sensitivity atclinically relevant false positive rates (0.5, 1.0) using 1000replications of bootstrapping (Chihara et al., 2014). Bon-ferroni correction was used to adjust the signiﬁcance levelfor multiple comparisons.

3. Results and Analysis

To determine the effect of the classiﬁcation architecturefor M , ﬁve different 3D CNNs (ResNet-v2, Inception-ResNet-v2, Residual Attention Network, SEResNet,SEResNeXt) were implemented and tuned across theirrespective hyperparameters to maximize patient-basedAUROC over 5-fold cross-validation. Furthermore, eachcandidate CNN was trained using whole-images andpatches, in separate turns, to draw out a comparativeanalysis surrounding the merits of spatial context versuslocalized labels. In the latter case, we studied the effect of τ on patch-wise label assignment (refer to Section 2.2.2).We investigated four different values of τ : 0.0%, 0.1%,0.5%, 1.0%; which correspond to minimum csPCa vol-umes of 9, 297, 594 and 1188 mm , respectively. Each clas-siﬁer was assessed qualitatively via 3D GradCAMs (Sel-varaju et al., 2017) to ensure adequate interpretability forclinical usage.From the results noted in Table 1, we observed thatthe SEResNet architecture consistently scored the highestAUROC across every training scheme. However, in eachcase, its performance remained statistically similar ( p ≥ × additional spatial6 .000.100.300.500.700.200.400.600.80 T2W

ADCDWI

Prostate bpMRI (with csPCa Annotation)

ResNet-v Inception-ResNet-v SEResNetRes. Attention Network SEResNeXt

Gradient-Weighted Class Activations Maps (GradCAM)

Fig. 4. Model interpretability of the candidate CNN architectures for classiﬁer M at τ = ( PR ) located in the prostatic TZ ( center row ) or PZ ( top, bottom rows ), as indicated by the yellow contours. Whole-imageGradCAMs are generated by restitching and normalizing ( min-max ) the eight patch-level GradCAMs generated per case. Maximum voxel-levelactivation is observed in close proximity of csPCa ( PR ) , despite training each network using patch-level binary labels only.Table 1. Patient-based diagnosis performance of the candidate CNN architectures and training schemes (whole-image versus patch-wise trainingwith four different values of τ to regulate label noise) for classiﬁer M . Performance scores indicate mean of 5-fold cross-validation, followed by95% conﬁdence intervals estimated as twice the standard deviation. Model Params AUROC AUROC (Patches)(Whole-Image) τ = . % τ = . % τ = . % τ = . %ResNet-v2 (He et al., 2016) 0.089 M . ± . . ± . . ± . . ± . . ± . Inception-ResNet-v2 (Szegedy et al., 2017) 6.121 M . ± . . ± . . ± . . ± . . ± . Res. Attention Network (Wang et al., 2017) 1.233 M . ± . . ± . . ± . . ± . . ± . SEResNet (Hu et al., 2019) 0.095 M . ± . . ± . . ± . . ± . . ± . SEResNeXt (Hu et al., 2019) 0.128 M . ± . . ± . . ± . . ± . . ± . context provided per sample during whole-image train-ing. Increasing the value of τ consistently improved per-formance for all candidate classiﬁers (upto 10% in patch-level AUROC). While we attribute this improvement tolower label noise, it is important to note that the vast ma-jority of csPCa lesions are typically small (refer to Section2.1.2) and entire patient cases risk being discarded fromthe training cycle for higher values of τ . For instance,when τ = is labelled as benign –leading to 9 pa-tient cases with incorrect label assignment in the trainingdataset. For the 3D CAD system, we chose the SEResNetpatch-wise classiﬁer trained at τ = M , becauseat τ = τ = { } %) and patch-level AUROCstill improved by nearly 2% relative to a naive patch ex-traction system ( τ = M accurately targets csPCa lesions (if any) on a voxel-levelbasis, despite being trained on patch-level binary labels(as highlighted in Fig. 4). Further details regarding thenetwork and training conﬁgurations of M are listed inAppendix A. We analyzed the effect of the M architecture, in com-parison to the four baseline 3D CNNs (U-SEResNet,UNet++, nnU-Net, Attention U-Net) that inspire its de-sign. We evaluated the end-to-end 3D CAD system, alongwith the individual contributions of its constituent com-ponents ( M , M , P ), to examine the effects of false pos-itive reduction and clinical priori. Additionally, we ap-plied the ensembling heuristic of the nnU-Net framework(Isensee et al., 2020) to create CAD ∗ , i.e. an ensemblemodel comprising of multiple CAD instances, and westudied its impact on overall performance. Each candi-date setup was tuned over 5-fold cross-validation andbenchmarked on the testing datasets (TS1, TS2). Lesion Localization : From the FROC analysis on the in-stitutional testing set TS1 (refer to Fig. 5), we observedthat M reached 88.15 ± p ≤ ± alse Positive Rate ( - Speciﬁcity)True Positive Rate (Sensitivity) Random Classiﬁer (AUC = ±0.000 ) U-SEResNet (AUC = ±0.037 ) UNet ++ (AUC = ±0.036 ) nnU-Net (AUC = ±0.034 ) M (AUC = ±0.032 ) Attention U-Net (AUC = ±0.035 ) M ⊗ M (AUC = ±0.031 ) Proposed CAD (AUC = ±0.030 ) Proposed CAD (AUC = ±0.030 ) False Positives per Patient (a)

Sensitivity U-SEResNetUNet ++ M nnU-NetAttention U-Net M ⊗ M Proposed CADProposed CAD * *

False Positive Rate ( - Speciﬁcity)True Positive Rate (Sensitivity) Random Classiﬁer (AUC = ±0.000 ) U-SEResNet (AUC = ±0.066 ) UNet ++ (AUC = ±0.058 ) M (AUC = ±0.054 ) Attention U-Net (AUC = ±0.056 ) M ⊗ M (AUC = ±0.054 ) Proposed CAD (AUC = ±0.043 ) Proposed CAD (AUC = ±0.044 ) ** Proposed CAD (Threshold = ) Radiologists (PI-RADS ≥ ) nnU-Net (AUC = ±0.054 ) False Positives per Patient (a)

Sensitivity U-SEResNetUNet ++ M Attention U-Net M ⊗ M Proposed CAD * Proposed CAD

Radiologists (PI-RADS ≥ ) nnU-Net Fig. 5. Lesion-level FROC ( left ) and patient-based ROC ( right ) analyses of csPCa ( PR ) ( top row ) / csPCa ( GS ) ( bottom row ) detection sensitivityagainst the number of false positives generated per patient scan using the baseline, ablated and proposed detection models on the institutionaltesting set TS1 ( top row ) and the external testing set TS2 ( bottom row ). Transparent areas indicate the 95% conﬁdence intervals. Mean per-formance for the consensus of expert radiologists and their 95% conﬁdence intervals are indicated by the centerpoint and length of the greenmarkers, respectively, where all observations marked PI-RADS 4 or 5 are considered positive detections (as detailed in Section 2.3). (83.81 ± ± ± M to M ( M ⊗ M ), upto 12.89% ( p ≤ M ⊗ M is illustrated in Fig. 6through a particularly challenging patient case, wherethe prostate gland is afﬂicted by multiple, simultaneousconditions. With the inclusion of anatomical prior P in M ⊗ M , our proposed CAD system beneﬁted from a fur- ther 3.14% increase in partial area under FROC (pAUC)between 0.10–2.50 false positives per patient, reaching1.676 ± ± p ≤ p ≤ p ≤ ( PR ) lesions than its component systems M and M ⊗ M , respectively. It reached a maximum de-tection sensitivity of 93.19 ± enign ProstaticHyperplasia (BPH) Indolent ProstateCancer (GS ≤ 3+3 ) Clinically SigniﬁcantProstate Cancer (GS > 3+3 ) Fig. 6. ( a ) T2W, ( b ) DWI, ( c ) ADC scans for a patient case in the exter-nal testing set TS2, followed by its csPCa detection map as predictedby each candidate system: ( d ) U-SEResNet, ( e ) UNet++, ( f ) AttentionU-Net, ( g ) nnU-Net, ( h ) M , ( i ) M ⊗ M , ( j ) proposed CAD, ( k ) proposedCAD ∗ . Three stand-alone detection networks (UNet++, nnU-Net, M )successfully identify the csPCa lesion, albeit with additional falsepositive(s). In the case of the proposed CAD/CAD ∗ system, whilethe classiﬁer in M ⊗ M is able to suppresses these false positive(s)from M , inclusion of prior P further strengthens the conﬁdence andboundaries of the true positive. currences than all other candidate systems. Patient-Based Diagnosis : From ROC analysis on the in-stitutional testing set TS1 (refer to Fig. 5), we observedthat our proposed CAD system reached 0.882 ± p ≤ p ≤ p ≤ benign and malignant pa-tient cases was statistically similar ( p ≥ M and M ⊗ M . Both the FROC and ROC analyses on the external test-ing set TS2 (refer to Fig. 5) indicate similar patternsemerging as those observed in Section 3.2.1, but with an overall decrease in performance. Given the near-identicalMRI scanners and acquisition conditions employed be-tween both institutions (refer to Section 2.1.1), we pri-marily attribute this decline to the disparity between theimperfect radiologically-estimated training annotations(csPCa ( PR ) ) and the histologically-conﬁrmed testing anno-tations (csPCa ( GS ) ) in TS2 (refer to Section 3.3 for radiolo-gists’ performance). By comparing the relative drop inperformance for each candidate model, we can effectivelyestimate their generalization and latent understanding ofcsPCa, beyond our provided training samples. Lesion Localization : At 1.0 false positive per patient,our proposed CAD system achieved 85.55 ± p ≤ ± ± ± ± p ≤ ( GS ) lesionsthan its ablated counterparts M and M ⊗ M , respectively.The 3D CAD system reached a maximum detection sen-sitivity of 90.03 ± M and M ⊗ M fell by nearly 10%.From the inclusion of P in M ⊗ M , this decline camedown to only 3% for the CAD system at the same falsepositive rate. Furthermore, an overall 11.54% increase inpAUC was observed between 0.10–2.50 false positives perpatient, relative to M ⊗ M . Patient-Based Diagnosis : Our proposed CAD systemreached 0.862 ± p ≤ p ≤ p > p ≤ M ⊗ M by 3.6% ( p ≤ Table 2. Computational requirements (in terms of the number of trainable parameters, VRAM usage and the average time taken per patient scanduring inference on a single NVIDIA RTX 2080 Ti) against the localization performance (in terms of the maximum csPCa detection sensitivityachieved and its corresponding false positive rate across both testing datasets) for each candidate detection system.

Model Params VRAM Inference Maximum Sensitivity { False Positive Rate } TS1 – csPCa ( PR ) TS2 – csPCa ( GS ) U-SEResNet (Hu et al., 2019) 1.615 M . GB . ± . s . ± . { . } . ± . { . } UNet++ (Zhou et al., 2020) 14.933 M . GB . ± . s . ± . { . } . ± . { . } nnU-Net (Isensee et al., 2020) 30.599 M . GB . ± . s . ± . { . } . ± . { . } Attention U-Net (Schlemper et al., 2019) 2.235 M . GB . ± . s . ± . { . } . ± . { . } Dual-Attention U-Net – M . GB . ± . s . ± . { . } . ± . { . } M with False Positive Reduction – M ⊗ M . GB . ± . s . ± . { . } . ± . { . } M ⊗ M with Prior – Proposed CAD 15.335 M . GB . ± . s . ± . { . } . ± . { . } Ensemble of CAD – Proposed CAD ∗ . GB . ± . s . ± . { . } . ± . { . } .2.3. Effect of Ensembling The ensembled prediction of CAD ∗ is the weighted-average output of three member models: 2D, 3D and two-stage cascaded 3D variants of the proposed CAD system(refer to Appendix A for detailed implementation). Incomparison to the standard CAD system, CAD ∗ carries2.6 × trainable parameters, occupies 2.5 × VRAM for hard-ware acceleration and requires 1.3 × inference time per pa-tient scan (as noted in Table 2). In terms of its perfor-mance, CAD ∗ demonstrated 0.3–0.4% improvement inpatient-based AUROC across both testing datasets andshared statistically similar lesion localization on TS1. Itboasted a considerably large improvement in lesion de-tection on TS2, amounting to 4.01% increase in pAUC be-tween 0.10–2.50 false positives per patient (refer to Fig5), as well as a higher maximum detection sensitivity(91.05 ± To evaluate the proposed CAD ∗ system in compari-son to the consensus of expert radiologists, we analyzedtheir relative performance on the external testing set TS2.Agreements in patient-based diagnosis were computedwith Cohen’s kappa .Radiologists achieved 90.72 ± ± ± ∗ system reached0.753 ± ± kappa = ± kappa = ± kappa = ±

4. Discussion and Conclusion

We conclude that a detection network ( M ), harmoniz-ing state-of-the-art attention mechanisms, can accuratelydiscriminate more malignancies at the same false posi-tive rate (refer to Section 3.2.1). Among four other recentadaptations of the 3D U-Net that are popularly used forbiomedical segmentation, M detected signiﬁcantly morecsPCa lesions at 1.00 false positive per patient and con-sistently reached the highest detection sensitivity on thetesting datasets between 0.10–2.50 false positives per pa-tient (refer to Fig. 5). As soft attention mechanisms con-tinue to evolve, supporting ease of optimization, sharingequivariance over permutations (Goyal and Bengio, 2020)and suppressing gradient updates from inaccurate anno-tations (Wang et al., 2017; Min et al., 2019), deep attentivemodels, such as M , become increasingly more applicablefor csPCa detection in bpMRI (Duran et al., 2020; Yu et al.,2020b).We conclude that a residual patch-wise 3D classiﬁer( M ) can signiﬁcantly reduce false positives, without sac-riﬁcing high sensitivity. In stark contrast to ensembling,which scaled up the number of trainable parametersnearly 3 × for limited improvements in performance (re-fer to Section 3.2.3), M produced ﬂat increases in speci-ﬁcity (upto 12.89% less false positives per patient) acrossboth testing datasets, while requiring less than 1% of thetotal parameters in our proposed CAD system (as notedin Table 2). Furthermore, as a decoupled classiﬁer, M T2W

ADCDWI Proposed CAD * T2W ADCDWI Proposed CAD * Prostate bpMRI (with csPCa Annotation)Prostate bpMRI (with csPCa Annotation)

Fig. 7. Six patient cases from the external testing set TS2 and their corresponding csPCa detection maps, as predicted by the proposed CAD ∗ system. Yellow contours indicate csPCa ( GS ) lesions, if present. While CAD ∗ is able to successfully localize large, multifocal and apical instancesof csPCa ( GS ) (left) , in the presence of severe inﬂammation/ﬁbrosis induced by other non-malignant conditions (eg. BPH, prostatitis), CAD ∗ canmisidentify smaller lesions, resulting in false positive/negative predictions (right) . M on theoverall CAD system could be controlled via the deci-sion fusion node N DF , such that the maximum detectionsensitivity of the system was completely retained (referto Table 2). Secondly, due to its independent trainingscheme, M remains highly modular, i.e. it can be eas-ily tuned, upgraded or swapped out entirely upon futureadvancements, without retraining or affecting the stand-alone performance of M .We conclude that encoding an anatomical prior ( P )into the CNN architecture can guide model generaliza-tion with domain-speciﬁc clinical knowledge. Results in-dicated that P played the most important role in the gen-eralization of the 3D CAD system (via M ) and in retain-ing its performance across the multi-institutional testingdatasets (refer to Section 3.2.2). Remarkably, its contri-bution was substantially more than any other architec-tural enhancement proposed in recent literature, while in-troducing negligible changes in the number of trainableparameters (refer to Table 2). However, it is worth not-ing that similar experiments with classiﬁer M , yieldedno statistical improvements. Parallel to the methods pro-posed by Cheng et al. (2018) and Tang et al. (2019), M wasdesigned to learn a different set of feature representationsfor csPCa than M , using its smaller receptive ﬁeld size,patch-wise approach and decoupled optimization strat-egy. Thus, while M was trained to learn translation co-variant features for localization, M was trained to learntranslation invariant features for classiﬁcation, i.e. patch-wise prediction of the presence/absence of csPCa, irre-gardless of its spatial context in the prostate gland. Wepresume this key difference to be the primary reason why M was effective at independent false positive reduction,yet unable to leverage the spatial priori embedded in P .Nonetheless, our study conﬁrmed that powerful anatom-ical priors, such as P , can substitute additional trainingdata for deep learning-based CAD systems and improvemodel generalization, by relaying the inductive biases ofcsPCa in bpMRI (Goyal and Bengio, 2020).We benchmarked our proposed architecture againsta consensus of radiologists, using an external test-ing set graded by independent pathologists. No-tably, we observed that the CAD ∗ system demon-strated higher agreement with pathologists (81.08%; kappa = ± kappa = ± ( GS ) and gener-alize beyond the radiologically-estimated training anno-tations. Although, deep learning-based systems remaininadequate as stand-alone solutions (refer to Fig. 5, 7),the moderate agreement of CAD ∗ with both clinical ex-perts, while inferring predictions relatively dissimilar toradiologists, highlights its potential to improve diagnos- tic certainty as a viable second reader, in a screening set-ting (Sanford et al., 2020; Schelb et al., 2020).The study is limited in a few aspects. All prostate scansused within the scope of this research, were acquired us-ing MRI scanners developed by the same vendor. Thus,generalizing our proposed solution to a vendor-neutralmodel requires special measures, such as domain adapta-tion (Chiou et al., 2020), to account for heterogeneous ac-quisition conditions. Radiologists utilize additional clini-cal variables (e.g. prior studies, DCE scans, PSA densitylevels, etc.) to inform their diagnosis for each patient case–limiting the equity of any direct comparisons against the3D CNNs developed in this research.In summary, an automated novel end-to-end 3D CADsystem, harmonizing several state-of-the-art methodsfrom recent literature, was developed to diagnose andlocalize csPCa in bpMRI. To the best of our knowledge,this was the ﬁrst demonstration of a deep learning-based3D detection and diagnosis system for csPCa, trained us-ing radiologically-estimated annotations only and eval-uated on large, multi-institutional testing datasets. Thepromising results of this research motivate the ongoingdevelopment of new techniques, particularly those whichfactor in the breadth of clinical knowledge established inthe ﬁeld beyond limited training datasets, to create com-prehensive CAD solutions for the clinical workﬂow ofprostate cancer management. Acknowledgements

The authors would like to acknowledge the contribu-tions of Maarten de Rooij and Ilse Slootweg from Rad-boud University Medical Center during the annotationof fully delineated masks of prostate cancer for everybpMRI scan used in this study. This research is supportedin parts by the European Union H2020: ProCAncer-Iproject (EU grant 952159) and Siemens Healthineers (CID:C00225450). Anindo Saha is supported by an EuropeanUnion EACEA: Erasmus+ grant in the Medical Imagingand Applications (MaIA) program.

References

Aldoj, N., Lukas, S., Dewey, M., Penzkofer, T., 2020. Semi-AutomaticClassiﬁcation of Prostate Cancer on Multi-parametric MR Imagingusing a Multi-Channel 3D Convolutional Neural Network. EuropeanRadiology 30, 1243–1253. doi: .Alkadi, R., El-Baz, A., Taher, F., Werghi, N., 2019. A 2.5D Deep Learning-Based Approach for Prostate Cancer Detection on T2-Weighted Mag-netic Resonance Imaging, in: Computer Vision – ECCV 2018 Work-shops, Springer International Publishing. pp. 734–739.Basillote, J.B., Armenakas, N.A., Hochberg, D.A., Fracchia, J.A., 2003. In-ﬂuence of Prostate Volume in the Detection of Prostate Cancer. Urol-ogy 61, 167–171. doi: .Bass, E., Pantovic, A., Connor, M., Gabe, R., , Ahmed, H., 2020. A Sys-tematic Review and Meta-Analysis of the Diagnostic Accuracy of Bi-parametric Prostate MRI for Prostate Cancer in Men at Risk. ProstateCancer and Prostatic Diseases , 1–16.Cao, R., Mohammadian Bajgiran, A., Afshari Mirak, S., Shakeri, S.,Zhong, X., Enzmann, D., Raman, S., Sung, K., 2019a. Joint ProstateCancer Detection and Gleason Score Prediction in mp-MRI via Focal-Net. IEEE Transactions on Medical Imaging 38, 2496–2506. ao, R., Zhong, X., Scalzo, F., Raman, S., Sung, K., 2019b. Prostate Can-cer Inference via Weakly-Supervised Learning using a Large Collec-tion of Negative MRI, in: 2019 IEEE/CVF International Conferenceon Computer Vision Workshop (ICCVW), pp. 434–439.Chen, M.E., Johnston, D.A., Tang, K., Babaian, R.J., Troncoso, P., 2000.Detailed mapping of prostate carcinoma foci: biopsy strategy impli-cations. Cancer 89, 1800–1809.Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T., 2018. Re-visiting RCNN: On Awakening the Classiﬁcation Power of FasterRCNN, in: Proceedings of the European Conference on ComputerVision (ECCV).Chihara, L.M., Hesterberg, T.C., Dobrow, R.P., 2014. MathematicalStatistics with Resampling and R & Probability: With Applicati. JohnWiley & Sons. OCLC: 941516595.Chiou, E., Giganti, F., Punwani, S., Kokkinos, I., Joskowicz, L., 2020.Harnessing Uncertainty in Domain Adaptation for MRI Prostate Le-sion Segmentation, in: Medical Image Computing and Computer As-sisted Intervention – MICCAI 2020, Springer International Publish-ing. pp. 510–520.C¸ ic¸ek, ¨O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.,2016. 3D U-Net: Learning Dense Volumetric Segmentation fromSparse Annotation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Springer International Pub-lishing. pp. 424–432.Dalca, A.V., Guttag, J., Sabuncu, M.R., 2018. Anatomical Priors in Con-volutional Networks for Unsupervised Biomedical Segmentation, in:2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, pp. 9290–9299.Duran, A., Jodoin, P.M., Lartizien, C., 2020. Prostate Cancer SemanticSegmentation by Gleason Score Group in Bi-parametric MRI with SelfAttention Model on the Peripheral Zone, in: International Conferenceon Medical Imaging with Deep Learning (MIDL) – Full Paper Track,Montreal, QC, Canada. pp. 193–204.Elwenspoek, M.M.C., Sheppard, A.L., McInnes, M.D.F., Whiting, P.,2019. Comparison of Multiparametric Magnetic Resonance Imag-ing and Targeted Biopsy With Systematic Biopsy Alone for theDiagnosis of Prostate Cancer: A Systematic Review and Meta-analysis. JAMA Network Open 2, e198427–e198427. doi: .Engels, R.R., Isra¨el, B., Padhani, A.R., Barentsz, J.O., 2020. Multipara-metric Magnetic Resonance Imaging for the Detection of ClinicallySigniﬁcant Prostate Cancer: What Urologists Need to Know. Part1: Acquisition. European Urology 77, 457 – 468. doi: .Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., 2016. The 2014 Inter-national Society of Urological Pathology (ISUP) Consensus Confer-ence on Gleason Grading of Prostatic Carcinoma: Deﬁnition of Grad-ing Patterns and Proposal for a New Grading System. Am. J. Surg.Pathol. 40, 244–252.Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., 2017. Dermatologist-levelClassiﬁcation of Skin Cancer with Deep Neural Networks. Nature542, 115–118. doi: .Fu, J., Liu, J., Tian, H., Lu, H., 2019. Dual Attention Network for SceneSegmentation, in: 2019 IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pp. 3141–3149.Garcia-Reyes, K., Passoni, N.M., Palmeri, M.L., Kauffman, C.R., 2015.Detection of Prostate Cancer with Multiparametric MRI (mpMRI): Ef-fect of Dedicated Reader Education on Accuracy and Conﬁdence ofIndex and Anterior Cancer Diagnosis. Abdominal Imaging 40, 134–142. doi: .Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K.,Davidson, B., Pereira, S.P., Clarkson, M.J., Barratt, D.C., 2018. Au-tomatic Multi-Organ Segmentation on Abdominal CT With DenseV-Networks. IEEE Transactions on Medical Imaging 37, 1822–1834.doi: .Goyal, A., Bengio, Y., 2020. Inductive Biases for Deep Learning ofHigher-Level Cognition. arXiv:2011.15091 .Hanley, J.A., McNeil, B.J., 1982. The Meaning and Use of The Area Un-der A Receiver Operating Characteristic (ROC) Curve. Radiology143, 29–36. doi: . pMID:7063747.He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving Deep into Rectiﬁers: Surpassing Human-Level Performance on ImageNet Classiﬁcation,in: Proceedings of the IEEE International Conference on ComputerVision (ICCV), pp. 1026–1034.He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity Mappings in DeepResidual Networks, in: Leibe, B., Matas, J., Sebe, N., Welling, M.(Eds.), Computer Vision – ECCV 2016, Springer International Pub-lishing. pp. 630–645.Hosseinzadeh, M., Brand, P., Huisman, H., 2019. Effect of Adding Prob-abilistic Zonal Prior in Deep Learning-based Prostate Cancer Detec-tion, in: International Conference on Medical Imaging with DeepLearning (MIDL) – Extended Abstract Track, London, United King-dom. pp. 1026–1034.Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., 2019. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 7132–7141.Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.,2020. nnU-Net: a self-conﬁguring method for deep learning-basedbiomedical image segmentation. Nature Methods doi: .Isra¨el, B., van der Leest, M., Sedelaar, M., Padhani, A.R., Z´amecnik, P.,Barentsz, J.O., 2020. Multiparametric Magnetic Resonance Imagingfor the Detection of Clinically Signiﬁcant Prostate Cancer: What Urol-ogists Need to Know. Part 2: Interpretation. European Urology 77,469 – 480. doi: .Jiang, Z., Ding, C., Liu, M., Tao, D., 2020. Two-Stage Cascaded U-Net:1st Place Solution to BraTS Challenge 2019 Segmentation Task, in:Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic BrainInjuries, Springer International Publishing. pp. 231–241.Johnson, L.M., Turkbey, B., Figg, W.D., Choyke, P.L., 2014. Multipara-metric MRI in Prostate Cancer Management. Nature Reviews ClinicalOncology 11, 346–353. doi: .Kasivisvanathan, V., Rannikko, A.S., Borghi, M., Panebianco, V., 2018.MRI-Targeted or Standard Biopsy for Prostate-Cancer Diagnosis.New England Journal of Medicine 378, 1767–1777. doi: .Kingma, D.P., Ba, J., 2015. Adam: A Method for Stochastic Optimization,in: International Conference on Learning Representations (ICLR),Ithaca, NY: arXiv.org. URL: http://arxiv.org/abs/1412.6980 .Lemaˆıtre, G., Mart´ı, R., Rastgoo, M., M´eriaudeau, F., 2017. Computer-Aided Detection for Prostate Cancer Detection based on Multi-parametric Magnetic Resonance Imaging, in: 2017 39th Annual In-ternational Conference of the IEEE Engineering in Medicine and Bi-ology Society (EMBC), pp. 3138–3141.Lin, T., Goyal, P., Girshick, R., He, K., Doll´ar, P., 2017. Focal Loss forDense Object Detection, in: 2017 IEEE International Conference onComputer Vision (ICCV), pp. 2999–3007.Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N., Huisman, H., 2014.Computer-aided detection of prostate cancer in mri. IEEE Transac-tions on Medical Imaging 33, 1083–1092.Luo, L., Xiong, Y., Liu, Y., 2019. Adaptive Gradient Methods withDynamic Bound of Learning Rate, in: International Conference onLearning Representations.McKinney, S.M., Sieniek, M., Godbole, V., Godwin, J., 2020. Interna-tional Evaluation of an AI System for Breast Cancer Screening. Na-ture 577, 89–94. doi: .Miller, K.D., Nogueira, L., Mariotto, A.B., Rowland, J.H., Yabroff, K.R.,Alfano, C.M., Jemal, A., Kramer, J.L., Siegel, R.L., 2019. Cancer Treat-ment and Survivorship Statistics, 2019. CA: A Cancer Journal forClinicians 69, 363–385. doi: .Min, S., Chen, X., Zha, Z.J., Wu, F., Zhang, Y., 2019. A Two-StreamMutual Attention Network for Semi-supervised Biomedical Segmen-tation with Noisy Labels, in: Proceedings of the AAAI Conference onArtiﬁcial Intelligence, pp. 4578–4585.Riepe, T., Hosseinzadeh, M., Brand, P., Huisman, H., 2020. AnisotropicDeep Learning Multi-planar Automatic Prostate Segmentation, in:Proceedings of the 28th International Society for Magnetic Reso-nance in Medicine Annual Meeting. URL: http://indexsmart.mirasmart.com/ISMRM2020/PDFfiles/3518.html .Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional Net-works for Biomedical Image Segmentation, in: Navab, N., Horneg-ger, J., Wells, W.M., Frangi, A.F. (Eds.), Medical Image Computingand Computer-Assisted Intervention – MICCAI 2015, Springer Inter- ational Publishing. pp. 234–241.Rosenkrantz, A.B., Ginocchio, L.A., Cornfeld, D., Froemming, A.T.,2016. Interobserver Reproducibility of the PI-RADS Version 2 Lex-icon: A Multicenter Study of Six Experienced Prostate Radiologists.Radiology 280, 793–804. doi: .Rouvi`ere, O., Puech, P., Renard-Penna, R., Claudon, M., 2019. Use ofProstate Systematic and Targeted Biopsy on the Basis of Multipara-metric MRI in Biopsy-Naive Patients (MRI-FIRST): A Prospective,Multicentre, Paired Diagnostic Study. The Lancet Oncology 20, 100 –109. doi: .Rundo, L., Han, C., Nagano, Y., Zhang, J., Hataya, R., Militello, C.,Tangherloni, A., Nobile, M., Ferretti, C., Besozzi, D., Gilardi, M.,Vitabile, S., Mauri, G., Nakayama, H., Cazzaniga, P., 2019. USE-Net:Incorporating Squeeze-and-Excitation Blocks into U-Net for ProstateZonal Segmentation of Multi-Institutional MRI Datasets. Neurocom-puting 365, 31 – 43.Saha, A., Hosseinzadeh, M., Huisman, H., 2020. Encoding Clinical Prioriin 3D Convolutional Neural Networks for Prostate Cancer Detectionin bpMRI, in: Medical Imaging Meets NeurIPS Workshop–34th Con-ference on Neural Information Processing Systems (NeurIPS 2020).URL: https://arxiv.org/abs/2011.00263 .Sanford, T., Harmon, S.A., Turkbey, E.B., Turkbey, B., 2020. Deep-Learning-Based Artiﬁcial Intelligence for PI-RADS Classiﬁcation toAssist Multiparametric Prostate MRI Interpretation: A DevelopmentStudy. Journal of Magnetic Resonance Imaging n/a. doi: .Schelb, P., Kohl, S., Radtke, J.P., Bonekamp, D., 2019. Classiﬁca-tion of Cancer at Prostate MRI: Deep Learning versus Clinical PI-RADS Assessment. Radiology 293, 607–617. doi: .Schelb, P., Wang, X., Radtke, J.P., Bonekamp, D., 2020. Simulated ClinicalDeployment of Fully Automatic Deep Learning for Clinical ProstateMRI Assessment. European Radiology .Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B.,Rueckert, D., 2019. Attention Gated Networks: Learning to LeverageSalient Regions in Medical Images. Medical Image Analysis 53, 197 –207. doi: .Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra,D., 2017. Grad-CAM: Visual Explanations from Deep Networks viaGradient-Based Localization, in: 2017 IEEE International Conferenceon Computer Vision (ICCV), pp. 618–626.Smith, C.P., Harmon, S.A., Barrett, T., Bittencourt, L.K., 2019. Intra-and Interreader Reproducibility of PI-RADSv2: A Multireader Study.Journal of Magnetic Resonance Imaging 49, 1694–1703. doi: .Smith, L.N., 2017. Cyclical Learning Rates for Training Neural Net-works, in: 2017 IEEE Winter Conference on Applications of Com-puter Vision (WACV), pp. 464–472.Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A., 2017. Inception-v4,Inception-ResNet and the Impact of Residual Connections on Learn-ing, in: Proceedings of the Thirty-First AAAI Conference on ArtiﬁcialIntelligence, AAAI Press. p. 4278–4284.Tang, H., Zhang, C., Xie, X., 2019. NoduleNet: Decoupled False Posi-tive Reduction for Pulmonary Nodule Detection and Segmentation,in: Medical Image Computing and Computer Assisted Intervention(MICCAI 2019), pp. 266–274.Turkbey, B., Rosenkrantz, A.B., Haider, M.A., Padhani, A.R., Margolis,D.J., 2019. Prostate Imaging Reporting and Data System version 2.1:2019 Update of Prostate Imaging Reporting and Data System version2. European Urology .van der Leest, M., Cornel, E., Isra¨el, B., Hendriks, R., 2019. Head-to-headComparison of Transrectal Ultrasound-guided Prostate Biopsy Ver-sus Multiparametric Prostate Resonance Imaging with SubsequentMagnetic Resonance-guided Biopsy in Biopsy-na¨ıve Men with El-evated Prostate-speciﬁc Antigen: A Large Prospective MulticenterClinical Study. European Urology 75, 570 – 578. doi: doi.org/10.1016/j.eururo.2018.11.023 .Verma, S., Choyke, P.L., Eberhardt, S.C., Oto, A., Tempany, C.M., Turk-bey, B., Rosenkrantz, A.B., 2017. The Current State of MR Imag-ing–targeted Biopsy Techniques for Detection of Prostate Cancer. Ra-diology 285, 343–356. doi: .Wachinger, C., Reuter, M., Klein, T., 2018. DeepNAT: Deep Convolu- tional Neural Network for Segmenting Neuroanatomy. NeuroImage170, 434 – 445. doi: . seg-menting the Brain.Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang,X., 2017. Residual Attention Network for Image Classiﬁcation, in:2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6450–6458.Weinreb, J.C., Barentsz, J.O., Choyke, P.L., Cornud, F., 2016. PI-RADSProstate Imaging – Reporting and Data System: 2015, Version 2. Eu-ropean Urology 69, 16 – 40. doi: .Westphalen, A.C., McCulloch, C.E., Anaokar, J.M., Arora, S.,Rosenkrantz, A.B., 2020. Variability of the Positive Predictive Value ofPI-RADS for Prostate MRI across 26 Centers: Experience of the Soci-ety of Abdominal Radiology Prostate Cancer Disease-focused Panel.Radiology 296, 76–84. doi: . pMID:32315265.Xiao, C., Deng, R., Li, B., Yu, F., Liu, M., Song, D., 2018. Characteriz-ing Adversarial Examples Based on Spatial Consistency Informationfor Semantic Segmentation, in: Ferrari, V., Hebert, M., Sminchisescu,C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018, Springer Interna-tional Publishing. pp. 220–237.Yoo, S., Gujrathi, I., Haider, M.A., Khalvati, F., 2019. Prostate CancerDetection using Deep Convolutional Neural Networks. Scientiﬁc Re-ports 9, 19518. doi: .Yu, X., Lou, B., Shi, B., Szolar, D., 2020a. False Positive Reduction Us-ing Multiscale Contextual Features for Prostate Cancer Detection inMulti-Parametric MRI Scans, in: 2020 IEEE 17th International Sym-posium on Biomedical Imaging (ISBI), pp. 1355–1359.Yu, X., Lou, B., Zhang, D., Winkel, D., Joskowicz, L., 2020b. Deep Atten-tive Panoptic Model for Prostate Cancer Detection Using Biparamet-ric MRI Scans, in: Medical Image Computing and Computer AssistedIntervention (MICCAI 2020), Springer International Publishing. pp.594–604.Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2020. UNet++: Re-designing Skip Connections to Exploit Multiscale Features in ImageSegmentation. IEEE Transactions on Medical Imaging 39, 1856–1867. Appendix A. Network Conﬁgurations

Proposed CAD/CAD ∗ system, including its CNNcomponents ( M , M ), were implemented in TensorFlow(Estimator, Keras APIs). Special care was taken through-out the design stage (as detailed in Section 2.2) to en-sure computational efﬁciency, such that, the end-to-end3D system is fully trainable and deployable from a singleNVIDIA RTX 2080 Ti GPU (11 GB) in less than 6 hours forthe dataset used in this study.

3D Dual-Attention U-Net ( M ) (component of the CADsystem): Network architecture (as detailed in Section3.2.1) comprises of 75 convolutional layers. Layers alongthe encoder and decoder stages are activated by ReLUand Leaky ReLU ( α = . ), respectively, and the out-put layer is activated by the softmax function. Dimensionreduction ratio of 8 is applied to re-weight each channelinside every SE module (Hu et al., 2019). Sub-samplingkernels of size (1,1,1) are used inside every grid-basedattention gate (Schlemper et al., 2019). Dropout nodes( rate = . ) are connected at each scale of the decoderto alleviate overﬁtting. M is initialized using He uni-form variance scaling (He et al., 2015) and trained using × × × multi-channel whole-images over 40epochs. It is trained with a minibatch size of 2 and anexponentially decaying cyclic learning rate ( γ = . ,13tep size = epochs) (Smith, 2017) oscillating between − and . × − . Focal loss ( α = . , γ = . ) isused with Adam optimizer ( β = . , β = . , (cid:15) = − ) (Kingma and Ba, 2015) in backpropagation throughthe model. Train-time augmentations include horizon-tal ﬂip, rotation ( − . ° to . °), translation ( - horizon-tal/vertical shifts) and scaling ( - ) centered along theaxial plane. Test-time augmentation includes horizontalﬂip along the axial plane. M predictions carry a weightof 0.60 in the ensembled output of CAD ∗ .

3D SEResNet ( M ) (component of the CAD system): Net-work follows a relatively shallow 3D adaptation of theSEResNet architecture proposed by Hu et al. (2019) –comprising of 2 residual blocks with 6 convolutional lay-ers each, followed by global average pooling and a singledensely-connected layer. All layers are activated by ReLUwith the exception of the output layer, which is activatedby the softmax function. Dimension reduction ratio of 8 isapplied to re-weight each channel inside every SE mod-ule. M is initialized using He uniform variance scaling(He et al., 2015) and trained using × × × multi-channel octant patches over 262 epochs. It trains with aminibatch size of 80 (equivalent to 10 full scans) and anexponentially decaying cyclic learning rate ( γ = . ,step size = epochs) (Smith, 2017) oscillating between − and . × − . Balanced cross-entropy loss ( β = . )is used with AMSBound optimizer ( γ = − , β = . , β = . ) (Luo et al., 2019) in backpropagation throughthe model. Train-time augmentations include horizon-tal ﬂip, rotation ( − ° to °), translation ( - horizon-tal/vertical shifts) and scaling ( - ) centered along theaxial plane.

3D CAD (member model of the CAD ∗ ensemble): Stan-dard solution proposed in this research, comprising ofthe detection network M , decoupled classiﬁer M andanatomical prior P (as detailed in Section 3.2). Model pre-dictions carry a weight of 0.60 in the ensembled output ofCAD ∗ .

2D CAD (member model of the CAD ∗ ensemble): Net-work architecture and training conﬁguration are identicalto that of the 3D CAD system, with only one exception:all modules operate with isotropically-strided 2D convo-lutions. Model predictions carry a weight of 0.20 in theensembled output of CAD ∗ .

3D Two-Stage Cascaded CAD (member model of theCAD ∗ ensemble): Network architecture of each stage andthe training conﬁguration of the overall model are identi-cal to that of the 3D CAD system, with three exceptions.First-stage uses only half as many convolutional ﬁlters asthe 3D CAD system at every resolution. Second-stage in-put includes the ﬁrst-stage output, as an additional chan-nel. Total cost function is computed as the average lossbetween the intermediary ﬁrst-stage and the ﬁnal second-stage outputs against the same ground-truth –identicalto the course-to-ﬁne approach proposed by Jiang et al. (2020). Model predictions carry a weight of 0.20 in theensembled output of CAD ∗∗