[PDF] Semi-supervised Semantic Segmentation of Prostate and Organs-at-Risk on 3D Pelvic CT Images

Abstract

Automated segmentation can assist radiotherapy treatment planning by saving manual contouring efforts and reducing intra-observer and inter-observer variations. The recent development of deep learning approaches has revoluted medical data processing, including semantic segmentation, by dramatically improving performance. However, training effective deep learning models usually require a large amount of high-quality labeled data, which are often costly to collect. We developed a novel semi-supervised adversarial deep learning approach for 3D pelvic CT image semantic segmentation. Unlike supervised deep learning methods, the new approach can utilize both annotated and un-annotated data for training. It generates un-annotated synthetic data by a data augmentation scheme using generative adversarial networks (GANs). We applied the new approach to segmenting multiple organs in male pelvic CT images, where CT images without annotations and GAN-synthesized un-annotated images were used in semi-supervised learning. Experimental results, evaluated by three metrics (Dice similarity coefficient, average Hausdorff distance, and average surface Hausdorff distance), showed that the new method achieved either comparable performance with substantially fewer annotated images or better performance with the same amount of annotated data, outperforming the existing state-of-the-art methods.

Full PDF

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. X, September 2020 The paper is submitted for review on 10/22/2020. This work was supported in part by the Varian Medical Systems through a research grant. Zhuangzhuang Zhang (email: [email protected]) is with Department of Computer Science and Engineering, Washington University, One Brookings Drive, Campus Box 1045, St. Louis, Missouri 63130 USA. Tianyu Zhao, Hiram Gay, and Baozhou Sun (email: [email protected]) are with Department of Radiation Oncology, Washington University School of Medicine, 4921 Parkview Place, Campus Box 8224, St. Louis, Missouri 63110. Weixiong Zhang (email: [email protected]) is with Department of Computer Science and Engineering and Department of Genetics, Washington University, One Brookings Drive, Campus Box 1045, St. Louis, Missouri 63130 USA. *: Correspondence to [email protected] or [email protected] . Semi-supervised Semantic Segmentation of Prostate and Organs-at-Risk on 3D Pelvic CT Images

Zhuangzhuang Zhang, Tianyu Zhao, Hiram Gay, Baozhou Sun*, and Weixiong Zhang*

Abstract —Automated segmentation of organs-at-risk (OARs) in pelvic computed tomography (CT) images can assist radiotherapy treatment planning by saving efforts for manual contouring and reducing intra-observer and inter-observer variations. However, training effective deep-learning segmentation models usually require sufficient high-quality labeled data, which are costly to collect. Taking automated segmentation of OARs as a case study, we developed a novel semi-supervised adversarial deep learning approach to support medical image data processing and modeling where training data may be insufficient. Unlike supervised deep learning methods, our new approach can utilize both annotated and un-annotated data for training. Additionally, the new approach can generate un-annotated data by the generative adversarial networks (GANs) aided data augmentation scheme. We applied the new approach to segmenting tumors and multiple OARs in male pelvic CT images. The new approach was evaluated on a dataset of 100 training cases and 20 testing cases. Experimental results, including four metrics (dice similarity coefficient, average Hausdorff distance, average surface Hausdorff distance, and relative volume difference), showed that the new method can achieve comparable performance with less annotated data and better performance with the same amount of annotated data. The performance of the new approach and its 3D Pelvic CT segmentation model achieved comparable or better performance than other state-of-the-art methods.

Index Terms —Deep Learning, Generative Adversarial Network, Semi-supervised Learning, Organs-at-risk Segmentation.

I. I

NTRODUCTION

CCURATE delineation of organs-at-risk (OARs) is crucial for maximizing target coverage while minimizing toxicities during radiotherapy [1, 2]. As a major treatment for prostate cancer, external beam radiotherapy requires accurate segmentation of the prostate and its nearby organs [3]. OARs are normally contoured manually by experienced oncologists, a common clinical practice that entails three serious drawbacks: 1) it is labor-intensive and time-consuming for physicians to contour multiple organs slice-by-slice; 2) the diverse expertise and experience level of physicians introduce considerable intra-observer variation [2]; 3) low-contrast and fuzzy boundaries of OARs on medical images cause significant interobserver variation of delineation [4]. Time-consuming manual segmentation processes are inadequate to support adaptive treatment, and intra- and inter-observer variations result in uncertainty in treatment planning that potentially compromises treatment outcomes. Automated segmentation approaches that can rapidly contour OARs with reliable and robust quality can overcome these drawbacks and provide accurate and efficient radiotherapy treatment. Deep learning [5] has been evolving rapidly as an enabling technique for various real-world problems in the past decades. It has been extended and applied to medical image segmentation to provide accurate, reliable, and efficient delineation of pelvic CT images [1-3, 6-12]. However, supervised deep learning models usually require massive annotated data to train, which presents a severe bottleneck for medical imaging applications. High-performance deep learning segmentation models that work on natural images are usually supported by massive datasets. For instance, sophisticatedly crafted deep learning methods are supported by more than 14 million images in the ImageNet dataset [13]. However, unlike natural images, medical images are annotated by professionals with specific expertise, making data acquisition and annotation laborious, expensive, and time-consuming [3]. Although the current clinical practice produces annotated data during treatment processes, the quality of the annotated data varies and influences the learning outcome. Inter-observer variation, caused by the difficulty of annotating and by diverse expertise levels, imposes inconsistency in the dataset and compromises the learning outcome [3]. As a result, a common bottleneck in the medical imaging area is the lack of high-quality massive annotated datasets. Here, we propose a novel semi-supervised adversarial deep learning approach to address the lack-of-data bottleneck by utilizing and synthesizing un-annotated data. Ample un-annotated data are often available from clinical practice, which A IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 can be used as training data for our semi-supervised learning scheme. Furthermore, the new method can also synthesize new images to enlarge the set of un-annotated data, if needed, to enhance training and ultimately improve segmentation models. In short, the new approach has two prominent features. First, it adopts an adversarial learning scheme to utilize un-annotated data for training so that fewer annotated data are required. Second, the new approach can synthesize new data from a limited number of annotated images, and thus support a semi-supervised adversarial learning scheme. The new approach has three major components: a CNN-based segmentation network ( S-net ), a discriminator network (

D-net ) for adversarial learning, and a progressive growth generative adversarial network (PGGAN). We adopt the residual U-net [14] as the backbone architecture for

S-net and introduce several modifications to the overall design. We use an adversarial learning scheme, formulated in a generative adversarial network (GAN) [15], to train

D-net to distinguish predicted label maps and ground-truth label maps. We design PGGAN for data augmentation [16] to synthesize new un-annotated data, when needed, from a small number of annotated training images to be used in semi-supervised learning. We applied the novel approach to the 3D Pelvic CT segmentation as a case study to demonstrate its power on applications where training data are limited. The problem of contouring tumors and OARs in Pelvic CT images is not complex and has only a few relevant OARs involved. We also have ample high-quality annotated pelvic CT data so that we can adequately validate the effectiveness of the semi-supervised learning framework. In our study, we also compared the new method and model with several state-of-the-art segmentation methods using four widely used metrics: dice similarity coefficient (DSC), average Hausdorff distance (AHD), average surface Hausdorff distance (ASHD), and relative volume difference (VD).

II. R

ELATED W ORK

A. Male Pelvic CT Image Segmentation

The automated segmentation of male pelvic CT images has been a research focus for radiation therapy. Two types of methods have been proposed, multi-atlas-based and learning-based methods. The multi-atlas-based methods typically follow two steps to generate organ segmentation: atlases registration and label fusion [3]. Acosta et al. discussed several strategies for segmentation and evaluation for different organs which fall into this category [4]. While the results of these methods are acceptable, the required manual guidance and registration are costly, which the automated segmentation approaches attempt to address [3, 17]. Machine learning methods, e.g., Random Forests [8], have been exploited to perform OAR segmentation on medical images [18, 19]. As deep learning methods developed fast in the past decades, they have achieved state-of-the-art performance on many difficult problems [5]. Deep learning models have been applied to medical image segmentation with outstanding performance due to their ability for feature extraction and representation [2, 7, 9, 10, 12, 20, 21]. A fully convolutional network (FCN) [22] was first proposed for daily-life image segmentation, which inspired many medical image segmentation methods. For example, Wang et al. [9] proposed a dilated FCN method for OAR pelvic CT images segmentation and achieved a DSC score of 0.85 on the prostate. After Ronneberger et al. [14] proposed U-net for medical image segmentation, a few U-net variations performed well for image segmentation [1, 2, 7, 23]. Kazemifar et al. developed a 2D U-net-based method for the region of interest (ROI) segmentation on pelvic CT images [2]. Their new approach used a 2D U-net for ROI localization and a 3D U-net for segmentation [1]. Despite that this method has excellent performance on multiple organs, it trains multiple segmentation networks, one for each organ, which is costly for implementation [1]. It is essential to highlight that deep learning methods typically require a massive amount of training data for building effective models. Obtaining massive datasets is challenging in medical imaging since data annotation requires professional expertise, so that data collection is labor-intensive and time-consuming.

B. Generative Adversarial Network (GAN)

The generative adversarial network (GAN), proposed in 2014 by Goodfellow et al. [15], has been extensively studied and applied to many real-world problems. A favorable feature of GAN is its ability to generate data with a desirable distribution [24]. GAN consists of two major components, a generator, and a discriminator. The generator is trained to learn a model of given training examples. The discriminator is tasked to learn to distinguish the data produced by the generator from given examples [15]. The two components are trained in an adversarial scheme by which they compete against each other until reaching equilibrium – the generator attempts to produce synthetic data that are similar to the given examples to fool the discriminator. In contrast, the discriminator tries to distinguish synthetic data from genuine ones. Tailored to specific applications, several GAN variations have been proposed, such as conditional GAN (CGAN) [25], progressive growth GAN (PGGAN) [26], and triple GAN[27]. These generative methods perform well for data generation, which is particularly desirable when training examples are insufficient. Besides generating new data, the idea of adversarial learning also lends itself to supervised learning. Luc et al. cast a network for segmentation as the generator and adopt the adversarial learning scheme for semantic segmentation, which performed well on the Pascal VOC 2012 dataset [28]. Hung et al. proposed a semi-supervised adversarial segmentation method using images with incomplete annotations for training [29]. IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 III. T HE M ETHOD

A. Overview

To reiterate, our new method (Fig. 1) included three major components: segmentation network (

S-net ), discriminator network (

D-net ), and progressive growth GAN (PGGAN) [26]. The method worked in two modes: 1) PGGAN-aided data augmentation, and 2) semi-supervised adversarial learning scheme using

S-net and

D-net . These two modes of action were integrated to address the bottleneck problem of insufficient data. We first trained a CT synthesis PGGAN and synthesized un-annotated CT images, which we later used for semi-supervised learning. We then designed a semi-supervised adversarial learning scheme involving

S-net and

D-net . This scheme worked with both the annotated training data and additional data synthesized by PGGAN. In other words, the semi-supervised adversarial learning scheme combined annotated and un-annotated data to improve the quality of the final model. The overall approach was designed for three scenarios: (1) when data annotations are unavailable, yet un-annotated data are available, it can utilize un-annotated data for training with the semi-supervised adversarial learning scheme; (2) when the given annotated training examples were insufficient and un-annotated data were also unavailable, it can synthesize un-annotated data using PGGAN; and (3) when training examples were sufficient for learning, it can generate and use additional synthesized un-annotated data to improve the quality of training and final models by semi-supervised learning.

B. Residual U-net Backbone with Auxiliary Classifiers

The segmentation network

S-net adopted the residual U-net [30] as its backbone architecture. The residual connection allowed the feature maps to bypass the non-linear transformations with an identity mapping. This design went beyond regular skip connections and reformulated the layers as learning residual functions [31]. The residual U-net in

S-net took

16 × 128 × 128

CT volumes (16 slices of 128 by 128 CT scans) with one channel as its input. It output

16 × 128 × 128 segmentation volumes with six channels, one for each of the OARs and the background. With the original U-net [14] architecture,

S-net first down-sampled the input for feature extraction, then up-sampled the extracted feature maps to scale back to the original size and perform classification. During the up-sampling process, feature maps from the down-sampling path were passed forward for concatenation as guidance for high-resolution spatial context information [22]. Along the down-sampling path of the original U-net, multiple stride-2 max-pooling layers were used to shrink feature maps, which caused spatial information loss [32]. To preserve spatial information, we replaced all the stride-2 max-pooling layers with multi-scale pooling layers developed in our adversarial multi-residual multi-scale pooling MRF-enhanced network (ARPM-net) [33]. Besides the classifier on the top level (in green in Fig. 1) that classified feature maps in the original resolution (

16 × 128 ×128 ), S-net also included two auxiliary classifiers. The auxiliary classifier aux_2 (in green in Fig. 1) worked on feature maps with 128 channels, and the aux_4 (in green in Fig. 1) processed feature maps with 256 channels. The segmentation outputs of these two auxiliary classifiers were up-sampled by interpolation to rescale to the input dimension of

16 × 128 × 128 . This architecture design incorporated the compressed feature maps in the final prediction, allowing supervision learning in the latent space [34].

In the original residual U-net [30], there was only one classifier in the last layer of the network, and its outputs determined the loss from the ground-truth. When the loss was back-propagated into the network, the supervision was only provided in the last layer. However, with two auxiliary classifiers contributing to the final prediction, supervision was done not only in the last layer but in the middle layers as well. Thus, we introduced supervision into the latent space by the two

Fig. 1. Workflow of the proposed semi-supervised learning method

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 auxiliary classifiers. The segmentation predictions of aux_4 and aux_2 were weighted by 0.25 and 0.5 in the final label prediction. The final label map predictions were in

16 × 128 ×128 volumes, with six channels representing the probability of each voxel belonging to each of the six classes.

C. Progressive Growth GAN-aided CT Synthesis

The generative adversarial network is known for generating synthetic samples from high-dimensional data distributions [24]. The general framework supports the generation of different types of data, and different GAN variations work specifically well on certain data types. Progressive growing generative adversarial network (PGGAN), as a special GAN, is tailored to image synthesis [26]. PGGAN incrementally generates images with increasingly higher resolution. It first discovers large structures of the image distribution and then shifts attention to increasingly finer details [26]. We adopted the training scheme of PGGAN and designed a new network architecture for synthesizing 3D CT volumes (Fig. 2). The generator 𝒢 and the discriminator 𝒟 had mirrored architectures. The training started by generating volumes. 𝒢 took a random noise 𝒛 as its input and output a synthesized volume 𝒙′ . 𝒟 took 𝒙′ or down-sampled real CT volume 𝒙 as its input, and output a prediction that the input was real ( 𝒙 ) or fake ( 𝒙’ ). After 𝒢 was well-trained on low-resolution ( 𝑛 × 𝑛 × 𝑛 ) volumes, one more up-sampling block was added to 𝒢 and 𝒟 individually to generate

2𝑛 ×2𝑛 × 2𝑛 volumes. This process repeated until the last up-sampling blocks were added to generate

64 × 128 × 128 from

64 × 64 × 64 feature maps. Note that

64 × 128 × 128 synthesized CT volumes had the same dimensions as real CT volumes, and both were cropped to

16 × 128 × 128 before fed into

S-net . Notice that adding randomly initiated layers to well-trained 𝒢 and 𝒟 could be disruptive to reach a training equilibrium. To address this issue, we used the smooth fade-in trick proposed in [26] to linearly increase weight 𝛼 to combine the outputs from an already trained low-resolution layer and a newly added higher-resolution layer. We used the well-trained PGGAN to synthesize 30 CT volumes to aid semi-supervised learning in the subsequent adversarial semantic segmentation training. D. Semi-supervised Adversarial Training

Adversarial training is mainly used for GAN-based data generation [15]. It can potentially improve the performance of the original task by using additional, albeit synthesized, data generated from random noises and training examples [20, 28, 33, 35]. Semantic segmentation can be cast as a generative task. Rather than generating new images,

S-net in our approach produced segmentation masks that closely resemble manually delineated masks on input CT images. We added an adversarial learning scheme to the new approach by exploiting two significant merits of adversarial models. Firstly, without

D-net , S-net was optimized based only on a cross-entropy loss, an element-wise (i.e., pixel- or voxel-wise) loss metric widely used for semantic segmentation [13, 22, 36]. Furthermore, we introduced a style-loss calculated by

D-net into the training scheme to help

S-net generate contours that closely resemble manual contours. Secondly, similar to other CNN networks,

S-net was designed for supervised learning so that it cannot make use of un-annotated CT images. Thanks to the adversarial learning scheme,

S-net can be extended to learn from un-annotated CT images with the help of the discriminator. The basic adversarial training scheme had four steps: 1.

Feed the input CT image 𝒙 forward through S-net to obtain predicted mask 𝒚̂ = 𝑆(𝒙) , where

𝑆(∙) represents the forward passing operation of

S-net . 2.

Feed the voxel-wise product of 𝒙 ∙ 𝒚̂ (fake input) or 𝒙 ∙ 𝒚 (real input) into

D-net to obtain a confidence map

𝐷(𝒙 ∙ 𝒚̂) or 𝐷(𝒙 ∙ 𝒚 ) , where 𝒚 is the ground truth label map of CT image 𝒙 , and 𝐷(∙) represents forward passing operation of

D-net . 3.

Compute the loss for

S-net and

D-net.

The main objective of the adversarial learning is to train

S-net to generate “fake” segmentation maps that closely resemble genuine CT images and can fool

D-net and to simultaneously train

D-net to distinguish fake inputs from real CT images.

Fig. 2. Structure of the progressive growth generative adversarial network (PGGAN)

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 Backpropagate the loss for

S-net and

D-net.

The loss for

S-net (denoted as ℒ 𝑆 ) has two components, i.e., 1) voxel-wise loss (denoted as ℒ 𝑣𝑜𝑥 ) between prediction 𝒚̂ and ground-truth 𝒚 , and 2) adversarial loss (denoted as ℒ 𝑎𝑑𝑣 ) computed interactively with D-net . The loss for

D-net (denoted as ℒ 𝐷 ) is computed based on how well the discriminator can separate fake inputs from real ones. We will show the math in section III. E . Besides the basic adversarial learning, we also utilized un-annotated CT images as input to S-net . Un-annotated CT images came from either real CT images without annotation or PGGAN-synthesized CT images. Similarly, we fed the un-annotated CT image 𝒙̇ to S-net to predict 𝒚̇̂ = 𝑆(𝒙̇ ) . Since there was no ground-truth 𝒚̇ for 𝒙̇ , we cannot compute ℒ 𝑣𝑜𝑥 with un-annotated CT images, but we can use D-net to compute

𝐷(𝒙̇ ∙𝑆(𝒙̇ )) and take advantage of semi-supervised adversarial learning.

E. Learning Algorithm 1) Training the Segmentation Network (S-net)

Inspired by Wei’s work [29],

S-net was trained with three losses: voxel-wise loss ℒ 𝑣𝑜𝑥 , adversarial loss ℒ 𝑎𝑑𝑣 , and semi-supervised learning loss ℒ 𝑠𝑒𝑚𝑖 . Start with the voxel-wise loss ℒ 𝑣𝑜𝑥 , adopting the adaptively weighted loss function from ARPM-net [33], the voxel-wise loss function (for each sample) of S-net is: ℓ 𝑚𝑐𝑒∗ (𝒚̂, 𝒚) = − ∑ ∑ 𝑤 𝑐 𝒚 𝑐 𝑙𝑛𝒚̂ 𝑐𝐶𝑐=1𝑍×𝐻×𝑊 , (1) 𝑤 𝑐 = 2 − 𝐷𝑆𝐶 𝑐 + ln 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑜𝑥𝑒𝑙 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑙𝑎𝑠𝑠𝑒𝑠𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑜𝑥𝑒𝑙 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑐 . (2) In Eq. 1, ℓ 𝑚𝑐𝑒∗ (𝒚̂, 𝒚) is the adaptively weighted multi-class cross-entropy (MCE) loss between the prediction 𝒚̂ and ground-truth 𝒚 ; 𝑍 , 𝐻 , and 𝑊 are the depth, height, and width of the 3D CT volume, respective; 𝐶 represents the number of classes and in our case 𝐶 = 6 ; and 𝑤 𝑐 is the adaptive weight of class 𝑐 . The adaptive weight is calculated by Eq. 2, where 𝐷𝑆𝐶 𝑖 is the current performance (measured by the Dice Similarity Coefficient) for class 𝑖 . Combined, we have the voxel-wise loss ℒ 𝑣𝑜𝑥 for all 𝑁 samples as: ℒ 𝑣𝑜𝑥 = ∑ ℓ 𝑀𝐶𝐸∗ (𝑆(𝒙 𝒏 ), 𝒚 𝒏 ) 𝑁𝑛=1 . (3) The adversarial learning trained

S-net to generate segmentation predictions that can fool

D-net to classify a synthetic CT image as a real image. The adversarial loss ℒ adv measures the difference between the current S-net and a “perf ect generator ” that always fools

D-net . The loss can be written as: ℓ 𝑎𝑑𝑣 = ℓ 𝑏𝑐𝑒 (𝐷(𝒙 𝒏 ∙ 𝑆(𝒙 𝒏 )), 𝟏) , (4) where represents the target confidence map of D-net with value 1 (real) for all voxels; ℓ 𝑏𝑐𝑒 (∙) is the binary cross-entropy (BCE) loss function: ℓ 𝑏𝑐𝑒 (𝒛̂, 𝒛) = − ∑ [𝒛 𝒊 𝑙𝑛𝒛̂ 𝑖 + (1 − 𝒛 𝑖 )𝑙𝑛(1 − 𝒛̂ 𝑖 )] 𝑍×𝐻×𝑊𝑖=1 , (5) 𝒛̂ = 𝐷(𝒙 𝒏 ∙ 𝑆(𝒙 𝒏 )) , (6) 𝒛 = 𝟏 . (7) Thus, we have the adversarial loss ℒ 𝑎𝑑𝑣 for all 𝑁 samples: ℒ 𝑎𝑑𝑣 = − ∑ ∑ 𝑙𝑛(𝐷(𝒙 𝒏 ∙ 𝑆(𝒙 𝒏 ))) 𝑍×𝐻×𝑊𝑁𝑛=1 . (8) In semi-supervised learning, un-annotated (or synthetic) CT images can be utilized for

S-net training. Since there is no ground-truth available, we cannot compute ℒ 𝑣𝑜𝑥 for un-annotated data. However, for an un-annotated image 𝒙̇ , the trained discriminator D-net generates a confidence map

𝐷(𝒙̇ ∙𝑆(𝒙̇ )) , which can infer the regions that are sufficiently close to ground-truth label maps [29]. These regions are selected based on a pre-set threshold 𝑇 𝑠𝑒𝑚𝑖 . For image 𝒙̇ , we also have the self- taught “ground - truth” 𝒚̇̃ as an element-wise set with 𝒚̇̃ (𝑧,ℎ,𝑤, 𝑐 ∗ ) = 1 if 𝑐 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐 𝑆(𝑥) (𝑧,ℎ,𝑤,𝑐) , where 𝑧, ℎ, 𝑤, and 𝑐 denote the voxels at the location ( 𝑧, ℎ, 𝑤 ) and channel c , respectively. Thus the semi-supervised loss can be written as: ℒ 𝑠𝑒𝑚𝑖 = − ∑ ∑ ∑ 𝑰 (𝑧,ℎ,𝑤) ∙ 𝐶𝑐=1𝑍,𝐻,𝑊𝑁𝑛=1 𝒚̇̃ 𝑛(𝑧,ℎ,𝑤,𝑐) ln(𝑆(𝒙̇ 𝒏 ) (𝑧,ℎ,𝑤) ) , (9) 𝑰 (𝑧,ℎ,𝑤) = ( 𝐷 ( 𝒙 ̇ 𝒏 ∙ 𝑆 ( 𝒙 ̇ 𝒏 )) ( 𝑧,ℎ,𝑤 ) > 𝑇 𝑠𝑒𝑚𝑖 ) . (10) The indicator function 𝑰 (𝑧,ℎ,𝑤) in Eq. 10 indicates if S-net prediction on voxel at ( 𝑧, ℎ, 𝑤 ) is sufficiently trustworthy, and threshold 𝑇 𝑠𝑒𝑚𝑖 controls the sensitivity [29]. We also pre-set the threshold to be 0.2, as suggested in [29]. The total loss function ℒ 𝑆−𝑛𝑒𝑡 is a weighted summation of all losses: ℒ 𝑆 = ℒ 𝑣𝑜𝑥 + 𝜆 𝑎𝑑𝑣 ℒ 𝑎𝑑𝑣 + 𝜆 𝑠𝑒𝑚𝑖 ℒ 𝑠𝑒𝑚𝑖 . (11)

2) Training the Discriminator Network (D-net)

We train

D-net as a competitor against

S-net in adversarial learning [15]. The discriminator learns to separate the label map prediction generated by

S-net from ground-truth. For input CT image 𝒙 𝒏 with its ground-truth label map 𝒚 𝒏 , we use the binary cross-entropy loss function: ℒ 𝐷 = − ∑ [ℓ 𝑏𝑐𝑒 (𝐷(𝒙 𝑛 ∙ 𝒚 𝑛 ),1) + ℓ 𝑏𝑐𝑒 (𝐷(𝒙 𝑛 ∙ 𝑆(𝒙 𝒏 )),0)] 𝑁𝑛=1 . (12)

D-net learns to label all voxels on the confidence map as 1 (real) if the input is the product computed with the ground-truth and 0 (fake) with

S-net prediction. In

Wei’s work, they did not encounter the issue that

D-net easily distinguishes whether the probability maps came from the ground truth by detecting the one-hot probability [28, 29], but we encountered this problem. We resolved the issue using the element-wise product of the input CT image and the one-hot encoded label prediction as the input for

D-net . IV. E

XPERIMENTS

A. Data Used and Augmentation

The data used in our study consisted of planning CT and structure data from 120 intact prostate cancer patients that were selected retrospectively. All CT images were acquired by a 16-slice CT scanner with an 85 cm bore size (Philips Brilliance Big Bore, Cleveland, OH, US). The contours of the five organs (prostate, bladder, rectum, left femur, and right femur) were drawn by the same imaging technician [33]. The prostate

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 contours were drawn by two radiation oncologists with over ten years’ experience tre ating prostate cancer, and contours were generated using the Eclipse treatment planning system (Varian Medical Systems, CA) [33]. From each of the 120 patients, we collected 100-200 slices with a slice thickness of 1.5mm. Every CT slice was translated into a 2D-array with

512 × 512 pixels according to original pixel spacing. We stacked 2D CT slices to build 3D CT volumes to cover the slice thickness. To accommodate the workstation that we used for the experiments, we first center cropped each slice to the size of

384 × 384 , then resized them to

128 × 128 and picked the middle 64 slices from each case. Thus, for every patient, we built the 3D Volume with dimensions of

64 × 128 × 128 . We randomly split the 120 cases into a train/validation set of 100 cases and a test set of 20 cases. We used the train/validation set with 100 patients to perform 10-fold cross-validation, where 90 cases were used for training and 10 cases for validation each time. The test set (of 20 cases) was used for model evaluation and performance comparison. The size of the training set (of 90 cases) was small and may cause serious over-fitting problems, so we used a random sampling technique for data augmentation. For each training iteration, we randomly picked 16 continuous slices from the

64 × 128 × 128

CT volumes and cropped the groud-truth label maps accordingly. Thus, each

64 × 128 ×128 volume can generate 48 different

16 × 128 × 128 volumes depending on different cropping positions. These smaller volumes may contain different OARs, and the positions of OARs were shifted between volumes cropped from nearby positions. This technique effectively augmented the training data by adding diversity and random translations to fixed organ positions. During the inference and testing phase, every validation/testing case was cropped into four 16-slice volumes and then fed into the model one by one. The outcomes from the network were later stacked together to restore the original dimensions (

64 × 128 × 128 ) for evaluation. We set up three configurations of the training set for model comparison: a) 90 annotated cases ( config.A ); b) 60 annotated cases randomly chosen from the 90 cases plus 30 cases without annotation ( config.B ); and c) 90 annotated cases plus 30 PGGAN-synthesized un-annotated cases ( config.C ). The dataset for config.A can be used for non-adversarial and adversarial supervised training schemes, and the other two datasets supported semi-supervised adversarial training.

B. Implementation and Training Model Parameters

We implemented the new method in Pytorch and trained the overall model on two RTX 2080 Ti GPUs using data parallelism. For

S-net , we used Adam optimizer [37] with an initial learning rate of −4 and polynomial learning rate scheduler with a power of . For D-net , we used Adam optimizer with an initial learning rate of 10 -4 and the same learning rate scheduler as for S-net . For the other parameters in Eq. 11, we set 𝜆 𝑎𝑑𝑣 = 0.01 𝑜𝑟 0.001 for annotated and un-annotated data, respectively, and 𝜆 𝑠𝑒𝑚𝑖 = 0.1 as suggested in [29]. In the semi-supervised adversarial learning described in III. D and III. E , we used both annotated and un-annotated data ( config.B or config.C ) for training. Following [29], we pre-trained S-net and

D-net with only labeled data for 5000 iterations before semi-supervised training started. The pre-training iterations can help stabilize the randomly initiated model. After pre-trained

S-net and

D-net , we randomly interleaved annotated and un-annotated data for semi-supervised training [29]. We trained the semi-supervised adversarial model on config.B and config.C dataset for 40k iterations with batch size 2. We also implemented and trained several the state-of-the-art existing methods to derive some baseline models for model comparison and evaluation (Table I): 1) res_Unet is a vanilla residual U-net model as proposed in [30]; 2) res_Unet_aux is the res_Unet upgraded with auxiliary classifiers; 3) res_Unet_aux_adv is the adversarial learning version of res_Unet_aux ; and 4) res_Unet_aux_adv_semi is the complete version of our proposed method. The first three models can be trained with config.A , and only the complete version can be built with config.B and config.C . C. Evaluation Metrics

Four evaluation metrics were used in our experiments: dice similarity coefficient [38], average Hausdorff distance [39], average surface Hausdorff distance [40], and relative volume difference [40]. Dice similarity coefficient (DSC) is a widely used segmentation metric in medical imaging:

𝐷𝑆𝐶 = , (13) where |𝑇𝑟𝑢𝑒| and |𝑃𝑟𝑒𝑑| are the number of voxels in the ground-truth and prediction, respectively [33]. The average Hausdorff distance (AHD) [2] measures the maximal average point-wise distance from points in 𝑋 to the nearest point in 𝑌 , and the average point-wise distance from points in 𝑌 to the nearest point in 𝑋 : 𝐴𝐻𝐷(𝑋, 𝑌) = 𝑚𝑎𝑥( ∑ 𝑚𝑖𝑛 𝑦∈𝑌 𝑑(𝑥, 𝑦), ∑ 𝑚𝑖𝑛 𝑥∈𝑋 𝑑(𝑦, 𝑥) 𝑦∈𝑌 ) 𝑥∈𝑋 , (14) where X and Y are the voxel sets of the ground-truth and prediction volume, respectively, and 𝑑(𝑥, 𝑦) is the Euclidean distance from point x to point y . The Average Surface Hausdorff Distance (ASHD) [2] is the symmetrical average point-wise distance between a point on one surface to the nearest point on the other surface: 𝐴𝑆𝐻𝐷(𝑋, 𝑌) = ( ∑ 𝑚𝑖𝑛 𝑦∈𝑌 𝑑(𝑥, 𝑦) + ∑ 𝑚𝑖𝑛 𝑥∈𝑋 𝑑(𝑦, 𝑥) 𝑦∈𝑌 ) 𝑥∈𝑋 , (15) where X and Y are the voxel sets of the ground-truth and prediction surface, respectively, and 𝑑(𝑥, 𝑦) is the Euclidean distance from point x to point y . IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 V. R

ESULTS

A. Model Evaluation

We evaluated the performance of our new method and compared it with the baseline methods by 10-fold cross-validation (Fig. 3). The proposed method can generate accurate contours for all five targeted OARs (Table I), and the resulting contours closely resemble the ground-truth (Fig. 3). The new approach achieves nearly 100% accuracy on both femurs (Table I) thanks to their high contrast with the background on CT images. However, the prostate, bladder, and rectum are more challenging to segment due to their low contrast and fuzzy boundaries. Despite the low contrast, our approach generates high-quality segmentation contours that closely align with the ground-truth. Compared to manual segmentation that typically takes 20-30 minutes per patient for the prostate alone, our approach can segment five OARs in 10 seconds per patient. The results suggest that the new approach can improve the quality of routine clinical practice by rapidly producing accurate contours with high efficiency.

B. Effectiveness of Semi-supervised Adversarial Learning

A notable feature of the new method is its ability to generate synthesized data to adequately address the severe bottleneck problem of insufficient training data and its semi-supervised learning capability to accommodate annotated and un-annotated data to improve overall performance. Importantly, since the new method can utilize un-annotated images, we can use fewer manually annotated examples, thus significantly reduce overall labor and cost. We compared the performance of the proposed approach with the baseline methods to show that the proposed approach achieves comparable results with less annotated data and better results with the same amount of annotated data (Table I). Experiments 1 and 2 were designed to measure the effectiveness of auxiliary classifiers. Even though the prostate and rectum are challenging to segment due to their low contrast level and variations in size and shape, using auxiliary classifiers in the new method can improve performance on the prostate and rectum (Table I). With auxiliary classifiers that perform segmentation on feature maps of multiple scales, the model generated better contours for organs of various shapes and sizes. Experiments 2 and 3 were used to analyze the effectiveness of adversarial training. The results showed the performance on the prostate and rectum was improved with the adversarial training scheme, which proved the effectiveness of adding style-wise loss into the loss function. CT slice Ground Truth 3D Residual U-net Our Method Overlay

Fig. 3. Exhibition of segmentation results comparison. Five columns are CT slices, ground-truth (manual annotations by experts) label masks (bladder in light blue, prostate in beige, rectum in purple, and femurs in red and green), prediction label masks of the baseline and our method, and contour overlay (ground truth in green and prediction in red) of the proposed method. The four slices are selected from the same test case and ordered from superior to inferior.

Fig. 2. Example slices of segmentation prediction.

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 We designed experiments 3 and 4 to test if our method can achieve comparable or even better performance with less annotated data. The results showed that discarding 1/3 of the annotated examples would not degrade the performance much with the new semi-supervised learning algorithm. With Experiments 3 and 4, we demonstrated that the new method required less effort for data annotation. We also expected additional synthesized CT images to improve the performance with the same amount of annotated data. Thus, we introduced experiments 3 and 5 to test if our method can achieve better performance with PGGAN-synthesized data augmentation. Indeed, the combination of using semi-supervised loss and synthesized data produced by adversarial learning was effective, providing comparable or better results (Table I). We tested if our semi-supervised learning scheme was robust with diverse types of input data. In experiment 4, the un-annotated cases used in training were real CT images without annotations. However, in experiment 5, the 30 un-annotated training cases were synthesized data. The results from experiments 4 and 5 showed that PGGAN-aided CT synthesis was effective. More importantly, both real un-annotated images (Exp. 4) and PGGAN-synthesized un-annotated images (Exp. 5) supported well the semi-supervised in our new method.

C. Model Comparison

We compared the performance of our model with the models from several state-of-the-art methods for pelvic CT segmentation, including one multi-atlas based model [4] and four deep learning-based models. Deep Dilated CNN [10] and 2D U-net [2] are end-to-end models, and

2D U-net Localization plus 3D U-net Segmentation [1] is a two-step ROI segmentation approach. Since we do not have access to their datasets or source code, we tried our best to implement and train their models on our dataset, yet did not achieve the same performance. As a result, we did not compare our methods directly on the same dataset. Instead, we accept their reported model performances as their best results (Table II). The proposed method outperformed the multi-atlas based model [4] and Deep Dilated CNN method with higher DSCs on all OARs. The 2D U-net and 2D U-net Localization plus 3D U-net Segmentation are ROI segmentation methods so that multiple networks are trained for different OARs catering to different shapes, sizes, and contrast levels. The method required one 2D localization network and four 3D segmentation networks with different architectural designs to segment four different OARs, which increased the difficulty of implementation. Our method achieved similar DSCs on all OARs as the two ROI segmentation methods. However, unlike the ROI segmentation models that train one segmentation network for each organ, our model performed an end-to-end segmentation process and segmented all organs with one trained network. Compared to ARPM-net, which also segmented all OARs within one forward propagation [33], our new method achieves better DSCs, AHDs, and ASHDs on the prostate and rectum. Our new method was trained and tested with the same dataset as ARPM-net, and our semi-supervised adversarial learning scheme supported by PGGAN-aided CT synthesis had significantly Table I: Experiment results from 10-fold cross-validation. We evaluated the performance with four metrics: dice similarity coefficient (DSC), average Hausdorff distance (AHD), average surface Hausdorff distance (ASHD), and relative volume difference (VD) on five OARs: prostate, bladder, rectum, left femur (Femur_L) and right femur (Femur_R). The best score of each metric on each organ is colored in red.

Experiment No. Model

Dataset

Prostate

DSC AHD(mm) ASHD(mm) VD(%)

Bladder

DSC AHD(mm) ASHD(mm) VD(%)

Rectum

DSC AHD(mm) ASHD(mm) VD(%)

Femur_L

DSC AHD(mm) ASHD(mm) VD(%)

Femur_R

DSC AHD(mm) ASHD(mm) VD(%) [30]

90 labeled

90 labeled (our method)

60 labeled & 30 unlabeled (our method)

90 labeled & 30 unlabeled

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 smaller HDs and AHDs by utilizing a style-wise loss for learning. VI. C

ONCLUSION AND D ISCUSSION

The research was motivated in part to address the problem of the lack of sufficient data in medical imaging. We proposed and developed a semi-supervised adversarial learning framework supported by GAN data augmentation for 3D male pelvic CT segmentation, which is also expected to boost a wide range of medical machine learning applications that currently suffer from the lack-of-data bottleneck. We cast semantic segmentation as a problem of adversarial learning for data generation and semi-supervised learning to make use of both annotated and un-annotated data. The results, quantified by four popular segmentation quality metrics, showed that our new method achieved state-of-the-art performance evaluated on a set of clinical 3D pelvic CT images and 10-fold-cross validation. A comparative analysis showed that the new semi-supervised adversarial learning method outperformed the currently best methods with the same amount of annotated data. In particular, the new method achieved comparable performance with less annotated data. It can produce high-quality contours for five organs (the prostate, bladder, rectum, left femur, and right femur). The new method can be improved. Firstly, like most adversarial learning models, its hyperparameters are delicate to tune. Training the model may encounter mode collapse, and it is challenging to balance the learning progress of the segmentation network and the discriminator. Secondly, the potential for progressive growth GAN [26] has not been fully exploited. For instance, it is possible to generate new data with corresponding ground-truth annotations [16], and we plan to investigate how synthesized annotated CT images can aid the learning process. A CKNOWLEDGMENT

The work was supported in part by Varian Medical Systems through a research grant. We obtained the Washington University of St. Louis institutional review board (IRB) approval for this study and the patient data used in the study. R EFERENCES [1] A. Balagopal, S. Kazemifar, D. Nguyen, M.-H. Lin, R. Hannan, A.

Owrangi, and S. Jiang, “Fully automated organ segmentation in male pelvic CT images,”

Physics in Medicine & Biology, vol. 63, no. 24, pp. 245015, 2018. [2] S. Kazemifar, A. Balagopal, D. Nguyen, S. McGuire, R. Hannan, S. Jiang, and A. Owrangi, “Segmentation of the prostate and organs at risk in male pelvic CT images using deep learning,”

Biomedical Physics & Engineering Express, vol. 4, no. 5, pp. 055003, 2018. [3]

S. Wang, D. Nie, L. Qu, Y. Shao, J. Lian, Q. Wang, and D. Shen, “CT Male

Pelvic Organ Segmentation via Hybrid Loss Network With Incomplete

Annotation,”

IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 2151-2162, 2020. [4] O. Acosta, J. Dowling, G. Drean, A. Simon, R. De Crevoisier, and P. Haigron, "Multi-atlas-based segmentation of pelvic structures from CT scans for planning in prostate cancer radiotherapy,"

Abdomen and Thoracic Imaging , pp. 623-656: Springer, 2014. [5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,

Deep learning : MIT press Cambridge, 2016. [6] F. Martínez, E. Romero, G. Dréan, A. Simon, P. Haigron, R. De Crevoisier, and O. Acosta, “Segmentation of pelvic structures for planning CT using a geometrical shape model tuned by a multi- scale edge detector,”

Physics in Medicine & Biology, vol. 59, no. 6, pp. 1471, 2014. [7] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.-

A. Heng, “3D deeply supervised network for automated segmentation of volumetric medical images,”

Medical image analysis, vol. 41, pp. 40-54, 2017. [8]

Y. Gao, Y. Shao, J. Lian, A. Z. Wang, R. C. Chen, and D. Shen, “Accurate segmentation of CT male pelvic organs via regression-based deformable

Table II: Model performance on testing data. The best score of each metric on each organ is colored in red. Note that relative volume difference is not listed because the metric was not used in most of compared methods.

Prostate

DSC AHD(mm) ASHD(mm)

Bladder

DSC AHD(mm) ASHD(mm)

Rectum

DSC AHD(mm) ASHD(mm)

Femur_L

DSC AHD(mm) ASHD(mm)

Femur_R

DSC AHD(mm) ASHD(mm)

Multi-atlas Based Segmentation [4] (2014) 0.85(±0.004) - - 0.92(±0.002) - - 0.80(±0.007) - - N/A N/A

Deep Dilated CNN [10] (2017) 0.88 - - 0.93 - - 0.62 - - 0.92 - - 0.92 - -

2D U-net [2] (2018) 0.88(±0.12) 0.4(±0.7) 1.2(±0.9) 0.95(±0.04) 0.4(±0.6) 1.1(±0.8) 0.92(±0.06) 0.2(±0.3) 0.8(±0.6) N/A N/A

2D U-net Localization, 3D U-net Segmentation [1] (2018) 0.90(±0.02) 5.3(±2.8) 0.7(±0.5) 0.95(±0.02) 17.0(±14.6) 0.5(±0.7) 0.84(±0.04) 4.9(±3.9) 0.8(±0.7) 0.96(±0.03) - - 0.95(±0.01) - -

ARPM-net [33] (2020) 0.88(±0.11) 1.58(±1.77) 2.11(±2.03) 0.97(±0.07) 1.91(±1.29) 2.36(±2.43) 0.86(±0.12) 3.14(±2.39) 3.05(±2.11) 0.97(±0.01) 1.76(±1.57) 1.99(±1.66) 0.97(±0.01) 1.92(±1.01) 2.00(±2.07) (our method) 0.90(±0.12) 0.27(±0.13) 0.77(±0.20) 0.95(±0.05) 0.11(±0.13) 0.47(±0.23) 0.87(±0.10) 0.30(±0.19) 0.63(±0.21) 0.97(±0.01) 0.04(±0.01) 0.24(±0.11) 0.97(±0.01) 0.04(±0.02) 0.18(±0.06) IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. xx, NO. x , 2020 models and multi- task random forests,” IEEE transactions on medical imaging, vol. 35, no. 6, pp. 1532-1543, 2016. [9] B. Wang, Y. Lei, T. Wang, X. Dong, S. Tian, X. Jiang, A. B. Jani, T. Liu, W. J. Curran, and P. Patel, "Automated prostate segmentation of volumetric CT images using 3D deeply supervised dilated FCN." p. 109492S. [10]

K. Men, J. Dai, and Y. Li, “Automatic segmentation of the clinical target volume and organs at risk in the planning CT for rectal cancer using deep dilated convolutional neural networks,”

Medical physics, vol. 44, no. 12, pp. 6377-6389, 2017. [11] Y. Fu, T. R. Mazur, X. Wu, S. Liu, X. Chang, Y. Lu, H. H. Li, H. Kim, M. C. Roach, and L. Henke, “ A novel MRI segmentation method using CNN ‐ based correction network for MRI ‐ guided adaptive radiotherapy, ” Medical physics, vol. 45, no. 11, pp. 5129-5137, 2018. [12] K. H. Cha, L. Hadjiiski, R. K. Samala, H. P. Chan, E. M. Caoili, and R. H. Cohan, “ Urinary bladder segmentation in CT urography using deep ‐ learning convolutional neural network and level sets, ” Medical physics, vol. 43, no. 4, pp. 1882-1896, 2016. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks." pp. 1097-1105. [14] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation." pp. 234-241. [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets." pp. 2672-2680. [16] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie , M. V. Hernández, J. Wardlaw, and D. Rueckert, “Gan augmentation: Augmenting training data using generative adversarial networks,” arXiv preprint arXiv:1810.10863 , 2018. [17] L. Ma, R. Guo, G. Zhang, F. Tade, D. M. Schuster, P. Nieh, V. Master, and B. Fei, "Automatic segmentation of the prostate on CT images using deep learning and multi-atlas fusion." p. 101332O. [18] C. Lu, Y. Zheng, N. Birkbeck, J. Zhang, T. Kohlberger, C. Tietjen, T. Boettger, J. S. Duncan, and S. K. Zhou, "Precise segmentation of multiple organs in CT volumes using learning-based approach and information theory." pp. 462-469. [19] M. W. Macomber, M. Phillips, I. Tarapov, R. Jena, A. Nori, D. Carter, L.

Le Folgoc, A. Criminisi, and M. J. Nyflot, “Autosegmentation of prostate anatomy for radiation treatment planning using deep decision forests of radiomic features,”

Physics in Medicine & Biology, vol. 63, no. 23, pp. 235002, 2018. [20] X. Dong, Y. Lei, S. Tian, T. Wang, P. Patel, W. J. Curran, A. B. Jani, T.

Liu, and X. Yang, “Synthetic M

RI-aided multi-organ segmentation on male pelvic CT using cycle consistent deep attention network,”

Radiotherapy and Oncology, vol. 141, pp. 192-199, 2019. [21] X. Ma, L. M. Hadjiiski, J. Wei, H. P. Chan, K. H. Cha, R. H. Cohan, E. M. Caoili, R. Samala, C. Zhou, and Y. Lu, “ U ‐ Net based deep learning bladder segmentation in CT urography, ” Medical physics, vol. 46, no. 4, pp. 1752-1765, 2019. [22] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation." pp. 3431-3440. [23]

N. Ibtehaz, and M. S. Rahman, “MultiResUNet: Rethinking the U -Net architecture for multimodal biomedical image segmentation,”

Neural Networks, vol. 121, pp. 74-87, 2020. [24]

I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160 , 2016. [25]

M. Mirza, and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784 , 2014. [26]

T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stabili ty, and variation,” arXiv preprint arXiv:1710.10196 , 2017. [27] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang, "Triple generative adversarial nets." pp. 4088-4098. [28]

P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial n etworks,” arXiv preprint arXiv:1611.08408 , 2016. [29] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang, “Adversarial learning for semi - supervised semantic segmentation,” arXiv preprint arXiv:1802.07934 , 2018. [30] Z. Zhang, Q. Liu, and Y. Wang , “Road extraction by deep residual u - net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749-753, 2018. [31] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition." pp. 770-778. [32] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”

IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017. [33]

Z. Zhang, T. Zhao, H. Gay, W. Zhang, and B. Sun, “ARPM -net: A novel CNN-based adversarial method with Markov Random Field enhancement for prostate and organs at risk segmentation in pelvic CT images,” arXiv preprint arXiv:2008.04488 , 2020. [34] L. Yu, X. Yang, H. Chen, J. Qin, and P.-A. Heng, "Volumetric convnets with mixed residual connections for automated prostate segmentation from 3D MR images." pp. 36-72. [35] Z. Zhang, L. Yang, and Y. Zheng, "Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network." pp. 9242-9251. [36] H. Noh, S. Hong, and B. Han, "Learning deconvolution network for semantic segmentation." pp. 1520-1528. [37]

D. P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014. [38]

V. Thada, and V. Jaglan, “Comparison of jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm,”

International Journal of Innovations in Engineering and Technology, vol. 2, no. 4, pp. 202-205, 2013. [39]

D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Comparing images using the Hausdorff distance,”

IEEE Transactions on pattern analysis and machine intelligence, vol. 15, no. 9, pp. 850-863, 1993. [40]

A. A. Taha, and A. Hanbury, “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,”