High Tissue Contrast MRI Synthesis Using Multi-Stage Attention-GAN for Glioma Segmentation
HHigh Tissue Contrast MRI Synthesis Using Multi-Stage Attention-GAN forGlioma Segmentation
Mohammad Hamghalam,
Baiying Lei, ∗ Tianfu Wang National-Regional Key Technology Engineering Laboratory for Medical Ultrasound,Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging,School of Biomedical Engineering, Health Science Center, Shenzhen University,Shenzhen, China, 518060, Faculty of Electrical, Biomedical and Mechatronics Engineering, Qazvin Branch,Islamic Azad University, Qazvin, [email protected], [email protected], [email protected]
Abstract
Magnetic resonance imaging (MRI) provides varying tissuecontrast images of internal organs based on a strong mag-netic field. Despite the non-invasive advantage of MRI in fre-quent imaging, the low contrast MR images in the target areamake tissue segmentation a challenging problem. This paperdemonstrates the potential benefits of image-to-image trans-lation techniques to generate synthetic high tissue contrast(HTC) images. Notably, we adopt a new cycle generative ad-versarial network (CycleGAN) with an attention mechanismto increase the contrast within underlying tissues. The atten-tion block, as well as training on HTC images, guides ourmodel to converge on certain tissues. To increase the reso-lution of HTC images, we employ multi-stage architectureto focus on one particular tissue as a foreground and filterout the irrelevant background in each stage. This multi-stagestructure also alleviates the common artifacts of the syntheticimages by decreasing the gap between source and target do-mains. We show the application of our method for synthe-sizing HTC images on brain MR scans, including glioma tu-mor. We also employ HTC MR images in both the end-to-endand two-stage segmentation structure to confirm the effective-ness of these images. The experiments over three competi-tive segmentation baselines on BraTS 2018 dataset indicatethat incorporating the synthetic HTC images in the multi-modal segmentation framework improves the average Dicescores 0.8 % , 0.6 % , and 0.5 % on the whole tumor, tumor core,and enhancing tumor, respectively, while eliminating one realMRI sequence from the segmentation procedure. Introduction
Among brain tumors, glioma is the most prevalent tumorthat begins from the tissue of the brain and can affect the ∗ Corresponding author. This work was supported partly byNational Natural Science Foundation of China (Nos. 61871274,61801305, and 81571758), National Natural Science Foun-dation of Guangdong Province (No. 2017A030313377),Guangdong Pearl River Talents Plan (2016ZT06S220),Shenzhen Peacock Plan (Nos. KQTD2016053112051497and KQTD2015033016104926), and Shenzhen Key Ba-sic Research Project (Nos. JCYJ20170413152804728,JCYJ20180507184647636, JCYJ20170818142347251, andJCYJ20170818094109846).Copyright c (cid:13)
WT: Whole Tumor TC: Tumor Core ET: Enhancing Tumor
Glioma(FLAIR MR scan) R ET R WT R TC (a) F S → T F T → S (b) Unpaired data
Unpaired data for training our model to generate HTC images with attention to WT s s s s s n t t t t t n HTC (Attention to WT)Source Target Ground truth annotation
Tissue distributions at source Tissue distributions at target P r . P r . Intensity Intensity ( | ) WT p s ( | ) Normal p s ( | ) WT p t ( | ) Normal p t S T Figure 1: The MR images show low tissue contrast in thesource domain, implying the most challenging for tissuesegmentation. (a) A glioma lesion in the FLAIR MR im-age (left) with its intensity distributions (i.e., non-enhancing,edema, and enhancing) as well as normal tissue distribution.The corresponding HTC target image (middle) with atten-tion to WT is achieved based on the manual labels (right).(b) Unpaired training data, consisting of a source set (firstrow) s ∼ p ( s ) and a target set (second row) t ∼ p ( t ) , withno information provided as to which s matches which t .brain function (Elazab et al. 2018). In brain magnetic reso-nance (MR) images, the intensity distributions of pixels arelargely overlapping in regions of interest (ROIs), thereforeleading to low tissue contrast and creating the main chal-lenge for tissue segmentation. In glioma, ROIs exhibit sim-ilar levels of intensity in MR images, making tissue seg-mentation quite challenging. Fig. 1(a) demonstrates a brainlesion in the FLAIR MR slice with three overlapping tis-sues: whole tumor (WT), tumor core (TC), and enhancingtumor (ET). We define target images as a high-contrast do- a r X i v : . [ ee ss . I V ] J un eal input image Attention to R WT Synthetic HTC image attention to R WT Seg. 2 classes R WT | R b Synthetic HTC imageattention to R TC Seg.Attention to R ET Cropped R WT Seg.
Attention to R TC MRI-to-HTC translation Cropped R TC Synthetic HTC image attention to R ET
2D binary segmentation
Stage-IStage-IIStage-III R TC | R b R ET | R b Figure 2: Design of the proposed multi-stage structure forsegmentation of glioma within three stages. In each step,we first generate synthetic HTC images with the minimumoverlapping area for the class-conditional densities between ( R f ) and ( R b ) through MRI-to-HTC block. Then, the syn-thetic HTC images are applied for segmentation in both end-to-end and two-stage training tactics. Synthetic HTC imagesdeal with R W T , R T C , and R ET in each stage, sequentially.main such that tissues have limited overlapping area, whilethe source domain has overlapping tissue distributions. Ourgoal is to increase the intensity contrast between the under-lying tissue region and others through image-to-image trans-lation technique based on unpaired training data (Fig. 1(b))to improve segmentation performance.Image translation aims to learn the mapping between aninput image following the source domain distribution to anoutput image with a defined distribution using a trainingset of paired (Isola et al. 2017) or unpaired images (Zhuet al. 2017). Despite the limitations of the synthetic im-ages in the clinical application, these data have suggestedpromising results through generative adversarial networks(GANs) (Goodfellow et al. 2014), including data augmenta-tion (Bowles et al. 2018), image reconstruction (Sharma andHamarneh 2019), and segmentation (Huo, Xu, and Moon2019; Zhang, Yang, and Zheng 2018; Chartsias et al. 2017;Nie et al. 2018; Wolterink et al. 2017; Zhao et al. 2017). Thepaired methods require training source images that alignedwith the target ones to learn image generation through theforward adversarial loss, while the unpaired approaches fre-quently employ unaligned images through the CycleGANstructure.The CycleGAN models have two main components, i.e., asource-to-target and a target-to-source block. Each part con-sists of a generator G and a discriminator D . G aims to gen-erate a real image from a noise vector and an input image,while D is finding the difference between an actual imageand the image produced by G . The key challenges in medical image synthesis either inter-modality (T1-to-T2, FLAIR-to-T1, and others) or cross-modality (MRI-to-CT, PET-to-CT,PET-to-MRI) translation are to predict the structure and fine-grained content of the target modality from the source one(Huo, Xu, and Moon 2019; Zhang, Yang, and Zheng 2018;Chartsias et al. 2017; Nie et al. 2018; Wolterink et al. 2017;Zhao et al. 2017). The CycleGAN provides effective super-vision using cycle consistency between the source inputs andthe reconstructed images as well as between the target im-ages and corresponding reconstructed ones.However, state-of-the-art medical image synthesis meth-ods are restricted by the model’s disability to attend a spe-cific tissue. In this paper, we propose a multi-stage modelto segment only one tissue through a segmentation blockfollowing an attention-guided synthesis block in each stage(Fig. 2). Specifically, the synthesis block generates high tis-sue contrast (HTC) images with attention to the relevant tis-sue for the segmentation task. In image synthesis block, wehave two mappings: MRI-to-HTC and HTC-to-MRI. Theformer accepts 2D MR slices and generates HTC images,which further fed into the latter to reconstruct the inputMR images. In the segmentation block, the HTC images arepassed to the convolution layers to produce a binary seg-mentation map and a bounding box for the next stage. Toprovide attention to the specific tissue during synthesis pro-cess, two strategies have been used: (1) attachment of theattention block into the CycleGAN, (2) using high contrastimage during training phase. The attention block guides G towards the expected region for translation via an attentionmap. This trainable map is further employed in D input tofilter out irrelevant areas. Regarding the training, we use theground-truth (GT) labels to form images with the minimumoverlapping area between the tissue intensity distribution ofthe foreground and background in each stage (depicted tis-sue distribution in Fig. 1 (a)).Furthermore, to produce a more detailed synthesis andconsequently more accurate segmentation map, we explorethe multi-stage architecture to deal with only one region ineach stage. This structure alleviates the artifacts of the syn-thesized images by decreasing the gap between the sourceand target domain. The attention module effectively learnsattention maps to guide the generator attentively select moreimportant regions for generating an HTC image. The gener-ated HTC image closely follows the distribution of the tar-get domain and boosts the segmentation performance sig-nificantly. Besides, our model is based on the CycleGANframework to leverage the vast quantities of unpaired datasets for training within the same modality. The experimentsare conducted on multi-modal BraTS 2018 dataset (Menzeet al. 2015) to segment internal parts of glioma. Specifically,we employ real modalities, i.e., FLAIR, T2, and T1c, to gen-erate synthetic one with attention to the WT, TC, and ET ineach stage, respectively. The contributions of this paper aresummarized as: • We design a novel framework to increase the contrastamong sub-regions of glioma in MR images. Training onhigh contrast images as well as an unsupervised attentionblock inside the adversarial network guide our model toay attention to the particular regions. • We propose a multi-stage structure that decreases the gapbetween the source and target domain to enhance the res-olution of synthetic HTC images. • We employ HTC MR images in both the end-to-end andtwo-stage segmentation structure on BraTS dataset toconfirm the effectiveness of these images.
Related works
Segmentation
Numerous machine learning (Hatami et al. 2019) and deeplearning methods have been introduced to address seg-mentation problems, especially glioma subregions (Soley-manifard and Hamghalam 2019). Fully convolutional net-works (FCNs) (Long, Shelhamer, and Darrell 2015; Ron-neberger, Fischer, and Brox 2015; Drozdzal et al. 2016;Chen et al. 2018; J´egou et al. 2017) as an extension ofconvolutional neural networks (CNNs) (He et al. 2016;Huang et al. 2017) with down-sampling and up-samplinglayers have been considered as a benchmark of segmenta-tion. Replacement of fully connected layers with convolu-tion layers facilitates FCNs to take the global features andprovides localization in an end-to-end framework (Long,Shelhamer, and Darrell 2015). In U-Net (Ronneberger, Fis-cher, and Brox 2015), authors used U-shaped architectureof FCNs with the skip connection to combine features ex-tracted in the encoder side to the decoder ones. In otherwork, Drozdzal et al. (Drozdzal et al. 2016) added the resid-ual blocks (He et al. 2016) to the U-Net framework to im-prove the segmentation accuracy by reducing the effect ofvanishing gradient (Res-U-Net). Chen et al. (Chen et al.2018) also extended the fully convolutional version of resid-ual networks (ResNets) (He et al. 2016) by incorporating thedilation to the main structure. Jegou et al. (J´egou et al. 2017)continued the DenseNet (Huang et al. 2017) to fully convo-lutional DenseNet (FC-DenseNet) without post-processingfor segmentation. This architecture leads to implicit in-depthsupervision and allows capturing contextual information.
Segmentation in Adversarial Framework
Adversarial methods have been successfully exploited inmedical image analysis to address the shortage of largeand diverse annotated databases (Bowles et al. 2018), miss-ing/corrupted MR pulse sequences (Sharma and Hamarneh2019), as well as boost the segmentation performance intypical applications. These latter approaches can be catego-rized as two-stage training techniques (Chartsias et al. 2017;Wolterink et al. 2017; Zhao et al. 2017; Nie et al. 2018;Hamghalam, Lei, and Wang 2019) and end-to-end methods(Huo, Xu, and Moon 2019; Zhang, Yang, and Zheng 2018).The former considers the synthesis and segmentation as twoindividual training stages, while the latter incorporates thesegmentation loss into the adversarial loss during the train-ing.Chartsias et al. (Chartsias et al. 2017) produced syntheticcardiac data from unpaired images coming from different in-dividuals (CT-to-MRI cardiac image) based on CycleGAN. They found that training on both real and synthetic imageslead to a statistically significant improvement compared totraining on real data. Wolterink et al. (Wolterink et al. 2017)proposed MRI-to-CT synthesis on pairwise aligned trainingimages of the same patient in the treatment planning of braintumors. They analyzed paired and unpaired image mappingfrom 2D brain MR image slices into 2D CT ones. Authorsfound that the synthetic CT images taken via the modeltrained with unpaired data seemed more realistic, containedfewer artifacts than those obtained through the model trainedwith paired data. Zhao et al. (Zhao et al. 2017) introduceda multi-atlas based hybrid approach to synthesize T1w MRimages from CT and CT images from T1w MR images us-ing random forest synthesis framework. This method useda set of random forest regressors within each label for syn-thesizing intensities on pairs of MR and CT images of thewhole head. In other works, Nie et al. (Nie et al. 2018) firstapplied FCN Model to generate MR from CT image as wellas 7T MR from 3T MR images based on the CycleGAN.Next step, they employed synthetic images for the task ofsemantic segmentation.In the end-to-end framework, Huo et al. (Huo, Xu, andMoon 2019) integrated the CycleGAN and segmentationinto an end-to-end structure to train a segmentation net-work for both MRI-to-CT and CT-to-MRI without hav-ing manual labels in the target modality. In their architec-ture, called SynSeg-Net, authors demonstrated that end-to-end training achieved better performances compared to thetwo-stage one for segmentation. Zhang et al. (Zhang, Yang,and Zheng 2018) presented 3D cross-modality synthesis ap-proach (CT-to-MRI) to segment cardiovascular volumes byadding shape-consistency loss to the CycleGAN framework.They also validated that coupling the generator and segmen-tor module resulted in better segmentation accuracy thantraining them exclusively.
Method
The proposed framework is composed of K stages, where K denotes the number of labels in input images. Each stepconsists of two main modules: (1) image synthesis with at-tention, and (2) segmentation block. The former is learnedin an adversarial framework to generate synthetic HTC im-ages with attention to an individual foreground R ( k ) f , whilethe latter performs supervised binary segmentation for theforeground and background R ( k ) b region. The bounding boxwhich calculated from the segmentation map in stage k willbe considered for the next step, k + 1 . Fig. 2 shows anoverview of the proposed structure for segmentation of brainlesion with three regions ( K = 3 ), including R W T , R T C ,and R ET . This section first describes how the image syn-thesis block transforms the tissue intensity distribution ofthe foreground from source to target domain, and then pro-vides details of incorporating the synthetic images into thesegmentation framework which is expected to produce moreaccurate results than using real MR images. TC Image Synthesis via Attention-GAN(MRI-to-HTC)
Let MR source image at stage k , s ( k ) ∈ S ( k ) , be the union offoreground, s ( k ) f , and background pixels, s ( k ) b , in the sourcedomain as: s = [ s f ∼ p ( s |R f )] ∪ [ s b ∼ p ( s |R b )] (1)we omit the superscript k for simplicity. Similarly, in thetarget domain, we have HTC image, t ( k ) ∈ T ( k ) as: t = [ t f ∼ p ( t |R f )] ∪ [ t b ∼ p ( t |R b )] (2)where p ( s |R ) and p ( t |R ) are the class-conditional distribu-tions of tissue in the source and target domain, respectively.We also assume that the distribution of the foreground andbackground have a little overlap in the target space.Our goal in each stage is to estimate a mapping function, F S → T : MRI-to-HTC, from a source domain S (MRI im-age) to the target domain T (HTC image) based on indepen-dently sampled data instance, such that the distribution ofthe mapped samples, s (cid:48) , matches the probability distribution p ( t ) of the target. For the cycle consistency, a domain in-verse mapping, F T → S : HTC-to-MRI, also generates the re-constructed images, s (cid:48)(cid:48) , to match closely to the input image s ≈ s (cid:48)(cid:48) . Attention Block
In our mapping, we need to generateHTC images that provide maximum segmentation accuracyin R f . To this end, we first need to locate the R f to translatein each image and then apply the translation to that region.Specifically, we achieve this by adding two attention net-works A S and A T , which select areas to translate by max-imizing the probability that the discriminator makes a mis-take in the source and target domain, respectively. The atten-tion block is an FCN network consists of convolution, de-convolution, and the ResNet (He et al. 2016) unit, followedby the soft-max layer. For each input image, it produces aper-pixel attention map with the same size of the input imageindicating the importance of the spatial information. Mainly,after feeding the input image to the generator, we employ theattention mask to the generated image using an element-wiseproduct ( (cid:12) ), and then add the background using the inverseof the mask applied to the input image.As shown in Fig. 3, s is split into two parts: the first part isfed to the source attention block, A S , to create the attentionmap, s a = A S ( s ) , while the second part is considered asan input of the generator G S → T to highlight the foregroundregion. To eliminate the background region, s a is element-wisely multiplied by G S → T ( s ) to make masked image as: s f = s a (cid:12) G S → T ( s ) . Finally, the synthetic HTC image canbe calculated as: s (cid:48) = s a (cid:12) G S → T ( s ) + (1 − s a ) (cid:12) s (3)where s (cid:48) is passed to the segmentation block to segmentsthe R f and fed to the domain inverse mapping for the re-construction. Likewise, we have: s (cid:48)(cid:48) = t a (cid:12) G T → S ( s (cid:48) ) + (1 − t a ) (cid:12) s (cid:48) (4)where t a = A T ( s (cid:48) ) is the attention map in target domain. A s : Attention block G S → T Invert s f A s G S → T s’ S ’ A T G T → S s’’ s ~ s Seg. stage-k
Attention map s Binary Seg.Synthetic HTC image (attention to R f ) s a =A S (s) -s a s a End-to-EndTwo-Stages R f R b s=R f U R b source-to-target target-to-source Figure 3: Image synthesis with attention to WT at stage-I.MRI and HTC images are considered as source and target.
Training Procedure
The training of MRI-to-HTC net-work requires a discriminator D T to discern the translatedoutputs from the real HTC images t . Likewise, the discrim-inator at the source domain, D S , encourages HTC-to-MRInetwork to translate t into source domain indistinguishablefrom the source domain. We train both discriminators suchthat they only rate attended regions. Particularly, instead ofemploying an entire image as the input, we first filter bothgenerated and real image via an element-wise multiplicationwith the attention map at source and target domain. Thenthe filtered images are fed into the discriminator for evalu-ation. According to (Mejjati et al. 2018), to avoid the modecollapse, we train the network on whole images ( s and t )for 25 epochs and then switch to masked ones ( s (cid:12) s a and t (cid:12) t a ), when the attention blocks A S and A T have trainedmoderately.According to Equations 3 and 4, as long as A S and A T attend to the background regions, the generated images willpreserve their input domain classes. Thus, the discriminatorscan simply detect the images as fake ones. To be thriving intwo-player minimax game, A S and A T have to concentrateon the objects or regions that the corresponding discrimina-tor thinks are the most descriptive within its domain (i.e.,the foreground). Finally, the network finds an equilibriumbetween the generator, attention map, and discriminator toproduce realistic images. Preparing Target HTC Images
The assumption of non-overlapping tissue distribution in the target domain canbe achieved through the GT labels. We change the class-conditional distributions of p ( t |R f ) and p ( t |R b ) accord-ing to the manual labels as depicted in Fig. 4. We mini-mize the inter-class variance while maximizing the intra-class distance between p ( t |R f ) and p ( t |R b ) in the targetspace. Value of the mean and variance in the target domainare considered as a hyperparameter. Choosing an appropri-ate amount will produce maximum segmentation accuracy.However, there is a trade-off between class overlap at largevalues and visual artifacts at small ones. Particularly, lowvalues for variance will generate much sharper results butintroduces visual artifacts, which leads to a decline in seg-mentation performance. Fig. 4 (left column) demonstratesthe considerable class overlap between the distributions ofthe foreground and background tissue in the source domainon the BraTS dataset. We use FLAIR, T2, and T1c sequenceigure 4: Tissue distributions of glioma at the source (left)and target (right) domain. (a) FLAIR MR images are used inthe first stage to produce synthetic HTC images with atten-tion to R W T . (b) At the second stage, T2 images are croppedaccording to the bounding box at the first stage and em-ployed to increase tissue contrast between R T C and Edema R ED . (c) The third stage is dedicated to enhancing R ET and non-enhancing tumor R NET from T1c MR images.to segment WT, TC, and ET in each stage, respectively. Fig.4 (right column) shows distributions of the correspondingtissues in the defined target domain.
Segmentation Block
Segmentation block provides feedback for the image syn-thesis one during the training in the case of end-to-end strat-egy. We apply the 2D binary segmentation structure withthe weighted cross-entropy loss, L seg , to handle the classimbalance, especially in the first stage. Specifically, FC-DenseNet comprises the Dense blocks (batch normalization(BN), ReLU, × convolution, and Dropout), the Transitiondown blocks (BN, ReLU, × convolution, Dropout, and × Max Pooling), and the Transition up block ( × Trans-posed convolution with stride of 2). We also consider non-overlapping max pooling and Dropout with p = 0 . . EachDense block contains four layers of convolution which eachlayer calculates 12 feature maps. These features are sequen-tially concatenated to build 48 feature maps at the output ofDense block. In the training phase, the bounding boxes areautomatically generated based on the GT, whereas, in thetesting phase, the bounding boxes are obtained based on thebinary segmentation results of the preceding stage. Loss Functions
In addition to the segmentation loss, L Seg. , there are fourloss functions to generate HTC images in each stage. Theadversarial loss to take advantage of GAN networks at thesource, L sadv , and target domain, L tadv , as: L sadv ( F S → T , A S , D T ) = E t ∼ p ( t ) [log( D T ( t ))]+ E s ∼ p ( s ) [log(1 − D T ( s (cid:48) ))] (5) L tadv ( F T → S , A T , D S ) = E s ∼ p ( s ) [log( D S ( s ))]+ E t ∼ p ( t ) [log(1 − D S ( t (cid:48) ))] (6)Meanwhile, and similarly to CycleGAN, we add a cycle-consistency loss to the adversarial ones by enforcing a one-to-one mapping between true image, s , and cycle recon-structed ones, s (cid:48)(cid:48) , as a forward cycle consistency loss: L scyc ( s, s (cid:48)(cid:48) ) = (cid:107) s − s (cid:48)(cid:48) (cid:107) (7)where s (cid:48)(cid:48) = F T → S ( F S → T ( s )) . In the backward path, wealso have the backward cycle consistency loss as: L tcyc ( t, t (cid:48)(cid:48) ) = (cid:107) t − t (cid:48)(cid:48) (cid:107) (8)where t (cid:48)(cid:48) = F S → T ( F T → S ( t )) . Finally, we combine the de-fined loss functions with different weights. The final objec-tive for the image synthesis, L synth. , is: L synth. = λ . L sadv ( F S → T , A s , D T )+ λ . L scyc ( F S → T , F T → S , S )+ λ . L tadv ( F T → S , A T , D S )+ λ . L tcyc ( F T → S , F S → T , T ) (9)where λ , λ , λ , and λ are the scalar hyper-parameters toregularize the loss functions.In the two-stage training strategy, we first minimize L synth. to generate HTC images with attention to the spe-cific region. Then, we optimize L Seg. with fixed synthesisloss, as two independent training steps. While, in the case ofend-to-end, we optimize L total which incorporates the seg-mentation loss into the adversarial one during training as: L total = λ L Seg. + L Synth. (10)where λ balances the effect of L Seg. to equip our HTC syn-thesis model with the segmentation feedback.
Experiments and Results
We conduct several experiments on BraTS 2018 dataset todemonstrate the effectiveness of the proposed methods onthe synthesizing HTC images and the segmentation task.Each sequence has been normalized separately by subtract-ing the mean and dividing by the standard deviation of thebrain region, and the non-brain area is set to zero. Further-more, all networks are trained for 180 epochs with Adamlearning rate of 0.0001. Our implementation is developedemploying TensorFlow on an NVIDIA TITAN X GPU with12G of RAM.able 1: K-S test results between the synthetic and targetHTC images applying different loss weights.
Loss weights Brain lesions λ , λ λ , λ WT TC ET Normal1 10
10 1 0.28 0.26 0.22 0.101 100 0.30 0.27 0.32 0.061 1 0.12 0.18 0.12 0.13
Dataset
The performance of the proposed method is eval-uated on publicly available BraTS 2018 dataset (Menze etal. 2015; Bakas et al. 2017; 2018), gathered from variousscanners with an in-plane matrix size of × × .Four MR sequences are available for each patient consistof FLAIR, T1, T1c, and T2. Evaluation is performed forthree ROIs, including the WT (all internal parts of tumor),TC (enhancing and non-enhancing), and ET. Around 2Kaxial slices are randomly selected and center-cropped to × for training, such that each slice has non-zerovalue on at least half of the pixels. Specifically, the firststage, 2D FLAIR MR images are used to generate syntheticHTC one in our cyclic framework with attention to WTas: FLAIR ↔ FLAIR (cid:48) . Then, we segment FLAIR (cid:48) with theend-to-end (FLAIR ↔ FLAIR (cid:48) →R W T ) as well as the two-stage (FLAIR ↔ FLAIR (cid:48) , FLAIR (cid:48) →R W T ) approach. Ac-cordingly, for the segmentation of TC, we extract T2 patchesto × from the corresponding slices. Thus, we have:T2 ↔ T2 (cid:48) →R T C and T2 ↔ T2 (cid:48) , T2 (cid:48) →R T C for the end-to-end and two-stage, respectively. In the last stage, segmenta-tion of ET, we apply T1c patches with a size of × togenerate the synthetic HTC images and predict pixel labelsof ET (T1c ↔ T1c (cid:48) →R ET and T1c ↔ T1c (cid:48) , T1c (cid:48) →R ET ). Evaluation of the Synthetic HTC MR Images
To eval-uate the synthetic HTC MR images, we calculate theKolmogorov-Smirnov (K-S) statistic on the target domainto estimate the goodness-of-fit between the intensity dis-tribution of the synthetic HTC and real HTC images foreach class label. Table 1 lists the consequences of the K-S test for the WT, TC, ET, and Normal of the brain tumorfor various loss weight values. Note that the segmentationblock is bypassed ( λ = 0 ) to assess the quality of syn-thetic HTC images. We further appraise the quality of syn-thetic HTC images on each stage using peak signal-to-noiseratio (PSNR) and structural similarity index metric (SSIM)in Table 2. In these experiments, the loss weights are con-sidered as λ , λ = 1 and λ , λ = 10 . Moreover, Fig. 5shows examples of synthetic HTC images with attention toET at stage III. The first column presents the real T1c MRpatches in the source domain, the second column displaysthe attention maps, and the third one shows the correspond-ing synthetic HTC patches, and the last column depicts thereal HTC images in the target domain. Table 2: Quality evaluation of the synthetic HTC images inour multi-stage framework. MRI-to-HTC (attention)
SSIM PSNRFLAIR ↔ FLAIR (cid:48) ( W T ) ↔ T2 (cid:48) ( T C ) ↔ T1c (cid:48) ( ET ) Input Attention map Synthetic Target
Figure 5: Examples of MRI-to-HTC translation with atten-tion to ET on BraTS dataset. From left to right: input image,attention map, synthetic HTC image, and target image.Table 3: Ablation study of multi-stage MRI-to-HTC model.
Model SSIM PSNR
CycleGAN 0.6072 17.64CycleGAN+ A S A t A S + A t A S + A t +multi-stage We measure PSNR and SSIM as sim-ilarity metrics between the synthetic HTC and target im-ages. In Table 3, we first employ the plain CycleGAN (Zhuet al. 2017) to generate HTC images with attention to fourregions. Then, we judge the model with only one attentionblock in either the source (CycleGAN+ A S ) or the target do-main (CycleGAN+ A T ). Finally, we repeat the experimentto consider only one region (ET) to assess our multi-stageMRI-to-HTC structure with attention blocks.able 4: Segmentation accuracy for WT, TC, and ET of brain lesion in MR images via cross-validation. Dice CycleGAN + Segmentation ProposedMean ( ± Std.(%)) End-to-End Two-Stage End-to-End Two-Stage
WT 0.9231 (0.24) 0.8804 (0.37)
Method Modality concatenation Dice HD95 (mm)EN WT TC EN WT TC
U-Net FLAIR, T1, T1c, T2 0.7874 0.8913 0.8402 4.15 5.61
FLAIR, FLAIR ↔ FLAIR ∗ ,T1c,T2 ↔ FLAIR ∗ ,T1c,T2 FC-DenseNet FLAIR,T1,T1c,T2 0.7890 0.8965 0.8439 ↔ FLAIR ∗ ,T1c,T2 We compare our method with recently proposed ap-proaches (Huo, Xu, and Moon 2019) and (Chartsias et al.2017), which employed synthetic images for segmentationin the end-to-end and two-stage manner, respectively. Theformer combines the segmentation loss with the adversar-ial one during training, while the latter individually trainsthe image synthesis and segmentation block. In Table 4, wemeasure the segmentation accuracy for WT, TC, and ET viathe 4-fold cross-validation and observe that the proposedend-to-end method with the attention block achieves thehighest accuracy compared to others. Table 4 also demon-strates the advantage of end-to-end training over the two-stage one in terms of accuracy. Note that we need roughly 27ms to generate synthetic HTC image (2D) in the two-stageframework during the inference time.
Synthetic HTC Volumes in 3D Multi-Modal Segmen-tation Framework
We evaluate the effect of syntheticHTC images in the 3D multi-modal segmentation frame-work based on the two-stage training approach. To thisend, we substitute T1 MR volume for the correspond-ing FLAIR ↔ FLAIR ∗ sequence, while increasing contrastamong the non-enhancing, edema, enhancing, and normalregions. We experiment with three state-of-the-art segmen-tation models, including U-Net (Ronneberger, Fischer, andBrox 2015), Res-U-Net (Drozdzal et al. 2016), and FC-DenseNet (J´egou et al. 2017). To have a fair comparison, weperform experiments using four sequences in both cases, i.e.,FLAIR, T1, T1c, and T2 for the real segmentation as well asFLAIR, FLAIR*, T1c, T2 for the synthetic one. Since T1modality has less information regarding glioma comparedto other sequences, we eliminate T1 in our experiments. Ta-ble 5 presents Dice and modified Hausdorff distance (HD95) on BraTS’18 validation set (Leaderboard), reported by theCBICA image processing online portal. Segmentation withHTC sequences improves Dice scores in three clinically im-portant sub-regions, including WT (0.8 % ), TC (0.6 % ), andET (0.5 % ). We also achieve averagely 0.4 mm improvementin WT in terms of HD95. However, we need approximately4.2s to generate each FLAIR* volume form real FLAIR. Discussion and Conclusion
We have shown that a deep neural network can be trainedon the unpaired dataset to synthesize an HTC image froman MR image. Our proposed supervised model modify theclass-conditional distributions of ROIs for the segmenta-tion task in each stage based on the GAN model, whichis equipped with attention mechanisms to alter only rele-vant regions in the input image. We validate our approachon the sub-regions of glioma in multi-modal MR scans ofBraTS 2018 dataset. The results of the K-S test confirm thatproposed MRI-to-HTC can modify the distributions of WT,TC, and ET in the FLAIR, T2, and T1c MR images, respec-tively. The experiments over three segmentation baselinesindicate that incorporating the synthetic HTC images withother modalities, i.e., FLAIR, T1c, and T2, improves Dicescore and HD95 on BraTS 2018 Leaderboard while elim-inating the T1 MR sequence from the segmentation pro-cedure. Although the proposed MRI-to-HTC can achievepromising results, it still has a limitation on defining themean and standard deviation of the class-conditional distri-bution in the HTC target images. Small standard deviationvalues generate much sharper results but introduce visualartifacts in the synthetic images, which reduce the segmen-tation accuracy. As a direction for future works, one can de-velop a framework to tackle corrupted or missing MR vol-me, that appears during scanning in the acquisition setting.Towards this end, the synthetic HTC volume can be replacedwith the corrupted one to complement the information pre-sented by the missing sequence for automated systems.
References
Bakas, S.; Akbari, H.; Sotiras, A.; et al. 2017. Advancingthe cancer genome atlas glioma MRI collections with expertsegmentation labels and radiomic features.
Scientific Data arXiv preprint arXiv:1811.02629 .Bowles, C.; Gunn, R.; Hammers, A.; et al. 2018. GANsferlearning: Combining labelled and unlabelled data for GANbased data augmentation. arXiv preprint arXiv:1811.10669 .Chartsias, A.; Joyce, T.; Dharmakumar, R.; et al. 2017. Ad-versarial image synthesis for unpaired multi-modal cardiacdata. In
International Workshop on Simulation and Synthe-sis in Medical Imaging , 3–13.Chen, L.; Papandreou, G.; Kokkinos, I.; et al. 2018.Deeplab: Semantic image segmentation with deep convolu-tional nets, atrous convolution, and fully connected CRFs.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence
Deep Learning and DataLabeling for Medical Applications . Springer. 179–187.Elazab, A.; Abdulazeem, Y. M.; Anter, A. M.; Hu, Q.; Wang,T.; and Lei, B. 2018. Macroscopic cerebral tumor growthmodeling from medical images: A review.
IEEE Access
Advances in Neural Infor-mation Processing Systems , 2672–2680.Hamghalam, M.; Lei, B.; and Wang, T. 2019. Brain tumorsynthetic segmentation in 3D multimodal MRI scans. arXivpreprint arXiv:1909.13640 .Hatami, T.; Hamghalam, M.; Reyhani-Galangashi, O.; andMirzakuchaki, S. 2019. A machine learning approach tobrain tumors segmentation using adaptive random forest al-gorithm. In , 076–082.He, K.; Zhang, X.; Ren, S.; et al. 2016. Deep residual learn-ing for image recognition. In
IEEE Conference on ComputerVision and Pattern Recognition , 770–778.Huang, G.; Liu, Z.; v. d. Maaten, L.; et al. 2017. Denselyconnected convolutional networks. In
IEEE Conference onComputer Vision and Pattern Recognition , 2261–2269.Huo, Y.; Xu, Z.; and Moon, H. 2019. Synseg-net: Syntheticsegmentation without target modality ground truth.
IEEETransactions on Medical Imaging
IEEE Conference on Computer Vision and Pattern Recognition ,5967–5976.J´egou, S.; Drozdzal, M.; Vazquez, D.; et al. 2017. The onehundred layers tiramisu: Fully convolutional densenets forsemantic segmentation. In
IEEE Conference on ComputerVision and Pattern Recognition Workshops , 1175–1183.Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convo-lutional networks for semantic segmentation. In
IEEE Con-ference on Computer Vision and Pattern Recognition , 3431–3440.Mejjati, Y. A.; Richardt, C.; Tompkin, J.; et al. 2018. Un-supervised attention-guided image-to-image translation. In
Advances in Neural Information Processing Systems , 3693–3703.Menze, B. H.; Jakab, A.; Bauer, S.; et al. 2015. The mul-timodal brain tumor image segmentation benchmark.
IEEETransactions on Medical Imaging
IEEE Transactions on Biomedical Engineering
Medical Image Computing and Computer-Assisted Interven-tion , 234–241.Sharma, A., and Hamarneh, G. 2019. Missing MRI pulsesequence synthesis using multi-modal generative adversarialnetwork. arXiv preprint arXiv:1904.12200 .Soleymanifard, M., and Hamghalam, M. 2019. Segmen-tation of whole tumor using localized active contour andtrained neural network in boundaries. In , 739–744.Wolterink, J. M.; Dinkla, A. M.; Savenije, M. H.; et al. 2017.Deep MR to CT synthesis using unpaired data. In
Simulationand Synthesis in Medical Imaging , 14–23.Zhang, Z.; Yang, L.; and Zheng, Y. 2018. Translatingand segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In
IEEE/CVF Conference on Computer Vision and PatternRecognition , 9242–9251.Zhao, C.; Carass, A.; Lee, J.; et al. 2017. A supervoxelbased random forest synthesis framework for bidirectionalmr/ct synthesis. In
International Workshop on Simulationand Synthesis in Medical Imaging , 33–40.Zhu, J.; Park, T.; Isola, P.; et al. 2017. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In