[PDF] 3D Medical Multi-modal Segmentation Network Guided by Multi-source Correlation Constraint

Abstract

In the field of multimodal segmentation, the correlation between different modalities can be considered for improving the segmentation results. In this paper, we propose a multi-modality segmentation network with a correlation constraint. Our network includes N model-independent encoding paths with N image sources, a correlation constraint block, a feature fusion block, and a decoding path. The model independent encoding path can capture modality-specific features from the N modalities. Since there exists a strong correlation between different modalities, we first propose a linear correlation block to learn the correlation between modalities, then a loss function is used to guide the network to learn the correlated features based on the linear correlation block. This block forces the network to learn the latent correlated features which are more relevant for segmentation. Considering that not all the features extracted from the encoders are useful for segmentation, we propose to use dual attention based fusion block to recalibrate the features along the modality and spatial paths, which can suppress less informative features and emphasize the useful ones. The fused feature representation is finally projected by the decoder to obtain the segmentation result. Our experiment results tested on BraTS-2018 dataset for brain tumor segmentation demonstrate the effectiveness of our proposed method.

Full PDF

AAccepted in 25th International Conference on Pattern Recognition (ICPR)Jan 2021, Milan, Italy

3D Medical Multi-modal Segmentation NetworkGuided by Multi-source Correlation Constraint

Tongxue Zhou ∗†‡ , St´ephane Canu ∗‡ , Pierre Vera § and Su Ruan †‡∗ INSA Rouen, LITIS - Apprentissage, Rouen 76800, France † Universit´e de Rouen Normandie, LITIS - QuantIF, Rouen 76183, France ‡ Normandie Univ, INSA Rouen, UNIROUEN, UNIHAVRE, LITIS, France § Department of Nuclear Medicine, Henri Becquerel Cancer Center, Rouen, 76038, France

Abstract —In the ﬁeld of multimodal segmentation, the correlationbetween different modalities can be considered for improvingthe segmentation results. In this paper, we propose a multi-modality segmentation network with a correlation constraint.Our network includes N model-independent encoding pathswith N image sources, a correlation constrain block, a featurefusion block, and a decoding path. The model independentencoding path can capture modality-speciﬁc features from theN modalities. Since there exists a strong correlation betweendifferent modalities, we ﬁrst propose a linear correlation block tolearn the correlation between modalities, then a loss function isused to guide the network to learn the correlated features basedon the linear correlation block. This block forces the networkto learn the latent correlated features which are more relevantfor segmentation. Considering that not all the features extractedfrom the encoders are useful for segmentation, we propose touse dual attention based fusion block to recalibrate the featuresalong the modality and spatial paths, which can suppress lessinformative features and emphasize the useful ones. The fusedfeature representation is ﬁnally projected by the decoder toobtain the segmentation result. Our experiment results tested onBraTS-2018 dataset for brain tumor segmentation demonstratethe effectiveness of our proposed method.

I. I

NTRODUCTION

Multimodal segmentation using a single model remainschallenging due to the different image characteristics of dif-ferent modalities. A key challenge is to exploit the latentcorrelation between modalities and to fuse the complementaryinformation to improve the segmentation performance. In thispaper, we proposed a method to exploit the multi-sourcecorrelation and apply it to brain tumor segmentation task.A brain tumor is a growth of cells in the brain that multipliesin an abnormal, uncontrollable way, which is one of themost lethal cancers in the world. Today, an estimated 700,000people in the United States are living with a primary braintumor, and over 87,000 more will be diagnosed in 2020 .Gliomas are the most common brain tumors that arise fromglial cells. According to the malignant degree of gliomas [1],they can be categorized into two grades: low-grade gliomas(LGG) and high-grade gliomas (HGG), the former one tend NBTS: National Brain Tumor Society to be benign, grow more slowly with lower degrees of cellinﬁltration and proliferation, the latter one are malignant, moreaggressive and need immediate treatment, moreover, the ﬁve-year relative survival rate of gliomas is only 6.8%. Therefore,early diagnosis of brain tumors is highly desired in clinicalpractice for better treatment planning.Magnetic Resonance Imaging (MRI) is commonly used inradiology to diagnose brain tumors, it is a non-invasive andgood soft tissue contrast imaging modality, which providesinvaluable information about shape, size, and localizationof brain tumors without exposing the patient to a highionization radiation [2]–[4]. The commonly used sequencesare T1-weighted (T1), contrast-enhanced T1-weighted (T1c),T2-weighted (T2) and Fluid Attenuation Inversion Recovery(FLAIR) images. In this work, we refer to these images ofdifferent sequences as modalities. Different modalities canprovide complementary information to analyze different sub-regions of gliomas. For example, T2 and FLAIR highlightthe tumor with peritumoral edema, designated whole tumor.T1 and T1c highlight the tumor without peritumoral edema,designated tumor core. An enhancing region of the tumor corewith hyper-intensity can also be observed in T1c, designatedenhancing tumor core. Therefore applying multi-modal imagescan reduce the information uncertainty and improve clinicaldiagnosis and segmentation accuracy.Inspired by a fact that, there is strong correlation betweenmulti MR modalities, since the same scene (the same patient)is observed by different modalities [5]. We propose a 3Dmultimodal brain segmentation network guided by multi-source correlation constrain. The main contributions of ourmethod are four folds: 1) A correlation block is introduced todiscover the latent multi-source correlation between modal-ities, making the features more relevant for segmentation.2) A dual attention based fusion strategy is proposed torecalibrate the feature representation along modality-wise andspatial-wise. 3) A correlation based loss function is proposedto aide the segmentation network to extract the correlatedfeature representation for a better segmentation. 4) The ﬁrst3D multimodal brain tumor segmentation network guided bymulti-source correlation constrain is proposed. a r X i v : . [ ee ss . I V ] F e b . Related Work A wide range of approaches for brain tumor segmentation,such as probability theory [5], kernel feature selection [6],belief function [7] based on [8], random forests [9], conditionalrandom ﬁelds [10] and support vector machines [11] have beendeveloped with success. However, brain tumor segmentationis still a challenging task due to three reasons: (1) The brainanatomy structure varies from patients to patients. (2) Thevariability across size, shape, and texture of gliomas. (3) Thevariability in intensity range and low contrast in qualitativeMR imaging modalities (see Fig. 1).

Fig. 1. Example of data from a training subject. The ﬁrst four imagesfrom left to right show the MRI modalities: Fluid Attenuation InversionRecovery (FLAIR), contrast enhanced T1-weighted (T1c), T1-weighted (T1),T2-weighted (T2) images, and the ﬁfth image is the ground truth labels createdby experts. The color is used to distinguish the different tumor regions: red:Necrotic and Non-enhancing tumor, yellow: edema, green: enhancing tumor,black: healthy tissue and background.

Recently, with a strong feature learning ability, deeplearning-based approaches have become more prominent forbrain tumor segmentation. Cui et al. [12] proposed a cascadeddeep learning convolutional neural network consisting of twosub-networks. The ﬁrst network is to deﬁne the tumor regionfrom a MRI slice and the second network is used to label thedeﬁned tumor region into multiple sub-regions. Zhao et al.[13] integrated fully convolutional neural networks (FCNNs)[14] and conditional random ﬁelds to segment brain tumor.Havaei et al. [15] implemented a two-pathway architecture thatlearns about the local details of the brain tumor as well as thelarger context feature. Wang et al. [16] proposed to decomposethe multi-class segmentation problem into a sequence of threebinary segmentation problems according to the sub-regionhierarchy. Kamnitsas et al. [17] proposed an efﬁcient fullyconnected multi-scale CNN architecture named DeepMedic,which reassembles a high resolution and a low resolutionpathway to obtain the segmentation results. Furthermore, theyused a 3D fully connected conditional random ﬁeld to effec-tively remove false positives. Kamnitsas et al. [18] introducedEMMA, an ensemble of multiple models and architecturesincluding DeepMedic, FCNs and U-Net. Myronenko et al.[19] proposed a segmentation network for brain tumor frommultimodal 3D MRIs, where variational auto-encoder branchis added into the U-net to further regularize the decoder in thepresence of limited training data.For multimodal segmentation task, exploiting the compli-mentary information from different modalities play an importrole in the ﬁnal segmentation accuracy. As presented in[20], the multi-modal segmentation network architectures canbe categorized into single-encoder-based method and multi-encoder-based method. The single-encoder-based method [18], [21] directly integrates the different multi-modality imageschannel-wise in the input space, while the correlations betweendifferent modalities are not well exploited. However, themulti-encoder-based method [22], allows to separately extractindividual feature information by applying multiple modality-speciﬁc encoders, and to fuse them with speciﬁc fusion strat-egy to emphasize the useful information for the segmentationtask. According to [23], multi-encoder-based method has betterperformance than single-encoder-based method, which canlearn more complementary and cross-modal interdependentfeatures. However, not all features extracted from the encoderare useful for segmentation. Therefore, it is necessary to ﬁndan effective way to fuse features, we focus on the extraction ofthe most informative features for segmentation. To this end,we propose to use the attention mechanism, which can beviewed as a tool being capable to take into account the mostinformative feature representation. Channel attention modulesand spatial attention modules are the commonly used attentionmechanisms. The former one learn a channel-wise featurerepresentation that quantiﬁes the relative importance of eachchannel’s features [24]–[26]. The latter one, spatial attentionmodules, learn the feature representation in each positionby weighted sum the features of all other positions [27]–[29]. However, the methods mentioned above evaluated theattention mechanism only on the single-modal image datasetand don’t consider the fusion issue on the multi-modal medicalimages. In this paper, we propose to apply the attentionmechanism on the multi-modality brain tumor dataset. To learnthe contributions of the feature representations from differentmodalities, we propose a dual attention based fusion block toselectively emphasize feature representations, which consistsof a modality attention module and a spatial attention module.The proposed fusion block uses the modality-speciﬁc featuresto derive a modality-wise and a spatial-wise weight map thatquantify the relative importance of each modality’s featuresand also of the different spatial locations in each modality.These fusion maps are then multiplied with the modality-speciﬁc feature representations to obtain a fused representationof the complementary multi-modality information. In this way,we can discover the most relevant characteristics to aide thesegmentation.For multi-modal MR brain tumor segmentation, since thefour MR modalities are from the same patient, there existsa strong correlation between modalities [5]. In this paper, ourgoal is to exploit and utilize the correlation between modalitiesto improve the segmentation performance. Therefore, we ﬁrstexploit the correlation between each two modalities and thenutilize a loss function to guide the segmentation networkto learn the correlated features to enhance the segmentationresult. To the best of our knowledge, this is the ﬁrst workwhich is capable of utilizing the latent multi-source correlationto help the segmentation.II. M

ETHOD

Our network is based on our previous work [23], whichused a multi-encoder based network to deal with the multi-odel fusion issue. In this paper, we aim to exploit the multi-source correlation between modalities and utilize the correla-tion to constrain the network to learn more effective featureso as to improve the segmentation performance. To learncomplementary features and cross-modal inter-dependenciesfrom multi-modality MRIs, we applied the multi-encoderbased framework. It takes 3D MRI modality as input in eachencoder. Each encoder can produce a modality-speciﬁc featurerepresentation, at the lowest level of the network, the linearcorrelation block is ﬁrst used to exploit the latent multi-sourcecorrelation, then a well-designed loss function is applied toguide the network to learn the effective feature information.Then, all the modality-speciﬁc feature representations are con-catenated to the fusion block at each layer. With the assistanceof the dual attention fusion block, the feature representationswill be separated along modality-wise and space-wise, andthe most informative feature is obtained as the shared latentrepresentation, and ﬁnally it is projected by decoder to thelabel space to obtain the segmentation result. The pipeline ofour method is described in Fig. 2.

A. Architecture Design

It’s likely to require different receptive ﬁelds when seg-menting different regions in an image, a standard U-Net can’tget enough semantic features due to the limited receptiveﬁeld. Inspired by dilated convolution, we use residual blockwith dilated convolutions (rate = 2, 4) (res dil block) onboth encoder part and decoder part to obtain features atmultiple scale. The encoder includes a convolutional block,a res dil block followed by skip connection. All convolutionsare × × . Each decoder level begins with up-sampling layerfollowed by a convolution to reduce the number of featuresby a factor of 2. Then the upsampled features are combinedwith the features from the corresponding level of the encoderpart using concatenation. After the concatenation, we use theres dil block to increase the receptive ﬁeld. In addition, weemploy deep supervision [21] for the segmentation decoder byintegrating segmentation layers from different levels to formthe ﬁnal network output. The proposed network architectureis described in Fig. 3. B. Correlation Constrain Block

Inspired by a fact that, there is strong correlation betweenmulti MR modalities, since the same scene (the same pa-tient) is observed by different modalities [5]. From Fig. 4presenting joint intensities of the MR images, we can observea strong correlation in intensity distribution between eachtwo modalities. To this end, it’s reasonable to assume that astrong correlation also exists in latent representation betweenmodalities. Therefore, we introduce a Correlation Constrain(CC) block, which consists of a Linear Correlation (LC) block(see Fig. 5) to discover the latent correlation and a correla-tion loss to constrain the correlation between modalities. Forsimplicity, we present the CC block using two modalities.The input modality { X i , ..., X n } , where n = 4 , is ﬁrst inputto the independent encoders (with learning parameters θ ) to learn the modality-speciﬁc representation Z i ( X i | θ i ) . Then, anetwork with two fully connected network with LeakyReLU,maps the modality-speciﬁc representation Z i ( X i | θ i ) to a setof independent parameters Γ i = { α i , β i } , i = 1 , ..., n . Finallythe linear correlation representation of j modality F j ( X j | θ j ) can be obtained via linear correlation Equation 1.Since we have four modalities, and each two modalitieshave a strong linear correlation, we only need to learn threepairs of correlation expressions from each two modalities.Then, the Kullback–Leibler divergence (Equation 2) is used asthe correlation loss to constrain the distributions between theestimated correlation representation and the original featurerepresentation, which enables the segmentation network tolearn the correlated feature representation to improve thesegmentation performance. F j ( X j | θ j ) = α i (cid:12) Z i ( X i | θ i ) + β i , ( i (cid:54) = j ) (1) L correlation = (cid:88) x ∈ X P ( x ) log P ( x ) Q ( x ) (2)where P and Q are probability distributions of Z i and F j ,respectively, which deﬁned on the same probability space X . C. Dual Attention based fusion strategy

The purpose of fusion is to stand out the most importantfeatures from different source images to highlight regionsthat are greatly relevant to the target region. Since differentMR modalities can identify different attributes of the targettumor to provide complementary information. In addition,from the same MR modality, we can learn different content atdifferent locations. Inspired by the attention mechanism [27],we propose a dual attention based fusion block to enable abetter integration of the complementary information betweenmodalities, which consists of a modality attention module anda spatial attention module, the architecture is described inFig. 6.The individual feature representations learned by four en-coders ( Z , Z , Z , Z ) are ﬁrst concatenated to obtainthe input feature representation Z = [ Z , Z , Z , Z ] , Z k ∈ R H × W . Note that, in the lowest level of the network, thereare four modality-speciﬁc feature representations ( Z , Z , Z , Z ), in the other levels, the upsamlping layer in thedecoder path is also concatenated with the modality-speciﬁcfeature representations to obtain the input feature representa-tion Z = [ Z , Z , Z , Z , Z ] , Z k ∈ R H × W , for simplicity,in the following, we describe the fusion block with the fourmodality-speciﬁc feature representations.In the modality attention module, a global average poolingis ﬁrst performed to produce a tensor g ∈ R × × , whichrepresents the global spatial information of the feature repre-sentation, with its k th element g k = 1 H × W H (cid:88) i W (cid:88) j Z k ( i, j ) (3) ig. 2. The pipeline of the proposed method, consisting of feature extraction, correlation constrain and fusion fusion block, 4 color circles represent 4 modalityfeature representations. Fig. 3. Overview of our proposed segmentation network framework. (a) (b) (c)(d) (e) (f) Fig. 4. Joint intensity distributions of MR images: (a) T1-FLAIR, (b) T1-T1c,(c) T1-T2, (d) FLAIR-T1c, (e) FLAIR-T2, (f) T1c-T2. The intensity ofthe ﬁrst modality is read on abscissa axis and that of the second modality onthe ordinate axis.

Then two fully-connected layers are applied to encode themodality-wise dependencies, ˆ g = W ( δ ( W g )) , with W ∈ R × , W ∈ R × , being weights of two fully-connectedlayers and the ReLU operator δ ( · ) , ˆ g is then passed through thesigmoid layer to obtain the modality-wise weights, which will Fig. 5. Architecture of Correlation Constrain (CC) block, which consists ofLinear Correlation (LC) block and a correlation constrain loss. be applied to the input representation Z through multiplicationto achieve the modality-wise features Z m , and the σ ( ˆ g k ) ndicates the importance of the i modality of the featurerepresentation. Z m = [ σ ( ˆ g ) Z , σ ( ˆ g ) Z , σ ( ˆ g ) Z , σ ( ˆ g ) Z , ] (4)In the spatial attention module, the feature representationcan be considered as Z = [ Z , , Z , , ..., Z i,j , ..., Z H,W ] , Z i,j ∈ R × × , i ∈ , , ..., H , j ∈ , , ..., W , and thena convolution operation q = W s (cid:63) Z , q ∈ R H × W withweight W s ∈ R × × × , is used to squeeze the spatialdomain, and to produce a projection tensor, which representsthe linearly combined representation for all modalities for aspatial location. The tensor is ﬁnally passed through a sigmoidlayer to obtain the space-wise weights, σ ( q i,j ) indicates theimportance of the spatial information ( i, j ) of the featurerepresentation. Z s = [ σ ( q , ) Z , , ..., σ ( q i,j ) Z i,j , ..., σ ( q H,W ) Z H,W ] (5)Finally, the learned fused feature representation is obtainedby adding the modality-wise feature representation and space-wise feature representation. Z f = Z m + Z s (6)From Fig. 6, we can observe the target tumor’s character-istics in the four independent feature representations are notobvious, however, the modality attention module stands out thedifferent attributes of the modalities to provide complementaryinformation, for example, the FLAIR modality highlights theedema region and T1c modality stand out the tumor coreregion. In the spatial attention module, all the locations relatedto the target tumor region are highlighted, In this way, we candiscover the most relevant characteristics between modalities.Furthermore, the proposed fusion block can be directly adaptedto any multi modal fusion problem. Fig. 6. Proposed dual attention fusion block. The individual feature repre-sentations ( Z , Z , Z , Z ) are ﬁrst concatenated, then they are recalibratedalong modality attention module and spatial attention module to achieve themodality attention representation Z m and spatial attention representation Z s ,ﬁnal they are added to obtain the fused feature representation Z f . III. D

ATA AND I MPLEMENTATION D ETAILS

A. Data

The datasets used in the experiments come from BraTS2018 dataset. The training set includes 285 patients, eachpatient has four image modalities including T1, T1c, T2 andFLAIR. Following the challenge, four intra-tumor structureshave been grouped into three mutually inclusive tumor regions:(a) whole tumor (WT) that consists of all tumor tissues, (b) tumor core (TC) that consists of the enhancing tumor, necroticand non-enhancing tumor core, and (c) enhancing tumor (ET).The provided data have been pre-processed by organisers: co-registered to the same anatomical template, interpolated to thesame resolution ( mm ) and skull-stripped. The ground truthhave been manually labeled by experts. We did additionalpre-processing with a standard procedure. The N4ITK [30]method is used to correct the distortion of MRI data, andintensity normalization is applied to normalize each modalityof each patient. To exploit the spatial contextual informationof the image, we use 3D image, crop and resize it from × × to × × . B. Implementation Details

Our network is implemented in Keras with a single NvidiaGPU Quadro P5000 (16G). The models are optimized usingthe Adam optimizer(initial learning rate = 5e-4) with a de-creasing learning rate factor 0.5 with patience of 10 epochs,to avoid over-ﬁtting, early stopping is used when the validationloss isn’t improved for 50 epoch. We randomly split the datasetinto 80% training and 20% testing.

C. The choices of loss function

The network is trained by the overall loss function as follow: L total = L dice + λL correlation (7)where λ is the trade-off parameters weightig the importanceof each component, which is set as 0.1 in our experiment.For segmentation, we use dice loss to evaluate the overlaprate of prediction results and ground truth. L dice = 1 − (cid:80) Ci =1 (cid:80) Nj =1 p ic g ic + (cid:15) (cid:80) Ci =1 (cid:80) Nj =1 p ic + g ic + (cid:15) (8)where N is the set of all examples, C is the set of the classes, p ic is the probability that pixel i is of the tumor class c and p ic is the probability that pixel i is of the non-tumor class c .The same is true for g ic and g ic , and (cid:15) is a small constant toavoid dividing by 0. D. Evaluation metrics

To evaluate the proposed method, two evaluation metrics:Dice Score and Hausdorff distance are used to obtain quanti-tative measurements of the segmentation accuracy:1) Dice Score: It is designed to evaluate the overlap rate ofprediction results and ground truth. It ranges from 0 to 1, andthe better predict result will have a larger Dice value.

Dice = 2

T P T P + F P + F N (9)where

T P represents the number of true positive voxels,

F P represents the number of false positive voxels, and

F N represents the number of false negative voxels.2) Hausdorff distance (HD): It is computed between bound-aries of the prediction results and ground-truth, it is anndicator of the largest segmentation error. The better predictresult will have a smaller HD value. HD = max { sup r ∈ ∂R d m ( s, r ) , sup s ∈ ∂S d m ( r, s ) } (10)where ∂S and ∂R are the sets of tumor border voxels forthe predicted and the real annotations, and d m ( v, v ) is theminimum of the Euclidean distances between a voxel v andvoxels in a set v .IV. E XPERIMENT R ESULTS

We conduct a series of comparative experiments to demon-strate the effectiveness of our proposed method and compareit to other approaches. In Section IV-A1, we ﬁrst performan ablation experiment to see the importance of our pro-posed components and demonstrate that adding the proposedcomponents can enhance the segmentation performance. InSection IV-A2, we compare our method with the state-of-the-art methods. In Section IV-B, the qualitative experiment resultsfurther demonstrate that our proposed method can achieve apromising segmentation result.

A. Quantitative Analysis

To prove the effectiveness of our network, we ﬁrst did anablation experiment to see the effectiveness of our proposedcomponents, and then we compare our method with the state-of-the-art methods. All the results are obtained by onlineevaluation platform .

1) Effectiveness of Individual Modules:

To assess the per-formance of our method, and see the importance of theproposed components in our network, including dual attentionfusion strategy and correlation constrain block, we did anablation experiment, our network without dual attention fusionstrategy and correlation constrain block is denoted as baseline.From Table I, we can observe the baseline method achievesDice Score of 0.726, 0.867, 0.766 for enhancing tumor, wholetumor, tumor core, respectively. When the dual attention fusionstrategy is applied to the network, we can see an increase ofDice Score and Hausdorff Distance across all tumor regionswith an average improvement of 0.85% and 6.44%. respec-tively. The major reason is that the proposed fusion block canhelp to emphasize the most important representations fromthe different modalities across different positions in order toboost the segmentation result. In addition, another advantageof our method is using the correlation constrain block, whichcan constrain the encoders to discover the latent multi-sourcecorrelation representation between modalities and then guidethe network to learn correlated representation to achieve abetter segmentation. From the results, we can observe thatwith the assistance of correlation constrain block, the networkcan achieve the best Dice Score of 0.747, 0.886 and 0.776 andHausdorff Distance of 7.851, 7.345 and 9.016 for enhancingtumor, whole tumor, tumor core, respectively with an averageimprovement of 2.21% and 9.28% relating to the baseline. https://ipp.cbica.upenn.edu/ TABLE IE VALUATION OF OUR PROPOSED METHOD ON B RATS

TRAININGDATASET , (1) B

ASELINE (2) B

ASELINE + D

UAL ATTENTION FUSION (3)B

ASELINE + D

UAL ATTENTION FUSION + C

ORRELATION CONSTRAIN , ET,WT, TC

DENOTE ENHANCING TUMOR , WHOLE TUMOR AND TUMOR CORE , RESPECTIVELY . Methods Dice Score Hausdorff (mm)ET WT TC ET WT TC(1) 0.726 0.867 0.764 8.743 8.463 9.482(2) 0.733 0.879 0.765 8.003 7.813 9.153(3)

The results in Table I demonstrate the effectiveness of eachproposed component and our proposed network architecturecan perform well on brain tumor segmentation.

2) Comparisons with the State-of-the-art:

To demonstratethe performance of our method, we compare our proposedmethod with the state-of-the-art methods on Brats 2018 vali-dation set, which contains 66 images of patients without theground truth. Table II shows the comparison results. We havealso carried out a comparison study with the state of art ofmethods based on U-Net.(1) Hu et al. [31] proposed the multi-level up-samplingnetwork (MU-Net) for automated segmentation of brain tu-mors, where a novel global attention (GA) module is used tocombine the low level feature maps obtained by the encoderand high level feature maps obtained by the decoder.(2) Tuan et al. [32] proposed using Bit-plane to generate aseries of binary images by determining signiﬁcant bits. Then,the ﬁrst U-Net used the signiﬁcant bits to segment the tumorboundary, and the other U-Net utilized the original imagesand images with least signiﬁcant bits to predict the label ofall pixel inside the boundary.(3) Hu et al. [33] introduced the 3D-residual-Unet architec-ture. The network comprises a context aggregation pathwayand a localization pathway, which encoder abstract represen-tation of the input, and then recombines these representationswith shallower features to precisely localize the interest do-main via a localization path.(4) Myronenko et al. [19] proposed a 3D MRI braintumor segmentation using autoencoder regularization, wherea variational autoencoder branch is added to reconstruct theinput image itself in order to regularize the shared decoderand impose additional constraints on its layers.The best result in BraTS 2018 Challenge is from [19],which achieves 0.814, 0.904 and 0.859 in terms of Dice Scoreon enhancing tumor, whole tumor and tumor core regions,respectively. However, it uses 32 initial convolution ﬁltersand a lot of memories (NVIDIA Tesla V100 32GB GPUis required) to train the model, which is computationallyexpensive. While our method used only 8 initial ﬁlters, andfrom Table II, it can be observed that our proposed methodcan yield a competitive results in terms of Dice Score andHausdorff distance across all the tumor regions. We alsoimplemented the method [19] with 8 initial ﬁlters, but the

ABLE IIC

OMPARISON OF DIFFERENT METHODS ON B RA TS 2018

VALIDATIONDATASET , ET, WT, TC

DENOTE ENHANCING TUMOR , WHOLE TUMOR , TUMOR CORE , RESPECTIVELY , BOLD RESULTS DENOTES THE BEST SCOREFOR EACH TUMOR REGION , UNDERLINE RESULTS DENOTES THE SECONDBEST RESULT . Methods Dice Score Hausdorff (mm)ET WT TC Average ET WT TC Average[31] 0.69 0.88 0.74 0.77 6.69 4.76 10.67 7.373[32] 0.682 0.818 0.699 0.733 7.016 9.412 12.462 9.633[33] 0.719 0.856 0.769 0.781 5.5 10.843 9.985 8.776[19]

Proposed 0.705 0.883 0.783 0.79 7.27 5.111 10.047 7.476 results are not good. Compared with other methods, [33] hasa better Dice Score on enhancing tumor, while our methodachieves a better average Dice Score on all the tumor regionswith an improvement of 1.15%, and it can also obtain anaverage improvement of 14.81% for Hausdorff Distance.To visualize the effectiveness of proposed correlation con-strain block, we select an example to show the feature repre-sentation of the four modalities in the last layer (before theoutput) of the network in Fig. 7. The ﬁrst and second rowshow the feature representations without and with correlationconstrain block, the ﬁfth column shows the ground truth. Wecan observe that, the correlation constrain block can constrainthe network to emphasize the interested tumor region forsegmentation.

Fig. 7. Visualization of effectiveness of proposed correlation constrain block.

B. Qualitative Analysis

In order to evaluate the robustness of our model, werandomly select several examples on BraTS 2018 dataset andvisualize the segmentation results in Fig. 8. From Fig. 8,we can observe that the segmentation results are graduallyimproved when the proposed strategies are integrated, thesecomparisons indicate that the effectiveness of the proposedstrategies. In addition, with all the proposed strategies, ourproposed method can achieve the best results.V. D

ISCUSSION AND C ONCLUSION

In this paper, we proposed a 3D multimodal brain tumorsegmentation network guided by a multi-source correlationconstrain, where the architecture demonstrated their segmen-tation performances in multi-modal MR images of gliomapatients.To take advantage of the complimentary information fromdifferent modalities, the multi-encoder based network is used

Fig. 8. Visualization of several segmentation results. (a) Baseline (b) Baselinewith fusion block (c) Proposed method with fusion block and correlationconstrain (d) Ground truth. Red: necrotic and non-enhancing tumor core;Yellow: edema; Green: enhancing tumor. to learn modality-speciﬁc feature representation. Consideringthe correlation between MR modalities can help the segmen-tation, a linear correlation block is used to describe the latentmulti-source correlation. Since an effective feature learningcan contribute to a better segmentation result, a loss functionis to guide the network to learn the most correlated fea-ture representation to improve the segmentation. Furthermore,different MR modalities can identify different attributes ofthe target tumor, and each MR modality image can presentdifferent contents at different locations. To this end, inspiredby an attention mechanism, a dual-attention fusion strategy isintegrated to our network. The modality attention module isused to distinguish the contribution of each modality, and thespatial attention module is used to extract more useful spatialinformation to boost the segmentation result. The proposedfusion strategy encourages the network to learn more usefulfeature representation to boost the segmentation result, whichis better then the simple max or mean fusion method.The advantages of our proposed network architecture (i) Thesegmentation results evaluated on the two metrics (Dice Scoreand Hausdorff Distance) are similar to real annotation providedby the radiologist experts. (ii) The architecture are an end-to-end Deep Leaning approach and fully automatic without anyuser interventions. (iii) The experiment results demonstratethat our proposed method gives a very accurate result for thesegmentation of brain tumors and its sub-regions even smallregions, and it also achieves very competitive results withless computational complexity. In addition, our method canbe generalized to other kinds of correlation (e.g. nonlinear)and applied to other kinds of multi-source images if somecorrelation exists between them.As a perspective of this research, we will valid our methodin different clinical scenarios. In addition, we intend to study aore efﬁcient correlation representation approach to describethe correlation between modalities, and apply it to synthesizeadditional images to cope with the limited medical imagedataset. A

CKNOWLEDGMENT

This project was co-ﬁnanced by the European Unionwith the European regional development fund (ERDF,18P03390/18E01750/18P02733) and by the Haute-NormandieRegional Council via the M2SINUM project.R

EFERENCES[1] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al. , “Themultimodal brain tumor image segmentation benchmark (brats),”

IEEEtransactions on medical imaging , vol. 34, no. 10, pp. 1993–2024, 2014.[2] Z.-P. Liang and P. C. Lauterbur,

Principles of magnetic resonanceimaging: a signal processing perspective . SPIE Optical EngineeringPress, 2000.[3] S. Bauer, R. Wiest, L.-P. Nolte, and M. Reyes, “A survey of mri-basedmedical image analysis for brain tumor studies,”

Physics in Medicine &Biology , vol. 58, no. 13, p. R97, 2013.[4] A. Drevelegas,

Imaging of brain tumors with histological correlations .Springer Science & Business Media, 2010.[5] J. Lapuyade-Lahorgue, J.-H. Xue, and S. Ruan, “Segmenting multi-source images using hidden markov ﬁelds with copula-based multivari-ate statistical distributions,”

IEEE Transactions on Image Processing ,vol. 26, no. 7, pp. 3187–3195, 2017.[6] N. Zhang, S. Ruan, S. Lebonvallet, Q. Liao, and Y. Zhu, “Kernelfeature selection to fuse multi-spectral mri images for brain tumorsegmentation,”

Computer Vision and Image Understanding , vol. 115,no. 2, pp. 256–269, 2011.[7] C. Lian, S. Ruan, T. Denœux, H. Li, and P. Vera, “Joint tumorsegmentation in pet-ct images using co-clustering and fusion based onbelief functions,”

IEEE Transactions on Image Processing , vol. 28, no. 2,pp. 755–766, 2018.[8] C. Lian, S. Ruan, and T. Denoeux, “Dissimilarity metric learning inthe belief function framework,”

IEEE Transactions on Fuzzy Systems ,vol. 24, no. 6, pp. 1555–1564, 2016.[9] D. Zikic, B. Glocker, E. Konukoglu, A. Criminisi, C. Demiralp, J. Shot-ton, O. M. Thomas, T. Das, R. Jena, and S. J. Price, “Decision forestsfor tissue-speciﬁc segmentation of high-grade gliomas in multi-channelmr,” in

International Conference on Medical Image Computing andComputer-Assisted Intervention . Springer, 2012, pp. 369–376.[10] Y. Yu, P. Decazes, J. Lapuyade-Lahorgue, I. Gardin, P. Vera, andS. Ruan, “Semi-automatic lymphoma detection and segmentation usingfully conditional random ﬁelds,”

Computerized Medical Imaging andGraphics , vol. 70, pp. 1–7, 2018.[11] S. Bauer, L.-P. Nolte, and M. Reyes, “Fully automatic segmentation ofbrain tumor images using support vector machine classiﬁcation in com-bination with hierarchical conditional random ﬁeld regularization,” in

International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2011, pp. 354–361.[12] S. Cui, L. Mao, J. Jiang, C. Liu, and S. Xiong, “Automatic semanticsegmentation of brain gliomas from mri images using a deep cascadedneural network,”

Journal of healthcare engineering , vol. 2018, 2018.[13] X. Zhao, Y. Wu, G. Song, Z. Li, Y. Zhang, and Y. Fan, “A deep learningmodel integrating fcnns and crfs for brain tumor segmentation,”

Medicalimage analysis , vol. 43, pp. 98–111, 2018.[14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[15] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio,C. Pal, P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation withdeep neural networks,”

Medical image analysis , vol. 35, pp. 18–31, 2017.[16] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain tumorsegmentation using cascaded anisotropic convolutional neural networks,”in

International MICCAI Brainlesion Workshop . Springer, 2017, pp.178–190. [17] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker, “Efﬁcient multi-scale 3dcnn with fully connected crf for accurate brain lesion segmentation,”

Medical image analysis , vol. 36, pp. 61–78, 2017.[18] K. Kamnitsas, W. Bai, E. Ferrante, S. McDonagh, M. Sinclair,N. Pawlowski, M. Rajchl, M. Lee, B. Kainz, D. Rueckert et al. ,“Ensembles of multiple models and architectures for robust braintumour segmentation,” in

International MICCAI Brainlesion Workshop .Springer, 2017, pp. 450–462.[19] A. Myronenko, “3d mri brain tumor segmentation using autoencoder reg-ularization,” in

International MICCAI Brainlesion Workshop . Springer,2018, pp. 311–320.[20] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medicalimage segmentation using multi-modality fusion,”

Array , vol. 3, p.100004, 2019.[21] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein, “Brain tumor segmentation and radiomics survival prediction:Contribution to the brats 2017 challenge,” in

International MICCAIBrainlesion Workshop . Springer, 2017, pp. 287–297.[22] K.-L. Tseng, Y.-L. Lin, W. Hsu, and C.-Y. Huang, “Joint sequence learn-ing and cross-modality convolution for 3d biomedical segmentation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 6393–6400.[23] T. Zhou, S. Ruan, Y. Guo, and S. Canu, “A multi-modality fusionnetwork based on attention mechanism for brain tumor segmentation,”in . IEEE, 2020, pp. 377–380.[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[25] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network forsemantic segmentation,” arXiv preprint arXiv:1805.10180 , 2018.[26] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa,K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al. , “Atten-tion u-net: Learning where to look for the pancreas,” arXiv preprintarXiv:1804.03999 , 2018.[27] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial andchannel ‘squeeze & excitation’in fully convolutional networks,” in

International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2018, pp. 421–429.[28] A. G. Roy, S. Siddiqui, S. P¨olsterl, N. Navab, and C. Wachinger,“‘squeeze & excite’guided few-shot segmentation of volumetric images,”

Medical image analysis , vol. 59, p. 101587, 2020.[29] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attentionnetwork for scene segmentation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 3146–3154.[30] B. B. Avants, N. Tustison, and G. Song, “Advanced normalization tools(ants),”

Insight j , vol. 2, pp. 1–35, 2009.[31] Y. Hu, X. Liu, X. Wen, C. Niu, and Y. Xia, “Brain tumor segmentationon multimodal mr imaging using multi-level upsampling in decoder,”in

International MICCAI Brainlesion Workshop . Springer, 2018, pp.168–177.[32] T. A. Tuan et al. , “Brain tumor segmentation using bit-plane and unet,”in

International MICCAI Brainlesion Workshop . Springer, 2018, pp.466–475.[33] X. Hu, H. Li, Y. Zhao, C. Dong, B. H. Menze, and M. Piraud, “Hierar-chical multi-class segmentation of glioma images using networks withmulti-level activation function,” in