[PDF] CA-Net: Comprehensive Attention Convolutional Neural Networks for Explainable Medical Image Segmentation

Abstract

Accurate medical image segmentation is essential for diagnosis and treatment planning of diseases. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they are still challenged by complicated conditions where the segmentation target has large variations of position, shape and scale, and existing CNNs have a poor explainability that limits their application to clinical decisions. In this work, we make extensive use of multiple attentions in a CNN architecture and propose a comprehensive attention-based CNN (CA-Net) for more accurate and explainable medical image segmentation that is aware of the most important spatial positions, channels and scales at the same time. In particular, we first propose a joint spatial attention module to make the network focus more on the foreground region. Then, a novel channel attention module is proposed to adaptively recalibrate channel-wise feature responses and highlight the most relevant feature channels. Also, we propose a scale attention module implicitly emphasizing the most salient feature maps among multiple scales so that the CNN is adaptive to the size of an object. Extensive experiments on skin lesion segmentation from ISIC 2018 and multi-class segmentation of fetal MRI found that our proposed CA-Net significantly improved the average segmentation Dice score from 87.77% to 92.08% for skin lesion, 84.79% to 87.08% for the placenta and 93.20% to 95.88% for the fetal brain respectively compared with U-Net. It reduced the model size to around 15 times smaller with close or even better accuracy compared with state-of-the-art DeepLabv3+. In addition, it has a much higher explainability than existing networks by visualizing the attention weight maps. Our code is available at this https URL

Full PDF

11 CA-Net: Comprehensive Attention ConvolutionalNeural Networks for Explainable Medical ImageSegmentation

Ran Gu, Guotai Wang, Tao Song, Rui Huang, Michael Aertsen, Jan Deprest, Sébastien Ourselin,Tom Vercauteren, Shaoting Zhang

Abstract —Accurate medical image segmentation is essentialfor diagnosis and treatment planning of diseases. ConvolutionalNeural Networks (CNNs) have achieved state-of-the-art per-formance for automatic medical image segmentation. However,they are still challenged by complicated conditions where thesegmentation target has large variations of position, shape andscale, and existing CNNs have a poor explainability that limitstheir application to clinical decisions. In this work, we makeextensive use of multiple attentions in a CNN architectureand propose a comprehensive attention-based CNN (CA-Net)for more accurate and explainable medical image segmentationthat is aware of the most important spatial positions, channelsand scales at the same time. In particular, we ﬁrst proposea joint spatial attention module to make the network focusmore on the foreground region. Then, a novel channel attentionmodule is proposed to adaptively recalibrate channel-wise featureresponses and highlight the most relevant feature channels. Also,we propose a scale attention module implicitly emphasizing themost salient feature maps among multiple scales so that theCNN is adaptive to the size of an object. Extensive experimentson skin lesion segmentation from ISIC 2018 and multi-classsegmentation of fetal MRI found that our proposed CA-Netsigniﬁcantly improved the average segmentation Dice score from . to . for skin lesion, . to . for theplacenta and . to . for the fetal brain respectivelycompared with U-Net. It reduced the model size to around 15times smaller with close or even better accuracy compared withstate-of-the-art DeepLabv3+. In addition, it has a much higherexplainability than existing networks by visualizing the attentionweight maps. Our code is available at https://github.com/HiLab-git/CA-Net Index Terms —Attention, Convolutional Neural Network, Med-ical Image Segmentation, Explainability

I. I

NTRODUCTION A UTOMATIC medical image segmentation is impor-tant for facilitating quantitative pathology assessment,

This work was supported by National Natural Science Foundation of China.No.81771921, and No.61901084. (Corresponding author: Guotai Wang)R. Gu, G. Wang, and S. Zhang are with the School of Mechanical andElectrical Engineering, University of Electronic Science and Technology ofChina, Chengdu, China. (e-mail: [email protected]).T. Song, and R. Huang are with the SenseTime Research, Shanghai, China.M. Aertsen is with the Department of Radiology, University HospitalsLeuven, Leuven, Belgium.J. Deprest is with the School of Biomedical Engineering and ImagingSciences, King’s College London, London, U.K., with the Department ofObstetrics and Gynaecology, University Hospitals Leuven, Leuven, Belgium,and with the Institute for Womenâ ˘A ´Zs Health, University College London,London, U.K.S. Ourselin, and T. Vercauteren are with the School of Biomedical Engi-neering and Imaging Sciences, King’s College London, London, U.K. treatment planning and monitoring disease progression [1].However, this is a challenging task due to several reasons.First, medical images can be acquired with a wide rangeof protocols and usually have low contrast and inhomoge-neous appearances, leading to over-segmentation and under-segmentation [2]. Second, some structures have large variationof scales and shapes such as skin lesion in dermoscopicimages [3], making it hard to construct a prior shape model. Inaddition, some structures may have large variation of positionand orientation in a large image context, such as the placentaand fetal brain in Magnetic Resonance Imaging (MRI) [2], [4],[5]. To achieve good segmentation performance, it is highlydesirable for automatic segmentation methods to be aware ofthe scale and position of the target.With the development of deep Convolutional Neural Net-works (CNNs), state-of-the-art performance has been achievedfor many segmentation tasks [1]. Compared with traditionalmethods, CNNs have a higher representation ability and canlearn the most useful features automatically from a largedataset. However, most existing CNNs are faced with thefollowing problems: Firstly, by design of the convolutionallayer, they use shared weights at different spatial positions,which may lead to a lack of spatial awareness and thushave reduced performance when dealing with structures withﬂexible shapes and positions, especially for small targets.Secondly, they usually use a very large number of featurechannels, while these channels may be redundant. Many net-works such as the U-Net [6] use a concatenation of low-leveland high-level features with different semantic information.They may have different importance for the segmentation task,and highlighting the relevant channels while suppressing someirrelevant channels would beneﬁt the segmentation task [7].Thirdly, CNNs usually extract multi-scale features to deal withobjects at different scales but lack the awareness of the mostsuitable scale for a speciﬁc image to be segmented [8]. Lastbut not least, the decisions of most existing CNNs are hardto explain and employed in a black box manner due to theirnested non-linear structure, which limits their application toclinical decisions.To address these problems, attention mechanism is promis-ing for improving CNNs’ segmentation performance as itmimics the human behavior of focusing on the most relevantinformation in the feature maps while suppressing irrelevantparts. Generally, there are different types of attentions thatcan be exploited for CNNs, such as paying attention to the a r X i v : . [ ee ss . I V ] S e p relevant spatial regions, feature channels and scales. As anexample of spatial attention, the Attention Gate (AG) [9]generates soft region proposals implicitly and highlights usefulsalient features for the segmentation of abdominal organs.The Squeeze and Excitation (SE) block [7] is one kind ofchannel attention and it recalibrates useful channel featuremaps related to the target. Qin [10] used an attention to dealwith multiple parallel branches with different receptive ﬁeldsfor brain tumor segmentation, and the same idea was used inprostate segmentation from ultrasound images [11]. However,these works have only demonstrated the effectiveness of usinga single or two attention mechanisms for segmentation thatmay limit the performance and explainability of the network.We assume that taking a more comprehensive use of attentionswould boost the segmentation performance and make it easierto understand how the network works.For artiﬁcial intelligence systems, the explainability ishighly desirable when applied to medical diagnosis [12]. Theexplainability of CNNs has a potiential for veriﬁcation ofthe prediction, where the reliance of the networks on thecorrect features must be guaranteed [12]. It can also helphuman understand the model’s weaknesses and strengths inorder to improve the performance and discover new knowl-edge distilled from a large dataset. In the segmentation task,explainability helps developers interpret and understand howthe decision is obtained, and accordingly modify the networkin order to gain better accuracy. Some early works tried tounderstand CNNs’ decisions by visualizing feature maps orconvolution kernels in different layers [13]. Other methodssuch as Class Activation Map (CAM) [14] and Guided BackPropagation (GBP) [15] are mainly proposed for explainingdecisions of CNNs in classiﬁcation tasks. However, explain-ability of CNNs in the context of medical image segmentationhas rarely been investigated [16], [17]. Schlemper et al. [16]proposed attention gate that implicitly learn to suppress irrel-evant region while highlighting salient features. Furthermore,Roy et al. [17] introduced spatial and channel attention at thesame time to boost meaningful features. In this work, we takeadvantages of spatial, channel and scale attentions to interpretand understand how the pixel-level predictions are obtainedby our network. Visualizing the attention weights obtained byour network not only helps to understand which image regionis activated for the segmentation result, but also sheds light onthe scale and channel that contribute most to the prediction.To the best of our knowledge, this is the ﬁrst work onusing comprehensive attentions to improve the performanceand explainability of CNNs for medical image segmentation.The contribution of this work is three-fold. First, we pro-pose a novel Comprehensive Attention-based Network (i.e.,CA-Net) in order to make a complete use of attentions tospatial positions, channels and scales. Second, to implementeach of these attentions, we propose novel building blocksincluding a dual-pathway multi-scale spatial attention module,a novel residual channel attention module and a scale attentionmodule that adaptively selects features from the most suitablescales. Thirdly, we use the comprehensive attention to obtaingood explainability of our network where the segmentationresult can be attributed to the relevant spatial areas, feature channels and scales. Our proposed CA-Net was validatedon two segmentation tasks: binary skin lesion segmentationfrom dermoscipic images and multi-class segmentation of fetalMRI (including the fetal brain and the placenta), where theobjects vary largely in position, scale and shape. Extensiveexperiments show that CA-Net outperforms its counterpartsthat use no or only partial attentions. In addition, by visualizingthe attention weight maps, we achieved a good explainabilityof how CA-Net works for the segmentation tasks.II. R ELATED W ORKS

A. CNNs for Image Segmentation

Fully Convolutional Network (FCN) [18] frameworks suchas DeepLab [8] are successful methods for natural semanticimage segmentation. Subsequently, an encoder-decoder net-work SegNet [19] was proposed to produce dense featuremaps. DeepLabv3+ [20] extended DeepLab by adding a de-coder module and using depth-wise separable convolution forbetter performance and efﬁciency.In medical image segmentation, FCNs also have been exten-sively exploited in a wide range of tasks. U-Net [6] is a widelyused CNN for 2D biomedical image segmentation. The 3D U-Net [21] and V-Net [22] with similar structures were proposedfor 3D medical image segmentation. In [23], a dilated residualand pyramid pooling network was proposed for automatedsegmentation of melanoma. Some other CNNs with goodperformance for medical image segmentation include High-Res3DNet [24], DeepMedic [25], and H-DenseUNet [26], etc.However, these methods only use position-invariant kernels forlearning, without focusing on the features and positions thatare more relevant to the segmentation object. Meanwhile, theyhave a poor explainability as they provide little mechanism forinterpreting the decision-making process.

B. Attention Mechanism

In computer vision, there are attention mechanisms appliedin different task scenarios [27]–[29]. Spatial attention has beenused for image classiﬁcation [27] and image caption [29],etc. The learned attention vector highlights the salient spatialareas of the sequence conditioned on the current feature whilesuppressing the irrelevant counter-parts, making the predictionmore contextualized. The SE block using a channel-wiseattention was originally proposed for image classiﬁcation andhas recently been used for semantic segmentation [26], [28].These ideas of attention mechanisms work by generating acontext vector which assigns weights on the input sequence.In [30], an attention mechanism is proposed to lean to softlyweight feature maps at multiple scales. However, this methodfeeds multiple resized input images to a shared deep network,which requires human expertise to choose the proper sizes andis not self-adaptive to the target scale.Recently, to leverage attention mechanism for medical im-age segmentation, Oktay et al. [9] combined spatial attentionwith U-Net for abdominal pancreas segmentation from CTimages. Roy et al. [17] proposed concurrent spatial andchannel wise ’Squeeze and Excitation’ (scSE) frameworksfor whole brain and abdominal multiple organs segmentation. 𝑪𝑨 𝑺𝑨 𝑪𝑨 𝑪𝑨 𝑪𝑨 𝑺𝑨 𝑺𝑨 𝑺𝑨 𝑭 𝑭 𝑭 𝑭 𝑳𝑨 query key query key query key scale1 scale2 scale3 scale4 Encoder scale1scale2scale3scale4

Decoder scale5

ConvolutionLayer+BN+ReLUMaxpool2 ×

2 Layer ConcatenationLayer

SpatialattentionChannelattentionScaleattention

Softmax

Layer 𝑺𝑨 𝒊 𝑳𝑨𝑪𝑨 𝒊 Deconvolution 2D Skip connection Bilinear interpolation × × × concatenation Fig. 1. Our proposed comprehensive attention CNN (CA-Net). Blue rectangles with × or × and numbers (16, 32, 64, 128, and 256, or class) correspondto the convolution kernel size and the output channels. We use four spatial attentions ( S A to S A ), four channel attentions ( C A to C A ) and one scaleattention ( L A ). F − means the resampled version of feature maps that are concatenated as input of the scale attention module. Qin et al. [10] and Wang et al. [11] got feature maps ofdifferent sizes from middle layers and recalibrate these featuremaps by assigning an attention weight. Despite the increasingnumber of works leveraging attention mechanisms for medicalimage segmentation, they seldom pay attention to featuremaps at different scales. What’s more, most of them focuson only one or two attention mechanisms, and to the bestof our knowledge, the attention mechanisms have not beencomprehensively incorporated to increase the accuracy andexplainability of segmentation tasks.III. M

ETHODS

A. Comprehensive-Attention CNN

The proposed CA-Net making use of comprehensive atten-tions is shown in Fig. 1, where we add specialized convo-lutional blocks to achieve comprehensive attention guidancewith respect to the space, channel and scale of the featuremaps simultaneously. Without loss of generality, we choosethe powerful structure of the U-Net [6] as the backbone. TheU-Net backbone is an end-to-end-trainable network consistingof an encoder and a decoder with shortcut connection at eachresolution level. The encoder is regarded as a feature extractorthat obtains high-dimensional features across multiple scalessequentially, and the decoder utilizes these encoded featuresto recover the segmentation target.Our CA-Net has four spatial attention modules (

S A − ), fourchannel attention modules ( C A − ) and one scale attentionmodule ( L A ), as shown in Fig. 1. The spatial attention isutilized to strengthen the region of interest on the featuremaps while suppressing the potential background or irrelevantparts. Hence, we propose a novel multi-scale spatial attentionmodule that is a combination of non-local block [31] at thelowest resolution level (

S A ) and dual-pathway AG [9] at theother resolution levels ( S A − ). We call it as the joint spatialattention ( Js − A ) that enhances inter-pixel relationship to makethe network better focus on the segmentation target. Channelattention ( C A − ) is used to calibrate the concatenation of low-level and high-level features in the network so that the morerelevant channels are weighted by higher coefﬁcients. Unlike the SE block that only uses average-pooling to gain channel at-tention weight, we additionally introduce max-pooled featuresto exploit more salient information for channel attention [32].Finally, we concatenate feature maps at multiple scales in thedecoder and propose a scale attention module ( L A ) to highlightfeatures at the most relevant scales for the segmentation target.These different attention modules are detailed in the following.

1) Joint Spatial Attention Modules:

The joint spatial atten-tion is inspired by the non-local network [31] and AG [9].We use four attention blocks (

S A − ) in the network to learnattention maps at four different resolution levels, as shown inFig. 1. First, for the spatial attention at the lowest resolutionlevel ( S A ), we use a non-local block that captures interactionsbetween all pixels with a better awareness of the entire context.The detail of ( S A ) is shown in Fig. 2(a). Let x representthe input feature map with a shape of × H × W , where256 is the input channel number, and H , W represent theheight and width, respectively. We ﬁrst use three parallel × convolutional layers with an output channel number of 64 toreduce the dimension of x , obtaining three compressed featuremaps x (cid:48) , x (cid:48)(cid:48) and x (cid:48)(cid:48)(cid:48) , respectively, and they have the sameshape of × H × W . The three feature maps can then bereshaped into 2D matrices with shape of × HW . A spatialattention coefﬁcient map is obtained as: α = σ ( x (cid:48) T · x (cid:48)(cid:48) ) (1)where T means matrix transpose operation. α ∈ ( , ) HW × HW is a square matrix, and σ is a row-wise Softmax function sothat the sum of each row equals to 1.0. α is used to representthe feature of each pixel as a weighted sum of features of allthe pixels, to ensure the interaction among all the pixels. Thecalibrated feature map in the reduced dimension is: ˆ x = α · x (cid:48)(cid:48)(cid:48) T (2) ˆ x is then reshaped to × H × W , and we use Φ thatis a × convolution with batch normalization and outputchannel number of 256 to expand ˆ x to match the channelnumber of x . A residual connection is ﬁnally used to facilitatethe information propagation during training, and the output of Concatenated Features 𝒙

64 × 𝐻 × 𝑊 64 × 𝐻𝑊𝐻𝑊 × 64𝐻𝑊 × 64 𝐻𝑊 × 𝐻𝑊 256 × 𝐻 × 𝑊𝐻𝑊 × 64 𝑺𝑨 Spatial Attention Block 𝜹 (a) Spatial attention block1 ( 𝑆𝐴 ) (c) Dual-pathway attention block ( 𝑆𝐴 ) 𝒚 𝑺𝑨𝟏 𝒚 𝑺𝑨𝒔

𝑺𝑨𝑺𝑨 ෝ𝒂𝑺𝑨

BlocksSingle-Pathway Spatial Attention Block (b) Single-pathway spatial attention (𝑆𝐴) 𝒙 ′ 𝒙 𝒉 𝒙 ′′′ 𝒂 𝒙 𝒍 𝒙 ′′ 𝒙 𝒉 𝒙 𝒍 ෥𝒂 × × Fig. 2. Details of our proposed joint spatial attention block. (a)

S A is a non-local block used at the lowest resolution level. (b) single-pathway spatialattention block (SA). (c) S A − are the dual-pathway attention blocks used in higher resolution levels. The query feature x h is used to calibrate the low-level key feature x l . δ means the Sigmoid function.

S A is obtained as: y SA = Φ ( ˆ x ) + x (3)Second, as the increased memory consumption limits apply-ing the non-local block to feature maps with higher resolution,we extend AG to learn attention coefﬁcients in S A − . Asa single AG may lead to a noisy spatial attention map, wepropose a dual-pathway spatial attention that exploits two AGsin parallel to strengthen the attention to the region of interest aswell as reducing noise in the attention map. Similarly to modelensemble, combining two AGs in parallel has a potential toimprove the robustness of the segmentation. The details abouta single pathway AG are shown in Fig. 2(b). Let x l representthe low-level feature map at the scale s in the encoder, and x h represent a high-level feature map up-sampled from the endof the decoder at scale s + with a lower spatial resolution,so that x h and x l have the same shape. In a single-pathwayAG, the query feature x h is used to calibrate the low-level key feature x l . As shown in Fig. 2(b), x h and x l are compressedby a × convolution with an output channel number C (e.g.,64) respectively, and the results are summed and followed by a ReLU activation function. Feature map obtained by the

ReLU is then fed into another × convlution with one outputchannel followed by a Sigmoid function to obtain a pixel-wise attention coefﬁcient α ∈ [ , ] H × W . x l is then multipliedwith α to be calibrated. In our dual-pathway AG, the spatialattention maps in the two pathways are denoted as ˆ α and ˜ α respectively. As shown in Fig. 2(c), the output of our dual-pathway AG for S A s ( s = , , ) is obtained as: y SAs = ReLU (cid:2) Φ C (cid:0) ( x l · ˆ α ) c (cid:13) ( x l · ˜ α ) (cid:1)(cid:3) (4)where c (cid:13) means channel concatenation. Φ C denotes × convolution with C output channels followed by batch nor-malization. Here C is 64, 32 and 16 for S A , S A and S A ,respectively.

2) Channel Attention Modules:

In our network, chan-nel concatenation is used to combine the spatial attention-calibrated low-level features from the encoder and higher-levelfeatures from the decoder as shown in Fig. 1. Feature channelsfrom the encoder contain mostly low-level information, andtheir counterparts from the decoder contain more semanticinformation. Therefore, they may have different importancefor the segmentation task. To better exploit the most usefulfeature channels, we introduce channel attention to automati-cally highlight the relevant feature channels while suppressingirrelevant channels. The details of proposed channel attention module (

C A − ) is shown in Fig. 3. Input 𝒙 Output 𝒚 𝑪𝑨 𝑀𝐿𝑃 𝑃 𝑎𝑣𝑔 Low-level High-level 𝑃 𝑚𝑎𝑥 𝑀𝐿𝑃 shared 𝐶 × 𝐻 × 𝑊 C × 1 × 1 𝐶 × 𝐻 × 𝑊

C × 1 × 1

𝑀𝐿𝑃

Fully connected +ReLU+Fully connected 𝜷 Fig. 3. Structure of our proposed channel attention module with residualconnection. Additional global max-pooled features are used in our module. β means the channel attention coefﬁcient. Unlike previous SE block that only utilized average-pooledinformation to excite feature channels [7], we use max-pooled features additionally to keep more information [32].Similarly, let x represent the concatenated input feature mapwith C channels, a global average pooling P avg and a globalmaximal pooling P max are ﬁrst used to obtain the globalinformation of each channel, and the outputs are representedas P avg ( x ) ∈ R C × × and P max ( x ) ∈ R C × × , respectively.A multiple layer perception ( M LP ) M r is used to obtainthe channel attention coefﬁcient β ∈ [ , ] C × × , and M r isimplemented by two fully connected layers, where the ﬁrstone has an output channel number of C / r followed by ReLU and the second one has an output channel number of C . We set r = counting the trade-off of performance and computationalcost [7]. Note that a shared M r is used for P avg ( x ) and P max ( x ) , and their results are summed and fed into a Sigmoid to obtain β . The output of our channel attention module isobtained as: y C A = x · β + x (5)where we use a residual connection to beneﬁt the training. Inour network, four channel attention modules ( C A − ) are used(one for each concatenated feature), as shown in Fig. 1.

3) Scale Attention Module:

The U-Net backbone obtainsfeature maps in different scales. To better deal with objects indifferent scales, it is reasonable to combine these features forthe ﬁnal prediction. However, for a given object, these featuremaps at different scales may have different relevance to theobject. It is desirable to automatically determine the scale-wiseweight for each pixel, so that the network can be adaptive tocorresponding scales of a given input. Therefore, we propose ascale attention module to learn image-speciﬁc weight for eachscale automatically to calibrate the features at different scales,which is used at the end of the network, as shown in Fig. 1.

Input ෡𝑭 Output 𝒚 𝑳𝑨 𝜹 𝜸 ∗ [ 𝐹 : 𝐹 : 𝐹 : 𝐹 ] 𝑳𝑨 ∗ 𝑀𝐿𝑃 𝑃 𝑎𝑣𝑔 𝑃 𝑚𝑎𝑥 𝑀𝐿𝑃 shared ReLU 𝜸 Fig. 4. Structure of our proposed scale attention module with residual con-nection. Its input is the concatenation of interpolated feature maps at differentscales obtained in the decoder. γ means scale-wise attention coefﬁcient. Weadditionally use a spatial attention block L A ∗ to gain pixel-wise scale attentioncoefﬁcient γ ∗ . Our proposed

L A block is illustrated in Fig. 4. We ﬁrstuse bilinear interpolation to resample the feature maps F s atdifferent scales ( s = , , , ) obtained by the decoder tothe original image size. To reduce the computational cost,these feature map are compressed into four channels using × convolutions, and the compressed results from differentscales are concatenated into a hybrid feature map ˆ F . Similarlyto our C A , we combine P avg P max with M LP to obtain acoeffcieint for each channel (i.e., scale here), as shown inFig. 4. The scale coefﬁcient attention vector is denoted as γ ∈ [ , ] × × . To distribute multi-scale soft attention weighton each pixel, we additionally use a spatial attention block L A ∗ taking ˆ F · γ as input to generate spatial-wise attentioncoefﬁcient γ ∗ ∈ [ , ] × H × W , so that γ · γ ∗ represents a pixel-wise scale attention. L A ∗ consists of one × and one × convolutional layers, where the ﬁrst one has 4 output channelsfollowed by ReLU , and the second one has 4 output channelsfollowed by

Sigmoid . The ﬁnal output of our

L A module is: y LA = ˆ F · γ · γ ∗ + ˆ F · γ + ˆ F (6)where the residual connections are again used to facilitate thetraining, as shown in Fig. 4. Using scale attention moduleenables the CNN to be aware of the most suitable scale (howbig the object is).IV. E XPERIMENTAL R ESULTS

We validated our proposed framework with two applica-tions: (i) Binary skin lesion segmentation from dermoscopicimages. (ii) Multi-class segmentation of fetal MRI, includingthe fetal brain and the placenta. For both applications, weimplemented ablation studies to validate the effectiveness ofour proposed CA-Net and compared it with state-of-the-artnetworks. Experimental results of these two tasks will bedetailed in Section IV-B and Section IV-C, respectively.

A. Implementation and Evaluation Methods

All methods were implemented in Pytorch framework , .We used Adaptive Moment Estimation (Adam) for trainingwith initial learning rate − , weight decay − , batch size16, and iteration 300 epochs. The learning rate is decayed https://pytorch.org/ https://github.com/HiLab-git/CA-Net by 0.5 every 256 epochs. The feature channel number in theﬁrst block of our CA-Net was set to 16 and doubled aftereach down-sampling. In M LP s of our

C A and

L A modules,the channel compression factor r was 2 according to [7].Training was implemented on one NVIDIA Geforce GTX1080 Ti GPU. We used Soft Dice loss function for the trainingof each network and used the best performing model on thevalidation set among all the epochs for testing. We used 5-foldcross-validation for ﬁnal evaluation.Quantitative evaluation of segmentation accuracy was basedon: (i) The Dice score between a segmentation and the groundtruth, which is deﬁned as: Dice = |R a ∩ R b ||R a | + |R b | (7)where R a and R b denote the region segmented by algorithmand the ground truth, respectively. (ii) Average symmetricsurface distance (ASSD). Supposing S a and S b represent theset of boundary points of the automatic segmentation and theground truth respectively, the ASSD is deﬁned as: ASSD = | S a | + | S b | × (cid:18) (cid:213) a ∈ S a d ( a , S b ) + (cid:213) b ∈ S b d ( b , S a ) (cid:19) (8)where d ( v , S a ) = min w ∈ S a ((cid:107) v − w (cid:107)) denotes the minimumEuclidean distance from point v to all the points of S a . B. Lesion Segmentation from Dermoscopic Images

With the emergence of automatic analysis algorithms, itbecomes possible that accurate automatic skin lesion boundarysegmentation helps dermatologists for early diagnosis andscreening of skin diseases quickly. The main challenge for thistask is that the skin lesion areas have various scales, shapesand colors, which requires automatic segmentation methods tobe robust against shape and scale variations of the lesion [33].

1) Dataset:

For skin lesion segmentation, we used thepublic available training set of ISIC 2018 with 2594 im-ages and their ground truth. We randomly split the datasetinto 1816, 260 and 518 for training, validation and testingrespectively. The original size of the skin lesion segmentationdataset ranged from × to × , and we resizedeach image to × and normalized it by the mean valueand standard deviation. During training, random cropping witha size of × , horizontal and vertical ﬂipping, andrandom rotation with a angle in (− π / , π / ) were used fordata augmentation.

2) Comparison of Spatial Attention Methods:

We ﬁrstinvestigated the effectiveness of our spatial attention mod-ules without using the channel attention and scale attentionmodules. We compared different variants of our proposedmulti-level spatial attention: 1) Using standard single-pathwayAG [9] at the position of

S A − , which is refereed to as s-AG; 2) Using the dual-pathway AG at the position of S A − ,which is refereed to as t-AG; 3) Using the non-local block of S A only, which is refereed to as n-Local [31]. Our proposed https://challenge2018.isic-archive.com/ Proposed S-A (Js-A)

Ground Truth Predicted Segmentation

Baseline (U-Net) S-A (n-Local)S-A (s-AG) S-A (t-AG) (a) Spatial attention weight maps (b) Segmentation results obtained by different variants of spatial attention

Proposed S-A (Js-A)S-A (t-AG)S-A (s-AG)

Fig. 5. Visual comparison between different spatial attention methods for skin lesion segmentation. (a) is the visualized attention weight maps of single-pathway, dual-pathway and our proposed spatial attention. (b) shows segmentation results, where red arrows highlight some mis-segmentations. For betterviewing of the segmentation boundary of the small target lesion, the ﬁrst row of (b) shows the zoomed-in version of the region in the blue rectangle in (a). joint attention method using non-local block in

S A and dual-pathway AG in S A − is denoted as Js-A. For the baselineU-Net, the skip connection was implemented by a simpleconcatenation of the corresponding features in the encoder andthe decoder [6]. For other compared variants that do not use S A − , their skip connections were implemented as the sameas that of U-Net. Table I shows a quantitative comparisonbetween these methods. It can be observed that all the variantsusing spatial attention lead to higher segmentation accuracythan the baseline. Also, we observe that dual-pathway spatialAG is more effective than single-pathway AG, and our jointspatial attention block outperforms the others. Compared withthe standard AG [9], our proposed spatial attention improvedthe average Dice from 88.46% to 90.83%. TABLE IQ

UANTITATIVE EVALUATION OF DIFFERENT SPATIAL ATTENTIONMETHODS FOR SKIN LESION SEGMENTATION . ( S -AG) MEANSSINGLE - PATHWAY

AG, ( T -AG) MEANS DUAL - PATHWAY

AG, ( N -L OCAL ) MEANS NON - LOCAL NETWORKS . J S -A IS OUR PROPOSED MULTI - SCALESPATIAL ATTENTION THAT COMBINES NON - LOCAL BLOCK ANDDUAL - PATHWAY

AG.

Network Para Dice(%) ASSD(pix)Baseline(U-Net [6]) 1.9M 87.77 ± ± ± ± ± ± ± ± ± ± Fig. 5(a) visualizes the spatial attention weight maps ob-tained by s-AG, t-AG and our Js-A. It can be observed thatsingle-pathway AG pays attention to almost every pixel, whichmeans it is dispersive. The dual-pathway AG is better than thanthe single-pathway AG but still not self-adaptive enough. Incomparison, our proposed Js-A pays a more close attention tothe target than the above methods.Fig. 5(b) presents some examples of qualitative segmen-tation results obtained by the compared methods. It can beseen that introducing spatial attention block in neural networklargely improves the segmentation accuracy. Furthermore, theproposed Js-A gets better result than the other spatial attentionmethods in both cases. In the second case where the lesionhas a complex shape and blurry boundary, our proposed Js-Akeeps a better result.We observed that there may exist skew between originalannotation and our cognition in ISIC 2018, as shown in Fig. 5.This is mainly because that the image contrast is often low along the true boundary, and the exact lesion boundary requiressome expertise to delineate. The ISIC 2018 dataset was an-notated by experienced dermatologists, and some annotationsmay be different from what a non-expert thinks.

3) Comparison of Channel Attention Methods:

In thiscomparison, we only introduced channel attention modulesto verify the effectiveness of our proposed method. We ﬁrstinvestigate the effect of position in the network the channelattention module plugged in: 1) the encoder, 2) the decoder, 3)both the encoder and decoder. These three variants are referredto as C-A (Enc), C-A (Dec) and C-A (Enc&Dec) respectively.We also compared the impact of using and not using maxpooling for the channel attention module.

TABLE IIQ

UANTITATIVE COMPARISON OF DIFFERENT CHANNEL ATTENTIONMETHODS FOR SKIN LESION SEGMENTATION . E NC , D EC AND E NC & D ECMEANS CHANNEL ATTENTION BLOCKS ARE PLUGGED IN THE ENCODER , THE DECODER AND BOTH ENCODER AND DECODER , RESPECTIVELY . Network P max

Para Dice( % ) ASSD(pix)Baseline - 1.9M 87.77 ± ± × ± ± √ ± ± × ± ± C-A(Dec) √ ± ± & Dec) × ± ± & Dec) √ ± ± Table II shows the quantitative comparison of these vari-ants, which demonstrates that channel attention blocks indeedimprove the segmentation performance. Moreover, channelattention block with additional max-pooled information gen-erally performs better than those using average pooling only.Additionally, we ﬁnd that channel attention block plugged inthe decoder performs better than plugged into the encoder orboth the encoder and decoder. The C-A (Dec) achieved anaverage Dice score of 91.68%, which outperforms the others.Fig. 6 shows the visual comparison of our proposed channelattention and its variants. The baseline U-Net has a poorperformance when the background has a complex texture, andthe channel attention methods improve the accuracy for thesecases. Clearly, our proposed channel attention module C-A(Dec) obtains higher accuracy than the others.

4) Comparison of Scale Attention Methods:

In this com-parison, we only introduced scale attention methods to verifythe effectiveness of our proposed scale attention. Let L-A (1-K) denote the scale attention applied to the concatenationof feature maps from scale 1 to K as shown in Fig. 1. To (a) Channel attention without max pool (b) Channel attention with max pool

C-A (Enc) C-A (Dec) C-A (Enc&Dec) C-A (Enc) C-A (Dec) C-A (Enc&Dec)

Visualized Regions Ground Truth

Predicted Segmentation

Baseline (U-Net)

Fig. 6. Visual comparison of our proposed channel attention method with different variants. Our proposed attention block is CA (Dec) where the channelattention module uses an additional max pooling and is plugged in the decoder. The red arrows highlight some mis-segmentations. investigate the effect of number of feature map scales on thesegmentation, we compared our proposed method with K=2,3, 4 and 5 respectively.

TABLE IIIQ

UANTITATIVE EVALUATION OF DIFFERENT SCALE - ATTENTION METHODSFOR SKIN LESION SEGMENTATION . L-A (1-K)

REPRESENTS THE FEATURESFROM SCALE TO K WERE CONCATENATED FOR SCALE ATTENTION . Network Para Dice(%) ASSD(pix)Baseline 1.9M 87.77 ± ± ± ± ± ± L-A(1-4) ± ± L-A(1-5) 2.0M 89.67 ± ± Table III shows the quantitative comparison results. We ﬁndthat combining features for multiple scales outperforms thebaseline. When we concatenated features from scale 1 to 4, theDice score and ASSD can get the best values of 91.58% and0.66 pixels respectively. However, when we combined featuresfrom all the 5 scales, the segmentation accuracy is decreased.This suggests that the feature maps at the lowest resolutionlevel is not suitable for predicting pixel-wise label in details.As a result, we only fused the features from scale 1 to 4, asshown in Fig. 1 in the following experiments. Fig. 7 showsa visualization of skin lesion segmentation based on differentscale attention variants.

L-A (1-4) L-A (1-5)L-A (1-3)L-A (1-2)Baseline (U-Net)Visualized Regions Ground Truth Predicted Segmentation

Fig. 7. Visual comparison of segmentation obtained by scale attention appliedto concatenation of features from different scales.

Fig. 8 presents the visualization of pixel-wise scale attentioncoefﬁcient γ · γ ∗ at different scales, where the number undereach picture denotes the scale-wise attention coefﬁcient γ .This helps to better understand the importance of feature atdifferent scales. The two cases show a large and a small lesionrespectively. It can be observed that the large lesion has higherglobal attention coefﬁcients γ in scale 2 and 3 than the smalllesion, and γ in scale 1 has a higher value in the small lesionthan the large lesion. The pixel-wise scale attention maps alsoshow that the strongest attention is paid to scale 2 in the ﬁrst Scale 1 Scale 2 Scale 3 Scale 4 𝛾 𝛾 SegmentationGround Truth Predicted Segmentation 𝛾𝛾 Fig. 8. Visualization of scale attention on dermoscopic images. warmer colorrepresents higher attention coefﬁcient values. γ means the global scale-wiseattention coefﬁcient. row, and scale 1 in the second row. This demonstrates thatthe network automatically leans to focus on the correspondingscales for segmentation of lesions at different sizes.

5) Comparison of Partial and Comprehensive Attention:

Toinvestigate the effect of combining different attention mech-anisms, we compared CA-Net with six variants of differentcombinations of the three basic spatial, channel and scaleattentions. Here, SA means our proposed multi-scale jointspatial attention and CA represents our channel attention usedonly in the decoder of the backbone.

TABLE IVC

OMPARISON BETWEEN PARTIAL AND COMPREHENSIVE ATTENTIONMETHODS FOR SKIN LESION SEGMENTATION . SA, CA

AND LA REPRESENT OUR PROPOSED SPATIAL , CHANNEL AND SCALE ATTENTIONMODULES RESPECTIVELY . Network Para Dice(%) ASSD(pix)Baseline 1.9M 87.77 ± ± ± ± ± ± ± ± ± ± ± ± ± ± CA-Net(Ours) ± ± Table IV presents the quantitative comparison of our CA-Net and partial attention methods for skin lesion segmentation.From Table IV, we ﬁnd that each of SA, CA and LA obtainesperformance improvement compared with the baseline U-Net.Combining two of these attention methods outperforms themethods using a single attention. Furthermore, our proposedCA-Net outperforms all other variants both in Dice score andASSD, and the corresponding values are 92.08% and 0.58pixels, respectively.

6) Comparison with the State-of-the-Art Frameworks:

Wecompared our CA-Net with three state-of-the-art methods:

Baseline (U-Net) CA-Net (ours) RefineNetDenseASPP DeepLabv3+ ( Xception ) Visualized Regions Ground Truth Predicted Segmentation

Fig. 9. Visual comparison between CA-Net and state-of-the-art networks for skin lesion segmentation. Red arrows highlight some mis-segmentation.DeepLabv3+ has similar performance to ours, but our CA-Net has fewer parameters and better explainability.

1) DenseASPP [34] that uses DenseNet-121 [35] as thebackbone; 2) ReﬁneNet [36] that uses Resnet101 [37] asthe backbone; 3) Two variants of DeepLabv3+ [20] that useXception [38] and Dilated Residual Network (DRN) [39] asfeature extractor, respectively. We retrained all these networkson ISIC 2018 and did not use their pre-trained models.

TABLE VC

OMPARISON OF THE STATE - OF - THE - ART METHODS AND OUR PROPOSED

CA-N

ET FOR SKIN LESION SEGMENTATION . I NF -T MEANS THEINFERENCE TIME FOR A SINGLE IMAGE . E-

ABLE MEANS THE METHOD ISEXPLAINABLE . Network Para/Inf-T

E-able

Dice(%) ASSD(pix)

Baseline(U-Net [6]) × ± ± √ ± ± DenseASPP [34] × ± ± DeepLabv3+(DRN) × ± ± ReﬁneNet [36] × ± ± DeepLabv3+ [20] × ± ± CA-Net(Ours) √ ± ± Quantitative comparison results of these methods are pre-sented in Table V. It shows that all state-of-the-art methodshave good performance in terms of Dice score and ASSD.Our CA-Net obtained a Dice score of 92.08%, which is aconsiderable improvement compared with the U-Net whoseDice is 87.77%. Though our CA-Net has a slightly lowerperformance than DeepLabv3+, the difference is not signif-icant (p-value=0.46 > TABLE VIQ

UANTITATIVE EVALUATION OF DIFFERENT SPATIAL ATTENTIONMETHODS FOR PLACENTA AND FETAL BRAIN SEGMENTATION . S -AG MEANS SINGLE - PATHWAY

AG, T -AG MEANS DUAL - PATHWAY

AG, N -L OCAL MEANS NON - LOCAL NETWORKS . J S -A IS OUR PROPOSEDMULTI - SCALE SPATIAL ATTENTION THAT COMBINES NON - LOCAL BLOCKAND DUAL - PATHWAY

AG.

Network Placenta Fetal BrainDice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline 84.79 ± ± ± ± ± ± ± ± ± ± ± ± n-Local 85.43 ± ± ± ± Js-A 85.65 ± ± ± ± C. Segmentation of Multiple Organs from Fetal MRI

In this experiment, we demonstrate the effectiveness ofour CA-Net in multi-organ segmentation, where we aim tojointly segment the placenta and the fetal brain from fetalMRI slices. Fetal MRI has been increasingly used to studyfetal development and pathology, as it provides a better softtissue contrast than more widely used prenatal sonography [4].Segmentation of some important organs such as the fetal brainand the placenta is important for fetal growth assessmentand motion correction [40]. Clinical fetal MRI data are oftenacquired with a large slice thickness for good contrast-to-noiseratio. Moreover, movement of the fetus can lead to inhomo-geneous appearances between slices. Hence, 2D segmentationis considered more suitable than direct 3D segmentation frommotion-corrupted MRI slices [2].

1) Dataset:

The dataset consists of 150 stacks with threeviews (axial, coronal, and sagittal) of T2-weighted fetal MRIscans of 36 pregnant women in the second trimester withSingle-shot Fast-Spin echo (SSFSE) with pixel size 0.74 to1.58 mm and inter-slice spacing 3 to 4 mm. The gestational ageranged from 22 to 29 weeks. 8 of the fetuses were diagnosedwith spinal biﬁda and the others had no fetal pathologies. Allthe pregnant women were above 18 years old, and the use ofdata was approved by the Research Ethics Committee of thehospital.As the stacks contained an imbalanced number of slicescovering the objects, we randomly selected 10 of these slicesfrom each stack for the experiments. Then, we randomly splitthe slices at patient level and assigned 1050 for training, 150for validation, and 300 for testing. The test set contains 110axial slices, 80 coronal slices, and 110 sagittal slices. Manualannotations of the fetal brain and placenta by an experiencedradiologist were used as the ground truth. We trained a multi-class segmentation network for simultaneous segmentation ofthese two organs. Each slice was resized to × . Werandomly ﬂipped in x and y axis and rotated with an anglein (− π / , π / ) for data augmentation. All the images werenormalized by the mean value and standard deviation.

2) Comparison of Spatial Attention Methods:

In parallelto section IV-B2, we compared our proposed Js-A with: (1)the single-pathway AG (s-AG) only, (2) the dual-pathway AG(t-AG) only, (3) the non-local block (n-local) only.Table VI presents quantitative comparison results betweenthese methods. From Table VI, we observe that all the variantsof spatial attention modules led to better Dice and ASSDscores. It can be observed that dual-pathway AG performs

Ground Truth

Visualized Regions Ground Truth Predicted Segmentation

Proposed S-A (Js-A)S-A (t-AG)

S-A (s-AG) (a) Spatial attention weight maps

Proposed S-A (Js-A)S-A (n-Local)S-A (t-AG)S-A (s-AG)Baseline (U-Net) (b) Segmentation results obtained by different variants of spatial attention

Fig. 10. Visual comparison between different spatial attention methods for fetal MRI segmentation. (a) is the visualized attention weight maps of single-pathway,dual-pathway and proposed spatial attention. (b) shows segmentation results, where red arrows and circles highlight mis-segmentations. more robustly than the single-pathway AG, and Js-A modulecan get the highest scores, with Dice of 95.47% and ASSD of0.30 pixels, respectively. Furthermore, in placenta segmenta-tion which has fuzzy tissue boundary, our model still maintainsencouraging segmentation performance, with Dice score of85.65%, ASSD of 0.58 pixels, respectively.Fig. 10 presents a visual comparison of segmentation resultsobtained by these methods as well as their attention weightmaps. From Fig. 10(b), we ﬁnd that spatial attention has re-liable performance when dealing with complex object shapes,as highlighted by red arrows. Meanwhile, with visualizingtheir spatial attention weight maps as shown in Fig. 10(a), ourproposed Js-A has a greater ability to focus on the target areascompared with the other methods as it distributes a higher andcloser weight on the target of our interest.

3) Comparison of Channel Attention Methods:

We com-pared the proposed channel attention method with the samevariants as listed in section IV-B3 for fetal MRI segmentation.The comparison results are presented in Table VII. It showsthat channel attention plugged in decoder brings noticeablyfewer parameters and still maintains similar or higher accuracythan the other variants. We also compared using and not usingmax-pooling in the channel attention block. From Table VII,we can ﬁnd that adding extra max-pooled information indeedincreases performance in terms of Dice and ASSD, whichproves the effectiveness of our proposed method.

TABLE VIIC

OMPARISON EXPERIMENT ON CHANNEL ATTENTION - BASED NETWORKSFOR FETAL

MRI

SEGMENTATION . E NC , D EC , AND E NC &D EC MEANS THECHANNEL ATTENTION BLOCKS ARE LOCATED IN THE ENCODER , DECODERAND BOTH ENCODER AND DECODER , RESPECTIVELY . Network P max

Para Placenta Fetal BrainDice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline - 1.9M 84.79 ± ± ± ± × ± ± ± ± √ ± ± ± ± × ± ± ± ± C-A(Dec) √ ± ± ± ± & Dec) × ± ± ± ± C-A(Enc & Dec) √ ± ± ± ±

4) Comparison of Scale Attention Methods:

In this com-parison, we investigate the effect of concatenating differentnumber of feature maps from scale 1 to K as described insection IV-B4, and Table VIII presents the quantitative results.Analogously, we observe that combining features from mul-tiple scales outperforms the baseline. When we concatenatefeatures from scale 1 to 4, we get the best results, and the corresponding Dice values for the placenta and the fetal brainare 86.21% and 95.18%, respectively. When the feature mapsat the lowest resolution is additional used, i.e., L-A (1-5), theDice scores are slightly reduced.

TABLE VIIIC

OMPARISON BETWEEN DIFFERENT VARIANTS OF SCALEATTENTION - BASED NETWORKS . L-A (1-K)

REPRESENTS THE FEATURESFROM SCALE TO K WERE CONCATENATED FOR SCALE ATTENTION . Network Placenta Fetal BrainDice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline 84.79 ± ± ± ± ± ± ± ± ± ± ± ± L-A(1-4) 86.21 ± ± ± ± ± ± ± ± Fig. 11 shows the visual comparison of our proposed scaleattention and its variants. In the second row, the placenta has acomplex shape with a long tail, and combining features fromscale 1 to 4 obtained the best performance. Fig. 12 showsthe visual comparison of scale attention weight maps on fetalMRI. From the visualized pixel-wise scale attention maps, weobserved that the network pays much attention to scale 1 inthe ﬁrst row where the fetal brain is small, and to scale 2 inthe second row where the fetal brain is larger.

L-A (1-4) L-A (1-5)L-A (1-3)L-A (1-2)Baseline (U-Net)

Visualized Regions Ground Truth Predicted Segmentation

Fig. 11. Visual comparison of proposed scale attention method applied toconcatenation of features form different scales.

5) Comparison of Partial and Comprehensive Attention:

Similar to Section IV-B5, we compared comprehensive atten-tions with partial attentions in the task of segmenting fetalbrain and placenta from fetal MRI. From Table IX, we ﬁndthat models combining two of the three attention mechanismbasically outperform variants that using a single attentionmechanism. SA + CA gets the highest scores among threebinary-attention methods, which can achieve Dice scores of86.68% for the placenta and 95.42% for the fetal brain. Fur-thermore, our proposed CA-Net outperforms all these binary- Scale 1 Scale 2 Scale 3 Scale 4

Ground Truth Predicted Segmentation

Segmentation 𝛾𝛾 Fig. 12. Visualization of scale attention weight maps on fetal MRI. Warmercolor represents higher attention coefﬁcient values. γ means the globalchannel-wise scale attention coefﬁcient. attention methods, achieving Dice scores of 87.08% for theplacenta and 95.88% for the fetal brain, respectively. TheASSD value of CA-Net is lower than those of other methods. TABLE IXQ

UANTITATIVE COMPARISON OF PARTIAL AND COMPREHENSIVEATTENTION METHODS FOR FETAL

MRI

SEGMENTATION . SA, CA

AND LA ARE OUR PROPOSED SPATIAL , CHANNEL AND SCALE ATTENTIONMODULES RESPECTIVELY . Network Placenta Fetal BrainDice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline 84.79 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CA-Net(Ours) 87.08 ± ± ± ±

6) Comparison of State-of-the-Art Frameworks:

We alsocompared our CA-Net with the state-of-the-art methods andtheir variants as implemented in section IV-B5. The segmen-tation performance on images in axial, sagittal and coronalviews was measured respectively. A quantitative evaluation ofthese methods for fetal MRI segmentation is listed in Table X.We observe that our proposed CA-Net obtained better Dicescores than the others in all the three views. Our CA-Netcan improve the Dice scores by . , . , and . for placenta segmentation and . , . , and . forfetal brain segmentation in three views compared with U-Net, respectively, surpassing the existing attention methodand the state-of-the-art segmentation methods. In addition, forthe average Dice and ASSD values across the three views,CA-Net outperformed the others. Meanwhile, CA-Net hasa much smaller model size compared with ReﬁneNet [36]and Deeplabv3+ [20], which leads to lower computationalcost for training and inference. For fetal MRI segmentation,the average inference time per image for our CA-Net was1.5ms, compared with 3.4ms and 2.2ms by DeepLabv3+ andReﬁneNet, respectively. Qualitative results in Fig. 13 also showthat CA-Net performs noticeably better than the baseline andthe other methods for fetal MRI segmentation. In dealing withthe complex shapes as shown in the ﬁrst and ﬁfth rows ofFig. 13, as well as the blurry boundary in the second row, CA-Net performs more closely to the authentic boundary than theother methods. Note that visualization of the spatial and scale attentions as show in Fig. 10 and Fig. 12 helps to interpret thedecision of our CA-Net, but such explainability is not providedby DeepLabv3+, ReﬁneNet and DenseASPP. Baseline (U-Net) CA-Net (ours) RefineNetDenseASPP

DeepLabv3+ (Xception)

Visualized Regions Ground Truth Predicted Segmentation

Fig. 13. Visual comparison of our proposed CA-Net with the state-of-the-artsegmentation methods for fetal brain and placenta segmentation from MRI.Red arrows highlight the mis-segmented regions.

V. D

ISCUSSION AND C ONCLUSION

For a medical image segmentation task, some targets such aslesions may have a large variation of position, shape and scale,enabling the networks to be aware of the objectâ ˘A ´Zs spatialposition and size is important for accurate segmentation. Inaddition, convolution neural networks generate feature mapswith a large number of channels, and concatenation of featuremaps with different semantic information or from differentscales are often used. Paying an attention to the most relevantchannels and scales is an effective way to improve the segmen-tation performance. Using scale attention to adaptively makeuse of the features in different scales would have advantages indealing with objects with a variation of scales. To take theseadvantages simultaneously, we take a comprehensive use ofthese complementary attention mechanisms, and our resultsshow that CA-Net helps to obtain more accurate segmentationwith only few parameters.For explainable CNN, previous works like CAM [14] andGBP [15] mainly focused on image classiﬁcation tasks andthey only consider the spatial information for explaining theCNN’s prediction. In addition, they are post-hoc methodsthat require additional computations after a forward passprediction to interpret the prediction results. Differently fromthese methods, CA-Net gives a comprehensive interpretationof how each spatial position, feature map channel and scaleis used for the prediction in segmentation tasks. What’s more,we obtain these attention coefﬁcients in a single forward passand require no additional computations. By visualizing theattention maps in different aspects as show in Fig. 5 and Fig. 8,we can better understand how the network works, which hasa potential to helps us improve the design of CNNs.We have done experiments on two different image domains,i.e., RGB image and fetal MRI. These two are representativeimage domains, and in both cases our CA-Net has a con-siderable segmentation improvement compared with U-Net, TABLE XQ

UANTITATIVE EVALUATIONS OF THE STATE - OF - THE - ART METHODS AND OUR PROPOSED

FA-N

ET FOR FETAL

MRI

SEGMENTATION ON THREE VIEWS ( AXIAL , CORONAL AND SAGITTAL ). I NF -T MEANS THE INFERENCE TIME ON WHOLE TEST DATASET . E-

ABLE MEANS THE METHOD IS EXPLAINABLE . PlacentaNetwork Para/Inf-T E-able Axial Coronal Sagittal WholeDice(%) ASSD(pix) Dice(%) ASSD(pix) Dice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline(U-Net [6]) 1.9M/0.9ms × ± ± ± ± ± ± ± ± √ ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± CA-Net(Ours) √ ± ± ± ± ± ± ± ± BrainNetwork Para/Inf-T E-able Axial Coronal Sagittal WholeDice(%) ASSD(pix) Dice(%) ASSD(pix) Dice(%) ASSD(pix) Dice(%) ASSD(pix)Baseline(U-Net [6]) 1.9M/0.9ms × ± ± ± ± ± ± ± ± √ ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± × ± ± ± ± ± ± ± ± CA-Net(Ours) √ ± ± ± ± ± ± ± ± which shows that the CA-Net has competing performance fordifferent segmentation tasks in different modalities. It is ofinterest to apply our CA-Net to other image modalities suchas the Ultrasound and other anatomies in the future.In this work, we have investigated three main types of atten-tions associated with segmentation targets in various positionsand scales. Recently, some other types of attentions have alsobeen proposed in the literature, such as attention to parallelconvolution kernels [41]. However, using multiple parallelconvolution kernels will increase the model complexity.Most of the attention blocks in our CA-Net are in thedecoder. This is mainly because that the encoder acts as afeature extractor that is exploited to obtain enough candidatefeatures. Applying attention at the encoder may lead somepotentially useful features to be suppressed at an early stage.Therefore, we use the attention blocks in the decoder tohighlight relevant features from all the candidate features.Speciﬁcally, following [9], the spatial attention is designedto use high-level semantic features in the decoder to calibratelow-level features in the encoder, so they are used at the skipconnections after the encoder. The scale attention is designedto better fuse the raw semantic predictions that are obtainedin the decoder, which should naturally be placed at the endof the network. For channel attentions, we tried to place themat different positions of the network, and found that placingthem in the decoder is better than in the encoder. As shownin Table II, all the channel attention variants outperformed thebaseline U-Net. However, using channel attention only in thedecoder outperformed the variants with channel attention inthe encoder. The reason may be that the encoding phase needsto maintain enough feature information, which conﬁrms ourassumption that suppressing some features at an early stagewill limit the modelâ ˘A ´Zs performance. However, some otherattentions [41] might be useful in the encoder, which will beinvestigated in the future.Differently from previous works that mainly focus on im-proving the segmentation accuracy while hard to explain, weaim to design a network with good comprehensive property,including high segmentation accuracy, efﬁciency and explain-ability at the same time. Indeed, the segmentation accuracy of our CA-Net is competing: It leads to a signiﬁcant improvementof Dice compared with the U-Net ( . VS . )for skin lesion. Compared with state-of-the-art DeepLabv3+and ReﬁneNet, our CA-Net achieved very close segmentationaccuracy with around 15 times fewer parameters. Whatâ ˘A ´Zsmore, CA-Net is easy to explain as shown in Fig. 5, 8, 10,and 12, but DeepLabv3+ and ReﬁneNet have poor explainabil-ity on how they localize the target region, recognize the scaleand determine the useful features. Meanwhile, in fetal MRIsegmentation, experimental results from Table X shows thatour CA-Net has a considerable improvement compared withU-Net (Dice was . VS . ), and it outperforms thestate-of-the-art methods in all the three views. Therefore, thesuperiority of our CA-Net is that it could achieve high ex-plainability and efﬁciency than state-of-the-art methods whilemaintaining comparable or even better accuracy.In the skin lesion segmentation task, we observe thatour CA-Net leads to slightly inferior performance thanDeeplabv3+, which is however without signiﬁcant difference.We believe the reason is that Deeplabv3+ is mainly designedfor natural image segmentation task, and the dermoscopic skinimages are color images, which has a similar distribution of in-tensity to natural images. However, compared to Deeplabv3+,our CA-Net can achieve comparable performance, and it hashigher explainability and 15 times fewer parameters, leading tohigher computational efﬁciency. In the fetal MRI segmentationtask, our CA-Net has distinctly higher accuracy than thosestate-of-the-art methods, which shows the effectiveness andgood explainability of our method.In conclusion, we propose a comprehensive attention-basedconvolutional neural network (CA-Net) that learns to makeextensive use of multiple attentions for better performanceand explainability of medical image segmentation. We enablethe network to adaptively pay attention to spatial positions,feature channels and object scales at the same time. Motivatedby existing spatial and channel attention methods, we makefurther improvements to enhance the network’s ability tofocus on areas of interest. We propose a novel scale atten-tion module implicitly emphasizing the most salient scalesto obtain multiple-scale features. Experimental results show that compared with the state-of-the-art semantic segmentationmodels like Deeplabv3+, our CA-Net obtains comparableand even higher accuracy for medical image segmentationwith a much smaller model size. Most importantly, CA-Netgains a good model explainability which is important forunderstanding how the network works, and has a potentialto improve cliniciansâ ˘A ´Z acceptance and trust on predictionsgiven by an artiﬁcial intelligence algorithm. Our proposedmultiple attention modules can be easily plugged into mostsemantic segmentation networks. In the future, the method canbe easily extended to segment 3D images.R EFERENCES[1] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez,“A survey on deep learning in medical image analysis,”

Med. ImageAnal. , vol. 42, pp. 60–88, 2017.[2] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen,T. Doel, A. L. David, J. Deprest, S. Ourselin et al. , “Interactivemedical image segmentation using deep learning with image-speciﬁcﬁne tuning,”

IEEE Trans. Med. Imag. , vol. 37, no. 7, pp. 1562–1573,2018.[3] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza,D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti et al. , “Skinlesion analysis toward melanoma detection 2018: A challenge hostedby the international skin imaging collaboration (isic),” arXiv preprintarXiv:1902.03368 , 2019.[4] G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen,T. Doel, A. L. David, J. Deprest, S. Ourselin et al. , “DeepIGeoS: Adeep interactive geodesic framework for medical image segmentation,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 41, no. 7, pp. 1559–1572,2018.[5] S. S. M. Salehi, S. R. Hashemi, C. Velasco-Annis, A. Ouaalam, J. A.Estroff, D. Erdogmus, S. K. Warﬁeld, and A. Gholipour, “Real-timeautomatic fetal brain extraction in fetal mri by deep learning,” in

Proc.ISBI , 2018, pp. 720–724.[6] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-works for biomedical image segmentation,” in

Proc. MICCAI , Oct 2015,pp. 234–241.[7] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proc.CVPR , 2018, pp. 7132–7141.[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“DeepLab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected CRFs,”

IEEE Trans. PatternAnal. Mach. Intel. , vol. 40, no. 4, pp. 834–848, 2017.[9] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa,K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al. , “AttentionU-Net: Learning where to look for the pancreas,” in

Proc. MIDL , Jul2018.[10] Y. Qin, K. Kamnitsas, S. Ancha, J. Nanavati, G. Cottrell, A. Criminisi,and A. Nori, “Autofocus layer for semantic segmentation,” in

Proc.MICCAI , Sep 2018, pp. 603–611.[11] Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng,and D. Ni, “Deep attentional features for prostate segmentation inultrasound,” in

MICCAI , Sep 2018, pp. 523–530.[12] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable artiﬁcial in-telligence: Understanding, visualizing and interpreting deep learningmodels,” arXiv preprint arXiv:1708.08296 , 2017.[13] R. Chen, H. Chen, J. Ren, G. Huang, and Q. Zhang, “Explaining neuralnetworks semantically and quantitatively,” in

Proc. ICCV , 2019, pp.9187–9196.[14] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in

Proc. CVPR , 2016, pp.2921–2929.[15] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,“Striving for simplicity: The all convolutional net,” arXiv preprintarXiv:1412.6806 , 2014.[16] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker,and D. Rueckert, “Attention gated networks: Learning to leverage salientregions in medical images,”

IEEE Trans. Med. Imag. , vol. 53, pp. 197–207, 2019. [17] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channelâ ˘AŸsqueeze & excitationâ ˘A ´Zin fully convolutional networks,” in

Proc.MICCAI , Sep 2018, pp. 421–429.[18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proc. CVPR , 2015, pp. 3431–3440.[19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deepconvolutional encoder-decoder architecture for image segmentation,”

IEEE Trans. Pattern Anal. Mach. Intell , vol. 39, no. 12, pp. 2481–2495,2017.[20] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” in

Proc. ECCV , 2018, pp. 801–818.[21] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger,“3D U-Net: learning dense volumetric segmentation from sparse anno-tation,” in

Proc. MICCAI , 2016, pp. 424–432.[22] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutionalneural networks for volumetric medical image segmentation,” in

Proc.3DV . IEEE, 2016, pp. 565–571.[23] M. M. K. Sarker, H. A. Rashwan, F. Akram, S. F. Banu, A. Saleh,V. K. Singh, F. U. Chowdhury, S. Abdulwahab, S. Romani, P. Radeva et al. , “SLSDeep: Skin lesion segmentation based on dilated residual andpyramid pooling networks,” in

Proc. MICCAI , Sep 2018, pp. 21–29.[24] W. Li, G. Wang, L. Fidon, S. Ourselin, M. J. Cardoso, and T. Ver-cauteren, “On the compactness, efﬁciency, and representation of 3Dconvolutional networks: brain parcellation as a pretext task,” in

Proc.IPMI , 2017, pp. 348–360.[25] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker, “Efﬁcient multi-scale 3DCNN with fully connected CRF for accurate brain lesion segmentation,”

Med. Image Anal. , vol. 36, pp. 61–78, 2017.[26] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, “Tell me where to look:Guided attention inference network,” in

Proc. CVPR , 2018, pp. 9215–9223.[27] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, andX. Tang, “Residual attention network for image classiﬁcation,” in

Proc.CVPR , 2017, pp. 3156–3164.[28] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attentionnetwork for scene segmentation,” in

Proc. CVPR , 2019, pp. 3146–3154.[29] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look:Adaptive attention via a visual sentinel for image captioning,” in

Proc.CVPR , 2017, pp. 375–383.[30] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention toscale: Scale-aware semantic image segmentation,” in

Proc. CVPR , 2016,pp. 3640–3649.[31] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in

Proc. CVPR , 2018, pp. 7794–7803.[32] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutionalblock attention module,” in

Proc. ECCV , 2018, pp. 3–19.[33] L. Bi, J. Kim, E. Ahn, A. Kumar, M. Fulham, and D. Feng, “Dermo-scopic image segmentation via multistage fully convolutional networks,”

IEEE Trans. Bio-med. Eng. , vol. 64, no. 9, pp. 2065–2074, 2017.[34] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP forsemantic segmentation in street scenes,” in

Proc. CVPR , 2018, pp. 3684–3692.[35] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

Proc. CVPR , 2017, pp. 4700–4708.[36] G. Lin, A. Milan, C. Shen, and I. Reid, “ReﬁneNet: Multi-path re-ﬁnement networks for high-resolution semantic segmentation,” in

Proc. ,2017, pp. 1925–1934.[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proc. CVPR , 2016, pp. 770–778.[38] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” in

Proc. CVPR , 2017, pp. 1251–1258.[39] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in

Proc. CVPR , 2017, pp. 472–480.[40] J. Torrents-Barrena, G. Piella, N. Masoller, E. Gratacós, E. Eixarch,M. Ceresa, and M. Á. G. Ballester, “Segmentation and classiﬁcation inMRI and US fetal imaging: recent trends and future prospects,”

Med.Image Anal , vol. 51, pp. 61–88, 2019.[41] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamicconvolution: Attention over convolution kernels,” in