D2A U-Net: Automatic Segmentation of COVID-19 Lesions from CT Slices with Dilated Convolution and Dual Attention Mechanism
Xiangyu Zhao, Peng Zhang, Fan Song, Guangda Fan, Yangyang Sun, Yujia Wang, Zheyuan Tian, Luqi Zhang, Guanglei Zhang
DD2A U-Net: Automatic Segmentation of COVID-19 Lesions from CTSlices with Dilated Convolution and Dual Attention Mechanism
Xiangyu Zhao a , Peng Zhang a , Fan Song a , Guangda Fan a , Yangyang Sun a , Yujia Wang a ,Zheyuan Tian a , Luqi Zhang a and Guanglei Zhang a,b , ∗ a School of Biological Science and Medical Engineering, Beihang University, Beijing, China b Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Beijing, China
A R T I C L E I N F O
Keywords :Attention mechanismCOVID-19Deep learningDilated convolutionSegmentation
A B S T R A C T
Background and Objective : Coronavirus Disease 2019 (COVID-19) has caused great casualties andbecomes almost the most urgent public health events worldwide. Computed tomography (CT) is asignificant screening tool for COVID-19 infection, and automated segmentation of lung infection inCOVID-19 CT images will greatly assist diagnosis and health care of patients. However, accurateand automatic segmentation of COVID-19 lung infections remains to be challenging. In this paper wepropose a dilated dual attention U-Net ( D2A U-Net ) for COVID-19 lesion segmentation in CT slicesbased on dilated convolution and a novel dual attention mechanism to address the issues above.
Methods : We introduce a dilated convolution module in model decoder to achieve large receptivefield, which refines decoding process and contributes to segmentation accuracy. Also, we present adual attention mechanism composed of two attention modules which are inserted to skip connectionand model decoder respectively. The dual attention mechanism is utilized to refine feature maps andreduce semantic gap between different levels of the model.
Results : The proposed method has been evaluated on open-source dataset and outperforms cutting-edges methods in semantic segmentation. Our proposed
D2A U-Net with pretrained encoder achievesa Dice score of 0.7298 and recall score of 0.7071. Besides, we also build a simplified
D2A U-Net without pretrained encoder to provide a fair comparison with other models trained from scratch, whichstill outperforms popular U-Net family models with a Dice score of 0.7047 and recall score of 0.6626.
Conclusion : Our experiment results have shown that by introducing dilated convolution and dualattention mechanism, the number of false positives is significantly reduced, which improves sensitivityto COVID-19 lesions and subsequently brings significant increase to Dice score. Significance: Ourproposed method improves segmentation performance on COVID-19 lesions in CT slices, and can beregarded as a potential AI-based approach utilized in diagnosis and prognosis of COVID-19 patients.
1. Introduction
COVID-19 pandemic caused by SARS-nCov-2 contin-ues to spread all over the world [24], and most of the coun-tries have been affected in this unprecedented public healthevent. By August 2020, more than 23.75 million of cases ofCOVID-19 have been reported and more than 810,000 died[2] of COVID-19 infection. Due to the strong infectivity ofSARS-nCov-2, identification of people infected by COVID-19 is significant to cut off the transmission and slow downvirus spread. Reverse transcriptase-polymerase chain reac-tion (RT-PCR) is considered as the gold standard of diagno-sis [29] for its high specificity, but it is time-consuming andlaborious. Also, the capacity of RT-PCR tests can be ratherinsufficient in less-developed regions, especially during thepandemic. Computed tomography (CT) imaging is one ofthe most commonly used screening methods to detect lunginfection and has proved to be efficient in the diagnosis andfollow-up prognosis of COVID-19.Compared with chest X-ray images, CT imaging is moresensitive, especially in the early stage of infection. Groundglass pattern is the most common finding in COVID-19 in-fections, usually in the early stage, while pulmonary consol- ∗ Corresponding author [email protected] (G. Zhang)
ORCID (s): (G. Zhang) idation can be observed in the later stage. Pleural effusioncan also be observed in pathological CT slices. These typ-ical features of COVID-19 lung infection are shown in Fig.1.
Figure 1:
Example of COVID-19 CT slices, where the red,green and blue masks denote the ground glass, consolidationand pleural effusion respectively. The images are collected from[1].
Thus, chest CT imaging is regarded as a convenient, fastand accurate approach to diagnose COVID-19. The evalua-tion of localization and geometric features of infection areacould provide adequate information of disease progress and
First Author et al.:
Preprint submitted to Elsevier
Page 1 of 11 a r X i v : . [ ee ss . I V ] F e b hort Title of the Article help doctors make better treatment [10] [16] [19]. However,manual annotation of infection regions is a time-consumingand laborious work. Also, the annotation made by radiolo-gists can be subjective and biased due to individual experi-ence and personal judgements.Recently, a number of deep learning systems using con-volutional neural networks (CNNs) have been proposed todetect COVID-19 infection. For instance, Wang and Wong[27] have developed a COVID-Net to perform ternary clas-sification between healthy people, COVID-19 patients andpeople infected with other pneumonia in chest X-ray im-ages, which achieves an overall accuracy of 93.3 % . In termsof deep learning systems for CT imaging, Zhou and Canu[39] have proposed an automatic network facilitated with at-tention mechanism to segment infection area in CT slices.Fan et. al [6] developed an Inf-Net and corresponding semi-supervision algorithm to perform CT segmentation. Zhenget al. [37] proposed a weakly-supervised deep learning methodto detect COVID-19 in CT volumes. Xi et al. [18] presenteda dual-sampling attention network to diagnose COVID-19from community acquired pneumonia. However, detectinglung infection caused by COVID-19 in CT images remainsto be challenging. As infection regions vary in shape, posi-tion and texture, and the boundaries with normal tissues canbe rather blurred, which add to the difficulty in COVID-19detection and limit model performance, especially in termsof recall score.To address the issues above, we proposed a dilated dualattention U-Net ( D2A U-Net ) framework to automatic seg-ment lung infection in COVID-19 CT slices. Since infectedtissues can be hardly distinguishable with normal tissues, weintroduce a dual attention mechanism consisting of a gateattention module (GAM) and a decoder attention module (DAM) to refine feature maps and produce more informa-tive feature representation. The proposed GAM is utilizedby fusing features and semantic-rich gate signals to refineskip connections. The proposed DAM is introduced to thedecoder of the network to improve model decoding qualityand better segment blurred infected tissues. As COVID-19infection varies in position and size, we utilize dilated convo-lution with different dilation rate in the model decoder to ob-tain larger receptive fields and balance the segmentation onboth large and tiny objects, which thus provides better seg-mentation results. Such refinement improves segmentationrecall score and thus provides better segmentation results.The paper is organized as follows: Section 2 offers a re-view of related works on CT segmentation. Section 3 de-scribes the overview of this work and details our model. Sec-tion 4 presents the details of our experiments and providesboth quantitative and qualitative segmentation results. Sec-tion 5 discusses the proposed method and concludes our work.
2. Related Works
In this section, we will go through 4 types of most re-lated works, which includes chest CT segmentation, atten-tion mechanism, dilated convolution and AI-based COVID- 19 segmentation systems.
Chest CT imaging is one of the most popular screeningmethods for lung disease diagnosis[9]. Segmentation of or-gans and lesions provides crucial information for disease di-agnosis and prognosis. However, manual segmentation re-mains time-consuming and laborious and subjective erroris inevitable, thus automatic CT segmentation gains muchpopularity in the research fields. Recent researches uponautomatic segmentation mainly focus on utilizing machinelearning techniques. Related works most feature a pixel-wiseclassifier to infer from extracted features to make predic-tions. For example, Mansoor et. al [13] proposed a texture-based feature classifier for pathological lung segmentation inCT images. Yao et. al [34] utilized texture analysis and sup-port vector machine to segment infections in lung tissues.These algorithms have realized automatic segmentation inchest CT images but several issues remain unsolved, includ-ing subjective bias in feature extraction and difficulties insegmenting nodule regions. Deep learning algorithms fea-ture powerful fitting capacity and require no laborious pre-processing. Most cutting-edge segmentation algorithms arebased on deep learning approaches. For example, Shaziyaet. al [23] used U-Net to segment lung tissues in chest CTscans. Zhao et. al [36] proposed a fully convolutional neuralnetwork with multi-instance and conditional adversary lossfor pathological lung segmentation.
Attention plays an important role in human perceptionand visual cognition [5]. One significant property in humanperception is that humans hardly process visual informationas a whole. Instead, humans usually process visual informa-tion recurrently, where top information is utilized to guidebottom-up feedforward process [15]. Inspired by this prin-ciple, attention mechanism has been widely used in com-puter vision, especially in image classification [7] [31] [25].Related algorithms typically refine feature maps in spatialdimension, channel dimension or both. For example, Hu etal. [7] introduced a Squeeze-and-Excitation module, whereglobal average pooling is performed on input features to pro-duce channel-wise attention. Woo et. al [31] proposed aconvolutional block attention module (CBAM) to introducea fused attention consisting of channel attention and spatialattention. Wang et al. [25] presented a residual attentionnetwork, which contains an attention module featuring anencoder-decoder architecture. Attention mechanism has alsobeen utilized in semantic segmentation tasks to make moreaccurate dense predictions. For instance, Li et. al [11] pro-posed a Pyramid Attention Network to exploit the impact ofglobal contextual information in semantic segmentation.These typical algorithms resemble in some aspects. Cer-tain operations, such as global pooling, convolution and thecombination of downsampling and upsampling, are utilizedto enhance informative regions in the feature maps and sup-press unrelated information, which makes the network learn
First Author et al.:
Preprint submitted to Elsevier
Page 2 of 11hort Title of the Article
Figure 2:
The proposed
D2A U-Net architecture with a ResNeXt-50 (32 × more generalized visual structures and improves robustnessto noisy inputs. Traditional deep convolutional networks often involveconvolution with stride or pooling operations to improve re-ceptive fields, and input images are downsampled in thisprocess. However, these operations often lead to the lossof global information in dense predictions, such as seman-tic segmentation and object detection. Yu and Koltun [35]introduced dilated convolution to deep networks, which hasproved useful in dense predictions. The basic idea of dilatedconvolution is to insert “holes” (zeros) in convolution ker-nels to obtain large receptive fields without downsampling.Dilated convolution avoids information loss during down-sampling and has been widely used in semantic segmenta-tion tasks [30] [14] [20]. However, it has been observed thatsimply stacking dilated convolution in CNNs may cause grideffects and irrelevant long-ranged information [35] and leadto performance deterioration. Wang et. al [28] proposed ahybrid dilated convolution (HDC) framework to avoid grideffects and improve segmentation performance on both largeand tiny objects.
Artificial intelligence has been widely utilized in fightingagainst COVID-19. We mainly focus on AI-based semanticsegmentation systems upon CT scans. Many works focus onlearning robust and noise-insensitive representations fromlimited or noisy inputs. For example, Xie et. al [33] pro-posed a RTSU-Net for segmenting pulmonary lobes in CTscans. A non-local neural network module was introduced tolearn both visual and geometric relationships among featuremaps to produce self-attention. Wang et. al [26] presenteda noise-robust framework for COVID-19 lesion segmenta-tion. They utilized a noise-robust Dice loss and adaptiveself-ensembling strategy to learn from noisy labels. Chenet. al [4] proposed a residual attention U-Net which intro-duced aggregated residual transformations and soft attentionmechanism to learn robust feature representations. Also, re-searchers look into segmentation solutions that achieve both high speed and high accuracy. For example, Zhou et. al [38]developed a rapid, accurate and machine-agnostic segmen-tation and quantification method for automatic segmentationon COVID-19 lesions. The innovation of their work lies inthe first CT scan simulator for COVID-19 and a novel net-work architecture which solves the large-scene-small-objectproblem. Qiu et. al [21] developed a parameter-efficientframework to achieve fast segmentation of COVID-19 lunginfection with relatively low computational cost.
3. Methods
In this section we will go through the details of the pro-posed
D2A U-Net architecture. In the first part, we will of-fer the overview of proposed network. We then provide de-tails about dual attention mechanism and proposed attentionmodules. Finally we introduce our proposed decoder blocks.
Basically, our proposed network is based on the U-Net[22] architecture, which is quite popular in medical imagesegmentation. Compared with original U-Net, dilated con-volution and a novel combination of attention mechanism areintegrated in our framework to obtain better feature represen-tation. As COVID-19 pandemic broke out rapidly, availableopen access CT image data with gold-standard annotationsis hard to acquire, and thus utilizing pretrained encoder inthe segmentation model can offer a better parameter initial-ization and improve generalization ability. Therefore, in thiswork, we utilize a ResNeXt-50 (32 × First Author et al.:
Preprint submitted to Elsevier
Page 3 of 11hort Title of the Article
Figure 3:
The proposed gate attention module , which takes guiding signal and featuresas input to generate fused attention. The number shown in the parentheses inside convblock means the number of outchannels. See 3.2.1 for details.
We introduce a dual attention mechanism composed of a gate attention module (GAM) and a decoder attention mod-ule (DAM) to our network. GAM is utilized to refine fea-tures extracted by model encoder and reduce semantic gapby fusing high and low level feature maps. DAM is insertedin model decoder to refine feature representations after up-sampling.
Feature concatenation from encoder to decoder is thetypical topological structure in U-Net, where the combina-tion of high-resolution features in the encoder and upsam-pled features in the decoder enables better localization ofsegmentation targets [22]. However, not all visual repre-sentations in encoder feature maps contribute to precise seg-mentation. Also, semantic gap between encoder and decodercould limit model performance as well. Thus, we introduce a gate attention module before concatenation to refine featurescoming from model encoder and reduce semantic gap.Oktay et. al [17] proposed an attention gate to refine en-coder features with attention mechanism. But in their pro-posed attention gate only spatial attention mechanism is im-plemented to refine features. We believe introducing channelattention and spatial attention simultaneously will improvethe efficiency of attention mechanism. Thus, inspired by theglobal attention upsample module proposed in pyramid at-tention network [11] and CBAM [31], we provide a noveldesign of a gate attention module to enable both channel at-tention and spatial attention. Detailed scheme of the pro-posed GAM is shown in Fig. 3. Two feature maps are fedinto the attention module. The guiding signal refers to the feature map coming from model decoder (or the last convo-lution block in model encoder), and the feature refers to fea-ture maps coming from model encoder to concatenate withupsampled feature maps. We use 𝐆 ∈ ℝ 𝐶 𝑔 × 𝐻 𝑔 × 𝑊 𝑔 to denoteguiding signal and 𝐅 ∈ ℝ 𝐶 𝑓 × 𝐻 𝑓 × 𝑊 𝑓 to denote features.In a U-Net shaped architecture, compared with 𝐅 , 𝐆 con-tains more deep and high-resolution semantic informationwhich is encoded in channel dimension. We utilize globalaverage pooling and a multilayer perception (MLP) to createa channel attention map 𝑍 𝑐 ( 𝐅 ) ∈ ℝ 𝐶 𝑓 ×1×1 . The output sizeof the MLP is smaller than the input size, thus we suppressirrelevant feature representations in channel dimension andimplement channel-wise attention mechanism. In short, wecompute channel attention as follows: 𝑍 𝑐 ( 𝐅 ) = 𝜎 ( 𝑀𝐿𝑃 ( 𝑃 𝑎𝑣𝑔 ( 𝐆 )))= 𝜎 ( 𝑊 𝐶 𝑓 ( 𝑅𝑒𝐿𝑈 ( 𝑊 𝐶 𝑔 ∕ 𝑟 ( 𝑃 𝑎𝑣𝑔 ( 𝐆 ))))) (1)where 𝜎 denotes sigmoid activation, 𝑃 𝑎𝑣𝑔 denotes global av-erage pooling, 𝑊 ∈ ℝ 𝐶 𝑔 ∕ 𝑟 × 𝐶 𝑔 and 𝑊 ∈ ℝ 𝐶 𝑓 × 𝐶 𝑔 ∕ 𝑟 , 𝑟 de-notes reduce ratio and in our experiments it is set to 16.Spatial attention is guided by both guiding signal and in-put feature itself. We use convolution operation with 1 filterto squeeze channel dimension of 𝐆 and 𝐅 . Then reduced fea-ture map from 𝐆 is upsampled to match the size of 𝐅 . A com-bination of convolution operation with different kernel sizeis utilized to produce spatial attention 𝑍 𝑠 ( 𝐅 ) ∈ ℝ 𝐻 𝑓 × 𝑊 𝑓 .In short, we compute spatial attention as: 𝑍 𝑠 ( 𝐅 ) = 𝜎 ( 𝑓 ([ 𝐅 𝐫 , 𝐆 𝐫 ]) + 𝑓 ([ 𝐅 𝐫 , 𝐆 𝐫 ]) + 𝑓 ([ 𝐅 𝐫 , 𝐆 𝐫 ])) 𝑤ℎ𝑒𝑟𝑒 𝐅 𝐫 = 𝑓 𝑟 ( 𝐅 ) , 𝐆 𝐫 = 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑒 ( 𝑓 𝑟 ( 𝐆 )) First Author et al.:
Preprint submitted to Elsevier
Page 4 of 11hort Title of the Article
Figure 4:
The proposed residual attention block (left) and decoder attention module (right). RAB integrates a hybrid dilated convolution module and a DAM; 𝑛 in the paren-theses refers to dilation rate. DAM is utilized to refine post-upsample features; the numbershown in the parentheses inside conv block means the number of outchannels. See 3.3 fordetails about RAB and 3.2.2 for details about DAM. (2)where 𝜎 denotes sigmoid activation, 𝑓 , 𝑓 and 𝑓 de-note convolution operation with corresponding kernel size. 𝑓 𝑟 is used to squeeze channel dimension.Then we use element-wise multiplication to combine spa-tial and channel attention to produce fused attention 𝑍 ( 𝐅 ) : 𝑍 ( 𝐅 ) = 𝐅 ∗ 𝑍 𝑠 ( 𝐅 ) ∗ 𝑍 𝑐 ( 𝐅 ) (3) In semantic segmentation, high-resolution visual repre-sentations in the encoder need to be upsampled to make densepredictions. Transposed convolution and interpolation areboth popular solutions to image upsampling, but both havetheir drawbacks. Compared with interpolation, transposedconvolution is trainable and offers more nonlinearity to deepnetworks, which improves model fitting capacity. But grideffect is hard to avoid if hyperparameters are not config-ured properly, while such drawback can be more trouble-some when stacking more than one transposed convolutionlayer. Thus we propose a combination of bilinear interpola-tion and following convolution to upsample feature maps.However, as interpolation is not trainable, it is inevitableto introduce irrelevant information or noise to upsampling. We introduce a decoder attention module to solve this is-sue. A fused attention mechanism is utilized to refine post-upsampling feature maps in both channel and spatial dimen-sions. The scheme is shown in Fig. 4. Compared withGAM, DAM is more simplified and only takes one input,but the implementation of both channel and spatial attentionis quite similar. We use 𝑍 𝑐 ( 𝐅 ) ∈ ℝ 𝐶 ×1×1 to denote channelattention, 𝑍 𝑠 ( 𝐅 ) ∈ ℝ 𝐻 × 𝑊 to denote spatial attention and 𝑍 ( 𝐅 ) to denote fused attention. In short, DAM is computedas follows: 𝑍 𝑐 ( 𝐅 ) = 𝜎 ( 𝑀𝐿𝑃 ( 𝑃 𝑎𝑣𝑔 ( 𝐅 )))= 𝜎 ( 𝑊 ( 𝑅𝑒𝐿𝑈 ( 𝑊 ( 𝑃 𝑎𝑣𝑔 ( 𝐅 ))))) (4)where 𝜎 denotes sigmoid activation, 𝑃 𝑎𝑣𝑔 denotes global av-erage pooling, 𝑊 ∈ ℝ 𝐶 ∕ 𝑟 × 𝐶 and 𝑊 ∈ ℝ 𝐶 × 𝐶 ∕ 𝑟 , 𝑟 denotesreduce ratio and in our experiments it is set to 16. 𝑍 𝑠 ( 𝐅 ) = 𝜎 ( 𝑓 ( 𝑓 𝑟 ( 𝐅 ))+ 𝑓 ( 𝑓 𝑟 ( 𝐅 ))+ 𝑓 ( 𝑓 𝑟 ( 𝐅 )) (5)where 𝜎 denotes sigmoid activation, 𝑓 , 𝑓 and 𝑓 de-note convolution operation with corresponding kernel size. First Author et al.:
Preprint submitted to Elsevier
Page 5 of 11hort Title of the Article
Table 1
Dataset DescriptionNum Dataset Description Split1 COVID-19 CT segmentation dataset[1] 110 slices with 100 containing annotations. Test Set2 Segmentation dataset nr. 2[1] 9 CT volumes (373 out of the total of 829 sliceshave been evaluated by a radiologist as positiveand segmented.) Training Set3 COVID-19 CT Lung and InfectionSegmentation Dataset[8] 20 CT volume (Left lung, right lung, and infectionsare labeled by two radiologists and verified by anexperienced radiologist, and 1,844 out of the total of3520 slices contains infection regions.) Training Set 𝑓 𝑟 is used to squeeze channel dimension. 𝑍 ( 𝐅 ) = 𝐅 ∗ 𝑍 𝑠 ( 𝐅 ) ∗ 𝑍 𝑐 ( 𝐅 ) (6) Standard convolution hardly reaches a large receptive fielddue to kernel size. Such drawback in traditional design ofU-Net based network decoder can limit the performance insegmentation. Inspired by the design of hybrid dilated con-volution [28], we proposed a residual attention block (RAB)as the basic module in model decoder. Unlike similar worksusing dilated convolution in the encoder, we explore to useit in the decoder to capture multiscale patterns of upsampledfeature maps. Hybrid dilated convolution is utilized in ourRAB to acquire large receptive fields and avoid grid effects.The stem of RAB is a stack of dilated convolution with ker-nel size 3 and dilation rate [1, 2, 5], followed by a decoderattention module . The scheme is shown in Fig. 4.We assume initial receptive field as 1 ×
1. The equivalentkernel size of dilated convolution is computed as follows: 𝐾 = 𝑘 + ( 𝑘 − 1)( 𝑛 − 1) (7)where 𝐾 denotes equivalent kernel size, 𝑘 denotes actualkernel size and 𝑛 denotes dilation rate.Thus, the equivalent kernel size of dilated convolutionwith kernel size 3 and dilation rate [1, 2, 5] is 3, 5, 11, re-spectively. According to the definition of receptive field,such design of stacked dilated convolution reaches a recep-tive field of 17 ×
17, which enables the capture of globalinformation. Also, dilated convolution with different dila-tion rate can capture multiscale information in feature maps,which can contribute to the accurate segmentation on bothlarge and small objects.As we use a ResNeXt-50 (32 × 𝐘 = 𝐗 + 𝐷𝐴𝑀 ( 𝐻𝐷𝐶 ( 𝐗 )) (8) where 𝐷𝐴𝑀 denotes decoder attention module and
𝐻𝐷𝐶 denotes hybrid dilated convolution.
4. Experiments
CT slices used in our experiments consist of 3 datasets[1][8].Details about dataset used are shown in Table 1. Dataset1 contains 100 axial CT slices from more than 40 patients,which have been rescaled to 512 ×
512 pixels and grayscaled.All slices are segmented by a radiologist using three labels:ground-glass opacity, consolidation and pleural effusion. Dataset2 contains 9 axial CT volumes, where 373 out of the total of829 slices have been evaluated by a radiologist as positiveand segmented using 2 labels including ground-glass opac-ity and consolidation. Dataset 3 contains 20 CT axial vol-umes, which have been segmented by two radiologists andverified by an experienced radiologist.Dataset 2 and Dataset 3 contain 29 CT volumes in to-tal, but not all slices contain infection regions. We choose todiscard all slices containing no COVID-19 infection and useslices with annotations only. As annotations in Dataset 3 donot distinguish ground-glass opacity and consolidation, wetake both ground-glass opacity and consolidation in Dataset2 as COVID-19 lesions and do not distinguish them as well,thus creating a binary segmentation dataset. An intensitynormalization has been applied on both datasets and all sliceshave been rescaled to 512 ×
512 pixels to match Dataset 1.We take all ground-glass, consolidation and pleural effusionin Dataset 1 as COVID-19 lesions, just the same as what wehave done to Dataset 2.We did not choose to combine processed Dataset 1 to 3together and then split them randomly, because in this wayslices of one certain subject can exist in both training andtest datasets, which could be regarded as a data leakage andcause a virtual-high model performance. Instead, we finallyobtain 1645 processed slices from processed Dataset 2 andDataset 3 in total and use these slices as our final trainingdataset, and then we use 100 axial slices from Dataset 1 asour final test dataset. Such data split can best evaluate modelgeneralization capacity.
First Author et al.:
Preprint submitted to Elsevier
Page 6 of 11hort Title of the Article
Model encoder is a ResNeXt-50 (32 × 𝑑 and binary cross-entropy loss 𝑐 asour final loss function: 𝑠𝑒𝑔 = 𝑑 + 𝛼 𝑐 (9)where 𝛼 = 1 in our experiments. Our model is implemented using PyTorch on an Ubuntu16.04 server. We use a NVIDIA RTX 2080 Ti GPU to accel-erate our training process. Data augmentation is utilized inour training process to reduce overfitting and improve gen-eralization capacity. First all input images are rescaled to560 × ×
448 and fed into network. The model is op-timized by an Adam optimizer with 𝛽 = 0 . , 𝛽 = 0 . , 𝜖 = 1 𝑒 − 8 . 𝐿 regularization is utilized to reduce overfittingas well. We set model weight decay to 1e-4. Initial learningrate is set to 1e-4 and reduced when faced with plateau, withreduce factor being 0.1 and patience being 10. The batchsize is set to 6 and we perform evaluation on test set after 30epochs. The training process takes approximately 140 min-utes. We use Dice similarity coefficient and pixel error as themain metrics to evaluate segmentation performance of our
D2A U-Net . Dice is a statistic used to gauge the similarityof two samples, and has been widely used to evaluate perfor-mance in semantic segmentation. Pixel error measures thenumber of pixels predicted falsely in the image, which showsthe global segmentation accuracy of the proposed models.Both metrics measure segmentation performance in a globalway. In addition, we calculate recall score of infection re-gions as recall score measures model’s sensitivity to lunginfection, which is rather significant in terms of COVID-19infection. We use 𝐺 to denote ground truth, 𝑃 to denotedense predications, 𝑇 𝑃 to denote true positive,
𝐹 𝑃 to de-note false positive,
𝑇 𝑁 to denote true negative and
𝐹 𝑁 todenote false negative. These metrics are calculated as fol- lows:
𝐷𝑖𝑐𝑒 = 2 | 𝐺 ⋂ 𝑃 || 𝐺 | + | 𝑃 | = 2 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 (10)
𝑃 𝑖𝑥𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 = 𝐹 𝑃 + 𝐹 𝑁𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 (11)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃𝑇 𝑃 + 𝐹 𝑁 (12)
We compared the performance of proposed network withU-Net [22], Attention U-Net [17] and U-Net ++ [40]. TheVGG-style backbone refers to the encoder design proposedin original U-Net paper [22].Also, we compared our model with 2 cutting-edge mod-els widely used in natural image segmentation, includingFCN8s [12] and DeepLab v3 (output stride = 8) [3], withboth models containing a pretrained backbone as well.Apart from model performance comparison, model pa-rameters and computational costs (FLOPs) are also com-pared in our experiments.As our model differs with other U-Net family models interms of model encoder, to best evaluate our design of modeldecoder and attention mechanism, we also build a simpli-fied D2A U-Net with a VGG-style backbone as well. Webelieve the simplified version offers more fair comparisonbetween proposed network and other U-Net based models,while standard
D2A U-Net with backbone ResNeXt-50 (32 × Detailed comparison among different models in our ex-periments is shown in Table 2. As has been shown, withoutpretrained backbone, our proposed network outperforms U-Net, Attention U-Net and U-Net ++ in terms of Dice, pixelerror and recall. As these models are identical in model en-coder, it is clear that the proposed dual attention mechanismand RAB contribute to infection segmentation a lot. Theutilization of attention mechanism aids the model to detectinfected tissues more accurately, which reduces the numberof false positives and improves recall score. Also, RAB inmodel decoder captures both large and tiny visual structures,which is helpful to segment infection lesions with differentsize. Also, it should be noted that proposed D2A U-Net withVGG-style backbone outperforms U-Net ++ with compara-bly lower model parameters and computational costs, whichcould prove the balance of efficiency and performance in ourmodels. First Author et al.:
Preprint submitted to Elsevier
Page 7 of 11hort Title of the Article
Table 2
Quantitative analysis of infection regions on our dataset. Backbone
VGG-style refers tothe encoder proposed in [22], and backbone
ResNet-101 and
ResNeXt-50 (32 × arepretrained on ImageNet-1K.Model Backbone Param. FLOPs Dice Pixel Error RecallU-Net VGG-style 7.85 M 43.13 G 0.6384 0.0332 0.5512Attention U-Net VGG-style 8.12 M 43.78 G 0.6646 0.0390 0.6470U-Net++ VGG-style 9.16 M 106.81 G 0.6830 0.0332 0.6417DeepLab v3 (os=8) ResNet-101 58.63 M 185.00 G 0.7095 0.0323 0.6780FCN8s ResNet-101 51.94 M 165.67 G 0.6825 0.0315 0.6348 D2A U-Net
VGG-style 8.95 M 53.19 G 0.7047 0.0323 0.6626
D2A U-Net
ResNeXt-50 90.05 M 149.97 G
Utilizing pretrained backbone could also improve modelperformance. As can be seen, our
D2A U-Net with pre-trained ResNeXt-50 (32 × D2A U-Net with pretrained ResNeXt-50 (32 × We visualized segmentation results, as is shown in Fig.5. It can be seen from the visualization that our proposedmodel outperforms other models obviously. U-Net and At-tention U-Net are the least sensitive to COVID-19 lesions,and the background pixels have much stronger activation com-pared with other models. U-Net ++ produces more accuratesegmentation results, but still not promising as some tinylesions or lesions with blurred edge are segmented poorly. D2A U-Net with VGG-style backbone produces most accu-rate segmentation masks compared with other U-Net basedmodels mentioned above, and when backbone is switched toResNeXt-50 (32 × D2A U-Net produces the best seg-mentation results, which is comparably more sensitive toblurred or tiny lesions than other models.
Several ablation experiments were conducted to evaluatethe performance of components presented in our model, asis shown in Table. 3.
Effectiveness of Proposed GAM
To evaluate the valid-ity of proposed GAM in our experiments, we designed twobaselines shown in Table. 3, including No.1 (U-Net only)and No.2 (U-Net + GAM). Experimental results have shown that introducing GAM to U-Net model can boost the perfor-mance, which leads to a better Dice score and recall.
Effectiveness of Proposed RAB
We conducted similar ex-periments (No.1 and No.3) to explore the effectiveness ofproposed RAB, which includes a hybrid dilated convolutionblock and a decoder attention module. Experimental resultsindicate that introducing RAB to our model yields better re-sults as well, but the performance boost is comparably lim-ited compared with GAM.
Effectiveness of Combining GAM, RAB and PB
As canbe seen from Table. 3, in experiment No.4, introducing GAMand RAB together (proposed
D2A U-Net ) yields best resultsin our experiments, and the performance boost exceeds thesimple addition of each module’s performance boost. Suchexperimental results indicate that introducing GAM and RABtogether promotes the performance mutually. Also, in No.5,pretrained backbone as better parameter initialization couldfurther improve model performance.
5. Conclusion
In this paper we proposed a novel segmentation network,
D2A U-Net , for COVID-19 CT segmentation. Inspired byglobal attention upsample and CBAM, we propose a novelgated attention mechanism, called gate attention module , toproduce a fused attention map on features extracted by en-coder. We introduce a decoder attention module as well,which helps refine upsampled feature maps. Also, inspiredby hybrid dilated convolution, we present a residual atten-tion block containing a hybrid dilated convolution and a de-coder attention module ; we use it as the basic block in modeldecoder. Attention mechanism is utilized to increase modelsensitivity to positive pixels and improve recall score. Andwe use residual attention block as decoder basic block torefine upsampled feature maps and increase receptive fieldsimultaneously. Experimental results indicate that our net-work design is capable of segment COVID-19 lesions from
First Author et al.:
Preprint submitted to Elsevier
Page 8 of 11hort Title of the Article
Figure 5:
Visual comparison of COVID-19 lesions segmentation results.
Table 3
Ablation analysis of proposed
D2A U-Net , where GAM denotes gate attention module ,RAB denotes residual attention block and PB denotes pretrained backbone.Method Dice Pixel Error Recall(No.1) U-Net 0.6384 0.0332 0.5512(No.2) U-Net + GAM 0.6771 0.0343 0.6445(No.3) U-Net + RAB 0.6579 0.0354 0.6154(No.4) U-Net + RAB + GAM 0.7047 0.0323 0.6626(No.5) U-Net + RAB + GAM + PB
CT slices automatically, and achieves best results among pop-ular cutting-edge models evaluated in our experiments. Butour work is still limited to some degree, as only binary seg-mentation is performed in our experiments, which can limitmodel’s potential use in both diagnosis and health care. Weexpect to gather more CT scans and perform multi-class seg-mentation in the future. Also, despite the significantly bet-ter performance of our
D2A U-Net with ResNeXt-50 (32 × D2A U-Net to reduce channels and thus model parameters.
First Author et al.:
Preprint submitted to Elsevier
Page 9 of 11hort Title of the Article
Acknowledgements
This work was partially supported by the FundamentalResearch Funds for Central Universities, the National Natu-ral Science Foundation of China (No. 61601019, 61871022),the Beijing Natural Science Foundation (7202102), and the111 Project (No. B13003).
References [1] , . Covid-19 ct segmentation dataset. https://medicalsegmentation.com/covid19/
Accessed August 28, 2020.[2] , . Covid-19 global cases by johns hopkins university. https://coronavirus.jhu.edu/map.html/
Accessed August 28, 2020.[3] Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinkingatrous convolution for semantic image segmentation. arXiv preprintarXiv:1706.05587 .[4] Chen, X., Yao, L., Zhang, Y., 2020. Residual attention u-net for au-tomated multi-class segmentation of covid-19 chest ct images. arXivpreprint arXiv:2004.05645 .[5] Corbetta, M., Shulman, G.L., 2002. Control of goal-directed andstimulus-driven attention in the brain. Nature reviews neuroscience3, 201–215.[6] Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao,L., 2020. Inf-net: Automatic covid-19 lung infection segmentationfrom ct images. IEEE Transactions on Medical Imaging .[7] Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in:Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 7132–7141.[8] Jun, M., Cheng, G., Yixin, W., Xingle, A., Jiantao, G., Ziqi, Y., Min-qing, Z., Xin, L., Xueyuan, D., Shucheng, C., Hao, W., Sen, M., Xi-aoyu, Y., Ziwei, N., Chen, L., Lu, T., Yuntao, Z., Qiongjie, Z., Guo-qiang, D., Jian, H., 2020. COVID-19 CT Lung and Infection Seg-mentation Dataset. URL: https://doi.org/10.5281/zenodo.3757476 ,doi: .[9] Kamble, B., Sahu, S.P., Doriya, R., 2020. A review on lung and nod-ule segmentation techniques, in: Advances in Data and InformationSciences. Springer, pp. 555–565.[10] Lei, J., Li, J., Li, X., Qi, X., 2020. Ct imaging of the 2019 novelcoronavirus (2019-ncov) pneumonia. Radiology 295, 18–18.[11] Li, H., Xiong, P., An, J., Wang, L., 2018. Pyramid attention networkfor semantic segmentation. arXiv preprint arXiv:1805.10180 .[12] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional net-works for semantic segmentation, in: Proceedings of the IEEE con-ference on computer vision and pattern recognition, pp. 3431–3440.[13] Mansoor, A., Bagci, U., Xu, Z., Foster, B., Olivier, K.N., Elinoff,J.M., Suffredini, A.F., Udupa, J.K., Mollura, D.J., 2014. A genericapproach to pathological lung segmentation. IEEE transactions onmedical imaging 33, 2293–2310.[14] Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H., 2018.Espnet: Efficient spatial pyramid of dilated convolutions for semanticsegmentation, in: Proceedings of the european conference on com-puter vision (ECCV), pp. 552–568.[15] Mnih, V., Heess, N., Graves, A., et al., 2014. Recurrent models ofvisual attention, in: Advances in neural information processing sys-tems, pp. 2204–2212.[16] Ng, M.Y., Lee, E.Y., Yang, J., Yang, F., Li, X., Wang, H., Lui, M.M.s.,Lo, C.S.Y., Leung, B., Khong, P.L., et al., 2020. Imaging profileof the covid-19 infection: radiologic findings and literature review.Radiology: Cardiothoracic Imaging 2, e200034.[17] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Mis-awa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.,2018. Attention u-net: Learning where to look for the pancreas. arXivpreprint arXiv:1804.03999 .[18] Ouyang, X., Huo, J., Xia, L., Shan, F., Liu, J., Mo, Z., Yan, F., Ding,Z., Yang, Q., Song, B., et al., 2020. Dual-sampling attention networkfor diagnosis of covid-19 from community acquired pneumonia. IEEETransactions on Medical Imaging . [19] Pan, F., Ye, T., Sun, P., Gui, S., Liang, B., Li, L., Zheng, D., Wang,J., Hesketh, R.L., Yang, L., et al., 2020. Time course of lung changeson chest ct during recovery from 2019 novel coronavirus (covid-19)pneumonia. Radiology , 200370.[20] Park, H., Yoo, Y., Seo, G., Han, D., Yun, S., Kwak, N., 2018.Concentrated-comprehensive convolutions for lightweight semanticsegmentation. arXiv preprint arXiv:1812.04920 .[21] Qiu, Y., Liu, Y., Xu, J., 2020. Miniseg: An extremely mini-mum network for efficient covid-19 segmentation. arXiv preprintarXiv:2004.09750 .[22] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutionalnetworks for biomedical image segmentation, in: International Con-ference on Medical image computing and computer-assisted interven-tion, Springer. pp. 234–241.[23] Shaziya, H., Shyamala, K., Zaheer, R., 2018. Automatic lung seg-mentation on thoracic ct scans using u-net convolutional network, in:2018 International Conference on Communication and Signal Pro-cessing (ICCSP), IEEE. pp. 0643–0647.[24] Wang, C., Horby, P.W., Hayden, F.G., Gao, G.F., 2020a. A novelcoronavirus outbreak of global health concern. The Lancet 395, 470–473.[25] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang,X., Tang, X., 2017. Residual attention network for image classifica-tion, in: Proceedings of the IEEE conference on computer vision andpattern recognition, pp. 3156–3164.[26] Wang, G., Liu, X., Li, C., Xu, Z., Ruan, J., Zhu, H., Meng, T., Li, K.,Huang, N., Zhang, S., 2020b. A noise-robust framework for automaticsegmentation of covid-19 pneumonia lesions from ct images. IEEETransactions on Medical Imaging 39, 2653–2663.[27] Wang, L., Wong, A., 2020. Covid-net: A tailored deep convolutionalneural network design for detection of covid-19 cases from chest x-rayimages. arXiv preprint arXiv:2003.09871 .[28] Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cot-trell, G., 2018. Understanding convolution for semantic segmenta-tion, in: 2018 IEEE winter conference on applications of computervision (WACV), IEEE. pp. 1451–1460.[29] Wang, W., Xu, Y., Gao, R., Lu, R., Han, K., Wu, G., Tan, W., 2020c.Detection of sars-cov-2 in different types of clinical specimens. Jama323, 1843–1844.[30] Wang, Z., Ji, S., 2018. Smoothed dilated convolutions for improveddense prediction, in: Proceedings of the 24th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Mining, pp.2486–2495.[31] Woo, S., Park, J., Lee, J.Y., So Kweon, I., 2018. Cbam: Convolutionalblock attention module, in: Proceedings of the European conferenceon computer vision (ECCV), pp. 3–19.[32] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregatedresidual transformations for deep neural networks, in: Proceedings ofthe IEEE conference on computer vision and pattern recognition, pp.1492–1500.[33] Xie, W., Jacobs, C., Charbonnier, J.P., van Ginneken, B., 2020. Rela-tional modeling for robust and efficient pulmonary lobe segmentationin ct scans. IEEE transactions on medical imaging 39, 2664–2675.[34] Yao, J., Dwyer, A., Summers, R.M., Mollura, D.J., 2011. Computer-aided diagnosis of pulmonary infections using texture analysis andsupport vector machine classification. Academic radiology 18, 306–314.[35] Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilatedconvolutions. arXiv preprint arXiv:1511.07122 .[36] Zhao, T., Gao, D., Wang, J., Tin, Z., 2018. Lung segmentationin ct images using a fully convolutional neural network with multi-instance and conditional adversary loss, in: 2018 IEEE 15th Inter-national Symposium on Biomedical Imaging (ISBI 2018), IEEE. pp.505–509.[37] Zheng, C., Deng, X., Fu, Q., Zhou, Q., Feng, J., Ma, H., Liu, W.,Wang, X., 2020. Deep learning-based detection for covid-19 fromchest ct using weak label. medRxiv .[38] Zhou, L., Li, Z., Zhou, J., Li, H., Chen, Y., Huang, Y., Xie, D., Zhao,
First Author et al.:
Preprint submitted to Elsevier
Page 10 of 11hort Title of the Article
L., Fan, M., Hashmi, S., et al., 2020a. A rapid, accurate and machine-agnostic segmentation and quantification method for ct-based covid-19 diagnosis. IEEE Transactions on Medical Imaging 39, 2638–2652.[39] Zhou, T., Canu, S., Ruan, S., 2020b. An automatic covid-19 ct seg-mentation network using spatial and channel attention mechanism.arXiv preprint arXiv:2004.06673 .[40] Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018.Unet++: A nested u-net architecture for medical image segmenta-tion, in: Deep Learning in Medical Image Analysis and MultimodalLearning for Clinical Decision Support. Springer, pp. 3–11.
First Author et al.: