Esophageal Tumor Segmentation in CT Images using Dilated Dense Attention Unet (DDAUnet)
Sahar Yousefi, Hessam Sokooti, Mohamed S. Elmahdy, Irene M. Lips, Mohammad T. Manzuri Shalmani, Roel T. Zinkstok, Frank J.W.M. Dankers, Marius Staring
11 Esophageal Tumor Segmentation in CT Imagesusing Dilated Dense Attention Unet (DDAUnet)
Sahar Yousefi , Hessam Sokooti , Mohamed S. Elmahdy , Irene M. Lips , Mohammad T. Manzuri Shalmani ,Roel T. Zinkstok , Frank J.W.M. Dankers , and Marius Staring Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands Department of Radiotherapy, Leiden University Medical Center, Leiden, The Netherlands Department of Computer Engineering, Sharif University of Technology, Tehran, Iran Department of Radiation Oncology, Netherlands Cancer Institute, Amsterdam, The Netherlands Delft University of Technology, Delft, The Netherlands s.yousefi[email protected]
Abstract —Manual or automatic delineation of the esophagealtumor in CT images is known to be very challenging. Thisis due to the low contrast between the tumor and adjacenttissues, the anatomical variation of the esophagus, as well asthe occasional presence of foreign bodies (e.g. feeding tubes).Physicians therefore usually exploit additional knowledge such asendoscopic findings, clinical history, additional imaging modal-ities like PET scans. Achieving his additional information istime-consuming, while the results are error-prone and mightlead to non-deterministic results. In this paper we aim toinvestigate if and to what extent a simplified clinical workflowbased on CT alone, allows one to automatically segment theesophageal tumor with sufficient quality. For this purpose,we present a fully automatic end-to-end esophageal tumorsegmentation method based on convolutional neural networks(CNNs). The proposed network, called Dilated Dense AttentionUnet (DDAUnet), leverages spatial and channel attention gatesin each dense block to selectively concentrate on determinantfeature maps and regions. Dilated convolutional layers are usedto manage GPU memory and increase the network receptivefield. We collected a dataset of 792 scans from 288 distinctpatients including varying anatomies with air pockets, feedingtubes and proximal tumors. Repeatability and reproducibilitystudies were conducted for three distinct splits of training andvalidation sets. The proposed network achieved a
DSC valueof . ± . , a mean surface distance of . ± . mm and Hausdorff distance of . ± . mm for 287 testscans, demonstrating promising results with a simplified clinicalworkflow based on CT alone. Our code is publicly available viahttps://github.com/yousefis/DenseUnet_Esophagus_Segmentation. Index Terms —Esophageal tumor segmentation, CT images,densely connected pattern, UNet, dilated convolutional layer,attention gate
I. I
NTRODUCTION
Esophageal cancer is one of the least studied cancers [1],while it is lethal in most patients [2]. Because of the very poorsurvival rate three standard treatment options are available,i.e. chemoradiotherapy (CRT), neoadjuvant CRT followed bysurgical resection, or radical radiotherapy [3]. For this purpose,rapid and accurate delineation of the target volume in CTimages plays a very important role in therapy and diseasecontrol. The complexities raised by automatic esophageal tumor delineation in CT images can be divided into severalcategories: textural similarities and the absence of contrastbetween the tumor and its adjacent tissues; the anatomicalvariation of different patients either intrinsically or caused by adisease, like a hiatal hernia in which part of the stomach bulgesinto the chest cavity through an opening of the diaphragm (seeFigure 1-(e) and (i)), extension of tumor into the stomach,or existence of air pocket inside the esophagus; existence offoreign bodies during the treatment, like a feeding tube orsurgical clips inside the esophageal lumen. Figure 1 illustratessome of the challenging cases. These difficulties lead to ahigh degree of uncertainty associated with the target volumeof the tumor, especially at the cranial and caudal borderof the tumor [4]. In order to overcome these complexities,physicians integrate CT imaging with the clinical history,endoscopic findings, endoscopic ultrasound, and other imagingmodalities such as positron-emission tomography (PET) [5].Obtaining these additional modalities is however a time-consuming and expensive process. Moreover, the process ofmanual delineation is a repetitive and tedious task, and oftenthere is a lack of consensus on how to best segment thetumor from normal tissue. Despite using additional modalitiesand expert knowledge, the process of manual delineation stillremains an ill-posed problem [6]. Nowee et al. [4] assessedmanual delineation variability of gross tumor volume (GTV)between using CT and combined F-fluorodeoxyglucose PET(FDG-PET) [7] and CT in esophageal cancer patients in amulti-institutional study by 20 observers. They concludedthat the use of PET images can significantly influence thedelineated volume in some cases, however its impact onobserver variation is limited.In this paper we aim to investigate if a simplified clinicalworkflow based on CT scans alone allows to automatically de-lineate the esophageal GTV with acceptable quality. Recently,there has been a revived interest in automating this processfor both the esophagus and the esophageal tumor based on CTimages alone [8], [9], [10]. Our earlier work [11], leveragedthe idea of dense blocks proposed by [12], arranging themin a typical U-shape. In that study, the proposed method was a r X i v : . [ ee ss . I V ] D ec (a) (b) (c) (d)(e) (f) (g) (h)(i) (j) (k) (l)Fig. 1: Variations in shape and location of the tumor. Green contours show the manual delineation of the GTVs. (a) normaljunction of esophagus and stomach, (b) hiatal hernia (type I): migration of esophagogastric junction through the gap in thecranial direction, (c) hiatal hernia (type II): migration of esophagogastric junction in the chest, (d) proximal tumor includingan air pocket and feeding tube, (e) proximal tumor including an air pocket, (f) tumor including an air pocket, (g,h) junctiontumor (extension of the tumor into the stomach) including an air pocket, (i) tumor including an air pocket and feeding tube,(j) relocation of the esophagus to the left of aorta, (k) a variety in the shape of the tumor, (l) junction tumor including anair pocket.trained and tested on 553 chests CT scans from 49 distinctpatients and achieved a DSC value of . ± . , and a95 % mean surface distance ( MSD ) of . ± . mm forthe test scans. Eight of the 85 scans in the test set had a DSC value lower than . , caused by the presence of aircavities and foreign bodies in the GTV, which was rarely seenin the training data. In order to enhance the robustness of thenetwork, in the present study we extended that work. The maincontributions of this paper are as follows:1) We propose an end-to-end CNN for esophageal GTVsegmentation on CT scans. Different from much of theprevious work, which addressed segmentation of theesophagus itself, we focus on the more challengingtumor area (GTV). The proposed method is end-to-end,without intricate pre- or post-processing steps, and usesno information in addition to the CT scans;2) We introduce dilated dense attention blocks which lever-age spatial and channel attention to emphasize on theGTV related features. Also, dilation layers are used tosupport an exponential expansion of the receptive fieldand decrease the size of the network without loss ofresolution;3) We collected a dataset of distinct patients( scans). The dataset includes different varietiesof anatomies, and presence of foreign bodies andair pockets in the esophageal lumen. In this study, all patients received either Neoadjuvant or definitivechemoradiotherapy treatment options. To the best of ourknowledge, none of the related works have addressedsuch a comprehensive dataset.The initial results of this work were presented in [11].The current paper includes a larger and more diverse dataset,and more elaborate evaluation. Also, we leverage dilatedconvolutional layers and attention gates in order to increasethe receptive field and filter tumor relevant features.II. R ELATED WORK
Most automatic esophagus segmentation approaches haveused either a shape or appearance model to guide the segmen-tation, where training such a guidance model is complicated.Rousson et al. proposed a two-stage probabilistic shortest pathapproach to segment the esophagus from 2D CT images [6].In the first stage, the aorta and left atrium are segmentedand then registered to reference shapes in order to find aregion of interest (ROI). In the second stage, the optimalesophagus centerline is extracted using the shortest path al-gorithm. Fieselmann et al. proposed an automatic approachfor segmenting the esophagus by detecting the air cavitiesthat often constitute the esophagus [13]. For reducing the timecomplexity they confined all the computations to an ROI. Also,they proposed another method based on spatially-constrainedshape interpolation in order to segment the esophagus in 2D
CT images [14]. In that investigation, two assumptions areconsidered: the shape of the esophagus changes smoothly, andthere is no intersection between the esophagus and the otherorgans.In [15] a multi-step approach based on probabilistic modelshas been proposed to segment the esophagus on 3D CT scans.In that work, a pre-processing step is used to extract anROI. Then, a discriminative learning technique is applied tolabel the voxels. In [16] an optical flow approach for semi-automated segmentation of CT images is used, where manuallydrawn curvature points are extended to contours by Fourierinterpolation and afterwards, optical flow is used for register-ing the original contour to the other slices. This method is notonly highly user-interactive but it also fails when the region tocontour is topologically different between two slices. Feulner et al. proposed a multi-step approach based on probabilisticmodels for automatic segmentation of the esophagus in 3D CTscans [17]. In that work, by running a discriminative modelfor each axial slice, a set of approximated esophagus contoursis extracted. Then, the contours are clustered and mergedand afterwards, a Markov chain model is used for findingthe most probable path through the axial slices. Ultimately,another discriminative model is used for refining the result.This approach just works for a manually selected ROI. Themanual selection of an ROI was later extended to automaticROI detection by a salient landmark on the chest [18].Kurugol et al. presented a 3D level set model for segmentingthe esophagus over the entire thoracic range employing a shapemodel, with a global and a locally deformable component[19]. In their work, an initial centerline estimation is requiredwhere an ad-hoc centerline estimator was used, which wasonly performed at the ROI of some predefined anatomicallandmarks followed by interpolation for the remaining slices.Later, they extended their work by using prior spatial andappearance models estimated from the training set insteadof using the ad-hoc estimator [20]. In [21], a two-phaseonline atlas-based approach was proposed to rank and selecta subset of optimal atlas candidates for segmentation of theesophagus on CT scans. Atlas-based approaches face somerestrictions including the selection of optimal atlases and acorrect representation of the study population.Deep learning for medical image analysis has aroused broadattention in recent years [22], [23], [24], [25]. However, thistechnique has been limitedly used for esophagus segmentationand even less for esophageal tumor segmentation. In [26], afully convolutional neural network (FCNN) for segmenting theesophagus on 3D CT was proposed, surrounded the bottom-most of the heart and the topmost of the stomach. For refiningthe results, an active contour model and a random walkerwere used as post-processing steps. In that study, 50 scanswere used as the training set and 20 scans as the test set.An average
DSC value of 0.76 ± et al. [27]. The firststage performs a multi-organ segmentation in order to extractan ROI including the esophagus. Then the manually croppedROI is fed to the second network to segment the esophagus. A DSC value of 0.72 ± et al. [28]. Then theyapplied a graph cut for segmenting the tumor. They reportedan average DSC value of 0.75 ± et al. [8] introduced a spatial-context encodeddeep esophageal clinical target volume (CTV) delineationframework to produce superior margin-based CTV boundaries.That work in an expensive pre-processing step encodes spatialcontext by computing the signed distance transform maps(SDMs) of the GTV, lymph nodes (LNs) and organs at risks(OARs) and then feeds the results with the CT image into a 3DCNN. In another work Jin et al. [29] proposed a two-streamchained 3D CNN fusion pipeline to segment esophageal GTVsusing both CT and PET+CT scans. They evaluated theirapproach by conducting a 5-fold cross validation on scansof 110 patients. They reported that using PET images ascomplementary information can improve the DSC score from . ± . to . ± . . Although reasonable results can beobtained in the approaches mentioned earlier, the problem ofesophageal GTV segmentation in CT modalities without extraknowledge constraining the problem is known as an ill-posedproblem and remains challenging [9].Most of the mentioned works addressed esophagus seg-mentation and not esophageal tumor segmentation. However,esophageal tumor segmentation is a more complicated task dueto the poor contrast of the tumor with respect to its adjacenttissues. It is especially difficult to define the start and end ofthe tumor without additional information such as endoscopicfindings. III. T HE PROPOSED METHOD
A. Network architecture
Figure 2 shows a schematic of the proposed network,dubbed dilated dense attention Unet (DDAUnet). The networkis composed of three levels, a down-sampling path for extract-ing contextual features and an up-sampling path for retrievingthe lost resolution during extraction. In each level, differentfrom our prior work [11], we used dilated dense spatial andchannel attention blocks (DDSCAB) which is composed of adilated dense block (DDB) and a spatial attention gate anda channel attention gate which are denoted SpA and ChA1.Using loop connectivity patterns between the layers in theDDSCAB blocks provide a deep supervision by re-using thefeature maps, while dilated layers increase the receptive fieldexponentially. Spatial attention gates are used in the mainbuilding blocks, and encourage the network to concentrateon extracting features from the tumor adjacency. The channelattention gates are used in the skip connections between thecontracting and expanding paths of the Unet (named ChA2),for filtering irrelevant feature maps to improve the trainingprocess. The proposed network DDAUnet does not includeChA1, and this block is only used in some of the experi-ments during the optimization of the network configuration.In section IV the performance of DDAUnet will be comparedwith DUnet [11], dilated dense unet (DDUnet) which isDUnet with dilated convolutional layers in the dense blocks,
DDAUnet without ChA2, i.e. DDAUnet-noChA2, DDAUnetwith ChA1 and without SpA and ChA2, i.e. DDAUnet-noSpA-plusChA1-noChA2, DDAUnet with ChA1 and without ChA2,i.e. DDAUnet-plusChA1-noChA2.According to [30], the incorporation of a stack of convo-lutional layers with small receptive fields in the first layersrather than few layers with large receptive fields decreasesthe number of the parameters, increases non-linearity of thenetwork, and consequently makes training of the networkeasier. These layers aid the network to extract significantfeatures before applying convolutional operations with a widerreceptive field in DDSCAB. Therefore, the network starts withtwo consecutive (3 × × conv d =1 ,p = true + BN + ReLU, inwhich × × , d and p indicate the kernel size, dilation factorand padding of the convolutional layer respectively. Also, BNand ReLU denote batch normalization and a Rectified linearunit layer, respectively.Afterward, the network is followed by a DDSCAB com-posed of a dilated dense block (DDB) and spatial andchannel attention gates. For each DDB, R is the numberof sub-DDBs. In each sub-DDB, there are R number of (3 × × conv d =2 ,p = true + BN + ReLU and R number of (1 × × conv d =1 ,p = true + BN + ReLU layers. The outputof a DDB is the concatenation of all preceding sub-DDBs.In our prior work, it has been shown that the loop connec-tivity patterns in dense blocks assist the network to performbetter [11]. In DDBs, (1 × × conv layers are used asbottleneck layers, which compress the number of featuremaps and thus improve computational efficiency [12]. Inthis paper, the feature maps in each DDB are compressedby a compression coefficient of θ . The output of DDBthen is fed to spatial and channel attention gates in orderto selectively filter the GTV irrelevant spatial features andfeature maps respectively, which leads to improving the train-ing process. In the down-sampling path, the DDSCABs arefollowed by (1 × × conv d =1 ,p = true + BN + ReLU. Using × × convolutional layers does not affect the receptivefield of the network, however, increases the non-linearityin between layers [31]. At the end of down-sampling pathand in the up-sampling path every DDSCAB is followedby (3 × × conv d =1 + BN + ReLU. In Section IV we willinvestigate the effect of deploying spatial and channel gatesin DDSCAB and will see that utilizing only the spatial gateis more effective. Also, the skip connections between thecontracting and expanding path are equipped by channel atten-tion gates to filter the irrelevant feature maps. Later we willshow that leveraging the spatial and channel attention gatesaid the network to end up with better segmentation results.The network is finalized by a convolutional layer with linearactivation and a soft-max layer to compute a probabilisticoutput. The probabilistic output can be classified as tumorand non-tumor regions. The skip connections between thedown-sampling and up-sampling paths demonstrate croppedconcatenation of the feature maps of the corresponding down-sampling levels and up-sampling levels.
B. Loss function
In this work, similar to our prior work [11] we used theDice coefficient as our main loss function [32]:
DSC
GTV = 2 (cid:80) Ni s i g i (cid:80) Ni s i + (cid:80) Ni g i , (1)where s i ∈ S is the binary segmentation of the GTV predictedby the network and g i ∈ G is the ground truth segmentation.We investigated different combinations of loss functions in-cluding boundary loss [33], distance map loss [34], and focalDice [35]. In [33] it is shown that the boundary loss can beapproximated by: L B ( θ ) = (cid:90) ω φ G ( q ) s θ ( q ) dq, (2)where φ G and s θ denote the level set representation of theboundary of ground truth, and network output, respectively.In Section IV it will be discussed that the combination ofDice and boundary loss works the best for this problem.IV. D ATA , TRAINING DETAILS AND EVALUATION
A. Dataset
All patients of this study received one of the following twotreatments:1) Neoadjuvant chemo-radiotherapy (CRT) followed bysurgical resection. The radiotherapy is 23 × × pixels and an average voxel DDAB(R)
DDAB (R)
DDAB (R) DDAB(R)DDAB(R) × (3×3×3)conv d=1,p=true +BN+ReLU (1×1×1)conv d=1,p=true +BN+ReLU(3×3×3)conv d=2,p=true +BN+ReLU(3×3×3)conv d=1,p=false +BN+ReLUMax-poolTransposed convolution layerSkip connectionc : ConcatenationInput patch Output patchR : Number of sub-blocks(1×1×1)conv d=1,p=ture +Soft-maxMulti-layer perceptronR c DDB (R) or c × +DDSCAB Average-pooling
ChA2
ChA2SpA
ChA1
Fig. 2: The architecture of the proposed method. DDSCAB and DDB stand for dilated dense spatial and channel attentionblock and dilated dense block, respectively. R is the number of sub-DDBs. ChA1, ChA2, and SpA denote channel attentiongate located on skip connections, channel attention inside the DDSCAB block, and spatial attention gates inside DDSCAB.ChA1, shown transparently here, is not included in the final network (DDAUnet), but is used in some of the experiments.thickness of . × . × mm , and were re-sampled to avoxel size of × × mm in this study. B. Training details
In this work, the proposed network, which contains 65ktrainable parameters, has been implemented in Google’s Ten-sorflow and experiments are carried out using a NVIDIAQuadro RTX6000 with 24 GB of GPU memory. For allnetworks, the patch extraction process has been implementedby multi-threaded programming in which fetching the imagesinto RAM, extracting the patches from the fetched imagesand feeding the extracted patches to the GPU are done con-currently. The multi-threading technique speeds up the patchextraction process. The input patches have been augmentedby white noise extracted from a Gaussian distribution of N ( µ (cid:48) , σ (cid:48) ) , in which µ (cid:48) = 0 is the mean of the distributionand σ (cid:48) is the standard deviation, which is selected randomlybetween 0 and 5. During the test process, the fully convo-lutional nature of the network is used, with zero padding toyield equal output size. For managing the GPU memory witha larger input patch, we use a batch size of seven.For designing the best configuration of the network, weperform experiments comparing different architectures andloss functions. The datasets I and II are divided randomly intothree distinct sets detailed in Table II. The optimization ofthe network is performed on the validation set. The test set isexcluded from the model optimization and kept independentlyfor the final evaluation. After choosing the best configurationof the network, the final model is trained for two more randomsplits of the training and validation sets, resulting in threetrained models. At the end, an average of the final results forthe chosen network, trained on three splits, is reported on thetest set.In Section V the optimization of the network configurationwill be discussed on the validation set. Then the best networkis trained by different linear combinations of the loss functionsincluding Dice, boundary loss, distance map, and Focal loss.Then in Section V-B, for reproducibility, the results of the best configuration of the network will be reported for two moredistinct and random splits. C. Evaluation measures
For evaluating the results we report
DSC value (see SectionIII-B),
MSD and Hausdorff distance (HD) which are definedas:
MSD = 12 (cid:32) n n (cid:88) i =1 d ( a i , S ) + 1 m m (cid:88) i =1 d ( b i , S ) (cid:33) , (3) HD = max { max i { d ( a i , S ) } , max j { d ( b i , G ) }} , (4)in which S and G are the predicted and ground truth contours,and { a , ..., a n } and { b , ..., b m } the surface mesh points of S and G , and d ( a i , S ) = min j (cid:107) b j − a i (cid:107) respectively. For theHausdorff distance we report the 95% percentile instead ofthe maximum for robustness against outliers. Since definingthe slices where the tumor starts and stops is difficult even formedical doctors, we report perpendicular cranial and caudaldistance between the output of the CNNs and the groundtruth. The cranial distance (CrD) error is computed as thetopmost slice number of the ground truth minus the topmostslice number of the CNN prediction; the caudal distance (CaD)error is computed similarly.V. E XPERIMENTAL RESULTS
In this section the experimental results are reported, withthe datasets divided into training, validation and test sets asdescribed in Section IV-B. Model optimization experiments aredescribed in Section V-A, where comparison is performed onthe validation set. Subsequently, the final results are reportedon the test set in Section V-B. For all experiments, weextract the largest component of the network prediction usingconnected component analysis, and report that. A repeatedmeasure oneway ANOVA test was performed on the Dicevalues using a significance level of p = 0.05.
TABLE I: Details of the dataset dataset of patients of scans Type Time period Treatment planI 21 (A) 525 5 time-points 2012-2014 A: Neoadj. CRT2 time-points: 3D scans3 time-points: 3D + 4D scansII 162 (A) + 105 (B) 267 3D 2014-2019 A: Neoadj. CRTB: def. CRTdataset of P/S DB I DB II totaltraining P 13 182 195S 325 182 507validation P 2 23 25S 50 23 53testing P 6 62 68S 150 62 212total P 21 267 288S 525 267 792
TABLE II: Data split into training, validation and testing sets.P and S denote distinct patients and scans.TABLE III: AUC for the networks on the validation set model AUCDUnet 0.49DDUnet 0.53DDAUnet-noChA2 0.73DDAUnet-plusChA1-noChA2 0.63DDAUnet-noSpA-plusChA1-noChA2 0.71DDAUnet 0.76
A. Model optimization
Figure 3 shows boxplots of DSC, MSD, 95 %HD , cu-mulative frequency ( % ) of DSC , and perpendicular cranialand caudal distance errors for different configurations of theCNN models using the DSC loss function. Since channelattention gates inside the DDSCAB block, i.e. ChA1 in Fig.2, did not improve the results, these are not used in the finalconfiguration. The results show that DDAUnet outperforms theother network configurations significantly.Figure 4 shows the precision-recall curves for the networkson the validation set using the DSC loss function. Theprecision and recall were calculated with different thresholdvalues applied to the probabilistic output of the networks. Foracquiring the final segmentation we used a threshold of 0.5.Table III tabulates the values of the area under the curve (AUC)for the networks on the validation set. The AUC for DDAUnetis the largest, and we choose this method as the final networkarchitecture.We experimented with different combinations of loss func-tions including Dice, boundary loss, distance map loss, FocalDice on the validation set. Figure 5 shows the results. Theresults show that DDAUnet using the DSC + BL loss functionoutperforms the other loss functions significantly.
B. Final results
As explained before, repeatability and reproducibility stud-ies were conducted for three distinct and random splits oftraining and validation sets. Table IV shows the results onthe independent test set after applying the largest componentanalysis. Figure 6 shows example results of the final network for some patients with different shape varieties and difficultiesraised by the presence of air pockets or feeding tubes. The 2D
DSC values are shown in yellow. Figure 7 shows a qualitativecomparison between the different CNNs.In order to study the strengths and weaknesses of the finalmodel, we manually labeled each scan with the followingproperties: presence of air pockets in the esophagus, the pres-ence of a feeding tube in the esophageal lumen, the tumor is ajunction tumor, the tumor volume is larger than 30cc (which isdefined by the median split of the GTV volumes), the patienthas a hiatal hernia, the tumor is in a dislocated esophagus, thetumor is located in the proximal esophagus (proximal tumor).Figure 8 shows the results of
DSC value,
MSD and 95 %HD for the mentioned tags for the final network on the test set.Results show that the network works better for patients withabsence of air pockets, feeding tubes, or junction tumors. Thismay be caused by the different varieties raised by the existenceof air pockets or foreign bodies. Also, the number of patientswith a hiatal hernia, a dislocated esophagus or a proximaltumor is relatively small in the test set, not allowing to drawa conclusion. VI. D
ISCUSSION
Esophageal GTV segmentation is not a trivial problem, dueto the difficulties raised by the poor contrast with respectto its vicinity. Most research addressed only segmentation ofthe esophagus, while esophageal GTV segmentation has beentouched in few works. Since defining a correct start and endlocation (slice) of the tumor in the cranial-caudal directionbased on CT images alone, is not an easy task even for doctors,esophageal GTV segmentation is considered as an ill-posedproblem. In this paper, for addressing the esophageal GTVsegmentation we designed an efficient deep learning model.In terms of the training data, we collected 792 CT scansfrom 288 patients diagnosed with an esophageal tumor. Thisdataset is the largest dataset among the present works address-ing esophageal tumor segmentation. Training time for a singlenetwork was in the order of 5 days. The average inference timefor the final network for a cube of × × voxels, is . ± . seconds.For tuning the proposed network, many experiments wereperformed in the present paper. We leveraged the DenseUnetnetwork, already deployed in our prior work as a baseline.In order to increase the receptive field of the network, denseblocks were equipped by dilated convolutional layers, dubbedDDUnet network. Then we leveraged attention mechanisms toencourage the network to selectively filter out GTV irrelevantfeatures. Three types of attention gates were utilized: i) aspatial attention gate in the dense blocks to filter out GTVirrelevant features in the spatial domain of each feature map, DSC
MSD (mm) . . . . . . . . . . . Cumulative frequency of DSC (%)
CrD (mm)
CaD (mm)
DUnetDDUnet DDAUnet-noSpA-plusChA1-noChA2DDAUnet-noChA2 DDAUnet-plusChA1-noChA2DDAUnet *75 176*79 177*415 2031116 114*518 12451 1011
Fig. 3: A comparison between different network configurations on the validation set. The number of results with values largerthan the maximum value on the vertical axis, is shown on top of each plot. The stars in the DSC plot show statistical significancebetween DDAUnet and the other CNNs.TABLE IV: Results for DDAUnet on the independent test set, with the combined Dice and boundary loss function.
Split DSC CrD (mm) CaD (mm) MSD (mm) 95 %HD (mm) µ ± σ µ ± σ µ ± σ µ ± σ µ ± σ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P r e c i s i o n ROC
DUnetDDUnetDDAUnet-noChA2DDAUnet-noSpA-plusChA1-noChA2DDAUnet-plusChA1-noChA2DDAUnet
Fig. 4: Precision-recall curves for the different network archi-tectures on the validation set using the DSC loss function.ii) a channel attention gate in the dense blocks to filter outirrelevant feature maps entirely, and iii) skip attention gates tofilter out GTV irrelevant feature maps between the contracting and expanding paths of the Unet. The experiments on thevalidation set showed that the architecture with the spatialattention and skip attention gates, dubbed DDAUnet, achievedthe best result. Deploying channel attention inside the denseblocks (see Figure 2) might filter out the feature maps inearly levels of the network before allowing it to extract finefeatures at the deeper levels. Channel attention in the skipconnections filters out redundant or irrelevant feature mapsduring the retrieval of lost resolution. The optimized networkarchitecture was further tuned using a large variety of lossfunctions, again on the validation set. Results showed thatthe summation of Dice and boundary loss performed best.Therefore, we introduced the DDAUnet with summation ofDice and boundary loss as the loss function as the finalnetwork.We trained the final network for three random splits of thetraining and validation sets. The results on the test set showedan average DSC of . ± . , an MSD of . ± . mm ,a of . ± . mm , and cranial and caudal perpen-dicular distance errors of − . ± . mm and . ± . mm , DSC
MSD (mm) . . . . . . . . . . . Cumulative frequency of DSC (%)
CrD (mm)
CaD (mm)
DSCDSC + DM DSC + FocalDSC + Focal + BL DSC + BL *59 * 1323*1410 * 335526 155723 136518 1121
Fig. 5: A comparison between deploying different loss functions for DDAUnet on the validation set. The number of resultswith values larger than the maximum value on the vertical axis, is shown on top of each plot. The stars in the DSC plot showstatistical significance between DSC+BL and the other loss functions.respectively. The cranial and caudal perpendicular distanceerrors between the ground truth and the network result showthat the network overestimates at the top of the GTV by ~6.5mm, and underestimates at the bottom of the GTV by ~3.5mm. As slice thickness of the data was 3 mm, this translatesto approximately 2 and 1 slices on average, respectively. Foralleviating this issue, incorporating auxiliary information couldaid the network.Although the datasets are not comparable, in [29] an average
DSC score of . ± . was obtained on scans of 110patients, using 5-fold cross validation. In [28] a DSC scoreof 0.75 ± DSC valueof . ± . , and a 95 % mean surface distance MSD of . ± . mm for 85 CT scans from 13 distinct patients.In the work described in this paper, a higher DSC value wasobtained.Nowee et al. studied the inter-observer variability inesophageal tumour delineation, and found that this variabilityis mainly located at the cranial and caudal border [4]. Theyreport a generalized conformity index for the GTV, a measurerelated to Dice overlap but for multiple observers, of 0.67. Thehuman delineation variation in the cranial direction, definedas the standard deviation of the most proximal slice, wason average 9.9 mm, and 7.5 mm for the caudal direction.Although these measures are not the same as the measuresreported in this paper, we cautiously conclude that the cranial and caudal error of the proposed automatic method (see TableIV) is not far from human delineation variation.Nowee et al. also investigated the impact of incorporatingFDG-PET scans in the delineation process, and concluded thatalthough it can influence the delineated volume significantly,its impact on observer variation was limited. As a future work,we aim to study if fusion of CT with FDG-PET can aid theCNNs to improve the extracted features and subsequently thesegmentation results.For a close inspection, we investigated the results on theindependent test set for the final network. We labelled thepatients in the test set with different tags, including thepresence of air pockets, feeding tube, junction tumor, tumorvolume > cc , hiatal hernia, dislocated esophagus, proximaltumor. Inspection of the final results (see Figure 8) showed thatthe network performed better for patients with an absence ofair pockets, feeding tubes in the esophagus lumen, or junctiontumors. A lower performance was obtained for smaller tumors( < cc ), while the strength of the network for patients witha dislocated esophagus, a proximal tumor or a hiatal herniawas not judge-able. Therefore, enriching the dataset withmore patients with the mentioned properties would potentiallyimprove the performance of the model. Also, incorporatingendoscopic findings in the process of segmentation can beconsidered as a future work to investigate if that can aid CNNsto reduce errors specially at the start and end of the GTV. (a) Normal esophageal GTV (b) Normal esophageal GTV (c) GTV including an air pocket (d) GTV in a dislocated esophagus (e) Junction GTV (f) Junction GTV (g) GTV including an air pocket (h) Proximal GTV including an air pocketand a feeding tube (i) Proximal GTV including an air pocket anda feeding tube Fig. 6: Example results of the proposed method with 2D
DSC value in yellow. The manual delineation and the network resultsare shown by green and red contours, respectively. .
61 0 .
51 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .
84 0 .
69 0 . DDAUnet-plusChA1-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet .
89 0 .
89 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .
93 0 .
92 0 . DDAUnet-plusChA1-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet .
00 0 .
00 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .
54 0 .
91 0 . DDAUnet-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet
Fig. 7: Qualitative comparison of DDAUnet with the other CNNs for three slices from three distinct patients. 2D DSC valuesare show in yellow. The manual delineation and the network results are shown by green and red contours, respectively.VII. C
ONCLUSION
In this study, we collected a large set of CT scans from288 distinct patients with esophageal cancer. To the best ofour knowledge this is the largest dataset in esophageal tumorsegmentation literature to date. We showed that despite thedifficulties raised by poor contrast of esophageal tumors withrespect to their neighbouring tissues, varieties in shape andlocation of tumor, presence of air pockets and foreign bodies,the proposed method, dubbed dilated dense attention Unet(DDAUnet), could segment the gross tumor volume with a mean surface distance of . ± . mm .VIII. A CKNOWLEDGMENTS
Femke P. Peters is acknowledged for delineation of the data.R
EFERENCES[1] Peter C Enzinger and Robert J Mayer. Esophageal cancer.
New EnglandJournal of Medicine , 349(23):2241–2252, 2003. air pocketFeeding tubeJunction tumorGTV>30ccHiatal herniaDislocated esoph.Proximal GTV + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) DSC MSD (mm) 95%HD (mm)
Fig. 8: Results analysis for DDAUnet,
DSC , MSD and 95 %HD boxplots on the test data for different patients with or withoutan air pocket, a feeding tube in the esophagus lumen, a junction tumor, tumor volume larger than 30cc (which is defined bythe median split of the GTV volumes), a hiatal hernia, a dislocated esophagus, or a proximal tumor. The outliers larger than20 and 60 for
MSD and 95 %HD have not been shown. The number of scans for each boxplot is shown in parentheses beloweach plot. [2] Jacques Ferlay, Isabelle Soerjomataram, Rajesh Dikshit, Sultan Eser,Colin Mathers, Marise Rebelo, Donald Maxwell Parkin, David Forman,and Freddie Bray. Cancer incidence and mortality worldwide: sources,methods and major patterns in globocan 2012.
International Journal ofCancer , 136(5):E359–E386, 2015.[3] Marcelo Mamede, Paula Abreu-e Lima, Maria Raquel Oliva, VâniaNosé, Harvey Mamon, and Victor H Gerbaudo. FDG-PET/CT tumorsegmentation-derived indices of metabolic activity to assess response toneoadjuvant therapy and progression-free survival in esophageal cancer:correlation with histopathology results.
American Journal of ClinicalOncology , 30(4):377–388, 2007.[4] Marlies E Nowee, Francine EM Voncken, Alexis NTJ Kotte, LucasGoense, Peter SN van Rossum, Astrid LHMW van Lier, Stijn WHeijmink, Berthe MP Aleman, et al. Gross tumour delineation oncomputed tomography and positron emission tomography-computedtomography in oesophageal cancer: A nationwide study.
Clinical andTranslational Radiation Oncology , 14:33–39, 2019.[5] Thomas R Charles, John G Hunter, and Blair AA Jobe.
Esophagealcancer: principles and practice . Demos Medical Publishing, 2009.[6] Mikael Rousson, Ying Bai, Chenyang Xu, and Frank Sauer. Probabilisticminimal path for automated esophagus segmentation. In
Medical Imag-ing 2006: Image Processing , volume 6144, page 614449. InternationalSociety for Optics and Photonics, 2006.[7] Frederiek M Lever, Irene M Lips, Sjoerd PM Crijns, Onne Reerink,Astrid LHMW van Lier, Marinus A Moerland, Marco van Vulpen,and Gert J Meijer. Quantification of esophageal tumor motion oncine-magnetic resonance imaging.
International Journal of RadiationOncology* Biology* Physics , 88(2):419–424, 2014.[8] Dakai Jin, Dazhou Guo, Tsung-Ying Ho, Adam P Harrison, Jing Xiao,Chen-kan Tseng, and Le Lu. Deep esophageal clinical target volumedelineation using encoded 3D spatial context of tumors, lymph nodes,and organs at risk. In
MICCAI , pages 603–612. Springer, 2019.[9] Di-Xiu Xue, Rong Zhang, Yuan-Yuan Zhao, Jian-Ming Xu, and Ya-LeiWang. Fully convolutional networks with double-label for esophagealcancer image segmentation by self-transfer learning. In
Ninth Interna-tional Conference on Digital Image Processing (ICDIP 2017) , volume10420, page 104202D. International Society for Optics and Photonics,2017.[10] Ying Liang, Diane Schott, Ying Zhang, Zhiwu Wang, Haidy Nasief, EricPaulson, William Hall, Paul Knechtges, Beth Erickson, and X AllenLi. Auto-segmentation of pancreatic tumor in multi-parametric MRIusing deep convolutional neural networks.
Radiotherapy and Oncology ,145:193–200, 2020. [11] Sahar Yousefi, Hessam Sokooti, Mohamed S Elmahdy, Femke P Peters,Mohammad T Manzuri Shalmani, Roel T Zinkstok, and Marius Staring.Esophageal gross tumor volume segmentation using a 3D convolutionalneural network. In
MICCAI , pages 343–351. Springer, 2018.[12] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van derMaaten. Densely connected convolutional networks. In
CVPR , pages4700–4708, 2017.[13] Andreas Fieselmann, Stefan Lautenschläger, Frank Deinzer, and BjörnPoppe. Automatic detection of air holes inside the esophagus in CTimages. In
Bildverarbeitung für die Medizin 2008 , pages 397–401.Springer, 2008.[14] Andreas Fieselmann, Stefan Lautenschläger, Frank Deinzer, MatthiasJohn, and Björn Poppe. Esophagus segmentation by spatially-constrained shape interpolation. In
Bildverarbeitung für die Medizin2008 , pages 247–251. Springer, 2008.[15] Johannes Feulner, S Kevin Zhou, Matthias Hammon, Sascha Seifert,Martin Huber, Dorin Comaniciu, Joachim Hornegger, and AlexanderCavallaro. A probabilistic model for automatic segmentation of theesophagus in 3-D CT scans.
IEEE Transactions on Medical Imaging ,30(6):1252–1264, 2011.[16] Tzung-Chi Huang, Geoffrey Zhang, Thomas Guerrero, GeorgeStarkschall, Kan-Ping Lin, and Ken Forster. Semi-automated CTsegmentation using optic flow and fourier interpolation techniques.
Computer Methods and Programs in Biomedicine , 84(2-3):124–134,2006.[17] Johannes Feulner, S Kevin Zhou, Alexander Cavallaro, Sascha Seifert,Joachim Hornegger, and Dorin Comaniciu. Fast automatic segmentationof the esophagus from 3D CT data using a probabilistic model. In
MICCAI , pages 255–262. Springer, 2009.[18] Johannes Feulner, S Kevin Zhou, Martin Huber, Alexander Cavallaro,Joachim Hornegger, and Dorin Comaniciu. Model-based esophagussegmentation from CT scans using a spatial probability map. In
MICCAI ,pages 95–102. Springer, 2010.[19] Sila Kurugol, Necmiye Ozay, Jennifer G Dy, Gregory C Sharp, andDana H Brooks. Locally deformable shape model to improve 3D levelset based esophagus segmentation. In
ICPR 2010 , pages 3955–3958.IEEE, 2010.[20] Sila Kurugol, Erhan Bas, Deniz Erdogmus, Jennifer G Dy, Gregory CSharp, and Dana H Brooks. Centerline extraction with principal curvetracing to improve 3D level set esophagus segmentation in CT images.In
EMBC 2011 , pages 3403–3406. IEEE, 2011.[21] Jinzhong Yang, Benjamin Haas, Raymond Fang, Beth M Beadle,Adam S Garden, Zhongxing Liao, Lifei Zhang, Peter Balter, et al. Atlas ranking and selection for automatic segmentation of the esophagus fromCT scans. Physics in Medicine & Biology , 62(23):9140, 2017.[22] Nicola Pezzotti, Sahar Yousefi, Mohamed S Elmahdy, Jeroen vanGemert, Christophe Schülke, Mariya Doneva, Tim Nielsen, SergeyKastryulin, Boudewijn PF Lelieveldt, Matthias JP van Osch, Elwinde Weerdt, and Marius Staring. An adaptive intelligence algorithm forundersampled knee MRI reconstruction.
IEEE Access , pages 204825–204838, 2020.[23] Sahar Yousefi, Lydiane Hirschler, Merlijn van der Plas, Mohamed SElmahdy, Hessam Sokooti, Matthias Van Osch, and Marius Staring. Fastdynamic perfusion and angiography reconstruction using an end-to-end3D convolutional neural network. In
MLMRI , pages 25–35. Springer,2019.[24] Mohamed S Elmahdy, Thyrza Jagt, Sahar Yousefi, Hessam Sokooti, RoelZinkstok, Mischa Hoogeman, and Marius Staring. Evaluation of multi-metric registration for online adaptive proton therapy of prostate cancer.In
International Workshop on Biomedical Image Registration , pages 94–104. Springer, 2018.[25] Mohamed S Elmahdy, Thyrza Jagt, Roel Th Zinkstok, Yuchuan Qiao,Rahil Shahzad, Hessam Sokooti, Sahar Yousefi, Luca Incrocci, CAMMarijnen, Mischa Hoogeman, et al. Robust contour propagation usingdeep learning and image registration for online adaptive proton therapyof prostate cancer.
Medical physics , 46(8):3329–3343, 2019.[26] Tobias Fechter, Sonja Adebahr, Dimos Baltas, Ismail Ben Ayed, Chris-tian Desrosiers, and Jose Dolz. A 3D fully convolutional neural networkand a random walker to segment the esophagus in CT.
Journal ofMedical Physics , 2017.[27] Roger Trullo, Caroline Petitjean, Dong Nie, Dinggang Shen, andSu Ruan. Fully automated esophagus segmentation with a hierarchicaldeep learning approach. In
ICSIPA 2017 , pages 503–506. IEEE, 2017.[28] Zhaojun Hao, Jiwei Liu, and Jianfei Liu. Esophagus tumor segmentationusing fully convolutional neural network and graph cut. In
ChineseIntelligent Systems Conference , pages 413–420. Springer, 2017.[29] Dakai Jin, Dazhou Guo, Tsung-Ying Ho, Adam P Harrison, Jing Xiao,Chen-kan Tseng, and Le Lu. Accurate esophageal gross tumor volumesegmentation in PET/CT using two-stream chained 3D deep networkfusion. In
MICCAI , pages 182–191. Springer, 2019.[30] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele. The cityscapes dataset for semantic urban scene under-standing. In
CVPR , pages 3213–3223, 2016.[31] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition.
ICLR , 2014.[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net:Fully convolutional neural networks for volumetric medical imagesegmentation. In , pages 565–571. IEEE, 2016.[33] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger,Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalancedsegmentation. In
International Conference on Medical imaging withdeep learning , pages 285–296, 2019.[34] Francesco Caliva, Claudia Iriondo, Alejandro Morales Martinez,Sharmila Majumdar, and Valentina Pedoia. Distance map loss penaltyterm for semantic segmentation.
MIDL , 2019.[35] Pei Wang and Albert CS Chung. Focal Dice loss and image dilation forbrain tumor segmentation. In