[PDF] Esophageal Tumor Segmentation in CT Images using Dilated Dense Attention Unet (DDAUnet)

Abstract

Manual or automatic delineation of the esophageal tumor in CT images is known to be very challenging. This is due to the low contrast between the tumor and adjacent tissues, the anatomical variation of the esophagus, as well as the occasional presence of foreign bodies (e.g. feeding tubes). Physicians therefore usually exploit additional knowledge such as endoscopic findings, clinical history, additional imaging modalities like PET scans. Achieving his additional information is time-consuming, while the results are error-prone and might lead to non-deterministic results. In this paper we aim to investigate if and to what extent a simplified clinical workflow based on CT alone, allows one to automatically segment the esophageal tumor with sufficient quality. For this purpose, we present a fully automatic end-to-end esophageal tumor segmentation method based on convolutional neural networks (CNNs). The proposed network, called Dilated Dense Attention Unet (DDAUnet), leverages spatial and channel attention gates in each dense block to selectively concentrate on determinant feature maps and regions. Dilated convolutional layers are used to manage GPU memory and increase the network receptive field. We collected a dataset of 792 scans from 288 distinct patients including varying anatomies with \mbox{air pockets}, feeding tubes and proximal tumors. Repeatability and reproducibility studies were conducted for three distinct splits of training and validation sets. The proposed network achieved a DSC value of 0.79±0.20 , a mean surface distance of 5.4±20.2mm and 95% Hausdorff distance of 14.7±25.0mm for 287 test scans, demonstrating promising results with a simplified clinical workflow based on CT alone. Our code is publicly available via \url{this https URL}.

Full PDF

11 Esophageal Tumor Segmentation in CT Imagesusing Dilated Dense Attention Unet (DDAUnet)

Sahar Youseﬁ , Hessam Sokooti , Mohamed S. Elmahdy , Irene M. Lips , Mohammad T. Manzuri Shalmani ,Roel T. Zinkstok , Frank J.W.M. Dankers , and Marius Staring Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands Department of Radiotherapy, Leiden University Medical Center, Leiden, The Netherlands Department of Computer Engineering, Sharif University of Technology, Tehran, Iran Department of Radiation Oncology, Netherlands Cancer Institute, Amsterdam, The Netherlands Delft University of Technology, Delft, The Netherlands s.youseﬁ[email protected]

Abstract —Manual or automatic delineation of the esophagealtumor in CT images is known to be very challenging. Thisis due to the low contrast between the tumor and adjacenttissues, the anatomical variation of the esophagus, as well asthe occasional presence of foreign bodies (e.g. feeding tubes).Physicians therefore usually exploit additional knowledge such asendoscopic ﬁndings, clinical history, additional imaging modal-ities like PET scans. Achieving his additional information istime-consuming, while the results are error-prone and mightlead to non-deterministic results. In this paper we aim toinvestigate if and to what extent a simpliﬁed clinical workﬂowbased on CT alone, allows one to automatically segment theesophageal tumor with sufﬁcient quality. For this purpose,we present a fully automatic end-to-end esophageal tumorsegmentation method based on convolutional neural networks(CNNs). The proposed network, called Dilated Dense AttentionUnet (DDAUnet), leverages spatial and channel attention gatesin each dense block to selectively concentrate on determinantfeature maps and regions. Dilated convolutional layers are usedto manage GPU memory and increase the network receptiveﬁeld. We collected a dataset of 792 scans from 288 distinctpatients including varying anatomies with air pockets, feedingtubes and proximal tumors. Repeatability and reproducibilitystudies were conducted for three distinct splits of training andvalidation sets. The proposed network achieved a

DSC valueof . ± . , a mean surface distance of . ± . mm and Hausdorff distance of . ± . mm for 287 testscans, demonstrating promising results with a simpliﬁed clinicalworkﬂow based on CT alone. Our code is publicly available viahttps://github.com/youseﬁs/DenseUnet_Esophagus_Segmentation. Index Terms —Esophageal tumor segmentation, CT images,densely connected pattern, UNet, dilated convolutional layer,attention gate

I. I

NTRODUCTION

Esophageal cancer is one of the least studied cancers [1],while it is lethal in most patients [2]. Because of the very poorsurvival rate three standard treatment options are available,i.e. chemoradiotherapy (CRT), neoadjuvant CRT followed bysurgical resection, or radical radiotherapy [3]. For this purpose,rapid and accurate delineation of the target volume in CTimages plays a very important role in therapy and diseasecontrol. The complexities raised by automatic esophageal tumor delineation in CT images can be divided into severalcategories: textural similarities and the absence of contrastbetween the tumor and its adjacent tissues; the anatomicalvariation of different patients either intrinsically or caused by adisease, like a hiatal hernia in which part of the stomach bulgesinto the chest cavity through an opening of the diaphragm (seeFigure 1-(e) and (i)), extension of tumor into the stomach,or existence of air pocket inside the esophagus; existence offoreign bodies during the treatment, like a feeding tube orsurgical clips inside the esophageal lumen. Figure 1 illustratessome of the challenging cases. These difﬁculties lead to ahigh degree of uncertainty associated with the target volumeof the tumor, especially at the cranial and caudal borderof the tumor [4]. In order to overcome these complexities,physicians integrate CT imaging with the clinical history,endoscopic ﬁndings, endoscopic ultrasound, and other imagingmodalities such as positron-emission tomography (PET) [5].Obtaining these additional modalities is however a time-consuming and expensive process. Moreover, the process ofmanual delineation is a repetitive and tedious task, and oftenthere is a lack of consensus on how to best segment thetumor from normal tissue. Despite using additional modalitiesand expert knowledge, the process of manual delineation stillremains an ill-posed problem [6]. Nowee et al. [4] assessedmanual delineation variability of gross tumor volume (GTV)between using CT and combined F-ﬂuorodeoxyglucose PET(FDG-PET) [7] and CT in esophageal cancer patients in amulti-institutional study by 20 observers. They concludedthat the use of PET images can signiﬁcantly inﬂuence thedelineated volume in some cases, however its impact onobserver variation is limited.In this paper we aim to investigate if a simpliﬁed clinicalworkﬂow based on CT scans alone allows to automatically de-lineate the esophageal GTV with acceptable quality. Recently,there has been a revived interest in automating this processfor both the esophagus and the esophageal tumor based on CTimages alone [8], [9], [10]. Our earlier work [11], leveragedthe idea of dense blocks proposed by [12], arranging themin a typical U-shape. In that study, the proposed method was a r X i v : . [ ee ss . I V ] D ec (a) (b) (c) (d)(e) (f) (g) (h)(i) (j) (k) (l)Fig. 1: Variations in shape and location of the tumor. Green contours show the manual delineation of the GTVs. (a) normaljunction of esophagus and stomach, (b) hiatal hernia (type I): migration of esophagogastric junction through the gap in thecranial direction, (c) hiatal hernia (type II): migration of esophagogastric junction in the chest, (d) proximal tumor includingan air pocket and feeding tube, (e) proximal tumor including an air pocket, (f) tumor including an air pocket, (g,h) junctiontumor (extension of the tumor into the stomach) including an air pocket, (i) tumor including an air pocket and feeding tube,(j) relocation of the esophagus to the left of aorta, (k) a variety in the shape of the tumor, (l) junction tumor including anair pocket.trained and tested on 553 chests CT scans from 49 distinctpatients and achieved a DSC value of . ± . , and a95 % mean surface distance ( MSD ) of . ± . mm forthe test scans. Eight of the 85 scans in the test set had a DSC value lower than . , caused by the presence of aircavities and foreign bodies in the GTV, which was rarely seenin the training data. In order to enhance the robustness of thenetwork, in the present study we extended that work. The maincontributions of this paper are as follows:1) We propose an end-to-end CNN for esophageal GTVsegmentation on CT scans. Different from much of theprevious work, which addressed segmentation of theesophagus itself, we focus on the more challengingtumor area (GTV). The proposed method is end-to-end,without intricate pre- or post-processing steps, and usesno information in addition to the CT scans;2) We introduce dilated dense attention blocks which lever-age spatial and channel attention to emphasize on theGTV related features. Also, dilation layers are used tosupport an exponential expansion of the receptive ﬁeldand decrease the size of the network without loss ofresolution;3) We collected a dataset of distinct patients( scans). The dataset includes different varietiesof anatomies, and presence of foreign bodies andair pockets in the esophageal lumen. In this study, all patients received either Neoadjuvant or deﬁnitivechemoradiotherapy treatment options. To the best of ourknowledge, none of the related works have addressedsuch a comprehensive dataset.The initial results of this work were presented in [11].The current paper includes a larger and more diverse dataset,and more elaborate evaluation. Also, we leverage dilatedconvolutional layers and attention gates in order to increasethe receptive ﬁeld and ﬁlter tumor relevant features.II. R ELATED WORK

Most automatic esophagus segmentation approaches haveused either a shape or appearance model to guide the segmen-tation, where training such a guidance model is complicated.Rousson et al. proposed a two-stage probabilistic shortest pathapproach to segment the esophagus from 2D CT images [6].In the ﬁrst stage, the aorta and left atrium are segmentedand then registered to reference shapes in order to ﬁnd aregion of interest (ROI). In the second stage, the optimalesophagus centerline is extracted using the shortest path al-gorithm. Fieselmann et al. proposed an automatic approachfor segmenting the esophagus by detecting the air cavitiesthat often constitute the esophagus [13]. For reducing the timecomplexity they conﬁned all the computations to an ROI. Also,they proposed another method based on spatially-constrainedshape interpolation in order to segment the esophagus in 2D

CT images [14]. In that investigation, two assumptions areconsidered: the shape of the esophagus changes smoothly, andthere is no intersection between the esophagus and the otherorgans.In [15] a multi-step approach based on probabilistic modelshas been proposed to segment the esophagus on 3D CT scans.In that work, a pre-processing step is used to extract anROI. Then, a discriminative learning technique is applied tolabel the voxels. In [16] an optical ﬂow approach for semi-automated segmentation of CT images is used, where manuallydrawn curvature points are extended to contours by Fourierinterpolation and afterwards, optical ﬂow is used for register-ing the original contour to the other slices. This method is notonly highly user-interactive but it also fails when the region tocontour is topologically different between two slices. Feulner et al. proposed a multi-step approach based on probabilisticmodels for automatic segmentation of the esophagus in 3D CTscans [17]. In that work, by running a discriminative modelfor each axial slice, a set of approximated esophagus contoursis extracted. Then, the contours are clustered and mergedand afterwards, a Markov chain model is used for ﬁndingthe most probable path through the axial slices. Ultimately,another discriminative model is used for reﬁning the result.This approach just works for a manually selected ROI. Themanual selection of an ROI was later extended to automaticROI detection by a salient landmark on the chest [18].Kurugol et al. presented a 3D level set model for segmentingthe esophagus over the entire thoracic range employing a shapemodel, with a global and a locally deformable component[19]. In their work, an initial centerline estimation is requiredwhere an ad-hoc centerline estimator was used, which wasonly performed at the ROI of some predeﬁned anatomicallandmarks followed by interpolation for the remaining slices.Later, they extended their work by using prior spatial andappearance models estimated from the training set insteadof using the ad-hoc estimator [20]. In [21], a two-phaseonline atlas-based approach was proposed to rank and selecta subset of optimal atlas candidates for segmentation of theesophagus on CT scans. Atlas-based approaches face somerestrictions including the selection of optimal atlases and acorrect representation of the study population.Deep learning for medical image analysis has aroused broadattention in recent years [22], [23], [24], [25]. However, thistechnique has been limitedly used for esophagus segmentationand even less for esophageal tumor segmentation. In [26], afully convolutional neural network (FCNN) for segmenting theesophagus on 3D CT was proposed, surrounded the bottom-most of the heart and the topmost of the stomach. For reﬁningthe results, an active contour model and a random walkerwere used as post-processing steps. In that study, 50 scanswere used as the training set and 20 scans as the test set.An average

DSC value of 0.76 ± et al. [27]. The ﬁrststage performs a multi-organ segmentation in order to extractan ROI including the esophagus. Then the manually croppedROI is fed to the second network to segment the esophagus. A DSC value of 0.72 ± et al. [28]. Then theyapplied a graph cut for segmenting the tumor. They reportedan average DSC value of 0.75 ± et al. [8] introduced a spatial-context encodeddeep esophageal clinical target volume (CTV) delineationframework to produce superior margin-based CTV boundaries.That work in an expensive pre-processing step encodes spatialcontext by computing the signed distance transform maps(SDMs) of the GTV, lymph nodes (LNs) and organs at risks(OARs) and then feeds the results with the CT image into a 3DCNN. In another work Jin et al. [29] proposed a two-streamchained 3D CNN fusion pipeline to segment esophageal GTVsusing both CT and PET+CT scans. They evaluated theirapproach by conducting a 5-fold cross validation on scansof 110 patients. They reported that using PET images ascomplementary information can improve the DSC score from . ± . to . ± . . Although reasonable results can beobtained in the approaches mentioned earlier, the problem ofesophageal GTV segmentation in CT modalities without extraknowledge constraining the problem is known as an ill-posedproblem and remains challenging [9].Most of the mentioned works addressed esophagus seg-mentation and not esophageal tumor segmentation. However,esophageal tumor segmentation is a more complicated task dueto the poor contrast of the tumor with respect to its adjacenttissues. It is especially difﬁcult to deﬁne the start and end ofthe tumor without additional information such as endoscopicﬁndings. III. T HE PROPOSED METHOD

A. Network architecture

Figure 2 shows a schematic of the proposed network,dubbed dilated dense attention Unet (DDAUnet). The networkis composed of three levels, a down-sampling path for extract-ing contextual features and an up-sampling path for retrievingthe lost resolution during extraction. In each level, differentfrom our prior work [11], we used dilated dense spatial andchannel attention blocks (DDSCAB) which is composed of adilated dense block (DDB) and a spatial attention gate anda channel attention gate which are denoted SpA and ChA1.Using loop connectivity patterns between the layers in theDDSCAB blocks provide a deep supervision by re-using thefeature maps, while dilated layers increase the receptive ﬁeldexponentially. Spatial attention gates are used in the mainbuilding blocks, and encourage the network to concentrateon extracting features from the tumor adjacency. The channelattention gates are used in the skip connections between thecontracting and expanding paths of the Unet (named ChA2),for ﬁltering irrelevant feature maps to improve the trainingprocess. The proposed network DDAUnet does not includeChA1, and this block is only used in some of the experi-ments during the optimization of the network conﬁguration.In section IV the performance of DDAUnet will be comparedwith DUnet [11], dilated dense unet (DDUnet) which isDUnet with dilated convolutional layers in the dense blocks,

DDAUnet without ChA2, i.e. DDAUnet-noChA2, DDAUnetwith ChA1 and without SpA and ChA2, i.e. DDAUnet-noSpA-plusChA1-noChA2, DDAUnet with ChA1 and without ChA2,i.e. DDAUnet-plusChA1-noChA2.According to [30], the incorporation of a stack of convo-lutional layers with small receptive ﬁelds in the ﬁrst layersrather than few layers with large receptive ﬁelds decreasesthe number of the parameters, increases non-linearity of thenetwork, and consequently makes training of the networkeasier. These layers aid the network to extract signiﬁcantfeatures before applying convolutional operations with a widerreceptive ﬁeld in DDSCAB. Therefore, the network starts withtwo consecutive (3 × × conv d =1 ,p = true + BN + ReLU, inwhich × × , d and p indicate the kernel size, dilation factorand padding of the convolutional layer respectively. Also, BNand ReLU denote batch normalization and a Rectiﬁed linearunit layer, respectively.Afterward, the network is followed by a DDSCAB com-posed of a dilated dense block (DDB) and spatial andchannel attention gates. For each DDB, R is the numberof sub-DDBs. In each sub-DDB, there are R number of (3 × × conv d =2 ,p = true + BN + ReLU and R number of (1 × × conv d =1 ,p = true + BN + ReLU layers. The outputof a DDB is the concatenation of all preceding sub-DDBs.In our prior work, it has been shown that the loop connec-tivity patterns in dense blocks assist the network to performbetter [11]. In DDBs, (1 × × conv layers are used asbottleneck layers, which compress the number of featuremaps and thus improve computational efﬁciency [12]. Inthis paper, the feature maps in each DDB are compressedby a compression coefﬁcient of θ . The output of DDBthen is fed to spatial and channel attention gates in orderto selectively ﬁlter the GTV irrelevant spatial features andfeature maps respectively, which leads to improving the train-ing process. In the down-sampling path, the DDSCABs arefollowed by (1 × × conv d =1 ,p = true + BN + ReLU. Using × × convolutional layers does not affect the receptiveﬁeld of the network, however, increases the non-linearityin between layers [31]. At the end of down-sampling pathand in the up-sampling path every DDSCAB is followedby (3 × × conv d =1 + BN + ReLU. In Section IV we willinvestigate the effect of deploying spatial and channel gatesin DDSCAB and will see that utilizing only the spatial gateis more effective. Also, the skip connections between thecontracting and expanding path are equipped by channel atten-tion gates to ﬁlter the irrelevant feature maps. Later we willshow that leveraging the spatial and channel attention gatesaid the network to end up with better segmentation results.The network is ﬁnalized by a convolutional layer with linearactivation and a soft-max layer to compute a probabilisticoutput. The probabilistic output can be classiﬁed as tumorand non-tumor regions. The skip connections between thedown-sampling and up-sampling paths demonstrate croppedconcatenation of the feature maps of the corresponding down-sampling levels and up-sampling levels.

B. Loss function

In this work, similar to our prior work [11] we used theDice coefﬁcient as our main loss function [32]:

DSC

GTV = 2 (cid:80) Ni s i g i (cid:80) Ni s i + (cid:80) Ni g i , (1)where s i ∈ S is the binary segmentation of the GTV predictedby the network and g i ∈ G is the ground truth segmentation.We investigated different combinations of loss functions in-cluding boundary loss [33], distance map loss [34], and focalDice [35]. In [33] it is shown that the boundary loss can beapproximated by: L B ( θ ) = (cid:90) ω φ G ( q ) s θ ( q ) dq, (2)where φ G and s θ denote the level set representation of theboundary of ground truth, and network output, respectively.In Section IV it will be discussed that the combination ofDice and boundary loss works the best for this problem.IV. D ATA , TRAINING DETAILS AND EVALUATION

A. Dataset

All patients of this study received one of the following twotreatments:1) Neoadjuvant chemo-radiotherapy (CRT) followed bysurgical resection. The radiotherapy is 23 × × pixels and an average voxel DDAB(R)

DDAB (R)

DDAB (R) DDAB(R)DDAB(R) × (3×3×3)conv d=1,p=true +BN+ReLU (1×1×1)conv d=1,p=true +BN+ReLU(3×3×3)conv d=2,p=true +BN+ReLU(3×3×3)conv d=1,p=false +BN+ReLUMax-poolTransposed convolution layerSkip connectionc : ConcatenationInput patch Output patchR : Number of sub-blocks(1×1×1)conv d=1,p=ture +Soft-maxMulti-layer perceptronR c DDB (R) or c × +DDSCAB Average-pooling

ChA2

ChA2SpA

ChA1

Fig. 2: The architecture of the proposed method. DDSCAB and DDB stand for dilated dense spatial and channel attentionblock and dilated dense block, respectively. R is the number of sub-DDBs. ChA1, ChA2, and SpA denote channel attentiongate located on skip connections, channel attention inside the DDSCAB block, and spatial attention gates inside DDSCAB.ChA1, shown transparently here, is not included in the ﬁnal network (DDAUnet), but is used in some of the experiments.thickness of . × . × mm , and were re-sampled to avoxel size of × × mm in this study. B. Training details

In this work, the proposed network, which contains 65ktrainable parameters, has been implemented in Google’s Ten-sorﬂow and experiments are carried out using a NVIDIAQuadro RTX6000 with 24 GB of GPU memory. For allnetworks, the patch extraction process has been implementedby multi-threaded programming in which fetching the imagesinto RAM, extracting the patches from the fetched imagesand feeding the extracted patches to the GPU are done con-currently. The multi-threading technique speeds up the patchextraction process. The input patches have been augmentedby white noise extracted from a Gaussian distribution of N ( µ (cid:48) , σ (cid:48) ) , in which µ (cid:48) = 0 is the mean of the distributionand σ (cid:48) is the standard deviation, which is selected randomlybetween 0 and 5. During the test process, the fully convo-lutional nature of the network is used, with zero padding toyield equal output size. For managing the GPU memory witha larger input patch, we use a batch size of seven.For designing the best conﬁguration of the network, weperform experiments comparing different architectures andloss functions. The datasets I and II are divided randomly intothree distinct sets detailed in Table II. The optimization ofthe network is performed on the validation set. The test set isexcluded from the model optimization and kept independentlyfor the ﬁnal evaluation. After choosing the best conﬁgurationof the network, the ﬁnal model is trained for two more randomsplits of the training and validation sets, resulting in threetrained models. At the end, an average of the ﬁnal results forthe chosen network, trained on three splits, is reported on thetest set.In Section V the optimization of the network conﬁgurationwill be discussed on the validation set. Then the best networkis trained by different linear combinations of the loss functionsincluding Dice, boundary loss, distance map, and Focal loss.Then in Section V-B, for reproducibility, the results of the best conﬁguration of the network will be reported for two moredistinct and random splits. C. Evaluation measures

For evaluating the results we report

DSC value (see SectionIII-B),

MSD and Hausdorff distance (HD) which are deﬁnedas:

MSD = 12 (cid:32) n n (cid:88) i =1 d ( a i , S ) + 1 m m (cid:88) i =1 d ( b i , S ) (cid:33) , (3) HD = max { max i { d ( a i , S ) } , max j { d ( b i , G ) }} , (4)in which S and G are the predicted and ground truth contours,and { a , ..., a n } and { b , ..., b m } the surface mesh points of S and G , and d ( a i , S ) = min j (cid:107) b j − a i (cid:107) respectively. For theHausdorff distance we report the 95% percentile instead ofthe maximum for robustness against outliers. Since deﬁningthe slices where the tumor starts and stops is difﬁcult even formedical doctors, we report perpendicular cranial and caudaldistance between the output of the CNNs and the groundtruth. The cranial distance (CrD) error is computed as thetopmost slice number of the ground truth minus the topmostslice number of the CNN prediction; the caudal distance (CaD)error is computed similarly.V. E XPERIMENTAL RESULTS

In this section the experimental results are reported, withthe datasets divided into training, validation and test sets asdescribed in Section IV-B. Model optimization experiments aredescribed in Section V-A, where comparison is performed onthe validation set. Subsequently, the ﬁnal results are reportedon the test set in Section V-B. For all experiments, weextract the largest component of the network prediction usingconnected component analysis, and report that. A repeatedmeasure oneway ANOVA test was performed on the Dicevalues using a signiﬁcance level of p = 0.05.

TABLE I: Details of the dataset dataset of patients of scans Type Time period Treatment planI 21 (A) 525 5 time-points 2012-2014 A: Neoadj. CRT2 time-points: 3D scans3 time-points: 3D + 4D scansII 162 (A) + 105 (B) 267 3D 2014-2019 A: Neoadj. CRTB: def. CRTdataset of P/S DB I DB II totaltraining P 13 182 195S 325 182 507validation P 2 23 25S 50 23 53testing P 6 62 68S 150 62 212total P 21 267 288S 525 267 792

TABLE II: Data split into training, validation and testing sets.P and S denote distinct patients and scans.TABLE III: AUC for the networks on the validation set model AUCDUnet 0.49DDUnet 0.53DDAUnet-noChA2 0.73DDAUnet-plusChA1-noChA2 0.63DDAUnet-noSpA-plusChA1-noChA2 0.71DDAUnet 0.76

A. Model optimization

Figure 3 shows boxplots of DSC, MSD, 95 %HD , cu-mulative frequency ( % ) of DSC , and perpendicular cranialand caudal distance errors for different conﬁgurations of theCNN models using the DSC loss function. Since channelattention gates inside the DDSCAB block, i.e. ChA1 in Fig.2, did not improve the results, these are not used in the ﬁnalconﬁguration. The results show that DDAUnet outperforms theother network conﬁgurations signiﬁcantly.Figure 4 shows the precision-recall curves for the networkson the validation set using the DSC loss function. Theprecision and recall were calculated with different thresholdvalues applied to the probabilistic output of the networks. Foracquiring the ﬁnal segmentation we used a threshold of 0.5.Table III tabulates the values of the area under the curve (AUC)for the networks on the validation set. The AUC for DDAUnetis the largest, and we choose this method as the ﬁnal networkarchitecture.We experimented with different combinations of loss func-tions including Dice, boundary loss, distance map loss, FocalDice on the validation set. Figure 5 shows the results. Theresults show that DDAUnet using the DSC + BL loss functionoutperforms the other loss functions signiﬁcantly.

B. Final results

As explained before, repeatability and reproducibility stud-ies were conducted for three distinct and random splits oftraining and validation sets. Table IV shows the results onthe independent test set after applying the largest componentanalysis. Figure 6 shows example results of the ﬁnal network for some patients with different shape varieties and difﬁcultiesraised by the presence of air pockets or feeding tubes. The 2D

DSC values are shown in yellow. Figure 7 shows a qualitativecomparison between the different CNNs.In order to study the strengths and weaknesses of the ﬁnalmodel, we manually labeled each scan with the followingproperties: presence of air pockets in the esophagus, the pres-ence of a feeding tube in the esophageal lumen, the tumor is ajunction tumor, the tumor volume is larger than 30cc (which isdeﬁned by the median split of the GTV volumes), the patienthas a hiatal hernia, the tumor is in a dislocated esophagus, thetumor is located in the proximal esophagus (proximal tumor).Figure 8 shows the results of

DSC value,

MSD and 95 %HD for the mentioned tags for the ﬁnal network on the test set.Results show that the network works better for patients withabsence of air pockets, feeding tubes, or junction tumors. Thismay be caused by the different varieties raised by the existenceof air pockets or foreign bodies. Also, the number of patientswith a hiatal hernia, a dislocated esophagus or a proximaltumor is relatively small in the test set, not allowing to drawa conclusion. VI. D

ISCUSSION

Esophageal GTV segmentation is not a trivial problem, dueto the difﬁculties raised by the poor contrast with respectto its vicinity. Most research addressed only segmentation ofthe esophagus, while esophageal GTV segmentation has beentouched in few works. Since deﬁning a correct start and endlocation (slice) of the tumor in the cranial-caudal directionbased on CT images alone, is not an easy task even for doctors,esophageal GTV segmentation is considered as an ill-posedproblem. In this paper, for addressing the esophageal GTVsegmentation we designed an efﬁcient deep learning model.In terms of the training data, we collected 792 CT scansfrom 288 patients diagnosed with an esophageal tumor. Thisdataset is the largest dataset among the present works address-ing esophageal tumor segmentation. Training time for a singlenetwork was in the order of 5 days. The average inference timefor the ﬁnal network for a cube of × × voxels, is . ± . seconds.For tuning the proposed network, many experiments wereperformed in the present paper. We leveraged the DenseUnetnetwork, already deployed in our prior work as a baseline.In order to increase the receptive ﬁeld of the network, denseblocks were equipped by dilated convolutional layers, dubbedDDUnet network. Then we leveraged attention mechanisms toencourage the network to selectively ﬁlter out GTV irrelevantfeatures. Three types of attention gates were utilized: i) aspatial attention gate in the dense blocks to ﬁlter out GTVirrelevant features in the spatial domain of each feature map, DSC

MSD (mm) . . . . . . . . . . . Cumulative frequency of DSC (%)

CrD (mm)

CaD (mm)

DUnetDDUnet DDAUnet-noSpA-plusChA1-noChA2DDAUnet-noChA2 DDAUnet-plusChA1-noChA2DDAUnet *75 176*79 177*415 2031116 114*518 12451 1011

Fig. 3: A comparison between different network conﬁgurations on the validation set. The number of results with values largerthan the maximum value on the vertical axis, is shown on top of each plot. The stars in the DSC plot show statistical signiﬁcancebetween DDAUnet and the other CNNs.TABLE IV: Results for DDAUnet on the independent test set, with the combined Dice and boundary loss function.

Split DSC CrD (mm) CaD (mm) MSD (mm) 95 %HD (mm) µ ± σ µ ± σ µ ± σ µ ± σ µ ± σ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P r e c i s i o n ROC

DUnetDDUnetDDAUnet-noChA2DDAUnet-noSpA-plusChA1-noChA2DDAUnet-plusChA1-noChA2DDAUnet

Fig. 4: Precision-recall curves for the different network archi-tectures on the validation set using the DSC loss function.ii) a channel attention gate in the dense blocks to ﬁlter outirrelevant feature maps entirely, and iii) skip attention gates toﬁlter out GTV irrelevant feature maps between the contracting and expanding paths of the Unet. The experiments on thevalidation set showed that the architecture with the spatialattention and skip attention gates, dubbed DDAUnet, achievedthe best result. Deploying channel attention inside the denseblocks (see Figure 2) might ﬁlter out the feature maps inearly levels of the network before allowing it to extract ﬁnefeatures at the deeper levels. Channel attention in the skipconnections ﬁlters out redundant or irrelevant feature mapsduring the retrieval of lost resolution. The optimized networkarchitecture was further tuned using a large variety of lossfunctions, again on the validation set. Results showed thatthe summation of Dice and boundary loss performed best.Therefore, we introduced the DDAUnet with summation ofDice and boundary loss as the loss function as the ﬁnalnetwork.We trained the ﬁnal network for three random splits of thetraining and validation sets. The results on the test set showedan average DSC of . ± . , an MSD of . ± . mm ,a of . ± . mm , and cranial and caudal perpen-dicular distance errors of − . ± . mm and . ± . mm , DSC

MSD (mm) . . . . . . . . . . . Cumulative frequency of DSC (%)

CrD (mm)

CaD (mm)

DSCDSC + DM DSC + FocalDSC + Focal + BL DSC + BL *59 * 1323*1410 * 335526 155723 136518 1121

Fig. 5: A comparison between deploying different loss functions for DDAUnet on the validation set. The number of resultswith values larger than the maximum value on the vertical axis, is shown on top of each plot. The stars in the DSC plot showstatistical signiﬁcance between DSC+BL and the other loss functions.respectively. The cranial and caudal perpendicular distanceerrors between the ground truth and the network result showthat the network overestimates at the top of the GTV by ~6.5mm, and underestimates at the bottom of the GTV by ~3.5mm. As slice thickness of the data was 3 mm, this translatesto approximately 2 and 1 slices on average, respectively. Foralleviating this issue, incorporating auxiliary information couldaid the network.Although the datasets are not comparable, in [29] an average

DSC score of . ± . was obtained on scans of 110patients, using 5-fold cross validation. In [28] a DSC scoreof 0.75 ± DSC valueof . ± . , and a 95 % mean surface distance MSD of . ± . mm for 85 CT scans from 13 distinct patients.In the work described in this paper, a higher DSC value wasobtained.Nowee et al. studied the inter-observer variability inesophageal tumour delineation, and found that this variabilityis mainly located at the cranial and caudal border [4]. Theyreport a generalized conformity index for the GTV, a measurerelated to Dice overlap but for multiple observers, of 0.67. Thehuman delineation variation in the cranial direction, deﬁnedas the standard deviation of the most proximal slice, wason average 9.9 mm, and 7.5 mm for the caudal direction.Although these measures are not the same as the measuresreported in this paper, we cautiously conclude that the cranial and caudal error of the proposed automatic method (see TableIV) is not far from human delineation variation.Nowee et al. also investigated the impact of incorporatingFDG-PET scans in the delineation process, and concluded thatalthough it can inﬂuence the delineated volume signiﬁcantly,its impact on observer variation was limited. As a future work,we aim to study if fusion of CT with FDG-PET can aid theCNNs to improve the extracted features and subsequently thesegmentation results.For a close inspection, we investigated the results on theindependent test set for the ﬁnal network. We labelled thepatients in the test set with different tags, including thepresence of air pockets, feeding tube, junction tumor, tumorvolume > cc , hiatal hernia, dislocated esophagus, proximaltumor. Inspection of the ﬁnal results (see Figure 8) showed thatthe network performed better for patients with an absence ofair pockets, feeding tubes in the esophagus lumen, or junctiontumors. A lower performance was obtained for smaller tumors( < cc ), while the strength of the network for patients witha dislocated esophagus, a proximal tumor or a hiatal herniawas not judge-able. Therefore, enriching the dataset withmore patients with the mentioned properties would potentiallyimprove the performance of the model. Also, incorporatingendoscopic ﬁndings in the process of segmentation can beconsidered as a future work to investigate if that can aid CNNsto reduce errors specially at the start and end of the GTV. (a) Normal esophageal GTV (b) Normal esophageal GTV (c) GTV including an air pocket (d) GTV in a dislocated esophagus (e) Junction GTV (f) Junction GTV (g) GTV including an air pocket (h) Proximal GTV including an air pocketand a feeding tube (i) Proximal GTV including an air pocket anda feeding tube Fig. 6: Example results of the proposed method with 2D

DSC value in yellow. The manual delineation and the network resultsare shown by green and red contours, respectively. .

61 0 .

51 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .

84 0 .

69 0 . DDAUnet-plusChA1-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet .

89 0 .

89 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .

93 0 .

92 0 . DDAUnet-plusChA1-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet .

00 0 .

00 0 . CT scan DUnet DDUnet DDAUnet-noSpA-plusChA1-noChA2 .

54 0 .

91 0 . DDAUnet-noChA2 DDAUnet-plusChA1-noChA2 DDAUnet

Fig. 7: Qualitative comparison of DDAUnet with the other CNNs for three slices from three distinct patients. 2D DSC valuesare show in yellow. The manual delineation and the network results are shown by green and red contours, respectively.VII. C

ONCLUSION

In this study, we collected a large set of CT scans from288 distinct patients with esophageal cancer. To the best ofour knowledge this is the largest dataset in esophageal tumorsegmentation literature to date. We showed that despite thedifﬁculties raised by poor contrast of esophageal tumors withrespect to their neighbouring tissues, varieties in shape andlocation of tumor, presence of air pockets and foreign bodies,the proposed method, dubbed dilated dense attention Unet(DDAUnet), could segment the gross tumor volume with a mean surface distance of . ± . mm .VIII. A CKNOWLEDGMENTS

Femke P. Peters is acknowledged for delineation of the data.R

EFERENCES[1] Peter C Enzinger and Robert J Mayer. Esophageal cancer.

New EnglandJournal of Medicine , 349(23):2241–2252, 2003. air pocketFeeding tubeJunction tumorGTV>30ccHiatal herniaDislocated esoph.Proximal GTV + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) + ( )-( ) DSC MSD (mm) 95%HD (mm)

Fig. 8: Results analysis for DDAUnet,

DSC , MSD and 95 %HD boxplots on the test data for different patients with or withoutan air pocket, a feeding tube in the esophagus lumen, a junction tumor, tumor volume larger than 30cc (which is deﬁned bythe median split of the GTV volumes), a hiatal hernia, a dislocated esophagus, or a proximal tumor. The outliers larger than20 and 60 for

MSD and 95 %HD have not been shown. The number of scans for each boxplot is shown in parentheses beloweach plot. [2] Jacques Ferlay, Isabelle Soerjomataram, Rajesh Dikshit, Sultan Eser,Colin Mathers, Marise Rebelo, Donald Maxwell Parkin, David Forman,and Freddie Bray. Cancer incidence and mortality worldwide: sources,methods and major patterns in globocan 2012.

International Journal ofCancer , 136(5):E359–E386, 2015.[3] Marcelo Mamede, Paula Abreu-e Lima, Maria Raquel Oliva, VâniaNosé, Harvey Mamon, and Victor H Gerbaudo. FDG-PET/CT tumorsegmentation-derived indices of metabolic activity to assess response toneoadjuvant therapy and progression-free survival in esophageal cancer:correlation with histopathology results.

American Journal of ClinicalOncology , 30(4):377–388, 2007.[4] Marlies E Nowee, Francine EM Voncken, Alexis NTJ Kotte, LucasGoense, Peter SN van Rossum, Astrid LHMW van Lier, Stijn WHeijmink, Berthe MP Aleman, et al. Gross tumour delineation oncomputed tomography and positron emission tomography-computedtomography in oesophageal cancer: A nationwide study.

Clinical andTranslational Radiation Oncology , 14:33–39, 2019.[5] Thomas R Charles, John G Hunter, and Blair AA Jobe.

Esophagealcancer: principles and practice . Demos Medical Publishing, 2009.[6] Mikael Rousson, Ying Bai, Chenyang Xu, and Frank Sauer. Probabilisticminimal path for automated esophagus segmentation. In

Medical Imag-ing 2006: Image Processing , volume 6144, page 614449. InternationalSociety for Optics and Photonics, 2006.[7] Frederiek M Lever, Irene M Lips, Sjoerd PM Crijns, Onne Reerink,Astrid LHMW van Lier, Marinus A Moerland, Marco van Vulpen,and Gert J Meijer. Quantiﬁcation of esophageal tumor motion oncine-magnetic resonance imaging.

International Journal of RadiationOncology* Biology* Physics , 88(2):419–424, 2014.[8] Dakai Jin, Dazhou Guo, Tsung-Ying Ho, Adam P Harrison, Jing Xiao,Chen-kan Tseng, and Le Lu. Deep esophageal clinical target volumedelineation using encoded 3D spatial context of tumors, lymph nodes,and organs at risk. In

MICCAI , pages 603–612. Springer, 2019.[9] Di-Xiu Xue, Rong Zhang, Yuan-Yuan Zhao, Jian-Ming Xu, and Ya-LeiWang. Fully convolutional networks with double-label for esophagealcancer image segmentation by self-transfer learning. In

Ninth Interna-tional Conference on Digital Image Processing (ICDIP 2017) , volume10420, page 104202D. International Society for Optics and Photonics,2017.[10] Ying Liang, Diane Schott, Ying Zhang, Zhiwu Wang, Haidy Nasief, EricPaulson, William Hall, Paul Knechtges, Beth Erickson, and X AllenLi. Auto-segmentation of pancreatic tumor in multi-parametric MRIusing deep convolutional neural networks.

Radiotherapy and Oncology ,145:193–200, 2020. [11] Sahar Youseﬁ, Hessam Sokooti, Mohamed S Elmahdy, Femke P Peters,Mohammad T Manzuri Shalmani, Roel T Zinkstok, and Marius Staring.Esophageal gross tumor volume segmentation using a 3D convolutionalneural network. In

MICCAI , pages 343–351. Springer, 2018.[12] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van derMaaten. Densely connected convolutional networks. In

CVPR , pages4700–4708, 2017.[13] Andreas Fieselmann, Stefan Lautenschläger, Frank Deinzer, and BjörnPoppe. Automatic detection of air holes inside the esophagus in CTimages. In

Bildverarbeitung für die Medizin 2008 , pages 397–401.Springer, 2008.[14] Andreas Fieselmann, Stefan Lautenschläger, Frank Deinzer, MatthiasJohn, and Björn Poppe. Esophagus segmentation by spatially-constrained shape interpolation. In

Bildverarbeitung für die Medizin2008 , pages 247–251. Springer, 2008.[15] Johannes Feulner, S Kevin Zhou, Matthias Hammon, Sascha Seifert,Martin Huber, Dorin Comaniciu, Joachim Hornegger, and AlexanderCavallaro. A probabilistic model for automatic segmentation of theesophagus in 3-D CT scans.

IEEE Transactions on Medical Imaging ,30(6):1252–1264, 2011.[16] Tzung-Chi Huang, Geoffrey Zhang, Thomas Guerrero, GeorgeStarkschall, Kan-Ping Lin, and Ken Forster. Semi-automated CTsegmentation using optic ﬂow and fourier interpolation techniques.

Computer Methods and Programs in Biomedicine , 84(2-3):124–134,2006.[17] Johannes Feulner, S Kevin Zhou, Alexander Cavallaro, Sascha Seifert,Joachim Hornegger, and Dorin Comaniciu. Fast automatic segmentationof the esophagus from 3D CT data using a probabilistic model. In

MICCAI , pages 255–262. Springer, 2009.[18] Johannes Feulner, S Kevin Zhou, Martin Huber, Alexander Cavallaro,Joachim Hornegger, and Dorin Comaniciu. Model-based esophagussegmentation from CT scans using a spatial probability map. In

MICCAI ,pages 95–102. Springer, 2010.[19] Sila Kurugol, Necmiye Ozay, Jennifer G Dy, Gregory C Sharp, andDana H Brooks. Locally deformable shape model to improve 3D levelset based esophagus segmentation. In

ICPR 2010 , pages 3955–3958.IEEE, 2010.[20] Sila Kurugol, Erhan Bas, Deniz Erdogmus, Jennifer G Dy, Gregory CSharp, and Dana H Brooks. Centerline extraction with principal curvetracing to improve 3D level set esophagus segmentation in CT images.In

EMBC 2011 , pages 3403–3406. IEEE, 2011.[21] Jinzhong Yang, Benjamin Haas, Raymond Fang, Beth M Beadle,Adam S Garden, Zhongxing Liao, Lifei Zhang, Peter Balter, et al. Atlas ranking and selection for automatic segmentation of the esophagus fromCT scans. Physics in Medicine & Biology , 62(23):9140, 2017.[22] Nicola Pezzotti, Sahar Youseﬁ, Mohamed S Elmahdy, Jeroen vanGemert, Christophe Schülke, Mariya Doneva, Tim Nielsen, SergeyKastryulin, Boudewijn PF Lelieveldt, Matthias JP van Osch, Elwinde Weerdt, and Marius Staring. An adaptive intelligence algorithm forundersampled knee MRI reconstruction.

IEEE Access , pages 204825–204838, 2020.[23] Sahar Youseﬁ, Lydiane Hirschler, Merlijn van der Plas, Mohamed SElmahdy, Hessam Sokooti, Matthias Van Osch, and Marius Staring. Fastdynamic perfusion and angiography reconstruction using an end-to-end3D convolutional neural network. In

MLMRI , pages 25–35. Springer,2019.[24] Mohamed S Elmahdy, Thyrza Jagt, Sahar Youseﬁ, Hessam Sokooti, RoelZinkstok, Mischa Hoogeman, and Marius Staring. Evaluation of multi-metric registration for online adaptive proton therapy of prostate cancer.In

International Workshop on Biomedical Image Registration , pages 94–104. Springer, 2018.[25] Mohamed S Elmahdy, Thyrza Jagt, Roel Th Zinkstok, Yuchuan Qiao,Rahil Shahzad, Hessam Sokooti, Sahar Youseﬁ, Luca Incrocci, CAMMarijnen, Mischa Hoogeman, et al. Robust contour propagation usingdeep learning and image registration for online adaptive proton therapyof prostate cancer.

Medical physics , 46(8):3329–3343, 2019.[26] Tobias Fechter, Sonja Adebahr, Dimos Baltas, Ismail Ben Ayed, Chris-tian Desrosiers, and Jose Dolz. A 3D fully convolutional neural networkand a random walker to segment the esophagus in CT.

Journal ofMedical Physics , 2017.[27] Roger Trullo, Caroline Petitjean, Dong Nie, Dinggang Shen, andSu Ruan. Fully automated esophagus segmentation with a hierarchicaldeep learning approach. In

ICSIPA 2017 , pages 503–506. IEEE, 2017.[28] Zhaojun Hao, Jiwei Liu, and Jianfei Liu. Esophagus tumor segmentationusing fully convolutional neural network and graph cut. In

ChineseIntelligent Systems Conference , pages 413–420. Springer, 2017.[29] Dakai Jin, Dazhou Guo, Tsung-Ying Ho, Adam P Harrison, Jing Xiao,Chen-kan Tseng, and Le Lu. Accurate esophageal gross tumor volumesegmentation in PET/CT using two-stream chained 3D deep networkfusion. In

MICCAI , pages 182–191. Springer, 2019.[30] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele. The cityscapes dataset for semantic urban scene under-standing. In

CVPR , pages 3213–3223, 2016.[31] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition.

ICLR , 2014.[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net:Fully convolutional neural networks for volumetric medical imagesegmentation. In , pages 565–571. IEEE, 2016.[33] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger,Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalancedsegmentation. In

International Conference on Medical imaging withdeep learning , pages 285–296, 2019.[34] Francesco Caliva, Claudia Iriondo, Alejandro Morales Martinez,Sharmila Majumdar, and Valentina Pedoia. Distance map loss penaltyterm for semantic segmentation.

MIDL , 2019.[35] Pei Wang and Albert CS Chung. Focal Dice loss and image dilation forbrain tumor segmentation. In