[PDF] Automatic Segmentation of Organs-at-Risk from Head-and-Neck CT using Separable Convolutional Neural Network with Hard-Region-Weighted Loss

Abstract

Nasopharyngeal Carcinoma (NPC) is a leading form of Head-and-Neck (HAN) cancer in the Arctic, China, Southeast Asia, and the Middle East/North Africa. Accurate segmentation of Organs-at-Risk (OAR) from Computed Tomography (CT) images with uncertainty information is critical for effective planning of radiation therapy for NPC treatment. Despite the stateof-the-art performance achieved by Convolutional Neural Networks (CNNs) for automatic segmentation of OARs, existing methods do not provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by several factors, including the low contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing. To address these problems, we propose a novel framework for accurate OAR segmentation with reliable uncertainty estimation. First, we propose a Segmental Linear Function (SLF) to transform the intensity of CT images to make multiple organs more distinguishable than existing methods based on a simple window width/level that often gives a better visibility of one organ while hiding the others. Second, to deal with the large inter-slice spacing, we introduce a novel 2.5D network (named as 3D-SepNet) specially designed for dealing with clinic HAN CT scans with anisotropic spacing. Thirdly, existing hardness-aware loss function often deal with class-level hardness, but our proposed attention to hard voxels (ATH) uses a voxel-level hardness strategy, which is more suitable to dealing with some hard regions despite that its corresponding class may be easy. Our code is now available at this https URL

Full PDF

AAutomatic Segmentation of Organs-at-Risk from Head-and-Neck CT using SeparableConvolutional Neural Network with Hard-Region-Weighted Loss

Wenhui Lei a , Haochen Mei a , Zhengwentai Sun a , Shan Ye a , Ran Gu a , Huan Wang a , Rui Huang b , Shichuan Zhang c ,Shaoting Zhang a , Guotai Wang a, ∗ a School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China b SenseTime Research, Shenzhen, China c Department of Radiation Oncology, Sichuan Cancer Hospital and Institute, University of Electronic Science and Technology of China, Chengdu, China

Abstract

Accurate segmentation of Organs-at-Risk (OAR) from Head and Neck (HAN) Computed Tomography (CT) images with uncertaintyinformation is critical for e ﬀ ective planning of radiation therapy for Nasopharyngeal Carcinoma (NPC) treatment. Despite thestate-of-the-art performance achieved by Convolutional Neural Networks (CNNs) for the segmentation task, existing methods donot provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by thelow contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing. To address these problems, wepropose a novel framework for accurate OAR segmentation with reliable uncertainty estimation. First, we propose a SegmentalLinear Function (SLF) to transform the intensity of CT images to make multiple organs more distinguishable than existing simplewindow width / level-based methods. Second, we introduce a novel 2.5D network (named as 3D-SepNet) specially designed fordealing with clinic CT scans with anisotropic spacing. Thirdly, we propose a novel hardness-aware loss function that pays attentionto hard voxels for accurate segmentation. We also use an ensemble of models trained with di ﬀ erent loss functions and intensitytransforms to obtain robust results, which also leads to segmentation uncertainty without extra e ﬀ orts. Our method won the thirdplace of the HAN OAR segmentation task in StructSeg 2019 challenge and it achieved weighted average Dice of 80.52% and 95%Hausdor ﬀ Distance of 3.043 mm. Experimental results show that 1) our SLF for intensity transform helps to improve the accuracyof OAR segmentation from CT images; 2) With only 1 / ﬀ ectively improves the segmentation accuracy; 4)The segmentation uncertainty obtained by our method has a high correlation to mis-segmentations, which has a potential to assistmore informed decisions in clinical practice. Our code is available at https: // github.com / HiLab-git / SepNet.

Keywords:

Medical image segmentation, Intensity transform, Convolutional Neural Network, Uncertainty

1. Introduction

Nasopharyngeal carcinoma (NPC) is a tumor arising fromthe epithelial cells that cover the surface and line the nasophar-ynx [1]. It is one of leading form of cancer in the Arctic, China,Southeast Asia, and the Middle East / North Africa [2]. Radi-ation therapy is the most common treatment choice for NPCbecause it is mostly radio-sensitive [3]. Radiation planning setsup the radiation dose distribution for tumor and ordinary organs,which is vital for the treatment. Oncologists would design ra-diotherapy plans to make sure the cancer cells receive enoughradiation and prevent the damage of normal cells in Organs-at-Risk (OARs), such as the optical nerves and chiasma that areimportant for the vision of patients.Medical images play a vital role in preoperative decisionmaking, as they can assist radiologists to determine the OARsboundaries. Magnetic Resonance Imaging (MRI) and Com-puted Tomography (CT) are widely used non-invasive imaging ∗ Corresponding author: Guotai Wang.

Email address: [email protected] (Guotai Wang) methods for HAN. Delineating the boundaries of tens of OARsfrom HAN CT scans is one crucial step in radiotherapy treat-ment planning, and manual delineation is tedious and time con-suming and likely to have high inter- and intra-observer varia-tions. Automated segmentation of OARs can save radiologiststime and has a potential to provide an accurate and reliable so-lution [4].However, the automatic segmentation of HAN OARs fromCT images is still challenging because of several reasons. First,there is a severe imbalance between sizes of large and smallOARs. As shown in Fig. 1, the smallest organs such as the lensare only a few voxels wide, while the parotid gland is over 250times larger than the lens, making the automatic segmentationalgorithm easily biased towards the large organs. Nowadays,existing methods [4, 5, 6] mainly deal with the imbalance prob-lem by weighting small OARs in the loss function for training,which is a class-level weighting and ignores hard voxels in easyor large OARs. Second, due to the mechanism of CT imaging,soft tissues including OARs have a low contrast, such as thebrain stem depicted in Fig. 1. The low contrast results in someuncertainty and challenges for accurate classiﬁcation of voxels

Preprint submitted to NeuroComputing February 4, 2021 a r X i v : . [ ee ss . I V ] F e b rain StemPituitaryParotid Gland LeftLeft Temporomandibular JointRight Temporomandibular JointInner Ear RightMiddle Ear LeftOptic ChiasmaParotid Gland RightInner Ear LeftMiddle Ear Right Brain Stem Optic Chiasma

Eye Left

Optic Nerve Right Optic Nerve Left

Eye Right

Len Left Len Right Parotid Gland LeftTemporal Lobes Left

Parotid Gland RightTemporal Lobes RightSpinal Cord

Ground Truth (a) Axial View (b) Coronal View (c) Sagittal View

Figure 1: An example of HAN CT images in three orthogonal views. The OARs have imbalanced sizes with a low contrast with the surrounding tissues. Note thelarge inter-slice spacing highlighted by red arrows and small organs occupying only few slices (e.g.), the optical nerves and the lens highlighted by yellow arrows. around the boundary of OARs. Most existing works [5, 6, 7]do not deal with this problem well as they use a single intensitywindow / level for intensity transform, where a naive linear func-tion between a lower threshold and an upper threshold is used.Such a simple intensity transform cannot provide good visibil-ity of multiple organs at the same time, which limits the accu-racy of multi-organ segmentation. Thirdly, HAN CT imagesare usually acquired with high in-plane resolution while lowthrough-plane resolution, resulting in the smaller slice numberof tiny OARs, such as the optical nerves highlighted by yellowarrows in Fig. 1. Therefore, using standard 3D convolutions totreat x , y and z dimensions equally [4, 5, 6, 7] may overlook theanisotropic spatial resolution and limit the accuracy of segmen-tation of these small organs.In real clinical applications, the uncertainty of predictions isimportant in OARs segmentation. For radiotherapy planning,radiologists care about not only how accurate the segmentationresults are, but also the extent to which the model is conﬁdentabout its predictions. If the uncertainty is too high, then doctorsneed to focus on this part of area and may give reﬁnements [8].In our case, the low contrast between OARs and their surround-ings leads the voxels around boundaries to be segmented withless conﬁdence [9]. The uncertainty information of these voxelscan indicate regions that have potentially been mis-segmented,and could be used to guide human to reﬁne the prediction re-sults [10, 11]. However, to the best of our knowledge, therehave been no existing works exploring the uncertainty estima-tion in HAN OARs predictions and its relationship with mis-segmentation.Though there have been some similar existing works on somecomponents of the proposed method, our pipeline is elaboratedfor uncertainty-aware multi-organ segmentation from HAN CTimages by considering the challenges associated with this spe-ciﬁc task. In addition, several modules in our method are alsodi ﬀ erent from their existing counterparts. First, to deal with low contrast of CT images, our Segmental Linear Functions(SLFs) transform the intensity of CT images to make multipleorgans more distinguishable than existing methods based on asimple window width / level that often gives a better visibilityof one organ while hiding the others. Second, existing general2D and 3D segmentation networks can hardly deal with imageswith anisotropic spacing well. Our SepNet is a novel 2.5D net-work specially designed for dealing with clinic HAN CT scanswith anisotropic spacing. Thirdly, existing hardness-aware lossfunctions often deal with class-level hardness, but our proposedattention to hard voxels (ATH) uses a voxel-level hardness strat-egy, which is more suitable to dealing with some hard regionsdespite that its corresponding class may be easy. Last but notleast, uncertainty information is important for more informedtreatment planning of NPC, but existing HAN CT segmenta-tion methods do not provide the uncertainty information. Weuse an ensemble of models trained with di ﬀ erent loss functionsand intensity transforms to obtain robust results. Therefore, ourproposed method is a novel pipeline for accurate segmentationof HAN OARs by better considering the image’s comprehen-sive properties including the low-contrast, anisotropic resolu-tion, segmentation hardness, and it provides segmentation un-certainty for more informed treatment decision making. Ourproposed method won the 3rd place in 17 teams in the HANOAR segmentation task of StructSeg 2019 challenge .

2. Related Works

In earlier studies, traditional segmentation methods such asatlas-[12], RF-[13], contour-[14], region-[15], SVM-based [16]methods, have achieved signiﬁcant beneﬁts compared with tra-ditional manual segmentation. Zhou et al. [17] proposed a https: // structseg2019.grand-challenge.org / Home / After dramatic increase in popularity of deep learning meth-ods in various computer vision tasks. Ibragimov and Xing [20]ﬁrstly proposed a CNN method at the end of 2016 to segmentorgans at HAN. They implemented a convolution neural net-work to extract features such as corners, end-points, and edgeswhich contribute to classiﬁcation and achieved higher accuracythan traditional methods. However, it requires further opera-tions like Markov Random Fields algorithm to smooth the seg-mentation area and is patch-based, which would be time con-suming for inference.With the development of CNNs, a new architecture calledFully Convolutional Neural Network (FCN) [21] obtained largeperformance improvement for segmentation which also encour-aged the emergence of U-Net [22]. Men et al. [23] used deepdilated CNNs for segmentation of OARs, where the networkhas a larger receptive ﬁeld. Other researchers [24, 25] pre-sented a fully convolution network-based model with auxiliarypaths strategy which allows the model to learn discriminativefeatures to obtain a higher segmentation performance. To dealwith the imbalance problem, Liang et al. [26] used one CNNfor OAR detection and then another CNN for ﬁne segmenta-tion. Based on 3D-UNet [27], Gao et al. [5] proposed FocusNetby adding additional sub-networks for better segmentation ofsmall organs.Loss functions also play an important role in the segmenta-tion task. Liang et al. [26] improved their loss function withweighted focal loss that performed well in the MICCAI 15dataset [28]. Since the traditional Dice loss [29] is unfavor-able to small structures as a few voxels of misclassiﬁcation canlead to a large decrease of the Dice score, and this sensitivityis irrelevant to the relative sizes among structures. Therefore,balancing by label frequencies is suboptimal for Dice loss. Ex-ponential Logarithmic loss [6], combining Focal loss [30] withDice loss, eased this issue without much adjustments of the net-work. However, it will reduce the accuracy of hard voxels inlarge or other easy regions.

To deal with the anisotropic spacing in 3D image scans,previous networks can be divided into three major categories:First, directly applying 2D networks slice by slice to get the ﬁnal results [31], which is agnostic to large inter-slice spac-ing. However, due to the internal structural constraints, the 2Dnetworks can hardly capture 3D information. Second, resam-pling the original images into the same spacing along three di-mensions [32, 33, 34] and processing them with standard 3Dconvolutions, which could merge the information among threedimensions e ﬀ ectively. But it will generate artifacts in interpo-lated slices and obtain reduced segmentation accuracy. Thirdly,applying 2.5D network [35, 36], which ﬁrstly use 2D convolu-tions separately in three orthogonal views and fuse these multi-view information together in ﬁnal stage, so that 3D contextualfeatures can be used. However, previous 2.5D networks mainlytreat each directions equally thus ignore the anisotropic spacingand organ scales among three dimensions. Despite the impressive performance of current methods hasachieved, measuring how reliable the predictions are is alsocrucial to indicate potential mis-segmented regions or guideuser interactions for reﬁnement [37, 38, 11]. There are twomajor types of predictive uncertainties for deep CNNs [39]: aleatoric uncertainty and epistemic uncertainty.

Aleatoric un-certainty depends on noise or randomness in the input testingimage, while epistemic uncertainty, also known as model un-certainty, can be explained away given enough training data.A lot of works have investigated uncertainty estimation fordeep neural networks [39, 40, 41] and they mainly focused onhigh-level image classiﬁcation or regression tasks. Some re-cent works [42, 43] investigated test-time dropout-based (epis-temic) uncertainty for segmentation. Wang et al. [44] and Nairet al. [43] extensively investigate di ﬀ erent kinds of uncertain-ties for CNN-based medical image segmentation, including notonly epistemic but also aleatoric uncertainties. Even thoughuncertainty estimation is of great interest in HAN OARs seg-mentation, to the best of our knowledge, exiting works have notinvestigated this problem.

3. Methods

Our proposed framework for HAN OARs segmentation isshown in Fig. 2. It consists of four main components. First,we propose to use Segmental Linear Functions (SLF) to obtainmultiple intensity-transformed copies of an input image to getbetter contrasts of di ﬀ erent OARs. Second, for each copy ofthe intensity-transformed image, we employ a novel network3D-SepNet combining intra-slice and inter-slice convolutionsto deal with the large inter-slice spacing. Thirdly, to train our3D-SepNet we propose a novel hard voxel weighting strategythat pays more attention to small organs and hard voxels inlarge / easy organs, and it can be combined with existing lossfunctions. Finally, we ensemble several models trained withdi ﬀ erent SLFs and loss functions by a weighted fusion to getﬁnal segmentation results, which simultaneously obtains uncer-tainty estimation of the segmentation, as show in Fig. 2.3 LF-NSLF-1 ......

Attention to Hard Pixels ......

3D SepNet Weighted Ensemble and

Uncertainty Estimation3D SepNet Hardness

Weighted Loss

𝐴𝑇𝐻 − 𝐿 𝑒𝑥𝑝

Hardness Weighted Loss

𝐴𝑇𝐻 − 𝐿 𝑒𝑥𝑝

HUHU ......

PredictionsTransformed StacksInput Stacksof CT Slices Ground Truth Final Prediction ......

Uncertainty StacksTestTrain

Figure 2: Overview of our proposed framework for accurate OARs segmentation. We ﬁrst use di ﬀ erent Segmental Linear Functions (SLFs) to transform theintensity of an input image respectively, which obtains good visibility of di ﬀ erent OARs. Then a novel network 3D-SepNet that leverages intra-slice and inter-sliceconvolutions is proposed to segment OARs from images with large inter-slice spacing, and we propose a novel hardness weighting strategy for training. Finally, anensemble of networks related to di ﬀ erent SLFs and loss functions obtains the ﬁnal segmentation and uncertainty estimation simultaneously. (a) Naive Linear Functions (NLFs) (b)

Segmental Linear Functions (SLFs)

Figure 3: Di ﬀ erent intensity transform functions for CT image preprocessing.Standard window width / level-based intensity transforms use NLFs shown in(a), which employs a single linear function between the lower and the upperthresholds. Our proposed SLFs in (b) transform the HU values to K linearsections to obtain better visibility of multiple OARs. HU values of soft issues such as the brain stem, parotid andtemporal lobes are very close, and largely di ﬀ erent from thoseof bones in CT images. Previous studies [4, 7, 6] used simplewindow width / level-based intensity transform for preprocess-ing, i.e., a lower threshold and a upper threshold with a linearfunction between them, which is shown in Fig. 3(a) and referredto as Naive Linear Functions (NLFs) in this paper. ProcessingCT images with an NLF is sub-optimal as it cannot obtain goodvisibility for soft tissues and bones at the same time, which maylimit the segmentation accuracy. To address this issue and in-spired by the fact that radiologists would use di ﬀ erent windowwidths / levels to better di ﬀ erentiate OARs with diverse HU val-ues, we propose to use Segmental Linear Functions (SLFs) totransform the CT images. More speciﬁcally, assume increasingnumbers [ x , x , ..., x K ] in [0,1] and corresponding K HU val-ues [ h , h , ..., h K ] in [ h min , h max ], where h min and h max representthe minimal and maximal HU value of CT images, respectively.Assume the original HU value is h , and the transformed inten-sity x is represented as: x =  , h ≤ h h i + h − h i h i + − h i , h i < h ≤ h i + , i ∈ [1 , K ]1 , h > h K (1) Axial A x i a l C o r on a l S a g itt a l (a) SLF1 (b) NLF1 (c) NLF2 Temporal Lobes

Left Temporal Lobes Right

Left

Temporomandibular JointRight Temporomandibular JointBrain Stem

Optic Chiasma

Middle Ear Right Spinal Cord

Inner Ear LeftParotid Gland RightMiddle Ear LeftParotid Gland Left Inner Ear RightPituitary

Figure 4: Visualization of a CT image with di ﬀ erent intensity transform func-tions. The sub-ﬁgures from left to right are transformed CT images based on:(a) our SLF1 that gives a better visualization to both bone and soft tissues,(b) NLF1 that highlights soft tissues and (c) NLF2 that preserves the internalstructure and boundary shape of bone tissues while reduce the visibility of softtissues. In this work, we set K =

4, [ x , x , x , x ] = [0, 0.2, 0.8,1.0], and use three di ﬀ erent SLFs: SLF1, SLF2, SLF3, with[ h , h , h , h ] set as [-500, -200, 200, 1500], [-500, -100, 100,1500] and [-500, -100, 400, 1500], respectively. We also com-pared them with two types of NLFs: NLF1, NLF2, with corre-sponding [ h , h ] set as [-100, 100] to focus on soft tissues and[-500, 800] for large window width, respectively, as shown inFig. 3(b).Fig. 4 shows a visual comparison of di ﬀ erent intensity trans-forms (i.e., SLF1, NLF1 and NLF2) applied to HAN CT4 Copy and

Concatenate n – Number of Channels N – Number of Labels Conv (1 × ×

3) + IN + ReLU

Conv (3 × ×

1) + IN + ReLU

Conv (1 × ×

1) Conv (1 × ×

1) + Softmax

Concatenate+ IN + ReLU

Max Pooling (2 × × Up Sampling (2 × × N = Image Block(48)Block(48)Block(96)Block(192) Block(96) Block(96) Block(48) Block(48) Block(48) Conv(48) Conv(48)Block( n ) Figure 5: Our proposed network 3D-SepNet for segmentation of OARs from images with large inter-slice spacing. Blue and white boxes indicate feature maps.Block( n ) represents a convolutional block with n channels, where three intra-slice layers (yellow arrows) are followed by one inter-slice convolutional layer (redarrow) and a skip connection with one 1 × × scans. From axial view, it could be observed that NLF1 sup-presses the bones and improves the visibility of soft tissues,while NLF2 improves the visibility of bones but makes softtissues hardly distinguishable. Each of these intensity trans-forms makes it hard to segment multiple OARs including bones(e.g., mandible) and soft tissues at the same time. In contrast,our SLF1 obtains high visibility for both soft tissue and bones,which is beneﬁcial for the segmentation of multiple OARs withcomplex intensity distributions. Due to the large inter-slice spacing and sharp boundary be-tween bones and nearby soft tissues, upsampling the imagesalong the z axis to obtain a high 3D isotropic resolution willproduce plenty of artifact on the interpolated slices and misleadthe segmentation results. Therefore, we directly use the stacksof 2D slices for segmentation. As 2D networks ignore the cor-relation between adjacent slices and standard 3D networks withisotropic 3D convolutions (e.g., with 3 × × mm ratherthan voxels) along each axis with di ﬀ erent voxel spacings, wepropose a 2.5D network combining intra-slice convolutions andinter-slice convolutions to deal with this problem.As shown in Fig. 5, our proposed network (3D SepNet) isbased on the backbone of 3D UNet [27] with 12 convolutionalblocks in an encoder-decoder structure. Since the inter-sliceand intra-slice voxel spacings of our experimental images arearound 3 mm and 1 mm respectively, small organs like opti-cal nerves / chiasma would only cross few slices. Therefore, ap-plying standard 3D convolutions would blur their boundariesand be harmful to accuracy. We proposed to use spatially sep-arable convolutions that separate a standard 3D convolutionwith a 3 × × × × × × × × n ) is doubled af-ter each max pooling in the encoder. We concatenate featuremaps from the encoding path with the corresponding featuremaps in the decoding path for better performance. A ﬁnal layerof 1 × × In our task, the ratio between the background and the small-est organ can reach around 10 : 1, which makes the loss func-tion values dominated by large numbers of easy backgroundvoxels. To solve this problem, we ﬁrst use the exponential log-arithmic loss [6] ( L Exp ), which balances the labels not only bytheir relative sizes but also by their segmentation di ﬃ culties: L Exp = ω DSC L DSC + ω Cross L Cross (2)with ω DSC and ω Cross are the weights of the exponential log-arithmic DSC loss ( L DSC ) and the exponential cross-entropy( L Cross ), respectively. L DSC = E [( − ln ( DS C c )) γ DSC ] (3)

DS C c = (cid:80) x g c ( x ) p c ( x )] + (cid:15) { (cid:80) x [ g c ( x ) + p c ( x )] } + (cid:15) (4) L Cross = E { w c {− ln [ p c ( x )] } γ Cross } (5)where x is a voxel and c is a class. p c ( x ) is the predicted proba-bility of being class c for voxel x , and g c ( x ) is the correspondingground truth label. E [ · ] is the mean value with respect to c and x in L DSC and L Cross , respectively. (cid:15) = w c = (( (cid:80) k f k ) / f c ) . is the class-level weight for reducingthe inﬂuence of more frequently seen classes, where f k is thefrequency of class k . We set ω DSC , ω

Cross , γ

DSC , γ

Cross = L Cross , L DSC weights voxels in one certain labelmainly considering its segmentation di ﬃ culty and the formu-5 .0 0.2 0.4 0.6 0.8 1.0 p c p w c p wc = p c e p c g c =0.5=1=2 p wc = p c (a) g c = p c p w c p wc = p c e p c g c =0.5=1=2 p wc = p c (b) g c = p wc with di ﬀ erent α and ground truth label g c fora voxel. p wc is lower than p c for g c = g c =

0, meaning theweighted prediction is further away from the ground truth than g c , which leadsa less accurate prediction to have a larger impact on the backpropagation. lation of L DSC can be di ﬀ erentiated yielding the gradient : ∂ L DSC ∂ p c ( x ) = ∂ L DSC ∂ DS C c ∂ DS C c ∂ p c ( x ) = − DS C c { g c ( x ) (cid:80) x [ g c ( x ) + p c ( x )] − (cid:80) x p c ( x ) g c ( x ) }{ (cid:80) x [ g c ( x ) + p c ( x )] } (6)It is easy to observe that for a hard voxel x in easy class c ,with p c ( x ) far away from g c ( x ), the absolute value of gradient ∂ L DSC ∂ DSC c is restrained compared with a voxel x in hard region be-cause the value ∂ L DSC ∂ DSC c is constrained around 1 with DS C c ≈ ﬀ erentobjects, in large or easy regions, and it will be helpful to forcethe network to focus on the hard voxels to balance this con-tradiction. More formally, we propose to multiply the predic-tion result by a weighting function w c ( x ) before sending it into L DSC , with a tunable attention parameter α > w c ( x ) = e pc ( x ) − gc ( x ) α (7) p wc ( x ) = p c ( x ) w c ( x ) (8)where p wc ( x ) is the weighted prediction. A higher value of p wc ( x )represents a higher possibility of voxel x belonging to class c .Fig. 6 shows the e ﬀ ect of weighting function w c ( x ) with dif-ferent values of α for g c = g c =

0, respectively. It canbe observed that p wc ( x ) is lower than p c ( x ) for g c ( x ) = g c ( x ) =

0, meaning the weighted prediction is furtheraway from the ground truth than the original prediction. As aresult, the weighted harder region will have a larger impact onthe backpropagation as they have larger gradient values than ac-curately predicted voxels. Generally, it can get more room forimprovement and have the network focus more on hard voxels.We refer to this weighting strategy as attention to hard voxels(

AT H ). It should be noticed that

AT H can be combined witha standard loss function in the training progress, and we com-bine it with L Exp due to its better performance than standardDice loss and cross entropy loss. Our loss function with

AT H is named as

AT H − L Exp . To achieve robust segmentation and obtain segmentation un-certainty at the same time, we use an ensemble of models trained with di ﬀ erent SLFs and ATHs based on di ﬀ erent α , asshown in Fig. 2. Since these models perform di ﬀ erently on dif-ferent organs, we implement the ensemble for each class re-spectively and use a class-wise weighted average for ensemble.Speciﬁcally, for class c , a model with a higher performance on c is assigned with a higher weight. For a model obtaining the i th highest DSC for class c of the validation set, its predictedprobability map of class c and the corresponding weight for atest image are denoted as P ic and w ic respectively. We set w ic to5, 4, 3, 1, 1, 1 for i =

1, 2, ...,6 respectively. The ﬁnal probabilitymap for class c of a test image is:ˆ P c = (cid:80) i w ic P ic (cid:80) i w ic (9)Based on the predictions of multiple models, it is straight-forward to obtain the segmentation uncertainty that is typicallyestimated by measuring the diversity of multiple predictions fora given image [39]. Supposing X denotes a test image and Y thepredicted label of X . Using the variance and entropy of the dis-tribution p ( Y | X ) are two common methods for uncertainty esti-mation. Similarly to [40], it is a promising choice to estimateuncertainty by ensemble method, using multiple independentlylearned models. We use the entropy to represent the voxel-wiseuncertainty: H ( Y | X ) = − (cid:90) p ( y | X )ln( p ( y | X )) dy (10)Suppose Y ( x ) denotes the predicted label for voxel x .With predictions from N models, a set of values Y = { y ( x ) , y ( x ) , ..., y N ( x ) } can be obtained. Therefore, the voxel-wise uncertainty can be approximated as: H ( Y | X ) ≈ − M (cid:88) m = ˆ p m ( x )ln( ˆ p m ( x )) (11)where ˆ p m ( x ) is the frequency of the m th unique value in Y .We also estimate the structure-wise uncertainty for eachOAR by calculating the Volume Variation Coe ﬃ cient (VVC).Let V i = { v i , v i , ..., v iN } denote the set of volumes of OAR i ob-tained by N models, and µ V i and σ V i denote the mean valueand standard deviation of V i respectively. We use the VVC toestimate the structure-wise uncertainty for OAR i : VVC i = µ V i σ V i (12)

4. Experiments and Results

Our proposed methods are evaluated on two segmentationtasks: (1) StructSeg 2019 challenge training dataset consistingof CT scans from 50 NPC patients where 22 OARs are to besegmented. (2) A mixed HAN CT dataset containing 165 vol-umes from three sources: 50 volumes from StructSeg 2019, 48volumes from MICCAI 2015 Head and Neck challenge [28]and 67 volumes collected locally with NPC before radiother-apy treatment, where we segment 7 OARs that are annotated6 able 1: DSC (mean ± std %) and ASSD (mean ± std mm) evaluation of 3D HAN OARs segmentation in StructSeg 19 task1, with 3D-SepNet, 3 di ﬀ erent intensitytransform methods and 3 di ﬀ erent loss functions. The best result is in bold font. Metric

DSC (mean ± std %) ↑ ± std mm) ↓ Transform

NLF1 NLF2

SLF1 (ours)

NLF1 NLF2

SLF1 (ours)Loss

ATH − L Exp (ours) L Exp

DSC Loss

ATH − L Exp (ours) L Exp

DSC LossBrain Stem 88.0 ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± ± ± ± . ± .

38 2 . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± . ± . ± ±

12. 3.05 ± ± . ± . ± ± ± ± . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± ± . ± . ± ± . ± . ± ± ± ± ± ± . ± . ± ± ± ± . ± . TM Joint L . ± . ± ± ± ± . ± . ± ± ± ± ± ± ± ± . ± . . ± . ± ± ± ± . ± . ± ± ± ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± Average ± ± . ± . ± ± ± ± . ± . ± ± in all of these three sources. For comparison, we implemented3D-UNet [27] with three versions: original one, SE block [47]added and residual connection [48] added, where we modiﬁedthem by starting with 48 base channels and applying IN [45] fornormalization. These networks and our proposed 3D-SepNetwere implemented in PyTorch, trained on two NVIDIA 1080TIGPUs for 250 epochs with the Adam optimizer [49], batch size6, initial learning rate 10 − and weight decay 10 − . The learningrate was decayed by 0.9 every 10 epochs. Without resampling,each CT slice was cropped by a 256 ×

256 window located atthe center to focus on the body part. Random cropping of size16 × ×

128 was used for data augmentation. Our code isavailable at https://github.com/HiLab-git/SepNet . We trained six models with three di ﬀ erent SLFs (i.e., SLF1,SLF2 and SLF3) and two loss functions (i.e., AT H ( α = . − L Exp and

AT H ( α = − L Exp ) with our 3D-SepNet. Our al-gorithms were containerized with Docker and submitted to or-ganizers of the StructSeg 2019 challenge to get results for theo ﬃ cial testing set containing 10 images. For each test case, thealgorithm requires at most 120 seconds to generate the output.Quantitative evaluation of the segmentation accuracy was basedon Dice Similarity Coe ﬃ cient ( DS C ) and 95% Hausdor ﬀ Dis-tance (95% HD ). DS C = × T P × T P + FN + FP (13)where T P , FP and FN are true positive, false positive and falsenegative, respectively. The maximum Hausdor ﬀ Distance is the https: // / maximum distance of a set to the nearest point in the other set.The maximum Hausdor ﬀ Distance from set X to set Y is a max-imin function, deﬁned as d H ( X , Y ) = max { max x ∈ X min y ∈ Y d ( x , y ) , max y ∈ Y min x ∈ X d ( x , y ) } (14)95%HD is similar to maximum HD, and it is based on the cal-culation of the 95th percentile of the distances between bound-ary points in X and Y . The purpose for using this metric is toeliminate the impact of a very small subset of the outliers. Eachtype of the objective scores of di ﬀ erent organs are weightedby their importance weights and then averaged. The 22 anno-tated OARs with importance weights include: left eye (100),right eye (100), left lens (50), right lens (50), left optical nerve(80), right optical nerve (80), optical chiasma (50), pituitary(80), brain stem (100), left temporal lobes (80), right tempo-ral lobes (80), spinal cord (100), left parotid gland (50), rightparotid gland (50), left inner ear (70), right inner ear (70), leftmiddle ear (70), right middle ear (70), left temporomandibu-lar joint (60), right temporomandibular joint (60), left mandible(100), right mandible (100). Eventually, the proposed methodachieved weighted average DSC of 80.52% and 95%HD of3.043 mm on the o ﬃ cial test set.For ablation study, as the ground truth of the o ﬃ cial testingimages in StructSeg 2019 was not publicly available, we did notuse any in-house data and randomly split the StructSeg 2019training set into 40 images for training and the other 10 imagesfor testing, which is referred to as local testing data. For quantitative comparison, Table 1 reports the performanceon our local testing set achieved by models trained with 3D-7 a) Axial View (b)

Coronal View (c) Sagittal View Brain StemTemporal Lobes LeftPituitaryParotid Gland LeftLeft Temporomandibular JointRight Temporomandibular JointInner Ear RightMiddle Ear LeftOptic ChiasmaTemporal Lobes RightParotid Gland RightInner Ear LeftMiddle Ear RightSpinal CordGround Truth 𝐴 𝑇 𝐻 − 𝐿 𝐸 𝑥 𝑝 + N L F 𝐴 𝑇 𝐻 − 𝐿 𝐸 𝑥 𝑝 + N L F 𝐿 𝐸 𝑥 𝑝 + S L F 𝐷 𝑖 𝑐 𝑒 𝐿 𝑜 𝑠𝑠 + S L F 𝐴 𝑇 𝐻 − 𝐿 𝐸 𝑥 + S L F ( O u r s ) Figure 7: Visual comparison of di ﬀ erent intensity transformations and loss functions for segmentation of OARs, where SLF means Segmental Linear Functionand NLF represents Naive Linear Function. L Exp is the exponential logarithmic loss and

AT H − L Exp represents the combination of our proposed

AT H and L Exp .Pointed by the red arrows in three views, it could be observed that models trained with SLF could distinguish OARs boundary, i.e. inner ears and temporal lobesmore accurately than models trained with NLFs. What’s more, compared with L Exp and Dice loss,

AT H − L Exp has network pay more attention to hard pixels likeboundary in easy or large regions, thus achieving more accurate results.Table 2: Quantitative comparison between our 3D-SepNet and 3D-UNet based versions. These networks were trained with

AT H ( α = . − L Exp and SLF1. Smallorgans contain lens, optical nerves, optical chiasma, pituitary and inner ears. The best result is in bold font.

Metric

DSC (mean ± std %) ↑ ± std mm) ↓ Network 3D-SepNet (ours) . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± ± . ± . ± ± ± ± ± ± ± ± ± ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± ± ± . ± . ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± Average 80 . ± . ± ± ± . ± . ± ± ± Small Organs Average 71 . ± . ± ± ± . ± . ± ± ± D - S e p N e t ( O u r s ) I m a g e a nd l a b e l D - UN e t D - U n e t - R e s D - U n e t - S E Axial Sagittal Coronal

Optic ChiasmaEye Left Optic Nerve RightOptic Nerve LeftEye RightParotid Gland LeftParotid Gland Right Figure 8: Example of averaged activation maps of di ﬀ erent networks. We map the value of each map m from [ m min , m max ] to [0, 1] for visualization, where m min and m max are the min and max values of origin map m . The red arrows in three views point out that for small organs in large inter-slice, 3D-SepNet is actuallybeneﬁtted from our separable convolutions and could draw more accurate boundary than the other compared networks. SepNet and mainly two groups: (1) 3 intensity transform func-tions: SLF1, NLF1 and NLF2 with

AT H ( α = . − L Exp ; (2)3 di ﬀ erent loss functions: AT H ( α = . − L Exp , L Exp , Diceloss with SLF1 transform to investigate the e ﬀ ectiveness of our AT H − L Exp . As shown in Table 1, our SLF1 exceeds NLF1 / ﬀ erentiate multi-ple OARs well. Our SLF1 improves Dice by 0.6% and 0.7%compared with traditional method NLF1 / AT H − L Exp outperforms L Exp inmost organs like brain stem, optical nerves and inner ear, whichveriﬁes that

AT H − L Exp will force the network to pay moreattention to hard voxels compared with L Exp while still preservethe ability to distinguish the small organs.For qualitative comparison, Fig. 7 shows a visual comparisonof segmentation results based on above-mentioned ﬁve models.It is easy to observe that in all views, the models trained witha wide intensity window of NLF2 failed to recognize the ac-curate border of brain stem, right temporomandibular joint andleft parotid gland due to the low contrast of transformed images.In the contrary, the model trained with SLF1 could distinguishthese OARs boundaries e ﬀ ectively, demonstrating the superior-ity of our proposed intensity transform based on SLFs. It is also visible that for the hard voxels (near the boundariesof OARs) in easy organs like middle ears and brain stem, seg-mentations based on L Exp and Dice loss are not very accurate,which is in line with our inference in Section 3.3 that the L Exp may restrain the gradients of hard voxels in easy region. In con-trast,

AT H − L Exp makes the network focus on these voxels andprovides more accurate results.

We also compare our 3D-SepNet with three variants of 3D-UNet: original one, SE block added and residual connectionadded. As shown in Table 2, with the same loss function andintensity transform function, 3D-UNet variants have a lowerperformance than 3D-SepNet in small ﬂat organs containing:lens, optical nerves, optical chiasma, pituitary and inner earsthat are only present in few slices. The reason could be thatafter each inter-slice convolution, the boundary along through-plane direction will be blurred. For these ﬂat organs, blurringthe through-plane boundary can lead to more reduced segmen-tation accuracy than those organs that span a larger number ofslices where the blurred boundary region occupy a relativelysmall volume. Thus, our network is more friendly to ﬂat or-gans than other 3D-UNet versions. To illustrate this point, foreach network, we output the activation maps before the last 1 × m from [ m min , m max ] to [0, 1] for visualization, where m min and m max are the min and max values of origin map m .9 D S C ( m ea n % ) Ensemble Model1 Model2 Model3 Model4 Model5 Model605101520 VV C ( m ea n % ) Figure 9: Quantitative comparison of model ensemble with single models for segmentation of OARs. All models are based on 3D-SepNet. Model1 / / AT H ( α = . − L Exp and SLF1 / / / / AT H ( α = − L Exp and SLF1 / / VV C ( m ea n % ) Figure 10: Structure-level uncertainty in terms of VVC of OARs in the segmentation result. Each VVC is from 6 models used for ensemble and averaged withinlocal test dataset.

As presented in Fig. 8, the averaged activation maps from 3D-SepNet show high response in small organs, e.g., ﬁtting wellto the optic nerves boundary especially in sagittal and coronalviews. In contrast, shape and boundary for the same organsare unclear in the averaged activation maps of other counter-parts. What’s more, in Table 2, it could also be observed that3D-UNet and 3D-UNet-Res achieve similar performance with3D-SepNet in organs with large scales in through-plane direc-tion, like spinal cord, mandible, parotid gland and mandible.With about 1 / ﬀ ectively. Section 4.1.1 shows that it is di ﬃ cult for a single intensitytransform function and a single loss function to achieve the bestperformance for all OARs, and di ﬀ erent SLFs and loss func-tions are complementary to each other. Therefore, we use anensemble of them for more robust segmentation. Our ensem-ble is based on 6 trained models, containing the combinationof three SLFs with loss functions AT H − L Exp of α = ﬀ er-ent complementary intensity transform functions and loss func-tions e ﬀ ectively improves the segmentation robustness.As we estimated the voxel-wise uncertainty based on seg-mentation labels obtained by 6 models using Eq. (11), we ob-served 6 levels of voxel-wise uncertainty totally: Level1: 0,Level2: 0.451, Level3: 0.637, Level4: 0.693, Level5: 0.8675,Level6: 1.011. Fig. 11 shows an example of uncertainty in-formation as a result of our ensemble model. In each subﬁg-ure, the ﬁrst image shows the ensemble result compared withthe ground truth, and the second image shows uncertainty esti-mation encoded by a color bar, with the yellow voxels havinghigh uncertainty and purple voxels having low uncertainty. Itcould be observed that most of the uncertain segmentation be-longs to the soft issues and locates near the border of OARs.What’s more, the red arrows in Fig. 11 point out the the over-lap between highly uncertain regions and mis-segmented areas,showing the uncertainty is highly related to segmentation error.We also calculated voxel-wise average error rate at each un-10 a) Axial View (b)

Coronal View (c)

Sagittal View

Figure 11: An example of our segmentation uncertainty estimation, encoded by color bar in the left top corner, with the yellow voxels having high uncertaintyvalues and purple voxels having low uncertainty. The left ﬁgure in each sub-ﬁgure is our ensemble segmentation result. The red arrows point out the major overlapbetween highly uncertain region and mis-segmentation.

Level1 Level2 Level3 Level4 Level5 Level6 E rr o r R a t e Uncertainty Level (a) Whole image area

Level1 Level2 Level3 Level4 Level5 Level6 E rr o r R a t e Uncertainty Level (b) Predicted or labeled OARs area

Level152%Level221%Level318% Level49%Level50.10% Level60.16% P r opo r ti on o f U nd e r- s e g m e n t a ti on (e) Under-segmented region Level1 Level2 Level3 Level4 Level5 Level6 E rr o r R a t e Uncertainty Level (c) Predicted OARs area

Level157%Level220%Level316% Level47%Level50.12% 0.17%Level6 P r opo r ti on o f O v e r- s e g m e n t a ti on (f) Over-segmented region Level1 Level2 Level3 Level4 Level5 Level6 E rr o r R a t e Uncertainty Level (b) Predicted background area

Level156%Level220%Level316% Level48%Level50.11% Level60.16% P r opo r ti on o f O v e r- a nd U nd e r- s e g m e n t a ti on (d) Mis-segmented region Figure 12: Statistics of segmentation error and voxel-wise uncertainty. (a-c) voxel-wise prediction error rates at di ﬀ erent uncertainty levels for the whole image,the predicted background region and the predicted OARs region, respectively. (d-f) Distribution of uncertainty levels in all mis-segmented region, under-segmentedregion, and over-segmented region, respectively. With 6 levels of uncertainty totally: Level1: 0, Level2: 0.451, Level3: 0.637, Level4: 0.693, Level5: 0.8675,Level6: 1.011. certainty level based on di ﬀ erent regions: 1) the entire image, 2)predicted background region, and 3) the predicted foregroundregion of OARs. From the results shown in Fig. 12 (a, b),it could be observed that for regions with uncertainty Level1(i.e., all the models obtained the same result), the average er-ror rate is close to 0 for whole image area and the predictedbackground area. It is mainly due to the large amount of easy-to-recognize voxels in the background. For the predicted OARsarea that contains all over-segmented voxels, the average er-ror rate in regions with uncertainty Level1 is much higher andreaches nearly 20%, as shown in Fig. 12 (c). With the increaseof uncertainty, the results have a steep raise of error rate in allthree types of areas. In predicted OARs area, the voxel-wiseaverage error rates for uncertainty Level { } are above 65%while are mainly less than 35% for uncertainty Level { } inpredicted background area. It indicates that uncertain region inpredicted OARs area is more likely to be mis-segmentation andworth more attention. What’s more, we also calculate the proportion of voxels be-longing to each uncertainty level in mis-segmentation for allthree types of areas, as shown in Fig. 12 (d, e, f). It could beobserved that the nearly half of voxels in the mis-segmentedregions have uncertainty levels between 2 to 6. Therefore, forboth predicted background and OARs area, ﬁnding and correct-ing mis-segmented voxels in uncertain region has a large poten-tial for improving the prediction result.For evaluation of structure-wise uncertainty, we investigatedthe relationship between VVC and segmentation accuracy mea-sured by DSC. We calculate VVC for each OAR based on theensemble of six models and compute the average VVC for eachOAR with our local test dataset, as shown in Fig. 10. FromFig. 9 and Fig. 10, it can be observed that the left optical nervehas a high structure-level uncertainty with low segmentation ac-curacy. What’s more, we show the joint distribution of averageVVC and DSC of ensemble results in Fig. 13, where the ﬁt-ted line shows that DSC tends to be smaller when VVC grows.11 able 3: Results of the mixed HAN OARs dataset of di ﬀ erent loss functions and transform methods (DSC and 95%HD (mm)), based on 3D-SepNet. Metric

DSC (mean ± std %) ↑ ± std mm) ↓ Transform

NLF1 NLF2

SLF1 (ours)

NLF1 NLF2

SLF1 (ours)Loss

ATH − L Exp (ours) L Exp

DSC Loss

ATH − L Exp (ours) L Exp

DSC LossBrain Stem 86.6 ± ± . ± . ± ± ± . ± . ± ± ± ± ± ± ± . ± . . ± . ± ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± ± . ± . ± ± ± ± ± . ± . Opt Nerve L 59.8 ± ± . ± . ± ± ± ± . ± . ± ± ± ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± ± ± ± . ± . ± ± Average ± ± . ± . ± ± ± ± . ± . ± ± D S C ( % )

123 4 5 67 891011 12 13 14151617 1819 2021 22

Figure 13: Joint distribution of DSC and average VVC for each OAR of en-semble result. Each point represents an OAR’s DSC and average VVC. Withthe increase of VVC, the DSC tends to be smaller. The number of each pointmeans the order of OAR: 1-6: Brain Stem, Eye L, Eye R, Lens L, Lens R, OptNerve L; 7-12: Opt Nerve R, Opt Chiasma, Temporal Lobes L, Temporal LobesR, Pituitary, Parotid Gland L; 13-18: Parotid Gland R, Inner Ear L, Inner Ear R,Mid Ear L, Mid Ear R, TM Joint L; 19-22: TM Joint R, Spinal Cord, MandibleL, Mandible R.

This demonstrates that a high VVC value can indicate inaccu-rate segmentation well. The three points far below the ﬁtted lineare pituitary, optical nerve right and optical chiasma, showingthe di ﬃ culty of segmenting these OARs. For further investigation, we applied our methods on a mixeddataset of HAN CT scans of 165 patients: 50 from StructSeg2019, 48 from MICCAI 2015 Head and Neck challenge and 67patients collected locally. We segment 7 organs that were an-notated in all of them: brain stem, optical chiasma, mandible,optical nerve right / left, parotid gland right / left. However, asshown in Fig. 14, the labeling style of each dataset is largelydi ﬀ erent especially in small organs containing optical chiasmaand nerves, which makes it a challenging task. The StructSeg2019 and our locally collected dataset had an inter-slice spacingaround 3 mm and intra-slice spacing around 1 mm. The MIC-CAI 2015 Head and Neck challenge dataset had an intra-slicespacing in the range of 0.76 mm to 1.27 mm, and inter-slicespacing from 1.25 mm to 3 mm. These CT scans were interpo-lated to an output voxel size of 3 × × ﬀ erent loss functions andintensity transformation functions based on 3D-SepNet. Due to (a) StructSeg 19 (b) MICCAI 15 (c) Locally Collected Figure 14: Examples of labeling style from three datasets. The major labelingstyle di ﬀ erences are pointed out by the red arrows. the di ﬀ erent labeling style of each dataset, it is hard to obtain anideal model for every patient, leading to unstable performancefor small OARs like optical chiasma. However, AT H − L Exp with

S LF still achieves the best DSC and 95%HD, proving therobustness of our methods. In addition, Table 4 shows that com-pared with other three variants 3D-UNet, our 3D-SepNet alsoachieves prominent improvements of optical chiasma / nervesespecially in terms of DSC. It further illustrates the applica-bility of our network to handle small organs.

5. Discussion and Conclusion

To segment multiple OARs with di ﬀ erent sizes from CT im-ages with low contrast and anisotropic resolution, we proposea novel framework consisting of Segmental Linear Function(SLF)-based intensity transform and a 3D-SepNet with hard-voxel weighting for training. For intensity transform, the tra-ditional method of Naive Linear Function (NLF) may not bee ﬀ ective enough because it cannot obtain good visibility formultiple OARs including both soft tissues and bones at thesame time. In contrast, our SLF makes multiple OARs havea good visibility at the same time. Considering the clinicalfact that radiologists would view scans under di ﬀ erent windowwidths / levels to better delineate di ﬀ erent organs, we also ap-plied multiple SLFs to better segment di ﬀ erent tissues. Gener-ally, our SLF for intensity transform is also applicable to othertasks, such as segmentation of multiple organs from thoracic orabdominal CT images [50].To deal with the severe imbalance between large and smallOARs, we ﬁrst used the exponential logarithmic loss ( L Exp ) thatweights the OARs by the relative size and class-level segmen-tation di ﬃ culty. However, we found that L Exp might limit theperformance of CNNs on hard voxels in large or easy OARs.To solve this problem, we proposed a weighting function tohave the network focus on the hard voxels and combined it with12 able 4: Quantitative comparison between our 3D-SepNet and 3D-UNet based versions on the mixed HAN OARs dataset. These networks were trained with

AT H ( α = . − L Exp and SLF1.

Metric

DSC (mean ± std %) ↑ ± std mm) ↓ Network 3D-SepNet (ours) . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± ± . ± . ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± . ± . ± ± ± ± . ± . ± ± . ± . ± ± ± Average 71 . ± . ± ± ± . ± . ± ± ± L Exp in our case, named

AT H − L Exp . As shown in experimen-tal results, the

AT H − L Exp outperformed the linear DSC lossand L Exp for most OARs, like brain stem, eye, lens and parotidgland. With a tunable attention parameter α >

0, we could fur-ther control how much the attention is paid to hard regions. Ourweighting function could also be easily combined with othersegmentation loss functions.We validated our framework with extensive experiments inthe HAN OAR segmentation task of StructSeg 2019 challenge,and a mixed HAN dataset from three sources. With about 1 / ﬀ ectively.The experiments found that our 3D-SepNet outperformed 3D-UNet for some ﬂat organs, and it is because 3D-UNet containmore inter-slice convolution operations than our 3D-SepNetand will blur through-plane boundaries. For these ﬂat organs,blurring the through-plane boundaries makes it harder for accu-rate segmentation. Therefore, using a relatively small numberof inter-slice convolution helps to alleviate this problem andobtains better results for organs that are only present in fewaxial slices. Some other techniques like group-wise [51] anddepth separable convolution [52] could also be combined withour method to build more lightweight and e ﬃ cient networks.Considering that models trained with di ﬀ erent loss functionsand SLFs obtained the best performance for di ﬀ erent OARs,we use an ensemble of these models by a weighted average.Our method won the 3rd place in the HAN OAR segmenta-tion task of StructSeg 2019 challenge, and achieved weightedaverage DSC of 80.52% and 95%HD of 3.043 mm based on theo ﬃ cial testing images. The leaderboard shows that the di ﬀ er-ence between our methods and other top-performing methodsis not very large in terms of DSC. This is mainly because thatthe learderboard reports the results averaged across a set of 22organs that is a mixture of easy and hard organs. We foundthat UNet could achieve DSC larger than 80% in near 11 ofthem, which could result in limited room for improvement onthese easy organs. Therefore, signiﬁcant improvements in hardorgans may not lead to a large improvement on the average re-sult. However, it should be noticed that the average DSC of ourmethod is only 0.14% lower than that of the second best per- http: // / / , and our team is UESTC 501. forming method, and is 0.64% and 0.95% higher than those ofthe 4th and 7th top-performing methods, respectively.Further more, to our best knowledge, this is the ﬁrst workto investigate the uncertainty of HAN OAR segmentation usingCNNs. Our results show that there is high correlation betweenmis-segmentation and high uncertain regions, leading to moreinformative segmentation outputs that could be used for reﬁne-ment for more accurate segmentation results and help radiolo-gists during radiotherapy planning. In the future, it is of inter-est to further improve the segmentation performance on smallorgans like optical chiasma and optical nerves, and to leveragethe uncertainty information to guide user interactions for reﬁne-ment in challenging cases. Acknowledgements

This work was supported by the National Natural ScienceFoundations of China [81771921, 61901084] funding.

References [1] W. I. Wei, J. S. Sham, Nasopharyngeal carcinoma, Lancet 365 (2005)2041–2054.[2] E. T. Chang, H.-O. Adami, The enigmatic epidemiology of nasopharyn-geal carcinoma, Cancer Epidemiol. Biomark. Prev 15 (2006) 1765–1777.[3] M. A. Hunt, M. J. Zelefsky, S. Wolden, C.-S. Chui, T. LoSasso, K. Rosen-zweig, L. Chong, S. V. Spirou, L. Fromme, M. Lumley, et al., Treatmentplanning and delivery of intensity-modulated radiation therapy for pri-mary nasopharynx cancer, Int. J. Radiat. Oncol. Biol. Phys. 49 (2001)623–632.[4] W. Zhu, Y. Huang, L. Zeng, X. Chen, Y. Liu, Z. Qian, N. Du, W. Fan,X. Xie, AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy, Med. Phys. 46 (2019)576–589.[5] Y. Gao, R. Huang, M. Chen, Z. Wang, J. Deng, Y. Chen, Y. Yang,J. Zhang, C. Tao, H. Li, FocusNet: Imbalanced Large and Small Or-gan Segmentation with an End-to-End Deep Neural Network for Headand Neck CT Images, in: MICCAI, Springer, 2019, pp. 829–838.[6] K. C. Wong, M. Moradi, H. Tang, T. Syeda-Mahmood, 3D segmentationwith exponential logarithmic loss for highly unbalanced object sizes, in:MICCAI, 2018, pp. 612–619.[7] S. Nikolov, S. Blackwell, R. Mendes, J. De Fauw, C. Meyer, C. Hughes,H. Askham, B. Romera-Paredes, A. Karthikesalingam, C. Chu, et al.,Deep learning to achieve clinically applicable segmentation of head andneck anatomy for radiotherapy, arXiv preprint arXiv:1809.04430 (2018).[8] W. Shi, X. Zhuang, R. Wolz, D. Simon, K. Tung, H. Wang, S. Ourselin,P. Edwards, R. Razavi, D. Rueckert, A multi-image graph cut approachfor cardiac image segmentation and uncertainty estimation, in: Int. MIC-CAI STACOM Work., Springer, 2011, pp. 178–187.[9] G. Wang, W. Li, T. Vercauteren, S. Ourselin, Automatic Brain TumorSegmentation Based on Cascaded Convolutional Neural Networks WithUncertainty Estimation, Front. Comput. Neurosci. 13 (2019) 56.

10] W. Lei, H. Wang, R. Gu, S. Zhang, S. Zhang, G. Wang, DeepIGeoS-V2: Deep Interactive Segmentation of Multiple Organs from Head andNeck Images with Lightweight CNNs, in: Int. MICCAI LABELS Work.,Springer, 2019, pp. 61–69.[11] G. Wang, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren, S. Zhang,Uncertainty-guided e ﬃ cient interactive reﬁnement of fetal brain segmen-tation from stacks of MRI slices, in: MICCAI, Springer, 2020, pp. 279–288.[12] J.-F. Daisne, A. Blumhofer, Atlas-based automatic segmentation of headand neck organs at risk and nodal target volumes: a clinical validation,Radiat. Oncol. 8 (2013) 154.[13] Z. Wang, L. Wei, L. Wang, Y. Gao, W. Chen, D. Shen, Hierarchicalvertex regression-based segmentation of head and neck CT images forradiotherapy planning, IEEE Trans. Image Process. 27 (2017) 923–937.[14] S. Gorthi, V. Duay, N. Houhou, M. B. Cuadra, U. Schick, M. Becker, A. S.Allal, J.-P. Thiran, Segmentation of head and neck lymph node regions forradiotherapy planning using active contour-based atlas registration, IEEEJ. Sel. Top. Signal Process. 3 (2009) 135–147.[15] H. Yu, C. Caldwell, K. Mah, I. Poon, J. Balogh, R. MacKenzie,N. Khaouam, R. Tirona, Automated radiation targeting in head-and-neckcancer using region-based texture analysis of PET and CT images, Int. J.Radiat. Oncol. Biol. Phys. 75 (2009) 618–625.[16] X. Yang, N. Wu, G. Cheng, Z. Zhou, S. Y. David, J. J. Beitler, W. J.Curran, T. Liu, Automated segmentation of the parotid gland based onatlas registration and machine learning: a longitudinal MRI study in head-and-neck radiation therapy, Int. J. Radiat. Oncol. Biol. Phys. 90 (2014)1225–1233.[17] J. Zhou, K. L. Chan, P. Xu, V. F. Chong, Nasopharyngeal carcinomalesion segmentation from MR images by support vector machine, in:ISBI, IEEE, 2006, pp. 1364–1367.[18] I. Fitton, S. Cornelissen, J. C. Duppen, R. Steenbakkers, S. Peeters,F. Hoebers, J. H. Kaanders, P. Nowak, C. R. Rasch, M. van Herk, Semi-automatic delineation using weighted CT-MRI registered images for ra-diotherapy of nasopharyngeal cancer, Med. Phys. 38 (2011) 4662–4666.[19] F. K. Lee, D. K. Yeung, A. D. King, S. Leung, A. Ahuja, Segmentationof nasopharyngeal carcinoma (NPC) lesions in MR images, Int. J. Radiat.Oncol. Biol. Phys. 61 (2005) 608–620.[20] B. Ibragimov, L. Xing, Segmentation of organs-at-risks in head and neckCT images using convolutional neural networks, Med. Phys. 44 (2017)547–557.[21] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for se-mantic segmentation, in: CVPR, 2015, pp. 3431–3440.[22] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks forbiomedical image segmentation, in: MICCAI, Springer, 2015, pp. 234–241.[23] K. Men, J. Dai, Y. Li, Automatic segmentation of the clinical target vol-ume and organs at risk in the planning CT for rectal cancer using deepdilated convolutional neural networks, Med. Phys. 44 (2017) 6377–6389.[24] L. Zhao, Z. Lu, J. Jiang, Y. Zhou, Y. Wu, Q. Feng, Automatic Nasopha-ryngeal Carcinoma Segmentation Using Fully Convolutional Networkswith Auxiliary Paths on Dual-Modality PET-CT Images, J. Digit. Imag-ing 32 (2019) 462–470.[25] K. D. Fritscher, M. Peroni, P. Za ﬃ no, M. F. Spadea, R. Schubert,G. Sharp, Automatic segmentation of head and neck CT images for radio-therapy treatment planning using multiple atlases, statistical appearancemodels, and geodesic active contours, Med. Phys. 41 (2014) 051910.[26] S. Liang, F. Tang, X. Huang, K. Yang, T. Zhong, R. Hu, S. Liu, X. Yuan,Y. Zhang, Deep-learning-based detection and segmentation of organs atrisk in nasopharyngeal carcinoma computed tomographic images for ra-diotherapy planning, Eur. Radiol. 29 (2019) 1961–1967.[27] ¨O. C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, 3DU-Net: learning dense volumetric segmentation from sparse annotation,in: MICCAI, Springer, 2016, pp. 424–432.[28] P. F. Raudaschl, P. Za ﬃ no, G. C. Sharp, M. F. Spadea, A. Chen, B. M.Dawant, T. Albrecht, T. Gass, C. Langguth, M. L¨uthi, et al., Evaluation ofsegmentation methods on head and neck ct: auto-segmentation challenge2015, Medical physics 44 (2017) 2020–2036.[29] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neuralnetworks for volumetric medical image segmentation, in: 3DV, IEEE,2016, pp. 565–571.[30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Doll´ar, Focal loss for dense object detection, in: ICCV, 2017, pp. 2980–2988.[31] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao,J. Liu, Ce-net: Context encoder network for 2d medical image segmenta-tion, IEEE transactions on medical imaging 38 (2019) 2281–2292.[32] J. Chen, Y. Wang, R. Guo, B. Yu, T. Chen, W. Wang, R. Feng, D. Z. Chen,J. Wu, Lsrc: A long-short range context-fusing framework for automatic3d vertebra localization, in: International Conference on Medical ImageComputing and Computer-Assisted Intervention, Springer, 2019, pp. 95–103.[33] D. Karimi, S. E. Salcudean, Reducing the hausdor ﬀ distance in medicalimage segmentation with convolutional neural networks, IEEE Transac-tions on medical imaging 39 (2019) 499–513.[34] M. P. Heinrich, Closing the gap between deep and conventional imageregistration using probabilistic dense displacement networks, in: Interna-tional Conference on Medical Image Computing and Computer-AssistedIntervention, Springer, 2019, pp. 50–58.[35] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumorsegmentation using cascaded anisotropic convolutional neural networks,in: International MICCAI brainlesion workshop, Springer, 2017, pp. 178–190.[36] H. R. Roth, L. Lu, J. Liu, J. Yao, A. Se ﬀ , K. Cherry, L. Kim, R. M.Summers, Improving computer-aided detection using convolutional neu-ral networks and random view aggregation, IEEE transactions on medicalimaging 35 (2015) 1170–1181.[37] J.-S. Prassni, T. Ropinski, K. Hinrichs, Uncertainty-aware guided volumesegmentation, IEEE Trans. Vis. Comput. Graph. 16 (2010) 1358–1365.[38] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel,A. L. David, J. Deprest, S. Ourselin, et al., Interactive medical imagesegmentation using deep learning with image-speciﬁc ﬁne tuning, IEEETrans. Med. Imaging 37 (2018) 1562–1573.[39] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deeplearning for computer vision?, in: NIPS, 2017, pp. 5574–5584.[40] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre-dictive uncertainty estimation using deep ensembles, in: NIPS, 2017, pp.6402–6413.[41] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder–decoder net-works for surrogate modeling and uncertainty quantiﬁcation, J. Comput.Phys. 366 (2018) 415–447.[42] A. G. Roy, S. Conjeti, N. Navab, C. Wachinger, Inherent brain segmen-tation quality control from fully convnet monte carlo sampling, in: MIC-CAI, Springer, 2018, pp. 664–672.[43] T. Nair, D. Precup, D. L. Arnold, T. Arbel, Exploring uncertainty mea-sures in deep networks for multiple sclerosis lesion detection and segmen-tation, Med. Image Anal. 59 (2020) 101557.[44] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren,Aleatoric uncertainty estimation with test-time augmentation for medicalimage segmentation with convolutional neural networks, Neurocomput-ing 338 (2019) 34–45.[45] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: Themissing ingredient for fast stylization, arXiv preprint arXiv:1607.08022(2016).[46] V. Nair, G. E. Hinton, Rectiﬁed linear units improve restricted boltzmannmachines, in: ICML, 2010, pp. 807–814.[47] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018,pp. 7132–7141.[48] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-nition, in: Proceedings of the IEEE conference on computer vision andpattern recognition, 2016, pp. 770–778.[49] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXivpreprint arXiv:1412.6980 (2014).[50] E. Gibson, F. Giganti, Y. Hu, E. Bonmati, S. Bandula, K. Gurusamy,B. Davidson, S. P. Pereira, M. J. Clarkson, D. C. Barratt, Automaticmulti-organ segmentation on abdominal CT with dense v-networks, IEEETrans. Med. Imaging 37 (2018) 1822–1834.[51] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation withdeep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105.[52] F. Chollet, Xception: Deep learning with depthwise separable convolu-tions, in: CVPR, 2017, pp. 1251–1258., K. Cherry, L. Kim, R. M.Summers, Improving computer-aided detection using convolutional neu-ral networks and random view aggregation, IEEE transactions on medicalimaging 35 (2015) 1170–1181.[37] J.-S. Prassni, T. Ropinski, K. Hinrichs, Uncertainty-aware guided volumesegmentation, IEEE Trans. Vis. Comput. Graph. 16 (2010) 1358–1365.[38] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel,A. L. David, J. Deprest, S. Ourselin, et al., Interactive medical imagesegmentation using deep learning with image-speciﬁc ﬁne tuning, IEEETrans. Med. Imaging 37 (2018) 1562–1573.[39] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deeplearning for computer vision?, in: NIPS, 2017, pp. 5574–5584.[40] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre-dictive uncertainty estimation using deep ensembles, in: NIPS, 2017, pp.6402–6413.[41] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder–decoder net-works for surrogate modeling and uncertainty quantiﬁcation, J. Comput.Phys. 366 (2018) 415–447.[42] A. G. Roy, S. Conjeti, N. Navab, C. Wachinger, Inherent brain segmen-tation quality control from fully convnet monte carlo sampling, in: MIC-CAI, Springer, 2018, pp. 664–672.[43] T. Nair, D. Precup, D. L. Arnold, T. Arbel, Exploring uncertainty mea-sures in deep networks for multiple sclerosis lesion detection and segmen-tation, Med. Image Anal. 59 (2020) 101557.[44] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren,Aleatoric uncertainty estimation with test-time augmentation for medicalimage segmentation with convolutional neural networks, Neurocomput-ing 338 (2019) 34–45.[45] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: Themissing ingredient for fast stylization, arXiv preprint arXiv:1607.08022(2016).[46] V. Nair, G. E. Hinton, Rectiﬁed linear units improve restricted boltzmannmachines, in: ICML, 2010, pp. 807–814.[47] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018,pp. 7132–7141.[48] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-nition, in: Proceedings of the IEEE conference on computer vision andpattern recognition, 2016, pp. 770–778.[49] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXivpreprint arXiv:1412.6980 (2014).[50] E. Gibson, F. Giganti, Y. Hu, E. Bonmati, S. Bandula, K. Gurusamy,B. Davidson, S. P. Pereira, M. J. Clarkson, D. C. Barratt, Automaticmulti-organ segmentation on abdominal CT with dense v-networks, IEEETrans. Med. Imaging 37 (2018) 1822–1834.[51] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation withdeep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105.[52] F. Chollet, Xception: Deep learning with depthwise separable convolu-tions, in: CVPR, 2017, pp. 1251–1258.