Squeeze-and-Excitation Normalization for Automated Delineation of Head and Neck Primary Tumors in Combined PET and CT Images
SSqueeze-and-Excitation Normalization forAutomated Delineation of Head and NeckPrimary Tumors in Combined PET and CTImages (cid:63)
Andrei Iantsen [0000 − − − , Dimitris Visvikis [0000 − − − , andMathieu Hatt [0000 − − − LaTIM, INSERM, UMR 1101, University Brest, Brest, France
Abstract.
Development of robust and accurate fully automated meth-ods for medical image segmentation is crucial in clinical practice andradiomics studies. In this work, we contributed an automated approachfor Head and Neck (H&N) primary tumor segmentation in combinedpositron emission tomography / computed tomography (PET/CT) im-ages in the context of the MICCAI 2020 Head and Neck Tumor seg-mentation challenge (HECKTOR). Our model was designed on the U-Net architecture with residual layers and supplemented with Squeeze-and-Excitation Normalization. The described method achieved competi-tive results in cross-validation (DSC 0.745, precision 0.760, recall 0.789)performed on different centers, as well as on the test set (DSC 0.759,precision 0.833, recall 0.740) that allowed us to win first prize in theHECKTOR challenge among 21 participating teams. The full imple-mentation based on PyTorch and the trained models are available at https://github.com/iantsen/hecktor . Keywords:
Medical imaging · Segmentation · Head and Neck cancer · U-Net SE Normalization.
Combined positron emission tomography / computed tomography (PET/CT)imaging is broadly used in clinical practice for radiotherapy treatment planning,initial staging and response assessment. In radiomics analyses, quantitative eval-uation of radiotracer uptake in PET images and tissues density in CT images,aims at extracting clinically relevant features and building diagnostic, prognosticand predictive models. The segmentation step of the radiomics workflow is the (cid:63)
Cite this paper as:
Iantsen A., Visvikis D., Hatt M. (2021) Squeeze-and-Excitation Normalization for Automated Delineation of Head and Neck PrimaryTumors in Combined PET and CT Images. In: Andrearczyk V., Oreiller V., De-peursinge A. (eds) Head and Neck Tumor Segmentation. HECKTOR 2020. LectureNotes in Computer Science, vol 12603. Springer, Cham. https://doi.org/10.1007/978-3-030-67194-5_4 a r X i v : . [ ee ss . I V ] F e b A. Iantsen et al. most time-consuming bottleneck and variability in usual semi-automatic seg-mentation methods can significantly affect the extracted features, especially incase of manual segmentation, which is affected by the highest magnitude of inter-and intra-observer variability. Under these circumstances, a fully automated seg-mentation is highly desirable to automate the whole process and facilitate itsclinical routine usage.The MICCAI 2020 Head and Neck Tumor segmentation challenge (HECK-TOR) [1] aims at evaluating automatic algorithms for segmentation of Head andNeck (H&N) tumors in combined PET and CT images. A dataset of 201 patientsfrom four medical centers in Qu´ebec (CHGJ, CHMR, CHUM and CHUS) withhistologically proven H&N cancer in the oropharynx is provided for a modeldevelopment. A test set comprised of 53 patients from a different center inSwitzerland (CHUV) is used for evaluation. All images were re-annotated byan expert for the purpose of the challenge in order to determine primary grosstumor volumes (GTV) on which the methods are evaluated using the Dice score(DSC), precision and recall.This paper describes our approach based on convolutional neural networkssupplemented with Squeeze-and-Excitation Normalization (SE Normalization orSE Norm) layers to address the goal of the HECKTOR challenge.
The key element of our model is SE Normalization layers [6] that we recentlyproposed in the context of the Brain Tumor Segmentation Challenge (BraTS2020) [3]. Similarly to Instance Normalization [4], for an input X = ( x , x , . . . , x N )with N channels, SE Norm layer first normalizes all channels of each example ina batch using the mean and standard deviation: x (cid:48) i = 1 σ i ( x i − µ i ) (1)where µ i = E[ x i ] and σ i = (cid:112) Var[ x i ] + (cid:15) with (cid:15) as a small constant to preventdivision by zero. After, a pair of parameters γ i , β i are applied to each channelto scale and shift the normalized values: y i = γ i x (cid:48) i + β i (2)In case of Instance Normalization, both parameters γ i , β i , fitted in the courseof training, stay fixed and independent on the input X during inference. Bycontrast, we propose to model the parameters γ i , β i as functions of the input Xby means of Squeeze-and-Excitation (SE) blocks [5], i.e γ = f γ ( X ) (3) β = f β ( X ) (4) E Normalization for Delineation of H&N Tumors in PET/CT 3 where γ = ( γ , γ , . . . , γ N ) and β = ( β , β , . . . , β N ) - the scale and shift pa-rameters for all channels, f γ - the original SE block with the sigmoid, and f β ismodeled as the SE block with the tanh activation function to enable the negativeshift (see Fig. 1a). Both of SE blocks first apply global average pooling (GAP)to squeeze each channel into a single descriptor. Then, two fully connected (FC)layers aim at capturing non-linear cross-channel dependencies. The first FC layeris implemented with the reduction ratio r to form a bottleneck for controllingmodel complexity. Throughout this paper, we apply SE Norm layers with thefixed reduction ration r = 2.Fig. 1: Layers with SE Normalization: (a) SE Norm layer, (b) residual layer withthe shortcut connection, and (c) residual layer with the non-linear projection.Output dimensions are depicted in italics. Our model is built upon a seminal U-Net architecture [7,8] with the use of SENorm layers [6]. Convolutional blocks, that form the model decoder, are stacksof 3 × × × × × ×
2. To linearly upsample feature maps in the decoder, 3 × × × × A. Iantsen et al. and utilizing trilinear interpolation to increase the spatial size of the featuremaps (see Fig. 2, yellow blocks).The first residual block placed after the input is implemented with the kernelsize of 7 × × Both PET and CT images were first resampled to a common resolution of1 × × mm with trilinear interpolation. Each training example was a patchof 144 × ×
144 voxels randomly extracted from a whole PET/CT image,whereas validation examples were received from bounding boxes provided byorganizers. Training patches were extracted to include the tumor class with theprobability of 0.9 to facilitate model training.CT intensities were clipped in the range of [ − , − , The model was trained for 800 epochs using Adam optimizer on two GPUsNVIDIA GeForce GTX 1080 Ti (11 GB) with a batch size of 2 (one sample perworker). The cosine annealing schedule was applied to reduce the learning ratefrom 10 − to 10 − within every 25 epochs. The unweighted sum of the Soft Dice Loss [9] and the Focal Loss [10] is utilizedto train the model. Based on [9], the Soft Dice Loss for one training examplecan be written as L Dice ( y, ˆ y ) = 1 − (cid:80) N i y i ˆ y i + 1 (cid:80) N i y i + (cid:80) N i ˆ y i + 1 (5)The Focal Loss is defined as L F ocal ( y, ˆ y ) = − N (cid:88) i y i (1 − ˆ y i ) γ ln(ˆ y i ) (6)In both definitions, y i ∈ { , } - the label for the i-th voxel, ˆ y i ∈ [0 ,
1] - thepredicted probability for the i-th voxel, and N - the total numbers of voxels.Additionally we add +1 to the numerator and denominator in the Soft DiceLoss to avoid the zero division in cases when the tumor class is not present intraining patches. The parameter γ in the Focal Loss is set at 2. E Normalization for Delineation of H&N Tumors in PET/CT 5
Fig. 2: The model architecture with SE Norm layers. The input consists ofPET/CT patches of the size of 144 × ×
144 voxels. The encoder consistsof residual blocks with identity (solid arrows) and projection (dashed arrows)shortcuts. The decoder is formed by convolutional blocks. Additional upsam-pling paths are added to transfer low-resolution features further in the decoder.Kernel sizes and numbers of output channels are depicted in each block.
A. Iantsen et al.
Table 1: The performance results on different cross-validation splits. Averageresults (the row ’Average’) are provided for each evaluation metric across allcenters in the leave-one-center-out cross-validation (first four rows). The meanand standard deviation of each metric are computed across all data samples inthe corresponding validation center. The row ’Average (rs)’ indicates the averageresults on the four random data splits.
Center DSC Precision RecallCHUS ( n = 72) 0 . ± .
206 0 . ± .
248 0 . ± . n = 56) 0 . ± .
190 0 . ± .
224 0 . ± . n = 55) 0 . ± .
180 0 . ± .
208 0 . ± . n = 18) 0 . ± .
232 0 . ± .
286 0 . ± . Our results on the test set were produced with the use of an ensemble of eightmodels trained and validated on different splits of the training set. Four modelswere built using a leave-one-center-out cross-validation, i.e, the data from threecenters was used for training and the data from the fourth center was held out forvalidation. Four other models were fitted on random training / validation splitsof the whole dataset. Predictions on the test set were produced by averagingpredictions of the individual models and applying a threshold operation with avalue equal to 0 . Center DSC Precision RecallCHUV ( n = 53) 0.759 0.833 0.740 Our validation results in the context of the HECKTOR challenge are summarizedin Table 1. The best outcome in terms of all evaluation metrics was receivedfor the ’CHGJ’ center with 55 patients. The model demonstrated the poorestperformance for the ’CHMR’ center that is least represented in the whole dataset.The differences with the two other centers was minor for all evaluation metrics.The small spread between all centers and the average results implies that themodel predictions were robust and any center-specific data standardization wasnot required. This finding is supported by the lack of significant difference in
E Normalization for Delineation of H&N Tumors in PET/CT 7 the average results between the leave-one-center-out and random split cross-validations.The ensemble results on the test set consisting of 53 patients from the ’CHUV’center are presented in Table 2. On the previously unseen data, the ensembleof eight models achieved the highest results among 21 participating teams withthe Dice score of 75.9%, precision 83.3% and recall 74%.