Automatic Segmentation, Localization, and Identification of Vertebrae in 3D CT Images Using Cascaded Convolutional Neural Networks
Naoto Masuzawa, Yoshiro Kitamura, Keigo Nakamura, Satoshi Iizuka, Edgar Simo-Serra
AAutomatic Segmentation, Localization, andIdentification of Vertebrae in 3D CT ImagesUsing Cascaded Convolutional Neural Networks
Naoto Masuzawa , Yoshiro Kitamura , Keigo Nakamura , Satoshi Iizuka , andEdgar Simo-Serra Imaging Technology Center, Fujifilm Corporation, Minato, Tokyo, Japan Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba,Ibaraki, Japan Department of Computer Science and Engineering, Waseda University, Shinjuku,Tokyo, Japan [email protected]
Abstract.
This paper presents a method for automatic segmentation,localization, and identification of vertebrae in arbitrary 3D CT images.Many previous works do not perform the three tasks simultaneously eventhough requiring a priori knowledge of which part of the anatomy isvisible in the 3D CT images. Our method tackles all these tasks in asingle multi-stage framework without any assumptions. In the first stage,we train a 3D Fully Convolutional Networks to find the bounding boxesof the cervical, thoracic, and lumbar vertebrae. In the second stage, wetrain an iterative 3D Fully Convolutional Networks to segment individualvertebrae in the bounding box. The input to the second networks have anauxiliary channel in addition to the 3D CT images. Given the segmentedvertebra regions in the auxiliary channel, the networks output the nextvertebra. The proposed method is evaluated in terms of segmentation,localization, and identification accuracy with two public datasets of 153D CT images from the MICCAI CSI 2014 workshop challenge and 3023D CT images with various pathologies introduced in [1]. Our methodachieved a mean Dice score of 96%, a mean localization error of 8.3 mm,and a mean identification rate of 84%. In summary, our method achievedbetter performance than all existing works in all the three metrics.
Keywords:
Vertebrae · Segmentation · Localization · Identification · Convolutional neural networks
Automatic segmentation, localization, and identification of individual vertebraefrom 3D CT (Computed Tomography) images play an important role in a pre-processing step of automatic analysis of the spine. However, many previous worksare not able to perform segmentation, localization, and identification simultane-ously and require a priori knowledge of which part of the anatomy is visible inthe 3D CT images. a r X i v : . [ ee ss . I V ] S e p N. Masuzawa et al.
Fig. 1.
Differences in anatomy betweencervical and thoracic vertebrae, andthoracic and lumbar vertebrae.
Fig. 2. a) A sagittal slice of 3D CT im-ages which includes cervical (C1-C7),thoracic (T1-T12), and lumbar (L1-L5)vertebrae. b) Segmentation and identi-fication of the individual vertebrae.
We overcome those drawbacks with a single multi-stage framework. Morespecifically, in the first stage, we train a 3D Fully Convolutional Networks (wecall it ”Semantic Segmentation Net”), which segments cervical, thoracic, andlumbar vertebrae so as to find the bounding boxes. As shown in Figure 1, tho-racic vertebrae are distinguished from the cervical and lumbar vertebrae bywhether they connect to their ribs and therefore it appears that the SemanticSegmentation Net performs well even if the field-of-view (FOV) is limited. Inthe second stage, we train an iterative 3D Fully Convolutional Networks (wecall it ”Iterative Instance Segmentation Net”), which segments (i.e., predicts thelabels of all voxels in the 3D CT images), localizes (i.e., finds the centroids ofall vertebrae), and identifies (i.e., assigns the anatomical labels) the vertebraein the bounding box one-by-one. Figure 2 shows an example input image andthe corresponding image synthesized by the proposed method. In summary, ourcontribution is as follows. 1) A two-stage coarse-to-fine approach for vertebraesegmentation, localization, and identification. 2) In-depth experiments and com-parisons with existing approaches.
The challenges associated with automatic segmentation, localization, and iden-tification of individual vertebrae are due to the following three points. 1) Highsimilarity in appearance of the vertebrae. 2) The various pathologies such as theabnormal spine curvature and vertebral fractures. 3) The variability of input 3DCT images such as FOV, resolution, and image artifacts. To address these chal-lenges, many methods have been proposed. Traditionally, vertebral segmentation utomatic Segmentation, Localization and Identification of Vertebrae 3 has used mathematical methods such as atlas-based segmentation or deformablemodels [5, 8, 9]. Speaking of localization and identification, Glocker et al. [1, 2]proposed a method based on regression forests with a challenging dataset. Theyintroduced 302 3D CT images with various pathologies, the narrow FOV, andmetal artifacts. Recently, deep learning has been employed in the applicationsof vertebral segmentation, localization, and identification. Yang et al. [13] pro-posed a deep image-to-image network (DI2IN) to predict centroid coordinatesof vertebrae. On the other hand, the common way to segment vertebrae usingdeep learning is to use semantic segmentation to predict the labels of all voxelsin input 3D CT images. For example, Janssens et al. [4] proposed a 3D fully con-volutional neural networks (FCN) to segment lumbar vertebrae. However, theway based on the semantic segmentation can segment vertebrae such as lumbaronly when whole of the vertebrae is visible in 3D CT images. This motivatedLessmann et al. [10] to consider vertebral segmentation as an instance segmenta-tion problem. The networks introduced by Lessman et al. [10] have an auxiliarychannel in addition to the input. Given the segmented vertebra regions in theauxiliary channel, the networks output the next vertebra. Thus, the methodproposed by Lessmann et al. [10] is able to perform vertebral segmentation eventhough whole of the vertebrae is not visible in 3D CT images and the numberof vertebra is not known a priori.Although the method by Lessmann et al. [10] achieves high segmentationaccuracy, it does not predict anatomical labels (i.e., cervical C1-C7, thoracicT1-T12, lumbar L1-L5) for each vertebra and it does not handle general 3D CTimages where it is not known in advance which part of the anatomy is visible.In fact, their method requires a priori knowledge of anatomy, such as lumbar 5.On the other hand, our approach is able to predict anatomical labels and handlegeneral 3D CT images.
Fig. 3.
A schematic view of the present approach.
Our method relies on a two-stage approach as shown in Figure 3. The firststage aims to segment cervical, thoracic, and lumbar vertebrae from input 3D
N. Masuzawa et al.
CT images. Individual vertebrae are segmented in the second stage. Moreover,vertebral centroid coordinates and their labels are also obtained. Below we firstpresent our training dataset, followed by descriptions of the Semantic Segmen-tation Net and the Iterative Instance Segmentation Net.
We prepared 1035 3D CT images (head: 181, chest 477, abdomen: 270, leg: 107)for training which are obtained from diverse manufacturer’s equipment (e.g., GE,Siemens, Toshiba, etc.). The leg 3D CT images were prepared for the purpose ofsuppressing false positive in the first stage. The slice thickness ranges from 0.4mm to 3.0 mm, and the in-plane resolution ranges from 0.34 mm to 0.97 mm.They have been selected to contain the abnormal spine curvature, metal artifacts,and the narrow FOV. Our spine model for training includes n = 25 individualvertebrae, where the regular 19 from the cervical, thoracic, and lumbar vertebraeconsist irregular lumbar 6. Reference segmentations of the visible vertebrae weregenerated by manually correcting automatic segmentations.
The convolutional neural networks are widely used to solve segmentation tasks insupervised learning technique. Recent works have shown that this technique canbe successfully applied to the multi-organ segmentation in 3D CT images [11]. Inour method, we develop the Semantic Segmentation Net which segment cervical,thoracic, and lumbar vertebrae from 3D CT images to find the bounding boxes.Figure 4 shows a schematic drawing of the architecture. Our architecture isbased on a 3D FCN [11]. For our Semantic Segmentation Net, the convolutionsperformed in each stage use volumetric kernels having size of 3 × × × × × × Fig. 4.
Architecture of the Semantic Segmentation Net.utomatic Segmentation, Localization and Identification of Vertebrae 5
Data augmentation and training
In the preprocessing steps, input 3D CTimages are clipped to the [-512.0, 1024.0] range and then normalized to be inthe [-1.0, 1.0] interval. After that, input 3D CT images are rescaled to 1.0 mmisotropic voxels. For each training iteration, we randomly crop 160 × × µ =0.0 and σ = [0.0, 50.0/1536.0]. In the training iteration, bootstrapped crossentropy loss functions [6] were optimized with the Adam optimizer [7] with alearning rate of 0.001 since the multi-class dice loss can be unstable. The ideabehind bootstrapping [6] is to backpropagate cross entropy loss not from all buta subset of voxels that the posterior probabilities are less than a threshold. Inour experiment, 10% of total voxels are used for the backpropagation. Fig. 5.
Architecture of the Instance Segmentation Net.
The goal of the second stage is segmenting, localizing, and assigning anatom-ical labels to each vertebra. To this end, we developed the Iterative InstanceSegmentation Net inspired by Lessmann et al. [10]. The input to the IterativeInstance Segmentation Net has an auxiliary channel in addition to the 3D CTimages. Given the segmented vertebra regions in the auxiliary channel, the net-works output the next vertebra. The method by Lessmann et al. [10] requireslumbar 5 region as a priori knowledge, and therefore it is not able to handlegeneral 3D CT images. By contrast, due to using the segmentation results in thefirst stage, our method is able to handle general 3D CT images.Figure 5 shows a schematic drawing of the architecture. For our IterativeInstance Segmentation Net, the convolutions performed in each stage use volu-metric kernels having size of 3 × × N. Masuzawa et al. ization [3] and ReLU as the activate function, the max pooling uses volumetrickernels having size of 2 × × × × Data augmentation and training
In the preprocessing steps, similar to thefirst stage, input 3D CT images are clipped to the [-512.0, 1024.0] range and thennormalized to be in the [-1.0, 1.0] interval. After that, input 3D CT images arerescaled to 1.0 mm isotropic voxels. For each training iteration, we randomly cropthe spine region from the input 3D CT images and apply data augmentation.In particular, we apply an affine transformation consisting of a random rotationbetween -15 and +15 degrees, and random scaling between -20% and +20%, bothsampled from uniform distributions. In addition, we apply a Gaussian noise with µ = 0.0 and σ = [0.0, 50.0/1536.0]. In the training iteration, the Dice loss of thesegmented volume were optimized with the Adam optimizer [7] with a learningrate of 0.001. We present two sets of experimental results. The first one is on vertebral seg-mentation and the second one is about vertebral localization and identification.We validate our algorithm with two public datasets of 15 3D CT images withreference segmentations from the MICCAI CSI (Computational Spine Imaging)2014 workshop challenge and 302 3D CT images of the patients with varioustypes of pathologies introduced in [1]. There are unusual appearances in the sec-ond dataset such as abnormal spine curvature and metal artifacts. In addition,the FOV of each volume varies widely.
We evaluated our method in terms of the segmentation accuracy with the MIC-CAI CSI 2014 workshop challenge. The CSI dataset consists of 15 3D CT imagesof healthy young adults, aged 2034 years. The images were scanned with eithera Philips iCT 256 slice CT scanner or a Siemens Sensation 64 slice CT scanner(120 kVp, with IV-contrast). The in-plane resolution ranges from 0.31 mm to0.36 mm and the slice thickness ranges from 0.7 mm to 1.0 mm. Each volumecover thoracic and lumbar vertebrae. We evaluate the segmentation performanceusing Average Symmetric Surface Distance (ASSD), Hausdorff Distance (HD),and Dice score on condition that the final segmentation masks are rescaled tothe resolution of the input 3D CT images. The results on the CSI dataset issummarized in Table 1. Our method achieved slightly better performance thanexisting methods. The examples of the segmentations and the anatomical labelsobtained with our method are shown in Figure 6. In all the 15 3D images, the utomatic Segmentation, Localization and Identification of Vertebrae 7
Semantic Segmentation Net provided the Iterative Instance Segmentation Netwith the accurate bounding boxes. Moreover, the Iterative Instance Segmenta-tion Net segmented the vertebrae precisely and predicted all of the anatomicallabels.
Table 1.
Comparison of Dice scores, ASSD and HD for segmentation results.Method Dice score(%) ASSD (mm) HD (mm)Janssens et al [4] 95.7 % 0.37 4.32Lessman et. al [10] 94.9 % 0.19 -Our method
Segmentation results and predicted anatomical labels obtained with the pro-posed method.
We evaluate localization and identification performance with 302 3D CT im-ages introduced in [1]. This dataset is challenging since it includes wide vari-eties of anomalies such as the abnormal spine curvature and the metal artifacts.Furthermore, the FOV of each volume is largely different. In this dataset, thereference centroid coordinates of the vertebrae and the anatomical labels weregiven by clinical experts. We evaluate our method with the two metrics describedin [2], which are the Euclidean distance error (in mm) and identification rates(Id.Rates) defined in [1]. On calculating these metrics, the final segmentation
N. Masuzawa et al. masks are rescaled to the resolution of the input 3D CT images. Table 2 showsa comparison between our method and previous works [12, 13]. The mean local-ization error is 8.3 mm, and the mean identification rate is 84%. Our methodachieved better performance than the other existing methods.
Table 2.
Comparison of localization errors in mm and identification rates.Method Mean Std Id.ratesAll Glocker et al. [1] 12.4 11.2 70%Suzani et al. [12] 18.2 11.4 -Yang et al. [13] 9.1
Cervical Glocker et al. [1] 7.0 4.7 80%Suzani et al. [12] 17.1 8.7 -Yang et al. [13] 6.6 3.9 83%Yang et al. [13] (+1000) 5.8 3.9 88%Our method
Thoracic Glocker et al. [1] 13.8 11.8 62%Suzani et al. [12] 17.2 11.8 -Yang et al. [13] 9.9 7.5 74%Yang et al. [13] (+1000) 9.5 8.5 78%Our method
Lumbar Glocker et al. [1] 14.3 12.3 75%Suzani et al. [12] 20.3 12.2 -Yang et al. [13] 10.9 9.1 80%Yang et al. [13] (+1000) 9.9 9.1 84%Our method
In this paper, we propose a multi-stage framework for segmentation, localizationand identification of vertebrae in 3D CT images. A novelty of this framework is todivide the three tasks into two stages. The first stage is multi-class segmentationof cervical, thoracic, and lumbar vertebrae. The second stage is iterative instancesegmentation of individual vertebrae. By doing this, the method successfullyworks without a priori knowledge of which part of the anatomy is visible in the3D CT images. This means that the method can be applied to a wide range of3D CT images and applications. In the experiments using two public datasets,the method achieved the best Dice score for volume segmentation, and achievedthe best mean localization error and identification rate. As far as we know, thisis the first unified framework that tackles the three tasks simultaneously withthe state of the art performance. We hope that the proposed method will helpdoctors in clinical practice. utomatic Segmentation, Localization and Identification of Vertebrae 9
References
1. Ben, G., et al.: Automatic localization and identification of vertebrae in arbitraryfield-of-view ct scans. In: Medical Image Computing and ComputerAssisted Inter-vention. pp. 590–598. Springer Berlin Heidelberg (2012)2. Ben, G., et al.: Vertebrae localization in pathological spine ct via dense classifica-tion from sparse annotations. In: Medical Image Computing and ComputerAssistedIntervention. pp. 262–270. Springer Berlin Heidelberg (2013)3. Ioffe, S., et al.: Batch normalization: Accelerating deep network training by re-ducing internal covariate shift. In: International Conference on Machine Learning.vol. 37, pp. 448–456 (2015)4. Janssens, R., et al.: Fully automatic segmentation of lumbar vertebrae from ctimages using cascaded 3d fully convolutional networks. In:
IEEE 15th InternationalSymposium Biomedical Imaging . pp. 893–897 (2018)5. Jianhua, Y., et al.: A multi-center milestone study of clinical vertebral ct segmen-tation. In: Computerized Medical Imaging and Graphics 49. pp. 16–28 (2016)6. Keshwani, D., et al.: Computation of total kidney volume from ct images in auto-somal dominant polycystic kidney disease using multi-task 3d convolutional neuralnetworks. In: Medical Image Computing & Computer Assisted Intervention, Ma-chine Learning in Medical Imaging workshop. vol. 66, pp. 90–99 (2018)7. Kingma, D.P., et al.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)8. Klinder, T., et al.: Automated model-based vertebra detection, identification, andsegmentation ct images. In: Medical Image Analysis 13. pp. 471–482 (2009)9. Korez, R., et al.: A framework for automated spine and vertebrae interpolation-based detection and model-based segmentation. In: