A fully automated method for 3D individual tooth identification and segmentation in dental CBCT
11 A fully automated method for 3D individual toothidentification and segmentation in dental CBCT
Tae Jun Jang, Kang Cheol Kim, Hyun Cheol Cho, and Jin Keun Seo
Abstract —Accurate and automatic segmentation of three-dimensional (3D) individual teeth from cone-beam computerizedtomography (CBCT) images is a challenging problem becauseof the difficulty in separating an individual tooth from adjacentteeth and its surrounding alveolar bone. Thus, this paper pro-poses a fully automated method of identifying and segmenting3D individual teeth from dental CBCT images. The proposedmethod addresses the aforementioned difficulty by developinga deep learning-based hierarchical multi-step model. First, itautomatically generates upper and lower jaws panoramic imagesto overcome the computational complexity caused by high-dimensional data and the curse of dimensionality associated withlimited training dataset. The obtained 2D panoramic images arethen used to identify 2D individual teeth and capture loose-and tight- regions of interest (ROIs) of 3D individual teeth.Finally, accurate 3D individual tooth segmentation is achievedusing both loose and tight ROIs. Experimental results showedthat the proposed method achieved an F1-score of 93.35% fortooth identification and a Dice similarity coefficient of 94.79% forindividual 3D tooth segmentation. The results demonstrate thatthe proposed method provides an effective clinical and practicalframework for digital dentistry.
Index Terms —Cone-beam computerized tomography, digitaldentistry, tooth segmentation, tooth identification, deep learning
I. I
NTRODUCTION
Digital dentistry is evolving rapidly along with the rapidinnovation of artificial intelligence and the development ofcone-beam computerized tomography (CT), intra-oral and facescanners, and dental three-dimensional (3D) printing. Digitaldentistry enhances a dentist’s efficiency and improves theaccuracies of orthodontic diagnoses, treatment planning, andsurgical guides. A fundamental component of digital dentistryis the 3D segmentation of teeth, jaws, and skulls from CBCTimages. Moreover, accurate digital models of individual toothgeometry and jaws facilitate the simulation of prosthetic eval-uation, cephalometric analysis, computer-aided digital implantplanning, and bite irregularity prediction.Automatic and accurate 3D individual tooth segmentationfrom CBCT images is a difficult task for the following reasons:(i) similar intensities between teeth roots and their surroundingalveolar bone; (ii) attached boundary between adjacent teethin the crown parts.Over the last decade, there have been several attemptsto develop 3D tooth segmentation methods, most of whichare based on level set methods [1]–[5]. Unfortunately, levelset-based methods have fundamental limitations in achieving
The authors are with School of Mathematics and Computing (Computa-tional Science and Engineering), Yonsei University, Seoul, 03722.E-mail: [email protected] (corresponding author) fully automated segmentation. This difficulty arises from thedependence of such methods on the initialization of level set,and the automatic initialization is hindered by the compleximage structure associated with adjacent teeth, the jaw, thealveolar bone, etc. Hence, user intervention through manualinitialization is inevitable in this approach. Similarly, graphcut-based methods [6], [7] require manual intervention as theresults are also affected by initialization. An anatomy-driven(or template-based) method [8] was proposed to automaticallymodel the overall 3D tooth shape through B-spline repre-sentation. The disadvantage of this approach is its convexhull property, which causes inaccurate 3D tooth segmentation.Particularly, the approach is vulnerable to topological changesof molar teeth in transverse computerized tomography (CT)slices along the longitudinal axis.Recently, deep learning methods have been applied in 3Dtooth segmentation. Lee et al. [9] and Rao et al. [10] useda fully convolutional network (FCN) [11] for whole toothsegmentation instead of individual tooth segmentation. Chen et al. [12] attempted to provide individual tooth segmentationusing a marker-controlled watershed transform from the tootharea predicted by FCN. The disadvantage of this approach isthat a tooth may be broken into several fractions causing thewatershed transform to assign several labels to an individualtooth. This drawback is a major obstacle to providing accurateand robust segmentation. Cui et al. [13] proposed a deeplearning framework for individual tooth segmentation andidentification using Mask R-CNN [14]. The limitation of thesedeep learning methods is patch-based approach to handle high-dimension inputs ( e.g. , × × voxels in 3D CBCTimage) and a limited amount of labeled samples. It is necessaryto use both local and global information to achieve accuratesegmentation with individual tooth identification. Thus, thedrawback of this patch-based approach is its inability toreflect contextual (global) information, since each output ofa convolutional network only depends on the correspondingpatch.Automatic individual tooth identification is also a difficulttask. Recently, several individual tooth identification attempts[15], [16] have been made using convolutional networks.However, these approaches suffer from misclassification errorscaused by adjacent teeth similarities.Existing 3D tooth segmentation methods may not be effec-tive for CBCT images that are severely corrupted by metalartifacts. In a clinical dental CBCT environment (e.g., lowdose radiation exposure), metal artifacts become common asthe number of aged patients with metallic prosthesis increases.Hence, it would be desirable to develop a method that works a r X i v : . [ c s . C V ] F e b Panoramic image reconstruction of upper jaw and lower jaw
3D CBCT image
Separation
Reference curves
Projection
Panoramic images individual segmentationTooth identification & N teeth recognition • Segmentation • Identification • Detection
Extraction of loose & tight 3D tooth ROIs imagesCropped loose ROI ! Concat tight ROI N t ee t h
3D tooth segmentation segmentation3D tooth
CT slice image 3D visualization using 2-channel input3D segmentation N t ee t h Fig. 1: Schematic diagram of the proposed method, which consists of four steps: 1) Panoramic image reconstruction of theupper and lower jaws from a 3D CBCT image; 2) tooth identification and 2D segmentation of individual teeth in the panoramicimages; 3) extraction of loose and tight 3D tooth ROIs using the detected bounding boxes and segmented tooth regions; and4) 3D segmentation for individual teeth from the 3D tooth ROIs.
FDI Dental Notation C e n t r a l / L a t e r a li n c i s o r C a n i n e s t /2 ndp r e m o l a r s t /2 nd /3 r d m o l a r
21 22 23 24 25 26 27 2831 32 33 34 35 36 37 384142434445464748
Fig. 2: F´ed´eration Dentaire Internationale (FDI) dental no-tation using a two-digit numbering system. The first digit(quadrant code) represents a quadrant of teeth, and the seconddigit (tooth code) represents the order of the tooth from thecentral incisor in a quadrant.well even in images degraded by metal artifacts.This paper aimed to address these limitations by developinga hierarchical multi-step deep learning model. The proposedmethod is summarized as follows. The first step is to cir-cumvent the high-dimensionality problem associated with CTimages. This step automatically generates panoramic imagesof the upper and lower jaws from CT images where itssize is smaller than the original CT image. The panoramicimages of the upper and lower jaws are separated to reduceoverlaps between adjacent teeth. Notably, panoramic imagesgenerated from CBCT images are not significantly affected by metal-related artifacts. We utilize these panoramic imagesto accurately perform 2D tooth detection, identification, andsegmentation. The second step is to identify individual teethby two-digit numbers relative to their quadrant and location,as shown in Fig. 2. We develop a tooth detection method tolocalize bounding boxes that enclose each tooth and classifiesthem into four types according to tooth morphology. Thismethod solve misclassification problems caused by similaradjacent teeth. The individual teeth are then identified usingthe results of tooth detection. Additionally, we perform 2Dsegmentation for individual teeth. The third step extracts looseand tight 3D tooth regions of interest (ROIs) from the 2Ddetected boxes and segmented tooth regions for accurate 3Dindividual tooth segmentation in the final step. Tight ROIsimprove the segmentation accuracy. A schematic diagram ofour method is described in Fig. 1.The rest of this paper is organized as follows. The detail ofthe proposed method is described in Section II. Implementa-tion details and experimental results are provided in SectionIII. Finally, Section IV presents the discussion and conclusionof this paper. II. M
ETHOD
Let X denote a 3D CT image with voxel grid Ω := { ( x, y, z ) ∈ N : 1 ≤ x ≤ N x , ≤ y ≤ N y , ≤ z ≤ N z } ,where N x , N y and N z are the voxel sizes in directions x (sagittal axis), y (frontal axis) and z (longitudinal axis),respectively. We used CT images of size × × . Thevalue X ( x, y, z ) at the voxel position ( x, y, z ) is representedas the attenuation coefficient. CT image X [1-1] thresholding z xy Binarized bone ˜ X [1-2] CCL componentsother connected˜ X lower Binarized lower jaw ˜ X upper Binarized upper jaw productelementwise componentThe largest componentThe 2nd largest X upper Upper jaw X lower Lower jaw [1-3]
MIP y xz M X upper MIP of X upper M X lower MIP of X lower [1-4] & closingthresholding y x A X upper Upper dental arch A X lower Lower dental arch [1-5] & curve processingskeletonization C X upper Reference curve C X lower Reference curve [1-6] projection P X upper Upper jaw panoramic image P X lower Lower jaw panoramic image z s
Fig. 3: Workflow of Step 1. This shows reconstruction process of upper jaw panoramic image P X upper and low jaw panoramicimages P X lower from a 3D CT image X . A. Step 1: Panoramic image reconstruction of the upper andlower jaws from a 3D CBCT image
This step describes the automatic reconstruction ofpanoramic images of the upper and lower jaws from a 3DCBCT image X . Fig. 3 illustrates the workflow.[Step 1-1] To obtain a binary bone image ˜ X , a 3D CT image X is segmented into three classes (air, soft tissues, and bones)using multi-level version of Otsu’s method [17]. The thresholdvalues T and T for the histogram h ( t ) corresponding to X are determined by { T , T } = argmax t ,t (cid:34) (cid:32) (cid:80) ≤ i This step aims to identify and segment individual teeth inthe reconstructed panoramic images. To achieve the goal, wefirst perform individual tooth detection. Here, the teeth areclassified as incisor (class 1), canine (class 2), premolar (class3), and molar (class 4).[Step 2-1] To detect individual teeth in a panoramic image,we develop a deep learning method inspired by one-stageobject detection [21], [22]. Given a panoramic image P withsize N s × N z ( e.g., N s × N z = 640 × ), a uniform gridis created. Each grid cell G ij is defined as follows: G ij = { ( s, z ) ∈ N : g ( i − < s ≤ gi,g ( j − < z ≤ gj } , (5)where g is the length of grid cells. Then we learn a toothdetection map f det : P (cid:55)→ Y that is given by f det ( P ) = Y , Y , · · · Y N s , Y , Y , · · · Y N s , ... . . . ... Y ,N z · · · · · · Y N s ,N z , (6)where Y ij = ( c ij , b ij , p ij ) predicting a confidence score c ij ,a bounding box component b ij , and a class probability p ij in G ij , as illustrated in Fig. 5. A confidence score c ij ∈ [0 , represents the existence of the tooth center in G ij . A boundingbox component is denoted by b ij = ( s ij , z ij , w ij , h ij ) . (7) where ( s ij , z ij ) is the center of the bounding box in G ij , ( w ij , h ij ) indicates its width and height. For a tooth in thebounding box corresponding to b ij , we estimate a classprobability p ij = ( p ij, , p ij, , p ij, , p ij, ) , (8)where p ij,k represents the probability of being tooth class k . G ij ( s ij , z ij ) w ij h ij p ij : molar Y = f det ( P ) ∈ R × × Y ij Fig. 5: Concept of Step 2-1. A detection map f det predicts Y ij for each G ij . The center position in G ij , the width and heightof a bounding box, and the tooth class are expressed from Y ij where the confidence score c ij has a high value.To find exact bounding boxes among the predicted boxesfor all G ij , we remove the boxes with scores e ij = c ij ∗ (max k p ij,k ) less than 0.5. Several bounding boxes with highscores may appear near the center of a tooth. We adopt the non-maximal suppression (NMS) technique to filter out boundingboxes that highly overlap high-scoring boxes.Using a labeled training dataset { ( P ( n ) , Y ∗ ( n ) ) } Nn =1 where Y ∗ is ground truth, f det is learned by minimizing the lossbetween the output Y = f det ( P ) and ground truth Y ∗ asfollows: N (cid:80) n =1 (cid:2) L obj ( Y ( n ) , Y ∗ ( n ) ) + λ L noobj ( Y ( n ) , Y ∗ ( n ) )+ λ L box ( Y ( n ) , Y ∗ ( n ) ) + L cls ( Y ( n ) , Y ∗ ( n ) ) (cid:3) , (9)where L obj ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } (1 − c ij ) , (10) L noobj ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =0 } (0 − c ij ) , (11) L box ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } | b ∗ ij − b ij | , (12) L cls ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } CrossEntropy ( p ∗ ij , p ij ) . (13) L obj , L box , and L cls represent the prediction errors for the confi-dence scores, bounding box components, and class probabili-ties where objects exist ( c ∗ ij = 1 ), respectively. L noobj is relatedto the confidence scores where no objects exist ( c ∗ ij = 0 ). Sincethere is no object in most grid cells, the confidence score tendsto be predicted as zero [21]. To eliminate the imbalance, L noobj and L box are multiplied by constants λ = 0 . and λ = 5 ,respectively. Here, for a stable learning of bounding box regression [23], b ij is replaced by ˆ b ij = (ˆ s ij , ˆ z ij , ˆ w ij , ˆ h ij ) which satisfies thefollowing conditions: s ij = g (ˆ s ij + i − , z ij = g (ˆ z ij + j − ,w ij = a w exp( ˆ w ij ) , h ij = a h exp(ˆ h ij ) . (14)where a w and a h are the width and height of an anchor box,respectively. We set a size of the anchor box as the mean sizeof the ground truth bounding boxes.[Step 2-2] For each tooth in the detected bounding box, anumber is assigned to identify the unique tooth according tothe FDI system. For convenience, we first explain how thenumbers are assigned to teeth in the upper jaw. As illustratedin Fig. 6, the detected bounding boxes are listed in ascendingorder of s coordinates of the box center. The upper rightand upper left quadrants are divided from the middle of foursequential incisor boxes. For the two right incisors and the twoleft incisors, number 1 and 2 are assigned from the inside to theoutside, respectively. Number 3 is assigned to the canines sincethere is only one in each quadrant. On each side, premolarsare assigned numbers 4 and 5 from the inside to the outside.Likewise, molars are assigned numbers 6, 7 and 8 (if a wisdomtooth exists). III IC CP PP PM MM MM M Identification 112 23 34 45 56 67 78 8 Fig. 6: Tooth identification process using the classificationresults in Step 2-1. The capital letters represent the first lettersof the tooth type and the numbers are tooth codes.[Step 2-3] The proposed 2D tooth segmentation methoduses a U-shaped FCN [24] with taking advantage of boundingbox knowledges obtained from Step 2-1. Let S ∈ R N s × N z denote the segmentation image for a tooth corresponding to adetected bounding box in P . We construct a training dataset { I ( n ) roi , Y ( n ) roi } Nn =1 for individual teeth segmentation, where I ( n ) roi and Y ( n ) roi are tooth images of P and S cropped by thebounding boxes. A segmentation map f seg : I roi (cid:55)→ Y roi is learned using a U-shaped network and minimizing thefollowing loss: L seg = 1 N N (cid:88) n =1 (cid:34) − M (cid:88) x Y ( n ) roi ( x ) log (cid:104) f seg (cid:16) I ( n ) roi (cid:17) ( x ) (cid:105)(cid:35) , (15)where x is a pixel position and M is the number of pixels of Y roi . C. Step 3: Extraction of loose and tight 3D tooth ROIs usingthe detected bounding boxes and segmented tooth regions In this step, 3D tooth ROIs are obtained using the results ofthe previous steps. As described in Fig. 7, a bounding box containing one tooth is projected back into the 3D CBCTimage using (3) and (4). A loose ROI domain of the targettooth is then given by D box = { ( r ( s ) + t n ( s ) , z ) : − α ≤ t ≤ α, ( s, z ) ∈ B box } , (16)where B box is the set of pixel positions in the bounding box.Similarly, a tight ROI domain D seg is determined by B seg ,which is the set of pixel positions in the 2D tooth segmentedregion in the box.The loose 3D tooth ROI R box is obtained by changing thevoxel values outside D box to 0, and extracting content in a3D bounding box that fits closely around D box , as shownin Fig. 7. Similarly, we obtain the tight 3D tooth ROI R seg by processing D seg instead of D box , and using the same 3Dbounding box as above. D. Step 4: 3D segmentation for individual teeth from the 3Dtooth ROIs In this final step, 3D individual tooth segmentation isperformed by applying the loose ROI ( R box ) and tight ROI( R seg ) to a 3D version of U-shaped FCN [24]. The tightROI is crucial for improving the segmentation accuracy at theattached boundaries between a target tooth and its neighboringteeth.The input of the network is I roi3 = R box ⊕ R seg , whichrepresents the concatenating vector of two ROIs. Let Y roi3 denote a binary vector representing 3D tooth segmentation cor-responding to I roi3 . Using a training dataset { I ( n ) roi3 , Y ( n ) roi3 } Nn =1 ,we learn a 3D segmentation map f seg : I roi3 (cid:55)→ Y roi3 byminimizing the following loss: L seg = 1 N N (cid:88) n =1 (cid:34) − V (cid:88) v Y ( n ) roi3 ( v ) log (cid:104) f seg (cid:16) I ( n ) roi3 (cid:17) ( v ) (cid:105)(cid:35) , (17)where v is a voxel position and V is the number of voxels Y roi3 . III. E XPERIMENTS AND R ESULTS A. Dataset and implementation details Experiments were conducted on 3D CT images producedby a dental CBCT with a circular trajectory (DENTRI-X;HDXWILL, Seoul, South Korea) using tube voltage 90kVpand tube current 10mA. All available datasets were formattedin the Digital Imaging and Communications in Medicine(DICOM) standard as a series of 16-bit grayscale images (cor-responding to individual CT slices). The size of the originalCBCT image was × × , and its row and columnpixel spacing and slice thickness were both 0.2mm. Duringscanning, a bite block was used to prevent contact betweenthe upper and lower teeth.We received 97 dental 3D CBCT images from HDXWILL.Using these data, we generated 194 upper jaw and lower jawpanoramic images in Step 1. We also received labeled dataconsisting of 97 2D individual tooth segmentation, boundingbox components, and tooth codes, as well as 11 3D individualtooth segmentation. The labeling was performed by experts inHDXWILL. z s panoramic image P ( s, z ) Projected backinto CT image CT image X ( x, y, z ) y xz Extract ROIs by3D bounding box loose ROI R box tight ROI R seg Fig. 7: Extraction of loose and tight 3D tooth ROIs. A loose ROI domain (green dotted line in X ) is determined by the domainof projection between points (blue stars in X ) on the reference curve corresponding to points (blue stars in P ) on the boundingbox. A 3D bounding box is then obtained by closely fitting the loose ROI domain. A loose 3D tooth ROI R box is extractedby cropping the CT image X by the 3D bounding box and by changing values of voxels outside the loose ROI domain to 0.Similarly, a tight 3D tooth ROI R seg is obtained from the 2D segmented tooth region.To reconstruct panoramic images from CT images, we used(4) for 3D CBCT images using a tomographic reconstructionsoftware TIGRE [25]. The size of all panoramic images wasfixed at × . The width of the panoramic imageswas determined by 640 reference curve points. In step 1-5,those points are obtained by interpolating 500 points on thesmooth skeleton and by extrapolating 70 points at each end ofthe curve. To completely cover the teeth at both ends, weextrapolated 70 points (approximately 13.3mm) taking intoaccount the average size of the molars. The height size 320was determined by removing 80 CT slices that do not containteeth from the bottom.For the 2D detection and segmentation in Step 2-1 and 2-3, 66 CBCT dataset were used for training and 31 datasetfor testing. Because two panoramic images (upper and lowerparts) are obtained from each CBCT image through Step 1,we use 132 labeled training data and 62 test data. Meanwhile,for the 3D segmentation in Step 4, 7 CBCT dataset wereused for training and 4 dataset for testing. Since each patienthas approximately 28 to 32 teeth, each CBCT image canprovide approximately 28 to 32 training data for individualtooth segmentation. To be precise, we use 216 training dataand 112 test data for the 3D segmentation in Step 4. In Step2-3 and 4, 2D tooth images and 3D loose and tight ROIs wereresized to × and × × , respectively. B. Training of the proposed network Training was implemented using Pytorch [26] on a CPU(Intel(R) Core(TM) i7-9700K, 3.60GHz) and GPU (NVIDIAGTX-2070, 8GB) system. In Steps 2 and 4, we trained threenetworks; tooth detection, 2D individual tooth segmentation,and 3D individual tooth segmentation. The proposed neuralnetworks were trained by minimizing losses in (9), (15),and (17) using Adam optimizer [27]. The network archi-tectures were determined using five-fold cross validation onthe training dataset. We used the batch normalization [28] toprevent overfitting. Each training in Steps 2-1, 2-3, and 4 usedbatch sizes of 8, 32, and 4, respectively. We also used dataaugmentation techniques including random contrast, random brightness adjustment, horizontal flip, and rotation between-8 ◦ and 8 ◦ . C. Evaluation and result of the proposed method1) Bounding box detection: For a quantitative evaluationof the bounding box detection, we provide four precision-recall (PR) curves [29] and their average precision (AP) [29],as shown in Fig. 8. When the intersection of union (IOU)threshold value was 0.6, according to the PR curve, theprecision tends to stay high as the recall increases, and theAP was . . Recall P r e c i s i o n Precision-recall curve IOU threshold AP Fig. 8: Tooth detection results. A PR curve represents thechange in the precision as the recall increases for a fixed IOUthreshold value, which is used for NMS. The AP is the averageof the precision values on a PR curve. 2) Individual tooth identification: This subsection presentsthe performance evaluation of tooth identification. The pre-cision, recall, and F1-score were used to evaluate the resultsof the identification method. In Step 2-1, teeth are initiallyclassified into four types, instead of directly predicting theeight tooth codes. The direct identification method can oftenmisclassify teeth within the same tooth type. Fig. 9 summa-rizes the identification results for each tooth code. As shownin Fig. 9 (a), the direct method confuses first premolars (code4) and wisdom teeth (code 8) in particular. These errors hinderthe performance improvement of the direct method. However,the four type-based method achieves a high accuracy bypreventing misclassification due to similar tooth shape. Table I shows that the proposed method leads to more statisticallyaccurate identification.TABLE I: Quantitative evaluation for tooth identificationmethods. Metric Direct method Proposed method Precision ( % ) . ± . 85 96 . ± . Recall ( % ) . ± . 42 90 . ± . F1-score ( % ) . ± . 63 93 . ± . T r u e T oo t h C o d e 110 3 0 0 0 3 0 03 96 0 0 7 0 0 02 0 107 1 0 6 0 00 1 0 96 23 0 0 00 0 0 0 89 0 0 00 0 1 0 0 119 2 00 0 2 0 0 1 116 10 0 0 0 0 0 9 52 Confusion matrix (a) T r u e T oo t h C o d e 108 2 0 0 0 1 0 03 97 0 2 0 0 0 00 0 113 1 0 3 0 00 0 0 114 3 0 0 00 0 0 3 89 0 0 00 0 0 0 0 117 3 00 0 0 0 0 1 116 30 0 0 0 0 0 1 56 Confusion matrix (b) Fig. 9: Confusion matrix for tooth identification. The verticaland horizontal axes represent true tooth codes and predictedtooth codes, respectively. The diagonal components representthe number of correct identification. (a) Result of the directmethod. (b) Result of the proposed method. 3) Individual tooth segmentation: In Steps 2-3 and 4, weperformed the 2D and 3D individual tooth segmentation. Toevaluate the segmentation performances, we used precision, re-call, Dice similarity coefficient (DSC) [30], Hausdorff distance(HD) [31], and average symmetric surface distance (ASSD)[32]. 2D individual tooth segmentation. The proposed methodproceeds in two steps consisting of the bounding box detectionin Step 2-1 and individual segmentation in Step 2-3, whileMask R-CNN [14] achieve the same task in a single step. Weimplemented both approaches, and the quantitative evaluationresults are reported in Table I. The proposed method isnumerically more accurate than the Mask R-CNN. In ourexperiment, the Mask R-CNN successfully detected teeth;however, there was a lack of segmentation detail at the edgesof teeth, as shown in Fig. 10.The accuracy of 2D segmentation is important becausethe key of precise 3D tooth segmentation is to use tight3D tooth ROIs obtained by 2D segmentation. Although thedetection and segmentation are not performed simultaneously,two simple convolutional networks (one-stage object detectionand U-shaped FCN) are designed to achieve a high accuracy. 3D individual tooth segmentation. We developed a fully au-tomated multi-step method for 3D individual segmentation. Toverify the effectiveness of the proposed method, we comparedit with patch-based Mask R-CNN [14] and ToothNet [13]. Wealso provided the segmentation results of the proposed methodby adopting either loose or tight ROI, or both. TABLE II: Quantitative evaluation for 2D tooth segmentationmethods. Metric 2D Mask R-CNN Proposed method Precision ( % ) . ± . 55 96 . ± . Recall ( % ) . ± . 87 93 . ± . DSC ( % ) . ± . 25 94 . ± . HD ( mm ) . ± . 93 1 . ± . ASSD ( mm ) . ± . 12 0 . ± . (a)(b) Fig. 10: Qualitative comparison of the proposed method andthe Mask R-CNN. Segmentation results of (a) the proposedmethod and (b) Mask R-CNN. The Mask R-CNN is capableof tooth segmentation; however, there are some inaccuraciesmainly at the edges of teeth.Individual tooth segmentation can be formulated as aninstance segmentation problem. Mask R-CNN is the state-of-the-art deep learning framework for instance segmentation.However, Mask R-CNN cannot be applied to large size 3DCBCT images directly because of the computational limit. Forcomparison experiments, we implemented Mask R-CNN in apatch-based fashion [33], [34] as an alternative to avoid thelimitation. ToothNet is based on Mask R-CNN, which alsoadopts this approach. In the implementation, image patcheshad the size of × × and the stride between thepatches was × × . The patch-based approach yieldsredundant results by overlapping local image patches owing tothe disconnected spatial relationship between adjacent patches.The integrated results are obtained by removing the overlappedsegmentations according to ToothNet.Patch-based 3D Mask R-CNN and ToothNet show lowersegmentation performances in quantitative evaluation, asshown in Table III. These methods perform individual seg-mentation from the original CBCT images. In contrast, theproposed method has the advantage of using loose and tightROIs that provide considerable background region in advance.In particular, the tight ROI excludes structures on sides ( e.g. ,adjacent teeth, jaw, etc) of the target tooth. To evaluate theeffectiveness of the use of both loose and tight ROIs, weimplemented experiments using either loose or tight ROI,or both on the same 3D segmentation network. When usingonly tight ROIs, the recall is the lowest because loss oftooth information may occur where the tight ROI boundaryintersects the tooth boundary. The use of only loose ROI TABLE III: Quantitative comparison for 3D tooth segmentation methods. 3D Mask R-CNN ToothNet Loose ROI Tight ROI Loose & tight ROIsMetric mean ± std p -value mean ± std p -value mean ± std p -value mean ± std p -value mean ± stdPrecision ( % ) . ± . < . 001 89 . ± . < . 001 94 . ± . < . 001 94 . ± . < . 001 95 . ± . Recall ( % ) . ± . < . 01 93 . ± . < . 001 92 . ± . < . 001 91 . ± . < . 001 93 . ± . DSC ( % ) . ± . < . 001 91 . ± . < . 001 93 . ± . < . 001 92 . ± . < . 001 94 . ± . HD ( mm ) . ± . < . 001 2 . ± . < . 001 2 . ± . < . 001 1 . ± . < . 001 1 . ± . ASSD ( mm ) . ± . < . 001 0 . ± . < . 001 0 . ± . < . 001 0 . ± . < . 001 0 . ± . (a) (b) (c) (d) (e) Fig. 11: Qualitative comparison for 3D individual tooth segmentation in a CBCT image with metal artifacts. Segmentationresult of (a) Mask R-CNN, (b) ToothNet, (c) the proposed method using loose ROIs, (d) the proposed method using tightROIs, and (e) the proposed method using both loose and tight ROIs.containing tooth boundary shows a higher the recall. However,HD tends to be high because there is no information on toothboundaries. A combination of the two ROIs enhances thesegmentation performance, as the tight ROI provides detailedinformation on the target tooth and the loose ROI compensatesfor the disadvantage of the tight ROI.Wilcoxon signed-rank test [35] is used to calculate the sta-tistical significance differences between the proposed methodand other methods, as summarized in Table III. D. Metal artifact-contaminated CBCT The proposed method effectively handles problems that arecaused by metal-related artifacts. Fig. 12 shows that the CBCTimage is significantly contaminated by metal artifacts, whereasmetal artifacts are significantly reduced in the correspond-ing panoramic image generated by the CBCT image. Thepanoramic image in Step 1 allow to accurately perform 2Dtooth detection and segmentation. These results provides priorknowledge of each 3D tooth as loose and tight ROIs. As shownin Fig. 12, the tight ROI excludes adjacent teeth, even thoughthe tooth boundaries are obscured by metal artifacts. Fig. 11provides a qualitative evaluation for 3D tooth segmentationin a CBCT image with metal artifacts. As shown in Fig.11 (d) and (e), the segmentation results in the degraded CTimage are superior to those of Fig. 11 (a)-(c) as the tightROIs provide robust boundary information. However, Fig. 11(d) presents that using only the tight ROIs may not providerobust segmentation because it can cut out the edges of theteeth. Therefore, Fig. 11 (e) illustrates the advantages and effectiveness of the proposed method using both loose andtight ROIs. (a) (b) Fig. 12: (a) CBCT image affected by metal artifacts. (b)panoramic image generated from Step 1, which is not affectedby metal artifacts. In (a), the green solid line represents theoutline of the tight ROI obtained from the segmented regionin (b). The tight ROI provide boundary information for thetarget tooth. E. Special cases This subsection shows the experimental results for theabnormal data ( e.g. , missing tooth and implants). These areoften encountered in real clinical practice. 1) Missing teeth: People often lose their teeth due to factorssuch as cavities, periodontal disease, aging, dental traumaand orthodontic treatment. To address these cases, we appliedCBCT images with missing teeth (except for wisdom teeth) toour method. Four type-based classification was successful, buttooth identification was incomplete. With exception of canines, each tooth quadrant contains two or more teeth of the sametype. When numbering the neighboring tooth of the same typeas a missing tooth, it is difficult to identify the neighboringtooth in panoramic images. Therefore, we suggest to performonly classification in the case of a missing tooth, as shown inFig. 13. 2) Implants: Implant surgery is a dental prosthetic treat-ment to replace a missing tooth. An implant appears as a screwin a panoramic image, which is a unique signature. Althoughthese data containing implants are totally unseen during modeltraining, Fig. 14 shows successful bounding box detection andsegmentation for implants. When applied to 3D segmentation,we also confirmed successful outcomes, as shown in Fig.14. However, alternative implants cannot be predicted withoriginal tooth classes since these have a different shape. Thiscan be inferred from the classification and identification ofneighboring teeth. 112 23 3P P6 67 7 Fig. 13: Illustration of tooth identification when there isa missing tooth. Two premolar (class P) corresponding tonumber 4 are missing. It is not possible to determine whetherthe remaining premolar is number 4 or 5 by panoramic imagealone. Hence, we suggest only marking the tooth type.Fig. 14: Results of three CT images with implants. Implantswere successfully segmented in both the panoramic image and3D CT image. But, the implants are classified into differentclasses. IV. D ISCUSSION AND C ONCLUSION In this paper, we developed a fully automated segmentationand identification method for individual teeth and jaws fromCBCT images. Given CBCT data, the method automaticallygenerates the maxillary and mandibular panoramic imagesthat are projected along the reference curve representing aregion-based shape feature of a dental arch. In the maxillaryand mandibular panoramic images, 2D tooth segmentationand identification are performed using deep learning meth-ods, which are vital in high-precision 3D tooth segmentationand identification. Experiments showed that the accuracy ofthe method is suitable for the clinical setting. Our methodovercomes the limitations of existing automated methods byachieving fully automation and improved accuracy. Addition-ally, the method addresses the difficulty of learning high-dimensional data.The main idea of the proposed method is the carefuluse of the accurate and robust 2D tooth segmentation andidentification in 2D panoramic images in an indirect manner toaddress the difficulty of 3D segmentation from metal artifact-contaminated 3D CBCT images. In a clinical dental CBCTenvironment ( e.g. , low dose radiation exposure), metal related-artifacts are common. The proposed method utilizes the crucialobservation that metal artifacts are significantly reduced inthe upper and lower panoramic images generated from theCBCT images. The outcome in Step 2 serves as strongprior knowledge of 3D tooth segmentation, which plays animportant role in separating teeth from 3D images, in caseswhere teeth are often contacted, overlapped, or connectedowing to metal-related artifacts.The automated system proposed in this study improvesthe efficiency of dentists by reducing the cumbersome andtime-consuming manual intervention. The result provides animproved workflow for dentists to simulate pre-operative or-thodontic treatment and manufacture implant surgical guides.Digital occlusion analysis is potentially possible by combiningour method with the intra-oral scan model [36], [37] viaregistration. Hence, it is expected to play an important rolein digital dentistry.A CKNOWLEDGEMENTS This research was supported by a grant of the Korea HealthTechnology R & D Project through the Korea Health IndustryDevelopment Institute (KHIDI), funded by the Ministry ofHealth & Welfare, Republic of Korea (HI20C0127). We wouldlike to express our deepest gratitude HDXWILL which sharesdental CBCT images and ground-truth data.R EFERENCES[1] H. Gao and O. Chae, “Individual tooth segmentation from ct images us-ing level set method with shape and intensity prior,” Pattern Recognition ,vol. 43, no. 7, pp. 2406–2417, 2010.[2] Y. Gan, Z. Xia, J. Xiong, Q. Zhao, Y. Hu, and J. Zhang, “Toward accuratetooth segmentation from computed tomography images using a hybridlevel set model,” Medical physics , vol. 42, no. 1, pp. 14–27, 2015.[3] H.-T. Yau, T.-J. Yang, and Y.-C. Chen, “Tooth model reconstructionbased upon data fusion for orthodontic treatment simulation,” Computersin biology and medicine , vol. 48, pp. 8–16, 2014. [4] D. X. Ji, S. H. Ong, and K. W. C. Foong, “A level-set based approachfor anterior teeth segmentation in cone beam computed tomographyimages,” Computers in biology and medicine , vol. 50, pp. 116–128,2014.[5] Y. Wang, S. Liu, G. Wang, and Y. Liu, “Accurate tooth segmentationwith improved hybrid active contour model,” Physics in Medicine &Biology , vol. 64, no. 1, p. 015012, 2018.[6] L. Hiew, S. Ong, K. W. Foong, and C. Weng, “Tooth segmentation fromcone-beam ct using graph cut,” in Proceedings of the Second APSIPAAnnual Summit and Conference , 2010, pp. 272–275.[7] J. Keustermans, D. Vandermeulen, and P. Suetens, “Integrating statisticalshape models into a graph cut framework for tooth segmentation,”in International Workshop on Machine Learning in Medical Imaging .Springer, 2012, pp. 242–249.[8] S. Barone, A. Paoli, and A. V. Razionale, “Ct segmentation of dentalshapes by anatomy-driven reformation imaging and b-spline modelling,” International journal for numerical methods in biomedical engineering ,vol. 32, no. 6, p. e02747, 2016.[9] S. Lee, S. Woo, J. Yu, J. Seo, J. Lee, and C. Lee, “Automated cnn-basedtooth segmentation in cone-beam ct for dental implant planning,” IEEEAccess , vol. 8, pp. 50 507–50 518, 2020.[10] Y. Rao, Y. Wang, F. Meng, J. Pu, J. Sun, and Q. Wang, “A symmetricfully convolutional residual network with dcrf for accurate tooth seg-mentation,” IEEE Access , vol. 8, pp. 92 028–92 038, 2020.[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[12] Y. Chen, H. Du, Z. Yun, S. Yang, Z. Dai, L. Zhong, Q. Feng, andW. Yang, “Automatic segmentation of individual tooth in dental cbctimages from tooth surface map by a multi-task fcn,” IEEE Access , 2020.[13] Z. Cui, C. Li, and W. Wang, “Toothnet: automatic tooth instance seg-mentation and identification from cone beam ct images,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 6368–6377.[14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision ,2017, pp. 2961–2969.[15] Y. Miki, C. Muramatsu, T. Hayashi, X. Zhou, T. Hara, A. Katsumata,and H. Fujita, “Classification of teeth in cone-beam ct using deepconvolutional neural network,” Computers in biology and medicine ,vol. 80, pp. 24–29, 2017.[16] D. V. Tuzoff, L. N. Tuzova, M. M. Bornstein, A. S. Krasnov, M. A.Kharchenko, S. I. Nikolenko, M. M. Sveshnikov, and G. B. Bednenko,“Tooth detection and numbering in panoramic radiographs using convo-lutional neural networks,” Dentomaxillofacial Radiology , vol. 48, no. 4,p. 20180051, 2019.[17] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics , vol. 9, no. 1, pp.62–66, 1979.[18] H. Samet and M. Tamminen, “Efficient component labeling of images ofarbitrary dimension represented by linear bintrees,” IEEE transactionson pattern analysis and machine intelligence , vol. 10, no. 4, pp. 579–586, 1988.[19] R. M. Haralick, S. R. Sternberg, and X. Zhuang, “Image analysis usingmathematical morphology,” IEEE transactions on pattern analysis andmachine intelligence , no. 4, pp. 532–550, 1987.[20] T.-C. Lee, R. L. Kashyap, and C.-N. Chu, “Building skeleton models via3-d medial surface axis thinning algorithms,” CVGIP: Graphical Modelsand Image Processing , vol. 56, no. 6, pp. 462–478, 1994.[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 779–788.[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in European conference oncomputer vision . Springer, 2016, pp. 21–37.[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 580–587.[24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[25] A. Biguri, M. Dosanjh, S. Hancock, and M. Soleimani, “Tigre: a matlab-gpu toolbox for cbct image reconstruction,” Biomedical Physics &Engineering Express , vol. 2, no. 5, p. 055010, 2016. [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in Advancesin neural information processing systems , 2019, pp. 8026–8037.[27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[29] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Internationaljournal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[30] S. Rueda, S. Fathima, C. L. Knight, M. Yaqub, A. T. Papageorghiou,B. Rahmatullah, A. Foi, M. Maggioni, A. Pepe, J. Tohka et al. ,“Evaluation and comparison of current fetal ultrasound image segmen-tation methods for biometric measurements: a grand challenge,” IEEETransactions on medical imaging , vol. 33, no. 4, pp. 797–813, 2013.[31] P. F. Raudaschl, P. Zaffino, G. C. Sharp, M. F. Spadea, A. Chen, B. M.Dawant, T. Albrecht, T. Gass, C. Langguth, M. L¨uthi et al. , “Evaluationof segmentation methods on head and neck ct: auto-segmentationchallenge 2015,” Medical physics , vol. 44, no. 5, pp. 2020–2036, 2017.[32] O. Maier, B. H. Menze, J. von der Gablentz, L. H¨ani, M. P. Hein-rich, M. Liebrand, S. Winzeck, A. Basit, P. Bentley, L. Chen et al. ,“Isles 2015-a public evaluation benchmark for ischemic stroke lesionsegmentation from multispectral mri,” Medical image analysis , vol. 35,pp. 250–269, 2017.[33] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deepneural networks segment neuronal membranes in electron microscopyimages,” in Advances in neural information processing systems , 2012,pp. 2843–2851.[34] B. Kim, K. C. Kim, Y. Park, J.-Y. Kwon, J. Jang, and J. K. Seo,“Machine-learning-based automatic identification of fetal abdominalcircumference from ultrasound images,” Physiological measurement ,vol. 39, no. 10, p. 105007, 2018.[35] F. Wilcoxon, “Individual comparisons by ranking methods,” in Break-throughs in statistics . Springer, 1992, pp. 196–202.[36] X. Xu, C. Liu, and Y. Zheng, “3d tooth segmentation and labeling usingdeep convolutional neural networks,” IEEE transactions on visualizationand computer graphics , vol. 25, no. 7, pp. 2336–2348, 2018.[37] S. Tian, N. Dai, B. Zhang, F. Yuan, Q. Yu, and X. Cheng, “Automaticclassification and segmentation of teeth on 3d dental model usinghierarchical deep learning networks,” IEEE Access , vol. 7, pp. 84 817–84 828, 2019. A PPENDIX AV ISUALIZATION OF THE RESULT In this section, we provide the qualitative evaluation ofthe four selected subjects by visualizing the results in eachstep. Fig. 15 presents the results for Step 2 using the upperand lower panoramic images generated by Step 1. Fig. 16shows the individual tooth segmentation results. The resultsare displayed on three CBCT slices for each subject. Fig. 17shows the visualized 3D teeth with a skull segmented from theCBCT image. The segmented regions of the teeth are filled indifferent colors according to the corresponding number.A PPENDIX BD EEP LEARNING NETWORK ARCHITECTURES This section shows the architectures of the proposed threenetworks in Steps 2-1, 2-3 and 4. The architecture of toothdetection network is described as Table IV. On the last layer,different activation functions are used to predict confidencescores, bounding box components, and class probabilities atthe same time. Table V shows the network of 2D toothsegmentation, which is a typical U-Net [24] structure. Asshown in Table VI, 3D tooth segmentation network is alsobased on the U-Net structure. Fig. 15: Results of the proposed method for Step 2.Fig. 16: 3D tooth segmentation results of the proposed method for four subjects.Fig. 17: Visualization of 3D tooth segmentation results for four subjects. TABLE IV: Network architecture of tooth detection in Step2-1. Layer Input Output Kernel Activation conv( × ) 1 × × × × 64 3 × ReLUmax pool 1 × × 64 320 × × 64 2 -conv( × ) 2 × × 64 320 × × 128 3 × ReLUmax pool 2 × × 128 160 × × 128 2 -conv( × ) 3 × × 128 160 × × 256 3 × ReLUmax pool 3 × × 256 80 × × 256 2 -conv( × ) 4 × × 256 80 × × 512 3 × ReLUmax pool 4 × × 512 40 × × 512 2 -conv( × ) 5 × × 512 40 × × × ReLUconv( × ) 6 × × × × × ReLUconv( × ) 7 × × × × × Sigmoid × × × × × Sigmoid × × × × × - × × × × × Softmax TABLE V: Network architecture of 2D tooth segmentation inStep 2-3. Layer Input Output Kernel Activation Encodingpath conv( × ) 1 × × 64 3 × ReLUmax pool 1 × 64 64 × 64 2 -conv( × ) 2 × 64 64 × 128 3 × ReLUmax pool 2 × 128 32 × 128 2 -conv( × ) 3 × 128 32 × 256 3 × ReLUmax pool 3 × 256 16 × 256 2 -conv( × ) 4 × 256 16 × 512 3 × ReLUDecodingpath upsample 1 × 512 32 × 512 2 -conv( × ) 5 × 512 32 × 256 3 × ReLUconcat withconv( × ) 3 × 256 32 × - - × conv( × ) 6 × 512 32 × 256 3 × ReLUupsample 2 × 256 64 × 256 2 -conv( × ) 7 × 256 64 × 128 3 × ReLUconcat withconv( × ) 2 × 128 64 × - - × conv( × ) 8 × 256 64 × 128 3 × ReLUupsample 3 × 128 128 × 128 2 -conv( × ) 9 × 128 128 × 64 3 × ReLUconcat withconv( × ) 1 × 64 128 × - - × conv( × ) 10 × 128 128 × 64 3 × ReLUconv( × ) 11 × 64 128 × × Softmax TABLE VI: Network architecture of 3D tooth segmentation inStep4. Layer Input Output Kernel Activation Encodingpath max pool 1 × × -conv( × ) 2 × × 16 3 × ReLUmax pool 2 × 16 32 × 16 2 -conv( × ) 3 × 16 32 × 32 3 × ReLUmax pool 3 × 32 16 × 32 2 -conv( × ) 4 × 32 16 × 64 3 × ReLUDecodingpath upsample 1 × 64 32 × 64 2 -conv( × ) 5 × 64 32 × 64 3 × ReLUconcat withconv( × ) 3 × 32 32 × - - × conv( × ) 6 × 64 32 × 32 3 × ReLUupsample 2 × 32 64 × 32 2 -conv( × ) 7 × 32 64 × 16 3 × ReLUconcat withconv( × ) 2 × 16 64 × - - × conv( × ) 8 × 32 64 × 16 3 × ReLUupsample 3 × 16 128 × 16 2 -conv( × ) 9 × 16 128 × × ReLUconv( × ) 10 × × ×1