[PDF] A fully automated method for 3D individual tooth identification and segmentation in dental CBCT

Abstract

Accurate and automatic segmentation of three-dimensional (3D) individual teeth from cone-beam computerized tomography (CBCT) images is a challenging problem because of the difficulty in separating an individual tooth from adjacent teeth and its surrounding alveolar bone. Thus, this paper proposes a fully automated method of identifying and segmenting 3D individual teeth from dental CBCT images. The proposed method addresses the aforementioned difficulty by developing a deep learning-based hierarchical multi-step model. First, it automatically generates upper and lower jaws panoramic images to overcome the computational complexity caused by high-dimensional data and the curse of dimensionality associated with limited training dataset. The obtained 2D panoramic images are then used to identify 2D individual teeth and capture loose- and tight- regions of interest (ROIs) of 3D individual teeth. Finally, accurate 3D individual tooth segmentation is achieved using both loose and tight ROIs. Experimental results showed that the proposed method achieved an F1-score of 93.35% for tooth identification and a Dice similarity coefficient of 94.79% for individual 3D tooth segmentation. The results demonstrate that the proposed method provides an effective clinical and practical framework for digital dentistry.

Full PDF

11 A fully automated method for 3D individual toothidentiﬁcation and segmentation in dental CBCT

Tae Jun Jang, Kang Cheol Kim, Hyun Cheol Cho, and Jin Keun Seo

Abstract —Accurate and automatic segmentation of three-dimensional (3D) individual teeth from cone-beam computerizedtomography (CBCT) images is a challenging problem becauseof the difﬁculty in separating an individual tooth from adjacentteeth and its surrounding alveolar bone. Thus, this paper pro-poses a fully automated method of identifying and segmenting3D individual teeth from dental CBCT images. The proposedmethod addresses the aforementioned difﬁculty by developinga deep learning-based hierarchical multi-step model. First, itautomatically generates upper and lower jaws panoramic imagesto overcome the computational complexity caused by high-dimensional data and the curse of dimensionality associated withlimited training dataset. The obtained 2D panoramic images arethen used to identify 2D individual teeth and capture loose-and tight- regions of interest (ROIs) of 3D individual teeth.Finally, accurate 3D individual tooth segmentation is achievedusing both loose and tight ROIs. Experimental results showedthat the proposed method achieved an F1-score of 93.35% fortooth identiﬁcation and a Dice similarity coefﬁcient of 94.79% forindividual 3D tooth segmentation. The results demonstrate thatthe proposed method provides an effective clinical and practicalframework for digital dentistry.

Index Terms —Cone-beam computerized tomography, digitaldentistry, tooth segmentation, tooth identiﬁcation, deep learning

I. I

NTRODUCTION

Digital dentistry is evolving rapidly along with the rapidinnovation of artiﬁcial intelligence and the development ofcone-beam computerized tomography (CT), intra-oral and facescanners, and dental three-dimensional (3D) printing. Digitaldentistry enhances a dentist’s efﬁciency and improves theaccuracies of orthodontic diagnoses, treatment planning, andsurgical guides. A fundamental component of digital dentistryis the 3D segmentation of teeth, jaws, and skulls from CBCTimages. Moreover, accurate digital models of individual toothgeometry and jaws facilitate the simulation of prosthetic eval-uation, cephalometric analysis, computer-aided digital implantplanning, and bite irregularity prediction.Automatic and accurate 3D individual tooth segmentationfrom CBCT images is a difﬁcult task for the following reasons:(i) similar intensities between teeth roots and their surroundingalveolar bone; (ii) attached boundary between adjacent teethin the crown parts.Over the last decade, there have been several attemptsto develop 3D tooth segmentation methods, most of whichare based on level set methods [1]–[5]. Unfortunately, levelset-based methods have fundamental limitations in achieving

The authors are with School of Mathematics and Computing (Computa-tional Science and Engineering), Yonsei University, Seoul, 03722.E-mail: [email protected] (corresponding author) fully automated segmentation. This difﬁculty arises from thedependence of such methods on the initialization of level set,and the automatic initialization is hindered by the compleximage structure associated with adjacent teeth, the jaw, thealveolar bone, etc. Hence, user intervention through manualinitialization is inevitable in this approach. Similarly, graphcut-based methods [6], [7] require manual intervention as theresults are also affected by initialization. An anatomy-driven(or template-based) method [8] was proposed to automaticallymodel the overall 3D tooth shape through B-spline repre-sentation. The disadvantage of this approach is its convexhull property, which causes inaccurate 3D tooth segmentation.Particularly, the approach is vulnerable to topological changesof molar teeth in transverse computerized tomography (CT)slices along the longitudinal axis.Recently, deep learning methods have been applied in 3Dtooth segmentation. Lee et al. [9] and Rao et al. [10] useda fully convolutional network (FCN) [11] for whole toothsegmentation instead of individual tooth segmentation. Chen et al. [12] attempted to provide individual tooth segmentationusing a marker-controlled watershed transform from the tootharea predicted by FCN. The disadvantage of this approach isthat a tooth may be broken into several fractions causing thewatershed transform to assign several labels to an individualtooth. This drawback is a major obstacle to providing accurateand robust segmentation. Cui et al. [13] proposed a deeplearning framework for individual tooth segmentation andidentiﬁcation using Mask R-CNN [14]. The limitation of thesedeep learning methods is patch-based approach to handle high-dimension inputs ( e.g. , × × voxels in 3D CBCTimage) and a limited amount of labeled samples. It is necessaryto use both local and global information to achieve accuratesegmentation with individual tooth identiﬁcation. Thus, thedrawback of this patch-based approach is its inability toreﬂect contextual (global) information, since each output ofa convolutional network only depends on the correspondingpatch.Automatic individual tooth identiﬁcation is also a difﬁculttask. Recently, several individual tooth identiﬁcation attempts[15], [16] have been made using convolutional networks.However, these approaches suffer from misclassiﬁcation errorscaused by adjacent teeth similarities.Existing 3D tooth segmentation methods may not be effec-tive for CBCT images that are severely corrupted by metalartifacts. In a clinical dental CBCT environment (e.g., lowdose radiation exposure), metal artifacts become common asthe number of aged patients with metallic prosthesis increases.Hence, it would be desirable to develop a method that works a r X i v : . [ c s . C V ] F e b Panoramic image reconstruction of upper jaw and lower jaw

3D CBCT image

Separation

Reference curves

Projection

Panoramic images individual segmentationTooth identiﬁcation & N teeth recognition • Segmentation • Identiﬁcation • Detection

Extraction of loose & tight 3D tooth ROIs imagesCropped loose ROI ! Concat tight ROI N t ee t h

3D tooth segmentation segmentation3D tooth

CT slice image 3D visualization using 2-channel input3D segmentation N t ee t h Fig. 1: Schematic diagram of the proposed method, which consists of four steps: 1) Panoramic image reconstruction of theupper and lower jaws from a 3D CBCT image; 2) tooth identiﬁcation and 2D segmentation of individual teeth in the panoramicimages; 3) extraction of loose and tight 3D tooth ROIs using the detected bounding boxes and segmented tooth regions; and4) 3D segmentation for individual teeth from the 3D tooth ROIs.

FDI Dental Notation C e n t r a l / L a t e r a li n c i s o r C a n i n e s t /2 ndp r e m o l a r s t /2 nd /3 r d m o l a r

21 22 23 24 25 26 27 2831 32 33 34 35 36 37 384142434445464748

Fig. 2: F´ed´eration Dentaire Internationale (FDI) dental no-tation using a two-digit numbering system. The ﬁrst digit(quadrant code) represents a quadrant of teeth, and the seconddigit (tooth code) represents the order of the tooth from thecentral incisor in a quadrant.well even in images degraded by metal artifacts.This paper aimed to address these limitations by developinga hierarchical multi-step deep learning model. The proposedmethod is summarized as follows. The ﬁrst step is to cir-cumvent the high-dimensionality problem associated with CTimages. This step automatically generates panoramic imagesof the upper and lower jaws from CT images where itssize is smaller than the original CT image. The panoramicimages of the upper and lower jaws are separated to reduceoverlaps between adjacent teeth. Notably, panoramic imagesgenerated from CBCT images are not signiﬁcantly affected by metal-related artifacts. We utilize these panoramic imagesto accurately perform 2D tooth detection, identiﬁcation, andsegmentation. The second step is to identify individual teethby two-digit numbers relative to their quadrant and location,as shown in Fig. 2. We develop a tooth detection method tolocalize bounding boxes that enclose each tooth and classiﬁesthem into four types according to tooth morphology. Thismethod solve misclassiﬁcation problems caused by similaradjacent teeth. The individual teeth are then identiﬁed usingthe results of tooth detection. Additionally, we perform 2Dsegmentation for individual teeth. The third step extracts looseand tight 3D tooth regions of interest (ROIs) from the 2Ddetected boxes and segmented tooth regions for accurate 3Dindividual tooth segmentation in the ﬁnal step. Tight ROIsimprove the segmentation accuracy. A schematic diagram ofour method is described in Fig. 1.The rest of this paper is organized as follows. The detail ofthe proposed method is described in Section II. Implementa-tion details and experimental results are provided in SectionIII. Finally, Section IV presents the discussion and conclusionof this paper. II. M

ETHOD

Let X denote a 3D CT image with voxel grid Ω := { ( x, y, z ) ∈ N : 1 ≤ x ≤ N x , ≤ y ≤ N y , ≤ z ≤ N z } ,where N x , N y and N z are the voxel sizes in directions x (sagittal axis), y (frontal axis) and z (longitudinal axis),respectively. We used CT images of size × × . Thevalue X ( x, y, z ) at the voxel position ( x, y, z ) is representedas the attenuation coefﬁcient. CT image X [1-1] thresholding z xy Binarized bone ˜ X [1-2] CCL componentsother connected˜ X lower Binarized lower jaw ˜ X upper Binarized upper jaw productelementwise componentThe largest componentThe 2nd largest X upper Upper jaw X lower Lower jaw [1-3]

MIP y xz M X upper MIP of X upper M X lower MIP of X lower [1-4] & closingthresholding y x A X upper Upper dental arch A X lower Lower dental arch [1-5] & curve processingskeletonization C X upper Reference curve C X lower Reference curve [1-6] projection P X upper Upper jaw panoramic image P X lower Lower jaw panoramic image z s

Fig. 3: Workﬂow of Step 1. This shows reconstruction process of upper jaw panoramic image P X upper and low jaw panoramicimages P X lower from a 3D CT image X . A. Step 1: Panoramic image reconstruction of the upper andlower jaws from a 3D CBCT image

This step describes the automatic reconstruction ofpanoramic images of the upper and lower jaws from a 3DCBCT image X . Fig. 3 illustrates the workﬂow.[Step 1-1] To obtain a binary bone image ˜ X , a 3D CT image X is segmented into three classes (air, soft tissues, and bones)using multi-level version of Otsu’s method [17]. The thresholdvalues T and T for the histogram h ( t ) corresponding to X are determined by { T , T } = argmax t ,t (cid:34) (cid:32) (cid:80) ≤ i

This step aims to identify and segment individual teeth inthe reconstructed panoramic images. To achieve the goal, weﬁrst perform individual tooth detection. Here, the teeth areclassiﬁed as incisor (class 1), canine (class 2), premolar (class3), and molar (class 4).[Step 2-1] To detect individual teeth in a panoramic image,we develop a deep learning method inspired by one-stageobject detection [21], [22]. Given a panoramic image P withsize N s × N z ( e.g., N s × N z = 640 × ), a uniform gridis created. Each grid cell G ij is deﬁned as follows: G ij = { ( s, z ) ∈ N : g ( i − < s ≤ gi,g ( j − < z ≤ gj } , (5)where g is the length of grid cells. Then we learn a toothdetection map f det : P (cid:55)→ Y that is given by f det ( P ) =  Y , Y , · · · Y N s , Y , Y , · · · Y N s , ... . . . ... Y ,N z · · · · · · Y N s ,N z  , (6)where Y ij = ( c ij , b ij , p ij ) predicting a conﬁdence score c ij ,a bounding box component b ij , and a class probability p ij in G ij , as illustrated in Fig. 5. A conﬁdence score c ij ∈ [0 , represents the existence of the tooth center in G ij . A boundingbox component is denoted by b ij = ( s ij , z ij , w ij , h ij ) . (7) where ( s ij , z ij ) is the center of the bounding box in G ij , ( w ij , h ij ) indicates its width and height. For a tooth in thebounding box corresponding to b ij , we estimate a classprobability p ij = ( p ij, , p ij, , p ij, , p ij, ) , (8)where p ij,k represents the probability of being tooth class k . G ij ( s ij , z ij ) w ij h ij p ij : molar Y = f det ( P ) ∈ R × × Y ij Fig. 5: Concept of Step 2-1. A detection map f det predicts Y ij for each G ij . The center position in G ij , the width and heightof a bounding box, and the tooth class are expressed from Y ij where the conﬁdence score c ij has a high value.To ﬁnd exact bounding boxes among the predicted boxesfor all G ij , we remove the boxes with scores e ij = c ij ∗ (max k p ij,k ) less than 0.5. Several bounding boxes with highscores may appear near the center of a tooth. We adopt the non-maximal suppression (NMS) technique to ﬁlter out boundingboxes that highly overlap high-scoring boxes.Using a labeled training dataset { ( P ( n ) , Y ∗ ( n ) ) } Nn =1 where Y ∗ is ground truth, f det is learned by minimizing the lossbetween the output Y = f det ( P ) and ground truth Y ∗ asfollows: N (cid:80) n =1 (cid:2) L obj ( Y ( n ) , Y ∗ ( n ) ) + λ L noobj ( Y ( n ) , Y ∗ ( n ) )+ λ L box ( Y ( n ) , Y ∗ ( n ) ) + L cls ( Y ( n ) , Y ∗ ( n ) ) (cid:3) , (9)where L obj ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } (1 − c ij ) , (10) L noobj ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =0 } (0 − c ij ) , (11) L box ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } | b ∗ ij − b ij | , (12) L cls ( Y, Y ∗ ) := (cid:88) ( i,j ) ∈{ ( i,j ) | c ∗ ij =1 } CrossEntropy ( p ∗ ij , p ij ) . (13) L obj , L box , and L cls represent the prediction errors for the conﬁ-dence scores, bounding box components, and class probabili-ties where objects exist ( c ∗ ij = 1 ), respectively. L noobj is relatedto the conﬁdence scores where no objects exist ( c ∗ ij = 0 ). Sincethere is no object in most grid cells, the conﬁdence score tendsto be predicted as zero [21]. To eliminate the imbalance, L noobj and L box are multiplied by constants λ = 0 . and λ = 5 ,respectively. Here, for a stable learning of bounding box regression [23], b ij is replaced by ˆ b ij = (ˆ s ij , ˆ z ij , ˆ w ij , ˆ h ij ) which satisﬁes thefollowing conditions: s ij = g (ˆ s ij + i − , z ij = g (ˆ z ij + j − ,w ij = a w exp( ˆ w ij ) , h ij = a h exp(ˆ h ij ) . (14)where a w and a h are the width and height of an anchor box,respectively. We set a size of the anchor box as the mean sizeof the ground truth bounding boxes.[Step 2-2] For each tooth in the detected bounding box, anumber is assigned to identify the unique tooth according tothe FDI system. For convenience, we ﬁrst explain how thenumbers are assigned to teeth in the upper jaw. As illustratedin Fig. 6, the detected bounding boxes are listed in ascendingorder of s coordinates of the box center. The upper rightand upper left quadrants are divided from the middle of foursequential incisor boxes. For the two right incisors and the twoleft incisors, number 1 and 2 are assigned from the inside to theoutside, respectively. Number 3 is assigned to the canines sincethere is only one in each quadrant. On each side, premolarsare assigned numbers 4 and 5 from the inside to the outside.Likewise, molars are assigned numbers 6, 7 and 8 (if a wisdomtooth exists). III IC CP PP PM MM MM M

Identiﬁcation

112 23 34 45 56 67 78 8

Fig. 6: Tooth identiﬁcation process using the classiﬁcationresults in Step 2-1. The capital letters represent the ﬁrst lettersof the tooth type and the numbers are tooth codes.[Step 2-3] The proposed 2D tooth segmentation methoduses a U-shaped FCN [24] with taking advantage of boundingbox knowledges obtained from Step 2-1. Let S ∈ R N s × N z denote the segmentation image for a tooth corresponding to adetected bounding box in P . We construct a training dataset { I ( n ) roi , Y ( n ) roi } Nn =1 for individual teeth segmentation, where I ( n ) roi and Y ( n ) roi are tooth images of P and S cropped by thebounding boxes. A segmentation map f seg : I roi (cid:55)→ Y roi is learned using a U-shaped network and minimizing thefollowing loss: L seg = 1 N N (cid:88) n =1 (cid:34) − M (cid:88) x Y ( n ) roi ( x ) log (cid:104) f seg (cid:16) I ( n ) roi (cid:17) ( x ) (cid:105)(cid:35) , (15)where x is a pixel position and M is the number of pixels of Y roi . C. Step 3: Extraction of loose and tight 3D tooth ROIs usingthe detected bounding boxes and segmented tooth regions

In this step, 3D tooth ROIs are obtained using the results ofthe previous steps. As described in Fig. 7, a bounding box containing one tooth is projected back into the 3D CBCTimage using (3) and (4). A loose ROI domain of the targettooth is then given by D box = { ( r ( s ) + t n ( s ) , z ) : − α ≤ t ≤ α, ( s, z ) ∈ B box } , (16)where B box is the set of pixel positions in the bounding box.Similarly, a tight ROI domain D seg is determined by B seg ,which is the set of pixel positions in the 2D tooth segmentedregion in the box.The loose 3D tooth ROI R box is obtained by changing thevoxel values outside D box to 0, and extracting content in a3D bounding box that ﬁts closely around D box , as shownin Fig. 7. Similarly, we obtain the tight 3D tooth ROI R seg by processing D seg instead of D box , and using the same 3Dbounding box as above. D. Step 4: 3D segmentation for individual teeth from the 3Dtooth ROIs

In this ﬁnal step, 3D individual tooth segmentation isperformed by applying the loose ROI ( R box ) and tight ROI( R seg ) to a 3D version of U-shaped FCN [24]. The tightROI is crucial for improving the segmentation accuracy at theattached boundaries between a target tooth and its neighboringteeth.The input of the network is I roi3 = R box ⊕ R seg , whichrepresents the concatenating vector of two ROIs. Let Y roi3 denote a binary vector representing 3D tooth segmentation cor-responding to I roi3 . Using a training dataset { I ( n ) roi3 , Y ( n ) roi3 } Nn =1 ,we learn a 3D segmentation map f seg : I roi3 (cid:55)→ Y roi3 byminimizing the following loss: L seg = 1 N N (cid:88) n =1 (cid:34) − V (cid:88) v Y ( n ) roi3 ( v ) log (cid:104) f seg (cid:16) I ( n ) roi3 (cid:17) ( v ) (cid:105)(cid:35) , (17)where v is a voxel position and V is the number of voxels Y roi3 . III. E XPERIMENTS AND R ESULTS

A. Dataset and implementation details

Experiments were conducted on 3D CT images producedby a dental CBCT with a circular trajectory (DENTRI-X;HDXWILL, Seoul, South Korea) using tube voltage 90kVpand tube current 10mA. All available datasets were formattedin the Digital Imaging and Communications in Medicine(DICOM) standard as a series of 16-bit grayscale images (cor-responding to individual CT slices). The size of the originalCBCT image was × × , and its row and columnpixel spacing and slice thickness were both 0.2mm. Duringscanning, a bite block was used to prevent contact betweenthe upper and lower teeth.We received 97 dental 3D CBCT images from HDXWILL.Using these data, we generated 194 upper jaw and lower jawpanoramic images in Step 1. We also received labeled dataconsisting of 97 2D individual tooth segmentation, boundingbox components, and tooth codes, as well as 11 3D individualtooth segmentation. The labeling was performed by experts inHDXWILL. z s panoramic image P ( s, z ) Projected backinto CT image

CT image X ( x, y, z ) y xz Extract ROIs by3D bounding box loose ROI R box tight ROI R seg Fig. 7: Extraction of loose and tight 3D tooth ROIs. A loose ROI domain (green dotted line in X ) is determined by the domainof projection between points (blue stars in X ) on the reference curve corresponding to points (blue stars in P ) on the boundingbox. A 3D bounding box is then obtained by closely ﬁtting the loose ROI domain. A loose 3D tooth ROI R box is extractedby cropping the CT image X by the 3D bounding box and by changing values of voxels outside the loose ROI domain to 0.Similarly, a tight 3D tooth ROI R seg is obtained from the 2D segmented tooth region.To reconstruct panoramic images from CT images, we used(4) for 3D CBCT images using a tomographic reconstructionsoftware TIGRE [25]. The size of all panoramic images wasﬁxed at × . The width of the panoramic imageswas determined by 640 reference curve points. In step 1-5,those points are obtained by interpolating 500 points on thesmooth skeleton and by extrapolating 70 points at each end ofthe curve. To completely cover the teeth at both ends, weextrapolated 70 points (approximately 13.3mm) taking intoaccount the average size of the molars. The height size 320was determined by removing 80 CT slices that do not containteeth from the bottom.For the 2D detection and segmentation in Step 2-1 and 2-3, 66 CBCT dataset were used for training and 31 datasetfor testing. Because two panoramic images (upper and lowerparts) are obtained from each CBCT image through Step 1,we use 132 labeled training data and 62 test data. Meanwhile,for the 3D segmentation in Step 4, 7 CBCT dataset wereused for training and 4 dataset for testing. Since each patienthas approximately 28 to 32 teeth, each CBCT image canprovide approximately 28 to 32 training data for individualtooth segmentation. To be precise, we use 216 training dataand 112 test data for the 3D segmentation in Step 4. In Step2-3 and 4, 2D tooth images and 3D loose and tight ROIs wereresized to × and × × , respectively. B. Training of the proposed network

Training was implemented using Pytorch [26] on a CPU(Intel(R) Core(TM) i7-9700K, 3.60GHz) and GPU (NVIDIAGTX-2070, 8GB) system. In Steps 2 and 4, we trained threenetworks; tooth detection, 2D individual tooth segmentation,and 3D individual tooth segmentation. The proposed neuralnetworks were trained by minimizing losses in (9), (15),and (17) using Adam optimizer [27]. The network archi-tectures were determined using ﬁve-fold cross validation onthe training dataset. We used the batch normalization [28] toprevent overﬁtting. Each training in Steps 2-1, 2-3, and 4 usedbatch sizes of 8, 32, and 4, respectively. We also used dataaugmentation techniques including random contrast, random brightness adjustment, horizontal ﬂip, and rotation between-8 ◦ and 8 ◦ . C. Evaluation and result of the proposed method1) Bounding box detection:

For a quantitative evaluationof the bounding box detection, we provide four precision-recall (PR) curves [29] and their average precision (AP) [29],as shown in Fig. 8. When the intersection of union (IOU)threshold value was 0.6, according to the PR curve, theprecision tends to stay high as the recall increases, and theAP was . . Recall P r e c i s i o n Precision-recall curve

IOU threshold AP

Fig. 8: Tooth detection results. A PR curve represents thechange in the precision as the recall increases for a ﬁxed IOUthreshold value, which is used for NMS. The AP is the averageof the precision values on a PR curve.

2) Individual tooth identiﬁcation:

This subsection presentsthe performance evaluation of tooth identiﬁcation. The pre-cision, recall, and F1-score were used to evaluate the resultsof the identiﬁcation method. In Step 2-1, teeth are initiallyclassiﬁed into four types, instead of directly predicting theeight tooth codes. The direct identiﬁcation method can oftenmisclassify teeth within the same tooth type. Fig. 9 summa-rizes the identiﬁcation results for each tooth code. As shownin Fig. 9 (a), the direct method confuses ﬁrst premolars (code4) and wisdom teeth (code 8) in particular. These errors hinderthe performance improvement of the direct method. However,the four type-based method achieves a high accuracy bypreventing misclassiﬁcation due to similar tooth shape. Table

I shows that the proposed method leads to more statisticallyaccurate identiﬁcation.TABLE I: Quantitative evaluation for tooth identiﬁcationmethods.

Metric Direct method Proposed method

Precision ( % ) . ± .

85 96 . ± . Recall ( % ) . ± .

42 90 . ± . F1-score ( % ) . ± .

63 93 . ± . T r u e T oo t h C o d e

110 3 0 0 0 3 0 03 96 0 0 7 0 0 02 0 107 1 0 6 0 00 1 0 96 23 0 0 00 0 0 0 89 0 0 00 0 1 0 0 119 2 00 0 2 0 0 1 116 10 0 0 0 0 0 9 52

Confusion matrix (a) T r u e T oo t h C o d e

108 2 0 0 0 1 0 03 97 0 2 0 0 0 00 0 113 1 0 3 0 00 0 0 114 3 0 0 00 0 0 3 89 0 0 00 0 0 0 0 117 3 00 0 0 0 0 1 116 30 0 0 0 0 0 1 56

Confusion matrix (b)

Fig. 9: Confusion matrix for tooth identiﬁcation. The verticaland horizontal axes represent true tooth codes and predictedtooth codes, respectively. The diagonal components representthe number of correct identiﬁcation. (a) Result of the directmethod. (b) Result of the proposed method.

3) Individual tooth segmentation:

In Steps 2-3 and 4, weperformed the 2D and 3D individual tooth segmentation. Toevaluate the segmentation performances, we used precision, re-call, Dice similarity coefﬁcient (DSC) [30], Hausdorff distance(HD) [31], and average symmetric surface distance (ASSD)[32].

2D individual tooth segmentation.

The proposed methodproceeds in two steps consisting of the bounding box detectionin Step 2-1 and individual segmentation in Step 2-3, whileMask R-CNN [14] achieve the same task in a single step. Weimplemented both approaches, and the quantitative evaluationresults are reported in Table I. The proposed method isnumerically more accurate than the Mask R-CNN. In ourexperiment, the Mask R-CNN successfully detected teeth;however, there was a lack of segmentation detail at the edgesof teeth, as shown in Fig. 10.The accuracy of 2D segmentation is important becausethe key of precise 3D tooth segmentation is to use tight3D tooth ROIs obtained by 2D segmentation. Although thedetection and segmentation are not performed simultaneously,two simple convolutional networks (one-stage object detectionand U-shaped FCN) are designed to achieve a high accuracy.

3D individual tooth segmentation.

We developed a fully au-tomated multi-step method for 3D individual segmentation. Toverify the effectiveness of the proposed method, we comparedit with patch-based Mask R-CNN [14] and ToothNet [13]. Wealso provided the segmentation results of the proposed methodby adopting either loose or tight ROI, or both. TABLE II: Quantitative evaluation for 2D tooth segmentationmethods.

Metric 2D Mask R-CNN Proposed method

Precision ( % ) . ± .

55 96 . ± . Recall ( % ) . ± .

87 93 . ± . DSC ( % ) . ± .

25 94 . ± . HD ( mm ) . ± .

93 1 . ± . ASSD ( mm ) . ± .

12 0 . ± . (a)(b) Fig. 10: Qualitative comparison of the proposed method andthe Mask R-CNN. Segmentation results of (a) the proposedmethod and (b) Mask R-CNN. The Mask R-CNN is capableof tooth segmentation; however, there are some inaccuraciesmainly at the edges of teeth.Individual tooth segmentation can be formulated as aninstance segmentation problem. Mask R-CNN is the state-of-the-art deep learning framework for instance segmentation.However, Mask R-CNN cannot be applied to large size 3DCBCT images directly because of the computational limit. Forcomparison experiments, we implemented Mask R-CNN in apatch-based fashion [33], [34] as an alternative to avoid thelimitation. ToothNet is based on Mask R-CNN, which alsoadopts this approach. In the implementation, image patcheshad the size of × × and the stride between thepatches was × × . The patch-based approach yieldsredundant results by overlapping local image patches owing tothe disconnected spatial relationship between adjacent patches.The integrated results are obtained by removing the overlappedsegmentations according to ToothNet.Patch-based 3D Mask R-CNN and ToothNet show lowersegmentation performances in quantitative evaluation, asshown in Table III. These methods perform individual seg-mentation from the original CBCT images. In contrast, theproposed method has the advantage of using loose and tightROIs that provide considerable background region in advance.In particular, the tight ROI excludes structures on sides ( e.g. ,adjacent teeth, jaw, etc) of the target tooth. To evaluate theeffectiveness of the use of both loose and tight ROIs, weimplemented experiments using either loose or tight ROI,or both on the same 3D segmentation network. When usingonly tight ROIs, the recall is the lowest because loss oftooth information may occur where the tight ROI boundaryintersects the tooth boundary. The use of only loose ROI TABLE III: Quantitative comparison for 3D tooth segmentation methods.

3D Mask R-CNN ToothNet Loose ROI Tight ROI Loose & tight ROIsMetric mean ± std p -value mean ± std p -value mean ± std p -value mean ± std p -value mean ± stdPrecision ( % ) . ± . < .

001 89 . ± . < .

001 94 . ± . < .

001 95 . ± . Recall ( % ) . ± . < .

01 93 . ± . < .

001 92 . ± . < .

001 91 . ± . < .

001 93 . ± . DSC ( % ) . ± . < .

001 91 . ± . < .

001 93 . ± . < .

001 92 . ± . < .

001 94 . ± . HD ( mm ) . ± . < .

001 2 . ± . < .

001 1 . ± . < .

001 1 . ± . ASSD ( mm ) . ± . < .

001 0 . ± . < .

001 0 . ± . (a) (b) (c) (d) (e) Fig. 11: Qualitative comparison for 3D individual tooth segmentation in a CBCT image with metal artifacts. Segmentationresult of (a) Mask R-CNN, (b) ToothNet, (c) the proposed method using loose ROIs, (d) the proposed method using tightROIs, and (e) the proposed method using both loose and tight ROIs.containing tooth boundary shows a higher the recall. However,HD tends to be high because there is no information on toothboundaries. A combination of the two ROIs enhances thesegmentation performance, as the tight ROI provides detailedinformation on the target tooth and the loose ROI compensatesfor the disadvantage of the tight ROI.Wilcoxon signed-rank test [35] is used to calculate the sta-tistical signiﬁcance differences between the proposed methodand other methods, as summarized in Table III.

D. Metal artifact-contaminated CBCT

The proposed method effectively handles problems that arecaused by metal-related artifacts. Fig. 12 shows that the CBCTimage is signiﬁcantly contaminated by metal artifacts, whereasmetal artifacts are signiﬁcantly reduced in the correspond-ing panoramic image generated by the CBCT image. Thepanoramic image in Step 1 allow to accurately perform 2Dtooth detection and segmentation. These results provides priorknowledge of each 3D tooth as loose and tight ROIs. As shownin Fig. 12, the tight ROI excludes adjacent teeth, even thoughthe tooth boundaries are obscured by metal artifacts. Fig. 11provides a qualitative evaluation for 3D tooth segmentationin a CBCT image with metal artifacts. As shown in Fig.11 (d) and (e), the segmentation results in the degraded CTimage are superior to those of Fig. 11 (a)-(c) as the tightROIs provide robust boundary information. However, Fig. 11(d) presents that using only the tight ROIs may not providerobust segmentation because it can cut out the edges of theteeth. Therefore, Fig. 11 (e) illustrates the advantages and effectiveness of the proposed method using both loose andtight ROIs. (a) (b)

Fig. 12: (a) CBCT image affected by metal artifacts. (b)panoramic image generated from Step 1, which is not affectedby metal artifacts. In (a), the green solid line represents theoutline of the tight ROI obtained from the segmented regionin (b). The tight ROI provide boundary information for thetarget tooth.

E. Special cases

This subsection shows the experimental results for theabnormal data ( e.g. , missing tooth and implants). These areoften encountered in real clinical practice.

1) Missing teeth:

People often lose their teeth due to factorssuch as cavities, periodontal disease, aging, dental traumaand orthodontic treatment. To address these cases, we appliedCBCT images with missing teeth (except for wisdom teeth) toour method. Four type-based classiﬁcation was successful, buttooth identiﬁcation was incomplete. With exception of canines, each tooth quadrant contains two or more teeth of the sametype. When numbering the neighboring tooth of the same typeas a missing tooth, it is difﬁcult to identify the neighboringtooth in panoramic images. Therefore, we suggest to performonly classiﬁcation in the case of a missing tooth, as shown inFig. 13.

2) Implants:

Implant surgery is a dental prosthetic treat-ment to replace a missing tooth. An implant appears as a screwin a panoramic image, which is a unique signature. Althoughthese data containing implants are totally unseen during modeltraining, Fig. 14 shows successful bounding box detection andsegmentation for implants. When applied to 3D segmentation,we also conﬁrmed successful outcomes, as shown in Fig.14. However, alternative implants cannot be predicted withoriginal tooth classes since these have a different shape. Thiscan be inferred from the classiﬁcation and identiﬁcation ofneighboring teeth.

112 23 3P P6 67 7

Fig. 13: Illustration of tooth identiﬁcation when there isa missing tooth. Two premolar (class P) corresponding tonumber 4 are missing. It is not possible to determine whetherthe remaining premolar is number 4 or 5 by panoramic imagealone. Hence, we suggest only marking the tooth type.Fig. 14: Results of three CT images with implants. Implantswere successfully segmented in both the panoramic image and3D CT image. But, the implants are classiﬁed into differentclasses. IV. D

ISCUSSION AND C ONCLUSION

In this paper, we developed a fully automated segmentationand identiﬁcation method for individual teeth and jaws fromCBCT images. Given CBCT data, the method automaticallygenerates the maxillary and mandibular panoramic imagesthat are projected along the reference curve representing aregion-based shape feature of a dental arch. In the maxillaryand mandibular panoramic images, 2D tooth segmentationand identiﬁcation are performed using deep learning meth-ods, which are vital in high-precision 3D tooth segmentationand identiﬁcation. Experiments showed that the accuracy ofthe method is suitable for the clinical setting. Our methodovercomes the limitations of existing automated methods byachieving fully automation and improved accuracy. Addition-ally, the method addresses the difﬁculty of learning high-dimensional data.The main idea of the proposed method is the carefuluse of the accurate and robust 2D tooth segmentation andidentiﬁcation in 2D panoramic images in an indirect manner toaddress the difﬁculty of 3D segmentation from metal artifact-contaminated 3D CBCT images. In a clinical dental CBCTenvironment ( e.g. , low dose radiation exposure), metal related-artifacts are common. The proposed method utilizes the crucialobservation that metal artifacts are signiﬁcantly reduced inthe upper and lower panoramic images generated from theCBCT images. The outcome in Step 2 serves as strongprior knowledge of 3D tooth segmentation, which plays animportant role in separating teeth from 3D images, in caseswhere teeth are often contacted, overlapped, or connectedowing to metal-related artifacts.The automated system proposed in this study improvesthe efﬁciency of dentists by reducing the cumbersome andtime-consuming manual intervention. The result provides animproved workﬂow for dentists to simulate pre-operative or-thodontic treatment and manufacture implant surgical guides.Digital occlusion analysis is potentially possible by combiningour method with the intra-oral scan model [36], [37] viaregistration. Hence, it is expected to play an important rolein digital dentistry.A

CKNOWLEDGEMENTS

This research was supported by a grant of the Korea HealthTechnology R & D Project through the Korea Health IndustryDevelopment Institute (KHIDI), funded by the Ministry ofHealth & Welfare, Republic of Korea (HI20C0127). We wouldlike to express our deepest gratitude HDXWILL which sharesdental CBCT images and ground-truth data.R

EFERENCES[1] H. Gao and O. Chae, “Individual tooth segmentation from ct images us-ing level set method with shape and intensity prior,”

Pattern Recognition ,vol. 43, no. 7, pp. 2406–2417, 2010.[2] Y. Gan, Z. Xia, J. Xiong, Q. Zhao, Y. Hu, and J. Zhang, “Toward accuratetooth segmentation from computed tomography images using a hybridlevel set model,”

Medical physics , vol. 42, no. 1, pp. 14–27, 2015.[3] H.-T. Yau, T.-J. Yang, and Y.-C. Chen, “Tooth model reconstructionbased upon data fusion for orthodontic treatment simulation,”

Computersin biology and medicine , vol. 48, pp. 8–16, 2014. [4] D. X. Ji, S. H. Ong, and K. W. C. Foong, “A level-set based approachfor anterior teeth segmentation in cone beam computed tomographyimages,” Computers in biology and medicine , vol. 50, pp. 116–128,2014.[5] Y. Wang, S. Liu, G. Wang, and Y. Liu, “Accurate tooth segmentationwith improved hybrid active contour model,”

Physics in Medicine &Biology , vol. 64, no. 1, p. 015012, 2018.[6] L. Hiew, S. Ong, K. W. Foong, and C. Weng, “Tooth segmentation fromcone-beam ct using graph cut,” in

Proceedings of the Second APSIPAAnnual Summit and Conference , 2010, pp. 272–275.[7] J. Keustermans, D. Vandermeulen, and P. Suetens, “Integrating statisticalshape models into a graph cut framework for tooth segmentation,”in

International Workshop on Machine Learning in Medical Imaging .Springer, 2012, pp. 242–249.[8] S. Barone, A. Paoli, and A. V. Razionale, “Ct segmentation of dentalshapes by anatomy-driven reformation imaging and b-spline modelling,”

International journal for numerical methods in biomedical engineering ,vol. 32, no. 6, p. e02747, 2016.[9] S. Lee, S. Woo, J. Yu, J. Seo, J. Lee, and C. Lee, “Automated cnn-basedtooth segmentation in cone-beam ct for dental implant planning,”

IEEEAccess , vol. 8, pp. 50 507–50 518, 2020.[10] Y. Rao, Y. Wang, F. Meng, J. Pu, J. Sun, and Q. Wang, “A symmetricfully convolutional residual network with dcrf for accurate tooth seg-mentation,”

IEEE Access , vol. 8, pp. 92 028–92 038, 2020.[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[12] Y. Chen, H. Du, Z. Yun, S. Yang, Z. Dai, L. Zhong, Q. Feng, andW. Yang, “Automatic segmentation of individual tooth in dental cbctimages from tooth surface map by a multi-task fcn,”

IEEE Access , 2020.[13] Z. Cui, C. Li, and W. Wang, “Toothnet: automatic tooth instance seg-mentation and identiﬁcation from cone beam ct images,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 6368–6377.[14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in

Proceedings of the IEEE international conference on computer vision ,2017, pp. 2961–2969.[15] Y. Miki, C. Muramatsu, T. Hayashi, X. Zhou, T. Hara, A. Katsumata,and H. Fujita, “Classiﬁcation of teeth in cone-beam ct using deepconvolutional neural network,”

Computers in biology and medicine ,vol. 80, pp. 24–29, 2017.[16] D. V. Tuzoff, L. N. Tuzova, M. M. Bornstein, A. S. Krasnov, M. A.Kharchenko, S. I. Nikolenko, M. M. Sveshnikov, and G. B. Bednenko,“Tooth detection and numbering in panoramic radiographs using convo-lutional neural networks,”

Dentomaxillofacial Radiology , vol. 48, no. 4,p. 20180051, 2019.[17] N. Otsu, “A threshold selection method from gray-level histograms,”

IEEE transactions on systems, man, and cybernetics , vol. 9, no. 1, pp.62–66, 1979.[18] H. Samet and M. Tamminen, “Efﬁcient component labeling of images ofarbitrary dimension represented by linear bintrees,”

IEEE transactionson pattern analysis and machine intelligence , vol. 10, no. 4, pp. 579–586, 1988.[19] R. M. Haralick, S. R. Sternberg, and X. Zhuang, “Image analysis usingmathematical morphology,”

IEEE transactions on pattern analysis andmachine intelligence , no. 4, pp. 532–550, 1987.[20] T.-C. Lee, R. L. Kashyap, and C.-N. Chu, “Building skeleton models via3-d medial surface axis thinning algorithms,”

CVGIP: Graphical Modelsand Image Processing , vol. 56, no. 6, pp. 462–478, 1994.[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 779–788.[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in

European conference oncomputer vision . Springer, 2016, pp. 21–37.[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 580–587.[24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[25] A. Biguri, M. Dosanjh, S. Hancock, and M. Soleimani, “Tigre: a matlab-gpu toolbox for cbct image reconstruction,”

Biomedical Physics &Engineering Express , vol. 2, no. 5, p. 055010, 2016. [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in

Advancesin neural information processing systems , 2019, pp. 8026–8037.[27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[29] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,”

Internationaljournal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[30] S. Rueda, S. Fathima, C. L. Knight, M. Yaqub, A. T. Papageorghiou,B. Rahmatullah, A. Foi, M. Maggioni, A. Pepe, J. Tohka et al. ,“Evaluation and comparison of current fetal ultrasound image segmen-tation methods for biometric measurements: a grand challenge,”

IEEETransactions on medical imaging , vol. 33, no. 4, pp. 797–813, 2013.[31] P. F. Raudaschl, P. Zafﬁno, G. C. Sharp, M. F. Spadea, A. Chen, B. M.Dawant, T. Albrecht, T. Gass, C. Langguth, M. L¨uthi et al. , “Evaluationof segmentation methods on head and neck ct: auto-segmentationchallenge 2015,”

Medical physics , vol. 44, no. 5, pp. 2020–2036, 2017.[32] O. Maier, B. H. Menze, J. von der Gablentz, L. H¨ani, M. P. Hein-rich, M. Liebrand, S. Winzeck, A. Basit, P. Bentley, L. Chen et al. ,“Isles 2015-a public evaluation benchmark for ischemic stroke lesionsegmentation from multispectral mri,”

Medical image analysis , vol. 35,pp. 250–269, 2017.[33] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deepneural networks segment neuronal membranes in electron microscopyimages,” in

Advances in neural information processing systems , 2012,pp. 2843–2851.[34] B. Kim, K. C. Kim, Y. Park, J.-Y. Kwon, J. Jang, and J. K. Seo,“Machine-learning-based automatic identiﬁcation of fetal abdominalcircumference from ultrasound images,”

Physiological measurement ,vol. 39, no. 10, p. 105007, 2018.[35] F. Wilcoxon, “Individual comparisons by ranking methods,” in

Break-throughs in statistics . Springer, 1992, pp. 196–202.[36] X. Xu, C. Liu, and Y. Zheng, “3d tooth segmentation and labeling usingdeep convolutional neural networks,”

IEEE transactions on visualizationand computer graphics , vol. 25, no. 7, pp. 2336–2348, 2018.[37] S. Tian, N. Dai, B. Zhang, F. Yuan, Q. Yu, and X. Cheng, “Automaticclassiﬁcation and segmentation of teeth on 3d dental model usinghierarchical deep learning networks,”

IEEE Access , vol. 7, pp. 84 817–84 828, 2019. A PPENDIX AV ISUALIZATION OF THE RESULT

In this section, we provide the qualitative evaluation ofthe four selected subjects by visualizing the results in eachstep. Fig. 15 presents the results for Step 2 using the upperand lower panoramic images generated by Step 1. Fig. 16shows the individual tooth segmentation results. The resultsare displayed on three CBCT slices for each subject. Fig. 17shows the visualized 3D teeth with a skull segmented from theCBCT image. The segmented regions of the teeth are ﬁlled indifferent colors according to the corresponding number.A

PPENDIX BD EEP LEARNING NETWORK ARCHITECTURES

This section shows the architectures of the proposed threenetworks in Steps 2-1, 2-3 and 4. The architecture of toothdetection network is described as Table IV. On the last layer,different activation functions are used to predict conﬁdencescores, bounding box components, and class probabilities atthe same time. Table V shows the network of 2D toothsegmentation, which is a typical U-Net [24] structure. Asshown in Table VI, 3D tooth segmentation network is alsobased on the U-Net structure. Fig. 15: Results of the proposed method for Step 2.Fig. 16: 3D tooth segmentation results of the proposed method for four subjects.Fig. 17: Visualization of 3D tooth segmentation results for four subjects. TABLE IV: Network architecture of tooth detection in Step2-1.

Layer Input Output Kernel Activation conv( × ) 1 × × × ×

64 3 × ReLUmax pool 1 × ×

64 320 × ×

64 2 -conv( × ) 2 × ×

64 320 × ×

128 3 × ReLUmax pool 2 × ×

128 160 × ×

128 2 -conv( × ) 3 × ×

128 160 × ×

256 3 × ReLUmax pool 3 × ×

256 80 × ×

256 2 -conv( × ) 4 × ×

256 80 × ×

512 3 × ReLUmax pool 4 × ×

512 40 × ×

512 2 -conv( × ) 5 × ×

512 40 × × × ReLUconv( × ) 6 × × × × × ReLUconv( × ) 7 × × × × × Sigmoid × × × × × Sigmoid × × × × × - × × × × × Softmax

TABLE V: Network architecture of 2D tooth segmentation inStep 2-3.

Layer Input Output Kernel Activation

Encodingpath conv( × ) 1 × ×

64 3 × ReLUmax pool 1 ×

64 64 ×

64 2 -conv( × ) 2 ×

64 64 ×

128 3 × ReLUmax pool 2 ×

128 32 ×

128 2 -conv( × ) 3 ×

128 32 ×

256 3 × ReLUmax pool 3 ×

256 16 ×

256 2 -conv( × ) 4 ×

256 16 ×

512 3 × ReLUDecodingpath upsample 1 ×

512 32 ×

512 2 -conv( × ) 5 ×

512 32 ×

256 3 × ReLUconcat withconv( × ) 3 ×

256 32 × - - × conv( × ) 6 ×

512 32 ×

256 3 × ReLUupsample 2 ×

256 64 ×

256 2 -conv( × ) 7 ×

256 64 ×

128 3 × ReLUconcat withconv( × ) 2 ×

128 64 × - - × conv( × ) 8 ×

256 64 ×

128 3 × ReLUupsample 3 ×

128 128 ×

128 2 -conv( × ) 9 ×

128 128 ×

64 3 × ReLUconcat withconv( × ) 1 ×

64 128 × - - × conv( × ) 10 ×

128 128 ×

64 3 × ReLUconv( × ) 11 ×

64 128 × × Softmax

TABLE VI: Network architecture of 3D tooth segmentation inStep4.

Layer Input Output Kernel Activation

Encodingpath max pool 1 × × -conv( × ) 2 × ×

16 3 × ReLUmax pool 2 ×

16 32 ×

16 2 -conv( × ) 3 ×

16 32 ×

32 3 × ReLUmax pool 3 ×

32 16 ×

32 2 -conv( × ) 4 ×

32 16 ×

64 3 × ReLUDecodingpath upsample 1 ×

64 32 ×

64 2 -conv( × ) 5 ×

64 32 ×

64 3 × ReLUconcat withconv( × ) 3 ×

32 32 × - - × conv( × ) 6 ×

64 32 ×

32 3 × ReLUupsample 2 ×

32 64 ×

32 2 -conv( × ) 7 ×

32 64 ×

16 3 × ReLUconcat withconv( × ) 2 ×

16 64 × - - × conv( × ) 8 ×

32 64 ×

16 3 × ReLUupsample 3 ×

16 128 ×

16 2 -conv( × ) 9 ×

16 128 × × ReLUconv( × ) 10 × × ×1