[PDF] Real-time 3D Tracking of Articulated Tools for Robotic Surgery

Abstract

In robotic surgery, tool tracking is important for providing safe tool-tissue interaction and facilitating surgical skills assessment. Despite recent advances in tool tracking, existing approaches are faced with major difficulties in real-time tracking of articulated tools. Most algorithms are tailored for offline processing with pre-recorded videos. In this paper, we propose a real-time 3D tracking method for articulated tools in robotic surgery. The proposed method is based on the CAD model of the tools as well as robot kinematics to generate online part-based templates for efficient 2D matching and 3D pose estimation. A robust verification approach is incorporated to reject outliers in 2D detections, which is then followed by fusing inliers with robot kinematic readings for 3D pose estimation of the tool. The proposed method has been validated with phantom data, as well as ex vivo and in vivo experiments. The results derived clearly demonstrate the performance advantage of the proposed method when compared to the state-of-the-art.

Full PDF

aa r X i v : . [ c s . C V ] O c t Real-time 3D Tracking of Articulated Tools forRobotic Surgery

Menglong Ye, Lin Zhang, Stamatia Giannarou and Guang-Zhong Yang

The Hamlyn Centre for Robotic Surgery, Imperial College London, UK [email protected]

Abstract.

In robotic surgery, tool tracking is important for providingsafe tool-tissue interaction and facilitating surgical skills assessment. De-spite recent advances in tool tracking, existing approaches are faced withmajor diﬃculties in real-time tracking of articulated tools. Most algo-rithms are tailored for oﬄine processing with pre-recorded videos. In thispaper, we propose a real-time 3D tracking method for articulated toolsin robotic surgery. The proposed method is based on the CAD modelof the tools as well as robot kinematics to generate online part-basedtemplates for eﬃcient 2D matching and 3D pose estimation. A robustveriﬁcation approach is incorporated to reject outliers in 2D detections,which is then followed by fusing inliers with robot kinematic readingsfor 3D pose estimation of the tool. The proposed method has been val-idated with phantom data, as well as ex vivo and in vivo experiments.The results derived clearly demonstrate the performance advantage ofthe proposed method when compared to the state-of-the-art.

Recent advances in surgical robots have signiﬁcantly improved the dexterityof the surgeons, along with enhanced 3D vision and motion scaling. Surgicalrobots such as the da Vinci ® (Intuitive Surgical, Inc. CA) platform, can allowthe augmentation of preoperative data to enhance the intraoperative surgicalguidance. In robotic surgery, tracking of surgical tools is an important task forapplications such as safe tool-tissue interaction and surgical skills assessment.In the last decade, many approaches for surgical tool tracking have beenproposed. The majority of these methods have focused on the tracking of la-paroscopic rigid tools, including using template matching [1] and combiningcolour-segmentation with prior geometrical tool models [2]. In [3], the 3D posesof rigid robotic tools were estimated by combining random forests with level-setssegmentation. More recently, tracking of articulated tools has also attracted a lotof interest. For example, Pezzementi et al. [4] tracked articulated tools based onan oﬄine synthetic model using colour and texture features. The CAD model ofa robotic tool was used by Reiter et al. [5] to generate virtual templates using therobot kinematics. However, thousands of templates were created by conﬁguringthe original tool kinematics, leading to time-demanding rendering and templatematching. In [6], boosted trees were used to learn predeﬁned parts of surgical M. Ye et al.

Fig. 1. (a) Illustration of transformations; (b) Virtual rendering example of the largeneedle driver and its keypoint locations; (c) Extracted gradient orientations from virtualrendering. The orientations are quantised and colour-coded as shown in the pie chart. tools. Similarly, regression forests have been employed in [7] to estimate the 2Dpose of articulated tools. In [8], the 3D locations of robotic tools estimated withoﬄine trained random forests, were fused with robot kinematics to recover the3D poses of the tools. Whilst there has been signiﬁcant progress on surgical tooldetection and tracking, none of the existing approaches have thus far achievedreal-time 3D tracking of articulated robotic tools.In this paper, we propose a framework for real-time 3D tracking of articulatedtools in robotic surgery. Similar to [5], CAD models have been used to generatevirtual tools and their contour templates are extracted online, based on thekinematic readings of the robot. In our work, the tool detection on the realcamera image is performed via matching the individual parts of the tools ratherthan the whole instrument. This enables our method to deal with the changingpose of the tools due to articulated motion. Another novel aspect of the proposedframework is the robust veriﬁcation approach based on 2D geometrical context,which is used to reject outlier template matches of the tool parts. The inlier 2Ddetections are then used for 3D pose estimation via the Extended Kalman Filter(EKF). Experiments have been conducted on phantom, ex vivo and in vivo videodata, and the results verify that our approach outperforms the state-of-the-art.

Our proposed framework includes three main components. The ﬁrst component isa virtual tool renderer that generates part-based templates online. After templatematching, the second component performs veriﬁcation to extract the inlier 2Ddetections. These 2D detections are ﬁnally fused with kinematic data for 3D toolpose estimation. Our framework is implemented on the da Vinci ® robot. Therobot kinematics are retrieved using the da Vinci ® Research Kit (dVRK) [9].

In this work, to deal with the changing pose of articulated surgical tools, thetool detection has been performed by matching individual parts of the tools,rather than the entire instrument, similar to [6]. To avoid the limitations of eal-time 3D Tracking of Articulated Tools for Robotic Surgery 3

Fig. 2. (a) An example of part-based templates; (b) Quantised gradient orientationsfrom a camera image; (c) Part-based template matching results of tool parts; (d) and(e) Geometrical context veriﬁcation; (f) Inlier detections obtained after veriﬁcation. oﬄine training, we propose to generate the part models on-the-ﬂy such that thechanging appearance of tool parts can be dynamically adapted.To generate the part-based models online, the CAD model of the tool and therobot kinematics have been used to render the tool in a virtual environment. Thepose of a tool in the robot base frame B can be denoted as the transformation T BE , where E is the end-eﬀector coordinate frame shown in Fig.1(a). T BE can beretrieved from dVRK (kinematics) to provide the 3D coordinates of the tool in B .Thus, to set the virtual view to be the same as the laparoscopic view, a standardhand-eye calibration [10] is used to estimate the transformation T CB from B tothe camera coordinate frame C . However, errors in the calibration can aﬀect theaccuracy of T CB , resulting in a 3D pose oﬀset between the virtual tool and thereal tool in C . In this regard, we represent the transformation found from thecalibration as T C − B , where C − is the camera coordinate frame that includes theaccumulated calibration errors. Therefore, a correction transformation denotedas T CC − can be introduced to compensate for the calibration errors.In this work, we have deﬁned n =14 keypoints P B = (cid:8) p Bi (cid:9) ni =1 on the tool, andthe large needle driver is taken as an example. The keypoints include the pointsshown in Fig.1(b) and those on the symmetric side of the tool. These keypointsrepresent the skeleton of the tool, which also apply to other da Vinci ® tools. Attime t , an image I t can be obtained from the laparoscopic camera. The keypointscan be projected in I t with the camera intrinsic matrix K via P It = 1 s KT CC − T C − B P Bt . (1)Here, s is the scaling factor that normalises the depth to the image plane.To represent the appearance of the tool parts, the Quantised Gradient Ori-entations (QGO) approach [11] has been used (see Fig.1(c)). Bounding boxesare created to represent part-based models and centred at the keypoints in thevirtual view (see Fig.2(a)). The box size for each part is adjusted according to M. Ye et al. the z coordinate of the keypoint with respect to the virtual camera centre. QGOtemplates are then extracted inside these bounding boxes. As QGO representsthe contour information of the tool, it is robust to cluttered scenes and illumi-nation changes. In addition, a QGO template is represented as a binary code byquantisation, thus template matching can be performed eﬃciently.Note that not all of the deﬁned parts are visible in the virtual view, as someof them may be occluded. Therefore, the templates are only extracted for those m parts that facing the camera. To ﬁnd the correspondences of the tool partsbetween the virtual and real images, QGO is also computed on the real image(see Fig.2(b)) and template matching is then performed for each part via slidingwindows. Exemplar template matching results are shown in Fig.2(c). To further extract the best location estimates of the tool parts, a consensus-basedveriﬁcation approach [12] is included. This approach analyses the geometricalcontext of the correspondences in a PROgressive SAmple Consensus (PROSAC)scheme [13]. For the visible keypoints { p i } mi =1 in the virtual view, we denote their2D correspondences in the real camera image as { p i,j } m,ki =1 ,j =1 , where { p i,j } kj =1 represent the top k correspondences of p i sorted by QGO similarities.For each iteration in PROSAC, we select two point pairs from { p i,j } m,ki =1 ,j =1 in a sorted descending order. These two pairs represent the correspondences fortwo diﬀerent parts, e.g., pair of p and p , , and pair of p and p , . The two pairsare then used to verify the geometrical context of the tool parts. As shown inFig.2(d) and (e), we use two polar grids to indicate the geometrical context of thevirtual view and the camera image. The origins of the grids are deﬁned as p and p , , respectively. The major axis of the grids can be deﬁned as the vectors from p to p and p , to p , , respectively. The scale diﬀerence between the two gridsis found by comparing d ( p , p ) and d ( p , , p , ), where d ( · , · ) is the euclideandistance. We can then deﬁne the angular and radial bin sizes as 30 degrees and 10pixels (allowing moderate out-of-plane rotation), respectively. With these, twopolar grids can be created and placed on the virtual and camera images. A pointpair is determined as an inlier if the two points are located in the same zone inthe polar grids. Therefore, if the number of inliers is larger than a predeﬁnedvalue, the geometrical context of the tools in the virtual and the real cameraimages are considered as matched. Otherwise, the above veriﬁcation is repeateduntil it reaches the maximum number (100) of iterations. After veriﬁcation, theinlier point matches are used to estimate the correction transformation T CC − . We now describe how to combine the 2D detections with 3D kinematic datato estimate T CC − . Here the transformation matrix is represented as a vector ofrotation angles and translations along each axis: x = [ θ x , θ y , θ z , r x , r y , r z ] T . Wedenote the n observations (corresponding to the tool parts deﬁned in Section eal-time 3D Tracking of Articulated Tools for Robotic Surgery 5 z = [ u , v , . . . , u n , v n ] T , where u and v are their locations in the cameraimage. To estimate x on-the-ﬂy, the EKF has been adopted to ﬁnd x t given theobservations z t at time t . The process model is deﬁned as x t = Ix t − + w t , where w t is the process noise at time t , and I is the transition function deﬁned as theidentity matrix. The measurement model is deﬁned as z t = h ( x t )+ v t , with v t be-ing the noise. h ( · ) is the nonlinear function with respect to [ θ x , θ y , θ z , r x , r y , r z ] T : h ( x t ) = 1 s Kf ( x t ) T C − B P Bt , (2)which is derived according to Eq.1. Note here, f ( · ) is the function that composesthe euler angles and translation (in x t ) into the 4 × T CC − . As Eq.2 is a nonlinear function, we derive the Jacobian matrix J of h ( · ) regarding each element in x t .For iteration t , the predicted state x − t is calculated and used to predict themeasurement z − t , and also to calculate J t . In addition, z t is obtained from theinlier detections (Section 2.2), which is used, along with J t and x − t , to derive thecorrected state x + t which contains the corrected angles and translations. Theseare ﬁnally used to compose the transformation T CC − at time t , and thus the 3Dpose of the tool in C is obtained as T CE = T CC − T C − B T BE . Note that if no 2Ddetections are available at time t , the previous T CC − is then used.At the beginning of the tracking process, an estimate T CC − is required toinitialise EKF, and correct the virtual view to be as close as possible to the realview. Therefore, template matching is performed in multiple scales and rotationsfor initialisation, however, only one template is needed for matching of each toolpart after initialisation. The Eﬃcient Perspective-n-Points (EPnP) algorithm[14] is applied to estimate T CC − based on the 2D-3D correspondences of thetool parts matched between the virtual and real views and their 3D positionsfrom kinematic data.The proposed framework can be easily extended to track multiple tools. Thisonly requires to generate part-based templates for all the tools in the samegraphic rendering and follow the proposed framework. As template matching isperformed in binarised templates, the computational speed is not deteriorated. The proposed framework has been implemented on an HP workstation with anIntel Xeon E5-2643v3 CPU. Stereo videos are captured at 25Hz. In our

C++ im-plementation, we have separated the part-based rendering and image processinginto two CPU running threads, enabling our framework to be real-time. Therendering part is implemented based on VTK and OpenGL, of which the speedis ﬁxed as 25Hz. As our framework only requires monocular images for 3D poseestimation, only the images from the left camera were processed. For imagesize 720x576, the processing speed is ≈ M. Ye et al.

Fig. 3. (a) and (b) Detection rate results of our online template matching and Grad-Boost [6] on two single-tool tracking sequences (see supplementary videos); (c) Overallrotation angle errors (mean ± std) along each axis on Seqs.1-6. Table 1.

Translation and rotation errors (mean ± std) on Seqs.1-6. Tracking accuracieswith run-time speed in Hz (in brackets) compared to [8] on their dataset (Seqs.7-12).

3D Pose Error Tracking Accuracy

Seq.

Our method

EPnP-based Seq.

Our method [8]Trans.(mm) Rot.(rads.) Trans.(mm) Rot.(rads.)1 . ± .

15 0 . ± . . ± .

89 0 . ± .

09 7 . ± .

12 0 . ± . . ± .

33 0 . ± .

19 8 . ± .

96 0 . ± . . ± .

46 0 . ± .

20 9 . ± .

21 0 . ± . . ± .

41 0 . ± .

18 10 96.84% (28) 97.75% (1)5 . ± .

02 0 . ± . . ± .

63 0 . ± .

20 11 96.57% (36) 98.76% (2)6 . ± .

70 0 . ± . . ± .

87 0 . ± .

13 12 . ± .

19 0 . ± . . ± .

45 0 . ± .

18 Overall additional scale ratios of 0 . .

2, and rotations of ±

15 degrees, which doesnot deteriorate the run-time speed due to template binarisation. Our method wascompared to the tracking approaches for articulated tools including [6] and [8].For demonstrating the eﬀectiveness of the online part-based templates fortool detection, we have compared our approach to the method proposed in [6],which is based on boosted trees for 2D tool part detection. For ease of trainingdata generation, a subset of the tool parts was evaluated in this comparison,namely the front pin, logo, and rear pin. The classiﬁer was trained with 6000samples for each part. Since [6] applies to single tool tracking only, the trainedclassiﬁer along with our approach were tested on two single-tool sequences (1677and 1732 images), where ground truth data was manually labelled. A part de-tection is determined to be correct if the distance of its centre and ground truthis smaller than a threshold. To evaluate the results with diﬀerent accuracy re-quirements, the threshold was therefore sequentially set to 5, 10, and 20 pixels.The detection rates of the methods were calculated among the top N detections.As shown in Fig.3(a-b) our method signiﬁcantly outperforms [6] in all accuracyrequirements. This is because our templates are generated adaptively online.To validate the accuracy of the 3D pose estimation, we manually labelledthe centre locations of the tool parts on both left and right camera imageson phantom (Seqs.1-3) and ex vivo (Seqs.4-6) video data to generate the 3Dground truth. The tool pose errors are then obtained as the relative pose betweenthe estimated pose and the ground truth. Our approach was also compared eal-time 3D Tracking of Articulated Tools for Robotic Surgery 7

Fig. 4.

Qualitative results. (a-c) phantom data (Seqs.1-3); (d) ex vivo ovine data(Seq.4); (e) and (g) ex vivo porcine data (Seqs.9 and 12); (f) in vivo porcine data(Seq.11). Red lines indicate the tool kinematics, and green lines indicate the trackingresults of our framework with 2D detections in coloured dots. to the 3D poses estimated performing EPnP for every image where the toolparts are detected. However, EPnP generated unstable results and had inferiorperformance to our approach as shown in Table 1 and Fig.3(c).We have also compared our framework to the method proposed in [8]. Astheir code is not publically available, we ran our framework on the same ex vivo (Seqs.7-10,12) and in vivo data (Seq.11) used in [8]. Example results are shownin Fig.4(e-g). For achieving a fair comparison, we have evaluated the trackingaccuracy as explained in their work, and presented both our results and theirsreported in the paper in Table 1. Although our framework achieved slightly bet-ter accuracies than their approach, our processing speed is signiﬁcantly faster,ranging from 25–36Hz, while theirs is approximately 1–2Hz as reported in [8].As shown in Figs.4(b) and (d), our proposed method is robust to occlusion dueto tool intersections and specularities, thanks to the fusion of 2D part detectionsand kinematics. In addition, our framework is able to provide accurate track-ing even when T C − B becomes invalid after the laparoscope has moved (Fig.4(c),Seq.3). This is because T CC − is estimated online using the 2D part detections.All the processed videos are available via https://youtu.be/oqw_9Xp_qsw . In this paper, we have proposed a real-time framework for 3D tracking of ar-ticulated tools in robotic surgery. Online part-based templates are generatedusing the tool CAD models and robot kinematics, such that eﬃcient 2D detec-tion can then be performed in the camera image. For rejecting outliers, a robustveriﬁcation method based on 2D geometrical context is included. The inlier 2Ddetections are ﬁnally fused with robot kinematics for 3D pose estimation. Ourframework can run in real-time for multi-tool tracking, thus can be used for

M. Ye et al. imposing dynamic active constraints and motion analysis. The results on phan-tom, ex vivo and in vivo experiments demonstrate that our approach can achieveaccurate 3D tracking, and outperform the current state-of-the-art.

Acknowledgements.

We would like to thank Simon DiMaio from IntuitiveSurgical for providing the tool CAD models, and Austin Reiter from JohnsHopkins University for assisting method comparisons.

References

1. Sznitman, R., Ali, K., Richa, R., Taylor, R.H., Hager, G.D., Fua, P.: Data-drivenvisual tracking in retinal microsurgery. In: Ayache, N., Delingette, H., Golland, P.,Mori, K. (eds.) MICCAI 2012, vol. 7511, pp. 568–575. Springer, Heidelberg (2012)2. Wolf, R., Duchateau, J., Cinquin, P., Voros, S.: 3d tracking of laparoscopic in-struments using statistical and geometric modeling. In: Fichtinger, G., Martel, A.,Peters, T. (eds.) MICCAI 2011, vol. 6891, pp. 203–210. Springer, Heidelberg (2011)3. Allan, M., Chang, P.L., Ourselin, S., Hawkes, D.J., Sridhar, A., Kelly, J., Stoyanov,D.: Image based surgical instrument pose estimation with multi-class labelling andoptical ﬂow. In: Navab, N., Hornegger, J., Wells, M.W., Frangi, F.A. (eds.) MICCAI2015, vol. 9349, pp. 331–338. Springer, Heidelberg (2015)4. Pezzementi, Z., Voros, S., Hager, G.: Articulated object tracking by renderingconsistent appearance parts. In: ICRA. pp. 3940–3947 (2009)5. Reiter, A., Allen, P.K., Zhao, T.: Articulated surgical tool detection using virtually-rendered templates. In: CARS (2012)6. Sznitman, R., Becker, C., Fua, P.: Fast part-based classiﬁcation for instrumentdetection in minimally invasive surgery. In: Golland, P., Hata, N., Barillot, C.,Hornegger, J., Howe, R. (eds.) MICCAI 2014, vol. 8674, pp. 692–699. Springer,Heidelberg (2014)7. Rieke, N., Tan, D.J., Alsheakhali, M., Tombari, F., di San Filippo, C.A., Bela-giannis, V., Eslami, A., Navab, N.: Surgical tool tracking and pose estimation inretinal microsurgery. In: Navab, N., Hornegger, J., Wells, M.W., Frangi, F.A. (eds.)MICCAI 2015, vol. 9349, pp. 266–273. Springer, Heidelberg (2015)8. Reiter, A., Allen, P.K., Zhao, T.: Appearance learning for 3d tracking of roboticsurgical tools. Int. J. Rob. Res. 33(2), 342–356 (2014)9. Kazanzides, P., Chen, Z., Deguet, A., Fischer, G., Taylor, R., DiMaio, S.: An open-source research kit for the da vinci rr