[PDF] Temporal Segmentation of Surgical Sub-tasks through Deep Learning with Multiple Data Sources

Abstract

Many tasks in robot-assisted surgeries (RAS) can be represented by finite-state machines (FSMs), where each state represents either an action (such as picking up a needle) or an observation (such as bleeding). A crucial step towards the automation of such surgical tasks is the temporal perception of the current surgical scene, which requires a real-time estimation of the states in the FSMs. The objective of this work is to estimate the current state of the surgical task based on the actions performed or events occurred as the task progresses. We propose Fusion-KVE, a unified surgical state estimation model that incorporates multiple data sources including the Kinematics, Vision, and system Events. Additionally, we examine the strengths and weaknesses of different state estimation models in segmenting states with different representative features or levels of granularity. We evaluate our model on the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS), as well as a more complex dataset involving robotic intra-operative ultrasound (RIOUS) imaging, created using the da Vinci Xi surgical system. Our model achieves a superior frame-wise state estimation accuracy up to 89.4%, which improves the state-of-the-art surgical state estimation models in both JIGSAWS suturing dataset and our RIOUS dataset.

Full PDF

TTemporal Segmentation of Surgical Sub-tasks throughDeep Learning with Multiple Data Sources

Yidan Qin , , Sahba Aghajani Pedram , , Seyedshams Feyzabadi ,Max Allan , A. Jonathan McLeod , Joel W. Burdick , Mahdi Azizian Abstract

Many tasks in robot-assisted surgeries (RAS) can be represented by ﬁnite-state machines (FSMs),where each state represents either an action (such as picking up a needle) or an observation (such asbleeding). A crucial step towards the automation of such surgical tasks is the temporal perception of thecurrent surgical scene, which requires a real-time estimation of the states in the FSMs. The objective ofthis work is to estimate the current state of the surgical task based on the actions performed or eventsoccurred as the task progresses. We propose Fusion-KVE, a uniﬁed surgical state estimation model thatincorporates multiple data sources including the Kinematics, Vision, and system Events. Additionally,we examine the strengths and weaknesses of diﬀerent state estimation models in segmenting states withdiﬀerent representative features or levels of granularity. We evaluate our model on the JHU-ISI Gestureand Skill Assessment Working Set (JIGSAWS), as well as a more complex dataset involving roboticintra-operative ultrasound (RIOUS) imaging, created using the da Vinci R (cid:13) Xi surgical system. Our modelachieves a superior frame-wise state estimation accuracy up to 89.4%, which improves the state-of-the-artsurgical state estimation models in both JIGSAWS suturing dataset and our RIOUS dataset.

In the ﬁeld of surgical robotics research, the development of autonomous and semi-autonomous roboticsurgical systems is among the most popular emerging topics [1]. Such systems allow RAS to go beyondteleoperation and assist the surgeons in many ways, including autonomous procedures, user interface (UI)integration, and providing advisory information [2,3]. One prerequisite for these applications is the perceptionof the current state of the surgical task being performed. These states include the actions performed or thechanges in the environment observed by the system. For instance, during suturing, the system needs to knowif the needle is visible from the endoscopic view before providing more advanced applications such as advisingthe needle position or autonomous suturing. Additionally, the recognition of higher-level surgical states, orsurgical phases, has a wide range of applications in post-operative analysis and surgical skill evaluation [4].The recognition and segmentation of the robot’s current action is one of the main pillars of the surgicalstate estimation process. Many models have been developed for the segmentation and recognition of ﬁne-grained surgical actions that last for a few seconds, such as cutting [5–8], as well as surgical phases thatlast for up to 10 minutes, such as bladder dissection [9–11]. The recognition of ﬁne-grained surgical states isparticularly challenging due to their short duration and frequent state transitions. Most work in this ﬁeld hasfocused on developing models using only one type of input data, such as kinematics or vision. Some studieshave focused on learning based on robot kinematics, using models such as Hidden Markov Models [12–14]and Conditional Random Fields (CRF) [15]. Zappella et al. proposed methods of modeling surgical videoclips for single-action classiﬁcation [16]. The Transition State Clustering (TSC) and Gaussian MixtureModel methods provide unsupervised or weakly-supervised methods for surgical trajectory segmentation[17, 18]. More recently, deep learning methods have come to deﬁne the state-of-the-art, such as Temporal Intuitive Surgical Inc., 1020 Kifer Road, Sunnyvale, CA,94086, USA Department of Mechanical and Civil Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena,CA, 91125, USA Department of Mechanical and Aerospace Engineering, University of California, Los Angeles, Los Angeles, CA, 90095,USAEmails: [email protected], [email protected] a r X i v : . [ c s . C V ] F e b ime time Figure 1: Sample data from JIGSAWS (left) and RIOUS dataset (right). The bottom row shows a samplestate sequence of each task, where each color denotes a state label.Convolutional Networks (TCN) [19], Time Delay Neural Network (TDNN) [7], and Long-Short Term Memory(LSTM) [6, 20]. Instead of using robot kinematics data, vision-based methods have been developed basedon Convolutional Neural Networks (CNN). Vision-based models in RAS use the vision data that is readilyavailable from the endoscopic view. Concatenating spatial features on the temporal axis with spatio-temporalCNNs (ST-CNN) has been explored in [21]. Jin et al. introduced the post-processing of predictions usingprior knowledge inference [22]. TCN can also be applied to vision data for action segmentation, takingthe encoding of a spatial CNN as input [19]. Ding et al. proposed a hybrid TCN-BiLSTM network [23].The limitation shared by single-input action recognition models is the large discrepancy among states’representative vision and kinematics features, making them distinguishable through diﬀerent types of inputdata.Comparing to action recognition datasets such as ActivityNet [24], RAS data enjoys the luxury of havingsynchronized vision, system events, and robot kinematics data. The attempts of incorporating multiple typesof input data have been focusing on using derived values as additional variables to a single model. Lea et al.measured two scene-based features in JIGSAWS as additional variables to the robot kinematics data in theirLatent Convolutional Skip-Chain CRF (LC-SC-CRF) model [5]. Zia et al. collected the robot kinematicsand system events data from RAS to perform surgical phase recognition [10]. While these attempts haveproven to improve the model accuracy, to the best of the authors’ knowledge, there is yet to be a uniﬁedmethod that incorporates multiple data sources directly for ﬁne-grained surgical state estimation.In addition to robot actions, the ﬁnite state machine (FSM) of a surgical task should also include theenvironmental changes observed by the robot. The non-action states were omitted in popular surgical actionsegmentation datasets such as JIGSAWS [25] and Cholec80 [26]; however are important for applications suchas autonomous procedures. They are also challenging to recognize as some non-action states may not bewell-reﬂected in a single-source dataset.

Contributions : In this paper, we propose a uniﬁed approach of ﬁne-grained state estimation in RASusing multiple types of input data collected from the da Vinci R (cid:13) surgical system. The input data we useincludes the endoscopic video, robot kinematics, and the system events of the surgical system. Our goal isto achieve the real-time ﬁne-grained state estimation of the surgical task being performed. To re-emphasize,we refer to ﬁne-grained states as states that last in the scale of seconds. Our main contributions include: • Implement a uniﬁed state estimation model that incorporates vision-, kinematics-, and event-basedstate estimation results; • Improve the frame-wise state estimation accuracy of state-of-the-art methods by up to 11% throughthe incorporation of multiple sources of data; • Demonstrate the advantages of a multi-input state estimation model through the comparison of single-input models’ performances in recognizing states with diﬀerent representative features or levels ofgranularity in a complex and realistic surgical task.We evaluated the performance of our model using JIGSAWS and a new RIOUS (robotic intra-operativeultrasound) dataset we developed. RIOUS consists of phantom and porcine experiments on a da Vinci R (cid:13) rid Search for parameters Ensemble methodWeighted votingVGG-16FC VGG-16 VGG-16FC FCTCN FeatureExtraction

Vision

TCN LSTMNormalization

Kinematics Events

Figure 2: Our model contains four single-input state estimation models receiving three types of input data.A fusion model that receives individual model outputs is used to make the comprehensive state estimationresult. E n c o d e r (L) Upsampling1-D ConvNormalize: D (L) (1)

Output: Y D e c o d e r Input: X1-D ConvMax poolingNormalize: E (1)

Max poolingUpsampling

Figure 3: The encoder-decoder TCN network that hierarchically models vision or kinematics data to states.Xi surgical system (Fig. 1). Comparing to JIGSAWS, which is relatively simple as it only contains dry-lab tasks with no camera motion nor non-action annotations, RIOUS dataset better resembles real-worldsurgical tasks. This is because RIOUS dataset contains dry-lab, cadaveric and in-vivo experiments , as wellas camera movements and annotations of both action and non-action states. We evaluated the accuracyof multiple state estimation models in the recognition of states with diﬀerent representative features. Eachmodel has its respective strengths and weaknesses, which supports the superior performance of our uniﬁedapproach of state estimation. Our proposed model (Fig. 2) consists of four single-source state estimation models based on vision, kine-matics, and system events, respectively. The outputs are fed to a fusion model that makes a comprehensiveinference. In this section, we discuss each individual model as well as the fusion model which eﬀectivelycombines the outputs of each model. All in-vivo experiments were performed on porcine models under Institutional Animal Care and Use Committee (IACUC)approved protocol. .1 Vision-based Method

The vision-based state estimation model is a CNN-TCN model [19] that takes the endoscopic camera streamas the input in the form of a series of video frames. The CNN architecture we deploy is VGG16 [27]. Thespatial CNN component serves as a feature extractor and maps each 224 × × X ∈ R N where N is the number of features. X is then fed to the TCN component, which is an encoder-decoder network (Fig. 3). At time step t , the input vector is denoted by X t for 0 < t (cid:54) T . For the l th l ∈ { , ..., L } ), F l ﬁlters of kernel size k are applied along the temporal axis that capturethe temporal progress of the input data. T l is the number of time steps in the l th layer. In each layer, theﬁlters are parameterized by a weight tensor W ( l ) ∈ R F l × k × F l − and a bias vector b ( l ) ∈ R F l . The raw outputactivation vector for the l th layer at time t , E ( l ) t , is calculated from a subsection of the normalized activationmatrix from the previous layer ˆ E ( l − ∈ R F l − × T l − E ( l ) t = f ( W ( l ) ∗ ˆ E ( l − t : t + k − + b ( l ) i ) (1)where f is a Rectiﬁed Linear Unit (ReLU) [28]. A max pooling layer of stride 2 is applied after eachconvolutional layer in the encoder part such that T l = T l − . The pooling layer is followed by a normalizationlayer, which normalizes the l th activation vector at time t, E ( l ) t , using its highest valueˆ E ( l ) t = E ( l ) t max ( E ( l ) t ) + (cid:15) (2)where (cid:15) = 10 − is a small number to ensure non-zero denominators, and ˆ E ( l ) t is the normalized outputactivation vector. In the decoder part, an upsampling layer that repeats each data point twice proceedseach temporal convolutional and normalization layers. The output vector ˆ D ( l ) t is calculated and normalizedin the same manner as the encoder part. The state estimation at frame t is done by a time-distributedfully-connected layer with softmax to normalize the logits. Implementation details : The training of the CNN feature extractor starts with the VGG16 networkinitialized with ImageNet pre-trained weights. We ﬁne-tune the weights by training with one fully-connectedlayer on top of the VGG16 model for state estimation. The feature vector X t ∈ R N =1024 . We use L = 3with F l = { , , } , and k = 6 . s for the JIGSAWS suturing dataset and k = 3 . s for the RIOUS dataset.For training, we use the cross entropy loss with Adam optimization algorithm [29].For our application of real-time state estimation, the model can only use the information from the currentand preceding time steps; therefore for the RIOUS dataset, we assume a causal setting and pad the temporalinput with k zeros on the left side before the convolutional layer and crop k data points on the right sideafterwards. We incorporate both forward LSTM and TCN to better capture states with diﬀerent duration. LSTM hasno constraints on learning only from the nearby data on the temporal axis. Rather, it maintains a memorycell and learns when to read/write/reset the memory [30]. It has been shown that LSTM-based approachesexceed the state-of-the-art performance in longer-duration action recognition [6]. We incorporate both TCN,which applies temporal convolution to learn local temporal dependencies, and LSTM, which is able to capturelonger-term data progress. Although the bi-directional LSTM model yields a higher accuracy [6], it is notapplicable for the real-time state estimation task where no future data is available; therefore we use a forwardLSTM with forget gates and peephole connections [30]. The loss function for the LSTM model is the crossentropy between the ground truth and the predicted labels, and the stochastic gradient descent (SGD) isused to minimize loss.

Implementation details : For the LSTM model, we perform a grid search over the initial learning rate(0.5 or 1.0), the number of hidden layers (1 or 2), the number of hidden units per layer (256, 512, 1024,or 2048), and the dropout probability (0 or 0.5). The optimized set of parameters is 1 hidden layers with1024 hidden units and 0.5 dropout probability for JIGSAWS, and 512 hidden units for the RIOUS dataset.The optimized initial learning rate is 1.0. For the TCN model, we mostly follow the same protocol of thevision-based TCN model described earlier. We use L = 2 with F l = { , } . The feature vector for theinematics data X ∈ R N , where N = 26 for the JIGSAWS suturing dataset and N = 19 for the RIOUSdataset. We experimented with various classiﬁcation algorithms, including Adaboost classiﬁer, decision tree, RandomForest (RF), Ridge classiﬁer, Support Vector Machine (SVM), and SGD [31]. We performed grid search overthe parameters of each model and evaluated each model’s performance using the Area Under the ReceiverOperating Characteristic Curve (ROC AUC) score [32]. The evaluation process was iterated 200 times,with an early stopping criterion of score improvement under 10 − . At each iteration, we recorded the best-performing model with replacement. The top three models that were selected most frequently are included,and the ﬁnal state estimation result is the mean of each model’s prediction. The three top-performing modelsfor our RIOUS dataset are RF ( n trees =500, min samples split=2), SVM (penalty= L

2, kernel=linear, C =2,multi class=crammer singer), and RF ( n trees =400, min samples split=3). The individual state estimation models have their respective strengths and weaknesses, since diﬀerent stateshave inherent features that make them easier to be recognized by one type of data than the other(s). Forinstance, the ‘transferring needle from left to right’ state in the JIGSAWS suturing dataset can be distinctlycharacterized by the sequential opening and closing of the left and right needle drivers which is captured bythe kinematics data.We therefore use a weighted voting method that incorporates the prediction vectors in all models. Attime t , let Y ( t ) ∈ R a × b , where a is the number of models and b is the total number of possible states in adataset. Row vector Y ( t ) i, · is the output vector of the i th model at time t and (cid:80) bj =1 Y ti,j = 1. The overallprobability for the system to be in the j th state at time t - according to the models - is then P ( t ) j = a (cid:88) i =1 α i,j Y ( t ) i,j (3)where α i,j is the weighting factor for the i th model predicting the j th state. α is calculated from the diagnosticodds ratio (OR) derived from the model’s accuracy in recognizing each state in the training data: α i,j = T P i,j · T N i,j

F P i,j · F N i,j + (cid:15) (4)where the ( i, j )’s components of TP, TN, FP, FN are the number of true positives, true negatives, falsepositives, and false negatives of the i th model on recognizing the j th state, respectively. (cid:15) = 10 − is aplaceholder such that the denominator is not zero. α is normalized proportionally such that (cid:80) ai =1 α i,j = 1.The comprehensive estimate of state at time t S ( t ) is then made by S ( t ) = argmax j P ( t ) j . (5) We used two datasets to evaluate our models: JIGSAWS and RIOUS datasets (Table I).

JIGSAWS : The JIGSAWS dataset consists of three types of ﬁnely-annotated RAS tasks captured by anendoscope [25]. These tasks are performed in a benchtop setting. The dataset contains synchronized videoand kinematics data. We used the suturing dataset of JIGSAWS, which has 39 trials recorded at 30Hz, eacharound 1.5 minutes and contains close to 20 action instances. There are 9 possible actions (Fig. 4a). Theinematics variables we used include the end eﬀector positions, velocities, and gripper angles of the patient-side manipulator (PSM). The raw kinematics data uses the rotation matrix to represent the end-eﬀector’sorientation. To reduce data dimensionality, we converted the rotation matrix (9 variables) to Euler angles(3 variables).

RIOUS : To explore the full potential of our uniﬁed model, we collected a robotic intra-operative ultra-sound (RIOUS) dataset on a da Vinci R (cid:13) Xi surgical system at Intuitive Surgical Inc. (Sunnyvale, CA), inwhich we performed ultrasound scanning on both phantom and porcine kidneys. In RAS, using a drop-inultrasound probe to scan the organs is a common technique practiced by surgeons to localize underlyinganatomical structures including tumors and vasculature. The real-time state estimation of this task allowsus to develop smart-assist technologies for surgeons as well as enabling supervised autonomous techniquesto perform such tasks.The RIOUS dataset contains 30 trials performed by 5 users with no RAS experience but familiar withthe da Vinci R (cid:13) surgical system. Each trial is around 5 minutes and contains roughly 80 action instances.26 trials are performed on a phantom kidney in dry-lab setting and 4 are performed on a porcine kidney inoperating room setting. The data is annotated with eight states (Fig. 4b). Two out of the four arms wereused, one holding an endoscope and the other holding a pair of Prograsp TM forceps. The ultrasound machineused is the bk5000 with a robotic drop-in probe from BK Medical Holding Company, Inc. Both video andkinematics entries were synchronized and down-sampled to 30Hz. The kinematics variables we used includethe instrument’s end-eﬀector positions, velocities, gripper angles, and the endoscope positions. We usedthe same pre-processing method as the suturing kinematics data. We also collected six system events datafrom the da Vinci R (cid:13) surgical system, including camera follow, instrument follow, surgeon head in/out of theconsole, master clutch for the hand controller, and two ultrasound probe events. The ultrasound probe eventsdetect if the probe is being held by the forceps and if the probe is in contact with the tissue, respectively.All events are represented as binary on/oﬀ time series. We use two evaluation metrics for our state estimation model: the frame-wise state estimation accuracy andthe edit distance. The frame-wise accuracy is the percentage of correctly recognized frames, which is mea-sured without taking temporal consistency into account. This is because the model has only the knowledgeof the current and preceding data entries in the real-time state estimation setting. The edit distance, orLevenshtein distance [33], measures the number of operations (insertion, deletion, and substitution) neededto convert the inferred sequence of states in the segment level to the ground truth. We normalize the editdistance following [5, 6]. We evaluate both datasets using

Leave One User Out as described in [34]. For theultrasound imaging task, we assume a causal setting, in which the models only have knowledge of the currentand preceding time steps. This is to mimic the real-time state estimation application of our model, in whichTable 1: Datasets State Descriptions and Duration

JIGSAWS Suturing Dataset

Action ID Description Duration (s)G1 Reaching for the needle with right hand 2.2G2 Positioning the tip of the needle 3.4G3 Pushing needle through the tissue 9.0G4 Transferring needle from left to right 4.5G5 Moving to center with needle in grip 3.0G6 Pulling suture with left hand 4.8G7 Orienting needle 7.7G8 Using right hand to help tighten suture 3.1G9 Dropping suture and moving to end points 7.3

RIOUS Dataset

State ID Description Duration (s)S1 Probe released, out of endoscopic view 17.3S2 Probe released, in endoscopic view 10.6S3 Reaching for probe 4.1S4 Grasping probe 1.3S5 Lifting probe up 2.2S6 Carrying probe to tissue surface 2.3S7 Sweeping 8.1S8 Releasing probe 2.5

Figure 4: FSMs of the JIGSAWS suturing task (a) and the RIOUS imaging task (b). The 0 states arethe starting of tasks. The states with a double circle are the accepting (ﬁnal) states. The actions in theJIGSAWS suturing task are represented with gestures (G) and the states in the RIOUS imaging task arerepresented with states (S).the robot cannot foresee the future. For the JIGSAWS suturing task, we assume a non-causal setting formore direct comparisons with the reported accuracy of the state-of-the-art methods. The edit distance istherefore only used for JIGSAWS.

Table II compares the performances of the state-of-the-art surgical state estimation models with an ablatedversion of our model (Fusion-KV), consisting of the kinematics- and vision-based models as well as the fusionmodel. Table III compares the performances of our full fusion model (Fusion-KVE) and Fusion-KV withtheir single-source components using the RIOUS dataset. In Fig. 5, we show an example of state estimationresults of our fusion models and their components for a string of ultrasound imaging sequences. Fig. 6shows the weight matrix α distributions used in our fusion models. A large α i,j indicates that the i th modelperforms well in estimating the j th state during training.In Table II, Fusion-KV achieves a frame-wise accuracy of 86.3% and edit distance score of 87.2 for theJIGSAWS suturing dataset, both improving the state-of-the-art surgical state estimation models. For theRIOUS dataset (Table III), Fusion-KVE achieves a frame-wise accuracy of 89.4%, with an improvementof 11% comparing to the best-performing single-input model. Fusion-KV also achieves a higher accuracycomparing to single-input models.A closer observation of the inferred state sequences by various models and their weighting factors as shownin Fig. 5 and Fig. 6 reveals the key aspects of improvements of our method. Although kinematics-based stateestimation models generally have a higher frame-wise accuracy comparing to vision-based models (TablesII and III), which are very sensitive to camera movements, each model has its respective strengths andweaknesses. For instance, at around 200s of the illustrated sequence in Fig. 5, both kinematics-based modelsshow a consecutive block of errors where the models fail to recognize the ‘probe released and in endoscopicview’ state. Considering the relatively random robotic motions in this state, this is to be expected. Thelow weighting factors for both kinematics-based model in estimating this state, as shown in Fig. 6, alsosupport this observation. On the other hand, the vision-based model correctly estimates this state, sincethe state is more visually distinguishable. When incorporating both vision- and kinematics-based methods,our fusion models perform weighted voting based on the training accuracy of each model. In this example,the weighting factor for the vision-based model is higher than the kinematics-based models; therefore, ourfusion models are able to correctly estimate the current state of the surgical task. In other states where therobotic motions are more consistent but the vision data is less distinguishable, the kinematics-based modelshave higher weighting factors.The incorporation of system events further improves the accuracy of our fusion model. Comparing Fusion-KV and Fusion-KVE, we observe fewer errors - many are corrected where α for the event-based model isigh, such as states with shorter duration or frequent camera movements. At around 250s to 300s of thepresented sequence, frequent state transitions can be observed. Fusion-KVE is able to estimate the statesmore accurately and shows fewer ﬂuctuations comparing to other models. The event-based model is lesssensitive to environmental noises, as the events are collected directly from the surgical system. Additionally,when the state transition is frequent, models that solely explore the temporal dependencies of input data, suchas TCN and LSTM, are less accurate. As the event-based model does not take the temporal correlations intoconsideration, incorporating such data source reduces the ﬂuctuation in state estimation results, especiallywhen the state transition is frequent or the duration of each state is short.The average duration of each state in both JIGSAWS suturing dataset and the RIOUS dataset variessigniﬁcantly, as shown in Table I. To better capture states with diﬀerent lengths of duration, we implementedtwo kinematics-based state estimation models: TCN and forward LSTM. Fig. 6 supports our decision. Whenthe average duration of a state is high, the LSTM-based model has a higher weighting factor. Similarly, theTCN-based model has a higher weighting factor for shorter-duration states.As mentioned before, the RIOUS dataset is more complex compared to JIGSAWS and resembles real-world surgical tasks more closely. It is, therefore, more complicated and harder to be well-captured by asingle-input state estimation model. Furthermore, our application of real-time state estimation limits theamount of data available to the model. Although running multiple state estimation models at the same timeinevitably requires higher computing power, our fusion state estimation model is robust against complex andrealistic surgical tasks such as ultrasound imaging and achieves a superior frame-wise accuracy. In this paper, we introduce a uniﬁed approach of ﬁne-grained state estimation for various surgical tasks usingmultiple sources of input data from the da Vinci R (cid:13) Xi surgical system. Our models (including Fusion-KVand Fusion-KVE) improve the state-of-the-art performance for both the JIGSAWS suturing dataset and theRIOUS dataset. Fusion-KVE, which takes advantage of the system events (absent in the JIGSAWS dataset),Table 2: Results on JIGSAWS suturing dataset

JIGSAWS Suturing

Method Input data type Accuracy (%) Edit Dist.ST-CNN [21] Vis 74.71.3 66.659.9TCN [19] Kin 79.6 85.8Forward LSTM [6] Kin 80.5 75.3TCN [19] Vis 81.4 83.1TDNN [7] Kin 81.7 -TricorNet [23] Kin 82.9 86.8Bidir. LSTM [6] Kin 83.3 81.1LC-SC-CRF [5] Kin+Vis 83.5 76.8

Fusion-KV Kin+Vis 86.3 87.2

Table 3: Results on RIOUS dataset

RIOUS dataset

Method Input data type Accuracy (%)ST-CNN [21] Vis 46.3TCN [19] Vis 54.8LC-SC-CRF [5] Kin 71.5Forward LSTM [6] Kin 72.2TDNN [7] Kin 78.1TCN [19] Kin 78.4Fusion-KV Kin+Vis 82.7

Fusion-KVE Kin+Vis+Evt 89.4 ccuracy (%)64.470.174.581.189.3

Figure 5: Example state estimation results of the vision-based model (Vis) and the kinematics-based models(Kin-LSTM and Kin-TCN) used in our fusion models, along with Fusion-KV and Fusion-KVE, comparingto the ground truth (GT). The top row of each block bar shows the state estimation results, and the framesmarked in red in the bottom row are the discrepancies between the state estimation results and the groundtruth.Figure 6: Distributions of the normalized weighting factor matrix α for the JIGSAWS suturing task and theRIOUS imaging task. A larger weighting factor means that the model performs better at estimating thecorresponding state.further improves Fusion-KV. Our RIOUS dataset is more complex than JIGSAWS and resembles the real-world surgical tasks, with dry-lab, cadaveric and in-vivo experiments, as well as camera movements andannotations of both action and non-action states. Our uniﬁed model proves its robustness against complexand realistic surgical tasks by achieving a superior frame-wise accuracy even in a causal setting, where themodel has knowledge of only the current and preceding time steps.We show how diﬀerent types of input data (vision, kinematics, and system events) have their respectivestrengths and weaknesses in the recognition of ﬁne-grained states. The ﬁne-grained state estimation ofsurgical tasks is challenging due to the duration of various states and frequent state transitions. We showthat by incorporating multiple types of input data, we are able to extract richer information during trainingand more accurately estimate the states in a surgical setting. A possible next step of our work would be touse the weighting factor matrix for boosting methods to more eﬃciently train the uniﬁed state estimationmodel. In the future, we also plan to apply this state estimation framework to applications such as smart-assist technologies and supervised autonomy for surgical subtasks. ACKNOWLEDGMENT

This work was funded by Intuitive Surgical, Inc. We would like to thank Dr. Azad Shademan and Dr.Pourya Shirazian for their support of this research. eferences [1] G. P. Moustris, S. C. Hiridis, K. M. Deliparaschos, and K. M. Konstantinidis, “Evolution of autonomousand semi-autonomous robotic surgical systems: a review of the literature,”

The international journal ofmedical robotics and computer assisted surgery , vol. 7, no. 4, pp. 375–392, 2011.[2] P. Chalasani, A. Deguet, P. Kazanzides, and R. H. Taylor, “A computational framework for comple-mentary situational awareness (csa) in surgical assistant robots,” in . IEEE, 2018, pp. 9–16.[3] S. P. DiMaio, C. J. Hasser, R. H. Taylor, D. Q. Larkin, P. Kazanzides, A. Deguet, B. P. Vagvolgyi,and J. Leven, “Interactive user interfaces for minimally invasive telesurgical systems,” Feb. 15 2018, uSPatent App. 15/725,271.[4] A. Zia, A. Hung, I. Essa, and A. Jarc, “Surgical activity recognition in robot-assisted radical prostatec-tomy using deep learning,” in

Medical Image Computing and Computer Assisted Intervention – MICCAI2018 , A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-L´opez, and G. Fichtinger, Eds. Cham:Springer International Publishing, 2018, pp. 273–280.[5] C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action primitives for ﬁne-grained actionrecognition,” in . IEEE, 2016,pp. 1642–1649.[6] R. DiPietro, C. Lea, A. Malpani, N. Ahmidi, S. S. Vedula, G. I. Lee, M. R. Lee, and G. D. Hager,“Recognizing surgical activities with recurrent neural networks,” in

International conference on medicalimage computing and computer-assisted intervention . Springer, 2016, pp. 551–558.[7] G. Menegozzo, D. DallAlba, C. Zandon`a, and P. Fiorini, “Surgical gesture recognition with time de-lay neural network based on kinematic data,” in . IEEE, 2019, pp. 1–7.[8] E. Mavroudi, D. Bhaskara, S. Sefati, H. Ali, and R. Vidal, “End-to-end ﬁne-grained action segmentationand recognition using conditional random ﬁeld models and discriminative sparse coding,” in . IEEE, 2018, pp. 1558–1567.[9] T. Yu, D. Mutter, J. Marescaux, and N. Padoy, “Learning from a tiny dataset of manual annotations:a teacher/student approach for surgical phase recognition,” arXiv preprint arXiv:1812.00033 , 2018.[10] A. Zia, C. Zhang, X. Xiong, and A. M. Jarc, “Temporal clustering of surgical activities in robot-assistedsurgery,”

International journal of computer assisted radiology and surgery , vol. 12, no. 7, pp. 1171–1178,2017.[11] G. Yengera, D. Mutter, J. Marescaux, and N. Padoy, “Less is more: surgical phase recognitionwith less annotations through self-supervised pre-training of cnn-lstm networks,” arXiv preprintarXiv:1805.08569 , 2018.[12] L. Tao, E. Elhamifar, S. Khudanpur, G. D. Hager, and R. Vidal, “Sparse hidden markov models forsurgical gesture classiﬁcation and skill evaluation,” in

International conference on information processingin computer-assisted interventions . Springer, 2012, pp. 167–177.[13] J. Rosen, J. D. Brown, L. Chang, M. N. Sinanan, and B. Hannaford, “Generalized approach for modelingminimally invasive surgery as a stochastic process using a discrete markov model,”

IEEE Transactionson Biomedical engineering , vol. 53, no. 3, pp. 399–413, 2006.[14] M. Volkov, D. A. Hashimoto, G. Rosman, O. R. Meireles, and D. Rus, “Machine learning and coresetsfor automated real-time video segmentation of laparoscopic and robot-assisted surgery,” in . IEEE, 2017, pp. 754–759.15] L. Tao, L. Zappella, G. D. Hager, and R. Vidal, “Surgical gesture segmentation and recognition,” in

International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer,2013, pp. 339–346.[16] L. Zappella, B. B´ejar, G. Hager, and R. Vidal, “Surgical gesture classiﬁcation from video and kinematicdata,”

Medical image analysis , vol. 17, no. 7, pp. 732–745, 2013.[17] S. Krishnan, A. Garg, S. Patil, C. Lea, G. Hager, P. Abbeel, and K. Goldberg, “Transition stateclustering: Unsupervised surgical trajectory segmentation for robot learning,” in

Robotics Research .Springer, 2018, pp. 91–110.[18] B. van Amsterdam, H. Nakawala, E. De Momi, and D. Stoyanov, “Weakly supervised recognition ofsurgical gestures,” in . IEEE, 2019,pp. 9565–9571.[19] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A uniﬁed approachto action segmentation,” in

European Conference on Computer Vision . Springer, 2016, pp. 47–54.[20] R. DiPietro, N. Ahmidi, A. Malpani, M. Waldram, G. I. Lee, M. R. Lee, S. S. Vedula, and G. D.Hager, “Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks,”

International journal of computer assisted radiology and surgery , pp. 1–16, 2019.[21] C. Lea, A. Reiter, R. Vidal, and G. D. Hager, “Segmental spatiotemporal cnns for ﬁne-grained actionsegmentation,” in

European Conference on Computer Vision . Springer, 2016, pp. 36–52.[22] Y. Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workﬂow recognitionfrom surgical videos using recurrent convolutional network,”

IEEE transactions on medical imaging ,vol. 37, no. 5, pp. 1114–1126, 2017.[23] L. Ding and C. Xu, “Tricornet: A hybrid temporal convolutional and recurrent network for video actionsegmentation,” arXiv preprint arXiv:1705.07818 , 2017.[24] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale videobenchmark for human activity understanding,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2015, pp. 961–970.[25] N. Ahmidi, L. Tao, S. Sefati, Y. Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D.Hager, “A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,”

IEEE Transactions on Biomedical Engineering , vol. 64, no. 9, pp. 2025–2041, 2017.[26] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: adeep architecture for recognition tasks on laparoscopic videos,”

IEEE transactions on medical imaging ,vol. 36, no. 1, pp. 86–97, 2016.[27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[28] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltzmann machines,” in

Proceedingsof the 27th international conference on machine learning (ICML-10) , 2010, pp. 807–814.[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[30] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in

Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: NewChallenges and Perspectives for the New Millennium , vol. 3. IEEE, 2000, pp. 189–194.[31] K. P. Murphy,

Machine learning: a probabilistic perspective . MIT press, 2012.32] A. P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,”

Pattern recognition , vol. 30, no. 7, pp. 1145–1159, 1997.[33] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in

Sovietphysics doklady , vol. 10, no. 8, 1966, pp. 707–710.[34] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. B´ejar,D. D. Yuh, et al. , “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity datasetfor human motion modeling,” in