Unsupervised Trajectory Segmentation and Promoting of Multi-Modal Surgical Demonstrations
Zhenzhou Shao, Hongfa Zhao, Jiexin Xie, Ying Qu, Yong Guan, Jindong Tan
UUnsupervised Trajectory Segmentation and Promotingof Multi-Modal Surgical Demonstrations
Zhenzhou Shao , Hongfa Zhao , Jiexin Xie , Ying Qu ∗ , Yong Guan and Jindong Tan Abstract — To improve the efficiency of surgical trajectorysegmentation for robot learning in robot-assisted minimallyinvasive surgery, this paper presents a fast unsupervised methodusing video and kinematic data, followed by a promotingprocedure to address the over-segmentation issue. Unsuperviseddeep learning network, stacking convolutional auto-encoder,is employed to extract more discriminative features fromvideos in an effective way. To further improve the accuracyof segmentation, on one hand, wavelet transform is used tofilter out the noises existed in the features from video andkinematic data. On the other hand, the segmentation result ispromoted by identifying the adjacent segments with no statetransition based on the predefined similarity measurements.Extensive experiments on a public dataset JIGSAWS show thatour method achieves much higher accuracy of segmentationthan state-of-the-art methods in the shorter time.
I. INTRODUCTIONSurgical trajectory segmentation is a fundamental problemin the field of robot-assisted minimally invasive surgery(RMIS). It can be applied to several applications, such asdemonstration learning [1], skill assessment [2], complextask automation [3] and so forth. Each surgical procedureis usually represented by synchronized video and kinematicrecordings, and can be decomposed into several meaningfulsub-trajectories. Since the segments are atomic with lesscomplexity, lower variance and easier to eliminate outliers,the capability of further robot learning and assessment canbe improved. However, it is a challenging task to segment thesurgical trajectory accurately and rapidly. Even an identicalsurgical procedure can vary remarkably in the spatial andtemporal domains due to the skill difference among surgeons.Moreover, the trajectory is susceptible to the random noise.Traditional solutions usually transfer the surgical trajectorysegmentation to a clustering problem, and are mainly dividedinto two categories: supervised and unsupervised methods.As the supervised methods, Linear Discriminate Analysis(LDA) [4], Hidden Markov Models (HMMs) [2], DescriptiveCurve Coding (DCC) [5], and Conditional Random Field(CRF) [6] are proposed. However, the supervised methodis time-consuming because of the manual annotations ofexperts for training dataset. Thus, unsupervised methods have *Corresponding author. Zhenzhou Shao, Hongfa Zhao, Jiexin Xie and Yong Guanare with the College of Information Engineering, BeijingAdvanced Innovation Center for Imaging Technology andBeijing Key Laboratory of Light Industrial Robot and SafetyVerification, Capital Normal University, Beijing, 100048, China { zshao,hfzhao,2171002039,guanyong } @cnu.edu.cn Ying Qu and Jindong Tan are with the Engineering Col-lege, The University of Tennessee, Knoxville, TN, 37996, USA { yqu3,tan } @utk.edu drawn more attention in recent years. Some unsupervisedmethods based on Gaussian Mixture Model (GMM) andDirichlet Processes (DP) are proposed [7], [8]. AlthoughGMM and DP based methods can get rid of the manualannotations, the room to improve the accuracy of surgicaltrajectory segmentation remains since only the kinematicdata is taken into account. Recently, video data are involvedby using a deep learning based method, since traditionalpattern recognition based feature extraction methods can’tmodel the variations among surgeon’s videos well. A. Murali et al. [9] employ VGGNet to extract features from videofollowed by Transition State Clustering (TSC) for task-levelsegmentation using both kinematic and video data. Althoughthe involvement of video source enables the higher accuracyof segmentation, the feature extraction from videos is time-consuming and easily leads to over-segmentation.This paper focuses on the unsupervised surgical trajectorysegmentation by means of both video and kinematic data inthis paper. There are challenges to find consistent segmentsfrom the varying and noising recordings from surgeons withdifferent skills for a specific task. First, although the videois capable of improving the performance of segmentation,it is challenging to extract the distinguishing features in anefficient way. In addition, random noise has to be considereddue to the difference of surgeons skill. Second, state-of-the-art methods generally suffer from the over-segmentationissue. We need to provide an effective way to identify theadjacent segments with no state transition.As shown in Fig. 1, a fast unsupervised method forsurgical trajectory segmentation is proposed using the videoand kinematic data. In particular, a promoting procedure ispresented to alleviate the over-segmentation issue. First, acompact but effective unsupervised learning network called Transition State ClusteringSCAE time
Wavelet time
Video Frames
Kinematic Data
Pre Segmentation
Wavelet
Output SegmentationPost-merger processing
Measure SimilarityMerge SegmentsMeasure SimilarityMerge SegmentsPCA, DA, MI, DTWMeasure SimilarityMerge SegmentsPCA, DA, MI, DTW
Fig. 1: Illustration of the suturing trajectory segmentationwith promoting procedure using video and kinematic data. a r X i v : . [ c s . C V ] O c t tacking convolutional auto-encoder (SCAE) is employed tospeed up the feature extraction of video. Wavelet transformis then used to filter the features from videos and kinematicdata for the further clustering based on TSC. We refer theproposed segmentation method as TSC-SCAE for abbrevia-tion. Finally, the segmentation result is promoted by mergingthe clusters according to four similarity measurements calledPMDD based on principal component analysis, mutual infor-mation, data average and dynamic time warping, respectively.II. UNSUPERVISED TRAJECTORYSEGMENTATION BASED ON TSC-SCAE
A. Visual Feature Extraction Using SCAE
Stacked Convolutional Auto-Encoder (SCAE) [10] is anunsupervised feature extractor which is well compatible tohigh-dimensional input. It is much faster than other methodssuch as TSC-VGG and TSC-SIFT because of the simpleneural network and unsupervised method. SCAE has moreadvantages in image processing as it can preserve the spatialrelationship between pixels. The SCAE network for visualfeature extraction is shown in Fig. 2, and the correspondingconfiguration is summarized in TABLE I.Fig. 2 illustrates that the basic structure of encoder consistsof convolutional layer and pooling layer. The input featuremaps (for the first layer, it is the original image I ) are con-volved with a convolution layer to transfer the information tosubsequent layers with the spatial relationship between pixelspreserved. These feature maps then pass through a max-pooling layer to reduce the feature map size. After severalabove conv-pooling layers, a low dimension feature map canget from the encoder.As shown in Fig. 2, the task of the decoder with the similartopology with the encoder is to reconstruct the encoding re-sult to get the implied image information. Therefore, we needto up-sample the encoding result to recover the feature maps.To prevent the checkerboard effect caused by traditionaltransposed convolution, we use bilinear interpolation to doup-sampling before each convolutional layer. For furtherreduction of feature dimension, we employ two convolutionallayers with the kernel size of × after the last layer of theencoder and before the first layer of decoder respectively.Adam optimization algorithm [11] is employed to min-imize the MSE (mean-square error) based loss function, which can estimate the similarity between the reconstructedimage ˆ I of decoder output and the original image I input toencoder. After the network training, a model (i.e., the weightsof each layer) for image encoding and reconstructing can begot from the network. In the phase of feature extraction,we exclusively load the model’s encoder part to extract thefeatures of each frame in the surgical video.TABLE I: Configuration of SCAE network. Type Patch Size Stride Output Sizeconvolution 3 × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × B. Denoising Based on Wavelet Transform
After the feature extraction from the demonstration video,the visual and kinematic features are then feed to nonpara-metric mixture model for clustering. However, we find thatthese features usually suffer from the random noise. To getrid of it, wavelet transform based filter is employed due toits ability of multi-scale filtering and a low-pass filter isdesigned.In this paper, we process the kinematic data and visual fea-tures with db10 wavelet, and a 5-level wavelet decompositionfor denoising is performed. Fig. 3 and Fig. 4 demonstrate thecomparison of kinematic and visual features before and afterfiltering based on wavelet transform.After the filtering, visual and kinematic features then feedto a nonparametric mixture model to segment surgical tra-jectory. Considering the clustering performance, TransitionState Clustering (TSC) [8] is adopted in this paper.III. SEGMENTATION PROMOTING BASED ON PMDDMost unsupervised trajectory segmentation methods usu-ally have the problem of over-segmentation. To correct the
Bilinear up-sampling
Fig. 2: SCAE network for visual feature extraction.
500 1000 1500 2000 2500 3000 3500 4000-0.2-0.15-0.1-0.0500.050.10.15 time (frame) l o c a t i on ( m ) xyzx wt y wt z wt Fig. 3: Comparison of kinematic features before and afterfiltering based on wavelet transform: The unit of verticalY-axis is meter (m) and the horizontal X-axis is frame (30fps), x , y , z are input data (which represent a spatial locationcomprehensively) while x wt , y wt , z wt are the correspondingdenoising results. wt fe2 wt fe3 wt Fig. 4: Comparison of visual features before and afterfiltering based on wavelet transform: The horizontal X-axisis frame (30 fps), Y-axis denotes the value of visual feature.wrongly segmented sub-trajectories that belong to the samecluster, a criterion is required to evaluate the similaritybetween segments. Taking a deep insight into the same sub-trajectory, they have a few implicit and explicit associations.Besides the similarity in spatial and temporal space, innerstructure, variation node and moving trend are also theimportant factors. Taking these factors into consideration,we proposed a promoting algorithm based on PMDD con-sisting of four similarity measurements based on PrincipalComponent Analysis (PCA), Mutual Information (MI), DataAverage (DA) and Dynamic Time Warping (DTW).
Similarity measurement based on PCA:
W. Krzanowski et al. [12] show that the PCA can be used to measurethe similarity between segments. PCA mainly determine theinternal link and structure between the segments. Consid-ering two segments S a and S b , PCA could find severalprinciple components of S a and S b , which make up asubspace representing the main information of S a and S b .The smaller subspace angle between S a and S b means thegreater internal consistency between them. Thus, Similaritymeasurement based on PCA is defined by the angles betweentheir subspaces comprised of principle components: SM P CA ( S a , S b ) = 1 q q (cid:88) i =1 q (cid:88) j =1 θ ( i, j ) , (1)where q is the number of principle components. Similarity measurement based on MI:
The surgery is acontinuous process, the data change of the segments in samesurgery sub-process is similar. Entropy can be interpretedas a measurement of the uncertainty of the particular vari-ables. Therefore, MI is a good similarity measurement forvariation degree between two segments, which is obtainedby subtracting the joint entropy H ( S a , S b ) from the entropy H ( S a ) and H ( S b ) of both segments: SM MI ( S a , S b ) = H ( S a ) + H ( S b ) − H ( S a , S b ) , (2) Similarity measurement based on DA:
DA mainlyreflects the spatial characteristic. During a sub-process ofsurgery, the trajectory in a short time interval is similar inthe spatial space. Therefore, the distance between the centersof segments in spatial space is taken into account, as writtenas follows: SM DA ( S a , S b ) = (cid:107) µ a − µ b (cid:107) , (3)where µ a and µ b are mean vectors of segments S a and S b . Similarity measurement based on DTW:
Due to thedifference of surgeons’ skill, the same action may showdifferent sub-trajectories. One typical is the same behaviorof different performance in temporal domain. The key issueof DTW is warping curve. Here, we take cumulative distance γ ( i, j ) to calculate the best warping path while measureDTW similarity [13]. SM DT W ( S a , S b ) = min (cid:32)(cid:114)(cid:88) Kk =1 w k /K (cid:33) , (4)where w k is the k − th element of warping path, K is thecompensation parameter that can be identified by cumulativedistance. γ ( i, j ) = d ( q i , c j ) + min γ ( i − , j − γ ( i − , j ) γ ( i, j − , (5)where d ( q i , c j ) is the Euclidean distance between point q i and c j .All above four similarity measurements are in differentdimensions. Thus, the normalization is required to obtain thefinal measure. For SM P CA , SM DA , SM DT W , the smallerthe value is, the more similar the two segments are. Weperform the normalization of them using Eq. (6), and thenormalization for S MI is perform using Eq. (7). After that,the final similarity O can be calculated by Eq. (8). Y = (cid:40) , SM ≥ mean ( SM ) mean ( SM ) − SMmean ( SM ) − min( SM ) , SM < mean ( SM ) , (6) Y = (cid:40) SM − mean ( SM )max( SM ) − mean ( SM ) , SM > mean ( SM )0 , SM ≤ mean ( SM ) , (7) O a,b = (cid:104) ( Y PCA ) +( Y DA ) +( Y DTW ) +( Y MI ) (cid:105) / , (8)Then, according to final similarity measurements, seg-ments that have high similarity can be merged iteratively.Considering the segmentation results S = { S i , ≤ i ≤ n } ,the final similarity of each pair of two adjacent segmentswill be calculated by Eq. (8) in each iteration, and then webtain a set of results O = { O , , O , , ..., O n − ,n } . Mergethe pairs with the highest final similarity, and update com-prehensive similarity O , merge the most similar segments inthe next iteration, until overall final similarity O ( a,b ) smallerthan threshold τ . The segmentation promoting algorithm issummarized in Algorithm 1. Algorithm 1
Segmentation promoting based on PMDD.
Input:
Segments S , Threshold τ . while O > τ do for i = 1 : length ( S ) − do Calculate O ( a,b ) by Eq. (8). end for index ← arg max i ( O i,i +1 ) S index ← merge ( S index , S index +1 ) remove ( S index +1 ) end whileOutput: Post processed segments S .IV. EXPERIMENTAL RESULTSIn this section, two sets of experiments are conducted toverify the performance of proposed unsupervised segmenta-tion algorithm for surgical trajectory. In the first experiment,TSC-SCAE is evaluated with respect to the accuracy andoverall running time, compared with the classic clusteringmethods including GMM and TSC. The effects of differ-ent data sources and wavelet transform based filtering areanalyzed quantitatively. Second, the promoting method ofsegmentation is verified by following different methods usingthe kinematic data alone and the combination of video andkinematic data, respectively.The dataset JIGSAWS [14] from Johns Hopkins Universityis used in the experiments, including data recordings andmanual annotations. Data recordings consist of surgical videoand kinematic data collected from Da Vinci Surgical System.The sampling frequency for both video and kinematicssources is 30Hz. The dataset contains three surgical tasks:Suturing (SU), Needle-Passing (NP) and Knot-Tying (KT),which are performed and annotated by 8 surgeons withdifferent skill levels. The suturing and needle passing taskare commonly used in literatures. In this paper, we adopt11 demonstrations of these two tasks in the experiments,including the videos and kinematic data from 5 experts (E),3 intermediates (I) and 3 novice (N). The kinematic data arein 38 dimensions, including position, angle velocity, angle ofgrasper, etc. All 11 videos of each task are used for SCAEmodel training and features extraction. The computationalconfiguration used in the experiments is summarized inTABLE II.TABLE II: Configuration used in the experiments. Category SpecificationOperating System UbuntuCPU 32 Intel Xeon E5-2620 v4 @ 2.10GHzGPU NVidia Tesla K40CUDA Compute Capability 3.5CUDA Cores 2880RAM 128GBProgramming Language Python
A. Quantitative Analysis of TSC-SCAE1) Accuracy Comparison:
In this section, the accuracyof TSC-SCAE is compared using Normalize Mutual Infor-mation (NMI), which indicates the transfer status similaritybetween a predictive clustering result A and the ground truth B (manual annotations), it can be calculated by N M I ( A, B ) = I ( A, B ) (cid:112) H ( A ) H ( B ) , (9)where H ( A ) and H ( B ) are the information entropies of A and B , respectively. I ( A, B ) is mutual information. Therange of NMI is [0,1], where 0 means that there is nocorrelation between two clustering results, while 1 representsthe results are completely related.We compare the proposed method TSC-SCAE with state-of-the-art methods, including TSC[8], GMM[7], TSC-VGG,TSC-SIFT[9] and TSC-SCAE on the selected surgicaldemonstrations. According to the data source in the differentmethods, the experiments are divided into two categories:one use kinematics data alone and the other use both videoand kinematic data. TABLE III shows NMI measurementsof segmentation. We can see that our method TSC-SCAEachieves the best NMI among all trajectory segmentationtasks, it thanks to the use of video data and wavelettransform. Especially, using both video and kinematic data,the accuracy is improved by more than 2.6 times at most,compared with TSC-SIFT.TABLE III: NMI of segmentation for different methods. K stands for using kinematics data alone, V & K representsusing both video and kinematics data, ∗ denotes data isfiltered by wavelet transform. Method NMI( % ) Needle Passing SuturingE E+I E+I+N E E+I E+I+NTSC(K) 21.6 27.2 17.0 43.2 38.0 25.7GMM(K) 53.3 51.2 45.8 45.2 43.4 41.0TSC-VGG(V&K) 62.9 64.7 69.3 58.6 64.0 66.5TSC-SIFT(V&K) 31.0 32.6 28.2 48.0 42.5 37.7GMM-SCAE(V&K) 59.3 57.4 58.7 57.5 52.5 51.4TSC-SCAE(V&K) 72.6 73.8 71.2 65.5 66.3 67.2TSC-SCAE(V&K*) Overall, methods with both video and kinematic data aregenerally better than the ones using kinematics data alone.It is consistent with the results reported in literatures. TheNMI of methods using kinematics data alone has a trend ofdecreasing with the growing proportion of non-expert (I & N) demonstrations. This phenomenon is very significant inthe suturing task, it is mainly because of the complexity andnon-regularity of suturing task. Whats more, demonstrationsfrom experts are usually smoother and rapider than non-experts do. However, when considering both kinematicsand video data, the phenomenon is obviously weakened.It proves that video data can help eliminate the influenceof irregular trajectory from intermediates and novices andis an effective compensation to achieve the better surgicaltrajectory segmentation.As aforementioned, random noise may cause the potentialinterference to the result of segmentation. To solve thisproblem, we perform a multi-scale smoothing processing tohe dataset by using db10 wavelet to filter out the small-scalenoise, which indirectly improve the segmentation accuracy.Compared with the experiments without filtering in needle-passing task, the NMI is increased by 3.5 % -6.5 % , theimprovement is 1.2 % -3.4 % in suturing task.
2) Overall Running Time Comparison:
Another key indi-cator is overall running time, although surgery segmentationis not in strong real-time, the task also needs to be as fastas possible. Methods based on kinematics data alone, therunning time is the cost of clustering and segmentation, whilewe need to add the time cost of video feature extraction formethods using visual and kinematic data (TSC-VGG, TSC-SIFT, etc.). For our method TSC-SCAE, the time cost iscalculated in three parts, including visual feature extraction,wavelet transform based filtering and clustering segmenta-tion.The running time in different steps is summarized inTABLE IV. The segmentation methods based on both visualand kinematic features are about 10 times slower than theones using kinematic data alone. It is mainly because of thetime-consuming visual feature extraction. However, for themethods using both data sources, our method TSC-SCAE isalmost 10 times faster than TSC-VGG and TSC-SIFT. Theimprovements of time efficiency is due to the high-efficiencyunsupervised model for feature extraction of video data weemployed.
B. Evaluation of Segmentation Promoting
Over-segmentation is a common problem of clusteringsegmentation algorithm. To prove the validity of the proposedpromoting approach as the post-processing step, we applyit to the mainstream clustering segmentation algorithms,including GMM, TSC based methods. NMI is used tomeasure the similarity of transition status in the segmentationclustering method. But it is not based on transfer stateto merge in the promoting stage. Therefore, we choosesegmentation accuracy (seg-acc) as the evaluation matrix,which can measure the similarity between the segmentationresult and ground truth intuitively and accurately.The calculation of seg-acc can divided in two steps. Inthe first step, we match resultant segments to the groundtruth by maximizing the number of overlap frames betweenpredicted segments and ground truth [15]. In second step, itis true positive if the IOU (Intersection over Union) betweenthe ground-truth segment G i and its corresponding resultantsegments S i is more than a default threshold 40 % . Wecalculate the accuracy of each segment separately and then sum up them. Fig. 5 illustrates the calculation process andthe seg-acc can be obtained using seg - acc = (cid:80) L i L = (cid:80) [min( S ei ,G ei ) − max( S si ,G si )] L , (10)where S si , S ei and G si , G ei represent start and end frame ofsegment S i and G i . …… …… …… …… SegmentationGround Truth …… …… i L − i L + i L i − i − i + i + ii si G ei G ei S si S L Fig. 5: Segmentation accuracy of predicted segments. Where s and e represent start and end frame of segment, L i standsfor the number of overlap frames between predicted segment S i and its corresponding ground truth G i .As shown in TABLE V, the seg-acc of each methodhas been improved obviously for most cases. TSC-K isthe biggest beneficiary with the improvement of seg-acc by15.2 % on average, while the accuracy is improved less forTSC-SIFT and TSC-VGG. In the experiment, we notice thatit is difficult to refine the segmentation if the clusteringresults is far away from the ground truth. As shown in Fig. 6,each color represents a surgical activity segment, while thewhite segment indicates incorrect segment or over-segmentedsegment. Among all methods, the seg-acc of GMM basedmethod even declines after the promoting. Because GMMneeds to specify the number of merged class, so over-segmentation in GMM is not very common instead is wrongsegmentation. For our method TSC-SCAE, the segmentationpromoting yields up to 16.7 % improvement with respectto seg-acc. In most cases, the resultant segmentation afterthe promoting is significantly improved. From the viewof TABLE V, we notice that the improvement of non-expert demonstration is more outstanding than the expert do,because the non-expert demonstration produces more over-segmentation fragments.In all experiments, TSC-SCAE obtains the best resultof segmentation, it is proved that the proposed promotingmethod is very effective for the surgical trajectory segmen-tation. In general, it can be extended to most clusteringsegmentation algorithms.V. CONCLUSIONThis paper proposed a fast unsupervised method for surgi-cal trajectory segmentation based on a compact stacking con-volutional auto-encoder model and wavelet transform basedTABLE IV: Comparison of overall running time using different segmentation methods (unit: s). FE stands for featureextraction, CS represents clustering segmentation and WT is wavelet transform, ∗ denotes data is filtered by wavelet transform. Method Time(s) Needle Passing Suturing ElementsE E+I E+I+N E E+I E+I+NTSC-K 79 103 353 59 83 331 CSGMM-K 1.76 1.95 3.34 1.59 2.00 5.38 CSTSC-VGG 8120+394 9744+380 14616+1226 4935+322 5922+364 8884+1404 FE+CSTSC-SIFT 2127+440 3284+723 5019+2020 1941+404 3036+533 4633+2259 FE+CSGMM-SCAE 128+2.94 154+2.95 231+5.57 139+2.80 167+3.30 251+5.38 FE+CSTSC-SCAE 128+197 154+199 231+933 139+158 167+201 251+1012 FE+CSTSC-SCAE*
FE+CS+WT
ABLE V: Segmentation accuracy before and after segmentation promoting.
Method seg-acc Before Promoting After PromotingNeedle Passing Suturing Needle Passing SuturingE E+I E+I+N E E+I E+I+N E E+I E+I+N E E+I E+I+NTSC-K 0.498 0.563 0.529 0.484 0.535 0.542 0.614 0.578 0.615 0.547 0.565 0.630GMM 0.480 0.528 0.541 0.466 0.489 0.503 0.392 0.475 0.551 0.494 0.541 0.575TSC-VGG 0.505 0.562 0.436 0.487 0.460 0.498 0.522 0.548 0.445 0.540 0.465 0.507TSC-SIFT 0.546 0.561 0.510 0.442 0.513 0.493 0.592 0.582 0.590 0.521 0.589 0.593TSC-SCAE
Frame numberAfter promotionBefore promotionGround truth TSC-K
Frame numberAfter promotionBefore promotionGround truth TSC-SIFT
Frame numberAfter promotionBefore promotionGround truth TSC-VGG
Frame numberAfter promotionBefore promotionGround truth GMM
Frame numberAfter promotionBefore promotionGround truth TSC-SCAE
Fig. 6: Visualization of comparison of needle passing task.filtering using multi-modal surgical demonstrations. The im-provement with respect to the efficiency of segmentationis three-fold. First, new involved model can generate morediscriminative visual features faster. Second, the short-rangenoises in the visual and kinematic features are filtered basedon wavelet transform. Last but not least, a promoting ap-proach is proposed to handle the over-segmentation problem.Compared with the state-of-the-art methods, experimentalresults demonstrate that the proposed algorithm can improvethe accuracy of segmentation in an more efficient way.ACKNOWLEDGMENTThis work was supported by the Project of BeijingMunicipal Commission of Education (KM201710028017),National Natural Science Foundation of China (61702348,61772351, 61602324), National Key R & D Program ofChina (2017YFB1303000, 2017YFB1302800), the Project ofthe Beijing Municipal Science & Technology Commission(LJ201607), Capacity Building for Sci-Tech Innovation -Fundamental Scientific Research Funds (025185305000), and Youth Innovative Research Team of Capital NormalUniversity. R
EFERENCES[1] A. Guha, Y. Yang, C. Fermuuller, and Y. Aloimonos, “Minimalistplans for interpreting manipulation actions,” in
Intelligent Robots andSystems (IROS), 2013 IEEE/RSJ International Conference on . IEEE,2013, pp. 5908–5914.[2] C. E. Reiley, H. C. Lin, B. Varadarajan, B. Vagvolgyi, S. Khudanpur,D. Yuh, and G. Hager, “Automatic recognition of surgical motionsusing statistical modeling for capturing variability,”
Studies in healthtechnology and informatics , vol. 132, p. 396, 2008.[3] K. Shamaei, Y. Che, A. Murali, S. Sen, S. Patil, K. Goldberg, andA. M. Okamura, “A paced shared-control teleoperated architecturefor supervised automation of multilateral surgical tasks,” in
IntelligentRobots and Systems (IROS), 2015 IEEE/RSJ International Conferenceon . IEEE, 2015, pp. 1434–1439.[4] H. C. Lin, I. Shafran, T. E. Murphy, A. M. Okamura, D. D. Yuh,and G. D. Hager, “Automatic detection and segmentation of robot-assisted surgical motions,” in
International Conference on MedicalImage Computing and Computer-Assisted Intervention . Springer,2005, pp. 802–810.[5] N. Ahmidi, Y. Gao, B. B´ejar, S. S. Vedula, S. Khudanpur, R. Vidal,and G. D. Hager, “String motif-based description of tool motionfor detecting skill and gestures in robotic surgery,” in
InternationalConference on Medical Image Computing and Computer-AssistedIntervention . Springer, 2013, pp. 26–33.[6] L. Tao, L. Zappella, G. D. Hager, and R. Vidal, “Surgical gesturesegmentation and recognition,” in
International Conference on Medi-cal Image Computing and Computer-Assisted Intervention . Springer,2013, pp. 339–346.[7] S. H. Lee, I. H. Suh, S. Calinon, and R. Johansson, “Autonomousframework for segmenting robot trajectories of manipulation task,”
Autonomous robots , vol. 38, no. 2, pp. 107–141, 2015.[8] S. Krishnan, A. Garg, S. Patil, C. Lea, G. Hager, P. Abbeel, andK. Goldberg, “Transition state clustering: Unsupervised surgical tra-jectory segmentation for robot learning,”
The International Journal ofRobotics Research , vol. 36, no. 13-14, pp. 1595–1618, 2017.[9] A. Murali, A. Garg, S. Krishnan, F. T. Pokorny, P. Abbeel, T. Darrell,and K. Goldberg, “Tsc-dl: Unsupervised trajectory segmentation ofmulti-modal surgical demonstrations with deep learning,” in
Roboticsand Automation (ICRA), 2016 IEEE International Conference on ,2016, pp. 4150–4157.[10] J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber, “Stackedconvolutional auto-encoders for hierarchical feature extraction,” in
International Conference on Artificial Neural Networks . Springer,2011, pp. 52–59.[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[12] W. Krzanowski, “Between-groups comparison of principal compo-nents,”
Journal of the American Statistical Association , vol. 74, no.367, pp. 703–707, 1979.[13] D. J. Berndt, “Finding patterns in time series: a dynamic programmingapproach,”
Advances in knowledge discovery and data mining , pp.229–248, 1996.[14] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C.Lin, L. Tao, L. Zappella, B. B´ejar, D. D. Yuh, et al. , “Jhu-isi gestureand skill assessment working set (jigsaws): A surgical activity datasetfor human motion modeling,” in
MICCAI Workshop: M2CAI , vol. 3,2014, p. 3.15] C. Wu, J. Zhang, S. Savarese, and A. Saxena, “Watch-n-patch:Unsupervised understanding of actions and relations,” in