[PDF] "Train one, Classify one, Teach one" -- Cross-surgery transfer learning for surgical step recognition

Abstract

Prior work demonstrated the ability of machine learning to automatically recognize surgical workflow steps from videos. However, these studies focused on only a single type of procedure. In this work, we analyze, for the first time, surgical step recognition on four different laparoscopic surgeries: Cholecystectomy, Right Hemicolectomy, Sleeve Gastrectomy, and Appendectomy. Inspired by the traditional apprenticeship model, in which surgical training is based on the Halstedian method, we paraphrase the "see one, do one, teach one" approach for the surgical intelligence domain as "train one, classify one, teach one". In machine learning, this approach is often referred to as transfer learning. To analyze the impact of transfer learning across different laparoscopic procedures, we explore various time-series architectures and examine their performance on each target domain. We introduce a new architecture, the Time-Series Adaptation Network (TSAN), an architecture optimized for transfer learning of surgical step recognition, and we show how TSAN can be pre-trained using self-supervised learning on a Sequence Sorting task. Such pre-training enables TSAN to learn workflow steps of a new laparoscopic procedure type from only a small number of labeled samples from the target procedure. Our proposed architecture leads to better performance compared to other possible architectures, reaching over 90% accuracy when transferring from laparoscopic Cholecystectomy to the other three procedure types.

Full PDF

PProceedings of Machine Learning Research – Under Review:1–12, 2021 Full Paper – MIDL 2021 submission “Train one, Classify one, Teach one” - Cross-surgery transferlearning for surgical step recognition

Daniel Neimark [email protected] Omri Bar [email protected] Maya Zohar [email protected] Gregory D. Hager , [email protected] Dotan Asselmann [email protected] Theator Inc., Palo Alto, CA, USA. Department of Computer Science, Johns Hopkins University, Baltimore, USA.

Editors:

Under Review for MIDL 2021

Abstract

Prior work demonstrated the ability of machine learning to automatically recognize sur-gical workﬂow steps from videos. However, these studies focused on only a single type ofprocedure. In this work, we analyze, for the ﬁrst time, surgical step recognition on four dif-ferent laparoscopic surgeries: Cholecystectomy, Right Hemicolectomy, Sleeve Gastrectomy,and Appendectomy. Inspired by the traditional apprenticeship model, in which surgicaltraining is based on the

Halstedian method, we paraphrase the “ see one, do one, teachone ” approach for the surgical intelligence domain as “ train one, classify one, teach one ”.In machine learning, this approach is often referred to as transfer learning. To analyzethe impact of transfer learning across diﬀerent laparoscopic procedures, we explore varioustime-series architectures and examine their performance on each target domain. We pro-pose a Time-Series Adaptation Network (TSAN), an architecture optimized for transferlearning of surgical step recognition. In addition, we show how TSAN can be pre-trainedusing self-supervised learning on a Sequence Sorting task. Such pre-training enables TSANto learn workﬂow steps of a new laparoscopic procedure type given only a small numberof samples from the target procedure dataset. Our proposed architecture leads to betterperformance compared to other possible architectures, reaching over 90% accuracy whentransferring from laparoscopic Cholecystectomy to the other three procedure types.

Keywords:

Surgical Intelligence, Surgical Transfer Learning, Surgical Step Recognition,Phase Recognition, Domain Adaptation, Deep Learning.

1. Introduction

Minimally invasive surgery (MIS) video analysis is steadily gaining acceptance for surgicalcompetency assessment (Ritter et al., 2019; Feldman et al., 2020). As MIS is performedunder visualization of endoscopic footage, the possibilities for AI-enabled computer-assistedsurgery (CAS) applications are immense (Maier-Hein et al., 2017).Regardless of the use case, such video-based applications must serve a wide variety ofsurgical procedures in order to be relevant, actionable, meet the surgeons’ clinical needsand provide them with value.A variety of surgery-related video-analysis tasks have been explored in recent studies.Surgical step (phase) recognition (Bar et al., 2020; Twinanda et al., 2016; Zisimopoulos © a r X i v : . [ c s . C V ] F e b eimark Bar Zohar Hager Asselmann Figure 1: The same step,

Adhesiolysis , is viewed in diﬀerent procedures. ( A ) Cholecystec-tomy, ( B ) Appendectomy, ( C ) Right Hemicolectomy, and ( D ) Sleeve Gastrec-tomy.et al., 2018; Hashimoto et al., 2019), surgical tool detection and segmentation (Twinandaet al., 2016; Al Hajj et al., 2019; Choi et al., 2017; Ni et al., 2020; Jin et al., 2018) andsurgical gesture and skill assessment (Gao et al., 2014; Ahmidi et al., 2017) are a fewexamples. However, these studies were developed and evaluated on only a single type ofprocedure and thus provide limited evidence of their broader applicability in the surgicaldomain.This study aims to address three key aspects which, taken together, provide insight intoour practical ability to scale video analysis of surgery to multiple procedures while minimiz-ing the need for large labeled datasets. First, we assess the potential of using self-supervisedpre-training to reduce dependence on explicitly labeled data. Second, we investigate theeﬀectiveness of transfer learning to move pre-trained models between diﬀerent surgical pro-cedures. Finally, we explore the impact of data size on the adaptation capabilities. Takentogether, our results suggest a practical and eﬀective path to generalizing video analysis ofsurgery while minimizing the need for laborious ﬁne-grain labeling.We chose to focus on the foundational task of surgical step recognition – that is, parsinga procedure video into meaningful segments that represent the surgeon’s workﬂow. Whileprevious studies have explored step recognition for a single type of surgical procedure,laparoscopic Cholecystectomy (Bar et al., 2020; Twinanda et al., 2016), Cataract surgery(Yu et al., 2019; Zisimopoulos et al., 2018) and laparoscopic Sleeve Gastrectomy (Hashimotoet al., 2019), they did not assess whether their methods would perform well if applied toother types of surgeries and did not examine the ability to adapt to new types of procedures.Inspired by the traditional apprenticeship model, in which surgical training is based onthe Halstedian method (Cameron, 1997), we paraphrase the “ see one, do one, teach one ”approach for the surgical intelligence domain as “ train one, classify one, teach one ”. Inmachine learning, this approach is often referred to as transfer learning.Transfer learning attempts to exploit a model that was pre-trained on one task, andapply its knowledge when training a diﬀerent task, thus improving the overall generalization(Goodfellow et al., 2016). It is especially useful when the target task’s dataset is relativelysmall, like in the surgical domain. Transfer learning has been proven to be a robust methodin many ML challenges. Speciﬁcally in the computer vision domain, it enables achievingstate-of-the-art results in object detection (Girshick et al., 2014; Girshick, 2015), imagesegmentation (Long et al., 2015; He et al., 2017), face identiﬁcation (Taigman et al., 2015)and video action recognition (Carreira and Zisserman, 2017). However, transfer learningtends to work better when the source task is related to the target task (Yosinski et al., ross-surgery transfer learning for surgical step recognition Adhesiolysis in four diﬀerent procedure types. In

Adhesiolysis ,the goal is to remove adhesions. In these procedures, the adhesions are abdominal, andtheir removal can be done with diﬀerent tools. While the anatomy viewed changes betweenprocedures, the action remains the same. Thus, we argue that adapting knowledge fromone procedure to another is beneﬁcial.The standard approach for step recognition in previous studies is training two modelsfor each procedure type. A deep ConvNet that extracts visual features and a time-seriesmodel that processes the features sequentially (Bar et al., 2020; Zisimopoulos et al., 2018;Hashimoto et al., 2019). The ConvNets are usually ﬁrst trained with non-surgical datasets,e.g., ImageNet (Deng et al., 2009) and Kinetics-400 (Kay et al., 2017), but the obviousapproach of transferring knowledge across diﬀerent procedures has never been assessedbefore.In this study, we suggest a new approach. We use a 3D ConvNet (Carreira and Zisser-man, 2017; Wang et al., 2018), pre-trained for step recognition on Cholecystectomy (Baret al., 2020). This model is used to extract feature representations from videos of threediﬀerent laparoscopic procedures: Right Hemicolectomy, Sleeve Gastrectomy, and Appen-dectomy. We then explore various architectures for the time-series model and focus onﬁnding the best one for surgical domain adaptation. We also suggest a self-supervised ini-tialization method that improves the performance of our time-series model. Finally, wecompare our ﬁndings with the traditional approach described above.

2. Methods

Our overall approach involves (1) extracting feature representations from videos using a 3DConvNet; and (2) training a time-series model on these features to predict a step label foreach second of video.

We consider several time-series model architectures below and explore two main variantsto process the temporal dimension: (1) 1D convolution layers and (2) recurrent layers.As the main contribution, we found a speciﬁc combination of the two that yields optimalperformance. In what follows, we describe architectural details. In all cases, the ﬁnalclassiﬁcation layer for all architectures is a fully connected layer, followed by a softmaxfunction that predicts, for each second, a single surgical step.

1D Convolution Layers (Conv1D).

As the short-term context is important when pre-dicting a step for each second, we use a standard 1D convolution layer and apply them tothe temporal dimension. In our experiments, we explore diﬀerent kernel sizes in order toobserve diﬀerent temporal contexts (Han et al., 2020).

Long Short-Term Memory (LSTM).

While Conv1D should be able to learn the contextin a short temporal region of interest, it still lacks the larger scope of view and cannotlink distant information. Hence, most recent studies use LSTM networks as their time-series model. We also explore the capabilities of LSTM to transfer knowledge and use abidirectional LSTM in our experiments, thus not assuming any causality constraints. eimark Bar Zohar Hager Asselmann Figure 2: Time-Series Adaptation Network (TSAN) architecture. Combining three Conv1Dlayers with two LSTM, followed by a fully connected classiﬁcation layer. φ i indi-cates the feature representations extracted from the 3D ConvNet. K denotes thekernel size of the 1D convolution layers. Time-Series Adaptation Network (TSAN).

Inspired by Ghosh and Kristensson (2017),our architecture fuses three Conv1D layers and two LSTM networks into a single architec-ture. The three Conv1D operates in parallel to a single bidirectional LSTM. The outputsof all four are concatenated and applied to an additional bidirectional LSTM, followed bya fully connected layer for classiﬁcation (Figure 2).

Sequence Sorting (SeSo).

Compared to other architectures, TSAN is a deeper network,with more parameters to train. The fact that the target surgical datasets are relativelysmall, especially compared to other video benchmarks (Kay et al., 2017), led us to explorea better initialization technique as an alternative to the random initialization.We establish our initialization approach using an analogy of solving jigsaw puzzles(Noroozi and Favaro, 2016). In the temporal domain, this can be structured as correctlyreassembling the shuﬄed segments of a video. We thus formulate a self-supervised trainingmethod as an initial task for step recognition.More concretely, we split a video into nine segments and shuﬄe the order randomly.Then, we process each segment’s feature vectors separately with a time-series model. Weconcatenate all nine segments’ last layer output and feed the resulting representation to aclassiﬁcation head that predicts the correct order. We use SeSo to pre-train both TSANand LSTM networks. We then remove the classiﬁcation layer and ﬁnetune the networks onthe step recognition task.

The training process of the 3D ConvNets is based on the work of Bar et al. (2020). Eachvideo’s features is a matrix of size L × N . Where L is the length of the video (in seconds)and N is the features dimension size (N=2049 in all our experiments).For the Conv1D, we explore three temporal kernel sizes, K = 5 , ,

39, and the output’slength is matched to the input by padding, based on the kernel size. The output dimensionof both the 1D convolution layer and the LSTM hidden layer is set to 128. ross-surgery transfer learning for surgical step recognition Table 1: Number of samples per subset for each of the target datasets.

Total Training Validation TestRight Hemicolectomy 205 123 31 51Sleeve Gastrectomy 229 138 34 57Appendectomy 852 511 128 213

Each network architecture was trained for 100 epochs. We use SGD and set the learningrate to 10 − for the Conv1D networks and 10 − for the LSTM and TSAN networks. Theloss function is the negative log-likelihood loss.Since the features are extracted from the raw videos in advance, applying augmentations,like those used on images, is not feasible. Thus, to apply some sort of data augmentation onour data and avoid overﬁtting, we apply two types of augmentations on the input featuresmatrix. First, we used an out-of-body and non-relevant video segment detection by applyingthe method described by Zohar et al. (2020). We then mark each video second as eitherrelevant or not, and randomly remove these seconds from the training with a probability of0.5. We also use Dropout of 0.5 both on the input matrix and on the intermediate layers.

3. Results

All datasets were randomly split into three subsets: training, validation, and test, with aratio of 25% for the test and 20% of the remaining for the validation (Table 1). We providea detailed description of the datasets and the steps workﬂow deﬁnition in Appendix A.The annotation process is identical for all procedures types. Each video undergoes arigorous annotation process by two diﬀerent annotation specialists. The team of annotatorsunderwent thorough training on labeling the workﬂow steps. Annotation process validitywas conﬁrmed in a previous study (Korndorﬀer Jr et al., 2020), in which an unbiasedgroup of surgeons reviewed large portions of the Cholecystectomy cases and reported highagreement with our annotation method.

We start by searching for an optimized architecture for surgical transfer learning. As abaseline, we use the traditional technique of training a 3D ConvNet on each target dataset,followed by training a bidirectional LSTM network on the resulting features. This is fullylabeled training on each type of surgery – no surgical transfer learning is applied at thisstage in the process.We then evaluate several time-series models using Cholecystectomy features from a pre-trained 3D ConvNet. In Table 2, we evaluate various models described in Sec. 2.1. Wereport the test set accuracy by measuring the number of seconds in all test videos that arelabeled correctly by each model.At a high level, we see that TSAN, the combination of two LSTMs and three Conv1Ds,pre-trained with our self-supervised approach (Sec. 2.1), outperformed all other methods. eimark Bar Zohar Hager Asselmann Table 2: Comparing diﬀerent time-series model architectures. The last column is a resultof averaging the accuracy of all three target datasets: Right Hemicolectomy (RH),Sleeve Gastrectomy (SG), and Appendectomy (APPY). The ﬁrst row is a result ofusing the standard approach without surgical transfer learning. Other table rowsshow the development of our suggested architecture. K denotes the kernel size ofthe 1D convolution layers (C1D). L denotes the number of LSTM layers. RH SG APPYC1D(K=5) C1D(K=25) C1D(K=39) LSTM(L=1) LSTM(L=2) WithSeSo Transferlearning Accuracy AVG (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

We also observe two other interesting results from the comparisons shown in Table 2.First, our transfer learning approach produces better results than using the traditionaltraining method. And second, if one does apply transfer learning for the surgical domain,our method improves the results by about 2%, compared to a single LSTM network.

To further explore the generalization of our approach and its usability in the future forrapidly achieving high performance on smaller datasets, we study the impact of (labeled)training set size on the ﬁnal accuracy results. We chose to focus on two variants thatgave the best results for transfer learning, the LSTM network and our TSAN. Both arepre-trained using the SeSo method applied to features from Cholecystectomy.To understand accuracy as a function of dataset size, we split the videos in the trainingsets of Right Hemicolectomy and Sleeve Gastrectomy into smaller subsets with 5 , ,

50 and all training samples ( all equals 123 and 138, respectively). For Appendectomy, as it is alarger dataset, we added two additional subsets of 150 and 350 ( all equals 511). The subsetsare randomly selected and constructed so that each subset extends the previous smaller one.The test set is kept the same to enable a fair comparison.In Figure 3, we show the accuracy values when training with diﬀerent training setsizes. Our approach with SeSo generalized better for the majority of training set sizes.Furthermore, we see consistently high accuracy achieved between 100 and 200 videos. ross-surgery transfer learning for surgical step recognition Figure 3: Evaluating the impact of the training set size on model generalization. ( A ) RightHemicolectomy, ( B ) Sleeve Gastrectomy and ( C ) Appendectomy. We train theLSTM-SeSo and TSAN-SeSo, using smaller subsets of the original training setand measure the results on a ﬁxed test set. ( D ) The validation accuracy curvewhen training the Sequence Sorting task vs. the number of samples used duringtraining. The model trained using the Cholecystectomy data converges muchfaster compared to all other procedure types. The SeSo initialization helps improve the results of the single LSTM and TSAN architec-tures. Especially for TSAN, this type of pre-training yields the best-performing architecturecompared to other possibilities (Table 2).To better understand the impact of SeSo on surgical transfer learning, we explore theeﬀect of pre-training the time-series model on the source (Cholecystectomy) or the targetdatasets. We trained four TSAN models on the sorting task using all four datasets. Then,we ﬁne-tuned the models on the step recognition task. We measure the accuracy on eachof the three target datasets, ﬁrst when pre-training using the source dataset (Cholecys-tectomy), and second when pre-training using the target dataset. Table 3 shows only asmall improvement when using the source dataset to train the SeSo task. While this issurprising, it is likely due to the fact that the Cholecystectomy dataset is larger than the eimark Bar Zohar Hager Asselmann Table 3: Comparing step recognition accuracy results on the three target datasets aftertraining the Sequence Sorting initialization task on either the source dataset(Cholecystectomy) or the target dataset.

Step training dataset SeSo training dataset AccuracyRight Hemicolectomy Right Hemicolectomy 94.5Right Hemicolectomy Cholecystectomy

Sleeve Gastrectomy Sleeve Gastrectomy 94.2Sleeve Gastrectomy Cholecystectomy

Appendectomy Appendectomy 89.9Appendectomy Cholecystectomy others. However, it supports the notion that self-supervised initialization can eﬀectivelyexploit unlabeled data, even when it does not come from the target dataset.In Figure 3.D, we plot the validation accuracy during the SeSo task training and demon-strate that the model also converges much faster on the Cholecystectomy dataset.

4. Conclusion

This work suggests a new approach to train surgical step recognition models by using sur-gical transfer learning. We show, for the ﬁrst time, an analysis of transfer learning betweendiﬀerent surgical procedures and our ﬁndings demonstrate that it is possible to transferknowledge from one procedure to another, even when using relatively small target datasets.It is also the ﬁrst study to explore surgical step recognition on Right Hemicolectomy andAppendectomy. To facilitate a robust domain adaptation, we explore various architecturesand introduce a new time-series architecture, TSAN, optimized for model adaptation inthe surgical domain. Moreover, we present a Sequence Sorting task as a pre-initializationmethod. The main advantage of this approach, besides that it improves TSAN performancewhen transferring knowledge from one surgery type to another, is the fact that it is trainedwith a self-supervised method.Future work should explore how mutual learning of surgical step recognition, trainedon several procedures simultaneously, will perform. Also, the ideas of domain adaptationpresented in this study could be applied to other surgical related tasks, such as eventdetection, and it would be interesting to test our ﬁndings on such tasks.Although signiﬁcant progress has been made in recent years in the ﬁeld of surgicalintelligence, the next leap forward must focus on the practical implementation of applyingartiﬁcial intelligence in the surgical domain. Solid evidence that these technologies canbe generalized for various surgical procedures is essential for surgeons to embrace themas part of their daily routine, both inside and outside the operating rooms. We believethat surgical transfer learning and the ability to transfer knowledge between models in thesurgical domain are key facilitators and will expedite the development of computer-assistedsurgery in a wide range of surgical procedures. This study is a step in that direction. ross-surgery transfer learning for surgical step recognition References

Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro,Luca Zappella, Sanjeev Khudanpur, Ren´e Vidal, and Gregory D Hager. A dataset andbenchmarks for segmentation and recognition of gestures in robotic surgery.

IEEE Trans-actions on Biomedical Engineering , 64(9):2025–2041, 2017.Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Soumali Roychowdhury, Xiaowei Hu,Gabija Marˇsalkait˙e, Odysseas Zisimopoulos, Muneer Ahmad Dedmari, Fenqiang Zhao,Jonas Prellberg, et al. Cataracts: Challenge on automatic tool annotation for cataractsurgery.

Medical image analysis , 52:24–41, 2019.Omri Bar, Daniel Neimark, Maya Zohar, Gregory D Hager, Ross Girshick, Gerald M Fried,Tamir Wolf, and Dotan Asselmann. Impact of data on generalization of ai for surgicalintelligence applications.

Scientiﬁc reports , 10(1):1–12, 2020.John L Cameron. William stewart halsted. our surgical heritage.

Annals of surgery , 225(5):445, 1997.Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and thekinetics dataset. In proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 6299–6308, 2017.Bareum Choi, Kyungmin Jo, Songe Choi, and Jaesoon Choi. Surgical-tools detection basedon convolutional neural network in laparoscopic robot-assisted surgery. In , pages 1756–1759. Ieee, 2017.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.Liane S Feldman, Aurora D Pryor, Aimee K Gardner, Brian J Dunkin, Linda Schultz,Michael M Awad, and E Matthew Ritter. Sages video-based assessment (vba) program:a vision for life-long learning for surgeons.

Surgical endoscopy , 34:3285–3288, 2020.Yixin Gao, S Swaroop Vedula, Carol E Reiley, Narges Ahmidi, Balakrishnan Varadarajan,Henry C Lin, Lingling Tao, Luca Zappella, Benjamın B´ejar, David D Yuh, et al. Jhu-isigesture and skill assessment working set (jigsaws): A surgical activity dataset for humanmotion modeling. In

MICCAI Workshop: M2CAI , volume 3, page 3, 2014.Shaona Ghosh and Per Ola Kristensson. Neural networks for text correction and completionin keyboard decoding. arXiv preprint arXiv:1709.06429 , 2017.Ross Girshick. Fast r-cnn. In

Proceedings of the IEEE international conference on computervision , pages 1440–1448, 2015.Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 580–587, 2014. eimark Bar Zohar Hager Asselmann Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning . MIT press, 2016.Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, An-mol Gulati, Ruoming Pang, and Yonghui Wu. Contextnet: Improving convolutionalneural networks for automatic speech recognition with global context. arXiv preprintarXiv:2005.03191 , 2020.Daniel A Hashimoto, Guy Rosman, Elan R Witkowski, Caitlin Staﬀord, Allison J Navarette-Welton, David W Rattner, Keith D Lillemoe, Daniela L Rus, and Ozanan R Meireles.Computer vision analysis of intraoperative video: Automated recognition of operativesteps in laparoscopic sleeve gastrectomy.

Annals of surgery , 270(3):414–421, 2019.Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In

Proceedingsof the IEEE international conference on computer vision , pages 2961–2969, 2017.Amy Jin, Serena Yeung, Jeﬀrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein,and Li Fei-Fei. Tool detection and operative skill assessment in surgical videos usingregion-based convolutional neural networks. In , pages 691–699. IEEE, 2018.Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kineticshuman action video dataset. arXiv preprint arXiv:1705.06950 , 2017.James R Korndorﬀer Jr, Mary T Hawn, David A Spain, Lisa M Knowlton, Dan E Azagury,Aussama K Nassar, James N Lau, Katherine D Arnow, Amber W Trickey, and Carla MPugh. Situating artiﬁcial intelligence in surgery: a focus on disease severity.

Annals ofSurgery , 272(3):523–528, 2020.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks forsemantic segmentation. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 3431–3440, 2015.Lena Maier-Hein, Swaroop S Vedula, Stefanie Speidel, Nassir Navab, Ron Kikinis, AdrianPark, Matthias Eisenmann, Hubertus Feussner, Germain Forestier, Stamatia Giannarou,et al. Surgical data science for next-generation interventions.

Nature Biomedical Engi-neering , 1(9):691–696, 2017.Zhen-Liang Ni, Gui-Bin Bian, Guan-An Wang, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Zhen Li, and Yu-Han Wang. Barnet: Bilinear attention network with adaptivereceptive ﬁeld for surgical instrument segmentation. arXiv preprint arXiv:2001.07093 ,2020.Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solvingjigsaw puzzles. In

European conference on computer vision , pages 69–84. Springer, 2016.E Matthew Ritter, Aimee K Gardner, Brian J Dunkin, Linda Schultz, Aurora D Pryor, andLiane Feldman. Video-based assessment for laparoscopic fundoplication: initial develop-ment of a robust tool for operative performance assessment.

Surgical endoscopy , pages1–8, 2019. ross-surgery transfer learning for surgical step recognition Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Web-scale training forface identiﬁcation. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 2746–2754, 2015.Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathe-lin, and Nicolas Padoy. Endonet: a deep architecture for recognition tasks on laparoscopicvideos.

IEEE transactions on medical imaging , 36(1):86–97, 2016.Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural net-works. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 7794–7803, 2018.Jason Yosinski, Jeﬀ Clune, Yoshua Bengio, and Hod Lipson. How transferable are featuresin deep neural networks? In

Advances in neural information processing systems , pages3320–3328, 2014.Felix Yu, Gianluca Silva Croso, Tae Soo Kim, Ziang Song, Felix Parker, Gregory D Hager,Austin Reiter, S Swaroop Vedula, Haider Ali, and Shameema Sikder. Assessment ofautomated identiﬁcation of phases in videos of cataract surgery using machine learningand deep learning techniques.

JAMA network open , 2(4):e191860–e191860, 2019.Odysseas Zisimopoulos, Evangello Flouty, Imanol Luengo, Petros Giataganas, Jean Nehme,Andre Chow, and Danail Stoyanov. Deepphase: surgical phase recognition in cataractsvideos. In

International Conference on Medical Image Computing and Computer-AssistedIntervention , pages 265–272. Springer, 2018.Maya Zohar, Omri Bar, Daniel Neimark, Gregory D Hager, and Dotan Asselmann. Accuratedetection of out of body segments in surgical video using semi-supervised learning. In

Medical Imaging with Deep Learning , pages 923–936. PMLR, 2020. eimark Bar Zohar Hager Asselmann Appendix A. Detailed datasets description.

The datasets characteristics are summarized in Table 4.

Right Hemicolectomy.

This dataset contains 205 videos curated from four diﬀerent med-ical centers. Each second in the video was annotated, categorized into one of seven clinically-relevant surgical steps: (1) Preparation, (2) Adhesiolysis, (3) Mobilization and Dissection,(4) Specimen Packaging, (5) Anastomosis, (6) Specimen Retrieval, and (7) Final Inspection.

Sleeve Gastrectomy.

This dataset contains 229 videos curated from two medical centers.Each second in the video was annotated, categorized into one of seven clinically-relevantsurgical steps: (1) Preparation, (2) Adhesiolysis, (3) Dissection of Greater Curvature, (4)Gastric Transaction, (5) Reinforcement of Staple Line, (6) Specimen Extraction, and (7)Final Inspection.

Appendectomy.