[PDF] LapTool-Net: A Contextual Detector of Surgical Tools in Laparoscopic Videos Based on Recurrent Convolutional Neural Networks

Abstract

We propose a new multilabel classifier, called LapTool-Net to detect the presence of surgical tools in each frame of a laparoscopic video. The novelty of LapTool-Net is the exploitation of the correlation among the usage of different tools and, the tools and tasks - namely, the context of the tools' usage. Towards this goal, the pattern in the co-occurrence of the tools is utilized for designing a decision policy for a multilabel classifier based on a Recurrent Convolutional Neural Network (RCNN) architecture to simultaneously extract the spatio-temporal features. In contrast to the previous multilabel classification methods, the RCNN and the decision model are trained in an end-to-end manner using a multitask learning scheme. To overcome the high imbalance and avoid overfitting caused by the lack of variety in the training data, a high down-sampling rate is chosen based on the more frequent combinations. Furthermore, at the post-processing step, the prediction for all the frames of a video are corrected by designing a bi-directional RNN to model the long-term task's order. LapTool-net was trained using a publicly available dataset of laparoscopic cholecystectomy. The results show LapTool-Net outperforms existing methods significantly, even while using fewer training samples and a shallower architecture.

Full PDF

LLapTool-Net: A Contextual Detector of Surgical Toolsin Laparoscopic Videos Based on RecurrentConvolutional Neural Networks

Babak Namazi

Department of Electrical EngineeringUniversity of Texas at ArlingtonArlington, TX 76019 [email protected]

Ganesh Sankaranarayanan

Department of SurgeryBaylor University Medical CenterDallas, TX 75246 [email protected]

Venkat Devarajan

Department of Electrical EngineeringUniversity of Texas at ArlingtonArlington, TX 76019 [email protected]

Abstract

We propose a new multilabel classiﬁer, called LapTool-Net to detect the presence ofsurgical tools in each frame of a laparoscopic video. The novelty of LapTool-Net isthe exploitation of the correlations among the usage of different tools and, the toolsand tasks - i.e., the context of the tools’ usage. Towards this goal, the pattern in theco-occurrence of the tools is utilized for designing a decision policy for a multilabelclassiﬁer based on a Recurrent Convolutional Neural Network (RCNN) architectureto simultaneously extract the spatio-temporal features. In contrast to the previousmultilabel classiﬁcation methods, the RCNN and the decision model are trained inan end-to-end manner using a multi-task learning scheme. To overcome the highimbalance and avoid overﬁtting caused by the lack of variety in the training data,a high down-sampling rate is chosen based on the more frequent combinations.Furthermore, at the post-processing step, the predictions for all the frames of avideo are corrected by designing a bi-directional RNN to model the long-term tasks’order. LapTool-Net was trained using a publicly available dataset of laparoscopiccholecystectomy. The results show LapTool-Net outperformed existing methodssigniﬁcantly, even while using fewer training samples and a shallower architecture.

Numerous advantages of minimally invasive surgery such as shorter recovery time, less pain andblood loss, and better cosmetic results, make it the preferred choice over conventional open surgeries[Velanovich, 2000]. In laparoscopy, the surgical instruments are inserted through small incisionsin the abdominal wall and the procedure is monitored using a laparoscope. The special way ofmanipulating the surgical instruments and the indirect observation of the surgical scene introducemore challenges in performing laparoscopic procedures [Ballantyne, 2002]. The complexity oflaparoscopy requires special training and assessment for the surgery residents to gain the requiredbi-manual dexterity. The videos from the previously accomplished procedures by expert surgeonscan be used for such training and assessment. The tedium and cost of such an assessment can bedramatically reduced using an automated tool detection system, among other things and is, therefore,the focus of this paper.

Preprint. Work in progress. a r X i v : . [ c s . C V ] M a y n computer-aided intervention, the surgical tools are controlled by a surgeon with the aid of aspecially designed robot [Antico et al., 2019], which requires a real-time understanding of the currenttask. Therefore, detecting the presence, location or pose of the surgical instruments may be usefulin robotic surgeries as well [Du et al., 2016], [Allan et al., 2013], [Allan et al., 2018]. Finally, theactual location and movement of the tools can be extremely useful in rating the surgeries as well asgenerating an operative summary.In order to track the surgical instruments, several approaches have been introduced, which use thesignals collected during the procedure [Elfring et al., 2010], [Reiter et al., 2012]. For instance,in vision-based methods, the instruments can be localized using the videos captured during theoperation [Wesierski and Jezierska, 2018]. These methods are generally reliable and inexpensive.Traditional vision-based methods rely on extracted features such as shape, color, the histogram oforiented gradients etc., along with a classiﬁcation or regression method to estimate the presence,location or pose of the instrument in the captured images or videos [Bouget et al., 2017]. However,these methods are dependent on pre-deﬁned and painstakingly extracted hand-crafted features. Justlogically deﬁning and extracting such features alone is a major part of the detection process. Thus,these hand-crafted features and designs are not suitable for real-time applications.Recent years have witnessed great advances in deep learning techniques in various computer visionareas such as image classiﬁcation, object detection, and segmentation etc., and in medical imaging[Litjens et al., 2017], due to the availability of large data and much improved computational powercompared to the 1990s. The main advantage of deep learning methods over traditional computervision techniques is that optimal high-level features can be directly and automatically extracted fromthe data. Therefore, there is a trend towards using these methods in analyzing the videos taken fromlaparoscopic operations [Twinanda et al., 2017a].Compared with the other surgical video tasks, detecting the presence and usage of surgical instrumentsin laparoscopic videos has certain challenges that need to be considered.Firstly, since multiple instruments might be present at the same time, detecting the presence of thesetools in a video frame is a multilabel (ML) classiﬁcation problem. In general, ML classiﬁcationis more challenging compared to the well-studied multiclass (MC) problem, where every instanceis related to only one output. These challenges include but are not limited to using correlationand co-existence of different objects/concepts with each other and the background/context and thevariations in the occurrence of different objects.Secondly, as opposed to other surgical videos, such as cataract surgery [Al Hajj et al., 2019], robot-assisted surgery [Sarikaya et al., 2017] or videos from a simulation [Zisimopoulos et al., 2017],where the camera is stationary or moving smoothly, in laparoscopic videos, the camera is constantlyshaking. Due to the rapid movement and changes in the ﬁeld of view of the camera, most of theimages suffer from motion blur and the objects can be seen in various sizes and locations. Also, thecamera view might be blocked by the smoke caused by burning tissue during cutting or cauterizing toarrest bleeding. Therefore, using still images is not sufﬁcient for detecting the instruments.Thirdly, surgical operations follow a speciﬁc order of tasks. Although the usage of the tools doesn’tstrictly adhere to that order, it is nevertheless highly correlated with the task being performed. Usingthe information about the task and the relative position of the frame with regard to the entire video,the performance of the tool detection can be improved.Lastly, since the performance of a deep classiﬁer in a supervised learning method is highly dependenton the size and the quality of the labeled dataset, collecting and annotating a large dataset is a crucialtask.Endonet [Twinanda et al., 2017a] was the ﬁrst deep learning model designed for detecting the presenceof surgical instruments in laparoscopic videos, wherein Alexnet [Krizhevsky et al., 2012] was used asa Convolutional Neural network (CNN), for feature extraction and is trained for the simultaneousdetection of surgical phases and instruments. Inspired by this work, other researchers used differentand more accurate CNN architectures with transfer learning [Sahu et al., 2016], [Prellberg andKramer, 2018] to classify the frames based on the visual features. For example, in [Zia et al., 2016],three CNN architectures are used, and [Wang et al., 2017] proposed an ensemble of two deep CNNs.[Sahu et al., 2017] were the ﬁrst to address the imbalance in the classes in a ML classiﬁcation ofvideo frames. They balanced the training set according to the combinations of the instruments. The2ata were re-sampled to have a uniform distribution in label-set space and, class re-weighting wasused to balance the data in a single class level. Despite the improvement gained by considering theco-occurrence in balancing the training set, the correlation of the tools’ usage was not considereddirectly in the classiﬁer and the decision was made solely based on the presence of single tools.[Abdulbaki Alshirbaji et al., 2018] used class weights and re-sampling together to deal with theimbalance issue.In order to consider the temporal features of the videos, Twinanda et al. employed a hidden Markovmodel (HMM) in [Twinanda et al., 2017a] and Recurrent Neural Network (RNN) in [Twinanda et al.,2017b]. Sahu et.al utilized a Gaussian distribution ﬁtting method in [Sahu et al., 2016] and a temporalsmoothing method using a moving average in [Sahu et al., 2017] to improve the classiﬁcation results,after the CNN was trained. [Mishra et al., 2017] were the ﬁrst to apply a Long Short-Term Memorymodel (LSTM) [Hochreiter and Schmidhuber, 1997], as an RNN to a short sequence of frames, tosimultaneously extract both spatial and temporal features for detecting the presence of the tools byend-to-end training.Other papers invoked different approaches to address the issues in detecting the presence of surgicaltools. [Hu et al., 2017] proposed an attention guided method using two deep CNNs to extract localand global spatial features. In [Al Hajj et al., 2018], a boosting mechanism was employed to combinedifferent CNNs and RNNs. In [Jin et al., 2018a], the tools were localized by Faster RCNN [Ren et al.,2015] method, after labeling the dataset with bounding boxes containing the surgical tools.It should be noted that none of the previous methods takes advantage of any knowledge regarding theorder of the tasks and, the correlations of the tools are not directly utilized in identifying differentsurgical instruments.In this paper, we propose a novel system called LapTool-Net to detect the presence of surgicalinstruments in laparoscopic videos. The main features of the proposed model are summarized asfollows:1. Exploiting the spatial discriminating features and the temporal correlation among them bydesigning a deep Recurrent Convolutional Neural Network (RCNN)2. Taking advantage of the relationship among the usage of different tools by considering theirco-occurrences3. The end-to-end training of the tool detector using a multitask learning approach4. Considering the inherent long-term pattern of the tools’ presence via a bi-directional RNN5. Using a small portion of the labeled samples considering the high correlation of the videoframes to avoid overﬁtting6. Addressing the imbalance issue using re-sampling and re-weighting methods7. Providing state-of-the-art performance on a publicly available dataset on laparoscopiccholecystectomyThe remainder of the paper is organized as follows: the main approach of LapTool-Net is describedin section 2 and is elaborated in section 3. The performance of LapTool-Net is evaluated throughexperiments described in section 4. Section 5 concludes the paper. The uniqueness of our approach is based on the following three original ideas: • A novel ML classiﬁer is proposed as a part of LapTool-Net, to take advantage of the co-occurrence of different tools in each frame – in other words, the context is taken into accountin the detection process. In order to accomplish this objective, each combination of tools isconsidered as a separate class during training and testing and, is further used as a decisionmodel for the ML classiﬁer. To the best of our knowledge, this is the ﬁrst attempt at directlyusing the information about the co-occurrence of surgical tools in laparoscopic videos in theclassiﬁer’s decision-making. 3igure 1: Block diagram of a) the proposed multiclass classiﬁer F which consists of f and g, b) thearchitecture for Gated Recurrent Units (GRU) and c) The bi-directional RNN for post-processing. • The ML classiﬁer and the decision model are trained in an end-to-end fashion. For thispurpose, the training is performed by jointly optimizing the loss functions for the MLclassiﬁer and the decision model using a multitask learning approach • At the post-processing step, the trained model’s prediction for each video is sent to anotherRNN to consider the order of the usage of different tools/tool combinations and long-termtemporal dependencies – yet another consideration for the context.The overview of the proposed model is illustrated in Fig. 1. Let D = { ( x ij , Y ij ) | ≤ i < m, < j

In order to detect the presence of surgical instruments in laparoscopic videos, the visual features(intra-frame spatial and inter-frame temporal features) need to be extracted. We use CNN to extractspatial features. A CNN consists of an input layer, multiple convolutional layers, non-linear activationunits, and pooling layers, followed by a fully connected (FC) layer to produce the outputs, which aretypically the classiﬁcation results or conﬁdence scores. Each layer passes the results to the next layerand the weights for the convolutional and FC layers are trained using back-propagation to minimize acost function. The output of the last convolutional layer is a lower dimensional representation of theinput and therefore, can be considered as the spatial features. As shown in Fig 1, the input frame x ij is sent through the trained CNN and the output of the last convolutional layer (after pooling) forms aﬁxed size spatial feature vector v ij .In the literature, several approaches have been proposed for utilizing the temporal features in videosfor tasks such as activity recognition and object detection in videos [Karpathy et al., 2014], [Simonyanand Zisserman, 2014]. For instance, when there is a high correlation among video frames, it can beexploited to improve the performance of the tool detection algorithm.An RNN is typically used to exploit the pattern of the instruments usage. It uses its internal memory(states) to process a sequence of inputs for time series and videos processing tasks [Jin et al., 2018b]. http://camma.u-strasbg.fr/m2cai2016/index.php/program-challenge x ij , the sequence of the spatial features V ij = [ v ( i − λ ∆ t ) j ...v ( i − ∆ t ) j v ij ] is the inputfor the RNN, where the hyper-parameters λ and ∆ t are the number of frames in the sequence and theconstant inter-frame interval respectively. The total length of the input is no longer than one second,which ensures that the tools remain visible during that time interval. Since the tool detection model isdesigned to be causal and to perform in real-time, only the previous frames with respect to the currentframe can be used with the RNN.We selected Gated Recurrent Unit (GRU) [Cho et al., 2014] as our RNN for its simplicity. Thearchitecture is illustrated in Fig. 1.(b) and formulated as: z ij = σ ( v ij U z + h i − ∆ t,j W z ) ,r ij = σ ( v ij U r + h i − ∆ t,j W r ) , ˜ h ij = tanh ( v ij U h + ( r ij (cid:12) h i − ∆ t,j ) W h ) ,h ij = (1 − z ij ) (cid:12) h i − ∆ t,j + z ij (cid:12) ˜ h ij , (1)where U and W are the GRU weights, (cid:12) is the element-wise multiplication and σ is the sigmoidactivation function. z and r are update gate and reset gate respectively. The ﬁnal hidden state h ij isthe output of the GRU and is the input to a fully connected neural network F C . The output layer F C is of size K (the number of tools) and after applying the sigmoid function, produces the vectorof conﬁdence scores P ( ij ) for all classes.We designed the above RCNN architecture as the ML classiﬁer model f shown in Fig. 1.a, whichexploits the spatiotemporal features of a given input frame and produces the vector of conﬁdencescores of all the tools, which in turn is the input to the decision model g . One of the main challenges in ML classiﬁcation is effectively utilizing the correlation among differentclasses. Using LP (as described earlier), uncommon combinations of the classes will automaticallybe eliminated from the output and the classiﬁer’s attention is directed towards the more possiblecombinations.As mentioned before, not all the K combinations are possible in a laparoscopic surgery. Fig. 2shows the percentage of the most likely combinations in the M2CAI dataset. The ﬁrst 15 classes outof a possible maximum of 128 span more than 99.95% of the frames in both the training and thevalidation sets, and the tools combinations have almost the same distribution in both cases.Since an LP classiﬁer is MC, the cost function for training a machine learning algorithm has to bethe conventional one-vs-all (categorical) loss. For example, Softmax cross-entropy (CE) is the mostpopular MC loss function. However, Softmax CE requires the classes to be mutually exclusive, whichis not true in the LP method. In other words, while using a Softmax loss, each superclass is treated asa separate class, i.e. separate features activate a superclass. This causes performance degradation inthe classiﬁer and therefore, more data is required for training. We address this issue by a novel use ofLP as the decision model g , which we apply to the ML classiﬁer f . Our method helps the classiﬁerto consider our superclasses as the combinations of classes rather than separate mutually exclusiveclasses.The decision model is a fully connected neural network ( F C ), which takes the conﬁdence scoresof f and maps them to the corresponding superclass. When the Softmax function is applied, theoutput of g ( . ) is the probability of each superclass Q = ( q , q , ..., q ˆ K ) where ˆ K is the size of the6igure 2: The distribution for the combination of the tools in M2CAI datasetsuperclass set. The ﬁnal prediction of the tool detector F is the index of the superclass with thehighest probability and for frame i of video j is calculated as: c ij = argmax ( Q ij ) (2) Class imbalance has been a well-studied area in machine learning for a long time [Buda et al., 2018].It is known that in skewed datasets, the classiﬁer’s decision is inclined towards the majority classes.Therefore, it is always beneﬁcial to have a uniform distribution for the classes during training. Twomajor approaches have been proposed in the literature to deal with imbalanced datasets.One approach is called cost sensitive and is mainly based on class re-weighting. In this method, theoutputs of the classes or the loss function during training are weighted based on the frequency of theclasses. Although this approach works in some cases, the choice of the weights might not dependsolely on the distribution of the data, since the complexity of the input is not known before training.Thus, class weights are another set of hyper-parameters that needs to be determined.Another solution to an imbalanced dataset is to change the distribution in the input. This can beaccomplished using over-sampling for the minority classes and under-sampling for the majorityclasses. However, in ML classiﬁcation, ﬁnding a balancing criterion for re-sampling is challenging[Charte et al., 2015], since a change in the number of samples for one class might affect other classesas well.The number of samples for each tool before balancing is shown in Table 1. In order to overcomethis issue, we perform under-sampling to have a uniform distribution of the combination of theclasses. The main advantage of under-sampling over other re-sampling methods is that it can alsobe applied to avoid overﬁtting caused by the high correlation between the neighboring frames of alaparoscopic video. Therefore, we try different under-sampling rate to ﬁnd the smallest training setwithout sacriﬁcing the performance.Since this approach will not guarantee balance, a cost-sensitive weighting approach can be used alongwith an ML loss, prior to the LP decision layer; nonetheless, we empirically found that this doesn’taffect the performance of the ML classiﬁer.Figure 3 shows the relationship among the tools after re-sampling. It can be seen that the LP-basedbalancing method not only tends to a uniform distribution in the superclass space, it also improvesthe balance of the dataset in the single class space (with the exception of Grasper, which can be usedwith all the tools). 7able 1: Number of frames for each tool in M2CAITool Train ValidationBipolar 631 331Clipper 878 315Grasper 10367 6571Hook 14130 7454Irrigator 953 131Scissors 411 158Specimen Bag 1504 483no tools 2759 1888total 23421 12512 (a) before balancing (b) after balancing

Figure 3: The chord diagram for the relationship between the tools before and after balancing basedon the tools’ co-occurrences

Since our tool detector F ( x ) is decomposed into an ML classiﬁer f and an MC decision model g , therequirement of both models needs to be considered during training. In order to accomplish this, themodel is trained using both ML and MC loss functions.We propose to use joint training paradigm for optimizing the ML and MC losses as a multitasklearning approach. In order to do that, two optimizers are deﬁned based on the two losses withseparate hyper-parameters such as learning rate and trainable weights. Using this technique, theextraction of the features is accomplished based on the ﬁnal prediction of the model.Having the vector of the conﬁdence scores P , the ML loss L f is the sigmoid cross-entropy and isformulated as: L f = − d (cid:88) x ∈D log( p k = Y ) , P = ( σf ( x )) (3)where Y is the correct label for frame x , d is the total number of frames and D is the training set.The Softmax CE loss function L g for the decision model is formulated as: L g = − d (cid:88) x ∈D log( q k = ˆ Y ) , Q = ( sof tmax ( g ( f ( x )))) (4)8he total loss function is the sum of the two losses and formulated as: L = L f + βL g (5)where β is a constant weight for adjusting the impact of the two loss functions. The training isperformed in an end-to-end fashion using the backpropagation through time method. The ﬁnal decision of the RCNN model from the previous section is made based on the extractedspatio-temporal features from a short sequence of frames. In other words, the model beneﬁts from ashort-term memory using the correlation among the neighboring frames. However, due to the highunder-sampling rate for the balanced training set, this method might not produce a smooth predictionover the entire duration of the laparoscopic videos. In order to deal with this issue, we model theorder in the usage of the tools with an RNN over all the frames of each video [Namazi et al., 2018].Due to memory constraints, the ﬁnal prediction from equation 2 of the RCNN, ¯ C ( j ) = [ c j ....c mj ] ]for all the videos < j < n , is selected as the input for the post-processing RNN. Since not all thevideos have the same length, the shorter videos are padded with the no-tool class.Our post-processing occurs ofﬂine, after the surgery is ﬁnished. Therefore, future frames can alsobe used along with past frames to improve the classiﬁcation results of the current frame. In order toaccomplish this, a bi-directional RNN is employed, which consists of two RNNs for the forward andbackward sequences. The backward sequence is simply the reverse of ¯ C . The outputs of the bi-RNNare concatenated and fed to F C for the ﬁnal prediction ( g (cid:48) in Fig. 1.c).Since the input frames for the bi-RNN are in a speciﬁc order, it’s not possible to balance the inputthrough re-sampling. Therefore, class re-weighting is performed to compensate for the minorityclasses. The class weights are chosen to be proportional to the inverse of the frequency of thesuperclasses. The loss function is: L p = − d (cid:88) x ∈D ( w k = ˆ Y ) log ¯ q k = ˆ Y , ¯ Q = ( sof tmax ( g (cid:48) ( ¯ C ))) (6)where w k is the weight for the superclass k , g (cid:48) is the bi-RNN with 64 hidden layers and ¯ Q =(¯ q , ..., ¯ q ˆ K ) is the superclass probability vector. In this section, the performance of the different parts of the proposed tool detection model on M2CAIdataset is validated through numerous experiments using the appropriate metrics.We selected Tensorﬂow [Abadi et al., 2016] for all of the experiments. The CNN in all the experimentswas Inception V1 [Szegedy et al., 2015]. In order to have better generalization, extensive dataaugmentation, such as random cropping, horizontal and vertical ﬂipping, rotation and a randomchange in brightness, contrast, saturation, and hue were performed during training. The initiallearning rate was 0.001 with a decay rate of 0.7 after 5 epochs and the results were taken after 100epochs. The batch size was 32 for training the CNN models and, 40 for the RNN-based models. Allthe experiments were conducted using an Nvidia TITAN XP GPU. The source code of the project isavailable on Github.

Since the proposed model is MC, the corresponding evaluation metrics were chosen. Due to the highimbalance of the validation dataset, accuracy alone is not sufﬁcient to evaluate the proposed model.Therefore, we used F1 score to compare the performance of different models in both per-class andoverall metrics. These are calculated as: F micro = 2 P pc .R pc P pc + R pc , F macro = 2 P ov .R ov P ov + R ov (7)9able 2: Results for the multi-label classiﬁcation of the CNNTotal Frames Balanced Acc(%) mAP(%) F1-macro(%)23k No 77.23 61.02 58.48150k Yes 75.90 71.15 70.4975k Yes 74.78 77.24 74.8125k Yes 75.40 78.58 74.646k Yes 74.36 78.22 74.433k Yes 73.10 73.69 70.85where P pc , R pc , P ov and R ov are per-class precision/recall and overall precision/recall respectivelyand are calculated as: P pc = 1 K K (cid:88) k =1 N cy k N py k , R pc = 1 K K (cid:88) k =1 N cy k N y k (8) P ov = (cid:80) Kk =1 N cy k (cid:80) Kk =1 N py k , R ov = (cid:80) Kk =1 N cy k (cid:80) Kk =1 N y k (9)where N cy k , N py k and N y k are the number of correctly predicted frames for class k , the total numberof frames predicted as k , and the total number frames for class k . Only frames with all the toolspredicted correctly are considered exact matches.In order to evaluate the RCNN model f , we used ML metric - mean Average Precision (mAP), whichis the mean of average precision (a weighted average of the precision with the recall at differentthresholds) for all 7 tools. In the ﬁrst experiments, we assumed that the classiﬁer f is a CNN and the decision model g is appliedto the resulting scores from the CNN. Since the dataset was labeled only for one frame per second(out of 25 frames/sec), there was a possibility of using the unlabeled frames for training, as long asthe tools remain the same between two consecutive labeled frames. We used this unlabeled data tobalance the training set, according to the LPs. The CNN was trained with the loss function 3 with F C of size 7. Table 2 shows the results of our CNN with different training set sizes for the toolslisted in Table 1.As was to be expected, the unbalanced training set results shown in the ﬁrst row of Table 2 has thelowest performance on all the metrics. The high exact match accuracy (Acc) of 77.23% and the lowerresults on per-class metrics, such as F1-macro and mAP show that the model correctly predicted themajority classes (grasper and hook) but has poor performance for the less used tools such as scissors.In order to balance the datasets, the following speciﬁc steps were taken: 1) 15 superclasses wereselected and the original frames were re-sampled to have a uniform distribution in the set of label-sets ˆ Y . The numbers of frames for each superclass were randomly selected to be 10,000, 5,000, 1,666,400 and 200. 2) Multiple copies of some frames were copied and pasted to the ﬁnal set in the ﬁrsttwo training sets, because of the availability of fewer frames in some tools such as scissors. Thisaccomplished the intended over-sampling. 3) Similarly, under-sampling was performed in at leastone class in all sets and, in all classes in the last two sets, because too many frames were available forsome tools.Under these conditions, we can discuss the results presented in rows 2 through 6 in Table 2. While theexact match accuracy is the highest in the 150K set, it has the lowest score on the per-class metrics.The likely reason is the high over-sampling rate, which causes overﬁtting for the less frequentlyoccurring classes. On the other hand, a very high under-sampling rate in the 3K set results in loweraccuracy, likely due to the lack of informative samples.The best per-class results are for the 25K/6K versus the 150K/75K, which is due to the lowercorrelation among the inputs of the CNN during training. We used the 6K dataset for the rest of theexperiments versus the 25K, because adding the RNN and decision model to the selected CNN wouldincrease the size of the model (RCNN-LP), and the chances of overﬁtting increases.10n order to evaluate the effect of utilizing the co-occurrence of different surgical tools, we tested theLP method as the primary classiﬁer, as well as the decision model, using different training strategies.The conﬁgurations for each experiment are shown in Table 3.Table 3: Setup conﬁgurations for training the multiclass CNNExp. num Loss function FC1 size FC2 size Trainable weights Training method1 (4) 15 - all -2 (150k) (4) 15 - all -3 (3)/(4) 15 15 CNN+FC1/ FC2 Sequential4 (3)/(4) 7 15 CNN+FC1/ CNN+FC2 Alternate5 (3)/(4) 7 15 CNN+FC1/ all Alternate6 (3)/(4) 7 15 CNN+FC1/ FC2 Sequential7 (5) 7 15 all JointIn sequential training, the CNN was trained ﬁrst, and the decision model was added on top of thetrained model, while the CNN weights remained unchanged. In alternate training, the trainableweights change with the loss at every other step. The joint training method is explained in theprevious section. We used MC metrics; exact match accuracy, micro and macro F1, and averageper-class precision and recall. The results are shown in Table 4 and the precision and recall for eachtool are shown in Table 5. Table 4: Results for the multiclass CNNsExp. num Acc(%) F1-macro(%) F1-micro(%) Mean P(%) Mean R(%)1 70.01 69.14 84.57 72.90 67.982 76.13 73.77 87.89 86.08 67.243 73.18 74.30 86.92 79.65 70.804 74.42 75.70 87.75 82.37 71.485 72.44 75.23 86.42 87.60 67.256 74.97 75.47 88.04 80.67 73.217 76.31 78.32 88.53 78.48 78.95In the ﬁrst two experiments, the LP method was used directly to map the video frames to thecorresponding superclass. In order to accomplish this, the features extracted using CNN wereconnected to an FC layer of size 15 and the network was trained with the loss function from equation4. We selected the balanced training sets from the previous experiments with 6K and 150K samples.It can be seen from experiment 1 and 2 (Table 4), which correspond to 6K sample and 150k samplesrespectively, that both accuracy and F1 scores increase, when the training set is larger. Also, theprecision and recall in Table 5 show some improvements in almost all classes. However, comparedwith the results from Table 2, we observe minor improvements in accuracy and F1, when using 150kframes with LP classiﬁer, while the metrics decrease with a smaller training set. Considering bothtraining sets were balanced based on LP, the observation suggests that the LP-based classiﬁer needsmore examples for reasonable performance. This is because, in an LP classiﬁer, the superclasses aretreated as separate classes with different features from the corresponding single label classes, whichrequires the classiﬁer to have more training examples to learn the discriminating features. This canalso be conﬁrmed by checking the relatively close precision/recall for grasper and hook in Table 5,which have more unique frames (due to lower under-sampling rate), in the two experiments.In experiment 3, ML loss was tried instead of MC for training the LP classiﬁer with 15 superclasses. F C was added as a decision model and was trained sequentially. As shown in Table 4, the per-classF1 score for experiment 3 improves compared to experiment 1, while the exact match accuracy islower. This is probably because the model is still not aware that a superclass is a combination ofmultiple classes.In experiments 4, 5 and 6, the CNN was trained using ML loss 3 with 7 classes and the decision layerwas added on top of the conﬁdence scores. We evaluated the model using different training strategies.All three of these experiments produced better results than experiment 3. This is likely because themodel can learn the pattern of the 7 tools easier with the ML loss, compared with learning the patternfor the combination classes using 15 classes. 11able 5: Precision (P(%)) and Recall (R(%)) of each tool for the multiclass CNNsBipolar Clipper Grasper HookExp. P R P R P R P R1 71.2 66.0 72.7 58.4 90.3 70.9 92.8 90.62 76.5 35.7 85.4 52.0 92.0 80.3 95.2 90.03 84.4 71.5 78.0 57.4 91.0 75.8 94.6 90.94 81.5 70.8 80.7 60.0 89.9 79.8 95.3 90.15 91.7 56.5 81.2 53.6 91.3 75.9 97.5 86.16 75.1 75.1 69.2 59.3 90.0 81.5 95.2 90.37 83.9 72.6 72.6 74.2 91.1 81.2 94.3 91.4Irrigator Scissors Specimen bagExp. P R P R P R1 56.7 65.5 61.9 32.9 64.6 91.42 85.8 74.6 93.8 48.1 73.6 89.83 57.0 53.2 79.6 54.4 72.8 92.14 63.3 56.5 88.7 50.0 77.0 93.05 78.2 59.0 93.1 51.8 79.9 87.56 72.8 70.4 87.8 41.1 74.3 94.47 59.7 75.4 71.6 63.9 75.8 93.7The point of performing the experiments 4 and 5 was to evaluate the effect of the decision modelin training the feature extractor and ML classiﬁer. In both experiments, the decision model g andthe CNN were trained alternately. The weights of the CNN were frozen in experiment 6, whilein experiments 4 and 5 they were trained at each step. Therefore, in experiment 6, the role of thedecision model was to just use the co-occurrence information to ﬁnd the correct classes (superclasses)using the conﬁdence scores of a trained model. The results show improvement in F1 scores in allthree experiments compared with the results from Table 2. This is due to the fact that using LP asthe decision model, the co-occurrence of surgical tools in each frame is considered directly in theclassiﬁcation method without learning separate patterns for superclasses.In experiment 7, the loss is the weighted sum of the ML and MC loss functions and the training wasperformed on all the weights of the model. We can see that the end-to-end training of the RCNN-LPproduces signiﬁcantly better results compared with all other training methods, such as sequentialand alternate training. The reason is that in end-to-end training, all parts of the model is trainedsimultaneously to reach better conﬁdence scores and hence, better ﬁnal decision. In this section, the performance of the proposed model is evaluated after considering the spatio-temporal features using an RNN. Similar to the previous section, we tested the model before andafter adding the decision model. The dataset for training is the 6K balanced set and all the modelswere trained end-to-end. For training the RCNN model, we used 5 frames at a time (current frameand 4 previous frames) with an inter-frame interval of 5, which resulted in a total distance of 20frames between the ﬁrst and the last frame. The RCNN model was trained with a Stochastic GradientDescent (SGD) optimizer. The data augmentation for the post-processing model includes addingrandom noise to the input and randomly dropping frames to change the duration of the sequences.Table 6 shows the results of the proposed LapTool-Net. For ease of comparison, we have copied theresults from the previous section for the CNN with and without the LP decision model. It can be seenthat by considering the temporal features through the RCNN model the exact match accuracy and F1scores were improved. The higher performance of the LapTool-Net is due to the utilization of theframes from both the past and the future of the current frame, as well as the long-term order of theusage of the tools, by the bi-directional RNN.The precision, recall and F1 score for each of the tools are shown in Table 2. Compared with theresults from Table 5, we can see that the F1 score for clippers and scissors have signiﬁcantly increased,because there is a high correlation between the usage of these tools and the tasks, i.e. the order12able 6: Final results for the proposed modelAcc(%) F1-macro(%) F1-micro(%)CNN 74.36 74.43 87.70CNN-LP 76.31 78.32 88.53RCNN 77.51 81.95 89.54RCNN-LP 78.58 84.89 89.79LapTool-Net 80.96 89.11 91.35 (a) Grasper (b) Grasper/Hook (c) Grasper/Clipper(d) Grasper/Scissor (e) Grasper/Hook (f) Grasper/Bipolar(g) Grasper (h) Hook (i) Grasper (j) Scissors (k) Hook (l) Bipolar

Figure 4: The visualization of the class activation maps for some examples, based on the predictionof the modelin the occurrence of the tools (e.g. cutting only happens after clipping is completed). The lowestperformance is for Irrigator, which is probably because of the irregular pattern in its use (only usedfor bleeding and coagulation, which can occur any time during the surgery). The higher over-allrecall is likely because of the class re-weighting method. We believe the performance could improvewith a better choice of the weights.Table 7: The precision, recall and F1 score of each tool for LapTool-NetTool Precision(%) Recall(%) F1(%)Bipolar 0.82 0.95 0.88Clipper 0.85 0.98 0.91Grasper 0.89 0.88 0.89Hook 0.94 0.94 0.94Irrigator 0.74 0.91 0.82Scissors 0.82 0.99 0.90Specimen bag 0.92 0.89 0.91Mean 0.85 0.93 0.89In order to localize the predicted tools, the attention maps were visualized using grad-CAM method[Selvaraju et al., 2017]. The results for some of the frames are shown in ﬁgure 4. In order to avoidconfusion with frames that multiple tools, only the class activation map of a single tool is shownbased on the prediction of the model. The results show that the visualization of the attention of theproposed model can also be used in reliably identifying the location of each tool.

In order to validate the proposed model, we compared it with previously published research on theM2CAI dataset. The result is shown in Table 8. Since all the methods reported their results usingML metrics such as mAP, we compared our ML classiﬁer f , which is the RCNN model, along withthe ﬁnal model. We show that our model out-performed previous methods by a signiﬁcant margineven when choosing a relatively shallower model (Inception V1) and while using less than 25% ofthe labeled images. 13able 8: Comparison of tool presence detection methods on M2CAIMethod CNN mAP(%) F1-Macro(%) LapTool-Net Inception-V1 - [Hu et al., 2017] Resnet-101 [He et al., 2016] 86.9 -[Sahu et al., 2017] Alexnet 65 -[Wang et al., 2017] Inception-V3 [Szegedy et al., 2016] 63.8 -[Sahu et al., 2016] Alexnet 61.5 -[Twinanda et al., 2016] Alexnet 52.5 - The observation by surgical residents of the usage of speciﬁc surgical instruments and the duration oftheir usage in laparoscopic procedures gives great insight into how the surgery is performed. Whileidentifying the tools in a recorded video of surgery is a trivial albeit tedious task for an average human,there are certain challenges in detecting the tools using computer vision algorithms. In order totackle these challenges, in this paper, we proposed a novel deep learning system called LapTool-Net,for automatically detecting the presence of surgical tools in every frame of a laparoscopic video.The main feature of the proposed RCNN model is the context-awareness, i.e. the model learnsthe short-term and long-term patterns of the tools usages by utilizing the correlation between theusage of the tools with each other and, with the surgical steps, which follow a speciﬁc order. Toachieve this goal, an LP-based model is used as a decision layer for the ML classiﬁer and the trainingis performed in an end-to-end fashion. The advantage of this paradigm over direct LP classiﬁeris that the training can be accomplished with a smaller dataset, due to having fewer classes andavoiding learning separate (and probably not useful) patterns for the superclasses. Furthermore, theorder of occurrence of the tools is extracted through training a bi-RNN with the ﬁnal prediction ofa trained RCNN model. To overcome the high imbalance in the occurrence of the tools, we usedunder-sampling based on the tools’ combinations and the LP model. In addition to having a balanceddataset, the high under-sampling rate reduces the generalization error by avoiding overﬁtting, whichis the main challenge in tool detection, due to the high correlation among videos frames. Our methodoutperformed all previously published results on M2CAI dataset, while using less than 1% of thetotal frames in the training set.While our model is designed based on the previous knowledge of the cholecystectomy procedure, itdoesn’t require any domain-speciﬁc knowledge from experts and can be effectively applied to anyvideo captured from laparoscopic or even other forms of surgeries. Also, the relatively small datasetafter under-sampling suggests that the labeling process can be accomplished faster by using fewerframes (e.g. one frame every 5 seconds). Moreover, the simple architecture of the proposed LP-basedclassiﬁer makes it easy to use it with other proposed models such as [Al Hajj et al., 2018] and [Huet al., 2017], or with weakly supervised models [38] to localize the tools in the frames. Moreover,the ofﬂine design can be useful in generating a summary report, assessment and procedure ratingetc. Also, the proposed model in online mode has a processing time of less than 0.01 seconds/frame,which makes it suitable for real-time applications such as feedback generation during surgery.We plan on implementing a few ways to improve the performance of the proposed model. Firstly,the CNN can be replaced by a deeper and more accurate model. In particular, we will use theInception-Resnet-V2 [Szegedy et al., 2017] for the CNN and the cholec80 dataset [Twinanda et al.,2017a] for training. Secondly, since the RNN doesn’t extract the unique motion features of thetools, it can be replaced by a 3D CNN. The other way to improve the results is to choose betterhyper-parameters, especially the class weights for balancing.In future, we will investigate applying the ﬁndings in this paper to designing a semi-supervisedlearning based model [Cheplygina et al., 2019], using only a fraction of the videos being labeled.

Acknowledgments

This work was supported by a Joseph Seeger Surgical Foundation award from the Baylor UniversityMedical Center at Dallas. 14he authors would like to thank NVIDIA Inc. for donating the TITAN XP GPU through the GPUgrant program.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, AndrewHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, MartinWattenberg, Martin Wicke, Yuan Yu, Xiaoqiang Zheng, and Research. TensorFlow: Large-ScaleMachine Learning on Heterogeneous Distributed Systems. Technical report, Google, 2016. URL

Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut Möller. Surgical Tool Classiﬁcation inLaparoscopic Videos Using Convolutional Neural Network.

Current Directions in BiomedicalEngineering , 4(1):407–410, 9 2018. ISSN 2364-5504. doi: 10.1515/cdbme-2018-0097.Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Béatrice Cochener, and Gwenolé Quellec.Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks.

Medical Image Analysis , 47:203–218, 7 2018. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2018.05.001.Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Soumali Roychowdhury, Xiaowei Hu, GabijaMaršalkait˙e, Odysseas Zisimopoulos, Muneer Ahmad Dedmari, Fenqiang Zhao, Jonas Prellberg,Manish Sahu, Adrian Galdran, Teresa Araújo, Duc My Vo, Chandan Panda, Navdeep Dahiya,Satoshi Kondo, Zhengbing Bian, Arash Vahdat, Jonas Bialopetraviˇcius, Evangello Flouty, ChenhuiQiu, Sabrina Dill, Anirban Mukhopadhyay, Pedro Costa, Guilherme Aresta, Senthil Ramamurthy,Sang-Woong Lee, Aurélio Campilho, Stefan Zachow, Shunren Xia, Sailesh Conjeti, DanailStoyanov, Jogundas Armaitis, Pheng-Ann Heng, William G. Macready, Béatrice Cochener, andGwenolé Quellec. CATARACTS: Challenge on automatic tool annotation for cataRACT surgery.

Medical Image Analysis , 52:24–41, 2 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2018.11.008.M. Allan, S. Ourselin, S. Thompson, D. J. Hawkes, J. Kelly, and D. Stoyanov. Toward Detectionand Localization of Instruments in Minimally Invasive Surgery.

IEEE Transactions on BiomedicalEngineering , 60(4):1050–1058, 4 2013. ISSN 0018-9294. doi: 10.1109/TBME.2012.2229278.M. Allan, S. Ourselin, D. J. Hawkes, J. D. Kelly, and D. Stoyanov. 3-D Pose Estimation of ArticulatedInstruments in Robotic Minimally Invasive Surgery.

IEEE Transactions on Medical Imaging , 37(5):1204–1213, 5 2018. ISSN 0278-0062. doi: 10.1109/TMI.2018.2794439.Maria Antico, Fumio Sasazawa, Liao Wu, Anjali Jaiprakash, Jonathan Roberts, Ross Crawford,Ajay K. Pandey, and Davide Fontanarosa. Ultrasound guidance in minimally invasive roboticprocedures.

Medical Image Analysis , 54:149–167, 5 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2019.01.002.Garth H Ballantyne. The pitfalls of laparoscopic surgery: challenges for robotics and teleroboticsurgery.

Surgical laparoscopy, endoscopy & percutaneous techniques , 12(1):1–5, 2 2002. ISSN1530-4515.David Bouget, Max Allan, Danail Stoyanov, and Pierre Jannin. Vision-based and marker-less surgicaltool detection and tracking: a review of the literature.

Medical Image Analysis , 35:633–654, 12017. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2016.09.003.Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalanceproblem in convolutional neural networks.

Neural Networks , 106:249–259, 10 2018. ISSN0893-6080. doi: 10.1016/J.NEUNET.2018.07.011.Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. Addressing imbalancein multilabel classiﬁcation: Measures and random resampling algorithms.

Neurocomputing , 163:3–16, 9 2015. ISSN 0925-2312. doi: 10.1016/J.NEUCOM.2014.08.091.15eronika Cheplygina, Marleen de Bruijne, and Josien P.W. Pluim. Not-so-supervised: A survey ofsemi-supervised, multi-instance, and transfer learning in medical image analysis.

Medical ImageAnalysis , 54:280–296, 5 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2019.03.009.Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder forStatistical Machine Translation. In

Proceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP) , page 1724–1734, 6 2014.Xiaofei Du, Maximilian Allan, Alessio Dore, Sebastien Ourselin, David Hawkes, John D. Kelly, andDanail Stoyanov. Combined 2D and 3D tracking of surgical instruments for minimally invasiveand robotic-assisted surgery.

International Journal of Computer Assisted Radiology and Surgery ,11(6):1109–1119, 6 2016. ISSN 1861-6410. doi: 10.1007/s11548-016-1393-4.Robert Elfring, Matías de la Fuente, and Klaus Radermacher. Assessment of optical localizeraccuracy for computer aided surgery systems.

Computer Aided Surgery , 15(1-3):1–12, 2 2010.ISSN 1092-9088. doi: 10.3109/10929081003647239.Eva Gibaja and Sebastián Ventura. A Tutorial on Multilabel Learning.

ACM Computing Surveys , 47(3):1–38, 4 2015. ISSN 03600300. doi: 10.1145/2716262.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. In ,pages 770–778. IEEE, 6 2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.90.Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory.

Neural Computation , 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.Xiaowei Hu, Lequan Yu, Hao Chen, Jing Qin, and Pheng-Ann Heng. AGNet: Attention-GuidedNetwork for Surgical Tool Presence Detection. In

Deep Learning in Medical Image Analysis andMultimodal Learning for Clinical Decision Support , pages 186–194. Springer, Cham, 2017. doi:10.1007/978-3-319-67558-9{\_}22.Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, andLi Fei-Fei. Tool Detection and Operative Skill Assessment in Surgical Videos Using Region-Based Convolutional Neural Networks. In , pages 691–699. IEEE, 3 2018a. ISBN 978-1-5386-4886-5. doi:10.1109/WACV.2018.00081.Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. SV-RCNet: Workﬂow Recognition From Surgical Videos Using Recurrent Convolutional Network.

IEEE Transactions on Medical Imaging , 37(5):1114–1126, 5 2018b. ISSN 0278-0062. doi:10.1109/TMI.2017.2787657.Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei Fei Li.Large-scale video classiﬁcation with convolutional neural networks. In

Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition , pages 1725–1732,2014. ISBN 9781479951178. doi: 10.1109/CVPR.2014.223.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classiﬁcation with DeepConvolutional Neural Networks. In

Advances In Neural Information Processing Systems , pages1097–1105, 2012.Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, FrancescoCiompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I.Sánchez. A survey on deep learning in medical image analysis.

Medical Image Analysis , 42:60–88,12 2017. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2017.07.005.Kaustuv Mishra, Rachana Sathish, and Debdoot Sheet. Learning Latent Temporal Connectionism ofDeep Residual Visual Abstractions for Identifying Surgical Tools in Laparoscopy Procedures. In , pages2233–2240. IEEE, 7 2017. ISBN 978-1-5386-0733-6. doi: 10.1109/CVPRW.2017.277.16abak Namazi, Ganesh Sankaranarayanan, and Venkat Devarajan. Automatic Detection of SurgicalPhases in Laparoscopic Videos. In

Proceedings on the International Conference in ArtiﬁcialIntelligence (ICAI) , pages 124–130, 2018.Jonas Prellberg and Oliver Kramer. Multi-label Classiﬁcation of Surgical Tools with ConvolutionalNeural Networks. , pages1–8, 2018. doi: 10.1109/IJCNN.2018.8489647. URL https://cataracts.grand-challenge.orghttp://arxiv.org/abs/1805.05760 .Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank, Carla J Brodley Read, B Pfahringer,G Holmes, E Frank, and J Read. Classiﬁer chains for multi-label classiﬁcation.

Mach Learn , 85:333–359, 2011. doi: 10.1007/s10994-011-5256-5.Austin Reiter, Peter K. Allen, and Tao Zhao. Feature Classiﬁcation for Tracking ArticulatedSurgical Tools. In

International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2012) , pages 592–600. Springer, Berlin, Heidelberg, 2012. doi:10.1007/978-3-642-33418-4{\_}73.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In

Advances in neural information processing systems ,pages 91–99, 2015. ISBN 0162-8828 VO - PP. doi: 10.1109/TPAMI.2016.2577031.Manish Sahu, Anirban Mukhopadhyay, Angelika Szengel, and Stefan Zachow. Tool and Phaserecognition using contextual CNN features.

Deep Learning in Medical Image Analysis andMultimodal Learning for Clinical Decision Support , pages 186–194, 2016. URL https://arxiv.org/pdf/1610.08854.pdf .Manish Sahu, Anirban Mukhopadhyay, Angelika Szengel, and Stefan Zachow. Addressing multi-label imbalance problem of surgical tool detection using CNN.

International Journal of ComputerAssisted Radiology and Surgery , 12(6):1013–1020, 6 2017. ISSN 1861-6410. doi: 10.1007/s11548-017-1565-x.Duygu Sarikaya, Jason J. Corso, and Khurshid A. Guru. Detection and Localization of RoboticTools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal andDetection.

IEEE Transactions on Medical Imaging , 36(7):1542–1549, 7 2017. ISSN 0278-0062.doi: 10.1109/TMI.2017.2665671.Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-BasedLocalization. In , pages 618–626.IEEE, 10 2017. ISBN 978-1-5386-1032-9. doi: 10.1109/ICCV.2017.74.Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognitionin videos. In

Proceedings of the 27th International Conference on Neural Information ProcessingSystems - Volume 1 , pages 568–576. MIT Press, 2014.Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

Pro-ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition ,volume 07-12-June, pages 1–9, 2015. ISBN 9781467369640. doi: 10.1109/CVPR.2015.7298594.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinkingthe Inception Architecture for Computer Vision. In , pages 2818–2826. IEEE, 6 2016. ISBN 978-1-4673-8851-1. doi:10.1109/CVPR.2016.308.Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In

Thirty-First AAAI Conference onArtiﬁcial Intelligence , 2 2017. URL http://arxiv.org/abs/1602.07261 .Andru P. Twinanda, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy.Single- and Multi-Task Architectures for Tool Presence Detection Challenge at M2CAI 2016. arXiv preprint , 10 2016. URL http://arxiv.org/abs/1610.08851 .17ndru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, andNicolas Padoy. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos.

IEEE Transactions on Medical Imaging , 36(1):86–97, 1 2017a. ISSN 0278-0062. doi: 10.1109/TMI.2016.2593957.Andru Putra Twinanda, Nicolas Padoy, Mrs Jocelyne Troccaz, and Gregory Hager.

Vision-based Ap-proaches for Surgical Activity Recognition Using Laparoscopic and RBGD Videos . PhD thesis, Uni-versity of Strasbourg, 2017b. URL https://tel.archives-ouvertes.fr/tel-01557522/document .V Velanovich. Laparoscopic vs open surgery.

Surgical Endoscopy , 14(1):16–21, 1 2000. ISSN0930-2794. doi: 10.1007/s004649900003.Sheng Wang, Ashwin Raju, and Junzhou Huang. Deep learning based multi-label classiﬁcation forsurgical tool presence detection in laparoscopic videos. In

Proceedings - International Symposiumon Biomedical Imaging , pages 620–623, 2017. ISBN 9781509011711. doi: 10.1109/ISBI.2017.7950597.Daniel Wesierski and Anna Jezierska. Instrument detection and pose estimation with rigid partmixtures model in video-assisted surgeries.

Medical Image Analysis , 46:244–265, 5 2018. ISSN1361-8415. doi: 10.1016/J.MEDIA.2018.03.012.Aneeq Zia, Daniel Castro, and Irfan Essa. Fine-tuning Deep Architectures for Surgical Tool Detection.In

Workshop and Challenges on Modeling and Monitoring of Computer Assisted Interventions(M2CAI) , 2016. URL .Odysseas Zisimopoulos, Evangello Flouty, Mark Stacey, Sam Muscroft, Petros Giataganas, JeanNehme, Andre Chow, and Danail Stoyanov. Can surgical simulation be used to train detection andclassiﬁcation of neural networks?