[PDF] Machine Learning Distinguishes Neurosurgical Skill Levels in a Virtual Reality Tumor Resection Task

Abstract

Background: Virtual reality simulators and machine learning have the potential to augment understanding, assessment and training of psychomotor performance in neurosurgery residents. Objective: This study outlines the first application of machine learning to distinguish "skilled" and "novice" psychomotor performance during a virtual reality neurosurgical task. Methods: Twenty-three neurosurgeons and senior neurosurgery residents comprising the "skilled" group and 92 junior neurosurgery residents and medical students the "novice" group. The task involved removing a series of virtual brain tumors without causing injury to surrounding tissue. Over 100 features were extracted and 68 selected using t-test analysis. These features were provided to 4 classifiers: K-Nearest Neighbors, Parzen Window, Support Vector Machine, and Fuzzy K-Nearest Neighbors. Equal Error Rate was used to assess classifier performance. Results: Ratios of train set size to test set size from 10% to 90% and 5 to 30 features, chosen by the forward feature selection algorithm, were employed. A working point of 50% train to test set size ratio and 15 features resulted in an equal error rates as low as 8.3% using the Fuzzy K-Nearest Neighbors classifier. Conclusion: Machine learning may be one component helping realign the traditional apprenticeship educational paradigm to a more objective model based on proven performance standards. Keywords: Artificial intelligence, Classifiers, Machine learning, Neurosurgery skill assessment, Surgical education, Tumor resection, Virtual reality simulation

Full PDF

11 Machine Learning Distinguishes Neurosurgical Skill Levels in a Virtual Reality Tumor Resection Task Samaneh Siyar, MSc, (1,2)

Hamed Azarnoush, PhD, (1,2)

Saeid Rashidi, PhD, (3)

Alexandre Winkler-Schwartz, MD, (1)

Vincent Bissonnette, MD, (1)

Nirros Ponnudurai, BEng, (1)

Rolando F. Del Maestro,

MD, PhD (1) Neurosurgical Simulation Research and Training Centre, Department of Neurosurgery, Montreal Neurological Institute and Hospital, McGill University, Canada Department of Biomedical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Iran Science and Research Branch, Islamic Azad University, Iran

Details of previous presentations:

A portion of this work has been presented in the form of an abstract at the Computer Assisted Radiology and Surgery congress, June 19-23, 2018, Berlin.

Disclosure of financial support:

This work was supported by the Di Giovanni Foundation, the Montreal English School Board, the B-Strong Foundation, the Colannini Foundation, the Montreal Neurological Institute and Hospital and the McGill Department of Orthopedics. Samaneh Siyar is a Visiting Scholar in the Neurosurgical Simulation Research and Training Centre. Dr. H. Azarnoush previously held the Postdoctoral Neuro-Oncology Fellowship from the Montreal Neurological Institute and Hospital and is a Visiting Professor in the Neurosurgical Simulation Research and Training Centre. Dr. Winkler-Schwartz holds a Robert Maudsley Fellowship from the Royal College of Physicians and Surgeons of Canada and Nirros Ponnudurai is supported by a Heffez Family Bursary. Dr. Del Maestro is the William Feindel Emeritus Professor in Neuro-Oncology at McGill University.

Acknowledgments

We thank all the medical students, residents, and neurosurgeons from the Montreal Neurological Institute and Hospital and at other institutions who participated in this study. We would also like to thank Robert DiRaddo, Group Leader, Simulation, Life Sciences Division, National Research Council Canada at Boucherville and his team, including Denis Laroche, Valérie Pazos, Nusrat Choudhury and Linda Pecora for their support in the development of the scenarios used in these studies and all the members of the Simulation, Life Sciences Division, National Research Council Canada.

Corresponding author:

Dr. Hamed Azarnoush, PhD Visiting professor, Neurosurgical Simulation Research and Training Centre Montreal Neurological Hospital, McGill University 3801 University, E2.89 Montreal, Quebec, Canada H3A 2B4 Tel: 514-934-1934 ext. 36733 Email: [email protected]

Abstract:

Background:

Virtual reality simulators and machine learning have the potential to augment understanding, assessment and training of psychomotor performance in neurosurgery residents.

Objective:

This study outlines the first application of machine learning to distinguish “skilled” and “novice” psychomotor performance during a virtual reality neurosurgical task.

Methods:

Twenty-three neurosurgeons and senior neurosurgery residents comprising the "skilled" group and 92 junior neurosurgery residents and medical students the "novice" group. The task involved removing a series of virtual brain tumors without causing injury to surrounding tissue. Over 100 features were extracted and 68 selected using t-test analysis. These features were provided to 4 classifiers: K-Nearest Neighbors, Parzen Window, Support Vector Machine, and Fuzzy K-Nearest Neighbors. Equal Error Rate was used to assess classifier performance.

Results:

Ratios of train set size to test set size from 10% to 90% and 5 to 30 features, chosen by the forward feature selection algorithm, were employed. A working point of 50% train to test set size ratio and 15 features resulted in an equal error rates as low as 8.3% using the Fuzzy K-Nearest Neighbors classifier.

Conclusion:

Machine learning may be one component helping realign the traditional apprenticeship educational paradigm to a more objective model based on proven performance standards.

Keywords:

Artificial intelligence, Classifiers, Machine learning, Neurosurgery skill assessment, Surgical education, Tumor resection, Virtual reality simulation

1. Introduction

Virtual reality simulators have been proposed as tools to understand, assess and train neurosurgery residents.

An important element of simulator performance is the capacity of simulators to distinguish operator expertise. Most studies on operator performance have utilized “metrics.”

A useful definition of “metrics” is that they are standards of reference by which performance, efficiency, and progress can be assessed. Individual metric can be used to assess aspects of operator performance. Applied forces, bimanual dexterity, and stress have all been studied. An operator’s performance metric(s) can be compared with previously defined proficiency benchmarks and that operator is placed into 1 of 2 or more groups with specific levels of psychomotor expertise. Neurosurgical tasks are complicated, involving multiple cognitive processes and psychomotor skills, and larger sets of more complex and interacting metrics may be required to differentiate groups. Artificial intelligence utilizing machine learning algorithms (classifiers) have the capacity to use extensive data sets involving numbers of features to separate groups.

Machine learning has been reviewed in neurosurgery and to characterize performance during otolaryngology and dental virtual reality procedures. Machine learning classifiers have not been utilized to differentiate “skilled” and “novice” neurosurgical psychomotor performance using a virtual reality simulator with haptic feedback. The question addressed in this communication is “do the 4 classifiers utilized in our study, K-Nearest Neighbors, Parzen Window, Support Vector Machine, and Fuzzy K-Nearest Neighbors have the ability to differentiate “skilled” from “novice” neurosurgical psychomotor performance using a virtual reality simulation platform?”

2. Methods

115 individuals including 16 board certified practicing neurosurgeons from 3 institutions and 7 senior residents (PGY 4-6) from one university made up the expert group (n=23). Eight junior residents (PGY 1-3) and 84 medical students comprised the novice group (n=92). No participant had had previous experience with the simulator utilized and participants signed an approved Research Ethics Board consent.

The NeuroTouch, now known as NeuroVR (CAE Healthcare, Montreal, Canada), virtual reality simulation platform was used. Tumor resections were performed using the simulated ultrasonic aspirator held in the dominant hand (Fig. 1A).

Figure 1B outlines the 6 scenarios used in this study utilized data involving the resection of 9 identical simulated brain tumors on 2 occasions (total of 18 procedures) separated by removal of tumors with different complexities. The simulated operative procedure utilized for these studies can be seen in electronic Supplementary Material 1 in a previous publication. To maximize tumor differences and increase participant difficulty, each of the 6 scenarios utilized had 3 tumors of varying complexities involving color (black, glioma-like and white: similar to background) and Young’s modulus stiffness (3 kPa, soft, 9 kPa, medium and 15 kPa, hard). The background with soft tumor stiffness, 3 kPa represented the surrounding ‘normal’ white matter (Fig. 1C). Scenario 1 included 3 black tumors with different stiffness. Scenario 2 included 3 glioma-like tumors with different stiffness and Scenario 3 included white tumors with different stiffness. In Scenarios 4 through 6, all three simulated tumors included in each scenario had the same stiffness but were visually different. Scenario 4 included 3 soft tumors with different visual properties. Scenario 5 included 3 medium stiffness tumors with different visual properties and Scenario 6 included 3 hard tumors with different visual properties. Three minutes was allowed for each tumors removal with a 1-minute rest time given between tumor resections to decrease fatigue. The trial involved 54 minutes of active tumor resection, 71 minutes in total. To develop procedure familiarity operators resected a practice scenario but this data was not used. Participants were unaware of study purpose or metrics utilized and were instructed to resect each tumor with minimal removal of the background tissue. (A) (B) (C) FIGURE 1.

The hand position of the operator holding the simulated ultrasonic aspirator (A), the 6 simulated tumor scenarios with tumor color and sequence (B) and lateral view of the brain tumor geometry and ellipsoidal shape utilized in each scenario demonstrating the three identical tumors, tumor buried underneath simulated “normal’ tissue and the R1and the R2 plus R3 regions studied (C).

The 3 processing steps, including feature extraction, feature selection, and classification are seen in Figure 2. Features may be considered inputs which are provided to machine learning algorithms to help define level of expertise. The simulator recorded signals including tool tip coordinates, tool tip orientation angles, contact force between virtual tool and virtual tissue and foot pedal activation state versus time. Although these signals provide useful information, previous investigations on developing a model for psychomotor performance for virtual reality tumor resections have outlined complex interacting human and task factors involved in differentiating skilled and novice performance. Different parametric features could be extracted from these and other derived signal features with the goal to differentiate the skilled and novice groups. FIGURE 2. Flowchart of the proposed feature selection and classification. Acceleration Acceleration was computed as the second derivative of the motion profile. Features based on the acceleration signal measured included mean acceleration, maximum acceleration and the integral of the acceleration vector (

IAV ) as given by: where x , y and z are Cartesian coordinates and T is the duration of the task. Jerk (1) 𝐼𝐴𝑉 = ∫ √(𝑑 𝑥𝑑𝑡 ) + (𝑑 𝑦𝑑𝑡 ) + (𝑑 𝑧𝑑𝑡 ) 𝑑𝑡 Jerk is defined as the third derivative of motion profile applied for motor skill assessment. A normalized three dimensional jerk metric is used in this study, given by: where T is task completion time and 𝐴 𝑚 is the amplitude of the motion. 2.4.2. Force-based features Not being able to measure forces applied by the neurosurgical aspirator during patient related procedures has limited our understanding of the forces that the human brain is exposed to by this instrument. The simulation platform utilized has the ability to analyze force feedback generated by the haptic device. This data has been utilized to create force pyramids and force heat maps to assess psychomotor function, automaticity, and force fingerprints for virtual reality tumor resections. Force-based features extracted in this study comprise force derivatives, integral of the force, the range of the applied forces and the interquartile range, i.e., the first quartile subtracted from the third quartile. In addition to the above mentioned force-based features, parametric features including temporal and spatial features were also extracted from the force signal and its derivatives. We also used the 2 features proposed previously to indicate consistency, given by: and one feature to indicate the smoothness of the force application, given by: where T is task completion time and 𝑓 𝑖𝑞𝑟 is the interquartile range of the force profile. We started with a list of over 100 parametric features many of which were eliminated in the subsequent feature selection process. The list of all signal features is included in Table 1.

TABLE 1. List of signal features (2)

𝐽𝑒𝑟𝑘 𝑛𝑜𝑟𝑚 = √ 𝑇 𝑚2 ∫ ( 𝑑 𝑥𝑑𝑡 ) + ( 𝑑 𝑦𝑑𝑡 ) + ( 𝑑 𝑧𝑑𝑡 ) 𝑑𝑡 (3) 𝑑𝑓 𝑚𝑒𝑡𝑟𝑖𝑐 = √ 𝑇2𝑓 𝑖𝑞𝑟2 ∫ ( 𝑑𝑓𝑑𝑡 ) 𝑑𝑡 𝑇0 (4) 𝑑 𝑓 𝑚𝑒𝑡𝑟𝑖𝑐 = √ 𝑇 𝑖𝑞𝑟2 ∫ ( 𝑑 𝑓𝑑𝑡 ) 𝑑𝑡 𝑇0 (5) 𝑑 𝑓 𝑚𝑒𝑡𝑟𝑖𝑐 = √ 𝑇 𝑖𝑞𝑟2 ∫ ( 𝑑 𝑓𝑑𝑡 ) 𝑑𝑡 𝑇0 𝑥(𝑡): position in the 𝑥 -direction 𝑗 𝑧 (𝑡) = 𝑑𝑎 𝑧 (𝑡)𝑑𝑡 ∶ jerk in the 𝑧 -direction 𝑦(𝑡): position in the 𝑦 -direction 𝑗 𝑓 (𝑡) = 𝑑𝑎 𝑓 (𝑡)𝑑𝑡 ∶ third derivative of force signal 𝑧(𝑡): position in the 𝑧 -direction 𝑅𝑜𝑙𝑙(𝑡) : Rotation around the front-to-back axis 𝑓(𝑡): 𝑓𝑜𝑟𝑐𝑒 𝑣 𝑅𝑜𝑙𝑙 (𝑡) = 𝑑𝑅𝑜𝑙𝑙(𝑡)𝑑𝑡 ∶ first derivative of Roll signal 𝑣 𝑥 (𝑡) = 𝑑𝑥(𝑡)𝑑𝑡 ∶ velocity in the 𝑥 -direction 𝑎 𝑅𝑜𝑙𝑙 (𝑡) = 𝑑𝑣 𝑅𝑜𝑙𝑙 (𝑡)𝑑𝑡 ∶ second derivative of Roll signal 𝑣 𝑦 (𝑡) = 𝑑𝑦(𝑡)𝑑𝑡 ∶ velocity in the 𝑦 -direction 𝑗 𝑅𝑜𝑙𝑙 (𝑡) = 𝑑𝑎 𝑅𝑜𝑙𝑙 (𝑡)𝑑𝑡 ∶ third derivative of Roll signal 𝑣 𝑧 (𝑡) = 𝑑𝑧(𝑡)𝑑𝑡 ∶ velocity in the 𝑧 -direction 𝑃𝑖𝑡𝑐ℎ(𝑡):

Rotation around the side-to-side axis 𝑣 𝑓 (𝑡) = 𝑑𝑓(𝑡)𝑑𝑡 : first derivative of the force signal 𝑣 𝑃𝑖𝑡𝑐ℎ (𝑡) = 𝑑𝑃𝑖𝑡𝑐ℎ(𝑡)𝑑𝑡 ∶ first derivative of Pitch signal 𝑉(𝑡) = √ 𝑑𝑥𝑑𝑡2 + 𝑑𝑦𝑑𝑡 2 + 𝑑𝑧𝑑𝑡2 : magnitude of velocity 𝑎 𝑃𝑖𝑡𝑐ℎ (𝑡) = 𝑑𝑣 𝑃𝑖𝑡𝑐ℎ (𝑡)𝑑𝑡 ∶ second derivative of Pitch signal 𝑎 𝑥 (𝑡) = 𝑑𝑣 𝑥 (𝑡)𝑑𝑡 ∶ acceleration in the 𝑥 -direction 𝑗 𝑃𝑖𝑡𝑐ℎ (𝑡) = 𝑑𝑎 𝑝𝑖𝑡𝑐ℎ (𝑡)𝑑𝑡 ∶ third derivative of Pitch signal 𝑎 𝑦 (𝑡) = 𝑑𝑣 𝑦 (𝑡)𝑑𝑡 ∶ acceleration in the 𝑦 -direction 𝑌𝑎𝑤(𝑡):

Rotation around the vertical axis 𝑎 𝑧 (𝑡) = 𝑑𝑣 𝑧 (𝑡)𝑑𝑡 ∶ acceleration in the 𝑧 -direction 𝑣 𝑌𝑎𝑤 (𝑡) = 𝑑𝑌𝑎𝑤(𝑡)𝑑𝑡 ∶ first derivative of Yaw signal 𝑎 𝑓 (𝑡) = 𝑑𝑣 𝑓 (𝑡)𝑑𝑡 ∶ second derivative of force signal 𝑎 𝑌𝑎𝑤 (𝑡) = 𝑑𝑣 𝑌𝑎𝑤 (𝑡)𝑑𝑡 ∶ second derivative of Yaw signal 𝑗 𝑥 (𝑡) = 𝑑𝑎 𝑥 (𝑡)𝑑𝑡 ∶ jerk in the 𝑥 -direction 𝑗 𝑦𝑎𝑤 (𝑡) = 𝑑𝑎 𝑌𝑎𝑤 (𝑡)𝑑𝑡 ∶ third derivative of Yaw signal 𝑗 𝑦 (𝑡) = 𝑑𝑎 𝑦 (𝑡)𝑑𝑡 ∶ jerk in the 𝑦 -direction Since the parametric feature values are not in the same order of size for comparison and to train classifiers, the obtained features were normalized exponentially: where Z i is the normalized value and x i is a data point ( 𝑥 , 𝑥 ,… , 𝑥 𝑛 ). (6) 𝑍 𝑖 = 𝑒 − 𝑥𝑖𝑚𝑎𝑥(𝑥) Feature selection follows feature extraction to decrease computational complexity and maintain classifier performance. Irrelevant features are identified and only useful ones are provided to classifiers since irrelevant features may result in overfitting and increase resource use. In these techniques, an efficient search strategy is adopted to determine a feature subset. Then the new selected subset can be evaluated based on evaluation criteria. Feature selection was carried out in 2 steps, first, the features with a defined statistical differentiation were identified and second, features improving classifier performance were selected. 2.6.1. Statistical feature selection For each feature a t-test was applied and the resultant p-values were compared for all features as a measure of the usefulness of each individual feature to separate groups. Among the extracted preliminary features, 68 were able to differentiate the 2 groups with a statistically significant difference of p<0.05 provided in Table 2. TABLE 2. List of 68 selected parametric features that provide the best classification. The best 30 features are marked by one asterisk (*) and the best 15 features by two asterisks (**). ∑ t(j x ≤ 0)/T 𝑇 : task completion time * 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑓) ∑ 𝑁(𝑓 > 0.1) * ∑ 𝑓 𝑅4 𝑅 : region beneath the tumor bulk * 𝑠𝑡𝑑(𝑓)/𝑠𝑡𝑑(𝑣 𝑥 ) 𝑠𝑡𝑑: standard deviation max(𝑓) − min (𝑓) (max(𝑣 𝑥 ) − min(𝑣 𝑥 )) ∗ (max(𝑣 𝑦 ) − min (𝑣 𝑦 )) ∗ (max (𝑣 𝑧 )− min (𝑣 𝑧 )) 𝑠𝑡𝑑(𝑓) 𝑠𝑡𝑑: standard deviation 𝑖𝑞𝑟(𝑓) 𝑖𝑞𝑟(𝑥) ∗ 𝑖𝑞𝑟(𝑦) ∗ 𝑖𝑞𝑟(𝑧) √ 𝑇 ∑ 𝑎 𝑓2 ** √ 𝑇2(𝑖𝑞𝑟(𝑓)) ∑ 𝑣 𝑓2 ( ∑ |𝑓 𝑖+1 − 𝑓 𝑖 | 𝑁−1𝑖=1 )/𝑇 ∑ √𝑎 𝑥𝑖2 + 𝑎 𝑦𝑖2 + 𝑎 𝑧𝑖2𝑖 ** ( ∑ |𝑣 𝑖+1 − 𝑣 𝑖 | 𝑁−1𝑖=1 )/𝑇(𝑠𝑡𝑑(𝑓)) * ( ∑ |𝑎 𝑖+1 − 𝑎 𝑖 | 𝑁−1𝑖=1 )/𝑇 ** (∑ 𝑡(𝑚𝑎𝑥(𝑎 𝑥 )))/𝑇 ∫ |𝑓| 𝑡 𝑡 𝑡 : 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡 𝑜𝑓 𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑒𝑎𝑘 𝑡 : 𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡 𝑜𝑓 𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑒𝑎𝑘 * 𝑁 𝑧𝑒𝑟𝑜−𝑐𝑟𝑜𝑠𝑠 (𝑣 𝑥 ) * (∑ 𝑡(𝑚𝑖𝑛(𝑎 𝑥 )))/𝑇 * (∑ 𝑡 (𝑚𝑖𝑛(𝑎 𝑦 )))/𝑇 ** (∑ 𝑡 (𝑚𝑎𝑥(𝑎 𝑦 )))/𝑇 (∑ 𝑡(𝑚𝑎𝑥(𝑎 𝑧 )))/𝑇 𝑁 𝑧𝑒𝑟𝑜−𝑐𝑟𝑜𝑠𝑠 (𝑣 𝑦 ) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑃𝑖𝑡𝑐ℎ) ** (∑ 𝑡(𝑚𝑖𝑛(𝑎 𝑧 )))/𝑇 * (∑ 𝑡 (𝑚𝑖𝑛(𝑎 𝑓 )))/𝑇 * (∑ 𝑡 (𝑚𝑎𝑥(𝑎 𝑓 )))/𝑇 𝑚𝑒𝑎𝑛(𝑉) max (𝑉) * 𝑠𝑡𝑑(𝑓)/𝑠𝑡𝑑(𝑣 𝑧 ) 𝑠𝑡𝑑: standard deviation ** √ 𝑇 ∑ 𝑗 𝑓2 ** ∑ 𝑓 𝑙𝑜𝑤 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / ∑ 𝑓 ℎ𝑖𝑔ℎ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ** 𝑠𝑡𝑑(𝑓)/𝑠𝑡𝑑(𝑣 𝑦 ) 𝑠𝑡𝑑: standard deviation (∑ 𝑡(𝑚𝑎𝑥(𝑧)))/𝑇 * 𝑁 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 (x) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 ( 𝑣 𝑓 ) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 : Number of extermum points ** 𝑁 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 ( 𝑣 𝑥 ) ** 𝑁 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 ( 𝑎 𝑥 ) ** (∑ 𝑡(𝑚𝑎𝑥(𝑓)))/ ∑ 𝑡(𝑚𝑖𝑛(𝑓))) 𝑁 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 ( 𝑎 𝑦 ) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 ( 𝑎 𝑓 ) 𝑁 𝑚𝑎𝑥 (𝑥) + 𝑁 𝑚𝑎𝑥 (𝑦) + 𝑁 𝑚𝑎𝑥 (𝑧) ∑ 𝑡(𝑣 𝑓 ≥ 0)/ ∑ 𝑡(𝑣 𝑓 ) ≤ 0 ** 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (x) 𝑁 𝑚𝑖𝑛 (𝑥) + 𝑁 𝑚𝑖𝑛 (𝑦) + 𝑁 𝑚𝑖𝑛 (𝑧) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (z) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (y) ( ∑ |𝑃𝑖𝑡𝑐ℎ 𝑖+1 − 𝑃𝑖𝑡𝑐ℎ 𝑖 | 𝑁−1𝑖=1 )/𝑇 * ∑ |𝑌𝑎𝑤 𝑖+1 − 𝑌𝑎𝑤 𝑖 | 𝑁−1𝑖=1 ( ∑ |𝑣 𝑅𝑜𝑙𝑙𝑖+1 − 𝑣

𝑅𝑜𝑙𝑙𝑖 | 𝑁−1𝑖=1 )/𝑇 ( ∑ |𝑣 𝑃𝑖𝑡𝑐ℎ𝑖+1 − 𝑣

𝑃𝑖𝑡𝑐ℎ𝑖 | 𝑁−1𝑖=1 )/𝑇 * 𝑚𝑒𝑎𝑛(𝑅𝑜𝑙𝑙) ∗ 𝑇 max(𝑣 𝑅𝑜𝑙𝑙 ) − min (𝑣

𝑅𝑜𝑙𝑙 ⁄ ) 𝑚𝑒𝑎𝑛(𝑃𝑖𝑡𝑐ℎ) ∗ 𝑇 max(𝑣 𝑃𝑖𝑡𝑐ℎ ) − min (𝑣

𝑃𝑖𝑡𝑐ℎ ⁄ ) 𝑁 𝑚𝑖𝑛 (𝑌𝑎𝑤) + 𝑁 𝑚𝑖𝑛 (𝑃𝑖𝑡𝑐ℎ) + 𝑁 𝑚𝑖𝑛 (𝑅𝑜𝑙𝑙) ** 𝑁 𝑚𝑎𝑥 (𝑌𝑎𝑤) + 𝑁 𝑚𝑎𝑥 (𝑃𝑖𝑡𝑐ℎ) + 𝑁 𝑚𝑎𝑥 (𝑅𝑜𝑙𝑙) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑃𝑖𝑡𝑐ℎ) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑌𝑎𝑤) 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑣 𝑌𝑎𝑤 ) ** 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑅𝑜𝑙𝑙) ** 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑣 𝑅𝑜𝑙𝑙 ) * 𝑁 𝑒𝑥𝑡𝑒𝑟𝑚𝑢𝑚 (𝑣 𝑃𝑖𝑡𝑐ℎ ) * ∑ 𝑡(𝑚𝑎𝑥(𝑃𝑖𝑡𝑐ℎ)) − ∑ 𝑡(𝑚𝑖𝑛(𝑃𝑖𝑡𝑐ℎ)/𝑇 ∑ |𝑗 𝑃𝑖𝑡𝑐ℎ 𝑖+1 − 𝑗

𝑃𝑖𝑡𝑐ℎ 𝑖 |/𝑚𝑒𝑎𝑛|𝑗 𝑃𝑖𝑡𝑐ℎ | 𝑁−1𝑖=1 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑝𝑒𝑑𝑎𝑙 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 ∑ 𝑡(𝑚𝑎𝑥(𝑌𝑎𝑤)) − ∑ 𝑡(𝑚𝑖𝑛(𝑌𝑎𝑤)/𝑇 ∑ 𝑓 𝑅3 ∑ 𝑓 𝑅1 This algorithm was applied to rank the best 5, 10, 15, 20, 25 and 30 features of 68 selected utilizing the statistics previously employed. The feature selection was done irrespective of the subsequent classifiers to be used in the next stage. Four classifiers, K-Nearest Neighbors with k = 7, Parzen Window, Support Vector Machine and Fuzzy K-Nearest Neighbors with k=7 were applied to classify skilled and novice groups. To improve the configuration of the train and test sets, train set was selected randomly and increasingly from 10% to 90% of all data. For cross validation of each train set size, classification process was repeated 20 times, each time with a randomly selected train set of that size. Equal error rate (EER) is obtained when sensitivity and specificity become equal. A classifier may have a good sensitivity and a poor specificity and vice versa. Equal error rate is a commonly-used measure to evaluate classifier performance since it evaluates its sensitivity and specificity at the same time. Equal error rates were measured for different working points of both different train set sizes and different number of premier features.

3. Results

All classifiers were applied on 5, 10, 15, 20, 25 and 30 superior features for 20 iterations. Figure 3 outlines the ability of each classifier to discriminate between skilled and novice groups in the 6 different scenarios for different train set sizes between 10% and 90% with 10% increments. The overall trend of classifiers’ performance are similar for all scenarios. We used 50% train set size as the working point, since increasing the train set size did not significantly improve classifier performance. For this train set size, the minimum equal error rate value in Scenario 1, 2, 4, and 5 were obtained for the Fuzzy K-Nearest Neighbors classifier (

𝐸𝐸𝑅 =9.6%, 𝐸𝐸𝑅 =13.7%, 𝐸𝐸𝑅 =11.1% 𝐸𝐸𝑅 =12.0% ) . In scenario 3 the Parzen Window classifier resulted in minimum equal error rate (𝐸𝐸𝑅 =10.8%) and in scenario 6, the minimum equal error rate was acquired for K-Nearest Neighbors classifier (𝐸𝐸𝑅 =14.3%). (A) (B) (C) (D) (E) (F) FIGURE 3. Classification results for different train set divisions (%), Scenario 1 ( A) , Scenario 2 ( B ), Scenario 3 ( C ), Scenario 4 ( D ), Scenario 5 ( E ), and Scenario 6 ( F ) with resultant equal error rates (EER) percentages values (%) for each scenario and classifier employed: K-Nearest Neighbors (KNN), Parzen Window (PW), Support Vector Machine (SVM), and Fuzzy K-Nearest Neighbors (FKNN). Performance of classifiers was then assessed based on different numbers of selected premier features. Figure 4 demonstrates equal error rate values based on number of superior features and average of all train sets for 20 iterations. Fuzzy K-Nearest Neighbors demonstrated best overall performance. The results indicate that overall classifier performance was improved when the number of premier features was increased to 15. For higher feature numbers, performance either decreased or did not significantly improve. Using 15 premier features, minimum equal error rates were obtained utilizing the Fuzzy K-Nearest Neighbors classifier in most scenarios (

𝐸𝐸𝑅 =9.3%, 𝐸𝐸𝑅 =14.4%, 𝐸𝐸𝑅 =10.0%, 𝐸𝐸𝑅 =9.2%, 𝐸𝐸𝑅 =14.5%). (A) (B) (C) (D) (E) (F) FIGURE 4. Classification results for different number of selected premier features, Scenario 1 ( A ), Scenario 2 ( B ), Scenario 3 ( C ), Scenario 4 ( D ), Scenario 5 ( E ), Scenario 6 ( F ) with resultant equal error rates (EER) percentage values (%) for each scenario and each classifier employed: K-Nearest Neighbors (KNN), Parzen Window (PW), Support Vector Machine (SVM), and Fuzzy K-Nearest Neighbors (FKNN). Figure 5 provides a comparison of classifier performance for all scenarios when working point is considered as 50% train set size and 15 best features and outlines that Fuzzy K-Nearest Neighbors classifier has the best performance with equal error rates ranging from 8.3%-14.5%. In Table 2 the best 30 features are marked with one asterisk (*) and the best 15 features with two asterisks (**). Six of best 15 features and 12 of the best 30 features involved force while the remaining features were associated with motion.

FIGURE 5. Classification results for selected working point (15 best features and 50% train set size) with resultant equal error rates (EER) of average percentage values (%) for each scenario and each classifier employed: K-Nearest Neighbors (KNN), Parzen Window (PW), Support Vector Machine (SVM), and Fuzzy K-Nearest Neighbors (FKNN).

4. Discussion

Machine learning is a subset of artificial intelligence, using algorithms (classifiers), which gives computers the capacity to "learn" patterns (progressively improve performance on a specific task) when provided with sufficient data, without needing explicit programming. Supervised, unsupervised, semi-supervised and reinforcement learning algorithms can be used.

In supervised classifiers, feature data is provided which maximize the ability of classifiers to separate groups by minimizing the error. These techniques have been employed in neurosurgical diagnosis, presurgical planning and outcome prediction. In otolaryngology and dental virtual reality procedures, participants ranged from 1 to 7 skilled (experts) and 5 to 40 novice (less skilled) and differentiated skilled and novice groups from 75 to 100%. Machine learning classifiers had not been used to differentiate skilled and novice groups using virtual reality cerebral tumor procedures. The scenarios utilized in this study involved aspirator skills used in human tumor resections, part of the surgical armamentarium of neurosurgeons and senior residents, but not yet acquired by all junior residents and medical students. It seemed reasonable to define a skilled and novice (less skilled) group based on the required skill set needed for the 6 scenarios studied.

We applied 4 different supervised machine learning classifiers to the data set involving these participants. Our results demonstrate that all 4 classifier distinguished skilled and novice groups with equal error rates as low as 8.3% indicating the usefulness of classifiers in differentiating participants doing virtual reality procedures. The Fuzzy K-Nearest-Neighbors classifier provided optimal performance and this may relate to its ability to assigns fuzzy rather than crisp membership to the skilled and novice groups. The Support Vector Machine classifier had the least ability to separate groups since it is known to degrade when classifying unbalanced groups.

Table 3 presents the range of individuals misclassified by the Fuzzy K-Nearest-Neighbors classifier. Using this classifier, 19-21 out of 23 skilled individuals and 78-84 out of 92 novices were correctly classified. Some neurosurgeons in this study had cerebrovascular, spinal and functional specialization with little exposure to tumor resection which may be one reason for misclassification. Some junior residents may have been misclassified since they had mastered the required surgical skills. Studies involving more complex scenarios, larger resident numbers and better understanding of which factors and/or combination of metrics to use to better differentiate groups are needed. The potential of machine learning classifiers applied to virtual reality procedures in surgical disciplines is that the new features identified will result in new “metrics” which can then be evaluated in other model systems. These results may not only help us understand the psychomotor skills needed to increase surgical skills but aid in resident assessment and training and improve patient outcomes. TABLE 3. The range of numbers of individuals correctly and incorrectly classified by the Fuzzy K-Nearest-Neighbors classifier in the 6 different scenarios. Classified as skilled Classified as novice S k ill e d N ov i ce N=23 N=92 N=115

The importance of these results lie in their potential educational application to aid in neurosurgical resident training and helping to further define the psychomotor skill sets of expert surgeons. Machine learning and artificial intelligence as applied to virtual reality surgical studies should be seen as useful adjuncts and not a replacement for standard residency training. By relying on 68 features, these machine learning classifiers can automatically capture multiple aspects of psychomotor performance and segregate participants into ‘skilled’ or ‘novice’ group. However, this should be seen as an initial step of a formative educational process, prompting instructors to further assess and coach a resident’s performance to a desired level . The classifiers and simulator platform utilized to distinguish neurosurgical skill levels in this study have limitations. First, many of the parametric features included in this investigation have not been assessed in more complex scenarios. Therefore, it is not known if the same classifiers would also be applicable to these scenarios. Whether these parametric features are the most appropriate or other metrics such as the force pyramid or automaticity will be more useful needs to be accessed.

Second, a simulated aspirator was utilized in the dominant hand which is not representative of the bimanual psychomotor skills and multiple instruments employed during patient tumor resections.

Previous studies have demonstrated differences in ergonomics between right and left handed operators and this issue was not addressed in this investigation and deserves further study. Third, the different visual and haptic complexities of simulated tumors utilized and task duration may not adequately discriminate operator performance. More complex and realistic tumor scenarios with simulated bleeding involving use of bimanual instruments are being studied using classifiers which may be more useful. Defining large populations of residents and neurosurgeons with equivalent experience with virtual reality simulation is challenging. Sixteen practicing board certified neurosurgeons from 3 institutions with different areas of expertise participated in this study which is felt to be representative of a general neurosurgical population. We only enrolled residents and medical students from one institution which limits extension of these results. The authors believe that increasing study participants from multiple institutions may further our ability to improve classifier performance to distinguish neurosurgical skill levels at various stages of resident training.

5. Conclusion

We presented the first investigation of the application of machine learning in assessing surgical skill level during virtual reality tumor resection. The importance of our results lies in their potential educational application in neurosurgical resident training and helping further define the psychomotor skill set of the skilled surgeon. Machine learning may be one component in helping to realign the present apprenticeship educational paradigm to a more objective model based on proven performance standards.

References:

1. Kockro RA, Serra L, Tseng-Tsai Y, Chan C, Yih-Yian S, Gim-Guan C, et al. Planning and simulation of neurosurgery in a virtual reality environment.

Neurosurgery.

Neurosurgery . 2003; 52(3):499–505, discussion 504–505. 3. Radetzky A, Rudolph M Simulating tumour removal in neurosurgery.

Int J Med Inform . Neurosurgery . Delorme S, Laroche D, DiRaddo R, Del Maestro RF: NeuroTouch: a physics-based virtual simulator for cranial microneurosurgery training.

Neurosurgery . 2012;

World Neurosurg : e9-19.

7. Gelinas-Phaneuf N, Del Maestro RF: Surgical expertise in neurosurgery: integrating theory into practice.

Neurosurgery . : -38. 8. Gelinas-Phaneuf N, Choudhury N, Al-Habib AR, Cabral A, Nadeau E, Mora V, et al: Assessing performance in brain tumor resection using a novel virtual reality simulator. Int J

Comput Assist Radiol

Surg. : Int J Comput Assist Radiol Surg . 2015 ; : Systems and Information Engineering Design Symposium, SIEDS IEEE ", 2008. 11. Kazemi H, Rappel JK, Poston T, Hai Lim B, Burdet E, Leong Teo C. Assessing suturing techniques using a virtual reality surgical simulator.

Microsurgery.

Surg Endosc . .

13. Kovac ERA, Azhar A, Quirouet J, Delisle, & M. Anidjar. "Construct validity of the lapSim virtual reality laparoscopic simulator within a urology residency program.

Can Urol Assoc J . 2012;6(4):253-9. 14. Alotaibi FE, Al Zhrani G, Bajunaid K, Winkler-Schwartz A, Azarnoush H, et al. (2015) Assessing Neurosurgical Psychomotor Performance: Role of Virtual Reality Simulators, Current and Future Potential.

SOJ Neurol . 2015;2(1), 1-7. 15. Alotaibi FE, AlZhrani GA, Mullah MA, Sabbagh AJ, Azarnoush H, Winkler-Schwartz A, et al: Assessing bimanual performance in brain tumor resection with NeuroTouch, a virtual reality simulator.

Neurosurgery . 11, 2015;Suppl 2:89-98; discussion 98. 16. Alotaibi FE, AlZhrani GA, Sabbagh AJ, Azarnoush H, Winkler-Schwartz A, Del Maestro RF: neurosurgical assessment of metrics including judgment and dexterity using the virtual reality simulator NeuroTouch (NAJD Metrics).

Surg Innov . 2015;22:636-642. 17. Azarnoush H, Siar S, Sawaya R, et al. The force pyramid: a spatial analysis of force application during virtual reality brain tumor resection.

J Neurosurg . 2017;127(1):171-181.

18. Sawaya R, Bugdadi A, Azarnoush H, Winkler-Schwartz A, Alotaibi FE, Bajunaid K, AlZhrani GA, Alsideiri G, Sabbagh AJ, Del Maestro RF. Virtual Reality Tumor Resection: The Force Pyramid Approach.

Operative Neurosurgery . 2017;14(6):686-96. 19. Bugdadi A, Sawaya R, Olwi D, AlZahrani G, Azarnoush H, Sabbagh A, et al: Automaticity of Force Application during Simulated Brain Tumor Resection: Testing the Fitts and Posner Model.

J Surg Educ . 2018;75(1):104-15.

20. Sawaya R, Alsidieri G, Bugdadi A, Winkler-Schwartz A, Azarnoush A, Bajunaid K, J. Sabbagh AJ, Del Maestro R Development of a Performance Model for Virtual Reality Tumor Resections. J Neurosurg [epub ahead of print].

August 3, 2018; DOI: 10.3171/2018.2.JNS172327. 21. Winkler-Schwartz A, Bajunaid K, Mullah MA, Marwa I, Alotaibi FE, Fares J, et al: Bimanual Psychomotor Performance in Neurosurgical Resident Applicants Assessed Using NeuroTouch, a Virtual Reality Simulator.

J Surg Educ . 2016;

Int J Comput Assist Radiol Surg. J Neurosurg.

A Validation Study of NeuroTouch in Neurosurgical Training . Saarbrücken, Germany: Lambert Academic Publishing; 2014. 25. Alzhrani G, Alotaibi F, Azarnoush H, Winkler-Schwartz A, Sabbagh A, Bajunaid K, et al: Proficiency performance benchmarks for removal of simulated brain tumors using a virtual reality simulator NeuroTouch.

J Surg Educ : IBM Journal of research and development . 1959;3(3):210-229. 27. Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine.

N Engl J Med . 2016;375(13):1216-1219 .

28. Senders JT, Arnaout O, Karhade AV, Dasenbrock HH, Gormley WB, Broekman ML et al. Natural and artificial intelligence in neurosurgery: a systematic review.

Neurosurgery.

J Neurol Neurosurg Psychiatry .2015;86(3):251-256. 30. Watson RA. Use of a machine learning algorithm to classify expertise: Analysis of hand motion patterns during a simulated surgical task.

Acad Med . 2014;89(8):1163-7. 31. Rhienmora P, Haddawy P, Khanal P, Suebnukarn S, Dailey MN. A virtual reality simulator for teaching and evaluating dental procedures.

Methods Inf Med . 2010;49(4):396-405. 32. Kerwin T, Wiet G, Stredney D, Shen HW. Automatic scoring of virtual mastoidectomies using expert examples.

Int J Comput Assist Radiol Surg .

33. Ma X, Wijewickrema S, Zhou S, Zhou Y, Mhammedi Z, O'Leary S, et al. Adversarial generation of real-time feedback with neural networks for simulation-based training. arXiv preprint arXiv:1703.01460. 34. Ma X, Wijewickrema S, Zhou Y, Zhou S, O’Leary S, Bailey J. Providing Effective Real-Time Feedback in Simulation-Based Surgical Training. In: Descoteaux M, Maier-Hein L, Franz A, Jannin P, Collins DL, Duchesne S, eds.

Medical Image Computing and Computer-Assisted Intervention − MICCAI 2017 . Cham: Springer International Publishing, 566-574, 2017.

35. Wijewickrema S, Ma X, Piromchai P, Piromchai P, Briggs R, James Bailey J. et al., Providing Automated Real-Time Technical Feedback for Virtual Reality Based Surgical Training: Is the Simpler the Better? In: Penstein Rosé C, Martínez-Maldonado R, Hoppe HU, et al., eds.

Artificial Intelligence in Education . Cham: Springer International Publishing; 2018:584-598. 36. Sewell C, Morris D, Blevins NH, Dutta S, Agrawal S, Federico Barbagli F et al. Providing metrics and performance feedback in a surgical simulator,

Comput Aided Surgery .2008;13:2, 63-81. 37. Rashidi S, Fallah A, Towhidkhah F. Authentication Based on Pole-zero Models of Signature Velocity . J Med Signals Sens . 2013;3(4):195-208.

38. Rohrer B, Fasoli S, Krebs H, Hughes R, Volpe B, Frontera W et al. Movement smoothness changes during stroke recovery.

J Neurosci. . 2002;15;22(18):8297-304.

39. Cavallo F, Megali G, Sinigaglia S, Tonet O, Dario P. A biomedical analysis of a surgeon’s gesture in a laparoscopic virtual scenario.

Stud Health Technol Inform . 2006;119:79-84. 40. Trejos Al, Patel RV, Naish MD, Malthaner RA, Schlachta CM. The application of force sensing to skills assessment in minimally invasive surgery.

IEEE International Conference on Robotics and Automation , 2013. 41. K.Deng. Omega: On-line Memory-based General Purpose System Classifier. The Robotics Institute School of Computer Science Carnegie Mellon University Pittsburgh, Chapter 7, 1998. 42.

Ladha L, Deepa T. Feature selection methods and algorithms.

International journal on computer science and engineering.

Kung SY. Kernel Methods and Machine Learning.

Cambridge University Press ; 2014, p. 34.

Keller JM, Gray MR, Givens JA. A fuzzy k-nearest neighbor algorithm.