[PDF] Eye Movement Feature Classification for Soccer Goalkeeper Expertise Identification in Virtual Reality

Abstract

The latest research in expertise assessment of soccer players has affirmed the importance of perceptual skills (especially for decision making) by focusing either on high experimental control or on a realistic presentation. To assess the perceptual skills of athletes in an optimized manner, we captured omnidirectional in-field scenes and showed these to 12 expert, 10 intermediate and 13 novice soccer goalkeepers on virtual reality glasses. All scenes were shown from the same natural goalkeeper perspective and ended after the return pass to the goalkeeper. Based on their gaze behavior we classified their expertise with common machine learning techniques. This pilot study shows promising results for objective classification of goalkeepers expertise based on their gaze behaviour and provided valuable insight to inform the design of training systems to enhance perceptual skills of athletes.

Full PDF

EEye Movement Feature Classificationfor Soccer Expertise Identification in Virtual Reality

Benedikt Hosp *,1,2 , Florian Schultz , Enkelejda Kasneci , Oliver H¨oner Perception Engineering, Human-Computer-Interaction, University of Tuebingen,Germany Institute of Sport Science, University of Tuebingen, Germany* [email protected]

Abstract

Latest research in expertise assessment of soccer players pronounced the importance ofperceptual skills. Former research focused either on high experimental control or naturalpresentation mode. To assess perceptual skills of athletes, in an optimized manner, wecaptured omnidirectional in-field scenes, showed to 12 expert, 9 intermediate and 13novice goalkeepers from soccer on virtual reality glasses. All scenes where shown fromthe same natural goalkeeper perspective and ended after the return pass to thegoalkeeper. Based on their responses and gaze behavior we classified their expertisewith common machine learning techniques. This pilot study shows promising results forobjective classification of goalkeepers expertise based on their gaze behaviour.

Introduction

Several sports related studies on perceptual-cognitive skills have shown the potential ofperceptional skills of athletes regarding their contribution to superior performance insports [1–7]. The method of choice in research of perceptual-cognitive skills are videobased. Observation of perceptual-cognitive skills with video based methods allows toisolate different characteristics to develop a knowledge base that explains certainperception based advantages of athletes.Research on perceptual-skills has taken advantage of innovations in computerscience, i.e. new presentation devices, interaction interfaces or biometric featurerecording devices such like eye trackers. In fact, one of the main challenges in sportrelated research on perceptual-skills, remains the trade-off between experimental controland a natural valid presentation mode, Kredel et. al [8] postulateed in a meta review ofover 60 studies from over 40 years of research on natural gaze behaviour.Larkin et al. [9] concluded based on a review of 25 studies that video based trainingcan enhance perceptual-cognitive performances. One fundamental aspect is a highlynatural presentation mode, which leads to pronounced expertise effects in gazebehaviour and decision-making. Mann et al. [10] found moderator effects of thestimulus presentation mode, postulating a relationship between an increased naturalpresentation mode and increased expertise effects.For all research on perceptual-cognitive skills, the need of a optimized trade-offbetween natural presentation mode and experimental control — for comparable results— is of high importance. Ignoring a natural presentation mode prevents the athletes toSeptember 25, 2020 1/18 a r X i v : . [ c s . H C ] S e p pply their natural gaze behaviour. Disregarding high experimental control preventscomparable and precise results. Both, Vater et al. [11] and Mann et al. [10] suggest thatsports-related perceptual-cognitive skills should be examined by taking care about bothsides of the trade-off. A natural environment that mimics the complexity of the task,while — from a scientific perspective — paying particular attention to the level ofexperimental control.So far eye tracking studies focused on one side. Either in-field setups with naturalpresentation mode (field camera) or laboratory setups with high experimental control[12–17] were conducted. For optimal research conditions both sides need to be improved.As a new upcoming technology, virtual reality (VR) devices are used more often asstimulus presentation mode and interaction device. Research focused either onphotorealistic stereoscopic views of sports environments combined with interactiontechniques for natural movements in a virtual reality [18] or on modeling athletes’behaviour to create expertise based adaptive interfaces or training systems. VR has thepower to optimize the trade-off and even create synergetical effects. VR can showrealistic and immersive environments and by using a built-in eye tracker infer a close tonatural gaze behaviour of the users. VR can even replace CAVE systems [19–22].There are several other advantages of VR. Bideau et al. [23] summarized theseadvantages. Their main contribution is to show that interactive and immersive virtualrealities can elicit experts responses similar to real-world responses.Another trend in computer science can help to improve the experimental control andthe analysis of the results. With more frequent usage of eye trackers, more accurate,faster and ubiquitous devices, huge amounts of precise data can be generated. Machinelearning provides the power to deal with huge amounts of data. In fact, machinelearning algorithms typically improve with more data and allow fast, precise andobjective reproducible ways for data analysis. Machine learning methods are used indifferent kinds of eye tracking studies. Especially expertise classification problems canbe solved, as shown by Castner et al. [24, 25] in dentistry education or expertiseidentification in microsurgery [26–29]. Machine learning techniques are the currentstate-of-the-art for expertise identification and classification. Both, supervised learningalgorithms [25, 26] and unsupervised methods or deep neural networks [24] have showntheir power for this kind of problem solving.Expertise identification and classification leads to adaptive and personalized designsof systems, i.e. virtual cognitive training systems. The choice of difficulty can beadapted based on the expertise of the user. For higher skilled users, the difficulty of alevel can be raised by pointing out less cues. With enough data it is also possible toadapt a training level based on personal deficiencies that were found during expertiseidentification.Our focus in this work is in particular to objectively identify and classify expertisebased on perceptual-cognitive skills that are represented by eye movements. Further, weare interested in obtaining explainable features, that could explain differences betweenexpertise groups and might not be obviously but found by a feature selection approach.In this work we present a system that is based on photorealistic 360 ° videos, viewed onVR glasses and a machine learning approach for data analysis. We show techniques tofind explainable differences between three groups of expertise in goalkeepers gazebehaviour. This work is meant to be a fundamental work for a machine learning basedperceptual-cognitive diagnostic system in virtual reality. Project description

The HTC Vive is a consumer-grade virtual reality (VR) headset. Gaze can be recorded,through integration of the SMI high speed eye tracker, at 250 Hz. The SteamVRSeptember 25, 2020 2/18 ig 1.

Schematic overview of the response options. The option ”kick out”, is onlyexplained verbally.framework is an open-source software that allows to interface common real-time gameengines with the VR glasses to display custom virtual environments. We projectedomnidirectional 4k footage on the inside of a sphere that envelopes the field of view ofthe user, which leads to a high immersion and presence into a realistic scene.

We captured the 360 ° footage by placing a Insta Pro 360 (360 ° camera) on the soccer fieldon the position of the goalkeeper. Members of a german first leagues elite youthacademy were playing a 6 (5 field player plus goalkeeper) versus 5 match scenes. Eachscene was developed with a training staff team of the german football association(DFB). We took only scenes that have binary decisions. We captured data of 12 experts during a DFB youth elite goalkeeper camp. The datacomes from german youth elite soccer goalkeepers (U-15 to U-21). The data of 8intermediates was captured in our laboratory and come from regional league soccerplayers (semi-professional). Data of 13 novices was either from players of lower leaguesor people with less or no experience in soccer.

The study was confirmed by the ethics committee of the faculty of economics and socialsciences of the university of tuebingen. After signing a consent form to allow the usageSeptember 25, 2020 3/18 ig 2.

Example stimulus in equirectangular format.of their data we familiarized the participants with the footage. 5 different screenshotsand stimuli were played and explained to allow the participant to acclimate to the setup.To learn the decision options we also showed a schematic overview. By doing this, wereduced the number of possible answers (see figure 1, plus ”kick out” option). Thegeneral procedure is as follows: One of the 26 stimuli is played in the VR glasses.Directly after receiving the last pass (to the goalkeeper), the video stops and a blackscreen is presented. The participant now has 1,5 seconds time to tell the decision optionone wants to make and the color of the ball, which was printed on the last return pass(to force all participants to recognize the last return pass realistically). The secondblock contains the same 26 stimuli but in a different order. Each decision made on thecontinuation of a video has a binary rating, as only one decision is counted as 1(correct). The remaining options are rated as 0 (incorrect). A correct answer is alwaysthe the one teammate that stands free.

Method

The raw data of the SMI Eye tracker can be exported from the proprietary BeGazesoftware as csv files. BeGaze already provides the calculation of different eye movementfeatures based on the raw gaze points. The following section describes the steps that arenecessary to train a model based on eye movement features.

For the classification of expertise level we focus on the following features: • event durations and frequency (fixation / saccade), • fixation dispersion (in ° ), • smooth pursuit duration (in ms), • smooth pursuit dispersion (in ° ), • saccade amplitude (in ° ), • average saccade acceleration (in ° / s ), • peak saccade acceleration (in ° / s ),September 25, 2020 4/18 average saccade deceleration (in ° / s ), • peak saccade deceleration (in ° / s ), • average saccade velocity (in ° / s ), • peak saccade velocity (in ° / s ).Each participant viewed 26 stimuli twice, resulting thus in 52 trials per subject. First, we viewed the samples of these 52 trials and checked the confidence measures ofthe eye tracking device. We removed all trials with less than 75% tracking ratio, as gazedata below this threshold are not reliable. Due to errors in the eye tracking device, notall participants data is available for all trials. Hence, we only used trials, that weconsider as valid. The number of trials was still 52, except for 3 participants, that onlyhad 41 valid trials. We checked these remaining trials for data quality of saccades. Thisdata preparation is necessary to remove erroneous and low quality data that come frompoor detections of the eye tracking device and do not reflect the correct gaze. Therefore,we investigated invalid samples and removed (1) all saccades with invalid startingposition values, (2) all saccades with invalid intra-saccade samples, and (3) all saccadeswith invalid velocity, acceleration or deceleration values.(1) Invalid starting position: 0.22% saccades had a start at coordinates (0,0). This isan encoding for an error of the eye tracking device. As amplitude, acceleration,deceleration and velocity are calculated based on the distance from start- to endpointthese calculations result in physiological impossible values, e.g., over 360 ° saccadeamplitudes.(2) Invalid intra-saccade values: Another error of the eye tracking device is based onthe way the saccade amplitude is calculated through the average velocity (equation 1)which is based on the distance of the mean of start and endpoints on asample-to-sample basis (see equation 2). 3.6% of the saccades had at least one invalidgaze sample and were removed (example see figure 3). (cid:11) V elocity ∗ EventDuration (1)1 n ∗ n (cid:88) dist ( startpoint ( i ) , endpoint ( i )) EventDuration ( i ) (2)On samples 7, 8, 14-16, 18-20 both, the x- and y-signal show zero values and therebyindicate a tracking loss. As the saccade amplitude is based on the average velocitywhich is calculated on a sample-to-sample formula (2), the velocity from samples 6 to 7,8 to 9, 13 to 14, 16 to 17,17 to 18, and 20 to 21 extremely increase the average velocityas the distances are high (on average over 2400 px for x-signal and over 1000px fory-signal, which corresponds to a turn of 225 ° on x-axis and 187,5 ° on y-axis in the timeof 4 ms between two consecutive samples).There are two interpretations for saccadic amplitude. The first refers to the shortestdistance from start to end point of a saccadic movement (i.e., a straight line) and thesecond describes the total distance traveled along the (potentially curved [30], p.311)trajectory of the saccade. The SMI implementation follows the second definition. Wecould potentially have interpolated invalid intra-saccade samples instead of completelyremoving the complete saccade from analysis, however this leads to uncertainties thatcan affect the amplitude depending on the amount of invalid samples and also does notnecessarily represent the true curvature of the saccade.September 25, 2020 5/18 ig 3. Example of invalid intra-saccade values. The x-axis shows the number of thesample (40 samples, 250 Hz, 160 ms duration) and the y-axis shows the position in pixel.The blue line represents the x-signal of the gaze and the orange line the y-signal.(3) As the velocity increases as a function of the saccade amplitude [31], 4.8% of thesaccades were ignored on ground of the restriction of velocities greater than 1000 ° /s.Similar to extreme velocities, we removed all saccade samples that exceeded themaximum theoretical acceleration and deceleration thresholds. Saccades with longeramplitudes have higher velocity, acceleration and deceleration, but can not exceed thephysiological boundaries of 100.000 ° / s [30]. 3.0% and 4.0% respectively, of all saccadesexceeded this limit. As most of the invalid samples had more than one error source, weonly removed 5.5 % of the saccades (3.5% of all samples) in total.After cleaning the data we use the remaining samples to calculate the average,maximum, minimum and standard deviation of the features. This results in 36individual features. We use those for classifying expertise in the following. In the following, we refer to expert samples as trials completed by an elite youth playerof the DFB goalkeeper camp, intermediate samples as those of regional league playersand novice samples as those of amateur players. We built a support vector machinemodel (SVM) and validated our model in two steps: cross-validation and leave-one-outvalidation. We trained and evaluated our model in 1000 runs, with both validations.For each run, we trained a model (and validated with cross-validation) with samples of8 experts, 8 intermediates, and 8 novices samples, and used the samples of theremaining participants to predict their classes (leave-out validation). The experts aswell as the intermediates and the novice samples in the validation set were pickedrandomly for each run.

We found that the way the samples of the data set are split into training and evaluationset is very important and a participant-wise manner should be applied. By randomlySeptember 25, 2020 6/18 ig 4.

Example sample assignment. Top row shows a random assignment of samples,independent of the corresponding participant. Bottom row shows participant-wisesample assignment to training and evaluation set.picking samples independent of the corresponding participant, samples of a participantusually end up being distributed on the training and the evaluation set (illustrated infigure 4). This leads to an unexpected learning behavior which does not necessaryclassify expertise directly but rather the origin of a sample to a specific participant andthereby indirectly the membership to the participant’s expertise level. Which means amodel would work perfectly for known participants but is unlikely to work for unseendata. Multiple studies showed that the gaze behavior of humans follows idiosyncraticpatterns. Holmqvist et al. [30] show that a large amount of eye tracking measuresunderlay the participants idiosyncrasy, which also means that the inter-participantdifferences are much higher than intra-participant differences. A classifier learns abiometric, person-specific measure instead of an expertise representation.

To find a model which is robust to high data variations, we applied a cross-validationduring training. The final model is based on the average of k=50 models, with k =number of folds in the cross-validation. For each model m i , with i ∈ { , . . . , k } , we useall out-of fold data of the i-th fold to train and evaluate m i with the in-fold data of thei-th fold. The final model is evaluated with a leave-out validation. The cross-validationstep during training is independent from the leave-out validation with totally new data(never seen by the model), as information of the cross-validation is used during buildingand optimizing the model and leave-out validation is just an information provider aboutthe prediction accuracy of the model when using completely new data. With a total of 810 valid samples, equally distributed on expert, intermediate andnovice samples, we built a subset of 552 samples for training the model and a subset of258 samples for evaluation. As each sample represents one trial, our approach here is toSeptember 25, 2020 7/18 ig 5.

Illustration of the k cross-validation procedure. Each of the k models has adifferent out-of-fold and in-fold data set. We build the final model on the average of allpredictions from all k models.predict wether a trial belongs to expert, intermediate or novice class. We testedassumption in different approaches.

Firstly, we used all 46 features to check the classifiability of this kind of data. The firstapproach contains all features from section

Feature selection (0.4), with theirderivations, namely: average, maximum, minimum ,and standard deviation to build aSVM model (table 1, 2 and 3 show all features with their derivations, splitted by class).When the binary case (expert vs. intermediates) results point out classifiability, theternary case (expert vs. intermediate vs. novice) should be investigated.

Secondly, we had a look at the features themselves and check wether there aredifferences between the single features according to their class and check for significancelevel of differences of the features of over 5%. We build a model based on the featuresthat have a significance level of over 5% (table 1, 2 and 3 all white cells, gray cells meanthere is no significant difference between the groups).

In a third approach we reduced the amount of features by running the prediction on all46 features 1000 times. By taking the most frequent features of the model, we search fora subset of features which prevents the model from overfitting and allows interpretableresults that represent the differences between the expertise classes with a minimumSeptember 25, 2020 8/18 able 1. All 42 features with their derivations. Novice class.Novices

Features average std. dev. minimum maximum

Fixation frequency (Hz) 0.214 - - -duration (ms) 214.017 31.926 190.49 239.30dispersion (pixels) 72.092 25.68 24.67 110.523

Saccade frequency (Hz) 0.071 - - -duration (ms) 71.688 38.869 26.514 175.460amplitude ( ° ) 9.294 9.417 0.574 51.402 Saccade mean acceleration mean ( ° /s ) 4263.381 2482.019 366.666 13984.563peak ( ° /s ) 9322.483168 5777.273817 231.836 28355.224 Saccade deceleration peak ( ° /s ) -6848.104 4166.262 -35563.646 -411.760 Saccade velocity mean ( ° /s ) 105.463 65.023 20.288 298.134peak ( ° /s ) 215.245 129.294 40.310 766.157 Smooth pursuit duration (ms) 302.637 278.112 75.629 1026.329dispersion (pixels) 622.805 201.268 185.437 1085.903Gray cells show features with no significant differences between classes. Orange cells stand for a most frequent feature.

Table 2. All 42 features with their derivations. Intermediate class.Intermediates

Features average std. dev. minimum maximum

Fixation frequency (Hz) 0.255 - - -duration (ms) 255.225 53.379 215.835 299.623dispersion (pixels) 73.173 26.548 23.070 114.762

Saccade frequency (Hz) 0.084 - - -duration (ms) 84.349 59.726 26.127 246.121amplitude ( ° ) 9.883 10.674 0.572 54.835 Saccade mean acceleration mean ( ° /s ) 4123.970 2685.991 315.346 15472.889peak ( ° /s ) 8920.177 5989.251 216.722 28266.000 Saccade deceleration peak ( ° /s ) -6948.491 4770.063 -36334.137 -231.355 Saccade velocity mean ( ° /s ) 104.199 66.682 21.520 331.111peak ( ° /s ) 213.835 136.529 40.109 764.027 Smooth pursuit duration (ms) 291.092 278.718 73.835 977.120dispersion (pixels) 425.089 124.853 168.320 694.370We consider samples as belonging to a smooth pursuit, when the dispersion of the samples is greater than 100 px. As the sizeof the players in the stimulus varies around 90 pixel + a buffer.September 25, 2020 9/18 able 3. All 42 features with their derivations. Expert class.Experts

Features average std. dev. minimum maximum

Fixation frequency (Hz) 0.241 - - -duration (ms) 241.509 58.629 198.132 291.721dispersion (pixels) 72.837 25.989 21.736 114.549

Saccade frequency (Hz) 0.007 - - -duration (ms) 65.472 35.548 25.019 163.415amplitude ( ° ) 8.938 9.430 0.567 52.029 Saccade mean acceleration mean ( ° /s ) 4769.655 3064.343 390.094 18965.944peak ( ° /s ) 10026.456 7094.930 175.242 39445.125 Saccade deceleration peak ( ° /s ) -7912.190 5492.287 -43479.916 -362.396 Saccade velocity mean ( ° /s ) 110.675 72.737 21.182 375.363peak ( ° /s ) 238.371 157.740 40.262 935.514 Smooth pursuit duration (ms) 276.785 265.679 74.404 953.660dispersion (pixels) 399.939 112.414 336.016 505.031amount of features. The resulting features with the highest frequency in our test can beseen in table 1, 2 and 3, in orange.

To strengthen the implicit assumption of this paper, that it is possible to distinguishbetween novices, intermediates and experts based on their gaze behavior, we evaluatedour expert data separately by flipping a subset of experts with intermediates. After 100iterations in which half of the experts where randomly labeled as intermediates, theaverage classification accuracy was below chance-level, which means the model can notdifferentiate between experts properly. This strengthens our assumption that thedifferences between experts are smaller than the differences between experts,intermediates and novice.

Results

We first report the results of the classifiablity test then provide a deeper analysis on themodel trained with all features and two models based on certain features obtainedthrough 1) their significance level and 2) their frequency in the all feature model. Theclassifiability test shows promising results. The binary model is able to distinguishbetween experts and intermediates with an accuracy of 88.8%. The model has a falsenegative rate of 1.6% and a false positive rate of 18.6%. This means the binary modelpredicted two out of 260 samples falsely as class one and 29 samples that are class zeroas class one. As the false negative rate is pretty low, the resulting miss rate is only11.9%. The confusion matrix (figure 6) shows the overall metrics. The binary model isbetter in predicting class zero samples than class one samples. The overall accuracy of88.1% is sufficient to investigate on ternary classification. In the following we showSeptember 25, 2020 10/18 ig 6.

Binary: distribution of prediction results from 100 runs.deeper insights on the ternary approaches by looking at accuracy, miss rate, recall andf1-scores of the ternary methods and compare those values between the all-feature model(ALL), most frequent features model (MFF) and the significant features model (SF).

The differences between the three approaches are barely visible when looking at themedian (ALL: 75.08%, MFF: 78.20%, SF: 73.95%), but even greater when comparingthe 75th percentile (ALL: 80.99%, MFF: 85.44%, SF: 79.25%). All models show a widerrange of accuracy values which means these models might overfit more on some runsand underfit on others. The lower adjacent of all models is higher than chance level(ALL: 53.46%, MFF: 52.93% and SF: 52.41%), which means all models perform betteras guessing. As the accuracy is a rough performance metric which only tells about thenumber of correct predictions (true positives and true negatives), we have a moredetailed look into the performance of the methods by comparing the miss rates of thesingle approaches.

The miss rate is a metric that tells about the rate of wrongly classified samples thatbelong to class x, but predicted to belong to class y. The ternary models models arebetter in predicting the membership of samples to class one and class two than to classzero. This results in miss rates that are only little lower than chance level when lookingat the median miss rates (All: 28.12%, MFF: 23.81% and SF: 26.80%). The upperadjacent shows a high range of miss rates reaching even values of over 43.19% for theSF-model. The MFF-model has the lowest median miss rate of all three methods with amiss rate of 41.96%.September 25, 2020 11/18 ig 7.

Accuracy values of the ternary methods.

Fig 8.

Miss rates of ternary methods.September 25, 2020 12/18 ig 9.

Recall values of ternary methods.

Recall tells about the rate of samples being predicted as belonging to class x in relationto the number of samples that really belong to class x. All three models have a medianrecall of over 70%. In the ternary case, chance level is at 33.33% which means allmodels have a recall of over two times higher than chance level as the lower adjacent ofall three models is higher than 33.33%. The MFF-model median is the highest at76.18% followed by the SF-model at 73.19% and the ALL-model at 71.87%. Again theMFF-model has the best performance values of all three methods.

The most frequent features in 100 runs are summarized in table 4. Only the minimumof the saccade duration has p > .

05. Which means the differences are not statisticallysignificant. All other features show significant differences, which means aMann-Whitney-U-test discards the null hypothesis that there are no differences with p < .

05 for each of the features.

Table 4. All 42 features with their derivations. Expert class.Most frequent features

Features derivation novice intermerdiate expert p-value hypothesis discardedsaccade duration std. dev. 38.869 59.726 35.548 3.33*e-08 1saccade duration minimum 26.514 26.127 25.019 0.242216408 0peak saccade deceleration std. dev. 4166.262 4770.063 5492.287 2.49*e-18 1peak saccade velocity std. dev. 129.294 136.529 157.740 6.19*e-07 1smooth pursuit dispersion average 622.805 425.089 399.939 9.66*e-82 1smooth pursuit dispersion minimum 185.437 168.320 336.016 5.44*e-12 1smooth pursuit dispersion maximum 1085.903 694.370 505.031 1.52*e-81 1Looking more closely at the most frequent features and their significant values, itbecomes clear that 1) experts (SD = 35.54 ms) as well as novices (SD = 38.86 ms) haveSeptember 25, 2020 13/18 homogeneous gaze behaviour compared to intermediates (SD = 59.72 ms). Thelengths of the saccades differ less. However, a fallacy would be to attribute the sameviewing behavior to novices and experts — due to the standard deviation and minimumduration of the saccades, which is quite similar for all three (novice: 26 ms, intermediate:25 ms, expert: 25 ms) — since, for example, the average dispersion of smooth pursuitsfor novices (622.80 pixels) is 1/3 higher than for experts (399.93 pixels). This meansthat both groups have similarly long saccades among themselves, but the novices havesimilarly long saccades and the experts similarly short saccades. Conversely, this meansthat the experts have longer fixations than the novices and intermediates. In a study,Mann et al. [10] show that experts are overrepresented in fewer but longer fixations,because they have more time to process and absorb information.Further differences between the groups can be found in velocity of the saccades. Onthe one hand there is a continuous increase in the maximum speed of the saccades fromnovices (4166.262 ° /s ) to intermediates (4770.063 ° /s ) to experts (5492.287 ° /s ),which is consistent with the findings of Zwierko et al. [32]. The authors say that thedeceleration behaviour can be inferred from different expertise classes. This allows,besides the differences in the distribution of the maximum speed of the saccades(Novizes: 129.294 ° /s , Intermediates: 136.529 ° /s , Experts: 157. 74 ° /s ), to concludethat one set of experts have faster saccades, but on the other hand also show a moretargeted, strucurated but fast gaze behavior. They are more likely to adapt to thesituation. Novices perceive a scene as a random situation and try to look in alldirections equally in order to keep the overview.Further differences between the groups can be found in the velocity of the saccades.On the one hand there is a continuous increase in the maximum deceleration speed ofthe novices’ saccades (4166.26 ° /s ) to intermediates (4770.06 ° /s ) to experts(5492.28 ° /s ), which is in line with the findings of Zwierko et al. [32] who say that thedeceleration behaviour can be inferred from different expertise classes. Besides thedifferences in distribution of the maximum velocity of the saccades (novices: 129.29 ° /s ,intermediates: 136.52 ° /s , experts: 157.74 ° /s ), this suggests that experts on the onehand have faster saccades, but on the other hand also show a more targeted structuredand fast gaze behavior. They adapt themselves more to the situation. Novices perceivea scene as if it were an ordinary situation and try to look in all directions equally inorder to maintain an overview.One observation during the study was that novices often follow the ball with theirgaze for a long time. This behavior is less evident among experts. They tend to onlylook at the ball when it has just been passed or when they themselves are not in play.At these times, the ball can not change its path. This observation is supported by thevalues of the smooth pursuit dispersion. With 505.031 pixel maximum and 336 pixelminimum, experts have a very narrow window of smooth pursuit lengths. Basically, themaximum smooth pursuit of the experts is less than half as long as the novices (1085.90pixel) and the minimum smooth pursuits (expert: 399 pixel, intermediate 425 pixel,novices 622 pixel) is still 1/3 shorter than the novices. The intermediates are placed inthe middle between the two groups. Again the values are continuously decreasing. Discussion

Such a setup opens the door for dynamic and online analysis of gaze features based onnatural gaze behavior. We are however aware that the small sample size, restricts theconclusions that can be drawn and might lead to debatable results. Another limitationof this work is the restriction to head movement unrelated eye movement features andthe absence of a detailed smooth pursuit detection algorithm, which might be important.Therefore in our future work we will implement a event calculation method i.e. based onSeptember 25, 2020 14/18he work of Agtzidis et al. [33].This work is meant to be a preliminary work for expertise prediction leading toobjective perceptual skill assessment in virtual reality. We show that the study setupwith an omnidirectional video source, high speed eye tracker and non-restrictive andrealistic virtual environment are promising techniques for optimizing the gap betweennatural presentation mode and experimental control, and therefore allowing theparticipants to apply their natural gaze behavior on a realistically mimickedenvironment. We are aware that the small sample count restricts the meaningfulness ofthe classification results and to shape a robust model, more samples are needed. Butthis work strengthens the assumption that there are differences between the gazebehavior of experts, intermediates and novices, and that these differences can be obtainthrough the mentioned methods. Especially when looking at the values of the mostfrequent features of the model in detail, the differences are noticeable and in line withlatest research. These differences lead to the conclusion, that experts scan theirenvironment in a more structured and faster way than intermediates and novices.

In this work we present a diagnostic model for eye movement feature classification intoexpert, intermediate and novice. The model presents a first step in the direction ofautomatic and dynamic design of levels of a training system in a virtual environmentbased on personalized user gaze behavior. We show that this kind of data is classifiablewith high accuracy and that the mentioned methods are suitable to obtain explainablefeatures of the gaze behaviour of the user. After the binary and ternary classification ofexpertise, the following step should be a finer grained gradation, which allows, bymapping expertise on a bigger amount of classes, the dynamic manipulation of thedifficulty level of an exercise of a training system or game level in virtual environments.Next to a training system for athletes and other professional groups, the difficulty levelin a VR game can be dynamically adjusted based on the gaze behavior of the user. Inour further work, we plan to expand our data set to more subjects, add more classes,add a physical response mode and focus on research of person-specific, gaze-basedexpertise weakness detection. Another point is to integrate the model into an onlinediagnostic system. To use the model online, the gaze signal can be directly drawn onlineat 250 Hz from the eye tracker by using the provided API of the vendor. Using amulti-threaded system, the data preparation and feature calculation can be donedirectly online in parallel to data collection. Only the higher level features (e.g. SD)need to be computed when the trial ends and fed as feature vector to the alreadytrained model, to estimate the class of the current trial. As predicting is done bysolving a function, the prediction result is supposed to be available few moments afterthe trial ended. Which is necessary as the prediction is the input for the adaption of thetraining. This work will be implemented in an online system for realtime gaze basedexpertise detection in virtual reality systems with an automatic input for thepresentation device for dynamic manipulation of the difficulty of the scene.

Acknowledgment

This research was supported by the German Football Association (DFB). We thank ourcolleagues from the DFB who provided insight and expertise that greatly assisted theresearch.September 25, 2020 15/18 eferenceseferences