[PDF] Explaining the difference between men's and women's football

Abstract

Women's football is gaining supporters and practitioners worldwide, raising questions about what the differences are with men's football. While the two sports are often compared based on the players' physical attributes, we analyze the spatio-temporal events during matches in the last World Cups to compare male and female teams based on their technical performance. We train an artificial intelligence model to recognize if a team is male or female based on variables that describe a match's playing intensity, accuracy, and performance quality. Our model accurately distinguishes between men's and women's football, revealing crucial technical differences, which we investigate through the extraction of explanations from the classifier's decisions. The differences between men's and women's football are rooted in play accuracy, the recovery time of ball possession, and the players' performance quality. Our methodology may help journalists and fans understand what makes women's football a distinct sport and coaches design tactics tailored to female teams.

Full PDF

EE XPLAINING THE DIFFERENCE BETWEENMEN ’ S AND WOMEN ’ S FOOTBALL

Luca Pappalardo

ISTI-CNR, [email protected]

Alessio Rossi

University of Pisa, [email protected]

Giuseppe Pontillo

University of Turin, [email protected]

Michela Natilli

University of Pisa, [email protected]

Paolo Cintia

University of Pisa, [email protected] A BSTRACT

Women’s football is gaining supporters and practitioners worldwide, raising questions about what the dif-ferences are with men’s football. While the two sports are often compared based on the players’ physicalattributes, we analyze the spatio-temporal events during matches in the last World Cups to compare male andfemale teams based on their technical performance. We train an artiﬁcial intelligence model to recognize if ateam is male or female based on variables that describe a match’s playing intensity, accuracy, and performancequality. Our model accurately distinguishes between men’s and women’s football, revealing crucial technicaldifferences, which we investigate through the extraction of explanations from the classiﬁer’s decisions. Thedifferences between men’s and women’s football are rooted in play accuracy, the recovery time of ball posses-sion, and the players’ performance quality. Our methodology may help journalists and fans understand whatmakes women’s football a distinct sport and coaches design tactics tailored to female teams. K eywords data science · sports analytics · football analytics · artiﬁcial intelligence · explainable AI Women’s football took its ﬁrst steps thanks to the independent women of the

Kerr Ladies team, who gave the most signiﬁcantimpetus to this sport since the early twentieth-century [30]. As time passed, the

Kerr Ladies intrigued the English crowds fortheir ability to stand up to male teams in numerous charity competitions. The success and enthusiasm of these events arousedconcerns within the English Football Association, which on December 5, 1921, decreed that “football is quite unsuitable forfemales and ought not to be encouraged”, and requested “the clubs belonging to the Association to refuse the use of their groundsfor such matches” [30]. Unfortunately, this measure drastically slowed down the development of women’s football, which, aftera long period of stagnation, resurfaced in the ﬁrst half of the 1960s in Europe’s Nordic countries, such as Norway, Sweden, andGermany. From that moment on, the development of women’s football was unstoppable, spreading to the stadiums of Europeand the world and carving out a notable showcase among the most popular sports in the world. From 2012 the number of womenacademies has doubled [18], with around 40 million girls and women playing football worldwide nowadays [26].In the last decade, the attention around women’s football has stimulated the birth of statistical comparisons with men’s football[18, 28, 11]. Bradley et al. [2] compare 52 men and 59 women, drawn during a Champions League season, and observe thatwomen cover more distance than men at lower speeds, especially in the ﬁnal minutes of the ﬁrst half. However, at higher speedlevels, men have better performances throughout the game [2]. Sakamoto et al. [28] examine the shooting performance of 17men and 17 women belonging to a university league, ﬁnding that women have lower average values than men on ball speed, footspeed, and ball-to-foot velocity ratio [28]. Pedersen et al. [26] question the rules and regulations of the game and, taking intoaccount the average height difference between 20-25 years-old men and women, estimate that the “fair” goal height in women’sfootball should be 2.25 m, instead of 2.44 m. Gioldasis et al. [11] recruit 37 male and 27 female players from an amateur youthleague and ﬁnd that, while among male players, there is a signiﬁcant difference between roles for almost all technical skills,among female players just the dribbling ability presents a signiﬁcant difference. Sakellaris [29] ﬁnds that, in international football a r X i v : . [ s t a t . A P ] J a n XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL competitions, female teams have a higher average number of goals scored per match than their male counterparts. Finally, Langeet al. [18] follow 157 female and 207 male young Dutch footballers to investigate the tendency to stop the game to permit ateammate’s or opponent’s care on the ground, ﬁnding that women show, on average, a greater willingness to help.An overview of the state of the art cannot avoid noticing that current studies focus on physical features and analyze small samplesof male and female players using data collected on purpose. At the same time, although massive digital data about the technicalbehavior of players are nowadays available at an unprecedented scale and detail [24, 22, 13, 7, 1, 27], investigations of thedifferences between women’s and men’s football from a technical point of view are still limited. Is the intensity of play inwomen’s matches higher than men’s ones? Are women more accurate than men in passing? Furthermore, does the statisticaldistribution of male players’ performance quality differ from that of female players?In this article, we analyze a large dataset describing 173k spatio-temporal events that occur during the last men’s and women’sWorld Cups: 64 and 44 matches, respectively, and 32 men’s and 24 women’s teams with 736 male players and 546 femaleplayers. To the best of our knowledge, ours is the largest sample of men’s and women’s football matches and players. Wequantify players’ and teams’ performance in several ways, from the number of game events generated during a match to theproportion of accurate passes, the velocity of the game, the quality of individual performance, and teams’ collective behavior.We then tackle the following interesting question:

Can a machine distinguish a male team from a female based on their technicalperformance only?

Based on the use of a machine learning classiﬁer, we show that men’s and women’s football do have apparent differences, whichwe investigate through the extraction of global and local explanations from the classiﬁer’s decisions. Opening the classiﬁer’sblack box allows us to reveal that, while the intensity of the game is similar, the differences between men’s and women’s footballare rooted in play accuracy, time to recover ball possession, and the typical performance quality of the players.Our methodology is useful to several actors in the sports industry. On the one hand, a deeper understanding of female andmale performance differences may help coaches and athletic trainers design training sessions, strategies, and tactics tailored forwomen players. On the other hand, our results may help sports journalists tell and football fans understand what makes women’sfootball a distinct sport. We use data related to the last men’s World Cup 2018, describing 101,759 events from 64 matches, 32 national teams and 736players, and the last women’s World Cup 2019, with 71,636 events from 44 matches, 24 national teams and 546 players. Eachevent records its type (e.g., pass, shot, foul), a time-stamp, the player(s) related to the event, the event’s match, and the positionon the ﬁeld, the event subtype and a list of tags, which enrich the event with additional information [24] (see an example of eventin Table 1). Events are annotated manually from each match’s video stream using proprietary software (the tagger) by threeoperators, one operator per team and one operator acting as responsible supervisor of the output of the whole match. The datasetregarding the men’s World Cup 2018 have been publicly released recently [25], in companion with a detailed description ofthe data format, the data collection procedure, and its reliability [24, 23]. Match event streams are nowadays a standard dataformat widely used in sports analytics for performance evaluation [23, 7, 22, 8] and advanced tactical analysis [9, 5, 15]. Figure1a shows some events generated by a player in a match. Figure 1b shows the distribution of the total number of events in ourdataset: on average, a football match has around 1600 events, whereas a couple of matches have up to 2200 events.

Do technical characteristics of men’s and women’s football signiﬁcantly differ, statistically speaking?

To answer this question,we deﬁne variables that describe relevant technical aspects of the game and show for which of them there is statistical differencebetween men and women. In particular, we investigate three technical aspects: (i) intensity of play (Section 3.1); (ii) shootingdistance (Section 3.2); and (iii) performance quality (Section 3.3).

The intensity of play is associated with a team’s chance of success [5, 6]. Here, we measure intensity of play in terms of volumeand velocity.

Volume.

For each team in a match, we compute the total number of events and the number of speciﬁc event types (duels,fouls, free kicks, offsides, passes and shots) [24]. Although, on average, men’s matches show more events that women’s ones,this difference is not statistically signiﬁcant (unpaired t-score = 1.40, p-value = 0.16, see Table 2). Women’s matches have, on2

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL (a)

NUMBER OF EVENTS F R E Q U E N C Y ( n ) = 1605.51= 156.98 (b) Figure 1: (a) Example of events observed for a player in our dataset. Events are shown at the position where they have occurred.This plain “geo-referenced” visualization of events allow understanding how to reconstruct the player’s behavior during thematch(b) Distribution of the number of events per match. On average, a football match in our dataset has 1600 events. { " eventName " : " Pass " ," eventSec " : 2 . 4 1 ," playerId " : 3 3 4 4 ," matchId " : 2 5 7 6 3 3 5 ," teamId " : 3 1 6 1 ," positions " : [ { " x " : 4 9 , " y " : 5 0 } ] ," subEventName " : " Simple pass " ," tags " : [ { " id " : 1 8 0 1 } ] }

Table 1: Example of event corresponding to an accurate pass. eventName indicates the name of the event’s type: there are seventypes of events (pass, foul, shot, duel, free kick, offside and touch). eventSec is the time when the event occurs (in secondssince the beginning of the current half of the match); playerId is the identiﬁer of the player who generated the event. matchId is the match’s identiﬁer. teamId is team’s identiﬁer. subEventName indicates the name of the subevent’s type. positions is the event’s origin and destination positions. Each position is a pair of coordinates (x, y) in the range [0 , , indicatingthe percentage of the ﬁeld from the perspective of the attacking team. tags is a list of event tags, each describing additionalinformation about the event (e.g., accurate). A thorough description of this data format and its collection procedure can be foundin [24].average, more free kicks, duels, others on the ball (i.e., accelerations, clearances and ball touches) and passes but fewer foulsthan men’s matches (Table 2). Additionally, men’s passes are also on average more accurate than women’s ones (unpaired t-score= 8.95, p-value < Velocity.

The average pass velocity PassV ( g ) measures the average time between two consecutive passes in a match g , and theaverage ball recovery time RecT ( g ) measures the average time for a team to recover ball possession in g (see SupplementaryInformation 1). The interruption time StopT ( g ) indicates the time spent between two consecutive actions (i.e., time to makea free-kick, a corner kick or a throw-in). The average pass length PassL ( g ) measures the average time between a team’s twoconsecutive shots in a match and the average distance between a pass’s starting and ending points, respectively. For all of thesefeatures, we perform an unpaired t-test to detect differences between men and women (Table 2). We ﬁnd that women’s PassV ( g ) (unpaired t-score = 8.69, p-value < < ( g ) is lower than male’s one (unpaired t-score =5.41, p-value < ( g ) (unpaired t-score = 3.54,p-value < XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Event Women Men t-score p-value ± ± ± ± ± ± < ± ± ± ± < ± ± ± ± ± ± ± ± < ± ± < ± ± < ± ± < ± ± < ± ± < ± ± < avg ) -0.01 ± ± < std ) 0.05 ± ± ± ± ± ± ± standard deviation per matches. Grey rows indicates features for which the difference between men andwomen is statistically signiﬁcant. The highest values are highlighted in bold. We explore the spatial distribution of the positions where male and female players perform free kicks and shots (see SupplementaryFigure 10) and quantify shooting distance ShotD as the Euclidean distance from the position where the shots starts to the centerof the opponents’ goal. To ﬁnd statistical difference between men and women, we use the non-parametric Mann-Whitney U-Test.On average, men players kick the ball from a greater distance than women (p-value < 0.001, Table 2).To take into account that men and women may have a different perception of distance to the opponents’ goal, we split theattacking midﬁeld into three zones Z , Z and Z , according to the two distributions of shooting distance, i.e., looking at ashot’s minimum and the maximum starting positions. Z is the area closest to the goal, Z the furthest, Z the zone in themiddle. The zones of women are 1.1 meters closer to the goal than the zones of men (p-value < 0.001).We then use a z-test for proportions with two independent samples to verify whether there is a difference in the shooting activitybetween men and women. Female teams have a higher percentage of shots from their Z1 zone than male teams (p-value = 0.01);the opposite is true in the Z2 shooting area (p-value = 0.004). Finally, female teams have a higher percentage of shots from theirZ3 shooting area (p-value = 0.02) than male teams. We use the PlayeRank algorithm [23] to compute the PR score, which quantiﬁes a player’s performance quality in a match (seeSupplementary Information 2 for details on the algorithm). PlayeRank is robust in agreeing with a ranking of players given byprofessional football scouts, given its capability of describing football performance comprehensively [23]. For each match g ,and for both teams, we compute the mean and the standard deviation of the individual PR scores, PR avg ( T, g ) and PR std ( T, g ) ,respectively. High values of PR avg ( T, g ) indicate that the players in team T perform well in match g , on average. High values ofPR std ( T, g ) indicate a large variability of PR across the teammates in match g . Male players have higher PR avg than femalesplayers (unpaired t-score = 9.01, p<0.001) but similar PR std (unpaired t-score = -0.40, p-value = 0.69). We ﬁnd statisticaldifference in the PR score between men and women for left ﬁelders only (Figure 2).We also explore the differences in the collective behavior of male and female teams computing the passing networks, graphs inwhich nodes are players and edges represent passes between teammates in a match [5, 10, 19, 4, 3]. From the passing networkof a team T in a match g we derive the H indicator H ( T, g ) [5, 6] and the team ﬂow centrality FC ( T, g ) [10], two ways ofquantifying the goodness of a team’s performance in a match [24] (Supplementary Information 3). H ( T, g ) summarizes differentaspects of a team’s passing behaviour, such as the average amount µ p of passes and the variance σ p of the number of passes4 XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL U L JK WI L H O GH U F HQ W U D O I R U Z D U G F HQ W U D O I L H O GH U O H IWI L H O GH U O H IW F HQ W U D O ED FN U L JK WI R U Z D U G U L JK W F HQ W U D O ED FN O H IWI R U Z D U G 3 O D \ H 5 DQ N VF R U H 0DOH)HPDOH Figure 2: PlayeRank score by role fro male and female players. Asterisks indicate signiﬁcant statistical difference between maleand female for that role.Table 3: List of the top ten football teams with the highest average H, FC, and PR indicators in the two competitions.

Team Sex H avg Spain M 1.67Egypt M 1.60Denmark M 1.59Japan F 1.56Australia M 1.54England F 1.53Chile F 1.51Iran M 1.47England M 1.45Tunisia M 1.44

Team Sex

P R avg

USA F 0.08France F 0.04Belgium M 0.04Germany F 0.04Australia F 0.035Italy F 0.035Croatia M 0.035Sweden F 0.034Russia M 0.03England F 0.03

Team Sex

F C avg

Mexico M 0.064Germany M 0.063Morocco M 0.063Spain M 0.063Argentina M 0.062USA F 0.062Canada F 0.061Japan F 0.061England F 0.061Peru M 0.06managed by players [5]. The higher the σ p , the higher is the heterogeneity in the volume of passes managed by the players. Aplayer’s ﬂow centrality in a match is deﬁned as their betweenness centrality in the passing network [10]. The team ﬂow centrality,FC ( T, g ) , is hence deﬁned as the average of the ﬂow centralities of players of team T in match g [10].Table 3 shows the top ten male and female teams with highest average H indicator H avg , the average PR score PR avg ( T ) ,and average FC score FC avg ( T ) . Spain is the male team with the best overall team performance ( H ( M ) avg ( Spain ) = 1 . ), andso is Japan in the women’s World Cup ( H ( F ) avg ( Japan ) = 1 . ). In general, the H indicator of male teams ( H ( M ) avg = 1 . ) ishigher (unpaired t-score = 2.67, p<0.02) than female teams’ one ( H ( F ) avg = 1 . ). Similarly, the FC indicator of male teams( F C ( M ) avg = 0 . ) is slightly higher (unpaired t-score = 2.11, p<0.04) than female teams’ one ( F C ( W ) avg = 0 . ). Our statistical analysis reveals that male and female teams do differ in many technical characteristics (Table 2): 5 XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Classiﬁer Accuracy Precision Recall F1-Score

AdaBoost.M1

Random Forest 0.86 (46%) 0.69 (45%) 0.82 (95%) 0.73 (65%)Decision Tree 0.85 (77%) 0.68 (44%) 0.79 (88%) 0.71 (61%)Logistic 0.79 (64%) 0.64 (36%) 0.79 (88%) 0.66 (50%)Baseline 0.48 0.47 0.42 0.44Table 4: Table of the leave-one-team out cross-validation results (i.e., Accuracy, Precision, Recall and F1-score) computed on thetraining dataset of each machine learning classiﬁers used to predict a football team in a game as male ( class 0 ) or female ( class1 ). The baseline classiﬁer always predicts by respecting the training set’s class distribution, which is balanced. The percentagesin the table refer to the improvement of machine learning model compared to the baseline results.• Men perform more passes per match with a higher accuracy indicating a higher volume of play and a better technicalquality of the men compared to woman;• Men perform longer passes and shoot from a longer distance than women, presumably due to the physical differencesbetween genders (e.g., men have greater strength in the legs, which allows them to shoot from farther away);• The typical performance quality of male teams, in terms of pass volume, heterogeneity, centrality and PR score, ishigher than women’s one. This result could be related to the different player style;• Women’s ball recovery time is shorter than men’s, denoting either a better capability of women to recover ball or alower capability to retain it, and characterizing a more fragmented game in women’s football.

Having established that women’s and men’s football differ in many technical characteristics related to intensity of play, shootingdistance, and performance quality, we now turn to the question:

Can we design a machine learning classiﬁer to distinguishbetween a male and a female football team?

Machine learning can capture the interplay between technical features, andexplanations extracted from the constructed classiﬁer can reveal further insights on the differences between men and womenfootball [14].As a ﬁrst step, we describe the behavior of a team T in match g by a performance vector of variables and associate it with atarget variable:• number of events ( avg and its standard deviation PR std .• the target variable indicates whether the team is male (class 1) or female (class 0).We build a supervised classiﬁer and use 20% of the dataset to tune its hyper-parameters through a grid search with 5-folds crossvalidation. We use the remaining 80% of the dataset to validate the model using a leave-one-team-out cross-validation: in turn,we leave out all matches of one team and train the model using all matches of the remaining teams. We assess the performanceof the model using four metrics [16]: (i) accuracy, the ratio of correct predictions over the total number of predictions; (ii) precision, the ratio of correct predictions over the number of predictions for the positive class (male); (iii) recall, the ratio ofcorrect predictions over the total number of instances of the positive class (male); (iv) F1 score, the harmonic mean of precisionand recall.We try several learners to construct different types of classiﬁers (Decision Tree, Logistic Regression, Random Forest, andAdaBoost). All classiﬁers achieve a good performance (see Supplementary Figure 9), with an average relative improvement of67% in terms of F1-score over a classiﬁer that always predicts the team’s gender randomly (Table 4). The best model, AdaBoost,has an improvement of 93% over the baseline in terms of F1-score. These results indicate that a classiﬁer can distinguish betweenmale and female teams on the only basis of the performance variables. 6

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

The inspection of the reasoning underlying the model’s decisions can provide us deeper insights into the differences betweenmen’s and women’s football. We extract global (i.e., inference on the basis of a complete dataset) and local (i.e., inference aboutan individual prediction) explanations from the best model (AdaBoost) using SHAP , a method to explain each prediction basedon the optimal Shapley value [20]. The Shapley value of a performance variable is obtained by composing a combination ofseveral variables and average change depending on the presence or absence of the variables to determine the importance of asingle variable based on game theory [20]. The interpretation of the shapley value for variable value j is: the value of the j -thvariable contributed φ j to the prediction of a particular instance compared to the average prediction for the dataset [21].Figure 3 shows the global explanation of AdaBoost, in which variables are ranked based on their overall importance to the modelin accordance with shap values. Pass accuracy (AccP) is way far the most important feature to classify a team’s gender. Recoverytime (RecT), average interruption time (StopT), pass velocity (PassV), pass length (PassL), avg , std are other important features for the decision making process.Figure 4 shows the summary plot that combines feature importance and feature effects, where each point indicates a team. Theposition of a feature on the y-axis indicates the importance of that feature to the model’s decision. A point’s color, in a gradientfrom blue (low) to red (high), indicates its numerical value. The position of a point on the x-axis indicates the associated shapvalue: positive values indicate that a team is more likely to be male; negative values that it is more likely to be female. Highervalues of PassAcc (red points) are associated with higher shap values. This indicates that male players are typically more accuratein passing, a property that is used by the classiﬁer to discriminate a male team from a female one. Similarly, high values of RecTare associated with a higher probability of a team to be male, highlighting a fortiori that female teams are characterized by amore fragmented play.Figure 5a refers to the ﬁnal of the men’s World cup 2018, Croatia vs. France. AdaBoost correctly predicts that France is a maleteam, basing its decision on ﬁve main variables: PR avg , CRO vs FRA ) = 38 . , CRO vs FRA ) = 241 and PR avg (France, CRO vs FRA ) = 0 . , closer to the typical values of men’sfootball (RecT ( M ) = 27 . , ( M ) = 394 . , PR ( M ) avg = 0 . ) than to those of female’s football (RecT ( F ) = 19 . , ( F ) = 430 . , PR ( F ) avg = − . ). In contrast, AccP(France, CRO vs FRA ) = 0 . and PassV(France, CRO vs FRA ) = 2 . , which are closer to the typical values of a female team (AccP ( F ) = 0 . , PassV ( F ) = 2 . , Table 2) than to those ofa male one (AccP ( M ) = 0 . , PassV ( M ) = 2 . , Table 2). Overall, the sum of the shap values indicates that France played amatch in accordance with the typical characteristics of a male team.Figure 5b shows the prediction of a match in the women’s World Cup 2019, USA vs Spain. In this case, AdaBoost correctlypredicts that USA is a female team, basing its decision mainly on AccP, PR std , StopT, RecT, and PassV. USA has RecT(USA, USA vs SPA ) = 28 . and StopT(USA, USA vs SPA ) = 30 . , closer to the typical values of men’s football (RecT ( M ) = 27 . and StopT ( M ) = 23 . , Table 2) than to those a women’s football (RecT ( M ) = 19 . and StopT ( M ) = 18 . , Table 2). Incontrast, the values of AccP(USA, USA vs SPA ) = 0 . and PassV(USA, USA vs SPA ) = 2 . , more similar to those of womenteams (Table 2). Overall, the sum of the shap values leads the model to classify US as a female team.Figure 6a and 6b visualize the predictions of the AdaBoost classiﬁer on a test set of 31 men’s matches and and 21 women’smatches concerning the two most important variables, AccP and RecT. In just two cases out of 21, AdaBoost misclassiﬁesa female team as a male one (Figure 6b). For example, in match Brazil vs France of the women’s World Cup, RecT(Brazil, BRA vs FRA ) = 35 . and AccP(Brazil, BRA vs FRA ) = 0 . (Figure 6c), which leads the model to misclassify it as a male teambecause those values are more typical of women’s football than of men’s football.In just three cases out of 31, a male team is misclassiﬁed as a female one (Figure 6a, red crosses). For example, in match Sweden vsMexico of the men’s World Cup, Mexico is correctly classiﬁed as a male team: its values of AccP ( Mexico,

SWE vs MEX ) = 0 . and RecT ( Mexico , SWE vs MEX ) = 30 are indeed close to the typical values of men’s football. In contrast, in match Germanyvs. South Korea, Germany is misclassiﬁed as a female team, mainly because RecT ( Germany , GER vs KOR ) = 20 . makes itmore similar to a female team (RecT ( F ) = 19 . ) than to a male one (RecT ( M ) = 27 . , see Table 2 and Figure 6d).The misclassiﬁed women’s teams have on average AccP ( F, wrong ) = 0 . > AccP ( F ) = 0 . , and a RecT ( F, wrong ) = 31 > RecT ( F ) = 29 . Moreover, on average StopT ( F, wrong ) = 19 , which is greater than StopT ( F ) = 18 among all female teams. Themisclassiﬁed male teams have AccP ( M, wrong ) = 0 . < AccP ( M ) = 0 . (close to AccP ( F ) = 0 . ), and RecT ( M, wrong ) =36 < RecT ( M ) = 37 (RecT ( F ) = 29 ). In both cases, AccP and RecT play a fundamental role in confusing the classiﬁer. library released for Python ( https://github.com/slundberg/shap ) XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Figure 3: Ranking of features importance (mean Shap value) extracted from the team gender classiﬁer.

The availability of spatio-temporal match events related to the last men’s and women’s World Cups allowed us to compare thetechnical characteristics of men’s and women’s football. While most of the existing works focus on the differences in physicalcharacteristics, we reconstructed a complex mosaic of the differences between male and female players. Our statistical analysisrevealed that differences do exist in several technical features: the time between two consecutive events and the time required torecover possession are the lowest in women’s football; conversely, male teams are typically more accurate in passing, and theykick the ball from a greater distance than women players. The inspection, through global and local explanations, of a modelthat classiﬁes team gender from the technical features, conﬁrmed that the percentage of accurate passes and the time to recoverpossession are crucial to distinguish between the two sports. In particular, the usage of the local explanations provide a novelperspective to reason about the difference between men and women in football, highlighting the reason behind the peculiar casesin which the classiﬁer has been “fooled” by a team’s technical performance.Our results are open to various interpretations. First of all, the statistical non-signiﬁcance of the difference in the number ofevents and shots suggest that, overall, men’s and women’s football have similar play intensity. Conversely, the higher accuracyof passes in men’s matches may be due to the higher technical level of male players, which may be rooted in the fact thatnational teams in the men’s World Cup are mainly composed of professional players. In contrast, several female national teams(e.g., Italy) are composed of non-professional players or professional players for a short time. Although women’s football’stechnical level is increasing rapidly, there is still a technical gap between the two sports. The shorter recovery time observed forwomen’s matches may be due to both the lower pass accuracy (i.e., more balls lost) and a better capacity of women to press theopponents and recover ball possession. Performance indicators reveal that centrality is higher in men’s football, denoting thepresence of “hub” players that centralize the game (higher ﬂow centrality) and higher variability in the performance qualityacross teammates (higher H indicator and PR score). This suggests that women’s football passes are more uniformly distributedacross the teammates. Women’s football also has a preference for short passes over long balls. Since accurate long balls areharder than short ones, this preference may be a solution to compensate for women players’ lower technical level.As future work, we plan to investigate differences in men’s and women’s football in national tournaments, and to investigate towhat extent these differences vary nation by nation and between national and continental competitions. Are the difference wefound in this paper more marked in the longer competitions for clubs? 8

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Figure 4: Distribution of the impact of each feature on the team gender classiﬁer. The color represents the feature value (redhigh, blue low); and position of the point indicates the Shap value.Figure 5: Local Shap explanations for two examples in our dataset: France in France vs Croatia and USA in match USA vs Spain.Feature values that increase the probability of a team to be male are shown in red, those decreasing the probability are in blue.

References [1] B

ORNN , L., C

ERVONE , D.,

AND F ERNANDEZ , J. Soccer analytics: Unravelling the complexity of “the beautiful game”.

Signiﬁcance 15 , 3 (2018), 26–29.[2] B

RADLEY , P. S., D

ELLAL , A., M

OHR , M., C

ASTELLANO , J.,

AND W ILKIE , A. Gender differences in match performancecharacteristics of soccer players competing in the uefa champions league.

Human Movement Science 33 (2014), 159 – 171.[3] B

ULDÚ , J. M., B

USQUETS , J., E

CHEGOYEN , I.,

AND S EIRUL . LO , F. Deﬁning a historic football team: Using networkscience to analyze guardiola’s f.c. barcelona. Scientiﬁc Reports 9 , 1 (2019), 13602.[4] C

INTIA , P., C

OSCIA , M.,

AND P APPALARDO , L. The haka network: Evaluating rugby team performance with dynamicgraph analysis. In

Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Figure 6: (a, b) Scatter plots displaying AccP versus RecT for a test set of teams in male matches (a) and female matches(b). Circles indicate a team correctly classiﬁed by the team gender classiﬁer, crosses indicate a mistake by the classiﬁer. Thedashed lines are at the median values for the two variables over the entire data set. In plots (c) and (d) we report the local Shapexplanations of two misclassiﬁed examples. and Mining (2016), pp. 1095–1102.[5] C

INTIA , P., G

IANNOTTI , F., P

APPALARDO , L., P

EDRESCHI , D.,

AND M ALVALDI , M. The harsh rule of the goals:Data-driven performance indicators for football teams. In (Oct 2015), pp. 1–10.[6] C

INTIA , P., R

INZIVILLO , S.,

AND P APPALARDO , L. Network-based measures for predicting the outcomes of footballgames. In

Proceedings of the 2nd Workshop on Machine Learning and Data Mining for Sports Analytics co-located with2015 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases(ECML PKDD 2015), Porto, Portugal, September 11th, 2015. (2015), pp. 46–54.[7] D

ECROOS , T., B

RANSEN , L., V AN H AAREN , J.,

AND D AVIS , J. Actions speak louder than goals: Valuing player actionsin soccer. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019), pp. 1851–1861.[8] D

ECROOS , T., B

RANSEN , L., V AN H AAREN , J.,

AND D AVIS , J. Vaep: An objective approach to valuing on-the-ballactions in soccer. In

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, IJCAI-20 (2020), International Joint Conferences on Artiﬁcial Intelligence Organization, pp. 4696–4700.[9] D

ECROOS , T., V AN H AAREN , J.,

AND D AVIS , J. Automatic discovery of tactics in spatio-temporal soccer match data.In

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018),pp. 223–232.[10] D

UCH , J., W

AITZMAN , J. S.,

AND A MARAL , L. A. N. Quantifying the performance of individual players in a teamactivity.

PloS one 5 , 6 (2010).[11] G

IOLDASIS , A., S

OUGLIS , A.,

AND C HRISTOFILAKIS , O. Technical skills according to playing position of male andfemale soccer players.

International Journal of Sport Culture and Science 5 (2017), 293 – 301. 10

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL [12] G

OLBECK , J.

Introduction to Social Media Investigation: A Hands-on Approach , ﬁrst ed. Syngress, 2015.[13] G

UDMUNDSSON , J.,

AND H ORTON , M. Spatio-temporal analysis of team sports.

ACM Comput. Surv. 50 , 2 (Apr. 2017).[14] G

UIDOTTI , R., M

ONREALE , A., R

UGGIERI , S., T

URINI , F., G

IANNOTTI , F.,

AND P EDRESCHI , D. A survey of methodsfor explaining black box models.

ACM Comput. Surv. 51 , 5 (Aug. 2018).[15] G

YARMATI , L.,

AND H EFEEDA , M. Competition-wide evaluation of individual and team movements in soccer. In (2016), IEEE, pp. 144–151.[16] H

ASTIE , T., T

IBSHIRANI , R.,

AND F RIEDMAN , J.

The elements of statistical learning: data mining, inference, andprediction , second ed. Springer Series in Statistics, 2009.[17] J

AMES , G., W

ITTEN , D., H

ASTIE , T.,

AND T IBSHIRANI , R.

An introduction to statistical learning , vol. 112. Springer,2013.[18] L

ANGE , P. A. M. V., M

ANESI , Z., M

EERSHOEK , R. W. J., Y

UAN , M., D

ONG , M.,

AND D OESUM , N. J. V. Do male andfemale soccer players differ in helping? A study on prosocial behavior among young players.

PloS one 13(12) , e0209168(2018).[19] L

ÓPEZ P EÑA , J.,

AND T OUCHETTE , H. A network theory analysis of football strategies.

ArXiv e-prints (June 2012).[20] L

UNDBERG , S. M.,

AND L EE , S.-I. A uniﬁed approach to interpreting model predictions. In Advances in neuralinformation processing systems (2017), pp. 4765–4774.[21] M

OLNAR , C.

Interpretable Machine Learning . Lulu. com, 2020.[22] P

APPALARDO , L.,

AND C INTIA , P. Quantifying the relation between performance and success in soccer.

Advances inComplex Systems 20 , 4 (2017).[23] P

APPALARDO , L., C

INTIA , P., F

ERRAGINA , P., M

ASSUCCO , E., P

EDRESCHI , D.,

AND G IANNOTTI , F. Playerank:Data-driven performance evaluation and player ranking in soccer via a machine learning approach.

ACM Transactions onIntelligent Systems and Technology (TIST) 10 , 5 (Sept. 2019).[24] P

APPALARDO , L., C

INTIA , P., R

OSSI , A., M

ASSUCCO , E., F

ERRAGINA , P., P

EDRESCHI , D.,

AND G IANNOTTI , F. Apublic data set of spatio-temporal match events in soccer competitions.

Scientiﬁc data 6 , 236 (2019), 1–15.[25] P

APPALARDO , L.,

AND M ASSUCCO , E. Soccer match event dataset, Feb 2019.[26] P

EDERSEN , A. V., A

KSDAL , I. M.,

AND S TALSBERG , R. Scaling demands of soccer according to anthropometric andphysiological sex differences: a fairer comparison of men’s and women’s soccer. frontiers in Psychology 10 (2019), 762.[27] R

OSSI , A., P

APPALARDO , L., C

INTIA , P., I

AIA , F. M., F

ERNÀNDEZ , J.,

AND M EDINA , D. Effective injury forecastingin soccer with gps training data and machine learning.

PLOS ONE 13 , 7 (07 2018), 1–15.[28] S

AKAMOTO , K., H

ONG , S., T

ABEI , Y.,

AND A SAI , T. Comparative study of female and male soccer players in kickingmotion.

Procedia Engineering 34 (2012), 206 – 211. Engineering of sport conference 2012.[29] S

AKELLARIS , D. The In-Game Comparison Between Male and Female Footballers.

Statathlon (2017).[30] S

CARDICCHIO , A.

Storia e storie del calcio femminile . Lampi di Stampa, Milano, 2011.

A Supplementary Information

Supplementary Information 1: Intensity of Play

We split a match into possession phases, i.e., sequence of consecutive events in which one team only owns the ball [24]. Anaction begins when a team gains the ball and ends if one of these cases occurs: the ﬁrst half or the second half of match end, theball goes out of the ﬁeld, there is an offside or a foul [24]. In women’s matches there is an event that is not present in men’smatches, the so-called cooling breaks , i.e., pauses in the game due to excessive heat; the algorithm recognizes them and indicatesthem as an additional cause of end of action.

Average pass velocity.

The average pass velocity PassV ( g ) in a match g is the average time between two consecutive passesin which the receiver of the ﬁrst pass is the player who makes the next pass to a teammate. Average ball possession recovery time.

The average ball recovery time RecT ( g ) is the average time elapsed between a team’slast recorded pass and the ﬁrst new pass made by a player of the same team. Shooting time.

The average shooting time ShotV ( g ) is the average time between two shots of the same team. For example, inthe men’s World Cup ﬁnal, on average, for France approximately 345 seconds passed, and for Croatia about 281 seconds. 11 XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Average pass length.

We measure the average pass length PassL ( g ) in a match g as the average Euclidean distance between apass’s starting and ending positions. Supplementary Information 2: PlayeRank scores

The PlayeRank algorithm takes into account different types of events made by the players to compute the performance rating r ( u, g ) of each player u in a match g [23]. Given a match g , PlayeRank describes the performance of a player u in g by an-dimensional feature vector Q gu = [ x , ..., x p ] , where each x j , with j = 1 , ..., p , is a feature describing a certain aspect of u ’sbehaviour during g . Some features are related to the number of speciﬁc events produced by u in g (e.g., passes, shots), otherstake into account the outcome of these events, e.g., whether or not they are accurate. The performance rating r ( u, g ) of u in g iscomputed as: r ( u, g ) = 1 R p (cid:88) i =1 w j x j (1)where w j is the importance of feature j , x j the value of that feature, and R a normalization constant. The weights w j arecomputed during a learning phase based on machine learning and consisting of two steps: feature weighting and role detectortraining [23]. Note that PlayeRank assign every player to a role if they played at least of the matches in that role. Eachrole in the ﬁeld is deﬁned through a K-means clustering method implemented in the role detection phase of the learning phase[23]. The performance rating r ( u, g ) is combined with the number of goals scored using a goal weight α (set to α = 0 . in ourexperiments). For example, Harry Kane (England), in the match against Panama, scored three goals and achieved a PlayeRankscore of 0.59, demonstrating its centrality in the 6 to 1 victory. Similarly, the Australian champion Samantha Kerr, in the matchagainst Jamaica, scored four times resulting in a PlayeRank score of 0.80. Supplementary Information 3: Team Indicators

H-indicator.

The H indicator summarizes different aspects of the passing behaviour of a team T into a single value. All theseaspects are related to the pass-based performance features, which are measured using a team’s passing network in a certain match g . First, we compute the average amount µ p of passes managed by players in a team during a match and the standard deviation σ p of the amount of passes managed by players in a team during a match [5]. The higher σ p , the higher is the heterogeneity inthe volume of passes managed by the players. Moreover, we consider the distribution of passes over the zones of the pitch bysplitting the football pitch into 100 zones, each of size 11 mt x 6.5 mt and computing the zone passing network, where nodes arezones of the pitch and edges represent the passes between two zones [5]. We take the average amount µ z of passes managed byzones of the pitch during the match and the standard deviation σ z of the amount of passes managed by zones of the pitch duringthe match [5]. High values of σ z underlies the coexistence of hot zones with high passing activity and cold zones with low passactivity during the game. Low values of σ z indicates, however, a more uniform distribution of the pass in game activity acrossthe zones of the pitch [5]. Finally, we combine these indicators by their harmonic mean to summarize the passing behavior of ateam T into the H indicator: H ( T, g ) = 5(1 /w + 1 /µ p + 1 /σ p + 1 /µ z + 1 /σ z ) (2)where w is simply the number of passes produced by the team T in a match g . Flow Centrality.

The team passing network allows measuring the centrality of each player within the network of passes. Theteam ﬂow centrality derives from the player ﬂow centrality [10], which we compute (and modify as needed) using the algorithmtaken from [24]. The player ﬂow centrality ranks each player based on their centrality in the network of passes in a certainmatch. Formally speaking, it measures the current-ﬂow-betweenness-centrality value for each node (remembering that eachnode is a football player). The betweenness centrality captures a node’s role in allowing information to pass from one part ofthe network to the other. Technically, it measures the percentage of shortest paths that must go through the speciﬁc node. Theimportant thing to know is that betweenness is a measure of how important the node is to the ﬂow of information through anetwork [12]. In this context, it quantiﬁes how central a player is in passing the ball from one side of the ﬁeld to the other. Theteam ﬂow centrality is then deﬁned by setting on average the betweenness ﬂow centrality values of players of the same team T inthe matches they played, FC avg ( T, g ) . We also compute a function to measure the variability FC std ( T, g ) in the passing ﬂowcentrality of a team in a match. High values of FC std ( T, g ) highlight that there are players that individually are at the center of ateam passing behavior in a particular game g ; low values of FC std ( T, g ) , otherwise, depict an equilibrium between players of thesame team in the ﬂow passing centrality.Supplementary Figure 8 shows two examples of passing networks and the corresponding H, FC, and PR values. 12 XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL (a) (b)(c) (d)

Figure 7: Heatmaps describing the pitch zones from where the free-kick shots and the shots in motion were more frequentlymade by male and female players, during their respective World Cup championships. They show the kernel estimate of the FirstGrade Intensity function λ ( s ) , where the event points s i are the free-kick shots ((a) and (b)) and the shots in motion ((c) and (d)),and the football ﬁeld is the region of interest R . The darker is the green, the higher is the number of free-kick shots and shots inmotion in a speciﬁc ﬁeld zone. The pitch length (x) and width (y) are in the range [0,100], which indicates the percentage of theﬁeld starting from the left corner of the attacking team. Supplementary Information 4: Predictions on Full Matches

Tracing the classiﬁer’s predictions for those teams that competed in the same game can be interesting to verify whether theresults and the comments made previously are not simply due to chance. To do this, we consider different test sets (again basedon the random state value with which the ﬁrst training set was divided); on each set we compute the class predictions, and weisolate the only games with both teams within the test set. In particular, we use ﬁfty different test sets.

Acknowledgments

We thank WyScout Spa for providing the match events, Daniele Fadda for his support on data visualization.

Funding

This research has been supported by by EU project H2020 SoBigData++ RI, grant

Authors’ contributions

LP directed the work, made statistical analysis, and wrote the paper. AR reﬁned the statistical analysis and the classiﬁcationexperiments, made the plots. GP conducted the statistical analysis, the classiﬁcation experiments, the plots, and wrote the paper.MN suggested experiments and checked the results. PC directed the work and wrote the paper. 13

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

Varane Pogba GriezmannBrozovic ModricRakiticDoorsoun-KhajehHegering Magull Ebi Ohale Ayinde

FEMALEMALENIGERIA

H-Ind = 1.07PR = 0.003PR std = 0.001FC = 0.049FC std = 0.0003

GERMANY

H-Ind = 0.85PR = 0.05PR std = 0.08FC = 0.069FC std = 0.0003

FRANCE

H-Ind = 1.42PR = 0.08PR std = 0.15FC = 0.0566FC std = 0.0002

CROATIA

H-Ind = 0.85PR = 0.06PR std = 0.13FC = 0.0630FC std = 0.0007

Figure 8: Passing network of the France World Cup game of the round of 16 , Germany v. Nigeria, and the Russia World Cup ﬁnal , France v. Croatia. Each node represents a player and its width is related to how many times teammates have passed the ballto that particular player. Formally, the width is related to the normalized weighted in-degree measure. The edges width, however,is weighted with respect to how many times two players have passed the ball to each other. There are highlighted the playerswho received the highest percentage of passes from their team mates, i.e., the most sought after on the pitch during the match.The algorithm used to draw the network was taken and modiﬁed as needed from the article [24]. 14

XPLAINING THE DIFFERENCE BETWEEN MEN ’ S AND WOMEN ’ S FOOTBALL

FALSE POSITIVE RATE T R U E P O S I T I V E R A T E (AUC = 0.52) - Baseline(AUC = 0.94) - Logistic(AUC = 0.89) - Decision Tree(AUC = 0.95) - RandomForest(AUC = 0.96) - Adaboost M1 Figure 9: ROC curves for the implemented classiﬁers. They trace out the true positive rate and the false positive rate, as theprobability threshold changes, i.e., the threshold beyond which an observation is assigned to class 1 (male team). When the truepositive rate and the false positive rate are both 0, the threshold is 1 (all the observations are classiﬁed as class 0) [17, p. 147]. Inthis case, the true positive rate is the percentage of male teams correctly classiﬁed and the false positive rate is the percentage offemale teams mistaken as male, using a given threshold. The actual thresholds are not shown. The AUC represents the area underthe curve, the larger the AUC the better the classiﬁer [17, p. 147]. Random Forest and Adaboost M1 show the best predictiveperformance. Y Attack

Z156%Z243%Z31%

Male Shots & Free Kicks Y Attack

Z160%Z238%Z32%

Female Shots & Free Kicks