Soccer Event Detection Using Deep Learning
11 Soccer Event Detection Using Deep Learning
Ali Karimi , Ramin Toosi , Mohammad Ali Akhaee School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran.
Event detection is an important step in extracting knowledge from the video. In this paper, we propose a deep learning approachto detect events in a soccer match emphasizing the distinction between images of red and yellow cards and the correct detection ofthe images of selected events from other images. This method includes the following three modules: i) the variational autoencoder(VAE) module to differentiate between soccer images and others image, ii) the image classification module to classify the images ofevents, and iii) the fine-grain image classification module to classify the images of red and yellow cards. Additionally, a new datasetwas introduced for soccer images classification that is employed to train the networks mentioned in the paper. In the final section,10 UEFA Champions League matches are used to evaluate the networks’ performance and precision in detecting the events. Theexperiments demonstrate that the proposed method achieves better performance than state-of-the-art methods.
Index Terms —Event detection, Neural networks, Soccer video
I. I
NTRODUCTION S occer is among the most popular sports in the world. Theattractiveness of this sport has gathered many spectators.Various studies are being performed in this area to growand assist this sport and meet the needs of soccer clubs andmedia. These researches mainly focus on estimating teamtactics [1], tracking of players [2] or ball on the field [3],detection of events occurring in the match [4], [5], [6], [7], [8],summarizing the soccer match [9], [10], [11] and estimatingball possession statistics [12]. These studies are carried outusing various methods and techniques, including machinelearning (ML).Artificial intelligence (AI), and more specifically ML canassist in conducting the above-mentioned researches in orderto achieve better and more intelligent results. ML itself canbe implemented in a variety of ways, including deep learning(DL). Distinguishing the events of a soccer match is one ofthe active research fields related to soccer. Today, various AImethods are utilized to detect events in a soccer game [4], [7].The use of AI in this area can help to achieve higher accuracyin detecting events.Detecting the events of a soccer match may have differentapplications. Event detection, for instance, in a soccer matchcan help obtaining the statistics of events of the match.Counting the number of free kicks, fouls, tackles, etc. in asoccer game can be done by a manpower. Using a manpoweris not only costly and time consuming, but also is may beassociated with errors. Nonetheless, with intelligent systemsbased on event detection, statistics can be calculated and usedautomatically. Other applications of event detection may bethe summarization of a soccer match. [13]. The summary ofa soccer match includes important events in which the matchtook place. To prepare a useful summary of a soccer match,the events should be correctly identified. Using a method foridentifying the events with high accuracy can improve thequality of the summarization task.The purpose of the current study is to detect events in soccermatches. In this regard, various deep learning architectures Corresponding author:Mohammad Ali Akhaee (email :[email protected]). have been developed. When using deep learning methods,there always exists a need for datasets for training. To thisend, first, a rich visual dataset is collected for our task suchas (penalty kick, corner kick, free kick, tackle, to substitute,yellow and red cards) and images containing these eventsof the sides and center of the field are collected. Then, thisimage dataset is used to train the proposed networks. Solvingevent detection problem encounters some challenges includingthe similarity of some images such as yellow and red cards,which leads to some difficulties in separating the images ofthese two groups from each other. This causes the classifier tohave trouble in distinguishing between yellow and red cards,resulting in incorrect detection.Another problem in detecting events that usually occurs ina soccer match is the issue of no highlights. Not all eventshappening in a soccer match can be included in specific eventsto be given to the network for training. Some events thathave not been fed to the network before training the networkmay occur specifically in a soccer match. In such cases, thenetwork must be able to face these images or videos and notmistakenly categorize them in one of the given events. Wesolved the problem of no highlight image detection by usingVAE , setting a threshold and using additional classes in theimage classification module.The proposed method for detecting the soccer match eventsuses two convolutional neural network (CNN) and a variationalautoencoder (VAE). The current study focuses on resolvingthe problem of no highlight frames which may be wronglyconsidered as one of the events, as well as dealing with theproblem of similarity between red and yellow cards frames.Experiments demonstrate the proposed method improves theaccuracy of image classification from 88.93% to 93.21%. Themain reason for increasing the accuracy is the use of a newfine-grain classification network to classify yellow and redcards. Also, the proposed method has a good ability to detectno highlight frames with a high precision. All datasets andimplementations are publicly available. The rest of the paper is organized as follows. Section II https://github.com/FootballAnalysis/footballanalysis a r X i v : . [ c s . C V ] F e b provides a literature review and examines the drawbacks ofthe works done in this area. Section III describes the proposedalgorithm, the mechanism of its structure and working. SectionIV introduces the datasets collected for this study. Section Vpresents the experimental results and compares the results withthose of other papers. Eventually, Section VI concludes thepaper. II. R ELATED W ORK
The research of Duan et. al. [14] is one of the first worksin this area that implements supervised learning for top-downvideo shot classification. The authors in [15] present a methodfor detecting soccer events using the Bayesian network. Thebasic methods presented suffered from low accuracy, untilsome methods have been proposed using DL. By presentinga method based on the convolutional network and the LSTMnetwork, Agyeman et. al. [9] present a method for summa-rizing a soccer match based on event detection, in whichfive events including corner kicks, free kicks, goal scenes,centerline, and throw-in are considered. This study uses 3D-ResNet34 architecture in the convolutional network structure.One of the problems with this work is that the number ofevents is limited and no highlights are taken into account.Jiang et. al. [8], initially, perform feature extraction using aconvolutional network, then perform event detection using itscombination with RNN model.This method is limited to fourevents: goal, goal attempt, corner, and card. Sigari et. al. [16]employ the fuzzy inference system. The algorithm presented inthis method works based on replay detection, logo detection,view type recognition and audience excitement. This methodis also limited to three events: penalty, corner, and free-kick.11 events are classified in [17], covering a good number ofevents; while this method is not capable to distinguish betweenthe red and yellow card events because of the high similaritybetween these two events. In general, the methods presented inthis field work on either image, video [9], [8], or audio signals[18], [19]. Nonetheless, there exist some methods employingtwo signals, i.e. audio and video, simultaneously [10], [16].The methods recently presented utilize DL architectures asthe main tool for feature extraction. Among the DL archi-tectures suggested for feature extraction, the closest flagshiparchitecture is EfficientNet architecture [20]. This architectureis presented in 8 different versions. Different versions of thisarchitecture offer generally higher performance than other pre-vious models [21], [22], [23]. Also, the proposed architecturehas fewer parameters and occupies less memory.One of the challenges of the event detection problem insoccer matches, which has not been addressed adequately inthe literature, is the events that are very similar in appearancebut are two separate events. For instance, in the images ofyellow and red cards, only the color of the cards is differentand the other parts of the image are the same. Although itmay come to the mind that both are card-taking operationsand may be almost the same, in the soccer game these twoevents impact the game process significantly. In the literature,both yellow and red card events are considered as one event[15], which causes problems for event detection. The reasons for this are the very high similarity of the images of these twoevents, which makes it very challenging to distinguish betweenthe two events. In this paper, fine-grained image classificationis used instead of common feature extraction architectures tosolve this problem in detecting such events.Fine-grained image classification is one of the challengesin machine vision, which categorizes images that fall into asimilar category but do not fall into one subcategory [24].For example, items such as face recognition, different breedsof dogs, birds, etc., despite the many structural similarities, donot fall into one subcategory and are different from each other.As another example, the California gull and Ringed-beak gullare two similar birds, differing only in beak pattern, and arein two separate subclasses. The main problem with this typeof classification is that these differences are usually subtle andlocal. Finding these discriminating places of two subcategoriesis a challenge that we face in these methods.The work of Lin et. al. [25] is one of the researches in thefield of fine-grained image classification, which is based ondeep learning. In this model, two neural networks are usedsimultaneously. The outer product of the outputs of these twonetworks is then mapped to a bilinear vector. Finally, there isa softmax layer to specify the classification of images. Theaccuracy of this method for the CUB-200-2011 dataset [26]is 84.1% while using only the neural network architecture inthis architecture provides the maximum accuracy of 74.7%.Fu et. al. [27] introduce a framework of recurrent attentionCNN, which receives the image with the original size andpasses it through a classification network, thereby extracts theprobability of its placement in each category. At the sametime, after the convolutional layers of the classifier, it extractsan attention proposal network that contains region parametersthat it uses to zoom in on the image and then crop it. Itnow inputs the resulting new image like the original imageinto a network, and at the same time extracts an attentionproposal network, thereby re-extracting another part of thenew image. This method reaches an accuracy of 85.3% forthe Birds dataset. In another work, the authors in [28] proposea framework of multi-attention convolutional neural networkwith an accuracy of 86.5%. Following in 2018, Sun et. al. [29]presented new architecture on an attention-based CNN thatlearns multiple attention region features per an image throughthe one-squeeze multi-excitation (OSME) module and then usethe multi-attention multi-class constraint (MAMC). Thanks tothis structure, this method improves the accuracy of previousmethods to some extent. One of the latest method presentedin [30] consists of three steps. In the first step, the instancesare detected and then instance segmentation is carried out.In the second step, a set of complementary parts is createdfrom the original image. In the third step, CNN is used foreach image obtained, and the output of these CNNs is givento the LSTM cells for all images. The accuracy of the bestmodel in this method for the Birds dataset reaches 90.4%. Ingeneral, the problem with the methods presented in this sectionis the accuracy they achieved, while newer methods attemptto improve the accuracy.Another issue is that a soccer match can include variousscenes that are not necessarily a specific event, such as scenes
Fig. 1: Generic block diagram of the proposed algorithmfrom a soccer match where players walking, or the momentswhen the game is stopped. Now, if such images are applied asinput to the classification network, the network will mistakenlyplace them in one of the defined categories. The reason isthat the network is trained only to categorize images betweenevents, and is called traditional classification network [31]. Insuch classifications, the known class is used during trainingand the known class images should be given to the networkduring testing. Otherwise, the network will have trouble indetecting the image category, and even though the imageshould not be placed in any of the categories, it will beplaced incorrectly in one of the defined categories. This typeof categorization does not suffice for the problem under study.Because, as explained, the input images in this problem maynot belong to any of the categories. Thus, we need a networkto specify the category of an input image if it falls into oneof the seven categories; otherwise, the network rejects it anddoes not mistakenly place it in these defined categories. Inother words, an open set recognition is required to address thementioned problem [31] .Open set recognition techniques can be implemented usingdifferent methods. Cevikalp et. al. [32] performs this based onsupport vector machine (SVM). The works in [33] and [34]are also based on deep neural networks. Today, the use ofgenerative models in this area is reaching its pinnacle, whichis divided into two categories: instance generation and non-instance generation [31] . Finding a suitable method is stillchallenging, and the literature presented for event detectionhas not addressed this issue profoundly.III. P
ROPOSED METHOD
This section describes the proposed method. Initially, thegeneral procedure is explained, then all three main parts of themethod are introduced, which includes an image classificationmodule for detecting the images of the defined events, a fine-grain classification module used to classify yellow and redcard images, and a variational autoencoder for detecting nohighlights . A. The Proposed Algorithm
As depicted in Fig 1, the received video is first split intoseveral frames based on the video length and frame rate persecond, and then each frame is passed separately through avariational autoencoder. If loss value of the VAE networkfor the input frame is smaller than a specified value, thereceived frame is considered as an event frame and given to the image classification module. The image classificationmodule classifies the images into nine classes. If the inputimage belongs to one of the center circle categories, rightpenalty area, or left penalty area, it is not categorized as anevent and is still classified as no highlights. Yet, if it is oneof the five events: penalty kick, corner kick, tackle, free kick,to substitute, it is recorded as an event in that image. If theevent is a card event, the image is given to the fine-grainclassification module to determine if it is a yellow card orred card, and the color of the card is specified there. Finally,to detect events in a soccer match, each event is calculatedfor every 15 consecutive frames (seven frames before, sevenframes after, and the current frame), and if more than half ofthe frames belong to an event, that 15 frames (half a second)are tagged as that event. Of course, an event cannot be repeatedmore than once in 10 s, and if repeated, only one of them iscalculated as an event in the calculation of the number ofevents occurred. In the folowing each module is described indetail.
B. No Highlight Detection Module (Variational Autoen-coder)a) A: soccer match can include various scenes that arenot necessarily a specific event, such as scenes from a soccermatch where the director is showing the faces of the playerseither on the field or on the bench, or when the players arewalking and the moments when the game is stopped. Thesescenes are not categorized as events of a soccer match.In general, to be able to separate the images of the definedevents from the rest of the images, three actions must beperformed to complete each other and help us to detect nohighlights. The three actions are :1) The use of the VAE network to identify if the inputimages are similar to the s occer ev ent (SEV) datasetimages2) The use of three additional categories, that is, leftpenalty area, right penalty area, and center circle inthe image classification module, given that most freekick images are similar to images from these categories(if these categories were not placed, the images of thewingers of the field would usually get a good score inthe free-kick category)3) Applying the best threshold on the prediction value ofthe last layer of the EfficientNetB0 feature extractionnetwork.The second and third methods are applied to the imageclassification module and will be described later. In the first y c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x v i e w B , f c , Q f c , Q K L D N + r e p a N ˜ z f c Q , v i e w B , , , , U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U ˜ y L ( ˜ y , y ) = − l n p ( y | ˜ z ) K L ( q ( z | y ) | p ( z )) Fig. 2: No highlight detection module architecture (VAE)Fig. 3: Image classification module architectureaction, the VAE architecture is employed according to Fig2 to identify images that do not fall into any of the eventcategories. To this end, the whole images of the soccer trainingdataset are given to the VAE network to be trained, then usingreconstruction loss and determining a threshold on it, it isdetermined that the images whose reconstruction loss value arehigher than a fixed threshold are not considered as the soccerimages. Images with a reconstruction loss value less than afixed threshold are categorized as the soccer game imagesand then they are given to the image classification modulefor classification. In other words, this VAE plays the role of atwo-class classifier that puts soccer images in one category andnon-soccer images in another category. Reconstruction lossis obtained from the difference between the input image andthe reconstructed image. The more input images from a moredistinctive distribution, the higher reconstruction loss valuewill be, and if it is trained using the same image distribution,the amount of this error will be less.
C. Image Classification Module
The EfficientNetB0 architecture which is shown in Fig 3is used to categorize images. This network is responsible forclassifying images between nine classes. To place an image inone of the classes of this network, its prediction value at the end layer should be higher than the threshold value which isset 0.9. Otherwise it would be selected as no highlight frame.If one of the options left penalty area, right penalty area,and center circle is the output of this network, that image isno longer defined as an image of the event that occurred ina soccer match but is included in the no highlight category.However, if it is in one of the categories of penalty kick,corner kick, tackle, free kick, and to substitute, the event willbe finalized and decision is made. Finally, if the image isclassified in the card category, the image is given to the fine-grain classification module where the color of that card willbe determined.
D. Fine-grain Classification Module
The only difference between the red and yellow cards istheir color of the card; otherwise, there exist no other differ-ences in their image. Thus, both are in the card category butare separate in terms of subcategories. The main classificationdoes not distinguish these two categories well, hence it isdecided in the training phase of image classification module tomerge these two cards into one category. Also, to separate theyellow and red cards, a separate subclassification is used thatfocuses on the details. The final architecture employed in thissection can be seen in Fig 4 . Here, the proposed architecture
Fig. 4: Fine-grain classification module architectureprovided in [29] is exploited, except that instead of the Res-Net50 architecture that is used in [29] , the EfficientNetB0architecture is employed and the network is trained using theyellow and red cards data. The input images to this networkare the images that were categorized as cards in the imageclassification module. The output of this network determineswhether these images are in the yellow or red card category.IV. D
ATASETS
In this paper, two datasets, that is, soccer event (SEV) andtest event datasets have been collected. The collection of thesetwo datasets has been done in two ways:1) By crawling on Internet websites, images of differentgames were collected. These images are unique and arenot consecutive frames2) Using the video of the soccer games of the last fewyears of the prestigious European leagues and extractingimages related to events.
A. ImageNet
ImageNet dataset given in [35] is employed in the methodproposed in this paper for the initial weighting of the Effi-cientNetB0 network.
B. SEV Dataset
This dataset includes images of soccer match events thathave been collected specifically for this study. In the SEVdataset, a total of 60,000 images were collected in 10 cate-gories. The images of this dataset, as described, were collectedin two different ways. Seven of the ten image categories arerelated to the soccer events defined in this paper, and the restof the categories are used for no-highlight detection so that theimages related to them are not mistakenly included in the sevenmain categories. Table I shows how dataset SEV is dividedinto train, validation, and test datasets. Also samples of SEVdataset shown in Fig 5
C. Test Event Dataset
The test dataset is exploited to evaluate the proposedmethod. Samples of that dataset shown in Fig 6. The datasetconsists of three classes, the first class contains images of theevents selected from the SEV dataset, and an equal numberof images (200 instances from each category) are selectedfrom each category (only defined events). The second classincludes other images of soccer, in which none of these sevenevents are included. The third class includes images that arenot generally related to soccer. Details of the number of imagesin this dataset are given in Table II.V. E
XPERIMENT AND P ERFORMANCE E VALUATION
A. Training
All three networks are trained independently and the end-to-end method is not employed. The training methods of the10-class classifier network, the yellow and red card classifiernetwork, and the VAE network are explained in the followingsubsections, respectively.
1) VAE
To train this network, the images of seven events definedfrom the SEV dataset are selected and given to the VAEnetwork as training data. Test and validation data of theseseven categories are also used to evaluate the network. Thespecifications of the simulation parameters used in this net-work are summarized in Table III.As illustrated in Fig 7 , the value of the loss curve decreasesduring successive epochs for validation data.
2) Image Classification (9 Class)
The image classification network is first trained on theImageNet image collection with dimensions of 224 * 224 * 3.Then, using transfer learning, the network is re-trained on theSEV dataset and is fine-tuned. Dimensions of input images ofthe SEV dataset are 224 * 224 * 3. The network is trained in20 epochs with the simulation parameters specified in TableIV. The yellow and red card classes are merged and their 5,000images are used for network training together with images ofother SEV dataset classes.
Fig. 5: Samples of SEV datasetTABLE I: Statistics of SEV dataset
Class Name
Fig. 6: Samples of test event datasetTABLE II: Statistics of test event dataset
Class name Image instanceSoccer events (events defined) 1400Other soccer events (throw-in, goal kick, offside, . . . ) 1400Other images (images that is not related to football) 1400Sum 4200
TABLE III: Simulation parameters of the variational autoencoder
Parameter ValueOptimizer AdamLoss function Reconstruction loss + KL lossPreformance metric Loss
TABLE IV: Simulation parameters of the image classification module
Parameter ValueOptimizer AdamLoss function Categorical Cross-EntropyPreformance metric AccuracyTotal Classes 9 (red and yellow card classes merged)Augemnation Scale, Rotate, Shift, FlipBatch Size 16
TABLE V: Simulation parameters of the fine-grain image classification module
Parameter ValueOptimizer AdamLoss function Categorical cross-entropyPreformance metric AccuracyTotal Classes 2 (red and yellow card classes)Augemnation Scale, rotate, shift, flipBatch Size 16
3) Fine-grain image classification
The network shown in Fig 4 is trained using two classes ofred card and yellow card of the SEV dataset. Data from eachcategory is partitioned to train, test and validation with 5000,500, and 500 images. The specifications of the simulationparameters used in this network are given in Table V.
B. Evaluation Metrics
Different metrics are exploited to evaluate this network. Inorder to evaluate and compare different image classification ar-chitectures of EfficientNetB0 and Fine-grain module networks,the accuracy metric is used as the main metric; recall andF1-score are also used to determine the appropriate thresholdvalue of the EfficientNetB0 network. Also, precision is usedto evaluate the performance of the proposed method to detectevents.The accuracy metric can be used to determine how accu-rately the trained model predicts and, as described in thispaper, to compare different architectures and hyperparametersin the EfficientNetB0 and Fine-grain module networks.
Accuracy = T P + T NP + N = T P + T NT P + T N + F P + F N (1)The F1 score metric considers both the recall and precisioncriteria together; the value of this criterion is one at the best-case scenario and zero at the worst-case scenario. F = 2 P recision − + Recall − = T PT P + × ( F P + F N ) (2)Precision is a metric that helps to determine how accuratethe model is when making a prediction. This metric has beenused as a criterion in selecting the appropriate threshold. P recision = T PT P + F P (3) Recall metric refers to the percentage of total predictionsthat are correctly categorized .
Recall = T PT P + F N (4)
C. Evaluation
The various parts of the proposed method have been eval-uated to achieve the best model in order to detect an eventin a soccer match. In the first step, the algorithm should beable to effectively classify the images of the defined eventscorrectly. In the next step, the network is examined to seehow the network can detect no highlights, and the best possiblemodel is selected. Eventually, the performance of the proposedalgorithm in a soccer video is examined and compared to otherstate-of-the-art methods.
1) Classification evaluation
The image classification module is responsible for classify-ing images. To test and evaluate this network, as shown in Fig7, different architectures were used for training and differenthyperparameters were also tested for these models. As shownin Fig 7a , the EfficinetNetB0 model has the best accuracyamong the other models.If in the above model, we divide the card images into twocategories of yellow and red cards and give the dataset tothe network in the form of the same 10 categories of SEVdatasets for training, the accuracy of test data is reduced from94.08% to 88.93% . The reason for this is the interference ofyellow and red card predictions in this model. Consequently,the yellow and red cards detection have been assigned to asubclassification, and only the card category in the Efficient-NetB0 model has been used.For the subclassification of yellow and red card images,various fine-grained methods were evaluated and the resultsare shown in Table VI. The bilinear CNN method [25] usingthe EfficientNet architecture achieves 66.86% accuracy and (a) Architecture (b) Batch size(c) Optimizer (d) Learning rate
Fig. 7: Comparing the results on hyperparameters and architecture for validation accuracy (image classification module)TABLE VI: Comparison between Fine-grain image classification models
Method name Epoch Acc (red and yellow card) (%) Acc (CUB-200-2011) (%)B-CNN [25][20] 60 66.86 84.01OSME + MAMC using Res-Net50 [29] 60 61.70 86.2
OSME + MAMC [29] using EfficientNetB0 [20] 60 the OSME method that employs the EfficientNet architec-ture reaches 79.90% accuracy, which shows higher accuracythan the other methods. Nonetheless, the main architecture(EfficientNetB0) used in the image classifier shows 62.02%accuracy, which gives difference of 17.88%.Table VII compares the accuracy of combining the imageclassification module and fine-grain classification module forimage classification with other models. As shown in Table VII,the accuracy of our proposed method for image classificationis 93.21%, which shows the best accuracy among all models.Also, the proposed method is still faster than the other modelsexcept MobileNet and MobileNetV2. To prove this point, The execution time of each model is calculated for 1400 images,and their average run time as the mean inference time is givenin Table VII.As shown in Table VIII, the problem of card overlap is alsosolved, and the proposed method in the image classificationsection separates the two categories almost well.
2) Known or Unknown Evaluation
In order to determine the threshold value on the output of thelast layer of the EfficientNetB0 network, according to TableIX, different threshold values were tested and evaluated; andthen the value 0.90 which gives the highest sum of F1-scorefor detecting the event and also gives the best recall to detect
TABLE VII: Comparison between image classification models on SEV dataset
Method name Accuracy (%) Mean inference time (second)VGG16 [21] 83.21 0.510ResNet50V2 [36] 84.18 0.121ResNet50 [23] 84.89 0.135InceptionV3 [37] 84.93 0.124MobileNetV2 [38] 86.95 0 .046
Xception [39] 86.97 0.186NASNetMobile [40] 87.01 0.121MobileNet [41] 88.48 0.048InceptionResNetV2 [42] 88.71 0.252EfficientNetB7 [20] 88.92 1.293EfficientNetB0 [20] 88.93 0.064DenseNet121 [43] 89.47 0.152
Proposed method (image classification section) 93.21
TABLE VIII: Confusion matrix proposed algorithm (image classification section)
PredictedActual Center Corner Free kicke Left Penalty R Card Right Tackle To Substitue Y CardCenter 0.994 0 0.006 0 0 0 0 0 0 0Corner 0 0.988 0.006 0 0 0.004 0 0 0.002 0Free kick 0.004 0.004 0.898 0.03 0 0.002 0.05 0.006 0.004 0.002Left Penalty Area 0.002 0 0.036 0.958 0.004 0 0 0 0 0Penalty kick 0 0 0 0.002 0.972 0 0.026 0 0 0Red cards 0 0.008 0.03 0 0 0.862 0 0.002 0.012 0.086Right Penalty Area 0 0 0.01 0.002 0 0 0.988 0 0 0Tackle 0.012 0.014 0.02 0.02 0.002 0.006 0.028 0.876 0.004 0.018To Substitue 0 0 0 0 0 0.002 0.002 0 0.994 0.002Yellow cards 0 0.022 0.01 0 0 0.131 0 0.002 0.006 0.829
TABLE IX: Comparison between different threshold values (image classification module)
Threshold F1-score (event images) (%) Recall (images not related to football) %)0.99 86.6 95.30.98 88.2 94.20.97 89.2 93.80.95 90.5 93.10.90 91.8 92.40.80 92.4 76.90.50 95.2 51.2 no highlight images was determined as the threshold for thelast layer of the network.The threshold value for the loss of VAE network was alsoexamined with different values as shown in Fig 8 , the valueof 328 as the threshold for the loss gives the best distinctionbetween categories.To test how the network detects no highlight images andthe defined event images, the test dataset is used. Using thisdataset helps to know the number of main events incorrectlyclassified by the proposed method as events not related tosoccer. Also, it clarifies the number of images related tosoccer classified as the main events, and the number of imagescategorized in the no highlight category. Moreover, it specifiesthe number of images related to the real event categorizedcorrectly into the correct events, and the number of eventsthat are not included in the main events. The results of thisevaluation are provided in Table X.
3) Final Evaluation
Ten soccer matches have been downloaded from the UEFAChampions league and then, using the proposed method, thetask of events detection has been carried out. In this evaluation,events that occur in each soccer game are examined. Inother words, the number of events correctly detected, and the number of events incorrectly detected by the network havebeen determined. Details of the results are given in Table XIand compared with other similar methods. in state-of-the-artmethods VI. C
ONCLUSIONS AND F UTURE W ORK
In this paper, two novel datasets for soccer event detectionhas been presented. One is the SEV dataset including 60,000images in 10 categories, seven of which were related to soccerevents, and three to soccer scenes; these were used in trainingimage classification networks. The images of this dataset weretaken from top fine leagues in Europe and the EuropeanChampions league. The other dataset is the test event thatcontains 4200 images. Test event includes three categories: thefirst category consists of the events mentioned in the paper,the second category comprises other images of a soccer matchapart from the first category events, and the third categoryincludes images off the soccer field. This dataset was exploitedto examine the network power in detecting and distinguishingbetween images with highlight and those with no highlight.Furthermore, a method for soccer event detection is pro-posed. The proposed method employed the EfficientNetB0network in the image classification module to detect events Fig. 8: Reconstruction Loss of VAETABLE X: The precision of the proposed algorithm
Class Sub-class PrecisionSoccer Events Corner kick 0.94— Free kick 0.92— To Substitue 0.98— Tackle 0.91— Red Card 0.90— Yellow Card 0.91— Penalty Kick 0.93Other soccer events 0.86Other images 0.94
TABLE XI: Precision of the proposed and state-of-the-art methods on 10 soccer matches (%)
Event name Proposed method BN[15] Jiang et. al.[8]Corner kick 94.16 88.13 93.91Free kick 83.31 – –To Substitue 97.68 – –Tackle 90.13 81.19 –Red Card 92.39 – –Yellow Card 92.66 – –Card – 88.83 93.21Penalty Kick 88.21 – – in a soccer match. Also, the fine-grain image classificationmodule was used to differentiate between red and yellow cards.If this module is not employed, red and yellow cards wouldhave been categorized in the image classification module,and the differentiation accuracy would have been 88.93%.However, the fine-grain image classification module increasedthe accuracy up to 93.21%. In order to solve the networkproblems in predicting the images other than those defined,a VAE was employed to adjust the value of the threshold andseveral images, other than those of the defined events, wereused to make a better distinction between the images of thedefined events and other images.R
EFERENCES[1] G. Suzuki, S. Takahashi, T. Ogawa, and M. Haseyama, “Team tacticsestimation in soccer videos based on a deep extreme learning machineand characteristics of the tactics,”
IEEE Access , vol. 7, pp. 153 238–153 248, 2019.[2] M. Manafifard, H. Ebadi, and H. A. Moghaddam, “A survey on playertracking in soccer videos,”
Computer Vision and Image Understanding ,vol. 159, pp. 19–46, 2017. [3] P. Kamble, A. Keskar, and K. Bhurchandi, “A deep learning ball trackingsystem in soccer videos,”
Opto-Electronics Review , vol. 27, no. 1, pp.58–69, 2019.[4] Y. Hong, C. Ling, and Z. Ye, “End-to-end soccer video scene andevent classification with deep transfer learning,” in . IEEE,2018, pp. 1–4.[5] B. Fakhar, H. R. Kanan, and A. Behrad, “Event detection in soccervideos using unsupervised learning of spatio-temporal features basedon pooled spatial pyramid model,”
Multimedia Tools and Applications ,vol. 78, no. 12, pp. 16 995–17 025, 2019.[6] A. Khan, B. Lazzerini, G. Calabrese, and L. Serafini, “Soccer eventdetection,” in . AIRCC Publishing Corporation,2018, pp. 119–129.[7] M. Z. Khan, S. Saleem, M. A. Hassan, and M. U. G. Khan, “Learningdeep c3d features for soccer video event detection,” in . IEEE,2018, pp. 1–6.[8] H. Jiang, Y. Lu, and J. Xue, “Automatic soccer video event detectionbased on a deep neural network combined cnn and rnn,” in . IEEE, 2016, pp. 490–494.[9] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summa-rization using deep learning,” in Information Processing and Retrieval (MIPR) . IEEE, 2019, pp. 270–273.[10] M. Sanabria, F. Precioso, and T. Menguy, “A deep architecture for mul-timodal summarization of soccer games,” in
Proceedings Proceedingsof the 2nd International Workshop on Multimedia Content Analysis inSports , 2019, pp. 16–24.[11] M. Rafiq, G. Rafiq, R. Agyeman, S.-I. Jin, and G. S. Choi, “Sceneclassification for sports video summarization using transfer learning,”
Sensors , vol. 20, no. 6, p. 1702, 2020.[12] S. Sarkar, A. Chakrabarti, and D. Prasad Mukherjee, “Generation ofball possession statistics in soccer using minimum-cost flow network,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , 2019, pp. 0–0.[13] H. M. Zawbaa, N. El-Bendary, A. E. Hassanien, and T.-h. Kim,“Event detection based approach for soccer video summarization usingmachine learning,”
International Journal of Multimedia and UbiquitousEngineering , vol. 7, no. 2, pp. 63–80, 2012.[14] L.-Y. Duan, M. Xu, Q. Tian, C.-S. Xu, and J. S. Jin, “A unified frame-work for semantic shot classification in sports video,”
IEEE Transactionson multimedia , vol. 7, no. 6, pp. 1066–1083, 2005.[15] M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection andsummarization in soccer videos using bayesian network and copula,”
IEEE Transactions on circuits and systems for video technology , vol. 24,no. 2, pp. 291–304, 2013.[16] M.-H. Sigari, H. Soltanian-Zadeh, and H.-R. Pourreza, “Fast highlightdetection and scoring for broadcast soccer video summarization usingon-demand feature extraction and fuzzy inference,”
International Jour-nal of Computer Graphics , vol. 6, no. 1, pp. 13–36, 2015.[17] J. Yu, A. Lei, and Y. Hu, “Soccer video event detection based ondeep learning,” in
International Conference on Multimedia Modeling .Springer, 2019, pp. 377–389.[18] H. Duxans, X. Anguera, and D. Conejero, “Audio based soccer gamesummarization,” in . IEEE, 2009, pp. 1–6.[19] A. Raventos, R. Quijada, L. Torres, and F. Tarr´es, “Automatic summa-rization of soccer highlights using audio-visual descriptors,”
Springer-Plus , vol. 4, no. 1, pp. 1–19, 2015.[20] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling forconvolutional neural networks,” arXiv preprint arXiv:1905.11946 , 2019.[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition (2015),” arXiv preprint arXiv:1512.03385 , 2016.[24] X. Dai, S. Gong, S. Zhong, and Z. Bao, “Bilinear cnn model for fine-grained classification based on subcategory-similarity measurement,”
Applied Sciences , vol. 9, no. 2, p. 301, 2019.[25] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in
Proceedings of the IEEE internationalconference on computer vision , 2015, pp. 1449–1457.[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” 2011.[27] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent atten-tion convolutional neural network for fine-grained image recognition,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 4438–4446.[28] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolu-tional neural network for fine-grained image recognition,” in
Proceed-ings of the IEEE international conference on computer vision , 2017, pp.5209–5217.[29] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-classconstraint for fine-grained image recognition,” in
Proceedings of theEuropean Conference on Computer Vision (ECCV) , 2018, pp. 805–821.[30] W. Ge, X. Lin, and Y. Yu, “Weakly supervised complementary partsmodels for fine-grained image classification from the bottom up,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 3034–3043.[31] C. Geng, S.-j. Huang, and S. Chen, “Recent advances in open setrecognition: A survey,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , 2020.[32] H. Cevikalp, “Best fitting hyperplanes for classification,”
IEEE transac-tions on pattern analysis and machine intelligence , vol. 39, no. 6, pp.1076–1088, 2016. [33] M. Hassen and P. K. Chan, “Learning a neural-network-based repre-sentation for open set recognition,” in
Proceedings of the 2020 SIAMInternational Conference on Data Mining . SIAM, 2020, pp. 154–162.[34] A. Bendale and T. E. Boult, “Towards open set deep networks,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 1563–1572.[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in
European conference on computer vision . Springer, 2016,pp. 630–645.[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[38] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in
Proceedingsof the IEEE conference on computer vision and pattern recognition ,2018, pp. 4510–4520.[39] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 1251–1258.[40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferablearchitectures for scalable image recognition,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2018, pp. 8697–8710.[41] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261 , 2016.[43] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.
Ali Karimi received the B.S. degree in Com-puter Engineering (software engineering) from Bu-Ali Sina University, Hamedan, Iran, in 2018. He iscurrently pursuing the M.S. degree in InformationTechnology Engineering at University of Tehran,Tehran, Iran. His fields of interest include Image& Video Processing, Machine Vision and MachineLearning.