[PDF] Soccer Event Detection Using Deep Learning

Abstract

Event detection is an important step in extracting knowledge from the video. In this paper, we propose a deep learning approach to detect events in a soccer match emphasizing the distinction between images of red and yellow cards and the correct detection of the images of selected events from other images. This method includes the following three modules: i) the variational autoencoder (VAE) module to differentiate between soccer images and others image, ii) the image classification module to classify the images of events, and iii) the fine-grain image classification module to classify the images of red and yellow cards. Additionally, a new dataset was introduced for soccer images classification that is employed to train the networks mentioned in the paper. In the final section, 10 UEFA Champions League matches are used to evaluate the networks' performance and precision in detecting the events. The experiments demonstrate that the proposed method achieves better performance than state-of-the-art methods.

Full PDF

11 Soccer Event Detection Using Deep Learning

Ali Karimi , Ramin Toosi , Mohammad Ali Akhaee School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran.

Event detection is an important step in extracting knowledge from the video. In this paper, we propose a deep learning approachto detect events in a soccer match emphasizing the distinction between images of red and yellow cards and the correct detection ofthe images of selected events from other images. This method includes the following three modules: i) the variational autoencoder(VAE) module to differentiate between soccer images and others image, ii) the image classiﬁcation module to classify the images ofevents, and iii) the ﬁne-grain image classiﬁcation module to classify the images of red and yellow cards. Additionally, a new datasetwas introduced for soccer images classiﬁcation that is employed to train the networks mentioned in the paper. In the ﬁnal section,10 UEFA Champions League matches are used to evaluate the networks’ performance and precision in detecting the events. Theexperiments demonstrate that the proposed method achieves better performance than state-of-the-art methods.

Index Terms —Event detection, Neural networks, Soccer video

I. I

NTRODUCTION S occer is among the most popular sports in the world. Theattractiveness of this sport has gathered many spectators.Various studies are being performed in this area to growand assist this sport and meet the needs of soccer clubs andmedia. These researches mainly focus on estimating teamtactics [1], tracking of players [2] or ball on the ﬁeld [3],detection of events occurring in the match [4], [5], [6], [7], [8],summarizing the soccer match [9], [10], [11] and estimatingball possession statistics [12]. These studies are carried outusing various methods and techniques, including machinelearning (ML).Artiﬁcial intelligence (AI), and more speciﬁcally ML canassist in conducting the above-mentioned researches in orderto achieve better and more intelligent results. ML itself canbe implemented in a variety of ways, including deep learning(DL). Distinguishing the events of a soccer match is one ofthe active research ﬁelds related to soccer. Today, various AImethods are utilized to detect events in a soccer game [4], [7].The use of AI in this area can help to achieve higher accuracyin detecting events.Detecting the events of a soccer match may have differentapplications. Event detection, for instance, in a soccer matchcan help obtaining the statistics of events of the match.Counting the number of free kicks, fouls, tackles, etc. in asoccer game can be done by a manpower. Using a manpoweris not only costly and time consuming, but also is may beassociated with errors. Nonetheless, with intelligent systemsbased on event detection, statistics can be calculated and usedautomatically. Other applications of event detection may bethe summarization of a soccer match. [13]. The summary ofa soccer match includes important events in which the matchtook place. To prepare a useful summary of a soccer match,the events should be correctly identiﬁed. Using a method foridentifying the events with high accuracy can improve thequality of the summarization task.The purpose of the current study is to detect events in soccermatches. In this regard, various deep learning architectures Corresponding author:Mohammad Ali Akhaee (email :[email protected]). have been developed. When using deep learning methods,there always exists a need for datasets for training. To thisend, ﬁrst, a rich visual dataset is collected for our task suchas (penalty kick, corner kick, free kick, tackle, to substitute,yellow and red cards) and images containing these eventsof the sides and center of the ﬁeld are collected. Then, thisimage dataset is used to train the proposed networks. Solvingevent detection problem encounters some challenges includingthe similarity of some images such as yellow and red cards,which leads to some difﬁculties in separating the images ofthese two groups from each other. This causes the classiﬁer tohave trouble in distinguishing between yellow and red cards,resulting in incorrect detection.Another problem in detecting events that usually occurs ina soccer match is the issue of no highlights. Not all eventshappening in a soccer match can be included in speciﬁc eventsto be given to the network for training. Some events thathave not been fed to the network before training the networkmay occur speciﬁcally in a soccer match. In such cases, thenetwork must be able to face these images or videos and notmistakenly categorize them in one of the given events. Wesolved the problem of no highlight image detection by usingVAE , setting a threshold and using additional classes in theimage classiﬁcation module.The proposed method for detecting the soccer match eventsuses two convolutional neural network (CNN) and a variationalautoencoder (VAE). The current study focuses on resolvingthe problem of no highlight frames which may be wronglyconsidered as one of the events, as well as dealing with theproblem of similarity between red and yellow cards frames.Experiments demonstrate the proposed method improves theaccuracy of image classiﬁcation from 88.93% to 93.21%. Themain reason for increasing the accuracy is the use of a newﬁne-grain classiﬁcation network to classify yellow and redcards. Also, the proposed method has a good ability to detectno highlight frames with a high precision. All datasets andimplementations are publicly available. The rest of the paper is organized as follows. Section II https://github.com/FootballAnalysis/footballanalysis a r X i v : . [ c s . C V ] F e b provides a literature review and examines the drawbacks ofthe works done in this area. Section III describes the proposedalgorithm, the mechanism of its structure and working. SectionIV introduces the datasets collected for this study. Section Vpresents the experimental results and compares the results withthose of other papers. Eventually, Section VI concludes thepaper. II. R ELATED W ORK

The research of Duan et. al. [14] is one of the ﬁrst worksin this area that implements supervised learning for top-downvideo shot classiﬁcation. The authors in [15] present a methodfor detecting soccer events using the Bayesian network. Thebasic methods presented suffered from low accuracy, untilsome methods have been proposed using DL. By presentinga method based on the convolutional network and the LSTMnetwork, Agyeman et. al. [9] present a method for summa-rizing a soccer match based on event detection, in whichﬁve events including corner kicks, free kicks, goal scenes,centerline, and throw-in are considered. This study uses 3D-ResNet34 architecture in the convolutional network structure.One of the problems with this work is that the number ofevents is limited and no highlights are taken into account.Jiang et. al. [8], initially, perform feature extraction using aconvolutional network, then perform event detection using itscombination with RNN model.This method is limited to fourevents: goal, goal attempt, corner, and card. Sigari et. al. [16]employ the fuzzy inference system. The algorithm presented inthis method works based on replay detection, logo detection,view type recognition and audience excitement. This methodis also limited to three events: penalty, corner, and free-kick.11 events are classiﬁed in [17], covering a good number ofevents; while this method is not capable to distinguish betweenthe red and yellow card events because of the high similaritybetween these two events. In general, the methods presented inthis ﬁeld work on either image, video [9], [8], or audio signals[18], [19]. Nonetheless, there exist some methods employingtwo signals, i.e. audio and video, simultaneously [10], [16].The methods recently presented utilize DL architectures asthe main tool for feature extraction. Among the DL archi-tectures suggested for feature extraction, the closest ﬂagshiparchitecture is EfﬁcientNet architecture [20]. This architectureis presented in 8 different versions. Different versions of thisarchitecture offer generally higher performance than other pre-vious models [21], [22], [23]. Also, the proposed architecturehas fewer parameters and occupies less memory.One of the challenges of the event detection problem insoccer matches, which has not been addressed adequately inthe literature, is the events that are very similar in appearancebut are two separate events. For instance, in the images ofyellow and red cards, only the color of the cards is differentand the other parts of the image are the same. Although itmay come to the mind that both are card-taking operationsand may be almost the same, in the soccer game these twoevents impact the game process signiﬁcantly. In the literature,both yellow and red card events are considered as one event[15], which causes problems for event detection. The reasons for this are the very high similarity of the images of these twoevents, which makes it very challenging to distinguish betweenthe two events. In this paper, ﬁne-grained image classiﬁcationis used instead of common feature extraction architectures tosolve this problem in detecting such events.Fine-grained image classiﬁcation is one of the challengesin machine vision, which categorizes images that fall into asimilar category but do not fall into one subcategory [24].For example, items such as face recognition, different breedsof dogs, birds, etc., despite the many structural similarities, donot fall into one subcategory and are different from each other.As another example, the California gull and Ringed-beak gullare two similar birds, differing only in beak pattern, and arein two separate subclasses. The main problem with this typeof classiﬁcation is that these differences are usually subtle andlocal. Finding these discriminating places of two subcategoriesis a challenge that we face in these methods.The work of Lin et. al. [25] is one of the researches in theﬁeld of ﬁne-grained image classiﬁcation, which is based ondeep learning. In this model, two neural networks are usedsimultaneously. The outer product of the outputs of these twonetworks is then mapped to a bilinear vector. Finally, there isa softmax layer to specify the classiﬁcation of images. Theaccuracy of this method for the CUB-200-2011 dataset [26]is 84.1% while using only the neural network architecture inthis architecture provides the maximum accuracy of 74.7%.Fu et. al. [27] introduce a framework of recurrent attentionCNN, which receives the image with the original size andpasses it through a classiﬁcation network, thereby extracts theprobability of its placement in each category. At the sametime, after the convolutional layers of the classiﬁer, it extractsan attention proposal network that contains region parametersthat it uses to zoom in on the image and then crop it. Itnow inputs the resulting new image like the original imageinto a network, and at the same time extracts an attentionproposal network, thereby re-extracting another part of thenew image. This method reaches an accuracy of 85.3% forthe Birds dataset. In another work, the authors in [28] proposea framework of multi-attention convolutional neural networkwith an accuracy of 86.5%. Following in 2018, Sun et. al. [29]presented new architecture on an attention-based CNN thatlearns multiple attention region features per an image throughthe one-squeeze multi-excitation (OSME) module and then usethe multi-attention multi-class constraint (MAMC). Thanks tothis structure, this method improves the accuracy of previousmethods to some extent. One of the latest method presentedin [30] consists of three steps. In the ﬁrst step, the instancesare detected and then instance segmentation is carried out.In the second step, a set of complementary parts is createdfrom the original image. In the third step, CNN is used foreach image obtained, and the output of these CNNs is givento the LSTM cells for all images. The accuracy of the bestmodel in this method for the Birds dataset reaches 90.4%. Ingeneral, the problem with the methods presented in this sectionis the accuracy they achieved, while newer methods attemptto improve the accuracy.Another issue is that a soccer match can include variousscenes that are not necessarily a speciﬁc event, such as scenes

Fig. 1: Generic block diagram of the proposed algorithmfrom a soccer match where players walking, or the momentswhen the game is stopped. Now, if such images are applied asinput to the classiﬁcation network, the network will mistakenlyplace them in one of the deﬁned categories. The reason isthat the network is trained only to categorize images betweenevents, and is called traditional classiﬁcation network [31]. Insuch classiﬁcations, the known class is used during trainingand the known class images should be given to the networkduring testing. Otherwise, the network will have trouble indetecting the image category, and even though the imageshould not be placed in any of the categories, it will beplaced incorrectly in one of the deﬁned categories. This typeof categorization does not sufﬁce for the problem under study.Because, as explained, the input images in this problem maynot belong to any of the categories. Thus, we need a networkto specify the category of an input image if it falls into oneof the seven categories; otherwise, the network rejects it anddoes not mistakenly place it in these deﬁned categories. Inother words, an open set recognition is required to address thementioned problem [31] .Open set recognition techniques can be implemented usingdifferent methods. Cevikalp et. al. [32] performs this based onsupport vector machine (SVM). The works in [33] and [34]are also based on deep neural networks. Today, the use ofgenerative models in this area is reaching its pinnacle, whichis divided into two categories: instance generation and non-instance generation [31] . Finding a suitable method is stillchallenging, and the literature presented for event detectionhas not addressed this issue profoundly.III. P

ROPOSED METHOD

This section describes the proposed method. Initially, thegeneral procedure is explained, then all three main parts of themethod are introduced, which includes an image classiﬁcationmodule for detecting the images of the deﬁned events, a ﬁne-grain classiﬁcation module used to classify yellow and redcard images, and a variational autoencoder for detecting nohighlights . A. The Proposed Algorithm

As depicted in Fig 1, the received video is ﬁrst split intoseveral frames based on the video length and frame rate persecond, and then each frame is passed separately through avariational autoencoder. If loss value of the VAE networkfor the input frame is smaller than a speciﬁed value, thereceived frame is considered as an event frame and given to the image classiﬁcation module. The image classiﬁcationmodule classiﬁes the images into nine classes. If the inputimage belongs to one of the center circle categories, rightpenalty area, or left penalty area, it is not categorized as anevent and is still classiﬁed as no highlights. Yet, if it is oneof the ﬁve events: penalty kick, corner kick, tackle, free kick,to substitute, it is recorded as an event in that image. If theevent is a card event, the image is given to the ﬁne-grainclassiﬁcation module to determine if it is a yellow card orred card, and the color of the card is speciﬁed there. Finally,to detect events in a soccer match, each event is calculatedfor every 15 consecutive frames (seven frames before, sevenframes after, and the current frame), and if more than half ofthe frames belong to an event, that 15 frames (half a second)are tagged as that event. Of course, an event cannot be repeatedmore than once in 10 s, and if repeated, only one of them iscalculated as an event in the calculation of the number ofevents occurred. In the folowing each module is described indetail.

B. No Highlight Detection Module (Variational Autoen-coder)a) A: soccer match can include various scenes that arenot necessarily a speciﬁc event, such as scenes from a soccermatch where the director is showing the faces of the playerseither on the ﬁeld or on the bench, or when the players arewalking and the moments when the game is stopped. Thesescenes are not categorized as events of a soccer match.In general, to be able to separate the images of the deﬁnedevents from the rest of the images, three actions must beperformed to complete each other and help us to detect nohighlights. The three actions are :1) The use of the VAE network to identify if the inputimages are similar to the s occer ev ent (SEV) datasetimages2) The use of three additional categories, that is, leftpenalty area, right penalty area, and center circle inthe image classiﬁcation module, given that most freekick images are similar to images from these categories(if these categories were not placed, the images of thewingers of the ﬁeld would usually get a good score inthe free-kick category)3) Applying the best threshold on the prediction value ofthe last layer of the EfﬁcientNetB0 feature extractionnetwork.The second and third methods are applied to the imageclassiﬁcation module and will be described later. In the ﬁrst y c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x c onv ( , ) + bn + R e L U poo li ng m a x v i e w B , f c , Q f c , Q K L D N + r e p a N ˜ z f c Q , v i e w B , , , , U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U U p s ca li ng C onv2 D T ( , ) + bn + R e L U ˜ y L ( ˜ y , y ) = − l n p ( y | ˜ z ) K L ( q ( z | y ) | p ( z )) Fig. 2: No highlight detection module architecture (VAE)Fig. 3: Image classiﬁcation module architectureaction, the VAE architecture is employed according to Fig2 to identify images that do not fall into any of the eventcategories. To this end, the whole images of the soccer trainingdataset are given to the VAE network to be trained, then usingreconstruction loss and determining a threshold on it, it isdetermined that the images whose reconstruction loss value arehigher than a ﬁxed threshold are not considered as the soccerimages. Images with a reconstruction loss value less than aﬁxed threshold are categorized as the soccer game imagesand then they are given to the image classiﬁcation modulefor classiﬁcation. In other words, this VAE plays the role of atwo-class classiﬁer that puts soccer images in one category andnon-soccer images in another category. Reconstruction lossis obtained from the difference between the input image andthe reconstructed image. The more input images from a moredistinctive distribution, the higher reconstruction loss valuewill be, and if it is trained using the same image distribution,the amount of this error will be less.

C. Image Classiﬁcation Module

The EfﬁcientNetB0 architecture which is shown in Fig 3is used to categorize images. This network is responsible forclassifying images between nine classes. To place an image inone of the classes of this network, its prediction value at the end layer should be higher than the threshold value which isset 0.9. Otherwise it would be selected as no highlight frame.If one of the options left penalty area, right penalty area,and center circle is the output of this network, that image isno longer deﬁned as an image of the event that occurred ina soccer match but is included in the no highlight category.However, if it is in one of the categories of penalty kick,corner kick, tackle, free kick, and to substitute, the event willbe ﬁnalized and decision is made. Finally, if the image isclassiﬁed in the card category, the image is given to the ﬁne-grain classiﬁcation module where the color of that card willbe determined.

D. Fine-grain Classiﬁcation Module

The only difference between the red and yellow cards istheir color of the card; otherwise, there exist no other differ-ences in their image. Thus, both are in the card category butare separate in terms of subcategories. The main classiﬁcationdoes not distinguish these two categories well, hence it isdecided in the training phase of image classiﬁcation module tomerge these two cards into one category. Also, to separate theyellow and red cards, a separate subclassiﬁcation is used thatfocuses on the details. The ﬁnal architecture employed in thissection can be seen in Fig 4 . Here, the proposed architecture

Fig. 4: Fine-grain classiﬁcation module architectureprovided in [29] is exploited, except that instead of the Res-Net50 architecture that is used in [29] , the EfﬁcientNetB0architecture is employed and the network is trained using theyellow and red cards data. The input images to this networkare the images that were categorized as cards in the imageclassiﬁcation module. The output of this network determineswhether these images are in the yellow or red card category.IV. D

ATASETS

In this paper, two datasets, that is, soccer event (SEV) andtest event datasets have been collected. The collection of thesetwo datasets has been done in two ways:1) By crawling on Internet websites, images of differentgames were collected. These images are unique and arenot consecutive frames2) Using the video of the soccer games of the last fewyears of the prestigious European leagues and extractingimages related to events.

A. ImageNet

ImageNet dataset given in [35] is employed in the methodproposed in this paper for the initial weighting of the Efﬁ-cientNetB0 network.

B. SEV Dataset

This dataset includes images of soccer match events thathave been collected speciﬁcally for this study. In the SEVdataset, a total of 60,000 images were collected in 10 cate-gories. The images of this dataset, as described, were collectedin two different ways. Seven of the ten image categories arerelated to the soccer events deﬁned in this paper, and the restof the categories are used for no-highlight detection so that theimages related to them are not mistakenly included in the sevenmain categories. Table I shows how dataset SEV is dividedinto train, validation, and test datasets. Also samples of SEVdataset shown in Fig 5

C. Test Event Dataset

The test dataset is exploited to evaluate the proposedmethod. Samples of that dataset shown in Fig 6. The datasetconsists of three classes, the ﬁrst class contains images of theevents selected from the SEV dataset, and an equal numberof images (200 instances from each category) are selectedfrom each category (only deﬁned events). The second classincludes other images of soccer, in which none of these sevenevents are included. The third class includes images that arenot generally related to soccer. Details of the number of imagesin this dataset are given in Table II.V. E

XPERIMENT AND P ERFORMANCE E VALUATION

A. Training

All three networks are trained independently and the end-to-end method is not employed. The training methods of the10-class classiﬁer network, the yellow and red card classiﬁernetwork, and the VAE network are explained in the followingsubsections, respectively.

1) VAE

To train this network, the images of seven events deﬁnedfrom the SEV dataset are selected and given to the VAEnetwork as training data. Test and validation data of theseseven categories are also used to evaluate the network. Thespeciﬁcations of the simulation parameters used in this net-work are summarized in Table III.As illustrated in Fig 7 , the value of the loss curve decreasesduring successive epochs for validation data.

2) Image Classiﬁcation (9 Class)

The image classiﬁcation network is ﬁrst trained on theImageNet image collection with dimensions of 224 * 224 * 3.Then, using transfer learning, the network is re-trained on theSEV dataset and is ﬁne-tuned. Dimensions of input images ofthe SEV dataset are 224 * 224 * 3. The network is trained in20 epochs with the simulation parameters speciﬁed in TableIV. The yellow and red card classes are merged and their 5,000images are used for network training together with images ofother SEV dataset classes.

Fig. 5: Samples of SEV datasetTABLE I: Statistics of SEV dataset

Class Name

Fig. 6: Samples of test event datasetTABLE II: Statistics of test event dataset

Class name Image instanceSoccer events (events deﬁned) 1400Other soccer events (throw-in, goal kick, offside, . . . ) 1400Other images (images that is not related to football) 1400Sum 4200

TABLE III: Simulation parameters of the variational autoencoder

Parameter ValueOptimizer AdamLoss function Reconstruction loss + KL lossPreformance metric Loss

TABLE IV: Simulation parameters of the image classiﬁcation module

Parameter ValueOptimizer AdamLoss function Categorical Cross-EntropyPreformance metric AccuracyTotal Classes 9 (red and yellow card classes merged)Augemnation Scale, Rotate, Shift, FlipBatch Size 16

TABLE V: Simulation parameters of the ﬁne-grain image classiﬁcation module

Parameter ValueOptimizer AdamLoss function Categorical cross-entropyPreformance metric AccuracyTotal Classes 2 (red and yellow card classes)Augemnation Scale, rotate, shift, ﬂipBatch Size 16

3) Fine-grain image classiﬁcation

The network shown in Fig 4 is trained using two classes ofred card and yellow card of the SEV dataset. Data from eachcategory is partitioned to train, test and validation with 5000,500, and 500 images. The speciﬁcations of the simulationparameters used in this network are given in Table V.

B. Evaluation Metrics

Different metrics are exploited to evaluate this network. Inorder to evaluate and compare different image classiﬁcation ar-chitectures of EfﬁcientNetB0 and Fine-grain module networks,the accuracy metric is used as the main metric; recall andF1-score are also used to determine the appropriate thresholdvalue of the EfﬁcientNetB0 network. Also, precision is usedto evaluate the performance of the proposed method to detectevents.The accuracy metric can be used to determine how accu-rately the trained model predicts and, as described in thispaper, to compare different architectures and hyperparametersin the EfﬁcientNetB0 and Fine-grain module networks.

Accuracy = T P + T NP + N = T P + T NT P + T N + F P + F N (1)The F1 score metric considers both the recall and precisioncriteria together; the value of this criterion is one at the best-case scenario and zero at the worst-case scenario. F = 2 P recision − + Recall − = T PT P + × ( F P + F N ) (2)Precision is a metric that helps to determine how accuratethe model is when making a prediction. This metric has beenused as a criterion in selecting the appropriate threshold. P recision = T PT P + F P (3) Recall metric refers to the percentage of total predictionsthat are correctly categorized .

Recall = T PT P + F N (4)

C. Evaluation

The various parts of the proposed method have been eval-uated to achieve the best model in order to detect an eventin a soccer match. In the ﬁrst step, the algorithm should beable to effectively classify the images of the deﬁned eventscorrectly. In the next step, the network is examined to seehow the network can detect no highlights, and the best possiblemodel is selected. Eventually, the performance of the proposedalgorithm in a soccer video is examined and compared to otherstate-of-the-art methods.

1) Classiﬁcation evaluation

The image classiﬁcation module is responsible for classify-ing images. To test and evaluate this network, as shown in Fig7, different architectures were used for training and differenthyperparameters were also tested for these models. As shownin Fig 7a , the EfﬁcinetNetB0 model has the best accuracyamong the other models.If in the above model, we divide the card images into twocategories of yellow and red cards and give the dataset tothe network in the form of the same 10 categories of SEVdatasets for training, the accuracy of test data is reduced from94.08% to 88.93% . The reason for this is the interference ofyellow and red card predictions in this model. Consequently,the yellow and red cards detection have been assigned to asubclassiﬁcation, and only the card category in the Efﬁcient-NetB0 model has been used.For the subclassiﬁcation of yellow and red card images,various ﬁne-grained methods were evaluated and the resultsare shown in Table VI. The bilinear CNN method [25] usingthe EfﬁcientNet architecture achieves 66.86% accuracy and (a) Architecture (b) Batch size(c) Optimizer (d) Learning rate

Fig. 7: Comparing the results on hyperparameters and architecture for validation accuracy (image classiﬁcation module)TABLE VI: Comparison between Fine-grain image classiﬁcation models

Method name Epoch Acc (red and yellow card) (%) Acc (CUB-200-2011) (%)B-CNN [25][20] 60 66.86 84.01OSME + MAMC using Res-Net50 [29] 60 61.70 86.2

OSME + MAMC [29] using EfﬁcientNetB0 [20] 60 the OSME method that employs the EfﬁcientNet architec-ture reaches 79.90% accuracy, which shows higher accuracythan the other methods. Nonetheless, the main architecture(EfﬁcientNetB0) used in the image classiﬁer shows 62.02%accuracy, which gives difference of 17.88%.Table VII compares the accuracy of combining the imageclassiﬁcation module and ﬁne-grain classiﬁcation module forimage classiﬁcation with other models. As shown in Table VII,the accuracy of our proposed method for image classiﬁcationis 93.21%, which shows the best accuracy among all models.Also, the proposed method is still faster than the other modelsexcept MobileNet and MobileNetV2. To prove this point, The execution time of each model is calculated for 1400 images,and their average run time as the mean inference time is givenin Table VII.As shown in Table VIII, the problem of card overlap is alsosolved, and the proposed method in the image classiﬁcationsection separates the two categories almost well.

2) Known or Unknown Evaluation

In order to determine the threshold value on the output of thelast layer of the EfﬁcientNetB0 network, according to TableIX, different threshold values were tested and evaluated; andthen the value 0.90 which gives the highest sum of F1-scorefor detecting the event and also gives the best recall to detect

TABLE VII: Comparison between image classiﬁcation models on SEV dataset

Method name Accuracy (%) Mean inference time (second)VGG16 [21] 83.21 0.510ResNet50V2 [36] 84.18 0.121ResNet50 [23] 84.89 0.135InceptionV3 [37] 84.93 0.124MobileNetV2 [38] 86.95 0 .046

Xception [39] 86.97 0.186NASNetMobile [40] 87.01 0.121MobileNet [41] 88.48 0.048InceptionResNetV2 [42] 88.71 0.252EfﬁcientNetB7 [20] 88.92 1.293EfﬁcientNetB0 [20] 88.93 0.064DenseNet121 [43] 89.47 0.152

Proposed method (image classiﬁcation section) 93.21

TABLE VIII: Confusion matrix proposed algorithm (image classiﬁcation section)

PredictedActual Center Corner Free kicke Left Penalty R Card Right Tackle To Substitue Y CardCenter 0.994 0 0.006 0 0 0 0 0 0 0Corner 0 0.988 0.006 0 0 0.004 0 0 0.002 0Free kick 0.004 0.004 0.898 0.03 0 0.002 0.05 0.006 0.004 0.002Left Penalty Area 0.002 0 0.036 0.958 0.004 0 0 0 0 0Penalty kick 0 0 0 0.002 0.972 0 0.026 0 0 0Red cards 0 0.008 0.03 0 0 0.862 0 0.002 0.012 0.086Right Penalty Area 0 0 0.01 0.002 0 0 0.988 0 0 0Tackle 0.012 0.014 0.02 0.02 0.002 0.006 0.028 0.876 0.004 0.018To Substitue 0 0 0 0 0 0.002 0.002 0 0.994 0.002Yellow cards 0 0.022 0.01 0 0 0.131 0 0.002 0.006 0.829

TABLE IX: Comparison between different threshold values (image classiﬁcation module)

Threshold F1-score (event images) (%) Recall (images not related to football) %)0.99 86.6 95.30.98 88.2 94.20.97 89.2 93.80.95 90.5 93.10.90 91.8 92.40.80 92.4 76.90.50 95.2 51.2 no highlight images was determined as the threshold for thelast layer of the network.The threshold value for the loss of VAE network was alsoexamined with different values as shown in Fig 8 , the valueof 328 as the threshold for the loss gives the best distinctionbetween categories.To test how the network detects no highlight images andthe deﬁned event images, the test dataset is used. Using thisdataset helps to know the number of main events incorrectlyclassiﬁed by the proposed method as events not related tosoccer. Also, it clariﬁes the number of images related tosoccer classiﬁed as the main events, and the number of imagescategorized in the no highlight category. Moreover, it speciﬁesthe number of images related to the real event categorizedcorrectly into the correct events, and the number of eventsthat are not included in the main events. The results of thisevaluation are provided in Table X.

3) Final Evaluation

Ten soccer matches have been downloaded from the UEFAChampions league and then, using the proposed method, thetask of events detection has been carried out. In this evaluation,events that occur in each soccer game are examined. Inother words, the number of events correctly detected, and the number of events incorrectly detected by the network havebeen determined. Details of the results are given in Table XIand compared with other similar methods. in state-of-the-artmethods VI. C

ONCLUSIONS AND F UTURE W ORK

In this paper, two novel datasets for soccer event detectionhas been presented. One is the SEV dataset including 60,000images in 10 categories, seven of which were related to soccerevents, and three to soccer scenes; these were used in trainingimage classiﬁcation networks. The images of this dataset weretaken from top ﬁne leagues in Europe and the EuropeanChampions league. The other dataset is the test event thatcontains 4200 images. Test event includes three categories: theﬁrst category consists of the events mentioned in the paper,the second category comprises other images of a soccer matchapart from the ﬁrst category events, and the third categoryincludes images off the soccer ﬁeld. This dataset was exploitedto examine the network power in detecting and distinguishingbetween images with highlight and those with no highlight.Furthermore, a method for soccer event detection is pro-posed. The proposed method employed the EfﬁcientNetB0network in the image classiﬁcation module to detect events Fig. 8: Reconstruction Loss of VAETABLE X: The precision of the proposed algorithm

Class Sub-class PrecisionSoccer Events Corner kick 0.94— Free kick 0.92— To Substitue 0.98— Tackle 0.91— Red Card 0.90— Yellow Card 0.91— Penalty Kick 0.93Other soccer events 0.86Other images 0.94

TABLE XI: Precision of the proposed and state-of-the-art methods on 10 soccer matches (%)

Event name Proposed method BN[15] Jiang et. al.[8]Corner kick 94.16 88.13 93.91Free kick 83.31 – –To Substitue 97.68 – –Tackle 90.13 81.19 –Red Card 92.39 – –Yellow Card 92.66 – –Card – 88.83 93.21Penalty Kick 88.21 – – in a soccer match. Also, the ﬁne-grain image classiﬁcationmodule was used to differentiate between red and yellow cards.If this module is not employed, red and yellow cards wouldhave been categorized in the image classiﬁcation module,and the differentiation accuracy would have been 88.93%.However, the ﬁne-grain image classiﬁcation module increasedthe accuracy up to 93.21%. In order to solve the networkproblems in predicting the images other than those deﬁned,a VAE was employed to adjust the value of the threshold andseveral images, other than those of the deﬁned events, wereused to make a better distinction between the images of thedeﬁned events and other images.R

EFERENCES[1] G. Suzuki, S. Takahashi, T. Ogawa, and M. Haseyama, “Team tacticsestimation in soccer videos based on a deep extreme learning machineand characteristics of the tactics,”

IEEE Access , vol. 7, pp. 153 238–153 248, 2019.[2] M. Manaﬁfard, H. Ebadi, and H. A. Moghaddam, “A survey on playertracking in soccer videos,”

Computer Vision and Image Understanding ,vol. 159, pp. 19–46, 2017. [3] P. Kamble, A. Keskar, and K. Bhurchandi, “A deep learning ball trackingsystem in soccer videos,”

Opto-Electronics Review , vol. 27, no. 1, pp.58–69, 2019.[4] Y. Hong, C. Ling, and Z. Ye, “End-to-end soccer video scene andevent classiﬁcation with deep transfer learning,” in . IEEE,2018, pp. 1–4.[5] B. Fakhar, H. R. Kanan, and A. Behrad, “Event detection in soccervideos using unsupervised learning of spatio-temporal features basedon pooled spatial pyramid model,”

Multimedia Tools and Applications ,vol. 78, no. 12, pp. 16 995–17 025, 2019.[6] A. Khan, B. Lazzerini, G. Calabrese, and L. Seraﬁni, “Soccer eventdetection,” in . AIRCC Publishing Corporation,2018, pp. 119–129.[7] M. Z. Khan, S. Saleem, M. A. Hassan, and M. U. G. Khan, “Learningdeep c3d features for soccer video event detection,” in . IEEE,2018, pp. 1–6.[8] H. Jiang, Y. Lu, and J. Xue, “Automatic soccer video event detectionbased on a deep neural network combined cnn and rnn,” in . IEEE, 2016, pp. 490–494.[9] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summa-rization using deep learning,” in Information Processing and Retrieval (MIPR) . IEEE, 2019, pp. 270–273.[10] M. Sanabria, F. Precioso, and T. Menguy, “A deep architecture for mul-timodal summarization of soccer games,” in

Proceedings Proceedingsof the 2nd International Workshop on Multimedia Content Analysis inSports , 2019, pp. 16–24.[11] M. Raﬁq, G. Raﬁq, R. Agyeman, S.-I. Jin, and G. S. Choi, “Sceneclassiﬁcation for sports video summarization using transfer learning,”

Sensors , vol. 20, no. 6, p. 1702, 2020.[12] S. Sarkar, A. Chakrabarti, and D. Prasad Mukherjee, “Generation ofball possession statistics in soccer using minimum-cost ﬂow network,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , 2019, pp. 0–0.[13] H. M. Zawbaa, N. El-Bendary, A. E. Hassanien, and T.-h. Kim,“Event detection based approach for soccer video summarization usingmachine learning,”

International Journal of Multimedia and UbiquitousEngineering , vol. 7, no. 2, pp. 63–80, 2012.[14] L.-Y. Duan, M. Xu, Q. Tian, C.-S. Xu, and J. S. Jin, “A uniﬁed frame-work for semantic shot classiﬁcation in sports video,”

IEEE Transactionson multimedia , vol. 7, no. 6, pp. 1066–1083, 2005.[15] M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection andsummarization in soccer videos using bayesian network and copula,”

IEEE Transactions on circuits and systems for video technology , vol. 24,no. 2, pp. 291–304, 2013.[16] M.-H. Sigari, H. Soltanian-Zadeh, and H.-R. Pourreza, “Fast highlightdetection and scoring for broadcast soccer video summarization usingon-demand feature extraction and fuzzy inference,”

International Jour-nal of Computer Graphics , vol. 6, no. 1, pp. 13–36, 2015.[17] J. Yu, A. Lei, and Y. Hu, “Soccer video event detection based ondeep learning,” in

International Conference on Multimedia Modeling .Springer, 2019, pp. 377–389.[18] H. Duxans, X. Anguera, and D. Conejero, “Audio based soccer gamesummarization,” in . IEEE, 2009, pp. 1–6.[19] A. Raventos, R. Quijada, L. Torres, and F. Tarr´es, “Automatic summa-rization of soccer highlights using audio-visual descriptors,”

Springer-Plus , vol. 4, no. 1, pp. 1–19, 2015.[20] M. Tan and Q. V. Le, “Efﬁcientnet: Rethinking model scaling forconvolutional neural networks,” arXiv preprint arXiv:1905.11946 , 2019.[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition (2015),” arXiv preprint arXiv:1512.03385 , 2016.[24] X. Dai, S. Gong, S. Zhong, and Z. Bao, “Bilinear cnn model for ﬁne-grained classiﬁcation based on subcategory-similarity measurement,”

Applied Sciences , vol. 9, no. 2, p. 301, 2019.[25] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for ﬁne-grained visual recognition,” in

Proceedings of the IEEE internationalconference on computer vision , 2015, pp. 1449–1457.[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” 2011.[27] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent atten-tion convolutional neural network for ﬁne-grained image recognition,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 4438–4446.[28] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolu-tional neural network for ﬁne-grained image recognition,” in

Proceed-ings of the IEEE international conference on computer vision , 2017, pp.5209–5217.[29] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-classconstraint for ﬁne-grained image recognition,” in

Proceedings of theEuropean Conference on Computer Vision (ECCV) , 2018, pp. 805–821.[30] W. Ge, X. Lin, and Y. Yu, “Weakly supervised complementary partsmodels for ﬁne-grained image classiﬁcation from the bottom up,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 3034–3043.[31] C. Geng, S.-j. Huang, and S. Chen, “Recent advances in open setrecognition: A survey,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , 2020.[32] H. Cevikalp, “Best ﬁtting hyperplanes for classiﬁcation,”

IEEE transac-tions on pattern analysis and machine intelligence , vol. 39, no. 6, pp.1076–1088, 2016. [33] M. Hassen and P. K. Chan, “Learning a neural-network-based repre-sentation for open set recognition,” in

Proceedings of the 2020 SIAMInternational Conference on Data Mining . SIAM, 2020, pp. 154–162.[34] A. Bendale and T. E. Boult, “Towards open set deep networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 1563–1572.[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in

European conference on computer vision . Springer, 2016,pp. 630–645.[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[38] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in

Proceedingsof the IEEE conference on computer vision and pattern recognition ,2018, pp. 4510–4520.[39] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 1251–1258.[40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferablearchitectures for scalable image recognition,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2018, pp. 8697–8710.[41] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261 , 2016.[43] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.

Ali Karimi received the B.S. degree in Com-puter Engineering (software engineering) from Bu-Ali Sina University, Hamedan, Iran, in 2018. He iscurrently pursuing the M.S. degree in InformationTechnology Engineering at University of Tehran,Tehran, Iran. His ﬁelds of interest include Image& Video Processing, Machine Vision and MachineLearning.