EventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos
Dazhen Deng, Jiang Wu, Jiachen Wang, Yihong Wu, Xiao Xie, Zheng Zhou, Hui Zhang, Xiaolong Zhang, Yingcai Wu
EEventAnchor: Reducing Human Interactions in EventAnnotation of Racket Sports Videos
Dazhen Deng [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China
Jiang Wu [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China
Jiachen Wang [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China
Yihong Wu [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China
Xiao Xie [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China
Zheng Zhou [email protected] of Sport Science,Zhejiang UniversityHangzhou, Zhejiang, China
Hui Zhang [email protected] of Sport Science,Zhejiang UniversityHangzhou, Zhejiang, China
Xiaolong (Luke) Zhang [email protected] of Information Sciences andTechnology, Pennsylvania StateUniversityUnited States of America
Yingcai Wu ∗ [email protected] State Key Lab of CAD&CG,Zhejiang UniversityHangzhou, Zhejiang, China ABSTRACT
The popularity of racket sports (e.g., tennis and table tennis) leadsto high demands for data analysis, such as notational analysis, onplayer performance. While sports videos offer many benefits forsuch analysis, retrieving accurate information from sports videoscould be challenging. In this paper, we propose EventAnchor, a dataanalysis framework to facilitate interactive annotation of racketsports video with the support of computer vision algorithms. Ourapproach uses machine learning models in computer vision tohelp users acquire essential events from videos (e.g., serve, the ballbouncing on the court) and offers users a set of interactive tools fordata annotation. An evaluation study on a table tennis annotationsystem built on this framework shows significant improvement ofuser performances in simple annotation tasks on objects of interestand complex annotation tasks requiring domain knowledge.
CCS CONCEPTS • Human-centered computing → Interaction techniques ; In-teractive systems and tools . ∗ Yingcai Wu is the corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
CHI ’21, May 8–13, 2021, Yokohama, Japan © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8096-6/21/05...$15.00https://doi.org/10.1145/3411764.3445431
KEYWORDS
Racket sports, visualization, computer vision, data annotation
ACM Reference Format:
Dazhen Deng, Jiang Wu, Jiachen Wang, Yihong Wu, Xiao Xie, Zheng Zhou,Hui Zhang, Xiaolong (Luke) Zhang, and Yingcai Wu. 2021. EventAnchor:Reducing Human Interactions in Event Annotation of Racket Sports Videos.In
CHI Conference on Human Factors in Computing Systems (CHI ’21), May8–13, 2021, Yokohama, Japan.
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3411764.3445431
Racket sports are popular over the world. For example, tennis, oftenregarded as the top 1 racket sport, has more than 87 million playersin 2019 [2], and ATP (the Association of Tennis Professionals) eventshave attracted 1 billion cumulative viewers [12]. Such popularityleads to high demands for data analysis on player performance byboth amateurs and professional analysts [27, 54]. One widely usedanalytical method for racket sports is notational analysis [1, 27],which focuses on the movements of players in a match. Videorecordings of matches are often used for such analysis because ofthe availability of rich source information, such as the position andaction of players, action time, and action result. Manually retrievingmassive source information from long match videos could be verychallenging for users, so computer vision algorithms have beenapplied to data extraction from sports videos.Existing data acquisition systems based on computer vision haveseveral limitations. First, many systems cannot accurately trackdata from low-quality videos, such as broadcasting videos [45]. Forexample, the low frame rate in broadcasting videos cannot exhibitthe fast motion of players and ball/shuttlecock well. In elite tabletennis matches, where the average duration between two strokescan be as fast as just half a second [27, 31], the image of the ball in a a r X i v : . [ c s . H C ] J a n HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al. single frame may be a semi-transparent tail to show a series of ballpositions. Most computer vision models cannot accurately recog-nize the ball position from the tail. Similarly, other characteristicsin racket sports videos, including but not limited to the frequentshot transformation, inconsistent scene appearances, and severeocclusion of the ball/shuttlecock by players, also pose challengesfor robust object recognition. Second, existing systems largely fo-cus on low-level object recognition, such as human action [64],and are weak to identify and retrieve high-level event information,such as the outcomes of the actions [10]. This limitation is still anopen problem in computer vision [21, 45], because automaticallyextracting contextual information in sports requires the integrationof domain knowledge into algorithms.Interactive data acquisition systems have been developed to im-prove the accuracy and quality of data extraction from videos. Suchsystems allow user involvement in the data processing, such asmanually validating the tracking result of the ball or labeling theoutcome of a serve. One of the challenges such systems face is thescalability. When having a large number of annotations to process,existing systems often rely on crowdsourcing [23, 34, 48]. Anotherchallenge is annotation efficiency for individual users. Some re-search attempted to reduce human interaction in data annotationfrom sports videos, such as baseball videos[33], but their methodscannot be applied to racket sports, which with faster and more dy-namics rhythms, require different approaches for data annotation.In this paper, we propose EventAnchor, an analytical frameworkto support data annotation for racket sports videos. Our frameworkintegrates computer vision models for scene detection and objecttracking, and uses the model outputs to create a series of anchorpoints, which are potential events of interest. Interacting with theseanchors, users can quickly find desired information, analyze rele-vant events, and eventually create annotations on simple events orcomplex player actions. Based on the framework, we implementan annotation system for table tennis. The results of our evalua-tion study on the system show significant improvement of userperformances in data annotation with our method.The major contribution of this paper lies in the novel frame-work, EventAnchor, that we propose for multiple-level video dataannotations based on our empirical work in understanding the re-quirements of data annotation by expert analysts. This frameworkintegrates rich information and supports efficient video contentexploration.
Our research focuses on interactive video annotation enhancedwith machine learning techniques in computer vision. Thus, in thissection, we review the methods for video annotation, particularlythose relying on machine-learning or crowdsourcing to scale up an-notation. We also discuss research on interaction design to supportvideo annotation.
The advance of machine learning has provided new opportunitiesto reduce the cognitive and interaction burdens of users in videoannotation[14, 28, 43, 49]. Models have been incorporated intovideo annotation systems for various purposes, such as predicting annotations based on user interaction activities [14, 28, 49], andpropagating the annotation of keyframes to other frames [26, 50].Many different models have been considered. For example, themodels to predict annotation include those based on continuousrelevance [26], particle filtering[56, 57], and bayesian inference [50].One common approach in model-based video annotation is to pre-process data with models pre-trained with other datasets. Thispractice can improve the efficiency of data annotation by removingnon-interesting data. For example, when constructing the NCAABasketball Dataset, Ramanathan et al. [38] used a pre-trained clas-sifier to filter video clips first, so that those non-profile shots canbe eliminated before distributing the data and tasks to crowd work-ers. This approach can significantly reduce the amount of data forannotation, as well as the burdens of users in annotation.Motivated by these methods, this research uses computer visionmodels to extract essential entities and objects from racket sportsvideos, such as key frames, ball trajectories, and player positions.Despite the inevitable errors accompanied with such models, theseentities and objects lay the foundations for further data process-ing (e.g., event recognition), and user interaction (e.g., searchingand evaluating events of interest), therefore potentially improvingannotation efficiency.
Researchers have also explored ways to help people annotate videodata through interaction designs. One research direction is to ex-plore new interactive approaches to facilitate important annotationtasks, such as an adaptive video playback tool to assist quick re-view of long video clips [3, 15, 18], a mobile application to supportreal-time, precise emotion annotation [63], an interaction pipelinefor the annotation of objects and their relations [41], and a novelmethod to acquire tracking data for sports videos [33]. These de-signs, which largely target single users, can improve the efficiencyand accuracy of video annotation from different perspectives. Ourproposed method is different from the aforementioned works fromtwo perspectives. First, our method allows users to locate events ofinterest by integrating not only essential information at the objectlevel (e.g., ball position), which existing designs [15, 18] largelyfocused on, but also more advanced information at the event (e.g.,stroke type) and context (e.g., tactical style) levels, which we pro-pose to enable more comprehensive and in-depth data analysis.Second, our method supports a more efficient and scalable explo-ration of events with computer vision algorithms and an improvedtimeline tool. Our algorithms can remove useless contents and keepthe key events to better support fast and dynamic video review. Ourfine-grained timeline, which visualizes the events at the frame leveland is controlled by a calibration hotbox, allows users to quicklyexamine frames back and forth, even in very long videos.Another research direction focuses on designs to support crowdworkers. Crowdsourcing has been considered as a way to scale upinteractive annotation [32, 60]. While some work studied generaldesign issues, such as user interface design guidelines for crowd-based video data annotation [22], most research in this directionexplored designs to combine annotations from the crowd to gen-erate better results. For example, Kaspar et al. [19] developed an ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan
Figure 1: The snapshots of the broadcast videos and the scoreboards of tennis, table tennis, and badminton, respectively. ensemble method to improve the quality of video segmentation,and Song et al. [43] proposed an intelligent, human-machine hybridmethod to combine crowd annotations for 3D reconstruction. Inaddition to these works from a technical design perspective, someresearch also investigated non-technical issues in the design ofcrowdsourcing tools, such as the skills and motivation of crowdworkers [48], and workflow for crowd workers [20]. For sportsvideos, while most work used crowdsourcing to enhance data anal-ysis [35, 47] or model training [38], Tang et al. [44] developed acrowdsourcing method to construct annotation for video highlightsbased on social media data from sports fans. In this work, we focuson improving the efficiency of single workers.
Various tools [5, 6, 11, 48, 59] have been developed for video dataannotation. Early work largely focused on object recognition andannotation. For example, ViPER [11] can annotate the boundingboxes of objects and texts frame by frame, and LabelMe [59] sup-ports the annotation of the same object across different frames. Asthe demands for video annotation dramatically increased, effortswere made to reduce the burdens in the annotation. VATIC [48],for example, was designed to leverage crowdsourcing for videoannotation; iVAT [5] combined automatic label generation withuser manipulation to improve annotation efficiency; ViTBAT [6]supported the annotation of individual and group behaviors acrossdifferent frames.These projects laid the foundations for the design of video an-notation systems. Some methods, such as the bounding box inViPER and frame interpolation in LabelMe, have become commonpractices supported by many annotation tools. Basic functions likegeometry drawing (e.g., lines, rectangles, polygons) and video op-eration (e.g., pause, speed control, skip back or forward) have beenwidely adopted. However, these tools only support basic annotationtasks, such as labeling objects from general videos, with limitedsupport for annotation tasks involving multiple fast moving objectsacross space and temporal dimensions, as what racket sports videosusually have.
In this section, we first explain the major rules of racket sports andsome characteristics of broadcasting videos that may affect data acquisition. Although a match in racket sports can be single ordouble competition, we use single matches as examples. Also, wefocus on those typical racket sports with a net to separate players,such as tennis, table tennis, and badminton. Those sports in whichplayers are not separated by a net and can have direct body contacts,such as racquetball and squash, are not considered because of thedifferent video scene structures. We will also introduce our twostudies to learn about the tasks and data in video annotation. Thefirst study is an interview study with three domain experts. Thesecond study is a survey investigation to collect information on theinterests of sports fans.
The match structures of racket sports are similar. A match is acompetition between two players. A match is usually played in thebest of N (e.g., 3, 5, 7) games , and each game is played in the bestof N rallies (or points). The only exception is tennis, where thereis another layer called set above the game. Tennis is played in thebest of N sets (Fig. 1). When playing a rally, two players hit theball (or shuttlecock) in turns until one fails to send the ball to thecourt on the other side and loses one point [27]. Each hit is called a stroke , and the first stroke in a rally is called the serve.Broadcasting videos of racket sports include different types ofcontents. The central piece is the rallies, which are shown withoutinterruption and often with a fixed camera angle to ensure the cov-erage of the whole court, as shown in Fig. 1. Before a rally, videosusually capture how players prepare for the rally (e.g., resting, chat-ting with coaches). After a rally, audience reactions often appear,and a rally replay in slow motion may also be provided. What areessential to data annotation are those rally segments, and the valuesof other contents are minimal. The duration of a rally varies fromsport to sport, ranging from seconds to minutes [27], but a matchcan last hours, as often seen in tennis.
We designed two studies to learn about how a match is analyzed andwhat data is used in the analysis. Considering the diverse interestsof people and possible vast design space, we first conducted aninterview study with domain experts to identify the essential tasks
HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al.
Table 1: The common analytical tasks based on interview data.
T TT B Tasks Level ✓ ✓ ✓
T1. Who served? (server) Object ✓ ✓ ✓
T2. What was the type of serve? (serve type) Context ✓ ✓ ✓
T3. What was the effect of serve? (serve effect) Context ✓ ✓ ✓
T4. Where did the ball fall on the court? (ball position) Event ✓ ✓
T5. What was the speed of the ball (shuttlecock)? (ball speed) Object ✓ ✓
T6. What was the spin type of the ball? (ball spin) Context ✓ ✓ ✓
T7. How was the ball received? (receive type) Context ✓ ✓ ✓
T8. What was the effect of receiving? (receive effect) Context ✓ ✓ ✓
T9. Where was the server/receiver? (player position) Event ✓ ✓ ✓
T10. How did the server/receiver move before/after hitting the ball? (player movement) Event ✓ ✓ ✓
T11. Who won this rally? (rally winner) Object ✓ ✓ ✓
T12. What was the tactic of the player in this rally? (rally tactic) ContextNote: T—Tennis, TT—Table Tennis, B—Badminton.in analysis, data required by analysis, and common challenges indata acquisition. Based on the information collected from this study,we designed a survey to investigate what ordinary sports fans maybe interested in, and what kinds of problems they may have had ifthey have been involved in data annotation.
Our interview study is a semi-structuredinvestigation involving three domain experts:
E1, E2 , and E3 . E1 , aprofessor of sports science, is interested in table tennis analysis. E2 is a badminton analyst and also a professor at a top sports university.Both E1 and E2 have experience in data analysis for more thantwenty years. E3 is a Ph.D. candidate of sports science, and as aformer professional tennis player, has conducted research on tennisdata analysis for more than three years. Our interviews with E1 and E3 were in a face-to-face manner, and the meetings with E2 was through a real-time, video conference call.Three interviews followed the same structure. Each interviewhad two sessions. The questions in the first session were the samefor all three experts, and focused on the understanding of theiranalytical tasks and relevant data in their own domain. The conver-sations in the second session were based on the information gainedfrom the first session, and aimed at deepening the understanding ofthe challenges in analysis and current approaches to address them.Each interview lasted about 90 minutes: roughly 60 minutes for thefirst session, and 30 minutes for the second.In the first session, we learned about the commonality anduniqueness of analytical tasks in these sports. The interests of thethree experts were almost the same at a high-level. They were allinterested in analyzing the movement of the ball (shuttlecock), themovement of players, their tactics in a rally (e.g., the type and effectof a serve), and the outcome of each rally. However, for certain tasks,their focuses differ. For example, in the analysis of ball movement,ball speed is a major factor to tennis and badminton, not to tabletennis, and ball spin type is very crucial to table tennis or tennis,not at all to badminton. What distinguishes their analyses most aretheir strategies. In table tennis, where a rally is usually very short,the analysis often focuses on the scoring rates of players in differentstages of a rally [53, 62] and the tactics used by a player (e.g., thestroke position of a player, the landing location of the ball, and the stroke type [51, 55]). In comparison, in badminton, the strategycenters on the three-dimensional trajectory of the shuttlecock[58],because it can fundamentally affect the tactics in both offense anddefense. In tennis, which has a much larger court and a larger ballthan table tennis and badminton, managing the physical energyby predicting the ball position and moving in advance is criticalto tennis players. Therefore, the analysis often emphasizes playermovement and its correlation with ball position [17], in order to un-derstand the spatio-temporal shot patterns [36, 37] and how playersuse various techniques [61] to mobilize their opponents to move.In the second session, we gathered information about how theseexperts conducted their analysis. They all used certain software.However, their tools are usually very basic, largely limited to con-trolling video playback, capturing video images, and extractingvideo segments from a long video clip, and cannot support moreadvanced tasks, such as identifying important events, relating dif-ferent events, and constructing annotations. For example, in tabletennis analysis, E1 usually needed to first specify the start time andend time of all rallies, and then drilled down into them to label ballposition , player position , stroke type , and spin type of each stroke.However, searching the starts and ends of rallies through a longvideo is a tedious process, and manually clipping individual ralliesout of the whole video is exhaustive. In addition, no tool is availablefor accurately specifying ball and player position. As a compromise,a common practice is to use a 3 × E3 usually used a virtual court with a dense gridfor the position of the ball and players. In badminton, the three-dimensional trajectory of the shuttlecock is estimated by a physicalmotion model [9]. To specify the three-dimensional start and endpositions of the ball, E2 used a tool with a vertical view of the courtfor (x, y) coordinates and an end view for the z coordinate. Thesetools were mostly developed in-house by their supporting staff, notcommercially available.In addition, the experts encountered more challenges in thoseadvanced tasks that require domain knowledge, such as identifyinga stroke type in table tennis, which has to be inferred based on ballposition and player position. The video annotation tools help the ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan Figure 2: Tasks that sports fans are interested in when watching racket sports videos. experts to conduct opponent analysis and prepare players for theirfuture matches. For example, knowing the tactics and strategiesat different levels, players can take appropriate actions, such asavoiding those situations where the opponents have high winningrates. This group of users is the primary users of this research.Based on the data collected from the interview study, we sum-marized the primary tasks that are commonly seen in three sports,as shown in Table 1. Each task reflects a question that experts tendto ask in analysis. We use a simple term, which is inside the paren-theses after each question, as a reference for each task. All tasks,except two, are interesting to all three. These two tasks are ballspeed, which is not a concern in table tennis, and ball spin, whichis not applicable to badminton. We still consider these two tasks inthis research because they are very critical to other two sports.
We conducted a survey to learn what generalsports fans may be interested in. The backgrounds and interests ofsports fans could be very diverse. To keep our focus, we designeda questionnaire based on those tasks developed in the interviewstudy.The questionnaire includes demographic questions, task interestquestions, and data annotation questions. Demographic questionssought some basic information from respondents related to theirfamiliarity with and involvement in tennis, table tennis, and bad-minton, as well as their experiences in watching racket sportsvideos. Task interest questions were developed by drawing on thetasks in Table 1, and asked respondents which tasks they are inter-ested in when watching match videos. In addition to these tasks,respondents could also choose none of these tasks and provideother tasks. Data annotation related questions asked whether re-spondents have been involved in data annotation for sports videos,and if so, what challenges they may have had.We distributed the survey to two online communities in China.The total number of members in the two communities are morethan 600. We got answers from 109 respondents. Among them, 51(46.8%) said they had watched tennis match videos, 86 (78.9%) tabletennis videos, and 85 (77.1%) badminton videos.Most respondents indicated that they were interested in someof the tasks on the list (Figure 2). Only a small portion of them showed no interest in any of them: 11.8% in tennis responses, 14%in table tennis, and 9.5% in badminton. For tennis, the top threetasks are ball position (54.9%), ball speed (47.1%), and serve effect(47.1%), and the least favourite tasks are two tied choices—playertactic (19.6%) and player position (19.6%). The top three tasks intable tennis are rally winner (54.7%), receive effect (45.3%), and ballposition (41.9%), and the bottom one is player position (21.0%). Thetop three tasks in badminton are shuttlecock position (54.8%), serveeffect (46.4%), and player position (46.4%), and the least concern iswho the server is in a rally (25%).Only a few respondents had been involved in video data annota-tion. There are 10 people indicating experience in annotating tabletennis videos, 2 in badminton, and 1 on tennis. One person hadexperience in all three. For table tennis annotation, two challengesstand out: accurately finding the times of important events (70%)and locating a specific rally in a long video (60%). Two challengesmentioned in annotating badminton videos are estimating shut-tlecock location (100%) and finding the times of important events(50%). The only challenge given in tennis is locating a specific rallyin a long video.Novice users use video annotation tools differently from experts(Section 3.2.1). As our survey data shows that fans are more in-terested in events like who won a rally, where a ball landed, andhow fast a fall was. We can speculate some application scenariosof our design by this group of users, such as using it to help thecreation of highlight videos of a match or tutorial videos based onmatches. The proposed method allows them to quickly identify andunderstand those key events in a match and choose their desirablevideo segments.
EventAnchor was developed based on literature on sports dataanalysis and what we learned from the interview and survey studies.It has been argued [42] that tasks in video analysis can involveinformation at different levels, ranging from raw objects (e.g., ball,court, player) at the bottom level to advanced inference or semanticanalysis at the top level (e.g., player tactic). The primary tasks shown
HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al.
Figure 3: The EventAnchor framework. Data at the object level includes objects recognized by computer vision algorithms.The event-level data is obtained through event detection algorithms based on the object-level data. The context-level data isresults of user-machine collaboration, where users apply domain knowledge to select and integrate information from lowerlevels and and video. in Table 1 actually include tasks at different levels. For example,some tasks like ball position, ball speed, and player position arelow-level tasks that concern object recognition, while tasks likeserve type, serve effect, and rally tactic are high-level semantictasks that require domain knowledge to relate various aspects ofspatial and temporal information about the ball and players.In-depth analysis of these tasks indicates that they are all relatedto a few key events: ball (shuttlecock)-racket contact and ball-courtcontact. For example, such tasks as server, serve type, receive type,and player position are all about situations before or after the eventof ball-racket contact; and tasks like serve effect and receive effectare related to the ball-count contact event. Other advanced semantictasks require the integration of information related to a series ofsuch events.Based on this understanding, we develop a three-level frame-work, which has an event level in the middle to connect an objectlevel below and a context level above (Fig. 3). At the bottom is theobject level . The data at this level is the foundation of the wholeframework, and includes essential objects recognized by computer-vision algorithms from videos, such as the positions of the ball, theplayer, and the court. Data at this level can be represented as a tuple, ( 𝑜𝑏 𝑗𝑒𝑐𝑡𝐼𝑑, 𝑥, 𝑦, 𝑡 ) , where 𝑜𝑏 𝑗𝑒𝑐𝑡𝐼𝑑 is the identity of a recognized ob-ject, 𝑥 and 𝑦 are the coordinates of the object in a video frame, and 𝑡 represents the timestamp of the frame where the object is.The center level of the framework is the event level . Data atthis level concerns the interaction between essential objects fromthe object level, such as a stroke, which is the result of the ballcontacting a racket, and the aggregation of them (e.g., a rally withmultiple strokes). Data at this level comes from the information atthe object level, such as the moving direction of the ball, or machinelearning models that recognize events. The data can be representedas a tuple, ( 𝑒𝑣𝑒𝑛𝑡𝑇𝑦𝑝𝑒, 𝑡 𝑠𝑡𝑎𝑟𝑡 , 𝑡 𝑒𝑛𝑑 ) , where 𝑒𝑣𝑒𝑛𝑡𝑇𝑦𝑝𝑒 representsthe type of events, and 𝑡 𝑠𝑡𝑎𝑟𝑡 and 𝑡 𝑒𝑛𝑑 are the timestamps of thestart and end of the event, respectively.Data at the context level summarizes information from theevent level, and can include the technical attributes of strokes (e.g.,stroke type, spin type) and the tactical style of a player. Retrievingdata at this level requires extensive annotation by domain experts, because of the required domain knowledge. For example, to deter-mine the type of a stroke in table tennis demands skills to recognizea sequence of micro-actions of the hand and wrist. Only analystswith extensive knowledge can make a right call. Similarly, obtainingthe contextual information of some events, such as what playertactics a rally is based on, also requires domain knowledge. Datastructure at this level can also be a tuple, ( 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑇𝑦𝑝𝑒, 𝑒𝑣𝑒𝑛𝑡𝐼𝑑 ) ,where 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑇𝑦𝑝𝑒 represents the type of context information andthe 𝑒𝑣𝑒𝑛𝑡𝐼𝑑 the identity of the event. We provide a mapping fromthe analytical tasks to the data level in Table 1.The event level plays an important role in this framework. Rec-ognized events at this level are the anchors for analytical tasks.Knowing the locations of these events in a video, analysts can ex-amine the images around them, find relevant video segments, andcreate corresponding annotations. For example, table tennis playersprefer to launch an attack as early as possible in a rally, so analystsoften want to examine those rallies in which a player launches anattack immediately after the serve and gains a point. To identifysuch rallies, users can rely on event information to select thoseshort rallies as candidates and then apply domain knowledge todetermine what rallies are of interest. Because of the essential roleof the event level to link analytical tasks, we call the frameworkEventAnchor. Based on our framework, we implemented a system, EventAnchorfor Table Tennis (ETT), to support annotation on table tennis videos.We chose table tennis because annotation on table tennis videos isoften regarded as one of the most challenging tasks among racketsports. First, we used computer vision models, such as object detec-tion [39, 40], object tracking [4, 16, 52], and pose estimation [7, 46]models, to identify the player, the ball, the court, and relevant tra-jectories (object-level) (Fig. 4A). Second, based on the motion of theball and the player, as well as their relative position, we obtainedevents. The positions and timestamps of the events are used asanchors (event-level) (Fig. 4B). For example, a sudden change ofthe moving direction of the ball implies the event that a player hits ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan
Figure 4: Pipeline of EventAnchor for Table Tennis. (A) exhibits the extraction of essential objects by computer vision models,such as score (A1), scene (A2), ball (A3), and player pose (A4); (B) describes the methods to obtain events by estimating themoment of score changing (B1) or scene changing (B2), or identifying the moments of ball hitting and ball bouncing (B3); and(C) shows two interactive tools for calibrating an event (C1) and annotating the event with contextual information (C2). the ball or the ball bounces on the table. Anchors can help the userquickly locate individual events in videos. Third, through visualinteraction, the user can add contextual information (context-level),such as the technical characteristics of a stroke and the tacticalstyle of a rally to each anchor, or calibrate the spatial and temporalinformation of an anchor (Fig. 4C).
To acquire object-level data from videos, we adopted a series ofcomputer vision models [8, 13, 16, 30]. Video processing had threesteps: score detection, scene detection, and ball and pose recognition.First, we used FOTS [30], an optical character recognition model,to process the scoreboard in video (Fig. 4A1). We sampled 5,000images from videos and annotated the location of the digits throughcrowdsourcing. The retrieved images were separated into a trainingset (70% of the whole data set) and a test set (30% of the data set)for model training and test. On the test set, the FOTS obtaineda precision of 92.1% and a recall of 95.4%. Second, we classifiedthe frames according to the scenes (Fig. 4A2). Each frame waspre-processed with ResNet-50 [13] that was pre-trained on theImageNet [24], and an embedding vector with a length of 2,048 wasobtained. Given the embeddings, we conducted binary classificationwith support vector machine and obtained the frames of “in-play.”Third, to recognize the ball and player posture (Fig. 4A3, 4A4). weused TrackNet [16], a ball tracking model for tennis and badminton,to extract ball trajectory. By stacking three consecutive frames asthe model input, the TrackNet can resolve the problems of noisyobjects (e.g., white dots in the billboard or headband of the playerbeing recognized as the ball), transparent tails, and invisible orseverely blurred ball. To apply TrackNet in table tennis, we sampledover 60,000 frames from different videos to annotate ball positions.After training, the TrackNet achieved an accuracy of 88.6%. Forpose recognition, we used Openpose [8] trained on the COCOdataset [29].
We used the object-level information to obtain anchors at the eventlevel. First, we segmented a video into a set of rallies by detectingthe timestamps of score changes (Fig. 4B1). With the scores detected,we adopted the longest increasing sub-sequence algorithm to modelscore change and obtained the match structure. The accuracy ofrally segmentation is 98.5% in the test set. Second, based on thescene detection results, we derived the start and end frame of eachrally (Fig. 4B2). Third, combining the ball trajectory and playerposes, we recognized the events such as the ball hitting a racketand the ball bouncing on the table (Fig. 4B3). For example, for theevents of ball hitting, we computed the ball velocity and the dis-tance between the ball and the players’ hands. To correctly obtainthe poses of the players, we adopted Faster R-CNN for player detec-tion, and filtered and clustered the bounding boxes using k-meansfor player tracking from both sides. In the computation of the dis-tance between the ball and the players, sometimes the hand nodeswere missed by Openpose because of the occlusion. To resolve thisproblem, we additionally considered the neck nodes, which havenever been missed by the model during testing. We regarded theball hitting time as the time when the ball velocity changes thedirection, and the distance reaches a bottom. These potential mo-ments are regarded as anchors, which can help to precisely locateevents occurring in a long video.
For the context-level information, we designed a user interface tosupport the calibration of the temporal and spatial attributes ofanchors, and the creation of contextual information on the eventsaccording to different analytical goals.The user interface has three major components: the anchors(Fig. 5B, 5D), a calibration box (Fig. 5A), and an annotation box(Fig. 5C). Anchors visually present when and where an event occurs.The calibration box and annotation box support interactive controlof anchors, and creation of annotation, respectively.
HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al.
Figure 5: User interface of EventAnchor for Table Tennis. The interface includes three major components: anchors (B, D), acalibration box (A), and an annotation box (C). The figure shows a scenario where a user is correcting the timestamp of thesecond anchor with the calibration box.Anchors
An anchor contains the temporal and spatial informa-tion of an event. We visualized the spatial attribute ( 𝑥, 𝑦 ) directlyon the video frame and the temporal information ( 𝑡 ) on a timeline.Fig. 5 illustrates an anchor on an event where the ball hit the table.The red point on the table (Fig. 5B) shows where the event hap-pened, for example, where the ball bounced on the table. For thetemporal information, we used a highlighted mark on a timelineto show where the event is on the video clip (Fig. 5D). Differentcolors of marks on the timeline indicate different mark types. Blacksmarks are those that have not been calibrated, and green ones arethose that have been calibrated. The red mark is the one that iscurrently being examined. Calibration Box
Anchors are automatically detected by algo-rithms and inevitably contain errors. The calibration box is used tocalibrate the time of an anchor. The design of the calibration box isinspired by "Hotbox" [25], a menu widget that arranges menu itemsin a circular manner. We divided the circular box into four func-tional areas: the left and right areas for correcting the timestampof an anchor, and the top and bottom areas for adding or removingan anchor (Fig. 5A). When the user clicks the mouse button on avideo, the calibration box appears and centered at the cursor. Tocorrect a timestamp, the user can hold the mouse button and dragthe cursor left or right to move timestamp backward or forward. Ifan anchor is useless, the user can delete it by dragging down to thedelete area (with a minus symbol). To add a moment as an anchor, the user can invoke the calibration box and drag up to the additionarea (with a plus symbol).
Annotation Box
With the annotation box, the user can inter-actively create and modify the annotation of an event. Similarly,the annotation box is also a customized "Hotbox". The number offunctional areas is determined by the number of annotation datatypes. Fig. 5C illustrates a scenario where the annotation box isused to annotate the tactics in a rally.
We conducted two experiments to evaluate how EventAnchor forTable Tennis (ETT) can assist the annotation of table tennis matchvideos. The first experiment focused on a task concerning event-level information, and the second on a semantic task at the contex-tual level.
The experiment is a within-subjects design. Two treatments areETT and a baseline system.
We recruited 8 (male: 6, female: 2) participants.They all played table tennis regularly (at least twice a week), andknew the sport well. We paid each $10 for their participation. ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan
Figure 6: Results of Experiment 1. The bar depicts the mean value and the error bars represent the 95% confidence interval.
In this experiment, we asked the participants to findone of the most frequent events: the ball hitting the table in a rally.They were asked to record when and where the ball hit the tablein a given video. We chose the final of the
ITTF World Tour 2019 between Ito Mima and Chen Meng. There are 10 rallies in the video,and the length of each rally is between 94 to 175 seconds. Thenumber of the target events in each rally ranges from 5 to 9.
ETT and a baseline system were used. ETT pre-sented the anchors on the timeline, and visualized the spatial posi-tion as a highlighted mark overlaying video image. When a videowas played, the video slowed down when approaching an anchor,and paused at it. Participants could use the calibration box to ad-just its timestamp. The baseline system had a structure and anappearance similar to ETT, but with some functions of ETT dis-abled, including the automatic slowing down and pausing at eventtimestamp, the calibration box, and the anchor visualization onthe timelines. To ensure its usability, the baseline system had avideo playback tool for participants to watch and control videowith a keyboard. When seeing the ball hitting the table, partici-pants could click the position where the ball hit. The system canrecord mouse-clicking time and location on the video.
Participants were required to annotate all tenrallies with ETT and the baseline system. Half of the participantsused ETT first, and then the baseline. The other half reversed theorder. In each condition, participants went through three steps:training, test, and post interview. In the training step, they wereintroduced to the task and the system, and practiced annotation onfive rallies different from those used in the test. They could ask anyquestions about the task and the user interface.After being familiar with the task and the system, participantstook the test. Participants were requested to finish each annotationas fast as they could while ensuring annotation accuracy on timeand location. Annotation data was recorded automatically by thesystem in both conditions.After finishing all tasks, they were interviewed for their feedbackon tasks and systems. The whole experiment lasted for 30 minutes,15 minutes for each condition.
In total, we collected 461 valid annotations in ETT,and 466 in the baseline condition. We compared the task time anderrors between two conditions (Fig. 6).
Task Time
Task time in this experiment was computed as thetime difference between annotating two consecutive events. Themean times in two treatments are 6.72 seconds ( 𝑆𝐷 = .
40) for ETTand 7.69 seconds ( 𝑆𝐷 = .
49) for the baseline (Fig. 6A), respectively.The result of a t-test shows that ETT is significantly more efficientthan the baseline system in completing the task ( 𝑡 = . , 𝑝 = . Task Errors
We analyzed two types of errors in annotation:temporal and spatial errors. Temporal error was measured as thedifference between the frame where a participant annotated and thecorrect frame, and spatial error was the pixel difference betweenwhere a participant clicked and where the ball really hit. The aver-age temporal errors in two treatments are 0.50 frame ( 𝑆𝐷 = . ) for ETT and 0.85 frame ( 𝑆𝐷 = .
41) for the baseline (Fig 6B). A t-testshows the difference is significant ( 𝑡 = . , 𝑝 = . 𝑆𝐷 = .
9) for ETT and 8.18 pixels( 𝑆𝐷 = .
04) for the baseline (Fig 6C), and no significant differencewas found between them ( 𝑡 = . , 𝑝 = . User Feedback
In the post interview, all participants, exceptone, preferred ETT. They mentioned that the anchors in ETT werevery helpful, and assisted them to locate the target events moreeasily, as one participant said: “ automatically pausing around theevents prevents me from missing the event, because the match is at afast pace .”Although the user interface of ETT is slightly more complicatedthan that of the baseline and includes the hotbox design that is lesscommon, most participants were positive about the user interfacein general. Two participants indicated that showing the locations ofanchors on a timeline helped to improve the efficiency in annotation.One participant commented: “ this allows me to know in advancehow many ball positions need to be labelled, and roughly when tobe marked .” Some participants were enthusiastic about the hotboxdesign actually, as three participants indicated that this design wasefficient for controlling the video time in annotation.One participant expressed a concern with errors of annotationin ETT. With the baseline system, participants had to check thevideo frame by frame to find a specific event. With the suggestions
HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al.
Figure 7: Results of Experiment 2. The bar depicts the mean value and the error bars represent the 95% confidence interval. in ETT, however, participants could accept the suggestions fromthe algorithms without checking whether there was any error inthe suggestion. This concern is legitimate, considering the possibleerrors of computer vision algorithms.
The second experiment is also a within-subjects design, with twotreatments of ETT and a baseline system. The overall design of thisexperiment is the same as the first experiment.
We recruited 8 table tennis analysts (male: 5,female: 3) for this experiment. They were all former professionalplayers, and had extensive knowledge of the sport. We paid eachparticipant $15 for their participation.
We asked participants to identify high-level tacticsoccurring in the final of the
ITTF World Tour 2019 between ItoMima and Chen Meng. The task was to find the rallies where Itoused the tactic of “serve and attack" and won the rally. This tacticrefers to an approach that the server launches an attack at thethird stroke immediately after the opponent receives the ball. Wechose two games (G1, G2) from the match. Both games contained24 rallies: one lasted 11 minutes with 2 qualified rallies, and theother 8 minutes with 2 qualified rallies.
We used ETT and a baseline system used byprofessional table tennis analysts. ETT generated a series of anchorsof potential rallies, and participants needed to locate and verifythese anchors. They needed to explore all rallies and annotate “serveand attack” with an annotation hotbox. The rule to generate theanchors is that a qualified rally was served and won by Ito andthe total strokes by two players were more than 2. The baselinesystem had a structure and an appearance similar to ETT, but withthe annotation box and the anchor visualization on the timelinedisabled. Alternatively, there is a confirm button in the baselinesystem to specify the qualified rally. Similar to the baseline systemin Experiment 1, the features of basic video control as other videoplayers were preserved here. To annotate a video, participantsneeded to manually check all rallies one by one, and to click theconfirm button for qualified rallies.
Participants were required to identify qualifiedrallies from two games, G1 with ETT and G2 with the baselinesystem. We could not use the same game in two treatments becauseparticipants, as professional analysts, could remember the resultsfrom a previous treatment easily. Two games were chosen carefullyto make sure the task difficulties on them were comparable. Wecould not find two games with exactly the same time length, sobetween G1 (11 minutes) and G2 (8 minutes) we chose G1 for ETTand G2 for the baseline to give the baseline an edge. Half of theparticipants annotated G1 with ETT first, and then G2 with thebaseline. The other half reversed the order.All participants went through the training, testing, and postinterview steps. Videos used in training differed from those in test.The experiment lasted about 20 minutes, 10 for each treatment.
In total, we collected 16 annotated results: eight inETT, and eight in the baseline. We analyzed task time and error intwo conditions (Fig. 7).
Task Time
The task time on annotating rally tactics was com-puted as the time between the start and the end of verifying allrallies in a game. The average times for completing the tasks intwo conditions are 56.4 seconds ( 𝑆𝐷 = .
4) per game for ETTand 144.5 seconds ( 𝑆𝐷 = .
7) per game for the baseline (Fig. 7A).The result of a t-test shows a significant difference between themeans ( 𝑡 = . , 𝑝 < . Task Errors
We examined the precision and recall of the an-notated results. Precision was computed as the ratio of correctannotations to the total submitted annotations, and recall was theratio of the correct annotations to the ground-truth. The ground-truth was produced by one of the domain experts we interviewed( E1 ) (Section 3.2.1). The average precision is 0.854 ( 𝑆𝐷 = . 𝑆𝐷 = . 𝑆𝐷 = . User Feedback
In the post interview, all participants preferredETT. They liked the way that anchors help them efficiently locatethe potential rallies. They also enjoyed the user experience in inter-acting with anchors, as one participant commented: “the anchors ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan have indicated when Ito will serve and win the rallies, so that I do nothave to remember this condition, and just need to focus on the tacticanalysis.”
Another participant added: “with the help of anchors, Ican confirm the tactic type of a rally by only watching the first threestrokes.”
The results of our evaluation show that the interaction system basedon the proposed EventAnchor framework can improve the workon annotating table tennis videos. For ordinary users, who may beinterested in important movements in a match, this method canhelp them more quickly identify those events and achieve slightlybetter annotation accuracy. For experienced analysts, who caremore about complex techniques used in a match, the system canimprove the efficiency in their work significantly, with similar taskaccuracy. These results indicate the reliability of our framework insupport of such annotation activities, the robust of our computervision algorithms, and the good usability of our system.By observing the use of EventAnchor for annotation, we foundthat our method can help users overcome some barriers they facedunder their old practices. First, from the perspective of interactions,EventAnchor allows users to focus more on important analyticaltasks by freeing them from repetitive interaction tasks. With theirold tools, experts have to interact with the keyboard frequently tolocate the timestamps of the events before they analyze and anno-tate them. With our tool, experts have learned that they can trustthe pre-computed and filtered timestamps of these events, and candirectly focus on judging and recognizing the events. Second, Even-tAnchor provides better support for the integration of necessarydata with analytic goals. With their old tools, the expert usuallydivides their whole workflow into two stages, the annotation stageand the analysis stage. The focus of the first stage is on filing videoclips and recording such detailed data as stroke type and strokeposition. After the completion of such data preparation work, theythen shift to the second stage and use different types of data for var-ious analytical tasks. After using EventAnchor, experts discovered anew annotation-on-demand approach. For example, in Experiment2, the candidates of the required rally can be filtered and retrievedquickly with the basic information provided by computer visionmodels. The experts can annotate the detailed attributes of thestrokes when necessary. Third, our method can help to reduce thecognition load and shorten interaction processes. Under their oldpractices, experts have to annotate the stroke attributes at the rallylevel, because the whole match is clipped into rallies manually inadvance. To be efficient, they often try to memorize the attributesof several consecutive strokes and record the results at the sametime when watching the rally video. After annotation, they alsoneed to replay and review the whole rally for validation. Sometimes,missing a stroke can lead to several extra replays to discover andcorrect the errors. Our tool uses computer vision models to providethem with the fine-grained information, so that they quickly andaccurately see and obtain required data attributes, such as the times-tamps of the strokes. Consequently, they can reserve their valuablecognitive resources for analytical tasks, rather than the memoriza-tion of supporting data, and potentially avoid the mistakes causedmemory errors and resulting task repetitions. Our EventAnchor can support various statistical and decision-making tasks. Here we provide two scenarios based on the tasksseen in our evaluation study. One scenario concerns the use ofour EventAnchor for accurate statistical analysis by leveragingcrowdsourcing. In table tennis, analyzing ball positions in a fullmatch statistically requires a dedicated analyst to mark the exactpositions of the ball on the table. It usually takes 30 to 40 minutesto complete the task. With the help of our system, this task canbe accomplished by distributing individual video segments of balllanding, which are generated by computer vision algorithms, tocrowd workers. Verifying and calibrating a ball position is an idealcrowdsourcing task, because of the short time required to do it,about 6.72 seconds, as we learned from our study (Fig. 6A), and norequirement for domain knowledge. Another scenario is relatedto quick decision-making that involves domain experts. In realmatches, coaches and players often need to adjust their tactics orstrategies based on the performance of the opponent. Our systemcan help them quickly search through videos to find the tactics ofthe opponent and make the necessary adjustments on tactics orstrategies. As our experiment results show that the average time todiscover the rallies with a specific condition is less than a minute(56.4 seconds) for domain experts (Fig. 7A), our system can providereal-time support for coaches and players during a break betweenrallies. Thus, our system can potentially change the ways peopleconduct video analysis by reducing the requirement on domainknowledge, or by allowing the use of video data for decision-makingin fast-paced situations.Although the effectiveness of our framework is demonstratedthrough a system for table tennis annotation, this approach can beapplied to video annotation in other racket sports. For racket sportslike tennis and badminton, with similar image setups and structuresin broadcasting videos, our framework can be directly applied,with proper algorithm training. For other racket sports, such asracquetball and squash, more work is needed to refine computervision algorithms to adapt to the different image structures andplayer movement patterns in videos, but our framework to anchoranalysis to events can still be used, because many rules of thesesports, such as those concerning ball hitting a racket and the court,are similar to those of tennis, table tennis, and badminton.There are some limitations in our work. First, our work could bemore flexible on the definition of events. Our current definition ofevents as ball-object contact works well for table tennis analysis, butpeople may be interested in other events, such as sudden movementchanges of a player. One approach to expand the definition ofevents is to develop an event syntax that includes essential elements(e.g., ball, player, court, net, racket, etc.), their attributes, and thespatial and temporal relationships among them, and then let usersinteractively define a new event type under the syntax.The second limitation is insufficient use of audio from videos.Audio information could be used for event recognition by detectingthe sound of ball contact(e.g., stroke detection [33]), and by addinginformation from another sensory channel, provide users withadditional information for annotation and make data annotationmore engaging.Furthermore, we need better mechanisms to motivate users tocarefully examine and calibrate the results suggested by algorithms.As shown in Fig. 6C, the spatial error in annotating ball position
HI ’21, May 8–13, 2021, Yokohama, Japan Deng, et al. with ETT is slightly larger than that with the baseline system, al-though the difference is not found significant. New designs areneeded to encourage user engagement with algorithms and compu-tational results.
This paper proposed EventAnchor, a framework to support dataannotation of racket sports videos. This framework uses eventsrecognized by computer vision algorithms as anchors to help userslocate, analyze, and annotate objects more efficiently. Based on thisframework, we implemented a system for table tennis annotation,and the results from our evaluation study on the system showsignificant improvement of user performances in simple annotationtasks (e.g., labeling ball position) and in complex tasks requiringdomain knowledge (e.g., labeling rallies with specific tactics).Our method can guide the design of systems for video annota-tion in other racket sports, such as tennis and badminton. Withimprovements on algorithms and interaction designs, its applica-tion domain can be extended. We will explore designs that allowusers to define new event types, so that the system can recognizeand process more complex events and support annotation on fastand dynamic videos in other domains. In addition, we will improveinteraction design to make users more engaged with computationalresults, further strengthening the collaboration between humanbrain powers and machine computation powers.
We thank all participants and reviewers for their thoughtful feed-back and comments. The work was supported by National KeyR&D Program of China (2018YFB1004300), NSFC (62072400), Zhe-jiang Provincial Natural Science Foundation (LR18F020001), andthe 100 Talents Program of Zhejiang University. This project wasalso funded by the Chinese Table Tennis Association.
REFERENCES [1] 2004. Notational Analysis of Sport: Systems for Better Coaching and Performancein Sport.
Journal of Sports Science & Medicine
3, 2 (01 Jun 2004), 104–104.[2] 2019. ITF Global Tennis Report 2019: A Report on Tennis Participation and Per-formance Worldwide. http://itf.uberflip.com/i/1169625-itf-global-tennis-report-2019-overview. Accessed: 2020-08-27.[3] Abir Al Hajri, Matthew Fong, Gregor Miller, and Sidney S Fels. 2014. Fast forwardwith your VCR: visualizing single-video viewing statistics for navigation andsharing.. In
Graphics Interface . 123–128.[4] Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016. Sim-ple Online and Realtime Tracking.
CoRR abs/1602.00763 (2016). arXiv:1602.00763http://arxiv.org/abs/1602.00763[5] Simone Bianco, Gianluigi Ciocca, Paolo Napoletano, and Raimondo Schettini.2015. An interactive tool for manual, semi-automatic and automatic videoannotation.
Computer Vision and Image Understanding
131 (2015), 88–99.https://doi.org/10.1016/j.cviu.2014.06.015[6] Tewodros A Biresaw, Tahir Nawaz, James Ferryman, and Anthony I Dell. 2016.ViTBAT: Video tracking and behavior annotation tool. In . IEEE,295–301.[7] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2018.OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.
CoRR abs/1812.08008 (2018).[8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In
Conference on ComputerVision and Pattern Recognition .[9] Lung-Ming Chen, Yi-Hsiang Pan, and Yung-Jen Chen. 2009. A Study of Shuttle-cock’s Trajectory in Badminton.
Journal of sports science & medicine
Proceedings of the 25thACM SIGKDD International Conference on Knowledge Discovery & Data Mining .ACM, 1851–1861.[11] David Doermann and David Mihalcik. 2000. Tools and techniques for videoperformance evaluation. In
Proceedings 15th International Conference on PatternRecognition. ICPR-2000 , Vol. 4. IEEE, 167–170.[12] Paul Downward, Bernd Frick, Brad Humphreys, Tim Pawlowski, Jane Ruseski,and Brian Soebbing. 2019.
The SAGE Handbook of Sports Economics . https://doi.org/10.4135/9781526470447[13] K He, X Zhang, S Ren, and J Sun. 2015. Deep residual learning for image recogni-tion. Computer Vision and Pattern Recognition (CVPR). In , Vol. 5. 6.[14] Fabian Caba Heilbron, Joon-Young Lee, Hailin Jin, and Bernard Ghanem. 2018.What do i annotate next? an empirical study of active learning for action local-ization. In
European Conference on Computer Vision . Springer, 212–229.[15] Keita Higuchi, Ryo Yonetani, and Yoichi Sato. 2017. Egoscanning: Quickly scan-ning first-person videos with egocentric elastic timelines. In
Proceedings of the2017 CHI Conference on Human Factors in Computing Systems . 6536–6546.[16] Yu-Chuan Huang, I-No Liao, Ching-Hsuan Chen, Tsì-Uí İk, and Wen-Chih Peng.2019. TrackNet: A Deep Learning Network for Tracking High-speed and TinyObjects in Sports Applications. In
IEEE International Conference on AdvancedVideo and Signal Based Surveillance . IEEE, 1–8.[17] M Hughes and P Moore. 2002. 36 Movement analysis of elite level male ‘serveand volley’ tennis players.
Science and racket sports II (2002), 254.[18] Neel Joshi, Wolf Kienzle, Mike Toelle, Matt Uyttendaele, and Michael F Cohen.2015. Real-time hyperlapse creation via optimal frame selection.
ACM Transac-tions on Graphics (TOG)
34, 4 (2015), 1–9.[19] Alexandre Kaspar, Geneviève Patterson, Changil Kim, Yağız Aksoy, WojciechMatusik, and Mohamed Elgharib. 2018. Crowd-Guided Ensembles: How Can WeChoreograph Crowd Workers for Video Segmentation?. In
In Proceeding of ACMCHI .[20] Juho Kim, Phu Tran Nguyen, Sarah Weir, Philip J Guo, Robert C Miller, andKrzysztof Z Gajos. 2014. Crowdsourcing step-by-step information extractionto enhance existing how-to videos. In
Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems . 4017–4026.[21] Yu Kong and Yun Fu. 2018. Human Action Recognition and Prediction: a Survey. arXiv preprint arXiv:1806.11230 (2018).[22] Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, and Kristen Grauman. 2016.Crowdsourcing in computer vision. arXiv:1611.02145 (2016).[23] Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma,Li Fei-Fei, and Michael S Bernstein. 2016. Embracing error to enable rapidcrowdsourcing. In
Proceedings of the 2016 CHI conference on human factors incomputing systems . 3167–3179.[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In
Advances in neural informationprocessing systems . 1097–1105.[25] Gordon Kurtenbach, George W Fitzmaurice, Russell N Owen, and Thomas Baudel.1999. The Hotbox: efficient access to a large number of menu-items. In
Proceedingsof the SIGCHI conference on Human Factors in Computing Systems . 231–237.[26] Victor Lavrenko, SL Feng, and Raghavan Manmatha. 2004. Statistical models forautomatic video annotation and retrieval. In , Vol. 3. IEEE, iii–1044.[27] Adrian Lees. 2003. Science and the major racket sports: a review.
Journal of SportsSciences
21, 9 (2003), 707–732. https://doi.org/10.1080/0264041031000140275arXiv:https://doi.org/10.1080/0264041031000140275 PMID: 14579868.[28] Hongsen Liao, Li Chen, Yibo Song, and Hao Ming. 2016. Visualization-basedactive learning for video annotation.
IEEE Transactions on Multimedia
18, 11(2016), 2196–2205.[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Commonobjects in context. In
European conference on computer vision . Springer, 740–755.[30] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS:Fast oriented text spotting with a unified network. In
Proceedings of the IEEEconference on computer vision and pattern recognition . 5676–5685.[31] Tze Chien Loh and Oleksandr Krasilshchikov. 2015. Competition performancevariables differences in elite and U-21 international men singles table tennisplayers.
Journal of Physical Education & Sport
15, 4 (2015).[32] Nuno Luz, Nuno Silva, and Paulo Novais. 2015. A survey of task-oriented crowd-sourcing.
Artificial Intelligence Review
44, 2 (2015), 187–213.[33] Jorge Piazentin Ono, Arvi Gjoka, Justin Salamon, Carlos Dietrich, and Claudio T.Silva. 2019. HistoryTracker: Minimizing Human Interactions in Baseball GameAnnotation. In
Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems . ACM, Article 63, 12 pages.[34] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017.Extreme clicking for efficient object annotation. In
Proceedings of the IEEE inter-national conference on computer vision . 4930–4939. ventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos CHI ’21, May 8–13, 2021, Yokohama, Japan [35] Charles Perin, Romain Vuillemot, and Jean-Daniel Fekete. 2013. Real-TimeCrowdsourcing of Detailed Soccer Data. In
What’s the score? The 1st Workshopon Sports Data Visualization .[36] Tom Polk, Dominik Jäckle, Johannes Häußler, and Jing Yang. 2019. CourtTime:Generating actionable insights into tennis matches using visual analytics.
IEEETransactions on Visualization and Computer Graphics
26, 1 (2019), 397–406.[37] Tom Polk, Jing Yang, Yueqi Hu, and Ye Zhao. 2014. TenniVis: Visualization fortennis match analysis.
IEEE transactions on visualization and computer graphics
20, 12 (2014), 2339–2348.[38] Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban,Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi-person videos. In
Proceedings of the IEEE conference on computer vision and patternrecognition . 3043–3053.[39] Joseph Redmon, S. Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only LookOnce: Unified, Real-Time Object Detection. (2016), 779–788.[40] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks. In
Advancesin neural information processing systems . 91–99.[41] Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua.2019. Annotating objects and relations in user-generated videos. In
Proceedingsof the 2019 on International Conference on Multimedia Retrieval . 279–287.[42] Huang-Chia Shih. 2017. A survey of content-aware video analysis for sports.
IEEETransactions on Circuits and Systems for Video Technology
28, 5 (2017), 1212–1231.[43] Jean Y Song, Stephan J Lemmer, Michael Xieyang Liu, Shiyan Yan, Juho Kim,Jason J Corso, and Walter S Lasecki. 2019. Popup: reconstructing 3D videousing particle filtering to aggregate crowd responses. In
Proceedings of the 24thInternational Conference on Intelligent User Interfaces . 558–569.[44] Anthony Tang and Sebastian Boring. 2012.
Proceedings of the SIGCHI Conference on Human Factors inComputing Systems . ACM, 1569–1572.[45] Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and AdrianHilton. 2017. Computer vision for sports: Current applications and researchtopics.
Computer Vision and Image Understanding
159 (2017), 3–18.[46] Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estima-tion via deep neural networks. In
Proceedings of the IEEE conference on computervision and pattern recognition . 1653–1660.[47] Guido Van Oorschot, Marieke Van Erp, and Chris Dijkshoorn. 2012. AutomaticExtraction of Soccer Game Events from Twitter. In
Proceedings of the Workshop onDetection, Representation, and Exploitation of Events in the Semantic Web . 21–30.[48] Carl Vondrick, Donald Patterson, and Deva Ramanan. 2013. Efficiently scalingup crowdsourced video annotation.
International journal of computer vision
Advances in Neural Information Processing Systems . 28–36.[50] Fangshi Wang, De Xu, Wei Lu, and Weixin Wu. 2007. Automatic video annota-tion and retrieval based on bayesian inference. In
International Conference onMultimedia Modeling . Springer, 279–288.[51] Jiachen Wang, Kejian Zhao, Dazhen Deng, Anqi Cao, Xiao Xie, Zheng Zhou,Hui Zhang, and Yingcai Wu. 2019. Tac-Simur: Tactic-based Simulative VisualAnalytics of Table Tennis.
IEEE transactions on visualization and computer graphics (2019).[52] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple Online andRealtime Tracking with a Deep Association Metric. In . IEEE, 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962[53] H Wu, Z Li, Z Tao, W Ding, and J Zhou. 1989. Methods of actual strengthevaluation and technical diagnosis in table tennis match.
Journal of NationalResearch Institute of Sports Science .[55] Yingcai Wu, Ji Lan, Xinhuan Shu, Chenyang Ji, Kejian Zhao, Jiachen Wang, andHui Zhang. 2018. iTTVis: Interactive Visualization of Table Tennis Data.
IEEETransaction on Visualization and Computer Graphic.
24, 1 (2018), 709–718.[56] Yingcai Wu, Xiao Xie, Jiachen Wang, Dazhen Deng, Hongye Liang, Hui Zhang,Shoubin Cheng, and Wei Chen. 2018. Forvizor: Visualizing spatio-temporal teamformations in soccer.
IEEE transactions on visualization and computer graphics
IEEE Transactions on Visualization and ComputerGraphics (2020).[58] Shuainan Ye, Zhutian Chen, Xiangtong Chu, Yifan Wang, Siwei Fu, Lejun Shen,Kun Zhou, and Yingcai Wu. 2020. Shuttlespace: Exploring and analyzing move-ment trajectory in immersive visualization.
IEEE Transactions on Visualizationand Computer Graphics (2020). [59] Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba. 2009. LabelMe video:Building a video database with human annotations. In . IEEE, 1451–1458.[60] Man-Ching Yuen, Irwin King, and Kwong-Sak Leung. 2011. A survey of crowd-sourcing systems. In .IEEE, 766–773.[61] Hui Zhang, Wei Liu, Jin-ju Hu, and Rui-zhi Liu. 2013. Evaluation of elite tabletennis players’ technique effectiveness.
Journal of sports sciences
31, 14 (2013),1526–1534.[62] Hui Zhang, Zheng Zhou, and Qing Yang. 2018. Match analyses of table tennis inChina: a systematic review.
Journal of sports sciences
36, 23 (2018), 2663–2674.[63] Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, and Pablo Cesar. 2020.RCEA: Real-time, Continuous Emotion Annotation for Collecting Precise MobileVideo Ground Truth Labels. In
Proceedings of the 2020 CHI Conference on HumanFactors in Computing Systems . 1–15.[64] Guangyu Zhu, Changsheng Xu, Qingming Huang, and Wen Gao. 2006. Actionrecognition in broadcast tennis video. In18th International Conference on PatternRecognition (ICPR’06)