Multi-Modal Trajectory Prediction of NBA Players
Sandro Hauri, Nemanja Djuric, Vladan Radosavljevic, Slobodan Vucetic
MMulti-Modal Trajectory Prediction of NBA Players
Sandro Hauri , Nemanja Djuric , Vladan Radosavljevic , Slobodan Vucetic Temple University, Uber Advanced Technology Group, Spotify [email protected], [email protected], [email protected], [email protected]
Abstract
National Basketball Association (NBA) players arehighly motivated and skilled experts that solve complex de-cision making problems at every time point during a game.As a step towards understanding how players make theirdecisions, we focus on their movement trajectories duringgames. We propose a method that captures the multi-modalbehavior of players, where they might consider multiple tra-jectories and select the most advantageous one. The methodis built on an LSTM-based architecture predicting multi-ple trajectories and their probabilities, trained by a multi-modal loss function that updates the best trajectories. Ex-periments on large, fine-grained NBA tracking data showthat the proposed method outperforms the state-of-the-art.In addition, the results indicate that the approach generatesmore realistic trajectories and that it can learn individualplaying styles of specific players.
1. Introduction
In recent years, advances in artificial intelligence andcomputer vision started revolutionizing how athletic per-formance and results are being analyzed and understood,which includes the use of fine-grained player tracking dataduring sporting events. In our research we are developingnew methods aimed at deeper understanding of the behav-ior of athletes in team sports, with particular focus on theirmotion prediction. This is a particularly important task ininvasion sports, such as soccer or basketball, where knowl-edge of how and where the players will move, especiallywhen it comes to those from the opposing team, is of criti-cal importance for gaining tactical advantage during a game[19]. Beyond this use case the benefits of accurate motionprediction extend to other applications, such as postgameanalysis [11] or improving TV broadcasting of games byoptimizing camera movement [4, 16]. Prediction of humantrajectories can also be used to improve tracking accuracy[17], and has recently become a vibrant topic of research inthe computer vision community [1, 8, 13].Using mathematics, statistics, and artificial intelligence to analyze sports performance is however not a novel idea.It has been famously explored in baseball [22] and appliedwith great success to soccer [18], with authors uncoveringuseful patterns in the sports data that can be and have beenused to move the needle in this highly competitive field.Today, elite teams from across the globe, such as GoldenState Warriors, New York Yankees, and Manchester United,have analytics departments focusing on deriving knowledgefrom large amounts of data these teams generate. Beyondthe sports professionals, it is interesting that even generalpublic is becoming more accepting of these complex statis-tical tools, as exemplified by the introduction of the conceptof expected goals [25] in some postgame summaries in thePremier League, the English top soccer division. This trendis also exemplified by a number of research publications, aswell as high-profile conferences and workshops organizedon the topic, such as MIT Sloan SAC or KDD Sports An-alytics [3]. These are attended by both the scientific com-munity on one side and world-class athletes and manage-ment of professional sports teams on the other, indicatingthe value and benefits that the artificial intelligence is bring-ing to this multi-billion dollar industry.In this paper we focus on movement prediction of NBAplayers during offensive possessions. Players at any mo-ment have freedom to consider several options for theirmovement. Potential trajectories depend on the state of apossession, which includes positions and current trajecto-ries of all the players and the ball, as well as on individualpreferences of the players. To predict the trajectories, wepropose an uncertainty-aware, multi-modal deep learningmodel. The model is trained to predict multiple trajecto-ries of a player and probabilities that a given trajectory willbe selected. Figure 1f shows an example of such trajecto-ries and the associated probabilities, compared to baselinemodels. We provide an in-depth discussion of Figure 1 inthe Results section, and evaluate the proposed method us-ing fine-scale player tracking data collected during severalmonths of an NBA season. In addition, we showcase thatwith our proposed training regime, the model has the abilityof recreating distinct playing styles of individual players. Link to Supplementary Material a r X i v : . [ c s . L G ] A ug a) (b) (c)(d) (e) (f)Figure 1: Visualization of predicted trajectories with H = 40 using several state-of-the-art methods: a) location-LSTM; b) CNN; c) MBT ;d) MACRO VRNN , e) SocialGAN ; f) MBT l (ours); red: attackers, blue: defenders, orange: ball, grey: input history of predicted player,yellow: prediction, green: ground truth. A video animation is included in the Supplementary Material.
2. Related Work
Modeling and predicting human trajectories is an impor-tant challenge in a number of scientific areas. Researchershave worked on this problem to develop realistic crowd sim-ulations [23], or to improve vehicle collision avoidance sys-tems [15] through predicting future pedestrian movement.When it comes to traffic applications, pedestrian behaviorwas usually modeled using attracting and repulsive forcesto guide them towards a goal, while simultaneously avoid-ing obstacles. Human pedestrian prediction was also used toimprove accuracy of tracking systems [6, 24, 30] or to studyintentions of individuals or groups of people [5, 20, 29].The advances in deep learning led to data-driven methods,such as Long Short-Term Memory (LSTM) networks [14]with shared hidden states [1], multi-modal Generative Ad-versarial Networks (GANs) [12], or inverse reinforcementlearning [17], outperforming the traditional methods. Thework by [12] is particularly related to our study, through itsuse of a multi-modal loss function and by showing practical benefits of multi-modal trajectory prediction as compared tosingle trajectory predictions. Beyond pedestrian movement,recent research on predictive modeling of vehicular trajec-tories for self-driving car applications also contains ideas ofrelevance for the current study. In particular, [7] showedthat multi-modal trajectory predictions for vehicles gener-ate realistic real-world traffic trajectories. The multi-modalloss function in our approach is inspired by this work, wherewe adapt ideas from the self-driving domain to modeling ofmovement of basketball players.The ubiquitous use of tracking systems in professionalsports leagues like the NBA or the English Premier Leagueinspired researchers to analyze and model trajectories ofathletes during matches. In ECCV 2018, [8] used Vari-ational Autoencoders (VAEs) to model real-world basket-ball data and showed for NBA data that the offensive playertrajectories are less predictable than the defense. [21] and[27] used LSTM to predict near-optimal defensive positionsfor soccer and basketball, respectively. [28] similarly usedvariants of VAEs to generate trajectories for NBA players.2BA player trajectory predictions are also studied by [31]and [32], where a deep generative model based on VAE andLSTM and trained with weak supervision was proposed topredict trajectories for an entire team. Macro-intents foreach player were inferred, where the players target a spoton the court they want to move to. The authors evaluate themodel mostly by human expert preference studies and showthey can outperform the baselines, indicating that RNNscan capture information from observational data in sports.However, their trajectories are usually not smooth and norestrictions are set on the position of a player on consecu-tive time steps, such that the model may output physicallyunrealistic trajectories. We consider this state-of-the-art ap-proach in our experiments, and show that it is outperformedby the proposed multi-modal method.
3. Methodology
Recent advancements in optical tracking have made itpossible to track the players and the ball during an NBAgame with good enough accuracy and temporal resolutionto recreate the trajectories of all ten players and the ballduring an entire basketball game. This allows us to extract2-D location (cid:96) pt = [ x pt , y pt ] of player p at time step t , with p ∈ { , . . . , } , as well as 2-D location of the ball at time t , (cid:96) bt = [ x bt , y bt ] , where x -coordinate represents the lengthof the field while the y -coordinate represents the width,with the origin at the upper left corner (see Figure 1 forillustration). Using an ordered sequence of previous L + 1 time steps we can generate historical trajectory of the p -thplayer as h pt = [ (cid:96) pt − L , . . . , (cid:96) pt ] , where time steps are equallyspaced at an interval of ∆ t . Similarly, we can generate ahistorical trajectory of the ball as h bt = [ (cid:96) bt − L , . . . , (cid:96) bt ] . Asa convention, we will assume that the first 5 players rep-resent the team on the offense and the last 5 players theteam on the defense. We are interested in predicting futuretrajectory of p -th offensive player, represented as a vector τ pt = [ (cid:96) pt +1 , . . . , (cid:96) pt + H ] , where H is the number of futuretime steps (or horizon) for which we predict the trajectory.We will assume that the player of interest (i.e., the offen-sive player for which we are predicting future trajectory) isdenoted by player index P .In this paper, we processed the raw tracking data to cre-ate labeled data set D = { ( u Pt , τ Pt ) , t = 1 , . . . , T, P =1 , . . . , } , where one data point is defined for each timestep and each offensive player (as indicated by the range P = 1 , . . . , ). Here T is the total number of time steps,input vector u Pt = { h Pt , h − Pt , h bt , s t } is a set of historicalplayer and ball trajectories, where h Pt indicates history ofthe player of interest, h − Pt indicates histories of all other9 players, and s t is the shot clock defined as the time inseconds remaining until the shot clock expires. Note that in the input vector the history of the player of interest P always comes first, followed by histories of their team-mates and then by opposing players, ordered by a distanceto the player of interest. Output vector τ Pt is a future tra-jectory of the player of interest P computed at time step t , and objective is to build a predictor that accurately pre-dicts their trajectory given inputs u Pt . We emphasize that,in addition to the given inputs, there are other features thatpotentially might influence the observed trajectories, suchas game clock, home vs. away, foul calling, previous plays,or player mismatch. As we demonstrate with the shot clockfeature, our approach allows for a straightforward use ofany additional feature that a modeller may deem important.However, an in-depth feature analysis is out of scope of thispaper, and instead we focus on showing viability of the pro-posed multi-modal predictive model. In fact, it could beargued that a number of such features are implicitly presentin the input representation already. For example, if a teamhas a large point lead with little game time remaining, theymay slow down on the offense and the observed movementhistory could capture that information.Lastly, note that an alternative to predicting a sequenceof H future locations of the offensive player is predict-ing a sequence of their velocities. As we know the cur-rent location at time t , we can convert trajectory τ Pt toa velocity vector ν Pt = [ v Pt +1 , . . . , v Pt + H ] using a directmapping of velocities to locations, computed for horizon h ∈ { , . . . , H } as v Pt + h =[ v Px,t + h , v Py,t + h ] =[ x Pt + h − x Pt + h − ∆ t , y Pt + h − y Pt + h − ∆ t ] . (1)Although trajectories and velocity vectors are mathemati-cally interchangeable, a particular choice might have a sig-nificant impact on model training. As we will demonstrateexperimentally, predicting the next location is more chal-lenging due to the issue in normalization of coordinates. As noted previously [31], movement of basketball play-ers is inherently multi-modal as the players can decide be-tween multiple plausible trajectories at any given time (e.g.,to move towards the basket for a layup or towards a cornerfor a three-point attempt). In order to account for this multi-modality we train a predictive model that generates output o Pt = [ˆ ν Pt, , . . . , ˆ ν Pt,M , ˆ p Pt, , . . . , ˆ p Pt,M ] , which consists of M predicted trajectories ˆ ν Pt,m representing M modes, aswell as M scalars ˆ p Pt,m representing probabilities that a cor-responding mode is selected by a player. This results in (2 H + 1) M output values, since output for each mode con-sists of a trajectory comprising H .2.1 Loss function Given a ground-truth trajectory ν and predicted trajectory ˆ ν , we first define the trajectory loss as L MSE ( ν , ˆ ν ) = 12 H (cid:107) ν − ˆ ν (cid:107) , (2)defined as a mean squared error (MSE) of the predictedvelocity vector. Then, in order to train a model to predictmultiple trajectories and their probabilities, we base our ap-proach on an adaptation of the multi-modal loss functionpresented in [7]. A similar loss function is used by [12] togenerate multi-modal pedestrian trajectories within a GANframework. In particular, we define the Multiple-TrajectoryPrediction (MTP) loss for time step t and player P , com-prising a linear combination of classification loss log ˆ p m and trajectory loss (2), L MTP = M (cid:88) m =1 δ (cid:15) ( m = m ∗ ) (cid:16) log ˆ p m + α L MSE ( ν Pt , ˆ ν Pt,m ) (cid:17) , (3)where ˆ p m is an output of a softmax, α is a hyper-parameterused to trade-off the classification and trajectory losses, and m ∗ is the index of the winning mode that produced the tra-jectory closest to the ground truth, computed according to adistance function dist () defined in the next subsection, m ∗ = arg min m ∈{ ,...,M } dist ( ν Pt , ˆ ν Pt,m ) . (4)Moreover, δ (cid:15) is a relaxed Kronecker delta [26] giving themost weight to the best matching trajectory, but also a smallweight to the remaining ones, δ (cid:15) ( cond ) = (cid:40) − (cid:15), if condition cond is true , (cid:15)M − , otherwise . (5)Intuitively, the classification loss in (3) forces the probabil-ity of the winning mode to 1 (thus pushing probabilities ofother modes towards zero due to the softmax), and trajec-tory loss penalizes prediction error of the winning mode.We note that [7] used the unrelaxed Kronecker delta (i.e., (cid:15) was set to ), which only updates the closest trajectory. Inpractice, this leads to problems where a randomly initializedpath is much worse than the remaining paths. Such poorlyinitialized modes never get selected through (4) and do notget a chance to improve during training. To prevent this is-sue we use the relaxed Kronecker delta, where we start fromsome small value of (cid:15) that is gradually reduced towards asthe training progresses. This phenomenon is well known ingenerative models and is commonly called mode collapsein GANs or posterior collapse in VAEs. Comparable an-nealing remedies have been proposed in VAEs [2], but aregenerally not sufficient to achieve good performance [10].Our approach was more stable than VAE or GAN trainingand we will empirically show that we can outperform state-of-the-art models based on each of those two methods. As mentioned above, m ∗ denotes a path closest to theground truth, however there are different closeness mea-sures that can be considered. For example, in [12] the clos-est mode is defined simply as a path with the lowest trajec-tory loss, computed as dist MSE ( ν , ˆ ν m ) = L MSE ( ν , ˆ ν m ) . (6)We also considered other distance functions, as [7] con-cluded that its choice has a large impact on the model per-formance. Thus, we considered distance function with thesmallest overall displacement error, defined as a location er-ror at the last time step and computed as dist l ( ν , ˆ ν m ) = (cid:107) H (cid:88) h =1 ( ν t + h − ˆ ν t + h,m ) (cid:107) . (7)Lastly, we considered using the error of final player velocity(which can be interpreted as player’s “heading”), shown inearlier work [7] to be beneficial, dist v ( ν t , ˆ ν t,m ) = (cid:107) ν t + H − ˆ ν t + H,m (cid:107) . (8) While [7] use the multi-modal loss function to train a CNNmodel, we will show that on the NBA data LSTM networkis more effective. We use a two-layer LSTM architecture,each with a width of 128, to encode the time-series inputof recently observed data u Pt . The encoder is a fully con-nected layer and the prediction consists of M trajectoriesof a single player given as x - and y -velocities for H futuretime steps, as well as M probabilities that the player willfollow the respective trajectory.It is important to note that the players differ in their po-sitions, skills, heights, and weights, and we would expectthem to run at different speeds and along different paths.To take these differences into account we consider a two-stage training approach to learn specific per-player models.To this end we first train the proposed model on data takenfrom all players, which can be seen as learning average be-havior of all NBA players. Then, in the second trainingphase these pre-trained networks can be used to initialize aspecialized per-player network fine-tuned on a subset con-taining only that player’s data, so that individual behavior ofthe player can be learned. In the experiments we evaluateboth global and per-player models.We refer to the proposed multi-modal approach as Multi-modal Basketball Trajectories (MBT). We evaluate differ-ent number of modes M and investigate different distancefunctions in (4), indicating these choices in the subscript.In particular, we denote model variants as MBT Md , with4 ∈ { M SE, l, v } , corresponding to (6), (7), and (8), re-spectively. For example, MBT l generates 4 paths and usesdistance function (7) during training. When using a singlemode the distance measure is not used, and we refer to theuni-modal model as MBT .
4. Experiments
We used publicly available movement data collected from632 NBA games during the 2015-2016 season , from whichwe extracted 114,294 offensive possessions. An offensivepossession starts when all players from one team cross intothe opponents’ half court, and ends when the first offensiveplayer leaves the half court or the game clock is paused.Possessions shorter than s were discarded, resulting in113,760 possessions. This amounts to 1.1 million secondsof gameplay where player location is captured every . s .We downsampled the data by a factor of 3 to obtain sam-pling rate of ∆ t = 0 . s , corresponding to a lower boundon human reaction time [9] during which velocity is consid-ered constant. Furthermore, we randomly split the data intotrain and test sets using 90/10 split. All inputs and outputswere normalized to the [ − , range. To train the special-ized networks that predict specific player’s movement weextracted possessions featuring that player. The amount ofdata for each player is in the order of several thousands (e.g.,for Stephen Curry there were 2,767 possessions). As discussed previously, we used a 2-layer LSTM with 128channels in each layer. To learn the general model for allNBA players we trained LSTM in batches of 1,024 samples.The learning rate in Adam optimizer was set to · − .We set hyper-parameter α in equation (3) to , such that theamplitude of the two losses are about equal, and (cid:15) in (5) to . which was reduced by a factor of . per epoch until (cid:15) = 0 . We used (cid:96) regularization with the weight of λ =0 . and an early stopping mechanism to further preventoverfitting. To specialize the neural network for a specificplayer we fine-tune the base model on data from that player.We start with (cid:15) = 0 . which is reduced by a factor of0.01 per epoch to make sure that all modes benefit fromthe information contained in this smaller training set. Theinitial learning rate in this case was reduced to − .All training was done on a single computer with NvidiaGeForce GTX 1080 card. It took approximately 60 minutesto train the base model, while specializing the network on aspecific player took less than 5 minutes. https://github.com/sealneaward/nba-movement-data, last accessedJune 2020; we are not associated with the data creator in any way. We report common measures used in pedestrian trajectoryprediction, final displacement error (FDE) and average dis-placement error (ADE) [1, 12], defined asFDE = 15 T T (cid:88) t =1 5 (cid:88) P =1 (cid:13)(cid:13)(cid:13) (cid:96) Pt + H − ˆ (cid:96) Pt + H (cid:13)(cid:13)(cid:13) , ADE = 15 HT T (cid:88) t =1 5 (cid:88) P =1 H (cid:88) h =1 (cid:13)(cid:13)(cid:13) (cid:96) Pt + h − ˆ (cid:96) Pt + h (cid:13)(cid:13)(cid:13) . (9)In other words, FDE considers the location error at the endof the prediction horizon H , while ADE averages locationerrors over the entire trajectory. We also report MSE error,defined as in equation (2). Unlike FDE and ADE that mea-sure trajectory prediction errors, MSE is a measure of howaccurately are the velocities predicted.To evaluate multi-modal approaches we choose the paththat has the smallest FDE among all the generated paths,which is consistent with evaluation procedure used in theliterature [12, 26]. To establish an upper bound for the proposed error mea-sures we compared our method to two straw-man baselines.
Constant location (CL) baseline assumes that a player staysat the last observed location on the court, while constant ve-locity (CV) baseline assumes that the player keeps movingin the last observed direction with constant speed.Baseline
CNN refers to an approach that transforms theinput to a rasterized trace image and uses a CNN encoder(instead of LSTM) before predicting the future velocities[7]. For the encoder, we used 5 layers with depths [64, 128,128, 64, 32], 5x5 mask, ”same” padding, and 2x2 max pool-ing. The decoder consisted of 2 densely connected layerswith size 128 and 64.To compare different output alternatives we trained thesame LSTM architecture used for our model to directlypredict player locations, as opposed to predicting veloci-ties. We refer to this model as the location-LSTM . We alsoconsidered
SocialGAN [12], the state-of-the-art in humantrajectory prediction. This approach uses an LSTM-basedgenerator, coupled with a social pooling layer to accountfor nearby actors. We trained this model using code madeavailable by the original authors , using the same NBA dataset except that SocialGAN can not use extra informationsuch as ball location or shot clock, therefore only the play-ers trajectories are used. GANs are notoriously hard to train,which resulted in a training time of 28 hours for 50 epochsof training. In addition, we considered the state-of-the-art https://github.com/agrimgupta92/sgan, last accessed June 2020. able 1: Comparison of various models, input steps L , and modes M in terms of error metrics ADE and FDE (in feet) and MSE (in ft /s ) H = 10 H = 20 H = 40Method L M
ADE FDE MSE ADE FDE MSE ADE FDE MSE
CL 1 1 3.19 5.69 39.56 5.47 9.78 38.46 9.03 14.85 37.34CV 1 1 1.72 3.92 9.09 4.64 10.97 16.01 11.59 26.14 20.59CNN 10 1 2.76 5.25 15.80 5.28 9.99 17.48 8.15 13.23 21.95location-LSTM 10 1 1.61 2.98 10.21 3.43 6.91 15.94 6.79 12.11 29.80MBT
10 1 1.43 2.98 7.26 3.32 6.92 12.36 6.59 11.97 16.93MBT
20 1 1.40 2.93 7.25 3.30 6.91 12.41 6.59 11.97 16.74MBT
30 1 1.39 2.92 7.46 3.33 6.91 12.32 6.58 11.92 16.87SocialGAN
10 1 1.25 2.75 8.18 3.09 6.67 13.32 6.47 12.35 17.54MACRO VRNN
10 1 1.70 3.43 13.17 4.46 8.66 19.85 8.48 14.98 25.03SocialGAN
10 4 1.19 2.61 7.36 2.95 6.33 11.91 6.19 11.54 15.76MACRO VRNN
10 4 1.07 1.98 5.90 3.14 5.07 11.93 6.40 8.54 19.29MBT MSE
10 4 v
10 4 1.05 1.93 4.00 2.66 4.31 7.75 6.71 8.74 14.72MBT l
10 4
MACRO VRNN [31], which uses programmatic weak su-pervision to first predict a location that the player wants toreach and then uses a Variational RNN (VRNN) to predicta trajectory that the player will take to reach it. MACROVRNN also accounts for the multi-modality of the prob-lem, with the number of generated paths denoted in the sub-script. We used models provided in [31] trained on roughlythe same amount of data. Note that training takes up to 20hours, as opposed to only 1 hour for our proposed method.
We first compared performance of models trained ondata containing all possessions, with results across differ-ent error measures and time horizons presented in Table 1.Model CL predicts that the player will remain static atthe last observed location, which explains the large MSE.CL also has a relatively large FDE for shorter horizons, butdoes not deteriorate as fast as the CV model which assumesthe player will keep moving with the last observed veloc-ity. The CNN model outperformed the simple baselines atlonger horizons, however at short horizons the performancewas suboptimal. Location-LSTM (which predicts player’slocations instead of velocities as the competing methods)is comparable to MBT model in terms of ADE and FDEmetrics, with much worse MSE metric. As we will demon-strate later in qualitative results, this difference in MSE canbe explained by the fact that location-LSTM produces tra-jectories that are not physically achievable by the players.Next we experimented with the uni-modal MBT modeland evaluated the influence of different lengths of histori-cal inputs L . Based on the results we confirmed that theMBT models only marginally improve with longer inputsequences. As a result, in the remainder of the experimentswe use a value of L = 10 , consistent with [31].In the following experiment we compared different dis- tance functions used for training MBT methods, where wekept M fixed at . We see that the choice of distance func-tion had limited effect on accuracy measures at a shorterhorizon of . s . However, as the horizon increased, MBT l started outperforming the competing approaches by a con-siderable margin. Taking this result into account, in furtherexperiments we used the distance function defined in (8).When we compare the proposed method to the state-of-the-art models MACRO VRNN and SocialGAN, we sepa-rate the analysis by comparing the same number of modes.When evaluating a single trajectory, SocialGAN outper-forms both our approach and MACRO VRNN in ADE andFDE. However, MBT reaches better MSE than those ap-proaches. When comparing multiple modes, we can see thatMBT l , MBT v and MBT MSE performance is roughlycomparable at shorter horizons, but MBT l outperforms allother methods across all accuracy measures at longer hori-zons. Quite notably, MBT l outperforms the baselines witha large margin in terms of MSE velocity measure. For ex-ample, for horizon H = 40 , our MBT l model achievesADE and smaller than MACRO VRNN andSocialGAN , respectively.In Figure 1, we illustrate predicted trajectories for a ran-domly picked player. Trajectories were generated using asingle-path model that predicts locations (location-LSTM,Figure 1a), two single-path models that predicts velocities,one based on a CNN architecture (Figure 1b) and one basedon an LSTM architecture (MBT , Figure 1c), one samplepath of MACRO VRNN (Figure 1d), 4 sampled paths of So-cialGAN (Figure 1e) and our proposed method for 4 modesMBT l (Figure 1f). We can see that location-LSTM outputis noisy and does not represent realistic player movements.Player trajectories predicted by the CNN and MBT modelare smoother and more realistic, showing the advantage ofpredicting velocities instead of locations. While CNN and6 igure 2: Evaluation of predicted mode probabilities for MBT l MBT generate qualitatively similar results, MBT outper-forms CNN in the quantitative measures. MACRO VRNNgenerally produces paths that are less smooth than compet-ing models, explaining the high error in MSE as discussedabove. The multiple paths predicted by SocialGAN aresmooth and look plausible, but lack the diversity of move-ment that we would basketball trajectories. MBT l predicts4 paths that are very distinct from each other. The highest-probability path ends up very close to the observed finalplayer location, while accurately following the ground-truthtrajectory. Other paths produced by the multi-modal modelallow for diverse movements, such as an aggressive driveto the basket or supporting the ball-handling teammate nearthe center of the court.We also evaluated quality of inferred mode probabilitiesproduced by the MBT l model. To this end we comparedpredicted mode probabilities to empirical ones, computed asa frequency of how often a mode of certain probability hadthe lowest FDE. We bucketed inferred probabilities in 5%bins and for each computed the empirical probability, withthe average per-bucket results presented in Figure 2. We cansee that the plot closely follows the identity line, indicatingthat the predicted mode probabilities are well-calibrated.To evaluate the hypothesis that the MBT trajectories aremore physically realistic, we calculated acceleration of pre-dicted trajectories on the test set. The maximum accelera-tion of MBT l is . m/s . We note that the ground truthcontains noisy outliers, with accelerations of up to m/s (the . th percentile is . m/s ). In contrast, when con-sidering MACRO VRNN we observe accelerations of morethan m/s (the . th percentile is . m/s ). Thisindicates that in many cases the baseline trajectories arefar from being physically achievable, while the proposedmethod yielded more realistic outputs. Table 2: Prediction of specific players with and without fine-tuningfor H = 40 (4.8 seconds) using the MBT l model Player Fine-tuned? ADE FDE MSE
LeBron James No 4.78 6.63 9.97LeBron James Yes 4.67 6.24 9.91Stephen Curry No 6.32 7.80 17.35Stephen Curry Yes 6.09 7.51 16.62Russell Westbrook No 5.49 7.15 12.43Russell Westbrook Yes 5.36 6.90 12.23DeAndre Jordan No 4.36 6.01 12.20DeAndre Jordan Yes 3.93 4.94 12.56Andrew Bogut No 4.54 6.12 9.34Andrew Bogut Yes 4.29 5.40 9.03
In this section we compare per-player models to the basemodel trained on all players, as well as the per-player mod-els fine-tuned on players that are playing in the same posi-tion, but are known to have distinct playing styles. We firstcompared performance of the base and per-player modelsfor several example players, with results presented in Table2. We can see that per-player models resulted in improvedperformance across the board, as they are better capturingplaying styles of individual players.Let us consider a specific game situation where centerDeAndre Jordan just set up a pick and roll, shown in Figure3 and the animated video in Supplementary Material . Themodel trained on all players predicts that the so-called rollman will now move either towards the basket or towards thewide open space on the right-hand side of the court, shownin the first row of Figure 3a. Jordan is a very dynamic andfast center who executes many successful pick and rolls, soour model trained on his data predicts he will drive to thebasket faster and with a higher probability than an averageplayer in the same situation, as shown in Figure 3b. Wealso compared to a model trained on data of Andrew Bogut,a defense specialist who is not as fast as Jordan. Accordingto stats.nba.com , Bogut only attempts 0.5 pick and rollsper game, while Jordan attempts 2.4. Our model correctlypredicts Bogut’s paths to be less dynamic and gives a 25%probability that he would turn around and focus on defend-ing a counter attack, entirely relying on his team mate tocapitalize on the pick, shown in Figure 3c.The following experiment involves a situation whereStephen Curry has possession of the ball at the top of thecircle, with a defender to his right, as illustrated in Figure 4(and in the Supplementary Material). This example showssome limitations of our approach because in actuality Curryfirst acts like he wants to drive inside, but decides to stopand throw the ball for a 2-pointer before starting to movebackwards. The predicted trajectories are much simpler, but Link to Supplementary Material https://on.nba.com/2ulXVau, last accessed June 2020. a) (b) (c)Figure 3: Visualization of predicted trajectories for DeAndre Jordan with H = 20 (2.4s) using 3 different networks MBT l : a) trained onall players, b) retrained with the data of DeAndre Jordan and c) retrained with the data of Andrew Bogut(a) (b) (c)Figure 4: Visualization of predicted trajectories for Stephen Curry with H = 20 (2.4s) using 3 different networks MBT l : a) trained on allplayers; b) retrained with the data of Stephen Curry; c) retrained with the data of Russel Westbrook still capture some interesting options that the player maychoose. The model that was trained on all players predictsthat the player may move towards the basket with about40% probability as seen in Figure 4a, with other lower-probability options to move along the arc, stay at the topof the arc, or try to circle around the defender. The modelthat was retrained on data of Stephen Curry shown in Fig-ure 4b slightly adjusts the path along the arc, because Curryoften tries to shoot 3-pointers (more specifically, he has thesecond-most 3-point attempts in the 2015/16 season). As aresult the model also gives him a lower probability to drivetowards the basket. We evaluated the same situation witha network fine-tuned on data of Russell Westbrook, shownin Figure 4c. Westbrook attempts much fewer 3-pointersthan Curry, and instead has more 2-point attempts. He isalso a very dynamic player that is excellent at driving to thebasket, such that when he makes an attempt he usually getscloser to the basket than an average player would. Thus, when he moves along the arc our model predicts that hewill not stay behind the 3-point line, but will instead try toget closer to the basket. We can see the model successfullymanaged to capture characteristics of individual players, ad-justing the predictions to their own playing styles.
5. Conclusion
In this paper we proposed an LSTM-based model trainedusing multi-modal loss that can generate multiple pathswhich accurately predict movement of NBA players. In ad-dition, we showed that per-player fine-tuning can captureinteresting and specific behavior of different players. Theproposed approach outperformed state-of-the-art by a largemargin, both in terms of standard prediction metrics and ve-locity error that better captures trajectory realism. As futurework, we are exploring ideas to model the multi-modal be-havior of the entire team, as well as opponents strategiesthat can counter such trajectories.8 eferences [1] A Alahi, K Goel, V Ramanathan, A Robicquet, LFei-Fei, and S Savarese. Social lstm: Human trajec-tory prediction in crowded spaces. In
Proceedings ofthe IEEE conference on computer vision and patternrecognition , pages 961–971, 2016.[2] S R Bowman, L Vilnis, O Vinyals, A M Dai, R Joze-fowicz, and S Bengio. Generating sentences from acontinuous space. arXiv preprint arXiv:1511.06349 ,2015.[3] U Brefeld and A Zimmermann. Guest editorial: Spe-cial issue on sports analytics.
Data Mining and Knowl-edge Discovery , 31(6):1577–1579, 2017.[4] J Chen, H M Le, P Carr, Y Yue, and J J Little. Learn-ing online smooth predictors for realtime camera plan-ning using recurrent decision trees. In
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pages 4688–4696, 2016.[5] S Choi, Wand Savarese. Understanding collectiveactivities of people from videos.
IEEE transac-tions on pattern analysis and machine intelligence ,36(6):1242–1257, 2013.[6] W Choi and S Savarese. A unified framework formulti-target tracking and collective activity recogni-tion. In
European Conference on Computer Vision ,pages 215–230. Springer, 2012.[7] H Cui, V Radosavljevic, F Chou, T Lin, T Nguyen,T Huang, J Schneider, and N Djuric. Multimodal tra-jectory predictions for autonomous driving using deepconvolutional networks. In
IEEE International Con-ference on Robotics and Automation (ICRA) , 2019.[8] P Felsen, P Lucey, and S Ganguly. Where willthey go? predicting fine-grained adversarial multi-agent motion using conditional variational autoen-coders. In
The European Conference on ComputerVision (ECCV) , September 2018.[9] B Fischer and E Ramsperger. Human express sac-cades: extremely short reaction times of goal directedeye movements.
Experimental brain research. Ex-perimentelle Hirnforschung. Exprimentation crbrale ,57:191–5, 02 1984.[10] H Fu, C Li, X Liu, F Gao, A Celikyilmaz, and LCarin. Cyclical annealing schedule: A simple ap-proach to mitigating kl vanishing. arXiv preprintarXiv:1903.10145 , 2019.[11] J Gudmundsson and M Horton. Spatio-temporal anal-ysis of team sports.
ACM Computing Surveys (CSUR) ,50(2):22, 2017.[12] A Gupta, J Johnson, L Fei-Fei, S Savarese, and AAlahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In
Proceedingsof the IEEE Conference on Computer Vision and Pat-tern Recognition , pages 2255–2264, 2018.[13] S Haddad and S Lam. Self-growing spatial graphnetworks for pedestrian trajectory prediction. In
TheIEEE Winter Conference on Applications of ComputerVision , pages 1151–1159, 2020.[14] S Hochreiter and J Schmidhuber. Long short-termmemory.
Neural Comput. , 9(8):1735–1780, Nov.1997.[15] C G Keller and D M Gavrila. Will the pedestriancross? a study on pedestrian path prediction.
IEEETransactions on Intelligent Transportation Systems ,15(2):494–506, 2013.[16] K Kim, M Grundmann, A Shamir, I Matthews, J Hod-gins, and Irfan Essa. Motion fields to predict play evo-lution in dynamic sport scenes. In , pages 840–847. IEEE, 2010.[17] K M Kitani, B D Ziebart, J A Bagnell, and M Hebert.Activity forecasting. In
European Conference onComputer Vision , pages 201–214. Springer, 2012.[18] S Kuper.
Soccernomics: Why England Loses, WhySpain, Germany, and Brazil Win, and Why the US,Japan, Australia and Even Iraq Are Destined to Be-come the Kings of the World’s Most Popular Sport .Nation Books, 2014.[19] L Lamas, J Barrera, G Otranto, and C Ugrinow-itsch. Invasion team sports: strategy and match mod-eling.
International Journal of Performance Analysisin Sport , 14(1):307–329, 2014.[20] T Lan, Y Wang, W Yang, and G Mori. Beyond ac-tions: Discriminative models for contextual group ac-tivities. In
Advances in neural information processingsystems , pages 1216–1224, 2010.[21] H Le, P Carr, Y Yue, and P Lucey. Data-driven ghost-ing using deep imitation learning. 03 2017.[22] M Lewis.
Moneyball: The art of winning an unfairgame . WW Norton & Company, 2004.[23] N Pelechano, J M Allbeck, and N I Badler. Con-trolling individual agents in high-density crowd sim-ulation. In
Proceedings of the 2007 ACM SIG-GRAPH/Eurographics symposium on Computer an-imation , pages 99–108. Eurographics Association,2007.[24] S Pellegrini, A Ess, and L Van Gool. Improving dataassociation by joint modeling of pedestrian trajecto-ries and groupings. In
European conference on com-puter vision , pages 452–465. Springer, 2010.925] R Pollard, J Ensum, and S Taylor. Estimating the prob-ability of a shot resulting in a goal: The effects of dis-tance, angle and space.
Int. J. Soccer Sci. , 2, 01 2004.[26] C Rupprecht, I Laina, R DiPietro, and M Baust.Learning in an uncertain world: Representing ambi-guity through multiple hypotheses. , Oct2017.[27] T Seidl, A Cherukumudi, A T Hartnett, P Carr, andP Lucey. Bhostgusters : Realtime interactive playsketching with synthesized nba defenses. 2018.[28] C Sun, P Karlsson, J Wu, J B Tenenbaum, and KMurphy. Stochastic prediction of multi-agent in-teractions from partial observations. arXiv preprintarXiv:1902.09641 , 2019.[29] D Xie, T Shu, S Todorovic, and S Zhu. Learning andinferring dark matter and predicting human intents andtrajectories in videos.
IEEE transactions on patternanalysis and machine intelligence , 40(7):1639–1652,2017.[30] K Yamaguchi, A C Berg, L E Ortiz, and T L Berg.Who are you with and where are you going? In
CVPR2011 , pages 1345–1352. IEEE, 2011.[31] E Zhan, S Zheng, Y Yue, L Sha, and P Lucey. Generat-ing multi-agent trajectories using programmatic weaksupervision. In