[PDF] Analysis of professional basketball field goal attempts via a Bayesian matrix clustering approach

Abstract

We propose a Bayesian nonparametric matrix clustering approach to analyze the latent heterogeneity structure in the shot selection data collected from professional basketball players in the National Basketball Association (NBA). The proposed method adopts a mixture of finite mixtures framework and fully utilizes the spatial information via a mixture of matrix normal distribution representation. We propose an efficient Markov chain Monte Carlo algorithm for posterior sampling that allows simultaneous inference on both the number of clusters and the cluster configurations. We also establish large sample convergence properties for the posterior distribution. The excellent empirical performance of the proposed method is demonstrated via simulation studies and an application to shot chart data from selected players in the 2017 18 NBA regular season.

Full PDF

AAnalysis of professional basketball ﬁeld goalattempts via a Bayesian matrix clustering approach

Fan YinDepartment of Statistics, University of California - IrvineGuanyu HuDepartment of Statistics, University of Missouri - ColumbiaandWeining Shen ∗ Department of Statistics, University of California - Irvine

Abstract

We propose a Bayesian nonparametric matrix clustering approach to analyze thelatent heterogeneity structure in the shot selection data collected from professional bas-ketball players in the National Basketball Association (NBA). The proposed methodadopts a mixture of ﬁnite mixtures framework and fully utilizes the spatial informationvia a mixture of matrix normal distribution representation. We propose an eﬃcientMarkov chain Monte Carlo algorithm for posterior sampling that allows simultaneousinference on both the number of clusters and the cluster conﬁgurations. We also estab-lish large-sample convergence properties for the posterior distribution. The excellentempirical performance of the proposed method is demonstrated via simulation studiesand an application to shot chart data from selected players in the NBA’s 2017–2018regular season.

Keywords: Basketball Shot Chart; Bayesian Nonparametrics; Matrix data; Mixture of FiniteMixtures; Model-Based Clustering

Sports analytics have received an increasing interest in statistics community and they con-tinue to oﬀer new challenges as a result of ever-increasing data sources. Conventional statis-tical research for sports analytics was mainly concerned with forecasting the game results, ∗ Corresponding author, Email: [email protected] a r X i v : . [ s t a t . M E ] O c t uch as predicting the number of goals scored in soccer matches (Dixon and Coles, 1997;Karlis and Ntzoufras, 2003; Baio and Blangiardo, 2010), and the basketball game outcomes(Carlin, 1996; Caudill, 2003; Cattelan et al., 2013). More recently, fast development in playertracking technologies has greatly facilitated the data collection (Albert et al., 2017), and inturn substantially expanded the role of statistics in sports analytics, including granular eval-uation of player/team performance (Cervone et al., 2014; Franks et al., 2015; Cervone et al.,2016; Wu and Bornn, 2018), and in-game strategy evaluation (Fernandez and Bornn, 2018;Sandholtz et al., 2019).In professional basketball research, shooting pattern remains to be a fundamental metricfor evaluating players’ performance and has aroused great interest among statisticians. Shotcharts, as graphical representations of players’ shot locations, provide an excellent tool tosummarize and visualize shooting patterns for players. To account for the spatial correlationin shot chart data, several spatial statistical models have been studied in the literature. Forexample, Reich et al. (2006) proposed a multinomial logit regression model with spatiallyvarying coeﬃcients to quantify the eﬀects of several hand-crafted features (e.g., home oraway games, average number of blocks made by the defensive player, the presence of certainother teammates) on the probability of making a shot over diﬀerent basketball court regions.More recently, spatial point process (Miller et al., 2014; Jiao et al., 2019) has emerged as apromising direction for shot chart data analysis in recognition of the randomness nature ofshot locations. In those works, it is common to ﬁrst summarize the shot charts as intensitymatrices of the underlying point process and then conduct a regression analysis of pre-speciﬁed artiﬁcial baseline shooting patterns on game outcomes. A common ﬁnding in thesestudies is that the shooting behaviors are highly heterogeneous among diﬀerent players, whichcalls for a clustering analysis towards a deeper understanding of the player-level heterogeneityand the improvement of existing statistical models by incorporating the latent clusteringstructure. To date, most existing clustering approaches for shot chart data analysis aredistance-based (e.g., K -means and hierarchical clustering), and hence lacking a probabilisticinterpretation. A model-based clustering approach was proposed in Hu et al. (2020) based oncalculating the similarity matrix between intensity matrices of players’ shot charts. However,that method still lacks an intuitive model interpretation for the clustering results since theclustering is performed based on the similarity matrix, rather than the intensity matrices.2he main goal of this paper is to ﬁll this gap by introducing a novel Bayesian model-based clustering approach for learning the basketball players’ heterogeneity via their shotchart data analysis. The key novelty of our method starts from treating each shot chart asa matrix , i.e., the basketball court is divided into a few rectangle regions and the numberof shots (or the intensity of the underlying spatial point process) over those regions are rep-resented as elements in the corresponding matrix. The immediate beneﬁt of treating eachsampling unit (shot chart) as a matrix is that it automatically takes account for the spatialstructure information in the analysis. Moreover, it allows us to conveniently extend theclassical Gaussian mixture model (for vectors) for clustering matrix-valued shot chart datapurpose. Gaussian mixture models (and mixture models in general) have been widely usedin many applications thanks to their convenient probabilistic interpretation and elegant com-putational solutions such as the expectation maximization (EM). However, mixture modelsfor matrix-valued data have received little attention until recently. Most existing works (Vi-roli, 2011a; Thompson et al., 2020; Gao et al., 2020) are based on the EM framework, whichrequires pre-specifying the number of clusters while the inference cannot be easily conductedfor clustering outputs over diﬀerent cluster numbers. A Bayesian approach was proposed inViroli (2011b) by imposing a prior on the number of clusters and drawing posterior sampleswith a birth and death Markov chain Monte-Carlo algorithm (BDMCMC; Stephens, 2000).However, that approach requires a careful parameter tuning process in BDMCMC and thecomputation does not scale up with the size of matrices. To date, it remains challengingto conduct eﬃcient Bayesian inference for matrix-valued mixture models due to the largeparameter space (e.g., the number of parameters is at least of order O ( p + q ) for p × q matrices) and the fact that the parameter space is not ﬁxed as the number of clusters varies.Moreover, there is a lack of understanding of the theoretical properties for these mixturemodels.Our methodology development is directly motivated by solving the aforementioned chal-lenges. In particular, we propose MFM-MxN, which is a novel nonparametric Bayesianmixture model of matrix normal distributions (MxN) under the mixture of ﬁnite mixturesframework (MFM; Miller and Harrison, 2018). The main idea is to represent each cluster (ofshot charts) by a matrix normal distribution and allow the number of clusters to be random.We develop a Gibbs sampler that enables eﬃcient full Bayesian inference on the number of3lusters, mixture probabilities as well as other modeling parameters. We demonstrate itsexcellent numerical performance through simulations and an analysis of the NBA shot chartdata. In addition, we establish a consistency result for the posterior estimates of the clusternumber and the associated modeling parameters.Our proposed method is unique in the following aspects. First, the idea of representingeach player’s shot chart as an intensity matrix and formally introducing the concept of matrixdata analysis for solving clustering problem is novel. In fact this idea and our proposedapproach are widely applicable to general sports applications such as baseball and footballstudies, and provide a valuable alternative to the existing literature that mainly relies onspatial methods. Secondly, by adopting a full Bayesian framework, the clustering results yielduseful probabilistic interpretation. Moreover, the developed posterior sampling scheme alsorenders eﬃcient computation and convenient inference compared to all the other methods formodeling matrix-valued data in the literature. Thirdly, our theoretical result is among theﬁrst of its kind for mixture models of matrix-variate distributions. The posterior consistencyresult not only provides a theoretical justiﬁcation for the excellent empirical performance(e.g., high probability of selecting the correct number of clusters), but also connects to theexisting theoretical ﬁndings on mixture models in general (for vector-valued data). We consider a dataset consisting of locations of ﬁeld goal attempts (FTA) from the oﬀensivehalf court in all 82 games during 2017–2018 National Basketball Association (NBA) regularseason. Following Hu et al. (2020), we focus on 191 players who have made more than400 FTAs in that season. The rookie year players, such as Lonzo Ball and Jayson Tatum,are not included in our analysis. All the shooting locations are in a 47 ft (baseline to midcourt line) by 50 ft (sideline to sideline) rectangle, which is the standard court size for NBAgames. We select nine players (DeMar DeRozan, LeBron James, Giannis Antetokounmpo,Stephen Curry, Nick Young, Eric Gordon, Steven Adams, Clint Capela, DeAndre Jordan)and visualize their shot charts in Figure 1. From this ﬁgure, we observe a clear heterogeneitypattern, e.g., the three players in the ﬁrst row have a more balanced spatial location patternin their FTAs than those from the other six players; the three players in second row have4ore FTAs around 3-pt line; and the three players in last row have more FTAs near thebasket. Those observations seem closely related to their positions and playing styles in thegame. Our goal in this paper is to synthesize these empirical ﬁndings through a formalmodel-based clustering approach.

Steven Adams Clint Capela DeAndre JordanStephen Curry Nick Young Eric GordonDeMar DeRozan LeBron James Giannis Antetokounmpo Made ShotMiss Shot

Figure 1: Shot charts for selected NBA players

In this section, we ﬁrst give a brief review of log Gaussian Cox process and matrix normaldistribution, and then present our Bayesian matrix normal mixture model in Section 3.3.

Consider a collection of 2D spatial locations S = { s , s , . . . , s N } over a study region B ⊂ R . It is common to represent the underlying spatial pattern by a spatial point processcharacterized by a quantity called intensity. Formally, within a region B , the intensity at5ocation s ∈ B is deﬁned as λ ( s ) = lim | d s → | (cid:18) E[ N ( d s )] | d s | (cid:19) , where d s is an inﬁnitesimal region around s , | d s | represents its area, and N ( d s ) denotes thenumber of events that happens over d s . A spatial Poisson point process is a process such thatthe number of events/points in any subregion A ⊂ B follows a Poisson distribution with mean λ ( A ) = (cid:82) A λ ( s ) d s for some function λ ( · ). Similarly with the Poisson distribution, a Poissonprocess PP ( λ ( · )) satisﬁes E( N S ( A )) = Var( N S ( A )) = λ ( A ). A homogeneous Poisson process(HPP) assumes λ ( s ) = λ , i.e., the intensity is a constant over the entire region B . A morerealistic case is to let λ ( s ) vary spatially, which leads to a nonhomogeneous Poisson process.Among the class of Poisson processes, log Gaussian Cox process (LGCP) has received alot of attention in practice thanks to its ﬂexibility and easy interpretability. A LGCP is adoubly-stochastic Poisson process with a correlated and spatially-varying intensity (Mølleret al., 1998), deﬁned as follows, S ∼ PP ( λ ( · )) , λ ( · ) = exp( Z ( · )) , Z ( · ) ∼ GP (0 , k ( · , · )) , (1)where Z ( · ) is a zero-mean Gaussian process with covariance kernel k ( · , · ). From (1), theLGCP can be viewed as an exponentiated Gaussian process, such as a Gaussian randomﬁeld (GRF; Rasmussen and Williams, 2006), which assumes that the log intensities at dif-ferent spatial locations are normally distributed, and spatially correlated. To relate to ourbasketball shot chart data discussed in Section 2, for all the players of interest, we can modeltheir shot charts S (1) , S (2) , . . . , S ( n ) through a LGCP and estimate their associated intensityfunctions, denoted by λ (1) ( · ) , λ (2) ( · ) , . . . , λ ( n ) ( · ). This step can be conveniently implementedusing integrated nested Laplace approximation (INLA; Rue et al., 2009). See more detailsof implementation in Cervone et al. (2016) and Hu et al. (2020). For illustration, we plotthe estimated intensity maps for three selected players in Figure 2. Next we provide a brief review of matrix normal distribution. Consider a p × q randommatrix Y . We say Y follows a matrix-variate normal distribution (MxN) with parameters6 eBron James Stephen Curry Steven Adams (0.0, 0.1] (0.1, 0.2] (0.2, 0.3] (0.3, 0.4] (0.4, 0.5] (0.5, 1.0] (1.0, 2.0] (2.0, 4.0] (4.0, 6.0] (6.0, 8.0](8.0, 10.0] (10.0, 12.0] (12.0, 14.0] (14.0, 16.0] (16.0, 18.0] (18.0, 20.0] (20.0, 25.0] (25.0, 30.0] (30.0, 35.0] Figure 2: Estimated Intensity Maps for three selected players M , U and V , denoted by, Y ∼ N p,q ( M, U, V ), if it has the following probability densityfunction f ( Y ; M, U, V ) = exp( − tr[ V − ( Y − M ) (cid:124) U − ( Y − M )])(2 π ) pq/ | V | p/ | U | q/ , (2)where matrix M ∈ R p × q is the mean of Y , and | · | denotes the matrix determinant. Herepositive deﬁnite matrices U ∈ R p × p and V ∈ R q × q are row-wise covariance and column-wisecovariance parameters, describing the covariances between, respectively, each of the p rowsand the q columns of Y . It is clear from (2) that the matrix normal distribution can be viewedas a multivariate normal distribution with a Kronecker product covariance structure (Guptaand Nagar, 1999), that is, Y ∼ N p,q ( M, U, V ) is equivalent to vec( Y ) ∼ N pq (vec( M ) , V ⊗ U ),where vec( · ) is a vectorization operator that stacks all the columns in a matrix into a tallcolumn vector. Since V ⊗ U = ( a V ) ⊗ ( aU ) for any a (cid:54) = 0, we impose a constraint tr( V ) = q for model identiﬁability purpose.From the deﬁnition of the matrix normal distribution, we can see that it enjoys a par-simonious covariance structure. By representing a ( pq ) × ( pq ) covariance as the Kroneckerproduct of a p × p and a q × q covariance matrix, it eﬀectively reduces the number of un-known parameters from pq ( pq + 1) / { p ( p + 1) + q ( q + 1) } /

2. Moreover, it provides auseful interpretation by projecting the spatial variability onto column and row directions,which can be viewed as a spatial version of the analysis of variance (ANOVA) model. Forthe basketball shot chart data, it is natural to divide the oﬀensive half court equally intorectangle regions and represent the measurements (e.g., number of shots being made by aplayer) over those regions in a matrix form. Moreover, we can model the logarithm of the7orresponding intensity function over the matrix by a matrix normal distribution. It is alsoworthy mentioning that there are other useful distributions deﬁned for matrix-valued data,such as matrix-variate t distribution (Thompson et al., 2020). Our proposed Bayesian mix-ture model of matrix normal distributions can be naturally extended to those distributions;and we focus on matrix normal distribution here for its convenience and easy interpretation.

To account for the potential heterogeneity in the matrix-valued data, we propose a Bayesianmixture model where each mixture component is represented by a matrix normal distribu-tion. More speciﬁcally, suppose that there are a total number of K clusters, with weights π , . . . , π K , and each mixture follows a diﬀerent matrix normal distribution. Then we adoptthe mixture of ﬁnite mixtures (MFM) framework (Miller and Harrison, 2018) by assigningprior distributions on those unknown model parameters as follows, K ∼ p K , p K is a p.m.f on N + = { , , . . . } ,π = ( π , . . . , π k ) ∼ Dir k ( γ, . . . , γ ) , given K = k,P ( Z i = j ) = π j for every i = 1 , . . . , n, and j = 1 , . . . , k, given π,M , . . . , M k i.i.d ∼ N p,q ( M , Σ , Ω ) given K = k,U , . . . , U k i.i.d ∼ IW p (2 α, (2 β ) − ) given K = k,V , . . . , V k i.i.d ∼ IW q (2 ψ, (2 ρ ) − ) given K = k,Y i ∼ N p,q ( M Z i , U Z i , V Z i ) independently for i = 1 , . . . , n, given Θ and Z , . . . , Z n , (3)where Z , . . . , Z n are cluster membership indicators that take values in { , . . . , K } for eachobservation Y i , Θ = ( Θ , . . . , Θ K ) and Θ k = ( M k , U k , V k ) , k = 1 , . . . , K are the collectionof the parameters in the matrix normal distribution, and γ , ψ, ρ are hyper-parameters.Here IW p ( ν, S − ) means an inverse-Wishart distribution on p × p positive deﬁnite matriceswith degree of freedom ν ( ν > p −

1) and scale parameter S , the probability density ofwhich is proportional to | Σ | − ( ν + p +1) / exp( − tr( S Σ − / Y i ’s are thelog intensities log(ˆ λ ( i ) ( · )) of LGCPs obtained in Section 3.1. We follow the convention tochoose p K as a Poisson( τ = 1) distribution truncated to take only positive values (Miller8nd Harrison, 2018; Geng et al., 2019). The prior distributions for Θ k ’s are speciﬁed tofacilitate Bayesian inference via Gibbs sampling by taking advantage of the Normal-Normaland Normal-inverse-Wishart conjugacy. We will discuss more details about the numericalimplementation in the later sections.The matrix normal mixture model has been previously studied in Viroli (2011a) under theEM framework and in Gao et al. (2020) by imposing regularization on the mean structure forsparsity structure. However, in both works, it remains challenging to conduct full inference onthe number of clusters and the cluster parameters simultaneously. Viroli (2011b) considereda Bayesian matrix normal mixture model and proposed to use birth and death MCMCalgorithm for posterior inference. However, that method does not scale up to the size of thematrix and the theoretical property of the Bayesian estimators remains largely unknown.We will provide more details about computation and theoretical results, and highlight ourcontributions in the next two sections. In this section, we present a Gibbs sampler that enables eﬃcient Bayesian inference for theproposed model and adopt the Dahl’s method (Dahl, 2006) for post-processing the MCMCoutputs.

By exploiting the conditional conjugacy property in model speciﬁcation (3), we derive acollapsed Gibbs sampler algorithm for eﬃcient Bayesian inference. Detailed derivations ofthe full conditionals are provided in Sections S3 and S4 of the Supplementary Materials.For the basketball application, we ﬁnd it plausible to assume that diﬀerent mixturecomponents share the same covariance structure, that is, U = · · · = U K = U and V = · · · = V K = V . Extension to allow distinct covariances for diﬀerent clusters is possible byconsidering auxiliary parameters when updating indicator variables Z i , i = 1 , . . . , n using themethod in Neal (2000). Based on the Algorithm 2 of Neal (2000), we obtain the followingproposition that provides the full conditional distribution of Z i , i = 1 , . . . , n while collapsingthe number of clusters K . 9 roposition 1. The full conditional distributions P ( Z i | Z − i , Θ ) is given by P ( Z i | Z − i , Θ ) ∝  ( c + γ ) f ( Y i | M Z i , U, V ) at an existing cluster c V n ( C − i +1) V n ( C − i ) γm ( Y i | U, V ) if c is a new cluster , (4) where c refers to the cardinality of the cluster labeled as c , f ( Y i | M Z i , U, V ) is the densityfunction of MxN deﬁned in (2) , V n ( t ) is a coeﬃcient for the partition distribution deﬁned as V n ( t ) = ∞ (cid:88) k =1 k ( t ) ( γk ) ( n ) p K ( k ) , with k ( t ) = k ( k − . . . ( k − t + 1) , ( γk ) ( n ) = γk ( γk + 1) . . . ( γk + n − , C − i representsa partition of the set { , , . . . , n } \ { i } , and let C − i denote the number of blocks in thepartition C − i . Also, we deﬁne m ( Y i | U, V ) as exp( − [ vec ( Y i ) (cid:124) ( V − ⊗ U − ) vec ( Y i ) + vec ( M ) (cid:124) (Ω − ⊗ Σ − ) vec ( M ) − ˜ µ (cid:124) ˜Σ − ˜ µ ])(2 π ) pq/ | V | p/ | U | q/ | Ω | p/ | Σ | q/ | ˜Σ | pq/ , where ˜Σ − = V − ⊗ U − +Ω − ⊗ Σ − and ˜ µ = ˜Σ[( V − ⊗ U − ) vec ( Y i )+(Ω − ⊗ Σ − ) vec ( M )] . The derivation of m ( Y i | U, V ) in Proposition 1 is given in the Supplementary Materials.Our collapsed Gibbs sampler algorithm for proposed model is summarized as Algorithm 1in Section S5 of the Supplementary Materials.We make the following recommendations for hyperparameter values in priors: • α = ( p + 1) / ψ = ( q + 1) /

2, (2 β ) − = I p , (2 ρ ) − = I q , which ensure the priordistributions for covariance matrices to be fairly diﬀuse while scale parameters 2 β , 2 ρ are chosen to possess the simplest possible forms. • γ = 3, which puts low probability on small group sizes. • Set M as the (element-wise) middle point of the observations. • Σ = diag( σ , . . . , σ p ), Ω = diag( ω , . . . , ω q ), where σ , . . . , σ p and ω , . . . , ω q are equalto half of the ranges along the respective rows and columns.10umerical experiments have conﬁrmed that the above hyperparameters work well, andhence these values will be used for all simulation studies and case studies in this paper. We carry out posterior inference on the group memberships using Dahl’s method (Dahl,2006), which proceeds as follows, • Step 1.

Deﬁne membership matrices A ( l ) = ( A ( l ) ( i, j )) i,j ∈{ ,...,n } = ( ( Z ( l ) i = Z ( l ) j )) n × n ,where l = 1 , . . . , L is the index for the retained MCMC draws after burn-in, and ( · )is the indicator function. • Step 2.

Calculate the element-wise mean of the membership matrices ¯ A = L (cid:80) Ll =1 A ( l ) . • Step 3.

Identify the most representative posterior draw as the one that is closest to¯ A with respect to the element-wise Euclidean distance (cid:80) ni =1 (cid:80) nj =1 ( A ( l ) ( i, j ) − ¯ A ( i, j )) among the retained l = 1 , . . . , L posterior draws.The posterior estimates of cluster memberships Z , . . . , Z n and other model parameters Θ can be also obtained using Dahl’s method accordingly. Next we study the theoretical properties for the posterior distribution obtained from model(3). In order to establish the posterior contraction results, we consider a reﬁned parameterspace Θ ∗ deﬁned as ∪ ∞ k =1 Θ ∗ k , where Θ ∗ k corresponds to the compact parameter space for allthe model parameters (i.e., mixture weights, matrix normal mean and covariances) given aﬁxed cluster number K = k . More precisely, we deﬁne Θ ∗ k as (cid:110) w , . . . , w k ∈ ( (cid:15), − (cid:15) ) , k (cid:88) i w i = 1 , M , . . . , M k ∈ ( − C , C ) p × q ,σ ( U i ) , . . . , σ p ( U i ) ∈ ( σ, ¯ σ ) , e ( U i ) , . . . , e p ( U i ) ∈ ( − C , C ) p for every i = 1 , . . . , k,σ ∗ ( V j ) , . . . , σ ∗ q ( V j ) ∈ ( σ, ¯ σ ) , e ∗ ( V j ) , . . . , e ∗ q ( V j ) ∈ ( − C , C ) q for every j = 1 , . . . , k. (cid:111) , (cid:15), σ, ¯ σ, C , C , C are some positive constants, { σ ( U i ) , . . . , σ p ( U i ); e ( U i ) , . . . , e p ( U i ) } , { σ ∗ ( V j ) , . . . , σ q ( V j ); e ∗ ( V j ) , . . . , e ∗ q ( V j ) } are eigenvalues and eigenvectors for matrix U i and V j ,respectively. We also deﬁne the mixing measure as G = (cid:80) ki =1 w i δ γ i , where δ is the pointmass measure, and γ i = { M i , U i , V i } is the collection of parameters for the matrix normaldistribution in cluster i for i = 1 , . . . , k . For two sequence of real numbers { a n } and { b n } , wedeﬁne a n (cid:46) b n if there exists a universal positive constant C whose value is independent of n such that a n ≤ Cb n . For any two mixing measures G = (cid:80) ki =1 p i δ γ i and G = (cid:80) k (cid:48) j =1 p (cid:48) j δ γ j , wedeﬁne their Wasserstein distance as W ( G , G ) = inf q ∈Q (cid:80) i,j q ij (cid:107) γ i − γ j (cid:107) , where (cid:107) · (cid:107) is theelement-wise L -distance, Q denotes the collection of joint discrete distribution on the spaceof { , . . . , k } × { , . . . , k (cid:48) } and q ij is the probability being associated with ( i, j )-element andit satisﬁes the constraint that (cid:80) ki =1 q ij = p (cid:48) j and (cid:80) k (cid:48) j =1 q ij = p i , for every i = 1 , . . . , k and j = 1 , . . . , k (cid:48) .Let K , G , P be the true number of clusters, the true mixing measure, and the cor-responding probability measure, respectively. Then the following theorem establishes theposterior consistency and contraction rate for the cluster number K and mixing measure G .The proof is given in Supplementary Materials, Section S6; and it is based on the generalresults for Bayesian mixture models in Guha et al. (2019). Theorem 5.1.

Let Π n ( · | Y , . . . , Y n ) be the posterior distribution obtained from (3) givena random sample Y , . . . , Y n . Assume that the parameters of interest are restricted to Θ ∗ .Then we have Π n ( K = K | Y , . . . , Y n ) → , and Π n ( W ( G, G ) (cid:46) (log n/n ) − / | Y , . . . , Y n ) → , almost surely under P as n → ∞ . Theorem 5.1 shows that our proposed Bayesian method is able to correctly identify theunknown number of clusters and the latent clustering structure with posterior probabilitytending to one as the sample size increases. The requirement of a compact parameter space Θ ∗ is commonly used in the Bayesian nonparametrics literature (Guha et al., 2019), and it ispractically relevant since the model parameters are expected to take values in a pre-speciﬁedrange. For example, it is reasonable to assume that the mixture weights are greater thansome extremely small number such as . Simulation

We conduct simulation studies to examine the ﬁnite-sample performance of the proposedmethod based on three evaluation metrics, (i) probability of choosing the correct numberof clusters, (ii) Rand index (Rand, 1971), and (iii) root mean squared error in estimating V ⊗ U . Those three metrics serve as useful evaluation measures in terms of the model selectionaccuracy, clustering structure recovery performance, and parameter estimation accuracy.We compare the performance of the proposed method with that of two classical bench-mark methods, K -means clustering (Hartigan and Wong, 1979) and spectral clustering (Nget al., 2002). Both methods take the vectorized matrices as the input. Those two benchmarkmethods are implemented using the built-in function kmeans and the function specc in Rpackage kernlab (Karatzoglou et al., 2004) with a Gaussian kernel under default settings,respectively. The Rand index is calculated using the function rand.index in R package fossil (Vavrek, 2011).When generating the data, we consider two matrix sizes: (i) small matrix size, where p = 10 and q = 6, and (ii) large matrix size, where p = 25 and q = 18. For small matrixsize, we generate three clusters of signals from matrix normal distributions with weights π = (0 . , . , .

4) and the mean matrices M , M , M ∈ R × displayed in the ﬁrst rowof Figure 3, where the elements are coded as 1 if corresponding regions are shaded, and 0otherwise. The row-wise covariance matrix U is drawn from a standard Wishart distributionwith ν = 11 and dimension 10 (to ensure that the marginal variance of the noise is equalto σ , U is converted to a correlation matrix), and the column-wise covariance matrix V is a 6 × ρ = 0 . V = Σ AR (1) , . , ). We set the total sample size n ∈ { , , } . To examine the performance of the proposed method and the othertwo competitive methods under diﬀerent noise levels, we also consider another setting underwhich the row-wise covariance matrix V = 0 . × Σ AR (1) , . , . We run 100 Monte-Carloreplications, and for each replication we run MCMC chains for 1500 iterations, where theﬁrst 1000 draws are discarded as burn-in for the experiments on small size matrix.For large matrix size, we consider three mean patterns shown in Figure 4, each of whichis designed to represent a prevalent oﬀensive style in the NBA shooting chart data. From the13 a) M (b) M (c) M (d) ˆ M , n = 100 (e) ˆ M , n = 100 (f) ˆ M , n = 100(g) ˆ M , n = 200 (h) ˆ M , n = 200 (i) ˆ M , n = 200(j) ˆ M , n = 400 (k) ˆ M , n = 400 (l) ˆ M , n = 400 Figure 3: True signals and representative draws from recovered signals by MFM under n = 100 , ,

400 and high noise level ( V = Σ AR (1) , . , ).14og intensity maps in Figure 4, we note that the Group 2 represents all-around players suchas LeBron James; Group 3 represents three-point shooters such as Eric Gordon; and Group1 represents inside players such as Steven Adams. With the mean structure being the logintensity shown in Figure 4, we generate our simulated data from matrix normal distributionswith column-wise covariance matrix V = σ × Σ AR (1) ,ρ, and row-wise covariance matrix U drawn from a standard Wishart distribution with ν = 19 and dimension 18 (to ensure thatthe marginal variance of the noise is equal to σ , U is converted to a correlation matrix).We ﬁx n = 200 to mimic the number of players in the motivating data example, and choose π = (0 . , . , . K -means and spectral clustering are chosen as the same number of clustersobtained from the proposed MxN-MFM. The MCMC settings are chosen based on pilot runs,and the overlaid traceplots of Rand index further justify the validity of the chosen MCMCsettings, which are shown in the Supplementary Materials. All computations presented inthis paper were performed in R (version 3.6.0) (R Core Team, 2019) on a computing server(256GB RAM, with 8 AMD Opteron 6276 processors, operating at 2.3 GHz, with 8 processingcores in each). Group 1 Group 2 Group 3 (−6.908, −2.303] (−2.303, −1.609] (−1.609, −1.204] (−1.204, −0.916] (−0.916, −0.693] (−0.693, 0.000] (0.000, 0.693] (0.693, 1.386] (1.386, 1.792](1.792, 2.079] (2.079, 2.303] (2.303, 2.485] (2.485, 2.639] (2.639, 2.773] (2.773, 2.890] (2.890, 2.996] (2.996, 3.219]

Figure 4: Log Intensity Maps of Three Patterns15 .2 Simulation Results

We ﬁrst present the results for small matrix size ( p = 10 , q = 6). Table 1 shows the meanRand index for three diﬀerent methods under diﬀerent sample size and noise levels. It isclear that the proposed method (MFM-MxN) outperforms K -means and spectral clusteringunder all scenarios, and its advantage is more salient when the noise level is higher, whichis particularly important for clustering. The clustering accuracy (mean Rand index > . .

80. We then summarize thedistribution of estimated number of clusters for MFM-MxN in Table 2. We note that theprobability of identifying the correct number of clusters is very satisfactory ( > V ⊗ U in Figure 5.It is clear that the estimation accuracy improves as the noise level drops down and whenthe sample size increases. This is also conﬁrmed in Figure 3 where the recovered signals aresigniﬁcantly less noisy and better recapitulate the true signals as the sample size increases.Table 1: Simulation results for small matrix size: Mean Rand index obtained from MFM-MxN, K -means and Spectral Clustering under diﬀerent sample size and noise levels (high: V = Σ AR (1) , . , , low: V = 0 . × Σ AR (1) , . , ) based on 100 Monte-Carlo replications. σ = 1 σ = 0 . K -means Spectral MFM K -means Spectral n = 100 . . n = 200 . . n = 400 . . In Tables 3 and 4, we present the results for large matrix size data ( p = 25, q = 18). Similarto previous ﬁndings, our proposed method is able to correctly ﬁnd the true number of clus-ters at least 80% of the time for diﬀerent settings. Our method also has the highest average16able 2: Simulation results for small matrix size: Percentage (%) of selected number ofclusters ˆ K for MFM-MxN under diﬀerent sample sizes and noise levels (high: V = Σ AR (1) , . , ,low: V = 0 . × Σ AR (1) , . , ) based on 100 Monte-Carlo replicates. The true number of clustersis 3. noise high lowˆ K n = 100 10 n = 200 18 n = 400 7 high noise low noise high noise low noise high noise low noise . . . . . . . . RMSE

Figure 5: Histograms of RMSE (over 100 replicates) for the covariance estimates ( ˆ V ⊗ ˆ U )under diﬀerent sample sizes ( n = 100 , , V = Σ AR (1) , . , , low: V = 0 . × Σ AR (1) , . , ). 17and index ( > . V = σ Σ AR (1) ,ρ, ) based on 100 replicates. The truenumber of clusters is 3. σ = 1 . σ = 1 . σ = 0 .

52 3 4 2 3 4 2 3 4 ρ = 0 . ρ = 0 . ρ = 0 . † † Results for four runs are not shown due to largeestimated cluster numbers.

Table 4: Simulation results for large matrix size: Mean Rand index for MFM-MxN, K -meansand Spectral Clustering under diﬀerent noise level σ and correlation strength ρ , based on 50replicates. σ = 1 . σ = 1 . σ = 0 . K -means Spectral MFM K -means Spectral MFM K -means Spectral ρ = 0 . . . . ρ = 0 . . . . ρ = 0 . . . . In this section, we apply the proposed method to investigate the shooting pattern of playersin the 2017-2018 NBA regular season. Our analysis is purely based on the location ofshots and hence the resulting clusters are completely data-driven without considering otherinformation about players or their teams. The shots that are made 36 ft away from thebaseline are not included in this analysis as they are usually not part of the regular tactics.We start by obtaining the intensity surface for each player by ﬁtting an LGCP to raw shotlocation data using oﬀ-the-shelf functions in R package inlabru (Bachl et al., 2019). The The resulting intensity surface is scaled to adjust for the number of games played by the respectiveplayer. .

82) compared to all other chains with respect tothe clustering memberships. This representative chain yields 3 groups of size 71, 23 and 97,respectively. Visualizations of intensity matrices with contour for selected players from thesethree groups are presented in Figure 6. A full list of player names for each group are given inthe Supplementary Materials, Section S1. We also plot the estimated covariance matrices inSection S7. We ﬁnd that both the column-wise covariance ˆ U and row-wise covariance matrixˆ V enjoy a banded structure, i.e., the correlations excluding the diagonal and oﬀ-diagonalentries are quite small, which conﬁrms that our method is able to take spatial/locationinformation into account by modeling the matrix structure in the data appropriately.Several interesting observations can be made from the visualization results. We see thatfor the players in Group 1, they are able to make all types of shots, including three-pointers,perimeter shot, and also shots over the painted area. However, compared with the players inGroup 3, they have less three-pointers. Most players in this group are Power Forward andSmall Forward. However, we can still ﬁnd some players such as Dwyane Wade, Ricky Rubioand Rajon Rondo in this group. This can be explained by the fact that the three playersmentioned above do not have a good three-point shot percentage and tend to attack thebasket in paint area. The players in Group 2 have most shots located near the hoop. Theyare good at making alley-oops and slam dunks, however, not good at making three-pointers.Most of them are Center. There is still an interesting player in this group, Dejounte Murray.He plays very similarly with Tony Park who is a previous player of Spurs and makes moreﬁeld goal attempts on paint area like the Center. For the players in Group 3, we ﬁnd thatthey have more three-pointers compared to the other two groups, and the players in thisgroup are almost all Shooting Guard. Although we still ﬁnd that Kevin Durant belongs tothis group, because he is an all-rounder and has an excellent ability of scoring three-pointers.In addition, there are some inside players in this group such as Kevin Love, Kelly Olynyk,and DeMarcus Cousins. This reﬂects the recent trend that the NBA teams start to prefer19 roup 3James Harden Group 3Kevin Durant Group 3Paul GeorgeGroup 2Clint Capela Group 2DeAndre Jordan Group 2Dwight HowardGroup 1Giannis Antetokounmpo Group 1Joel Embiid Group 1LeBron James (−11, −10] (−10, −9] (−9, −8] (−8, −7] (−7, −6] (−6, −5](−5, −4] (−4, −3] (−3, −2] (−2, −1] (−1, 0] Figure 6: Log Intensity Matrices with Contour for Selected Players20r even require their inside players to shoot from downtown and release the space in paintarea. Our ﬁndings also conﬁrm that the number of pure inside players decreases as thethree-pointer becomes a conventional weapon for most players.

In this paper, we propose a novel Bayesian nonparametric clustering approach for learningthe latent heterogeneity in matrix-valued response data. Building upon the mixture of ﬁnitemixtures framework, we develop a collapsed Gibbs sampler for eﬃcient Bayesian inferenceand adopt Dahl’s method for post MCMC inference. Numerical results have conﬁrmed thatthe proposed method is able to simultaneously infer the number of clusters and the modelparameters with high accuracy. Comparing to the traditional clustering techniques suchas K -means and spectral clustering, the proposed method is able to improve the clusteringperformance especially when the noise level is high for the reason that the rich spatial locationinformation is incorporated by handling the data in the matrix format.In the analysis of the NBA shot charts data, three prevalent shooting patterns along withthe respective players are identiﬁed. The results provide valuable insights to both playersand managers - players can obtain more comprehensive understandings of their currentattacking patterns, and hence develop further training plans accordingly; the managers canbe equipped with more objective and principled analysis of shooting patterns of the playersin the league, and hence make better data-informed decisions on player recruiting.A few topics beyond the scope of this paper are worth further investigation. A naturalextension is Bayesian clustering for general multi-way data such as tensors. Also, as matrixinverse is required at each MCMC update, the proposed estimation procedure can be slowwhen the matrix size is very large. Proposing an eﬃcient algorithm for large matrix datadevotes an interesting future work, and we envision low-rank approximation and sparsecompression as two promising directions to mitigate this computational challenge. Finally,jointly estimating intensity surface and grouping information is another interesting directionfor future work. 21 cknowledgements The authors would like to thank Dr. Yishu Xue and Dr. Hou-Cheng Yang for providing theorganized data which include the raw shot charts and estimated intensity maps via INLAand R code for data visualization.

Supplementary Materials

Technical details about the posterior derivation, proof of theorems, additional numericalresults are provided in the Online Supplementary Materials. R code and the data for thecomputations of this work are available at https://github.com/fyin-stats/MFM-MxN.

References

J. Albert, M. E. Glickman, T. B. Swartz, and R. H. Koning.

Handbook of statistical methodsand analyses in sports . CRC Press, 2017.F. E. Bachl, F. Lindgren, D. L. Borchers, and J. B. Illian. inlabru: an R package for Bayesianspatial modelling from ecological survey data.

Methods in Ecology and Evolution , 10:760–766, 2019. doi: 10.1111/2041-210X.13168.G. Baio and M. Blangiardo. Bayesian hierarchical model for the prediction of football results.

Journal of Applied Statistics , 37(2):253–264, 2010.B. P. Carlin. Improved NCAA basketball tournament modeling via point spread and teamstrength information.

The American Statistician , 50(1):39–43, 1996.M. Cattelan, C. Varin, and D. Firth. Dynamic Bradley–Terry modelling of sports tour-naments.

Journal of the Royal Statistical Society: Series C (Applied Statistics) , 62(1):135–150, 2013.S. B. Caudill. Predicting discrete outcomes with the maximum score estimator: The caseof the NCAA men’s basketball tournament.

International Journal of Forecasting , 19(2):313–317, 2003. 22. Cervone, A. D’Amour, L. Bornn, and K. Goldsberry. Pointwise: Predicting points andvaluing decisions in real time with nba optical tracking data. In

Proceedings of the 8thMIT Sloan Sports Analytics Conference, Boston, MA, USA , volume 28, page 3, 2014.D. Cervone, A. D’Amour, L. Bornn, and K. Goldsberry. A multiresolution stochastic processmodel for predicting basketball possession outcomes.

Journal of the American StatisticalAssociation , 111(514):585–599, 2016.D. B. Dahl. Model-based clustering for expression data via a Dirichlet process mixturemodel.

Bayesian inference for gene expression and proteomics , 4:201–218, 2006.M. J. Dixon and S. G. Coles. Modelling association football scores and ineﬃciencies inthe football betting market.

Journal of the Royal Statistical Society: Series C (AppliedStatistics) , 46(2):265–280, 1997.J. Fernandez and L. Bornn. Wide open spaces: A statistical technique for measuring spacecreation in professional soccer. In

Sloan Sports Analytics Conference , volume 2018, 2018.A. Franks, A. Miller, L. Bornn, K. Goldsberry, et al. Characterizing the spatial structureof defensive skill in professional basketball.

The Annals of Applied Statistics , 9(1):94–121,2015.X. Gao, W. Shen, L. Zhang, J. Hu, N. J. Fortin, R. D. Frostig, and H. Ombao. Regularizedmatrix data clustering and its application to image analysis.

Biometrics , 2020.J. Geng, W. Shi, and G. Hu. Bayesian nonparametric nonhomogeneous poisson process withapplications to usgs earthquake data. arXiv preprint arXiv:1907.03186 , 2019.A. Guha, N. Ho, and X. Nguyen. On posterior contraction of parameters and interpretabilityin Bayesian mixture modeling. arXiv preprint arXiv:1901.05078 , 2019.A. Gupta and D. Nagar.

Matrix Variate Distributions , volume 104. CRC Press, 1999.J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm.

Journalof the Royal Statistical Society, Series C (Applied Statistics) , 28(1):100–108, 1979.23. Hu, H.-C. Yang, and Y. Xue. Bayesian group learning for shot selection of professionalbasketball players. arXiv preprint arXiv:2006.07513 , 2020.J. Jiao, G. Hu, and J. Yan. A Bayesian joint model for spatial point processes with applicationto basketball shot chart. arXiv preprint arXiv:1908.05745 , 2019.A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab – an S4 package forkernel methods in R.

Journal of Statistical Software , 11(9):1–20, 2004. URL .D. Karlis and I. Ntzoufras. Analysis of sports data by using bivariate poisson models.

Journalof the Royal Statistical Society: Series D (The Statistician) , 52(3):381–393, 2003.A. Miller, L. Bornn, R. Adams, and K. Goldsberry. Factorized point process intensities: Aspatial analysis of professional basketball. In

International conference on machine learning ,pages 235–243, 2014.J. W. Miller and M. T. Harrison. Mixture models with a prior on the number of components.

Journal of the American Statistical Association , 113(521):340–356, 2018.J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log gaussian cox processes.

Scandi-navian Journal of Statistics , 25(3):451–482, 1998.R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models.

Journalof Computational and Graphical Statistics , 9(2):249–265, 2000.A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.In

Advances in neural information processing systems , pages 849–856, 2002.R Core Team.

R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria, 2019. URL .W. M. Rand. Objective criteria for the evaluation of clustering methods.

Journal of theAmerican Statistical Association , 66(336):846–850, 1971.C. E. Rasmussen and C. K. Williams.

Gaussian Processes for Machine Learning . MIT PressCambridge, MA, 2006. 24. J. Reich, J. S. Hodges, B. P. Carlin, and A. M. Reich. A spatial analysis of basketballshot chart data.

The American Statistician , 60(1):3–12, 2006.H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent gaussianmodels by using integrated nested Laplace approximations.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 71(2):319–392, 2009.N. Sandholtz, J. Mortensen, and L. Bornn. Measuring spatial allocative eﬃciency in basket-ball. arXiv preprint arXiv:1912.05129 , 2019.M. Stephens. Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods.

Annals of Statistics , pages 40–74, 2000.G. Z. Thompson, R. Maitra, W. Q. Meeker, and A. F. Bastawros. Classiﬁcation with thematrix-variate-t distribution.

Journal of Computational and Graphical Statistics , pages1–7, 2020.M. J. Vavrek. Fossil: palaeoecological and palaeogeographical analysis tools.

PalaeontologiaElectronica , 14(1):16, 2011.C. Viroli. Finite mixtures of matrix normal distributions for classifying three-way data.

Statistics and Computing , 21(4):511–522, 2011a.C. Viroli. Model based clustering for three-way data structures.

Bayesian Analysis , 6(4):573–602, 2011b.S. Wu and L. Bornn. Modeling oﬀensive player movement in professional basketball.