[PDF] Heterogeneous Regression Models for Clusters of Spatial Dependent Data

Abstract

In economic development, there are often regions that share similar economic characteristics, and economic models on such regions tend to have similar covariate effects. In this paper, we propose a Bayesian clustered regression for spatially dependent data in order to detect clusters in the covariate effects. Our proposed method is based on the Dirichlet process which provides a probabilistic framework for simultaneous inference of the number of clusters and the clustering configurations. The usage of our method is illustrated both in simulation studies and an application to a housing cost dataset of Georgia.

Full PDF

HHeterogeneous Regression Models for Clusters of SpatialDependent Data

Zhihua Ma Yishu Xue ∗ Guanyu Hu

Abstract

In economic development, there are often regions that share similar economic char-acteristics, and economic models on such regions tend to have similar covariate eﬀects.In this paper, we propose a Bayesian clustered regression for spatially dependent datain order to detect clusters in the covariate eﬀects. Our proposed method is based on theDirichlet process which provides a probabilistic framework for simultaneous inferenceof the number of clusters and the clustering conﬁgurations. The usage of our methodis illustrated both in simulation studies and an application to a housing cost datasetof Georgia. keywords:

Clustered Coeﬃcients Regression, Dirichlet process, MCMC, Spatial Ran-dom Eﬀects ∗ [email protected] a r X i v : . [ ec on . E M ] A p r Introduction

Spatial regression models have been widely applied in many diﬀerent ﬁelds such as environ-mental science (Hu and Bradley, 2018), biological science (Zhang and Lawson, 2011), andeconometrics (Brunsdon et al., 1996) to explore the relation between a response variable anda set of predictors over a region. One of the most important tasks for a spatial regressionmodel is to capture the spatial dependent structure between a response variable and a setof covariates. Cressie (1992) proposed a spatial regression model with Gaussian process,where the spatial random eﬀects are accounted for only by the intercepts. Brunsdon et al.(1996) proposed a geographically weighted regression (GWR) model which assumed the ex-istence of a spatially-dependent parameter surface, and used weighted local linear regressionto estimate such parameter surface. An application of GWR in analyzing the impact ofsocio-economic factors on treated prevalence for mental disorders in Barcelona is presentedin Salinas-P´erez et al. (2015). The idea of GWR has been subsequently extended to the Coxmodel framework by Xue et al. (2019). In addition to mean-based regression, Chasco andGallo (2015) used a spatial quantile regression technique to identify heterogeneities. Fromthe Bayesian perspective, Gelfand et al. (2003) incorporated Gaussian process to regressioncoeﬃcients to build a model with spatially varying coeﬃcients. Autant-Bernard and LeSage(2019) used a Bayesian heterogeneous spatial autoregressive model that allows for spatialvariation variations in intercepts, covariate eﬀects, and noise variances to study the knowl-edge production functions of diﬀerent regions in order to set up their regional innovationstrategy. The aforementioned works, however, all assumed that each location has its own setof regression parameters, which sometimes leads to excessive numbers of parameters, andsubsequently overﬁtting. Cluster eﬀects over the space of interest has not been taken intoaccount.Detection of heterogeneous covariate eﬀects in many diﬀerent ﬁelds, such as real estateapplications, spatial econometrics, and environmental science are becoming of increasing re-2earch interest. For example, administrative divisions in a country, such as regions, provinces,states, or territories, often have diﬀerent economic statuses and development patterns. Moreadvanced divisions and less developed divisions could be put into separate clusters and an-alyzed. Such clustering information is of great interest to regional economics researchers.One of the most popular methods for spatial cluster detection is the scan statistic method(Kulldorﬀ and Nagarwalla, 1995), where a scan statistic is constructed via a likelihood ratiostatistic to test the potential clusters. The usage of spatial scan statistics has been extendedto studies of disease mapping, crime, and public health. Similar endeavor has also been madeunder the Bayesian and nonparametric Bayesian frameworks in pursuit of spatial homogene-ity. Li et al. (2015) used nonparametric Bayesian method to detect cluster boundaries forareal data. Noticing that traditional methods may not work well with spatial missing data,Panzera et al. (2016) proposed using multiple imputation together with the Bayesian Inter-polation method to analyze spatially clustered missing data, which addresses both spatialunivariate and multivariate problems.Most of the aforementioned frequentist and Bayesian approaches mainly focus on estimat-ing cluster conﬁgurations of spatial response. Spatially varying patterns in the relationshipbetween a set of covariates and the response is also an important topic that needs to bestudied. Bill´e et al. (2017) used a two-step approach where in the ﬁrst step, spatial regimesof spatially varying parameters are identiﬁed, and in the second step estimated. Recently,methods for cluster detection of spatial regression coeﬃcients have been proposed to detectthe homogeneity of the covariates eﬀects among subareas. Li and Sang (2019) incorpo-rated spatial neighborhood information in a penalized approach to detect spatially clusteredpatterns in the regression coeﬃcients.Under the Bayesian framework, Lawson et al. (2014) explored the usage of multinomialpriors in modeling clustered coeﬃcients in the accelerated failure time model for survivaldata. As discussed by Lawson et al. (2014), to infer the grouping level, complicated searchalgorithms in variable dimensional parameter space are needed, such as the reversible jump3arkov chain Monte Carlo (MCMC) algorithm of Green (1995), which assigns a prior onthe number of clusters, and this number is updated at each iteration of an MCMC chain.Such algorithms are diﬃcult to implement and automate, and are known to suﬀer from lackof scalability and mixing issues. Nonparametric Bayesian approaches, such as the Dirichletprocess mixture model (DPMM; Ferguson, 1973), oﬀer choices to allow for uncertainty inthe number of clusters, and provide an integrated probabilistic framework under which thenumber of clusters, the clustering conﬁguration, and regression coeﬃcients are simultaneouslyestimated.In this paper, we propose a Bayesian spatial clustered linear regression model with Dirich-let process (DP; Ferguson, 1973) prior, which considers spatially dependent structure andclusters the covariate eﬀects simultaneously. In addition, implementation of our proposedmethods based on nimble (de Valpine et al., 2017), a relatively new and powerful R package,is discussed. The model diagnostic technique, logarithm of the pseudo-marginal likelihood(LPML; Ibrahim et al., 2013), is introduced to assess the ﬁtness of our proposed model. Ourproposed Bayesian approach reveals interesting features of the state-level data of Georgia.The remainder of the paper is organized as follows. In Section 2, we develop a spatialclustered linear regression model with DP prior. In Section 3, a MCMC sampling algorithmbased on nimble and post MCMC inference are discussed. Extensive simulation studiesare carried out in the next section. For illustration, our proposed methodology is appliedto Georgia housing cost dataset in Section 5. Finally, we conclude this paper with a briefdiscussion. In this section, a Bayesian spatial clustered linear model using DPMM is proposed forcoeﬃcient grouping in spatially dependent data. Based on the spatial regression model,spatially-varying coeﬃcients are assigned with a nonparametric DP prior to achieve the goal4f grouping.

The basic geostatistical model (Gelfand and Schliep, 2016) for spatially dependent responseat locations s = ( s , . . . , s n ) is denoted by Y = Xβ + w + (cid:15) , (1)where Y = ( Y ( s ) , . . . , Y ( s n )) the n -dimensional vector of responses observed at the n diﬀerent locations, X =  X ( s ) (cid:62) . . .X ( s n ) (cid:62)  is the n × p matrix of covariates, w = ( w ( s ) , . . . , w ( s n ))is the vector of spatial random eﬀects, which is assumed to follow a stationary Gaussianprocess whose covariance structure often depends on the geographical locations, and (cid:15) ∼ MVN( , σ y I ) adds the nugget eﬀect (see e.g., Chapter 6 of Carlin et al., 2014), which isusually a vector of white noise, with MVN denoting the multivariate normal distribution.Oftentimes, w ( s ) is assumed to also follow a MVN. The above spatial regression model canalso be rewritten as Y | β , w , σ y ∼ MVN( Xβ + w , σ y I ) , w ∼ MVN( , Σ W ) , where Σ W is the covariance matrix of the spatial random eﬀect vector w ( s ), and I denotesthe identity matrix. Conditional on X and w , entries in Y are independent. Conventionally,the covariance matrix is given as Σ W = σ w H , where H is a matrix constructed using thegreat circle distance matrix, denoted as GCD, between diﬀerent locations, i.e.,GCD = GCD( i, j ) i,j ∈{ ,...,n } = great circle distance between locations s i and s j , σ w is a scalar. There are three common weighting schemes for deﬁning H , including:the unity scheme : H = I n × n the exponential scheme : H ( φ ) = H ( i, j ) i,j ∈{ ,...,n } = [exp( − GCD( i, j ) /φ )] n × n , the Gaussian scheme : H ( φ ) = H ( i, j ) i,j ∈{ ,...,n } = (cid:2) exp (cid:0) − (GCD( i, j ) /φ ) (cid:1)(cid:3) n × n , (2)where φ is a tuning parameter that controls the spatial correlation. Larger value of φ indicatesstronger correlation.Another spatial regression model is the spatially varying coeﬃcients model (Gelfand et al.,2003): Y ( s ) = X ( s ) (cid:62) (cid:101) β ( s ) + (cid:15) ( s ) , (3)where X ( s ) is p × s , and (cid:101) β ( s ) is assumed to followa p -variate spatial process model. If we have observations ( Y ( s i ) , X ( s i )) for i = 1 , . . . , n ,they can be written into Y = X (cid:62) (cid:101) β + (cid:15) , where Y = ( Y ( s ) , . . . , Y ( s n )) (cid:62) , X (cid:62) is an n × ( np ) block diagonal matrix which has the rowvector X (cid:62) ( s i ) as its i -th diagonal entry, (cid:101) β = ( (cid:101) β ( s ) (cid:62) , . . . , (cid:101) β ( s n ) (cid:62) ) (cid:62) , and (cid:15) ∼ MVN( , σ I ).Gelfand et al. (2003) proposed the following hierarchical model: Y | (cid:101) β , σ ∼ MVN( X (cid:62) (cid:101) β , σ I ) (cid:101) β | µ β , T ∼ MVN(1 n × ⊗ µ β , H ( φ ) ⊗ T ) (4)where µ β is a p × H ( φ ) is a n × n matrix measuring spatial correlations between the n observed locations, T is a p × p covariance matrix associated with an observation vectorat any spatial location, and ⊗ denotes the Kronecker product. In model (1), β is constantover space, which means the covariate eﬀects remain the same over all locations; model (4)6llows for diﬀerent covariate eﬀects over locations, but restrict the covariate eﬀects to bedetermined by distances between pairs of locations as in (2).For many spatial economics data, however, some regions will share similar covariateeﬀects regardless of their geographical distance. Taking China as an example, Beijing tendsto have similar economic development pattern with Shanghai or Jiangsu (Ma et al., 2019b). The models in (1) and (4), however, do not take into account such inherent similarities inspatially dependent data. Within the Bayesian framework, coeﬃcient clustering can be accomplished using a Dirichletprocess mixture model (DPMM) by nonparametrically linking the spatial response variableto covariates through cluster membership. Formally, a probability measure G following a DPwith a concentration parameter α and a base distribution G is denoted by G ∼ DP( α, G ) if( G ( A ) , · · · , G ( A r )) ∼ Dirichlet( αG ( A ) , · · · , αG ( A r )) , (5)where ( A , · · · , A r ) are ﬁnite measurable partitions of the space Ω. Several diﬀerent formula-tions can be used for determining the DP. In this work, we use the stick-breaking constructionproposed by Sethuraman (1991) for DP realization, which is given as θ c ∼ G , G = ∞ (cid:88) c =1 π c δ θ c ( · ) ,π = V , π c = V c (cid:89) (cid:96)

MCMC is used to draw samples from the posterior distributions of the model parameters.In this section we present the sampling scheme, the posterior inference of cluster belongings,and measurements to evaluate the estimation performance and clustering accuracy.

We present the main R function written using the nimble package (de Valpine et al., 2017).The model is wrapped in a nimbleCode() function. For ease of exposition, we break it intoseparate snippets. The full code is available on GitHub with an example implementation.(link removed for blinding purposes, documentation submitted separately)Deﬁne S as the number of locations. The following code represent Equation (7). Ateach location, y[i] has a normal distribution with mu y[i] and precision tau y , whichis equivalent to 1 /σ y . A Gamma(1,1) prior is given to tau y . The coeﬃcient vector forlocation i , b[i, 1:6] , equals the coeﬃcient vector estimated for the cluster it belongs to,represented by latent[i] , which follows a multinomial distribution with probability vector zlatent[1:M] , where M denotes the number of potential clusters. SLMMCode <- nimbleCode( { for (i in 1:S) { y[i] ~ dnorm(mu_y[i], tau = tau_y)mu_y[i] <- b[i, 1] * x1[i] + b[i, 2] * x2[i] +b[i, 3] * x3[i] + b[i, 4] * x4[i] + b[i, 5] * x5[i] +b[i, 6] * x6[i] + W[i]b[i, 1:6] <- bm[latent[i], 1:6]latent[i] ~ dcat(zlatent[1:M]) The following code represent Equation (8). H represents the matrix Σ W , where phi is thetuning parameter φ in the exponential scheme in (2) that controls spatial correlation. Therandom eﬀects at locations 1 to S follow a multivariate normal distribution with mu w[1:S] and precision matrix, which equals the product of σ w , tau w , and the inverse of H . H is deﬁnedas a function of a certain distance matrix Dist , which is passed into the function later as anargument of

SLMMConsts . Here, the function is chosen to be the exponential scheme. Theprior distribution of the bandwidth phi is speciﬁed to be a uniform distribution from 0 to acertain upper limit, denoted by D . The prior of tau w is set to Gamma(1,1). for (j in 1:S) { for (k in 1:S) { H[j, k] <- exp(-Dist[j, k]/phi) }} W[1:S] ~ dmnorm(mu_w[1:S], prec = prec_W[1:S, 1:S])prec_W[1:S, 1:S] <- tau_w * inverse(H[1:S, 1:S])phi ~ dunif(0, D)tau_w ~ dgamma(1, 1)mu_w[1:S] <- rep(0, S)

The distribution of β for each location s i is deﬁned next. They each come from a multivariatenormal distribution with mean mu bm and covariance matrix var bm , which is a diagonalmatrix with all diagonal entries being . The inverse variance term, tau bm , is again10iven a Gamma(1,1) prior, and the entries in the mean vector are all given independentstandard normal priors. for (k in 1:M) { bm[k, 1:6] ~ dmnorm(mu_bm[1:6], cov = var_bm[1:6, 1:6]) } var_bm[1:6, 1:6] <- 1/tau_bm * diag(rep(1, 6))tau_bm ~ dgamma(1, 1)for (j in 1:6) { mu_bm[j] ~ dnorm(0, 1) } Finally for the model, the stick breaking process corresponding to Equations (10) and (11)is depicted. zlatent[1:M] <- stick_breaking(vlatent[1:(M - 1)])for (j in 1:(M - 1)) { vlatent[j] ~ dbeta(1, alpha) } alpha ~ dgamma(1, 1)tau_y ~ dgamma(1, 1) } ) With the full model deﬁned, we next declare the data list, which is made up of the response Y , the covariates X[,1] to X[,6] , and the matrix of distances

Dist . The constants in the11odel also need to be supplied, including the number of locations S , the number of startingclusters M , and the upper endpoint D for the uniform distribution of bandwidth. In addition,the initial values are speciﬁed. Code to compile the model, supply the initial values, andinvoke the MCMC process is included in the supplementary package. SLMMdata <- list(y = y, x1 = X[,1], x2 = X[,2], x3 = X[,3],x4 = X[,4], x5 = X[,5], x6 = X[,6],Dist = distmatrix)SLMMConsts <- list(S = 159, M = 50, D = 100)SLMMInits <- list(tau_y = 1,latent = rep(1, SLMMConsts$S), alpha = 2,tau_bm = 1,mu_bm = rnorm(6), phi = 1, tau_w = 1,vlatent = rbeta(SLMMConsts$M - 1, 1, 1))

The estimated parameters, together with the cluster assignments z , are determined for eachreplicate from the best post burn-in iteration selected using the Dahl’s method (Dahl, 2006).Dahl (2006) proposed a least-square model-based clustering for estimating the clustering ofobservations using draws from a posterior clustering distribution. In this method, mem-bership matrices for each iteration, B (1) , . . . , B ( M ) , where M is the number of post-burn-inMCMC itertations, are calculated. The membership matrix for the c th iteration, B ( c ) is12eﬁned as: B ( c ) = ( B ( c ) ( i, j )) i,j ∈{ n } = 1( z ( c ) i = z ( c ) j ) n × n , (12)with 1() being the indicator function, B ( c ) ( i, j ) ∈ { , } for all i, j = 1 , ..., n and c = 1 , . . . , M .Having B ( c ) ( i, j ) = 1 means observations i and j are in the same cluster in the c th iteration.The average of B (1) , . . . , B ( M ) can be calculated as B = 1 M M (cid:88) c =1 B ( c ) , where (cid:80) here denotes element-wise summation of matrices. The ( i, j )th entry of B providesan empirical estimate of the probability for locations i and j to be in the same cluster.Next we ﬁnd the iteration that has the least squared distance to B as: C LS = arg min c ∈ (1: M ) n (cid:88) i =1 n (cid:88) j =1 (cid:0) B ( c ) ( i, j ) − B ( i, j ) (cid:1) , (13)where B ( c ) ( i, j ) is the ( i, j )th entry of B ( c ) , and B ( i, j ) is the ( i, j )th entry of B . Anadvantage of the least-squares clustering is the fact that information from all clusterings areutilized via the usage of the empirical pairwise probability matrix B . It is also intuitivelyappealing, as the average clustering is selected instead of formed via an external, ad hoc clustering algorithm. In the spatial regression model, the Gaussian process spatial structure Σ W = σ w H canbe constructed via several diﬀerent weighting schemes including the aforementioned unity,exponential, and Gaussian schemes in (2). In order to determine which weighting scheme isthe most suitable for the data, a commonly used model comparison criterion, the logarithm13f the pseudo-Marginal likelihood (LPML; Ibrahim et al., 2013), is applied. The LPMLcan be obtained through the conditional predictive ordinate (CPO) values. With Y ∗ ( − i ) =( Y , . . . , Y i − , Y i +1 , . . . , Y n ) denoting the observations with the i th subject response deleted,CPO can be regarded as leave-one-out-cross-validation under Bayesian framework, and itestimates the probability of observing Y i in the future if after having already observed Y ∗ ( − i ) .The CPO for the i th subject is calculated as:CPO i = (cid:90) f ( y ( s i ) | β ( s i ) , w ( s i ) , σ y ) π ( w ( s ) , β ( s ) , σ y | Y ∗ ( − i ) ) d (cid:0) w ( s ) , β ( s ) , σ y (cid:1) , (14)where π ( w ( s ) , β ( s ) , σ y | Y ∗ ( − i ) ) = (cid:81) j (cid:54) = i f ( y ( s j ) | β ( s j ) , w ( s j ) , σ y ) π ( w ( s ) , β ( s ) , σ y | Y ∗ ( − i ) ) c ( Y ∗ ( − i ) ) , and c ( Y ∗ ( − i ) ) is the normalizing constant. Within the Bayesian framework, a Monte Carloestimate of the CPO can be obtained as: (cid:91) CPO − i = 1 M M (cid:88) t =1 f ( y ( s i ) | w t ( s i ) , β t ( s i ) , σ yt ) , (15)where w t ( s i ) is calculated based on the sampled φ in the t -th iteration, and β t ( s i ) and σ yt are, respectively, the t -th iteration samples for β ( s i ) and σ y . An estimate of the LPML cansubsequently be calculated as: (cid:92) LPML = N (cid:88) i =1 log (cid:16) (cid:91) CPO i (cid:17) . (16)A model with a larger LPML value is preferred. In addition, p D , the eﬀective number ofparameters, which can be used to measure the complexity of the model, is deﬁned as p D = D − D ( θ ) , (17)14here D = − f ( y ( s ) | θ ) is the deviance of the model, D is the posterior mean ofdeviance, θ is the posterior mean of the parameters and D ( θ ) denotes deviance at posteriormeans. We use the Rand index (RI; Rand, 1971) to measure the accuracy of clustering. The RI isdeﬁned as RI = ( a + b ) / ( a + b + c + d ) = ( a + b ) / (cid:18) n (cid:19) , where C = { X , . . . , X r } and C = { Y , . . . , Y s } are two partitions of { , , . . . , n } , and a, b, c and d respectively denote the number of pairs of elements of { , , . . . , n } that are (a) in asame set in C and a same set in C , (b) in diﬀerent sets in C and diﬀerent sets in C , (c)in a same set in C but in diﬀerent sets in C , and (d) in diﬀerent sets in C and a same setin C . The RI ranges from 0 to 1 with a higher value indicating better agreement betweenthe two partitions. In particular, RI = 1 indicates that C and C are identical in terms ofmodulo labeling of the nodes. In this section, we conduct simulation studies to assess the performance of the proposedmethods under scenarios where there is no clustered covariate eﬀect, and when there isindeed clustered covariate eﬀect. All simulations are run on an institutional high performancecomputing cluster running Red Hat Enterprise Linux Server (release 6.7).15 .1 Simulation Without Clustered Covariate Eﬀects

The spatial adjacency structure of counties in Georgia is used. As a starting point, to mimicthe real dataset we use later, one observation is generated for each of the 159 counties. Sixcovariate vectors are generated for the 159 counties with each entry i.i.d. from N (0 , × X . The spatial random eﬀects w are simulated basedon the matrix of great circle distance (GCD) between county centroids. The great circledistances are obtained using the function distCosine() , and the centroids are calculated basedon county polygons using the function centroid() , both provided by the R package geosphere (Hijmans, 2017). The GCD matrix is subsequently normalized to have a maximum value of10 for ease in computation. and the response vector Y is generated as Y = Xβ + w + (cid:15) , where X = ( X , . . . , X ), w ∼ MVN( , exp( − GCD / (cid:15) ∼ MVN( , I ). Diﬀerentvalues of β are used: (1 , , , , . , (cid:62) , (2 , , , , , (cid:62) , and (9 , , − , , , (cid:62) , correspondingto scenarios where the signal is weak, moderate, and strong. For each of the three β ’s,the average parameter estimate denoted by (cid:98) β (cid:96),m ( (cid:96) = 1 , · · · , m = 1 , · · · ,

6) in 100simulations is calculated as (cid:98) β (cid:96),m = 1100 (cid:88) r =1 (cid:98) β (cid:96),m,r , (18)where (cid:98) β (cid:96),m,r denotes the posterior estimate for the m th coeﬃcient of county (cid:96) in the r threplicate. The performance of these posterior estimates are evaluated by the mean absolutebias (MAB), the mean standard deviation (MSD), the mean of mean squared error (MMSE)and mean coverage rate (MCR) of the 95% highest posterior density (HPD) intervals in the16ollowing ways: MAB = 1159 (cid:88) (cid:96) =1 (cid:88) r =1 (cid:12)(cid:12)(cid:12) (cid:98) β (cid:96),m,r − β (cid:96),m (cid:12)(cid:12)(cid:12) , (19)MSD = 1159 (cid:88) (cid:96) =1 (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) r =1 (cid:16) (cid:98) β (cid:96),m,r − (cid:98) β (cid:96),m (cid:17) , (20)MMSE = 1159 (cid:88) (cid:96) =1 (cid:88) r =1 (cid:16) (cid:98) β (cid:96),m,r − β (cid:96),m (cid:17) , (21)MCR = 1159 (cid:88) (cid:96) =1 (cid:88) r =1 (cid:16) (cid:98) β (cid:96),m,r ∈

95% HPD interval (cid:17) , (22)In each replicate, the MCMC chain length is set to be 50,000, with thinning 10 and theﬁrst 2,000 samples are discarded as burn-in, therefore we have 3,000 samples for posteriorinference. The parameter D for the uniform prior of bandwidth, i.e. φ in Equation (2),is set to 100 such that the prior for bandwidth is also noninformative. In Table 1 theaverage parameter estimates (cid:98) β (cid:96),m are reported together with the four performance measuresin Equations (19),(20), (21) and (22) are reported for the three settings. Under all threesettings, the parameter estimates are highly close to the true underlying values, and havevery small MAB, MSD and MMSE, while maintaining the MCR at close to 95% level. TheRI’s are all very close to or equal to 1, indicating that the clustering results are highlyconsistent and credible. It is worth noticing that, even when the signal is relatively weak,the clustering approach is quite precise. We consider an underlying setting where there exist clustered covariate eﬀects. First weconsider a setting where the clustered covariate eﬀect is independent of spatial locations, i.e.where cluster belonging are set randomly. The 159 counties are randomly assigned to threeclusters, visualized in Figure 1(a). There are, respectively, 51, 49, and 59 counties in thethree clusters. Diﬀerent parameter vectors are used for data generation in diﬀerent clusters17able 1: Average parameter estimates and performance of parameter estimates and cluster-ing results when without clustered covariate eﬀect. (cid:98) β MAB MSD MMSE MCR RISetting 1 β β β β -0.003 0.090 0.071 0.005 0.970 β β β β -0.001 0.090 0.072 0.005 0.980 β β β β β β -0.001 0.088 0.069 0.005 0.970 β -3.996 0.089 0.069 0.005 0.960 β -0.002 0.091 0.072 0.005 0.960 β β Longitude La t i t ude Cluster (a)

Longitude La t i t ude Cluster (b)

Figure 1: Visualization of (a) random cluster assignment and (b) regional cluster assignmentfor Georgia counties used for simulation studies.18able 2: True parameter vectors used in data generation for three clusters.Cluster 1 Cluster 2 Cluster 3Setting 1 (1, 0, 1, 0, 0.5, 2) (1, 0.7, 0.3, 2, 0, 3) (2, 1, 0.8, 1, 0, 1)Setting 2 (2, 0, 1, 0, 4, 2) (1, 0, 3, 2, 0, 3) (4, 1, 0, 3, 0, 1)Setting 3 (9, 0, -4, 0, 2, 5) (1, 7, 3, 6, 0, -1) (2, 0, 6, 1, 7, 0)(see Table 2) to assess the estimation and clustering performance under diﬀerent strengthsof signals. The spatial random eﬀect W is generated using the same setting as before.The performance measures are presented in Table 3. In another scenario, a setting wherethe clustered covariate eﬀect depends on spatial locations. Consider a partition of Georgiacounties into three large regions, visualized in Figure 1(b). The same parameter vectors inTable 2 are used for the three clusters under three settings. Corresponding performancemeasures are reported in Table 4.For each signal strength and each of the two settings, we randomly selected four replicatesfrom the total of 100 replicates and visualize the results in the Online Supplement. It is nosurprise that under both settings, the accuracy of clustering increases with the strengtheningof signals. It can be seen from Supplemental Figures 1 and 4 that with weak true signals,the proposed approach suﬀers from over-clustering, which is a known property of Dirichletprocess mixtures that the posterior does not concentrate at the true number of clusters (see,e.g., Miller and Harrison, 2013). This over-clustering behavior, however, diminishes as thesignals’ strength increase. When the signals are strong, the RI reaches near 0.85, indicatingthat 85% of the time, two counties that belong to the same cluster are correctly put into thesame cluster. Together with increase in RI is decrease in MCR, which is an inevitable resultof incorporating more counties in each cluster. For each county, taking other counties thatdo not belong to this county’s true cluster introduces bias in estimation.19able 3: Performance of parameter estimates and clustering results under the scenario wherecluster belongings are set randomly.MAB MSD MMSE MCR RISetting 1 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β Real Data Analysis

We consider analyzing inﬂuential factors for monthly housing cost in Georgia using the pro-posed methods. The dataset is available at , with 159observations corresponding to the 159 counties in Georgia. For each county, the dependentvariable median monthly housing cost for all occupied housing units is observed. The inde-pendent variables considered here include: the percentage of adults aged 18 to 64 who areunemployed ( X ), the average total real and personal property taxes collected per person( X ), the median home market value ( X , in thousand dollars), the percentage of White racepopulation ( X ), the median age ( X ), and size of a county’s population ( X , in thousands).Figure 2 provides a visualization of the 6 covariates on the Georgia map. In the computation,the covariates are centered and scaled to have mean 0 and unit standard deviation. Also,following the common practice in economics to account for long-tailed distributions, we takethe logarithm of monthly housing cost before ﬁtting the model. The response variable is alsocentered and scaled, and therefore all models to follow are ﬁtted without the intercept term.We ﬁrstly apply the model assessment criteria, LPML, for selecting the best weightingscheme for the data. The LPML values for the unity weighting scheme, the exponentialweighting scheme and the Gaussian weighting scheme are shown in Table 5. From Table 5we can see that the model with exponential weighting scheme has the largest LPML valueamong the three candidate schemes, and is therefore preferred. To verify that there isindeed spatially varying covariate eﬀects, we also ﬁtted the spatially-varying coeﬃcientsmodel without clustering (4) to the dataset. The model is also compared against a vanillaBayesian regression, where no spatial eﬀect is considered, and observations are treated asi.i.d. samples from the population. In this model, no spatial variation is assumed in thecovariate eﬀects β , and the model reduces to the regular Bayesian linear regression.The computation is performed on a desktop computer running Windows 10 Enterprise,with i7-8700K CPU @ 3.70GHz. The computing time as well as performance measure for22 Longitude La t i t ude Value

Unemployment.Rate

Longitude La t i t ude Value

Property.Tax

Longitude La t i t ude Value

House.Market.Price

Longitude La t i t ude Value

White.Race

Longitude La t i t ude Value

Age

Longitude La t i t ude Value

Population

Figure 2: Visualizations of covariate values in counties of Georgia.Table 5: LPML values for diﬀerent weighting schemes in the proposed model.Unity Exponential GaussianLPML -165.784 -165.620 -217.878these three models are recorded and presented in Table 6. The proposed model takes around650 seconds to run, followed by the spatially varying coeﬃcients model, and then the spatiallyconstant coeﬃcients model. The LPML values of the ﬁrst two models that allow for spatiallyvarying coeﬃcients are larger than the vanilla regression model, and the diﬀerences are notminor. This indicates that there indeed exist spatially varying covariate eﬀect, and moreﬂexible models are preferred. Comparing the LPML and p D for the ﬁrst two models, it canbe seen that the proposed model reduces p D and provides better ﬁt to the data. Combiningthe conclusions from Tables 5 and 6, the proposed model with exponential weighting schemefor W ( s ) is ﬁtted on the dataset.A total of 3 clusters are identiﬁed. The cluster belongings of the 159 counties are visu-23able 6: LPML and p D values and computing time for diﬀerent models.The proposed model Spatial-varying coeﬃcients model Vanilla regressionLPML -165.620 -174.171 -203.013 p D In this paper, we have proposed a Bayesian clustered coeﬃcients linear regression model withspatial random eﬀects to capture heterogeneity of regression coeﬃcients. Multiple weightingschemes in modeling the spatial random eﬀects have been proposed, and the corresponding24

Longitude La t i t ude Cluster

Figure 3: Clusters produced by the proposed approach.Table 7: Parameter estimates and their 95% HPD intervals for the three clusters identiﬁed.Cluster 1 2 3 (cid:98) β (cid:98) β -0.091 0.043 -0.047(-0.217, 0.040) (-0.404, 0.433) (-1.19, 1.08) (cid:98) β (cid:98) β -0.078 -0.146 0.151(-0.295, 0.102) (-0.648, 0.313) (-0.910, 1.509) (cid:98) β -0.004 -0.998 -0.994(-0.205, 0.192) (-1.478, -0.578) (-1.899, 0.186) (cid:98) β References

Autant-Bernard, C. and J. P. LeSage (2019). A heterogeneous coeﬃcient approach to theknowledge production function.

Spatial Economic Analysis 14 (2), 196–218.Bill´e, A. G., R. Benedetti, and P. Postiglione (2017). A two-step approach to account forunobserved spatial heterogeneity.

Spatial Economic Analysis 12 (4), 452–471.Brunsdon, C., A. S. Fotheringham, and M. E. Charlton (1996). Geographically weighted26egression: a method for exploring spatial nonstationarity.

Geographical Analysis 28 (4),281–298.Carlin, B. P., A. E. Gelfand, and S. Banerjee (2014).

Hierarchical Modeling and Analysis forSpatial Data . Chapman and Hall/CRC.Chasco, C. and J. L. Gallo (2015). Heterogeneity in perceptions of noise and air pollution:A spatial quantile approach on the city of Madrid.

Spatial Economic Analysis 10 (3),317–343.Cressie, N. (1992). Statistics for spatial data.

Terra Nova 4 (5), 613–617.Dahl, D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixturemodel.

Bayesian Inference for Gene Expression and Proteomics 4 , 201–218.de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. T. Lang, and R. Bodik(2017). Programming with models: writing statistical algorithms for general model struc-tures with NIMBLE.

Journal of Computational and Graphical Statistics 26 (2), 403–413.Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

Annals ofStatistics 1 (2), 209–230.Gelfand, A. E., H.-J. Kim, C. Sirmans, and S. Banerjee (2003). Spatial modeling withspatially varying coeﬃcient processes.

Journal of the American Statistical Associa-tion 98 (462), 387–396.Gelfand, A. E. and E. M. Schliep (2016). Spatial statistics and Gaussian processes: Abeautiful marriage.

Spatial Statistics 18 , 86–104.Geng, J., A. Bhattacharya, and D. Pati (2019). Probabilistic community detection with un-known number of communities.

Journal of the American Statistical Association 114 (526),893–905. 27reen, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesianmodel determination.

Biometrika 82 (4), 711–732.Hijmans, R. J. (2017). geosphere: Spherical Trigonometry . R package version 1.5-7.Hu, G. and J. Bradley (2018). A Bayesian spatial-temporal model with latent multivariatelog-gamma random eﬀects with application to earthquake magnitudes.

Stat 7 (1), e179.e179 sta4.179.Hu, G., J. Geng, Y. Xue, and H. Sang (2020). Bayesian spatial homogeneity pursuit offunctional data: an application to the U.S. income distribution.

Arxiv . Preprint.Ibrahim, J. G., M.-H. Chen, and D. Sinha (2013).

Bayesian Survival Analysis . SpringerScience & Business Media.Kulldorﬀ, M. and N. Nagarwalla (1995). Spatial disease clusters: detection and inference.

Statistics in Medicine 14 (8), 799–810.Lawson, A. B., J. Choi, and J. Zhang (2014). Prior choice in discrete latent modeling ofspatially referenced cancer survival.

Statistical Methods in Medical Research 23 (2), 183–200.Li, F. and H. Sang (2019). Spatial homogeneity pursuit of regression coeﬃcients for largedatasets.

Journal of the American Statistical Association , 1–21.Li, P., S. Banerjee, T. A. Hanson, and A. M. McBean (2015). Bayesian models for detectingdiﬀerence boundaries in areal data.

Statistica Sinica 25 (1), 385–402.Ma, Z., Y. Xue, and G. Hu (2019a). Geographically weighted regression analysis for spatialeconomics data: a Bayesian recourse. Technical report, University of Connecticut.Ma, Z., Y. Xue, and G. Hu (2019b). Nonparametric analysis of income distributions amongdiﬀerent regions based on energy distance with applications to China Health and NutritionSurvey data.

Communications for Statistical Applications and Methods 26 (1), 57–67.28iller, J. W. and M. T. Harrison (2013). A simple example of Dirichlet process mixtureinconsistency for the number of components. In

Advances in Neural Information ProcessingSystems , pp. 199–206.Panzera, D., R. Benedetti, and P. Postiglione (2016). A Bayesian approach to parameterestimation in the presence of spatial missing data.

Spatial Economic Analysis 11 (2),201–218.Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.

Journal ofthe American Statistical Association 66 (336), 846–850.Salinas-P´erez, J. A., M. L. Rodero-Cosano, C. R. Garc´ıa-Alonso, and L. Salvador-Carulla(2015). Applying an evolutionary algorithm for the analysis of mental disorders in macro-urban areas: The case of Barcelona.

Spatial Economic Analysis 10 (3), 270–288.Sethuraman, J. (1991). A constructive deﬁnition of Dirichlet priors.

Statistica Sinica 4 (2),639–650.Xue, Y., E. D. Schifano, and G. Hu (2019). Geographically weighted Cox regression and itsapplication to prostate cancer survival data in Louisiana.

Geographical Analysis . Forth-coming.Zhang, J. and A. B. Lawson (2011). Bayesian parametric accelerated failure time spatialmodel and its application to prostate cancer.

Journal of Applied Statistics 38 (3), 591–603.Zhao, P., H.-C. Yang, D. K. Dey, and G. Hu (2020). Bayesian spatial homogeneity pursuitregression for count value data.