Geographically Weighted Regression Analysis for Spatial Economics Data: a Bayesian Recourse
GGeographically Weighted Regression Analysis for SpatialEconomics Data: a Bayesian Recourse
Zhihua Ma Yishu Xue Guanyu Hu
Abstract
The geographically weighted regression (GWR) is a well-known statistical approachto explore spatial non-stationarity of the regression relationship in spatial data analy-sis. In this paper, we discuss a Bayesian recourse of GWR. Bayesian variable selectionbased on spike-and-slab prior, bandwidth selection based on range prior, and modelassessment using a modified deviance information criterion and a modified logarithmof pseudo-marginal likelihood are fully discussed in this paper. Usage of the graphdistance in modeling areal data is also introduced. Extensive simulation studies arecarried out to examine the empirical performance of the proposed methods with bothsmall and large number of location scenarios, and comparison with the classical fre-quentist GWR is made. The performance of variable selection and estimation of theproposed methodology under different circumstances are satisfactory. We further ap-ply the proposed methodology in analysis of a province-level macroeconomic data of30 selected provinces in China. The estimation and variable selection results revealinsights about China’s economy that are convincing and agree with previous studiesand facts.Keywords: MCMC, Model Assessment, Spatial Econometrics, Variable Selection a r X i v : . [ s t a t . A P ] J u l ntroduction For geographically sparse data with inherent spatial variability, estimating coefficients of aregression model for a particular location based on only observations from this location isnot feasible due to the small number of observations. The geographically weighted regression(GWR; Brunsdon et al., 1996) is an important tool to explore spatial non-stationarity of theregression relationship in spatial data analysis. It has been applied to a variety of fields, in-cluding geology, environmental science, epidemiology, and econometrics. Fotheringham et al.(2002) has summarized the basic theory, statistical inference, and bandwidth selection forGWR, and proposed natural extensions of GWR under the generalized linear model frame-work. The basic idea of GWR is to make use of information from nearby locations. Theweighting idea is a natural strategy to use in light of Tobler’s first law that “near things aremore related than distant things” (Tobler, 1970). In estimating the parameter for one spe-cific location, subjects in the data are weighed according to their distance from this location,with greater weight for closer subjects. P´aez et al. (2002a,b) proposed estimation and infer-ence for GWR under a maximum-likelihood-based framework. Mei et al. (2006) proposed amixed geographically weighted regression model which included both spatially-varying andnon-spatially-varying coefficients, and gave a testing procedure of important explanatoryvariables. Wang et al. (2008) proposed a local linear-based GWR for the spatially vary-ing coefficient models which can significantly improve GWR. More recently, da Silva andFotheringham (2016) discussed the multiple testing issue, and proposed a solution whichoutperforms other solutions such as the Bonferroni procedure under the GWR framework.The aforementioned works discussed the GWR in the frequentist fashion. From the Bayesianperspective, LeSage (2004) proposed a Bayesian GWR, which gives a prior distribution onthe parameter vector depending on historical knowledge. The proposed model, however, usedcross-validation for bandwidth selection, which relies on a user-specified grid of bandwidth,and is computationally intensive like other cross-validation based methods.2n this paper, we propose Bayesian techniques for the GWR using the likelihood-basedapproach in P´aez et al. (2002a,b). In addition to Bayesian estimation and inference, a spikeand slab (Ishwaran et al., 2005) prior is applied for variable selection for Bayesian GWR.Furthermore, bandwidth selection and weighting scheme selection are discussed based onprior selection and Bayesian model selection criteria. An introduction to the implementationof GWR based on nimble (de Valpine et al., 2017), a relatively new and powerful R packagefor Bayesian inference, is presented as an open source repository on GitHub. Our simulationstudies showed the promising empirical performance of the proposed methods in both non-spatially varying and spatially varying cases. In addition, our proposed Bayesian approachreveals interesting features of the province-level macroeconomic data in China.The rest of this paper is organized as follows. In Section “Geographically WeightedRegression”, the GWR and its weighting schemes are discussed. Section “Bayesian Recoursefor GWR” gives a detailed discussion of Bayesian inference, variable selection, bandwidthselection, and model assessment for the GWR. Extensive simulation studies are conductedin Section “Simulation Studies” to investigate the empirical performance of the proposedmethods. In Section “Real Data Analysis”, we implement our model using province-levelmacroeconomic data in China from year 2012 to year 2016. Finally, Section “Discussion”contains a brief summary of this paper. Geographically Weighted Regression
Geographically Weighted Regression
From Brunsdon et al. (1998) , the GWR model can be written as: y ( s ) = β ( s ) x ( s ) + ... + β p ( s ) x p ( s ) + (cid:15) ( s ) (1)3here y ( s ) is the response variable at location s , β i ( s ), i = 1 , , ..., p are the coefficients ofindependent variables at location s , and (cid:15) ( s ) is the random effect at location s , assumedto follow N (0 , σ ). In addition, we also assume cov( (cid:15) ( (cid:96) ) , (cid:15) ( m )) = 0 for any (cid:96) (cid:54) = m . Givena weighting function, the weights of each observation can be calculated with the distancebetween that observation and s . Estimation of coefficients at location s can be formulatedin a way similar to the weighted least squares: (cid:98) β ( s ) = (cid:0) X (cid:62) W ( s ) X (cid:1) − X (cid:62) W ( s ) Y, (2)where X is the n × p matrix of covariates, Y is the n × W ( s ) =diag( w ( s ) , ..., w n ( s )) is a diagonal matrix of the weights. Spatial Weighting Functions and Distances
We first introduce several spatial weighting functions that can be used in GWR. Notice thatthe weighting scheme of ordinary least squares can be defined in the following form: w jk = 1 , ∀ j, k (3)where j represents the location of the observations, and k represents the location for whichparameters are estimated. In a global model where observations from all locations are usedto estimate one vector of coefficients, each observation is assigned a weight of unity.A first step to consider locality is to include observations that are only within a certaindistance d from the target location, i.e., w jk = d jk ≤ d , (4)where d jk is the distance between locations j and k . This weighting scheme is one of the sim-4lest to calculate. It is, however, a discontinuous function of distance, which can sometimeslead to undesired jumps in the estimated parameter surface. In order to get a continuousweighting function, the exponential function and the Gaussian function can also be used.The exponential weighting scheme can be written as: w jk = exp ( − d jk /b ) , (5)where b is the bandwidth that can be chosen appropriately to control the decay with respectto distance. The Gaussian weighting scheme can be written as: w jk = exp (cid:0) − ( d jk /b ) (cid:1) . (6)Both (5) and (6) are decreasing functions of d jk , which, intuitively, indicates that an ob-servation very far away from the location of interest contributes little in the estimation ofparameters at this location. In order to provide a continuous, near-Gaussian weighting func-tion up to distance b from the estimation point, and then zero weights for any data pointbeyond b , Brunsdon et al. (1996, 1998); Fotheringham et al. (1998) proposed the bi-squarefunction: w jk = (1 − ( d jk /b ) ) d jk < b . (7)For the bi-square kernel, by tuning the threshold b , one can control the number of neighborsthat are used to estimate the parameters for the location of interest. The weighting schemesmentioned above are the most popular schemes used in the GWR.We then briefly discuss different choices of distance function. The Euclidean distance,defined as d jk = (cid:113) (latitude j − latitude k ) + (longitude j − longitude k ) , is one of the most popular choices when the precise (latitude, longitude) location of each5bservation is available. However, in some public health and epidemiology studies, or somesocioeconomics studies, data are collected and summarized on a higher level than singleobservations, such as wards (Brunsdon et al., 1996) or counties (Xue et al., 2019), whichproduces areal data instead of point-reference data. All observations within the same areaare assigned the same (latitude, longitude). For example, Hu and Huffer (2020) attributedeach county’s observations to its centroid. Note that the Euclidean distance is easily affectedby the areas of the administrative divisions, which additionally complicates the process ofparameter tuning as there is no golden benchmark measure of distance.An alternative distance measure when we have areal data is the graph distance (M¨ulleret al., 1987; Bhattacharyya and Bickel, 2014). The administrative devisions are regardedas vertices of a graph G , denoted as v , . . . , v n . The graph G also includes a set of edges, E ( G ) = { e , . . . , e m } , where each edge connects a pair of vertices. The graph distance isdefined as: d v i v j = | V ( e ) | if e is the shortest path connecting v i and v j ∞ if v i and v j are not connected , (8)where | V ( e ) | denotes the number of edges in e .While it remains a subjective problem in choosing appropriate bandwidths and thresholdsfor the geographical distance based methods, i.e. one has to decide “how close is closeenough”, a natural definition of closeness would derive from the graph distance. Countiessharing a common boundary, i.e. having graph distance 1, are close, while having graphdistance greater than 1 indicates “not close”, and observations in these far neighboringcounties need to be weighed down. A graph distance based weighting function would be w jk = d Gjk ≤ f ( d Gjk | b ) otherwise , (9)6here d Gjk denotes the graph distance, and f is a certain weighting function with bandwidth b . In this work, we choose f () to be the negative exponential function, i.e., w jk = d Gjk ≤ (cid:0) − d Gjk /b (cid:1) otherwise . (10) Bayesian Recourse for GWR
In this section, we propose the posterior estimation, variable selection, and bandwidth se-lection for the Bayesian GWR model. The proposed methods are implemented with thepowerful R package nimble . The code and documentation can be found at GitHub. Bayesian Estimation for GWR
According to Boscardin and Gelman (1993) and P´aez et al. (2002a,b), we can get the esti-mation of GWR using Bayesian computation. The likelihood function of this model can bewritten as: Y | β ( s ) , X, W ( s ) , σ ( s ) ∼ MVN( Xβ ( s ) , σ ( s ) W − ( s )) , (11)where MVN indicates the multivariate normal distribution. In order to have a conjugateposterior distribution, we can set the priors of β ( s ) and σ ( s ) as: β ( s ) | Σ β ∼ N p (0 , Σ β ) , (12)where Σ β is a diagonal matrix, and σ ( s ) ∼ IGamma( α , α ) , j = 1 , . . . , p, (13)7here α , α are the hyper-parameters for distributions of σ ( s ). One set of non-informativechoices of hyper-parameters is to set Σ β = 100 I p and α = α = 0 .
01 (Gelman et al., 2013).The posterior distribution can be written as: p (cid:0) β ( s ) , σ ( s ) | Y, X, W ( s ) (cid:1) ∝ p (cid:0) Y | β ( s ) , X, W ( s ) , σ ( s ) (cid:1) × p ( β ( s ) | Σ β ) × p (cid:0) σ ( s ) (cid:1) . (14)According to (14), we can use Markov chain Monte Carlo (MCMC, Gelman et al., 2013) toestimate β ( s ) and σ ( s ). Bayesian Variable Selection
We first consider the regression problem for one location. Following the procedure of Georgeand McCulloch (1993), the spike and slab prior for β j ( s ) can be formulated as: β j ( s ) | γ j ∼ (1 − γ j ) N (0 , τ j ) + γ j N (0 , c j τ j ) , (15)where P ( γ j = 0) = 1 − P ( γ j = 1) = p j . (16)When γ j = 0, β j ( s ) ∼ N (0 , τ j ), and when γ j = 1, β j ( s ) ∼ N (0 , c j τ j ). Our interpretation ofthis prior is: we set τ j small enough so that if τ j = 0, β j ( s ) would be so small that we can“safely” estimate it as 0; inversely, we set c j large so that if γ j = 1, we include the β j ( s ) intoour final model. For the prior of γ j , we set γ j ∼ Bernoulli(0 . ayesian Bandwidth Selection In GWR, it is important to choose a proper bandwidth for the weighting functions. In theBayesian approach, a prior can be given to the bandwidth b so that the optimal bandwidthcan be simultaneously obtained together with the estimation of other parameters. The prioralso depends on which measure of distance is used. A more detailed discussion of distancemeasures is given in Section “Spatial Weighting Function and Distances”. Using similarideas as in Boehm Vock et al. (2015), a prior for bandwidth can be set as: b ∼ Uniform(0 , D ) , where D is the upper limit for the support of the distribution of b . Without any priorknowledge, D can be chosen large enough so that we start from a noninformative priorfor the bandwidth, i.e., we start from an approximate global model where observationsare always weighed equally. There are also some other choices of prior distributions forthe bandwidth, such as the gamma distribution or discrete uniform distribution. If priorinformation is available about the bandwidth, parameters for the prior distributions can beset to incorporate such information. Our proposed model can be summarized as follows: Y | β ( s ) , X, W ( s ) , σ ( s ) ∼ MVN( Xβ ( s ) , σ ( s ) W − ( s )) (17) β j ( s ) | γ j , τ j ∼ (1 − γ j ) N (0 , τ j ) + γ j N (0 , c j τ j ) (18) τ j ∼ IGamma( α , α ) (19) γ j ∼ Bernoulli(0 .
5) (20) w i ( s ) = f ( d i | b ) (21) b ∼ Uniform(0 , D ) (22)where “IGamma” denotes the inverse-Gamma distribution, and f is the weighting functionintroduced in (9), i = 1 , ..., n and j = 1 , ..., p . As we incorporate the prior of b into our9odel, conjugate posterior distribution for b cannot be obtained. Therefore, we use theMetropolis–Hastings Algorithm (MH; Gelman et al., 2013) to estimate the parameters. Bayesian Model Assesment
In Section “Spatial Weighting Function and Distances”, we introduced several spatial weight-ing functions that can be used in the GWR. In order to select the weighting scheme thatfits the data best, we apply the most commonly used tools, the Deviance Information Crite-rion (DIC; Spiegelhalter et al., 2002) and the Logarithm of the Pseudo-Marginal Likelihood(LPML; Ibrahim et al., 2013), for model selection. The DIC is defined as:DIC = Dev( θ ) + 2 p D , (23)where θ and θ represent the parameter of interest and the corresponding posterior mean. Theterm Dev( · ) denotes the deviance function, while p D is the effective number of parameters inthe model, given by p D = Dev( θ ) − Dev( θ ). For the GWR model in our paper, the followingdeviance function can be specified (Ma et al., 2018):Dev( β ( s ) , W ( s ) , σ ( s )) = − f ( Y | β ( s ) , X, W ( s ) , σ ( s ))= n log(2 π ) + log( σ ( s )) − log( | W ( s ) | ) + ( Y − Xβ ( s )) (cid:62) σ − ( s ) W ( s )( Y − Xβ ( s )) , where n is the total number of the observations. Therefore, the DIC for the GWR modelcan be given as: DIC = Dev( β ( s ) , W ( s ) , σ ( s )) + 2 p D = 2Dev( β ( s ) , W ( s ) , σ ( s )) − Dev( β ( s ) , W ( s ) , σ ( s )) , β ( s ), W ( s ) and σ ( s ) are posterior estimates obtained from MCMC results. A smallervalue of DIC indicates a better model. It can be regarded as the Bayesian equivalent of AIC,where the term p D is the penalty term for model complexity, similar to the p , i.e., dimensionof parameter space, in AIC. Similar to AIC, DIC also takes both fitness and model complexityinto account simultaneously.The LPML is constructed based on the Conditional Predictive Ordinate (CPO) values,which are estimates of the probability for observing Y i given that all other responses havebeen observed. Let D ( − i ) = { Y j : j = 1 , · · · , i − , i + 1 , · · · , N } denote the observed datawith the i th subject response deleted. The CPO for the i th subject is defined as:CPO i = (cid:90) f ( Y | β ( s ) , X, W ( s ) , σ ( s )) π ( W ( s ) , β ( s ) , σ ( s ) | D ( − i ) ) d (cid:0) W ( s ) , β ( s ) , σ ( s ) (cid:1) , (24)where π ( W ( s ) , β ( s ) , σ ( s ) | D ( − i ) ) = (cid:81) j (cid:54) = i f ( Y j | β ( s ) ,X,W ( s ) ,σ ( s )) π ( W ( s ) ,β ( s ) ,σ ( s ) | D ( − i ) ) c ( D ( − i ) ) , and c ( D ( − i ) )denotes the normalizing constant. Within the Bayesian framework, a Monte Carlo estimateof the CPO can be obtained as: (cid:100) CPO − i = 1 T T (cid:88) t =1 f ( Y i | X i , β t ( s ) , W t ( s ) , σ t ( s )) , where T is the total number of Monte Carlo iterations. Then an estimate of the LPML isgiven by: (cid:100) LPML = N (cid:88) i =1 log( (cid:100) CPO i ) . (25)A model with a larger LPML value indicates that it is more preferred.11 imulation Studies In this section, we present the performance of the proposed estimation and variable selectiontechniques under scenarios where the covariate effects do not vary spatially, and where thecovariate effects do vary spatially. We use the spatial structure of 30 selected provincesin China in our simulations. A map of these provinces with their names is presented inFigure 1(a). Specifically, Hainan province is an island, and is therefore not connected withany others, which makes its graph distance with any other province infinity. It is, however,very close to Guangdong province, and they bear a lot of resemblance in both culture andeconomic development. Therefore, we modified the adjacency matrix so that Hainan andGuangdong are adjacent. The graph distance matrix is calculated based on the modifiedadjacency matrix. A visualization of the graph distance matrix is presented in Figure 1(b).Denote the average parameter estimates as β (cid:96),m , calculated as (cid:98) β (cid:96),m = 1100 (cid:88) r =1 (cid:98) β (cid:96),m,r , (26)where (cid:98) β (cid:96),m,r denotes the parameter estimate for the m th coefficient of province (cid:96) in the r threplicate. The parameter estimates are evaluated based on their bias, standard deviation,mean squared error, and coverage rate of the 95% highest posterior density (HPD) intervalsin the following four ways:mean absolute bias (MAB) = 130 (cid:88) (cid:96) =1 (cid:88) r =1 (cid:12)(cid:12)(cid:12) (cid:98) β (cid:96),m,r − β (cid:96),m (cid:12)(cid:12)(cid:12) , (27)mean standard deviation (MSD) = 130 (cid:88) (cid:96) =1 (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) r =1 (cid:16) (cid:98) β (cid:96),m,r − (cid:98) β (cid:96),m (cid:17) , (28)mean of mean squared error (MMSE) = 130 (cid:88) (cid:96) =1 (cid:88) r =1 (cid:16) (cid:98) β (cid:96),m,r − β (cid:96),m (cid:17) , (29)mean coverage rate (MCR) = 130 (cid:88) (cid:96) =1 (cid:88) r =1 I ( β (cid:96),m ∈
95% HPD interval) , (30)12 eijingTianjinHebeiShanxiInnerMong Liaoning JilinHeilongjiangShanghaiJiangsuZhejiangAnhuiFujianJiangxiShandongHenanHubeiHunanGuangdongGuangxiHainanChongqingSichuanGuizhouYunnan ShaanxiGansuQinghai NingxiaXinjiang
80 100 120
Longitude La t i t ude (a) BeijingTianjinHebeiShanxiInnerMongLiaoningJilinHeilongjiangShanghaiJiangsuZhejiangAnhuiFujianJiangxiShandongHenanHubeiHunanGuangdongGuangxiHainanChongqingSichuanGuizhouYunnanShaanxiGansuQinghaiNingxiaXinjiang B e iji n g T i a n ji n H e b e i S h a n x i I n n e r M o n gL i a o n i n g J ili n H e il o n g ji a n g S h a n g h a i J i a n g s u Z h e ji a n g A n h u i F u ji a n J i a n g x i S h a n d o n g H e n a n H u b e i H u n a n G u a n g d o n g G u a n g x i H a i n a n C h o n g q i n g S i c h u a n G u i z h o u Y u n n a n S h a a n x i G a n s u Q i n g h a i N i n g x i a X i n ji a n g Graph.Distance (b)
Figure 1: (a) A map of the 30 selected provinces of China, with their names annotated.(b) Visualization of graph distance matrix for 30 selected provinces in China. Darker colorindicate larger graph distance. 13here β (cid:96),m is the true underlying parameter, and I( · ) denotes the indicator function. Thesemeasures are first calculated for each individual province over replicates, and then averagedover provinces. The variable selection approach is evaluated using both the accuracy ratefor a single variable ACC and for the entire model (Model ACC), defined as:ACC m = (cid:80) r =1 I (covariate m is selected in the final model) β (cid:96),m (cid:54) = 0 , (cid:96) = 1 , . . . , (cid:80) r =1 I (covariate m is not selected in the final model) β (cid:96),m = 0 , (cid:96) = 1 , . . . , , and Model ACC = 1100 (cid:88) r =1 I (the exact true underlying model is selected) . To compare with frequentist approach, the same datasets are also fitted using the classicalfrequentist GWR approach, where the bandwidth selection is made based on minimizing thesummation of SSE across 30 provinces over a grid of candidate bandwidths. The detailsof frequentist bandwdith selection, as well as the parameter estimates obtained using theoptimal bandwdith, are presented in Section 1 of the supplemental material. To demonstratethat the graph distance produces credible parameter estimates, and that at the same timeit circumvents the additional effort of threshold selection in weighting kernels such as (4)and (10), simulation study is done for the same designs to be presented, with the greatcircle distance used. The results are reported in Section 2 of the supplemental material.In addition, considering that a total of 150 observations with 30 locations still make asmall sample, an additional simulation study with more than 300 locations using the spatialstructure of census tracts in Hartford, Litchfield, and Middletown counties in Connecticuthas been conducted, and included in Section 3 of the supplemental material.For both the following simulation studies and the supplemental simulation studies, theeffective number of parameters for the frequentist GWR is also calculated as in Brunsdonet al. (2000). Note that the frequentist GWR is based on one bandwidth only, and only the14ull model that includes all covariates is fitted, therefore the frequentist p D should be usedas a reference, instead of a criteria for direct comparison. Simulation Without Spatially Varying Coefficients
Under the scenario where there are no spatially varying coefficients, we generate data usingthe same set of parameters for all provinces. The independent continuous covariates aregenerated i.i.d. from the standard normal distribution N (0 , X , X ,. . . , X ,and we use the matrix X to denote the covariate matrix with 5 columns, with the i th columnbeing X i . The response vector Y is generated as Xβ + (cid:15) , where (cid:15) ∼ MVN( , I ). Differentchoices of β have been used corresponding to different underlying true models. The parameter D for bandwidth is set to be 100. Given that the maximum graph distance in the spatialstructure of the 30 selected provinces is 6, a bandwidth of 100 induces a weighting schemesthat, even if the distance between one certain province and another province whose parameterestimates we want to obtain, this province gets assigned a relative weight of exp( − / b is sufficiently noninformative.For each province, five observations are generated, resulting in 150 observations per replicate.A total of 100 replicates are performed. For each replicate, a chain of length 10,000 is runwithout thinning, where the first 2,000 samples are discarded as burn-in. Three parametersettings similar to in Shao (1997) were used, with β (cid:62) = (2 , , , , , , , , , , , , (cid:98) β MAB MSD MMSE MCR True ACC(%) Model ACC(%) b Setting 1 β β β β β β β β β β β β β β β Simulation with Spatially Varying Coefficients
For estimation and variable selection in the presence of spatially varying coefficients, we usesimilar simulation schemes as in Xue et al. (2019). The graph distance matrix visualizedin Figure 1 is transformed using multidimensional scaling (MDS; Cox and Cox, 2000) andmapped into a Cartesian space. Denoting the transformed coordinates corresponding toprovince (cid:96) as ( x c(cid:96) , y c(cid:96) ), the β vector for province (cid:96) is set to β p,(cid:96) = , x p not in the true model β p + 0 . x c(cid:96) + y c(cid:96) ) , x p in the true model . (31)16 Longitude La t i t ude value b Longitude La t i t ude value b Longitude La t i t ude value b Longitude La t i t ude value b Longitude La t i t ude value b Figure 2: Visualization of parameter surfaces when the parameters vary according to (31).The variation pattern is visualized in Figure 2. The estimation and variable selection resultsfor the setting in (31) are presented in Table 2. It can be seen that in all three settings,the MAB, MSD and MMSE are all larger than when there is no spatially varying covariateeffects. There have been decrease in the MCR for parameters corresponding to covariatesthat are in the true model. The variable selection procedure, however, remains quite robust,and in all replicates of simulation are the correct models selected. The average bandwidthsselected for the three settings are around 67. This indicates that in the presence of spatialvariation, for some replicates of simulation, the bandwidths tend to be large as the gain instabilizing the parameter estimates for each location dominates the incurred bias. This hasalso been observed for the classical frequentist GWR, as presented in Supplemental Table 2.The frequentist GWR has an average effective number of parameters of value 19.32, 20.49and 20.19 under the three settings, and the Bayesian GWR has 179.38, 179.58 and 179.32instead.To study the estimation and variable selection performance under a scenario where re-gional variation exists, we choose to use the four major economic regions of China proposed17able 2: Performance of parameter estimates, and variable selection results when a simplelinear variation pattern is present. “True” denotes whether a covariate is in the true modelor not, and ACC stands for variable selection accuracy rate.MAB MSD MMSE MCR True ACC(%) Model ACC(%) b Setting 1 β β β β β β β β β β β β β β β β (cid:62) ’s under the threesimulation settings are given in Table 3. The estimation and variable selection results arepresented in Table 4. Again, compared to results in Supplemental Table 3, both approachesyield similar MAB, MSD, MMSE and MCR/MCP. Similar to previously observed, the vari-able selection in the Bayesian approach effectively reduces the MAB, MSD and MMSE pa-rameter estimates for variables that are not in the true underlying model. The bandwidthsselected average to around 70. This could be due to the fact that a province now has a fewneighbors with exactly the same true underlying coefficients, and therefore the weightingfunction tries to weigh observations in the neighboring provinces equally as the local one.The frequentist GWR performed similarly, and results are included Supplemental Table 3.The average effective number of parameters under the frequentist GWR are 17.27, 16.73 and18able 3: The true underlying parameters used for the four regions in each setting.Region 0 Region 1 Region 2 Region 3Setting 1 (1.8, 0, 0, 4.2, 7) (1.5, 0, 0, 3.8, 9) (2.2, 0, 0, 4, 8.5) (2, 0, 0, 4, 8)Setting 2 (1.8, 1.8, 0, 4.2, 7) (1.5, 1.5, 0, 3.8, 9) (2.2, 2.2, 0, 4, 8.5) (2, 2, 0, 4, 8)Setting 3 (1.8, 1.8, 2.9, 4.2, 7) (1.5, 1.5, 3.4, 3.8, 9) (2.2, 2.2, 3.1, 4, 8.5) (2, 2, 3, 4, 8)
80 100 120
Longitude La t i t ude Region
Figure 3: Map of selected provinces in China colored by their economic regions proposedduring the Eleventh Five-year Plan.15.28 under the three settings, while under the Bayesian GWR, the values are 178.35, 178.25and 178.38, respectively.
Real Data Analysis
The proposed Bayesian GWR model is used to analyze province-level macroeconomic data in30 selected provinces of China from year 2012 to year 2016, i.e., we have 150 observations in19able 4: Performance of parameter estimates and variable selection results when regionalvariation is present. “True” denotes whether a covariate is in the true model or not, andACC stands for variable selection accuracy rate.MAB MSD MMSE MCR True ACC(%) Model ACC(%) b Setting 1 β β β β β β β β β β β β β β β Y ). Five covariates, including the resident population in millions ( X ), the urbanpopulation in millions ( X ), the fixed asset investment in the whole society in billions ofCNY ( X ), total export value in billions of USD ( X ), and total import value in billionsof USD ( X ), are incorporated in the model. The 5-year means of the variables for eachprovince are shown in Figure 4. Following the common practice in economics to account forlong-tails (see, e.g. Wooldridge, 2015), we take the logarithm of GDP before model fitting.All five covariates are continuous, and are therefore standardized before model fitting.The proposed Bayesian GWR model (17) – (22) is fitted on this dataset. Priors σ ( s ) ∼ IGamma(1 ,
1) and β ( s ) ∼ N (0 ,
1) are given, and we set τ j = 0 . c j = 10000 followingthe common practice in spike-and-slab model selection, and D = 100 so that we start froman approximately uniform weight over all provinces. The same graph distance matrix asin Section was used. The length of chains was selected to be 5000, with the first 2000 as20 Longitude La t i t ude value GDP (billion CNY)
Longitude La t i t ude value Resident Population (million)
Longitude La t i t ude value Urban Population (million)
Longitude La t i t ude value Investment (billion CNY)
Longitude La t i t ude value Export (billion USD)
Longitude La t i t ude value Import (billion USD)
Figure 4: Five-year means for variables in selected provinces of China.Table 5: DIC and LPML values for different weighting schemesUnity Scheme Exponential Scheme Gaussian SchemeDIC 12584.45 12525.65 12511.21LPML -6875.99 -6876.55 -6871.79 p D p D ; Spiegelhalteret al., 2002) for these three weighting schemes are shown in Table 5. It can be seen thatthe Gaussian weighting scheme yields the smallest DIC value and the largest LPML value,indicating that the model with a Gaussian weighting scheme is selected as the best modelamong the candidate models.Under the GWR model with a Gaussian weighting scheme, the posterior modes of theindicators γ j are (1 , , , , X , X and X are selected, while covariates X and X can be excluded from the regression model. Specifi-21 Longitude La t i t ude −0.70−0.68−0.66−0.64−0.62 value Resident Population (million)
Longitude La t i t ude value Urban Population (million)
Longitude La t i t ude −0.42−0.40−0.38 value Export (billion USD)
Figure 5: Plot of parameter estimates for the three covariates in the final model in eachselected province.cally, in our model, the number of resident population, the number of urban population, andtotal export value can help explain the change of GDP in each province. The posterior esti-mation results of the parameters under the Bayesian GWR model with a Gaussian weightingscheme are presented in Figure 5. The geographical variation in the parameters is ratherobvious. We can see that the number of resident population and the total export value havesignificant negative impact on the increase of GDP, while the number of urban populationhas significant positive impact. For the Gaussian weight function, the posterior estimateof the bandwidth b is 9.40, which indicates that the most distant provinces are assigned arelative weight of 0.665 in the estimation for one particular province. The impact of residentpopulation on GDP is larger in north China than in southeast China. Comparing this to thepopulation density map in China made by the Center for Geographic Analysis at HarvardUniversity ( worldmap.havard.edu/maps/11756 ), it can be seen that the influence is biggerin less populous provinces, which is in accordance with our intuition. The increasing effect ofurban population on GDP is smaller in southeast China than in northwest China. Consider-ing the relatively higher urbanization in south and east China (Wang et al., 2012), this canbe explained by the “decreasing marginal effect” in economics. Export appears to be moreimportant to provinces in west China than in eastern areas, which can also be explained bythe fact that east China is relatively more developed, and have a more diversified source ofGDP. 22or comparison, classical frequentist GWR in (2) is also fitted on the dataset. We reportthe parameter estimates, together with plots, in the supplemental material. Particularly, thetwo covariates dropped by our Bayesian variable selection approach have the smallest abso-lute parameter values among all five covariates, which indicates that our proposed approachis indeed capable of picking out the most influential factors. The parameter estimates arealso plotted on maps as in Figure 5. There are slight differences in the values of parameterestimates, which is partly due to the fact that we only have 150 observations (30 provinces,5 years). It is, however, worth noticing that the trend of variation is consistent between thefrequentist and Bayesian approaches. Discussion
We developed a likelihood-based Bayesian approach to estimate regression coefficients in con-junction with spike and slab variable selection for geographically sparse data. The selectionof bandwidth is discussed for a wide choice of weighting schemes using popular Bayesianmodel selection criteria such as the DIC and the LPML under the GWR context. Theproposed methods are implemented in nimble . In our simulation studies, when there is nospatially varying covariate effect, the bandwidth is selected to give all observations close touniform weight in estimating the coefficients for each individual location, whereas when thereis indeed spatially varying covariate effect, the bandwidth is selected to achieve a balancebetween introducing bias for each location by taking into consideration nearby observations,and having too unstable estimates by placing the majority of emphasis on local observationsand weighing down all others too heavily. The parameters estimated for each location havedecent coverage rate that are close to the nominal 95% level.Compared to the great circle distance, with a natural threshold of 1 to define “closeenough”, the graph distance yields weighting schems that produce models with robust pa-rameter estimation and variable selection performance. Based on comparisons with classical23requentist GWR from the simulation studies, it is interesting for us to notice that the band-width selection results of the Bayesian approach are different from that of the frequentistapproach. A partial reason for this pattern is that the bandwidth selection of frequentistapproach is based on minimizing the summation of the sum of square errors (SSE) for eachlocations. Our Bayesian approach, however, tries to maximize the whole data likelihood.Therefore, the frequentist approach tends to select smaller bandwidth than Bayesian ap-proach.A few issues beyond the scope of this paper are worth further investigation. In thiswork, we are only concerned with estimation of parameter for linear regression. Extensionof similar ideas to generalized linear models, and semi-parametric models such as the Coxmodel, are worth developing. In the second alternative simulation scenario with regionalvariation patterns, both the frequentist and Bayesian GWR try to weigh neighbors as highas possible, leading to large bandwidths. Under the frequentist framework, clustering ofcovariate effects have been done using hierarchical clustering on the parameter estimation,which is ad hoc. Another approach is the penalized methods in Li and Sang (2019). Inthe Bayesian paradigm, however, hierarchical modeling provides an integrated frameworkthat incorporates the latent cluster configuration layer. Development of such a frameworkis worth investigating. Also, we are assuming that a covariate is either in the true model forall locations, or not in the true model for all locations. There are cases where a covariateis important for some locations, but is minimally impacting for other locations. Identifyingsuch locations is devoted to future research. Detecting a relationship between two areas thatdo not share a boundary (Gao and Bradley, 2019) other than using graph distance is alsoan interesting future work. 24 eferences
Bhattacharyya, S. and P. J. Bickel (2014). Community detection in networks using graphdistance. arXiv preprint arXiv:1401.3915 .Boehm Vock, L. F., B. J. Reich, M. Fuentes, and F. Dominici (2015). Spatial variable selec-tion methods for investigating acute health effects of fine particulate matter components.
Biometrics 71 (1), 167–177.Boscardin, W. J. and A. Gelman (1993). Bayesian computation for parametric models ofheteroscedasticity in the linear model. In T. B. Fomby and R. C. Hill (Eds.),
Advances inEconometrics , Volume 11, pp. 87–109. Emerald Group Publishing Ltd.Brunsdon, C., A. S. Fotheringham, and M. E. Charlton (1996). Geographically weightedregression: a method for exploring spatial nonstationarity.
Geographical Analysis 28 (4),281–298.Brunsdon, C., S. Fotheringham, and M. Charlton (1998). Geographically weighted regression.
Journal of the Royal Statistical Society: Series D (The Statistician) 47 (3), 431–443.Brunsdon, C., S. Fotheringham, and M. Charlton (2000). Geographically weighted regres-sion as a statistical model. Technical report, Department of Geography, University ofNewcastle-upon-Tyne.Cox, T. F. and M. A. Cox (2000).
Multidimensional Scaling . Chapman and Hall/CRC.da Silva, A. R. and A. S. Fotheringham (2016). The multiple testing issue in geographicallyweighted regression.
Geographical Analysis 48 (3), 233–247.de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. T. Lang, and R. Bodik(2017). Programming with models: writing statistical algorithms for general model struc-tures with NIMBLE.
Journal of Computational and Graphical Statistics 26 (2), 403–413.25otheringham, A., C. Brunsdon, and M. Charlton (2002).
Geographically Weighted Regres-sion: The Analysis of Spatially Varying Relationships . Wiley. Chichester, West Sussex(UK).Fotheringham, A. S., M. E. Charlton, and C. Brunsdon (1998). Geographically weightedregression: a natural evolution of the expansion method for spatial data analysis.
Envi-ronment and Planning A 30 (11), 1905–1927.Gao, H. and J. R. Bradley (2019). Bayesian analysis of areal data with unknown adjacenciesusing the stochastic edge mixed effects model.
Spatial Statistics 31 , 100357.Gelman, A., H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013).
Bayesian Data Analysis . Chapman and Hall/CRC.George, E. I. and R. E. McCulloch (1993). Variable selection via Gibbs sampling.
Journalof the American Statistical Association 88 (423), 881–889.Hu, G. and F. Huffer (2020). Modified Kaplan-Meier estimator and Nelson-Aalen estimatorwith geographical weighting for survival data.
Geographical Analysis 52 (1), 28–48.Ibrahim, J. G., M.-H. Chen, and D. Sinha (2013).
Bayesian Survival Analysis . SpringerScience & Business Media.Ishwaran, H., J. S. Rao, et al. (2005). Spike and slab variable selection: frequentist andBayesian strategies.
The Annals of Statistics 33 (2), 730–773.LeSage, J. P. (2004).
A Family of Geographically Weighted Regression Models , pp. 241–264.Berlin, Heidelberg: Springer Berlin Heidelberg.Li, F. and H. Sang (2019). Spatial homogeneity pursuit of regression coefficients for largedatasets.
Journal of the American Statistical Association 114 (527), 1050–1062.26a, Z., M.-H. Chen, and G. Hu (2018). Bayesian hierarchical spatial regression models forspatial data in the presence of missing covariates with applications. Technical Report18-22, University of Connecticut, Department of Statistics.Mei, C.-L., N. Wang, and W.-X. Zhang (2006). Testing the importance of the explanatoryvariables in a mixed geographically weighted regression model.
Environment and PlanningA 38 (3), 587–598.M¨uller, W., K. Szymanski, J. Knop, and N. Trinajsti´c (1987). An algorithm for constructionof the molecular distance matrix.
Journal of Computational Chemistry 8 (2), 170–173.P´aez, A., T. Uchida, and K. Miyamoto (2002a). A general framework for estimation and infer-ence of geographically weighted regression models: 1. location-specific kernel bandwidthsand a test for locational heterogeneity.
Environment and Planning A 34 (4), 733–754.P´aez, A., T. Uchida, and K. Miyamoto (2002b). A general framework for estimation andinference of geographically weighted regression models: 2. spatial association and modelspecification tests.
Environment and Planning A 34 (5), 883–904.Shao, J. (1997). An asymptotic theory for linear model selection.
Statistica Sinica 7 (2),221–264.Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesianmeasures of model complexity and fit.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) 64 (4), 583–639.Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region.
Economic Geography 46 (sup1), 234–240.Wang, L., C. Li, Q. Ying, X. Cheng, X. Wang, X. Li, L. Hu, L. Liang, L. Yu, H. Huang,and P. Gong (2012, Aug). China’s urban expansion from 1990 to 2010 determined withsatellite remote sensing.
Chinese Science Bulletin 57 (22), 2802–2812.27ang, N., C.-L. Mei, and X.-D. Yan (2008). Local linear estimation of spatially varyingcoefficient models: an improvement on the geographically weighted regression technique.
Environment and Planning A 40 (4), 986–1005.Wooldridge, J. M. (2015).
Introductory Econometrics: A Modern Approach . Nelson Educa-tion.Xue, Y., E. D. Schifano, and G. Hu (2019). Geographically weighted Cox regression and itsapplication to prostate cancer survival data in Louisiana.