Estimation of Health and Demographic Indicators with Incomplete Geographic Information
EEstimation of Health and Demographic Indicators with IncompleteGeographic Information
Katie Wilson and Jon Wakefield , Department of Health Metrics Sciences, University of Washington Department of Biostatistics, University of Washington Department of Statistics, University of Washington
Abstract
In low and middle income countries, household surveys are a valuable source of information fora range of health and demographic indicators. Increasingly, subnational estimates are requiredfor targeting interventions and evaluating progress towards targets. In the majority of cases,stratified cluster sampling is used, with clusters corresponding to enumeration areas. The reportedgeographical information varies. A common procedure, to preserve confidentiality, is to give ajittered location with the true centroid of the cluster is displaced under a known algorithm. Analternative situation, which was used for older surveys in particular, is to report the geographicalregion within the cluster lies. In this paper, we describe a spatial hierarchical model in which weaccount for inaccuracies in the cluster locations. The computational algorithm we develop is fastand avoids the heavy computation of a pure MCMC approach. We illustrate by simulation thebenefits of the model, over naive alternatives.
Keywords:
Household surveys; Integrated nested Laplace approximation; Jittering; Masking;Spatial modeling.
1. Introduction
Effective implementation of health programs requires information on where unmet need existswithin countries. In many low and middle income countries (LMIC), health data can come from
Preprint submitted to Elsevier a r X i v : . [ s t a t . M E ] S e p variety of sources, and in many cases the sources are an incomplete representation of all of thecountry’s inhabitants. Surveys are a primary tool for obtaining vital information. One exampleof surveys that are commonly used, especially in LMIC, are the Demographic and Health Surveys(DHS). Typically, the DHS employs stratified, cluster sampling, where the clusters represent asmall area of the country. Within clusters, households are randomly selected. They are designedto provide reliable estimates at a pre-specified, and usually geographically large, administrativelevel. However, policymakers and researchers are often interested in modeling and understandinghealth indicators at lower levels, e.g., at the district level.To protect respondent confidentiality, survey data from households within the same cluster areaggregated to a single point, the centroid of the cluster. This cluster location information canbe used for spatial modeling. However, the geographic identifiers available in the DHS can varyand typically the precise coordinates of the cluster centers are not publicly available [3]. Recently,the geographic locations of the clusters (i.e., the centroids) are provided, but they are displaced.Specifically, urban clusters locations are displaced up to 2km, and 99% of rural cluster locationsare displaced up to 5km, with the remaining 1% displaced up to 10km. For example, in Kenyadisplaced GPS data are available for DHS completed on or after 2003. In older DHS and othersurveys such as the Multiple Indicator Clusters Surveys (MICS) only the larger administrativearea within which the cluster resides is reported. We will refer to this procedure as “masking”.In the literature this is sometimes referred to “aggregation”, though not to be confused with theaggregation procedure that was the focus of [26] in which the observed data consist of the sumor mean response of all responses in the area. In that paper, censuses were considered, whichprovided outcomes that are aggregated over an entire administrative area. Here, we considerpoint data, but where the geographic location of the point is assigned to the administrative areawithin which the point belongs. For the purposes of this paper, we will ignore issues involvingthe first step of aggregating household data to a single point, though an approach similar to thatin [26] could be used. Instead, we focus on issues surrounding displacement and masking of the2luster locations.First, we consider the displacement scenario. A naive analysis would ignore the jittering of thecluster centroids and fit a continuous spatial model using the displaced location. The effect ofdoing this in spatial analyses has been studied [11]. Using real data, 100 datasets were simu-lated with jittered location information and the impact on analyses involving several indicatorsof interest was assessed. The effect of the displacement on spatial correlation, spatial covariateassociations, and model derived surfaces was investigated. In the example, using empirical var-iograms, it was found that there was not a large impact on spatial correlation. However, somedifferences in the relationship between spatial covariates and the outcomes was found; modelsnaively using the spatial covariate value at the displaced value tended to have lower R althoughthis was not always the case. Some inaccuracies in predicted surfaces were also observed whenusing displaced data, and these differences tended to be exacerbated when the spatial covariatechanged quickly in space. Based on work by [20], the DHS have proposed guidelines that whenusing spatial covariate raster data, the average value of the covariate in cells within a specifiedbuffer (10km for rural clusters and 2km for urban clusters) of the reported cluster location shouldbe used in analyses. Warren et al. [25] examine the impact of jittering on covariate modeling,when the covariates are available for areas, and the outcome at points. They introduce a newmethod for this scenario, maximum probability covariate (MPC) selection and show superior per-formance with the naive method of using the covariate associated with the displaced point. Thismethod cannot resolve the problem completely, since bias in association parameters will in generalresult unless the correct covariate is determined with probability 1.Having location information subject to positional error can be thought of as an error-in-variablesproblem. In particular, let { s , . . . , s n } denote the set of true, unobserved, (analogous to covari-ates in the regular setting) that give rise to the set of observed outcomes { y , . . . , y n } . Denotethe measured (reported) locations associated with the outcomes as { u , . . . , u n } . The Berkson3easurement error model would be, s i = u i + (cid:15) ?i where (cid:15) ?i is an error term, whereas the classicalmeasurement error model would be, u i = s i + (cid:15) i where (cid:15) i is an error term [4]. The Berksonmodel is particularly appealing when there are a set of desired locations that outcomes shouldbe collected at, but the actual location has been perturbed. This could be a result of usingan imprecise positional instrument. On the other hand, the classical measurement error modelwould arise when outcomes are collected at a particular location, but the reported location hasbeen perturbed. Many proposed approaches seeking to address the positional error issue focuson normally-distributed outcomes under a Berkson measurement error model [9, 5, 6]. To over-come the computationally expensive Monte Carlo integration from earlier work using a Berksonmeasurement error model, an approximate composite likelihood for inference has been suggested[8]. In their application to DHS data from Senegal, the displacement mechanism is approximatedand the stratified sampling nature of the data is ignored. In this paper, we develop a methodunder the classical measurement error model and the outcome distribution can be non-normal.Now, we turn to the issue of masking. To incorporate masked data reported at the administrativearea, one solution is to use a discrete spatial model, such as the ICAR model. Using this type ofmodel would not allow for higher spatial resolution maps than the broadest administrative levelreported. Further issues could arise if the divisions of regions change over time. Additionally,these boundaries are often arbitrary and using a discrete model can be difficult to interpret ifregions differ substantially in size and shape. In the context of modeling the under-five mortalityrate, [12] fit a continuous spatial model and develop an approximate strategy for includingdata associated with areas and do not distinguish between aggregate and point-level data withmissing coordinates. To deal with the masking problem, points are randomly generated in anarea according to the population density. Points nearby are grouped together to form “pseudo-clusters” and assigned a weight based on the population that each “pseudo-cluster” represents.These weights then essentially partition the observed data to each of the “pseudo-clusters.” Thisapproach has no formal justification, and it is difficult to gauge how the method will perform in4ractice.The organization of this paper is as follows. In Sections 2, we make explicit the problem andpropose a model that can accommodate masked (only administrative area available) or displaced(jittered coordinates) data. In Section 3 a hybrid computational scheme is described. In Sections4 and 5, we conduct a simulation study to assess the impact of each of these problems on spatialmodeling. We also consider disclosure risk, which, in this case, refers to the ability to identify thetrue cluster location from the reported cluster location. Finally, we conclude with a discussion inSection 6, which includes directions for future work.
2. Method
We suppose that the cluster data is associated with a true point location, namely the centroidof the cluster. Let i = 1 , . . . , I index the administrative areas, j = 1 , . . . , J index the strata(typically J = 2 for urban/rural), and k = 1 , . . . , K ij index the clusters within administrativearea i and strata j . Consider cluster k in strata j and administrative area i , and denote the truecluster centroid location by s ijk , and the available location information by u ijk . Suppose the setof all possible (true) cluster locations (i.e., the sampling frame) is known and denote the set ofthe potential locations in area i , strata j by E ij = { E ije , e = 1 , . . . , m ij } .In the masking scenario , only the area in which the cluster is located is reported, which we willdenote by u ijk = { s ijk ∈ E ij } . Hence, the prior on the location is, p ( s ijk = E ije | u ijk ) = d ije , e = 1 , . . . , m ij , (1)where d ije is the probability that potential location E ije was selected. If probability proportional tosize (PPS) sampling was undertaken (the usual strategy in the DHS), then d ije ∝ N ije where N ije is the population size of enumeration area located at E ije . If random sampling was undertaken,5hen d ije ∝ .In the displacement scenario , a jittered version of the true location is reported, which we willdenote by u ijk = s ijk + (cid:15) ijk where (cid:15) ijk is the result of the jittering probability density function. Weconsider the DHS jittering algorithm in which the true location is randomly displaced according tothe distribution (in polar coordinates), p ( r, θ ) = (2 πR ) − I (0 < r < R ) × I (0 < θ < π ) where R = 2 km for urban clusters and R = 5 km for 99% of rural clusters and R = 10 km for theremaining 1% of rural clusters, and I ( · ) is the indicator function. That is, p ( s ijk = E ije | u ijk ) ∝ p ( u ijk | s ijk = E ije ) × p ( s ijk = E ije ) , e = 1 , . . . , m ij , where p ( s ijk ) corresponds to (1). To derive the first term on the right side, we need to marginalizeover possible values of R . First note that for a given R , p ( u ijk | s ijk = E ije , R ) = [2 πRd ( u ijk , E ije )] − C ije,R I (0 < d ( u ijk , E ije ) < R ) where d ( u ijk , E ije ) = [( E ije − u ijk ) + ( E ije − u ijk ) ] / is the distance between the candidatelocation E ije = [ E ije , E ije ] and the reported location u ijk = [ u ijk , u ijk ] and C ije,R = (cid:20)Z u ∈ D i [2 πRd ( u , E ije )] − I (0 < d ( u , E ije ) < R ) d u (cid:21) − (2)is the normalizing constant that accounts for the jittered point being restrictedwil to stay within6dministrative area D i . Therefore, p ( s ijk = E ije | u ijk ) ∝ d ije [4 πd ( u ijk , E ije )] − C ije,R =2 I (0 < d ( u ijk , E ije ) < km ) if urban d ije { . × [10 πd ( u ijk , E ije )] − C ije,R =5 I (0 < d ( u ijk , E ije ) < km )+0 . × [20 πd ( u ijk , E ije )] − C ije,R =10 I (0 < d ( u ijk , E ije ) < km ) } if rural(3)where d ije are the prior probabilities on the locations.Denote the outcome data measured at each cluster as y , the available location information as u ,the true (unobserved) location information as s , non-spatial covariates as x , spatial covariatesas z where z ijk = z ( s ijk ) is the vector of spatial covariates associated with the true location ofcluster k in strata j and administrative area i . In contrast to Gaussian Markov Random Fields(GMRFs) that are fundamentally discrete, Gaussian Random Fields (GRFs) are continuouslyindexed. Consider a domain D ∈ R . Then S ( s ) is a GRF if all finite collections are jointlymultivariate normal. That is, for a collection of points [ s , s , . . . , s n ] T , the density is, π ( S ) = (2 π ) − n/ | Σ | − / exp (cid:20) −
12 ( S − µ ) > Σ − ( S − µ ) (cid:21) where S i = S ( s i ) , µ i = µ ( s i ) for some mean function µ ( · ) , and Σ ij = C ( s i , s j ) for somecovariance function C ( · , · ) . We focus on the Mat´ern covariance function with scaling parameter κ > , marginal variance λ and smoothness parameter ν , C ν ( s i , s j ) = λ ν − γ ( ν ) ( κ || s i − s j || ) ν K ν ( κ || s i − s j || ) (4)where || · || denotes the Euclidean distance in R and K ν is the modified Bessel function of thesecond kind and order ν > . In general, it is difficult to learn about the smoothness parameter7 , and so we follow convention and fix this parameter to ν = 1 [22, 23]. The benefit of thischoice is that the field has one continuous derivative while maintaining computational feasibility.We now describe how computation can be carried out for this GRF model.In a major breakthrough an elegant connection between GRFs and GMRFs has been established[18, 22, 23]. In these papers, the following stochastic partial differential equation (SPDE) isconsidered, ( κ − ∆) α/ S ( s ) = λW ( s ) , s ∈ R where ∆ = ( ∂ /∂s ) + ( ∂ /∂s ) is the Laplacian on R , W ( s ) is Gaussian white noise and α = ν + 1 . They show that the solution to the SPDE is a GRF with Mat´ern covariance, S ( s ) = Z R k ( s , s ) dW ( s ) where k ( s , s ) = C ν ( s , s ) .Finally, using finite element analysis, a representation to the solution of the SPDE over a trian-gulation of the domain (called the mesh) is constructed by a weighted sum of basis functions, S ( s ) ≈ ˜ S ( s ) = M X m =1 w m ψ m ( s ) , (5)where M is the number of mesh points in the triangulation, ψ m ( s ) is a basis function and w = [ w , . . . , w M ] T is a collection of weights. The weights w are jointly Gaussian with µ = and sparse m × m precision matrix, Q , depending on spatial hyperparameters λ and κ ; hence w isa GMRF. The exact form for Q is chosen so that the resulting distribution for ˜ S ( s ) approximatesthe distribution of the solution to the SPDE, and thus the form will depend on the basis functions.The basis functions are chosen to be piecewise linear functions; that is, ψ m ( s ) = 1 at the m -th8ertex of the mesh and ψ m ( s ) = 0 at all other vertices, m = 1 , . . . , M . This results in a set ofpyramids, each with typically a six- or seven-sided base.To use the approximation, a mesh is first created. For the simulation we considered, the meshconsisted of M = 2 , mesh points and is shown in Figure 1. Let φ = [log λ, log κ ] representthe parameters of the GRF model, and w be the vector of the weights. Let β be a vector ofthe fixed effects, and define θ = [ β , w ] . We could proceed using a data augmentation (DA)algorithm, based on the factorizations: p ( θ , φ | y , s , u ) ∝ p ( y | s , θ , φ ) × p ( θ , φ ) , (6) p ( s | y , θ , φ , u ) ∝ p ( y | s , θ , φ ) × p ( s | u ) . (7)but this is computationally expensive, and so instead we use a hybrid scheme.
3. INLA within MCMC
For inference, we propose using an approximate Gibbs sampling strategy, known as “INLA withinMCMC” [13, 14]. The motivation for doing this is that the model described in the last sectioncannot be fit with INLA, given that it is a mixture distribution over the unknown locations.One could use MCMC for inference [19]. However, such algorithms are inefficient for Gaussianprocesses [7]. The key for our implementation is to note that if the locations of the clusters arefixed, the conditional models can be fit using INLA.It has been recognized that some models can be fit in
R-INLA once certain parameters in the modelare fixed [2, 1] . Specifically, the authors define a grid of values for the “problem parameters”and use
R-INLA to fit a conditional model. The reported marginal likelihood from
R-INLA isthen used to obtain the posterior distributions of these “problem parameters”. Lastly, Bayesianmodeling averaging [15] is used to derive the posterior distribution for the other parameters.9 ig. 1: Top: Mesh over the geography of Kenya. Bottom left: Kenya provinces with locations of centroids.Bottom right: Kenya provinces with true locations of the 398 clusters. Red: urban. Green: rural.
10n the setting that we consider, it is not straightforward to derive a grid of values with highposterior probability for the “problem parameters” (in our case, the unknown locations). Avariation on the approach of [2, 1] is to use a Metropolis-Hastings algorithm for the “problemparameters”. This has been previously proposed [13, 14], with the marginal likelihood fromfitting conditional models in
R-INLA being used to determine the acceptance probability for theMetropolis-Hastings step. They note that the marginal likelihood reported by
R-INLA is anestimate, and the limiting distribution is not exactly the desired stationary distribution. For theirpurposes, they argue and show that the difference is not significant.We take this approach one step further and propose a new algorithm using
R-INLA to fit (6) andgenerate a sample for [ θ , φ ] based on the posterior (INLA) approximation. This sample is thenused to generate a sample for s . This avoids needing a proposal distribution, as the posteriorconditional distribution is available exactly. Therefore, the “INLA within MCMC” algorithm canbe summarized as: Initialize θ (0) = [ β (0) , w (0) ] , where w represents the values of the spatial field on the mesh. Iterate: (a)
Sample s ( t +1) ijk using Gibbs sampling, p ( s ijk = E ije | y ijk , u ijk , θ ( t ) ) ∝ p ( s ijk = E ije | u ijk ) × p ( y ijk | s ijk = E ije , θ ( t ) ) (8)where the first term on the right corresponds to (1) for the masking scenario and(3) for the displacement scenario. The second term on the right corresponds to thecomplete data likelihood. (b) Use INLA to obtain the approximate conditional posterior, denoted ˜ p ( θ , φ | y , s ( t +1) ) .Sample θ ( t +1) , φ ( t +1) from the approximate posterior.We note that the implementation of R-INLA means that the hyperparameters, φ , are defined11n a grid meaning that when a joint sample is drawn from the approximation in step (b) thehyperparameters can only fall on the grid. By default, a new grid for the hyperparameters isconstructed during each iteration (since the cluster locations changes). This grid could be fixedahead of time if this is desired; however, we allow the grid to change from one iteration to thenext in our examples. How accurate the results are relies on the accuracy of INLA and on thejoint posterior sampling algorithm used in R-INLA .The latter is based on a mixture of multivariatenormal distributions, with the mixing being over the grid of hyperparameters.
4. Simulation Setup
We investigate the impact of masking and displacement of cluster centroids using the geography ofKenya. A masterframe of all sampling locations approximately representing the true masterframefrom the 2009 Kenya census was created based on population density retrieved from [27]. Thiswas done by first dividing the gridded population density into two zones: urban and rural withineach county. To identify these zones, thresholding was used so that the proportion exceedingthe threshold amount matched the proportion urban in the 2014 Kenya DHS [17]. The 1kmby 1km grids that exceeded the threshold were labeled as urban and otherwise labeled rural.The masterframe of all sampling locations was then created by randomly drawing coordinatesproportional to population density within each strata (urban/rural crossed with county) to obtain95,310 enumeration areas; see Figure 1 for locations and Table 1 for counts of clusters by eachof the eight provinces and the urban/rural strata. Finally, 398 clusters (right panel in Figure1) were then randomly sampled (uniformly, not proportional to size), stratified by province andurban/rural. The number of clusters within each sampling strata were chosen to match the 2008Kenya DHS. 12ural UrbanCentral 7,816 4,192Coast 4,268 3,569Eastern 12,396 3,234Nairobi 0 10,394North Eastern 2,230 433Nyanza 9,787 3,041Rift Valley 19,097 6,051Western 7,383 1,419
Table 1: Number of potential clusters in each administrative area and strata.
We imagine a binary response and generate data from the model, Y ijk | p ( s ijk ) ∼ Binomial (25 , p ( s ijk )) logit ( p ( s ijk )) = β + β z ijk + ˜ S ( s ijk ) where ˜ S ( · ) is the SPDE approximation to the Gaussian process spatial random effect surface. Inour simulation, we used the square-root of nighttime lights (NOAA nighttime lights series) as thespatial covariate, z . The spatial surface ˜ S ( · ) and nighttime lights surface are plotted in Figure 2. We will consider several simulation scenarios where centroid locations are jittered or masked,described in Table 2. Row 1a corresponds to fitting the “gold standard” model, which is if allcluster centroids are available exactly. Row 2a corresponds to fitting the model using the jitteredlocations of the centroids. Figure 3 shows the true and jittered locations for clusters in theWestern province. Row 3a corresponds to using our proposed approach to accommodate thejittered nature of the locations. Rows 4a–6a refer to a masking scenario where we will mask50% of the centroids (so that only the strata and administrative area are known); Figure 4 shows13 ig. 2: Left: Spatial surface ˜ S ( · ) used in the simulation. Right: square root of nighttime lights surface. the location of the clusters where the true centroids are known and only the administrative areaand strata are known. Row 4a corresponds to fitting the model only to the data where thecluster information is known exactly. Row 5a corresponds to also incorporating the data fromthe masked cluster locations. Here, we use the centroid of all potential cluster locations. Row6a corresponds to using our proposed approach. We repeat these scenarios when a covariateis included (rows 1b–6b). In each case, we investigate the effect on surface reconstruction andcovariate associations. For the “INLA within MCMC” algorithm, (8) is as follows: p ( s ( t +1) ijk = E ije | y ijk , u ijk , θ ( t ) ) ∝ p ( s ( t +1) ijk = E ije | u ijk ) × n expit (cid:16) β ( t )0 + β ( t )1 z ( E ije ) + ˜ S ( E ije ) ( t ) (cid:17)o y ijk × n − expit (cid:16) β ( t )0 + β ( t )1 z ( E ije ) + ˜ S ( E ije ) ( t ) (cid:17)o − y ijk (cid:88)
2b INLA naive 100% jittered (cid:88)
3b INLA within MCMC 100% jittered (cid:88)
4b INLA 50% exact (cid:88)
5b INLA 50% exact, 50% at centroids (cid:88)
6b INLA within MCMC 50% exact, 50% masked (cid:88)
Table 2: Simulation scenarios considered. C o v a r i a t e a t D i s p l a c ed C l u s t e r s Lo c a t i on s Fig. 3: Left: true and jittered locations in simulation, zoomed in on the Western province. Solid points: truelocations of clusters. × : displaced locations. Red: urban clusters. Green: rural clusters. Right: value of covariateat true and jittered locations. ig. 4: Solid points: GPS locations of clusters known. Grey squares: only admin area of clusters known.i.e., masked data. Red: urban clusters. Green: rural clusters. where θ = ( β , w ) , s ijk is the true location of cluster k in strata j and administrative area i , u ijk is the available location information for the cluster, and E ije is a potential cluster locationin strata j and area i , e = 1 , . . . , m ij . The number of potential locations, m ij , ranged from1 to 1,015 for the jittering scenario (median 142) and 433 to 19,097 for the masking scenario.To obtain the normalization factors (2), for each possible enumeration area, we simulate 1,000jitterings of the point following the DHS jittering algorithm and determine the proportion ofrealizations that fall within the administrative area. The priors for the fixed effects (intercept andcovariate association) were N (0 , . The hyperprior for φ = [log λ, log κ ] > is chosen to befairly vague. Here, the prior mean for φ corresponds to a marginal variance λ of 1. The priormean for φ corresponds to a practical range of roughly 20% of the domain size. Code to fit themodels can be found at https://github.com/wilsonka/Incomplete-Geography odel β β φ φ Truth -1.5 0.15 3.93 -4.51a -1.24 (-1.75, -0.85) - 3.97 (3.60, 4.36) -4.55 (-5.09, -4.05)2a -1.22 (-1.64, -0.84) - 3.97 (3.60, 4.36) -4.55 (-5.09, -4.05)3a -1.23 (-1.69, -0.81) - 3.98 (3.61, 4.35) -4.58 (-5.22, -4.10)4a -1.09 (-1.43, -0.50) - 3.81 (3.33, 4.31) -4.45 (-5.08, -3.86)5a -1.07 (-1.58, -0.55) - 4.03 (3.57, 4.51) -4.61 (-5.25, -4.00)6a -1.13 (-1.59, -0.63) - 3.94 (3.44, 4.40) -4.56 (-5.35, -4.00)1b -1.10 (-1.67, -0.48) 0.16 (0.12, 0.19) 4.00 (3.63, 4.40) -4.71 (-5.36, -4.13)2b -1.08 (-1.58, -0.59) 0.14 (0.11, 0.18) 3.98 (3.61, 4.39) -4.69 (-5.34, -4.11)3b -1.14 (-1.81, -0.48) 0.16 (0.12, 0.19) 3.99 (3.60, 4.37) -4.71 (-5.52, -4.16)4b -1.06 (-1.99, 0.05) 0.18 (0.13, 0.23) 4.02 (3.56, 4.52) -4.82 (-5.59, -4.13)5b -1.00 (-1.89, 0.27) 0.19 (0.15, 0.23) 4.05 (3.59, 4.54) -4.85 (-5.62, -4.15)6b -1.05 (-1.81, -0.27) 0.18 (0.14, 0.23) 4.05 (3.57, 4.51) -4.79 (-5.74, -4.17)Table 3: Posterior medians (95% CIs) for parameters in the simulation scenarios considered (see Table 2 fordescription of the different scenarios).
5. Simulation Results
To assess convergence, trace plots were examined and we calculated the ˆ R statistic [10] and thesewere all less than 1.05, which suggests convergence for all scenarios and approaches. Posteriormedians and 95% credible intervals (CIs) for the fixed effects and spatial hyperparameters arepresented in Table 3. Figures 5–8 shows the posterior medians latent surface ˜ S ( · ) and posteriorstandard deviation for the jittering and masking scenarios, respectively. First, we consider jittering,where the “best case” scenarios are 1a and 1b, where the true cluster locations are available.In general, using the jittered coordinates does not significantly impact the results, except fordifferences in the uncertainty in the spatial surface (bottom rows of Figures 5 and 6). Additionally,the differences are larger when a spatial covariate is involved in our simulation (1b and 2b). Whenusing the DA approach for jittered data (3a and 3b), we also see some minor differences, andsome “recovery” of the best case scenario.Next, we consider the masking scenario, where the “best case” scenarios are again 1a and 1b,where true cluster locations are available for all clusters. Across the board, posteriors tend tobe wider when we consider cases where only 50% of the clusters with GPS coordinates (4a and4b). The approach that uses the centroid for the masked data (5a and 5b) gives slightly different17 ig. 5: Top row: posterior medians of latent spatial surface. Bottom row: posterior standard deviations of latentspatial surface for the jittering scenario without a spatial covariate. The left column is the ideal scenario withexact GPS available (see Table 2 for full description of the different scenarios).Fig. 6: Top row: posterior medians of latent spatial surface. Bottom row: posterior standard deviations of latentspatial surface for the jittering scenario with a spatial covariate. The left column is the ideal scenario with exactGPS available (see Table 2 for full description of the different scenarios). ig. 7: Top row: posterior medians of latent spatial surface. Bottom row: posterior standard deviations of latentspatial surface for the masking scenario without spatial covariate. results. Noticeably, these results tend to be worse when a spatial covariate is involved (5b). Inthis scenario, we had taken the location for the masked data to be the centroid location of thepotential locations and used the value of the spatial covariate at that centroid location (ratherthan averaging the covariate from the potential locations). The DA approach (6a and 6b), wherethe other 50% of the clusters with only the admin area known are also included, show similarresults. We find a more noticeable narrowing of the 95% CIs in the scenario involving the covariate(6b).The predicted probability surfaces are in Figures 9 and 10. Plotted are the posterior medians and95% CIs. The posterior medians tend to be similar within the jittering scenarios and within themasking scenarios. There is some overall reduction in uncertainty when using DA for the maskingscenario, though this varies significantly spatially (Figure 11).Additionally, we consider the mean squared error (MSE) of the predicted latent surface S ( s ) and19 ig. 8: Top row: posterior medians of latent spatial surface. Bottom row: posterior standard deviations of latentspatial surface for the masking scenario with spatial covariate. the probability surface (on the logit scale) over Kenya,MSE ( M ) = 1 G G X g =1 n E ( Y ( M ) g − y g ) o + 1 G G X g =1 Var ( Y ( M ) g ) with g indexing points, y g being the true value of the surface at location s g , and Y ( M ) g beingthe estimate for the surface for model M at location g . We consider 2 different resolutions. Inthe first, predictions are made on a 1km × km apart. Inthe second, the predicted probability surface on the 1km × × × × i g . : P r e d i c t e dp r o b a b ili t y s u r f a ce f o r s i m u l a t i o n w i t h o u t s p a t i a l c o v a r i a t e . T o p r o w : p o s t e r i o r m e d i a n . M i dd l e r o w : . t hp e r ce n t il e . B o tt o m r o w : . t hp e r ce n t il e . i g . : P r e d i c t e dp r o b a b ili t y s u r f a ce f o r s i m u l a t i o n w i t h s p a t i a l c o v a r i a t e . T o p r o w : p o s t e r i o r m e d i a n . M i dd l e r o w : . t hp e r ce n t il e . B o tt o m r o w : . t hp e r ce n t il e . ig. 11: Ratio of posterior standard deviation of logit( p ) in DA approach to 50% only approach for maskingscenario. Values less than 1 indicates that the posterior standard deviation is lower when the data with maskedlocation information is incorporated over not including it. ˜ S ( s ) p ( s )
1a 30.2 (16.3) 19.0 (8.81)2a 30.5 (17.5) 20.2 (9.82)3a 33.6 (19.0) 20.3 (9.73)1b 45.4 (26.1) 20.7 (9.79)2b 43.8 (26.4) 20.9 (9.66)3b 45.8 (23.9) 21.0 (9.73)4a 46.2 (26.0) 29.0 (14.0)5a 43.6 (26.3) 25.2 (13.2)6a 38.5 (22.1) 24.9 (12.4)4b 65.4 (28.9) 27.7 (12.4)5b 85.7 (35.0) 27.4 (13.5)6b 56.2 (25.8) 26.3 (12.7)Table 4: MSE (bias ) of the probability surface from the various models on a 1km × the masked data at all.An important consideration is the disclosure risk, or ability to identify the enumeration area aparticular set of data arose from. Exact identification in the jittering case would be possible ifthere is only 1 possible EA within 2km for urban coordinates or within 10km for rural coordinates.In our example, 1 cluster could be exactly identified with another 10 having only at most 5 possibleEAs; see Figure 12.Another potential avenue for disclosure risk is if the posterior probability is significantly larger for23 umber of Possible EAs N u m be r o f C l u s t e r s Fig. 12: Histogram of potential disclosure risk. one particular EA than for the other potential ones. To establish this, the posterior probability ofthe possible EAs is calculated and the largest and second largest are compared. We do this forscenario 3a. First, we note that for 6 clusters, there was 1 possible EA with posterior probability > . , meaning that for those clusters disclosure risk is highly probable. Additionally, for 26(105) clusters the most likely EA had a posterior probability that was more than 5 (2) timeshigher than the second most likely.This is less of a concern for the masking procedure as the number of possible EAs for each clusterrange from 433 to 19,097 and the posterior probabilities were fairly uniform. Figure 13 shows theprior and posterior probabilities for one cluster that was known to be from a rural EA from theCoast province with the outcome y = 5 for 5a (no spatial covariate) and y = 2 for 5b (spatialcovariate). Noticeably, the posterior probability is lower than the prior probability in the centraleastern region where the latent spatial surface is highest.24 ig. 13: Prior and posterior probability of EA location for masking scenario. Darker indicates higher probability.Green point is the true location of the cluster. Also shown is the latent spatial surface (bottom left) and lightsurface (bottom middle). . Discussion In this paper, we propose an approach for incorporating data with missing or jittered GPS locationinformation. We develop an “INLA within MCMC” approach, where we alternate between (1)updating the location of clusters by sampling from the full conditional posterior and (2) fittingconditional models using INLA and sampling from the approximated conditional posterior. Interms of computation time, it took about 52 hours to run 1,000 iterations for each scenario. Themain computational burden comes from fitting the 1,000
R-INLA models.We show that inference tends to be improved when the procedure that results in missing locationinformation is taken into account through a simulation. Jittering of the coordinates did not havea significant impact on the results and one could argue that the more complicated DA procedureis not warranted. Further, there is a very real risk of identifying the true locations of (some of) theclusters, which is a privacy concern. From the DHS Terms of Use, users of geographic data “agreeto treat all data as confidential, and to make no effort to identify any individual, household, orenumeration area in the survey” ( https://dhsprogram.com/data/terms-of-use.cfm ). Weillustrate that in our simulation this is possible if the potential sampling locations (i.e., themasterframe) are available. Ideally, the jittering should provide a balance between accuracy ininference if the (incorrect) geographic coordinates are used and confidentiality in terms of notbeing able to uniquely identify the enumeration area that the cluster comes from.In our model formulation, access to a masterframe was assumed. However, in many cases amasterframe is not available; therefore, the possible cluster locations are unknown. In this case,one could be created as it was done for the simulation and assumed to be correct or the grid cellsof a population density raster could be used as a surrogate, with the population density value ofthe grid cell being used in (1).Validity of the “INLA within MCMC” computational approach we employ relies on the validity26f the INLA approximation. In our simulation, this approach seems to be accurate enough (incomparison to the models that could be fit solely in INLA) based on the posteriors obtained.To fully evaluate this approach, it would be best to compare results to only using MCMC. Morepractically, another option includes trying cruder approximations, i.e., the Gaussian and simplifiedLaplace approximations [21] in the INLA step to see if the results seem stable. Another strategycould be to refine the hyperparameter grid that is used in the approximation. One could alsovalidate overall results by holding out data and then comparing results on a large administrativearea level.Future work involves expanding the scope of the simulation study to cases where the masterframeis not available, investigating the impact of other covariates that are smoother in space, alteringthe spatial range of the underlying process, and including different levels of masking. In thesimulation, we supposed that the the available location information was the provincial level, butother geographies exist such as the county level.27 eferences [1] Bivand, R., G´omez-Rubio, V., and Rue, H. (2015). Spatial data analysis with
R-INLA withsome extensions.
Journal of Statistical Software , 63:1–31.[2] Bivand, R. S., G´omez-Rubio, V., and Rue, H. (2014). Approximate bayesian inference forspatial econometrics models.
Spatial Statistics , 9:146–165.[3] Burgert, C. R., Colston, J., Roy, T., and Zachary, B. (2013). Geographic displacement proce-dure and georeferenced data release policy for the demographic and health surveys. Technicalreport, ICF International. DHS Spatial Analysis Reports No. 7.[4] Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006).
MeasurementError in Nonlinear Models . Chapman and Hall/CRC.[5] Cressie, N. and Kornak, J. (2003). Spatial statistics in the presence of location error with anapplication to remote sensing of the environment.
Statistical science , 18:436–456.[6] Fanshawe, T. and Diggle, P. (2011). Spatial prediction in the presence of positional error.
Environmetrics , 22:109–122.[7] Filippone, M., Zhong, M., and Girolami, M. (2013). A comparative evaluation of stochastic-based inference methods for Gaussian process models.
Machine Learning , 93:93–114.[8] Fronterr`e, C., Giorgi, E., and Diggle, P. (2018). Geostatistical inference in the presence ofgeomasking: a composite-likelihood approach.
Spatial Statistics , 28:319–330.[9] Gabrosek, J. and Cressie, N. (2002). The effect on attribute prediction of location uncertaintyin spatial data.
Geographical Analysis , 34:262–285.[10] Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple se-quences.
Statistical Science , 7:457–511. 2811] Gething, P., Tatem, A., Bird, T., and Burgert-Brucker, C. (2015). Creating spatial inter-polation surfaces with DHS data. Technical report, ICF International. DHS Spatial AnalysisReports No. 11.[12] Golding, N., Burstein, R., Longbottom, J., Browne, A. J., Fullman, N., Osgood-Zimmerman,A., et al. (2017). Mapping under-5 and neonatal mortality in Africa, 2000–15: a baselineanalysis for the Sustainable Development Goals.
The Lancet , 390:2171–2182.[13] G´omez-Rubio, V. and Palm´ı-Perales, F. (2017). Spatial models with the integrated nestedlaplace approximation within markov chain monte carlo. arXiv preprint arXiv:1702.03891 .[14] G´omez-Rubio, V. and Rue, H. (2018). Markov chain Monte Carlo with the integrated nestedLaplace approximation.
Statistics and Computing , 28:1033–1051.[15] Hoeting, J., Madigan, D., Raftery, A., and Volinsky, C. (1998). Bayesian model averaging.Technical Report 9814, Department of Statistics, Colorado State University.[16] Image and Data processing by NOAA’s National Geophysical Data Center. DMSP datacollected by the US Air Force Weather Agency (2008). Version 4 DMSP-OLS NighttimeLights Time Series. https://ngdc.noaa.gov/eog/dmsp/downloadV4composites.html .Accessed 27 July 2018.[17] Kenya National Bureau of Statistics (2015). Kenya Demographic and Health Survey 2014.Technical report, Kenya National Bureau of Statistics.[18] Lindgren, F., Rue, H., and Lindstr¨om, J. (2011). An explicit link between Gaussian fields andGaussian Markov random fields: the stochastic differential equation approach (with discussion).
Journal of the Royal Statistical Society, Series B , 73:423–498.[19] Marin, J.-M., Mengersen, K., and Robert, C. P. (2005). Bayesian modelling and inference29n mixtures of distributions. In Dey, D. and Rao, C., editors,
Handbook of Statistics , pages459–507. Elsevier, Amsterdam.[20] Perez-Heydrick, C., Warren, J., Burgert, C., and Emch, M. (2013). Guidelines on the useof DHS GPS data. Technical report, ICF International. DHS Spatial Analysis Reports No. 8.[21] Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P., and Lindgren, F. K. (2017).Bayesian computing with INLA: a review.
Annual Review of Statistics and Its Application ,4:395–421.[22] Simpson, D., Lindgren, F., and Rue, H. (2012a). In order to make spatial statistics compu-tationally feasible, we need to forget about the covariance function.
Environmetrics , 23:65–74.[23] Simpson, D., Lindgren, F., and Rue, H. (2012b). Think continuous: Markovian Gaussianmodels in spatial statistics.
Spatial Statistics , 1:16–29.[24] The Demographic and Health Surveys Program (2018). Conditions of Use for The DHSProgram datasets . https://dhsprogram.com/data/terms-of-use.cfm . Accessed: 2019-01-30.[25] Warren, J. L., Perez-Heydrich, C., Burgert, C. R., and Emch, M. E. (2016). Influenceof demographic and health survey point displacements on point-in-polygon analyses.
Spatialdemography , 4:117–133.[26] Wilson, K. and Wakefield, J. (2020). Pointless spatial modeling.