Boundary detection in disease mapping studies
BBoundary detection in disease mapping studies.
Duncan Lee and Richard Mitchell School of Mathematics and Statistics, University of Glasgow, Glasgow, UK Public Health, School of Medicine, University of Glasgow, Glasgow, UKOctober 28, 2018
Abstract
In disease mapping, the aim is to estimate the spatial pattern in diseaserisk over an extended geographical region, so that areas with elevated riskscan be identified. A Bayesian hierarchical approach is typically used toproduce such maps, which models the risk surface with a set of spatiallysmooth random effects. However, in complex urban settings there arelikely to be boundaries in the risk surface, which separate populations thatare geographically adjacent but have very different risk profiles. Thereforethis paper proposes an approach for detecting such risk boundaries, andtests its effectiveness by simulation. Finally, the model is applied to lungcancer incidence data in Greater Glasgow, Scotland, between 2001 and2005.Boundary detection; Disease mapping; Spatial correlation
Disease mapping is the area of statistics that quantifies the spatial pattern indisease risk over an extended geographical region, such as a city of country.The study region is partitioned into a number of small non-overlapping arealunits, such as electoral wards, and only the total number of cases in each unit isavailable. The majority of disease risk maps are produced using Bayesian hier-archical models, utilising a vector of covariate risk factors and a set of randomeffects. The latter model any overdispersion or spatial correlation in the diseasedata (after the covariate effects have been allowed for), which can be caused bythe presence of unmeasured risk factors that also have a spatial structure. Therandom effects are often represented by a Conditional Autoregressive (CAR)prior, which forces the effects in two areas to be correlated if those areas sharea common border.The production of such maps provide a number of benefits to public healthprofessionals, including the ability to investigate the association between dis-ease prevalence and suspected risk factors. More recently, disease maps have1 a r X i v : . [ s t a t . O T ] A ug een used to identify boundaries (or cliffs or discontinuities ) in the risk sur-face, which separate geographically adjacent areas that have high and low risks.Such boundaries are likely to exist in complex urban settings, where rich andpoor communities can be separated by just a few metres. The detection of suchboundaries has a number of benefits, including the ability to detect the spatialextent of a cluster of high risk areas. It is also important economically, becauseit allows health resources to be targeted at communities with the highest dis-ease risks. Furthermore, the spatial units at which the health and covariate dataare available are typically designed for administrative purposes, and thus oftendo not delineate between distinct neighbourhoods (i.e. groups of people withthe same social circumstances, culture and behaviour). Therefore, boundariesin disease risk surfaces may also correspond to boundaries between differentneighbourhoods, which are of interest because “their locations reflect underly-ing biological, physical, and/or social processes” (Jacquez et al. (2000)).The statistical detection of boundaries in disease maps is also known as Wombling , following the seminal article by Womble (1951). More recently, anumber of different approaches to this problem have been proposed, includingthe calculation of local statistics (Boots (2001)), and a Bayesian random effectsmodel (Lu and Carlin (2005)). In this paper we also use a Bayesian randomeffects model, and identify boundaries by measuring the level of dissimilaritybetween the populations living in neighbouring areas. We believe that riskboundaries are more likely to occur between populations that are very differ-ent, because homogeneous populations should have similar disease risks. Theremainder of this paper is organised as follows. Section 2 provides a reviewof disease mapping and boundary detection methods, while Section 3 presentsour proposed methodological extension. Section 4 assesses the efficacy of ourapproach using simulation, while Section 5 identifies boundaries in the risk sur-face for lung cancer cases in Greater Glasgow. Finally, Section 6 contains aconcluding discussion and outlines future developments.
The data used to quantify disease risk are denoted by y = ( y , . . . , y n ) and E =( E , . . . , E n ), the former being the numbers of disease cases observed in each of n non-overlapping areas within a specified time frame. The latter are the expectednumbers of disease cases, which depend on the size and demographic structureof the population living within each area. Disease risk can be summarised bythe standardised incidence ratio (SIR), which for area k is given by ˆ R k = y k /E k .Values above one represent areas with elevated risks of disease, while values lessthan one correspond to relatively healthy areas. However, elevated SIR valuescan occur by chance in areas where E k is small, so the set of disease risks for all n areas are more commonly estimated using a Bayesian hierarchical model. A2eneral specification has been described by Elliott et al. (2000), Banerjee et al.(2004), Wakefield (2007) and Lawson (2008), and is given by Y k | E k , R k ∼ Poisson( E k R k ) for k = 1 , . . . , n, ln( R k ) = x T k β + φ k . (1)Disease risk is modelled by covariates x T k = ( x k , . . . , x kp ), and randomeffects φ = ( φ , . . . , φ n ), the latter allowing for any overdispersion and spatialcorrelation in the disease data (after the covariate effects have been accountedfor). The most common prior for φ is a conditional autoregressive (CAR, Besaget al. (1991)) model, which is specified in terms of n univariate conditionaldistributions, f ( φ k | φ , . . . , φ k − , φ k +1 , . . . , φ n ) for k = 1 , . . . , n . However, eachfull conditional distribution only depends on the values of φ j in a small numberof neighbouring areas. This neighbourhood information is contained in a binaryadjacency matrix W , which has elements w kj that are equal to one or zerodepending on whether areas ( k, j ) are defined to be neighbours. A commonspecification is that areas ( k, j ) are neighbours if they share a common border,which corresponds to w kj =1 and is denoted by k ∼ j . A number of priors havebeen proposed within the general class of CAR models, and the one we adopthere was originally proposed by Leroux et al. (1999), and has subsequently beenreviewed by MacNab (2003) and Lee (2011). The model has full conditionaldistributions given by φ k | φ − k , W, τ , ρ, µ ∼ N (cid:32) ρ (cid:80) nj =1 w kj φ j + (1 − ρ ) µρ (cid:80) nj =1 w kj + 1 − ρ , τ ρ (cid:80) nj =1 w kj + 1 − ρ (cid:33) . (2)The conditional expectation is a weighted average of the random effects inneighbouring areas and a global intercept µ (not included in (1)), where theweights are controlled by ρ . In this model ρ = 0 corresponds to independence,while ρ close to one defines strong spatial correlation. These full conditionaldistributions correspond to a proper multivariate Gaussian distribution if ρ ∈ [0 , φ ∼ N( µ , τ [ ρW ∗ + (1 − ρ ) I ] − ) , (3)where I denotes an n × n identity matrix while is an n-vector of ones. In theabove equation W ∗ has diagonal elements w ∗ kk = (cid:80) ni =1 w ki , and non-diagonalelements w ∗ kj = − w kj . This model simplifies to the intrinsic autoregressivemodel if ρ is fixed at one, although this does correspond to an improper jointdistribution for φ . Lu and Carlin (2005) propose the use of Boundary Likelihood Values (BLV),which are calculated as BLV kj = | ˆ R k − ˆ R j | , the absolute difference in risk3etween two neighbouring areas. The border between neighbouring areas ( k, j )can then be classified as a boundary in the risk surface if either(a) BLV kj > c for some cut-off c ; or(b) BLV kj is within the top c % of the boundary likelihood values over thestudy region, for some percentage c .These approaches to boundary detection are ad-hoc, because the decisionrules (a) and (b) require tuning constants ( c or c ) to be specified. Jacquezet al. (2000) has criticised this approach for this reason, and argues that byspecifying the tuning constant the investigator essentially chooses the numberof boundaries that are identified, even though this is unknown and the goal ofthe analysis. The approach we propose allows the data to determine the number and locationsof any boundaries in the risk surface, rather than requiring the investigator tospecify a tuning constant. We achieve this by modelling w kj as a binary ran-dom quantity if areas ( k, j ) share a common border, rather than assuming itis fixed at one. If w kj is estimated as zero the random effects in areas ( k, j )are conditionally independent, which corresponds to a boundary in the risk sur-face. In contrast, if w kj equals one the random effects are correlated, whichcorresponds to no boundary. This general approach to boundary detection haspreviously been proposed by Lu et al. (2007), Ma and Carlin (2007), and Maet al. (2010), who model the set of w kj by logistic regression, a CAR prior,or an Ising model. However, this requires the large set of w kj for all pairs ofneighbouring areas to be estimated, which Li et al. (2011) argue are not wellidentified from the data. In this paper we model the set of w kj as a function ofparameters α = ( α , . . . , α q ), rather than treating each w kj as a separate un-known quantity. This results in a parsimonious yet flexible model for detectingboundaries in the risk surface, which avoids the weak parameter identifiabilityand slow MCMC convergence experienced by Li et al. (2011), when modellingeach w kj separately. The first stage of our hierarchical model is similar to that described in Section2, and is given by Y k | E k , R k ∼ Poisson( E k R k ) for k = 1 , . . . , n, ln( R k ) = φ k , (4) φ k | φ − k , µ, α , τ ∼ N (cid:32) . (cid:80) nj =1 w kj ( α ) φ j + 0 . µ . (cid:80) nj =1 w kj ( α ) + 0 . , τ . (cid:80) nj =1 w kj ( α ) + 0 . (cid:33) .
4n common with Lu et al. (2007) no covariates are included in (4), because φ would then represent the residual pattern in disease risk, and any boundariesidentified would hence be in the residual surface. Secondly, we fix ρ at 0.99 be-cause it enforces strong spatial correlation, which allows the presence or absenceof boundaries to be determined by W ( α ). We note that we do not choose ρ = 1,as it results in an infinite mean and variance in the conditional distribution of φ k (see (2)) if an area is surrounded by boundaries, i.e. if (cid:80) nj =1 w kj ( α ) = 0 forsome area k . We believe that boundaries in the risk surface are likely to occur between pop-ulations that are very different, because homogeneous populations should havesimilar risk profiles. Therefore, we model the presence or absence of a bound-ary between areas ( k, j ) by a vector of q non-negative dissimilarity metrics z kji = ( z ki , . . . , z kjq ). These metrics have the general form z kji = | z ki − z ji | σ i for i = 1 , . . . , q, (5)the absolute difference in the value of a covariate between the two areas inquestion. Here, σ i represents the standard deviation of | z ki − z ji | over all pairs ofcontiguous areas, and we re-scale the dissimilarity metrics to improve the mixingand convergence of the MCMC algorithm. It is these dissimilarity measures thatdrive the detection of boundaries in the risk surface, and examples could includedifferences in the population’s social characteristics (e.g. average income) or riskinducing behaviour (e.g. smoking prevalence). Using these metrics we modelthe elements of W ( α ) as w kj ( α ) = (cid:26) − (cid:80) qi =1 z kji α i ) ≥ . j ∼ k , (6)where pairs of areas that do not share a common border have w kj ( α ) fixedat zero. For areas that are contiguous, the model detects a boundary in the risksurface ifexp( − (cid:80) qi =1 z kji α i ) is less than 0.5. Therefore we constrain the regression pa-rameters to be non-negative, so that the greater the dissimilarity between twoareas the more likely there is to be a boundary between them. In contrast, iftwo areas have identical covariate values (and hence homogeneous populations)there cannot be a boundary between them, regardless of the value of α . This isthe reason we do not include an intercept term in (6), as doing so would allowboundaries to be detected between areas with homogeneous populations. Theregression parameters determine the number of risk boundaries in the studyregion, with larger values of α corresponding to more boundaries being de-tected. If only one dissimilarity metric z is included in (6), then a plausiblerange of values for the single regression parameter α can be determined. At5ne extreme, no boundaries will be detected if α ≤ − ln(0 . /z max , while at theother, all borders in the study region will be considered as boundaries (unless z kj =0) if α > − ln(0 . /z min . Here ( z min , z max ) denote the minimum positiveand maximum values of the dissimilarity metric. More generally, if there are q dissimilarity metrics then boundaries are identified ifexp( − z kj α ) × . . . × exp( − z kjq α q ) < . , where the value of each component, exp( − z kji α i ), must lie between zeroand one. Therefore, if α i ≤ − ln(0 . /z maxi , the dissimilarity measure z kji is notsolely responsible for detecting any boundaries, because exp( − z i α i ) would begreater than 0.5 for all pairs of contiguous areas. Therefore in terms of interpre-tation, the dissimilarity metric can be said to have no effect on detecting bound-aries if the entire 95% credible interval for α i is less than α min = − ln(0 . /z maxkji .In contrast, if the interval lies completely above α min , then the metric can besaid to have a substantial effect on identifying risk boundaries. We note thatthe usual statistical representation of ‘no effect’ (credible interval that includeszero) is not possible in this context, because the regression parameters are con-strained to be non-negative. We also note that the approach we outline does notguarantee that the boundaries we detect will be closed (form an unbroken line,an example of which is shown in Figure 3), which allows us to detect bound-aries that enclose an entire subregion, as well as those that just separate highlydifferent areas. The vector of random effects depend on hyperparameters ( µ, τ , α ), which re-spectively control its mean, variance and correlation structure. The mean µ is assigned a weakly informative Gaussian prior distribution, with a mean ofzero and a variance of 10. The variance parameter τ is assigned a weaklyinformative Uniform(0 ,
10) prior on the standard deviation scale, following thesuggestion of Gelman (2006). We adopt a uniform prior for the regression pa-rameters, α i ∼ Uniform(0 , M i ), which corresponds to our prior ignorance aboutthe number of boundaries in the risk surface. An alternative would be a recip-rocal prior, f ( α i ) ∝ α i I [0 ≤ α i ≤ M i ], which represents our prior belief thatthe risk surface is spatially smooth. In both cases a natural upper limit wouldbe M i = − ln(0 . /z minkji , the value at which the dissimilarity measure z kji solelyidentifies all borders as boundaries in the risk surface. However, in a bound-ary detection analysis one is looking to identify boundaries between collectionsof areas, which have similar risks within each collection but differ across theboundary. Therefore we fix M i so that at most 50% of borders can be classifiedas boundaries, and present a sensitivity analysis in the supplementary materialto this choice of 50%. 6 Simulation study
In this section we present a simulation study, that assesses the accuracy withwhich our proposed model can detect boundaries in the risk surface. In doingthis we assess whether our model can detect ‘true’ boundaries in the risk surface,as well as the extent to which it falsely identifies boundaries that do not exist.We have decided not to compare our model to the existing boundary detectionapproach using (1), (2) and BLVs, because this requires a tuning constant ( c or c ) to be specified by the user, which in real life would be unknown. However,these tuning constants are known in this simulation setting, and specifying theirvalue would put this method at an unfair advantage or disadvantage, dependingon whether we chose the ‘correct’ values. We base our study on the n = 271 areas that comprise the Greater Glasgow andClyde health board, which is the region considered in the cancer mapping studypresented in Section five. Disease counts are generated from the Poisson model(4), where the expected numbers of admissions, E , relate to the Glasgow cancerdata. A new risk surface R = exp( φ ) is generated for each set of simulateddisease data, because this ensures the results are not affected by a particularrealisation of φ . Each simulated risk surface has fixed boundaries, which areshown by the bold black lines in Figure 1. There are 74 boundaries in total,which corresponds to approximately 10% of the set of borders in the studyregion. This set of boundaries partition the study region into 6 groups, themain area shaded in white, and the remaining 5 smaller areas shaded in grey.To produce risk surfaces with boundaries, the random effects ( φ ) are generatedfrom a multivariate Gaussian distribution with a piecewise constant mean, whichin the white region is equal to 0, while in the grey regions it is equal to k . Thecorrelation function is from the Matern class with smoothness parameter equalto κ = 2 .
5, while the spatial range is fixed so that the median correlationbetween areas is 0.5. The dissimilarity metrics are generated from z kj ∼ (cid:26) | N(1 , . ) | if areas ( k, j ) are not separated by a boundary. | N(1 + k , . ) | if areas ( k, j ) are separated by a boundary. (7)Here, larger values of k correspond to dissimilarity metrics that better iden-tify the true boundaries in the risk surface; i.e. have larger values for the bound-aries in Figure 1 than for the non-boundaries. We assess the effect on modelperformance of changing both the size of the boundaries (via k ) and the qualityof the dissimilarity metrics (via k ), the results of which are displayed in Table1. When assessing the effect of boundary size we fix k = 3, which providesnearly ‘perfect’ dissimilarity metrics, i.e. values for z kj for boundaries and non-boundaries that almost never overlap. Conversely, when assessing the effect of7igure 1: Locations of the true boundaries in the simulated risk surfaces.having imperfect dissimilarity metrics, we fix the boundary size at k = 0 . The top half of Table 1 displays the effects of changing the magnitude of therisk boundaries, in the idealised situation of having perfect dissimilarity metrics.The constant k represents the difference in the mean value of the risk surfacebetween white and grey areas (see Figure 1), hence smaller values correspond toa spatially smoother surface ( k = 0 would correspond to a completely smoothrisk surface with no boundaries). The table shows that the model can detectlarger risk boundaries more often than smaller ones as expected, although itstill achieves over a 90% detection rate when the risk surface only has a meandifference of 0.2. The much lower percentages for smaller values of k are alsonot surprising, as they correspond to situations in which the average size of thetrue boundaries are not very different to the average size of the non-boundaries.In addition, the false positive rates are very low (generally less than 1%) regard-less of the boundary size, which suggests that detected boundaries are likely tobe real.The bottom half of Table 1 displays the effects of having imperfect dissimilar-ity metrics, i.e. metrics for which some of the values corresponding to the ‘true’8able 1: The effect of boundary size (as measured by k ) and the quality ofthe dissimilarity metrics (as measured by k ) on the effectiveness of the model.The table displays the percentage agreement for boundaries (BA) and non-boundaries (NBA), as well as the bias and root mean square error (RMSE) ofthe estimated risk surface, which are presented as a percentage of their truevalue. Comparison k k BA ( % ) NBA ( % ) Bias RMSE z kj k decreases, there is a greater overlap between the values ofthe dissimilarity metrics at boundaries and non-boundaries. The limit of k = 0corresponds to a dissimilarity metric with no information, i.e. it has the samerange of values for boundaries as non-boundaries. The table shows that if thedissimilarity metric is nearly perfect ( k = 3 corresponds to the means in equa-tion (7) being separated by 6 standard deviations), then the model nearly alwayscorrectly identifies boundaries and non-boundaries. However, as the informa-tion content in the dissimilarity metric decreases (as k decreases) so does theperformance of the model, both in terms of the boundary agreement and thefalse positive rate. If the dissimilarity metric contains no information the modelonly identifies 1.86% of the true boundaries as would be expected, although inthis situation the false positive rate returns to being low at around 1%. This section presents a study mapping the risk of lung cancer in Greater Glas-gow, Scotland, between 2001 and 2005.
The data for our study are publicly available, and can be downloaded fromthe Scottish Neighbourhood Statistics (SNS) database ( ).The study region is the Greater Glasgow and Clyde health board, which con-tains the city of Glasgow in the east, and the river Clyde estuary in the west.9lasgow is known to contain some of the poorest people in Europe (Leylandet al. (2007)), and has rich and poor communities that are geographically adja-cent. The Greater Glasgow and Clyde health board is partitioned into n = 271administrative units called Intermediate Geographies (IG), which were devel-oped specifically for the distribution of small-area statistics, and have a medianarea of 124 hectares and a median population of 4,239.The disease data we model are the number of people diagnosed with lungcancer between 2001 and 2005 in each IG, which corresponds to ICD-10 codesC33 - C34. The expected numbers of cases in each IG are calculated by exter-nal standardisation, using age and sex adjusted rates for the whole of Scotland.These rates were obtained from the Information Services Division (ISD), whichis the statistical arm of the National Health Service in Scotland. The simplestmeasure of disease risk is the standardised incidence ratio, which is presented inFigure 2 as a choropleth map. The Figure shows that the risk of lung cancer ishighest in the heavily deprived east end of Glasgow (east of the study region),as well as along the banks of the river Clyde (the thin white line running southeast). The Figure also shows that cancer incidence in Greater Glasgow is higherthan in the rest of Scotland, as the average SIR across the study region is 1.186.Large amounts of covariate data are available from the Scottish Neighbour-hood Statistics database, and the first variable we consider is a modelled esti-mate of the percentage of the population in each IG that smoke, further detailsof which are available from Whyte et al. (2007). The causal relationship betweensmoking and lung cancer risk is long standing (see for example Doll and Brad-ford Hill (1950) and Doll et al. (2005)), and it is likely to be the most importantdissimilarity metric in our study. Cancer risk has also been shown to vary byethnic group (see for example National Cancer Intelligence Network (2009)), sowe include the percentage of school children from ethnic minorities as a proxymeasure. Socio-economic deprivation is also associated with cancer risk (see forexample Quinn, M and Babb, P (2000) and Woods et al. (2006)), and as a proxymeasure we use the natural log of the median house price. This proxy measureis used because it is the measure of deprivation available that is least correlatedwith the smoking covariate (Pearson’s r=-0.69). Finally, we acknowledge thatother factors such as diet and physical activity have been repeatedly associatedwith lung cancer risk. However, no data are available at the small-area levelabout these factors. Inference for all models is based on 50,000 MCMC samples generated from fiveMarkov chains, that were initialised at dispersed locations in the sample space.Each chain is burnt-in until convergence (40,000 iterations), and the next 10,000samples are used for the analysis. A number of models were fitted to the datawith different combinations of the three dissimilarity metrics, and in each casethe residuals were assessed for the presence of spatial correlation. This was10igure 2: Standardised Incidence Ratio (SIR) for lung cancer in Greater Glasgowbetween 2001 and 2005.achieved using a permutation test based on Moran’s I statistic (Moran (1950)),and in all cases the model adequately removes the spatial correlation present inthe data, as the corresponding p-values (not shown) are greater than 0.05.Initially, all three covariates (smoking, ethnicity and house price) were in-cluded in the model as dissimilarity metrics, and the posterior medians and 95%credible intervals are displayed in Table 2. Also displayed in Table 2 is α min , thethreshold value below which the dissimilarity metric does not solely detect anyboundaries in the risk surface. The table shows that smoking prevalence hassubstantial influence in detecting boundaries, as the estimate and credible inter-val lie above the threshold value of α min = 0 . α min = − ln(0 . /z max is presented, the value at which each dissimilarity met-ric does not solely identify any risk boundaries. Covariate Estimate 95 % credible interval α min Smoking 0.232 (0.171, 0.254) 0.131Ethnicity 0.012 (0.001, 0.046) 0.126House price (log) 0.015 (0.001, 0.103) 0.119the smoking only model the sole regression parameter has a posterior medianand 95% credible interval of 0.257 (0.239, 0.262), which, in common with theresults in Table 2, lies completely above the ‘no effect’ threshold.
Figure 3 displays the estimated risk surface for lung cancer from the smokingonly model, where the shading is on the same scale as that used in Figure 2.The solid white lines denote the risk boundaries that have been identified bythe smoking covariate, which are defined by having posterior median values of w kj equal to zero. We note that the study region is split completely in two bythe river Clyde (the thin white line running south east), and areas on oppositebanks are not assumed to be neighbours. Therefore no boundaries can be de-tected across the river, which explains the absence of boundaries in this area.The figure shows that the smoking covariate detects 162 boundaries in therisk surface (23.1% of all possible borders), including within the city of Glasgow(middle of the map) as well as along the southern coast of the Clyde estuary (farwest of the map). The majority of these estimated risk boundaries appear tocorrespond to sizeable changes in the risk surface, suggesting that the smokingcovariate appears to be an appropriate dissimilarity metric for detecting suchboundaries. However, a few of the boundaries identified show no evidence ofseparating areas with differing health risks, such as the closed boundary in thesouth of the city. These ‘false positives’ correspond to two areas having differentsmoking prevalences but similar risk profiles, and would be a starting point fora more detailed investigation into why the risk profiles are similar given thevastly different smoking rates. In this paper we have proposed a statistical approach to detecting boundariesin disease risk maps, which separate populations that exhibit high and low risksof disease. Our approach detects boundaries by measuring the dissimilarity be-tween populations living in neighbouring areas, because we believe that abruptchanges in the risk surface are most likely to occur between populations that12igure 3: Estimated risk surface for lung cancer in Greater Glasgow between2001 and 2005. The risk boundaries are denoted by solid white lines.are geographically adjacent but have very different social characteristics or riskinducing behaviour. Our approach has the advantage of being fully automatic,so that unlike the approach of Lu and Carlin (2005), the number of boundariesin the risk surface is determined by the data and not a-priori by the investi-gator. In addition, our approach identifies boundaries using a small numberof regression parameters α , rather than estimating the set of neighbourhoodrelations { w kj } for all pairs of contiguous areas. Li et al. (2011) have suggestedthat this latter approach (used by Lu et al. (2007) and Ma and Carlin (2007))can lead to identifiability problems and slow MCMC convergence, due to thelarge number of parameters to be estimated. For example, in the lung cancerapplication presented in Section 5, there are 701 neighbourhood relations w kj ,compared with disease data in only 271 areas.The simulation study presented in Section 4 suggests that our approachgenerally performs well, both in terms of detecting true boundaries in the risksurface, as well as not detecting large numbers of false positives. In the presenceof a perfect dissimilarity metric our approach can detect the majority of truerisk boundaries, for example, it has a 99.6% detection rate when the averagedifference in the log risk surface is 0.3. The main drawback of our approach isillustrated by the bottom half of Table 1, which shows that it is crucially de-pendent on the existence of good quality dissimilarity metrics, that have large13alues for risk boundaries and small values for non-boundaries. However, in theabsence of good dissimilarity metrics (as in the last few rows of Table 1) ourapproach appears to remain conservative, in the sense that the false positiverate is relatively low.As our approach to boundary detection is based solely on covariates, it hasthe advantage that in addition to identifying risk boundaries, it also identifiesthe underlying drivers of these boundaries. However, as illustrated by the sim-ulation study, the corresponding disadvantage is that it relies on the existenceof relevant covariate data, which may not always be available. Our lung cancerexample also illustrates this, as the main covariate (smoking prevalence) iden-tifies the majority of the risk boundaries evident from Figure 3. However, italso identifies some boundaries that do not appear to be real, as well as ignor-ing others where there is a suspected discontinuity in the risk surface. Thisimperfection could be caused by the omission of other important dissimilaritymetrics, or by the fact that the smoking covariate is a modelled estimate ratherthan being real prevalence data. Therefore, the production of risk maps suchas Figure 3 can be viewed as an exploratory tool, which helps the investigatorunderstand the drivers of the spatial variation in disease risk. For example, ifa risk boundary is not detected despite their being a discontinuity in the risksurface, further investigation of the two areas in question could be carried outto determine what other factors could be causing this discontinuity.Finally, this paper opens up numerous avenues for future work. The mostobvious is to develop a boundary detection method that has the advantages ofthe method proposed here, but does not rely on covariate data to detect riskboundaries. One possibility in this vein is to develop an iterative algorithm,which compares the fit (possibly using DIC) of a number of models with differentbut fixed neighbourhood matrices W . In this way one could compare a modelwhere all adjacent areas have w kj = 1, against an alternative with w kj = 0 forareas that are suspected of being separated by a discontinuity in the risk surface.Other natural extensions to this approach include the detection of boundariesin multiple disease risk surfaces simultaneously, as well as adding a temporaldimension to the model. Acknowledgments
The data and shapefiles used in this study were provided by the Scottish Gov-ernment.
Conflict of Interest:
None declared.
Funding
This work was supported by the Economic and Social Research Council [RES-000-22-4256]. 14 eferences
Banerjee, S., B. Carlin, and A. Gelfand (2004).
Hierarchical Modelling andAnalysis for Spatial Data (1st ed.). Chapman and Hall.Besag, J., J. York, and A. Mollie (1991). Bayesian image restoration with twoapplications in spatial statistics.
Annals of the Institute of Statistics andMathematics 43 , 1–59.Boots, B. (2001). Using local statistics for boundary characterization.
GeoJour-nal 53 , 339–345.Doll, R. and A. Bradford Hill (1950). Smoking and carcinoma of the lung.
British Medical Journal 2 , 739748.Doll, R., R. Peto, J. Boreham, and I. Sutherland (2005). Mortality from cancerin relation to smoking: 50 years observations on british doctors.
BritishJournal of Cancer 92 , 426–429.Elliott, P., J. Wakefield, N. Best, and D. Briggs (2000).
Spatial Epidemiology:Methods and Applications (1st ed.). Oxford University Press.Gelman, A. (2006). Prior distributions for variance parameters in hierarchicalmodels.
Bayesian Analysis 1 , 515–533.Jacquez, G., S. Maruca, and M. Fortin (2000). From fields to objects: A reviewof geographic boundary analysis.
Journal of Geographical Systems 2 , 221–241.Lawson, A. (2008).
Bayesian Disease Mapping: Hierarchical Modelling in Spa-tial Epidemiology (1st ed.). Chapman and Hall.Lee, D. (2011). A comparison of conditional autoregressive model used inBayesian disease mapping.
Spatial and Spatio-temporal Epidemiology to ap-pear , DOI:10.1016/j.sste.2011.03.001.Leroux, B., X. Lei, and N. Breslow (1999).
Estimation of disease rates in smallareas: A new mixed model for spatial dependence , Chapter Statistical Modelsin Epidemiology, the Environment and Clinical Trials, Halloran, M and Berry,D (eds), pp. 135–178. Springer-Verlag, New York.Leyland, A., R. Dundas, P. McLoone, and F. Boddy (2007). Inequalities inmortality in Scotland 1981-2001.
BMC Public Health 7 , 172.Li, P., S. banerjee, and A. McBean (2011). Mining boundary effects in areallyreferenced spatial data using the Bayesian information criterion.
Geoinfor-matica to appear , DOI: 10.1007/s10707–010–0109–0.Lu, H. and B. Carlin (2005). Bayesian Areal Wombling for Geographical Bound-ary Analysis.
Geographical Analysis 37 , 265–285.15u, H., C. Reilly, S. Banerjee, and B. Carlin (2007). Bayesian areal wombling viaadjacency modelling.
Environmental and Ecological Statistics 14 , 433–452.Ma, H. and B. Carlin (2007). Bayesian Multivariate Areal Wombling for Mul-tiple Disease Boundary Analysis.
Bayesian Analysis 2 , 281–302.Ma, H., B. Carlin, and S. Banerjee (2010). Hierarchical and Joint Site-EdgeMethods for Medicare Hospice Service Region Boundary Analysis.
Biomet-rics 66 , 355–364.MacNab, Y. (2003). Hierarchical Bayesian Modelling of Spatially CorrelatedHealth Service Outcome and Utilization Rates.
Biometrics 59 , 305–316.Moran, P. (1950). Notes on continuous stochastic phenomena.
Biometrika 37 ,17–23.National Cancer Intelligence Network (2009). Cancer incidence and survivalby major ethnic group, england, 2002-2006. Technical report, Cancer Re-search UK Cancer Survival Group, and London School of Hygiene and Trop-ical Medicine.Quinn, M and Babb, P (2000). Cancer trends in England and Wales, 19501999.Technical report, Health Statistics Quarterly, Office for National Statistics.Wakefield, J. (2007). Disease mapping and spatial regression with count data.
Biostatistics 8 , 158–183.Whyte, B., D. Gordon, S. Haw, C. Fischbacher, and R. Harrison (2007).
Anatlas of tobacco smoking in Scotland: A report presenting estimated smok-ing prevalence and smoking attributable deaths within Scotland . NHS HealthScotland.Womble, W. (1951). Differential systematics.
Science 114 , 315–322.Woods, L., B. Rachet, and M. Coleman (2006). Origins of socio-economic in-equalities in cancer survival: a review.