[PDF] Boundary detection in disease mapping studies

Abstract

In disease mapping, the aim is to estimate the spatial pattern in disease risk over an extended geographical region, so that areas with elevated risks can be identified. A Bayesian hierarchical approach is typically used to produce such maps, which models the risk surface with a set of spatially smooth random effects. However, in complex urban settings there are likely to be boundaries in the risk surface, which separate populations that are geographically adjacent but have very different risk profiles. Therefore this paper proposes an approach for detecting such risk boundaries, and tests its effectiveness by simulation. Finally, the model is applied to lung cancer incidence data in Greater Glasgow, Scotland, between 2001 and 2005.

Full PDF

BBoundary detection in disease mapping studies.

Duncan Lee and Richard Mitchell School of Mathematics and Statistics, University of Glasgow, Glasgow, UK Public Health, School of Medicine, University of Glasgow, Glasgow, UKOctober 28, 2018

Abstract

In disease mapping, the aim is to estimate the spatial pattern in diseaserisk over an extended geographical region, so that areas with elevated riskscan be identiﬁed. A Bayesian hierarchical approach is typically used toproduce such maps, which models the risk surface with a set of spatiallysmooth random eﬀects. However, in complex urban settings there arelikely to be boundaries in the risk surface, which separate populations thatare geographically adjacent but have very diﬀerent risk proﬁles. Thereforethis paper proposes an approach for detecting such risk boundaries, andtests its eﬀectiveness by simulation. Finally, the model is applied to lungcancer incidence data in Greater Glasgow, Scotland, between 2001 and2005.Boundary detection; Disease mapping; Spatial correlation

Disease mapping is the area of statistics that quantiﬁes the spatial pattern indisease risk over an extended geographical region, such as a city of country.The study region is partitioned into a number of small non-overlapping arealunits, such as electoral wards, and only the total number of cases in each unit isavailable. The majority of disease risk maps are produced using Bayesian hier-archical models, utilising a vector of covariate risk factors and a set of randomeﬀects. The latter model any overdispersion or spatial correlation in the diseasedata (after the covariate eﬀects have been allowed for), which can be caused bythe presence of unmeasured risk factors that also have a spatial structure. Therandom eﬀects are often represented by a Conditional Autoregressive (CAR)prior, which forces the eﬀects in two areas to be correlated if those areas sharea common border.The production of such maps provide a number of beneﬁts to public healthprofessionals, including the ability to investigate the association between dis-ease prevalence and suspected risk factors. More recently, disease maps have1 a r X i v : . [ s t a t . O T ] A ug een used to identify boundaries (or cliﬀs or discontinuities ) in the risk sur-face, which separate geographically adjacent areas that have high and low risks.Such boundaries are likely to exist in complex urban settings, where rich andpoor communities can be separated by just a few metres. The detection of suchboundaries has a number of beneﬁts, including the ability to detect the spatialextent of a cluster of high risk areas. It is also important economically, becauseit allows health resources to be targeted at communities with the highest dis-ease risks. Furthermore, the spatial units at which the health and covariate dataare available are typically designed for administrative purposes, and thus oftendo not delineate between distinct neighbourhoods (i.e. groups of people withthe same social circumstances, culture and behaviour). Therefore, boundariesin disease risk surfaces may also correspond to boundaries between diﬀerentneighbourhoods, which are of interest because “their locations reﬂect underly-ing biological, physical, and/or social processes” (Jacquez et al. (2000)).The statistical detection of boundaries in disease maps is also known as Wombling , following the seminal article by Womble (1951). More recently, anumber of diﬀerent approaches to this problem have been proposed, includingthe calculation of local statistics (Boots (2001)), and a Bayesian random eﬀectsmodel (Lu and Carlin (2005)). In this paper we also use a Bayesian randomeﬀects model, and identify boundaries by measuring the level of dissimilaritybetween the populations living in neighbouring areas. We believe that riskboundaries are more likely to occur between populations that are very diﬀer-ent, because homogeneous populations should have similar disease risks. Theremainder of this paper is organised as follows. Section 2 provides a reviewof disease mapping and boundary detection methods, while Section 3 presentsour proposed methodological extension. Section 4 assesses the eﬃcacy of ourapproach using simulation, while Section 5 identiﬁes boundaries in the risk sur-face for lung cancer cases in Greater Glasgow. Finally, Section 6 contains aconcluding discussion and outlines future developments.

The data used to quantify disease risk are denoted by y = ( y , . . . , y n ) and E =( E , . . . , E n ), the former being the numbers of disease cases observed in each of n non-overlapping areas within a speciﬁed time frame. The latter are the expectednumbers of disease cases, which depend on the size and demographic structureof the population living within each area. Disease risk can be summarised bythe standardised incidence ratio (SIR), which for area k is given by ˆ R k = y k /E k .Values above one represent areas with elevated risks of disease, while values lessthan one correspond to relatively healthy areas. However, elevated SIR valuescan occur by chance in areas where E k is small, so the set of disease risks for all n areas are more commonly estimated using a Bayesian hierarchical model. A2eneral speciﬁcation has been described by Elliott et al. (2000), Banerjee et al.(2004), Wakeﬁeld (2007) and Lawson (2008), and is given by Y k | E k , R k ∼ Poisson( E k R k ) for k = 1 , . . . , n, ln( R k ) = x T k β + φ k . (1)Disease risk is modelled by covariates x T k = ( x k , . . . , x kp ), and randomeﬀects φ = ( φ , . . . , φ n ), the latter allowing for any overdispersion and spatialcorrelation in the disease data (after the covariate eﬀects have been accountedfor). The most common prior for φ is a conditional autoregressive (CAR, Besaget al. (1991)) model, which is speciﬁed in terms of n univariate conditionaldistributions, f ( φ k | φ , . . . , φ k − , φ k +1 , . . . , φ n ) for k = 1 , . . . , n . However, eachfull conditional distribution only depends on the values of φ j in a small numberof neighbouring areas. This neighbourhood information is contained in a binaryadjacency matrix W , which has elements w kj that are equal to one or zerodepending on whether areas ( k, j ) are deﬁned to be neighbours. A commonspeciﬁcation is that areas ( k, j ) are neighbours if they share a common border,which corresponds to w kj =1 and is denoted by k ∼ j . A number of priors havebeen proposed within the general class of CAR models, and the one we adopthere was originally proposed by Leroux et al. (1999), and has subsequently beenreviewed by MacNab (2003) and Lee (2011). The model has full conditionaldistributions given by φ k | φ − k , W, τ , ρ, µ ∼ N (cid:32) ρ (cid:80) nj =1 w kj φ j + (1 − ρ ) µρ (cid:80) nj =1 w kj + 1 − ρ , τ ρ (cid:80) nj =1 w kj + 1 − ρ (cid:33) . (2)The conditional expectation is a weighted average of the random eﬀects inneighbouring areas and a global intercept µ (not included in (1)), where theweights are controlled by ρ . In this model ρ = 0 corresponds to independence,while ρ close to one deﬁnes strong spatial correlation. These full conditionaldistributions correspond to a proper multivariate Gaussian distribution if ρ ∈ [0 , φ ∼ N( µ , τ [ ρW ∗ + (1 − ρ ) I ] − ) , (3)where I denotes an n × n identity matrix while is an n-vector of ones. In theabove equation W ∗ has diagonal elements w ∗ kk = (cid:80) ni =1 w ki , and non-diagonalelements w ∗ kj = − w kj . This model simpliﬁes to the intrinsic autoregressivemodel if ρ is ﬁxed at one, although this does correspond to an improper jointdistribution for φ . Lu and Carlin (2005) propose the use of Boundary Likelihood Values (BLV),which are calculated as BLV kj = | ˆ R k − ˆ R j | , the absolute diﬀerence in risk3etween two neighbouring areas. The border between neighbouring areas ( k, j )can then be classiﬁed as a boundary in the risk surface if either(a) BLV kj > c for some cut-oﬀ c ; or(b) BLV kj is within the top c % of the boundary likelihood values over thestudy region, for some percentage c .These approaches to boundary detection are ad-hoc, because the decisionrules (a) and (b) require tuning constants ( c or c ) to be speciﬁed. Jacquezet al. (2000) has criticised this approach for this reason, and argues that byspecifying the tuning constant the investigator essentially chooses the numberof boundaries that are identiﬁed, even though this is unknown and the goal ofthe analysis. The approach we propose allows the data to determine the number and locationsof any boundaries in the risk surface, rather than requiring the investigator tospecify a tuning constant. We achieve this by modelling w kj as a binary ran-dom quantity if areas ( k, j ) share a common border, rather than assuming itis ﬁxed at one. If w kj is estimated as zero the random eﬀects in areas ( k, j )are conditionally independent, which corresponds to a boundary in the risk sur-face. In contrast, if w kj equals one the random eﬀects are correlated, whichcorresponds to no boundary. This general approach to boundary detection haspreviously been proposed by Lu et al. (2007), Ma and Carlin (2007), and Maet al. (2010), who model the set of w kj by logistic regression, a CAR prior,or an Ising model. However, this requires the large set of w kj for all pairs ofneighbouring areas to be estimated, which Li et al. (2011) argue are not wellidentiﬁed from the data. In this paper we model the set of w kj as a function ofparameters α = ( α , . . . , α q ), rather than treating each w kj as a separate un-known quantity. This results in a parsimonious yet ﬂexible model for detectingboundaries in the risk surface, which avoids the weak parameter identiﬁabilityand slow MCMC convergence experienced by Li et al. (2011), when modellingeach w kj separately. The ﬁrst stage of our hierarchical model is similar to that described in Section2, and is given by Y k | E k , R k ∼ Poisson( E k R k ) for k = 1 , . . . , n, ln( R k ) = φ k , (4) φ k | φ − k , µ, α , τ ∼ N (cid:32) . (cid:80) nj =1 w kj ( α ) φ j + 0 . µ . (cid:80) nj =1 w kj ( α ) + 0 . , τ . (cid:80) nj =1 w kj ( α ) + 0 . (cid:33) .

4n common with Lu et al. (2007) no covariates are included in (4), because φ would then represent the residual pattern in disease risk, and any boundariesidentiﬁed would hence be in the residual surface. Secondly, we ﬁx ρ at 0.99 be-cause it enforces strong spatial correlation, which allows the presence or absenceof boundaries to be determined by W ( α ). We note that we do not choose ρ = 1,as it results in an inﬁnite mean and variance in the conditional distribution of φ k (see (2)) if an area is surrounded by boundaries, i.e. if (cid:80) nj =1 w kj ( α ) = 0 forsome area k . We believe that boundaries in the risk surface are likely to occur between pop-ulations that are very diﬀerent, because homogeneous populations should havesimilar risk proﬁles. Therefore, we model the presence or absence of a bound-ary between areas ( k, j ) by a vector of q non-negative dissimilarity metrics z kji = ( z ki , . . . , z kjq ). These metrics have the general form z kji = | z ki − z ji | σ i for i = 1 , . . . , q, (5)the absolute diﬀerence in the value of a covariate between the two areas inquestion. Here, σ i represents the standard deviation of | z ki − z ji | over all pairs ofcontiguous areas, and we re-scale the dissimilarity metrics to improve the mixingand convergence of the MCMC algorithm. It is these dissimilarity measures thatdrive the detection of boundaries in the risk surface, and examples could includediﬀerences in the population’s social characteristics (e.g. average income) or riskinducing behaviour (e.g. smoking prevalence). Using these metrics we modelthe elements of W ( α ) as w kj ( α ) = (cid:26) − (cid:80) qi =1 z kji α i ) ≥ . j ∼ k , (6)where pairs of areas that do not share a common border have w kj ( α ) ﬁxedat zero. For areas that are contiguous, the model detects a boundary in the risksurface ifexp( − (cid:80) qi =1 z kji α i ) is less than 0.5. Therefore we constrain the regression pa-rameters to be non-negative, so that the greater the dissimilarity between twoareas the more likely there is to be a boundary between them. In contrast, iftwo areas have identical covariate values (and hence homogeneous populations)there cannot be a boundary between them, regardless of the value of α . This isthe reason we do not include an intercept term in (6), as doing so would allowboundaries to be detected between areas with homogeneous populations. Theregression parameters determine the number of risk boundaries in the studyregion, with larger values of α corresponding to more boundaries being de-tected. If only one dissimilarity metric z is included in (6), then a plausiblerange of values for the single regression parameter α can be determined. At5ne extreme, no boundaries will be detected if α ≤ − ln(0 . /z max , while at theother, all borders in the study region will be considered as boundaries (unless z kj =0) if α > − ln(0 . /z min . Here ( z min , z max ) denote the minimum positiveand maximum values of the dissimilarity metric. More generally, if there are q dissimilarity metrics then boundaries are identiﬁed ifexp( − z kj α ) × . . . × exp( − z kjq α q ) < . , where the value of each component, exp( − z kji α i ), must lie between zeroand one. Therefore, if α i ≤ − ln(0 . /z maxi , the dissimilarity measure z kji is notsolely responsible for detecting any boundaries, because exp( − z i α i ) would begreater than 0.5 for all pairs of contiguous areas. Therefore in terms of interpre-tation, the dissimilarity metric can be said to have no eﬀect on detecting bound-aries if the entire 95% credible interval for α i is less than α min = − ln(0 . /z maxkji .In contrast, if the interval lies completely above α min , then the metric can besaid to have a substantial eﬀect on identifying risk boundaries. We note thatthe usual statistical representation of ‘no eﬀect’ (credible interval that includeszero) is not possible in this context, because the regression parameters are con-strained to be non-negative. We also note that the approach we outline does notguarantee that the boundaries we detect will be closed (form an unbroken line,an example of which is shown in Figure 3), which allows us to detect bound-aries that enclose an entire subregion, as well as those that just separate highlydiﬀerent areas. The vector of random eﬀects depend on hyperparameters ( µ, τ , α ), which re-spectively control its mean, variance and correlation structure. The mean µ is assigned a weakly informative Gaussian prior distribution, with a mean ofzero and a variance of 10. The variance parameter τ is assigned a weaklyinformative Uniform(0 ,

10) prior on the standard deviation scale, following thesuggestion of Gelman (2006). We adopt a uniform prior for the regression pa-rameters, α i ∼ Uniform(0 , M i ), which corresponds to our prior ignorance aboutthe number of boundaries in the risk surface. An alternative would be a recip-rocal prior, f ( α i ) ∝ α i I [0 ≤ α i ≤ M i ], which represents our prior belief thatthe risk surface is spatially smooth. In both cases a natural upper limit wouldbe M i = − ln(0 . /z minkji , the value at which the dissimilarity measure z kji solelyidentiﬁes all borders as boundaries in the risk surface. However, in a bound-ary detection analysis one is looking to identify boundaries between collectionsof areas, which have similar risks within each collection but diﬀer across theboundary. Therefore we ﬁx M i so that at most 50% of borders can be classiﬁedas boundaries, and present a sensitivity analysis in the supplementary materialto this choice of 50%. 6 Simulation study

In this section we present a simulation study, that assesses the accuracy withwhich our proposed model can detect boundaries in the risk surface. In doingthis we assess whether our model can detect ‘true’ boundaries in the risk surface,as well as the extent to which it falsely identiﬁes boundaries that do not exist.We have decided not to compare our model to the existing boundary detectionapproach using (1), (2) and BLVs, because this requires a tuning constant ( c or c ) to be speciﬁed by the user, which in real life would be unknown. However,these tuning constants are known in this simulation setting, and specifying theirvalue would put this method at an unfair advantage or disadvantage, dependingon whether we chose the ‘correct’ values. We base our study on the n = 271 areas that comprise the Greater Glasgow andClyde health board, which is the region considered in the cancer mapping studypresented in Section ﬁve. Disease counts are generated from the Poisson model(4), where the expected numbers of admissions, E , relate to the Glasgow cancerdata. A new risk surface R = exp( φ ) is generated for each set of simulateddisease data, because this ensures the results are not aﬀected by a particularrealisation of φ . Each simulated risk surface has ﬁxed boundaries, which areshown by the bold black lines in Figure 1. There are 74 boundaries in total,which corresponds to approximately 10% of the set of borders in the studyregion. This set of boundaries partition the study region into 6 groups, themain area shaded in white, and the remaining 5 smaller areas shaded in grey.To produce risk surfaces with boundaries, the random eﬀects ( φ ) are generatedfrom a multivariate Gaussian distribution with a piecewise constant mean, whichin the white region is equal to 0, while in the grey regions it is equal to k . Thecorrelation function is from the Matern class with smoothness parameter equalto κ = 2 .

5, while the spatial range is ﬁxed so that the median correlationbetween areas is 0.5. The dissimilarity metrics are generated from z kj ∼ (cid:26) | N(1 , . ) | if areas ( k, j ) are not separated by a boundary. | N(1 + k , . ) | if areas ( k, j ) are separated by a boundary. (7)Here, larger values of k correspond to dissimilarity metrics that better iden-tify the true boundaries in the risk surface; i.e. have larger values for the bound-aries in Figure 1 than for the non-boundaries. We assess the eﬀect on modelperformance of changing both the size of the boundaries (via k ) and the qualityof the dissimilarity metrics (via k ), the results of which are displayed in Table1. When assessing the eﬀect of boundary size we ﬁx k = 3, which providesnearly ‘perfect’ dissimilarity metrics, i.e. values for z kj for boundaries and non-boundaries that almost never overlap. Conversely, when assessing the eﬀect of7igure 1: Locations of the true boundaries in the simulated risk surfaces.having imperfect dissimilarity metrics, we ﬁx the boundary size at k = 0 . The top half of Table 1 displays the eﬀects of changing the magnitude of therisk boundaries, in the idealised situation of having perfect dissimilarity metrics.The constant k represents the diﬀerence in the mean value of the risk surfacebetween white and grey areas (see Figure 1), hence smaller values correspond toa spatially smoother surface ( k = 0 would correspond to a completely smoothrisk surface with no boundaries). The table shows that the model can detectlarger risk boundaries more often than smaller ones as expected, although itstill achieves over a 90% detection rate when the risk surface only has a meandiﬀerence of 0.2. The much lower percentages for smaller values of k are alsonot surprising, as they correspond to situations in which the average size of thetrue boundaries are not very diﬀerent to the average size of the non-boundaries.In addition, the false positive rates are very low (generally less than 1%) regard-less of the boundary size, which suggests that detected boundaries are likely tobe real.The bottom half of Table 1 displays the eﬀects of having imperfect dissimilar-ity metrics, i.e. metrics for which some of the values corresponding to the ‘true’8able 1: The eﬀect of boundary size (as measured by k ) and the quality ofthe dissimilarity metrics (as measured by k ) on the eﬀectiveness of the model.The table displays the percentage agreement for boundaries (BA) and non-boundaries (NBA), as well as the bias and root mean square error (RMSE) ofthe estimated risk surface, which are presented as a percentage of their truevalue. Comparison k k BA ( % ) NBA ( % ) Bias RMSE z kj k decreases, there is a greater overlap between the values ofthe dissimilarity metrics at boundaries and non-boundaries. The limit of k = 0corresponds to a dissimilarity metric with no information, i.e. it has the samerange of values for boundaries as non-boundaries. The table shows that if thedissimilarity metric is nearly perfect ( k = 3 corresponds to the means in equa-tion (7) being separated by 6 standard deviations), then the model nearly alwayscorrectly identiﬁes boundaries and non-boundaries. However, as the informa-tion content in the dissimilarity metric decreases (as k decreases) so does theperformance of the model, both in terms of the boundary agreement and thefalse positive rate. If the dissimilarity metric contains no information the modelonly identiﬁes 1.86% of the true boundaries as would be expected, although inthis situation the false positive rate returns to being low at around 1%. This section presents a study mapping the risk of lung cancer in Greater Glas-gow, Scotland, between 2001 and 2005.

The data for our study are publicly available, and can be downloaded fromthe Scottish Neighbourhood Statistics (SNS) database ( ).The study region is the Greater Glasgow and Clyde health board, which con-tains the city of Glasgow in the east, and the river Clyde estuary in the west.9lasgow is known to contain some of the poorest people in Europe (Leylandet al. (2007)), and has rich and poor communities that are geographically adja-cent. The Greater Glasgow and Clyde health board is partitioned into n = 271administrative units called Intermediate Geographies (IG), which were devel-oped speciﬁcally for the distribution of small-area statistics, and have a medianarea of 124 hectares and a median population of 4,239.The disease data we model are the number of people diagnosed with lungcancer between 2001 and 2005 in each IG, which corresponds to ICD-10 codesC33 - C34. The expected numbers of cases in each IG are calculated by exter-nal standardisation, using age and sex adjusted rates for the whole of Scotland.These rates were obtained from the Information Services Division (ISD), whichis the statistical arm of the National Health Service in Scotland. The simplestmeasure of disease risk is the standardised incidence ratio, which is presented inFigure 2 as a choropleth map. The Figure shows that the risk of lung cancer ishighest in the heavily deprived east end of Glasgow (east of the study region),as well as along the banks of the river Clyde (the thin white line running southeast). The Figure also shows that cancer incidence in Greater Glasgow is higherthan in the rest of Scotland, as the average SIR across the study region is 1.186.Large amounts of covariate data are available from the Scottish Neighbour-hood Statistics database, and the ﬁrst variable we consider is a modelled esti-mate of the percentage of the population in each IG that smoke, further detailsof which are available from Whyte et al. (2007). The causal relationship betweensmoking and lung cancer risk is long standing (see for example Doll and Brad-ford Hill (1950) and Doll et al. (2005)), and it is likely to be the most importantdissimilarity metric in our study. Cancer risk has also been shown to vary byethnic group (see for example National Cancer Intelligence Network (2009)), sowe include the percentage of school children from ethnic minorities as a proxymeasure. Socio-economic deprivation is also associated with cancer risk (see forexample Quinn, M and Babb, P (2000) and Woods et al. (2006)), and as a proxymeasure we use the natural log of the median house price. This proxy measureis used because it is the measure of deprivation available that is least correlatedwith the smoking covariate (Pearson’s r=-0.69). Finally, we acknowledge thatother factors such as diet and physical activity have been repeatedly associatedwith lung cancer risk. However, no data are available at the small-area levelabout these factors. Inference for all models is based on 50,000 MCMC samples generated from ﬁveMarkov chains, that were initialised at dispersed locations in the sample space.Each chain is burnt-in until convergence (40,000 iterations), and the next 10,000samples are used for the analysis. A number of models were ﬁtted to the datawith diﬀerent combinations of the three dissimilarity metrics, and in each casethe residuals were assessed for the presence of spatial correlation. This was10igure 2: Standardised Incidence Ratio (SIR) for lung cancer in Greater Glasgowbetween 2001 and 2005.achieved using a permutation test based on Moran’s I statistic (Moran (1950)),and in all cases the model adequately removes the spatial correlation present inthe data, as the corresponding p-values (not shown) are greater than 0.05.Initially, all three covariates (smoking, ethnicity and house price) were in-cluded in the model as dissimilarity metrics, and the posterior medians and 95%credible intervals are displayed in Table 2. Also displayed in Table 2 is α min , thethreshold value below which the dissimilarity metric does not solely detect anyboundaries in the risk surface. The table shows that smoking prevalence hassubstantial inﬂuence in detecting boundaries, as the estimate and credible inter-val lie above the threshold value of α min = 0 . α min = − ln(0 . /z max is presented, the value at which each dissimilarity met-ric does not solely identify any risk boundaries. Covariate Estimate 95 % credible interval α min Smoking 0.232 (0.171, 0.254) 0.131Ethnicity 0.012 (0.001, 0.046) 0.126House price (log) 0.015 (0.001, 0.103) 0.119the smoking only model the sole regression parameter has a posterior medianand 95% credible interval of 0.257 (0.239, 0.262), which, in common with theresults in Table 2, lies completely above the ‘no eﬀect’ threshold.

Figure 3 displays the estimated risk surface for lung cancer from the smokingonly model, where the shading is on the same scale as that used in Figure 2.The solid white lines denote the risk boundaries that have been identiﬁed bythe smoking covariate, which are deﬁned by having posterior median values of w kj equal to zero. We note that the study region is split completely in two bythe river Clyde (the thin white line running south east), and areas on oppositebanks are not assumed to be neighbours. Therefore no boundaries can be de-tected across the river, which explains the absence of boundaries in this area.The ﬁgure shows that the smoking covariate detects 162 boundaries in therisk surface (23.1% of all possible borders), including within the city of Glasgow(middle of the map) as well as along the southern coast of the Clyde estuary (farwest of the map). The majority of these estimated risk boundaries appear tocorrespond to sizeable changes in the risk surface, suggesting that the smokingcovariate appears to be an appropriate dissimilarity metric for detecting suchboundaries. However, a few of the boundaries identiﬁed show no evidence ofseparating areas with diﬀering health risks, such as the closed boundary in thesouth of the city. These ‘false positives’ correspond to two areas having diﬀerentsmoking prevalences but similar risk proﬁles, and would be a starting point fora more detailed investigation into why the risk proﬁles are similar given thevastly diﬀerent smoking rates. In this paper we have proposed a statistical approach to detecting boundariesin disease risk maps, which separate populations that exhibit high and low risksof disease. Our approach detects boundaries by measuring the dissimilarity be-tween populations living in neighbouring areas, because we believe that abruptchanges in the risk surface are most likely to occur between populations that12igure 3: Estimated risk surface for lung cancer in Greater Glasgow between2001 and 2005. The risk boundaries are denoted by solid white lines.are geographically adjacent but have very diﬀerent social characteristics or riskinducing behaviour. Our approach has the advantage of being fully automatic,so that unlike the approach of Lu and Carlin (2005), the number of boundariesin the risk surface is determined by the data and not a-priori by the investi-gator. In addition, our approach identiﬁes boundaries using a small numberof regression parameters α , rather than estimating the set of neighbourhoodrelations { w kj } for all pairs of contiguous areas. Li et al. (2011) have suggestedthat this latter approach (used by Lu et al. (2007) and Ma and Carlin (2007))can lead to identiﬁability problems and slow MCMC convergence, due to thelarge number of parameters to be estimated. For example, in the lung cancerapplication presented in Section 5, there are 701 neighbourhood relations w kj ,compared with disease data in only 271 areas.The simulation study presented in Section 4 suggests that our approachgenerally performs well, both in terms of detecting true boundaries in the risksurface, as well as not detecting large numbers of false positives. In the presenceof a perfect dissimilarity metric our approach can detect the majority of truerisk boundaries, for example, it has a 99.6% detection rate when the averagediﬀerence in the log risk surface is 0.3. The main drawback of our approach isillustrated by the bottom half of Table 1, which shows that it is crucially de-pendent on the existence of good quality dissimilarity metrics, that have large13alues for risk boundaries and small values for non-boundaries. However, in theabsence of good dissimilarity metrics (as in the last few rows of Table 1) ourapproach appears to remain conservative, in the sense that the false positiverate is relatively low.As our approach to boundary detection is based solely on covariates, it hasthe advantage that in addition to identifying risk boundaries, it also identiﬁesthe underlying drivers of these boundaries. However, as illustrated by the sim-ulation study, the corresponding disadvantage is that it relies on the existenceof relevant covariate data, which may not always be available. Our lung cancerexample also illustrates this, as the main covariate (smoking prevalence) iden-tiﬁes the majority of the risk boundaries evident from Figure 3. However, italso identiﬁes some boundaries that do not appear to be real, as well as ignor-ing others where there is a suspected discontinuity in the risk surface. Thisimperfection could be caused by the omission of other important dissimilaritymetrics, or by the fact that the smoking covariate is a modelled estimate ratherthan being real prevalence data. Therefore, the production of risk maps suchas Figure 3 can be viewed as an exploratory tool, which helps the investigatorunderstand the drivers of the spatial variation in disease risk. For example, ifa risk boundary is not detected despite their being a discontinuity in the risksurface, further investigation of the two areas in question could be carried outto determine what other factors could be causing this discontinuity.Finally, this paper opens up numerous avenues for future work. The mostobvious is to develop a boundary detection method that has the advantages ofthe method proposed here, but does not rely on covariate data to detect riskboundaries. One possibility in this vein is to develop an iterative algorithm,which compares the ﬁt (possibly using DIC) of a number of models with diﬀerentbut ﬁxed neighbourhood matrices W . In this way one could compare a modelwhere all adjacent areas have w kj = 1, against an alternative with w kj = 0 forareas that are suspected of being separated by a discontinuity in the risk surface.Other natural extensions to this approach include the detection of boundariesin multiple disease risk surfaces simultaneously, as well as adding a temporaldimension to the model. Acknowledgments

The data and shapeﬁles used in this study were provided by the Scottish Gov-ernment.

Conﬂict of Interest:

None declared.

Funding

This work was supported by the Economic and Social Research Council [RES-000-22-4256]. 14 eferences

Banerjee, S., B. Carlin, and A. Gelfand (2004).

Hierarchical Modelling andAnalysis for Spatial Data (1st ed.). Chapman and Hall.Besag, J., J. York, and A. Mollie (1991). Bayesian image restoration with twoapplications in spatial statistics.

Annals of the Institute of Statistics andMathematics 43 , 1–59.Boots, B. (2001). Using local statistics for boundary characterization.

GeoJour-nal 53 , 339–345.Doll, R. and A. Bradford Hill (1950). Smoking and carcinoma of the lung.

British Medical Journal 2 , 739748.Doll, R., R. Peto, J. Boreham, and I. Sutherland (2005). Mortality from cancerin relation to smoking: 50 years observations on british doctors.

BritishJournal of Cancer 92 , 426–429.Elliott, P., J. Wakeﬁeld, N. Best, and D. Briggs (2000).

Spatial Epidemiology:Methods and Applications (1st ed.). Oxford University Press.Gelman, A. (2006). Prior distributions for variance parameters in hierarchicalmodels.

Bayesian Analysis 1 , 515–533.Jacquez, G., S. Maruca, and M. Fortin (2000). From ﬁelds to objects: A reviewof geographic boundary analysis.

Journal of Geographical Systems 2 , 221–241.Lawson, A. (2008).

Bayesian Disease Mapping: Hierarchical Modelling in Spa-tial Epidemiology (1st ed.). Chapman and Hall.Lee, D. (2011). A comparison of conditional autoregressive model used inBayesian disease mapping.

Spatial and Spatio-temporal Epidemiology to ap-pear , DOI:10.1016/j.sste.2011.03.001.Leroux, B., X. Lei, and N. Breslow (1999).

Estimation of disease rates in smallareas: A new mixed model for spatial dependence , Chapter Statistical Modelsin Epidemiology, the Environment and Clinical Trials, Halloran, M and Berry,D (eds), pp. 135–178. Springer-Verlag, New York.Leyland, A., R. Dundas, P. McLoone, and F. Boddy (2007). Inequalities inmortality in Scotland 1981-2001.

BMC Public Health 7 , 172.Li, P., S. banerjee, and A. McBean (2011). Mining boundary eﬀects in areallyreferenced spatial data using the Bayesian information criterion.

Geoinfor-matica to appear , DOI: 10.1007/s10707–010–0109–0.Lu, H. and B. Carlin (2005). Bayesian Areal Wombling for Geographical Bound-ary Analysis.

Geographical Analysis 37 , 265–285.15u, H., C. Reilly, S. Banerjee, and B. Carlin (2007). Bayesian areal wombling viaadjacency modelling.

Environmental and Ecological Statistics 14 , 433–452.Ma, H. and B. Carlin (2007). Bayesian Multivariate Areal Wombling for Mul-tiple Disease Boundary Analysis.

Bayesian Analysis 2 , 281–302.Ma, H., B. Carlin, and S. Banerjee (2010). Hierarchical and Joint Site-EdgeMethods for Medicare Hospice Service Region Boundary Analysis.

Biomet-rics 66 , 355–364.MacNab, Y. (2003). Hierarchical Bayesian Modelling of Spatially CorrelatedHealth Service Outcome and Utilization Rates.

Biometrics 59 , 305–316.Moran, P. (1950). Notes on continuous stochastic phenomena.

Biometrika 37 ,17–23.National Cancer Intelligence Network (2009). Cancer incidence and survivalby major ethnic group, england, 2002-2006. Technical report, Cancer Re-search UK Cancer Survival Group, and London School of Hygiene and Trop-ical Medicine.Quinn, M and Babb, P (2000). Cancer trends in England and Wales, 19501999.Technical report, Health Statistics Quarterly, Oﬃce for National Statistics.Wakeﬁeld, J. (2007). Disease mapping and spatial regression with count data.

Biostatistics 8 , 158–183.Whyte, B., D. Gordon, S. Haw, C. Fischbacher, and R. Harrison (2007).

Anatlas of tobacco smoking in Scotland: A report presenting estimated smok-ing prevalence and smoking attributable deaths within Scotland . NHS HealthScotland.Womble, W. (1951). Diﬀerential systematics.

Science 114 , 315–322.Woods, L., B. Rachet, and M. Coleman (2006). Origins of socio-economic in-equalities in cancer survival: a review.