Modeling large scale species abundance with latent spatial processes
Avishek Chakraborty, Alan E. Gelfand, Adam M. Wilson, Andrew M. Latimer, John A. Silander Jr
aa r X i v : . [ s t a t . A P ] N ov The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2010
MODELING LARGE SCALE SPECIES ABUNDANCE WITHLATENT SPATIAL PROCESSES By Avishek Chakraborty, Alan E. Gelfand, Adam M. Wilson,Andrew M. Latimer and John A. Silander, Jr.
Duke University, Duke University, University of Connecticut, University ofCalifornia, Davis and University of Connecticut
Modeling species abundance patterns using local environmentalfeatures is an important, current problem in ecology. The Cape Floris-tic Region (CFR) in South Africa is a global hot spot of diversity andendemism, and provides a rich class of species abundance data forsuch modeling. Here, we propose a multi-stage Bayesian hierarchicalmodel for explaining species abundance over this region. Our modelis specified at areal level, where the CFR is divided into roughly37 ,
000 one minute grid cells; species abundance is observed at somelocations within some cells. The abundance values are ordinally cat-egorized. Environmental and soil-type factors, likely to influence theabundance pattern, are included in the model. We formulate the em-pirical abundance pattern as a degraded version of the potential pat-tern, with the degradation effect accomplished in two stages. First,we adjust for land use transformation and then we adjust for mea-surement error, hence misclassification error, to yield the observedabundance classifications. An important point in this analysis is thatonly 28% of the grid cells have been sampled and that, for sampledgrid cells, the number of sampled locations ranges from one to morethan one hundred. Still, we are able to develop potential and trans-formed abundance surfaces over the entire region.In the hierarchical framework, categorical abundance classifica-tions are induced by continuous latent surfaces. The degradationmodel above is built on the latent scale. On this scale, an areal levelspatial regression model was used for modeling the dependence ofspecies abundance on the environmental factors. To capture antic-ipated similarity in abundance pattern among neighboring regions,spatial random effects with a conditionally autoregressive prior (CAR)were specified. Model fitting is through familiar Markov chain MonteReceived September 2009; revised February 2010. Supported in part by NSF DEB Grants 056320 and 0516198.
Key words and phrases.
Conditional autoregressive prior, latent variables, misalign-ment, ordinal categorical data, parallel computing.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2010, Vol. 4, No. 3, 1403–1429. This reprint differs from the original in paginationand typographic detail. 1
A. CHAKRABORTY ET AL.Carlo methods. While models with CAR priors are usually efficientlyfitted, even with large data sets, with our modeling and the largenumber of cells, run times became very long. So a novel parallelizedcomputing strategy was developed to expedite fitting. The modelwas run for six different species. With categorical data, display ofthe resultant abundance patterns is a challenge and we offer severaldifferent views. The patterns are of importance on their own, com-paratively across the region and across species, with implications forspecies competition and, more generally, for planning and conserva-tion.
1. Introduction.
Ecologists increasingly use species distribution modelsto address theoretical and practical issues, including predicting the responseof species to climate change [Midgley and Thuiller (2007), Fitzpatrick etal. (2008), Loarie et al. (2008)], designing and managing conservation areas[Pressey et al. (2007)], and finding additional populations of known species orclosely related sibling species [Raxworthy et al. (2003), Guisan et al. (2006)].In all these applications the core problem is to use information about wherea species occurs and about relevant environmental factors to predict howlikely the species is to be present or absent in unsampled locations.The literature on species distribution modeling covers many applications;there are useful review papers that organize and compare model approaches[Guisan and Zimmerman (2000), Guisan and Thuiller (2005), Elith et al.(2006), Graham and Hijmans (2006), Wisz et al. (2008)]. Most species dis-tribution models ignore spatial pattern and thus are based implicitly ontwo assumptions: (1) environmental factors are the primary determinants ofspecies distributions and (2) species have reached or nearly reached equilib-rium with these factors [Schwartz et al. (2006), Beale et al. (2007)]. Theseassumptions underlie the currently dominant species distribution model-ing approaches—generalized linear and additive models (GLM and GAM),species envelope models such as BIOCLIM [Busby (1991)], and the maxi-mum entropy-based approach MAXENT [Phillips and Dud´ık (2008)]. Thestatistics literature covers GLM and GAM models extensively. The lattertends to fit data better than the former since they employ additional pa-rameters but lose simplicity in interpretation and risk overfitting and poorout-of-sample prediction. Climate envelope models and the now increasingly-used maximum entropy methods are algorithmic and not of direct interesthere.In addition to the fundamental ecological issues mentioned above, compli-cation arises in various forms in modeling abundance from imperfect surveydata such as observer error [Royle et al. (2007), Cressie et al. (2009)], variablesampling intensity, gaps in sampling, and spatial misalignment of distribu-tional and environmental data [Gelfand et al. (2005a)]. First, since a regionis almost never exhaustively sampled, individuals not exposed to sampling
PATIAL MODELING FOR SPECIES ABUNDANCE will be missed. Second, it may be that potentially present individuals areundetected [Royle et al. (2007)] and, possibly vice versa, for example, a falsepositive misclassification error with regard to species detection [Royle andLink (2006)]. A third complication is that ecologists and field workers arebiased against absences; they tend to sample where species are, not wherethey aren’t. Such preferential sampling and its impact on inference is dis-cussed in Diggle, Menezes and Su (2010). Further complications arise withanimals due to their mobility. Previous work has developed spatial hier-archical models that accommodate some of these difficulties, fitting thesemodels to presence/absence data in a Bayesian framework [Hooten, Larsenand Wikle (2003), Gelfand et al. (2005a, 2005b), Latimer et al. (2006)].The species distribution modeling discussed above is either in the pres-ence/absence or presence-only data settings; there is relatively little workon spatial abundance patterns, despite their theoretical and practical im-portance [Kunin, Hartley and Lennon (2000), Gaston (2003)]. Our primarycontribution here is to develop a hierarchical modeling approach for ordi-nal categorical abundance data, explained by the suitability of the environ-ment, the effect of land use/land transformation, and potential misclassifica-tion error. Ordinal classifications are often the case in ecological abundancedata, especially for plants [Sutherland (2006), Ib´a˜nez et al. (2009)]. Froma stochastic modeling perspective, categorical data can be viewed as theoutcome of a multinomial model, with the cell probabilities dependent onbackground features. Within a Bayesian framework such modeling is oftenimplemented using data augmentation [Albert and Chib (1993)], introduc-ing a latent hierarchical level. There, the ordered classification is viewedas a clipped version of a single latent continuous response, introducing cutpoints. See also De Oliveira (2000) and Higgs and Hoeting (2010).At the latent level, suitability of the environment can be modeled throughregression. Availability in terms of land use degrades suitability. That is, animportant feature of our modeling, from an ecological point of view, is thatit deals with transformation of the study area by human intervention. Inmuch of the region, the “natural” state of areas has been altered to anagricultural or urban state, or the vegetation has been densely colonizedby alien invasive plant species. So, we cannot treat the entire region asequally available to the plant species we are modeling. We must introducea contrast between the current abundance of species (their transformed or adjusted abundance) and their potential distributions in the absence of landuse change ( potential abundance). These notions are formally defined at theareal unit level in Section 3. A further degradation enabling the possibilityof misclassification and/or observer error in the data collection procedurecan be accounted for as measurement error in the latent surface. Thereis a substantial literature on measurement error modeling for continuousobservations, for example, Fuller (1987), Stefanski and Carroll (1987), and A. CHAKRABORTY ET AL.
Mallick and Gelfand (1995). In our modeling we impose a hard constraint:with no potential presence (i.e., an unsuitable environment), we can observeonly zero abundance. We enforce this constraint on the latent scale. Withcut points, modeled as random, we provide an explanatory model for theobserved categorical abundance data. Furthermore, we can invert from thelatent abundance scale to the categorical abundances to predict abundancefor unsampled cells and also to predict abundance in the absence of land usetransformation.With spatial data collection, we anticipate spatial pattern in abundanceand thus introduce spatial structure into our modeling. That is, causal eco-logical explanations such as localized dispersal, as well as omitted (unob-served) explanatory variables with spatial pattern such as local smoothnessof geological or topographic features, suggest that, at sufficiently high resolu-tion, abundance of a species at one location will be associated with its abun-dance at neighboring locations [Ver Hoef et al. (2001)]. Moreover, throughspatial modeling, we can provide spatial adjustment to cells that have notbeen sampled, accommodating the gaps in sampling and irregular samplingintensity mentioned above. In particular, we create a latent process modelthrough a trivariate spatial process specification, with truncated support,to capture potential abundance, land transformation-adjusted abundance,and measurement error-adjusted abundance. Since our environmental infor-mation is available at grid cell level, we use Markov random field (MRF)models [Besag (1974), Banerjee, Carlin and Gelfand (2004)] to capture spa-tial dependence and to facilitate computation. However, we work with a largelandscape of approximately 37 ,
000 grid cells which leads to very long runtimes in model fitting and so we introduce a novel parallelized computingstrategy to expedite fitting.There have been other recent developments in modeling of species abun-dances, some using Bayesian hierarchical models. First, there has been somework on developing models that deal almost exclusively with animal censusdata, including count data and mark-recapture data [Royle et al. (2007),Conroy et al. (2008), Gorresen et al. (2009)]. Potts and Elith (2006) providean overview of abundance modeling, in fact, five regression models (Poisson,negative binomial, quasi-Poisson, the hurdle model, and the zero-inflatedPoisson) fitted for one particular plant example. These models focus on cor-recting observer error and bias as well as under-detection (the species ispresent but not observed), whence the “true” abundance is virtually alwayshigher than observed [Royle et al. (2007), Cressie et al. (2009)]. We notesome very recent work on working with ordinal species abundance in plantdata by Ib´a˜nez et al. (2009). This approach takes ordinally scored abun-dances and uses an ordered logit hierarchical Bayes model to infer potentialabundances for species that are still spreading across the landscape.
PATIAL MODELING FOR SPECIES ABUNDANCE The advantages of working within the Bayesian framework with Markovchain Monte Carlo (MCMC) model fitting are familiar by now—full infer-ence about arbitrary unknowns, that is, functions of model parameters andpredictions, can be achieved through their posterior distributions, and un-certainty can be quantified exactly rather than through asymptotics. In thisapplication we work with the disaggregated data at the level of individualspecies and sites to present spatially resolved abundance “surfaces” and tocapture uncertainty in model parameters. Doing this turns out to be moredifficult than might be expected, as we reveal in our model developmentsection. The key modeling issues center on careful articulation of the defi-nition of events and associated probabilities, the misalignment between thesampling for abundance (at the relatively small sampling sites) and the avail-able environmental data layers (at a scale of minute by minute grid cells,roughly 1.55 km ×
2. Data description.
The focal area for this abundance study is the CapeFloristic Kingdom or Region (CFR), the smallest of the world’s six flo-ral kingdoms (Figure 1). As noted above, it encompasses a small regionof southwestern South Africa, about 90,000 km , including the Cape ofGood Hope, and is partitioned into 36,907 minute-by-minute grid cells ofequal area. It has long been recognized for high levels of plant species diver-sity and endemism across all spatial scales. The region includes about 9000plant species, 69% of which are found nowhere else [Goldblatt and Manning(2000)]. Globally, this is one of the highest concentrations of endemic plantspecies in the world. It is as diverse as many of the world’s tropical rainforests and apparently has the highest density of globally endangered plantspecies [Rebelo (2002)]. The plant diversity in the CFR is concentrated inrelatively few groups, such as the icon flowering plant family of South Africa,the Proteaceae. We focus on this family because the data on species distri-bution and abundance patterns are sufficiently rich and detailed to allowcomplex modeling. The Proteaceae have also shown a remarkable level ofspeciation, with about 400 species across Africa, of which 330 species are99% restricted to the CFR. Of those 330 species, at least 152 are listed as A. CHAKRABORTY ET AL. “threatened” with extinction by the International Union for the Conserva-tion of Nature. Proteaceae have been unusually well sampled across the re-gion by the Protea Atlas Project of the South African National BiodiversityInstitute [Rebelo (2001)]. Data were collected at record localities : relativelyuniform, geo-referenced areas typically 50 to 100 m in diameter. In additionto the presence (or absence) at the locality of protea species, abundance ofeach species along with selected environmental and species-level informationwere also tallied [Rebelo (1991)]. To date, some 60,000 localities have beenrecorded (including null sites), with a total of about 250,000 species countsfrom among some 375 proteas [Rebelo (2006)].Abundance is given for a sampling locality. Evidently, there is no notionof abundance at a point; however, with roughly 60 ,
000 sites sampled overthe entire CFR, the relative scale of the Protea Atlas observations is smallenough when compared to our areal units to be considered as “points.” Inthe literature, abundance is sometimes measured as percent cover [Mueller-Dombois and Ellenberg (2003)]. In our data set, abundance is recorded asan ordinal categorical classification of the count for the species with fourcategories: category 0: none observed, category 1: 1–10 observed, category2: 11–100 observed, category 3: >
100 observed. Such categorization is fastand efficient for studying many species and many sampling locations butis certainly at risk for measurement error in the form of misclassification.Additionally, a large number of cells were not sampled at all. In fact, only10,158, that is, 28%, were sampled at one or more sites. Even among cellssampled, some have just one or two sites while others have more than 100,reflecting the irregular and opportunistic nature of the sampling rather thanan experimentally designed sampling plan.
Fig. 1.
Location of the Cape Floristic Region (CFR) of South Africa. Inset shows thelocation of the CFR within the African Continent. The , km region was divided into ,
907 1 -minute cells for modeling.
PATIAL MODELING FOR SPECIES ABUNDANCE Turning to the covariates, in Gelfand et al. (2005a, 2005b) 16 explana-tory environmental variables were studied, capturing climate, soil, and topo-graphic features (further detail is provided there). Here, we confine ourselvesto the six most significant variables from that study, which are Evapotran-spiration (APAN.MEAN), July (winter) minimum temperature (MIN07),January (summer) maximum temperature (MAX01), mean annual precip-itation (MEAN.AN.PR), summer soil moisture days (SUMSMD), and soilfertility (FERT1). Transformed areas (by agriculture, reforestation, alienplant infestation, and urbanization) were obtained as a GIS data layer fromR. Cowling (private communication). Figure 2 shows the pattern of trans-formation across the CFR. Approximately 1 /
3. Multi-level latent abundance modeling.
In Section 3.1 we briefly re-view the earlier work on hierarchical modeling for presence/absence data,presented in Gelfand et al. (2005a, 2005b), in order to reveal how we havegeneralized it for the abundance problem. Section 3.2 develops our proposedprobability model for the categorical abundance data. In Section 3.3 discreteprobability distributions are replaced using latent continuous variables. InSection 3.4 we discuss bias issues associated with modeling abundance dataand, in particular, how they affect our setting. Section 3.5 deals with explicitmodel details for the likelihood and posterior.
Fig. 2.
Proportion of untransformed land inside the CFR. Most of the transformation isdue to agriculture, but includes dense stands of alien invasive species.
A. CHAKRABORTY ET AL.
Hierarchical presence/absence modeling.
In Gelfand et al. (2005a,2005b) the authors model at the scale of the grid cells in the CFR and providea block averaged binary process presence/absence model at this scale. Inparticular, let A i ⊂ R denote the geographical region corresponding to the i th grid cell and X ( k ) i the event that a randomly selected location within A i is suitable (1) or unsuitable (0) for species k . Set P ( X ( k ) i = 1) = p ( k ) i . Then, p ( k ) i is conceptualized by letting λ ( k ) ( s ) be a binary process over the regionindicating the suitability (1) or not (0) of location s for species k and taking p ( k ) i to be the block average of this process over unit i . That is, p ( k ) i = 1 | A i | Z A i λ ( k ) ( s ) ds = 1 | A i | Z A i ( λ ( k ) ( s ) = 1) ds, (3.1)where | A i | denotes the area of A i . From Equation (3.1), the interpretationis that the more locations within A i with λ ( k ) ( s ) = 1, the more suitable A i is for species k , that is, the greater the chance of potential presence in A i .The collection of p ( k ) i ’s over the A i is viewed as the potential distribution ofspecies k .Let V ( k ) i denote the event that a randomly selected location in A i is suit-able for species k in the presence of transformation of the landscape. Let T ( s ) be an indicator process indicating whether location s is transformed(1) or not (0). Then, at s , both T ( s ) = 0 (availability) and λ ( k ) ( s ) = 1 (suit-ability) are needed in order that location s is suitable under transformation.Therefore, P ( V ( k ) i = 1) = 1 | A i | Z A i ( T ( s ) = 0) ( λ ( k ) ( s ) = 1) ds. (3.2)If, for each pixel, availability is uncorrelated with suitability, then Equation(3.2) simplifies to P ( V ( k ) i = 1) = u i p ( k ) i , where u i denotes the proportion ofarea in A i which is untransformed, 0 ≤ u i ≤ A i has been visited n i times in untransformed areaswithin the cell. Further, let y ( k ) ij be the observed presence/absence statusof the k th species at the j th sampling location within the i th unit. The y ( k ) ij | V ( k ) i = 1 are modeled as i.i.d. Bernoulli trials with success probability q ( k ) i , that is, for a randomly selected location in A i , q ( k ) i is the probabilityof species k being present given the location is both suitable and available.Of course, given V i ( k ) = 0, y ( k ) ij = 0 with probability 1. Then, we have that P ( y ( k ) ij = 1) = q ( k ) i u i p ( k ) i . Gelfand et al. (2005a, 2005b) model the p ( k ) i and q ( k ) i using logistic regressions. In fact, they use environmental variables andspatial random effects to model the p ( k ) i ’s, the probabilities of potentialpresence, and, to facilitate identifiability of parameters, use species levelattributes to model the q ( k ) i ’s. PATIAL MODELING FOR SPECIES ABUNDANCE Probability model for categorical abundance.
We first define what categorical abundance means at an areal scale using the four ordinal cate-gories from Section 2. Suppressing the species index, let X i denote the clas-sification for a randomly selected location in A i and define p ih = P ( X i = h )for h = 0 , , ,
3. If λ ( s ) is a four-colored process taking values 0 , , ,
3, then p ih = | A i | R A i λ ( s ) = h ) ds . That is, p ih is the proportion of area within A i with color h , equivalently, the proportion in abundance class h . The p ih denote the potential abundance probabilities, that is, in the absence oftransformation.We describe land transformation percentage (1 − u i ) as a block averageof a binary availability process T ( s ) over A i . It can also be interpreted asthe probability that a randomly selected site within A i is transformed. At atransformed location abundance must be 0. Thus, as in Equation (3.2), in thepresence of transformation, we revise p ih to P T ( X i = h ) = | A i | R A i T ( s ) =0)1( λ ( s ) = h ) ds . Under independence of abundance and land transforma-tion, we obtain P T ( X i = h ) = u i p ih . The u i p ih denote the transformed abun-dance probabilities for h = 0. The probability of abundance class 0 undertransformation is evidently 1 − u i + u i p i . Let r ih denote the abundance classprobabilities in the presence of transformation.Finally, suppose there is an observed categorical abundance at location j within A i , say, y ij . There is an associated conceptual λ ij and an observed T ij .Then, λ ij = λ ij T ij if there has been transformation degradation at location j , unless λ ij = 0. Furthermore, if there has been a misclassification error at j , y ij = λ ij T ij unless λ ij = 0. Let q ih denote the abundance class probabilitiesassociated with the observed abundances. In Section 3.3 we specify a latenttrivariate continuous abundance model that produces the p ’s, r ’s, and q ’sby integrating over appropriate intervals. This latent model can be viewedas the process model for our setting.The data set consists of observed abundances across several samplingsites within the CFR. Let D denote our CFR study domain so D is dividedinto I = 36 ,
907 grid cells of equal area. For each cell i = 1 , , , . . . , I , weare given information on p covariates as v i = ( v i , v i , . . . , v ip ). Within A i ,the abundance category of a species was recorded at each of n i samplingsites. For many cells n i >
1. For site j in A i we observe y ij as a multinomialtrial, that is, y ij i . i . d . ∼ mult( { q ih } ) , j = 1 , , . . . , n i . We have a large number ofunsampled cells, that is, n i = 0. In fact, out of 36,907 cells, only m = 10 , y ij ’s in the data set.Hence, the inference problem involves estimation of probabilities over theobserved cells as well as prediction over the unsampled region. Prediction ofa categorical response distribution for unsampled locations in a point levelmodel was discussed in De Oliveira (2000) and Higgs and Hoeting (2010). A. CHAKRABORTY ET AL.
Fig. 3.
Cells within the CFR that have at least one observation from the Protea Atlasdata set are shown in light grey, while cells with no observations are shown in dark grey.
In our areal setup with only areal level v ’s, we address this problem with aMRF model, as described in Section 3.3. Again, we seek to infer about the p ’s, r ’s, and q ’s given the observed y ’s for a subset of cells and v ’s knownfor all cells.3.3. Latent continuous abundances.
It is now common to model the prob-ability mass function of a scalar ordinal categorical variable through anunderlying univariate continuous distribution. In a more general setup, LeLoc’h and Galli (1997) and Armstrong et al. (2003) used latent random vec-tors to define the categorical probabilities in terms of these vectors takingvalues within a specific set. In a similar spirit, corresponding to an observedabundance category variable y ij , we introduce a continuous latent variable z O,ij such that y ij = X h =0 h α h − < z ij < α h ) , where α = ( α − = −∞ , α = 0 , α , α , α = ∞ ) are an increasing sequenceof cut points. For identifiability and without loss of generality, we can set α = 0 and interpret z O,ij < z O,ij > P ( y ij = h ) = q ih = P ( z O,ij ∈ ( α h − , α h )). So q ih will be determined bythe probability model specified for the z O,ij ’s. We will introduce spatialdependence between z O,ij ’s below but, for now, to simplify notation, wedrop the subscript.A simple model would put a Gaussian distribution on these latent z O ’swhose means are linear functions of the associated v ’s. This would provide aroutine categorical regression model but ignores known land transformation PATIAL MODELING FOR SPECIES ABUNDANCE and potential measurement/ecological error in the recorded abundance cat-egories. Instead, we introduce z P,ij to provide the p ij ’s and z T,ij to providethe r ij ’s. We need a joint distribution to relate the z P , z T , and z O . Froma process perspective in terms of the proposed degradation, it seems nat-ural to specify this distribution in the form f ( z P ) f ( z T | z P ) f ( z O | z T ). Since( z P , z T , z O ) capture the sequential degradation of an associated categori-cal abundance distribution, we need to use the same set of α ’s to producemeaningful ( p, r, q ) respectively. Now, we propose (and clarify) the followingdependence structure. Define c ( µ ) = µ − φ ( µ )¯Φ( µ ) , where φ ( · ) and Φ( · ) are thestandard normal p.d.f. and c.d.f. respectively. Note that c ( µ ) = E ( V | V ∼ N ( µ, , V <
0) so c ( µ ) ≤ min(0 , µ ) for all µ ∈ R . Let P ( y = h | z O , u ) = 1( α h − ≤ z O ≤ α h ); 0 ≤ h ≤ ,f ( z O | z T ) = φ ( z O ; z T , z T ≥ + δ z T z T ≤ , (3.3) f ( z T | z P , u ) ∼ uδ z P + (1 − u ) δ c ( z P ) ,f ( z P | v, β, τ ) = φ ( z P ; v T β + θ, . Again, the conditional modeling above is motivated by the degradationperspective. To model the latent z P surface, we use the covariate informa-tion, that is, climate and soil features that are believed to influence theabundance distribution of different species in different ways. We also add aspatial random effect ( θ ) in the mean function to account for spatial associ-ation that may arise from factors, apart from included covariates, that mayhave a spatial pattern. The covariate effects β as well as the spatial randomeffects θ are species-specific. Variances are fixed at 1 for identifiability (seeSection 3.5). Since we are working at areal scale, we assign each cell a single θ with the prior on θ , ,...,I specified using a Gaussian Markov random field(MRF) [Besag (1974)] with first-order adjacency proximities. See Banerjee,Carlin and Gelfand (2004) for details as well as further references.Next, the z P surface is degraded by land transformation. A random lo-cation inside A i is untransformed with probability u i . Then, z T = z P , thatis, a degenerate distribution at z P given z P . If it is transformed, the degra-dation occurs so that the z T corresponds to the zero abundance category.For simplicity (with further discussion below), we make this a degeneratedistribution at c ( z P ) <
0, whence z T | z P , u becomes a two point distributionas above. Again, transformation is equivalent to absence and since α = 0 isthe upper threshold for that classification, we need z T < z T < u = 1 (complete availability) z T and z P are thesame. For any 0 < u <
1, we get E ( z T | z P ) = uz P + (1 − u ) c ( z P ) . Also, since c ( x ) < x , E ( z T | z P ) ≤ z P , which is essential in the sense thattransformation can only degrade abundance [and clarifies our choice for c ( · )]. A. CHAKRABORTY ET AL.
Posterior summaries of z T measure the prevailing abundance under trans-formation within the CFR. [In Appendix A.1 we show that | E ( z T ) | < ∞ .]The two-point mixture distribution also implies the probability of abun-dance class 0, P ( z T < ≥ (1 − u ), that is, no matter how large the poten-tial abundance is within a cell, for any u < z T | z P specification besides a point mass at c ( z P ) include putting a point mass at some arbitrary point c <
0, or using atruncated normal z T | z P on R − . In the first case, it is not ensured whether z T ≤ z P (it depends on whether there are cells with z P < c ), while the secondchoice adds complication for no benefit, is less interpretable, and does notensure z T < z P with probability 1. Also, in Section 4 we show that, in termsof fitting the model, the specification used in Equation (3.3) is the same asusing a truncated normal distribution for land transformation.Next, we modify the { z T } surface to produce { z O } . With regard to mea-surement error, the recorded category of abundance at a particular locationcan be different from the prevailing category due to inaccuracy in field as-sessment of species quantity. However, we assume that when the potentialabundance was zero, one could not record a nonzero abundance category forit [no false positives, see Royle and Link (2006) in this regard]. This putsa directional constraint on the effect of noise. A specification for f ( z T | z O )which is coherent with this restriction has, with z T > z O | z T ∼ N ( z T , MEM ) specifi-cation. For a site with no presence z T <
0, our assumption says there cannotbe any measurement error, thus, in Equation (3.3), for simplicity, we set z O to be the same as z T . Again, other choices of z O | z T can be considered forthe z T < z P surface, as we clarify in Section 4. This sequential dependence structure, z P → z T → z O , implies that if z P < z T and z O . Hence, if a site isnot suitable for a species, at no intermediate stage of the model can the sitehave any positive probability of species occurrence. A change in category be-tween actual and observed arises when the noise pushes z T to the other sideof some cut point to produce z O . And, because of the truncation structure,that shift cannot happen from the left of α = 0 to the right.An alternative way to jointly model ( z O , z T ) could use a bivariate normaldistribution with support truncated to R − [0 , ∞ ) × ( −∞ , f ( z O | z T ) which match our intuition abouthow the degradation took place. Also, from the distributional perspective,the truncated normal redistributes the mass contained inside the left-outregion uniformly across the support, whereas the specification in Equation(3.3) shifts the mass only to ( z O < PATIAL MODELING FOR SPECIES ABUNDANCE The simple dependence structure for z T | z P allows us to marginalize over z T and work with z P and z O | z P as our joint latent distribution. We have f ( z O | z P ) = (cid:26) uφ ( z O ; z P ,
1) + (1 − u ) δ c ( z P ) , z P > uδ z P + (1 − u ) δ c ( z P ) , z P < f ( z O | z P ) ∼ u [ φ ( z O ; z P , z P ≥ + δ z P z P ≤ ] + (1 − u ) δ c ( z P ) . (3.5)Moreover, Equation (3.5) has a nice interpretation in the sense that, first, itindicates whether the land is transformed or not with probability 1 − u . If theland is transformed, it sets observed abundance to be z O = c ( z P ) < z P . Inthe case of available land, if there is a potential presence, it allows for a MEMaround z P ; in the case of absence, it stays fixed at z P . Since z O is relatedto the observed data and z P is our surface of interest, the marginalizationremoves one stage of hierarchy from our model fitting and thus reduces corre-lation, yielding better behaved MCMC in model fitting. Furthermore, we canretrieve the z T surface after the fact since f ( z T | z O , z P ) ∝ f ( z T | z P ) f ( z O | z T ).3.4. Measurement error and bias issues.
In the Introduction we notedthat measurement error and bias typically occur with ecological survey data.It can manifest itself in the form of detection error, spatial coverage bias[Royle et al. (2007)], and under-reporting of absences. How do these biasesarise in our modeling? Noteworthy points here are (i) the difference betweenobtaining abundance as actual counts as opposed to through ordinal classi-fications and (ii) what “no abundance” means across our collection of gridcells.Nondetection bias (i.e., undetected individuals in a sampled location)tends to be discussed more with regard to animal abundance [Ver Hoefand Frost (2003), Royle et al. (2007), Gorresen et al. (2009)]. Using counts,evidently observed abundance is at most true abundance; error can occurin only one direction. With ordinal counts, the bias is still expected to re-flect under-reporting but, depending upon the categorical definitions, willbe much less frequent and need not be absolutely so. For example, in ourdata set, plant population size is visually estimated and an observation, es-pecially of large populations, could potentially have error in either direction.In our modeling, “true” abundance is not “potential” abundance. For us, onecould envision true abundance on the latent scale as a “true” transformedabundance, say, ˜ z T with measurement error yielding z O . Then, one mightinsist that our measurement error model requires z O ≤ ˜ z T . Under our mea-surement error formulation using z T , we even allow z O > z P to account forpotential overestimation of abundance. Evidently, since y O may occasion-ally be less than the potential classification y P at that location, we may be A. CHAKRABORTY ET AL. slightly underestimating potential abundance. We don’t expect this to beconsequential and, in any event, with no knowledge about the incidence of under-classification in our setting, we have no sensible way to correct forthis bias.Turning to spatial coverage bias (i.e., individuals not exposed to samplingwill be missed), for us, with only 28% of grid cells sampled, we certainly aresubject to this. However, the spatial modeling we introduce helps in thisregard. The mean of z P,ij is v Ti β + θ i regardless of whether we collected anydata in A i . So, the regression is expected to find the appropriate level for thecell and the spatial smoothing associated with the θ i is expected to providesuitable local adjustment. We could argue that, if sampling of grid cells israndom, this bias can be ignored.Perhaps the most difficult bias to address is the under-sampling of ab-sences. This bias counters the previous ones; under-sampling of absenceswill tend to produce over-estimates of potential abundance. In our setting,under-sampling of absences is reflected in the decision-making that leadsto only 28% of cells being sampled, that is, it is not a random 28% thathave been sampled. Different from spatial coverage bias, in this context, theecologist expresses confidence that the species is not present in some of theunsampled cells. If this is so and we were to set some additional abundancesto 0, this would assert that these “0”s are not nondetects and would di-minish potential abundance, opposite to the case of nondetects. Of course,in the absence of actual data collection, we would not see any of these 0’sand would adopt model-based inference regarding potential abundance forthese cells. In any event, with no explicit knowledge of how sampling siteswere chosen, we are unable to attempt correction for this bias. Possibly,approaches to address the effects of preferential sampling [Diggle, Menezesand Su (2010)] could be attempted here.3.5. Likelihood and posterior distribution.
The posterior distributions ofinterest, p and r , will be constructed in the post MCMC analysis (discussedin detail in Section 4.3). From the conditional structure we first write P ( y = c | z O , α ) = 1 z O ∈ ( α c − ,α c ) . So the likelihood function for a single sample y turnsout to be L ( y | z O , α ) = Q k =0 z O ∈ ( α k − , α k )) y = k ) . Now f ( z O | z P ) can bewritten as in Equation (3.5).Again, we have I cells with n i sampling sites within A i . For each y ij weintroduce a corresponding z O,ij, and hence a pair of z T,ij , z
P,ij , to representthe event happening at the j th sampling site within A i . We work directlywith the z O | z P structure. Since we are interested in the areal level abundancedistribution and have covariates at areal resolution, we assume for fixed i , z P,ij i . i . d . ∼ N ( · ; v Ti β + θ i , z O,ij ’s are conditionallyindependent given the z P,ij ’s.
PATIAL MODELING FOR SPECIES ABUNDANCE Without loss of generality, re-index cells so that the first m of them aresampled and the last I − m are not. The latter cells have no contribution tothe y column and, hence, no associated z O appears in the likelihood. Using anonspatial model, we would work with a posterior on the domain of sampledcells only. But assuming a CAR prior structure with adjacency proximitymatrix W for the θ over the whole domain enables us to learn about z P forunsampled cells. In summary, the posterior distribution takes the followingform, up to proportionality, with Θ = ( α , β , θ ): π ( z P , z O , Θ | y , v , u ) ∝ m Y i =1 n i Y j =1 L ( y ij | z O,ij , α ) f ( z O,ij | z P,ij ) f ( z P,ij | v i , Θ) × π (Θ) , (3.6) π (Θ) = π ( α ) π ( β ) π ( θ ) ,π ( θ , ,...,I ) = CAR( η , W ) . We turn to the identifiability of the set of parameters under the hierar-chical dependent latent structure. First, with a latent continuous processyielding an ordinal categorical variable, the mean and scale of the distribu-tion can be identified only up to their ratio. In Equation (3.3) the depen-dence across ( z p , z T , z O ) is specified through conditional means. Hence, allGaussian distributions there are specified with standard deviation 1. Fourcategories of abundance allow three free probabilities and the correspondingfour latent surfaces will also have 3 degrees of freedom. As noted above, weset α = 0 with α , α as free parameters.We also need to ensure that all three z surfaces can be distinguished fromeach other. Since transformation percentage 1 − u is given a priori, it isstraightforward to separate z P and z T . We turn to the joint distribution for z P , z O given as z O | z P ∼ N ( z P , , z P ∼ N ( vβ + θ, z P ; it is redundant, there is no way to distinguish between z P and z O , andone can use the marginal z O ∼ N ( vβ + θ, z O surface non-Gaussian though the z P surface is.The greater the measurement error, the more departure from Gaussianityin the marginal distribution of z O . Again, the measurement error cannotbe estimated on any absolute scale, since the latent z scales are fixed foridentifiability. It will be controlled by parameters like β and θ . To comparethe relative effect of measurement error across different species, under fixedscale parameters, P ( z P <
0) is a candidate but other model features can beinformative as well.Finally, the full model specification, described in Equation (3.3), can berepresented through a graphical model, shown in Figure 4. A. CHAKRABORTY ET AL.
Fig. 4.
Graphical model for latent abundance specification at site j within cell A i . z ’sdenote latent abundance processes, observed (O), transformed (T), and potential (P); y ’sdenote interval-censored abundances, observed (O), and potential (P); u is proportion ofland untransformed, v ’s are covariates, β ’s are regression coefficients, α ’s are cut pointsfor z scale, and θ ’s are spatial random effects.
4. Posterior computation and inference.
Here, we describe how to designa computationally efficient MCMC algorithm for the model. We then dis-cuss how to summarize the posterior samples to estimate important modelfeatures.4.1.
Sampling.
Introduction of latent layers, although increasing the pa-rameter dimension in the model, makes the posterior full conditionals stan-dard and easy to sample from. Our goal is to efficiently estimate componentsof Θ which control potential abundance. We rewrite Equation (3.5) as fol-lows: f ( z O | z P ) = uφ ( z O ; z P , z P ≥ + [ uδ z P + (1 − u ) δ c ( z P ) ]1 z P ≤ (4.1) + (1 − u ) δ c ( z P ) z P > , and work with Equation (4.1) to implement the computation for the modelfitting.We start with updating all z O , z P using (Θ ( t ) ; y , v , u ) and then drawingcomponents of Θ from their respective posterior full conditionals based on z P, ( t +1) , z O, ( t +1) . Given the draw from z P , sampling the components of Θis standard as in almost any spatial regression analysis (see Appendix A.3).For the set of θ ’s, after sampling them sequentially, we need to “center themon the fly” [Besag and Kooperberg (1995), Gelfand and Sahu (1999)]. Themore challenging part is to update z O , z P | Θ. In Albert and Chib (1993)the latent variables were sampled in the MCMC from mutually independenttruncated Gaussian full conditionals, with the support determined by thecorresponding classification. For our model, the posterior full conditional forany z O is π ( z O | z P , y, u ) ∝ f ( z O | z P )1( z O ∈ ( α y − , α y )) . We take two different strategies to update z P , z O depending on the ob-served y . For any site with nonzero y we have (with α = 0) f ( z O , z P | y > PATIAL MODELING FOR SPECIES ABUNDANCE , u ) ∝ φ ( z O | z P )1 z O ∈ ( α y − ,α y ) φ ( z P ), which amounts to sampling first a uni-variate normal z O, ( t +1) | ( z P, ( t ) , y, α ) truncated within ( α ( t ) y − , α ( t ) y ) and thenfrom z P, ( t +1) | z O, ( t +1) , Θ ( t ) which is also Gaussian [with location ( z O, ( t +1) + µ ( t ) ) / p / µ ( t ) = v T β ( t ) + θ ( t ) ]. For a site with ob-served y = 0 the case is more complicated, with details provided in Ap-pendix A.2. All of the sampling distributions required in MCMC are listedin Appendix A.3.4.2. Computational efficiency.
The algorithm described above is compu-tationally demanding as we have two latent variables to sample at each sam-pling site and one spatial parameter for each of the grid cells. However, since z P , z O are independent across cells given Θ, we can update them all at once.The problematic part is sampling the spatial effects, with approximately37,000 grid cells. To handle this issue, we used a parallelization methodwhere D is divided into disjoint and exhaustive subregions D , D , . . . , D L along with a resultant set of boundary cells B arising through the CARproximity matrix. Thus, once θ B is updated conditional on the rest, then θ D , θ D , . . . , θ D L given θ B can be updated in parallel.This algorithm is illustrated in Figure 5, where we have a 15 × W which puts weight only onthe cells sharing a common boundary. Sequential updating would have re-quired 120 steps. We constructed a set of 48 boundary cells B (the darkcells in Figure 5). It divides the rectangle into 12 segments of 6 cells each,so that conditional on θ B , those segments can be updated independently ofeach other so we need only 54 updating steps. This is only an illustrativeexample, but for large regions, this may significantly improve the run time.However, the time required for communication and data assimilation is anissue for this parallelization method. With increasing L , although the timerequired for the sequential updating within each D i goes down, the size of B increases as does the amount of communication required within the par-allel architecture. So a trade-off must be determined for choosing L ; in oursetting L = 11 worked well.4.3. Posterior summaries.
There are several ways to summarize infer-ence about the p and r distributions. According to our model, for A i , p ih = Φ( α h − v Ti β − θ i ) − Φ( α h − − v Ti β − θ i ). Posterior samples of β, θ, τ enable us to compute samples of the p i . A posterior sample of r i can beconstructed using the relation r i ≡ (1 − u i + u i p io , u i p i , u i p i , u i p i ). Addi-tionally, we can calculate the mean as well as the uncertainty from thesesamples, enabling maps for transformed abundance ( r ) and potential ( p )abundance. For each of p i and r i , we have 4 submaps, one for each abun-dance category. This is useful in terms of assessing high and low abundance A. CHAKRABORTY ET AL.
Fig. 5.
An example grid to illustrate parallel CAR implementation. Normal sequentialupdating would have required steps in each iteration. By dividing the rectangle into segments of cells each with boundary cells (shown in dark grey), each segment can beupdated independently (conditional on the boundaries). In this example the parallelizationresults in only updating steps. regions for the species. The β ’s provide (up to fixed scale) the effect of a par-ticular climate or soil-related factor on the abundance of a particular species.Comparison of the p and r maps informs about the effect of land transfor-mation. One may also be interested in capturing p or r through a singlesummary feature rather than all 4 categorical probabilities. Grouped meanabundance (expectation with respect to the p or r distribution) can be usedwith suitable categorical midpoints. We note that the posterior inferencecan also be summarized on the latent scale using posterior samples of the z ’s. However, working on the z scale can only provide relative comparison.
5. Data analysis.
We have implemented the described model on abun-dance data for several different plant species over the whole
CFR . Wecentered and scaled all the v ’s before using them in the model. As pri-ors we used π ( α ) ≡ π ( β ) = N (0 , φI ) with large φ = 100. For θ , we used η = 0 . W to be a binary matrix with w ( i, j ) = 1 iff d ( i, j ) < . Table 1
Posterior summaries for covariate effects (mean and % c.i. width) Species Apan.mean Max Min Mean.an.pr Sumsmd Fert PRPUNC 1 . − . − . . . . . . . . . . . − . − . . − . . . . . . . . Fig. 6.
Posterior mean spatial effects ( θ ) for Protea punctata (PRPUNC) and Protearepens (PRREPE). These effects offer local adjustment to potential abundance. Cells withvalues greater than zero represent regions with larger than expected populations, conditionalon the other covariates. The threshold 0 .
30 was used to provide an 8 nearest neighbor structurefor most of the cells. However, for boundary cells, the number of neighborsvaries from 3 to 6. The parallelization algorithm was implemented inside R( ) using l = 11. The run time for an individualspecies was about 9000 iterations / day. The outputs presented below are cre-ated by first running 12500 iterations of MCMC, discarding the initial 7500samples, and thinning the rest at every fifth sample. The β ’s were quick toconverge, but the α sequences were highly autocorrelated and moved moreslowly in the space.Here we consider two species, Protea punctata (PRPUNC) and
Protearepens (PRREPE). A summary of the model output is presented throughthe following table and diagrams. Table 1 provides the mean covariate effectsfor both species along with the 95% equal tail credible interval width (inparentheses). Considering 95% equal tail credible interval, all the covariateeffects are significant except
Fert1 for
P. punctata .The mean posterior spatial effects are shown in Figure 6. Note that thespatial effects for the two species have quite different patterns, with
Protearepens having a region of low values in the northeast and larger valueselsewhere, while Protea punctata is more even across the landscape, butwith lower values toward the edges of the
CFR . These surfaces capture thespatial variability in abundance that is not explained by the other covariateswithin the model. This suggests that the covariates predict higher abundanceof
P. repens in the northwest than what was observed, perhaps indicatingsome unobserved limiting factor (such as unsuitable soils, more extremeseasonality in rainfall, or dispersal limitations). Similarly for
P. punctata ,the covariates may over-predict abundances at the edges of the CFR wheremany environmental factors change as one transitions to other biome types. A. CHAKRABORTY ET AL.
Fig. 7.
Abundance category probability maps for Protea punctata (PRPUNC) for un-transformed (left) and transformed (right) situations. Values are cellwise posterior meanprobabilities for the abundance classes. Class means the probability the species is absent,while classes – indicate estimated abundance from – , – , individuals, re-spectively. Figures 7 and 8 show the mean posterior abundance category probabil-ities (potential and transformed) for
P. punctata (Figure 7) and
P. repens (Figure 8). Comparing these plots among rows contrasts the probabilitiesassociated with each abundance class for the species, while comparing be-tween columns shows the effects of landscape transformations on abundanceclass probabilities. Both species show higher predicted abundances coincid-ing with mountainous areas of the
CFR . This is where the fynbos biomedominates the landscape and where proteas are characteristically the dom-inant, indicator species [Rebelo et al. (2006)]. Note that
P. punctata , a lesscommon species, is only slightly affected by landscape transformation, whilethere are dramatic differences for
P. repens (Figures 9 and 10). This isbecause
P. punctata is mostly limited to dry, rocky, or shale slopes [Re-
PATIAL MODELING FOR SPECIES ABUNDANCE Fig. 8.
Abundance category probability maps for Protea repens (PRREPE) for untrans-formed (left) and transformed (right) situations. Values are cellwise posterior mean prob-abilities for the abundance classes. Class means the probability the species is absent,while classes – indicate estimated abundance from – , – , individuals, re-spectively. belo (2001)] which are less suitable for agriculture or development and thusmostly untransformed. P. repens , on the other hand, is much more ubiqui-tous across the region and can frequently occur in lowland areas that havebeen largely transformed by human activities [Rebelo (2001), Rebelo et al.(2006)].It is also useful to summarize these data through mean potential abun-dance and mean transformed abundance (see Section 4.3) as in Figures 9and 10. These figures allow inspection of the underlying latent surfaces thatare of interest to ecologists as a continuous relative representation of speciesabundances. However, the latent “ z ” scales may be difficult to interpret eco-logically and, thus, estimated potential and transformed abundance (usingthe grouped mean) are also shown. These represent the expected abundance A. CHAKRABORTY ET AL.
Fig. 9.
Mean posterior abundance summaries for Protea punctata (PRPUNC). On thelatent z -scale, “Mean z [ P ] ” refers to the potential abundance, while “Mean z [ T ] ” refersto the potential abundance corrected for habitat transformation. The Grouped Mean Abun-dance rescales the Mean z [ P ] surface to the expected potential size of a population in agrid cell (using the observed abundance classes: absent, – , – , ). The GroupMean Transformed Abundance shows the expected size of a population after correcting forhabitat transformation. (with respect to the p ’s or r ’s) of a species at a randomly selected samplelocation in that grid. The associated display makes it easy to visualize theeffects of habitat transformation on protea populations. P. punctata showsalmost no effects of landscape transformation, while large differences areapparent for
P. repens.
Note the large transformed regions in the south andwest where the expected abundance of plants has dropped from more than50 to near zero. It is also apparent that, across the landscape,
P. punctata tends to have a higher expected mean abundance at any given sample pointthan does
P. repens (Figures 9 and 10).
6. Discussion and future work.
Building on previous efforts that haveaddressed the presence/absence of species, we have presented a modelingframework for learning about potential patterns for species abundance notdegraded by land transformation and potential measurement error. Themodel was built using a hierarchical latent abundance specification, incor-porating spatial structure to capture anticipated association between adja-cent locations. Along with potential pattern, we also have an estimate of
PATIAL MODELING FOR SPECIES ABUNDANCE Fig. 10.
Mean posterior abundance summaries for Protea repens (PRREPE). On thelatent z -scale, “Mean z [ P ] ” refers to the potential abundance, while “Mean z [ T ] ” refersto the potential abundance corrected for habitat transformation. The Grouped Mean Abun-dance rescales the Mean z [ P ] surface to the expected potential size of a population in agrid cell (using the observed abundance classes: absent, – , – , ). The GroupMean Transformed Abundance shows the expected size of a population after correcting forhabitat transformation. transformed abundance pattern. Comparison of these two patterns is help-ful for understanding the effect of land transformation on species presenceand abundance and, in particular, for disentangling these effects from thoseof other environmental factors. This may facilitate designing strategies forspecies conservation as well as understanding the overall effects of climatechange.This work has applications in biogeography and in conservation biology[Pearce and Ferrier (2001), Gaston (2003)]. We can now develop predictivemaps of “high quality” habitat sites within a species range, based on highpredicted abundances. This will help identify prime locations for effectiveconservation efforts. We can also estimate the impact of habitat transfor-mation on the size of the population using the information from Figures 8and 10, and thus identify threats to conservation. Predictive abundancemaps will also be useful to explore patterns in biodiversity and species abun-dances. Do species abundances tend to peak in the middle of the species’range [Gaston (2003)]? Do areas of high biodiversity tend to have lower A. CHAKRABORTY ET AL. species abundances? Are there areas that are rich in both abundance andbiodiversity (perhaps identifying ideal regions for conservation efforts)?There are several natural extensions. One is to study the temporal changein abundance. With abundance data collected over time as well as associatedenvironmental factors such as rainfall and temperature, dynamic modelingof species abundance with changing environmental factors may give a clearerpicture of how a species is responding to climate change. Indeed, when con-nected to future climate scenarios, we may attempt to forecast prospectivespecies abundance. Similarly, if the transformation data is also time varying,we could illuminate the effect of land transformation in greater detail.The current model uses transformation percentage (1 − u ) in a determin-istic way (transformation having a binary effect on potential abundance). Inother cases (e.g., to study abundance pattern of animals) it may be reason-able to treat transformation as another covariate influencing species habitat.Also, it may be imagined that the relationship between potential abundanceand environmental variables is not linear as specified in equation (3.3), forexample, environmental variables may affect larger abundance classes differ-ently from smaller abundance classes; piecewise linear specification, intro-ducing different regression coefficients over the different abundance classes,could be explored.Another possible extension lies in joint modeling of two or more species.One may wish to learn whether two plant varieties are sympatric or al-lopatric and whether or not there is evidence for competitive interactions orfacilitation. Such modeling can be done by extending our model to have mul-tiple ( z P,k , z
T,k , z
O,k ) surfaces, where k is the species indicator. Dependencecan be introduced across z P,k surfaces by modeling θ ( k ) using an MCAR[Gefand and Vounatsou (2003), Jin, Carlin and Banerjee (2005)]. Fittingsuch models will be very challenging if there are many grid cells.Instead of taking an areal level approach, if covariate information is avail-able at point level (where sampling sites are viewed as “points” within thelarge region, D ), one may consider a point-level model. This amounts to re-placing the CAR model with a Gaussian process prior for the spatial effects.With many sampling sites, we will need to use appropriate approximationtechniques [Banerjee et al. (2008)].APPENDIX
A.1. Proof of E ( z T ) finite. E ( z T ) = E ( E ( z T | z P )) = E ( uz P + (1 − u ) × c ( z P )) = E ( z P − (1 − u ) φ ( z P )1 − Φ( z P ) ). Assuming z P ∼ N ( µ, R ∞−∞ φ ( x )1 − Φ( x ) φ ( x − µ ) dx < ∞ . PATIAL MODELING FOR SPECIES ABUNDANCE Consider the quantity x φ ( x )1 − Φ( x ) φ ( x − µ ), if x → −∞ , it goes to 0. When x → ∞ , we havelim x →∞ x φ ( x )1 − Φ( x ) φ ( x − µ ) L ′ ptal = lim x →∞ xφ ( x ) φ ( x − µ ) − x φ ( x ) φ ( x − µ ) − x ( x − µ ) φ ( x ) φ ( x − µ ) − φ ( x )= 0 . So lim | x |→∞ x φ ( x )1 − Φ( x ) φ ( x − µ ) = 0, thus, we can get B < , B >
0, such that φ ( x )1 − Φ( x ) φ ( x − µ ) < x for all x / ∈ ( B , B ). Hence, the result follows. A.2. Posterior simulation of z ’s for a site with no presence observed. We subdivide by considering the ways that we can generate a 0 realizationof y based on Equation (3.5) [one may also use Equation (3.3) to do this]:(i) The area is untransformed, the species was potentially there, but missedduring data collection or it was absent at that time instance; the eventis 1 z P ≥ α ,z O ≤ α with prior probability π = uP ( z P ≥ α , z O ≤ α ).(ii) Potentially the species was absent there; the event is 1 z P ≤ α with priorprobability π = P ( z P ≤ α ).(iii) The species was potentially there 1 z P ≥ α , but the area was transformed;the event has prior probability π = (1 − u ) P ( z P ≥ α ) . These three events are exhaustive and mutually exclusive for the event( y = 0). Thus, f ( z P , z O | y = 0 , Θ) is a 3-component mixture. To draw a( z P , z O ) pair from this distribution amounts to first choosing a componentand then drawing a pair ( z P , z O ) from that component distribution. ByBayes’ rule, conditional on observed ( y = 0), these three cases can happenwith posterior probability π i / P l =1 π l , i = 1 , ,
3. So we use a multinomialto select which of these events took place. Before going into case by casedetails, it is worth mentioning that in all these cases the sampling fromthe joint density of chosen mixture component was done via the marginal f ( z P | · · ) followed by f ( z O | z P , · · ). The advantage of this scheme is that wedon’t need to draw from the latter because z O ’s corresponding to y = 0 arenot involved in posterior full conditionals of any other parameters in themodel (as α = 0, fixed). If the second case is selected, then f ( z O , z P |· , · ) ∝ [ uδ z P + (1 − u ) δ c ( z P ) ]1 z O ≤ z P ≤ φ ( z P ) and thus marginalizing over z O , we get f ( z P | · · ) ∝ φ ( z P )1 z P ≤ which is a truncated Gaussian on R − . Similarly un-der case (iii), we need to simulate z P from φ ( z P )1 z P ≥ , a Gaussian truncatedon R + . In case (i), f ( z O , z P | · · ) ∝ φ ( z O ; z P , z P ≥ z O ≤ φ ( z P ), so marginal-izing over z O we get f ( z P | · · ) ∝ φ ( z P )(1 − Φ( z P ))1 z P ≥ . An efficient way todraw from this density is to propose a z P from a truncated normal on R + A. CHAKRABORTY ET AL. and do a Metropolis–Hastings update with an independent proposal, usingthe quantity (1 − Φ( · )). However, all sampling distributions are summarizedin Appendix A.3 below. A.3. Posterior full conditionals needed for Gibbs sampling. • If y ij >
0, draw z O,ij ∼ N ( z P,ij , ( α yij − ,α yij ) . Draw z P,ij ∼ N ( v Ti β + θ i + z O,ij , ). • If y ij = 0, compute p ij = ( u Φ ([0 , ∞ ] × [ −∞ , µ ij , Σ ) , − Φ( v Ti β + θ ) , (1 − u )Φ( v Ti β + θ i )), where µ ij = ( v Ti β + θ i , v Ti β + θ i ) and Σ = (cid:0)
11 12 (cid:1) arethe location and dispersion parameters for bivariate normal joint priordistribution of ( z O,ij , z
P,ij ). Draw d ij i . i . d . ∼ mult( p ij ). If d ij = 1, propose z propose P,ij ∼ N ( v Ti β + θ i , (0 , ∞ ) and do a Metropolis–Hastings sampler us-ing (1 − Φ( · )). Else if d ij = 2, draw z P,ij ∼ N ( v Ti β + θ i , ( −∞ , , else draw z P,ij ∼ N ( v Ti β + θ i , (0 , ∞ ) . • Draw α h = unif(max ij : y ij = h z O,ij , min ij : y ij = h +1 z O,ij ), h = 1 , • Draw β ∼ N ( µ β , Σ β ) Q i,j N ( z P,ij ; v i , β, θ i ). • Draw θ i ∼ N ( z P,ij ; v i , β, θ i ) N ( P j w ij θ j w i + , τ w i + ) for i = 1 , , . . . , m . Draw θ i ∼ N ( P j w ij θ j w i + , τ w i + ) for i = m + 1 , , . . . , I . Acknowledgments.
The authors thank Guy Midgley and Anthony Re-belo for useful discussions. REFERENCES
Albert, J. H. and
Chib, S. (1993). Bayesian analysis of binary and polychotomousresponse data.
J. Amer. Statist. Assoc. Armstrong, M., Galli, A. G., Le Loc’h, G., Geffroy, F. and
Eschard R. (2003).
PluriGaussian Simulations in Geosciences.
Springer, Berlin.
Banerjee, S., Carlin, B. P. and
Gelfand, A. E. (2004).
Hierarchical Modeling andAnalysis for Spatial Data.
Chapman & Hall/CRC, Boca Raton, FL.
Banerjee, S., Gelfand, A. E., Finley, A. O. and
Sang, H. (2008). Gaussian predictiveprocess models for large spatial datasets.
J. Roy. Statist. Soc. Ser. B Beale, C. M., Lennon, J. J., Elston, D. A., Brewer, M. J. and
Yearsley, J.M. (2007). Red herrings remain in geographical ecology: A reply to Hawkins et al.
Ecography Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (withdiscussion).
J. Roy. Statist. Soc. Ser. B Besag, J. and
Kooperberg, C. (1995). On conditional and intrinsic autoregressions.
Biometrika Busby, J. R. (1991). BIOCLIM: A bioclimatic analysis and predictive system. In
NatureConservation: Cost Effective Biological Surveys and Data Analysis (C. R. Margules andM. P. Austin, eds.) 64–68. CSIRO, Canberra, Australia.PATIAL MODELING FOR SPECIES ABUNDANCE Conroy, M. J., Runge, J. P., Barker, R. J., Schofield, M. R. and
Fonnesbeck,C. J. (2008). Efficient estimation of abundance for patchily distributed populations viatwo-phase, adaptive sampling.
Ecology Cressie, N., Calder, C. A., Clark, J. S., Ver Hoef, J. M. and
Wikle, C. K. (2009). Accounting for uncertainty in ecological analysis: The strengths and limitationsof hierarchical statistical modeling.
Ecological Applications De Oliveira, V. (2000). Bayesian prediction of clipped Gaussian random fields.
Comput.Statist. Data Anal. Diggle, P. J., Menezes, R. and
Su, T.-L. (2010). Geostatistical analysis under prefer-ential sampling (with discussion).
J. Roy. Statist. Soc. Ser. C Elith, J., Graham, C. H., Anderson, R. P., Dud´ık, M., Ferrier, S., Guisan, A., Hi-jmans, R. J., Huettmann, F., Leathwick, J. R., Lehmann, A., Li, J., Lohmann,L. G., Loiselle, B. A., Manion, G., Moritz, C., Nakamura, M., Nakazawa,Y., Overton, J. McC., Peterson, A. T., Phillips, S. J., Richardson, K. S.,Scachetti-Pereira, R., Schapire, R. E., Sober´on, J., Williams, S., Wisz, M.S. and
Zimmermann, N. E. (2006). Novel methods improve prediction of species’ dis-tributions from occurrence data.
Ecography Fitzpatrick, M. C., Gove, A. D., Nathan, J., Sanders, N. J. and
Dunn, R. R. (2008). Climate change, plant migration, and range collapse in a global biodiversityhotspot: The Banksia (Proteaceae) of Western Australia.
Global Change Biology Fuller, W. A. (1987).
Measurement Error Models . Wiley, New York. MR0898653
Gaston, K. (2003). The structure and dynamics of geographic ranges, 1st ed. OxfordUniv. Press, Oxford.
Gelfand, A. E. and
Sahu, S. K. (1999). Identifiability, improper priors, and Gibbssampling for generalized linear models.
J. Amer. Statist. Assoc. Gelfand, A. E. and
Vounatsou, P. (2003). Proper multivariate conditional autoregres-sive models for spatial data analysis.
Biostatistics Gelfand, A. E., Silander, J. A., Jr., Wu, S., Latimer, A. M., Lewis, P., Rebelo,A. G. and
Holder, M. (2005a). Explaining species distribution patterns through hi-erarchical modeling.
Bayesian Anal. Gelfand, A. E., Schmidt, A. M., Wu, S., Silander, J. A., Jr., Latimer, A. M. and
Rebelo, A. G. (2005b). Modelling species diversity through species level hierarchicalmodeling.
J. Roy. Statist. Soc. Ser. C Goldblatt, P. and
Manning, J. (2000).
Cape Plants: A Conspectus of the Cape Floraof South Africa . National Botanical Institute of South Africa, Cape Town.
Gorresen, P. M., McMillan, G. P., Camp, R. J. and
Pratt, T. K. (2009). A spatialmodel of bird abundance as adjusted for detection probability.
Ecography Graham, C. H. and
Hijmans, R. J. (2006). A comparison of methods for mapping speciesranges and species richness.
Global Ecology and Biogeography Guisan, A. and
Thuiller, W. (2005). Predicting species distribution: Offering more thansimple habitat models.
Ecology Letters Guisan, A. and
Zimmerman, N. E. (2000). Predictive habitat distribution models inecology.
Ecological Modelling
Guisan, A., Lehman, A., Ferrier, S., Austin, M. P., Overton, J. M. C., Aspinall,R. and
Hastie, T. (2006). Making better biogeographical predictions of species’ dis-tributions.
Journal of Applied Ecology Higgs, M. D. and
Hoeting, J. A. (2010). A clipped latent-variable model for spatiallycorrelated ordered categorical data.
Comput. Statist. Data Anal. A. CHAKRABORTY ET AL.
Hooten, M. B., Larsen, D. R. and
Wikle, C. K. (2003). Predicting the spatial distri-bution of ground flora on large domains using a hierarchical Bayesian model.
LandscapeEcology Ib´a˜nez, I., Silander, J. A., Jr., Allen, J. M., Treanor, S. and
Wilson, A. (2009).Identifying hotspots for plant invasions and forecasting focal points of further spread.
Journal of Applied Ecology Jin, X., Carlin, B. P. and
Banerjee, S. (2005). Generalized hierarchical multivariateCAR models for areal data.
Biometrics Kunin, W. E., Hartley, S. and
Lennon, J. (2000). Scaling down: On the challenges ofestimating abundance from occurrence patterns.
The American Naturalist
Latimer, A. M., Wu, S., Gelfand, A. E. and
Silander, J. A., Jr. (2006). Buildingstatistical models to analyze species distributions.
Ecological Applications Le Loc’h, G. and
Galli, A. (1997). Truncated plurigaussian method: Theoretical andpractical points of view. In
Geostatistics Wollongong’96 (E. Y. Baafi and N. A.Schofield, eds.) Loarie, S. R., Carter, B. E., Hayhoe, K., McMahon, S., Moe, R., Knight, C. A. and
Ackerly, D. D. (2008). Climate change and the future of California’s endemicflora.
PLoS ONE e2502. Mallick, B. and
Gelfand, A. E. (1995). Bayesian analysis of semiparametric propor-tional hazards models.
Biometrics Midgley, G. F. and
Thuiller, W. (2007). Potential vulnerability of Namaqualand plantdiversity to anthropogenic climate change.
Journal of Arid Environments Mueller-Dombois, D. and
Ellenberg, H. (2003).
Aims and Methods of VegetationEcology . Blackburn Press, Caldwell, NJ.
Pearce, J. and
Ferrier, S. (2001). The practical value of modelling relative abundanceof species for regional conservation planning: A case study.
Biological Conservation Phillips, S. J. and
Dud´ık, M. (2008). Modeling of species distributions with Maxent:New extensions and a comprehensive evaluation.
Ecography Potts, J. M. and
Elith, J. (2006). Comparing species abundance models.
EcologicalModelling
Pressey, R. L., Cabeza, M., Watts, M. E., Cowling, R. M. and
Wilson, K. A. (2007). Conservation planning in a changing world.
Trends in Ecology and Evolution Raxworthy, C. J., Martinez-Meyer, E., Horning, N., Nussbaum, R. A., Schnei-der, G. E., Ortega-Huerta, M. A. and
Peterson, A. T. (2003). Predicting distri-butions of known and unknown reptile species in madagascar.
Nature
Rebelo, A. G. (1991).
Protea Atlas Manual: Instruction Booklet to the Protea AtlasProject . Protea Atlas Project, Cape Town.
Rebelo, A. G. (2001).
Proteas: A Field Guide to the Proteas of Southern Africa , 2nd ed.Fernwood Press, Vlaeberg, South Africa.
Rebelo, A. G. (2002). The state of plants in the Cape Flora. In
Proceedings of a Con-ference Held at the Rosebank Hotel in Johannesburg (G. H. Verdoorn and J. Le Roux,eds.) . The State of South Africa’s Species, Endangered Wildlife Trust. Rebelo, A. G. (2006). Protea atlas project website. Available at http://protea.worldonline.co.za/default.htm . Rebelo, A. G., Boucher, C., Helme, N., Mucina, L. and
Rutherford, M. C. (2006).Fynbos biome. In
The Vegetation of South Africa, Lesotho and Swaziland. Streltzia (L.Micina and M. C. Rutherford, eds.) . South African National Biodiversity Institute,Pretoria, South Africa.PATIAL MODELING FOR SPECIES ABUNDANCE Rouget, M., Richardson, D. M., Cowling, R. M., Lloyd, J. W. and
Lombard,A. T. (2003). Current patterns of habitat transformation and future threats to biodi-versity in terrestrial ecosystems of the Cape Floristic Region, South Africa.
BiologicalConservation
Royle, J. A. and
Link, W. A. (2006). Generalized site occupancy models allowing forfalse positive and false negative errors.
Ecology Royle, J. A., K´ery, M., Gautier, R. and
Schmidt, H. (2007). Hierarchical spatialmodels of abundance and occurrence from imperfect survey data.
Ecological Monographs Schwartz, M. W., Iverson, L. R., Prasad, A. M., Matthews, S. N. and
O’Connor,R. J. (2006). Predicting extinctions as a result of climate change.
Ecology Stefanski, L. A. and
Carroll, R. J. (1987). Conditional scores and optimal scores forgeneralized linear measurement-error models.
Biometrika Sutherland, W. J. (2006).
Ecological Census Techniques , 2nd ed. Cambridge Univ.Press, Cambridge.
Ver Hoef, J. M., Cressie, N., Fisher, R. N. and
Case, T. J. (2001). Uncertaintyand spatial linear models for ecological data. In
Spatial Uncertainty in Ecology (C. T.Hunsaker, M. F. Goodchild, M. A. Friedl and T. J. Case, eds.) 214–237. Springer, NewYork.
Ver Hoef, J. M. and
Frost, K. (2003). A Bayesian hierarchical model for monitoringharbor seal changes in Prince William Sound, Alaska.
Environ. Ecol. Stat. Wisz, M. S., Hijmans, R. J., Li, J., Peterson, A. T., Graham, C. H. and
Guisan,A. (2008). Effects of sample size on the performance of species distribution models.
Diversity and Distributions A. ChakrabortyA. E. GelfandDepartment of Statistical ScienceDuke UniversityDurham, North Carolina 27708USAE-mail: [email protected]@stat.duke.edu
A. M. WilsonJ. A. Silander, Jr.Department of Ecology andEvolutionary BiologyUniversity of ConnecticutStorrs, Connecticut 06269USAE-mail: [email protected]@uconn.edu