Poisson point process models solve the "pseudo-absence problem" for presence-only data in ecology
aa r X i v : . [ s t a t . A P ] N ov The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2010
POISSON POINT PROCESS MODELS SOLVETHE “PSEUDO-ABSENCE PROBLEM” FORPRESENCE-ONLY DATA IN ECOLOGY By David I. Warton and Leah C. Shepherd
University of New South Wales
Presence-only data, point locations where a species has beenrecorded as being present, are often used in modeling the distribu-tion of a species as a function of a set of explanatory variables—whether to map species occurrence, to understand its associationwith the environment, or to predict its response to environmentalchange. Currently, ecologists most commonly analyze presence-onlydata by adding randomly chosen “pseudo-absences” to the data suchthat it can be analyzed using logistic regression, an approach whichhas weaknesses in model specification, in interpretation, and in imple-mentation. To address these issues, we propose Poisson point processmodeling of the intensity of presences. We also derive a link betweenthe proposed approach and logistic regression—specifically, we showthat as the number of pseudo-absences increases (in a regular or uni-form random arrangement), logistic regression slope parameters andtheir standard errors converge to those of the corresponding Poissonpoint process model. We discuss the practical implications of theseresults. In particular, point process modeling offers a framework forchoice of the number and location of pseudo-absences, both of whichare currently chosen by ad hoc and sometimes ineffective methods inecology, a point which we illustrate by example.
1. Background.
Pearce and Boyce (2006) define presence-only data as“consisting only of observations of the organism but with no reliable data onwhere the species was not found. Sources for these data include atlases, mu-seum and herbarium records, species lists, incidental observation databasesand radio-tracking studies.” Note that such data arise as point locations where the organism is observed, which we denote as y in this article. An Received May 2009; revised December 2009. Supported by the Australian Research Council, Linkage Project LP0774833.
Key words and phrases.
Habitat modeling, quadrature points, occurrence data, pseudo-absences, species distribution modeling.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2010, Vol. 4, No. 3, 1383–1402. This reprint differs from the original in paginationand typographic detail. 1
D. I. WARTON AND L. C. SHEPHERD
Fig. 1. (a)
Example presence-only data—atlas records of where the tree species
An-gophora costata has been reported to be present, west of Sydney, Australia. The studyregion is shaded. (b)
A map of minimum temperature ( ◦ C) over the study region. Vari-ables such as this are used to model how intensity of A. costata presence relates to theenvironment. (c)
A species distribution model, modeling the association between
A. costata and a suite of environmental variables. This is the fitted intensity function for
A. costata records per km , modeled as a quadratic function of four environmental variables using apoint process model as in Section 4. example is given in Figure 1(a). This figure gives all locations where a par-ticular tree species ( Angophora costata ) has been reported by park rangerssince 1972, within 100 km of the Greater Blue Mountains World HeritageArea, near Sydney, Australia. Note that this does not consist of all loca-tions where an
Angophora costata tree is found—rather it is the locationswhere the species has been reported to be found. We would like to use thesepresence points, together with maps of explanatory variables describing theenvironment (often referred to in ecology as “environmental variables”), topredict the location of
A. costata and how it varies as a function of explana-tory variables (Figure 1).Presence-only data are used extensively in ecology to model species distribu-tions—while the term “presence-only data” was rarely used before the 1990s,ISI Web of Science reports that it was used in 343 publications from 2005to 2008. The use of presence-only data in modeling is a relatively recentdevelopment, presumably aided by the movement toward electronic recordkeeping and recent advances in Geographic Information Systems. One rea-son for the current widespread usage of presence-only data is that often thisis the best available information concerning the distribution of a species, asthere is often little or no information on species distribution being availablefrom systematic surveys [Elith and Leathwick (2007)].Species distribution models, sometimes referred to as habitat models orhabitat classification models [Zarnetske, Edwards and Moisen (2007)], are
OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING regression models for the likelihood that a species is present at a given loca-tion, as a function of explanatory variables that are available over the wholestudy region. Such models are used to construct maps predicting the full spa-tial distribution of a species [given GIS maps of explanatory variables suchas in Figure 1(b)]. When surveys have recorded the presence and absenceof a species in a pre-defined study area (“presence/absence data”), logis-tic regression approaches and modern generalizations [Elith, Leathwick andHastie (2008)] are typically used for species distribution modeling. If insteadpresence-only data are to be used in species distribution modeling, then acommon approach to analysis is to first create “pseudo-absences,” denotedas y , usually achieved by randomly choosing point locations in the region ofinterest and treating them as absences. Then the presence/pseudo-absencedata set is analyzed using standard analysis methods for presence/absencedata [Pearce and Boyce (2006); Elith and Leathwick (2007)], which havebeen used in species distribution modeling for a long time [Austin (1985)].Ward et al. (2009) recently proposed a modification of the pseudo-absencelogistic regression approach for the analysis of presence-only data, when theprobability π that a randomly chosen pseudo-absence point of a presence isknown. However, π is not known in practice.We see three key weaknesses of the “pseudo-absence” approach so widelyused in ecology for analyzing presence-only data, which we describe conciselyas problems of model specification, interpretation, and implementation. Asounder model specification would involve constructing a model for the ob-served data y only , rather than requiring us to generate new data y priorto constructing a model. Interpretation of results is difficult, because somemodel parameters of interest (such as p i of Section 3) are a function of thenumber of pseudo-absences and their location. For example, we explain inSection 3 that as the number of pseudo-absences approaches infinity, p i → y . Implementation of the approach is prob-lematic because it is unclear how pseudo-absences should be chosen [Elithand Leathwick (2007); Guisan et al. (2007); Zarnetske, Edwards and Moisen(2007); Phillips et al. (2009)], and one can obtain qualitatively different re-sults depending on the method of choice of pseudo-absences [Chefaoui andLobo (2008)].In this paper we make two key contributions. First, we propose pointprocess models (Section 2) as an appropriate tool for species distributionmodeling of presence-only data, given that presence-only data arise as aset of point events—a set of locations where a species has been reportedto have been seen. A point process model specification addresses each ofthe three concerns raised above regarding pseudo-absence approaches. Oursecond key contribution is a proof that the pseudo-absence logistic regressionapproach, when applied with an increasing number of regularly spaced orrandomly chosen pseudo-absences, yields estimates of slope parameters that
D. I. WARTON AND L. C. SHEPHERD converge to the point process slope estimates (Section 3). These two keyresults have important ramifications for species distribution modeling inecology (Section 5), in particular, we provide a solution to the problem ofhow to select pseudo-absences. We illustrate our results for the
A. costata data of Figure 1(a) (Section 4).
2. Poisson point process models for presence-only data.
Presence-onlydata are a set y = { y , . . . , y n } of point locations in a two-dimensional region A , where the locations where presences are recorded (the y i ) are out of thecontrol of the researcher, as is the total number of presence points n . We alsoobserve a “map” of values over the entire region A for each of k explanatoryvariables, and we denote the values of these variables at y i as ( x i , . . . , x ik ).We propose analyzing y = { y , . . . , y n } as a point process, hence, we jointlymodel number of presence points n and their location ( y i ). This has notpreviously been proposed for the analysis of presence-only data, despitethe extensive literature on the analysis of presence-only data. We considerinhomogeneous Poisson point process models [Cressie (1993); Diggle (2003)],which make the following two assumptions:1. The locations of the n point events ( y , . . . , y n ) are independent.2. The intensity at point y i [ λ ( y i ), denoted as λ i for convenience], the lim-iting expected number of presences per unit area [Cressie (1993)], canbe modeled as a function of the k explanatory variables. We assume alog-linear specification: log( λ i ) = β + k X j =1 x ij β j , (2.1)although note that the linearity assumption can be relaxed in the usualway (e.g., using quadratic terms or splines). The parameters of the modelfor the λ i are stored in the vector β = ( β , β , . . . , β k ).Note that the process being modeled here is locations where an organism hasbeen reported rather than locations where individuals of the organism occur.Hence, the independence assumption would only be violated by interactionsbetween records of sightings rather than by interactions between individ-ual organisms per se. The atlas data of Figure 1 consist of 721 A. costata records accumulated over a period of 35 years in a region of 86,000 km , soindependence of records seems a reasonable assumption in this case, giventhe rarity of event reporting. Nevertheless, the methods we review here canbe generalized to handle dependence between point events [Baddeley andTurner (2005)]. OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Cressie (1993) shows that the log-likelihood for y can be written as l ( β ; y ) = n X i =1 log( λ i ) − Z y ∈A λ ( y ) dy − log( n !) . (2.2)Berman and Turner (1992) showed that if the integral is estimated via nu-merical quadrature as R y ∈A λ ( y ) dy ≈ P mi =1 w i λ i , then the log-likelihood is(approximately) proportional to a weighted Poisson likelihood: l ppm ( β ; y , y , w ) = m X i =1 w i ( z i log( λ i ) − λ i ) , (2.3)where z i = I ( i ∈{ ,...,n } ) w i , y = { y n +1 , . . . , y m } are quadrature points, the vec-tor w = ( w , . . . , w m ) stores all quadrature weights, and I ( · ) is the indicatorfunction. Being able to write l ( β ; y ) as a weighted Poisson likelihood hasimportant practical significance because it implies that generalized linearmodeling (GLM) techniques can be used for estimation and inference about β . Further, adaptations of GLM techniques to other settings, such as gen-eralized additive models [Hastie and Tibshirani (1990)], can then be readilyapplied to Poisson point process models also.Before implementing this approach, however, we need to make two keydecisions—how to choose quadrature points y = { y n +1 , . . . , y m } and how tocalculate the quadrature weight w i at each point y i .We propose choosing quadrature points in a regular rectangular grid, andconsidering grids of increasing spatial resolution until the estimate of themaximized log-likelihood l ppm ( ˆ β ; y , y , w ) has converged. A rectangular gridprovides reasonably efficient coverage of the region A , and is an arrangementfor which environmental data x i , . . . , x ik can be easily obtained via GISsoftware. We illustrate this method in Section 4. Note a large data set maybe required—in Section 4 convergence was achieved at a spatial scale thatrequired inclusion of approximately 86,000 quadrature points.Quadrature weights are calculated as the area of the neighborhood A i around each point y i , according to some definition of the A i such that y i ∈ A i for each i , A i ∩ A i ′ = ∅ for each i = i ′ , and S i A i = A . In Section 4 wecalculated quadrature weights using the tiling method implemented in the R package spatstat [Baddeley and Turner (2005)]. This crude approachbreaks the region A into rectangular tiles and calculates the weight of apoint as the inverse of the number of points per unit area in its tile. Wefixed tile size at the size of the regular grid used to sample quadraturepoints, such that all tiles contained exactly one quadrature point. Dirichlettessellation [Baddeley and Turner (2005)] offers an alternative method ofestimating weights, but this was not practical for our sample sizes. D. I. WARTON AND L. C. SHEPHERD
3. Asymptotic equivalence of pseudo-absence logistic regression and Pois-son point process models.
Ecologists typically analyze presence-only datapoints y = { y , . . . , y n } by generating a set of “pseudo-absence” points y = { y n +1 , . . . , y m } , then using logistic regression to model the “response vari-able” I ( i ∈ { , . . . , n } ) as a function of explanatory variables [Pearce andBoyce (2006)]. Note that I ( i ∈ { , . . . , n } ) is not actually a stochastic quan-tity, nevertheless, the use of logistic regression to model this quantity as aBernoulli response variable can be motivated via a case-control argumentalong the lines of Diggle (2003), Section 9.3.In this section we will show that the approach to analysis currently usedin ecology, logistic regression using pseudo-absences, is closely related tothe Poisson point process model introduced in Section 2. Specifically, if thepseudo-absences are either generated on a regular grid or completely atrandom over the region A , then as the number of pseudo-absences increase,all parameter estimators except for the intercept in the logistic regressionmodel converge to the maximum likelihood estimators of the Poisson processmodel of Section 2. This asymptotic relationship between logistic regressionand Poisson point process models does not appear to have been recognizedpreviously in the literature.First, we will specify a probability model for I ( i ∈ { , . . . , n } ) that permitsa logistic regression model, and the study of its properties as m → ∞ . Thiscan be achieved by considering a point chosen at random from { y , . . . , y m } and defining U as the event that the randomly chosen point y i is a presence.We are interested in modeling U conditionally on the explanatory variablesobserved at the randomly chosen point, x i = ( x i , . . . , x ik ). In this setting, U is a Bernoulli variable with conditional mean p i , and we assume thatlog (cid:18) p i − p i (cid:19) = γ − log( m − n ) + k X j =1 x ij γ j . (3.1)The intercept term is written as γ − log( m − n ) because p i − p i = f ( x i | U = 1) f ( x i | U = 0) · P ( U = 1) P ( U = 0) = f ( x i | U = 1) f ( x i | U = 0) nm − n , (3.2)where f ( · ) and f ( · ) are the densities of x i conditional on U = 1 and U =0 respectively. Provided that f ( x i | U = 0) is not a function of m (whichis ensured, e.g., by using an identical process to select all pseudo-absencepoints), then the odds of a presence point p i − p i is a function of m only viathe multiplier ( m − n ) − .It can be seen from equation (3.2) that if m → ∞ in such a way that f ( x i | U = 0) is not a function of i , then p i → m − , and the intercept term in the logistic regression model OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING approaches −∞ at the rate log( m ). This in turn means that the logisticregression log-likelihood, defined below, will also diverge as m → ∞ : l bin ( γ ; y , y ) = n X i =1 log( p i ) + m X i = n +1 log(1 − p i ) . (3.3)Clearly, as p i →
0, log( p i ) → −∞ and, hence, l bin ( γ ; y , y ) → −∞ . Such di-vergence is a symptom that the original model has been incorrectly specified.The use of the more appropriate spatial point process model of Section 2 willnot encounter such problems. However, it is shown in the following theoremsthat despite the problems inherent in the logistic regression model specifica-tion, and despite divergence of the intercept term, the remaining parametersconverge to the corresponding parameters from the Poisson process modelof equation (2.3). Further, pseudo-absences play the same role in the logisticregression model that quadrature points played in Section 2.For notational convenience, we will define J m to be the single-entry matrixwhose first element is log m : J m = (log m, , . . . , . (3.4)This definition will be used in each of the theorems that follow. The J m notation is immediately useful in writing out the parameters of the modelfor p i in equation (3.1) as γ − J m − n where γ = ( γ , γ , . . . , γ k ). Theorem 3.1.
Consider a fixed set of n observations from a pointprocess y = { y , . . . , y n } , and a set of pseudo-absences y = { y n +1 , . . . , y m } of variable size that is chosen via some identical process on A for i ∈{ n + 1 , . . . , m } . We model U , whether or not a randomly chosen point isa presence point, via logistic regression as in equation (3.1).As m → ∞ , the logistic regression log-likelihood of equation (3.3) ap-proaches the Poisson point process log-likelihood [equation (2.3)] but withall quadrature weights set to one: l bin ( γ ; y , y ) = l ppm ( γ − J m ; y , y , ) + O ( m − ) , where is a m -vector of ones, and J m is defined in equation (3.4). The proofs to Theorem 3.1 and all other theorems are given in Ap-pendix A.Theorem 3.1 has two interesting practical implications.First, it implies that the pseudo-absence points of presence-only logisticregression play the same role as quadrature points of a point process model,and so established guidelines on how to choose quadrature points (such asthose of Section 2) can inform choice of pseudo-absences. Previously pseudo-absences have been generated according to ad hoc recommendations [Pearce
D. I. WARTON AND L. C. SHEPHERD
Fig. 2.
Asymptotic behavior of Poisson point process and pseudo-absence logistic regres-sion models when the number of quadrature points becomes large (via sampling in a regulargrid with increasing spatial resolution). (a)
The maximized log-likelihood converges for aPoisson point process, but not for pseudo-absence logistic regression. (b)
The parametersand their standard errors converge for Poisson point process and logistic regression mod-els, for sufficiently high spatial resolution. Linear coefficient of “minimum temperature”is given here (corresponding to the second entry in Table 2). and Boyce (2006); Zarnetske, Edwards and Moisen (2007)], given the lackof a theoretical framework for their selection. In contrast, quadrature pointsare generated in order to estimate the log-likelihood to a pre-determinedlevel of accuracy, a criterion which guides the choice of locations and num-bers of quadrature points m − n , as explained in Section 2 and as illustratedlater in Section 4 [Figure 2(a)]. Interestingly, current methods of selectingpseudo-absences in ecology [Pearce and Boyce (2006); Zarnetske, Edwardsand Moisen (2007)] do not appear to be consistent with the best practice inlow-dimensional numerical quadrature—points are usually selected at ran-dom rather than on a regular grid, and the number of pseudo-absences( m − n ) is more commonly chosen relative to the magnitude of the num-ber of presences ( n ) rather than based on some convergence criterion as inFigure 2(a).Second, Theorem 3.1 implies that despite the apparent ad hoc nature ofthe pseudo-absence approach, some form of point process model is being es-timated . However, logistic regression is only equivalent to a Poisson pointprocess when w = , that is, all quadrature weights are ignored. The impli-cations of ignoring weights is considered in Theorems 3.2 and 3.3 below.It should also be noted that Theorem 3.1 is closely related to results dueto Owen (2007) and Ward (2007), although Theorem 3.1 differs from theseresults by relating pseudo-absence logistic regression specifically to pointprocess modeling. OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Owen (2007) also considered the logistic regression setting where the num-ber of presence points is fixed, and the number of pseudo-absences increasesto infinity, and referred to this as “infinitely unbalanced logistic regression.”Owen (2007) derived conditions under which convergence of model parame-ters could be achieved as the number of pseudo-absences increased. The keycondition is that the centroid of the points { y , . . . , y n } in the design spaceis “surrounded”—see Definition 3 of Owen (2007) for details.Along the lines of Ward et al. (2009), Ward (2007) considered a pseudo-absence logistic regression formulation of the presence-only data problem,and defined the “population logistic model” as the model across “the fullpopulation” of locations in the region A . The unconstrained log-likelihoodof the population logistic model [Ward (2007), equation (7.6)] has a similarform to the point process log-likelihood l ( β ; y ) of Section 2. For presence-only logistic regression as in Ward et al. (2009), Ward (2007) shows that asthe number of pseudo-absences approaches infinity, the log-likelihood con-verges to that of the population logistic model, a result that is analogous toTheorem 3.1.Having shown that pseudo-absence logistic regression is equivalent to apoint process model where weights are ignored, the implications of ignoringweights is now considered in Theorems 3.2 and 3.3. Theorem 3.2.
Consider a point process model with quadrature points y n +1 , . . . , y m selected such that for all i w i = |A| m , where |A| is the total areaof the region A . Assume also that the design matrix X has full rank.The maximum likelihood estimators of l ppm ( γ − J m ; y , y , ) and l ppm ( β ; y , y , |A| m ) , ˆ γ − J m and ˆ β respectively, satisfy ˆ γ = ˆ β + J |A| . Further, the Fisher information for ˆ γ and ˆ β is equal.That is, provided that quadrature points have been selected such that quadra-ture weights are equal, ignoring quadrature weights in a Poisson point processmodel does not change slope parameters nor their standard errors, althoughthe intercept term will differ by log( |A| /m ) . Theorem 3.2 refers to the special case where all points (including presencepoints) are sampled on a regular grid. This arises in the special case wherethe region A has been divided into grid cells of equal area, and each gridcell is assigned the value 1 only if it contains a presence point. This formof presence-only analysis is sometimes used in ecology [Phillips, Andersonand Schapire (2006), e.g.]. In addition, the setting of Theorem 3.2 providesa reasonable approximation to the approach to quadrature-point selectionproposed in Section 2. D. I. WARTON AND L. C. SHEPHERD
Note that together Theorems 3.1–3.2 suggest that when quadrature points(or, equivalently, pseudo-absences) are sampled in a regular grid at increas-ing resolution, the logistic regression parameter estimates and their standarderrors will approach those of the point process model—with the exceptionof the intercept term, which diverges slowly to −∞ as all p i → m . This nonconvergence of the intercept was alsonoticed by Owen (2007). Figure 2 illustrates these results for the A. costata data.Theorem 3.3 below links the above results with the case where pseudo-absences are randomly sampled within the region A , which is a more com-mon approach in ecology than sampling on a regular grid [e.g., Elith andLeathwick (2007); Hernandez et al. (2008)]. Theorem 3.3.
Consider again the conditions of Theorem 3.2, but nowassume that the quadrature points y are selected uniformly at random withinthe region A . As previously, ˆ γ is the maximum likelihood estimator of l ppm ( γ − J m ; y , y , ) , but now let ˆ β be the maximum likelihood estimator of l ( β ; y ) from equation (2.2). As m → ∞ , ˆ γ P → ˆ β + J |A| . That is, if quadrature points are randomly selected instead of being sampledon a regular grid, the result of Theorem 3.2 holds in probability rather thanexactly.
Note that the stochastic convergence in Theorem 3.3 is with respect to m not n , that is, it is conditional on the observed point process.Note also that one can think of randomly selecting pseudo-absence pointsas an implementation of “crude” Monte Carlo integration [Lepage (1978)]for estimating R y ∈A λ ( y ) dy in the point process likelihood [equation (2.2)].
4. Modeling Angophora costata species distribution.
As an illustration,we construct Poisson point process models for the intensity of
Angophoracostata records as a function of a set of explanatory variables. We considermodeling the log of intensity using linear and quadratic functions of the vari-ables minimum and maximum temperature, mean annual rainfall, numberof fires since 1943, and “wetness,” a coefficient which can be considered as anindicator of local moisture. These five variables were recommended by localexperts as likely to be important in determining
A. costata distribution.Our full model for intensity of
A. costata records at the point y i has thefollowing form: log( λ i ) = β + x Ti β + x Ti Bx i , (4.1) OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Table 1
AIC values for linear and quadratic Poisson point process modelsfor log( λ ) of Angophora costata presence. Model fitted at the 1 kmby 1 km resolution
Model AIC
Linear terms only 5363.6Quadratic (additive terms only) 4763.4Quadratic (interactions included) 4400.6 where β is a vector of linear coefficients, B is a matrix of quadratic coeffi-cients, and x i is a vector containing measurements of the five environmentalvariables at point y i . We consider a quadratic model for log( λ i ) becausethis enables fitting a nonlinear function and interaction between differentenvironmental variables, both considered important in species distributionmodeling [Elith, Leathwick and Hastie (2008)].All analyses were carried out using purpose-written code on the R program[R Development Core Team (2009)].We first considered the spatial resolution at which quadrature pointsneeded to be sampled in order for the log-likelihood l ( β ; y ) to be suitablywell approximated by l ppm ( β ; y , y , w ). We found [as in Figure 2(a)] that onincreasing the number of quadrature points, the estimate of the maximizedlog-likelihood converged, and that there was minimal change in the solutionbeyond a resolution of one quadrature point every 1 km (the maximized log-likelihood changed by less than one when the number of quadrature pointswas increased 4-fold). Hence, the 1 km resolution was used in model-fitting,and these results are reported here. This involved a total of 86,227 quadra-ture points.In order to study which environmental variables are associated with A.costata and how they are associated, we performed model selection wherewe considered different forms of models for log-intensity as a function ofenvironmental variables, and we considered different subsets of the envi-ronmental variables via all-subsets selection. In both cases we used AIC asour model selection criterion, a simple and widely-used penalty-based modelselection criterion [Burnham and Anderson (1998)].Comparison of AIC values for linear and quadratic models suggest that amuch better fit is achieved when using a quadratic model with interactionsterms for all coefficients (Table 1). Hence, we have evidence that environmen-tal variables interact in their effect on
A. costata . Judging from the modelcoefficients and their relative size compared to standard errors, the interac-tions between maximum temperature, minimum temperature, and annualrainfall appear to be the major contributors. D. I. WARTON AND L. C. SHEPHERD
All-subsets selection considered a total of 32 models, and found that thebest-fitting model included four variables (Figure 3)—all except for “wet-ness.” Parameter estimates and standard errors for this best-fitting modelare given in Table 2(a).An image of the fitted intensity surface from the best-fitting model ispresented in Figure 1(c). The regions of highest predicted intensity are nearthe coast and just north of Sydney, which are indeed where the highest den-sity of presence points appeared in Figure 1(a). We also compared intensitysurfaces fitted at different spatial resolutions, and note that they appearidentical when quadrature points are selected in a 500 ×
500 m, 1 × × A. costata records per square kilometer, as inFigure 1(c). Note this is in contrast to logistic regression, where fitted prob-abilities in any given location are a function of number of pseudo-absences,and vary by a factor of 16 when moving from a 2 × ×
500 mgrid.To assist in interpreting parameters from the best-fitting model, we haveconstructed image plots of the fitted intensity in “environmental space” toelucidate the nature of the effect of each environmental variable on intensityof
A. costata records (Figure 4). It can be seen in Figure 4 that there isa strong and negatively correlated response to maximum temperature andannual rainfall. Of the four environmental variables, the response to numberof fires appears to be the weakest, with little apparent change in predictedintensity as number of fires increased, and no observable interaction withthe three climatic variables.
Fig. 3.
Results of all-subsets selection, expressed as AIC of the best-fitting model at eachlevel of complexity. The respective best-fitting models included minimum temperature, thenminimum and maximum temperature, then annual rainfall was added, then fire count, andfinally wetness. The best-fitting model included four explanatory variables (all variablesexcept wetness).
OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Table 2
Parameter estimates and their standard errors for (a) the Poisson point process modelwith a × km regular grid of quadrature points; (b) The logistic regression model witha × km regular grid of pseudo-absences; (c) The logistic model with 1000 randomlychosen pseudo-absence points. In each case we fitted a quadratic model of minimumtemperature (MNT), maximum temperature (MXT), annual rainfall (RA), and fire count(FC). Notice that with few exceptions, terms are equivalent to 2–3 significant figures forthe models fitted over a regular grid. But this is not the case for (c) , and, in particular,standard errors are all 30–80% larger (a) (b) (c)Term ˆ β j se ( ˆ β j ) ˆ β j se ( ˆ β j ) ˆ β j se ( ˆ β j ) Intercept − . − − − . . − . . − .
91 4 . − .
21 0 . − .
205 0 . − .
185 0 . . . . . . ∗ MXT 0 .
539 0 .
090 0 .
535 0 .
091 0 .
377 0 . − .
98 0 . − .
97 0 . − .
84 0 . .
759 0 .
065 0 .
755 0 .
066 0 .
714 0 . ∗ RA 0 . . . . . . ∗ RA − . . − . . − . . /1000 − . . − . . − . . .
24 3 .
37 5 .
98 3 .
42 4 .
08 4 . ∗ FC − .
101 0 . − .
101 0 . − .
207 0 . ∗ FC − .
123 0 . − .
115 0 . − . . ∗ FC − . . − . . − . . − .
127 0 . − .
123 0 . − .
107 0 . For the purpose of comparison, parameter estimates and their standarderrors are reported not just for the point-process model fit, but also for theanalogous logistic regression model in Table 2(b), for a model fitted withquadrature points sampled in a 1 × m − n = 1000pseudo-absences at randomly selected locations [Table 2(c)]. This is at thelower end of the range typically used [Pearce and Boyce (2006); Elith andLeathwick (2007); Hernandez et al. (2008)] for pseudo-absence selection inecology. Note that the standard errors are substantially larger in this case,and no parameter estimates are correct past the first significant figure. Thisresult exemplifies how current practice in ecology regarding the number ofpseudo-absences ( m − n ) can lead to poor results. Instead, it is advisableto consider the sensitivity of results to different choices of m − n , along thelines of Figure 2. D. I. WARTON AND L. C. SHEPHERD
Fig. 4.
Image plots of the joint effects of environmental variables on predicted log (in-tensity) of
A. costata records. Darker areas of the image correspond to higher predictedvalues of log( λ i ) . Note the strong and highly correlated response of intensity to maximumtemperature and annual rainfall, and the relatively weak response to To explore the goodness of fit of the best-fitting model, an inhomogeneous K -function [Baddeley, Moller and Waagepetersen (2000)] was plotted usingthe kinhom function from the spatstat package on R [Baddeley and Turner(2005)], and simulation envelopes around the fitted model were constructed.See Diggle (2003) for details concerning the use of K -functions to exploregoodness of fit of point process models. The inhomogeneous K -function, asits name suggests, is a generalization of the K -function to the nonstationarycase. It is defined over the region A as K inhom ( r ) = 1 |A| E (cid:26) X y i ∈ y X y j ∈ y \{ y i } I ( k y i − y j k < r ) λ i λ j (cid:27) . OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Fig. 5.
Goodness of fit plot of the quadratic Poisson process model—inhomogeneous K -function (solid line) with simulation envelope (broken lines). The envelope gives 95%confidence bands as estimated from 500 simulated data sets. Note that the K -function fallswithin those bounds over most of its range, although with a possible departure for r < km. K inhom reduces to the usual K function for a stationary process, and like theusual K -function, can be used to diagnose whether there are interactions inthe point pattern y [Baddeley, Moller and Waagepetersen (2000)]. Results(Figure 5) suggest a reasonable fit of this model to the data, although withsome lack of fit at small spatial scales ( r < spatstat package. We have repeated analyses using such a model andfound results to be generally consistent with those presented here, except ofcourse that the equivalence with logistic regression no longer holds.
5. Discussion.
In this paper we have proposed the use of Poisson pointprocess models for the analysis of presence-only data in ecology, an impor-tant and widely-studied problem to which this methodology is well suited.We have also shown that this method is approximately equivalent to logisticregression, when a suitable number of regularly or randomly spaced pseudo-absences are used, hence, we provide a link between the proposed methodand the approach most commonly used in ecology at the moment. But thisraises the question: why use point process models, if the method currentlybeing used is (asymptotically) equivalent to logistic regression anyway? Sev-eral reasons are listed below.Recall that in Section 1 we argued that the pseudo-absence approachhas problems with model specification, interpretation, and implementation. D. I. WARTON AND L. C. SHEPHERD
We argue that each of these difficulties is resolved by using a point processmodeling framework.
Model specification —we believe that a point processmodel as in Section 2 is a plausible model for the data generation mech-anism for presence-only data. In contrast, the logistic regression approachinvolves generating new data in order to fit a model originally designed fora different problem (analysis of binary data not analysis of point-events).Hence, the pseudo-absence approach as it is usually applied appears to in-volve coercing the data to fit the model rather than choosing a model thatfits the original data.
Interpretation —in the logistic regression approach wemodel p i , the probability that a given point event is a presence not a pseudo-absence. This quantity has no physical meaning and clearly its interpretationis sensitive to our method of choice of pseudo-absences (and typically each p i → m → ∞ ). In contrast, the intensity at a point λ i has a naturalinterpretation as the (limiting) expected number of presences per unit area,and will not be sensitive to choice of quadrature points, provided that thenumber of quadrature points is sufficiently large. Implementation —in Sec-tion 2 we explain that point process models offer a framework for choosingquadrature points. Specifically, equation (2.3) is used to estimate the pointprocess log-likelihood, and progressively more quadrature points are addeduntil convergence of l ppm ( ˆ β ; y , y , w ) is achieved as in Figure 2(a). No suchframework for choice of pseudo-absences is offered by logistic regression, andinstead choice of the location and number of pseudo-absences is ad hoc, withpotentially poor results [Table 2(c)]. Ecologists are concerned about the is-sues of how many pseudo-absences to choose [Pearce and Boyce (2006)],where to put them [Elith and Leathwick (2007); Zarnetske, Edwards andMoisen (2007); Phillips et al. (2009)], and what spatial resolution to use inmodel-fitting [Guisan et al. (2007); Elith and Leathwick (2009)], all issuesthat have natural solutions given a point process model specification of theproblem, as in Section 2.It should be emphasized that we have demonstrated equivalence of pointprocess modeling and pseudo-absence logistic regression only for large num-bers of pseudo-absences and only for pseudo-absences that are either regu-larly spaced or located uniformly at random over A . Current practice con-cerning selection of pseudo-absences in the ecology literature does not alwaysinvolve sampling at random over A [e.g., Hernandez et al. (2008)] and doesnot involve sampling sufficiently many pseudo-absences for model conver-gence. Instead, choice of the number of pseudo-absences is ad hoc, and atotal of 1000–10,000 pseudo-absences is usually used [Elith and Leathwick(2007); Hernandez et al. (2008)], although sometimes even fewer [Zarnetske,Edwards and Moisen (2007)]. On Figure 2, 1000–10,000 corresponds to a res-olution of about 4–8 km, for which model convergence has not been achieved.When fitting a model using just 1000 pseudo absences, some parameter es-timates are not equivalent to the high-resolution fits to even one significantfigure, and all standard error estimates were larger by 30–80% (Table 2). OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING While only Poisson point processes were considered in this paper, themethodology implemented in Section 4 can be generalized to incorporate in-teractions between points in a straightforward fashion [Baddeley and Turner(2005)]. However, the links between point process models and logistic regres-sion identified in Section 3 may be lost in this more general setting.One issue not touched on in this paper is the problem of observer bias—that the likelihood of a species being reported is a function of additionalvariables related to properties of the observer and not of the target species,such as variation in the level of accessibility of different parts of the region A . For example, the high number of A. costata records just north of Sydneymay be due in part to proximity to a large city, rather than simply being dueto environmental conditions being suitable for
A. costata . This issue will beaddressed in a related article.APPENDIX A: PROOF OF THEOREMS
A.1. Proof of Theorem 3.1.
The proof involves two steps. The first stepinvolves showing that l bin ( · ), as a function of p i , is asymptotically equivalentto l ppm ( · ) when written as a function of λ i . The second step involves showingthat given the definitions of p i and λ i in equations (2.1) and (3.1), we canreplace one with the other without affecting the order of approximation.Specifically, the log-likelihood function for U can be written as l bin ( γ ; y , y ) = n X i =1 log( p i ) + m X i = n +1 log(1 − p i )and a Taylor expansion of log(1 − p i ) yields= n X i =1 log( p i ) + m X i = n +1 { p i + O ( p i ) } , but it can be seen from equation (3.2) that p i = O ( m − ) and, hence, P ni =1 p i = O ( m − ) for fixed n , so l bin ( γ ; y , y ) = n X i =1 log( p i ) + m X i = n +1 p i + O ( m − )= n X i =1 log( p i ) + m X i =1 p i + O ( m − )(A.1) = m X i =1 { I ( i ∈ { , . . . , n } ) log( p i ) − p i } + O ( m − ) . Note that equation (A.1) has the form of the Poisson point process log-likelihood, but with all weights set to one and p i being used in place of D. I. WARTON AND L. C. SHEPHERD λ i . We will now derive a relation between p i and λ i which motivates thereplacement of p i by λ i .First note that the Taylor expansion of log(1 − x ) implies both that log(1 − p i ) = O ( m − ), and that log( m − n ) = log( m ) + log(1 − n/m ) = log( m ) + O ( m − ). So from equation (3.1),log p i = γ − log( m − n ) + k X j =1 x ij γ j − log(1 − p i )= γ − log( m ) + k X j =1 x ij γ j + O ( m − ) . This has the form of equation (2.1), where β = γ − J m . So when β = γ − J m ,log p i = log λ i + O ( m − ), and P mi =1 p i = P mi =1 λ i { O ( m − } = P mi =1 λ i + O ( m − ). Now plugging these results into equation (A.1) yields l ppm ( γ − J m ; y , y , ) + O ( m − ), completing the proof. (cid:3) A.2. Proof of Theorem 3.2.
The proof follows by inspection of the scoreequations. Specifically, let s j ( β ; w ) = ∂∂β j l ppm ( β ; y , y , w ). From equation (2.3),for j ∈ { , . . . , k } , s j ( β ; w ) = m X i =1 x ij λ i w i (cid:18) z i − λ i λ i (cid:19) = m X i =1 x ij w i ( z i − λ i ) , (A.2)where z i = I ( i ∈ ,...,n ) w i . If j = 0, equation (A.2) holds but with x ij = 1 for each i . Now ˆ β satisfies s j ( ˆ β ; |A| m ) = 0 for each j , that is,0 = m X i =1 x ij (cid:18) I ( i ∈ , . . . , n ) − ˆ λ i |A| m (cid:19) , (A.3)where from equation (2.1), log(ˆ λ i ) = ˆ β + P kj =1 x ij ˆ β .ˆ γ satisfies s j (ˆ γ − J m ; ) = 0, for each j ,0 = m X i =1 x ij ( I ( i ∈ , . . . , n ) − ˜ λ i ) , (A.4)where ˜ λ i is the maximum likelihood estimator of λ i for l ppm ( γ − J m ; y , y , ),which satisfies log(˜ λ i ) = ˆ γ − log m + P kj =1 x ij ˆ γ .The solutions to equations (A.3) and (A.4) are related by the identity˜ λ i = ˆ λ i |A| m for each i , and if we take the logarithm of both sides,ˆ γ − log m + k X j =1 x ij ˆ γ j = ˆ β + k X j =1 x ij ˆ β j + log |A| − log m. OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Provided that the design matrix X has full rank, ˆ γ = ˆ β + J |A| .Also, note that the ( j, j ′ )th element of the Fisher information matrix of l ppm ( β ; y , y , w ) is I jj ′ ( β ; w ) = − E (cid:18) ∂ ∂β j β j ′ l ppm ( β ; y , y , w ) (cid:19) = m X i =1 x ij w i x ij ′ λ i (A.5)and so I jj ′ ( ˆ β ; |A| m ) = P mi =1 x ij x ij ′ |A| m ˆ λ i = P mi =1 x ij x ij ′ ˜ λ i = I jj ′ (ˆ γ − J m ; ) foreach ( j, j ′ ). This completes the proof. (cid:3) A.3. Proof of Theorem 3.3.
Let δ = ˆ γ − ˆ β − J |A| . We will prove the the-orem by using a Taylor expansion of the score equations for l ppm ( β ; y , y , )to show that for fixed n and m → ∞ , δ P → S ( β ; ) be the vector of score equations whose j th element is s j ( β ; ) = ∂∂β j l ppm ( β ; y , y , ) and let I ( β ; ) be the corresponding Fisher informationmatrix. A Taylor expansion of S (ˆ γ − J m ; ) about S ( ˆ β + J |A| /m ; ) yields S (ˆ γ − J m ; ) = S ( ˆ β + J |A| /m ; ) − I ( ˆ β + J |A| /m ; ) δ + O p ( k δ k ) . (A.6)The left-hand side is zero, because it is evaluated at the maximizer of l ppm ( β ; y , y , ). Also, evaluating λ i at ˆ β + J |A| /m gives ˆ λ i |A| m , and substi-tuting this into equation (A.2) at w = , s j ( ˆ β + J |A| /m ; ) = n X i =1 x ij − m X i =1 x ij ˆ λ i |A| m P → Z y ∈A x j ( y )ˆ λ ( y ) dy from the weak law of large numbers. But this is the derivative of l ( β ; y )from equation (2.2), evaluated at the maximum likelihood estimate ˆ β , andso it equals zero for each j and, hence, S ( ˆ β + J |A| /m ; ) P → . Similarly, foreach ( j, j ′ ), I jj ′ ( ˆ β + J |A| /m ; ) P → I jj ′ ( ˆ β ), the ( j, j ′ )th element of the Fisherinformation matrix for ˆ β from l ( β ; y ). So returning to equation (A.6), δ = I ( ˆ β + J |A| /m ; ) − S ( ˆ β + J |A| /m ; ) + O p ( k δ k ) P → , completing the proof. (cid:3) Acknowledgments.
Thanks to the NSW Department of Environmentand Climate Change for making the data available, and to Dan Rampand Evan Webster at UNSW for assistance processing the data and ad-vice. Thanks to David Nott, William Dunsmuir and Simon Barry for helpfuldiscussions. D. I. WARTON AND L. C. SHEPHERD
REFERENCES
Austin, M. P. (1985). Continuum concept, ordination methods and niche theory.
AnnualReview of Ecology, Evolution, and Systematics Baddeley, A. and
Turner, R. (2005). Spatstat: An R package for analyzing spatialpoint patterns.
Journal of Statistical Software Baddeley, A. J. and van Lieshout, M. (1995). Area-interaction point processes.
Ann.Inst. Statist. Math. Baddeley, A. J. , Moller, J. and
Waagepetersen, R. (2000). Non- and semiparametricestimation of interaction in inhomogeneous point patterns.
Statist. Neerlandica Berman, M. and
Turner, T. (1992). Approximating point process likelihoods withGLIM.
J. Roy. Statist. Soc. Ser. C Burnham, K. P. and
Anderson, D. R. (1998).
Model Selection and Inference—A Prac-tical Information-Theoretic Approach . Springer, New York. MR1919620
Chefaoui, R. M. and
Lobo, J. M. (2008). Assessing the effects of pseudo-absences onpredictive distribution model performance.
Ecological Modelling
Cressie, N. A. C. (1993).
Statistics for Spatial Data . Wiley, New York. MR1239641
Diggle, P. J. (2003).
Statistical Analysis of Spatial Point Patterns , 2nd ed. Arnold,London. MR0743593
Elith, J. and
Leathwick, J. (2007). Predicting species distributions from museum andherbarium records using multiresponse models fitted with multivariate adaptive regres-sion splines.
Diversity and Distributions Elith, J. and
Leathwick, J. (2009). Species distribution models: Ecological explana-tion and prediction across space and time.
Annual Review of Ecology, Evolution, andSystematics Elith, J. , Leathwick, J. R. and
Hastie, T. (2008). A working guide to boosted regres-sion trees.
Journal of Animal Ecology Guisan, A. , Graham, C. H. , Elith, J. and
Huettmann, F. (2007). Sensitivity of pre-dictive species distribution models to change in grain size.
Diversity and Distributions Hastie, T. and
Tibshirani, R. (1990).
Generalized Additive Models . Chapman & Hall,Boca Raton, FL. MR1082147
Hernandez, P. A. , Franke, I. , Herzog, S. K. , Pacheco, V. , Paniagua, L. , Quin-tana, H. L. , Soto, A. , Swenson, J. J. , Tovar, C. , Valqui, T. H. , Vargas, J. and
Young, B. E. (2008). Predicting species distributions in poorly-studied landscapes.
Biodiversity and Conservation Lepage, G. (1978). A new algorithm for adaptive multidimensional integration.
J. Com-put. Phys. Owen, A. B. (2007). Infinitely imbalanced logistic regression.
J. Mach. Learn. Res. Pearce, J. L. and
Boyce, M. S. (2006). Modelling distribution and abundance withpresence-only data.
Journal of Applied Ecology Phillips, S. J. , Anderson, R. P. and
Schapire, R. E. (2006). Maximum entropy mod-eling of species geographic distributions.
Ecological Modelling
Phillips, S. J. , Dud´ık, M. , Elith, J. , Graham, C. H. , Lehmann, A. , Leathwick, J. and
Ferrier, S. (2009). Sample selection bias and presence-only distribution models:Implications for background and pseudo-absence data.
Ecological Applications R Development Core Team (2009).
R: A Language and Environment for StatisticalComputing . R Foundation for Statistical Computing, Vienna, Austria.OINT PROCESS MODELS FOR SPECIES DISTRIBUTION MODELING Ward, G. (2007). Statistics in ecological modelling; presence-only data andboosted MARS. PhD thesis, Dept. Statistics, Stanford Univ. Available at . Ward, G. , Hastie, T. , Barry, S. , Elith, J. and
Leathwick, J. R. (2009). Presence-only data and the EM algorithm.
Biometrics Zarnetske, P. L. , Edwards, T. C., Jr. and
Moisen, G. G. (2007). Habitat classifica-tion modeling with incomplete data: Pushing the habitat envelope.
Ecological Applica-tions School of Mathematics and Statisticsand Evolution & Ecology Research CentreUniversity of New South WalesSydney, NSW 2052AustraliaE-mail: