[PDF] A note on Bayesian logistic regression for spatial exponential family Gibbs point processes

Abstract

Recently, a very attractive logistic regression inference method for exponential family Gibbs spatial point processes was introduced. We combined it with the technique of quadratic tangential variational approximation and derived a new Bayesian technique for analysing spatial point patterns. The technique is described in detail, and demonstrated on numerical examples.

Full PDF

AA note on Bayesian logistic regression for spatial exponential familyGibbs point processes

Tuomas A. RajalaDepartment of Mathematical SciencesChalmers University of Technology and University of GothenburgS-41296 Gothenburg, Sweden [email protected]

October 4, 2018

Abstract

Recently, a very attractive logistic regression inference method for exponential family Gibbs spatial pointprocesses was introduced. We combined it with the technique of quadratic tangential variational approximationand derived a new Bayesian technique for analysing spatial point patterns. The technique is described in detail,and demonstrated on numerical examples.

Keywords : Exponential family model; Logistic regression; Bayesian inference

Gibbs point processes are a popular class of models for interacting events taking place in spatial locations. Paramet-ric inference for these inherently non-independent event models is costly due to analytically intractable normalizingconstants in the likelihood. However, due to the recent development of approximate techniques such as the logisticregression technique described by Baddeley et al. (2014), the act of modeling is becoming free from the constrainsof computational limitations.In this paper we will describe a Bayesian version of the logistic regression technique of Baddeley et al. (2014).Bayesian methodology is not generally shunned upon by point pattern statisticians, so the slow growth in popularityamongst scientists is instead more likely a result of the need to employ MCMC algorithms (see overviews in e.g.Gelfand et al., 2010; Møller and Waagepetersen, 2007). MCMC for point process models requires strong expertise indesigning the algorithms even for the simplest models, and usually additional, costly simulation loops are needed toevaluate the chains’ transitions probabilities. The proposed method requires no simulations and provides posteriordistributions in a fraction of the running time of an MCMC algorithm.This text is based on the setup described in detail by Baddeley et al. (2014), and we often refer to that paper tominimize repetition. Section 2 describes the details of the Bayesian version, and Section 3 depicts some exampleson how the Bayesianity can be exploited for a ﬂexible analysis of point pattern data.

We assume that we observe a stationary point process in some ﬁnite window W ⊂ R d . We denote the volumeof the window by V := | W | , and write ω ⊂ W for the observed, ﬁnite conﬁguration of point locations ω i ∈ W .The cardinality is denoted by n = n ( ω ) := | ω ∩ W | . We write u ∈ W for an arbitrary point in the window. Thenotation is slightly changed from that of Baddeley et al. (2014) to simplify the exposition of the model ﬁttingalgorithm later on.The exponential family Gibbs models are deﬁned through a density with respect to a unit rate Poisson processin R d . In the so called canonical parametrisation, a model of the exponential Gibbs model family has a parametric1 a r X i v : . [ s t a t . O T ] N ov ensity of the form ˜ f ( ω | θ ) = α ( θ ) − H ( ω ) exp[ θ T t ( ω )] (1)where θ ∈ R p is a vector of parameters, t ( ω ) = [ t ( ω ) . . . t p ( ω )] T ∈ R p is a vector of canonical statistics, function H ( ω ) is a baseline or a reference density (mainly to have a hard-core eﬀect), and α ( θ ) is the normalizing constant.The normalizing constant α ( θ ) is usually not tractable, hindering direct likelihood based inference. Approx-imate likelihood inference for Gibbs models is usually based on local dynamics of the process described by thePapangelou conditional intensity, deﬁned at any location u ∈ W as λ ( u ; ω ) := ˜ f ( ω ∪ u )˜ f ( ω \ u ) = H ( u ; ω ) exp[ θ T t ( u ; ω )]) (2)with the convention ω \ u = ω if u / ∈ ω . Here H ( u ; ω ) := 1 { H ( ω ) > } H ( ω ∪ u ) /H ( ω \ u ) and t ( u ; ω ) := t ( ω ∪ u ) − t ( ω \ u ) are the contributions of the point u to the density.The logistic likelihood approximation described by Baddeley et al. (2014) replaces the true likelihood (1) withan unbiased estimation equation formally equal to that of standard logistic regression. The method is based onan auxiliary point conﬁguration: First, we simulate a set of dummy locations d = { u j ∈ W : j = 1 , ..., m } from aknown process with a known intensity function (cid:37) ( u ) >

0. Then we concatenate the data and the dummy pointsto a conﬁguration ¯ ω = ω ∪ d , keeping the order, and deﬁne a vector of data indicators y i := 1 { u i ∈ ω } for each u i ∈ ¯ ω , i = 1 , ..., n + m .Then the likelihood (1) is well approximated by a logistic regression density of the form f ( ω | θ ) = n + m (cid:89) i =1 p y i i (1 − p i ) − y i (3)where p i = exp (cid:2) θ T t ( u i ; ω ) + o i (cid:3) (cid:2) θ T t ( u i ; ω ) + o i (cid:3) and o i := log[ H ( u i ; ω ) /(cid:37) ( u i )] are oﬀset terms. Spatial trend components with possible location and covariatedependencies can be included as usual for logistic regression.The power of the approach is that the original, computationally hard point process inference problem isapproximated by much simpler standard exponential family form which can be solved in the generalized linearmodel (GLM) framework with standard software. In what follows a variational Bayes method for computing thesolution is described. The following variational technique for logistic regression was ﬁrst described in 1996 and further developed byJaakkola and Jordan (2000), motivated by requirements of fast learning of graphical models. It is based on atight tangential bound for the function log(1 + e x ), and with Gaussian priors for θ the technique leads to tractableGaussian posteriors. The technique is part of a large group of variational techniques developed for Bayesian dataanalysis, often grouped under the name Variational Bayes (VB) techniques (see e.g. Ormerod and Wand (2010)for an overview).Write N := n + m . To conform to the usual GLM notation, we will now write X = [ x . . . x N ] T for the datamatrix with the canonical statistics forming the rows, x i = t ( u i ; ω ). Using vector notation the logarithm of (3)becomeslog f ( y | θ ) = n + m (cid:88) i =1 (cid:8) y i ( x Ti θ + o i ) − log[1 + exp( x Ti θ + o i )] (cid:9) = y T ( Xθ + o ) − TN log[1 + exp( Xθ + o )] (4)with 1 N = [1 ... T denoting the N length vector of 1’s and o = [ o ...o N ] T . The formulation in Ormerod and Wand(2010) has now been extended to include the oﬀset terms.The logarithm terms in (4) make it hard to derive posteriors for θ as the normalizing integral is not tractable.The method of Jaakkola and Jordan (2000) works around this by approximating the logarithm terms with tan-gential lower bounds of quadratic form. The bound is based on the inequality2 log[1 + exp( x Ti θ + o i )] ≥ λ ( ξ i )( x Ti θ + o i ) −

12 ( x Ti θ + o i ) + γ ( ξ i ) ∀ ξ i > λ ( ξ i ) = − tanh( ξ i / / ξ i and γ ( ξ i ) = ξ i / − log(1 + e ξi ) + ξ i tanh( ξ i / ξ i ’s are called the variational parameters, and they determine the goodness of the approximation. Equality holdswhen ξ i = ( x Ti θ + o i ) . In vector form, applying the bound to each of the logarithm terms in (4) leads to thequadratic form − TN log[1 + exp( Xθ + o )] ≥ ( Xθ + o ) T Λ( ξ )( Xθ + o ) −

12 1 TN ( Xθ + o ) + 1 TN γ ( ξ )= θ T X T Λ( ξ ) Xθ + 2 o T Λ( ξ ) Xθ + o T Λ( ξ ) o −

12 1 TN ( Xθ + o ) + 1 TN γ ( ξ )where Λ( ξ ) = diag ( λ ( ξ i )).By setting a Gaussian prior θ ∼ N ( µ , Σ ) the log of the joint density f ( y , θ ) in (4) has the closed form lowerbound log f ( y , θ ; ξ ) = − θ T [Σ − − X T Λ( ξ ) X ] θ + [( y −

12 1 N + 2Λ( ξ ) o ) T X + µ T Σ − ] θ + 1 TN γ ( ξ ) + o T Λ( ξ ) o + ( y −

12 1 N ) T o − p π ) −

12 log | Σ | − µ T Σ − µ . The bound is proportional to a Gaussian density in θ with posterior covariance matrix and mean given byΣ − ξ = Σ − − X T Λ( ξ ) Xµ ξ = Σ ξ [( y −

12 1 N + 2Λ( ξ ) o ) T X + µ T Σ − ] T As these can be computed, we have derived a closed form for the (approximate) posterior distribution.One detail remains: Both posterior parameters depend on the variational parameters which we need to deﬁne.A natural criterion for optimizing the variational parameters is given by the diﬀerence between the lower bound ofthe evidence f ( y ; ξ ) and the true evidence f ( y ), i.e. we aim to minimize the Kullbach-Leibler divergence betweenthe true evidence and the approximation. Due to Gaussianity of θ the lower bound of the log-evidence can besolved in closed form aslog f ( y ; ξ ) = 12 log | Σ ξ | −

12 log | Σ | + 1 TN γ ( ξ ) + 12 µ Tξ Σ − ξ µ ξ − µ T Σ − µ + o T Λ( ξ ) o + ( y −

12 1 N ) T o (5)Numerical techniques could now be used for ﬁnding ξ that maximizes (5), but Jaakkola and Jordan (2000) proposedto use a simple expectation-maximization (EM) algorithm instead. With some matrix calculus, especially thelinearity of traces, it can be shown that the E-step function has the form Q ( ξ (cid:48) | ξ ) = E θ | y ; ξ log f ( y , θ ; ξ (cid:48) ) = tr (cid:8) [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ]Λ( ξ (cid:48) ) (cid:9) + 1 TN γ ( ξ (cid:48) ) + const.As λ ( ξ i ) is a strictly increasing function of ξ i >

0, the optimal values are given by the equation ξ = diag [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ]cf. Ormerod and Wand (2010) with the current extension of including the oﬀset terms.A summary of the technique is given in Algorithm 1. The computational complexity of the technique diﬀersfrom that of Baddeley et al. (2014) only in the logistic regression estimation part so we comment on the complexityof the variational approximation. The variational parameter vector is of size N = n + m , and each iteration theinverse and determinant operations are at most of order O ( p ). Since usually p << N the complexity of the VBalgorithm is dominated by the matrix multiplication with complexity of O ( N p ). In comparison, the complexityof solving iterated weighted least squares equation, the default technique in the glm -function used by spatstat -package, is of the same order. 3 lgorithm 1 Posterior estimation for exponential family Gibbs point processes Set µ , Σ , H and (cid:37) Generate dummy points d Create vector y and the matrix X Initialize ξ , N × while log-evidence is increasing do Update Σ − ξ = Σ − − X T Λ( ξ ) X Update µ ξ = Σ ξ [( y − N + 2Λ( ξ ) o ) T X + µ T Σ − ] T Update ξ = (cid:113) diag [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ] end while Return ( µ ξ , Σ ξ ) To assess the precision of the VB technique we compare it to the default GLM-ﬁtter of the statistical softwareenvironment R. The comparison consists of ﬁts to simulations from a range of homogeneous Strauss processes,a model family with repulsive interaction between events’ locations. With some ﬁxed parameter R denoting theinteraction range, a Strauss process has the canonical statistics in the conditional intensity (2) of the form t ( u ; ω ) = [1 , n ( ω ∩ b ( u, R ))] T where b ( u, R ) is the ball of radius R centered at u . The model has two parameters to estimate, the intensityrelated parameter θ and the strength of interaction parameter θ . Strict deﬁnition of the model requires θ < β, γ ] T := exp( θ ). The range parameter R is not of canonicalform, so the models are conditional on ﬁxed R .We simulated 100 realisations of the Strauss model with design exp( θ ) ∈ { , } × { . , . } . To have arealistic range parameter value, we set it to 0.7 R max , where R max = 2( π λ ) / is the maximal range under hardpacking limit. In setting the range parameters the unknown intensity λ was crudely replaced by exp( θ ), leadingto R = { . , . } for small and large θ , respectively. The observation window was set to [0 , with additional R -dilation for border correction.We set three types of priors for θ :a) ”ﬂat”, µ a = 0 T , S a = diag (10 )b) ”tight” around correct values, µ b = θ T , S b = diag (1 , .

01) when θ = .

05 and S b = diag (1 , . θ = . µ c = log(2) + θ T and S c = S b .The ”tight” wrong priors (c) correspond to stating ”expert knowledge” that the true interaction parameter γ =exp( θ ) is between .07 and .19 when infact γ = .

05, and between .65 and 1 when γ = .

4. The estimation used thetrue values of R parameters as is the common practice in similar experiments.The same dummy point conﬁgurations were used for each ﬁtter, and were provided by the current state-of-theart Gibbs model inference function ppm from the spatstat -package (Baddeley and Turner, 2005).Figure 1 shows the estimated parameter values from 100 simulations with the described Bayesian method (VB)plotted against the standard GLM-ﬁtter glm of the R-software as used by the ppm -function. Left column, withthe ”ﬂat” prior, shows that the VB technique with weak priors induces virtually no extra bias to the estimates,conﬁrming the past experiences of good performance of the tangential approximation (Jaakkola and Jordan, 2000).The middle column, with tight prior around the true value (b), shows that the for low <

100 points intensitythe estimates are concentrated strongly around the prior (0-line on the ﬁgure). Higher intensity diminishes theprior eﬀect. In the right column we see that the wrong prior (c) eﬀect is likewise strong for small intensity anddiminishes when more points are available (cf. + vs × ). The reduction is higher for less interaction (larger θ ).It is clear that to fully override strong priors an intensity around 1000 is not enough, but for large point patterns(of order > ) a dubious expert knowledge would most likely be overriden by the data.4 rior a) −0.3 0.0 0.3 − . . . l l lll ll lll llll lll ll llll l lll lll ll lll l ll ll ll ll l ll lll ll ll ll llll ll ll lll l llll llllllll lll ll lll llll lll ll l ll −2 0 2 − l llll lll ll ll llll lll ll ll ll llll l lll lll lll l ll l l ll l llll l ll l llll lll ll l l ll ll llll ll l ll l lll lll llll lll llll lll Prior b) −0.3 0.0 0.3 l l lll ll lll llll lll ll llll l lll l ll ll lll l ll ll ll ll l ll lll ll ll ll llll ll ll llll llll ll lll lll lll ll lll llll lll ll l ll −2 0 2 l llll lll ll ll llll lll ll ll ll llll lll l lll lll l ll l l ll l llll l ll l llll lll ll l l ll ll llll ll l ll l lll lll llll lll llll lll glm

Prior c) −0.3 0.0 0.3 l l lll ll lll llll lll ll llll l lll l ll ll lll lll ll ll ll l ll lll ll ll ll llll llll lll l llll lllll lll lll ll lll llll lll ll l ll −2 0 2 l llll llll ll lll llll l lllll l ll ll lll ll lllll l ll lll lll llllll llll lll ( q ^ - q ) q ( q ^ - q ) q VB l exp( q , q )(100, .05)(1000, .05)(100, .4)(1000, .4) Figure 1: Comparison of the VB logistic regression technique and the standard glm -ﬁtter in R. 100 simulations ofStrauss model with 4 sets of parameters (see legend). Columns correspond to diﬀerent priors (see text). Upperand lower row correspond to θ and θ , respectively, units are in relative error. This example revisits the Example 5.4 of Baddeley et al. (2014). The data consist of mucous membrane cells oftwo types, see Figure 2a). The interesting question is whether the two cell types have a diﬀerent trend component,indicating that the intensities are not proportional. The trend model for each type is a fourth degree polynomialin the vertical coordinate. A Strauss component is set for cross-type interaction, with range ﬁxed to R = 0 . y -coordinate locations. A conﬁrmation is given by the Bayes factor 26 against a model with ashared trend.Figure 2b) answers the same question as Figure 2b) presented by Baddeley et al. (2014), namely what are thepointwise conﬁdence regions for the trends given the data. To derive the regions Baddeley et al. (2014) deploy aparametric bootstrap scheme where the ﬁtted model is repeatedly simulated. Proper simulation of a point processmodel is in general not trivial, and can in fact be very costly. In our case we need only simulate a 10-dimensionGaussian random variable, a comparatively trivial execution with very little cost. For the so-called pairwise interaction Gibbs model family an often used alternative form of the density is˜ f ( ω ; θ ) ∝ (cid:89) i φ ( ω i ) (cid:89) i

75] type 1 = ◦ , type 2 = × . b) Posterior pointwise 95%-envelopes for the trends without intercept, as a function of y -coordinate.Strauss model belongs to this family with the interaction function φ θ ( r ) = exp( θ ) r n . The above conditional intensity is that of an exponential modelso we are back at the logistic regression situation.Within the Bayesian framework, we can impose soft constraints on the weights via priors. For example, atypical repulsive interaction function is at r = 0 equal to 0 and for short ranges close to 0, and at r max interactionconverges to 1 (no interaction) as stability conditions require the interaction to be of ﬁnite range. We can set theweights to be conditional on these bordering values, and connect the intermediate weights by imposing oﬀ-diagonalcovariances in Gaussian process fashion, using some suitable kernel function. Note that we are not modeling theinteraction function with a multi-scale family but merely approximating it like Heikkinen and Penttinen (1999)did using MCMC. In their implementation the weights were strictly increasing, but unfortunately it is not possibleto formulate such hard conditioning in the prior construction.In Figure 3a) the blue thick line depicts the interaction function of the Lennard-Jones (1924) model, a modelmotivated by atomic and molecular interactions. A simulation from the model was generated (not shown) in awindow [0 , with ﬁrst order intensity related term θ = log(200). Overlaid on the true function are LOESS-smoothed samples from the posterior of the step-function (6), along with the posterior mean of w k ’s. The estimateof the interaction function is quite acceptable, and an estimate of the characteristic repulsion range with certaintyinterval can be inferred from the posteriors. Via the formula for the Lennard-Jones interaction function’s maxima,the mean estimate of the range is .

058 with 95% central conﬁdence region [ . , . . . . . . . a) range i n t e r a c t i on . . . . . b) range i n t e r a c t i on Figure 3: a) Interaction function estimation using a step-function with a smoothing prior. True interaction used insimulation (blue, thick line), 1000 posterior simulations of the step function, LOESS-smoothed (light gray lines),and the posterior means of the step function weights (red × ). b) Interaction function estimated using 50 squareexponential basis functions.interaction function using 50 squared-exponential basis functions as h k , using the same r -grid for centers as for thestep functions. The results are again quite good, considering no particular information of the original potential’sshape is used. Characteristic range estimate is also very good, . ± . r -grid and the basis function width, both corresponding to the level of smoothness of theunknown function is an issue that has been discussed heavily in literature and therefore beyond this text. In this exposition we combined two approximation techniques to get a computationally cheap Bayesian method foranalysing exponential family Gibbs point processes. The ﬁrst of the two approximations, and in all accounts themore important one in point pattern context, draws the connection between the conditional intensity of a processand the logistic regression, as described by Baddeley et al. (2014) and apparently discovered by others during thepast decade (personal communications). As it is superior to the most commonly used Poisson approximation tothe pseudo-likelihood, the connection is currently the most eﬃcient method for estimating the parameters of thesemodels and is opening interesting doors for point pattern analysis and modeling.We developed a Bayesian version of the logistic regression method by recalling the second approximationfrom machine learning literature, a technique that has seen surprisingly little use in the hands of statisticians.The experience on approximating the logistic likelihood using a tangential functions has a good track record inmachine learning community, where feasibility, performance and scalability are valued, and the examples shownhere only reinforce that optimism.The shown examples are not particularly new. Bayesian inference for unknown interaction functions and pos-terior analysis of trend components have been described earlier by e.g. Heikkinen and Penttinen (1999),Berthelsenand Møller (2003) and Bognar (2005). However, most of Baysian analysis of point pattern data are based onMCMC, which is often time consuming for exponential point process models, involving repeated simulations of theprocess. The described method avoids all such simulations, thus providing real savings on time and computations.And with a full posterior distribution for the parameters at hand some further computational strain can be alle-viated in the data analysis over the non-Bayesian counterparts. In the analysis of the mucuous data, for example,the bootstrap simulations of a bivariate Gibbs point process was replaced by considerably easier simulations of aGaussian vector.Hopefully this discussion increases the interest of statistical community on the developments happening in otherdata-oriented ﬁelds with similar computational problems. The emphasis on analysis details might be diﬀerent, but7he numerical problems are parallel and we should use that connection to everyones advantage.

The author has been ﬁnancially supported by the Knut and Alice Wallenberg Foundation.

References

A. Baddeley and R. Turner. Spatstat: an R package for analyzing spatial point patterns.

Journal of StatisticalSoftware , 12:1–42, 2005. URL .A. Baddeley, J. Coeurjolly, E. Rubak, and R. Waagepetersen. Logistic regression for spatial Gibbs point processes.

Biometrika , 101:377–392, 2014.K. Berthelsen and J. Møller. Likelihood and Non-parametric Bayesian MCMC Inference for Spatial Point ProcessesBased on Perfect Simulation and Path Sampling.

Scandinavian Journal of Statistics , 30:549–564, 2003.M. Bognar. Bayesian inference for spatially inhomogeneous pairwise interacting point processes.

ComputationalStatistics & Data Analysis , 49:1–18, 2005. ISSN 01679473.A. Gelfand, P. Diggle, M. Fuentes, and P. Guttorp, editors.

Handbook of Spatial Statistics . CRC Press, 2010.J. Heikkinen and A. Penttinen. Bayesian smoothing in the estimation of the pair potential function of Gibbs pointprocesses.

Bernoulli , 5:1119–1136, 1999. URL http://projecteuclid.org/euclid.bj/1143122305 .T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods.

Statistics and Computing ,10:25–37, 2000. URL .J. E. Lennard-Jones. On the Determination of Molecular Fields.

Proceedings of the Royal Society A: Mathematical,Physical and Engineering Sciences , 106:463–477, 1924. ISSN 1364-5021.J. Møller and R. Waagepetersen. Modern Statistics for Spatial Point Processes.

Scandinavian Journal of Statistics ,34:643–684, 2007. ISSN 0303-6898.J. Ormerod and M. Wand. Explaining variational approximations.

The American Statistician , 64:140–153, 2010.URL http://amstat.tandfonline.com/doi/abs/10.1198/tast.2010.09058http://amstat.tandfonline.com/doi/abs/10.1198/tast.2010.09058