A note on Bayesian logistic regression for spatial exponential family Gibbs point processes
AA note on Bayesian logistic regression for spatial exponential familyGibbs point processes
Tuomas A. RajalaDepartment of Mathematical SciencesChalmers University of Technology and University of GothenburgS-41296 Gothenburg, Sweden [email protected]
October 4, 2018
Abstract
Recently, a very attractive logistic regression inference method for exponential family Gibbs spatial pointprocesses was introduced. We combined it with the technique of quadratic tangential variational approximationand derived a new Bayesian technique for analysing spatial point patterns. The technique is described in detail,and demonstrated on numerical examples.
Keywords : Exponential family model; Logistic regression; Bayesian inference
Gibbs point processes are a popular class of models for interacting events taking place in spatial locations. Paramet-ric inference for these inherently non-independent event models is costly due to analytically intractable normalizingconstants in the likelihood. However, due to the recent development of approximate techniques such as the logisticregression technique described by Baddeley et al. (2014), the act of modeling is becoming free from the constrainsof computational limitations.In this paper we will describe a Bayesian version of the logistic regression technique of Baddeley et al. (2014).Bayesian methodology is not generally shunned upon by point pattern statisticians, so the slow growth in popularityamongst scientists is instead more likely a result of the need to employ MCMC algorithms (see overviews in e.g.Gelfand et al., 2010; Møller and Waagepetersen, 2007). MCMC for point process models requires strong expertise indesigning the algorithms even for the simplest models, and usually additional, costly simulation loops are needed toevaluate the chains’ transitions probabilities. The proposed method requires no simulations and provides posteriordistributions in a fraction of the running time of an MCMC algorithm.This text is based on the setup described in detail by Baddeley et al. (2014), and we often refer to that paper tominimize repetition. Section 2 describes the details of the Bayesian version, and Section 3 depicts some exampleson how the Bayesianity can be exploited for a flexible analysis of point pattern data.
We assume that we observe a stationary point process in some finite window W ⊂ R d . We denote the volumeof the window by V := | W | , and write ω ⊂ W for the observed, finite configuration of point locations ω i ∈ W .The cardinality is denoted by n = n ( ω ) := | ω ∩ W | . We write u ∈ W for an arbitrary point in the window. Thenotation is slightly changed from that of Baddeley et al. (2014) to simplify the exposition of the model fittingalgorithm later on.The exponential family Gibbs models are defined through a density with respect to a unit rate Poisson processin R d . In the so called canonical parametrisation, a model of the exponential Gibbs model family has a parametric1 a r X i v : . [ s t a t . O T ] N ov ensity of the form ˜ f ( ω | θ ) = α ( θ ) − H ( ω ) exp[ θ T t ( ω )] (1)where θ ∈ R p is a vector of parameters, t ( ω ) = [ t ( ω ) . . . t p ( ω )] T ∈ R p is a vector of canonical statistics, function H ( ω ) is a baseline or a reference density (mainly to have a hard-core effect), and α ( θ ) is the normalizing constant.The normalizing constant α ( θ ) is usually not tractable, hindering direct likelihood based inference. Approx-imate likelihood inference for Gibbs models is usually based on local dynamics of the process described by thePapangelou conditional intensity, defined at any location u ∈ W as λ ( u ; ω ) := ˜ f ( ω ∪ u )˜ f ( ω \ u ) = H ( u ; ω ) exp[ θ T t ( u ; ω )]) (2)with the convention ω \ u = ω if u / ∈ ω . Here H ( u ; ω ) := 1 { H ( ω ) > } H ( ω ∪ u ) /H ( ω \ u ) and t ( u ; ω ) := t ( ω ∪ u ) − t ( ω \ u ) are the contributions of the point u to the density.The logistic likelihood approximation described by Baddeley et al. (2014) replaces the true likelihood (1) withan unbiased estimation equation formally equal to that of standard logistic regression. The method is based onan auxiliary point configuration: First, we simulate a set of dummy locations d = { u j ∈ W : j = 1 , ..., m } from aknown process with a known intensity function (cid:37) ( u ) >
0. Then we concatenate the data and the dummy pointsto a configuration ¯ ω = ω ∪ d , keeping the order, and define a vector of data indicators y i := 1 { u i ∈ ω } for each u i ∈ ¯ ω , i = 1 , ..., n + m .Then the likelihood (1) is well approximated by a logistic regression density of the form f ( ω | θ ) = n + m (cid:89) i =1 p y i i (1 − p i ) − y i (3)where p i = exp (cid:2) θ T t ( u i ; ω ) + o i (cid:3) (cid:2) θ T t ( u i ; ω ) + o i (cid:3) and o i := log[ H ( u i ; ω ) /(cid:37) ( u i )] are offset terms. Spatial trend components with possible location and covariatedependencies can be included as usual for logistic regression.The power of the approach is that the original, computationally hard point process inference problem isapproximated by much simpler standard exponential family form which can be solved in the generalized linearmodel (GLM) framework with standard software. In what follows a variational Bayes method for computing thesolution is described. The following variational technique for logistic regression was first described in 1996 and further developed byJaakkola and Jordan (2000), motivated by requirements of fast learning of graphical models. It is based on atight tangential bound for the function log(1 + e x ), and with Gaussian priors for θ the technique leads to tractableGaussian posteriors. The technique is part of a large group of variational techniques developed for Bayesian dataanalysis, often grouped under the name Variational Bayes (VB) techniques (see e.g. Ormerod and Wand (2010)for an overview).Write N := n + m . To conform to the usual GLM notation, we will now write X = [ x . . . x N ] T for the datamatrix with the canonical statistics forming the rows, x i = t ( u i ; ω ). Using vector notation the logarithm of (3)becomeslog f ( y | θ ) = n + m (cid:88) i =1 (cid:8) y i ( x Ti θ + o i ) − log[1 + exp( x Ti θ + o i )] (cid:9) = y T ( Xθ + o ) − TN log[1 + exp( Xθ + o )] (4)with 1 N = [1 ... T denoting the N length vector of 1’s and o = [ o ...o N ] T . The formulation in Ormerod and Wand(2010) has now been extended to include the offset terms.The logarithm terms in (4) make it hard to derive posteriors for θ as the normalizing integral is not tractable.The method of Jaakkola and Jordan (2000) works around this by approximating the logarithm terms with tan-gential lower bounds of quadratic form. The bound is based on the inequality2 log[1 + exp( x Ti θ + o i )] ≥ λ ( ξ i )( x Ti θ + o i ) −
12 ( x Ti θ + o i ) + γ ( ξ i ) ∀ ξ i > λ ( ξ i ) = − tanh( ξ i / / ξ i and γ ( ξ i ) = ξ i / − log(1 + e ξi ) + ξ i tanh( ξ i / ξ i ’s are called the variational parameters, and they determine the goodness of the approximation. Equality holdswhen ξ i = ( x Ti θ + o i ) . In vector form, applying the bound to each of the logarithm terms in (4) leads to thequadratic form − TN log[1 + exp( Xθ + o )] ≥ ( Xθ + o ) T Λ( ξ )( Xθ + o ) −
12 1 TN ( Xθ + o ) + 1 TN γ ( ξ )= θ T X T Λ( ξ ) Xθ + 2 o T Λ( ξ ) Xθ + o T Λ( ξ ) o −
12 1 TN ( Xθ + o ) + 1 TN γ ( ξ )where Λ( ξ ) = diag ( λ ( ξ i )).By setting a Gaussian prior θ ∼ N ( µ , Σ ) the log of the joint density f ( y , θ ) in (4) has the closed form lowerbound log f ( y , θ ; ξ ) = − θ T [Σ − − X T Λ( ξ ) X ] θ + [( y −
12 1 N + 2Λ( ξ ) o ) T X + µ T Σ − ] θ + 1 TN γ ( ξ ) + o T Λ( ξ ) o + ( y −
12 1 N ) T o − p π ) −
12 log | Σ | − µ T Σ − µ . The bound is proportional to a Gaussian density in θ with posterior covariance matrix and mean given byΣ − ξ = Σ − − X T Λ( ξ ) Xµ ξ = Σ ξ [( y −
12 1 N + 2Λ( ξ ) o ) T X + µ T Σ − ] T As these can be computed, we have derived a closed form for the (approximate) posterior distribution.One detail remains: Both posterior parameters depend on the variational parameters which we need to define.A natural criterion for optimizing the variational parameters is given by the difference between the lower bound ofthe evidence f ( y ; ξ ) and the true evidence f ( y ), i.e. we aim to minimize the Kullbach-Leibler divergence betweenthe true evidence and the approximation. Due to Gaussianity of θ the lower bound of the log-evidence can besolved in closed form aslog f ( y ; ξ ) = 12 log | Σ ξ | −
12 log | Σ | + 1 TN γ ( ξ ) + 12 µ Tξ Σ − ξ µ ξ − µ T Σ − µ + o T Λ( ξ ) o + ( y −
12 1 N ) T o (5)Numerical techniques could now be used for finding ξ that maximizes (5), but Jaakkola and Jordan (2000) proposedto use a simple expectation-maximization (EM) algorithm instead. With some matrix calculus, especially thelinearity of traces, it can be shown that the E-step function has the form Q ( ξ (cid:48) | ξ ) = E θ | y ; ξ log f ( y , θ ; ξ (cid:48) ) = tr (cid:8) [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ]Λ( ξ (cid:48) ) (cid:9) + 1 TN γ ( ξ (cid:48) ) + const.As λ ( ξ i ) is a strictly increasing function of ξ i >
0, the optimal values are given by the equation ξ = diag [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ]cf. Ormerod and Wand (2010) with the current extension of including the offset terms.A summary of the technique is given in Algorithm 1. The computational complexity of the technique differsfrom that of Baddeley et al. (2014) only in the logistic regression estimation part so we comment on the complexityof the variational approximation. The variational parameter vector is of size N = n + m , and each iteration theinverse and determinant operations are at most of order O ( p ). Since usually p << N the complexity of the VBalgorithm is dominated by the matrix multiplication with complexity of O ( N p ). In comparison, the complexityof solving iterated weighted least squares equation, the default technique in the glm -function used by spatstat -package, is of the same order. 3 lgorithm 1 Posterior estimation for exponential family Gibbs point processes Set µ , Σ , H and (cid:37) Generate dummy points d Create vector y and the matrix X Initialize ξ , N × while log-evidence is increasing do Update Σ − ξ = Σ − − X T Λ( ξ ) X Update µ ξ = Σ ξ [( y − N + 2Λ( ξ ) o ) T X + µ T Σ − ] T Update ξ = (cid:113) diag [ X (Σ ξ + µ ξ µ Tξ ) X T + oo T + 2 Xµ ξ o T ] end while Return ( µ ξ , Σ ξ ) To assess the precision of the VB technique we compare it to the default GLM-fitter of the statistical softwareenvironment R. The comparison consists of fits to simulations from a range of homogeneous Strauss processes,a model family with repulsive interaction between events’ locations. With some fixed parameter R denoting theinteraction range, a Strauss process has the canonical statistics in the conditional intensity (2) of the form t ( u ; ω ) = [1 , n ( ω ∩ b ( u, R ))] T where b ( u, R ) is the ball of radius R centered at u . The model has two parameters to estimate, the intensityrelated parameter θ and the strength of interaction parameter θ . Strict definition of the model requires θ < β, γ ] T := exp( θ ). The range parameter R is not of canonicalform, so the models are conditional on fixed R .We simulated 100 realisations of the Strauss model with design exp( θ ) ∈ { , } × { . , . } . To have arealistic range parameter value, we set it to 0.7 R max , where R max = 2( π λ ) / is the maximal range under hardpacking limit. In setting the range parameters the unknown intensity λ was crudely replaced by exp( θ ), leadingto R = { . , . } for small and large θ , respectively. The observation window was set to [0 , with additional R -dilation for border correction.We set three types of priors for θ :a) ”flat”, µ a = 0 T , S a = diag (10 )b) ”tight” around correct values, µ b = θ T , S b = diag (1 , .
01) when θ = .
05 and S b = diag (1 , . θ = . µ c = log(2) + θ T and S c = S b .The ”tight” wrong priors (c) correspond to stating ”expert knowledge” that the true interaction parameter γ =exp( θ ) is between .07 and .19 when infact γ = .
05, and between .65 and 1 when γ = .
4. The estimation used thetrue values of R parameters as is the common practice in similar experiments.The same dummy point configurations were used for each fitter, and were provided by the current state-of-theart Gibbs model inference function ppm from the spatstat -package (Baddeley and Turner, 2005).Figure 1 shows the estimated parameter values from 100 simulations with the described Bayesian method (VB)plotted against the standard GLM-fitter glm of the R-software as used by the ppm -function. Left column, withthe ”flat” prior, shows that the VB technique with weak priors induces virtually no extra bias to the estimates,confirming the past experiences of good performance of the tangential approximation (Jaakkola and Jordan, 2000).The middle column, with tight prior around the true value (b), shows that the for low <
100 points intensitythe estimates are concentrated strongly around the prior (0-line on the figure). Higher intensity diminishes theprior effect. In the right column we see that the wrong prior (c) effect is likewise strong for small intensity anddiminishes when more points are available (cf. + vs × ). The reduction is higher for less interaction (larger θ ).It is clear that to fully override strong priors an intensity around 1000 is not enough, but for large point patterns(of order > ) a dubious expert knowledge would most likely be overriden by the data.4 rior a) −0.3 0.0 0.3 − . . . l l lll ll lll llll lll ll llll l lll lll ll lll l ll ll ll ll l ll lll ll ll ll llll ll ll lll l llll llllllll lll ll lll llll lll ll l ll −2 0 2 − l llll lll ll ll llll lll ll ll ll llll l lll lll lll l ll l l ll l llll l ll l llll lll ll l l ll ll llll ll l ll l lll lll llll lll llll lll Prior b) −0.3 0.0 0.3 l l lll ll lll llll lll ll llll l lll l ll ll lll l ll ll ll ll l ll lll ll ll ll llll ll ll llll llll ll lll lll lll ll lll llll lll ll l ll −2 0 2 l llll lll ll ll llll lll ll ll ll llll lll l lll lll l ll l l ll l llll l ll l llll lll ll l l ll ll llll ll l ll l lll lll llll lll llll lll glm
Prior c) −0.3 0.0 0.3 l l lll ll lll llll lll ll llll l lll l ll ll lll lll ll ll ll l ll lll ll ll ll llll llll lll l llll lllll lll lll ll lll llll lll ll l ll −2 0 2 l llll llll ll lll llll l lllll l ll ll lll ll lllll l ll lll lll llllll llll lll ( q ^ - q ) q ( q ^ - q ) q VB l exp( q , q )(100, .05)(1000, .05)(100, .4)(1000, .4) Figure 1: Comparison of the VB logistic regression technique and the standard glm -fitter in R. 100 simulations ofStrauss model with 4 sets of parameters (see legend). Columns correspond to different priors (see text). Upperand lower row correspond to θ and θ , respectively, units are in relative error. This example revisits the Example 5.4 of Baddeley et al. (2014). The data consist of mucous membrane cells oftwo types, see Figure 2a). The interesting question is whether the two cell types have a different trend component,indicating that the intensities are not proportional. The trend model for each type is a fourth degree polynomialin the vertical coordinate. A Strauss component is set for cross-type interaction, with range fixed to R = 0 . y -coordinate locations. A confirmation is given by the Bayes factor 26 against a model with ashared trend.Figure 2b) answers the same question as Figure 2b) presented by Baddeley et al. (2014), namely what are thepointwise confidence regions for the trends given the data. To derive the regions Baddeley et al. (2014) deploy aparametric bootstrap scheme where the fitted model is repeatedly simulated. Proper simulation of a point processmodel is in general not trivial, and can in fact be very costly. In our case we need only simulate a 10-dimensionGaussian random variable, a comparatively trivial execution with very little cost. For the so-called pairwise interaction Gibbs model family an often used alternative form of the density is˜ f ( ω ; θ ) ∝ (cid:89) i φ ( ω i ) (cid:89) i 75] type 1 = ◦ , type 2 = × . b) Posterior pointwise 95%-envelopes for the trends without intercept, as a function of y -coordinate.Strauss model belongs to this family with the interaction function φ θ ( r ) = exp( θ ) r 058 with 95% central confidence region [ . , . . . . . . . a) range i n t e r a c t i on . . . . . b) range i n t e r a c t i on Figure 3: a) Interaction function estimation using a step-function with a smoothing prior. True interaction used insimulation (blue, thick line), 1000 posterior simulations of the step function, LOESS-smoothed (light gray lines),and the posterior means of the step function weights (red × ). b) Interaction function estimated using 50 squareexponential basis functions.interaction function using 50 squared-exponential basis functions as h k , using the same r -grid for centers as for thestep functions. The results are again quite good, considering no particular information of the original potential’sshape is used. Characteristic range estimate is also very good, . ± . r -grid and the basis function width, both corresponding to the level of smoothness of theunknown function is an issue that has been discussed heavily in literature and therefore beyond this text. In this exposition we combined two approximation techniques to get a computationally cheap Bayesian method foranalysing exponential family Gibbs point processes. The first of the two approximations, and in all accounts themore important one in point pattern context, draws the connection between the conditional intensity of a processand the logistic regression, as described by Baddeley et al. (2014) and apparently discovered by others during thepast decade (personal communications). As it is superior to the most commonly used Poisson approximation tothe pseudo-likelihood, the connection is currently the most efficient method for estimating the parameters of thesemodels and is opening interesting doors for point pattern analysis and modeling.We developed a Bayesian version of the logistic regression method by recalling the second approximationfrom machine learning literature, a technique that has seen surprisingly little use in the hands of statisticians.The experience on approximating the logistic likelihood using a tangential functions has a good track record inmachine learning community, where feasibility, performance and scalability are valued, and the examples shownhere only reinforce that optimism.The shown examples are not particularly new. Bayesian inference for unknown interaction functions and pos-terior analysis of trend components have been described earlier by e.g. Heikkinen and Penttinen (1999),Berthelsenand Møller (2003) and Bognar (2005). However, most of Baysian analysis of point pattern data are based onMCMC, which is often time consuming for exponential point process models, involving repeated simulations of theprocess. The described method avoids all such simulations, thus providing real savings on time and computations.And with a full posterior distribution for the parameters at hand some further computational strain can be alle-viated in the data analysis over the non-Bayesian counterparts. In the analysis of the mucuous data, for example,the bootstrap simulations of a bivariate Gibbs point process was replaced by considerably easier simulations of aGaussian vector.Hopefully this discussion increases the interest of statistical community on the developments happening in otherdata-oriented fields with similar computational problems. The emphasis on analysis details might be different, but7he numerical problems are parallel and we should use that connection to everyones advantage. The author has been financially supported by the Knut and Alice Wallenberg Foundation. References A. Baddeley and R. Turner. Spatstat: an R package for analyzing spatial point patterns. Journal of StatisticalSoftware , 12:1–42, 2005. URL .A. Baddeley, J. Coeurjolly, E. Rubak, and R. Waagepetersen. Logistic regression for spatial Gibbs point processes. Biometrika , 101:377–392, 2014.K. Berthelsen and J. Møller. Likelihood and Non-parametric Bayesian MCMC Inference for Spatial Point ProcessesBased on Perfect Simulation and Path Sampling. Scandinavian Journal of Statistics , 30:549–564, 2003.M. Bognar. Bayesian inference for spatially inhomogeneous pairwise interacting point processes. ComputationalStatistics & Data Analysis , 49:1–18, 2005. ISSN 01679473.A. Gelfand, P. Diggle, M. Fuentes, and P. Guttorp, editors. Handbook of Spatial Statistics . CRC Press, 2010.J. Heikkinen and A. Penttinen. Bayesian smoothing in the estimation of the pair potential function of Gibbs pointprocesses. Bernoulli , 5:1119–1136, 1999. URL http://projecteuclid.org/euclid.bj/1143122305 .T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing ,10:25–37, 2000. URL .J. E. Lennard-Jones. On the Determination of Molecular Fields. Proceedings of the Royal Society A: Mathematical,Physical and Engineering Sciences , 106:463–477, 1924. ISSN 1364-5021.J. Møller and R. Waagepetersen. Modern Statistics for Spatial Point Processes. Scandinavian Journal of Statistics ,34:643–684, 2007. ISSN 0303-6898.J. Ormerod and M. Wand. Explaining variational approximations. The American Statistician , 64:140–153, 2010.URL http://amstat.tandfonline.com/doi/abs/10.1198/tast.2010.09058http://amstat.tandfonline.com/doi/abs/10.1198/tast.2010.09058