[PDF] Fused Spatial Point Process Intensity Estimation with Varying Coefficients on Complex Constrained Domains

Abstract

The availability of large spatial data geocoded at accurate locations has fueled a growing interest in spatial modeling and analysis of point processes. The proposed research is motivated by the intensity estimation problem for large spatial point patterns on complex domains in \mathbb{R}^2 (e.g., domains with irregular boundaries, sharp concavities, and/or interior holes due to geographic constraints) and linear networks, where many existing spatial point process models suffer from the problems of "leakage" and computation. We propose an efficient intensity estimation algorithm to estimate the spatially varying intensity function and to study the varying relationship between intensity and explanatory variables on complex domains. The method is built upon a graph regularization technique and hence can be flexibly applied to point patterns on complex domains such as regions with irregular boundaries and holes, or linear networks. An efficient proximal gradient optimization algorithm is proposed to handle large spatial point patterns. We also derive the asymptotic error bound for the proposed estimator. Numerical studies are conducted to illustrate the performance of the method. Finally, we apply the method to study and visualize the intensity patterns of the accidents on the Western Australia road network, and the spatial variations in the effects of income, lights condition, and population density on the Toronto homicides occurrences.

Full PDF

FFused Spatial Point Process IntensityEstimation on Complex Domains

Lihao YinInstitute of Statistics and Big Data, Renmin University of ChinaDepartment of Statistics, Texas A&M UniversityHuiyan SangDepartment of Statistics, Texas A&M UniversityFebruary 23, 2021

Abstract

The availability of large spatial data geocoded at accurate locations has fueled agrowing interest in spatial modeling and analysis of point processes. The proposedresearch is motivated by the intensity estimation problem for large spatial pointpatterns on complex domains, where many existing spatial point process modelssuﬀer from the problems of ”leakage” and computation. We propose an eﬃcientintensity estimation algorithm to estimate the spatially varying intensity functionand to study the varying relationship between intensity and explanatory variables oncomplex domains. The method is built upon a graph regularization technique andhence can be ﬂexibly applied to point patterns on complex domains such as regionswith irregular boundaries and holes, or linear networks. An eﬃcient proximal gradientoptimization algorithm is proposed to handle large spatial point patterns. We alsoderive the asymptotic error bound for the proposed estimator. Numerical studiesare conducted to illustrate the performance of the method. Finally, We apply themethod to study and visualize the intensity patterns of the accidents on the WesternAustralia road network, and the spatial variations in the eﬀects of income, lightscondition, and population density on the Toronto homicides occurrences.

Keywords:

Graph regularization, Intensity estimation, Line network, Spatial Point Pattern,Varying coeﬃcient models. 1 a r X i v : . [ s t a t . A P ] F e b Introduction

Numerous problems in geosciences, social sciences, ecology, and urban planning nowadaysinvolve extensive amounts of spatial point pattern data recording event occurrence. Exam-ples include locations of invasive species, pick-up locations of Taxi trips, addresses of 911calls, and traﬃc accidents on roads, to name a few. In many such applications, the primaryproblem of interest is to characterize the probability of event occurrence. In the presenceof additional covariates information, another problem of interest is to study the eﬀect ofthese covarites on event occurrence probability, considering the spatial dependence of ob-servations. Spatial point process models have been widely used for the analysis of pointpatterns, in which the intensity function, denoted as ρ ( u ), is used to describe the likelihoodfor an event to occur at location u .In practice, many spatial point patterns data are collected over complex domains withirregular boundaries, peninsulas, interior holes, or network geographical structures. In thispaper, we consider two motivating data examples. The ﬁrst one is the traﬃc accidentlocations on the Western Australia road network shown in the left panel of Figure 1, wherethe interest lies in studying the spatial variation of accident occurrences. The right panel inFigure 1 shows the homicides that occurred in Toronto, where the city boundary has a veryirregular shape. The second data set also includes several additional covariates such as therecords of average income, night lights, and population density. Therefore, the questionsof interest include not only the intensity of crime events but also the relationships betweencrime intensity and regional characteristics. In particular, for a large city like Toronto, wemay expect that such relationships can vary, and in some places rather abruptly, acrossthe study domain.Thus far, many methods have been introduced to model the ﬁrst-order spatially varying2igure 1: Left: Map of homicide locations in Toronto during 2000 − ρ ( u ). Popular point process models include the spatial Poisson pointprocesses, the log-Gaussian Cox Processes, and the Gibbs point processes. See a review byMøller and Waagepetersen (2007). The intensity estimations of these models are often doneusing maximum composite likelihoods(Guan, 2006), estimating equations (Guan et al.,2015) or Bayesian inference methods (Leininger et al., 2017; Gon¸calves and Gamerman,2018; Shirota and Banerjee, 2019). Nonparametric methods have also been widely usedfor estimating the spatially varying intensity functions, including the edge-corrected kernelsmoothing estimators by (Diggle, 1985; Jones et al., 1996), and the Voronoi estimatorby Barr and Schoenberg (2010) using the inverse of the area of the Voronoi cell for eachobserved location.Yet, statistical analysis of point patterns on complex domains presents severe challengesto many of the classical point process models reviewed above. Mainly, the commonlyadopted Euclidean assumption underpinning some of these methods no longer holds for3oint patterns on complex domains. For example, two locations on a road network that areclose by Euclidean distance may actually lie on two separate roads. Moreover, the large datasize will aggravate the challenges in modeling point patterns on complex domains. Thereis a great need to develop spatial point pattern analysis tools that are computationallyeﬃcient to solve the so called “leakage” problem encountered on complex domains.For point patterns that occur along a line network, a particular type of complex domain,a number of methods have been developed recently. Kernel estimators of the intensity func-tion on a line network were investigated in Moradi et al. (2018); McSwiggan et al. (2017);Rakshit et al. (2019), adapting the idea of edge-correction using path lengths. Other vari-ations of kernel density estimation methods are reviewed in Baddeley et al. (2020). It isknown that kernel estimators are, by nature, more suitable for estimating relatively smoothintensity functions because of the use of smoothing kernel functions. When intensity func-tion exhibits discontinuities and abrupt changes in space, as discussed in Baddeley et al.(2020), piece-wise constant estimators become an appealing alternative as they have astrong adaptivity to changes. One research in this direction is the aforementioned Voronoiestimator by Barr and Schoenberg (2010). However, the method suﬀers from the highvariance in the estimator. To reduce the variance, Moradi et al. (2019) extended it by abootstrap resample smoothing procedure. Recently, Bassett and Sharpnack (2019) pro-posed to estimate the density of points on a network as opposed to the intensity functionbased on a total variation regularization method. While each represents advancementsin estimating intensity or density of points, none has incorporated spatial covariates inestimation.When spatial covariates are available, various methods (Baddeley et al., 2012; Mc-Swiggan, 2019) have been developed to incorporate covariate information with the goal toinvestigate the eﬀect of spatial covariates on point patterns. However, to the best of our4nowledge, there has been very limited work for dealing with varying regression coeﬃcientsfor spatial point patterns, even in the simpler case where point patterns are observed inthe Euclidean space. One notable exception is the work by Pinto Junior et al. (2015),which modeled the regression coeﬃcients as a multivariate Gaussian process in a similarfashion as the Spatially Varying Coeﬃcients (SVC) linear regression model proposed byGelfand et al. (2003). Despite the model richness and ﬂexibility, the SVC model is knownto involve heavy computation in the presence of large spatial data due to the requirementof Metropolis MCMC and the need to invert a large covariance matrix. The intractabilityof the likelihood function of the spatial Poisson process further aggravates the issue. Toaddress the computation issue, Pinto Junior et al. (2015) partitioned the study region into asmall number of subregions according to administrative areas and the latent spatial randomeﬀects are assumed to be constant within each subregion. However, in some applications,such a pre-determined partition may be unavailable or fail to accurately reﬂect the complexunderlying environmental and geological conditions.In light of these limitations in the current literature, we develop a new approach basedon a fused lasso regularization method on a graph for the estimation of piece-wise constantspatial intensity functions. We propose penalties on regression coeﬃcients to encouragesparsity on the diﬀerences among regression coeﬃcients that are close in space. We considervarious spatial graphs to represent spatial geometry of complex domains and compare theirperformance. In addition, we extend the approach to a piece-wise constant coeﬃcientspatial point process model when explanatory variables are available, to model the varyingrelationships between point patterns and covariates. We formulate the estimation probleminto a penalized optimization, for which we solve by an eﬃcient proximal gradient algorithm.We tailor the algorithm to utilize spatial graph structures and parallel computations suchthat the method is applicable for dealing with large spatial point patterns. We also make a5heoretical contribution by studying the asymptotic properties of the proposed estimator.Finally, we introduce this method to the analysis of the Western Australia accident dataand the Toronto homicides data. The results of our analysis reveal several interestingclustering patterns of traﬃc accidents and the spatial crime distribution in relation to anumber of key environmental, social, and economic variables.The paper is organized as follows. In Section 2, we review the basic mathematicalformulations and deﬁnitions of spatial point processes. We then introduce our method inSection 3.1, followed by the computation algorithm in 3.2. Sections 4 and 5 include thesimulations to illustrate the model performance and the applications to the two real datasets. We oﬀer conclusions in Section 6. In this study, we consider spatial points on two important types of observation domains.The ﬁrst type is a bounded domain D ⊂ R that can be fully covered by ﬁnitely manyrectangles. The commonly assumed planar window [ a , a ] × [ b , b ] is a special case ofthis type. For any locations u , u in a planar window, the Euclidean distance is used tomeasure the distance between two locations, denoted as d ( u , u ). Another example of thistype is given in Figure 1, where the observation domain is the city of Toronto, which hasirregular city limit boundaries. For any Borel subset B ⊆ D , the Lebesgue measure | B | isthe area of B .In the second type, we assume D is a linear network. Let [ u, v ] = { tu + (1 − t ) v : 0 ≤ t ≤ } denotes a line segment in the plane with endpoints u, v ∈ R . A linear network is6eﬁned as the ﬁnite union D = ∪ ni =1 [ u i , v i ] of line segments [ u , v ] , · · · , [ u n , v n ] embeddedin the same plane. We deﬁne the distance d ( u , u ) as the shortest-path distance between u and u on the network, and for any subset B ⊆ D , the measure | B | is for the totallength of all segments in B . An example of the line network is shown in the right panel,where the road network in the state of Western Australia is drawn in grey lines and thered points mark the traﬃc accident locations in 2011. Let X be a spatial point process on D with the locally ﬁnite property, i.e., the randomcardinality N X ( B ) = { u : u ∈ X ∩ B } is almost surely ﬁnite for any B ⊂ D . Assumethat, for any bounded B ⊂ D , if there exits a non-negative and locally integrable function ρ ( · ) : B (cid:55)→ R such that, E { N X ( B ) } = (cid:90) B ρ ( u ) du then ρ ( · ) is called the intensity function of X . The intensity function is of key interestin point pattern analysis as ρ ( u ) du is interpreted as the approximate probability that anevent is involved in the inﬁnitesimal set du .Poisson point processes are one of the most fundamental and tractable spatial pointprocess models. In practice, ρ ( · ) is often varying over D , i.e. X is inhomogeneous, and canalso depend on some spatial covariates z ( u ). In our study, we model the intensity functionwith a general log-linear form, ρ ( u ; β ) = exp { z T ( u ) β } , u ∈ D (1)where z ( u ) = (cid:0) z ( u ) , · · · , z p ( u ) (cid:1) T is a p -dimensional vector of spatial covariates associatedwith the spatial location u , and β ∈ R p is the vector of regression parameter.7here are several other popular parametric point process models whose marginal in-tensity functions take the same log-linear form as in (1). The class of Cox process modelsis one such example. Let Λ = { Λ( u ) : u ∈ D } denote a real, nonnegative valued randomﬁeld. If the conditional distribution of X given Λ is a Poisson process on D with intensityfunction Λ then X is said to be a Cox process driven by Λ . Popular examples of Cox pro-cesses models include the Neyman-Scott process and the log Gaussian Cox process (e.g.,see a review in Chapter 17, Gelfand et al. (2010)). To estimate the parameter β in (1), one commonly used method is to construct unbiasedestimating equations and obtain estimators by maximizing the corresponding compositelikelihoods. The Poisson based composite log-likelihood function (Waagepetersen, 2007) orthe logistic based composite log-likelihood function (Baddeley et al., 2014) have been usedwidely in the literature, which are respectively given by (cid:96) PL ( β ) = n (cid:88) i =1 log ρ ( u i ; β ) − (cid:90) D ρ ( u ; β ) du (2) (cid:96) LRL ( β ) = n (cid:88) i =1 log( ρ ( u i ; β ) δ ( u i ) + ρ ( u i ; β ) ) − (cid:90) D δ ( u ) log( ρ ( u ; β ) + δ ( u ) ρ ( u ; β ) ) du, (3)where { u , · · · , u n } denotes a set of realizations of a point process, and δ ( u ) is a non-negative real-valued function. When the point process is a Poisson process, the Poissonbased composite log-likelihood function in (2) is identical to the full log-likelihood function.For other point process models, the use of composite likelihood can be justiﬁed by thetheory of estimating functions (Guan, 2006). It can be shown that the estimators obtainedby maximizing both Poisson based and logistic based composite log-likelihoods are thesolutions to the two corresponding unbiased estimating equations for β .8evertheless, the composite likelihood based inference produces a less eﬃcient estimatorcompared with the full likelihood based estimator, due to the loss of information incurredwhen only using the ﬁrst-order moment property of the point process. To improve itseﬃciency, several methods have been developed to carefully select the weights when com-bining composite likelihood terms (Guan and Shen, 2010). For simplicity, we only considerthe unweighted composite likelihoods in the paper, but remark that the methods can bepotentially generalized to the use of weighted composite likelihoods.In practice, numerical approximations are needed for the composite likelihood inferencebecause both the evaluations of (2) and (3) involve integral terms. For equation (2), Bermanand Turner (1992) developed a numerical quadrature method that employs Riemann sumapproximation to the integral part. To implement this, the domain D is partitioned into M − n subdivisions. We divide a 2-D bounded domain into equal rectangular pixels anddivide a line network into smaller line segments. The M − n dummy points, denoted by { u i , i = n + 1 , · · · , M } , are then placed at the center of each subdivision. The Poissonbased composite log likelihood is then approximated by (cid:96) P L ( β ) ≈ M (cid:88) i =1 v i { y i log ρ ( u i ; β ) − ρ ( u i ; β ) } , (4)where { u i ∈ D , i = 1 , · · · , M } consists of the n observed points and M − n dummy points. v i is the quadrature weight corresponding to each u i . We set v i = a i /n i , where n i denotesthe total number of observed events and dummy points in the subdivision that u i resides,and a i denotes the Lebesgue measure of the subdivision of u i such that (cid:80) Mi = n +1 a i = | D | .The working response data becomes y i = v − i ∆ i , where ∆ i is an indicator of whether point i is an observation (∆ i = 1) or a dummy point (∆ i = 0).The Berman-Turner approximation (4) often requires a great amount of dummy points,consequently incurring extra computational cost. Baddeley et al. (2014) showed that the9stimates based on the logistic likelihood in (3) perform competitively with The Berman-Turner approximation using a smaller number of dummy points. The method approximates(3) by (cid:96) LRL ( β ) ≈ n (cid:88) i =1 log ρ ( u i ; β ) δ ( u i ) + ρ ( u i ; β ) + M (cid:88) i = n +1 log δ ( u i ) δ ( u i ) + ρ ( u i ; β ) (5)where the integration term is calculated by Monte Carlo integration, and the dummypoints are drawn from a Poisson point process over D , which has an intensity function δ ( u ) and is independent from X . Baddeley et al. (2014) suggested to choose δ ( u ) = ( M − n ) / | D | . Considering the inhomogeneity of X , we set δ ( u ) ∝ ˆ ρ Ker ( u ), where ˆ ρ Ker ( u ) isthe estimated intensity function by the kernel based method for point processes on linearnetworks (Baddeley et al., 2020). The traditional way to model the log-linear term of the intensity function is to treat theregression coeﬃcients as constants in space as in (1). In the proposed model, we areinterested in estimating a piece-wise constant intensity function in an intercept only log-linear model, or detecting clustering patterns in β when covariates are available. Below, weintroduce a varying coeﬃcient log-linear intensity model (SVCI) for spatial point processesvia regularization methods.To elaborate, suppose a set of spatial points is observed at locations u , . . . , u n ∈ D . We assume these spatial points are a realization from a point process X with a in-tensity function ρ ( u ), which depends on the p -dimensional spatial explanatory variables z ( u ) = { z ( u ) , . . . , z p ( u ) } . As an extension of the constant coeﬃcients regression model,10e assume that the regression coeﬃcients are spatially varying across D , denoted as β ( u ) = { β ( u ) , . . . , β p ( u ) } T . The spatially varying coeﬃcient models inherit the simplicityand easy interpretation of the traditional log-linear model in (1), yet enjoy great ﬂexibilitythat allows practitioners to investigate locally varying relationships among variables.Let β k = (cid:0) β k ( u ) , . . . , β k ( u n ) (cid:1) T , for k = 1 , . . . , p , denote the vector of regression coef-ﬁcients associated with the k -th covariate. We assume that each β k has its own spatiallyclustered pattern and is piece-wise constant on D , that is, the coeﬃcients are homogeneousin the same spatial cluster and varying across diﬀerent clusters. In many spatial applica-tions, it is desired to consider spatially contiguous clustering conﬁgurations such that onlyadjacent locations are clustered together. This way, the practitioners can detect discon-tinuities across boundaries and easily interpret the clusters as local regions to facilitatesubsequent regional analysis.Before introducing our regularization method, we formally deﬁne spatially contiguouscluster of points using the notion of connected components in graph theory. Consider anundirected graph denoted as G = ( V , E ), where V = { u i ∈ R d , i = 1 , . . . , n, n < ∞} is theset of vertices, and E is the edge set consisting of a subset of { ( u i , u j ) : u i , u j ∈ V } . Ingraph theory, a graph G is said to be connected if for any two vertices there exists a pathbetween them. A subgraph G s is called a connected component of G if it is connected andthere is no path between any vertex in G s and any vertex in G \ G s , i.e., the diﬀerencebetween sets V and V s . Now we can deﬁne spatially contiguous clusters as the connectedcomponents of a graph G . As a result, a spatially contiguous partition of V is deﬁned as acollection of disjoint connect components such that the union of vertices is V .This motivates us to construct a graph based regularization model, which permits con-tiguous cluster identiﬁcations of regression coeﬃcients for each covariate in the log-linearpoint process model. Let β ∗ k = (cid:0) β k ( u n +1 ) , . . . , β k ( u M ) (cid:1) T , for k = 1 , . . . , p , denotes the11ector of regression coeﬃcients at the dummy points associated with the k -th covariate.Denote the vector of the stacked regression coeﬃcients at both the observed and dummypoints by β = ( β T , β ∗ ,T , . . . , β Tp , β ∗ ,Tp ) T . We estimate β by minimizing the penalized nega-tive composite log likelihood objective function: Q ( β ) = − | D | ˜ (cid:96) c ( β ) + p (cid:88) k =1 (cid:88) ( i,j ) ∈ E P λ ( β k ( u i ) − β k ( u j )) (6)where ˜ (cid:96) c ( β ) is either the approximation of the Poisson based composite log-likelihood orthe logistic regression-based composite log-likelihood function similar as in (4) and (5).The expressions are given below:˜ (cid:96) P L ( β ) = M (cid:88) i =1 v i { y i log ρ ( u i ; β ( u i )) − ρ ( u i ; β ( u i )) } , (7)˜ (cid:96) LRL ( β ) = n (cid:88) i =1 log ρ (cid:0) u i ; β ( u i ) (cid:1) δ ( u i ) + ρ (cid:0) u i ; β ( u i ) (cid:1) + M (cid:88) i = n +1 log δ ( u i ) δ ( u i ) + ρ (cid:0) u i ; β ( u i ) (cid:1) (8)The second term in the objective function (6) adds a graph pairwise fused regularizationto the negative composite log-likelihood function. E is the edge set of a graph G and( i, j ) ∈ E implies that there is an edge in E connecting the points at u i and u j . P λ ( · ), anon-negative function tuned by parameter λ , penalizes the pairwise diﬀerence of regressioncoeﬃcients whose corresponding locations are connected by an edge in E . One popularchoice is the L -penalty, P λ ( t ) = λ (cid:107) t (cid:107) which is often referred to as the graph fused lasso penalty in the literature (Tibshirani andTaylor, 2011; Arnold and Tibshirani, 2016; Li and Sang, 2019). The L penalty encouragessparsity in the pairwise diﬀerences between the coeﬃcients of edge-connected locations. As12 result, the edges in the graph can be classiﬁed into a set that corresponds to the non-zero elements of | β k ( u i ) − β k ( u j ) | , and another set that corresponds to the zero elementsof | β k ( u i ) − β k ( u j ) | . This naturally leads to a piece-wise constant estimate of β k for eachcovariate and hence a well deﬁned spatially contiguous partition of the vertices for eachcovariate. λ is a non-negative tuning parameter that determines the strength of penalizationand ultimately inﬂuences the estimates of clustered. To make a proper choice of the valuesof them, we use the Bayes information criterion (BIC) to select an optimal value of λ (Liand Sang, 2019).We remark that there are other choices of sparsity inducing penalty functions, includingadaptive lasso (Zou, 2006), smoothly clipped absolute deviation (SCAD) (Fan and Li,2001), and minimax concave penalty (MCP) (Zhang, 2010). And there are other criteriafor tuning parameter selection, including Akaike information criterion (AIC), generalizedcross-validation (GCV) (Golub et al., 1979), and extended Bayesian information criterion(EBIC) (Chen and Chen, 2012). In this paper, we choose to use Lasso together with BICto demonstrate the utility of our method for its computational simplicity. The methodcan adopt other forms of penalty functions and model-selection criteria which may furtherimprove its performance.The selection of edge set E is a key ingredient in our SVCI model by playing twoimportant roles. First, the corresponding graph G reﬂects the prior assumptions about thespatial structure and the contiguous constraint of the regression coeﬃcients. In particular,we rely on G to incorporate the relational information among observations on complexconstrained domains so that we can relax the Euclidean assumption. Second, as we willexplain in Section 3.2, the computation speed and storage complexity of the optimizationalgorithm are largely determined by the structure of G . We seek to construct a graph fusedlasso regularization to achieve a good balance between model accuracy and computational13ﬃciency.For point patterns on a bounded observation domain, one natural choice is to constructa nearest neighbor graph that connects each vertex with its k nearest neighbors (KNN)or neighbors within a certain radius (RNN). In practice, the number of neighbors in KNNor the radius in RNN needs to be chosen with care to guarantee that G is a connectedgraph. It is known in machine learning literature (see, e.g., Shaw and Jebara (2009)) thatKNN graphs can eﬀectively preserve the intrinsic manifold structure of the data. Anotherapproach is the Delaunay triangulation (Lee, 1980), which constructs triangles with a ver-tex set such that no vertex is inside the circumcircle of any triangle. In practice, edgeslonger than a certain threshold are removed to ensure the spatial proximity of neighbor-ing vertices. Triangular graphs have also shown their capabilities in preserving complextopological structures of the data. See Lindgren et al. (2011); Mu et al. (2018) for ex-amples. Moreover, when a graph has certain simple structures such as a chain or a treegraph, several recent work (Padilla et al., 2018; Li and Sang, 2019) showed that thesesimple graph structures can be utilized to design eﬃcient algorithms to solve the graphfused lasso problem. This motivates us to adopt a similar strategy to replace the originalgraph by a minimum spanning tree graph, deﬁned as a subgraph that connects all verticeswith no cycles and with minimum total edge weights. We will investigate and compare theperformance of the proposed SVCI model with diﬀerent types of graphs in the numericalstudies in Section 4. 14 .2 Computation Once we construct the edge set E , we rewrite the objective function in (6) in matrix formand obtain the estimate of β by solving the following fused lasso optimization problem:ˆ β = arg min β ∈ R M · p {− | D | ˜ (cid:96) c ( β ) + λ p (cid:88) k =1 (cid:107) H β k (cid:107) } , (9)where H is an m × M incidence matrix corresponding to the edge set E with m edges.Speciﬁcally, for the l -th edge of E connecting vertices u i and u j , the penalty term | β k ( u i ) − β k ( u j ) | is represented as | H l β k | , where H l is the l -th row of H and contains only twononzero elements, 1 at the i -th index and − j -th.The path following type of algorithms (Arnold and Tibshirani, 2016) and alternatingdirection methods of multipliers (ADMM) (Boyd et al., 2011) have been developed to solvethe graph fused lasso problems. However, the computation of these algorithms can beexpensive for a general graph with a large number of nodes. Recall the number of nodes inour graph is the summation of the numbers of observations and dummy points, typically alarge number in practice. It is, therefore, challenging to directly apply these conventionalalgorithms for the implementation of our model.It is noted that the two approximated log composite likelihood functions in (7) and(8) coincide with the forms of the log likelihood function of a weighted Poisson linearregression and a logistic linear regression, respectively, both of which are concave functionsof β . Below, we propose to combine the proximal gradient method and the alternatingdirection method of multipliers (ADMM) to solve the convex optimization problem in (9).In particular, we take advantage of the structures of our selected spatial graphs and utilizeparallel computations to speed up computation.Speciﬁcally, with the current estimate of the parameters being β ( t ) , we follow the prox-imal gradient method (Beck and Teboulle, 2009) to update the value of β iteratively by15olving: β t +1 = arg min β (cid:13)(cid:13) β − R ( t ) (cid:13)(cid:13) + λL p (cid:88) k =1 (cid:107) H β k (cid:107) , (10)where L is the local Lipschitz constant of − | D | ˜ (cid:96) c ( β ( t ) ), ∇ ˜ (cid:96) c ( β ( t ) ) is the ﬁrst derivative of˜ (cid:96) P L ( β ) or ˜ (cid:96) LRL ( β ) evaluated at β ( t ) for the Poisson based or logistics regression log likelihoodrespectively, and R ( t ) = β ( t ) + (1 /L ) | D | ∇ ˜ (cid:96) c ( β ( t ) ). We can choose L to be the maximumeigenvalue of the Hessian matrix of − | D | ˜ (cid:96) c ( β ) evaluated at β ( t ) .Now the optimization at each iteration boils down to solving (10), which we proposeto use the ADMM algorithm (Wahlberg et al., 2012). By introducing auxiliary variables θ = { θ , · · · , θ p } , Equation (10) is equivalent to:arg min β, θ (cid:13)(cid:13) β − R ( t ) (cid:13)(cid:13) + λL p (cid:88) k =1 (cid:107) θ k (cid:107) s.t. θ k = H β k , ∀ k = 1 , · · · , p Its augmented Lagrangian function is :12 (cid:13)(cid:13) β − R ( t ) (cid:13)(cid:13) + λL p (cid:88) k =1 (cid:107) θ k (cid:107) + γ p (cid:88) k =1 (cid:107) H β k − θ k (cid:107) + γ p (cid:88) k =1 u Tk ( H β k − θ k )where u = { u , · · · , u p } are Lagrangian multipliers and γ is a penalty parameter. ADMMalternately optimizes { β, θ , u } by solving the following three subproblems:  β ( t +1) = arg min β (cid:13)(cid:13) β − R ( t ) (cid:13)(cid:13) + γ p (cid:88) k =1 (cid:107) H β k − θ ( t ) k + u ( t ) k (cid:107) , θ ( t +1) k = arg min θ k λL (cid:107) θ k (cid:107) + γ (cid:107) H β ( t +1) k − θ k + u ( t ) k (cid:107) , k = 1 , · · · , p u ( t +1) k = u ( t ) k + H β ( t +1) k − θ ( t +1) k , k = 1 , · · · , p where t denotes the t -th iteration. 16he above sub optimization problems have the following analytical results:  β ( t +1) k : ( I n + γ H T H ) − [ R ( t ) k + γ H T ( θ ( t ) k − u ( t ) k )] θ ( t +1) k : S ( H β ( t +1) k + u ( t ) k ; λLγ ) , u ( t +1) k = u ( t ) k + H β t +1 k − θ t +1 k , for each k = 1 , · · · , p where S ( z, λ ) is the soft-thresholding operator. It is noted thatthe above optimization steps are separable for the parameters associated with each k , andhence can be conveniently solved in a parallel fashion. In addition, under our choice ofthe spatial graphs, the graph Laplacian matrix H T H is a sparse matrix. As a result, theupdate of β ( t ) only involves the linear solver of the sparse matrix ( I n + γ H T H ) − , whosesparse Cholesky factorization can be pre-computed eﬃciently using R package Matrix . Weiterate between (8) and the above ADMM steps until convergence.

In this section, we adopt an increasing domain framework and investigate the rates ofconvergence for our estimators when n goes to inﬁnity. For simplicity, we only presentthe asymptotic results for the regularized unweighted Poisson based composite likelihoodestimator. Similar results and proofs hold for the regularized logistic based compositelikelihood estimator. We ﬁrst deﬁne notations needed to establish the theoretical resultsbelow: • ˆ β n denotes ˆ β obtained over D n using (7), and β denotes the true parameter value. • ρ ( k ) ( u , . . . , u k ) ≡ lim | dui | → i =1 ,...,k (cid:20) E { N ( du ) · · · N ( du k ) }| du | · · · | du k | (cid:21) denotes the k -th order inten-sity function. 17 G k ( u , . . . , u k ) = lim | dui | → i =1 ,...,k (cid:104) cum { N ( du ) ,...,N ( du k ) }| du |···| du k | (cid:105) denotes the cumulant function of apoint process describing the dependence among points u , . . . , u k , where cum ( N , . . . , N k )is the coeﬃcient of i k t . . . t k in the Taylor series expansion of log (cid:104) E (cid:110) exp (cid:16) i (cid:80) kj =1 N j t j (cid:17)(cid:111)(cid:105) at the origin. • The strong mixing coeﬃcient is deﬁned aswhere F is the σ -algebra generated by X ∩ E i , i = 1 , d ( E , E ) is the minimaldistance between sets E and E , and the supremum is taken over all compact andconvex subsets E ⊂ R and over all u ∈ R . • U n ( β )) = − (cid:96) ( β ) / | D n | denotes the scaled negative Poisson composite log-likelihoodfunction. U (1) n ( β ) and U (2) n ( β ) denote the ﬁrst and second derivatives of U n ( β ) re-spectively. • B n ( β ) = (cid:90) D n z ( u ) z ( u ) (cid:62) ρ ( u ; β ) d u C n ( β ) = (cid:90) D n (cid:90) D n z ( u ) z ( v ) (cid:62) { g ( u, v ) − } ρ ( u ; β ) ρ ( v ; β ) d v d u Below, we use B n and C n to denote for B n ( β ) and C n ( β ) respectively. • Let H be the direct sum of p incidence matrices H . Let H † be the the Moore-Penrosepseudo inverse of H , P H = H † H be the projection matrix onto the row space spannedby H , and P H ⊥ = I − P H .Our asymptotic results rely on the following regularity conditions as n → ∞ : Assumption 1.

For every n ≥ , D n = nD = { ne : e ∈ D } , where D ⊂ R d and | D | = Const. Assumption 1 states that we consider an increasing domain asymptoticwhere D n expands in all directions. ssumption 2. sup p α ( p ; k ) p = O (cid:0) k − (cid:15) (cid:1) , for some (cid:15) > Assumption 2 is the strong mixing coeﬃcient condition such that for any two ﬁxed set,dependence between them decays to 0 at a polynomial rate of the intersect distance k . Assumption 3.

The ﬁrst-order intensity function ρ ( u ; β ) is bounded below from 0, ρ (2) ( u, u (cid:48) ; β ) is bounded and continuous with respect to β , and sup u (cid:82) D n · · · (cid:82) D n | G k ( u , . . . , u k ) | du · · · du k , where ν min ( · ) denotes the small-est eigenvalue of a matrix. Assumption 5.

There exists a δ -neighborhood of β denoted as N δ ( β ) , such that (cid:98) β n ∈N δ ( β ) and lim inf n →∞ ν min (cid:0) | D n | − B n ( β ) (cid:1) > , for any β ∈ N δ ( β ) . Assumptions 1-4 are commonly adopted in the asymptotic theories for point processmodels. We adopt these assumptions to establish the asymptotic normality result for U (1) n ( β ). Assumption 5 is adopted for second order Taylor expansion of the compositelikelihood function. These assumptions are needed to derive the error bound for (cid:98) β n . Wethen follow similar ideas as the basic inequality in Lasso (B¨uhlmann and Van De Geer,2011) and the projection argument in Wang et al. (2016) to derive the following errorbound. Theorem 1.

Let M denote the maximum L norm of the columns of ( B + C ) / P H ⊥ H † ,under Assumptions 1 to 5, for a tuning parameter λ = O ( M | D n | − / √ log n ) , with proba-bility tending to 1 as n → ∞ , we have (cid:13)(cid:13)(cid:13) (cid:98) β n − β (cid:13)(cid:13)(cid:13) = O P ( M | D n | − / (cid:112) log n (cid:13)(cid:13) Hβ (cid:13)(cid:13) )19he proof of Theorem 1 is provided in the Appendix.Finally, we demonstrate in Corollary 1 how the estimation error rate can be used toguide the detection of clusters in practice. Deﬁne H j as the j -th row of H , and I k = { j : H j β k (cid:54) = 0 } as the set consisting of the edges that connect the points with diﬀerentcoeﬃcient values for the k -th covariate. We have the following corollary based on Theorem1: Corollary 1. If M (cid:107) Hβ (cid:107) = o ( | D n | / / √ log n ) , under the Assumptions in Theorem 1,there exists δ > such that | H j ˆ β k | < δ ⇔ j / ∈ I k with probability tending to 1 as n → ∞ . In this section, we conduct simulation studies to investigate the performance of our SVCImodel. We design two diﬀerent scenarios: • Scenario 1: Point patterns are generated from a Poisson point process on a planar win-dow, where the log intensity is a linear function of an intercept and two covariates withclustered regression coeﬃcients, i.e., ρ ( u ; β ) = exp { β ( u ) + z ( u ) β ( u ) + z ( u ) β ( u ) } . • Scenario 2: Point patterns are generated from a Poisson point process on a linearnetwork. We consider two sub-scenarios: (a) The log intensity function is piece-wiseconstant, i.e., ρ ( u ; β ) = exp { β ( u ) } ; (b) The log intensity is a linear function of anintercept and two covariates with clustered regression coeﬃcients as in Scenario 1.In Scenario 1, we focus on examining the performance of our method under diﬀerentmodel choices, including the choices of graphs used in the graph fused lasso penalties, andthe choice between the Poisson likelihood based SVCI (SVCI-PL) and the logistic regressionlikelihood based SVCI (SVCI-LRL). For comparison studies, to the best of our knowledge,20here are very limited existing methods available for spatially clustered coeﬃcient log linearpoint process models on complex domains as reviewed in the Introduction, except for thesimple case of an intercept-only log-linear model. As such, Scenario 2.(a) is included sothat we can compare our method with the nonparametric kernel density estimation (KDE)method on a linear network proposed in Baddeley et al. (2020). For the case that has spatialcovariates as in Scenario 2.(b), the comparison is made with the LGCP model (Møller et al.(1998)), in which the inhomogeneity of the intensity function is modeled by a latent spatialGaussian process random eﬀects model.Given the estimator ˆ β deﬁned in (9), we can predict the coeﬃcients at any given newlocation u ∈ D \{ u , · · · , u M } according to ˆ β ( u ) = (cid:80) Mi =1 { u i ∈N K ( u ) } ˆ β ( u i )/K, where N K ( u )denote the K nearest neighbors of u . To quantify the performance of parameter estima-tion, we evaluate the estimation accuracy of β k ( u ) by the mean integrated squared error( M ISE β , Davis (1977)), deﬁned as M ISE β = 1 p | D | p (cid:88) k =1 (cid:90) D ( β k ( u ) − ˆ β k ( u )) du We implement our methods in R . The code will be made publicly available upon pub-lication. The data generation and the implementation of the KDE method are done usingthe R package spatstat (Baddeley and Turner, 2014). The LGCP model is implementedin R using the lgcp function provided in geostatsp (Brown et al., 2020). All computationswere performed on a Mac Pro with 2.8 GHz eight-core processors and 16GB of memory. In Simulation Scenario 1, we consider a spatial 2D window D = [0 , R ] ⊂ R and thetrue regression coeﬃcients in the log-intensity function are assumed to have the clusteringpatterns shown in the top panel of Figure 2. We simulate the two covariates { z ( u ) } and21 z ( u ) } from two independent realizations of a spatial GP with mean zero and an isotropicexponential covariance function taking the form of Cov { z k ( u ) , z k ( v ) } = σ exp( −(cid:107) u − v (cid:107) /φ ), k = 1 , u, v ∈ [0 , R ] , where φ is the range parameter and we consider φ = 0 . R thatcorresponds to a moderate spatial correlation setting.Under one chosen ﬁxed coeﬃcient pattern, we experiment with a range of R valuessuch that the number of simulated points ranges from 800 to 6000 on average, in order toexamine the performance of SVCI as the sample size increases with the expanding domain.Furthermore, we report the model performance for three diﬀerent numbers of dummy points nd : (a) nd < n ; (b) nd = n ; (c) nd > n , where n is the number of the observed points.As discussed in Section 3, the selection of connection graphs for the fused lasso penaltyplays critical roles on the estimation accuracy and computation speed. In this study, wecompare the performance of SVCI using three types of connection graphs, including theminimum spanning tree graph (MST), the Delaunay triangulation (DTs) and the K -nearestneighbor graph ( K -NNs, K is set to be 3 , , nd = n .In Table 1, we report the averaged MISE of the estimate of each β k ( u ) ( M ISE β ), k = 0 , ,

2. There are several noticeable observations. First, a denser graph such as the 5-NN graph produces a more accurate estimation result compared with that of a sparser graphsuch as the MST or 3-NN graphs. The bottom panel of Figure 2 illustrates an example of theestimated coeﬃcients using the 5-NN graph when n = 2000, nd = n , which demonstratesthe capability of SVCI in capturing the cluster structure in the regression coeﬃcients anddetecting the abrupt changes across the boundaries of adjacent clusters. However, thereis clearly a trade oﬀ between the estimation accuracy and computation eﬃciency whenusing diﬀerent graphs; as reported in the left panel of Figure 3, the computation time using22igure 2: The top panel (a-c): the true coeﬃcients β , β and β in Scenario 1; The bottompanel (d-f): the estimated coeﬃcients from one simulation using the 5NN graph when n = 2400, nd = n . n grows larger,as evidenced by the decreasing value of M ISE β . Finally, SVCI-LRL produces comparableresults with those from SVCI-PL when n = 800 or using the MST graph, but notablyoutperforms SVCI-PL when n ≥ β ( u ), we obtain the estimate of the log intensity function bylog ˆ ρ ( u ) = z T ( u ) ˆ β ( u ) for u ∈ D . The right panel of Figure 3 compares the M ISE oflog ˆ ρ ( u ) from SVCI-PL, SVCI-LRL and LGCP, respectively. It is noted that SVCI-LRLmaintains its superior performance when predicting the intensity function in comparisonwith SVCI-PL. Besides, both versions of SVCI produce more accurate estimates than LGCPwhen estimating the intensity function with clustered regression coeﬃcients.Next we examine the performance of SVCI in recovering the clusters of coeﬃcients.Table 2 reports the Rand index for each of the SVCI estimates ˆ β , ˆ β and ˆ β averagedover 100 simulations, using the 5-NN graph and setting nd = n . The Rand index measuresthe proportion of pairs consisting of a true parameter and the corresponding estimatedparameter that agree by virtue of belonging either to the same cluster or to diﬀerentclusters. Overall, SVCI achieves an accurate cluster recovery result, evidenced by therelatively high Rand index value ranging from 0.73 to 0.93 in all settings. We also ﬁndthat SVCI-LPL surpasses SVCI-PL in detecting spatial clusters. Finally, an interestingobservation is that ˆ β has a lower Rand index value than both ˆ β and ˆ β , which might bethe consequence of having more clusters in the true function of β .24inally we check the sensitivity of the model performance to the number of dummypoints. We ﬁx n = 1600 and consider three diﬀerent numbers of dummy points denotedby nd . Table 3 presents the averaged M ISE β and the associated computation time over100 simulations. For the Poisson likelihood, the default choice suggested in the R package spatstat is nd ≈ n . In our experiments, however, as presented in Table 3, both SVCI-PLand SVCI-LRL achieve the minimal M ISE β when nd = 40 , i.e. when the number ofdummy points equals the number of points. Weighing the trade-oﬀ between computationeﬃciency and estimation accuracy, we recommend to use nd ≈ n in practice.Table 1: Scenario 1: Mean integrated squared error of β ( M ISE β ) averaged over 100simulations for diﬀerent values of n with nd = n , diﬀerent connection graphs, and thePoisson-based SVCI-PL method and the logistic regression based SVCI-LRL method.Method n = 800 n = 1600 n = 2400 n = 3600 n = 6000PL LRL PL LRL PL LRL PL LRL PL LRL M ISE β MST 0.234 0.243 0.222 0.224 0.189 0.191 0.152 0.146 0.130 0.1253-NN 0.223 0.230 0.204 0.182 0.189 0.184 0.155 0.142 0.127 0.1154-NN 0.214 0.209 0.200 0.181 0.157 0.132 0.133 0.116 0.115 0.0985-NN 0.209 0.195 0.184 0.153 0.141 0.119 0.122 0.105 0.095 0.078DT 0.215 0.211 0.174 0.141 0.128 0.111 0.113 0.097 0.080 0.072

In Scenario 2, we generate spatial points on the chicago linear network from the R package spatstat . The network is shown in the left panel of Figure 4, which consists of the road25igure 3: Left: Computation time to solve the optimization for one tuning parameter usingdiﬀerent connection graphs; Right: the boxplots of M ISE β for SVCI-PL, SVCI-LRL with5-NN graphs and LGCP. Reported results are averaged over 100 Monte Carlo simulationsunder Scenario 1. 26able 2: Scenario 1: Rand index of the estimates ˆ β , ˆ β and ˆ β (averaged over 100 MonteCarlo simulations) for SVCI-PL and SVCI-LRL for diﬀerent values of n , using nd = n and5-NN connection graphs. n = 800 n = 1600 n = 2400 n = 3600 n = 6000PL LRL PL LRL PL LRL PL LRL PL LRLRand Indexˆ β β β M ISE β (averaged over 100 simulations) and compu-tation time (in seconds) between SVCI-PL and SVCI-LRL for diﬀerent numbers of dummypoints, using n = 1600 and 5-NN connection graphs.dummy points SVCI-PL SVCI-LRL M ISE β time(s) M ISE β time(s) nd = 30 nd = 40 nd = 60 nd = 80 D = [0 , R ] and increase R to expand D , sothat the linear network grows with D at the same rate to obtain increasing numbers ofrealizations on the network.We ﬁrst consider a simpliﬁed case where there is no covariate available. We focus on the27stimation of the intensity function whose true value is a piece-wise constant function, thatis, the intensity function ρ ( u ) = exp { β ( u ) } and the true value of β ( u ) has a clusteredpattern as shown in the left panel of Figure 4. The upper part of Table 4 presents the M ISE of log ρ , i.e., the log intensity function for each value of n . In general, we obtainsimilar ﬁndings on the linear network as on a planar window presented in Scenario 1; M ISE from both the Poisson based and logistic based SVCI models show a convergence tendencywith n going up, and the logistic likelihood based method achieves a slightly more accurateestimation than the Poisson based method with a large number of points. For this case, Weinclude the results of the KDE method for point processes on linear networks (Baddeleyet al., 2020) for comparison. It is clear that both the SVCI-PL and SVCI-LRL modelsoutperform the KDE based intensity estimation method in all settings. It is not surprisingto observe that the SVCI recovers the cluster boundaries more accurately than KDE,because the KDE based method is better suited for the case where the intensity function isrelatively smooth. In contrast, the advantage of our method is more prominent when thetrue intensity function has abrupt changes or clustering patterns. This explanation can beillustrated by an example shown in the right panel of Figure 4, which plots the true andthe estimated log intensity along a selected road segment from one simulation. It clearlyshows that SVCI captures the intensity with discontinuities more eﬃciently than KDE.We then consider the case with an intercept and two covariates using the same settingas in Scenario 1. The true regression coeﬃcients are plotted in the subﬁgures (a-c) ofFigure 5. The subﬁgures (d-f) of Figure 5 give the estimated coeﬃcients from SVCI on the chicago network. The results demonstrate that our method is also capable of capturing theclustered coeﬃcient patterns on a linear network. In addition, the log intensity estimationresults presented in the lower part of Table 4 are in general consistent with the ﬁndingspresented in Scenario 1 and Scenario 2.(a), that is, the performance of the SVCI model28igure 4: Left: The true log intensity β on the chicago network in Scenario 2(a); Right:The true and the estimated log intensity along one road segment corresponding to the linebetween the two black arrows in the left panel of Figure 4.29igure 5: The top three panels (a-c): the spatial structures of true coeﬃcients β , β and β in Scenario 2(b); The upper three panels (d-f): the estimated coeﬃcient surfaces fromin one simulation using SVCI-LRL with n = 2400.30able 4: Scenario 2: mean integrated squared error of log ρ ( M ISE log ρ ) averaged over 100simulations for diﬀerent values of n . We compare the two estimating equations, the Poissonlikelihood (PL) and the logistic regression likelihood (LRL) with kernel density estimation(KDE) and the log Gaussian Cox process (LGCP).Method M ISE log ρ n = 800 n = 1600 n = 2400 n = 3600 n = 6000(a): ρ ( u ) = exp { β ( u ) } SVCI-PL 0.128 0.101 0.084 0.057 0.041SVCI-LRL 0.122 0.095 0.073 0.048 0.034KDE 0.157 0.140 0.112 0.084 0.061(b): ρ ( u ) = exp { z ( u ) β ( u ) + z ( u ) + β ( u ) + β ( u ) } SVCI-PL 0.177 0.152 0.135 0.114 0.085SVCI-LRL 0.174 0.142 0.128 0.102 0.073LGCP 0.227 0.202 0.188 0.154 0.137with the logistic regression likelihood or with a larger n is more preferable. Table 4 alsodisplays the results from LGCP as a comparison. The result indicates a clear improvementof using SVCI over LGCP in terms of estimation accuracy. We consider two real data examples to illustrate the performance of the proposed method.The ﬁrst Toronto Homicide data example has a moderate data size with 1398 points andthree explanatory variables on a domain with irregular boundaries. And the second WesternAustralia Traﬃc Accidents data has a larger data size with 14 ,

562 points on a linear road31etwork.

We apply the proposed SVCI model to the analysis of the Toronto Homicide dataset. Theraw dataset contains the information of 1398 homicides occurred in Toronto, Canada during1990 to 2014, recording the locations of murder scenes, homocide types and informationof victims obtained from the Toronto Star Newspaper ( ). We select the more recent years since 2000, and deletethe data which have duplicated locations. There remains 764 homicide cases for the ﬁnalanalysis. Figure 1 shows the entire Toronto city and the locations of the selected caseswithin a 42 ×

31 km rectangle window. Notably, the old Toronto region, in the middle ofthe coast, has more frequent occurrences of homicides.The data also contains the records of average income, night lights and population densityof Toronto city in 2006 and we use them as the explanatory variables. Figures 6 (a-c) showthe observations of the three variables. As can be seen, there is a large spatial variationof these variables across the city. We scaled and centered each spatial covariates beforerunning our point process models.We focus on the investigation of the relationship between the distribution of homicidesand the three explanatory variables. We ﬁrst ﬁt a standard log-Gaussian cox process modelwith constant regression coeﬃcients as a benchmark. The intensity function of LGCP takesthe form log { ρ ( u ) } = β + Income( u ) β + Night( u ) β + Pdens( u ) β + φ ( u )where β , β and β are the constant-coeﬃcients across the domain and φ ( u ) is a spatialGaussian process with a zero mean and a Matern correlation function. After centering32igure 6: Left Panel: patterns of covariates; Right Panel: patterns of estimates; (a,e)Average income of the residents in Toronto; (b,d) Light intensity in Toronto night; (c,f)Population density in Toronto; 33nd scaling the covariates, we obtain the parameter estimates from LGCP as β = − . β = 0 . β = 0 .

207 and β = 1 . { ρ ( u ) } = Income( u ) β ( u ) + Night( u ) β ( u ) + Pdens( u ) β ( u ) + β ( u )Here β k ( u ), k = 0 , , , β k ( u ) are the same as the results of LGCP.For example, the estimates of β ( u ) range from -2 to 0, indicating a negative relationshipbetween income and homicide occurrence as is expected. Such a negative relationship ismost prominent in the western region of Toronto whereas a weaker relationship is observednear the upper east corner of Toronto City. For both β and β , we observe a small clusterat the Old Toronto region, which has the most concentrated homicide cases. It is notablethat the relationships between light intensity/population density and homicides occurrencein the Old Toronto region diﬀer signiﬁcantly from the rest of Toronto city; a strongerpositive relationship is observed for both variables. In this section, we study the traﬃc accidents data in the state of Western Australia forthe year 2011, as shown in Figure 1. The data were originally provided by the WesternAustralian State Government Department of Main Roads and are made publicly available aspart of the Western Australian Whole of Government Open Data Policy. The data can also34e accessed from the R package spatstat . It consists of 14 ,

562 locations of accidents on aroad network with 115 ,

169 road segments constrained in a [217 . , . × [6114 . , . ρ ( u ; β ) = exp { β ( u ) } . In this study, we don’t have any spatial covariates available andhence we focus on detecting the clustered patterns of the intensity function ρ ( u ). Figure 7plots the estimated log intensity log ˆ ρ ( u ) on the road network. We notice that ˆ ρ ( u ) has alarge spatial variation, ranging from 0 per kilometer in some remote eastern areas to nearly50 accidents per kilometer in some busy roads in the Perth metropolitan area.We zoom into the sub-region of [372 , × [6434 , In this study, we propose a varying coeﬃcient log-linear intensity model, referred to asthe SVCI model, for the visualization and analysis of spatial point processes. We utilize37 graph fused lasso regularization to estimate the clustering patterns of the regressioncoeﬃcients, which allow practitioners to capture discontinuous changes in space. Themethod guarantees spatially contiguous clustering conﬁgurations with highly ﬂexible clustershapes and data-driven cluster sizes. It supplements the current research on intensityestimation, which primarily focuses on relatively smooth intensity functions or spatiallyconstant regression coeﬃcients. The method also has the advantage of being applicable toa broad range of complex domains such as line networks and spatial domains with irregularboundaries. The computation of the model is made highly eﬃcient by using a proximalgradient optimization algorithm. Numerical studies show that our method produces fastand accurate estimation results. We apply the method to identify spatially heterogeneouspatterns in the determinants of Toronto crime events and the intensity of traﬃc accidentsin Western Australia.Moving forward, this work could be further reﬁned in several ways. First, SVCI onlyconsiders a small ﬁxed number of covariates. However, in practice, practitioners may facea large number of available covariates but lack a strong theory to inform variable selection.There is a need of a more general model that allows researchers to undergo variable selectionand spatial cluster identiﬁcation simultaneously for point processes. Second, the SVCIestimator does not come with an uncertainty measure that makes it hard for statisticalinference, a common issue shared by many regularization based approaches. We mayconsider a Bayesian version of the method or a bootstrapping based approach to addressthe inference problem. Finally, we acknowledge that further investigations are needed onhow to verify and relax some of the assumptions that are used to establish the theoreticalresults. Tighter error bounds may be obtained following the method of entropy boundsimilar as those in Ortelli and van de Geer (2019). However, these are not trivial theoreticalquestions and we will leave those for future research.38 eferences

Arnold, T. B. and R. J. Tibshirani (2016). Eﬃcient implementations of the generalizedlasso dual path algorithm.

Journal of Computational and Graphical Statistics 25 (1),1–27.Baddeley, A., Y.-M. Chang, Y. Song, and R. Turner (2012). Nonparametric estimationof the dependence of a spatial point process on spatial covariates.

Statistics and itsinterface 5 (2), 221–236.Baddeley, A., J.-F. Coeurjolly, E. Rubak, and R. Waagepetersen (2014). Logistic regressionfor spatial gibbs point processes.

Biometrika 101 (2), 377–392.Baddeley, A., G. Nair, S. Rakshit, G. McSwiggan, and T. M. Davies (2020). Analysingpoint patterns on networks—a review.

Spatial Statistics , 100435.Baddeley, A. and R. Turner (2014). Package ‘spatstat’.

The Comprehensive R ArchiveNetwork () .Baddeley, A. J., R. Turner, et al. (2004). Spatstat: An r package for analyzing spatialpoint pattens.Barr, C. D. and F. P. Schoenberg (2010). On the voronoi estimator for the intensity of aninhomogeneous planar poisson process.

Biometrika 97 (4), 977–984.Bassett, R. and J. Sharpnack (2019). Fused density estimation: theory and methods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81 (5), 839–860. 39eck, A. and M. Teboulle (2009). Fast gradient-based algorithms for constrained totalvariation image denoising and deblurring problems.

IEEE transactions on image pro-cessing 18 (11), 2419–2434.Berman, M. and T. R. Turner (1992). Approximating point process likelihoods with glim.

Journal of the Royal Statistical Society: Series C (Applied Statistics) 41 (1), 31–38.Boyd, S., N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011). Distributed optimizationand statistical learning via the alternating direction method of multipliers.

Foundationsand Trends ® in Machine learning 3 (1), 1–122.Brown, P., R. Hijmans, and M. P. Brown (2020). Package ‘geostatsp’.B¨uhlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media.Chen, J. and Z. Chen (2012). Extended bic for small-n-large-p sparse glm.

StatisticaSinica 22 , 555–574.Davis, K. B. (1977). Mean integrated square error properties of density estimates.

TheAnnals of Statistics , 530–535.Diggle, P. (1985). A kernel method for smoothing point process data.

Journal of the RoyalStatistical Society: Series C (Applied Statistics) 34 (2), 138–147.Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties.

Journal of the American Statistical Association 96 (456), 1348–1360.Gelfand, A. E., P. Diggle, P. Guttorp, and M. Fuentes (2010).

Handbook of spatial statistics .CRC press. 40elfand, A. E., H.-J. Kim, C. Sirmans, and S. Banerjee (2003). Spatial modeling withspatially varying coeﬃcient processes.

Journal of the American Statistical Associa-tion 98 (462), 387–396.Golub, G. H., M. Heath, and G. Wahba (1979). Generalized cross-validation as a methodfor choosing a good ridge parameter.

Technometrics 21 (2), 215–223.Gon¸calves, F. B. and D. Gamerman (2018). Exact bayesian inference in spatiotemporalcox processes driven by multivariate gaussian processes.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 80 (1), 157–175.Guan, Y. (2006). A composite likelihood approach in ﬁtting spatial point process models.

Journal of the American Statistical Association 101 (476), 1502–1512.Guan, Y., A. Jalilian, and R. Waagepetersen (2015). Quasi-likelihood for spatial point pro-cesses.

Journal of the Royal Statistical Society. Series B, Statistical methodology 77 (3),677.Guan, Y. and Y. Shen (2010). A weighted estimating equation approach for inhomogeneousspatial point processes.

Biometrika 97 (4), 867–880.Jones, M. C., J. S. Marron, and S. J. Sheather (1996). A brief survey of bandwidth selectionfor density estimation.

Journal of the American statistical association 91 (433), 401–407.Lee, D.-T. (1980). Two-dimensional voronoi diagrams in the lp-metric.

Journal of theACM (JACM) 27 (4), 604–618.Leininger, T. J., A. E. Gelfand, et al. (2017). Bayesian inference and model assessmentfor spatial point patterns using posterior predictive samples.

Bayesian Analysis 12 (1),1–30. 41i, F. and H. Sang (2019). Spatial homogeneity pursuit of regression coeﬃcients for largedatasets.

Journal of the American Statistical Association , 1–21.Lindgren, F., H. Rue, and J. Lindstr¨om (2011). An explicit link between gaussian ﬁeldsand gaussian markov random ﬁelds: the stochastic partial diﬀerential equation approach.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (4), 423–498.McSwiggan, G. (2019). Spatial point process methods for linear networks with applicationsto road accident analysis. (PhD thesis) .McSwiggan, G., A. Baddeley, and G. Nair (2017). Kernel density estimation on a linearnetwork.

Scandinavian Journal of Statistics 44 (2), 324–345.Møller, J., A. R. Syversveen, and R. P. Waagepetersen (1998). Log gaussian cox processes.

Scandinavian journal of statistics 25 (3), 451–482.Møller, J. and R. P. Waagepetersen (2007). Modern statistics for spatial point processes.

Scandinavian Journal of Statistics 34 (4), 643–684.Moradi, M. M., O. Cronie, E. Rubak, R. Lachieze-Rey, J. Mateu, and A. Baddeley (2019).Resample-smoothing of voronoi intensity estimators.

Statistics and computing 29 (5),995–1010.Moradi, M. M., F. J. Rodr´ıguez-Cort´es, and J. Mateu (2018). On kernel-based intensityestimation of spatial point patterns on linear networks.

Journal of Computational andGraphical Statistics 27 (2), 302–311.Mu, J., G. Wang, and L. Wang (2018). Estimation and inference in spatially varyingcoeﬃcient models.

Environmetrics 29 (1), e2485.42rtelli, F. and S. van de Geer (2019). Oracle inequalities for image denoising with totalvariation regularization. arXiv preprint arXiv:1911.07231 .Padilla, O. H. M., J. Sharpnack, J. G. Scott, and R. J. Tibshirani (2018). The dfs fusedlasso: Linear-time denoising over general graphs.

Journal of Machine Learning Re-search 18 , 176–1.Pinto Junior, J. A., D. Gamerman, M. S. Paez, and R. H. Fonseca Alves (2015). Pointpattern analysis with spatially varying covariate eﬀects, applied to the study of cere-brovascular deaths.

Statistics in medicine 34 (7), 1214–1226.Rakshit, S., T. Davies, M. M. Moradi, G. McSwiggan, G. Nair, J. Mateu, and A. Baddeley(2019). Fast kernel smoothing of point patterns on a large network using two-dimensionalconvolution.

International Statistical Review 87 (3), 531–556.Shaw, B. and T. Jebara (2009). Structure preserving embedding. In

Proceedings of the26th Annual International Conference on Machine Learning , pp. 937–944.Shirota, S. and S. Banerjee (2019). Scalable inference for space-time gaussian cox processes.

Journal of Time Series Analysis 40 (3), 269–287.Tibshirani, R. J. and J. Taylor (2011). The solution path of the genearlized lasso.

TheAnnals of Statistics 3 , 1335–1371.Waagepetersen, R. P. (2007). An estimating function approach to inference for inhomoge-neous neyman–scott processes.

Biometrics 63 (1), 252–258.Wahlberg, B., S. Boyd, M. Annergren, and Y. Wang (2012). An admm algorithm for a classof total variation regularized estimation problems.

IFAC Proceedings Volumes 45 (16),83–88. 43ang, Y.-X., J. Sharpnack, A. J. Smola, and R. J. Tibshirani (2016). Trend ﬁltering ongraphs.

The Journal of Machine Learning Research 17 (1), 3651–3691.Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.

Annals of Statistics 38 , 894–942.Zou, H. (2006). The adaptive lasso and its oracle properties.