A functional-model-adjusted spatial scan statistic
AA functional-model-adjusted spatial scan statistic
Mohamed-Salem Ahmed and Micha¨el Genin
Univ. Lille, EA2694 - Sant´e publique : ´epid´emiologie et qualit´e des soins, F-59000 Lille, France [email protected]@univ-lille.fr
Abstract
This paper introduces a new spatial scan statistic designed to adjust cluster detection forlongitudinal confounding factors indexed in space. The functional-model-adjusted statistic wasdeveloped using generalized functional linear models in which longitudinal confounding factorswere considered to be functional covariates. A general framework was developed for applicationto various probability models. Application to a Poisson model showed that the new methodis equivalent to a conventional spatial scan statistic that adjusts the underlying population forcovariates. In a simulation study with univariate and multivariate models, we found that ournew method adjusts the cluster detection procedure more accurately than other methods. Useof the new spatial scan statistic was illustrated by analysing data on premature mortality inFrance over the period from 1998 to 2013, with the quarterly unemployment rate as a longitu-dinal confounding factor.
Keywords: cluster detection, confounding factor, functional data analysis, longitudinal data,generalized functional linear model.
In many fields of science, cluster detection methods are useful tools for objective identifying ag-gregations of events in time and/or space and for determining the latter’s statistical significance.In the field of epidemiology, researchers often seek to detect spatial clusters in which the risk ofdisease is significantly higher or lower than in the rest of the geographical area studied. For dis-eases of unknown etiology, information on the presence and nature of clusters provides clues to thedisease mechanism (especially in terms of environmental factors), and can facilitate the design ofsubsequent individual-level observational studies.Over the last few decades, several cluster detection methods have been developed. In particular,spatial scan statistics (originally proposed by Kulldorff, on the basis of Bernoulli and Poisson models(Kulldorff, 1997, 1999)) are powerful methods for detecting spatial clusters with a variable scanningwindow size and in the absence of pre-selection bias, and then testing the clusters’ statisticalsignificance. Following on from Kulldorff’s initial work, several researchers have adapted spatialscan statistics to other spatial data distributions, such as ordinal (Jung et al., 2007), normal1 a r X i v : . [ s t a t . M E ] M a r Kulldorff et al., 2009), exponential (Huang et al., 2007) and Weibull model (Bhatt & Tiwari,2014). Spatial scan statistics have been extended to the multivariate framework by Kulldorff et al.(2007), Neill (2012), and, most recently, Cucala et al. (2017, 2018).One of the main problems in cluster detection is the need to adjust for covariates. If a covariateis a confounding factor associated with the event of interest, and is not homogeneously distributedover a geographical area, a cluster analysis can generate clusters in which the covariate (and not theevent of interest) predominates. For example, clusters of cardiovascular disease must be adjustedfor social deprivation, which is a strong confounding factor (Rothman et al., 2008). In the absenceof adjustment, the analysis may highlight very deprived areas that have a higher number of diseasecases but are not epidemiologically relevant because of confusion bias. In the literature, severalcovariate adjustment techniques have been applied to spatial scan statistics. For the Poisson model,Kulldorff et al. (1997) originally suggested the use of (i) indirect standardization methods to adjustfor qualitative covariates, and (ii) regression methods to adjust for quantitative covariates andto estimate the expected number of cases per spatial unit. For the Bernoulli model, Kulldorffet al. (2007) suggested using several datasets for each stratum of a qualitative covariate. Klassenet al. (2005) applied multilevel regression methods to adjust for quantitative covariates. Morerecently, Jung (Jung, 2009) used generalized multivariate linear models (GMLMs) to build spatialscan statistics that incorporated covariates. The latter approach is particularly valuable becauseit merges spatial scan statistics developed for different probability models into a single framework.However, this approach has limitations when dealing with longitudinal covariates. In a purelyspatial analysis, there are two possible scenarios for longitudinal data: (i) the variable outcomeand the covariates are observed on the same time scale (e.g. one observation per year for both)over a long period of time, or (ii) the variable outcome and the covariates are observed on differenttime scales (e.g. one observation per year for the outcome, and one observation per month forcovariates). In Jung’s approach, a simplistic way of managing longitudinal covariates in bothscenarios is to summarize the data by averaging them (or determining the median) over the entiretime period. However, this may lead to significant information loss and a decrease in the qualityof covariate adjustment. Alternatively, the confounding factors for each measurement time scalecan be included in the model as a covariate, as long as the time scale is the same for each of thespatial units (in order to limit the number of missing values). However, this approach may createa high-dimensional vector of coefficients and introduce multicollinearity (James, 2002).In the present work, we developed a spatial scan statistic based on functional data analysis(FDA) (Ramsay & Silverman, 2005). Firstly, our approach allows longitudinal data to be consideredas the realization of a random function over an interval containing discrete time points. It shouldbe noted that the random function can be observed at different, unequally spaced time points foreach location. Secondly, our approach replaces the high-dimensional vector of coefficients by aparameter function to be estimated. These two characteristics make it possible to overcome boththe above-mentioned problems, i.e. identical measurement times, and high dimensionality.The present article is organized as follows. Section 2 describes the methodological aspects of thefunctional-model-adjusted spatial scan statistic (FMASSS). In Section 3, the FMASSS was appliedto a Poisson model, and was found to be equivalent to a conventional spatial scan statistic whenthe underlying population was adjusted for covariates. Section 4 presents both the design and the2esults of a simulation study. Section 5 describes the application of the FMASSS to epidemiologicdata and the detection of clusters of high and low premature mortality in France. Lastly, the resultsare discussed in Section 6.
Let consider that at each location s i (one of n different spatial locations s , . . . , s n included in D ⊂ R )), we observe an outcome variable Y i and two type of covariate: Z i is a p × { X i , . . . , X im i } is the realization of a real-valued stochastic process at m i time points t i , . . . , t im i (i.e. longitudinal data). Hereafter, all observations are considered to be independent,this is a classical assumption in scan statistics. A spatial scan statistic usually denotes the maximumconcentration observed among a collection of potential clusters denoted by S = { S k ⊂ D, k =1 , , ... } . It is used as a test statistic for areas in which the concentration might be abnormally highor abnormally low Cressie (1977). Kulldorff Kulldorff (1997) introduced a spatial scan statisticbased on a generalized likelihood ratio; this enables the comparison of concentrations in potentialclusters of different sizes, and takes account of heterogeneity in the underlying population. Withoutloss of generality and in line with Kulldorff’s work (Kulldorff, 1997), we shall focus on variable-size, circular clusters. Hence, the set of potential clusters S is built so that (i) each potentialcluster is centered at a particular location, and (ii) the radius is limited so that the correspondingcluster cannot cover more than 50% of the studied region. It should be noted that many otherconfigurations (such as elliptical clusters (Kulldorff et al., 2006) and graph-based clusters (Cucalaet al., 2013) have been suggested.Conventionally, the spatial scan statistic can be defined as the potential cluster that maximizes alog-likelihood ratio (LLR) over S namely the most likely cluster (MLC). This LLR is based on a nullhypothesis H (the absence of a cluster) and an alternative hypothesis H (the presence of a cluster).If confounding covariates ( Z i and { X i , . . . , X im i } ) are present, the MLC can be revealed by thesefactors alone. Thus, the spatial scan statistic has to be adjusted with respect to these covariates.In Jung’s GMLM approach (Jung, 2009), Z i and { X i , . . . , X im i } will be integrated as separatecovariates. However, as mentioned in the Introduction, this approach can be limited by informationloss and high dimensionality. Hence, we developed an FMASSS that considers { X i , . . . , X im i } asrealizations of a random function { X i ( t ) , t ∈ T } , where T is an interval containing the discrete timepoints. The random function { X i ( t ) , t ∈ T } is approximated from the longitudinal observations { X i , . . . , X im i } . More generally, a basis of functions { ϕ j ( t ) , j = 1 , . . . , K, t ∈ T } is consideredwith K ≤ min( m , . . . , m n ), and the random function is assumed to belong to the space generatedby this basis X i ( t ) = K (cid:88) j =1 a ij ϕ j ( t ) , (1)where the n × K matrix basis coefficients A with elements a ij can be estimated using either aninterpolation method (if the measurements { X i , . . . , X im i } are observed without error, i.e X ik = X i ( t ik ) , k = 1 , . . . , m i , or an ordinary (or penalized) least-square method (if the measurementsare observed with some error, i.e X ik = X i ( t ik ) + e ik , k = 1 , . . . , m i . The choice of the basis of3unctions depends on the shape of the longitudinal data. For instance, a B-spline basis is the mostsuitable choice for non-periodic functional data, a Fourier basis can be useful for periodic functionaldata, while a wavelet basis can be appropriate for functional data with discontinuities or changesin behavior (see Ramsay & Silverman (2005) for more details).Once the random function { X i ( t ) , t ∈ T } has been built for each location si, one can use thegeneralized functional linear modelM¨uller & Stadtm¨uller (2005) to adjust the spatial scan statisticwith respect to the covariate Z i and the random function { X i ( t ) , t ∈ T } . To this end, let S k ∈ S and assume that the conditional mean of the outcome variable Y i , (with respect to the covariateinformation and the potential cluster) is defined by the following revised generalized functionallinear model: µ ( k ) i = E ( Y i | S k , Z i , X i ) = Φ − (cid:18) α + δ k ξ ( k ) i + Z (cid:48) i β + (cid:90) T X i ( t ) θ ( t ) dt (cid:19) , (2)where ξ ( k ) i is a binary covariate equal to 1 if the location s i belongs to S k and equal to 0 otherwise,and where Φ( · ) is a known increasing link function. The parameters of interest are the intercept α , δ k which refers to the intensity of the cluster, the coefficients β associated with the p × Z , and the parameter function θ ( · ), which is a smoothing function that can be consideredas a generalization of a slope function.The parameters β and θ ( · ) are fixed inside and outside thepotential cluster, which means that the distributions of the covariates Z and X ( · ) are invariantwith respect to the clustering hypotheses. In other words, the conditional mean of Y i inside S k isfully characterized by its intensity δ k . It should be noted that exp( δ k ) can be interpreted as thecovariate-adjusted relative risk for individuals within the potential cluster S k , relative to the riskfor those outside it. The clustering hypotheses can therefore be expressed as follows: H : δ k = 0 H : δ k > δ k < . Given that Φ( · ) is an increasing function, H means that the mean of Y i inside S k is higher (orlower) than the mean of Y i outside S k .As mentioned above, the spatial scan statistic is based on the likelihood ratio between thesetwo hypotheses. Thus, in order to provide a general framework that can handle various models(Bernoulli, normal, Poisson, etc.), one needs to assume that the outcome variable Y has a known,parametrized, conditional log-likelihood function: F (cid:16) Y i ; µ ( k ) i , σ (cid:16) µ ( k ) i (cid:17)(cid:17) , (3)where σ ( · ) is a positive function defining the variance of Y .Below, we describe the estimation procedure under each hypothesis and then introduce theFMASSS. 4 stimation under the null hypothesis . Under the null hypothesis, model (2) is reduced to aGLFM: µ i = E ( Y i | Z i , X i ) = Φ − (cid:18) α + Z (cid:48) i β + (cid:90) T X i ( t ) θ ( t ) dt (cid:19) . (4)We used the popular estimation procedure developed by M¨uller and Stadtm¨ullerM¨uller & Stadtm¨uller(2005). It is based on a truncation strategy in which the random function and the parameter func-tion are projected into a space of functions generated by a basis of functions with an arbitrarydimension. Let { φ j ( t ) , j = 1 , . . . , K } be the eigenbasis associated with the functional principalcomponent analysis (PCA) of the functional data { X i ( t ) , i = 1 , . . . , n } . For a fixed J , the param-eter function is approximated by its projection in the space of functions generated by the first J eigenfunctions: ˜ θ ( t ) = J (cid:88) j =1 θ j φ j ( t ) . Using this approach, M¨uller & Stadtm¨uller (2005) suggested that the conditional mean (4) couldbe approximated by its truncated version ˜ µ i :˜ µ i = Φ − (cid:16) α + Z (cid:48) i β + C (cid:48) i θ (cid:17) (5)where θ = ( θ , . . . , θ J ) (cid:48) and C i is the coefficient vector of the random function { X i ( t ) , t ∈ T } inthe eigenbasis, which is given by: C ij = (cid:90) T X i ( t ) φ j ( t ) , j = 1 , . . . , K. Using (5), we defined the following truncated log-likelihood function under H ˜ L ( α, β, θ ) = n (cid:88) i =1 F ( Y i ; ˜ µ i , σ (˜ µ i )) (6)It should be noted that (6) is a log-likelihood function associated with a GMLM whose covariatesare Z i and C i , where (cid:98) α , (cid:98) β and (cid:98) θ are the maximum likelihood estimators (MLEs) of α, β and θ respectively. Consequently, the MLE of the parameter function is given by: (cid:98) θ ( t ) = J (cid:88) j =1 (cid:98) θ j φ j ( t ) . The quality of the estimation depends principally on J , i.e. the number of eigenfunctions used inthe truncation strategy. This crucial parameter can be consistently chosen by inspecting the Akaikeinformation criterion (AIC) related to (6), as proved by M¨uller & Stadtm¨uller (2005). Note thatwe used a pre-selected J based on the cumulative inertia. Indeed, we focused on the selection ofa J (using the AIC) with a cumulative inertia value below a given threshold (95% in the presentcase)Ahmed et al. (2018). 5 stimation under the alternative hypothesis . Since the parameters β and θ ( · ) must beindependent of the potential cluster, their estimates under H will be fixed in the alternativehypothesis H . This means that under H , covariate effects are invariant inside and outside thepotential cluster. Hence, one only needs to estimate the parameters α and δ k for each S k ∈ S .This can be achieved by maximizing the following log-likelihood function with respect to the twoscalars: ˜ L k ( α, δ k ) = n (cid:88) i =1 F (cid:16) Y i ; ˜ µ ( k ) i , σ (cid:16) ˜ µ ( k ) i (cid:17)(cid:17) , (7)with ˜ µ ( k ) i = Φ − (cid:18) α + δ k ξ ( k ) i + Z (cid:48) i (cid:98) β + (cid:90) T X i ( t ) (cid:98) θ ( t ) dt (cid:19) . Let us consider (cid:98) α ( k ) and (cid:98) δ k , denoting the MLEs of α k and δ k , respectively. It should be noted thatthe covariate information is added as an offset, which illustrates the above-mentioned assumptionconcerning the independence of the potential cluster vs. the covariates. Functional-model-adjusted spatial scan statistic.
Using the MLEs determined under the twohypotheses, the LLR can be defined as follows:LLR k = ˜ L k (cid:16)(cid:98) α ( k ) , (cid:98) δ k (cid:17) − ˜ L (cid:16)(cid:98) α, (cid:98) β, (cid:98) θ (cid:17) . (8)The MLC is then defined as the potential cluster S k that maximizes this ratio:MLC = argmax S k ∈S { LLR k } . (9)Hence, the FMASSS is defined as the LLR associated with the MLC: λ = max S k ∈S { LLR k } . (10)Since the distribution of λ under H does not have a closed form, the significance of the MLC isevaluated by Monte-Carlo simulation. Each simulation m ( m = 1 , . . . , M ) combines the real data(associated with the covariates) with a random dataset generated for the outcome variable. Thelatter is simulated using a conditional distribution under H ( via (cid:98) α, (cid:98) β and (cid:98) θ ( · )). Let λ (1) , . . . , λ ( M ) denote the observations of the FMASSS on the simulated datasets. According to DwassDwass(1957), the p-value of the FMASSS λ observed in the real data is defined by 1 − R/ ( M + 1), where R is the rank of λ in the ( M + 1)-sample { λ (1) , . . . , λ ( M ) , λ } .The FMASSS is built in three steps: Construction of functional data, and dimension reduction • Construct the functional data by using a suitable basis of functions { ϕ j ( t ) , j = 1 , . . . , K } .6 Apply a functional PCA to the constructed functions { X i ( t ) , i = 1 , . . . , n } . This is equivalentto a multiple PCA on the matrix A Ψ / where Ψ is the K × K matrix with elements (Escabiaset al., 2004) Ψ jr = (cid:90) T ϕ j ( t ) ϕ r ( t ) dt, j, r = 1 , . . . , K. Thus, the eigenfunctions are defined by φ j ( t ) = (cid:80) Kj =1 v lj ϕ l ( t ) , j = 1 , . . . , K where v lj arethe elements of the K × K matrix V = Ψ − / G , where G is the eigenvector matrix associatedwith a multiple PCA of the matrix A Ψ / . Moreover, the coefficients of the functional datain the eigenbasis are given by the n × K matrix C = A Ψ V . • Choose the optimal truncation parameter J ∗ ∈ { , . . . , K } ,i.e. one that minimizes the AICassociated with models with the log-likelihood function defined in (6). Computation of the observed FMASSS • Use J ∗ to estimate (cid:98) α , (cid:98) β and (cid:98) θ under H by using the log-likelihood function defined in (6). • For each potential cluster S k , find (cid:98) α ( k ) and (cid:98) δ k , that maximize (7) by adding the Z i (cid:98) β and C (cid:48) i (cid:98) θ as offsets, then calculate the associated LLR k . Moreover, identify the MLC and its FMASSS λ , over the set of potential clusters. Monte-Carlo simulation • Apply the Monte-Carlo hypothesis testing procedure described above.
This section describes the estimation procedure when the data on the outcome variable Y have aPoisson distribution. Let N i be the measurement of the underlying at-risk population associatedwith the i th location s i . The Poisson model is characterized by the following link function Φ( · ) andthe conditional log-likelihood (3):Φ( t ) = log( t ) and F (cid:16) Y i ; µ ( k ) i (cid:17) = Y i log (cid:16) µ ( k ) i (cid:17) − µ ( k ) i − log ( Y i !) , (11)with µ ( k ) i = E ( Y i | S k , Z i , X i ) = N i exp (cid:18) α + δ k ξ ( k ) i + Z (cid:48) i β + (cid:90) T X i ( t ) θ ( t ) dt (cid:19) . It should be noted that multiplication by N i makes it possible to take account of the underlyingat-risk population as an adjustment covariate. Consequently, log( N i ) is taken as an offset in themodel. Let (cid:98) α , (cid:98) β and (cid:98) θ ( · ) be the MLEs under the null hypothesis. It can be shown that the MLE (cid:98) α is expressed in the following manner (for details, see the Appendix): (cid:98) α = log (cid:32) (cid:80) ni =1 Y i (cid:80) ni =1 ˜ N i (cid:33) , where ˜ N i = N i exp (cid:18) Z (cid:48) i (cid:98) β + (cid:90) T X i ( t ) (cid:98) θ ( t ) dt (cid:19) . (cid:98) α ) can be viewed as the incidence rate under H in the adjusted underlyingat-risk population ˜ N i rather than in the initial underlying at-risk population N i .As detailed in section 2, the estimation procedure under H consists in maximizing the log-likelihood(7), which is expressed as follows:˜ L k ( α, δ k ) = n (cid:88) i =1 Y i (cid:16) α + δ k ξ ( k ) i + log (cid:16) ˜ N i (cid:17)(cid:17) − ˜ N i exp (cid:16) α + δ k ξ ( k ) i (cid:17) − log ( Y i !) , taking its maximum at: (cid:98) α ( k ) = log (cid:32) (cid:80) ni =1 Y i (1 − ξ ( k ) i ) (cid:80) ni =1 ˜ N i (1 − ξ ( k ) i ) (cid:33) and (cid:98) δ k = log (cid:32) (cid:80) ni =1 Y i ξ ( k ) i (cid:80) ni =1 ˜ N i ξ ( k ) i (cid:80) ni =1 ˜ N i (1 − ξ ( k ) i ) (cid:80) ni =1 Y i (1 − ξ ( k ) i ) (cid:33) . It should be noted that exp( (cid:98) δ k ) is the relative risk associated with the potential cluster S k afteradjusting for the underlying at-risk population ˜ N i .Next, LLR k is given by:LLR k = ˜ L k ( (cid:98) α ( k ) , (cid:98) δ k ) − ˜ L ( (cid:98) α, (cid:98) β, (cid:98) θ )= (cid:34) O ( k ) log (cid:32) O ( k ) ˜ N ( k ) (cid:33) + ( O − O ( k ) ) log (cid:32) O − O ( k ) ˜ N − ˜ N ( k ) (cid:33)(cid:35) − O log (cid:18) O ˜ N (cid:19) , (12)where ˜ N = n (cid:88) i =1 ˜ N i , ˜ N ( k ) = n (cid:88) i =1 ˜ N i ξ ( k ) i , O = n (cid:88) i =1 Y i and O ( k ) = n (cid:88) i =1 Y i ξ ( k ) i . It should be noted that (12) is equivalent to the LLR proposed by Kulldorff (1997) for a Poissonmodel, except that the adjusted underlying at-risk population ˜ N i is taken into account (rather than N i ). In other words, adjustment for covariates is equivalent to considering a Poisson model withan underlying at-risk population adjusted under the null hypothesis. We simulated a cluster detection procedure in order to compare the quality of adjustment fora longitudinal confounding factor in three spatial scan statistic models: a univariate model, amultivariate model, and a functional model.
Artificial datasets were generated according to Poisson models by using the geographic locations ofthe n = 94 French administrative areas ( d´epartements , as shown in Figure 6 in the SupplementaryMaterial) and population data from the French national census database ( Institut National de la tatistique et des Etudes Economiques , INSEE). Each location was defined as the d´epartement ’sadministrative center. Two types of non-overlapping cluster (each containing 8 d´epartements ) weredefined and simulated for each artificial dataset. The first was entirely characterized by the clusterintensity δ , namely the true cluster (the areas in green in Figure 6), and the second was characterizedsolely by the effect of the functional covariate, namely the fake cluster (the areas in red in Figure 6). Generation of the artificial datasets . The random functions were simulated as the realizationof the following process in the interval [0 , X i ( t ) = U i h ( t ) + (1 − U i ) h ( t + 4) + (cid:15) i ( t ) for s i outside the fake cluster U i h ( t ) + (1 − U i ) h ( t −
4) + (cid:15) i ( t ) for s i inside the fake cluster where h ( t ) = max (6 − | t − | , U i is uniform, and (cid:15) i ( t ) are uncorrelated, normally distributedrandom variables. A total of 94 curves were simulated with respect to the random function X ( · )(Figure 1, left panel) and used to generate data from the following Poisson model:Figure 1: Simulation study: an example of the generated longitudinal data before (left panel) andafter (right panel) smoothing. The red curves correspond to the observations in the fake cluster . µ i = N i exp (cid:18) α + δξ i + (cid:90) X i ( t ) θ ( t ) dt (cid:19) (13)where N i corresponds to the at-risk population in the i th d´epartement and ξ i = 1 for d´epartements located in the true cluster . Firstly, an intercept α = − .
51 was chosen to ensure a diseaseincidence of approximately 10 − in the absence of a cluster and the absence of a confoundingcovariate. Secondly, the confounding functional covariate was introduced into the model using θ ( t ) = t sin( π t + π ) , t ∈ [0 ,
21] in such a way that the mean value of the outcome was twice as9igh inside the fake cluster as outside. Thirdly, different values of the true cluster intensity wereconsidered and expressed in terms of the relative risk: exp( δ ) ∈ { , . , . , . , . , } . Comparison of three models . To illustrate the performance of the functional approach toadjustment, we compared three models. We considered that for each location, the functionalcovariate was only observed, at 70 time points equally spaced throughout the interval [0 , ,
21] (the right panel in Figure 1).For each value of the cluster intensity, 1000 artificial datasets were simulated. The three modelswere compared with regard to three distinct criteria: the power to detect a significant cluster ( true or fake ), the true-positive (TP) rate, and the false-positive (FP) rate. The power of each modelwas defined as the proportion of datasets highlighting a significant cluster (a true or fake cluster),with a type I error of 0 .
05 and 999 Monte-Carlo simulations. The TP and FP rates were calculatedaccording to Cucala et al.’s methodCucala et al. (2018).
The results of the simulation study are shown in Figure 2 (see Table 2 in the supplementarymaterial for more details). The adjustment based on a univariate model (with the average of thelongitudinal data as a covariate) failed to detect the true cluster as the MLC. This was particularlythe case for cluster intensity values that were low or moderate, relative to the intensity of the fakecluster . The univariate model detected the fake cluster as the MLC, as illustrated by the curvesfor the power and the TP and FP rates in Figure 2. The adjustments based on the functionaland multivariate models did not differ significantly with regard to the power or the TP and FPrates for detecting the true cluster . The functional model performed slightly better for high clusterintensities (exp( δ ) = 1 . . fake cluster as the MLC. We considered data provided by the INSEE on premature mortality in France between 1998 and2013. Premature mortality was defined as death before the age of 65. For each of the 94 French10igure 2: Simulation study: comparison of the univariate, multivariate and functional models withregard to the quality of adjustment. For each model, the power curves and the true-positive andfalse-positive rates for the detection of the true cluster (A) and the fake cluster (B) as most likelycluster are shown. The quantity exp ( δ ) refers to the cluster intensity.11 ´epartements (administrative areas) and for the period between 1998 and 2013, the mean prematuremortality rate was defined as the number of persons who died before the age of 65, divided by themean number of persons aged under 65. Hereafter, the outcome variable refers to the number ofpremature deaths per d´epartement between 1998 and 2013. The spatial distribution of prematuremortality in France is shown in Figure 7 (supplementary materials).It is known that premature mortality affects men more than women, and is correlated with socio-economic status: the most deprived are more likely to die youngStringhini et al. (2017). Thus, itis important to adjust the spatial cluster detection analyzes for the confounding factors of genderand socio-economic status. To this end, we considered the mean proportion of men aged under 65over the period from 1998 to 2013 for each d´epartement (as provided by the INSEE database). Wechose the mean proportion because it did not greatly vary over the 16-year period (see Figure 8in the supplementary materials). We considered the unemployment rate (in %) for each quarterof the period from 1998 to 2013 as a proxy for socioeconomic status - leading to 64 values per d´epartement . Figure 3 shows both the spatial distribution of the mean unemployment rate over theentire period and the change over time in the unemployment rate for each of the d´epartements . Themean unemployment rate is spatially heterogeneous. Furthermore, the unemployment rate variedmarkedly between 1998 and 2013, and thus must be considered as a longitudinal confounding factor. s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − s t qua r t e r − U ne m p l o y m en t r a t e ( % ) Quarters of the period 1998−2013
Figure 3: The unemployment rate in France from 1998 to 2013, by d´epartement . The leftpanel shows the unemployment rate averaged over the 16-year period from 1998 to 2013 for each d´epartement . The right panel shows the change over time in the unemployment rate between 1998and 2013; each curve corresponds to a d´epartement .12 .2 Spatial clusters detection
In order to detect spatial clusters of premature mortality, four different Poisson models were con-sidered. Each model was adjusted for gender by introducing the mean proportion of men by d´epartement over the period from 1998 to 2013 as a covariate. The four models are describedbelow:1. Model 1 (the non-adjusted model): no adjustment of the outcome variable for the unemploy-ment rate.2. Model 2 (the univariate model): adjustment of the outcome variable for the unemploymentrate, using the mean rate over the period from 1998 to 2013 by d´epartement as a singlecovariate.3. Model 3 (the multivariate model): adjustment of the variable outcome for the unemploymentrate by considering the each of the quarterly values by d´epartement for the period from 1998to 2013 as a covariate. Thus, 64 covariates related to the unemployment rate were introducedinto the model.4. Model 4 (the functional model): adjustment of the outcome variable for the unemploymentrate using smoothed rate curves as a functional covariate. The curves were built from thedata using a cubic B-spline basis defined by 15 knots in the interval [0 , δ ) >
1) or with a low risk of premature mortality (RR = exp( δ ) < The statistically significant spatial clusters detected by the non-adjusted and univariate models(models 1 and 2) are presented in Figure 4, and those identified by the multivariate and thefunctional models (models 3 and 4) are displayed in Figure 5. Detailed information on the spatialclusters is presented in Table 1.Model 1 identified 6 significant spatial clusters of premature mortality: 3 low-risk clusters (RR:0.79 to 0.86) and 3 high-risk clusters (RR: 1.18 to 1.28) (top panel in Figure 4). The MLC (Cluster1, RR=1.28) was located in northern France, and was characterized by a high unemployment rate.Similarly, the first secondary cluster (Cluster 2, RR=0.79) was located in eastern France and wascharacterized by a low unemployment rate.Model 2 also identified 6 significant spatial clusters of premature mortality: 3 low-risk clusters(RR: 0.77 to 0.86) and 3 high-risk clusters (RR: 1.08 to 1.20) (bottom panel in Figure 4). Likemodel 1, model 2 also detected the cluster with a high unemployment rate in northern France(Cluster 6, RR: 1.08) and the cluster with a low unemployment rate in eastern France (Cluster 2,13igure 4: Significant spatial clusters of premature mortality detected by the model not adjusted forthe unemployment rate (top panel A) and the univariate model (bottom panel B). Spatial clustersin red indicate a high risk of premature mortality, and those in blue show indicate a low risk ofpremature mortality. The clusters are numbered as follows: cluster 1 is the most likely cluster, andthe other (secondary) clusters a number in descending order for their test statistic. For each cluster,the unemployment rate curves (from 1998 to 2013) in each d´epartement are presented. Curves inblue correspond to d´epartements inside the cluster, curves in red correspond to d´epartements outsidethe cluster, and the curve in green is the mean curve.14igure 5: Significant spatial clusters of premature mortality detected by the multivariate model(top panel A) and the functional model (bottom panel B). Spatial clusters in red color indicate ahigh risk of premature mortality, and those in blue indicate a low risk of premature mortality. Theclusters are numbered as follows: cluster 1 is the most likely cluster, and the other (secondary)clusters a number in descending order for their test statistic. For each cluster, the unemploymentrate curves (from 1998 to 2013) in each d´epartement are presented. Curves in blue correspond to d´epartements inside the cluster, curves in red correspond to d´epartements outside the cluster, andthe curve in green is the mean curve. 15able 1: Statistically significant spatial clusters of premature mortality detected in the absence ofadjustment for the unemployment rate (Model 1), a univariate model (Model 2), a multivariatemodel (Model 3) and a functional model (Model 4).Model Cluster d´epartements relative risk (exp( δ )) LLR P-valueModel 1 1 4 1.28 4648.24 0.0012 7 0.79 3225.33 0.0013 9 0.86 2939.85 0.0014 4 1.28 1131.82 0.0015 3 1.18 856.63 0.0016 2 0.80 827.15 0.001Model 2 1 12 1.19 1531.91 0.0012 8 0.86 1458.09 0.0013 3 0.85 1120.88 0.0014 3 1.20 1091.54 0.0015 2 0.77 1090.26 0.0016 3 1.08 405.53 0.001Model 3 1 6 0.86 916.17 0.0012 3 1.17 795.61 0.0013 4 1.19 511.30 0.001Model 4 1 3 1.24 1455.90 0.0012 2 0.74 1398.57 0.0013 5 1.21 917.15 0.00116R=0.86) - emphasizing the poor quality of adjustment when using solely the mean unemploymentrate over the study period.Model 3 detected 3 statistically significant spatial clusters of premature mortality: a low-riskcluster (RR: 0.86) and 2 high-risk clusters (RR: 1.17 and 1.19, respectively) (top panel in Figure 5).It should be noted that the clusters characterized by a high or low unemployment rate (in northernand eastern France, respectively) detected by models 1 and 2 were not detected by model 3. TheMLC in model 3 (Cluster 1; RR: 0.86) highlighted significant heterogeneity in the unemploymentrates because it included a d´epartement with a high unemployment rate and a d´epartement with alow unemployment rate.Model 4 highlighted 3 statistically significant spatial clusters of premature mortality: 1 low-riskcluster (RR: 0.74) and 2 high-risk clusters (RR: 1.21 and 1.24, respectively) (bottom panel in Figure5). These three clusters are characterized by unemployment rate curves close to the average curve(in green). This result shows that the cluster detection was well adjusted for the unemploymentrate. Here, we developed an FMASSS in order to adjust cluster detection for longitudinal confoundingfactors in a purely spatial analysis. In other words, we addressed the issue of adjusting a spatialscan statistic for repeatedly measured covariates whose values vary over time. The FMASSS wasderived by modeling the longitudinal confounding factor as a random function. The correspondingbasis of functions depends principally on the nature of the longitudinal data. One advantage ofusing a random function is its consideration of the entire set of longitudinal data, rather than arough approximation by a statistical indicator such as the mean (which is often the case in spatialepidemiological studies). Furthermore, this functional approach makes it possible to overcome (i)the missing data problem related to the difference in measurement times between spatial units, and(ii) the high dimensionality inherently associated with multivariate approaches when longitudinaldata are measured at many time points. Our approach was built into a general framework for usewith various parametric models (Bernoulli, Gaussian, and Poisson models, etc.). For a Poissonmodel, it has been shown that the FMASSS is equivalent to Kulldorff’s classical spatial scan statis-tic in an adjusted population (Kulldorff, 1997).We next simulated and compared different way of adjusting the spatial scan statistics for lon-gitudinal confounders. The univariate model did not adjust the data well, and detected a fakecluster when the cluster intensity was weak or moderate. In contrast, the multivariate and func-tional models were both able detect the true cluster with a high power. The functional model wasslightly better than the multivariate model for high cluster intensities. It should be noted that thisgeneral power equivalence for the two latter models is partly due to the design of the simulationstudy. In fact, the simulation represented an ideal situation because the measurement times for thelongitudinal data were the same in all the spatial units; hence, there were no missing data in themultivariate model. 17hese models were applied to the detection of spatial clusters of premature mortality in Franceover the period from 1998 to 2013. The proportion of men by d´epartement and the unemploymentrates for each quarter of the study period (64 values per d´epartement ) were considered as confound-ing variables. The clusters considered to be significant in the univariate model (based on the meanunemployment rate over the entire study period) were characterized by unemployment rates thatwere far from the mean. This finding highlighted the univariate model’s poor ability to adjust fora longitudinal confounding factor summarized as the mean. In the multivariate model, the MLCalso included d´epartements with unemployment rates that were far from the mean value - againshowing that the adjustment was not optimal. In contrast, the spatial clusters of premature mor-tality detected by the functional model had unemployment rates that were very close to the mean -testifying to high-quality adjustment for the longitudinal confounding factor. In the present appli-cation, it would have been interesting to adjust to environmental factors that are usually measureddaily or weekly. The new method presented here is very well suited to this type of longitudinal data.It should be borne in mind that the new FMASSS deals with round-shaped clusters only (thesimplest case). However, clusters may be elongated in some situations - such as the aggregationof cases of water-borne disease along a river. However, the FMASSS can easily be extended toother spatial cluster shapes, such as elliptical clusters (Kulldorff et al., 2006), graph-based clusters(Cucala et al., 2013) or (for a spatiotemporal framework) cylindrical clusters (Kulldorff et al., 2005).Lastly, the FMASSS can be extended to spatiotemporal frameworks in which the outcomemeasure and longitudinal confounders are measured on different time scales (e.g. an outcomemeasured annually and a longitudinal confounding factor measured monthly). In this context, thelongitudinal data can be represented by a random function for each of the outcome time units.
References
Ahmed, M., Attouch, M., & Dabo-Niang, S. (2018). Binary functional linear models under choice-based sampling.
Econometrics and statistics , , 134–152.Bhatt, V., & Tiwari, N. (2014). A spatial scan statistic for survival data based on weibull distri-bution. Statistics in medicine , , 1867–1876.Cressie, N. (1977). On some properties of the scan statistic on the circle and the line. Journal ofApplied Probability , , 272–283. URL: .Cucala, L., Demattei, C., Lopes, P., & Ribeiro, A. (2013). A spatial scan statistic for case eventdata based on connected components. Computational Statistics , , 357–369.Cucala, L., Genin, M., Lanier, C., & Occelli, F. (2017). A multivariate gaussian scan statisticforspatial data. Spatial Statistics , , 66–74.Cucala, L., Genin, M., Occelli, F., & Soula, J. (2018). A multivariate nonparametric scan statisticfor spatial data. Spatial Statistics , , 1–14. 18wass, M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals ofMathematical Statistics , (pp. 181–187).Escabias, M., Aguilera, A. M., & Valderrama, M. J. (2004). Principal component estimation offunctional logistic regression: discussion of two different approaches.
Journal of NonparametricStatistics , , 365–384.Huang, L., Kulldorff, M., & Gregorio, D. (2007). A spatial scan statistic for survival data. Biomet-rics , , 109–118.James, G. M. (2002). Generalized linear models with functional predictors. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , , 411–432.Jung, I. (2009). A generalized linear models approach to spatial scan statistics for covariate ad-justment. Statistics in medicine , , 1131–1143.Jung, I., Kulldorff, M., & Klassen, A. C. (2007). A spatial scan statistic for ordinal data. Statisticsin medicine , , 1594–1607.Klassen, A. C., Kulldorff, M., & Curriero, F. (2005). Geographical clustering of prostate cancergrade and stage at diagnosis, before and after adjustment for risk factors. International journalof health geographics , , 1.Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics-Theory and methods , , 1481–1496.Kulldorff, M. (1999). Spatial scan statistics: models, calculations, and applications. In Scanstatistics and applications (pp. 303–322). Springer.Kulldorff, M., Feuer, E. J., Miller, B. A., & Freedma, L. S. (1997). Breast cancer clusters in thenortheast united states: a geographic analysis.
American journal of epidemiology , , 161–170.Kulldorff, M., Heffernan, R., Hartman, J., Assun¸cao, R., & Mostashari, F. (2005). A space–timepermutation scan statistic for disease outbreak detection. PLoS medicine , , e59.Kulldorff, M., Huang, L., & Konty, K. (2009). A scan statistic for continuous data based on thenormal probability model. International journal of health geographics , , 58.Kulldorff, M., Huang, L., Pickle, L., & Duczmal, L. (2006). An elliptic spatial scan statistic. Statistics in medicine , , 3929–3943.Kulldorff, M., Mostashari, F., Duczmal, L., Katherine Yih, W., Kleinman, K., & Platt, R. (2007).Multivariate scan statistics for disease surveillance. Statistics in medicine , , 1824–1833.M¨uller, H.-G., & Stadtm¨uller, U. (2005). Generalized functional linear models. The Annals ofStatistics , , 774–805. doi: .19eill, D. B. (2012). Fast subset scan for spatial pattern detection. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , , 337–360.Ramsay, J., & Silverman, B. (2005). Functional data analysis. 2nd springer. New York , .Rothman, K. J., Greenland, S., Lash, T. L. et al. (2008). Modern epidemiology, .Stringhini, S., Carmeli, C., ..., & Zins, M. (2017). Socioeconomic status and the 25x25 risk factorsas determinants of premature mortality: a multicohort study and meta-analysis of 1.7 millionmen and women.
The Lancet , , 1229 – 1237. A An explicit intercept estimator in the Poisson model under H Under the null hypothesis, the truncated log-likelihood function (6) associated with the Poissonmodel (11) is given by:˜ L ( α, β, θ ) n (cid:88) i =1 Y i (cid:16) log( N i ) + α + Z (cid:48) i β + C (cid:48) i θ (cid:17) − n (cid:88) i =1 N i exp (cid:16) α + Z (cid:48) i β + C (cid:48) i θ (cid:17) − n (cid:88) i =1 log( Y i !) . and has the following first partially derivative with respect to α : ∂ ˜ L∂α ( α, β, θ ) = n (cid:88) i =1 Y i − n (cid:88) i =1 N i exp (cid:16) α + Z (cid:48) i β + C (cid:48) i θ (cid:17) . It should be borne in mind that the MLEs (cid:98) α , (cid:98) β and (cid:98) θ of α , β and θ , respectively, have to satisfythe first-order condition: ∂ ˜ L∂α (cid:16)(cid:98) α, (cid:98) β, (cid:98) θ (cid:17) = n (cid:88) i =1 Y i − n (cid:88) i =1 N i exp (cid:16)(cid:98) α + Z (cid:48) i (cid:98) β + C (cid:48) i (cid:98) θ (cid:17) = 0 . Therefore, the coefficient (cid:98) α has an explicit expression with respect to the other coefficients (cid:98) β and (cid:98) θ exp ( (cid:98) α ) = (cid:80) ni =1 Y i (cid:80) ni =1 ˜ N i , with ˜ N i = N i exp (cid:16) Z (cid:48) i (cid:98) β + C (cid:48) i (cid:98) θ (cid:17) . (cid:3) B Supplementary material true cluster or a fake cluster are given.exp( δ ) Cluster Univariate model Multivariate model Functional modelPower TP FP Power TP FP Power TP FP1 True 0.012 0.014 0.100 0.075 0.169 0.104 0.083 0.183 0.109Fake 0.914 0.853 0.022 Fake 0.893 0.857 0.026 0.037 0.090 0.117 0.041 0.097 0.1161.4 True 0.093 0.084 0.105 0.344
Fake 0.862 0.813 0.037 0.044 0.074 0.123 0.039 0.069 0.1201.6 True 0.248 0.224 0.102 0.656 0.709
Fake 0.588 0.553 0.072 0.021 0.023 0.105 0.015 0.018 0.1042.0 True 0.676 0.642 0.054 0.948 0.898 0.018
Fake 0.366 0.345 0.081 0.008 0.011 0.100 0.006 0.007 0.09921igure 6: Simulation study: the true and fake simulated clusters among the 94 d´epartements (administrative areas) of France. 22igure 7: Spatial distribution of the cumulative incidence of premature mortality in France duringthe period from 1998 to 2013.
Years P e r c en t age o f m en i n t he popu l a t i on 0 . . . . . . . . Figure 8: Change over time in the proportion of men in the underlying population in France duringthe period from 1998 to 2013. Each curve corresponds to a d´epartementd´epartement