Locally adaptive smoothing with Markov random fields and shrinkage priors
LLocally Adaptive Smoothing with Markov Random Fields and Shrinkage Priors
James R. Faulkner
Quantitative Ecology and Resource ManagementUniversity of Washington, Seattle, Washington, U.S.A.
Vladimir N. Minin
Departments of Statistics and BiologyUniversity of Washington, Seattle, Washington, U.S.A.
Abstract
We present a locally adaptive nonparametric curve fitting method that operates within afully Bayesian framework. This method uses shrinkage priors to induce sparsity in order- k di ff erences in the latent trend function, providing a combination of local adaptation andglobal control. Using a scale mixture of normals representation of shrinkage priors, wemake explicit connections between our method and k th order Gaussian Markov randomfield smoothing. We call the resulting processes shrinkage prior Markov random fields(SPMRFs). We use Hamiltonian Monte Carlo to approximate the posterior distributionof model parameters because this method provides superior performance in the presenceof the high dimensionality and strong parameter correlations exhibited by our models.We compare the performance of three prior formulations using simulated data and findthe horseshoe prior provides the best compromise between bias and precision. We applySPMRF models to two benchmark data examples frequently used to test nonparametricmethods. We find that this method is flexible enough to accommodate a variety of datagenerating models and o ff ers the adaptive properties and computational tractability to makeit a useful addition to the Bayesian nonparametric toolbox. Nonparametric curve fitting methods find extensive use in many aspects of statistical model-ing such as nonparametric regression, spatial statistics, and survival models, to name a few.Although these methods form a mature area of statistics, many computational and statisticalchallenges remain when such curve fitting needs to be incorporated into multi-level Bayesianmodels with complex data generating processes. This work is motivated by the need for acurve fitting method that could adapt to local changes in smoothness of a function, includingabrupt changes or jumps, and would not be restricted by the nature of observations and / or theirassociated likelihood. Our desired method should o ff er measures of uncertainty for use in infer-ence, should be relatively simple to implement and computationally e ffi cient. There are manymethods available for nonparametric curve fitting, but few which meet all of these criteria.Gaussian process (GP) regression (Neal, 1998; Rasmussen and Williams, 2006) is a popularBayesian nonparametric approach for functional estimation that places a GP prior on the func-tion of interest. The covariance function must be specified for the GP prior, and the isotropiccovariance functions typically used are not locally adaptive. Nonstationary covariance func-tions have been investigated to make GP regression locally adaptive (Brahim-Belhouari and1 a r X i v : . [ s t a t . M E ] F e b ermak, 2004; Paciorek and Schervish, 2004, 2006). Any finite dimensional representation ofGPs involves manipulations of, typically high dimensional, Gaussian vectors with mean vec-tor and covariance matrix induced by the GP. Many GPs, including the ones with nonstation-ary covariance functions, su ff er from high computational cost imposed by manipulations (e.g.,Cholesky factorization) of the dense covariance matrix in the finite dimensional representation.Sparsity can be imposed in the precision matrix (inverse covariance matrix) by constraininga finite dimensional representation of a GP to be a Gaussian Markov random field (GMRF), andthen computational methods for sparse matrices can be employed to speed computations (Rue,2001; Rue and Held, 2005). Fitting smooth functions with GMRFs has been practiced widely.These methods use di ff erence equations as approximations to continuous function derivativesto induce smoothing, and have a direct relationship to smoothing splines (Speckman and Sun,2003). GMRFs have also been used to develop Bayesian adaptive smoothing splines (Langet al., 2002; Yue et al., 2012, 2014). A similar approach is the nested GP (Zhu and Dunson,2013), which puts a GP prior on the order- k function derivative, which is in turn centered onanother GP. This approach has good adaptive properties but has not been developed for non-Gaussian data.Di ff erencing has commonly been used as an approach to smoothing and trend estimationin time series analysis, signal processing, and spatial statistics. Its origins go back at leastto Whittaker (1922), who suggested a need for a trade o ff between fidelity to the data andsmoothness of the estimated function. This idea is the basis of some frequentist curve-fittingmethods based on penalized least squares, such as the smoothing spline (Reinsch, 1967; Wahba,1975) and the trend filter (Kim et al., 2009; Tibshirani, 2014). These penalized least-squaresmethods are closely related to regularization methods for high-dimensional regression such asridge regression (Hoerl and Kennard, 1970) and the lasso (Tibshirani, 1996) due to the form ofthe penalties imposed.Bayesian versions of methods like the lasso (Park and Casella, 2008) utilize shrinkage pri-ors in place of penalties. Therefore, it is interesting to investigate how these shrinkage priors(Polson and Scott, 2010; Gri ffi n et al., 2013; Bhattacharya et al., 2015) perform when ap-plied to di ff erencing-based time series smoothing. Although shrinkage priors have been usedexplicitly in the Bayesian nonparametric regression setting for regularization of wavelet coef-ficients (Abramovich et al., 1998; Johnstone and Silverman, 2005; Rem´enyi and Vidakovic,2015) and for shrinkage of order- k di ff erences of basis spline coe ffi cients in adaptive BayesianP-splines (Scheipl and Kneib, 2009), a Bayesian version of the trend filter and Markov randomfield (MRF) smoothing with shrinkage priors has not been thoroughly investigated. To ourknowledge, only Roualdes (2015), independently from our work, looked at Laplace prior-basedBayesian version of the trend filter in the context of a normal response model. In this paper,we conduct a thorough investigation of smoothing with shrinkage priors applied to MRFs forGaussian and non-Gaussian data. We call the resulting models shrinkage prior Markov randomfields (SPMRFs).We borrow the idea of shrinkage priors from the sparse regression setting and apply it tothe problem of function estimation. We take the perspective that nonparametric curve fitting isessentially a regularization problem where estimation of an unknown function can be achievedby inducing sparsity in its order- k derivatives. We propose a few fully Bayesian variationsof the trend filter (Kim et al., 2009; Tibshirani, 2014) which utilize shrinkage priors on the k th-order di ff erences in values of the unknown target function. The shrinkage imposed by thepriors induces a locally adaptive smoothing of the trend. The fully Bayesian implementationallows representation of parameter uncertainty through posterior distributions and eliminatesthe need to specify a single global smoothing parameter by placing a prior distribution on2he smoothing parameter, although complete automation is not possible so we o ff er ways toparameterize the global smoothing prior. In Section 2 we provide a derivation of the modelsstarting from penalized frequentist methods and we show the relationship to GMRF models.In Section 2 we also describe our method of sampling from the posterior distribution of theparameters using Hamiltonian Monte Carlo (HMC), which is e ffi cient and straight forwardto implement. In Section 3 we use simulations to investigate performance properties of theSPMRF models under two di ff erent prior formulations and we compare results to those fora GMRF with constant precision. We show that the choice of shrinkage prior will a ff ect thesmoothness and local adaptive properties. In Section 4 we apply the method to two exampledata sets which are well known in the nonparametric regression setting. We start by reviewing a locally adaptive penalized least squares approach to nonparametricregression known as the trend filter (Kim et al., 2009; Tibshirani and Taylor, 2011; Tibshirani,2014) and use that as a basis to motivate a general Bayesian approach that utilizes shrinkagepriors in place of roughness penalties. We first consider the standard nonparametric regressionproblem to estimate the unknown function f . We let θ represent a vector of values of f on adiscrete uniform grid t ∈ { , , . . . , n } , and we assume y = θ + (cid:15) , where (cid:15) ∼ N( , I σ ), and y and (cid:15) are vectors of length n . Here all vectors are column vectors. Following Tibshirani (2014) withslight modification, the least squares estimator of the k th order trend filtering estimate ˆ θ is ˆ θ = argmin θ (cid:107) y − θ (cid:107) + λ (cid:107) D ( k ) θ (cid:107) , (1)where (cid:107) · (cid:107) q represents the L q vector norm, and D ( k ) is an ( n − k ) × n forward di ff erence op-erator matrix of order k , such that the i th element of the vector ∆ k θ = D ( k ) θ is the forwarddi ff erence ∆ k θ i = ( − k (cid:80) kj = ( − j (cid:16) kj (cid:17) θ i + j . Note that D ( k ) has recursive properties such that D ( k ) n = D (1) n − k + D ( k − n , where D ( h ) m has dimensions ( m − h ) × m . The objective function in equa-tion (1) balances the trade-o ff between minimizing the squared deviations from the data (thefirst term in the sum on the right) with minimizing the discretized roughness penalty of thefunction f (the second term in the sum on the right). The smoothing parameter λ ≥ λ to 0 we get least squares estimation.As λ gets large, the roughness penalty dominates, resulting in a function with k -th order dif-ferences approaching 0 for all t . The trend filter produces a piecewise polynomial function of t , . . . , t n with degree k − f . Increasing the order ofthe di ff erence operator will enforce a smoother function.The L penalty in equation (1) results in the trend filter having locally adaptive smoothingproperties. Tibshirani (2014) shows that the trend filter is very similar in form and performanceto smoothing splines and locally adaptive regression splines, but the trend filter has a finerlevel of local adaptivity than smoothing splines. A main di ff erence between the trend filter andsmoothing splines is that the latter uses a squared L penalty, which is the same penalty usedin ridge regression (Hoerl and Kennard, 1970). Note that the L penalty used by the trend filteris also used by the lasso regression (Tibshirani, 1996), and the trend filter is a form of general-ized lasso (Tibshirani and Taylor, 2011; Tibshirani, 2014). In the linear regression setting withregression coe ffi cients β j s, the L and L penalties can be represented by the generalized ridge3enalty λ (cid:80) j | β j | q (Frank and Friedman, 1993), where q = q = q to zero results in all subsets selectionregression (Tibshirani, 2011). Based on what we know about lasso regression, subset selectionregression, and ridge regression, we expect a penalty closer to subset selection to do better forfitting functions with a small number of large jumps, a trend filter penalty ( L ) to do better forfitting functions with small to moderate deviations from polynomials of degree k −
1, and asmoothing spline (squared L ) penalty to do better for smooth polynomial-like functions withno jumps. This distinction will become important later when we assess the performance ofdi ff erent Bayesian formulations of the trend filter.One can translate the penalized least squares formulation in equation (1) into either a pe-nalized likelihood formulation or a Bayesian formulation. Penalized least squares can be inter-preted as minimizing the penalized negative log-likelihood − l p ( θ | y ) = − l ( θ | y ) + p ( θ | λ ) , where l ( θ | y ) is the unpenalized log-likelihood and p ( θ | λ ) is the penalty. It follows thatmaximization of the penalized log-likelihood is directly comparable to finding the mode of thelog-posterior in the Bayesian formulation, where the penalty is represented as a prior. Thisimplies independent Laplace (double-exponential) priors on the ∆ k θ j , where j = , . . . , n − k ,for the trend filter formulation in equation (1). That is, p ( ∆ k θ j | λ ) = λ exp (cid:16) − λ | ∆ k θ j | (cid:17) . Thisis a well-known result that has been used in deriving a Bayesian form of the lasso (Tibshirani,1996; Figueiredo, 2003; Park and Casella, 2008). Note that putting independent priors on the k th order di ff erences results in improper joint prior p ( θ | λ ), which can be made proper byincluding a proper prior on the first k θ s.The Laplace prior falls into a class of priors commonly known as shrinkage priors. Ane ff ective shrinkage prior has the ability to shrink noise to zero yet retain and accurately estimatesignals (Polson and Scott, 2010). These properties translate into a prior density function that hasa combination of high mass near zero and heavy tails. The high density near zero acts to shrinksmall values close to zero, while the heavy tails allow large signals to be maintained. A simpleprior developed for subset selection in Bayesian setting is the spike-and-slab prior, which is amixture distribution between a point mass at zero and a continuous distribution (Mitchell andBeauchamp, 1988). This prior works well for model selection, but some drawbacks are that itforces small signals to be exactly zero, and computational issues can make it di ffi cult to use(Polson and Scott, 2010). There has been much interest in developing priors with continuousdistributions (one group) that retain variable selection properties of the spike-and-slab (two-group) yet do so by introducing sparsity through shrinkage (Polson and Scott, 2010). Thisapproach allows all of the coe ffi cients to be nonzero, but most are small and only some arelarge. Many such shrinkage priors have been proposed, including the normal-gamma (Gri ffi net al., 2010), generalized double-Pareto (Armagan et al., 2013), horseshoe (Carvalho et al.,2010), horseshoe + (Bhadra et al., 2015), and Dirichlet-Laplace (Bhattacharya et al., 2015). TheLaplace prior lies somewhere between the normal prior and the spike-and-slab in its shrinkageabilities, yet most shrinkage priors of current research interest have sparsity inducing propertiescloser to those of the spike-and-slab. Our main interest is in comparing the Laplace prior toother shrinkage priors in the context of nonparametric smoothing. It is clear that shrinkage priors other than the lasso could represent di ff erent smoothing penaltiesand therefore could lead to more desirable smoothing properties. There is a large and growingnumber of shrinkage priors in the literature. It is not our goal to compare and characterizeproperties of Bayesian nonparametric function estimation under all of these priors. Instead,4e wish to investigate a few well known shrinkage priors and demonstrate as proof of conceptthat adaptive functional estimation can be achieved with shrinkage priors. Further research canfocus on improvements to these methods. What follows is a general description of our modelingapproach and the specific prior formulations that will be investigated through the remainder ofthe paper.We assume the n observations y i , where i = , . . . , n , are independent and follow somedistribution dependent on the unknown function values θ i and possibly other parameters ξ atdiscrete points t . We further assume that the order- k forward di ff erences in the function param-eters, ∆ k θ j , where j = , . . . , n − k , are independent and identically distributed conditional ona global scale parameter which is a function of the smoothing parameter λ . These assumptionsresult in the following general hierarchical form: y i | θ i , ξ ∼ p ( y i | θ i , ξ ) , ∆ k θ j | λ ∼ p ( ∆ k θ j | λ ) , λ ∼ p ( λ ) , ξ ∼ p ( ξ ) . (2)One convenient trait of many shrinkage priors, including the Laplace, the logistic, and the t -distribution, is that they can be represented as scale mixtures of normal distributions (An-drews and Mallows, 1974; West, 1987; Polson and Scott, 2010). The conditional form of scalemixture densities leads naturally to hierarchical representations. This can allow some other-wise intractable density functions to be represented hierarchically with standard distributionsand can ease computation. To take advantage of this hierarchical structure, we restrict densities p ( ∆ k θ j | λ ) to be scale mixtures of normals, which allows us to induce a hierarchical formto our model formulation by introducing latent local scale parameters, τ j . Here the order- k di ff erences in the function parameters, ∆ k θ j , are conditionally normally distributed with meanzero and variance τ j , and the τ j are independent and identically distributed with a global scaleparameter which is a function of the smoothing parameter λ . The distribution statement for ∆ k θ j in Equation (2) can then be replaced with the following hierarchical representation: ∆ k θ j | τ j ∼ N(0 , τ j ) , τ j | λ ∼ p ( τ j | λ ) . (3)To complete the model specification, we place proper priors on θ , . . . , θ k . This maintainspropriety and can improve computational performance for some Markov chain Monte Carlo(MCMC) samplers. We start by setting θ ∼ N( µ, ω ), where µ and ω can be constants orallowed to follow their own distributions. Then for k ≥ h = , . . . , k −
1, we let ∆ h θ | α h ∼ N(0 , α h ) and α h | λ ∼ p ( α h | λ ), where p ( α | λ ) is the same form as p ( τ | λ ). That is, we assumethe order- h di ff erences are independent with scale parameters that follow the same distributionas the order- k di ff erences. For most situations, the order of k will be less than 4, so issues ofscale introduced by assuming the same distribution on the scale parameters for the lower andhigher order di ff erences will be minimal. One could alternatively adjust the scale parameter ofeach p ( α h | λ ) to impose smaller variance for lower order di ff erences.For the remainder of the paper we investigate two specific forms of shrinkage priors: theLaplace and the horseshoe. We later compare the performance of these two priors to the casewhere the order- k di ff erences follow identical normal distributions. The following providesspecific descriptions of our shrinkage prior formulations. Laplace . As we showed previously, this prior arises naturally from an L penalty, makingit the default prior for Bayesian versions of the lasso (Park and Casella, 2008) and trend filter.The Laplace distribution is leptokurtic and features high mass near zero and exponential tails(Figure 1). Various authors have investigated its shrinkage properties (Gri ffi n et al., 2010;Kyung et al., 2010; Armagan et al., 2013). We allow the order- k di ff erences ∆ k θ j to follow a5 . . . . . . D k q d e n s i t y . . . . . D k q d e n s i t y NormalLaplaceHorseshoe
Figure 1: Shapes of prior distributions (left) and associated tail behavior (right) for priors usedfor p ( ∆ k θ | λ ).Laplace distribution conditional on a global scale parameter γ = /λ , and we allow γ to followa half-Cauchy distribution with scale parameter ζ . That is, ∆ k θ j | γ ∼ Laplace( γ ) , γ ∼ C + (0 , ζ ) . (4)The use of a half-Cauchy prior on γ is a departure from Park and Casella (2008), who make λ follow a gamma distribution to induce conjugacy in the Bayesian lasso. We chose to use thehalf-Cauchy prior on γ because its single parameter simplifies implementation, it has desirableproperties as a prior on a scale parameter (Gelman et al., 2006; Polson and Scott, 2012b),and it allowed us to be consistent across methods (see horseshoe specification below). Thehierarchical form of the Laplace prior arises when the mixing distribution on the square ofthe local scale parameter τ j is an exponential distribution. Specifically, we specify τ j | λ ∼ Exp( λ /
2) and ∆ k θ j | τ j ∼ N(0 , τ j ) in the hierarchical representation. Horseshoe . The horseshoe prior (Carvalho et al., 2010) has an infinite spike in density atzero but also exhibits heavy tails (Figure 1). This combination results in excellent performanceas a shrinkage prior (Polson and Scott, 2010), and gives the horseshoe shrinkage propertiesmore similar to the spike-and-slab variable selection prior than those of the Laplace prior. Weallow the order- k di ff erences ∆ k θ j to follow a horseshoe distribution conditional on global scaleparameter γ = /λ , and allow γ to follow a half-Cauchy distribution with scale parameter ζ .That is, ∆ k θ j | γ ∼ HS( γ ) , γ ∼ C + (0 , ζ ) . (5)The horseshoe density function does not exist in closed form, but we have derived an approxi-mate closed-form solution using the known function bounds (see Appendix A), which could beuseful for application in some settings. Carvalho et al. (2010) represent the horseshoe densityhierarchically as a scale mixture of normals where the local scale parameters τ j are distributedhalf-Cauchy. In our hierarchical version, the latent scale parameter τ j | γ ∼ C + (0 , γ ) and thenconditional on τ j the distribution on the order- k di ff erences is ∆ k θ j | τ j ∼ N(0 , τ j ).The horseshoe prior arises when the mixing distribution on the local scale parameter τ j is half-Cauchy, which is a special case of a half- t -distribution where degrees of freedom ( df )6qual 1. Setting d f > < d f < t formulations with d f between 1and 5 in test scenarios, but did not find an appreciable di ff erence in performance relative to thehorseshoe. We also attempted to place a prior distribution on the d f parameter, but found thedata to be insu ffi cient to gain information in the posterior for d f in our test scenarios, so we didnot pursue this further. Normal . The normal distribution arises as a prior on the order- k di ff erences when thepenalty in the penalized likelihood formulation is a squared L penalty. The normal prior isalso the form of prior used in Bayesian smoothing splines. The normal is not considered ashrinkage prior and does not have the flexibility to allow locally adaptive smoothing behavior.We use it for comparison to demonstrate the local adaptivity allowed by the shrinkage priors.For our investigations, the distribution on the order- k di ff erences and associated scale parameteris: ∆ k θ j | γ ∼ N(0 , γ ) , γ ∼ C + (0 , ζ ) . (6) Here we briefly show the models represented by (2) can be expressed with GMRF priors for θ conditional on the local scale parameters τ . It is instructive to start with the normal incrementsmodel (6), which belongs to a class of time series models known as autoregressive models oforder k . Rue and Held (2005) call this model a k -th order random walk and show that it is aGMRF with respect to a k -th order chain graph — a graph with nodes { , , . . . , n } , where thenodes i (cid:44) j are connected by an edge if and only if | i − j | ≤ k . Since the normal model (6) doesnot fully specify the joint distribution of θ , it is an intrinsic (improper) GMRF. We make it aproper GMRF by specifying a prior density of the first k components of θ , p ( θ , . . . , θ k ). TheMarkov property of the model manifests itself in the following factorization: p ( θ ) = p ( θ , . . . , θ k ) p ( θ k + | θ , . . . , θ k ) · · · p ( θ n | θ n − , . . . , θ n − k ) . Equipped with initial distribution p ( θ , . . . , θ k ), models (5) and (4) also admit this factorization,so they are k -th order Markov, albeit not Gaussian models. However, if we condition on thelatent scale parameters τ , both the Laplace and horseshoe models become GMRFs, or morespecifically k -th order normal random walks. One important feature of these random walks isthat each step in the walk has its own precision. To recap, under prior specifications (5) and (4) p ( θ | γ ) is a non-Gaussian Markov field, while p ( θ | τ , γ ) = p ( θ | τ ) is a GMRF.Our GMRF point of view is useful in at least three respects. First, GMRFs with constantprecision have been used for nonparametric smoothing in many settings (see Rue and Held(2005) for examples). GMRFs with nonconstant precision have been used much less frequently,but one important application is to the development of adaptive smoothing splines by allowingorder- k increments to have nonconstant variances (Lang et al., 2002; Yue et al., 2012). Theapproach of these authors is very similar to our own but di ff ers in at least two important ways.First, we specify the prior distribution on the latent local scale parameters τ j with the resultingmarginal distribution of ∆ k θ j in mind, such as the Laplace or horseshoe distributions which ariseas scale mixtures of normals. This allows a better understanding of the adaptive properties ofthe resulting marginal prior in advance of implementation. In contrast, Lang et al. (2002) andYue et al. (2012) appear to choose the distribution on local scale parameters based on conjugacyand do not consider the e ff ect on the marginal distribution of ∆ k θ j . Second, we allow the localscale parameters τ j to be independent, whereas Lang et al. (2002) and Yue et al. (2012) impose7ependence among the scale (precision) parameters by forcing them to follow another GMRF.Allowing the local scale parameters to be independent allows the model to be more flexible andable to adapt to jumps and sharp local features. We should also note that Rue and Held (2005)in section 4.3 show that the idea of scale mixtures of normal distributions can be used withGMRFs to generate order- k di ff erences which marginally follow a t -distribution by introducinglatent local scale parameters. Although they do not pursue this further, we mention it becauseit bears similarity to our approach.Second, viewing the SPMRF models as conditional GMRFs allows us to utilize some ofthe theoretical results and computational methods developed for GMRFs. In particular, onecan take advantage of more complex forms of precision matrices such as circulant or seasonaltrend matrices (see Rue and Held (2005) for examples). One can also employ the compu-tational methods developed for sparse matrices, which speed computation times (Rue, 2001;Rue and Held, 2005). We note that simple model formulations such as the k th-order randomwalk models can be coded with state-space formulations based on forward di ff erences, whichspeed computation times by eliminating the operations on covariance matrices necessary withmultivariate Gaussian formulations.A third advantage of connecting our models to GMRFs is that the GMRF representation al-lows us to connect our first-order Markov models to subordinated Brownian motion (Bochner,1955; Clark, 1973), a type of L´evy process recently studied in the context of scale mixture ofnormal distributions (Polson and Scott, 2012a). Polson and Scott (2012a) use the theory ofL´evy processes to develop shrinkage priors and penalty functions. Let us briefly considera simple example of subordinated Brownian motion. Let W be a Weiner process, so that W ( t + s ) − W ( t ) ∼ N(0 , s σ ), and W has independent increments. Let T be a subordinator,which is a L´evy process that is non-decreasing with probability 1, has independent increments,and is independent of W . The subordinated process Z results from observing W at locations T ( t ). That is, Z ( t ) = W [ T ( t )]. The subordinator essentially generates a random set of irregularlocations over which the Brownian motion is observed, which results in a new process. In ourhierarchical representation of Laplace and horseshoe priors for the first order di ff erences , wecan define a subordinator process T j = (cid:80) ji = τ i , so that the GMRF p ( θ | τ ) can be thought of asa subordinated Brownian motion or as a realization of a Brownian motion with unit variance onthe random latent irregular grid T , . . . , T n . The subordinated Brownian motion interpretationis not so straight forward when applied to higher-order increments, but we think this inter-pretation will be fruitful for extending our SPMRF models in the future. One example wherethis interpretation is useful is when observations occur on an irregularly spaced grid, which weexplore in the following section. So far we have restricted our model formulation to the case where data are observed at equally-spaced locations. Here we generalize the model formulation to allow for data observed atlocations with irregular spacing. This situation arises with continuous measurements over time,space, or some covariate, or when gaps are left by missing observations.For a GMRF with constant precision (normally distributed k th-order di ff erences), we canuse integrated Wiener processes to obtain the precision matrix (see Rue and Held (2005) andLindgren and Rue (2008) for details). However, properly accounting for irregular spacing inour models with Laplace or horseshoe k th-order di ff erences is more di ffi cult. To use tools sim-ilar to those for integrated Wiener processes we would need to show that the processes built onLaplace and horseshoe increments maintain their distributional properties over any subinterval8f a continuous measure. Polson and Scott (2012a) show that processes with Laplace or horse-shoe first-order increments can be represented as subordinated Brownian motion. However, tomeet the necessary condition of an infinitely divisible subordinator, the subordinator for theLaplace process needs to be on the precision scale and the subordinator for the horseshoe pro-cess needs to be on the log-variance scale. Both resulting processes are L´evy process, whichmeans they have independent and stationary increments, but the increments are no longer overthe continuous measure of interest. This makes representation of these processes over continu-ous time di ffi cult and development of the necessary theory is out of the scope of this paper.Absent theory to properly address this problem, we instead start with our hierarchical modelformulations and assume that conditional on a set of local variance parameters τ , we can usemethods based on integrated Wiener processes to obtain the precision matrices for the latentGMRFs. This requires the assumption that local variances are constant within respective in-tervals between observations. Let s < s < ... < s n be a set of locations of observations, andlet δ j = s j + − s j be the distance between adjacent locations. We assume we have a discretelyobserved continuous process and denote by θ ( s j ) the value of the process at location s j . For thefirst-order model and some interval [ s j , s j + ], we assume that conditional on local variance τ j , θ ( s ) follows a Wiener process where θ ( s j + h ) − θ ( s j ) | τ j ∼ N(0 , h τ j ) for all 0 ≤ h ≤ δ j . If welet ∆ θ j = θ ( s j + ) − θ ( s j ), the resulting variance of ∆ θ j isVar( ∆ θ j ) = δ j τ j . Note that the resulting marginal distribution of θ ( s j + h ) − θ ( s j ) after integrating over τ j is there-fore assumed to be Laplace or horseshoe for all h , with the form of the marginal distributiondependent on the distribution of τ j . We know this cannot be true in general given the propertiesof these distributions, but we assume it approximately holds for h ≤ δ .The situation becomes more complex for higher order models. We restrict our investiga-tions to the second-order model and follow the methods of Lindgren and Rue (2008), who usea Galerkin approximation to the stochastic di ff erential equation representing the continuousprocess. The resulting formula for a second-order increment becomes ∆ θ j = θ ( s j + ) − (cid:32) + δ j + δ j (cid:33) θ ( s j + ) + δ j + δ j θ ( s j ) , and the variance of a second-order increment conditional on τ j isVar( ∆ θ j ) = δ j + ( δ j + δ j + )2 τ j . This adjustment of the variance results in good consistency properties for GMRFs with con-stant precision (Lindgren and Rue, 2008), so should also perform well over intervals with lo-cally constant precision. We show in Appendices A and B that integrating over the local scaleparameter τ j maintains the distance correction as a multiplicative factor on the scale of the re-sulting marginal distribution. We also provide a data example involving a continuous covariatein Appendix C where we apply the methods above for irregular grids. Since we have two general model formulations, marginal and hierarchical, we could use MCMCto approximate the posterior distribution of heights of our piecewise step functions, θ , by work-9ng with either one of the two corresponding posterior distributions. The first one correspondsto the marginal model formulation: p ( θ , γ, ξ | y ) ∝ n (cid:89) i = p ( y i | θ i , ξ ) p ( θ | γ ) p ( ξ ) p ( γ ) , (7)where p ( θ | γ ) is a Markov field induced by the normal, Laplace, or horseshoe densities, andp( γ ) is a half-Cauchy density. Note that a closed-form approximation to the density functionfor the horsehoe prior (see Appendix A) is needed for the marginal formulation using the horse-shoe. The second posterior corresponds to the hierarchical model with latent scale parameters τ : p ( θ , τ , γ, ξ | y ) ∝ n (cid:89) i = p ( y i | θ i , ξ ) p ( θ | τ ) n − k (cid:89) j = p ( τ j | γ ) p ( ξ ) p ( γ ) , (8)where p ( θ | τ ) is a GMRF and the choice of p ( τ j | γ ) makes the marginal prior specificationfor θ correspond either to a Laplace or to a horseshoe Markov random field. Notice that theunconditional GMRF (normal prior) has only the marginal specification.Both of the above model classes are highly parameterized with dependencies among param-eters induced by di ff erencing and the model hierarchy. It is well known that high-dimensional,hierarchical models with strong correlations among parameters can create challenges for stan-dard MCMC samplers, such as component-wise random walk Metropolis or Gibbs updates.When faced with these challenges, random walk behavior can result in ine ffi cient explorationof the parameter space, which can lead to poor mixing and prohibitively long convergencetimes. Many approaches have been proposed to deal with these issues, including block up-dating (Knorr-Held and Rue, 2002), elliptical slice sampling (Murray et al., 2010; Murray andAdams, 2010), the Metropolis adjusted Langevin algorithm (MALA) (Roberts and Stramer,2002), and Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 1993, 2011). All ofthese approaches jointly update some or all of the parameters at each MCMC iteration, whichusually improves mixing and speeds up convergence of MCMC. Among these methods, HMCo ff ered the most practical choice due to its ability to handle a wide variety of models and itsrelative ease in implementation via readily availble software such as stan (Carpenter et al.,2016). We used a modification of HMC proposed by Ho ff man and Gelman (2014) which au-tomatically adjusts HMC tuning parameters. We used the open source package rstan (StanDevelopment Team, 2015a), which provides a platform for fitting models using HMC in the R computing environment (R Core Team, 2014).Even with HMC, slow mixing can still arise with hierarchical models and heavy-tailed dis-tributions due to the inability of a single set of HMC tuning parameter values to be e ff ectiveacross the entire model parameter space. Fortunately this problem can often be remedied bymodel reparameterizations that change the geometry of the sampled parameter space. For hi-erarchical models, the non-centered parameterization methods described by Papaspiliopouloset al. (2003, 2007) and Betancourt and Girolami (2015) can be useful. Non-centered param-eterizations break the dependencies among parameters by introducing deterministic transfor-mations of the parameters. The MCMC algorithm then operates directly on the independentparameters. Betancourt and Girolami (2015) discuss non-centered parameterizations in thecontext of HMC, and further examples of these and other reparameterization methods that tar-get heavy-tailed distributions are provided in the documentation for stan (Stan DevelopmentTeam, 2015b).We note that after employing reparameterizations, HMC with stationary distribution equal10o the hierarchical model posterior (8) had good convergence and mixing properties for eachof our models and in nearly all of our numerical experiments. HMC that targeted the marginalmodel posterior (7) had fast run times and good mixing for the normal and Laplace formula-tions, but we could not e ff ectively reparameterize the (approximate) marginal horseshoe dis-tribution to remove the e ff ects of its heavy tails, which resulted in severe mixing problems forthe marginal horseshoe-based model. Therefore, in the rest of the manuscript we work with thehierarchical model posterior distribution (8) for all models.For SPMRF and GMRF models, the computation time needed to evaluate the log-posteriorand its gradient scales as O ( n ), where n is the grid size. However, the hierarchical SPMRFmodels have approximately twice as many parameters as the GMRF or marginal SPMRF mod-els. These hierarchical SPMRF methods are therefore slower than their GMRF counterparts.Since the computational cost of evaluating the log-posterior is only one factor determining theMCMC speed, we compared run times of the SPMRF and GMRF models on simulated andreal data (see Appendix G). Our results show that SPMRF models are slower than GMRFs, butnot prohibitively so.We developed an R package titled spmrf which allows for easy implementation of ourmodels via a wrapper to the rstan tools. The package code is publicly available at https://github.com/jrfaulkner/spmrf . We use simulations to investigate the performance of two SPMRF formulations using theLaplace and horseshoe shrinkage priors described in Section 2.2 and compare results to thoseusing a normal distribution on the order- k di ff erences. We refer to the shrinkage prior methodsas adaptive due to the local scale parameters, and the method with normal prior as non-adaptivedue to the use of a single scale parameter. We constructed underlying trends with a variety ofcharacteristics following approaches similar to those of other authors (Scheipl and Kneib, 2009;Yue et al., 2012; Zhu and Dunson, 2013). We investigated four di ff erent types of underlyingtrend (constant, piecewise constant, smooth function, and function with varying smoothness).The first row of Figure 2 shows examples of the trend functions, each illustrated with simulatednormal observations centered at the function values over a regular grid. We used three observa-tion types for each trend type where the observations were conditionally independent given thetrend function values θ i , where i = , . . . , n . The observation distributions investigated were 1)normal: y i | θ i ∼ N( θ i , σ ), where σ = . σ = .
5; 2) Poisson: y i | θ i ∼ Pois(exp( θ i )); and 3)binomial: y i | θ i ∼ Binom( m , (1 + exp( − θ i )) − ), where m =
20 for all scenarios.Note that we constructed the function values for the scenarios with normally distributedobservations so that each function would have approximately the same mean and variance,where the mean and variance were calculated across the function values realized at the discretetime points. This allowed us to specify observation variances which resulted in the same signal-to-noise ratio for each function, where signal-to-noise ratio is defined as the standard deviationof function values divided by the standard deviation of observations. The signal-to-noise ratiosfor our scenarios with normal observations were 6 for σ = . σ = .
5. We chosethe mean sizes for the Poisson scenarios and sample sizes for the binomial scenarios so that theresulting signal-to-noise ratios would be similar to those for the normal scenarios with σ = . onstant . This scenario uses a constant mean across all points. We use this scenario toinvestigate the ability of each method to find a straight horizontal line in the presence of noisydata. The values used for the constant mean were 20 for normal and Poisson observations, and0.5 for binomial observations. Piecewise constant . This type of function has been used by Tibshirani (2014) and otherssuch as Scheipl and Kneib (2009) and Zhu and Dunson (2013). The horizontal trends combinedwith sharp breaks o ff er a di ffi cult challenge for all methods. For the scenarios with normalor Poisson observations, the function values were 25, 10, 35, and 15 with break points at t ∈ { , , } . For the binomial observations the function values on the probability scale were0.65, 0.25, 0.85, and 0.45 with the same break points as the other observation types. Smooth trend . We use this as an example to test the ability of the adaptive methods to handlea smoothly varying function. We generated the function f as a GP with squared exponentialcovariance function. That is, f ∼ GP( µ, Σ ) , Σ i , j = σ f exp (cid:104) − ( t j − t i ) / (2 ρ ) (cid:105) , where Σ i , j is thecovariance between points i and j , σ f > ρ > µ = σ f = ρ =
10 for the scenarios with normal or Poisson observations.For binomial observations, f was generated in logit space with µ = − . σ f =
3, and ρ = Varying smoothness . This function with varying smoothness was initially presented byDiMatteo et al. (2001) and later used by others, including Yue et al. (2012). We adapted thefunction to a uniform grid, t ∈ [1 , n ], where n =
100 in our case, resulting in the function g ( t ) = sin (cid:32) tn − (cid:33) + − (cid:32) tn − (cid:33) . For the normal and Poisson observations we made the transformation f ( t ) = + g ( t ). Forbinomial observations we used f ( t ) = . g ( t ) on the logit scale.We generated 100 datasets for each combination of trend and observation type. This num-ber of simulations was su ffi cient to identify meaningful di ff erences between models withoutexcessive computation time. Each dataset had 100 equally-spaced sample points over the in-terval [1 , ff erent prior formulationsfor the order- k di ff erences, which were 1) normal, 2) Laplace, and 3) horseshoe. We used thehierarchical prior representations for these models given in Section 2.2. We selected the degreeof k -th order di ff erences for each model based on knowledge of the shape of the underlyingfunction. We fit first-order models for the constant and piecewise constant functions, and wefit second-order models for the smooth and varying smooth functions. For the scenarios withnormal observations, we set σ ∼ C + (0 , θ ∼ N( µ, ω ), where µ is set to the sam-ple mean and ω is two times the sample standard deviation of the observed data transformed tomatch the scale of θ . We also set γ ∼ C + (0 , .
01) for all models.We used HMC to approximate the posterior distributions. For each model we ran four inde-pendent chains with di ff erent randomly generated starting parameter values and initial burn-inof 500 iterations. For all scenarios except for normal observations with σ = .
5, each chainhad 2,500 posterior draws post-burn-in that were thinned to keep every 5th draw. For scenar-ios with normal observations with σ = .
5, chains with 10,000 iterations post-burn-in werenecessary, with additional thinning to every 20th draw. In all cases, these settings resulted in2,000 posterior draws retained per model. We found that these settings consistently resulted ingood convergence properties, where convergence and mixing were assessed with a combina-tion of trace plots, autocorrelation values, e ff ective sample sizes, and potential scale reduction12tatistics (Gelman and Rubin, 1992).We assessed the relative performance of each model using three di ff erent summary statis-tics. We compared the posterior medians of the trend parameters ( ˆ θ i ) to the true trend values( θ i ) using the mean absolute deviation (MAD):MAD = n n (cid:88) i = | ˆ θ i − θ i | . (9)We assessed the width of the 95% Bayesian credible intervals (BCIs) using the mean credibleinterval width (MCIW): MCIW = n n (cid:88) i = ˆ θ . , i − ˆ θ . , i , (10)where ˆ θ . , i and ˆ θ . , i are the 97.5% and 2.5% quantiles of the posterior distribution for θ i . Wealso computed the mean absolute sequential variation (MASV) of ˆ θ asMASV = n − n − (cid:88) i = | ˆ θ i + − ˆ θ i | . (11)We compared the observed MASV to the true MASV (TMASV) in the underlying trend func-tion, which is calculated by substituting true θ ’s into equation for MASV. In the interest of space, we emphasize results for the scenarios with normally distributed ob-servations with σ = . ff ered results similar to those scenarios. We followthese results with a brief summary of results for the other observation types, and we providefurther summary of other results in Appendix D. Constant . The three models performed similarly in terms of absolute value of all the metrics(Table 1 and Figure 2), but the Laplace and normal models were slightly better at fitting straightlines than the horseshoe. This is evidenced by the fact that the horseshoe had larger MCIW andlarger MASV than the other methods. The first column of plots in Figure 3 provides a visualexample of the extra variation exhibited by the horseshoe.
Piecewise constant . The horseshoe model performed the best in all categories for this sce-nario and the normal model performed the worst (Table 1 and Figure 2). The Laplace modelwas closer to the normal model in performance. The horseshoe was flexible enough to accountfor the large function breaks yet still able to limit variation in the constant segments. Examplefits for the piecewise constant function are shown in the second column of plots in Figure 3.
Smooth trend . The di ff erent models were all close in value of the performance metricsfor the smooth trend scenario (Table 1 and Figure 2). The normal and Laplace models hadsmallest MAD, but the horseshoe had MSAV closer to the true MSAV. The fact that the valuesof the metrics were similar for all models suggests that not much performance is lost in fittinga smooth trend with the adaptive methods in comparison to non-adaptive. Varying Smoothness . Again the models all performed similarly in terms of absolute value ofthe metrics, but there was a clear ordering among models in relative performance (Table 1 andFigure 2). The horseshoe model performed the best relative to the other models on all metrics.This function forces a compromise between having large enough local variance to capture the13 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Constant y . . . . . M AD N L H l l l lllll ll lll . . . . M C I W N L H l l l llllllll llllllllll llllllllll . . . M A SV N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Piecewise Constant lll l ll . . . . . N L H l l l l N L H l l l lllllll ll llll . . . . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Smooth GP ll l . . . . N L H l l l . . . . N L H l l l lll lll lllll . . . . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Varying Smoothness l . . . . N L H l l l . . . N L H l l l . . . . . . N L H l l l modelx
Figure 2: Functions used in simulations and simulation results by model (N = normal,L = Laplace, H = horseshoe) and function type for normally distributed data with σ = σ = . σ = .
5. For the constant function, the normal prior performed the best and the horseshoeprior the worst, although di ff erences in terms of absolute values of the performance metricswere small. The relative di ff erences were more pronounced with the scenarios with normalobservations with σ = .
5. For the piecewise constant function, the horseshoe prior performedthe best for all scenarios and the normal prior the worst. All methods performed similarly for14 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Constant llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Piecewise Constant llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Smooth GP llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Varying Smoothness
TruthMedian95% BCI llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll time a)b)c) yyy
Figure 3: Example fits for models using a) normal, b) Laplace, and c) horseshoe priors whereobservations are drawn from normal distributions with SD = θ . Values between observed locations are interpo-lated for plotting.the smooth function, with the normal and Laplace generally performing a little better than thehorseshoe. For the function with varying smoothness, the horseshoe performed the best and thenormal the worst for all scenarios. Here we provide two examples of fitting SPMRF models to real data. Each example usesa di ff erent probability distribution for the observations. The first example exhibits a changepoint, which makes it amenable to adaptive smoothing methods. The second example has amore uniformly smooth trend but also shows a period of rapid change, so represents a test forall methods. First we address the issue of setting the hyperparameter for the global smoothingparameter. The value of the global smoothing parameter λ determines the precision of the marginal distri-butions of the order- k di ff erences, which influences the smoothness of the estimated trend. Se-lection of the global smoothing parameter in penalized regression models is typically done viacross-validation in the frequentist setting (Tibshirani, 1996) or marginal maximum likelihoodin the empirical Bayes setting (Park and Casella, 2008). Our fully Bayesian formulation elim-inates the need for these additional steps, but in turn requires selection of the hyperparametercontrolling the scale of the prior on the smoothing parameter. The value of this hyperparameter15able 1: Mean values of performance measures across 100 simulations for normal observations( σ = .
5) for each model and trend function type.
Function Model MAD MCIW MASV TMASV
Constant Normal 0.341 1.904 0.003 0.000Laplace 0.339 1.937 0.003 0.000Horseshoe 0.356 2.406 0.006 0.000Piecewise Const. Normal 2.112 10.826 1.587 0.606Laplace 1.816 9.899 1.441 0.606Horseshoe 0.886 5.919 0.689 0.606Smooth Normal 1.355 7.092 1.328 1.406Laplace 1.352 7.106 1.329 1.406Horseshoe 1.389 7.081 1.359 1.406Varying Smooth Normal 1.596 6.467 0.426 0.543Laplace 1.552 6.413 0.432 0.543Horseshoe 1.211 5.743 0.470 0.543will depend on the order of the model, the grid resolution, and the variability in the latent trendparameters. Therefore, a single hyperparameter value cannot be used in all situations. Somerecent studies have focused on methods for more careful and principled specification of priorsfor complex hierarchical models (Fong et al., 2010; Simpson et al., 2014; Sørbye and Rue,2014). The method of Sørbye and Rue (2014) was developed for intrinsic GMRF priors andwe adapt their approach to our specific models in what follows.We wish to specify values of the hyperparameter ζ for various situations, where the globalscale parameter γ ∼ C + (0 , ζ ) . Let Q be the precision matrix for the Markov random fieldcorresponding to the model of interest (see Appendix E for examples), and Σ = Q − be thecovariance matrix with diagonal elements Σ ii . The marginal standard deviation of all compo-nents of θ for a fixed value of γ is σ γ ( θ i ) = γσ ref ( θ ), where σ ref ( θ ) is the geometric mean of theindividual marginal standard deviations when γ = U on the average marginal standard deviation of θ i , such that Pr( σ γ ( θ i ) > U ) = α ,where α is some small probability. Using the cumulative probability function for a half-Cauchydistribution, we can find a value of ζ for a given value of σ ref ( θ ) specific to a model of interestand given common values of U and α by: ζ = U σ ref ( θ ) tan (cid:16) π (1 − α ) (cid:17) . (12)By standardizing calculations to be relative to the average marginal standard deviation, themethods of Sørbye and Rue (2014) allow us to easily calculate ζ for a model of di ff erent orderor a model with a di ff erent density of grid points. For practical purposes we apply the samemethod to the normal and SPMRF models. This is not ideal in terms of theory, however, sincethe horseshoe distribution has infinite variance and the corresponding SPMRF will clearly nothave the same marginal variance as a GMRF. This is not necessarily problematic since GMRFapproximation will result in an estimate of ζ under the horseshoe SPMRF which is less infor-mative than would result under similar methods derived specifically for the horseshoe SPMRF,and could therefore be seen as more conservative in terms of guarding against over smoothing.16n contrast, the Laplace SPMRF has finite marginal variance that is well approximated by theGMRF methods. We apply these methods in the data examples that follow. This is an example of estimating the time-varying intensity of an inhomogeneous Poisson pro-cess that exhibits a relatively rapid period of change. The data are on the time intervals betweensuccessive coal-mining disasters, and were originally presented by Maguire et al. (1952), withlater corrections given by Jarrett (1979) and Raftery and Akman (1986). We use the data for-mat presented by Raftery and Akman (1986). A disaster is defined as an accident involving10 or more deaths. The first disaster was recorded in March of 1851 and the last in March of1962, with 191 total event times during the period 1 January, 1851 through 31 December, 1962.Visual inspection of the data suggests a decrease in rate of disasters over time, but it is unclearby eye alone whether this change is abrupt or gradual. The decrease in disasters is associatedwith a few changes in the coal industry at the time. A sharp decline in labor productivity at theend of the 1880’s is thought to have decreased the opportunity for disasters, and the formationof the Miner’s Federation, a labor union, in late 1889 brought added safety and protection tothe workers (Raftery and Akman, 1986).This data set has been of interest to various authors due to uncertainty in the timing and rateof decline in disasters and the computational challenge presented by the discrete nature of theobservations. Some authors have fit smooth curves exhibiting gradual change (Adams et al.,2009; Teh and Rao, 2011) and others have fit change-point models with abrupt, instantaneouschange (Raftery and Akman, 1986; Carlin et al., 1992; Green, 1995). An ideal model wouldprovide the flexibility to automatically adapt to either scenario.We assumed an inhomogeneous Poisson process for the disaster events and binned theevent counts by year. We fit first-order models using the normal, Laplace, and horseshoe priorformulations. We assumed the event counts, y i , were distributed Poisson conditional on the θ i : y i | θ i ∼ Pois (cid:0) exp( θ i ) (cid:1) . The marginal prior distributions for the first-order increments were ∆ θ j ∼ N(0 , γ ) for the Normal, ∆ θ j ∼ Laplace( γ ) for the Laplace, and ∆ θ j ∼ HS( γ ) for thehorseshoe. We used the same prior specifications as those used in the simulations for theremaining parameters, except we used the guidelines in Section 4.1 to set the hyperparameteron the global scale prior. Using calculations outlined in Appendix E, we set σ ref ( θ ) = .
47 and U = . α = .
05 and substituting into Equation (E.6) results in ζ = . γ ∼ C + (0 , . ffi cient flexibility to allow large jumps and produced a gradual decline in ac-cidents rate, which is less plausible than a sharp decline in light of the additional informationabout change in coal mining industry safety regulations. The relative qualitative performanceof the normal, Laplace, and horseshoe densities is similar to that for the piecewise constantscenario from our simulation study. The posterior distributions of the change point times areshown in Figure 4. The horseshoe model clearly shows a more concentrated posterior for the17 Normal llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Laplace llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Horseshoe llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Median95% BCI . . . . . Year A cc i d e n t s p e r yea r D e n s i t y Figure 4: Top row: fits to coal mining disaster data for di ff erent prior distributions. Posteriormedians (lines), 95% credible intervals (shaded regions), and data points are shown. Bottomrow: associated posterior distributions for change points.break points, and that distribution is centered near the late 1880’s, which corresponds to theperiod of change in the coal industry. Therefore, we think the Bayesian trend filter with thehorseshoe prior is a better default model in cases where sharp change points are expected.It is important to point out that we tried other values for the scale parameter ( ζ ) in the priordistribution for γ and found that the models were somewhat sensitive to that hyperparameterfor this data set. In particular, the horseshoe results for ζ = ζ = . ζ = .
01 (seeAppendix F).
This problem concerns the estimation of the time-varying mean of an inhomogeneous binomialprocess. We are interested in estimating the seasonal trend in daily probability of rainfall. Thedata are binary indicators of when daily rainfall exceeded 1 mm in Tokyo, Japan, over thecourse of 39 consecutive years (1951-1989). The indicators were combined by day of yearacross years, resulting in a sample size of m =
39 for each of 365 out of 366 possible days,and a size of m =
10 for the additional day that occurred in each of the 10 leap years. Theobservation variable y is therefore a count, where y ∈ { , , . . . , } . Data were obtained fromthe NOAA’s National Center for Climate Information (https: // . . . . . . . Normal llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
J F M A M J J A S O N D
Laplace llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
J F M A M J J A S O N D
Horseshoe llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
J F M A M J J A S O N D
Median95% BCI
Month P r ob a b ili t y o f r a i n Figure 5: Fits to Tokyo rainfall data for di ff erent prior distributions. Posterior medians (lines),95% credible intervals (shaded regions), and estimated probabilities ( y i / n i ) are shown.We fit SPMRF models with Laplace and horseshoe priors and a GMRF model (normalprior). All models were based on second-order di ff erences. The observation model was y i | θ i ∼ Bin (cid:32) m i , + exp( − θ i ) (cid:33) , and the marginal prior distributions for the second-order di ff erences were ∆ θ j ∼ N(0 , γ ) forthe normal prior, ∆ θ j ∼ Laplace( γ ) for the Laplace, and ∆ θ j ∼ HS( γ ) for the horseshoe. Weused the same prior specifications as those used in the simulations for the remaining parameters,except we used the guidelines in Section 4.1 to set the hyperparameter on the global scale prior.Using calculations outlined in Appendix E, we set σ ref ( θ ) = . U = . α = .
05 and substituting into Equation (E.6) results in ζ = . × − , so γ ∼ C + (0 , . × − )for each model. We ran four independent chains for each model, each with a burn-in of 500followed by 6,250 draws thinned at every 5. This resulted in a total of 5,000 MCMC samplesretained for each model.The resulting function estimates for all models reveal a sharp increase in probability of rainin June followed by a sharp decrease through July and early August and a subsequent sharpincrease in late August and September (Figure 5). Changes through the rest of the months wererelatively smooth. The estimated function displays some variations in smoothness similar tothe function with varying smoothness used in our simulations. All methods resulted in a sim-ilar estimated function, but the horseshoe prior resulted in a smoother function that displayedsharper features at transition points in late June and early August, yet also had narrower cred-ible intervals over most of the function. The normal and Laplace models resulted in a littlemore variability in the trend in January-April and in November. In their analysis of a subset ofthese data, Rue and Held (2005) used a circular constraint to tie together the endpoints of thefunction at the beginning and end of the year. We did not use such a constraint here, althoughit is possible with the SPMRF models. Even so, it is evident that the horseshoe model resultedin more similar function estimates at the endpoints than did the other two models. We presented a method for curve fitting in a Bayesian context that achieves locally adaptivesmoothing by exploiting the sparsity-inducing properties of shrinkage priors and the smooth-ing properties of GMRFs. We compared the performance of the Laplace prior, which simply19eformulates the frequentist trend filter to a Bayesian analog, to a more aggressive horseshoeshrinkage prior by using simulations and found that the horseshoe provided the best balancebetween bias and precision. The horseshoe prior has the greatest concentration of density nearzero and the heaviest tails among the priors we investigated. This combination allows smoothfunctions to be fit in regions with weak signals or noisy data while still allowing for recoveryof sharp functional changes when supported by informative data. The Laplace prior allowedmore functional changes of moderate value to be retained and could not accommodate largechanges without compromising the ability to shrink the noisy and smaller functional changes.This resulted in greater variability in the estimated functions and wider associated credible in-tervals for the models with the Laplace prior in comparison to those with the horseshoe priorwhen the underlying true functions had jumps or varying smoothness. The Laplace prior didhave adaptive ability not possessed by the normal prior, but the horseshoe prior clearly had thebest adaptive properties among the priors we investigated.The Laplace prior performed better than the horseshoe for the constant and smooth func-tions in our simulations, with results closer to those of the normal prior, although the di ff erencesin performance among the three methods were relatively small. These functions do not havelarge deviations in order- k di ff erences, and so there are many small or medium sized values forthe estimated ∆ k θ . This situation is reflective of cases described by Tibshirani (1996) wherethe lasso and ridge regression perform best, which helps explain why the analogous SPMRFmodels with Laplace or normal prior distributions do better here. We expect that non-adaptiveor mildly adaptive methods will perform better when used on functions which do not exhibitjumps or varying smoothness. However, it is reassuring that an adaptive method does nearly aswell as a non-adaptive method for these functions. This allows an adaptive model such as thatusing the horseshoe to be applied to a variety of functions with minimal risk of performanceloss.Our fully Bayesian implementation of the SPMRF models eliminates the need to explicitlyselect the global smoothing parameter λ , either directly or through selection methods suchas cross-validation (e.g., Tibshirani (1996)) or marginal maximum likelihood (e.g., Park andCasella (2008)). However, the fully Bayesian approach does still require attention to the se-lection of the hyperparameter that controls the prior distribution on the smoothing parameter.We found the methods of Sørbye and Rue (2014) to o ff er practical guidelines for selectingthis hyperparameter, and we successfully applied a modification of those methods in our dataexamples. A highly informative prior on the global smoothing parameter can result in over-smoothing if the prior overwhelms the information in the data, while a di ff use prior may resultin a rougher function with insu ffi cient smoothing. Noisier data are therefore more sensitiveto choice of parameterization of the prior on the global smoothing parameter. We tested priorsensitivity in the coal mining example and found that the horseshoe prior was more responsiveto changes in hyperparmeter values than the normal and Laplace priors (see Appendix F). How-ever, the results for the simulations and for the Tokyo rainfall example were much more robustto the value of the hyperparameter on the global scale due to the information in the data. As aprecaution, we recommend first applying methods such as those by Sørbye and Rue (2014) toset the hyperparameter, but then also paying attention to prior sensitivity when analyzing noisydata with the SPMRF models.We only addressed one-dimensional problems here, but we think the GMRF representationof these models can allow extension to higher dimensions such as the spatial setting by incorpo-rating methods used by Rue and Held (2005) and others. We also plan to extend these methodsto semi-parametric models that allow additional covariates.20 Acknowledgments
J.R.F, and V.N.M. were supported by the NIH grant R01 AI107034. J.R.F. was supportedby the NOAA Advanced Studies Program and V.N.M. was supported by the NIH grant U54GM111274. We are grateful to Soumik Pal for making us think harder about subordinatedprocesses. We thank two anonymous reviewers for their help in improving this manuscript.
References
Abramovich, F., Sapatinas, T., and Silverman, B. W. (1998). “Wavelet thresholding via aBayesian approach.”
Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , 60(4): 725–749.Adams, R. P., Murray, I., and MacKay, D. J. (2009). “Tractable nonparametric Bayesian in-ference in Poisson processes with Gaussian process intensities.” In
Proceedings of the 26thAnnual International Conference on Machine Learning , 9–16. ACM.Andrews, D. F. and Mallows, C. L. (1974). “Scale mixtures of normal distributions.”
Journalof the Royal Statistical Society. Series B (Methodological) , 36(1): 99–102.Armagan, A., Dunson, D. B., and Lee, J. (2013). “Generalized Double Pareto Shrinkage.”
Statistica Sinica , 23(1): 119–143.Betancourt, M. and Girolami, M. (2015). “Hamiltonian Monte Carlo for hierarchical models.”In Upadhyay, S. K., Singh, U., Dey, D. K., and Loganathan, A. (eds.),
Current Trends inBayesian Methodology with Applications , 79–97. CRC Press.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2015). “The horseshoe + estimator ofultra-sparse signals.” arXiv preprint arXiv:1502.00560 .Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015). “Dirichlet-Laplace priors foroptimal shrinkage.” Journal of the American Statistical Association , 110(512): 1479–1490.Bochner, S. (1955).
Harmonic Analysis and the Theory of Probability . University of CaliforniaPress.Brahim-Belhouari, S. and Bermak, A. (2004). “Gaussian process for nonstationary time seriesprediction.”
Computational Statistics & Data Analysis , 47(4): 705–712.Carlin, B. P., Gelfand, A. E., and Smith, A. F. (1992). “Hierarchical Bayesian analysis ofchangepoint problems.”
Applied Statistics , 41(2): 389–405.Carpenter, B., Gelman, A., Ho ff man, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.,Guo, J., Li, P., and Riddell, A. (2016). “Stan: a probabilistic modeling language.” Journalof Statistical Software , (in press).Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). “The Horseshoe Estimator for SparseSignals.”
Biometrika , 97(2): 465–480.Clark, P. K. (1973). “A subordinated stochastic process model with finite variance for specula-tive prices.”
Econometrica , 41(1): 135–155.21iMatteo, I., Genovese, C. R., and Kass, R. E. (2001). “Bayesian curve-fitting with free-knotsplines.”
Biometrika , 88(4): 1055–1071.Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). “Hybrid Monte Carlo.”
Physics Letters B , 195(2): 216–222.Figueiredo, M. A. (2003). “Adaptive sparseness for supervised learning.”
Pattern Analysis andMachine Intelligence, IEEE Transactions on , 25(9): 1150–1159.Fong, Y., Rue, H., and Wakefield, J. (2010). “Bayesian inference for generalized linear mixedmodels.”
Biostatistics , 11(3): 397–412.Frank, I. E. and Friedman, J. H. (1993). “A statistical view of some chemometrics regressiontools.”
Technometrics , 35(2): 109–135.Gelman, A. and Rubin, D. B. (1992). “Inference from iterative simulation using multiplesequences.”
Statistical Science , 7(4): 457–472.Gelman, A. et al. (2006). “Prior distributions for variance parameters in hierarchical models(comment on article by Browne and Draper).”
Bayesian Analysis , 1(3): 515–534.Green, P. J. (1995). “Reversible Jump Markov Chain Monte Carlo Computation and BayesianModel Determination.”
Biometrika , 82(4): 711–732.Gri ffi n, J. E., Brown, P. J., et al. (2010). “Inference with normal-gamma prior distributions inregression problems.” Bayesian Analysis , 5(1): 171–188.— (2013). “Some priors for sparse regression modelling.”
Bayesian Analysis , 8(3): 691–702.Hoerl, A. E. and Kennard, R. W. (1970). “Ridge regression: Biased estimation for nonorthog-onal problems.”
Technometrics , 12(1): 55–67.Ho ff man, M. D. and Gelman, A. (2014). “The no-U-turn sampler: Adaptively setting pathlengths in Hamiltonian Monte Carlo.” The Journal of Machine Learning Research , 15(1):1593–1623.Jarrett, R. G. (1979). “A Note on the Intervals Between Coal-Mining Disasters.”
Biometrika ,66(1): 191–193.Johnstone, I. M. and Silverman, B. W. (2005). “Empirical Bayes selection of wavelet thresh-olds.”
Annals of Statistics , 1700–1752.Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. (2009). “ (cid:96) Trend Filtering.”
Siam Review ,51(2): 339–360.Kitagawa, G. (1987). “Non-Gaussian state-space modeling of nonstationary time series.”
Jour-nal of the American Statistical Association , 82(400): 1032–1041.Knorr-Held, L. and Rue, H. (2002). “On block updating in Markov random field models fordisease mapping.”
Scandinavian Journal of Statistics , 29(4): 597–614.Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). “Penalized Regression, StandardErrors, and Bayesian Lassos.”
Bayesian Analysis , 5(2): 369–411.22ang, S., Fronk, E.-M., and Fahrmeir, L. (2002). “Function estimation with locally adaptivedynamic models.”
Computational Statistics , 17(4): 479–499.Lindgren, F. and Rue, H. (2008). “On the Second-Order Random Walk Model for IrregularLocations.”
Scandinavian Journal of Statistics , 35(4): 691–700.Maguire, B. A., Pearson, E. S., and Wynn, A. H. A. (1952). “The Time Intervals BetweenIndustrial Accidents.”
Biometrika , 39(1 / Journal of the American Statistical Association , 83(404): 1023–1032.Murray, I. and Adams, R. P. (2010). “Slice sampling covariance hyperparameters of latentGaussian models.” In
Advances in Neural Information Processing Systems , 1732–1740.Murray, I., Adams, R. P., and Mackay, D. (2010). “Elliptical slice sampling.” In
InternationalConference on Artificial Intelligence and Statistics , 541–548.Neal, R. (2011). “MCMC Using Hamiltonian Dynamics.”
Handbook of Markov Chain MonteCarlo , 2.Neal, R. M. (1993). “Probabilistic inference using Markov chain Monte Carlo methods.” Tech-nical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.— (1998). “Regression and classification using Gaussian process priors.”
Bayesian statistics ,6: 475–501.Paciorek, C. and Schervish, M. (2004). “Nonstationary covariance functions for Gaussianprocess regression.”
Advances in Neural Information Processing Systems , 16: 273–280.Paciorek, C. J. and Schervish, M. J. (2006). “Spatial modelling using a new class of nonsta-tionary covariance functions.”
Environmetrics , 17(5): 483–506.Papaspiliopoulos, O., Roberts, G. O., and Sk¨old, M. (2003). “Non-centered parameterisationsfor hierarchical models and data augmentation.” In Bernardo, J., Bayarri, M., Berger, J.,Dawid, A., Heckerman, D., Smith, A., and West, M. (eds.),
Bayesian Statistics 7: Proceed-ings of the Seventh Valencia International Meeting , 307–326. Oxford University Press, USA.— (2007). “A general framework for the parametrization of hierarchical models.”
StatisticalScience , 22(1): 59–73.Park, T. and Casella, G. (2008). “The Bayesian Lasso.”
Journal of the American StatisticalAssociation , 103(482): 681–686.Polson, N. G. and Scott, J. G. (2010). “Shrink globally, act locally: sparse Bayesian regular-ization and prediction.”
Bayesian Statistics , 9: 501–538.— (2012a). “Local Shrinkage Rules, L´evy Processes and Regularized Regression.”
Journal ofthe Royal Statistical Society: Series B (Methodological) , 74(2): 287–311.— (2012b). “On the Half-Cauchy Prior for a Global Scale Parameter.”
Bayesian Analysis , 7(4):887–902. 23 Core Team (2014).
R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria.URL
Raftery, A. E. and Akman, V. E. (1986). “Bayesian analysis of a Poisson process with a change-point.”
Biometrika , 73(1): 85–89.Rasmussen, C. E. and Williams, C. K. I. (2006).
Gaussian Processes for Machine Learning .MIT Press.Reinsch, C. H. (1967). “Smoothing by spline functions.”
Numerische Mathematik , 10(3):177–183.Rem´enyi, N. and Vidakovic, B. (2015). “Wavelet Shrinkage with Double Weibull Prior.”
Com-munications in Statistics-Simulation and Computation , 44(1): 88–104.Roberts, G. O. and Stramer, O. (2002). “Langevin di ff usions and Metropolis-Hastings algo-rithms.” Methodology and Computing in Applied Probability , 4(4): 337–357.Roualdes, E. A. (2015). “Bayesian Trend Filtering.” arXiv preprint arXiv:1505.07710 .Rue, H. (2001). “Fast sampling of Gaussian Markov random fields.”
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 63(2): 325–338.Rue, H. and Held, L. (2005).
Gaussian Markov Random Fields: Theory and Applications .CRC Press.Scheipl, F. and Kneib, T. (2009). “Locally Adaptive Bayesian P-splines with a Normal-Exponential-Gamma Prior.”
Computational Statistics & Data Analysis , 53(10): 3533–3552.Simpson, D. P., Martins, T. G., Riebler, A., Fuglstad, G.-A., Rue, H., and Sørbye, S. H. (2014).“Penalising model component complexity: A principled, practical approach to constructingpriors.” arXiv preprint arXiv:1403.4630 .Sørbye, S. H. and Rue, H. (2014). “Scaling intrinsic Gaussian Markov random field priors inspatial modelling.”
Spatial Statistics , 8: 39–51.Speckman, P. L. and Sun, D. (2003). “Fully Bayesian spline smoothing and intrinsic autore-gressive priors.”
Biometrika , 90(2): 289–302.Stan Development Team (2015a). “RStan: the R interface to Stan, Version 2.6.2.”URL http://mc-stan.org/rstan.html — (2015b).
Stan Modeling Language Users Guide and Reference Manual, Version 2.6.2 .URL http://mc-stan.org/
Teh, Y. W. and Rao, V. (2011). “Gaussian Process Modulated Renewal Processes.” In
Advancesin Neural Information Processing Systems , 2474–2482.Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso.”
Journal of the RoyalStatistical Society. Series B (Methodological) , 58(1): 267–288.— (2011). “Regression shrinkage and selection via the lasso: a retrospective.”
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 73(3): 273–282.24ibshirani, R. J. (2014). “Adaptive Piecewise Polynomial Estimation via Trend Filtering.”
TheAnnals of Statistics , 42(1): 285–323.Tibshirani, R. J. and Taylor, J. (2011). “The Solution Path of the Generalized Lasso.”
TheAnnals of Statistics , 39(3): 1335–1371.Wahba, G. (1975). “Smoothing noisy data with spline functions.”
Numerische Mathematik ,24(5): 383–393.West, M. (1987). “On scale mixtures of normal distributions.”
Biometrika , 74(3): 646–648.Whittaker, E. T. (1922). “On a new method of graduation.”
Proceedings of the EdinburghMathematical Society , 41: 63–75.Yue, Y. R., Simpson, D., Lindgren, F., and Rue, H. (2014). “Bayesian Adaptive SmoothingSplines Using Stochastic Di ff erential Equations.” Bayesian Analysis , 9(2): 397–424.Yue, Y. R., Speckman, P. L., and Sun, D. (2012). “Priors for Bayesian adaptive spline smooth-ing.”
Annals of the Institute of Statistical Mathematics , 64(3): 577–613.Zhu, B. and Dunson, D. B. (2013). “Locally adaptive Bayes nonparametric regression vianested Gaussian processes.”
Journal of the American Statistical Association , 108(504):1445–1456.
A Approximation to the Horseshoe Density
There is no exact closed-form expression available for the horseshoe density function. Wepresent an approximation to the horseshoe density that can be used without the need for explicitspecification of the nuisance local scale parameters. Following Carvalho et al. (2010), themarginal distribution of u given global scale parameter γ is found by integrating over possiblevalues of the local scale parameter τ , where u | τ ∼ N(0 , δτ ) and τ | γ ∼ C + (0 , γ ). Here δ is a constant representing a scale factor for the distance between adjacent points when thisdistribution is used for the increments of a k th-order smoothing model. This leads to p ( u | δ, γ ) = (cid:90) ∞ p ( u | δ, τ, γ ) p ( τ | γ ) d τ = (cid:90) ∞ √ πδτ exp (cid:32) − u δτ (cid:33) γπ ( τ + γ ) d τ We let B = γ/ ( √ π δ ) and introduce the substitution ω = τ − , which gives d τ = − / (2 ω / ),resulting in p ( u | δ, γ ) = B (cid:90) ∞ ω / ω / exp (cid:32) − u ω δ (cid:33) ω − + γ d ω = B (cid:90) ∞ ω exp (cid:32) − u ω δ (cid:33) ω + ωγ d ω = B (cid:90) ∞ + ωγ exp (cid:32) − u ω δ (cid:33) d ω. z = + ωγ , which gives d ω = γ − dz , and results in p ( u | δ, γ ) = B (cid:90) ∞ z γ exp (cid:40) − (cid:32) u ω δ (cid:33) z γ + (cid:32) u δ (cid:33) γ (cid:41) dz = B γ exp (cid:32) u δγ (cid:33) (cid:90) ∞ z exp (cid:32) − u z δγ (cid:33) dz = (cid:32) π δγ (cid:33) / exp (cid:32) u δγ (cid:33) E (cid:32) u δγ (cid:33) dz , where E is the exponential integral function. Note that lim x → + E ( x ) = ∞ , but for x >
0, thefunction E ( x ) is bounded as follows:12 e − x ln (cid:32) + x (cid:33) < E ( x ) < e − x ln (cid:32) + x (cid:33) . Then for u ∈ { R : u (cid:44) } we have12 exp (cid:32) − u δγ (cid:33) ln (cid:32) + δγ u (cid:33) < E (cid:32) u δγ (cid:33) < exp (cid:32) − u δγ (cid:33) ln (cid:32) + δγ u (cid:33) . It follows that the target density is bounded by12 (cid:32) π δγ (cid:33) / ln (cid:32) + δγ u (cid:33) < p ( u | γ ) < (cid:32) π δγ (cid:33) / ln (cid:32) + δγ u (cid:33) . (A.1)Let the left bound in equation (A.1) be denoted B ( u ) and the right bound B ( u ). Note that as u →
0, each of B ( u ) , p ( u | γ ) and B ( u ) approach ∞ . It can be shown that (cid:82) ∞−∞ B ( u ) du = √ /π and (cid:82) ∞−∞ B ( u ) du = / √ π . Since √ /π < < / √ π , these bounds can be used to find anapproximate expression for p ( u | γ ) that integrates to 1 and still satisfies equation (A.1). We set˜ p ( u | γ ) = wB ( u ) + (1 − w ) B ( u ) (A.2)with constraints 0 < w < (cid:82) ∞−∞ wB ( u ) + (1 − w ) B ( u ) du =
1. Using the values for theintegrated bounds and solving gives w = ( √ π − / ( √ − w intoequation (A.2) and simplifying gives the following closed-form approximation to the horseshoedensity function:˜ p ( u | γ ) = (cid:32) π δγ (cid:33) / √ π − √ − (cid:32) + δγ u (cid:33) + √ − √ π √ − (cid:32) + δγ u (cid:33) . (A.3)26 Marginal Laplace Distribution with Irregular Grid Spac-ing
The following is a derivation of the marginal prior distribution for the order- k di ff erences whengrid spacing is unequal. These derivations are based on the scale-mixture representation ofthe Laplace distribution. These results are known to apply to the first-order and second-ordermodels, but higher orders.Let u j = ∆ k θ j and let δ j be a constant representing a scale factor for the distance betweenadjacent points when this distribution is used for the increments of a k th-order smoothingmodel. For convenience, subscripts on u and δ are dropped from here forward. We assume u | τ, δ ∼ N(0 , δτ ) and τ | λ ∼ Exp( λ / λ = /γ is the global shrinkage parameter. Itfollows that p ( u | δ, λ ) = (cid:90) ∞ λ (cid:32) − τ λ (cid:33) √ πδτ exp (cid:32) − u δτ (cid:33) d τ = A (cid:90) ∞ τ exp (cid:32) − τ λ − u δτ (cid:33) d τ , where A = λ √ πδτ . Now we make the substitution ω = /τ , which gives d τ = − ω − d ω , andthe marginal density for u becomes p ( u | δ, λ ) = A (cid:90) ∞ ω − / exp (cid:40) − λ ω − u ω δ (cid:41) d ω = A (cid:90) ∞ ω − / exp (cid:40) − u ω δ − λ ω + λ | u | δ / − λ | u | δ / (cid:41) d ω = A (cid:90) ∞ ω − / exp (cid:40) − | u | δω (cid:32) ω − δ / ωλ | u | + δλ | u | (cid:33) − λ | u | δ / (cid:41) d ω = λ √ δ exp (cid:40) − λ | u |√ δ (cid:41) (cid:90) ∞ λ √ πω / exp − λ δω ( λ / | u | ) ω − √ δλ | u | d ω = λ √ δ exp (cid:40) − λ | u |√ δ (cid:41) , where the last line follows from the fact that the integrand in the second-to-last line is the pdfof an inverse-Gaussian distribution with mean parameter µ = √ δλ/ | u | and shape parameter λ = λ . The result is the pdf of a Laplace distribution with mean zero and scale parameter λ/ √ δ . Note that the variance of the Laplace distribution is 2 δ/λ , which implies that the gridspacing δ scales the variance of the increments u . C Data Example with Irregular Grid
We apply the SPMRF models to a data set with a continuous covariate. The response data arerent per square meter of floor space in Munich, Germany, and the covariate is the floor space insquare meters. These data were analysed by Rue and Held (2005) using a second-order GMRFwith irregular spacing. Here we apply a second-order GMRF and SPMRF models using themethods described in Section 2.4 of the main text.27et x represent the floor space measurements, and let x < x < ... < x n be the ordered set ofunique floor measurement values. Further, let δ j = x j + − x j be the distance between adjacentfloor space measurements. The marginal prior distributions for the second-order di ff erenceswere ∆ θ j ∼ N(0 , d j γ ) for the normal prior, ∆ θ j ∼ Laplace( d / j γ ) for the Laplace, and ∆ θ j ∼ HS( d / j γ ) for the horseshoe, where ∆ θ j = θ ( x j + ) − (cid:32) + δ j + δ j (cid:33) θ ( x j + ) + δ j + δ j θ ( x j ) , and d j = δ j + ( δ j + δ j + )2 . Using methods described in Section 4.1 of the main text and Appendix E, we calculatedthe value of the hyperparameter for the global scale parameter to be ζ = . γ ∼ C + (0 , .
50 100 150
Normal | || | | || |||| |||| || || || || || || | | |||| ||| || || | || || || | || || |||| || ||| | |||| | || || | ||| || || | ||| || | ||| | || | ||| ||||| || ||| ||| || | |||| | |||||| | |||| || | |||| | ||| ||| ||| || ||| |||| || | || || || | || | | || |||| || | || || | | || ||| || ||| ||||| || ||| || ||| | | || || || || || ||| | || | ||| | ||| | ||| ||||| | ||| || || || |||| ||| | ||| ||| | | ||| || || | |||| || || || || ||| | |||| || | || | || || ||| |||| || || ||| || | || || || || | ||| || || | || || || ||| |||||| || || || || || || ||| ||| |||| || || | ||| | | |||| ||| || ||| ||| ||| || ||| ||| | ||| || ||| || | ||| ||||| ||| ||| | || |||| || | || | | |||| | || |||| ||| || ||| | ||| || || | ||||| || | || | | ||||| | || ||| || | || | ||| ||| | || | || |||| ||| | || ||| || | || | | ||| | || || | || | ||| | ||| | || | || | ||| | || ||| || || || | || || || ||| | | || || ||| || || || ||| | | || || ||| || | || || || || || || ||| |||| || || ||| || | || || || || |||| |||| |||| ||| ||| || | || || ||||| | |||| | ||| | || || ||| ||||| |||| || ||| ||| || | | || | ||| | | || || || || | || | || || |||||| | || ||| || || | ||| | || ||| || | | || || | |||||| || | || | | || ||||| || |||| | || || || || | | || ||| | || ||| | || || ||| ||| | || | || |||| ||| | | | || | ||| | ||| || | ||| || ||| || | || || || || ||| || | || | || || | |||| || |||| | |||| ||| | || | || || || || | | |||| || || ||| || |||| || || || | || || ||| |||| || | || || ||||| | | | || || ||| ||||| | || || || || || || || ||| ||| || | || | ||| || || | | |||||| | || || | || ||| | || || ||| || || || |||| ||| | || ||||| ||| ||| ||| || || || ||||| | ||| |||||| | ||| ||||| || || | || | || || || || | |||| |||| | || || | | || ||| | || || || || || || ||| |||| || || || ||| | | || | || ||| | ||| || |||| | | || | | || ||| | || | ||| ||| | |||| || ||||| | | || |||| ||| ||| | |||| || || ||| || | |||| | | || || || || || |||| | ||| | | || || || ||| || || | || | | | |||| |||| || |||| |||| | || |||| |||| || | || ||| || || ||| || || |||| |||| |||| || | ||| ||| | ||| | || | |||| ||| | | |||| | || | || || ||| || | | || | | || |||| || || | | || ||| || || | || || | || | ||| | | |||| | ||| ||| | | || |||| || | ||| | || || ||| || || ||| || || ||| | ||| || | || || || | |||| || | ||| | ||| || |||| ||||| | || ||| | || || || || ||||| || ||| || |||| | |||| | || |||||| |||| ||| ||| | |||| || |||| || || || |||| || | |||| ||| | ||||||| | ||| || | ||| ||| || | | || ||| | || | ||| | |||| || | | || || ||| | |||| || || || || || |||| || ||| || ||| | || ||| |||| || | || | || ||| | ||| ||| | |||| ||| |||| | || | | | ||| ||| | || |||| | | || | ||| | || ||| |||| || || || | |||| || || | | | || || || ||| | || ||| |||| || | ||| || ||| || || ||| || || || ||||| ||| | | ||| | || |||| || | || ||| | | |||| | ||| | |||| ||| |||| | || | | || || | ||| ||||| |||| || | || | || | ||| || || | ||| ||| | || || || || || | || ||| || | | ||| ||| || | | |||| || |||| || ||| ||| | | || ||||| || ||| ||| | || ||| | ||| || |||||| |||
50 100 150
Laplace | || | | || |||| |||| || || || || || || | | |||| ||| || || | || || || | || || |||| || ||| | |||| | || || | ||| || || | ||| || | ||| | || | ||| ||||| || ||| ||| || | |||| | |||||| | |||| || | |||| | ||| ||| ||| || ||| |||| || | || || || | || | | || |||| || | || || | | || ||| || ||| ||||| || ||| || ||| | | || || || || || ||| | || | ||| | ||| | ||| ||||| | ||| || || || |||| ||| | ||| ||| | | ||| || || | |||| || || || || ||| | |||| || | || | || || ||| |||| || || ||| || | || || || || | ||| || || | || || || ||| |||||| || || || || || || ||| ||| |||| || || | ||| | | |||| ||| || ||| ||| ||| || ||| ||| | ||| || ||| || | ||| ||||| ||| ||| | || |||| || | || | | |||| | || |||| ||| || ||| | ||| || || | ||||| || | || | | ||||| | || ||| || | || | ||| ||| | || | || |||| ||| | || ||| || | || | | ||| | || || | || | ||| | ||| | || | || | ||| | || ||| || || || | || || || ||| | | || || ||| || || || ||| | | || || ||| || | || || || || || || ||| |||| || || ||| || | || || || || |||| |||| |||| ||| ||| || | || || ||||| | |||| | ||| | || || ||| ||||| |||| || ||| ||| || | | || | ||| | | || || || || | || | || || |||||| | || ||| || || | ||| | || ||| || | | || || | |||||| || | || | | || ||||| || |||| | || || || || | | || ||| | || ||| | || || ||| ||| | || | || |||| ||| | | | || | ||| | ||| || | ||| || ||| || | || || || || ||| || | || | || || | |||| || |||| | |||| ||| | || | || || || || | | |||| || || ||| || |||| || || || | || || ||| |||| || | || || ||||| | | | || || ||| ||||| | || || || || || || || ||| ||| || | || | ||| || || | | |||||| | || || | || ||| | || || ||| || || || |||| ||| | || ||||| ||| ||| ||| || || || ||||| | ||| |||||| | ||| ||||| || || | || | || || || || | |||| |||| | || || | | || ||| | || || || || || || ||| |||| || || || ||| | | || | || ||| | ||| || |||| | | || | | || ||| | || | ||| ||| | |||| || ||||| | | || |||| ||| ||| | |||| || || ||| || | |||| | | || || || || || |||| | ||| | | || || || ||| || || | || | | | |||| |||| || |||| |||| | || |||| |||| || | || ||| || || ||| || || |||| |||| |||| || | ||| ||| | ||| | || | |||| ||| | | |||| | || | || || ||| || | | || | | || |||| || || | | || ||| || || | || || | || | ||| | | |||| | ||| ||| | | || |||| || | ||| | || || ||| || || ||| || || ||| | ||| || | || || || | |||| || | ||| | ||| || |||| ||||| | || ||| | || || || || ||||| || ||| || |||| | |||| | || |||||| |||| ||| ||| | |||| || |||| || || || |||| || | |||| ||| | ||||||| | ||| || | ||| ||| || | | || ||| | || | ||| | |||| || | | || || ||| | |||| || || || || || |||| || ||| || ||| | || ||| |||| || | || | || ||| | ||| ||| | |||| ||| |||| | || | | | ||| ||| | || |||| | | || | ||| | || ||| |||| || || || | |||| || || | | | || || || ||| | || ||| |||| || | ||| || ||| || || ||| || || || ||||| ||| | | ||| | || |||| || | || ||| | | |||| | ||| | |||| ||| |||| | || | | || || | ||| ||||| |||| || | || | || | ||| || || | ||| ||| | || || || || || | || ||| || | | ||| ||| || | | |||| || |||| || ||| ||| | | || ||||| || ||| ||| | || ||| | ||| || |||||| |||
50 100 150
Horseshoe | || | | || |||| |||| || || || || || || | | |||| ||| || || | || || || | || || |||| || ||| | |||| | || || | ||| || || | ||| || | ||| | || | ||| ||||| || ||| ||| || | |||| | |||||| | |||| || | |||| | ||| ||| ||| || ||| |||| || | || || || | || | | || |||| || | || || | | || ||| || ||| ||||| || ||| || ||| | | || || || || || ||| | || | ||| | ||| | ||| ||||| | ||| || || || |||| ||| | ||| ||| | | ||| || || | |||| || || || || ||| | |||| || | || | || || ||| |||| || || ||| || | || || || || | ||| || || | || || || ||| |||||| || || || || || || ||| ||| |||| || || | ||| | | |||| ||| || ||| ||| ||| || ||| ||| | ||| || ||| || | ||| ||||| ||| ||| | || |||| || | || | | |||| | || |||| ||| || ||| | ||| || || | ||||| || | || | | ||||| | || ||| || | || | ||| ||| | || | || |||| ||| | || ||| || | || | | ||| | || || | || | ||| | ||| | || | || | ||| | || ||| || || || | || || || ||| | | || || ||| || || || ||| | | || || ||| || | || || || || || || ||| |||| || || ||| || | || || || || |||| |||| |||| ||| ||| || | || || ||||| | |||| | ||| | || || ||| ||||| |||| || ||| ||| || | | || | ||| | | || || || || | || | || || |||||| | || ||| || || | ||| | || ||| || | | || || | |||||| || | || | | || ||||| || |||| | || || || || | | || ||| | || ||| | || || ||| ||| | || | || |||| ||| | | | || | ||| | ||| || | ||| || ||| || | || || || || ||| || | || | || || | |||| || |||| | |||| ||| | || | || || || || | | |||| || || ||| || |||| || || || | || || ||| |||| || | || || ||||| | | | || || ||| ||||| | || || || || || || || ||| ||| || | || | ||| || || | | |||||| | || || | || ||| | || || ||| || || || |||| ||| | || ||||| ||| ||| ||| || || || ||||| | ||| |||||| | ||| ||||| || || | || | || || || || | |||| |||| | || || | | || ||| | || || || || || || ||| |||| || || || ||| | | || | || ||| | ||| || |||| | | || | | || ||| | || | ||| ||| | |||| || ||||| | | || |||| ||| ||| | |||| || || ||| || | |||| | | || || || || || |||| | ||| | | || || || ||| || || | || | | | |||| |||| || |||| |||| | || |||| |||| || | || ||| || || ||| || || |||| |||| |||| || | ||| ||| | ||| | || | |||| ||| | | |||| | || | || || ||| || | | || | | || |||| || || | | || ||| || || | || || | || | ||| | | |||| | ||| ||| | | || |||| || | ||| | || || ||| || || ||| || || ||| | ||| || | || || || | |||| || | ||| | ||| || |||| ||||| | || ||| | || || || || ||||| || ||| || |||| | |||| | || |||||| |||| ||| ||| | |||| || |||| || || || |||| || | |||| ||| | ||||||| | ||| || | ||| ||| || | | || ||| | || | ||| | |||| || | | || || ||| | |||| || || || || || |||| || ||| || ||| | || ||| |||| || | || | || ||| | ||| ||| | |||| ||| |||| | || | | | ||| ||| | || |||| | | || | ||| | || ||| |||| || || || | |||| || || | | | || || || ||| | || ||| |||| || | ||| || ||| || || ||| || || || ||||| ||| | | ||| | || |||| || | || ||| | | |||| | ||| | |||| ||| |||| | || | | || || | ||| ||||| |||| || | || | || | ||| || || | ||| ||| | || || || || || | || ||| || | | ||| ||| || | | |||| || |||| || ||| ||| | | || ||||| || ||| ||| | || ||| | ||| || |||||| |||
Median95% BCI
Floor size R e n t Figure C.1: Results for models using irregular grids for Munich rent data. Posterior medians(dark line) are shown with 95% Bayesian credible intervals (BCIs). Locations of data are shownwith vertical bars at the bottom of plots. 28
Additional Simulation Results
Here we display plots with simulation results for normal data with σ = . σ = . σ = .
5) for each model and trend function type.
Function Model MAD MCIW MASV TMASV
Constant Normal 0.115 0.698 0.002 0.000Laplace 0.116 0.731 0.002 0.000Horseshoe 0.127 0.921 0.004 0.000Piecewise Const. Normal 1.040 5.479 1.647 0.606Laplace 0.899 3.282 1.557 0.606Horseshoe 0.281 1.918 0.638 0.606Smooth Normal 0.565 2.985 1.391 1.406Laplace 0.561 2.985 1.393 1.406Horseshoe 0.565 2.946 1.414 1.406Varying Smooth Normal 0.586 3.036 0.613 0.543Laplace 0.550 2.898 0.592 0.543Horseshoe 0.438 2.228 0.558 0.543Table D.2: Mean values of performance measures across 100 simulations for Poisson observa-tions for each model and trend function type.
Function Model MAD MCIW MASV TMASV
Constant Normal 0.022 0.142 0.001 0.000Laplace 0.023 0.149 0.001 0.000Horseshoe 0.025 0.167 0.001 0.000Piecewise Const. Normal 0.109 0.557 0.077 0.030Laplace 0.092 0.529 0.064 0.030Horseshoe 0.051 0.334 0.036 0.030Smooth Normal 0.078 0.379 0.072 0.079Laplace 0.078 0.380 0.072 0.079Horseshoe 0.079 0.382 0.073 0.079Varying Smooth Normal 0.067 0.296 0.020 0.023Laplace 0.066 0.295 0.020 0.023Horseshoe 0.058 0.277 0.020 0.02329able D.3: Mean values of performance measures across 100 simulations for binomial obser-vations for each model and trend function type.
Function Model MAD MCIW MASV TMASV
Constant Normal 0.042 0.249 0.001 0.000Laplace 0.043 0.262 0.001 0.000Horseshoe 0.047 0.311 0.002 0.000Piecewise Const. Normal 0.229 1.191 0.166 0.066Laplace 0.193 1.126 0.137 0.066Horseshoe 0.108 0.690 0.076 0.066Smooth Normal 0.139 0.733 0.110 0.117Laplace 0.139 0.735 0.111 0.117Horseshoe 0.143 0.740 0.113 0.117Varying Smooth Normal 0.188 0.730 0.056 0.068Laplace 0.183 0.726 0.056 0.068Horseshoe 0.149 0.676 0.058 0.06830 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Constant y . . . . M AD N L H l l l llllll llllll lll . . . . . M C I W N L H l l l lllllllllll lllllllll lllllll . . . M A SV N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Piecewise Constant l . . . . . . N L H l l l l ll l
N L H l l l l l llllll . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Smooth GP lll lll lll . . . . . N L H l l l . . . . . N L H l l l l l . . . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Varying Smoothness lll lll ll . . . N L H l l l l . . . N L H l l l . . . N L H l l l modelx
Figure D.1: Functions used in simulations and simulation results by model (N = normal,L = Laplace, H = horseshoe) and function type for normally distributed data with σ = lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Constant y l l l . . . . M AD N L H l l l lll llll lll . . . . M C I W N L H l l l llll lllll llllll . . . M A SV N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Piecewise Constant l . . . . N L H l l l . . . . N L H l l l lll . . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Smooth GP ll ll l . . . . N L H l l l . . . . N L H l l l l l . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Varying Smoothness l l lll . . . . . N L H l l l l l ll . . . N L H l l l l . . . N L H l l l modelx
Figure D.2: Functions used in simulations and simulation results by model (N = normal,L = Laplace, H = horseshoe) and function type for Poisson distributed data. Top row shows truefunctions (dashed lines) with example simulated data. Remaining rows show mean absolute de-viation (MAD), mean credible interval width (MCIW), and mean absolute sequential variation(MASV). Horizontal dashed line in plots on bottom row is the true mean absolute sequentialvariation (TMASV). Shown for each model are standard boxplots of simulation results (left)and mean values with 95 % frequentist confidence intervals (right).32 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Constant p l l l . . . . M AD N L H l l l llll llll lll . . . M C I W N L H l l l lllllllll lllllllll lllll . . . . M A SV N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Piecewise Constant l . . . N L H l l l l lll . . . . N L H l l l l l llll . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Smooth GP . . . . N L H l l l ll . . . . N L H l l l . . . N L H l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Varying Smoothness l l . . . . N L H l l l l l lllll . . . . N L H l l l ll ll ll . . . . N L H l l l modelx
Figure D.3: Functions used in simulations and simulation results by model (N = normal,L = Laplace, H = horseshoe) and function type for binomial distributed data. Top row shows truefunctions (dashed lines) with empirical probability estimates from example simulated data. Re-maining rows show mean absolute deviation (MAD), mean credible interval width (MCIW),and mean absolute sequential variation (MASV). Horizontal dashed line in plots on bottom rowis the true mean absolute sequential variation (TMASV). Shown for each model are standardboxplots of simulation results (left) and mean values with 95% frequentist confidence intervals(right). 33 Parameterizing the Global Smoothing Prior
Here we provide additional details for calculating the hyperparameter ζ for the prior on theglobal scale parameter γ , where γ ∼ C + (0 , ζ ). First let Q be the precision matrix for the Markovrandom field corresponding to the model of interest (see examples below), and Σ = Q − be thecovariance matrix with diagonal elements Σ ii . Following Sørbye and Rue (2014), the marginalstandard deviation of all components of θ for a fixed value of γ is σ γ ( θ i ) = γσ ref ( θ ), where σ ref ( θ ) = exp n n (cid:88) i = log σ { γ = } ( θ i ) = exp n n (cid:88) i =
12 log ( Σ ii ) . (E.1)That is, σ ref ( θ ) is the geometric mean of the individual marginal standard deviations when γ = θ , which wedenote ω = Var( θ ). For a given value of γ , the n × n precision matrix for a first-order modelis: Q = /γ γ /ω + − − − − − . . . . . . . . . − − − , (E.2)and the corresponding covariance matrix is: Σ = ω ω · · · ω ω ω + γ ω + γ · · · ω + γ ... ω + γ ω + γ ω + γ · · · ω + γ ... ω + γ . . . ...... ω + ( n − γ ω + ( n − γ ω ω + γ ω + γ · · · ω + ( n − γ ω + ( n − γ . (E.3)Therefore, the marginal variances for the first-order model are Σ , ii = ω + ( i − γ . For thesecond-order model, the n × n precision matrix is: Q = /γ γ /ω + − − − − − − − . . . . . . . . . . . . . . . − − − − − − − . (E.4)34here is an analytical form for the covariance matrix for the second-order model, but it su ffi ceshere to know that the form of the marginal variances is: Σ , ii = ω + i ( i − i − γ . (E.5)Note that if we were using an intrinsic GMRF model, we would assume that ω is infinite,which would result in a covariance matrix of rank n − k . Following Sørbye and Rue (2014)we would then use the generalized inverse of the precision matrix to calculate the marginalvariances.In practice we use the variance of the data (transformed data if the θ parameters are on atransformed scale) as an estimate of ω . Although this is using the data twice, this o ff ers areasonable constraint on the marginal variances of the θ s.We want to set an upper bound U on the average marginal standard deviation of θ i , suchthat Pr( σ γ ( θ i ) > U ) = α , where α is some small probability. Using the cumulative probabilityfunction for a half-Cauchy distribution, we can find a value of ζ for a given value of σ ref ( θ )specific to a model of interest and given common values of U and α by: ζ = U σ ref ( θ ) tan (cid:16) π (1 − α ) (cid:17) . (E.6)It may be useful to note here that the median of a half-Cauchy distribution is equal to its scaleparameter, since the median may be a more intuitive measure of the e ff ect of di ff erent values of ζ . For our data examples in the main text, we let U be the estimated standard deviation of thedata on the appropriate scale. We know that the marginal variances of the θ s should not exceedthe variance in the observed data, on average. We set α = .
05 as the probability of the averagemarginal standard deviation exceeding U . For the coal mining example in the main text, wefound an estimate of the variance of the data on the log scale by (cid:80) ni = ln( y i + . / ( n − y i is the observed count at time i = , . . . , n . For the Tokyo rain example, we estimated thevariance of the data on the logit scale as (cid:80) ni = logit(( y i + q i ) / m i )) / ( n − y i is the numberof years with rain on day i out of m i possible years, and q i = . I y i = − . I y i = + I y i (cid:60) { , } ,where I is an indicator function.Suppose we have calculated ζ o , the hyperparameter for a first-order model given the cor-responding average marginal standard deviation σ ref ( θ o ) using Equation (E.6). If we wish tocalculate the value of ζ o for a second-order model we can simply use ζ o = ζ o σ ref ( θ o ) σ ref ( θ o ) . Now suppose we have a model with n equidistant nodes and want to increase the density of thegrid to kn nodes without changing the range of the grid. For a first-order model, Var( ∆ θ new ) = k Var( ∆ θ ), and for a second-order model Var( ∆ θ new ) = k Var( ∆ θ ) (Lindgren and Rue, 2008;Sørbye and Rue, 2014). In terms of the hyperparameter for the global smoothing prior, for thefirst-order model ζ o , new = k − / ζ o , and for the second-order model ζ o , new = k − / ζ o .35 Normal llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Laplace llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Horseshoe llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Median95% BCI llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Year A cc i d e n t s p e r yea r a)b)c) Figure F.1: Models fits to coal mining accidents data by model type and value of hyperparame-ter for global scale: a) ζ =
1, b) ζ = .
01, and c) ζ = . F Prior Sensitivity
We tested the sensitivity of the three prior formulations (normal, Laplace, and horseshoe) to thevalue of the hyperparameter ( ζ ) which controls the scale of the distribution on the smoothingparameter γ , where γ ∼ C + (0 , ζ ) A smaller value of ζ constricts γ to be closer to zero, which inturn constricts the scales of the priors on the order- k di ff erences. We tested three levels for thehyperparameter: a) ζ =
1, b) ζ = .
01, and c) ζ = . ff ered a good testset because the observations are relatively noisy.Clearly the horseshoe prior was the most sensitive to the level of ζ (Figure F.1). In particular,the horseshoe results for ζ = ζ = . ζ = . G Computational E ffi ciency To compare SPMRF and GMRF models’ computational e ffi ciency, we calculated the e ff ectivenumber of posterior samples per second of computation time (ESSps) for di ff erent model for-mulations and data configurations. We used the scenario with a piecewise-constant expectedvalue from our main simulations (Section 3) to test the e ff ect of model type, model order, andnumber of grid cells ( n ) on the ESS per second of sampling time. Here sampling time is definedas the total run time minus the time spent in the adaptation (warm-up or burn-in) phase, where36ime is measured in seconds of CPU time. We also calculated the ESSps for the coal miningand Tokyo rainfall examples.There were three simulated scenarios with piecewise-constant trend: 1) order-1 with n =
100 observations and grid points (one observation per grid point), 2) order-1 with n = n = σ = .
5. For each of these scenarios we ran 4 independent chains each with1,000 iterations of burn-in and 2,500 iterations post burn-in thinned at every 5 for a total of2,000 retained posterior samples combined across chains. The maximum ESS would thereforebe 2,000 for these scenarios. The chains were run in sequence so that the total time (TCPU)sampling times (SCPU) times are the respective cumulative times across 4 chains. For the coalmining and Tokyo rainfall examples we used the same settings for number of iterations andthinning as was used in the main text (Section 4). We calculated e ff ective sample size using themethods described in the documentation for stan (Stan Development Team, 2015b).Table G.1: Measures of computational e ffi ciency for each model type (Normal (N), Laplace(L), or Horseshoe (H)) for three simulated data scenarios and two real data examples. Modelorder and number of parameters ( p ) are shown. The total CPU time (TCPU: includes adaptivephase) and sampling CPU (SCPU) time are in seconds. The minimum and mean e ff ectivesample sizes per second (ESSps) of SCPU are also shown. Scenario Model Order p TCPU SCPU Min.ESSps MeanESSps
Piece. Const.( n = n = n = ffff