BNSP: an R Package for Fitting Bayesian Semiparametric Regression Models and Variable Selection
BBNSP: an R Package for Fitting Bayesian SemiparametricRegression Models and Variable Selection
Georgios PapageorgiouDepartment of Economics, Mathematics and StatisticsBirkbeck, University of London, [email protected]
Abstract
The R package
BNSP provides a unified framework for semiparametric location-scale regression andstochastic search variable selection. The statistical methodology that the package is built upon utilizesbasis function expansions to represent semiparametric covariate effects in the mean and variance functions,and spike-slab priors to perform selection and regularization of the estimated effects. In addition to themain function that performs posterior sampling, the package includes functions for assessing convergenceof the sampler, summarizing model fits, visualizing covariate effects and obtaining predictions for newresponses or their means given feature/covariate vectors.
Keywords : additive models; basis functions; heteroscedastic models; variable selection
There are many approaches to non- and semi-parametric modeling. From a Bayesian perspective, M¨uller& Mitra (2013) provide a review that covers methods for density estimation, modeling of random effectsdistributions in mixed effects models, clustering and modeling of unknown functions in regression models.Our interest is on Bayesian methods for modeling unknown functions in regression models. In particular,we are interested in modeling both the mean and variance functions non-parametrically, as general functionsof the covariates. There are multiple reasons why allowing the variance function to be a general functionof the covariates may be important (Chan et al., 2006). Firstly, it can result in more realistic predictionintervals than those obtained by assuming constant error variance, or as M¨uller & Mitra (2013) put it, itcan result in more honest representation of uncertainties. Secondly, it allows the practitioner to examineand understand which covariates drive the variance. Thirdly, it results in more efficient estimation of themean function, and lastly, it produces more accurate standard errors of unknown parameters.In the R (R Core Team, 2016) package
BNSP (Papageorgiou, 2018) we implemented Bayesian regres-sion models with Gaussian errors and with mean and log-variance functions that can be modeled as generalfunctions of the covariates. Covariate effects may enter the mean and log-variance functions parametri-cally or non-parametrically, with the nonparametric effects represented as linear combinations of basisfunctions. The strategy that we follow in representing unknown functions is to utilize a large number ofbasis functions. This allows for flexible estimation and for capturing true effects that are locally adaptive.Potential problems associated with large numbers of basis functions, such as over-fitting, are avoided in our1 a r X i v : . [ s t a t . O T ] O c t mplementation, and efficient estimation is achieved, by utilizing spike-slab priors for variable selection. Areview of variable selection methods is provided by O’Hara & Sillanp¨a¨a (2009).The methods described here belong to the general class of models known as generalized additive modelsfor location, scale and shape (GAMLSS) (Rigby & Stasinopoulos, 2005; Stasinopoulos & Rigby, 2007) orthe Bayesian analogue termed as BAMLSS (Umlauf et al., 2018) and implemented in package bamlss (Umlauf et al., 2017). However, due to the nature of the spike-and-slab priors that we have implemented,in addition to flexible modeling of the mean and variance functions, the methods described here canalso be utilized for selecting promising subsets of predictor variables in multiple regression models. Theimplemented methods fall in the general class of methods known as stochastic search variable selection(SSVS). SSVS has received considerable attention in the Bayesian literature and its applications rangefrom investigating factors that affect individual’s happiness (George & McCulloch, 1993), to constructingfinancial indexes (George & McCulloch, 1997) and to gene mapping (O’Hara & Sillanp¨a¨a, 2009). Thesemethods associate each regression coefficient, either a main effect or the coefficient of a basis function, witha latent binary variable that indicates whether the corresponding covariate is needed in the model or not.Hence, the joint posterior distribution of the vector of these binary variables can identify the models withthe higher posterior probability.R packages that are related to BNSP include spikeSlabGAM (Scheipl, 2016) that also utilizesSSVS methods (Scheipl, 2011). A major difference between the two packages, however, is that whereas spikeSlabGAM utilizes spike-and-slab priors for function selection,
BNSP utilizes spike-and-slab priorsfor variable selection. In addition, Bayesian GAMLSS models, also refer to as distributional regressionmodels, can also be fit with R package brms using normal priors (B¨urkner, 2018). Further, the R pack-age gamboostLSS (Hofner et al., 2018) includes frequentist GAMLSS implementation based on boostingthat can handle high-dimensional data (Mayr et al., 2012). Lastly, the R package mgcv (Wood, 2018)can also fit generalized additive models with Gaussian errors and integrated smoothness estimation, withimplementations that can handle large datasets.In
BNSP we have implemented functions for fitting such semi-parametric models, summarizing modelfits, visualizing covariate effects and predicting new responses or their means. The main functions are mvrm, mvrm2mcmc, print.mvrm, summary.mvrm, plot.mvrm and predict.mvrm . A quick description ofthese functions follows. The first one, mvrm , returns samples from the posterior distributions of the modelparameters, and it is based on an efficient Markov chain Monte Carlo (MCMC) algorithm in which weintegrate out the coefficients in the mean function, generate the variable selection indicators in blocks(Chan et al., 2006) and choose the MCMC tuning parameters adaptively (Roberts & Rosenthal, 2009). Inorder to minimize random-access memory utilization, posterior samples are not kept in memory, but insteadwritten in files in a directory supplied by the user. The second function, mvrm2mcmc , reads-in the samplesfrom the posterior of the model parameters and it creates an object of class "mcmc" . This enables users toeasily utilize functions from package coda (Plummer et al., 2006), including functions plot and summary for assessing convergence and for summarizing posterior distributions. Further, functions print.mvrm and summary.mvrm provide summaries of model fits, including models and priors specified, marginal posteriorprobabilities of term inclusion in the mean and variance models and models with the highest posteriorprobabilities. Function plot.mvrm creates plots of parametric and nonparametric terms that appear inthe mean and variance models. The function can create two-dimensional plots by calling functions frompackage ggplot2 (Wickham, 2009). It can also create static or interactive three-dimensional plots bycalling functions from packages plot3D (Soetaert, 2016) and threejs (Lewis, 2016). Lastly, function predict.mvrm provides predictions either for new responses or their means given feature/covariate vectors.We next provide a detailed model description followed by illustrations on the usage of the package andthe options it provides. Technical details on the implementation of the MCMC algorithm are provided inthe Appendix. The paper concludes with a brief summary.2
Mean-variance nonparametric regression models
Let y = ( y , . . . , y n ) (cid:62) denote the vector of responses and let X = [ x , . . . , x n ] (cid:62) and Z = [ z , . . . , z n ] (cid:62) denote design matrices. The models that we consider express the vector of responses utilizing Y = β n + Xβ + (cid:15) , where n is the usual n-dimensional vector of ones, β is an intercept term, β is a vector of regressioncoefficients and (cid:15) = ( (cid:15) , . . . , (cid:15) n ) (cid:62) is an n-dimensional vector of independent random errors. Each (cid:15) i , i =1 , . . . , n, is assumed to have a normal distribution, (cid:15) i ∼ N (0 , σ i ), with variances that are modeled in termsof covariates. Let σ = ( σ , . . . , σ n ) (cid:62) . We model the vector of variances utilizinglog( σ ) = α n + Zα , where α is an intercept term and α is a vector of regression coefficients. Equivalently, the model for thevariances can be expressed as σ i = σ exp( z (cid:62) i α ) , i = 1 , . . . , n, (1)where σ = exp( α ).Let D ( α ) denote an n-dimensional, diagonal matrix with elements exp( z (cid:62) i α / , i = 1 , . . . , n . Then,the model that we consider may be expressed more economically as Y = X ∗ β + (cid:15) , (cid:15) ∼ N ( , σ D ( α )) , (2)where β = ( β , β (cid:62) ) (cid:62) and X ∗ = [ n , X ].In the next subsections we describe how, within model (2), both parametric and nonparametric effectsof explanatory variables on the mean and variance functions can be captured utilizing regression splinesand variable selection methods. We begin by considering the special case where there is a single covariateentering the mean model and a single covariate entering the variance model. Suppose that the observed dataset consists of triplets ( y i , u i , w i ) , i = 1 , . . . , n, where explanatory variables u and w enter flexibly the mean and variance models, respectively. To model the nonparametric effects of u and w we consider the following formulations of the mean and variance models µ i = β + f µ ( u i ) = β + q (cid:88) j =1 β j φ j ( u i ) = β + x (cid:62) i β , (3)log( σ i ) = α + f σ ( w i ) = α + q (cid:88) j =1 α j φ j ( w i ) = α + z (cid:62) i α . (4)In the preceding x i = ( φ ( u i ) , . . . , φ q ( u i )) (cid:62) and z i = ( φ ( w i ) , . . . , φ q ( w i )) (cid:62) are vectors of basis func-tions and β = ( β , . . . , β q ) (cid:62) and α = ( α , . . . , α q ) (cid:62) are the corresponding coefficients.In package BNSP we implemented two sets of basis functions. Firstly, radial basis functions B = (cid:8) φ ( u ) = u, φ ( u ) = || u − ξ || log (cid:0) || u − ξ || (cid:1) , . . . ,φ q ( u ) = || u − ξ q − || log (cid:0) || u − ξ q − || (cid:1) (cid:9) , (5)3here || u || denotes the Euclidean norm of u and ξ , . . . , ξ q − are the knots that within package BNSP arechosen as the quantiles of the observed values of explanatory variable u , with ξ = min( u i ), ξ q − = max( u i )and the remaining knots chosen as equally spaced quantiles between ξ and ξ q − .Secondly, we implemented thin plate splines B = { φ ( u ) = u, φ ( u ) = ( u − ξ ) + , . . . , φ q ( u ) = ( u − ξ q ) + } , where ( a ) + = max( a,
0) and the knots ξ , . . . , ξ q − are determined as above.In addition, BNSP supports the smooth constructors from package mgcv e.g. the low-rank thin platesplines, cubic regression splines, P-splines, their cyclic versions and others. Examples on how these smoothterms are used within
BNSP are provided later in this paper.Locally adaptive models for the mean and variance functions are obtained utilizing the methodologydeveloped by Chan et al. (2006). Considering the mean function, local adaptivity is achieved by utilizinga large number of basis functions q . Over-fitting, and problems associated with it, is avoided by allowingpositive prior probability that the regression coefficients are exactly zero. The latter is achieved by definingbinary variables γ j , j = 1 , . . . , q , that take value γ j = 1 if β j (cid:54) = 0 and γ j = 0 if β j = 0. Hence, vector γ =( γ , . . . , γ q ) (cid:62) determines which terms enter the mean model. The vector of indicators δ = ( δ , . . . , δ q ) (cid:62) for the variance function is defined analogously.Given vectors γ and δ , the heteroscedastic, semiparametric model (2) can be written as Y = X ∗ γ β γ + (cid:15) , (cid:15) ∼ N ( , σ D ( α δ )) , where β γ consisting of all non-zero elements of β and X ∗ γ consists of the corresponding columns of X ∗ .Subvector α δ is defined analogously.We note that, as was suggested by Chan et al. (2006), we work with mean corrected columns in thedesign matrices X and Z , both in this paper and in the BNSP implementation. We remove the meanfrom all columns in the design matrices except those that correspond to categorical variables.
Let ˜ X = D ( α ) − X ∗ . The prior for β γ is specified as (Zellner, 1986) β γ | c β , σ , γ , α , δ ∼ N ( , c β σ ( ˜ X (cid:62) γ ˜ X γ ) − ) . Further, the prior for α δ is specified as α δ | c α , δ ∼ N ( , c α I ) . Independent priors are specified for the indicators variables γ j as P ( γ j = 1 | π µ ) = π µ , j = 1 , . . . , q ,from which the joint prior is obtained as P ( γ | π µ ) = π N ( γ ) µ (1 − π µ ) q − N ( γ ) , where N ( γ ) = (cid:80) q j =1 γ j .Similarly, for the indicators δ j we specify independent priors P ( δ j = 1 | π σ ) = π σ , j = 1 , . . . , q . It followsthat the joint prior is P ( δ | π σ ) = π N ( δ ) σ (1 − π σ ) q − N ( δ ) , N ( δ ) = (cid:80) q j =1 δ j .We specify inverse Gamma priors for c β and c α and Beta priors for π µ and π σ c β ∼ IG( a β , b β ) , c α ∼ IG( a α , b α ) ,π µ ∼ Beta( c µ , d µ ) , π σ ∼ Beta( c σ , d σ ) . (6)Lastly, for σ we consider inverse Gamma and half-normal priors σ ∼ IG( a σ , b σ ) and σ ∼ N ( σ ; 0 , φ σ ) I [ σ > . (7) It is straightforward to extend the methodology described earlier to allow fitting of flexible mean andvariance surfaces. In fact, the only modification required is in the basis functions and knots. For fittingsurfaces, in package
BNSP we implemented radial basis functions B = (cid:110) φ ( u ) = u , φ ( u ) = u , φ ( u ) = || u − ξ || log (cid:0) || u − ξ || (cid:1) , . . . ,φ q ( u ) = || u − ξ q − || log (cid:0) || u − ξ q − || (cid:1) (cid:111) . We note that the prior specification presented earlier for fitting flexible functions remains unchainedfor fitting flexible surfaces. Further, for fitting bivariate or higher order functions,
BNSP also supportssmooth constructors s , te and ti from mgcv . In the presence of multiple covariates, the effects of which may be modeled parametrically or semipara-metrically, the mean model in (3) is extended to the following µ i = β + u (cid:62) ip β + K (cid:88) k =1 f µ,k ( u ik ) , i = 1 , . . . , n, where, u ip includes the covariates the effects of which are modeled parametrically, β denotes the corre-sponding effects, and f µ,k ( u ik ) , k = 1 , . . . , K , are flexible functions of one or more covariates expressedas f µ,k ( u ik ) = q k (cid:88) j =1 β kj φ kj ( u ik ) , where φ kj , j = 1 , . . . , q k are the basis functions used in the k th component, k = 1 , . . . , K .Similarly, the variance model (4), in the presence of multiple covariates, is expressed aslog( σ i ) = α + w (cid:62) ip α + K (cid:88) k =1 f σ,k ( w ik ) , i = 1 , . . . , n, where f σ,k ( w ik ) = q k (cid:88) j =1 α kj φ kj ( w ik ) . f µ,k , k = 1 , . . . , K , and in the variance model, f σ,k , k = 1 , . . . , K .To avoid over-fitting, we allow removal of the unnecessary ones utilizing the usual indicator variables, γ k = ( γ k , . . . , γ kq k ) (cid:62) , k = 1 , . . . , K , and δ k = ( δ k , . . . , δ kq k ) (cid:62) , k = 1 , . . . , K . Here, vectors γ k and δ k determine which basis functions appear in f µ,k and f σ,k respectively.The model that we implemented in package BNSP specifies independent priors for the indicatorsvariables γ kj as P ( γ kj = 1 | π µ k ) = π µ k , j = 1 , . . . , q k . From these, the joint prior follows P ( γ k | π µ k ) = π N ( γ k ) µ k (1 − π µ k ) q k − N ( γ k ) , where N ( γ k ) = (cid:80) q k j =1 γ kj .Similarly, for the indicators δ kj we specify independent priors P ( δ kj = 1 | π σ k ) = π σ k , j = 1 , . . . , q k . Itfollows that the joint prior is P ( δ k | π σ k ) = π N ( δ k ) σ k (1 − π σ k ) q k − N ( δ k ) , where N ( δ k ) = (cid:80) q k j =1 δ kj .We specify the following independent priors for the inclusion probabilities. π µ k ∼ Beta( c µ k , d µ k ) , k = 1 , . . . , K π σ k ∼ Beta( c σ k , d σ k ) , k = 1 , . . . , K . (8)The rest of the priors are the same as those specified for the single covariate models. In this section we provide results on simulation studies and real data analyses. The purpose is twofold:firstly we point out that the package works well and provides the expected results (in simulation studies)and secondly we illustrate the options that the users of
BNSP have.
Here we present results from three simulations studies, involving one, two, and multiple covariates. For themajority of these simulation studies, we utilize the same data-generating mechanisms as those presentedby Chan et al. (2006).
We consider two mechanisms that involve a single covariate that appears in both the mean and vari-ance model. Denoting the covariate by u , the data-generating mechanisms are the normal model Y ∼ N { µ ( u ) , σ ( u ) } with the following mean and standard deviation functions1. µ ( u ) = 2 u, σ ( u ) = 0 . u ,2. µ ( u ) = { N ( u, µ = 0 . , σ = 0 . N ( u, µ = 0 . , σ = 0 . } / ,σ ( u ) = { N ( u, µ = 0 . , σ = 0 . N ( u, µ = 0 . , σ = 0 . } / n = 500 from each mechanism, where variable u is obtained from theuniform distribution, u ∼ Unif(0 , mu <- function(u){2 * u}> stdev <- function(u){0.1 + u}> set.seed(1)> n <- 500> u <- sort(runif(n))> y <- rnorm(n,mu(u),stdev(u))> data <- data.frame(y,u) Above we specified the seed value to be one, and we do so in what follows, so that our results are replicable.To the generated dataset we fit a special case of the model that we presented, where the mean andvariance functions in (3) and (4) are specified as µ = β + f µ ( u ) = β + q (cid:88) j =1 β j φ j ( u ) and log( σ ) = α + f σ ( u ) = α + q (cid:88) j =1 α j φ j ( u ) , (9)with φ denoting the radial basis functions presented in (5). Further, we choose q = q = 21 basis functionsor, equivalently, 20 knots. Hence, we have φ j ( u ) = φ j ( u ) , j = 1 , . . . , , which results in identical designmatrices for the mean and variance models. In R, the two models are specified using > model <- y ~ sm(u, k = 20, bs = "rd") | sm(u, k = 20, bs = "rd") The above formula (Zeileis & Croissant, 2010) specifies the response, mean and variance models. Smoothterms are specified utilizing function sm , that takes as input the covariate u , the selected number of knotsand the selected type of basis functions.Next we specify the hyper-parameter values for the priors in (6) and (7). The default prior for c β isinverse Gamma with a β = 0 . , b β = n/ c α the default prior is anon-informative but proper inverse Gamma with a α = b α = 1 .
1. Concerning π µ and π σ , the default priorsare uniform, obtained by setting c µ = d µ = 1 and c σ = d σ = 1. Lastly, the default prior for the errorstandard deviation is the half-normal with variance φ σ = 2, | σ | ∼ N (0 , ,
000 iterations and discard the first 5 ,
000 as burn-in. Ofthe remaining 5 ,
000 samples we retain 1 every 2 samples, resulting in 2 ,
500 posterior samples. Further,as mentioned above, we set the seed of the MCMC sampler equal to one. Obtaining posterior samples isachieved by a function call of the form > m1 <- mvrm(formula = model, data = data, sweeps = 10000, burn = 5000,+ thin = 2, seed = 1, StorageDir = DIR,+ c.betaPrior = "IG(0.5,0.5*n)", c.alphaPrior = "IG(1.1,1.1)",+ pi.muPrior = "Beta(1,1)", pi.sigmaPrior = "Beta(1,1)", sigmaPrior = "HN(2)")
Samples from the posteriors of the model parameters { β , γ , α , δ , c β , c α , σ } are written in seven separatefiles which are stored in the directory specified by argument StorageDir . If a storage directory is notspecified, then function mvrm returns an error message, as without these files there will be no output toprocess. Furthermore, the last two lines of the above function call show the specified priors, which are c β ∼ IG(0 . , n/ c α ∼ IG(1 . , . , π µ ∼ Beta(1 , π σ ∼ Beta(1 ,
1) and | σ | ∼ N (0 , c β and c α only inverse Gamma priors are available, with parameters thatcan be specified by the user in the intuitive way. For instance, the prior c β ∼ IG(1 . , .
01) can be specified7n function mvrm by using c.betaPrior = "IG(1.01,1.01)" . The second parameter of the prior for c β can be a function of the sample size n (but only symbol n would work here), so for instance c.betaPrior= "IG(1,0.4*n)" is another acceptable specification. Further, Beta priors are available for parameters π µ and π σ with parameters that can be specified by the user again in the intuitive way. Lastly, two priors areavailable for the error variance. These are the default half-normal and the inverse Gamma. For instance, sigmaPrior = "HN(5)" defines | σ | ∼ N (0 ,
5) as the prior while sigmaPrior = "IG(1.1,1.1)" defines σ ∼ IG(1 . , .
1) as the prior.Function mvrm2mcmc reads in posterior samples from the files that the call to function mvrm generatedand creates an object of class "mcmc" . Hence, for summarizing posterior distributions and for assessingconvergence, functions summary and plot from package coda can be used. As an example, here we readin the samples from the posterior of β > beta <- mvrm2mcmc(m1, "beta") and summarize the posterior using summary . For the sake of economizing space, only the part of the outputthat describes the posteriors of β , β , and β is shown below > summary(beta)Iterations = 5001:9999Thinning interval = 2Number of chains = 1Sample size per chain = 25001. Empirical mean and standard deviation for each variable,plus standard error of the mean:Mean SD Naive SE Time-series SE(Intercept) 9.534e-01 0.004399 8.799e-05 0.0002534u 1.864e+00 0.042045 8.409e-04 0.0010356sm(u).1 3.842e-04 0.016421 3.284e-04 0.00032842. Quantiles for each variable:2.5% 25% 50% 75% 97.5%(Intercept) 0.946 0.9513 0.9533 0.9554 0.960u 1.833 1.8565 1.8614 1.8682 1.923sm(u).1 0.000 0.0000 0.0000 0.0000 0.000 Further, we may obtain a plot using > plot(beta)
Figure 1 shows the first three of the plots created by function plot . These are the plots of the samplesfrom the posteriors of coefficients β , β and β . As we can see from both the summary and Figure 1, onlythe first two coefficients have posteriors that are not centered around zero.Returning to function mvrm2mcmc , it requires two inputs. These are an object of class "mvrm" andthe name of the file to be read in R. For the parameters in the current model { β , γ , α , δ , c β , c α , σ } thecorresponding file names are ‘beta’, ‘gamma’, ‘alpha’, ‘delta’, ‘cbeta’, ‘calpha’ and ‘sigma2’ respectively.8
000 6000 7000 8000 9000 10000 . . Iterations
Trace of (Intercept)
Density of (Intercept)
N = 2500 Bandwidth = 0.00068885000 6000 7000 8000 9000 10000 . . Iterations
Trace of u
Density of u
N = 2500 Bandwidth = 0.0019375000 6000 7000 8000 9000 10000 − . . Iterations
Trace of sm(u).1 Density of sm(u).1 −1.281334 −0.281334 0.168114 0.613418
Figure 1: Trace and density plots for the regression coefficients β , β and β of the first simulated example.Parameters β and β are the coefficients of the first two basis functions, denoted by ‘u’ and ‘sm(u).1’.Plots for coefficients β , . . . , β are omitted as they follow a very similar pattern to that seen for β i.e.most of the time they take value zero but with random spikes away from zero.Summaries of ‘mvrm’ fits may be obtained utilizing functions print.mvrm and summary.mvrm . Function print takes as input an object of class "mvrm" . It returns basic information at the model fit as shownbelow > print(m1)Call:mvrm(formula = model, data = data, sweeps = 10000, burn = 5000,thin = 2, seed = 1, StorageDir = DIR, c.betaPrior = "IG(0.5,0.5*n)",c.alphaPrior = "IG(1.1,1.1)", pi.muPrior = "Beta(1,1)",pi.sigmaPrior = "Beta(1,1)", sigmaPrior = "HN(2)")2500 posterior samplesMean model - marginal inclusion probabilitiesu sm(u).1 sm(u).2 sm(u).3 sm(u).4 sm(u).5 sm(u).6 sm(u).71.0000 0.0040 0.0036 0.0032 0.0084 0.0036 0.0044 0.0028sm(u).8 sm(u).9 sm(u).10 sm(u).11 sm(u).12 sm(u).13 sm(u).14 sm(u).150.0060 0.0020 0.0060 0.0036 0.0056 0.0056 0.0036 0.0052sm(u).16 sm(u).17 sm(u).18 sm(u).19 sm(u).200.0060 0.0044 0.0056 0.0044 0.0052Variance model - marginal inclusion probabilitiesu sm(u).1 sm(u).2 sm(u).3 sm(u).4 sm(u).5 sm(u).6 sm(u).71.0000 0.6072 0.5164 0.5808 0.5488 0.6760 0.5320 0.6336sm(u).8 sm(u).9 sm(u).10 sm(u).11 sm(u).12 sm(u).13 sm(u).14 sm(u).15 .6936 0.6708 0.5996 0.4816 0.4912 0.3728 0.6268 0.5688sm(u).16 sm(u).17 sm(u).18 sm(u).19 sm(u).200.5872 0.6528 0.4428 0.6900 0.5356 The function returns a matched call, the number of posterior samples obtained and marginal inclusionprobabilities of the terms in the mean and variance models.Whereas the output of function print focuses on marginal inclusion probabilities, the output of function summary focuses on the most frequently visited models. It takes as input an object of class ‘mvrm’ and thenumber of (most frequently visited) models to be displayed, which by default is set to nModels = 5 . Hereto economize space we set nModels = 2 . The information returned by functions summary is shown below > summary(m1, nModels = 2)Specified model for the mean and variance:y ~ sm(u, k = 20, bs = "rd") | sm(u, k = 20, bs = "rd")Specified priors:[1] c.beta = IG(0.5,0.5*n) c.alpha = IG(1.1,1.1) pi.mu = Beta(1,1)[4] pi.sigma = Beta(1,1) sigma = HN(2)Total posterior samples: 2500 ; burn-in: 5000 ; thinning: 2Files stored in /home/papgeo/1/Null deviance: 1299.292Mean posterior deviance: -88.691Joint mean/variance model posterior probabilities:mean.u mean.sm.u..1 mean.sm.u..2 mean.sm.u..3 mean.sm.u..4 mean.sm.u..51 1 0 0 0 0 02 1 0 0 0 0 0mean.sm.u..6 mean.sm.u..7 mean.sm.u..8 mean.sm.u..9 mean.sm.u..101 0 0 0 0 02 0 0 0 0 0mean.sm.u..11 mean.sm.u..12 mean.sm.u..13 mean.sm.u..14 mean.sm.u..151 0 0 0 0 02 0 0 0 0 0mean.sm.u..16 mean.sm.u..17 mean.sm.u..18 mean.sm.u..19 mean.sm.u..20 var.u1 0 0 0 0 0 12 0 0 0 0 0 1var.sm.u..1 var.sm.u..2 var.sm.u..3 var.sm.u..4 var.sm.u..5 var.sm.u..61 1 1 1 1 1 12 1 0 1 1 1 1var.sm.u..7 var.sm.u..8 var.sm.u..9 var.sm.u..10 var.sm.u..11 var.sm.u..121 1 1 1 0 1 02 1 1 1 1 1 1var.sm.u..13 var.sm.u..14 var.sm.u..15 var.sm.u..16 var.sm.u..17 var.sm.u..181 1 1 1 1 1 1 Firstly, the function provides the specified mean and variance models and the specified priors. Thisis followed by information about the MCMC chain and the directory where files have been stored. Inaddition, the function provides the null and the mean posterior deviance. Finally, the function providesthe specification of the joint mean/variance models that were visited most often during MCMC sampling.This specification is in terms of a vector of indicators, each consisting of zeros and ones that show whichterms are in the mean and variance model. To make clear which terms pertain to the mean and whichto the variance function, we have preceded the names of the model terms by ‘ mean. ’ or a ‘ var. ’. In theabove output we see that the most visited model specifies a linear mean model (only the linear term inincluded in the model) while the variance model includes twelve terms. See also Figure 2.We next describe function plot.mvrm which creates plots of terms in the mean and variance functions.Two calls to function plot can be seen in the code below. Argument x expects an object of class ‘mvrm’,as created by a call to function mvrm . Argument model may take on one of three possible values: ‘mean’,‘stdev’ or ‘both’, specifying the model to be visualized. Further, argument term determines the term tobe plotted. In the current example there is only one term in each of the two models which leaves us withonly one choice, term = "sm(u)" . Equivalently, term may be specified as an integer, term = 1 . If term isleft unspecified, then by default the first term in the model is plotted. For creating two-dimensional plots,as in the current example, function plot utilizes package ggplot2 . Users of BNSP may add their ownoptions to plots via argument plotOptions . The code below serves as an example. > x1 <- seq(0, 1, length.out = 30)> plotOptionsM <- list(geom_line(aes_string(x = x1, y = mu(x1)), col = 2, alpha = 0.5,+ lty = 2), geom_point(data = data, aes(x = u, y = y)))> plot(x = m1, model = "mean", term = "sm(u)", plotOptions = plotOptionsM,+ intercept = TRUE, quantiles = c(0.005, 0.995), grid = 30)> plotOptionsV = list(geom_line(aes_string(x = x1, y = stdev(x1)), col = 2,+ alpha = 0.5, lty = 2))> plot(x = m1, model = "stdev", term = "sm(u)", plotOptions = plotOptionsV,+ intercept = TRUE, quantiles = c(0.05, 0.95), grid = 30)
The resulting plots can be seen in Figure 2, panels (a) and (b). Panel (a) displays the simulated dataset,showing the expected increase in both the mean and variance with increasing values of the covariate u .Further, we see the posterior mean of µ ( u ) = β + f µ ( u ) = β + (cid:80) j =1 β j φ j ( u ) evaluated over a gridof 30 values of u , as specified by the (default) grid = 30 option in function plot . For each sample β ( s ) , s = 1 , . . . , S, from the posterior of β , and for each value of u over the grid of 30 values, u j , j = 1 , . . . , , function plot computes µ ( u j ) ( s ) = β ( s )0 + (cid:80) j =1 β ( s ) j φ j ( u j ). The default option intercept = TRUE specifiesthat the intercept β is included in the computation, but it may be removed by setting intercept = FALSE .The posterior means are computed by the usual ¯ µ ( u j ) = S − (cid:80) s µ ( u j ) ( s ) and are plotted with solid (blue-color) line. By default, the function displays 80% point-wise credible intervals (CI). In Figure 2 panel (a)we have plotted 99% CIs, as specified by option quantiles = c(0.005, 0.995) . This option specifiesthat for each value u j , j = 1 , . . . , , on the grid, 99% CIs for µ ( u j ) are computed by the 0 .
5% and 99 . lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −101234 0.00 0.25 0.50 0.75 1.00 u s m ( u ) mean u s m ( u ) st dev (a) (b) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −101234 0.00 0.25 0.50 0.75 1.00 u s ( u ) mean u s ( u ) st dev (c) (d)Figure 2: Results from the single covariate simulated examples. The column on the left-hand side displaysthe generated data points and posterior means of the estimated effect along with 99% CIs. The column onthe right-hand side displays the posterior mean of the estimated standard deviation function along with90% CIs. In all panels, the truth is represented by dashed (red color) lines, the posterior means by solid(blue color) lines, and the CIs by gray color.quantiles of the samples µ ( u j ) ( s ) , s = 1 , . . . , S . Plots without credible intervals may be obtained by setting quantiles = NULL .Figure 2, panel (b) displays the posterior mean of the standard deviation function, given by σ ( u ) = σ exp { (cid:80) q j =1 α j φ j ( u ) / } . The details are the same as for the plot of the mean function, so here we brieflymention a difference: option intercept = TRUE specifies that σ is included in the calculation. It may beremoved by setting intercept = FALSE , which will result in plots of σ ( u ) ∗ = exp { (cid:80) q j =1 α j φ j ( u ) / } .We use the second simulated dataset to show how the s constructor from package mgcv may be used.In our example, we used s to specify the model as follows > model <- y ~ s(u, k = 15, bs = "ps", absorb.cons=TRUE) |+ s(u, k = 15, bs = "ps", absorb.cons=TRUE) Function
BNSP::s calls in turn mgcv::s and mgcv::smoothCon . All options of the last two functions maybe passed to
BNSP::s . In the example above we used options k , bs and absorb.cons .The remaining R code for the second simulated example is precisely the same as the one for the firstexample, and hence omitted. Results are shown in Figure 2, panels (c) and (d).We conclude the current section by describing the function predict.mvrm . The function providespredictions and posterior credible or prediction intervals for given feature vectors. The two types ofintervals differ in the associated level of uncertainty: prediction intervals attempt to capture a futureresponse and are usually much wider than credible intervals that attempt to capture a mean response.The following code shows how credible and prediction intervals can be obtained for a sequence ofcovariate values stored in x1
24 0.00 0.25 0.50 0.75 1.00 x1 f i t x1 f i t (a) (b)Figure 3: Predictions results from the first two simulated datasets. Each panel displays a credible intervaland two prediction intervals, one obtained using a model that recognizes the dependence of the varianceon the covariate and one that ignores it. > x1 <- seq(0, 1, length.out = 30)> p1 <- predict(m1, newdata = data.frame(u = x1), interval = "credible")> p2 <- predict(m1, newdata = data.frame(u = x1), interval = "prediction") where the first argument in function predict is a fitted mvrm model, the second one is a data framecontaining the feature vectors at which predictions are to be obtained and the last one defines the type ofinterval to be created. We applied the predict function to the two simulated datasets. To each of thosedatasets we fitted two models: the first one is the one we saw earlier, where both the mean and varianceare modeled in terms of covariates, while the second one ignores the dependence of the variance on thecovariate. The latter model is specified in R using > model <- y ~ sm(u, k = 20, bs = "rd") | 1 Results are displayed in Figure 3. Each of the two figures displays a credible interval and two predictionintervals. The figure emphasizes a point that was discussed in the introductory section, that modeling thevariance in terms of covariates can result in more realistic prediction intervals. The same point was recentlydiscussed by Mayr et al. (2012).
Interactions between two predictors can be modeled by appropriate specification of either the built-in sm function or the smooth constructors from mgcv . Function sm can take up to two covariates, both of whichmay be continuous or one continuous and one discrete. Next we consider an example that involves twocontinuous covariates. An example involving a continuous and a discrete covariate is shown later on, inthe second application to a real dataset. 13et u = ( u , u ) (cid:62) denote a bivariate predictor. The data-generating mechanism that we consider is y ( u ) ∼ N { µ ( u ) , σ ( u ) } ,µ ( u ) = 0 . N ( u , µ , Σ ) + N ( u , µ , Σ ) ,σ ( u ) = 0 . { N ( u , µ , Σ ) + N ( u , µ , Σ ) } / , µ = (cid:18) . . (cid:19) , Σ = (cid:18) .
03 0 . .
01 0 . (cid:19) , µ = (cid:18) . . (cid:19) , Σ = (cid:18) .
09 0 . .
01 0 . (cid:19) . As before, u and u are obtained independently from uniform distributions on the unit interval. Further,the sample size is set to n = 500.In R, we simulate data from the above mechanism using > mu1 <- matrix(c(0.25, 0.75))> sigma1 <- matrix(c(0.03, 0.01, 0.01, 0.03), 2, 2)> mu2 <- matrix(c(0.65, 0.35))> sigma2 <- matrix(c(0.09, 0.01, 0.01, 0.09), 2, 2)> mu <- function(x1, x2) {x <- cbind(x1, x2);+ 0.1 + dmvnorm(x, mu1, sigma1) + dmvnorm(x, mu2, sigma2)}> Sigma <- function(x1, x2) {x <- cbind(x1, x2);+ 0.1 + (dmvnorm(x, mu1, sigma1) + dmvnorm(x, mu2, sigma2)) / 2}> set.seed(1)> n <- 500> w1 <- runif(n)> w2 <- runif(n)> y <- vector()> for (i in 1 : n) y[i] <- rnorm(1, mean = mu(w1[i], w2[i]),+ sd = sqrt(Sigma(w1[i], w2[i])))> data <- data.frame(y, w1, w2) We fit a model with mean and variance functions specified as µ ( u ) = β + (cid:88) j =1 12 (cid:88) j =1 β j ,j φ j ,j ( u ) , log( σ ( u )) = α + (cid:88) j =1 12 (cid:88) j =1 α j ,j φ j ,j ( u ) . The R code that fits the above model is > Model <- y ~ sm(w1, w2, k = 10, bs = "rd") | sm(w1, w2, k = 10, bs = "rd")> m2 <- mvrm(formula = Model, data = data, sweeps = 10000, burn = 5000, thin = 2,+ seed = 1, StorageDir = DIR)
As in the univariate case, convergence assessment and univariate posterior summaries may be obtainedby using function mvrm2mcmc in conjunction with functions plot.mcmc and summary.mcmc . Further, sum-maries of the ‘mvrm’ fits may be obtained using functions print.mvrm and summary.mvrm . Plots ofthe bivariate effects may be obtained using function plot.mvrm . This is shown below, where argument plotOptions utilizes package colorspace (Zeileis et al., 2009). > plot(x = m2, model = "mean", term = "sm(w1,w2)", static = TRUE,+ plotOptions = list(col = diverge_hcl(n = 10)))> plot(x = m2, model = "stdev", term = "sm(w1,w2)", static = TRUE,+ plotOptions = list(col = diverge_hcl(n = 10))) w s m ( w , w ) mean w w s m ( w , w ) st dev (a) (b)Figure 4: Bivariate simulation study results with two continuous covariates. Posterior means of the (a)mean and (b) standard deviation function.Results are shown in Figure 4. For bivariate predictors, function plot.mvrm calls function ribbon3D from package plot3D . Dynamic plots, viewable in a browser, can be created by replacing the default static = TRUE by static = FALSE . When the latter option is specified, function plot.mvrm calls func-tion scatterplot3js from package threejs . Users may pass their own options to plot.mvrm via the plotOptions argument. We consider fitting general additive models for the mean and variance functions in a simulated examplewith four independent continuous covariates. In this scenario, we set n = 1000. Further the covariates w = ( w , w , w , w ) (cid:62) are simulated independently from a uniform distribution on the unit interval. Thedata-generating mechanism that we consider is Y ( w ) ∼ N ( µ ( w ) , σ ( w )) ,µ ( w ) = (cid:88) j =1 µ j ( w j ) and σ ( w ) = (cid:89) j =1 σ j ( w j )where functions µ j , σ j , j = 1 , . . . , , are specified below1. µ ( w ) = 1 . w , σ ( w ) = { N ( w , µ = 0 . , σ = 0 . N ( w , µ = 0 . , σ = 0 . } / µ ( w ) = { N ( w , µ = 0 . , σ = 0 . N ( w , µ = 0 . , σ = 0 . } / ,σ ( w ) = 0 . . πw ),3. µ ( w ) = 1 + sin(2 πw ) , σ ( w ) = 1 . − w , 15. µ ( w ) = − w , σ ( w ) = 0 . . w .To the generated dataset we fit a model with mean and variance functions modeled as µ ( w ) = β + (cid:88) k =1 16 (cid:88) j =1 β kj φ kj ( w k ) and log { σ ( w ) } = α + (cid:88) k =1 16 (cid:88) j =1 α kj φ kj ( w k ) . Fitting the above model to the simulated data is achieved by the following R code > Model <- y ~ sm(w1, k = 15, bs = "rd") + sm(w2, k = 15, bs = "rd") ++ sm(w3, k = 15, bs = "rd") + sm(w4, k = 15, bs = "rd") |+ sm(w1, k = 15, bs = "rd") + sm(w2, k = 15, bs = "rd") ++ sm(w3, k = 15, bs = "rd") + sm(w4, k = 15, bs = "rd")> m3 <- mvrm(formula = Model, data = data, sweeps = 50000, burn = 25000,+ thin = 5, seed = 1, StorageDir = DIR)
By default function sm utilizes the radial basis functions, hence there is no need to specify bs = "rd" aswe did earlier, if radial basis functions are preferred over thin plate splines. Further, we have selected k =15 for all smooth functions. However, there is no restriction to the number of knots and certainly one canselect a different number of knots for each smooth function.As discussed previously, for each term that appears in the right-hand side of the mean and variancefunctions, the model incorporates indicator variables that specify which basis functions are to be includedand which are to be excluded from the model. For the current example, the indicator variables are denotedby γ kj and δ kj , k = 1 , , , , j = 1 , . . . ,
16. The prior probabilities that variables are included were specifiedin (8) and they are specific to each term, π µ k ∼ Beta( c µ k , d µ k ) , π σ k ∼ Beta( c σ k , d σ k ) , k = 1 , , , . The default option pi.muPrior = "Beta(1,1)" specifies that π µ k ∼ Beta(1 , , k = 1 , , , . Further,by setting, for example, pi.muPrior = "Beta(2,1)" we specify that π µ k ∼ Beta(2 , , k = 1 , , , . Tospecify a different Beta prior for each of the four terms, pi.muPrior will have to be specified as a vector oflength four, as an example, pi.muPrior = c("Beta(1,1)","Beta(2,1)","Beta(3,1)","Beta(4,1)") .Specification of the priors for π σ k is done in a similar way, via argument pi.sigmaPrior .We conclude this section by presenting plots of the four terms in the mean and variance models. Theplots are presented in Figure 5. We provide a few details on how function plot works in the presenceof multiple terms, and how the comparison between true and estimated effects is made. Starting withthe mean function, to create the relevant plots, that appear on the left panels of Figure 5, function plot considers only the part of the mean function µ ( u ) that is related to the chosen term while leaving all otherterms out. For instance, in the code below we choose term = "sm(u1)" and hence we plot the posteriormean and a posterior credible interval for (cid:80) j =1 β j φ j ( u ), where the intercept β is left out by option intercept = FALSE . Further, comparison is made with a centered version of the true curve, representedby the dashed (red color) line and obtained by the first three lines of code below. > x1 <- seq(0, 1, length.out = 30)> y1 <- mu1(x1)> y1 <- y1 - mean(y1)> PlotOptions <- list(geom_line(aes_string(x = x1, y = y1),+ col = 2, alpha = 0.5, lty = 2))> plot(x = m3, model = "mean", term = "sm(w1)", plotOptions = PlotOptions,+ intercept = FALSE, centreEffects = FALSE, quantiles = c(0.005, 1 - 0.005)) σ ( u ) that is related to the chosen term . For instance,below we choose term = "sm(u1)" . Hence, in this case the plot will present the posterior mean and a pos-terior credible interval for exp { (cid:80) j =1 α j φ j ( u ) / } , where the intercept α is left out by option intercept= FALSE . Option centreEffects = TRUE scales the posterior realizations of exp { (cid:80) j =1 α j φ j ( u ) / } be-fore plotting them, where the scaling is done in such a way that the realized function has mean one overthe range of the predictor. Further, the comparison is made with a scaled version of the true curve, whereagain the scaling is done to achieve mean one. This is shown below and it is in the spirit of Chan et al.(2006) who discuss the differences between the data generating mechanism and the fitted model. > y1 <- stdev1(x1) / mean(stdev1(x1))> PlotOptions <- list(geom_line(aes_string(x = x1, y = y1),+ col = 2, alpha = 0.5, lty = 2))> plot(x = m3, model = "stdev", term = "sm(w1)", plotOptions = PlotOptions,+ intercept = FALSE, centreEffects = TRUE, quantiles = c(0.025, 1 - 0.025)) In this section we present four empirical applications.
In the first empirical application, we analyse a dataset from Pagan & Ullah (1999) that is available in the Rpackage np (Hayfield & Racine, 2008). The dataset consists of n = 205 observations on dependent variable logwage , the logarithm of the individual’s wage, and covariate age , the individual’s age. The dataset comesfrom the 1971 Census of Canada Public Use Sample Tapes and the sampling units it involves are males ofcommon education. Hence, the investigation of the relationship between age and the logarithm of wage iscarried out controlling for the two potentially important covariates education and gender.We utilize the following R code to specify flexible models for the mean and variance functions, and toobtain 5 ,
000 posterior samples, after a burn-in period of 25 ,
000 samples and a thinning period of 5. > data(cps71)> DIR <- getwd()> model <- logwage ~ sm(age, k = 30, bs = "rd") | sm(age, k = 30, bs = "rd")> m4 <- mvrm(formula = model, data = cps71, sweeps = 50000,+ burn = 25000, thin = 5, seed = 1, StorageDir = DIR)
After checking convergence, we use the following code to create the plots that appear in Figure 6. > wagePlotOptions <- list(geom_point(data = cps71, aes(x = age, y = logwage)))> plot(x = m4, model = "mean", term = "sm(age)", plotOptions = wagePlotOptions)> plot(x = m4, model = "stdev", term = "sm(age)")
Figure 6 (a) shows the posterior mean and an 80% credible interval for the mean function and it suggestsa quadratic relationship between age and logwage . Figure 6 (b) shows the posterior mean and an 80%credible interval for the standard deviation function. It suggest a complex relationship between age andthe variability in logwage . The relationship suggested by Figure 6 (b) is also suggested by the spread ofthe data-points around the estimated mean in Figure 6 (a). At ages around 20 years the variability in17 w1 s m ( w ) mean w1 s m ( w ) st dev
012 0.00 0.25 0.50 0.75 1.00 w2 s m ( w ) mean w2 s m ( w ) st dev −1.0−0.50.00.51.0 0.00 0.25 0.50 0.75 1.00 w3 s m ( w ) mean w3 s m ( w ) st dev −0.50−0.250.000.250.50 0.00 0.25 0.50 0.75 1.00 w4 s m ( w ) mean w4 s m ( w ) st dev Figure 5: Multiple covariate simulation study results. The column on the left-hand side presents the trueand estimated mean functions, along with 99% credible intervals. The column on the right-hand sidepresents the true and estimated standard deviation functions, along with 95% credible intervals. In allpanels, the truth is represented by dashed (red color) lines, the estimated functions by solid (blue color)lines, and the credible intervals by gray color. 18 llllllll lllllllll llllllllllll llllllllll lllllll lllllll llll ll lllll lllll lllllll lllll llll llllll llll lll llllllll llll lllllll llllll lllll ll llll llll ll lllllll lllll ll lllllll l lllllll llll llll lllll ll l llll lll llll l ll lll l l nage s m ( nage ) mean nage s m ( nage ) st dev (a) (b)Figure 6: Results from the data analysis on the relationship between age and the logarithm of wage.Panel (a) shows the posterior mean and an 80% credible interval of the mean function, and the observeddata-points. Panel (b) shows the posterior mean and an 80% credible interval of the standard deviationfunction. logwage is high. It then reduces until about age 30, to start increasing again until about age 45. Fromage 45 to 60 it remains stable but high, while for ages above 60, Figure 6 (b) suggests further increase inthe variability, but the wide credible interval suggests high uncertainty over this age range. In the second empirical application, we analyse a dataset from Wooldridge (2008) that is also availablein R package np . The response variable here is the logarithm of the individual’s hourly wage ( lwage )while the covariates include the years of education ( educ ), the years of experience ( exper ), the years withthe current employer ( tenure ), the individual’s gender (named as female within the dataset, with levels Female and
Male ), and marital status (named as married with levels
Married and
Notmarried ). Thedataset consists of n = 526 independent observations. We analyse the first three covariates as continuousand the last two as discrete.As the variance function is modeled in terms of an exponential, see (1), to avoid potential numericalproblems, we transform the three continuous variables to have range in the interval [0 , > data(wage1)> wage1$ntenure <- wage1$tenure / max(wage1$tenure)> wage1$nexper <- wage1$exper / max(wage1$exper)> wage1$neduc <- wage1$educ / max(wage1$educ) We choose to fit the following mean and variance models to the data µ i = β + β married i + f ( ntenure i ) + f ( neduc i ) + f ( nexper i , female i ) , log( σ i ) = α + f ( nexper i ) . We note that, as it turns out, an interaction between variables nexper and female is not necessary for thecurrent data analysis. However, we choose to add this term in the mean model in order to illustrate howinteraction terms can be specified. We illustrate further options below. > knots1 <- seq(min(wage1$nexper), max(wage1$nexper), length.out = 30) knots2 <- c(0, 1)> knotsD <- expand.grid(knots1, knots2)> model <- lwage ~ fmarried + sm(ntenure) + sm(neduc, knots=data.frame(knots =+ seq(min(wage1$neduc), max(wage1$neduc), length.out = 15))) ++ sm(nexper, ffemale, knots = knotsD) | sm(nexper, knots=data.frame(knots =+ seq(min(wage1$nexper), max(wage1$nexper), length.out=15))) The first three lines of the R code above specify the matrix of (potential) knots to be used for representing f ( nexper,female ). Knots may be left unspecified, in which case the defaults in function sm will be used.Furthermore, in the specification of the mean model we use sm(ntenure) . By this, we chose to represent f utilizing the default 10 knots and the radial basis functions. Further, the specification of f in the meanmodel illustrates how users can specify their own knots for univariate functions. In the current example,we select 15 knots uniformly spread over the range of neduc . Fifteen knots are also used to represent f within the variance model.The following code is used to obtain samples from the posteriors of the model parameters. > DIR <- getwd()> m5 <- mvrm(formula = model, data = wage1, sweeps = 100000,+ burn = 25000, thin = 5, seed = 1, StorageDir = DIR)) After summarizing results and checking convergence, we create plots of posterior means, along with95% credible intervals, for functions f , . . . , f . These are displayed in Figure 7. As it turns out, variable married does not have an effect on the mean of lwage . For this reason, we do not provide further resultson the posterior of the coefficient of covariate married , β . However, in the code below we show howgraphical summaries on β can be obtained, if needed. > PlotOptionsT <- list(geom_point(data = wage1, aes(x = ntenure, y = lwage)))> plot(x = m5, model = "mean", term="sm(ntenure)", quantiles = c(0.025, 0.975),+ plotOptions = PlotOptionsT)> PlotOptionsEdu <- list(geom_point(data = wage1, aes(x = neduc, y = lwage)))> plot(x = m5, model = "mean", term = "sm(neduc)", quantiles = c(0.025, 0.975),+ plotOptions = PlotOptionsEdu)> pchs <- as.numeric(wage1$female)> pchs[pchs == 1] <- 17; pchs[pchs == 2] <- 19> cols <- as.numeric(wage1$female)> cols[cols == 2] <- 3; cols[cols == 1] <- 2> PlotOptionsE <- list(geom_point(data = wage1, aes(x = nexper, y = lwage),+ col = cols, pch = pchs, group = wage1$female))> plot(x = m5, model = "mean", term="sm(nexper,female)", quantiles = c(0.025, 0.975),+ plotOptions = PlotOptionsE)> plot(x = m5, model = "stdev", term = "sm(nexper)", quantiles = c(0.025, 0.975))> PlotOptionsF <- list(geom_boxplot(fill = 2, color = 1))> plot(x = m5, model = "mean", term = "married", quantiles = c(0.025, 0.975),+ plotOptions = PlotOptionsF) Figure 7, panels (a) and (b) show the posterior means and 95% credible intervals for f ( ntenure ) and f ( neduc ). It can be seen that expected wages increase with tenure and education, although there is highuncertainty over a large part of the range of both covariates. Panel (c) displays the posterior mean and a95% credible interval for f . We can see that although the forms of the two functions are similar, i.e. the20 ll ll lll l llll l lll ll ll lll ll lll l ll lll l l llll ll ll ll llll l l lll ll ll ll ll l lll lll l l ll ll l ll lll ll ll l lll lll lll lll ll ll ll llll l lll lll ll lllll lll lll ll llll ll ll ll ll lll l l lll l ll ll ll lll lll ll ll lll ll l llll lll ll l lllll lll ll ll l l l lllll ll l ll l ll lll lll ll ll ll lll ll ll llll ll ll llll lll lll l ll ll ll l lllll l ll lll ll l lll ll ll ll ll ll l lll lll l ll l lll ll lll ll llll l lll lll ll ll lll l lll lll ll ll lllll ll ll ll lll ll l ll ll lll llll ll l ll ll lll lll ll lll l lll ll ll lll l llll lll ll ll llll lll l llll lllll l ll l l ll l lll llll l ll l ll l llll ll ll ll lll l lll ll ll lll lll ll ll l ll l llll l ll lll ll llllll lll lllll lll ll llll lll l llll ll ll l ntenure s m ( n t enu r e ) mean l lll l l lll llllll ll lllll lll lllll lllll lll llll ll l ll l lll llll l ll lll lll ll ll ll l llll llll ll ll llllll ll lllll ll ll llll l llll ll ll lll ll ll ll ll llll l l llll ll l ll l lll ll ll l lll l lll l lll lll ll lll lll lll llll ll l lll llllllll llll lll lll ll lll llll l lll lll llll l ll ll ll ll ll llll lllllll l ll ll ll ll lll l lll lllllll ll lll ll ll l lll l llll lll ll l ll lll lll ll lll lll lll l ll ll lllll l llll llllll l ll lll ll lll ll ll llll ll lllll l lllll lllll lll ll ll l llll l l llll l ll l ll lll lll ll ll lll ll l lllll l llll ll l l llll l ll ll ll ll llll ll ll l lll l ll lllll ll lll llll ll l ll ll l lll l ll l lllll l llll lll lll ll l ll llll l lll llll lll lll ll ll ll l ll neduc s m ( nedu c ) mean (a) (b) l ll l l ll l ll ll l lll lll l lll l ll lll lll ll lll ll l ll l ll lll l l ll lll lllll l lll l ll ll lll l l llll ll ll ll l ll l l lll lll ll l ll l l llll ll l l lll l l ll l l ll ll ll ll lll lll lll l ll ll lll lll l ll l l l llll l ll l l llll lll l ll ll ll l ll ll l lll lll lllll lll ll ll lll lll ll l ll lll ll ll lllll lll ll l lll l l ll ll l l lll lll ll l lll ll l l ll ll lll ll l ll l l ll ll nexper s m ( ne x pe r ,ff e m a l e ) ffemale FemaleMale mean nexper s m ( ne x pe r) st dev (c) (d)Figure 7: Results from the data analysis on the relationship between covariates gender, marital status,experience, education and tenure, and response variable logarithm of hourly wage. Posterior means and95% credible intervals for (a) f ( ntenure ), (b) f ( neduc ), (c) f ( nexper , female ), and (d) the standarddeviation function σ i = σ exp[ f ( nexper ) / σ i = σ exp( f / "Married" and "Notmarried" ofvariable fmaried and the levels "Female" and "Male" of variable ffemale , with variables ntenure , nedc and nexper fixed at their mid-range. > p1 <- predict(m5, newdata = data.frame(fmarried = rep(c("Married", "Notmarried"), 2),+ ntenure = rep(0.5, 4), neduc = rep(0.5, 4), nexper = rep(0.5, 4),+ ffemale = rep(c("Female", "Male"), each = 2)), interval = "credible")> p1 fit lwr upr1 1.321802 1.119508 1.5065742 1.320400 1.119000 1.5052723 1.913341 1.794035 2.0362554 1.911939 1.791578 2.034832 The predictions are suggestive of no ‘marriage’ effect and of ‘gender’ effect.
Here we analyse brain activity level data obtained by functional magnetic resonance imaging. The datasetis available in package gamair (Wood, 2006) and it was previously analysed by Landau et al. (2003). We21re interested in three of the columns in the dataset. These are the response variable, medFPQ , which iscalculated as the median over three measurements of ‘Fundamental Power Quotient’ and the two covariates, X and Y , which show the location of each voxel.The following R code loads the relevant data frame, removes two outliers and transforms the responsevariable, as was suggested by Wood (2006). In addition, it plots the brain activity data using function levelplot from package lattice (Sarkar, 2008). > data(brain)> brain <- brain[brain$medFPQ > 5e-5, ]> brain$medFPQ <- (brain$medFPQ) ^ 0.25> levelplot(medFPQ ~ Y * X, data = brain, xlab = "Y", ylab = "X",+ col.regions = gray(10 : 100 / 100)) The plot of the observed data is shown in Figure 8, panel (a). Its distinctive feature is the noise level,which makes difficult to decipher any latent pattern. Hence, the goal of the current data analysis is toobtain a smooth surface of brain activity level from the noisy data. It was argued by Wood (2006) thatfor achieving this goal a spatial error term is not needed in the model. Thus, we analyse the brain activitylevel data using a model of the formmedFPQ i ind ∼ N ( µ i , σ ) , where µ i = β + (cid:88) j =1 10 (cid:88) j =1 β j ,j φ j ,j ( X i , Y i ) , i = 1 , . . . , n, where n = 1565 is the number of voxels.The R code that fits the above model is > Model <- medFPQ ~ sm(Y, X, k = 10, bs = "rd") | 1> m6 <- mvrm(formula = Model, data = brain, sweeps = 50000, burn = 20000, thin = 2,+ seed = 1, StorageDir = DIR) From the fitted model we obtain a smooth brain activity level surface using function predict . Thefunction estimates the average activity at each voxel of the brain. Further, we plot the estimated surfaceusing function levelplot . > p1 <- predict(m6)> levelplot(p1[, 1] ~ Y * X, data = brain , xlab = "Y", ylab = "X",+ col.regions = gray(10 : 100 / 100), contour = TRUE) Results are shown in Figure 8, panel(b). The smooth surface makes it much easier to see and understandwhich parts of the brain have higher activity.
In the fourth and final application we use function mvrm to identify the best subset of predictors in aregression setting. Usually stepwise model selection is performed, using functions step and stepAIC in R.Here we show how mvrm can be used as alternative to those two functions. The data frame that we apply mvrm on is mtcars , where the response variable is mpg and the explanatory variables that we consider are disp , hp , wt and qsec . The code below loads the data frame, specifies the model and obtains samplesfrom the posteriors of the model parameters. 22 X Y X (a) (b)Figure 8: Results from the brain activity level data analysis. Panel (a) shows the observed data and panel(b) the model-based smooth surface. > data(mtcars)> Model <- mpg ~ disp + hp + wt + qsec | 1> m7 <- mvrm(formula = Model, data = mtcars, sweeps = 50000, burn = 25000, thin = 2,+ seed = 1, StorageDir = DIR) The following is an excerpt of the output that function summary produces and it shows the three modelswith the highest posterior probability. > summary(m7, nModels = 3)Joint mean/variance model posterior probabilities:mean.disp mean.hp mean.wt mean.qsec freq prob cumulative1 0 1 1 0 1085 43.40 43.402 0 0 1 1 1040 41.60 85.003 0 0 1 0 128 5.12 90.12Displaying 3 models of the 11 visited3 models account for 90.12% of the posterior mass
The model with the highest posterior probability (43 . hp and wt . The model that includes wt and qsec has almost equal posterior probability, 41 . wt as predictor, but its posterior probability is much lower, 5 . In this section we present the technical details of how the MCMC algorithm is designed for the case wherethere is a single covariate in the mean and variance models. We first note that to improve mixing of the23ampler, we integrate out vector β from the likelihood of y , as was done by Chan et al. (2006) f ( y | α , c β , γ , δ , σ ) ∝ | σ D ( α δ ) | − ( c β + 1) − N ( γ )+12 exp( − S/ σ ) , (10)where, with ˜ y = D − ( α δ ) y , we have S = S ( y , α , c β , γ , δ ) = ˜ y (cid:62) ˜ y − c β c β ˜ y (cid:62) ˜ X γ ( ˜ X (cid:62) γ ˜ X γ ) − ˜ X (cid:62) γ ˜ y .The six steps of the MCMC sampler are as follows1. Similar to Chan et al. (2006), the elements of γ are updated in blocks of randomly chosen elements.The block size is chosen based on probabilities that can be supplied by the user or be left at theirdefault values. Let γ B be a block of random size of randomly chosen elements from γ . The proposedvalue for γ B is obtained from its prior with the remaining elements of γ , denoted by γ B C , kept attheir current value. The proposal pmf is obtained from the Bernoulli prior with π µ integrated out p ( γ B | γ B C ) = p ( γ ) p ( γ B C ) = Beta( c µ + N ( γ ) , d µ + q − N ( γ ))Beta( c µ + N ( γ B C ) , d µ + q − L ( γ B ) − N ( γ B C )) , where L ( γ B ) denotes the length of γ B i.e. the size of the block. For this proposal pmf, the acceptanceprobability of the Metropolis-Hastings move reduces to the ratio of the likelihoods in (10)min , ( c β + 1) − N ( γ P )+12 exp( − S P / σ )( c β + 1) − N ( γ C )+12 exp( − S C / σ ) , where superscripts P and C denote proposed and currents values respectively.2. Vectors α and δ are updated simultaneously. Similarly to the updating of γ , the elements of δ are updated in random order in blocks of random size. Let δ B denote a block. Blocks δ B and thewhole vector α are generated simultaneously. As was mentioned by Chan et al. (2006), generatingthe whole vector α , instead of subvector α B , is necessary in order to make the proposed value of α consistent with the proposed value of δ .Generating the proposed value for δ B is done in a similar way as was done for γ B in the previousstep. Let δ P denote the proposed value of δ . Next, we describe how the proposed vale for α δ P isobtained. The development is in the spirit of Chan et al. (2006) who built on the work of Gamerman(1997).Let ˆ β Cγ = { c β / (1 + c β ) } ( ˜ X (cid:62) γ ˜ X γ ) − ˜ X (cid:62) γ ˜ y denote the current value of the posterior mean of β γ . Definethe current squared residuals e Ci = ( y i − ( x ∗ iγ ) (cid:62) ˆ β Cγ ) ,i = 1 , . . . , n . These will have an approximate σ i χ distribution, where σ i = σ exp( z (cid:62) i α ). The latterdefines a Gamma generalized linear model (GLM) for the squared residuals with mean E ( σ i χ ) = σ i = σ exp( z (cid:62) i α ), which, utilizing a log-link, can be thought of as Gamma GLM with an offset term:log( σ i ) = log( σ ) + z (cid:62) i α . Given the proposed value of δ , denoted by δ P , the proposal density for α Pδ P is derived utilizing the one step iteratively re-weighted least squares algorithm. This proceedsas follows. First define the transformed observations d Ci ( α C ) = log( σ ) + z (cid:62) i α C + e Ci − ( σ i ) C ( σ i ) C , where superscript C denotes current values. Further, let d C denote the vector of d Ci .24ext we define ∆ ( δ P ) = ( c − α I + Z (cid:62) δ P Z δ P ) − and ˆ α ( δ P , α C ) = ∆ δ P Z (cid:62) δ P d C , where Z is the design matrix. The proposed value α Pδ P is obtained from a multivariate normaldistribution with mean ˆ α ( δ P , α C ) and covariance h ∆ ( δ P ), denoted as N ( α Pδ P ; ˆ α ( δ P , α C ) , h ∆ ( δ P )),where h is a free parameter that we introduce and select its value adaptively (Roberts & Rosenthal,2009) in order to achieve an acceptance probability of 20% −
25% (Roberts & Rosenthal, 2001).Let N ( α Cδ C ; ˆ α ( δ C , α P ) , h ∆ ( δ C )) denote the proposal density for taking a step in the reverse direction,from model δ P to δ C . Then the acceptance probability of the pair ( δ P , α Pδ P ) is the minimum between1 and | D ( α Pδ P ) | − exp( − S P / σ ) | D ( α Cδ C ) | − exp( − S C / σ ) (2 πc α ) − N ( δP )2 exp( − c α ( α Pδ P ) (cid:62) α Pδ P )(2 πc α ) − N ( δC )2 exp( − c α ( α Cδ C ) (cid:62) α Cδ C ) N ( α Cδ C ; ˆ α δ C , h ∆ δ C ) N ( α Pδ P ; ˆ α δ P , h ∆ δ P ) . We note that the determinants that appear in the above ratio are equal to one when utilizing centredexplanatory variables in the variance model and hence can be left out of the calculation of theacceptance probability.3. We update σ utilizing the marginal (10) and the two priors in (7). The full conditional correspondingto the IG( a σ , b σ ) prior is f ( σ | . . . ) ∝ ( σ ) − n − a σ − exp {− ( S/ b σ ) /σ } , which is recognized as another inverse gamma IG( n/ a σ , S/ b σ ) distribution.The full conditional corresponding to the normal prior | σ | ∼ N (0 , φ σ ) is f ( σ | . . . ) ∝ ( σ ) − n exp( − S/ σ ) exp( − σ / φ σ ) . Proposed values are obtained from σ p ∼ N ( σ c , f ) where σ c denotes the current value. Proposedvalues are accepted with probability f ( σ p | . . . ) /f ( σ c | . . . ). We treat f as a tuning parameter and weselect its value adaptively (Roberts & Rosenthal, 2009) in order to achieve an acceptance probabilityof 20% −
25% (Roberts & Rosenthal, 2001).4. Parameter c β is updated from the marginal (10) and the IG( a β , b β ) prior f ( c β | . . . ) ∝ ( c β + 1) − N ( γ )+12 exp( − S/ σ )( c β ) − a β − exp( − b β /c β ) . To sample from the above, we utilize a normal approximation to it. Let (cid:96) ( c β ) = log { f ( c β | . . . ) } . Weutilize a normal proposal density N (ˆ c β , − g /(cid:96) (cid:48)(cid:48) (ˆ c β )) , where ˆ c β is the mode of (cid:96) ( c β ), found using aNewton-Raphson algorithm, (cid:96) (cid:48)(cid:48) (ˆ c β ) is the second derivative of (cid:96) ( c β ) evaluated at the mode, and g is a tuning variance parameter the value of which is chosen adaptively (Roberts & Rosenthal, 2009).At iteration u + 1 the acceptance probability is the minimum between one and f ( c ( u +1) β | . . . ) f ( c ( u ) β | . . . ) N ( c ( u ) β ; ˆ c β , − g /(cid:96) (cid:48)(cid:48) (ˆ c β )) N ( c ( u +1) β ; ˆ c β , − g /(cid:96) (cid:48)(cid:48) (ˆ c β )) .
5. Parameter c α is updated from the inverse Gamma density IG( a α + N ( δ ) / , b α + α (cid:62) δ α δ / β , they can be generated from β γ | · · · ∼ N ( c β c β ( ˜ X (cid:62) γ ˜ X γ ) − ˜ X (cid:62) γ ˜ y , σ c β c β ( ˜ X (cid:62) γ ˜ X γ ) − ) , where β γ is the non-zero part of β . We have presented a tutorial on several functions from the R package
BNSP . These functions are usedfor specifying, fitting and summarizing results from regression models with Gaussian errors and with meanand variance functions that can be modeled nonparametrically. Function sm is utilized to specify smoothterms in the mean and variance functions of the model. Function mvrm calls an MCMC algorithm thatobtains samples from the posteriors of the model parameters. Samples are converted into an object ofclass ‘mcmc’ by function mvrm2mcmc which facilitates the use of multiple functions from package coda .Functons print.mvrm and summary.mvrm provide summaries of fitted ‘mvrm’ objects. Further, function plot.mvrm provides graphical summaries of parametric and nonparametric terms that enter the mean orvariance function. Lastly, function predict.mvrm provides predictions for a future response or a meanresponse along with the corresponding prediction/credible intervals. References
B¨urkner, P.-C. (2018). brms: Bayesian Regression Models using ‘Stan’ . R package version 2.3.1.Chan, D., Kohn, R., Nott, D., & Kirby, C. (2006). Locally adaptive semiparametric estimation of themean and variance functions in regression models.
Journal of Computational and Graphical Statistics ,15(4), 915–936.Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models.
Statistics and Computing , 7(1), 57–68.George, E. I. & McCulloch, R. E. (1993). Variable selection via Gibbs sampling.
Journal of the AmericanStatistical Association , 88(423), 881–889.George, E. I. & McCulloch, R. E. (1997). Approaches for Bayesian variable selection.
Statistica Sinica ,7(2), 339–373.Hayfield, T. & Racine, J. S. (2008). Nonparametric econometrics: The np package.
Journal of StatisticalSoftware , 27(5).Hofner, B., Mayr, A., Fenske, N., & Schmid, M. (2018). gamboostLSS: Boosting Methods for GAMLSSModels . R package version 2.0-1.Landau, S., Ellison-Wright, I. C., & Bullmore, E. T. (2003). Tests for a difference in timing of physiologicalresponse between two brain regions measured by using functional magnetic resonance imaging.
Journalof the Royal Statistical Society: Series C (Applied Statistics) , 53(1), 63–82.Lewis, B. W. (2016). threejs: Interactive 3D Scatter Plots, Networks and Globes . R package version 0.2.2.26iang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of g Priors for Bayesianvariable selection.
Journal of the American Statistical Association , 103(481), 410–423.Mayr, A., Fenske, N., Hofner, B., Kneib, T., & Schmid, M. (2012). Generalized additive models forlocation, scale and shape for high dimensional data-a flexible approach based on boosting.
Journal ofthe Royal Statistical Society: Series C (Applied Statistics) , 61(3), 403–427.M¨uller, P. & Mitra, R. (2013). Bayesian nonparametric inference - why and how.
Bayesian Analysis , 8(2),269–302.O’Hara, R. B. & Sillanp¨a¨a, M. J. (2009). A review of Bayesian variable selection methods: What, howand which.
Bayesian Analysis , 4(1), 85–117.Pagan, A. & Ullah, A. (1999).
Nonparametric Econometrics . Cambridge University Press.Papageorgiou, G. (2018).
BNSP: Bayesian Non- And Semi-Parametric Model Fitting . R package version2.0.8.Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). Coda: Convergence diagnosis and output analysisfor mcmc.
R News , 6(1), 7–11.R Core Team (2016).
R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Rigby, R. A. & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape,(with discussion).
Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54(3), 507–554.Roberts, G. O. & Rosenthal, J. S. (2001). Optimal scaling for various Metropolis-Hastings algorithms.
Statistical Science , 16(4), 351–367.Roberts, G. O. & Rosenthal, J. S. (2009). Examples of adaptive MCMC.
Journal of Computational andGraphical Statistics , 18(2), 349–367.Sarkar, D. (2008).
Lattice: Multivariate Data Visualization with R . New York: Springer. ISBN 978-0-387-75968-5.Scheipl, F. (2011). spikeSlabGAM: Bayesian variable selection, model choice and regularization for gener-alized additive mixed models in R.
Journal of Statistical Software , 43(14), 1–24.Scheipl, F. (2016). spikeSlabGAM: Bayesian Variable Selection and Model Choice for Generalized AdditiveMixed Models . R package version 1.1-11.Soetaert, K. (2016). plot3D: Plotting Multi-Dimensional Data . R package version 1.1.Stasinopoulos, D. M. & Rigby, R. A. (2007). Generalized additive models for location scale and shape(GAMLSS) in R.
Journal of Statistical Software , 23(7), 1–46.Umlauf, N., Klein, N., & Zeileis, A. (2018). BAMLSS: Bayesian additive models for location, scale andshape (and beyond).
Journal of Computational and Graphical Statistics .Umlauf, N., Klein, N., Zeileis, A., & Koehler, M. (2017). bamlss: Bayesian Additive Models for LocationScale and Shape (and Beyond) . R package version 0.1-2.27ickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York.Wood, S. (2018). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation . Rpackage version 1.8-23.Wood, S. N. (2006).
Generalized Additive Models: An Introduction with R . Boca Raton, Florida: Chapmanand Hall/CRC. ISBN 1-58488-474-6.Wooldridge, J. (2008).
Introductory Econometrics: A Modern Approach . South Western College.Zeileis, A. & Croissant, Y. (2010). Extended model formulas in R: Multiple parts and multiple responses.
Journal of Statistical Software , 34(1), 1–13.Zeileis, A., Hornik, K., & Murrell, P. (2009). Escaping RGBland: Selecting colors for statistical graphics.
Computational Statistics & Data Analysis , 53(9), 3259–3270.Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distri-butions. In P. Goel & Zellner (Eds.),